Difference between revisions of "PubMan 7 7"
Jump to navigation
Jump to search
Siedersleben (talk | contribs) |
Siedersleben (talk | contribs) |
||
(46 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
This page shall contain every change that is made during a qa release of the version mentioned above. If it's not here, it never happened! | This page shall contain every change that is made during a qa release of the version mentioned above. If it's not here, it never happened! | ||
= PubMan 7. | = PubMan 7.7 Release = | ||
== Affected Servers == | == Affected Servers == | ||
Line 14: | Line 14: | ||
== Core Infrastructure == | == Core Infrastructure == | ||
* copy jboss-log4j.xml from srv11 /home/siedersleben/escidoc-core-1.3.10-For-Release7.7 to JBOSS_HOME/server/default/conf | |||
* copy escidoc-core-1.3.10-SNAPSHOT-build72.ear, fedoragsearch.war, srw.war from /home/siedersleben/escidoc-core-1.3.10-For-Release7.7 to JBOSS_HOME/server/default/deploy | |||
* chown jboss:jboss .... | |||
== Core Properties == | == Core Properties == | ||
* escidoc-core.properties: remove pdf extractor properties | * escidoc-core.properties: remove the pdf extractor properties from escidoc-core.properties and put the corresponding properties to fedoragsearch.properties (gs: Änderungen eingefügt auf srv11 unter escidoc-core.properties.release7.7.) | ||
# true|false Defines what happenes if an Exception occurs while extracting the text from an pdf for indexing | # true|false Defines what happenes if an Exception occurs while extracting the text from an pdf for indexing | ||
# if set to true, Exception is ignored and object is indexed without the fulltext. | # if set to true, Exception is ignored and object is indexed without the fulltext. | ||
# if set to false, Exception is thrown and object is not indexed at all. | # if set to false, Exception is thrown and object is not indexed at all. | ||
gsearch.ignoreTextExtractionErrors = true | gsearch.ignoreTextExtractionErrors = true | ||
# Location of the indexingStylesheet that generates the indexInformation-Document for gsearch-indexing. | |||
# has to be an URL | |||
# currently the eSciDoc-Core-Infrastructure provides 2 index-databases: escidoc_all and escidocou_all | |||
# stylesheet-path-property for index escidoc_all is gsearch.escidoc.indexingStylesheet | |||
# stylesheet-path-property for index escidoc_all is gsearch.escidocou.indexingStylesheet | |||
#gsearch.escidoc.indexingStylesheet = http://escidoc1.escidoc.mpg.de/resources/searchIndexDefinition/mpdlEscidocXmlToLucene_1.2.xslt | |||
gsearch.escidoc.indexingStylesheet = http://coreservice.mpdl.mpg.de/mpdlEscidocXmlToLucene.xslt | |||
gsearch.escidocou.indexingStylesheet = | |||
# if pdfBox (internally used by gsearch to extract text from pdfs) is not working well for your pdfs, | |||
# define command-line-command to custom pdf-text-extractor (has to get installed seperately) | |||
# define command with full path, define inputfile with <inputfile> and outputfile with <outputfile> | |||
#example: C:/Programme/xpdf-3.02pl2-win32/pdftotext -cfg C:/Programme/xpdf-3.02pl2-win32/xpdfrc <inputfile> <outputfile> | |||
# gsearch.pdfTextExtractorCommand = /usr/bin/pdftotext -cfg /etc/xpdfrc <inputfile> <outputfile> | |||
gsearch.pdfTextExtractorCommand = /usr/bin/java -classpath /usr/share/jboss/server/default/conf/pdf-extraction/classes:/usr/share/jboss/server/default/conf/pdf-extraction/lib/iText-5.0.6.jar de.mpg.escidoc.services.extraction.ExtractionChain <inputfile> <outputfile> | |||
# Analyzer to use for indexing and search | |||
lucene.analyzer = de.escidoc.sb.common.lucene.analyzer.EscidocAnalyzer | |||
* add new property for skipping reindex to escidoc-core.properties (gs: already in escidoc-core.properties.release7.7) | |||
# | # Comma separated List of method names, where automatic indexing is skipped | ||
escidoc-core.skip.notify.indexer.methods = assignObjectPid, assignVersionPid | |||
* fedoragsearch.properties: add the following properties to JBOSS_HOME/conf/search/config/fedoragsearch.properties: (gs: Änderungen eingefügt auf srv11 unter fedoragsearch.properties.release7.7.) | |||
# if pdfBox (internally used by gsearch to extract text from pdfs) is not working well for your pdfs, | |||
# use a command-line tool. | |||
# If you want to use a command-line tool, | |||
# define command-line-command to custom pdf-text-extractor (has to get installed seperately) | |||
# define command with full path, define inputfile with <inputfile> and outputfile with <outputfile> | |||
#example: C:/Programme/xpdf-3.02pl2-win32/pdftotext -cfg C:/Programme/xpdf-3.02pl2-win32/xpdfrc <inputfile> <outputfile> | |||
fedoragsearch.pdfTextExtractorCommand=/usr/bin/java -classpath /usr/share/jboss/server/default/conf/pdf-extraction/classes:/usr/share/jboss/server/default/conf/pdf-extraction/lib/itextpdf-5.5.1.jar de.mpg.escidoc.services.extraction.ExtractionChain <inputfile> <outputfile> | |||
# if | # true|false Defines what happens if an Exception occurs while extracting the text from an pdf for indexing | ||
# | # if set to true, Exception is ignored and object is indexed without the fulltext. | ||
# if set to false, Exception is thrown and object is not indexed at all. | |||
# | fedoragsearch.ignoreTextExtractionErrors=true | ||
fedoragsearch. | |||
* copy directory pdf-extraction and pdfbox-app-1.8.6.jar from /home/siedersleben/escidoc-core-1.3.10-For-Release7.7 to JBOSS_HOME/conf (gs: already done) | |||
== Core Index Properties == | == Core Index Properties == | ||
* Reindex: set the following properties in $JBOSS_HOME/conf/search/config/index/escidoc_all/index.properties (remove comment sign) | |||
#Use this property for bulk index operations: the index is hold in memory until ramBufferSize is reached. | |||
#Make sure this property does not conflict with fgsindex.maxBufferedDocs. | |||
fgsindex.ramBufferSizeMb = 128 | |||
# Use this property to minimize garbage collection during indexing. Be careful: index is not thread save when running in this mode. | |||
fgsindex.indexMode = 1 | |||
* same for item_container_admin | |||
* set indexing to asynchron for item_container_admin during reindex in $JBOSS_HOME/conf/search/config/index/item_container_admin/index.object-types.properties | |||
Resource.Item.indexAsynchronous=true | |||
* check if pdf extraction works properly looking for log messages like the following in fedoragsearch.log (don't care about the WARN) | |||
WARN 2014-08-06 09:17:54,269 (TransformerToText)(http-0.0.0.0-8080-4) error while transforming pdf to text with external tool: | |||
Extracting PDF content ---------------------------------------- | |||
Infile: /usr/share/jboss-4.2.3.GA/server/default/1407309473650.pdf | |||
Outfile: /usr/share/jboss-4.2.3.GA/server/default/1407309473650.txt | |||
Wed Aug 06 09:17:53 CEST 2014 -- started | |||
Extracting with xPDF | |||
Wed Aug 06 09:17:54 CEST 2014 -- finished successfully | |||
Extraction took 314 | |||
* dont't forget to set back all these properties modified when reindex has finished | |||
== Core Lucene Index == | == Core Lucene Index == | ||
* Take indexing stylesheets from wildfly branch, not from trunk: | |||
** https://subversion.mpdl.mpg.de/repos/common/wildfly_migration/common_services/framework_access/src/main/resources/ | |||
== PubMan EAR == | == PubMan EAR == | ||
Line 66: | Line 98: | ||
& check if transformation exists | & check if transformation exists | ||
* escidoc.framework_access.framework.url=http://localhost:8080 (instead of coreservice) | * escidoc.framework_access.framework.url=http://localhost:8080 (instead of coreservice) | ||
* CHANGE: escidoc.pubman.favicon.url=/pubman/faces/javax.faces.resources/pubman_favicon_32_32.png | |||
* escidoc.dataaquisition.resources.fop.configuration (?) | |||
* escidoc.cone.modelsxml.path (?) | |||
* escidoc.transformation.edoc.stylesheet.filename (?) | |||
* escidoc.transformation.endnote.ice.stylesheet.filename | |||
* escidoc.transformation.endnote.stylesheet.filename (?) | |||
* escidoc.transformation.edoc.configuration.filename (?) | |||
* escidoc.transformation.escidoc2marcxml.stylesheet.filename (?) | |||
* escidoc.aa.public.key.file | |||
* escidoc.aa.private.key.file | |||
* escidoc.aa.config.file | |||
== PubMan Apache== | == PubMan Apache== | ||
* Add ProxyPassReverse for /cone, /sword-app, /dataacquisition in Apache 2 config, if not done yet | |||
== PubMan | == PubMan Wildfly== | ||
* Add pubman module, which should contain all properties and configuration files for PubMan: | |||
** Create directory WILDFLY_HOME/modules/pubman/main | |||
** Add a file called module.xml to this directory, containing the following xml | |||
<?xml version="1.0" encoding="UTF-8"?> | |||
<module xmlns="urn:jboss:module:1.1" name="pubman"> | |||
<resources> | |||
<resource-root path="."/> | |||
</resources> | |||
</module> | |||
**Add all necessary property files to this directory (pubman.properties, solution.properties, auth.properties, cone.properties, conf.xml, apache-fop-config.xml) | |||
**Make this module global by adding the following xml snippet to standalone.xml, subsystem urn:jboss:domain:ee | |||
<global-modules> | |||
<module name="pubman" slot="main"/> | |||
</global-modules> | |||
* Wildfly has a default maximum POST size of 10mb and a default POST parameter size of 1000, which is not convenient for large file uploads | |||
** Increase max-post and max-parameters size in standalone.xml, subsystem urn:jboss:domain:undertow by changing http-listener to (for e.g. 1024 mb) | |||
<http-listener name="default" socket-binding="http" max-post-size="1024000000" max-parameters="50000"/> | |||
== PubMan PidCache == | == PubMan PidCache == | ||
== eSciDoc-OAI-Provider == | |||
* copy escidoc-oaiprovider.war from /home/walter | |||
== AA == | == AA == |
Latest revision as of 06:18, 10 September 2014
This page shall contain every change that is made during a qa release of the version mentioned above. If it's not here, it never happened!
PubMan 7.7 Release[edit]
Affected Servers[edit]
Prepare read only system[edit]
Fedora[edit]
Coreservice Apache[edit]
Coreservice JBoss[edit]
Core Infrastructure[edit]
- copy jboss-log4j.xml from srv11 /home/siedersleben/escidoc-core-1.3.10-For-Release7.7 to JBOSS_HOME/server/default/conf
- copy escidoc-core-1.3.10-SNAPSHOT-build72.ear, fedoragsearch.war, srw.war from /home/siedersleben/escidoc-core-1.3.10-For-Release7.7 to JBOSS_HOME/server/default/deploy
- chown jboss:jboss ....
Core Properties[edit]
- escidoc-core.properties: remove the pdf extractor properties from escidoc-core.properties and put the corresponding properties to fedoragsearch.properties (gs: Änderungen eingefügt auf srv11 unter escidoc-core.properties.release7.7.)
# true|false Defines what happenes if an Exception occurs while extracting the text from an pdf for indexing # if set to true, Exception is ignored and object is indexed without the fulltext. # if set to false, Exception is thrown and object is not indexed at all. gsearch.ignoreTextExtractionErrors = true # Location of the indexingStylesheet that generates the indexInformation-Document for gsearch-indexing. # has to be an URL # currently the eSciDoc-Core-Infrastructure provides 2 index-databases: escidoc_all and escidocou_all # stylesheet-path-property for index escidoc_all is gsearch.escidoc.indexingStylesheet # stylesheet-path-property for index escidoc_all is gsearch.escidocou.indexingStylesheet #gsearch.escidoc.indexingStylesheet = http://escidoc1.escidoc.mpg.de/resources/searchIndexDefinition/mpdlEscidocXmlToLucene_1.2.xslt gsearch.escidoc.indexingStylesheet = http://coreservice.mpdl.mpg.de/mpdlEscidocXmlToLucene.xslt gsearch.escidocou.indexingStylesheet = # if pdfBox (internally used by gsearch to extract text from pdfs) is not working well for your pdfs, # define command-line-command to custom pdf-text-extractor (has to get installed seperately) # define command with full path, define inputfile with <inputfile> and outputfile with <outputfile> #example: C:/Programme/xpdf-3.02pl2-win32/pdftotext -cfg C:/Programme/xpdf-3.02pl2-win32/xpdfrc <inputfile> <outputfile> # gsearch.pdfTextExtractorCommand = /usr/bin/pdftotext -cfg /etc/xpdfrc <inputfile> <outputfile> gsearch.pdfTextExtractorCommand = /usr/bin/java -classpath /usr/share/jboss/server/default/conf/pdf-extraction/classes:/usr/share/jboss/server/default/conf/pdf-extraction/lib/iText-5.0.6.jar de.mpg.escidoc.services.extraction.ExtractionChain <inputfile> <outputfile> # Analyzer to use for indexing and search lucene.analyzer = de.escidoc.sb.common.lucene.analyzer.EscidocAnalyzer
- add new property for skipping reindex to escidoc-core.properties (gs: already in escidoc-core.properties.release7.7)
# Comma separated List of method names, where automatic indexing is skipped escidoc-core.skip.notify.indexer.methods = assignObjectPid, assignVersionPid
- fedoragsearch.properties: add the following properties to JBOSS_HOME/conf/search/config/fedoragsearch.properties: (gs: Änderungen eingefügt auf srv11 unter fedoragsearch.properties.release7.7.)
# if pdfBox (internally used by gsearch to extract text from pdfs) is not working well for your pdfs, # use a command-line tool. # If you want to use a command-line tool, # define command-line-command to custom pdf-text-extractor (has to get installed seperately) # define command with full path, define inputfile with <inputfile> and outputfile with <outputfile> #example: C:/Programme/xpdf-3.02pl2-win32/pdftotext -cfg C:/Programme/xpdf-3.02pl2-win32/xpdfrc <inputfile> <outputfile> fedoragsearch.pdfTextExtractorCommand=/usr/bin/java -classpath /usr/share/jboss/server/default/conf/pdf-extraction/classes:/usr/share/jboss/server/default/conf/pdf-extraction/lib/itextpdf-5.5.1.jar de.mpg.escidoc.services.extraction.ExtractionChain <inputfile> <outputfile>
# true|false Defines what happens if an Exception occurs while extracting the text from an pdf for indexing
# if set to true, Exception is ignored and object is indexed without the fulltext.
# if set to false, Exception is thrown and object is not indexed at all.
fedoragsearch.ignoreTextExtractionErrors=true
- copy directory pdf-extraction and pdfbox-app-1.8.6.jar from /home/siedersleben/escidoc-core-1.3.10-For-Release7.7 to JBOSS_HOME/conf (gs: already done)
Core Index Properties[edit]
- Reindex: set the following properties in $JBOSS_HOME/conf/search/config/index/escidoc_all/index.properties (remove comment sign)
#Use this property for bulk index operations: the index is hold in memory until ramBufferSize is reached. #Make sure this property does not conflict with fgsindex.maxBufferedDocs. fgsindex.ramBufferSizeMb = 128
# Use this property to minimize garbage collection during indexing. Be careful: index is not thread save when running in this mode. fgsindex.indexMode = 1
- same for item_container_admin
- set indexing to asynchron for item_container_admin during reindex in $JBOSS_HOME/conf/search/config/index/item_container_admin/index.object-types.properties
Resource.Item.indexAsynchronous=true
- check if pdf extraction works properly looking for log messages like the following in fedoragsearch.log (don't care about the WARN)
WARN 2014-08-06 09:17:54,269 (TransformerToText)(http-0.0.0.0-8080-4) error while transforming pdf to text with external tool: Extracting PDF content ---------------------------------------- Infile: /usr/share/jboss-4.2.3.GA/server/default/1407309473650.pdf Outfile: /usr/share/jboss-4.2.3.GA/server/default/1407309473650.txt Wed Aug 06 09:17:53 CEST 2014 -- started Extracting with xPDF Wed Aug 06 09:17:54 CEST 2014 -- finished successfully Extraction took 314
- dont't forget to set back all these properties modified when reindex has finished
Core Lucene Index[edit]
- Take indexing stylesheets from wildfly branch, not from trunk:
PubMan EAR[edit]
PubMan Properties[edit]
- escidoc.transformation.wos.stylesheet.filename=/usr/share/jboss/server/default/conf/transformation/transformations/otherFormats/xslt/wosxml2escidoc.xslt
& check if transformation exists
- escidoc.framework_access.framework.url=http://localhost:8080 (instead of coreservice)
- CHANGE: escidoc.pubman.favicon.url=/pubman/faces/javax.faces.resources/pubman_favicon_32_32.png
- escidoc.dataaquisition.resources.fop.configuration (?)
- escidoc.cone.modelsxml.path (?)
- escidoc.transformation.edoc.stylesheet.filename (?)
- escidoc.transformation.endnote.ice.stylesheet.filename
- escidoc.transformation.endnote.stylesheet.filename (?)
- escidoc.transformation.edoc.configuration.filename (?)
- escidoc.transformation.escidoc2marcxml.stylesheet.filename (?)
- escidoc.aa.public.key.file
- escidoc.aa.private.key.file
- escidoc.aa.config.file
PubMan Apache[edit]
- Add ProxyPassReverse for /cone, /sword-app, /dataacquisition in Apache 2 config, if not done yet
PubMan Wildfly[edit]
- Add pubman module, which should contain all properties and configuration files for PubMan:
- Create directory WILDFLY_HOME/modules/pubman/main
- Add a file called module.xml to this directory, containing the following xml
<?xml version="1.0" encoding="UTF-8"?> <module xmlns="urn:jboss:module:1.1" name="pubman"> <resources> <resource-root path="."/> </resources> </module>
- Add all necessary property files to this directory (pubman.properties, solution.properties, auth.properties, cone.properties, conf.xml, apache-fop-config.xml)
- Make this module global by adding the following xml snippet to standalone.xml, subsystem urn:jboss:domain:ee
<global-modules> <module name="pubman" slot="main"/> </global-modules>
- Wildfly has a default maximum POST size of 10mb and a default POST parameter size of 1000, which is not convenient for large file uploads
- Increase max-post and max-parameters size in standalone.xml, subsystem urn:jboss:domain:undertow by changing http-listener to (for e.g. 1024 mb)
<http-listener name="default" socket-binding="http" max-post-size="1024000000" max-parameters="50000"/>
PubMan PidCache[edit]
eSciDoc-OAI-Provider[edit]
- copy escidoc-oaiprovider.war from /home/walter