Difference between revisions of "ESciDoc Developer Workshop 14 15 07 2011"
Jump to navigation
Jump to search
(55 intermediate revisions by the same user not shown) | |||
Line 5: | Line 5: | ||
== Participants MPDL == | == Participants MPDL == | ||
*Wilhelm Frank | |||
*Lu Yu | |||
*Benjamin Knoth | |||
*Richard Bourke | |||
*Michael Franke | |||
*Marcus Haarländer | |||
*Natasa Bulatovic | |||
== Participants FIZ == | == Participants FIZ == | ||
Line 14: | Line 21: | ||
=Agenda 14.07.2011= | =Agenda 14.07.2011= | ||
==Fulltext indexing== | ==Fulltext indexing== | ||
* | *questions related to indexing stylesheets enhanced with own xslt | ||
**configuration of the search results output (rather than complete item/component/container) | **configuration of the search results output (rather than complete item/component/container) | ||
**highlighting of search results | **highlighting of search results e.g. get the last page break tag from a TEI fulltext(escidoc fulltext index) | ||
** | ***each digitized book is represented with two items each with 1 fulltext document | ||
**selective indexing from Admin | ****METS file - containing table of contents with links to digitized pages of the book (where digitized pages are image files stored externally on the file system) | ||
**incremental indexing | ****TEI file - containing the fulltext (contents) of the book shown on the digitized pages and page-break links to the image where the content starts/ends to appear | ||
****search shall enable when searching for a fulltext, to get exactly the correct snippet from the TEI file, with the exact link to the image file | |||
****MPDL had done some custom stylesheets, where text between two page-breaks is "treated" as separate files sent for highlighting together with the link | |||
**fulltext indexing for all FT visibility, searching according privileges and displaying snippets according privileges | |||
**selective indexing of resources from Admin tool | |||
**incremental indexing problems | |||
*Solr support and interfaces | *Solr support and interfaces | ||
===Outcome on Fulltext indexing=== | ===Outcome on Fulltext indexing=== | ||
*current aproach is not bad, one document is generated with many file-highlight fields - but it has to be checked if the limit of the highlighted fields is 100 | *current aproach is not bad, one document is generated with many file-highlight fields - but it has to be checked if the limit of the highlighted fields is 100 and about the performance issues | ||
**performance issues could be by caused highlighting also during indexing itself, but mostly by search performance itself | **performance issues could be by caused highlighting also during indexing itself, but mostly by search performance itself | ||
**more input from FIZ after analysis, as the problem is clear | |||
*fulltext indexing in accordance with the privileges | |||
**search receives items to which user has privileges, but when searching from fulltext with restriction privileges - if user has rights on one ft and not on the other ft of the item, she will get both highlights back. | |||
***workaround: one can exclude ft highlighting for not public texts always, or include visiblity in populating the highlighting (again performance potential) | |||
*selective indexing | |||
**would be good to prioritize it (ab Oktober) | |||
**todo: send more info on internal script | |||
**index fulltext only or metadata only (in selective reindexing) - would not be possible (unless indexes are splitted) | |||
*incremental reindexing - does not function in 1.2 (to be checked by 1.3) | |||
*Solr | |||
**GSearch can index via Solr | |||
**we have to create an XML docu which Solr understands (similar like Lucene) | |||
**gSearch can do it, | |||
**however there are many Solr settings that have to be done i.e. specific fields configuration have to be done | |||
**so practically Solr can be set up now, but it takes a lot of configuration | |||
**todo: send config and test locally | |||
*reformatting of search results is possible with external stylesheet - todo: send more info | |||
**stylesheet caching - defined in properties | |||
==Future development plans== | |||
*Future development plans - short term roadmap on versions/features for release 1.4 to 2.0 | *Future development plans - short term roadmap on versions/features for release 1.4 to 2.0 | ||
**critical: Internal managed vs. externally managed datastreams of MD records | **critical: Internal managed vs. externally managed datastreams of MD records | ||
** | ===outcome=== | ||
* | *1.4 would come by end of July | ||
*** | *1.5 would come by October for eSciDoc days | ||
*see [https://www.escidoc.org/wiki/Main_Page eSciDoc Org Wiki] for preliminary roadmaps (comments are welcome) | |||
*2.0 | |||
**scalability | |||
**performance | |||
==Digilib integration== | |||
*Digilib integration | |||
**plans , ideas, replacements? | |||
===outcome=== | |||
*alternatives have been considered | |||
*at digilib - only the transformation on folder is implemented - the list of the images is slow to load in digilib | |||
*digilib has to be applied for DL project | |||
**alternatives as well considered by MPDL | |||
*is digilib needed in escidoc if images are not in escidoc? | |||
**close until no bigger requirement is there for digilib/escidoc | |||
**MPIWG people work on digilib stream retrieval per url | |||
==Admin Tool== | |||
*Admin tool 1.3 offers only repository information | *Admin tool 1.3 offers only repository information | ||
* | ===outcome=== | ||
** | *wrong version was delivered with 1.3 | ||
**escidoc Admin tool can work with more instances at one time (per URL) | |||
**role-assignment in eSciDoc Admin - restricted to resources of scope of the role | |||
**better selection of the urls to the core service e.g. property file and pulldown-list to select | |||
*www.escidoc.org/artifactory/ | |||
**search for "admintool" - new version snapshot of the escidocadmin tool | |||
**search for "ijc" - the newest snapshots from escidoc-core | |||
*svn-not available from outside at the moment - work to make it public for read access is on the way | |||
==Scalability&Performance== | |||
**creation of items (MPDL provides some numbers) | **creation of items (MPDL provides some numbers) | ||
**reindexing | **reindexing | ||
Line 49: | Line 104: | ||
*LTA Long term archiving | *LTA Long term archiving | ||
*workflow settings (content model, context?) | ===Outcome=== | ||
*second VIRR instance with 2 milio eSciDoc objects iwth 1.2 | |||
**there is a big difference between 1.2 and 1.3 with big scaling difference (as the database goes away) | |||
**Springer texts bey FIZ with 2.5 milio items (6 weeks) - but was fine | |||
**performance - is not worse - started by ingest below 1 sec, but finished with a bit more than 1 sec - with weaker HW | |||
**ingest direct into Fedora - not recommended | |||
**ingest interface (but reindex afterwards is still necessary) | |||
**also sometimes shall be reconsidered if all fulltexts-i.e. component contents (images, videos) should be in eSciDoc, or only metadata shall be in escidoc (image viewers, video streamers etc.) | |||
**performance is the big next step - not only scaling but also stabilization of services | |||
**test data sets - one to think how this is to be done, one possibility is to rebuild Fedora and reindex via eSciDoc (but it depends on the data set size) | |||
**Fedora tests- how to make similar for eSciDoc? | |||
***how to make a proper evaluation environment for test/monitoring/performance | |||
***at the moment there is no possibliity nor efforts | |||
***todo: define a list of what needs to be checked | |||
***scaling - escidoc services can not be splitted among different servers - no horizontal scaling is possible at the moment, but indexing can be moved to separate server | |||
***high availability | |||
****not yet natively possible , planned | |||
***however to be clarified what it means | |||
***distributed environment can also be problematic - e.g. how we make a backup | |||
***load balancing - only read operation for example | |||
***journaling mode - nice feature - will be considered | |||
***probably focus most for first step on indexing/searching scalability and performance | |||
**reindexing (more alternatives, see below) | |||
***distribute on different machines | |||
***sharding (1 index on more machines) | |||
***more optimal would be when more machines hold complete index | |||
***optimization of the stylesheet transformation | |||
***gsearch can work with multiple threads, but is not indeed practical - as it may lock the complete machine - therefore it is set to 1 thread | |||
***todo: check the possibility to run two or more gsearch instances each single thread | |||
***keep indexes in mem and only write when treshhold is reached | |||
**statistics - other store - faster and not dependent on escidoc-core and fedora? | |||
***mongoDb as store, BIRT as standard interface may be considered | |||
**performance/scalability | |||
***potential to use scalable stores e.g. MongoDB (as configuration option - instead of Fedora or as Fedora storage) in parallel with Lucene | |||
*Performance prio: workflow settings (content model, context?) | |||
**items immediately released (with proper indexing afterwards) -> one call to the service | **items immediately released (with proper indexing afterwards) -> one call to the service | ||
* | *Item event log (update, insert comments?) | ||
==JBOSS, other AS, Tomcat== | |||
**works to remove dependencies from JBoss in progress | |||
**works to enable working in a servlet container in progress | |||
**next version after 1.4 | |||
==SPO== | |||
*Semantic Store Handler | |||
*not needed | |||
*interfaces will be deprecated | |||
* Content Models | ==Other== | ||
*toDo: issue: Add content relation in REST Client handler | |||
*todo: issue container delete updates release (mpdl to precise) | |||
*critical: namespace preservation bug in MD record | |||
**https://www.escidoc.org/jira/browse/INFR-947 | |||
**https://www.escidoc.org/jira/browse/INFR-1190 | |||
**to do: at the moment use the workaround, will be fixed, but is more serious issue | |||
=Agenda 15.07.2011= | |||
== OAI provider == | |||
* transformations from item (not only from metadata record)? | |||
===outcome=== | |||
**to check later | |||
== AA == | |||
**external roles | |||
**standalone service | |||
===Outcome=== | |||
*is possible to define a new role (policy); the policy evaluates an action and PDP methods - evaluate the rights | |||
*evaluation based on the XML (expects the user-id and some attributes) | |||
*user must exist in escidoc and role must exist in escidoc | |||
*can it be completely standalone outside of Fedora? | |||
**statistics are dependent on Fedora (primary key) | |||
**otherwise it is completely independent on Fedora | |||
*not possible at the moment as completely standalone component | |||
**possible with limitations together with eSciDoc - external roles, external actions | |||
== Content Models == | |||
** Any plans from MPDL side? | ** Any plans from MPDL side? | ||
** Pragmatic and iterative approach - some ideas | ** Pragmatic and iterative approach - some ideas | ||
===outcome=== | |||
What can be in the content model defined? | |||
*schema for metadata record (already done in 1.2) | |||
*transformation for a "subview" or freemade representation of an item (already done in 1.2) | |||
:note-to clarify is this item representation or metadata record representation which is referenced above?--[[User:Natasab|Natasa]] 09:37, 20 July 2011 (CEST) | |||
**components - content categories , mime types | |||
**validation method - for metadata schema compliance, item structure compliance with componens, content categories, mime types | |||
*versioned content models? | |||
**each resource references one content model | |||
**when this reference is made into a particular version then "transition" could be done a bit more smoothly | |||
**migration - stylesheet for content model for different versions of content model versions | |||
== PubMan == | |||
** Migration plans to version 1.4 of eSciDoc Infrastructure | |||
** Evaluation of eSciDoc Infrastructure Java Connector ("Java Client Library") | |||
===outcome=== | |||
*development version 6.3 | |||
**cone improvements | |||
**japanese language support | |||
**browser problems with newer browsers(javascript) | |||
*next version can be on 1.3 (planned for eSciDoc days) | |||
*would be good to bring pubman on core service 1.3 or 1.4 | |||
== Other eSciDoc Applications == | |||
* Status of Digitization Lifecycle | |||
**successor of VIRR | |||
**to cover more institutes and bigger data scale | |||
**books in eSciDoc (mets container with structure and item with TEI component for fulltext) | |||
**images are in file system (Digilib) | |||
**customization of stylesheets for fulltext (xml) search | |||
**jsf2.0 | |||
**end of september the first version (VIRR equivalent in first place - browse, view) | |||
**eventual upload functionality until eSciDoc days | |||
* Status of Imeji | |||
**demo | |||
**further plans: imeji on tomcat, rdf, switch to java connector for escidoc-core, thesaurus | |||
**timing not clear yet | |||
* Status of eSciDoc Browser | |||
**demo, ideas | |||
== Building environment == | |||
*eSciDoc building environment | |||
**FIZ development plugins - Jrebel, code formatting by build (code style, spaces, line breaks, line length - functions problematic a bit by javadoc) , | |||
***statical code analysis with Sonar | |||
***scrum room: code reports visible | |||
**MPDL development plugins: checkstyle (not intensively used), code template, jRebel | |||
***more info see here [[Escidoc_Building_and_Developing_Environment|MPDL Building and Development Environment]] | |||
***selenium tests see [[Automatic_GUI_tests_with_selenium|General]] and [[PubMan_and_Selenium_tests|PubMan and Selenium tests]] | |||
===outcome=== | |||
*todo: check on Maven versions used by both teams to ensure smooth building from sources on both core/pubman code | |||
*todo: check on JBoss ports used in maven builds | |||
*FIZ: '''maven-failsafe-plugin''' for integration tests - parallelization of tests - not too much success with the plug-in | |||
*MPDL: uses surefire (more aplicable for unit testing) | |||
*test artifacts (to be used by both MPDL and FIZ) to ensure functioning of both core services and solutions eg . PubMan | |||
*todo: (now) publish source packages and in future open SVN | |||
==other== | |||
*1.5 eScidoc core requires migration (planed for escidoc days) | |||
**migration would be on FoXML | |||
***foxml | |||
***fedora rebuild | |||
***triple store | |||
***Lucene indexes? | |||
[[Category:ESciDoc_Developer|Developer Workshop 2011-07-14/15]] | [[Category:ESciDoc_Developer|Developer Workshop 2011-07-14/15]] |
Latest revision as of 07:43, 20 July 2011
Developer Workshop[edit]
- Date: July 14/15, 2011
- Place: Munich
- Previous workshop 22-23.09.2010, Karlsruhe
Participants MPDL[edit]
- Wilhelm Frank
- Lu Yu
- Benjamin Knoth
- Richard Bourke
- Michael Franke
- Marcus Haarländer
- Natasa Bulatovic
Participants FIZ[edit]
- Steffen Wagner
- Michael Hoppe
- Christian Herlambang
- Matthias Razum
Agenda 14.07.2011[edit]
Fulltext indexing[edit]
- questions related to indexing stylesheets enhanced with own xslt
- configuration of the search results output (rather than complete item/component/container)
- highlighting of search results e.g. get the last page break tag from a TEI fulltext(escidoc fulltext index)
- each digitized book is represented with two items each with 1 fulltext document
- METS file - containing table of contents with links to digitized pages of the book (where digitized pages are image files stored externally on the file system)
- TEI file - containing the fulltext (contents) of the book shown on the digitized pages and page-break links to the image where the content starts/ends to appear
- search shall enable when searching for a fulltext, to get exactly the correct snippet from the TEI file, with the exact link to the image file
- MPDL had done some custom stylesheets, where text between two page-breaks is "treated" as separate files sent for highlighting together with the link
- each digitized book is represented with two items each with 1 fulltext document
- fulltext indexing for all FT visibility, searching according privileges and displaying snippets according privileges
- selective indexing of resources from Admin tool
- incremental indexing problems
- Solr support and interfaces
Outcome on Fulltext indexing[edit]
- current aproach is not bad, one document is generated with many file-highlight fields - but it has to be checked if the limit of the highlighted fields is 100 and about the performance issues
- performance issues could be by caused highlighting also during indexing itself, but mostly by search performance itself
- more input from FIZ after analysis, as the problem is clear
- fulltext indexing in accordance with the privileges
- search receives items to which user has privileges, but when searching from fulltext with restriction privileges - if user has rights on one ft and not on the other ft of the item, she will get both highlights back.
- workaround: one can exclude ft highlighting for not public texts always, or include visiblity in populating the highlighting (again performance potential)
- search receives items to which user has privileges, but when searching from fulltext with restriction privileges - if user has rights on one ft and not on the other ft of the item, she will get both highlights back.
- selective indexing
- would be good to prioritize it (ab Oktober)
- todo: send more info on internal script
- index fulltext only or metadata only (in selective reindexing) - would not be possible (unless indexes are splitted)
- incremental reindexing - does not function in 1.2 (to be checked by 1.3)
- Solr
- GSearch can index via Solr
- we have to create an XML docu which Solr understands (similar like Lucene)
- gSearch can do it,
- however there are many Solr settings that have to be done i.e. specific fields configuration have to be done
- so practically Solr can be set up now, but it takes a lot of configuration
- todo: send config and test locally
- reformatting of search results is possible with external stylesheet - todo: send more info
- stylesheet caching - defined in properties
Future development plans[edit]
- Future development plans - short term roadmap on versions/features for release 1.4 to 2.0
- critical: Internal managed vs. externally managed datastreams of MD records
outcome[edit]
- 1.4 would come by end of July
- 1.5 would come by October for eSciDoc days
- see eSciDoc Org Wiki for preliminary roadmaps (comments are welcome)
- 2.0
- scalability
- performance
Digilib integration[edit]
- Digilib integration
- plans , ideas, replacements?
outcome[edit]
- alternatives have been considered
- at digilib - only the transformation on folder is implemented - the list of the images is slow to load in digilib
- digilib has to be applied for DL project
- alternatives as well considered by MPDL
- is digilib needed in escidoc if images are not in escidoc?
- close until no bigger requirement is there for digilib/escidoc
- MPIWG people work on digilib stream retrieval per url
Admin Tool[edit]
- Admin tool 1.3 offers only repository information
outcome[edit]
- wrong version was delivered with 1.3
- escidoc Admin tool can work with more instances at one time (per URL)
- role-assignment in eSciDoc Admin - restricted to resources of scope of the role
- better selection of the urls to the core service e.g. property file and pulldown-list to select
- www.escidoc.org/artifactory/
- search for "admintool" - new version snapshot of the escidocadmin tool
- search for "ijc" - the newest snapshots from escidoc-core
- svn-not available from outside at the moment - work to make it public for read access is on the way
Scalability&Performance[edit]
- creation of items (MPDL provides some numbers)
- reindexing
- statistics - other store - faster and not dependent on escidoc-core and fedora?
- stress testing, mass data generation, monitoring of core service - how is done internally at FIZ with reference to FIZ Fedora Performance and Scalability Wiki
- JBoss, other AS, Tomcat
- supporting newer versions of JBoss
- support for other AS
- Tomcat
- LTA Long term archiving
Outcome[edit]
- second VIRR instance with 2 milio eSciDoc objects iwth 1.2
- there is a big difference between 1.2 and 1.3 with big scaling difference (as the database goes away)
- Springer texts bey FIZ with 2.5 milio items (6 weeks) - but was fine
- performance - is not worse - started by ingest below 1 sec, but finished with a bit more than 1 sec - with weaker HW
- ingest direct into Fedora - not recommended
- ingest interface (but reindex afterwards is still necessary)
- also sometimes shall be reconsidered if all fulltexts-i.e. component contents (images, videos) should be in eSciDoc, or only metadata shall be in escidoc (image viewers, video streamers etc.)
- performance is the big next step - not only scaling but also stabilization of services
- test data sets - one to think how this is to be done, one possibility is to rebuild Fedora and reindex via eSciDoc (but it depends on the data set size)
- Fedora tests- how to make similar for eSciDoc?
- how to make a proper evaluation environment for test/monitoring/performance
- at the moment there is no possibliity nor efforts
- todo: define a list of what needs to be checked
- scaling - escidoc services can not be splitted among different servers - no horizontal scaling is possible at the moment, but indexing can be moved to separate server
- high availability
- not yet natively possible , planned
- however to be clarified what it means
- distributed environment can also be problematic - e.g. how we make a backup
- load balancing - only read operation for example
- journaling mode - nice feature - will be considered
- probably focus most for first step on indexing/searching scalability and performance
- reindexing (more alternatives, see below)
- distribute on different machines
- sharding (1 index on more machines)
- more optimal would be when more machines hold complete index
- optimization of the stylesheet transformation
- gsearch can work with multiple threads, but is not indeed practical - as it may lock the complete machine - therefore it is set to 1 thread
- todo: check the possibility to run two or more gsearch instances each single thread
- keep indexes in mem and only write when treshhold is reached
- statistics - other store - faster and not dependent on escidoc-core and fedora?
- mongoDb as store, BIRT as standard interface may be considered
- performance/scalability
- potential to use scalable stores e.g. MongoDB (as configuration option - instead of Fedora or as Fedora storage) in parallel with Lucene
- Performance prio: workflow settings (content model, context?)
- items immediately released (with proper indexing afterwards) -> one call to the service
- Item event log (update, insert comments?)
JBOSS, other AS, Tomcat[edit]
- works to remove dependencies from JBoss in progress
- works to enable working in a servlet container in progress
- next version after 1.4
SPO[edit]
- Semantic Store Handler
- not needed
- interfaces will be deprecated
Other[edit]
- toDo: issue: Add content relation in REST Client handler
- todo: issue container delete updates release (mpdl to precise)
- critical: namespace preservation bug in MD record
- https://www.escidoc.org/jira/browse/INFR-947
- https://www.escidoc.org/jira/browse/INFR-1190
- to do: at the moment use the workaround, will be fixed, but is more serious issue
Agenda 15.07.2011[edit]
OAI provider[edit]
- transformations from item (not only from metadata record)?
outcome[edit]
- to check later
AA[edit]
- external roles
- standalone service
Outcome[edit]
- is possible to define a new role (policy); the policy evaluates an action and PDP methods - evaluate the rights
- evaluation based on the XML (expects the user-id and some attributes)
- user must exist in escidoc and role must exist in escidoc
- can it be completely standalone outside of Fedora?
- statistics are dependent on Fedora (primary key)
- otherwise it is completely independent on Fedora
- not possible at the moment as completely standalone component
- possible with limitations together with eSciDoc - external roles, external actions
Content Models[edit]
- Any plans from MPDL side?
- Pragmatic and iterative approach - some ideas
outcome[edit]
What can be in the content model defined?
- schema for metadata record (already done in 1.2)
- transformation for a "subview" or freemade representation of an item (already done in 1.2)
- note-to clarify is this item representation or metadata record representation which is referenced above?--Natasa 09:37, 20 July 2011 (CEST)
- components - content categories , mime types
- validation method - for metadata schema compliance, item structure compliance with componens, content categories, mime types
- versioned content models?
- each resource references one content model
- when this reference is made into a particular version then "transition" could be done a bit more smoothly
- migration - stylesheet for content model for different versions of content model versions
PubMan[edit]
- Migration plans to version 1.4 of eSciDoc Infrastructure
- Evaluation of eSciDoc Infrastructure Java Connector ("Java Client Library")
outcome[edit]
- development version 6.3
- cone improvements
- japanese language support
- browser problems with newer browsers(javascript)
- next version can be on 1.3 (planned for eSciDoc days)
- would be good to bring pubman on core service 1.3 or 1.4
Other eSciDoc Applications[edit]
- Status of Digitization Lifecycle
- successor of VIRR
- to cover more institutes and bigger data scale
- books in eSciDoc (mets container with structure and item with TEI component for fulltext)
- images are in file system (Digilib)
- customization of stylesheets for fulltext (xml) search
- jsf2.0
- end of september the first version (VIRR equivalent in first place - browse, view)
- eventual upload functionality until eSciDoc days
- Status of Imeji
- demo
- further plans: imeji on tomcat, rdf, switch to java connector for escidoc-core, thesaurus
- timing not clear yet
- Status of eSciDoc Browser
- demo, ideas
Building environment[edit]
- eSciDoc building environment
- FIZ development plugins - Jrebel, code formatting by build (code style, spaces, line breaks, line length - functions problematic a bit by javadoc) ,
- statical code analysis with Sonar
- scrum room: code reports visible
- MPDL development plugins: checkstyle (not intensively used), code template, jRebel
- more info see here MPDL Building and Development Environment
- selenium tests see General and PubMan and Selenium tests
- FIZ development plugins - Jrebel, code formatting by build (code style, spaces, line breaks, line length - functions problematic a bit by javadoc) ,
outcome[edit]
- todo: check on Maven versions used by both teams to ensure smooth building from sources on both core/pubman code
- todo: check on JBoss ports used in maven builds
- FIZ: maven-failsafe-plugin for integration tests - parallelization of tests - not too much success with the plug-in
- MPDL: uses surefire (more aplicable for unit testing)
- test artifacts (to be used by both MPDL and FIZ) to ensure functioning of both core services and solutions eg . PubMan
- todo: (now) publish source packages and in future open SVN
other[edit]
- 1.5 eScidoc core requires migration (planed for escidoc days)
- migration would be on FoXML
- foxml
- fedora rebuild
- triple store
- Lucene indexes?
- migration would be on FoXML