Difference between revisions of "ESciDoc Developer Workshop 14 15 07 2011"
Jump to navigation
Jump to search
Line 46: | Line 46: | ||
***https://www.escidoc.org/jira/browse/INFR-947 | ***https://www.escidoc.org/jira/browse/INFR-947 | ||
***https://www.escidoc.org/jira/browse/INFR-1190 | ***https://www.escidoc.org/jira/browse/INFR-1190 | ||
== | ===outcme=== | ||
*Admin tool 1.3 offers only repository information | *Admin tool 1.3 offers only repository information | ||
Line 53: | Line 53: | ||
**plans , ideas, replacements? | **plans , ideas, replacements? | ||
==Scalability&Performance== | |||
**creation of items (MPDL provides some numbers) | **creation of items (MPDL provides some numbers) | ||
**reindexing | **reindexing | ||
Line 63: | Line 63: | ||
**Tomcat | **Tomcat | ||
*LTA Long term archiving | *LTA Long term archiving | ||
===Outcome=== | |||
*second VIRR instance with 2 milio eSciDoc objects iwth 1.2 | |||
**there is a big difference between 1.2 and 1.3 with big scaling difference (as the database goes away) | |||
**Springer texts bey FIZ with 2.5 milio items (6 weeks) - but was fine | |||
**performance - is not worse - started by ingest below 1 sec, but with a bit more than 1 sec - with weaker HW | |||
**ingest direct into Fedora - not recommended | |||
**ingest interface (but reindex afterwards) | |||
**also sometimes shall be reconsidered if all ft (images, videos) should be in eSciDoc, or only metadata shall be in escidoc (image viewers, video streamers etc.) | |||
**performance is the big next step - not only scaling but also stabilization of services | |||
**test data sets - one to think how this is to be done, one possibility is to rebuild Fedora and reindex via eSciDoc (but it depends on the data set size) | |||
**Fedora tests- how to make similar for eSciDoc? | |||
***how to make a proper evaluation environment for test/monitoring/performance | |||
***at the moment there is no possibliity nor efforts | |||
***todo: define a list of what needs to be checked | |||
***scaling - escidoc services can not be splitted among different servers - no horizontal scaling is possible at the moment, but indexing can be moved to separate server | |||
***high availability | |||
****not yet natively possible , planned but not clear why | |||
*workflow settings (content model, context?) | *workflow settings (content model, context?) |
Revision as of 13:04, 14 July 2011
Developer Workshop[edit]
- Date: July 14/15, 2011
- Place: Munich
- Previous workshop 22-23.09.2010, Karlsruhe
Participants MPDL[edit]
Participants FIZ[edit]
- Steffen Wagner
- Michael Hoppe
- Christian Herlambang
- Matthias Razum
Agenda 14.07.2011[edit]
Fulltext indexing[edit]
- enhanced with own xslt - questions from MPDL related to
- configuration of the search results output (rather than complete item/component/container)
- highlighting of search results (e.g. get the last page break tag)
- full text indexing for all FT visibility, searching according privileges and displaying snippets according privileges
- selective indexing from Admin tools
- incremental indexing
- Solr support and interfaces
Outcome on Fulltext indexing[edit]
- current aproach is not bad, one document is generated with many file-highlight fields - but it has to be checked if the limit of the highlighted fields is 100 and about the performance issues
- performance issues could be by caused highlighting also during indexing itself, but mostly by search performance itself
- more input from FIZ after analysis, as the problem is clear
- fulltext search
- search receives items to which user has privileges, but when searching from fulltext with restriction privileges - if user has rights on one ft and not on the other ft of the item, she will get both highlights back.
- workaround: one can exclude ft highlighting for not public texts always, or include visiblity in populating the highlighting (again performance potential)
- search receives items to which user has privileges, but when searching from fulltext with restriction privileges - if user has rights on one ft and not on the other ft of the item, she will get both highlights back.
- selective indexing
- would be good to prioritize it (ab Oktober)
- todo: send more info on internal script
- index fulltext only or metadata only (in selectiv reindexing) - would not be possible (unless indexes are splitted)
- incremental reindexing - does not function in 1.2 (to be checked by 1.3)
- Solr
- GSearch can index via Solr
- we have to create an XML docu which Solr understands (similar like Lucene)
- gSearch can do it,
- however there are many Solr settings that have to be done i.e. specific fields configuration have to be done
- so practically Solr can be set up now, but it takes a lot of configuration
- todo: send config and test locally
Future development plans[edit]
- Future development plans - short term roadmap on versions/features for release 1.4 to 2.0
- critical: Internal managed vs. externally managed datastreams of MD records
- critical: namespace preservation bug in MD record
outcme[edit]
- Admin tool 1.3 offers only repository information
- Digilib integration
- plans , ideas, replacements?
Scalability&Performance[edit]
- creation of items (MPDL provides some numbers)
- reindexing
- statistics - other store - faster and not dependent on escidoc-core and fedora?
- stress testing, mass data generation, monitoring of core service - how is done internally at FIZ with reference to FIZ Fedora Performance and Scalability Wiki
- JBoss, other AS, Tomcat
- supporting newer versions of JBoss
- support for other AS
- Tomcat
- LTA Long term archiving
Outcome[edit]
- second VIRR instance with 2 milio eSciDoc objects iwth 1.2
- there is a big difference between 1.2 and 1.3 with big scaling difference (as the database goes away)
- Springer texts bey FIZ with 2.5 milio items (6 weeks) - but was fine
- performance - is not worse - started by ingest below 1 sec, but with a bit more than 1 sec - with weaker HW
- ingest direct into Fedora - not recommended
- ingest interface (but reindex afterwards)
- also sometimes shall be reconsidered if all ft (images, videos) should be in eSciDoc, or only metadata shall be in escidoc (image viewers, video streamers etc.)
- performance is the big next step - not only scaling but also stabilization of services
- test data sets - one to think how this is to be done, one possibility is to rebuild Fedora and reindex via eSciDoc (but it depends on the data set size)
- Fedora tests- how to make similar for eSciDoc?
- how to make a proper evaluation environment for test/monitoring/performance
- at the moment there is no possibliity nor efforts
- todo: define a list of what needs to be checked
- scaling - escidoc services can not be splitted among different servers - no horizontal scaling is possible at the moment, but indexing can be moved to separate server
- high availability
- not yet natively possible , planned but not clear why
- workflow settings (content model, context?)
- items immediately released (with proper indexing afterwards) -> one call to the service
- item event log (update, insert comments?)
- Content Models
- Any plans from MPDL side?
- Pragmatic and iterative approach - some ideas
- SPO