Developer Workshop[edit]

Date: July 14/15, 2011
Place: Munich
Previous workshop 22-23.09.2010, Karlsruhe

Participants MPDL[edit]

Participants FIZ[edit]

Steffen Wagner
Michael Hoppe
Christian Herlambang
Matthias Razum

Agenda 14.07.2011[edit]

Fulltext indexing[edit]

- enhanced with own xslt - questions from MPDL related to
- configuration of the search results output (rather than complete item/component/container)
- highlighting of search results (e.g. get the last page break tag)
- full text indexing for all FT visibility, searching according privileges and displaying snippets according privileges
- selective indexing from Admin tools
- incremental indexing
Solr support and interfaces

Outcome on Fulltext indexing[edit]

current aproach is not bad, one document is generated with many file-highlight fields - but it has to be checked if the limit of the highlighted fields is 100 and about the performance issues
- performance issues could be by caused highlighting also during indexing itself, but mostly by search performance itself
- more input from FIZ after analysis, as the problem is clear
fulltext search
- search receives items to which user has privileges, but when searching from fulltext with restriction privileges - if user has rights on one ft and not on the other ft of the item, she will get both highlights back.
  - workaround: one can exclude ft highlighting for not public texts always, or include visiblity in populating the highlighting (again performance potential)
selective indexing
- would be good to prioritize it (ab Oktober)
- todo: send more info on internal script
- index fulltext only or metadata only (in selectiv reindexing) - would not be possible (unless indexes are splitted)
incremental reindexing - does not function in 1.2 (to be checked by 1.3)
Solr
- GSearch can index via Solr
- we have to create an XML docu which Solr understands (similar like Lucene)
- gSearch can do it,
- however there are many Solr settings that have to be done i.e. specific fields configuration have to be done
- so practically Solr can be set up now, but it takes a lot of configuration
- todo: send config and test locally
- reformatting of search results is possible with external stylesheet - todo: send more info
- stylesheet caching - property

Future development plans[edit]

Future development plans - short term roadmap on versions/features for release 1.4 to 2.0
- critical: Internal managed vs. externally managed datastreams of MD records
- critical: namespace preservation bug in MD record
  - https://www.escidoc.org/jira/browse/INFR-947
  - https://www.escidoc.org/jira/browse/INFR-1190

outcome[edit]

Digilib integration[edit]

Digilib integration
- plans , ideas, replacements?

outcome[edit]

alternatives have been considered
at digilib - only the transformation on folder is implemented - the list of the images is slow to load in digilib
digilib has to be applied for DL project
- alternatives as well considered by MPDL
is digilib needed in escidoc if images are not in escidoc
close until no bigger requirement is there for digilib/escidoc
MPIWG people work on digilib stream retrieval per url

Admin Tool[edit]

Admin tool 1.3 offers only repository information

outcome[edit]

wrong version was delivered with 1.3
- escidoc Admin tool can work with more instances at one time (per URL)
- role-assignment in eSciDoc Admin - restricted to resources of scope of the role
- better selection of the urls to the core service e.g. property file and pulldown-list to select
www.escidoc.org/artifactory/
- search for "admintool" - new version snapshot of the escidocadmin tool
search for "ijc" - the newest snapshots from escidoc-core
svn-not available from outside at the moment - is on the way

Scalability&Performance[edit]

- creation of items (MPDL provides some numbers)
- reindexing
- statistics - other store - faster and not dependent on escidoc-core and fedora?
stress testing, mass data generation, monitoring of core service - how is done internally at FIZ with reference to FIZ Fedora Performance and Scalability Wiki
JBoss, other AS, Tomcat
- supporting newer versions of JBoss
- support for other AS
- Tomcat
LTA Long term archiving

Outcome[edit]

second VIRR instance with 2 milio eSciDoc objects iwth 1.2
- there is a big difference between 1.2 and 1.3 with big scaling difference (as the database goes away)
- Springer texts bey FIZ with 2.5 milio items (6 weeks) - but was fine
- performance - is not worse - started by ingest below 1 sec, but with a bit more than 1 sec - with weaker HW
- ingest direct into Fedora - not recommended
- ingest interface (but reindex afterwards)
- also sometimes shall be reconsidered if all ft (images, videos) should be in eSciDoc, or only metadata shall be in escidoc (image viewers, video streamers etc.)
- performance is the big next step - not only scaling but also stabilization of services
- test data sets - one to think how this is to be done, one possibility is to rebuild Fedora and reindex via eSciDoc (but it depends on the data set size)
- Fedora tests- how to make similar for eSciDoc?
  - how to make a proper evaluation environment for test/monitoring/performance
  - at the moment there is no possibliity nor efforts
  - todo: define a list of what needs to be checked
  - scaling - escidoc services can not be splitted among different servers - no horizontal scaling is possible at the moment, but indexing can be moved to separate server
  - high availability
    - not yet natively possible , planned
  - however to be clarified what it means
  - distributed environment can also be problematic - e.g. how we make a backup
  - load balancing - only read operation for example
  - journaling mode - nice feature - will be considered
  - probably focus most for first step by indexing/searching scalability
- reindexing (more alternatives, see below)
  - distribute on different machines
  - sharding (1 index on more machines)
  - more optimal would be when more machines hold complete index
  - optimization of the stylesheet transformation
  - gsearch can work with multiple threads, but is not indeed practical - as it may lock the complete machine - therefore it is set to 1 thread
  - two or more gsearch with 1 thread
  - keep indexes in mem and only write when treshhold is reached
- statistics - other store - faster and not dependent on escidoc-core and fedora?
  - mongoDb as store, BIRT as standard interface
- performance/scalability
  - potential to use scalable stores e.g. MongoDB (as configuration option - instead of Fedora or as Fedora storage) in parallel with Lucene

Performance prio: workflow settings (content model, context?)
- items immediately released (with proper indexing afterwards) -> one call to the service
- item event log (update, insert comments?)

JBOSS, other AS, Tomcat
- works to remove dependencies from JBoss
- works to enable working in a servlet container
- next version after 1.4

Content Models
- Any plans from MPDL side?
- Pragmatic and iterative approach - some ideas

SPO[edit]

to be checked again

Other[edit]

toDo: issue: Add content relation in REST Client handler
todo: issue container delete updates

Agenda 15.07.2011[edit]

ESciDoc Developer Workshop 14 15 07 2011

Contents

Developer Workshop[edit]

Participants MPDL[edit]

Participants FIZ[edit]

Agenda 14.07.2011[edit]

Fulltext indexing[edit]

Outcome on Fulltext indexing[edit]

Future development plans[edit]

outcome[edit]

Digilib integration[edit]

outcome[edit]

Admin Tool[edit]

outcome[edit]

Scalability&Performance[edit]

Outcome[edit]

SPO[edit]

Other[edit]

Agenda 15.07.2011[edit]

Navigation menu

Search