ESciDoc Developer Workshop 14 15 07 2011

Developer Workshop

 * Date: July 14/15, 2011
 * Place: Munich
 * Previous workshop 22-23.09.2010, Karlsruhe

Participants MPDL

 * Wilhelm Frank
 * Lu Yu
 * Benjamin Knoth
 * Richard Bourke
 * Michael Franke
 * Marcus Haarländer
 * Natasa Bulatovic

Participants FIZ

 * Steffen Wagner
 * Michael Hoppe
 * Christian Herlambang
 * Matthias Razum

=Agenda 14.07.2011=

Fulltext indexing

 * questions related to indexing stylesheets enhanced with own xslt
 * configuration of the search results output (rather than complete item/component/container)
 * highlighting of search results e.g. get the last page break tag from a TEI fulltext(escidoc fulltext index)
 * each digitized book is represented with two items each with 1 fulltext document
 * METS file - containing table of contents with links to digitized pages of the book (where digitized pages are image files stored externally on the file system)
 * TEI file - containing the fulltext (contents) of the book shown on the digitized pages and page-break links to the image where the content starts/ends to appear
 * search shall enable when searching for a fulltext, to get exactly the correct snippet from the TEI file, with the exact link to the image file
 * MPDL had done some custom stylesheets, where text between two page-breaks is "treated" as separate files sent for highlighting together with the link
 * fulltext indexing for all FT visibility, searching according privileges and displaying snippets according privileges
 * selective indexing of resources from Admin tool
 * incremental indexing problems
 * Solr support and interfaces

Outcome on Fulltext indexing

 * current aproach is not bad, one document is generated with many file-highlight fields - but it has to be checked if the limit of the highlighted fields is 100 and about the performance issues
 * performance issues could be by caused highlighting also during indexing itself, but mostly by search performance itself
 * more input from FIZ after analysis, as the problem is clear
 * fulltext indexing in accordance with the privileges
 * search receives items to which user has privileges, but when searching from fulltext with restriction privileges - if user has rights on one ft and not on the other ft of the item, she will get both highlights back.
 * workaround: one can exclude ft highlighting for not public texts always, or include visiblity in populating the highlighting (again performance potential)
 * selective indexing
 * would be good to prioritize it (ab Oktober)
 * todo: send more info on internal script
 * index fulltext only or metadata only (in selective reindexing) - would not be possible (unless indexes are splitted)
 * incremental reindexing - does not function in 1.2 (to be checked by 1.3)
 * Solr
 * GSearch can index via Solr
 * we have to create an XML docu which Solr understands (similar like Lucene)
 * gSearch can do it,
 * however there are many Solr settings that have to be done i.e. specific fields configuration have to be done
 * so practically Solr can be set up now, but it takes a lot of configuration
 * todo: send config and test locally
 * reformatting of search results is possible with external stylesheet - todo: send more info
 * stylesheet caching - defined in properties

Future development plans

 * Future development plans - short term roadmap on versions/features for release 1.4 to 2.0
 * critical: Internal managed vs. externally managed datastreams of MD records

outcome

 * 1.4 would come by end of July
 * 1.5 would come by October for eSciDoc days
 * see eSciDoc Org Wiki for preliminary roadmaps (comments are welcome)
 * 2.0
 * scalability
 * performance

Digilib integration

 * Digilib integration
 * plans, ideas, replacements?

outcome

 * alternatives have been considered
 * at digilib - only the transformation on folder is implemented - the list of the images is slow to load in digilib
 * digilib has to be applied for DL project
 * alternatives as well considered by MPDL
 * is digilib needed in escidoc if images are not in escidoc?
 * close until no bigger requirement is there for digilib/escidoc
 * MPIWG people work on digilib stream retrieval per url

Admin Tool

 * Admin tool 1.3 offers only repository information

outcome

 * wrong version was delivered with 1.3
 * escidoc Admin tool can work with more instances at one time (per URL)
 * role-assignment in eSciDoc Admin - restricted to resources of scope of the role
 * better selection of the urls to the core service e.g. property file and pulldown-list to select
 * www.escidoc.org/artifactory/
 * search for "admintool" - new version snapshot of the escidocadmin tool
 * search for "ijc" - the newest snapshots from escidoc-core
 * svn-not available from outside at the moment - work to make it public for read access is on the way

Scalability&Performance

 * creation of items (MPDL provides some numbers)
 * reindexing
 * statistics - other store - faster and not dependent on escidoc-core and fedora?
 * stress testing, mass data generation, monitoring of core service - how is done internally at FIZ with reference to FIZ Fedora Performance and Scalability Wiki
 * JBoss, other AS, Tomcat
 * supporting newer versions of JBoss
 * support for other AS
 * Tomcat
 * LTA Long term archiving

Outcome

 * second VIRR instance with 2 milio eSciDoc objects iwth 1.2
 * there is a big difference between 1.2 and 1.3 with big scaling difference (as the database goes away)
 * Springer texts bey FIZ with 2.5 milio items (6 weeks) - but was fine
 * performance - is not worse - started by ingest below 1 sec, but finished with a bit more than 1 sec - with weaker HW
 * ingest direct into Fedora - not recommended
 * ingest interface (but reindex afterwards is still necessary)
 * also sometimes shall be reconsidered if all fulltexts-i.e. component contents (images, videos) should be in eSciDoc, or only metadata shall be in escidoc (image viewers, video streamers etc.)
 * performance is the big next step - not only scaling but also stabilization of services
 * test data sets - one to think how this is to be done, one possibility is to rebuild Fedora and reindex via eSciDoc (but it depends on the data set size)
 * Fedora tests- how to make similar for eSciDoc?
 * how to make a proper evaluation environment for test/monitoring/performance
 * at the moment there is no possibliity nor efforts
 * todo: define a list of what needs to be checked
 * scaling - escidoc services can not be splitted among different servers - no horizontal scaling is possible at the moment, but indexing can be moved to separate server
 * high availability
 * not yet natively possible, planned
 * however to be clarified what it means
 * distributed environment can also be problematic - e.g. how we make a backup
 * load balancing - only read operation for example
 * journaling mode - nice feature - will be considered
 * probably focus most for first step on indexing/searching scalability and performance
 * reindexing (more alternatives, see below)
 * distribute on different machines
 * sharding (1 index on more machines)
 * more optimal would be when more machines hold complete index
 * optimization of the stylesheet transformation
 * gsearch can work with multiple threads, but is not indeed practical - as it may lock the complete machine - therefore it is set to 1 thread
 * todo: check the possibility to run two or more gsearch instances each single thread
 * keep indexes in mem and only write when treshhold is reached
 * statistics - other store - faster and not dependent on escidoc-core and fedora?
 * mongoDb as store, BIRT as standard interface may be considered
 * performance/scalability
 * potential to use scalable stores e.g. MongoDB (as configuration option - instead of Fedora or as Fedora storage) in parallel with Lucene
 * Performance prio: workflow settings (content model, context?)
 * items immediately released (with proper indexing afterwards) -> one call to the service
 * Item event log (update, insert comments?)

JBOSS, other AS, Tomcat

 * works to remove dependencies from JBoss in progress
 * works to enable working in a servlet container in progress
 * next version after 1.4

SPO

 * Semantic Store Handler
 * not needed
 * interfaces will be deprecated

Other

 * toDo: issue: Add content relation in REST Client handler
 * todo: issue container delete updates release (mpdl to precise)
 * critical: namespace preservation bug in MD record
 * https://www.escidoc.org/jira/browse/INFR-947
 * https://www.escidoc.org/jira/browse/INFR-1190
 * to do: at the moment use the workaround, will be fixed, but is more serious issue

=Agenda 15.07.2011=

OAI provider

 * transformations from item (not only from metadata record)?

outcome

 * to check later

AA

 * external roles
 * standalone service

Outcome

 * is possible to define a new role (policy); the policy evaluates an action and PDP methods - evaluate the rights
 * evaluation based on the XML (expects the user-id and some attributes)
 * user must exist in escidoc and role must exist in escidoc
 * can it be completely standalone outside of Fedora?
 * statistics are dependent on Fedora (primary key)
 * otherwise it is completely independent on Fedora
 * not possible at the moment as completely standalone component
 * possible with limitations together with eSciDoc - external roles, external actions

Content Models

 * Any plans from MPDL side?
 * Pragmatic and iterative approach - some ideas

outcome
What can be in the content model defined?
 * schema for metadata record (already done in 1.2)
 * transformation for a "subview" or freemade representation of an item (already done in 1.2)
 * note-to clarify is this item representation or metadata record representation which is referenced above?--Natasa 09:37, 20 July 2011 (CEST)


 * components - content categories, mime types
 * validation method - for metadata schema compliance, item structure compliance with componens, content categories, mime types
 * versioned content models?
 * each resource references one content model
 * when this reference is made into a particular version then "transition" could be done a bit more smoothly
 * migration - stylesheet for content model for different versions of content model versions

PubMan

 * Migration plans to version 1.4 of eSciDoc Infrastructure
 * Evaluation of eSciDoc Infrastructure Java Connector ("Java Client Library")

outcome

 * development version 6.3
 * cone improvements
 * japanese language support
 * browser problems with newer browsers(javascript)
 * next version can be on 1.3 (planned for eSciDoc days)
 * would be good to bring pubman on core service 1.3 or 1.4

Other eSciDoc Applications

 * Status of Digitization Lifecycle
 * successor of VIRR
 * to cover more institutes and bigger data scale
 * books in eSciDoc (mets container with structure and item with TEI component for fulltext)
 * images are in file system (Digilib)
 * customization of stylesheets for fulltext (xml) search
 * jsf2.0
 * end of september the first version (VIRR equivalent in first place - browse, view)
 * eventual upload functionality until eSciDoc days
 * Status of Imeji
 * demo
 * further plans: imeji on tomcat, rdf, switch to java connector for escidoc-core, thesaurus
 * timing not clear yet
 * Status of eSciDoc Browser
 * demo, ideas

Building environment

 * eSciDoc building environment
 * FIZ development plugins - Jrebel, code formatting by build (code style, spaces, line breaks, line length - functions problematic a bit by javadoc) ,
 * statical code analysis with Sonar
 * scrum room: code reports visible
 * MPDL development plugins: checkstyle (not intensively used), code template, jRebel
 * more info see here MPDL Building and Development Environment
 * selenium tests see General and PubMan and Selenium tests

outcome

 * todo: check on Maven versions used by both teams to ensure smooth building from sources on both core/pubman code
 * todo: check on JBoss ports used in maven builds
 * FIZ: maven-failsafe-plugin for integration tests - parallelization of tests - not too much success with the plug-in
 * MPDL: uses surefire (more aplicable for unit testing)
 * test artifacts (to be used by both MPDL and FIZ) to ensure functioning of both core services and solutions eg . PubMan
 * todo: (now) publish source packages and in future open SVN

other

 * 1.5 eScidoc core requires migration (planed for escidoc days)
 * migration would be on FoXML
 * foxml
 * fedora rebuild
 * triple store
 * Lucene indexes?