ESciDoc Developer Workshop 14 15 07 2011

From MPDLMediaWiki
Revision as of 07:43, 20 July 2011 by Natasab (talk | contribs) (→‎Building environment)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Developer Workshop[edit]

Participants MPDL[edit]

  • Wilhelm Frank
  • Lu Yu
  • Benjamin Knoth
  • Richard Bourke
  • Michael Franke
  • Marcus Haarländer
  • Natasa Bulatovic

Participants FIZ[edit]

  • Steffen Wagner
  • Michael Hoppe
  • Christian Herlambang
  • Matthias Razum

Agenda 14.07.2011[edit]

Fulltext indexing[edit]

  • questions related to indexing stylesheets enhanced with own xslt
    • configuration of the search results output (rather than complete item/component/container)
    • highlighting of search results e.g. get the last page break tag from a TEI fulltext(escidoc fulltext index)
      • each digitized book is represented with two items each with 1 fulltext document
        • METS file - containing table of contents with links to digitized pages of the book (where digitized pages are image files stored externally on the file system)
        • TEI file - containing the fulltext (contents) of the book shown on the digitized pages and page-break links to the image where the content starts/ends to appear
        • search shall enable when searching for a fulltext, to get exactly the correct snippet from the TEI file, with the exact link to the image file
        • MPDL had done some custom stylesheets, where text between two page-breaks is "treated" as separate files sent for highlighting together with the link
    • fulltext indexing for all FT visibility, searching according privileges and displaying snippets according privileges
    • selective indexing of resources from Admin tool
    • incremental indexing problems
  • Solr support and interfaces

Outcome on Fulltext indexing[edit]

  • current aproach is not bad, one document is generated with many file-highlight fields - but it has to be checked if the limit of the highlighted fields is 100 and about the performance issues
    • performance issues could be by caused highlighting also during indexing itself, but mostly by search performance itself
    • more input from FIZ after analysis, as the problem is clear
  • fulltext indexing in accordance with the privileges
    • search receives items to which user has privileges, but when searching from fulltext with restriction privileges - if user has rights on one ft and not on the other ft of the item, she will get both highlights back.
      • workaround: one can exclude ft highlighting for not public texts always, or include visiblity in populating the highlighting (again performance potential)
  • selective indexing
    • would be good to prioritize it (ab Oktober)
    • todo: send more info on internal script
    • index fulltext only or metadata only (in selective reindexing) - would not be possible (unless indexes are splitted)
  • incremental reindexing - does not function in 1.2 (to be checked by 1.3)
  • Solr
    • GSearch can index via Solr
    • we have to create an XML docu which Solr understands (similar like Lucene)
    • gSearch can do it,
    • however there are many Solr settings that have to be done i.e. specific fields configuration have to be done
    • so practically Solr can be set up now, but it takes a lot of configuration
    • todo: send config and test locally
  • reformatting of search results is possible with external stylesheet - todo: send more info
    • stylesheet caching - defined in properties

Future development plans[edit]

  • Future development plans - short term roadmap on versions/features for release 1.4 to 2.0
    • critical: Internal managed vs. externally managed datastreams of MD records

outcome[edit]

  • 1.4 would come by end of July
  • 1.5 would come by October for eSciDoc days
  • see eSciDoc Org Wiki for preliminary roadmaps (comments are welcome)
  • 2.0
    • scalability
    • performance

Digilib integration[edit]

  • Digilib integration
    • plans , ideas, replacements?

outcome[edit]

  • alternatives have been considered
  • at digilib - only the transformation on folder is implemented - the list of the images is slow to load in digilib
  • digilib has to be applied for DL project
    • alternatives as well considered by MPDL
  • is digilib needed in escidoc if images are not in escidoc?
    • close until no bigger requirement is there for digilib/escidoc
    • MPIWG people work on digilib stream retrieval per url

Admin Tool[edit]

  • Admin tool 1.3 offers only repository information

outcome[edit]

  • wrong version was delivered with 1.3
    • escidoc Admin tool can work with more instances at one time (per URL)
    • role-assignment in eSciDoc Admin - restricted to resources of scope of the role
    • better selection of the urls to the core service e.g. property file and pulldown-list to select
  • www.escidoc.org/artifactory/
    • search for "admintool" - new version snapshot of the escidocadmin tool
    • search for "ijc" - the newest snapshots from escidoc-core
  • svn-not available from outside at the moment - work to make it public for read access is on the way

Scalability&Performance[edit]

    • creation of items (MPDL provides some numbers)
    • reindexing
    • statistics - other store - faster and not dependent on escidoc-core and fedora?
  • stress testing, mass data generation, monitoring of core service - how is done internally at FIZ with reference to FIZ Fedora Performance and Scalability Wiki
  • JBoss, other AS, Tomcat
    • supporting newer versions of JBoss
    • support for other AS
    • Tomcat
  • LTA Long term archiving

Outcome[edit]

  • second VIRR instance with 2 milio eSciDoc objects iwth 1.2
    • there is a big difference between 1.2 and 1.3 with big scaling difference (as the database goes away)
    • Springer texts bey FIZ with 2.5 milio items (6 weeks) - but was fine
    • performance - is not worse - started by ingest below 1 sec, but finished with a bit more than 1 sec - with weaker HW
    • ingest direct into Fedora - not recommended
    • ingest interface (but reindex afterwards is still necessary)
    • also sometimes shall be reconsidered if all fulltexts-i.e. component contents (images, videos) should be in eSciDoc, or only metadata shall be in escidoc (image viewers, video streamers etc.)
    • performance is the big next step - not only scaling but also stabilization of services
    • test data sets - one to think how this is to be done, one possibility is to rebuild Fedora and reindex via eSciDoc (but it depends on the data set size)
    • Fedora tests- how to make similar for eSciDoc?
      • how to make a proper evaluation environment for test/monitoring/performance
      • at the moment there is no possibliity nor efforts
      • todo: define a list of what needs to be checked
      • scaling - escidoc services can not be splitted among different servers - no horizontal scaling is possible at the moment, but indexing can be moved to separate server
      • high availability
        • not yet natively possible , planned
      • however to be clarified what it means
      • distributed environment can also be problematic - e.g. how we make a backup
      • load balancing - only read operation for example
      • journaling mode - nice feature - will be considered
      • probably focus most for first step on indexing/searching scalability and performance
    • reindexing (more alternatives, see below)
      • distribute on different machines
      • sharding (1 index on more machines)
      • more optimal would be when more machines hold complete index
      • optimization of the stylesheet transformation
      • gsearch can work with multiple threads, but is not indeed practical - as it may lock the complete machine - therefore it is set to 1 thread
      • todo: check the possibility to run two or more gsearch instances each single thread
      • keep indexes in mem and only write when treshhold is reached
    • statistics - other store - faster and not dependent on escidoc-core and fedora?
      • mongoDb as store, BIRT as standard interface may be considered
    • performance/scalability
      • potential to use scalable stores e.g. MongoDB (as configuration option - instead of Fedora or as Fedora storage) in parallel with Lucene
  • Performance prio: workflow settings (content model, context?)
    • items immediately released (with proper indexing afterwards) -> one call to the service
  • Item event log (update, insert comments?)

JBOSS, other AS, Tomcat[edit]

    • works to remove dependencies from JBoss in progress
    • works to enable working in a servlet container in progress
    • next version after 1.4

SPO[edit]

  • Semantic Store Handler
  • not needed
  • interfaces will be deprecated

Other[edit]

Agenda 15.07.2011[edit]

OAI provider[edit]

  • transformations from item (not only from metadata record)?

outcome[edit]

    • to check later

AA[edit]

    • external roles
    • standalone service

Outcome[edit]

  • is possible to define a new role (policy); the policy evaluates an action and PDP methods - evaluate the rights
  • evaluation based on the XML (expects the user-id and some attributes)
  • user must exist in escidoc and role must exist in escidoc
  • can it be completely standalone outside of Fedora?
    • statistics are dependent on Fedora (primary key)
    • otherwise it is completely independent on Fedora
  • not possible at the moment as completely standalone component
    • possible with limitations together with eSciDoc - external roles, external actions

Content Models[edit]

    • Any plans from MPDL side?
    • Pragmatic and iterative approach - some ideas

outcome[edit]

What can be in the content model defined?

  • schema for metadata record (already done in 1.2)
  • transformation for a "subview" or freemade representation of an item (already done in 1.2)
note-to clarify is this item representation or metadata record representation which is referenced above?--Natasa 09:37, 20 July 2011 (CEST)
    • components - content categories , mime types
    • validation method - for metadata schema compliance, item structure compliance with componens, content categories, mime types
  • versioned content models?
    • each resource references one content model
    • when this reference is made into a particular version then "transition" could be done a bit more smoothly
    • migration - stylesheet for content model for different versions of content model versions

PubMan[edit]

    • Migration plans to version 1.4 of eSciDoc Infrastructure
    • Evaluation of eSciDoc Infrastructure Java Connector ("Java Client Library")

outcome[edit]

  • development version 6.3
    • cone improvements
    • japanese language support
    • browser problems with newer browsers(javascript)
  • next version can be on 1.3 (planned for eSciDoc days)
  • would be good to bring pubman on core service 1.3 or 1.4

Other eSciDoc Applications[edit]

  • Status of Digitization Lifecycle
    • successor of VIRR
    • to cover more institutes and bigger data scale
    • books in eSciDoc (mets container with structure and item with TEI component for fulltext)
    • images are in file system (Digilib)
    • customization of stylesheets for fulltext (xml) search
    • jsf2.0
    • end of september the first version (VIRR equivalent in first place - browse, view)
    • eventual upload functionality until eSciDoc days
  • Status of Imeji
    • demo
    • further plans: imeji on tomcat, rdf, switch to java connector for escidoc-core, thesaurus
    • timing not clear yet
  • Status of eSciDoc Browser
    • demo, ideas

Building environment[edit]

  • eSciDoc building environment
    • FIZ development plugins - Jrebel, code formatting by build (code style, spaces, line breaks, line length - functions problematic a bit by javadoc) ,
      • statical code analysis with Sonar
      • scrum room: code reports visible
    • MPDL development plugins: checkstyle (not intensively used), code template, jRebel

outcome[edit]

  • todo: check on Maven versions used by both teams to ensure smooth building from sources on both core/pubman code
  • todo: check on JBoss ports used in maven builds
  • FIZ: maven-failsafe-plugin for integration tests - parallelization of tests - not too much success with the plug-in
  • MPDL: uses surefire (more aplicable for unit testing)
  • test artifacts (to be used by both MPDL and FIZ) to ensure functioning of both core services and solutions eg . PubMan
  • todo: (now) publish source packages and in future open SVN

other[edit]

  • 1.5 eScidoc core requires migration (planed for escidoc days)
    • migration would be on FoXML
      • foxml
      • fedora rebuild
      • triple store
      • Lucene indexes?