Latest revision as of 07:43, 20 July 2011

Developer Workshop[edit]

Date: July 14/15, 2011
Place: Munich
Previous workshop 22-23.09.2010, Karlsruhe

Participants MPDL[edit]

Wilhelm Frank
Lu Yu
Benjamin Knoth
Richard Bourke
Michael Franke
Marcus Haarländer
Natasa Bulatovic

Participants FIZ[edit]

Steffen Wagner
Michael Hoppe
Christian Herlambang
Matthias Razum

Agenda 14.07.2011[edit]

Fulltext indexing[edit]

questions related to indexing stylesheets enhanced with own xslt
- configuration of the search results output (rather than complete item/component/container)
- highlighting of search results e.g. get the last page break tag from a TEI fulltext(escidoc fulltext index)
  - each digitized book is represented with two items each with 1 fulltext document
    - METS file - containing table of contents with links to digitized pages of the book (where digitized pages are image files stored externally on the file system)
    - TEI file - containing the fulltext (contents) of the book shown on the digitized pages and page-break links to the image where the content starts/ends to appear
    - search shall enable when searching for a fulltext, to get exactly the correct snippet from the TEI file, with the exact link to the image file
    - MPDL had done some custom stylesheets, where text between two page-breaks is "treated" as separate files sent for highlighting together with the link
- fulltext indexing for all FT visibility, searching according privileges and displaying snippets according privileges
- selective indexing of resources from Admin tool
- incremental indexing problems
Solr support and interfaces

Outcome on Fulltext indexing[edit]

current aproach is not bad, one document is generated with many file-highlight fields - but it has to be checked if the limit of the highlighted fields is 100 and about the performance issues
- performance issues could be by caused highlighting also during indexing itself, but mostly by search performance itself
- more input from FIZ after analysis, as the problem is clear
fulltext indexing in accordance with the privileges
- search receives items to which user has privileges, but when searching from fulltext with restriction privileges - if user has rights on one ft and not on the other ft of the item, she will get both highlights back.
  - workaround: one can exclude ft highlighting for not public texts always, or include visiblity in populating the highlighting (again performance potential)
selective indexing
- would be good to prioritize it (ab Oktober)
- todo: send more info on internal script
- index fulltext only or metadata only (in selective reindexing) - would not be possible (unless indexes are splitted)
incremental reindexing - does not function in 1.2 (to be checked by 1.3)
Solr
- GSearch can index via Solr
- we have to create an XML docu which Solr understands (similar like Lucene)
- gSearch can do it,
- however there are many Solr settings that have to be done i.e. specific fields configuration have to be done
- so practically Solr can be set up now, but it takes a lot of configuration
- todo: send config and test locally
reformatting of search results is possible with external stylesheet - todo: send more info
- stylesheet caching - defined in properties

Future development plans[edit]

Future development plans - short term roadmap on versions/features for release 1.4 to 2.0
- critical: Internal managed vs. externally managed datastreams of MD records

outcome[edit]

1.4 would come by end of July
1.5 would come by October for eSciDoc days
see eSciDoc Org Wiki for preliminary roadmaps (comments are welcome)
2.0
- scalability
- performance

Digilib integration[edit]

Digilib integration
- plans , ideas, replacements?

outcome[edit]

alternatives have been considered
at digilib - only the transformation on folder is implemented - the list of the images is slow to load in digilib
digilib has to be applied for DL project
- alternatives as well considered by MPDL
is digilib needed in escidoc if images are not in escidoc?
- close until no bigger requirement is there for digilib/escidoc
- MPIWG people work on digilib stream retrieval per url

Admin Tool[edit]

Admin tool 1.3 offers only repository information

outcome[edit]

wrong version was delivered with 1.3
- escidoc Admin tool can work with more instances at one time (per URL)
- role-assignment in eSciDoc Admin - restricted to resources of scope of the role
- better selection of the urls to the core service e.g. property file and pulldown-list to select
www.escidoc.org/artifactory/
- search for "admintool" - new version snapshot of the escidocadmin tool
- search for "ijc" - the newest snapshots from escidoc-core
svn-not available from outside at the moment - work to make it public for read access is on the way

Scalability&Performance[edit]

- creation of items (MPDL provides some numbers)
- reindexing
- statistics - other store - faster and not dependent on escidoc-core and fedora?
stress testing, mass data generation, monitoring of core service - how is done internally at FIZ with reference to FIZ Fedora Performance and Scalability Wiki
JBoss, other AS, Tomcat
- supporting newer versions of JBoss
- support for other AS
- Tomcat
LTA Long term archiving

Outcome[edit]

second VIRR instance with 2 milio eSciDoc objects iwth 1.2
- there is a big difference between 1.2 and 1.3 with big scaling difference (as the database goes away)
- Springer texts bey FIZ with 2.5 milio items (6 weeks) - but was fine
- performance - is not worse - started by ingest below 1 sec, but finished with a bit more than 1 sec - with weaker HW
- ingest direct into Fedora - not recommended
- ingest interface (but reindex afterwards is still necessary)
- also sometimes shall be reconsidered if all fulltexts-i.e. component contents (images, videos) should be in eSciDoc, or only metadata shall be in escidoc (image viewers, video streamers etc.)
- performance is the big next step - not only scaling but also stabilization of services
- test data sets - one to think how this is to be done, one possibility is to rebuild Fedora and reindex via eSciDoc (but it depends on the data set size)
- Fedora tests- how to make similar for eSciDoc?
  - how to make a proper evaluation environment for test/monitoring/performance
  - at the moment there is no possibliity nor efforts
  - todo: define a list of what needs to be checked
  - scaling - escidoc services can not be splitted among different servers - no horizontal scaling is possible at the moment, but indexing can be moved to separate server
  - high availability
    - not yet natively possible , planned
  - however to be clarified what it means
  - distributed environment can also be problematic - e.g. how we make a backup
  - load balancing - only read operation for example
  - journaling mode - nice feature - will be considered
  - probably focus most for first step on indexing/searching scalability and performance
- reindexing (more alternatives, see below)
  - distribute on different machines
  - sharding (1 index on more machines)
  - more optimal would be when more machines hold complete index
  - optimization of the stylesheet transformation
  - gsearch can work with multiple threads, but is not indeed practical - as it may lock the complete machine - therefore it is set to 1 thread
  - todo: check the possibility to run two or more gsearch instances each single thread
  - keep indexes in mem and only write when treshhold is reached
- statistics - other store - faster and not dependent on escidoc-core and fedora?
  - mongoDb as store, BIRT as standard interface may be considered
- performance/scalability
  - potential to use scalable stores e.g. MongoDB (as configuration option - instead of Fedora or as Fedora storage) in parallel with Lucene
Performance prio: workflow settings (content model, context?)
- items immediately released (with proper indexing afterwards) -> one call to the service
Item event log (update, insert comments?)

JBOSS, other AS, Tomcat[edit]

- works to remove dependencies from JBoss in progress
- works to enable working in a servlet container in progress
- next version after 1.4

SPO[edit]

Semantic Store Handler
not needed
interfaces will be deprecated

Other[edit]

toDo: issue: Add content relation in REST Client handler
todo: issue container delete updates release (mpdl to precise)
critical: namespace preservation bug in MD record
- https://www.escidoc.org/jira/browse/INFR-947
- https://www.escidoc.org/jira/browse/INFR-1190
- to do: at the moment use the workaround, will be fixed, but is more serious issue

Agenda 15.07.2011[edit]

OAI provider[edit]

transformations from item (not only from metadata record)?

outcome[edit]

- to check later

AA[edit]

- external roles
- standalone service

Outcome[edit]

is possible to define a new role (policy); the policy evaluates an action and PDP methods - evaluate the rights
evaluation based on the XML (expects the user-id and some attributes)
user must exist in escidoc and role must exist in escidoc
can it be completely standalone outside of Fedora?
- statistics are dependent on Fedora (primary key)
- otherwise it is completely independent on Fedora
not possible at the moment as completely standalone component
- possible with limitations together with eSciDoc - external roles, external actions

Content Models[edit]

- Any plans from MPDL side?
- Pragmatic and iterative approach - some ideas

outcome[edit]

What can be in the content model defined?

schema for metadata record (already done in 1.2)
transformation for a "subview" or freemade representation of an item (already done in 1.2)

note-to clarify is this item representation or metadata record representation which is referenced above?--Natasa 09:37, 20 July 2011 (CEST)

- components - content categories , mime types
- validation method - for metadata schema compliance, item structure compliance with componens, content categories, mime types
versioned content models?
- each resource references one content model
- when this reference is made into a particular version then "transition" could be done a bit more smoothly
- migration - stylesheet for content model for different versions of content model versions

PubMan[edit]

- Migration plans to version 1.4 of eSciDoc Infrastructure
- Evaluation of eSciDoc Infrastructure Java Connector ("Java Client Library")

outcome[edit]

development version 6.3
- cone improvements
- japanese language support
- browser problems with newer browsers(javascript)
next version can be on 1.3 (planned for eSciDoc days)
would be good to bring pubman on core service 1.3 or 1.4

Other eSciDoc Applications[edit]

Status of Digitization Lifecycle
- successor of VIRR
- to cover more institutes and bigger data scale
- books in eSciDoc (mets container with structure and item with TEI component for fulltext)
- images are in file system (Digilib)
- customization of stylesheets for fulltext (xml) search
- jsf2.0
- end of september the first version (VIRR equivalent in first place - browse, view)
- eventual upload functionality until eSciDoc days
Status of Imeji
- demo
- further plans: imeji on tomcat, rdf, switch to java connector for escidoc-core, thesaurus
- timing not clear yet
Status of eSciDoc Browser
- demo, ideas

Building environment[edit]

eSciDoc building environment
- FIZ development plugins - Jrebel, code formatting by build (code style, spaces, line breaks, line length - functions problematic a bit by javadoc) ,
  - statical code analysis with Sonar
  - scrum room: code reports visible
- MPDL development plugins: checkstyle (not intensively used), code template, jRebel
  - more info see here MPDL Building and Development Environment
  - selenium tests see General and PubMan and Selenium tests

outcome[edit]

todo: check on Maven versions used by both teams to ensure smooth building from sources on both core/pubman code
todo: check on JBoss ports used in maven builds
FIZ: maven-failsafe-plugin for integration tests - parallelization of tests - not too much success with the plug-in
MPDL: uses surefire (more aplicable for unit testing)
test artifacts (to be used by both MPDL and FIZ) to ensure functioning of both core services and solutions eg . PubMan
todo: (now) publish source packages and in future open SVN

other[edit]

1.5 eScidoc core requires migration (planed for escidoc days)
- migration would be on FoXML
  - foxml
  - fedora rebuild
  - triple store
  - Lucene indexes?

Difference between revisions of "ESciDoc Developer Workshop 14 15 07 2011"

Latest revision as of 07:43, 20 July 2011

Contents

Developer Workshop[edit]

Participants MPDL[edit]

Participants FIZ[edit]

Agenda 14.07.2011[edit]

Fulltext indexing[edit]

Outcome on Fulltext indexing[edit]

Future development plans[edit]

outcome[edit]

Digilib integration[edit]

outcome[edit]

Admin Tool[edit]

outcome[edit]

Scalability&Performance[edit]

Outcome[edit]

JBOSS, other AS, Tomcat[edit]

SPO[edit]

Other[edit]

Agenda 15.07.2011[edit]

OAI provider[edit]

outcome[edit]

AA[edit]

Outcome[edit]

Content Models[edit]

outcome[edit]

PubMan[edit]

outcome[edit]

Other eSciDoc Applications[edit]

Building environment[edit]

outcome[edit]

other[edit]

Navigation menu

Search

@@ Line 5: / Line 5: @@
 == Participants MPDL ==
+*Wilhelm Frank
+*Lu Yu
+*Benjamin Knoth
+*Richard Bourke
+*Michael Franke
+*Marcus Haarländer
+*Natasa Bulatovic
 == Participants FIZ ==
@@ Line 14: / Line 21: @@
 =Agenda 14.07.2011=
 ==Fulltext indexing==
-**enhanced with own xslt - questions from MPDL related to
+*questions related to indexing stylesheets enhanced with own xslt
 **configuration of the search results output (rather than complete item/component/container)
-**highlighting of search results  (e.g. get the last page break tag)
+**highlighting of search results e.g. get the last page break tag from a TEI fulltext(escidoc fulltext index)
-**full text indexing for all FT visibility, searching according privileges and displaying snippets according privileges
+***each digitized book is represented with two items each with 1 fulltext document
-**selective indexing from Admin tools
+****METS file - containing table of contents with links to digitized pages of the book (where digitized pages are image files stored externally on the file system)
-**incremental indexing
+****TEI file - containing the fulltext (contents) of the book shown on the digitized pages and page-break links to the image where the content starts/ends to appear
+****search shall enable when searching for a fulltext, to get exactly the correct snippet from the TEI file, with the exact link to the image file
+****MPDL had done some custom stylesheets, where text between two page-breaks is "treated" as separate files sent for highlighting together with the link
+**fulltext indexing for all FT visibility, searching according privileges and displaying snippets according privileges
+**selective indexing of resources from Admin tool
+**incremental indexing problems
 *Solr support and interfaces
 ===Outcome on Fulltext indexing===
-*current aproach is not bad, one document is generated with many file-highlight fields - but it has to be checked if the limit of the highlighted fields is 100  and about the performance issues
+*current aproach is not bad, one document is generated with many file-highlight fields - but it has to be checked if the limit of the highlighted fields is 100 and about the performance issues
 **performance issues could be by caused highlighting also during indexing itself, but mostly by search performance itself
+**more input from FIZ after analysis, as the problem is clear
+*fulltext indexing in accordance with the privileges
+**search receives items to which user has privileges, but when searching from fulltext with restriction privileges - if user has rights on one ft and not on the other ft of the item, she will get both highlights back.
+***workaround: one can exclude ft highlighting for not public texts always, or include visiblity in populating the highlighting (again performance potential)
+*selective indexing
+**would be good to prioritize it (ab Oktober)
+**todo: send more info on internal script
+**index fulltext only or metadata only (in selective reindexing) - would not be possible (unless indexes are splitted)
+*incremental reindexing - does not function in 1.2 (to be checked by 1.3)
+*Solr
+**GSearch can index via Solr
+**we have to create an XML docu which Solr understands (similar like Lucene)
+**gSearch can do it,
+**however there are many Solr settings that have to be done i.e. specific fields configuration have to be done
+**so practically Solr can be set up now, but it takes a lot of configuration
+**todo: send config and test locally
+*reformatting of search results is possible with external stylesheet - todo: send more info
+**stylesheet caching - defined in properties
+==Future development plans==
 *Future development plans - short term roadmap on versions/features for release 1.4 to 2.0
 **critical: Internal managed vs. externally managed datastreams of MD records
-**critical: namespace preservation bug in MD record
+===outcome===
-***https://www.escidoc.org/jira/browse/INFR-947
+*1.4 would come by end of July
-***https://www.escidoc.org/jira/browse/INFR-1190
+*1.5 would come by October for eSciDoc days
+*see [https://www.escidoc.org/wiki/Main_Page eSciDoc Org Wiki] for preliminary roadmaps (comments are welcome)
+*2.0
+**scalability
+**performance
+==Digilib integration==
+*Digilib integration
+**plans , ideas, replacements?
+===outcome===
+*alternatives have been considered
+*at digilib  - only the transformation on folder is implemented - the list of the images is slow to load in digilib
+*digilib has to be applied for DL project
+**alternatives as well considered by MPDL
+*is digilib needed in escidoc if images are not in escidoc?
+**close until no bigger requirement is there for digilib/escidoc
+**MPIWG people work on digilib stream retrieval per url
+==Admin Tool==
 *Admin tool 1.3 offers only repository information
-*Digilib integration
+===outcome===
-**plans , ideas, replacements?
+*wrong version was delivered with 1.3
+**escidoc Admin tool can work with more instances at one time (per URL)
+**role-assignment in eSciDoc Admin - restricted to resources of scope of the role
+**better selection of the urls to the core service e.g. property file and pulldown-list to select
+*www.escidoc.org/artifactory/
+**search for "admintool" - new version snapshot of the escidocadmin tool
+**search for "ijc" - the newest snapshots from escidoc-core
+*svn-not available from outside at the moment - work to make it public for read access is on the way
-*Scalability&Performance
+==Scalability&Performance==
 **creation of items (MPDL provides some numbers)
 **reindexing
@@ Line 49: / Line 104: @@
 *LTA Long term archiving
-*workflow settings (content model, context?)
+===Outcome===
+*second VIRR instance with 2 milio eSciDoc objects iwth 1.2
+**there is a big difference between 1.2 and 1.3 with big scaling difference (as the database goes away)
+**Springer texts bey FIZ with 2.5 milio items (6 weeks) - but was fine
+**performance - is not worse - started by ingest below 1 sec, but finished with a bit more than 1 sec - with weaker HW
+**ingest direct into Fedora - not recommended
+**ingest interface (but reindex afterwards is still necessary)
+**also sometimes shall be reconsidered if all fulltexts-i.e. component contents (images, videos) should be in eSciDoc, or only metadata shall be in escidoc (image viewers, video streamers etc.)
+**performance is the big next step - not only scaling but also stabilization of services
+**test data sets - one to think how this is to be done, one possibility is to rebuild Fedora and reindex via eSciDoc (but it depends on the data set size)
+**Fedora tests- how to make similar for eSciDoc?
+***how to make a proper evaluation environment for test/monitoring/performance
+***at the moment there is no possibliity nor efforts
+***todo: define a list of what needs to be checked
+***scaling - escidoc services can not be splitted among different servers - no horizontal scaling is possible at the moment, but indexing can be moved to separate server
+***high availability
+****not yet natively possible , planned
+***however to be clarified what it means
+***distributed environment can also be problematic - e.g. how we make a backup
+***load balancing - only read operation for example
+***journaling mode - nice feature - will be considered
+***probably focus most for first step on indexing/searching scalability and performance
+**reindexing (more alternatives, see below)
+***distribute on different machines
+***sharding (1 index on more machines)
+***more optimal would be when more machines hold complete index
+***optimization of the stylesheet transformation
+***gsearch can work with multiple threads, but is not indeed practical - as it may lock the complete machine - therefore it is set to 1 thread
+***todo: check the possibility to run two or more gsearch instances each single thread
+***keep indexes in mem and only write when treshhold is reached
+**statistics - other store - faster and not dependent on escidoc-core and fedora?
+***mongoDb as store, BIRT as standard interface may be considered
+**performance/scalability
+***potential to use scalable stores e.g. MongoDB (as configuration option - instead of Fedora or as Fedora storage) in parallel with Lucene
+*Performance prio: workflow settings (content model, context?)
 **items immediately released (with proper indexing afterwards) -> one call to the service
-**item event log (update, insert comments?)
+*Item event log (update, insert comments?)
+==JBOSS, other AS, Tomcat==
+**works to remove dependencies from JBoss in progress
+**works to enable working in a servlet container in progress
+**next version after 1.4
+==SPO==
+*Semantic Store Handler
+*not needed
+*interfaces will be deprecated
-* Content Models
+==Other==
+*toDo: issue: Add content relation in REST Client handler
+*todo: issue container delete updates release (mpdl to precise)
+*critical: namespace preservation bug in MD record
+**https://www.escidoc.org/jira/browse/INFR-947
+**https://www.escidoc.org/jira/browse/INFR-1190
+**to do: at the moment use the workaround, will be fixed, but is more serious issue
+=Agenda 15.07.2011=
+== OAI provider ==
+* transformations from item (not only from metadata record)?
+===outcome===
+**to check later
+== AA ==
+**external roles
+**standalone service
+===Outcome===
+*is possible to define a new role (policy); the policy evaluates an action and PDP methods - evaluate the rights
+*evaluation based on the XML (expects the user-id and some attributes)
+*user must exist in escidoc and role must exist in escidoc
+*can it be completely standalone outside of Fedora?
+**statistics are dependent on Fedora (primary key)
+**otherwise it is completely independent on Fedora
+*not possible at the moment as completely standalone component
+**possible with limitations together with eSciDoc - external roles, external actions
+== Content Models ==
 ** Any plans from MPDL side?
 ** Pragmatic and iterative approach - some ideas
+===outcome===
+What can be in the content model defined?
+*schema for metadata record (already done in 1.2)
+*transformation for a "subview" or freemade representation of an item (already done in 1.2)
+:note-to clarify is this item representation or metadata record representation which is referenced above?--[[User:Natasab|Natasa]] 09:37, 20 July 2011 (CEST)
+**components - content categories , mime types
+**validation method - for metadata schema compliance, item structure compliance with componens, content categories, mime types
+*versioned content models?
+**each resource references one content model
+**when this reference is made into a particular version then "transition" could be done a bit more smoothly
+**migration - stylesheet for content model for different versions of content model versions
+== PubMan ==
+** Migration plans to version 1.4 of eSciDoc Infrastructure
+** Evaluation of eSciDoc Infrastructure Java Connector ("Java Client Library")
+===outcome===
+*development version 6.3
+**cone improvements
+**japanese language support
+**browser problems with newer browsers(javascript)
+*next version can be on 1.3  (planned for eSciDoc days)
+*would be good to bring pubman on core service 1.3 or 1.4
+== Other eSciDoc Applications ==
+* Status of Digitization Lifecycle
+**successor of VIRR
+**to cover more institutes and bigger data scale
+**books in eSciDoc (mets container with structure and item with TEI component for fulltext)
+**images are in file system (Digilib)
+**customization of stylesheets for fulltext (xml) search
+**jsf2.0
+**end of september the first version (VIRR equivalent in first place - browse, view)
+**eventual upload functionality until eSciDoc days
+* Status of Imeji
+**demo
+**further plans: imeji on tomcat, rdf, switch to java connector for escidoc-core, thesaurus
+**timing not clear yet
+* Status of eSciDoc Browser
+**demo, ideas
+== Building environment ==
+*eSciDoc building environment
+**FIZ development plugins - Jrebel, code formatting by build (code style, spaces, line breaks, line length - functions problematic a bit by javadoc) ,
+***statical code analysis with Sonar
+***scrum room: code reports visible
+**MPDL development plugins: checkstyle (not intensively used), code template, jRebel
+***more info see here [[Escidoc_Building_and_Developing_Environment|MPDL Building and Development Environment]]
+***selenium tests see [[Automatic_GUI_tests_with_selenium|General]] and [[PubMan_and_Selenium_tests|PubMan and Selenium tests]]
+===outcome===
+*todo: check on Maven versions used by both teams to ensure smooth building from sources on both core/pubman code
+*todo: check on JBoss ports used in maven builds
+*FIZ: '''maven-failsafe-plugin''' for integration tests - parallelization of tests - not too much success with the plug-in
+*MPDL: uses surefire (more aplicable for unit testing)
+*test artifacts (to be used by both MPDL and FIZ) to ensure functioning of both core services and solutions eg . PubMan
+*todo: (now) publish source packages and in future open SVN
+==other==
+*1.5 eScidoc core requires migration (planed for escidoc days)
+**migration would be on FoXML
+***foxml
+***fedora rebuild
+***triple store
+***Lucene indexes?
-*SPO
-==Agenda 15.07.2011==
 [[Category:ESciDoc_Developer|Developer Workshop 2011-07-14/15]]