ESciDoc Developer Workshop 2009-01-13

ESciDoc,NIMS

Date: 13.01.2009 Start time: 14:30

Location: Karlsruhe, München (Video conference)

Participants MPDL: Wilhelm Frank, Natasa Bulatovic

Participants FIZ: Harald Kappus, Matthias Razum, Frank Schwichtenberg

=Agenda=

Previous workshop

 * ESciDoc_Developer_Workshop_2008-12-16

Next workshop

 * ESciDoc_Developer_Workshop_2009-01-20

Administrative search requirements for PubMan

 * See PubMan Administrative search

Outcome

 * to check access rights
 * not certain how to do it, if it is possible to have Lucene index
 * not certain if fast enough to be able to index synchronously
 * filters will be discarded due to fulltext search requirements in administrative search
 * will be OK to have indexing of fulltext asynchronously
 * in this case it will take about 10 secs until search index is updated
 * to check if there is another tool next to Lucene for this purpose
 * another issue:
 * to find out a solution for not populated metadata
 * no solution exactly
 * after Japanese indexing, this will be priority and then the administrative search

Japanese Charset

 * Status (Feedback provided by NIMS)

Outcome

 * there are two problems,
 * extracting Japanese text out of PDFs that have no fonts included
 * PDFBox and iText are not able to extract Japanese text
 * There is a freely available tool from Adobe which extracts (Japanese) text correctly, but it inserts spaces after each character. This behaviour eliminates all word boudaries for western languages and is therefore a severe issue for mixed-language documents.
 * tokenizing Japanese text
 * xpdf is set-up in C, path variables need to be set-up is rather complicated
 * a character set needs to be defined when extracting text from PDFs - sounds complicated
 * FIZ would wait for iText (communicated problems already)
 * Japanese analyzer
 * FIZ will send info on commercial tool (assessed)
 * metadata is not problem, it is only PDF extraction problem

New user roles, discussion on existing user roles

 * See Faces_User_Management
 * possibility to have additional roles that need not to be granted each time after object is created?
 * possibility to extend Administrator role
 * possibility to have "Grantor roles" etc.
 * check status on Collaborator roles

Outcome

 * rule based role grant is to be researched as at present is not possible
 * Collaborator role is implemented, can retrieve the item and may retrieve the content of the components
 * Collaborator can be defined for ctx, cnt, item, component
 * New role: Collaborator modifier may modify items, containers, components in same manner as depositor

TOC

 * MPDL uses Component of Item for storing TOC
 * result of tests at MPDL
 * versioning issue

Outcome

 * MPDL currently uses normal item with CModel:TOC, the TOC content is a component
 * since TOCs are rapidly increasing in size, the only solution due to Fedora limitation is to have it as a component
 * retrieve TOC method does not work with these items
 * retrieveToc methods are to be allowed for cModel
 * to check how to move forward with CModel and CModel specific methods
 * introduce a new struct-map for Tocs
 * method would return the List of Tocs of this container
 * method would return the content of Tocs of this container
 * at present: recommendation to actually use only the staging servlet if the TOC ig is expected to be large object

Content Relations

 * see Talk:ESciDoc_Content_Relations

Outcome

 * on the upcoming TODO list (next weeks) for implementation

Content Model

 * see concept and definition at ESciDoc_Content_Models
 * see implementation and example object at ESciDoc_Content_Model_Object

JHOVE integration

 * discussion at ESciDoc_JHove_Integration

Outcome

 * will not be part of the Item handler, will still be separate service
 * discarded configuring it via CModel
 * CLOSED

Default MD-Records

 * discussion at ESciDoc_Metadata_Records_Manipulation

Outcome

 * CLOSED:
 * Discussion is closed and will be reopened in case needed in future
 * no changes, no new virtual resources

TOC

 * discussion at ESciDoc_Toc

Ontology Manager

 * see ESciDoc_Content_Relations

Outcome

 * FIZ will provide some more input on storage

OAI-PMH

 * MPDL: set up requirements for sets

Outcome

 * FIZ works with GFZ Potsdam to provide OAI-PMH
 * there is an extended DC mapping that is solving the problem with release dates and last modified dates
 * set definitions are possible, more input from Rozita
 * FIZ will provide some more input
 * MPDL link to requirements: PubMan_Func_Spec_Export/OAI_Data_Provider

Fedora FOXML handling

 * Fedora memory problem
 * use a heap size of more than 1 or 2 GByte as a work around
 * FIZ will try to talk to Fedora people that some of the problems could prevent MPDL from using fedora in a production environment
 * MPDL proposes common meeting FIZ/Fedora/MPDL to discuss about possible resolutions

Outcome

 * the problem can be approached also with Fedora people Matthias now has good arguments :) as:
 * metadata also have this problem in addition to Toc size and big number of versions of container where it originally appeared
 * FIZ to provide info on MAX-XML-SIZE