ESciDoc Ingest Tool

From MPDLMediaWiki
Jump to: navigation, search

Functional and Technical Requirements for an eSciDocEnhanced Scientific Documentation Ingest Tool

Being able to ingest large numbers (100,000 - 200,000) of objects is crucial for the successful roll-out of most eSciDoc Solutions. Nearly all the data has to be migrated/ingested by the end of May 2008. Therefore, we need to provide a working solutions asap., which will help to meet the goal of a timely roll-out, and a much improved version of the ingest tool after the initial productive deployment of PubManPublication Management and other solutions. Therefore, we splitted up the planning into short-term and mid-term goals.

Improved Performance of Object Manager

Short-term Goals

  • Ingest rate of 1 item/s or ~ 80,000 objects/day (with 2 components and one descriptive MDMetadata record besides DCDublin Core)
  • Create a separate ingest method in Object Manager which allows for
    • the explicit setting of the item status (pending, submitted, released)
    • inclusion of existing PIDs
    • delayed synch of the triple store

Mid-term Goals

  • Ingest rate of 5 items/s or ~ 400.000 objects/day (with 2 components and one descriptive MDMetadata record besides DCDublin Core)
  • Automatic transformation of item XMLExtensible Markup Language into separate FOXMLFedora Object XML 1.1 objects

Ingest Tool

Short-term Goals

  • Read in validated item XMLExtensible Markup Language files from local disc and ingest them into Object Manager
does this mean one can provide a URLUniform Resource Locator or directory path for ingestion and the rest is part of the Ingest service?--Natasa 11:53, 6 May 2008 (CESTCentral European Summer Time)
  • Item XMLExtensible Markup Language must include proper references to Context, Creator, and Content Type (mra: is this sufficient? correct?)
can't we provide context, state, creator, content-model(optionally), format (optionally, for start escidoc item/container xml) as parameters? This would enable immediate validation on correct references and they can be actually set by the FW?--Natasa 11:53, 6 May 2008 (CESTCentral European Summer Time)
  • create the ingest tool from the start as a core service (not as utility tool) ?--Natasa 11:57, 6 May 2008 (CESTCentral European Summer Time)

Mid-term Goals

  • Accept objects from other sources than just local disc
  • Accept more formats than just item XMLExtensible Markup Language
  • Better support for collections

Open Questions

  • How to handle dates (should the original creation date be maintained as property)?
probably the creation date should be set-up by the FW as otherwise may cause a lot of other interesting problems (i.e. OAIOpen Archives Initiative providers, items changed/created since - may lead to incosistent views etc.) --Natasa 11:46, 6 May 2008 (CESTCentral European Summer Time)
if the original date of creation is mandatory to be maintained then we should probably set-up another property or simply keep this information as extra metadata record. Note that not probably only the original date of creation in the ex-system property is to be maintained, what about users? We can not expect that all users e.g from eDocElectronic Documentation are created and referenced correctly, as many of them may not be active in fact. --Natasa 11:46, 6 May 2008 (CESTCentral European Summer Time)
  • Shall the Ingest Tool support a stylesheet transformation from input format to item XMLExtensible Markup Language?
it is discussion about supported ingest formats, for short term would be great to have the standard eSciDocEnhanced Scientific Documentation item.xml supported - and to resolve the basic ingestion functionality needed.--Natasa 11:46, 6 May 2008 (CESTCentral European Summer Time)
  • should we allow ingest method from Item/Container handler? Proposal: really provide it only from the Ingest tool/service (even if in the background it may actually use the item/container manager handlers).
  • why is limitation of object manager short/mid term goals set-up to 2 components and one descriptive MDMetadata record? Already for VIRRVirtueller Raum Reichsrecht solution we do have now the case of 3 components and 2 metadata records.
no, there is no limitation. The numbers mentioned are just ment as "standard objects" for the ingest rate. If objects are more complex, they will ingest fine, but it will take longer to ingest them. --Matthias 12:41, 6 May 2008 (CESTCentral European Summer Time)