ESciDoc Ingest Tool

From MPDLMediaWiki
Jump to navigation Jump to search

Functional and Technical Requirements for an eSciDoc Ingest Tool[edit]

Being able to ingest large numbers (100,000 - 200,000) of objects is crucial for the successful roll-out of most eSciDoc Solutions. Nearly all the data has to be migrated/ingested by the end of May 2008. Therefore, we need to provide a working solutions asap., which will help to meet the goal of a timely roll-out, and a much improved version of the ingest tool after the initial productive deployment of PubMan and other solutions. Therefore, we splitted up the planning into short-term and mid-term goals.

Improved Performance of Object Manager[edit]

Short-term Goals[edit]

  • Ingest rate of 1 item/s or ~ 80,000 objects/day (with 2 components and one descriptive MD record besides DC)
  • Create a separate ingest method in Object Manager which allows for
    • the explicit setting of the item status (pending, submitted, released)
    • inclusion of existing PIDs
    • delayed synch of the triple store

Mid-term Goals[edit]

  • Ingest rate of 5 items/s or ~ 400.000 objects/day (with 2 components and one descriptive MD record besides DC)
  • Automatic transformation of item XML into separate FOXML 1.1 objects

Ingest Tool[edit]

Short-term Goals[edit]

  • Read in validated item XML files from local disc and ingest them into Object Manager
does this mean one can provide a URL or directory path for ingestion and the rest is part of the Ingest service?--Natasa 11:53, 6 May 2008 (CEST)
  • Item XML must include proper references to Context, Creator, and Content Type (mra: is this sufficient? correct?)
can't we provide context, state, creator, content-model(optionally), format (optionally, for start escidoc item/container xml) as parameters? This would enable immediate validation on correct references and they can be actually set by the FW?--Natasa 11:53, 6 May 2008 (CEST)
  • create the ingest tool from the start as a core service (not as utility tool) ?--Natasa 11:57, 6 May 2008 (CEST)

Mid-term Goals[edit]

  • Accept objects from other sources than just local disc
  • Accept more formats than just item XML
  • Better support for collections

Open Questions[edit]

  • How to handle dates (should the original creation date be maintained as property)?
probably the creation date should be set-up by the FW as otherwise may cause a lot of other interesting problems (i.e. OAI providers, items changed/created since - may lead to incosistent views etc.) --Natasa 11:46, 6 May 2008 (CEST)
if the original date of creation is mandatory to be maintained then we should probably set-up another property or simply keep this information as extra metadata record. Note that not probably only the original date of creation in the ex-system property is to be maintained, what about users? We can not expect that all users e.g from eDoc are created and referenced correctly, as many of them may not be active in fact. --Natasa 11:46, 6 May 2008 (CEST)
  • Shall the Ingest Tool support a stylesheet transformation from input format to item XML?
it is discussion about supported ingest formats, for short term would be great to have the standard eSciDoc item.xml supported - and to resolve the basic ingestion functionality needed.--Natasa 11:46, 6 May 2008 (CEST)
  • should we allow ingest method from Item/Container handler? Proposal: really provide it only from the Ingest tool/service (even if in the background it may actually use the item/container manager handlers).
  • why is limitation of object manager short/mid term goals set-up to 2 components and one descriptive MD record? Already for VIRR solution we do have now the case of 3 components and 2 metadata records.
no, there is no limitation. The numbers mentioned are just ment as "standard objects" for the ingest rate. If objects are more complex, they will ingest fine, but it will take longer to ingest them. --Matthias 12:41, 6 May 2008 (CEST)