Difference between revisions of "ESciDoc Ingest Tool"

From MPDLMediaWiki
Jump to navigation Jump to search
Line 19: Line 19:
=== Short-term Goals ===
=== Short-term Goals ===
* Read in validated item XML files from local disc and ingest them into Object Manager
* Read in validated item XML files from local disc and ingest them into Object Manager
:does this mean one can provide a URL or directory path for ingestion and the rest is part of the Ingest service?--[[User:Natasab|Natasa]] 11:53, 6 May 2008 (CEST)
* Item XML must include proper references to Context, Creator, and Content Type (mra: is this sufficient? correct?)
* Item XML must include proper references to Context, Creator, and Content Type (mra: is this sufficient? correct?)
:can't we provide context, state, creator, content-model(optionally) as parameters? This would enable immediate validation on correct references and they can be actually set by the FW?--[[User:Natasab|Natasa]] 11:53, 6 May 2008 (CEST)


=== Mid-term Goals ===
=== Mid-term Goals ===

Revision as of 09:53, 6 May 2008

Functional and Technical Requirements for an eSciDoc Ingest Tool[edit]

Being able to ingest large numbers (100,000 - 200,000) of objects is crucial for the successful roll-out of most eSciDoc Solutions. Nearly all the data has to be migrated/ingested by the end of May 2008. Therefore, we need to provide a working solutions asap., which will help to meet the goal of a timely roll-out, and a much improved version of the ingest tool after the initial productive deployment of PubMan and other solutions. Therefore, we splitted up the planning into short-term and mid-term goals.

Improved Performance of Object Manager[edit]

Short-term Goals[edit]

  • Ingest rate of 1 item/s or ~ 80,000 objects/day (with 2 components and one descriptive MD record besides DC)
  • Create a separate ingest method in Object Manager which allows for
    • the explicit setting of the item status (pendig, submitted, released)
    • inclusion of existing PIDs
    • delayed synch of the triple store

Mid-term Goals[edit]

  • Ingest rate of 5 items/s or ~ 400.000 objects/day (with 2 components and one descriptive MD record besides DC)
  • Automatic transformation of item XML into separate FOXML 1.1 objects

Ingest Tool[edit]

Short-term Goals[edit]

  • Read in validated item XML files from local disc and ingest them into Object Manager
does this mean one can provide a URL or directory path for ingestion and the rest is part of the Ingest service?--Natasa 11:53, 6 May 2008 (CEST)
  • Item XML must include proper references to Context, Creator, and Content Type (mra: is this sufficient? correct?)
can't we provide context, state, creator, content-model(optionally) as parameters? This would enable immediate validation on correct references and they can be actually set by the FW?--Natasa 11:53, 6 May 2008 (CEST)

Mid-term Goals[edit]

  • Accept objects from other sources than just local disc
  • Accept more formats than just item XML
  • Better support for collections

Open Questions[edit]

  • How to handle dates (should the original creation date be maintained as property)?
probably the creation date should be set-up by the FW as otherwise may cause a lot of other interesting problems (i.e. OAI providers, items changed/created since - may lead to incosistent views etc.) --Natasa 11:46, 6 May 2008 (CEST)
if the original date of creation is mandatory to be maintained then we should probably set-up another property or simply keep this information as extra metadata record.--Natasa 11:46, 6 May 2008 (CEST)
  • Shall the Ingest Tool support a stylesheet transformation from input format to item XML?
it is discussion about supported ingest formats, for short term would be great to have the standard eSciDoc item.xml supported - and to resolve the basic ingestion functionality needed.--Natasa 11:46, 6 May 2008 (CEST)
  • should we allow ingest method from Item/Container handler? Proposal: really provide it only from the Ingest tool/service (even if in the background it may actually use the item/container manager handlers).