Talk:PubMan Func Spec Ingestion

From MPDLMediaWiki
Revision as of 15:59, 8 September 2009 by Melanie.stetter (talk | contribs)
Jump to navigation Jump to search

work in progress

Phase 1[edit]

  • provide multiple item submission (batch import) for Endnote references, WoS references, eSciDoc xml, Bibtex, RIS, containing more than one reference.

Rupert: Does it mean all formats are subject to single item import too? (see comment in UC_PM_IN_01) --Rupert 08:06, 30 March 2009 (UTC)

For future developemnt, yes, might be. For R5, focus is set on batch ingestion.--Ulla 11:48, 7 April 2009 (UTC)

  • For endnote, consider files in various versions: Either version 1.x-7 or verion 8.x
    • encoding of files depends on endnote version: 1.x to 7 support ASCII, 8.x support UTF8
    • Mapping to PubMan Genres depends on endnote version (different mappings needed)
    • First Prio: Endnote version in use by ICE and MPI Pflanze--Ulla 12:58, 24 February 2009 (UTC)
  • for BibTeX consider, that the record can contain an URL tag, which points to the fulltext belonging to the bibliographic record, which should be uploaded to PubMan (see also BibTeX maping)

Functional specification[edit]

UC_PM_IN_01 import file in structured format[edit]

In order to save manual typing for example, the user wants to upload a file in a structured format such as BibTeX, EndNote Export Format or RIS. A complete overview on supported import formats on PubMan can be found in the Category:ESciDoc_Mappings.

Status/Schedule[edit]

  • status: in specification
  • schedule: R 5

Triggers[edit]

  • the user wants to upload a file in structured format, containing one or more items, in order to create eSciDoc items

Actors[edit]

  • user, who has depositor and moderator rights

Pre-Conditions[edit]

  • Target context, incl. its validation rules for submission and release, is selected
  • Recommendation for users: Local data is prepared for genre-specific constraints and validation rules for the selected target context, to avoid fail of import

Flow of events[edit]

  • The user starts to import a file to the system
  • The user provides the path of the import file, the type of the Import Format (BibTeX, EndNote, WoS, RIS, escidocXML ) and the context to where s/he would like to import the items.
    • In case the user selects an import format, where customized mappings have been created beforehand, the user can in addition select the customized mapping (e.g. "Endnote for MPI ICE")
    • If the user has choosen an import format, which contains links to full texts (like BibTeX), the full text is also being imported.
  • The user defines what should happen if the validation fails:
    • cancel ingestion
    • ingest only valid items
  • The user provides an ingestion description for the ingestion task, which will be attached to the items within the local tags. In addition the system assigns the timestamp to the imported items within the local tags.
  • the user triggers the ingestion
  • the system checks if one or more persons within the import are already within CoNE
    • for persons, which are already in CoNE: the system adds the CoNE ID to the respective persons in the import data (already in R5)
    • otherwise: the system creates unauthorized person entries (future development)
  • the system informs the user on the progress and outcome of the import.
    • If the ingestion fails (wrong import format, full text not fetched, corrupted file, failed validation), the ingestion is canceled or only valid items are being ingested.
      • during the ingestion both validation rules are being applied, the one for create item and the one for release item. If the one for create item fails, the imtem(s) can not be imported. If the one for release fails, the items are being imported and the user gets a report.
    • If the import is successful, the imported items are created as pending items in the import workspace of the user. The imported items carry a system-specific property for the ingestion task (This would be best accomplished via content-model-specific-property such as <ingestion><ingestion-task></ingestion-task></ingestion>), the ingestion date and the owner of the ingestion.
  • The user can view the ingested pending items in his import workspace, do batch operations (batch delete, batch submit/release, remove ingestion task) and view the ingestion report.
  • Proceed with UC_PM_IN_02 Batch release imported items

Post conditions[edit]

New pending PubMan Entries have been created in the import workspace of the owner of the ingestion.

Future Development[edit]

  • If the user chooses an import format, which contains URLs to the full text, the user can specify if s/he would like to import the full texts or not.
  • Automatic upload (give the URL of the server from where to get the data on a regular basis).
  • Check if journal names within the import file are already in CoNE.

UC_PM_IN_02 Batch release/submit imported items[edit]

The user wants to make a set of ingested items visible via PubMan.

Status/Schedule[edit]

  • status: in specification
  • schedule: R 5

Actors[edit]

  • user with Depositor and Moderator rights

Pre-Conditions[edit]

  • One ingestion set has been selected

Flow of events[edit]

  • the user selects one set of ingestion tasks from the workspace.
  • Depending on the workflow for the target context and the privileges of the user, the user can trigger either a batch release or a batch submission.
  • The system checks for potential duplicates. Include use case UC_PM_IN_03 Do duplicate check (OPEN)
  • The system checks the items against the validation rules provided for the context, incl. the genre-specific constraint.
    • If the items are valid, the items are submitted or released. User gets information how many items have been submitted or released.
    • If one or more items are invalid, the system shows the invalid items and gives the user the possibility to edit them and re-start the batch release process

Future Development[edit]

  • batch release/submit should also be possible from Depositor and QA Workspace
  • the user should be able to define sets for batch operations by himself/herself

UC_PM_IN_03 Do duplicate check[edit]

The user wants to avoid creating duplicates in a specific context,during ingest of new items.

OPEN: Where to include the duplicate check? During import or during batch submission/release?

Status/Schedule[edit]

  • status: in specification
  • schedule: R5

Actors[edit]

  • user with moderator and depositor rights

Pre-Conditions[edit]

  • One ingestion set has been selected
  • the ingested items carry an unique identifier to base the duplicate check on (i.e. duplicate check on item level)

Flow of events[edit]

  • the user specifies what should be done in case one or more duplicates have been found:
    • cancel the operation
    • skip the potential duplicates and only handle the non-duplicates
    • ignore duplicates and overwrite existing entries

Future Development[edit]

  • the user should be able to view a duplicate checking report and decide for each item, which action should be taken

UC_PM_IN_04 batch delete items[edit]

The user wants to delete several items from the import manager interface, as they where duplicates to items, which already existed in PubMan and are no longer needed.

Status/Schedule[edit]

  • status: in specification
  • schedule: R5

Flow of events[edit]

  • the user can remove all imported items from, which have been ingested (will be possible from ingestion workspace)

UC_PM_IN_05 batch attach local tags[edit]

The user wants to assign one or more local tags to a set of items.

Status/Schedule[edit]

  • status: in specification
  • schedule: not R5

UC_PM_IN_06 batch assign organizational units[edit]

The user wants to assign one or more OUs to a set of items.

Status/Schedule[edit]

  • status: in specification
  • schedule: not R5

Architecture and thoughts[edit]

  • R5 - Import logic/design
    • import page modification to include
      • ingestion task identifier (username+timestamp)
      • checkbox for skip creation of duplicates or create evtl. duplicates as new items
    • asynchronous start of ingestion process
    • sendmail to user when ingestion is finished or failed
      • email in case of failure:
        • at which item (by title) it failed
        • possible cause: mapping, creation, validation message (for Val.point default)
        • info where to go for further steps
      • email in case of success:
        • how many items were created (with or without fulltext)
        • which items were possible duplicates (by title) (if DC is to be applied)
        • info where to go for further steps
    • on ingestion tasks (tbd: BT or eSciDoc days?)
      • possible to create items with CM: ingestion-task
      • these items should never be released
        • this will enable filtering ingestion tasks workspace when it comes
      • items may contain:
        • special metadata(or content-model-specific?) to point on the status of the ingestion task (scheduled, in-progress, finished succesfully, finished unsuccessfully) etc (NBU: to check data model from before, as it was defined in details).
        • component or MD record stream with links to ingested items (preferrable component) and info on status: failed/success, and info on evtl. found duplicate
        • component with the original file uploaded for import
      • ingestion task items can be "cleaned-up" i.e. deleted if wished or not
      • Advantage by defining them as items:
        • we can even have separate role if needed
        • we can re-use existing functionality of item handler and storage
sounds to me like a workflow engine.--Robert 11:27, 30 March 2009 (UTC)
i also think it would be somewhat strange, if we put management process data like ingestion tasks into our repository - including persistent identifiers, lta, etc. - while keeping the cone stuff, which is actually part of the core data, out.--Robert 12:01, 30 March 2009 (UTC)
correct - There is a WF manager (according to FIZ) set-up, but this would require a lot of testing. The above proposal is done in order to avoid such complexity and introduce another external component at present. It is in any case doubtful that we will anyway have ingestion tasks now - the purpose is only to understand where to store them, as we will anyway probably have to store the originally uploaded files somewhere. --Natasa 08:09, 1 April 2009 (UTC)
  • R6
    • introduce ingestion definition (again as item)?
    • to help ingest-users to define their own ingestion settings and remember them
    • ingestion tasks will in that case have relations to ingestion definition

Future development[edit]

  • Check for new version of item. It should be possible to check if a newer version of the PubMan item has been created at the import source.
  • Set up automatic ingestion mechanism (regular automatic ingest from specific URL), incl. respective update of escidoc items
  • Provide separate interface to define the mapping for customized fields to eSciDoc. These Mappings can be stored in e.g. User preferences or a "Mapping library" open for all users.

Use case fetch full text from identifier (arxive)[edit]

General Thoughts[edit]

  • Should we enhance the (technical) metadata of an item with the information where the item originally was created?

(in this case arXiv)

  • We should add something like a progress bar when importing data from another system
  • The system must take precautions not to get blocked from arxiv for indiscriminate automated download (see http://arxiv.org/RobotsBeware.html).