Difference between revisions of "Talk:PubMan Func Spec Ingestion"

From MPDLMediaWiki
Jump to navigation Jump to search
Line 53: Line 53:
**If the ingestion fails (wrong import format, fulltext not fetched, corrupted file, failed validation), the ingestion is canceled. No items are created in the ingestion workspace, i.e. no ingestion task is available in the workspace. (???)The user gets a warning message and can repeat the ingestion.
**If the ingestion fails (wrong import format, fulltext not fetched, corrupted file, failed validation), the ingestion is canceled. No items are created in the ingestion workspace, i.e. no ingestion task is available in the workspace. (???)The user gets a warning message and can repeat the ingestion.
:I would state: user should be notified by email on the outcome. --[[User:Natasab|Natasa]] 10:11, 30 March 2009 (UTC)
:I would state: user should be notified by email on the outcome. --[[User:Natasab|Natasa]] 10:11, 30 March 2009 (UTC)
::In addition, to be checked within DEV, if it is possible to mark at which item(by title)/step (e.g. mapping transformation, validation or creation of item) the ingestion failed. I would not go for more other details. --[[User:Natasab|Natasa]] 10:14, 30 March 2009 (UTC)
**If the import is successful, the imported items are created as pending items in the import workspace of the user. The imported items carry a system-specific property for the ingestion task (This would be best accomplished via content-model-specific-property such as <ingestion><ingestion-task></ingestion-task></ingestion>), the ingestion date and the owner of the ingestion.
**If the import is successful, the imported items are created as pending items in the import workspace of the user. The imported items carry a system-specific property for the ingestion task (This would be best accomplished via content-model-specific-property such as <ingestion><ingestion-task></ingestion-task></ingestion>), the ingestion date and the owner of the ingestion.
*The user can view the ingested pending items in his import workspace
*The user can view the ingested pending items in his import workspace

Revision as of 10:14, 30 March 2009

work in progress

Phase 1[edit]

  • provide multiple item submission (batch import) for Endnote references, WoS references, eSciDoc xml, Bibtex, RIS, containing more than one reference.

Rupert: Does it mean all formats are subject to single item import too? (see comment in UC_PM_IN_01) --Rupert 08:06, 30 March 2009 (UTC)

  • For endnote, consider files in various versions: Either version 1.x-7 or verion 8.x
    • encoding of files depends on endnote version: 1.x to 7 support ASCII, 8.x support UTF8
    • Mapping to PubMan Genres depends on endnote version (different mappings needed)
    • First Prio: Endnote version in use by ICE and MPI Pflanze--Ulla 12:58, 24 February 2009 (UTC)
  • for BibTeX consider, that the record can contain an URL tag, which points to the fulltext belonging to the bibliographic record, which should be uploaded to PubMan (see also BibTeX maping)

Functional specification[edit]

UC_PM_IN_01 import file in structured format[edit]

In order to save manual typing for example, the user wants to upload a file in a structured format such as Endnote, BibTeX, WoS, Ris or XML. (Please add here a separate page, where we list all supported structured formats and their respective mappings, so we do not have to update the use case each time we provide additional format--Ulla 17:50, 15 March 2009 (UTC))

Status/Schedule[edit]

  • status: in specification
  • schedule: R 5

Triggers[edit]

  • the user wants to upload a file in structured format in order to create eSciDoc items

Rupert: ... containing one or more references ... (?) --Rupert 08:08, 30 March 2009 (UTC)

Sure --Natasa 09:08, 30 March 2009 (UTC)

Actors[edit]

  • Depositor
  • Moderator

Pre-Conditions[edit]

  • Target context, incl. its validation rules for submission and release, is selected
  • Recommendation for users: Local data is prepared for genre-specific constraints and validation rules for the selected target context, to avoid fail of import
I would actually include a link to the help text, where genre-specific-mapping constrains are written. --Natasa 10:12, 30 March 2009 (UTC)

Flow of events[edit]

  • The user starts to import a file to the system
  • The user provides the path of the import file, the type of the Import Format (BibTeX, EndNote, WoS, RIS, escidocXML ) and the context to where s/he would like to import the items.
  • In case the user selects a import format, where customized mappings have been created beforehand, the user can in addition select the customized mapping (e.g. "Endnote for MPI ICE")
  • Optionally, the user can provide an ingestion description for the ingestion task (OPEN: Isn't that actually the possibility to batch assign local tags to the imported data? see below the use case--Ulla 17:30, 27 March 2009 (UTC))
Yes, therefore this is not optional. Users need to provide a value for their ingestion task. (Suggestion not to do it by the system in stage 1). I think we can make a default value of the field based on username and date --Natasa 10:11, 30 March 2009 (UTC)
  • Optionally, the user can decide to fetch external files which are referenced in the import file (e.g. URL tag in BibTex file).

Rupert: For the user it should not make a difference to upload one ore more records. In tests they provided a file and expected all records to be imported (which was not the case at this time). From the interface point of view ingest is rather an extension point of import. If the system detects more than one record it should automatically invoke the ingest process with a message. --Rupert 12:10, 24 March 2009 (UTC)

OK, this is fine for me as well. --Natasa 10:11, 30 March 2009 (UTC)
  • the user triggers the ingestion
  • the system informs the user on the progress and outcome of the import. OPEN: needs to be decided how (e.g. progress indicator, automatic mail generated as soon as finished)
i guess that's largely a question of how the long the import takes, but it shouldn't take too long to count the items to import and then decide automatically whether a progress indicator on the page will do, or whether an email is sent - meaning it doesn't make sense to wait for completion.--Robert 18:01, 27 March 2009 (UTC)
Yes, ingest should start asynchronously. In R5 ingest process will send email to the user when the task is finished. It is not about counting only how many items, but if ingest feature in addition needs to download files from external servers we can not estimate how much it will take. --Natasa 10:11, 30 March 2009 (UTC)
    • If the ingestion fails (wrong import format, fulltext not fetched, corrupted file, failed validation), the ingestion is canceled. No items are created in the ingestion workspace, i.e. no ingestion task is available in the workspace. (???)The user gets a warning message and can repeat the ingestion.
I would state: user should be notified by email on the outcome. --Natasa 10:11, 30 March 2009 (UTC)
In addition, to be checked within DEV, if it is possible to mark at which item(by title)/step (e.g. mapping transformation, validation or creation of item) the ingestion failed. I would not go for more other details. --Natasa 10:14, 30 March 2009 (UTC)
    • If the import is successful, the imported items are created as pending items in the import workspace of the user. The imported items carry a system-specific property for the ingestion task (This would be best accomplished via content-model-specific-property such as <ingestion><ingestion-task></ingestion-task></ingestion>), the ingestion date and the owner of the ingestion.
  • The user can view the ingested pending items in his import workspace
  • Proceed with UC_PM_IN_02 Batch release imported items

Post conditions[edit]

New pending PubMan Entries have been created in the import workspace of the owner of the ingestion.


UC_PM_IN_02 Batch release/submit imported items[edit]

The user wants to make a set of ingested items visible via PubMan.

Status/Schedule[edit]

  • status: in specification
  • schedule: R 5

Actors[edit]

  • Depositor
  • Moderator

Pre-Conditions[edit]

  • One or more ingestion sets have been selected (e.g. via basket)

Flow of events[edit]

  • the user selects one or more sets of ingestion tasks from the workspace. Only item sets which have been imported for the same context can be combined.
  • Depending on the workflow for the target context and the privileges of the user, the user can trigger either a batch release or a batch submission.
  • The system checks for potential duplicates. Include use case UC_PM_IN_03 Do duplicate check (OPEN)
  • The system checks the items against the validation rules provided for the context, incl. the genre-specific constraint. (OPEN: the only reason I see another validation here, is, in case the user has an option to modify the pending items after ingest. During import, already vadliation rules for the context have been checked....)
  • If the items are valid, the items are submitted or released. User gets information how many items have been submitted or released
  • If one or more items are invalid, the system shows the invalid items and gives the user the possibility to edit them and re-start the batch release process

OPEN: In case user imports 500 references from WoS or Endnote, he need to have possibility to check the quality for each of them, and potentially modify them, before the actual release. Should this happen after the import to the import workspace? If he can modify the pending item in the import workspace, where can he store the modified item? If modification is not feasible before the release, the user has to modify the ingested items after the batch release process, which is not nice. An option would be, that we allow import only for contexts, where standard workflow is applied, and we restrict the import to users with moderator privileges. As consequence, there is only the option to Batch submit, and the items can be modified as "normal", manual submitted items.

OPEN: What happens to imported persons and person data in CONE? (see also below, duplicate check for CONE persons?)

UC_PM_IN_03 Do duplicate check[edit]

The user wants to avoid creating duplicates in a specific context,during ingest of new items.

OPEN: Where to include the duplicate check? During import or during batch submission/release?

Status/Schedule[edit]

  • status: in specification
  • schedule: R5?

Actors[edit]

  • Depositor
  • Moderator

Pre-Conditions[edit]

  • One or more ingestion sets have been selected (e.g. via basket)
  • the ingested items carry an unique identifier to base the duplicate check on (i.e. duplicate check on item level)

Flow of events[edit]

  • The user selects an identifier type to base the duplicate check on (.e.g WOS identifier)
  • No duplicates are found. The user is informed.
  • In case duplicates are found, the system provides the user with a report of possible duplicates and with the following possibilities to proceed:
    • import all, including the duplicates, and overwrite the existing items (i.e. create new version)
    • skip the potential duplicates and import only the non-duplicates
    • cancel the import
  • The use case ends successfully.

OPEN: Duplicate check in CONE service possible for person?

  • The system checks if the creators within the imported data already exist within the CoNE Service.
    • The creators don't exist in the CoNE Service: New unauthorized entries are being created in CoNE.
    • The creators already exist in the CoNE Service: The system displays the possible entry in CoNE. The user can decide to use the CoNE person or to add a new person in CoNE or to add name variants, affiliations etc. to an existing CoNE entry.

UC_PM_IN_03 batch delete items[edit]

The user wants to delete several items from the import manager interface, as they where duplicates to items, which already existed in PubMan and are no longer needed.


not needed to my understanding...depends on final set-up of the import workspace, ie.if we can keep track of history of ingests (as long as imported data are not batch released, they are visible in workspace. as soon as they are released/submitted, they are processed, i.e. might be retrieved by "my ingest history".)--Ulla 17:28, 27 March 2009 (UTC)

UC_PM_IN_04 batch attach local tags[edit]

The user wants to assign one or more local tags to a set of items.

To be discussed if possible--Ulla 17:28, 27 March 2009 (UTC)

UC_PM_IN_05 batch assign organizational units[edit]

The user wants to assign one or more OUs to a set of items.

to be discussed if possible, e.g. for WOS or bibtexx import--Ulla 17:28, 27 March 2009 (UTC)

Future development[edit]

  • Check for new version of item. It should be possible to check if a newer version of the PubMan item has been created at the import source.
  • Set up automatic ingestion mechanism (regular automatic ingest from specific URL), incl. respective update of escidoc items
  • Provide separate interface to define the mapping for customized fields to eSciDoc. These Mappings can be stored in e.g. User preferences or a "Mapping library" open for all users.