Talk:PubMan Func Spec Ingestion

From MPDLMediaWiki
Jump to navigation Jump to search

work in progress

Phase 1[edit]

  • provide multiple item submission (batch import) for Endnote references, WoS references, eSciDoc xml, Bibtex, RIS, containing more than one reference.

Rupert: Does it mean all formats are subject to single item import too? (see comment in UC_PM_IN_01) --Rupert 08:06, 30 March 2009 (UTC)

For future developemnt, yes, might be. For R5, focus is set on batch ingestion.--Ulla 11:48, 7 April 2009 (UTC)

  • For endnote, consider files in various versions: Either version 1.x-7 or verion 8.x
    • encoding of files depends on endnote version: 1.x to 7 support ASCII, 8.x support UTF8
    • Mapping to PubMan Genres depends on endnote version (different mappings needed)
    • First Prio: Endnote version in use by ICE and MPI Pflanze--Ulla 12:58, 24 February 2009 (UTC)
  • for BibTeX consider, that the record can contain an URL tag, which points to the fulltext belonging to the bibliographic record, which should be uploaded to PubMan (see also BibTeX maping)

Functional specification[edit]

UC_PM_IN_01 import file in structured format[edit]

In order to save manual typing for example, the user wants to upload a file in a structured format such as Endnote, BibTeX, WoS, Ris or XML. (Please add here a separate page, where we list all supported structured formats and their respective mappings, so we do not have to update the use case each time we provide additional format--Ulla 17:50, 15 March 2009 (UTC))

Status/Schedule[edit]

  • status: in specification
  • schedule: R 5

Triggers[edit]

  • the user wants to upload a file in structured format in order to create eSciDoc items

Rupert: ... containing one or more references ... (?) --Rupert 08:08, 30 March 2009 (UTC)

Sure --Natasa 09:08, 30 March 2009 (UTC)

Actors[edit]

  • Depositor
  • Moderator

Pre-Conditions[edit]

  • Target context, incl. its validation rules for submission and release, is selected
  • Recommendation for users: Local data is prepared for genre-specific constraints and validation rules for the selected target context, to avoid fail of import
I would actually include a link to the help text, where genre-specific-mapping constrains are written. --Natasa 10:12, 30 March 2009 (UTC)

Flow of events[edit]

  • The user starts to import a file to the system
  • The user provides the path of the import file, the type of the Import Format (BibTeX, EndNote, WoS, RIS, escidocXML ) and the context to where s/he would like to import the items.
  • In case the user selects a import format, where customized mappings have been created beforehand, the user can in addition select the customized mapping (e.g. "Endnote for MPI ICE")
  • Optionally, the user can provide an ingestion description for the ingestion task (OPEN: Isn't that actually the possibility to batch assign local tags to the imported data? see below the use case--Ulla 17:30, 27 March 2009 (UTC))
Yes, therefore this is not optional. Users need to provide a value for their ingestion task. (Suggestion not to do it by the system in stage 1). I think we can make a default value of the field based on username and date --Natasa 10:11, 30 March 2009 (UTC)
  • Optionally, the user can decide to fetch external files which are referenced in the import file (e.g. URL tag in BibTex file).

Rupert: For the user it should not make a difference to upload one ore more records. In tests they provided a file and expected all records to be imported (which was not the case at this time). From the interface point of view ingest is rather an extension point of import. If the system detects more than one record it should automatically invoke the ingest process with a message. --Rupert 12:10, 24 March 2009 (UTC)

OK, this is fine for me as well. --Natasa 10:11, 30 March 2009 (UTC)
  • the user triggers the ingestion
  • the system informs the user on the progress and outcome of the import. OPEN: needs to be decided how (e.g. progress indicator, automatic mail generated as soon as finished)
i guess that's largely a question of how the long the import takes, but it shouldn't take too long to count the items to import and then decide automatically whether a progress indicator on the page will do, or whether an email is sent - meaning it doesn't make sense to wait for completion.--Robert 18:01, 27 March 2009 (UTC)
Yes, ingest should start asynchronously. In R5 ingest process will send email to the user when the task is finished. It is not about counting only how many items, but if ingest feature in addition needs to download files from external servers we can not estimate how much it will take. --Natasa 10:11, 30 March 2009 (UTC)
    • If the ingestion fails (wrong import format, fulltext not fetched, corrupted file, failed validation), the ingestion is canceled. No items are created in the ingestion workspace, i.e. no ingestion task is available in the workspace. (???)The user gets a warning message and can repeat the ingestion.
I would state: user should be notified by email on the outcome. --Natasa 10:11, 30 March 2009 (UTC)
In addition, to be checked within DEV, if it is possible to mark at which item(by title)/step (e.g. mapping transformation, validation or creation of item) the ingestion failed. I would not go for more other details. --Natasa 10:14, 30 March 2009 (UTC)
    • If the import is successful, the imported items are created as pending items in the import workspace of the user. The imported items carry a system-specific property for the ingestion task (This would be best accomplished via content-model-specific-property such as <ingestion><ingestion-task></ingestion-task></ingestion>), the ingestion date and the owner of the ingestion.
  • The user can view the ingested pending items in his import workspace
  • Proceed with UC_PM_IN_02 Batch release imported items

Post conditions[edit]

New pending PubMan Entries have been created in the import workspace of the owner of the ingestion.


UC_PM_IN_02 Batch release/submit imported items[edit]

The user wants to make a set of ingested items visible via PubMan.

Status/Schedule[edit]

  • status: in specification
  • schedule: R 5

Actors[edit]

  • Depositor
  • Moderator

Pre-Conditions[edit]

  • One or more ingestion sets have been selected (e.g. via basket)
For R5 we should state: 1 ingestion set is selected. --Natasa 10:16, 30 March 2009 (UTC)

Flow of events[edit]

  • the user selects one or more sets of ingestion tasks from the workspace. Only item sets which have been imported for the same context can be combined.
See remark above: 1 set is selected for batch release. --Natasa 10:26, 30 March 2009 (UTC)
  • Depending on the workflow for the target context and the privileges of the user, the user can trigger either a batch release or a batch submission.
  • The system checks for potential duplicates. Include use case UC_PM_IN_03 Do duplicate check (OPEN)
The step on checking duplicates was not supposed to be in the batch release, but during the import triggering. We agreed to have a check-box in import form, based on which users tell to the system whether to skip the creation if duplicate is found or to create an item in any case. --Natasa 10:26, 30 March 2009 (UTC)
  • The system checks the items against the validation rules provided for the context, incl. the genre-specific constraint. (OPEN: the only reason I see another validation here, is, in case the user has an option to modify the pending items after ingest. During import, already validation rules for the context have been checked....)
As well during the creation validation rules have to be checked for validation point default. In this case validation rules for submit/release should apply. --Natasa 10:26, 30 March 2009 (UTC)
  • If the items are valid, the items are submitted or released. User gets information how many items have been submitted or released
  • If one or more items are invalid, the system shows the invalid items and gives the user the possibility to edit them and re-start the batch release process
Should the system generate a list which contains the validation messages for each item? --Natasa 10:26, 30 March 2009 (UTC)

OPEN: In case user imports 500 references from WoS or Endnote, he need to have possibility to check the quality for each of them, and potentially modify them, before the actual release. Should this happen after the import to the import workspace? If he can modify the pending item in the import workspace, where can he store the modified item? If modification is not feasible before the release, the user has to modify the ingested items after the batch release process, which is not nice. An option would be, that we allow import only for contexts, where standard workflow is applied, and we restrict the import to users with moderator privileges. As consequence, there is only the option to Batch submit, and the items can be modified as "normal", manual submitted items.

The import workspace should point no restrictions in respect to the privileges of the user. It is just to distinguish items that are created by certain ingest process. Users need to be able to modify and release the items as well in single mode, in accordance with the workflow of the context.The only difference between depositor and import workspace is the possibility of filtering by ingestion task value, to do batch operation and that the import workspace shows ONLY PENDING items. The ingestion task should not be filter-able via import workspace if the items are in status "submitted" or "released". --Natasa 10:26, 30 March 2009 (UTC)
Note: QA roles should also have the batch release operation available and the possibility to filter by import tasks. Therefore the import workspace is also to be allowed for both QA and Depositor roles. --Natasa 10:26, 30 March 2009 (UTC)
i don't really understand the necessity of an import workspace in addition to the depositor workspace. if the latter provides filtering by date, context, status, etc. wouldn't it be just as easy to find items resulting from a batch import in there?--Robert 11:15, 30 March 2009 (UTC)

OPEN: What happens to imported persons and person data in CONE? (see also below, duplicate check for CONE persons?)

UC_PM_IN_03 Do duplicate check[edit]

The user wants to avoid creating duplicates in a specific context,during ingest of new items.

OPEN: Where to include the duplicate check? During import or during batch submission/release?

Status/Schedule[edit]

  • status: in specification
  • schedule: R5?

Actors[edit]

  • Depositor
  • Moderator

Pre-Conditions[edit]

  • One or more ingestion sets have been selected (e.g. via basket)
  • the ingested items carry an unique identifier to base the duplicate check on (i.e. duplicate check on item level)

Flow of events[edit]

  • The user selects an identifier type to base the duplicate check on (.e.g WOS identifier)
  • No duplicates are found. The user is informed.
  • In case duplicates are found, the system provides the user with a report of possible duplicates and with the following possibilities to proceed:
    • import all, including the duplicates, and overwrite the existing items (i.e. create new version)
import all (duplicates may be created)
    • skip the potential duplicates and import only the non-duplicates
    • cancel the import
  • The use case ends successfully.

OPEN: Duplicate check in CONE service possible for person?

  • The system checks if the creators within the imported data already exist within the CoNE Service.
    • The creators don't exist in the CoNE Service: New unauthorized entries are being created in CoNE.
    • The creators already exist in the CoNE Service: The system displays the possible entry in CoNE. The user can decide to use the CoNE person or to add a new person in CoNE or to add name variants, affiliations etc. to an existing CoNE entry.
this needs to be very carefully checked. For R5 i would state the following: Always create unauthorized entries (unless is a custom import of MPI ICE which needs to be handled differently)--Natasa 10:30, 30 March 2009 (UTC)

UC_PM_IN_04 batch delete items[edit]

The user wants to delete several items from the import manager interface, as they where duplicates to items, which already existed in PubMan and are no longer needed.


not needed to my understanding...depends on final set-up of the import workspace, ie.if we can keep track of history of ingests (as long as imported data are not batch released, they are visible in workspace. as soon as they are released/submitted, they are processed, i.e. might be retrieved by "my ingest history".)--Ulla 17:28, 27 March 2009 (UTC)

Actually, the batch delete of imported items is fine idea. But is not related only to the import workspace - can also be offered as functionality in the depositor workspace. --Natasa 10:32, 30 March 2009 (UTC)

UC_PM_IN_05 batch attach local tags[edit]

The user wants to assign one or more local tags to a set of items.

To be discussed if possible--Ulla 17:28, 27 March 2009 (UTC)

UC_PM_IN_05 batch assign organizational units[edit]

The user wants to assign one or more OUs to a set of items.

to be discussed if possible, e.g. for WOS or bibtexx import--Ulla 17:28, 27 March 2009 (UTC)

Architecture and thoughts[edit]

  • R5 - Import logic/design
    • import page modification to include
      • ingestion task identifier (username+timestamp)
      • checkbox for skip creation of duplicates or create evtl. duplicates as new items
    • asynchronous start of ingestion process
    • sendmail to user when ingestion is finished or failed
      • email in case of failure:
        • at which item (by title) it failed
        • possible cause: mapping, creation, validation message (for Val.point default)
        • info where to go for further steps
      • email in case of success:
        • how many items were created (with or without fulltext)
        • which items were possible duplicates (by title) (if DC is to be applied)
        • info where to go for further steps
    • on ingestion tasks (tbd: BT or eSciDoc days?)
      • possible to create items with CM: ingestion-task
      • these items should never be released
        • this will enable filtering ingestion tasks workspace when it comes
      • items may contain:
        • special metadata(or content-model-specific?) to point on the status of the ingestion task (scheduled, in-progress, finished succesfully, finished unsuccessfully) etc (NBU: to check data model from before, as it was defined in details).
        • component or MD record stream with links to ingested items (preferrable component) and info on status: failed/success, and info on evtl. found duplicate
        • component with the original file uploaded for import
      • ingestion task items can be "cleaned-up" i.e. deleted if wished or not
      • Advantage by defining them as items:
        • we can even have separate role if needed
        • we can re-use existing functionality of item handler and storage
sounds to me like a workflow engine.--Robert 11:27, 30 March 2009 (UTC)
i also think it would be somewhat strange, if we put management process data like ingestion tasks into our repository - including persistent identifiers, lta, etc. - while keeping the cone stuff, which is actually part of the core data, out.--Robert 12:01, 30 March 2009 (UTC)
correct - There is a WF manager (according to FIZ) set-up, but this would require a lot of testing. The above proposal is done in order to avoid such complexity and introduce another external component at present. It is in any case doubtful that we will anyway have ingestion tasks now - the purpose is only to understand where to store them, as we will anyway probably have to store the originally uploaded files somewhere. --Natasa 08:09, 1 April 2009 (UTC)
  • R6
    • introduce ingestion definition (again as item)?
    • to help ingest-users to define their own ingestion settings and remember them
    • ingestion tasks will in that case have relations to ingestion definition

Future development[edit]

  • Check for new version of item. It should be possible to check if a newer version of the PubMan item has been created at the import source.
  • Set up automatic ingestion mechanism (regular automatic ingest from specific URL), incl. respective update of escidoc items
  • Provide separate interface to define the mapping for customized fields to eSciDoc. These Mappings can be stored in e.g. User preferences or a "Mapping library" open for all users.