Difference between revisions of "Talk:PubMan Func Spec Ingestion"

From MPDLMediaWiki
Jump to navigation Jump to search
Line 46: Line 46:
**If the ingestion fails (wrong import format, full text not fetched, corrupted file, failed validation), the ingestion is canceled or only valid items are being ingested.
**If the ingestion fails (wrong import format, full text not fetched, corrupted file, failed validation), the ingestion is canceled or only valid items are being ingested.
**If the import is successful, the imported items are created as pending items in the import workspace of the user. The imported items carry a system-specific property for the ingestion task (This would be best accomplished via content-model-specific-property such as <ingestion><ingestion-task></ingestion-task></ingestion>), the ingestion date and the owner of the ingestion.
**If the import is successful, the imported items are created as pending items in the import workspace of the user. The imported items carry a system-specific property for the ingestion task (This would be best accomplished via content-model-specific-property such as <ingestion><ingestion-task></ingestion-task></ingestion>), the ingestion date and the owner of the ingestion.
*The user can view the ingested pending items in his import workspace
*The user can view the ingested pending items in his import workspace, do batch operations and view the ingestion report.
*Proceed with UC_PM_IN_02 Batch release imported items
*Proceed with UC_PM_IN_02 Batch release imported items



Revision as of 08:53, 17 April 2009

work in progress

Phase 1[edit]

  • provide multiple item submission (batch import) for Endnote references, WoS references, eSciDoc xml, Bibtex, RIS, containing more than one reference.

Rupert: Does it mean all formats are subject to single item import too? (see comment in UC_PM_IN_01) --Rupert 08:06, 30 March 2009 (UTC)

For future developemnt, yes, might be. For R5, focus is set on batch ingestion.--Ulla 11:48, 7 April 2009 (UTC)

  • For endnote, consider files in various versions: Either version 1.x-7 or verion 8.x
    • encoding of files depends on endnote version: 1.x to 7 support ASCII, 8.x support UTF8
    • Mapping to PubMan Genres depends on endnote version (different mappings needed)
    • First Prio: Endnote version in use by ICE and MPI Pflanze--Ulla 12:58, 24 February 2009 (UTC)
  • for BibTeX consider, that the record can contain an URL tag, which points to the fulltext belonging to the bibliographic record, which should be uploaded to PubMan (see also BibTeX maping)

Functional specification[edit]

UC_PM_IN_01 import file in structured format[edit]

In order to save manual typing for example, the user wants to upload a file in a structured format such as BibTeX, EndNote Export Format or RIS. A complete overview on supported import formats on PubMan can be found in the Category:ESciDoc_Mappings.

Status/Schedule[edit]

  • status: in specification
  • schedule: R 5

Triggers[edit]

  • the user wants to upload a file in structured format, containing one or more items, in order to create eSciDoc items

Actors[edit]

  • user, who has depositor and moderator rights

Pre-Conditions[edit]

  • Target context, incl. its validation rules for submission and release, is selected
  • Recommendation for users: Local data is prepared for genre-specific constraints and validation rules for the selected target context, to avoid fail of import

Flow of events[edit]

  • The user starts to import a file to the system
  • The user provides the path of the import file, the type of the Import Format (BibTeX, EndNote, WoS, RIS, escidocXML ) and the context to where s/he would like to import the items.
    • In case the user selects an import format, where customized mappings have been created beforehand, the user can in addition select the customized mapping (e.g. "Endnote for MPI ICE")
    • If the user has choosen an import format, which contains links to full texts (like BibTeX), the full text is also being imported.
  • The user defines what should happen if the validation fails:
    • cancel ingestion
    • ingest only valid items
  • The user provides an ingestion description for the ingestion task, which will be attached to the items within the local tags. In addition the system assigns the timestamp to the imported items within the local tags.
  • the user triggers the ingestion
  • the system informs the user on the progress and outcome of the import.
    • If the ingestion fails (wrong import format, full text not fetched, corrupted file, failed validation), the ingestion is canceled or only valid items are being ingested.
    • If the import is successful, the imported items are created as pending items in the import workspace of the user. The imported items carry a system-specific property for the ingestion task (This would be best accomplished via content-model-specific-property such as <ingestion><ingestion-task></ingestion-task></ingestion>), the ingestion date and the owner of the ingestion.
  • The user can view the ingested pending items in his import workspace, do batch operations and view the ingestion report.
  • Proceed with UC_PM_IN_02 Batch release imported items

Post conditions[edit]

New pending PubMan Entries have been created in the import workspace of the owner of the ingestion.

Future Development[edit]

  • If the user chooses an import format, which contains URLs to the full text, the user can specify if s/he would like to import the full texts or not.

UC_PM_IN_02 Batch release/submit imported items[edit]

The user wants to make a set of ingested items visible via PubMan.

Status/Schedule[edit]

  • status: in specification
  • schedule: R 5

Actors[edit]

  • Depositor
  • Moderator

Pre-Conditions[edit]

  • One or more ingestion sets have been selected (e.g. via basket)
For R5 we should state: 1 ingestion set is selected. --Natasa 10:16, 30 March 2009 (UTC)

Flow of events[edit]

  • the user selects one or more sets of ingestion tasks from the workspace. Only item sets which have been imported for the same context can be combined.
See remark above: 1 set is selected for batch release. --Natasa 10:26, 30 March 2009 (UTC)
  • Depending on the workflow for the target context and the privileges of the user, the user can trigger either a batch release or a batch submission.
  • The system checks for potential duplicates. Include use case UC_PM_IN_03 Do duplicate check (OPEN)
The step on checking duplicates was not supposed to be in the batch release, but during the import triggering. We agreed to have a check-box in import form, based on which users tell to the system whether to skip the creation if duplicate is found or to create an item in any case. --Natasa 10:26, 30 March 2009 (UTC)
  • The system checks the items against the validation rules provided for the context, incl. the genre-specific constraint. (OPEN: the only reason I see another validation here, is, in case the user has an option to modify the pending items after ingest. During import, already validation rules for the context have been checked....)
As well during the creation validation rules have to be checked for validation point default. In this case validation rules for submit/release should apply. --Natasa 10:26, 30 March 2009 (UTC)
  • If the items are valid, the items are submitted or released. User gets information how many items have been submitted or released
  • If one or more items are invalid, the system shows the invalid items and gives the user the possibility to edit them and re-start the batch release process
Should the system generate a list which contains the validation messages for each item? --Natasa 10:26, 30 March 2009 (UTC)

OPEN: In case user imports 500 references from WoS or Endnote, he need to have possibility to check the quality for each of them, and potentially modify them, before the actual release. Should this happen after the import to the import workspace? If he can modify the pending item in the import workspace, where can he store the modified item? If modification is not feasible before the release, the user has to modify the ingested items after the batch release process, which is not nice. An option would be, that we allow import only for contexts, where standard workflow is applied, and we restrict the import to users with moderator privileges. As consequence, there is only the option to Batch submit, and the items can be modified as "normal", manual submitted items.

The import workspace should point no restrictions in respect to the privileges of the user. It is just to distinguish items that are created by certain ingest process. Users need to be able to modify and release the items as well in single mode, in accordance with the workflow of the context.The only difference between depositor and import workspace is the possibility of filtering by ingestion task value, to do batch operation and that the import workspace shows ONLY PENDING items. The ingestion task should not be filter-able via import workspace if the items are in status "submitted" or "released". --Natasa 10:26, 30 March 2009 (UTC)
Note: QA roles should also have the batch release operation available and the possibility to filter by import tasks. Therefore the import workspace is also to be allowed for both QA and Depositor roles. --Natasa 10:26, 30 March 2009 (UTC)
i don't really understand the necessity of an import workspace in addition to the depositor workspace. if the latter provides filtering by date, context, status, etc. wouldn't it be just as easy to find items resulting from a batch import in there?--Robert 11:15, 30 March 2009 (UTC)

OPEN: What happens to imported persons and person data in CONE? (see also below, duplicate check for CONE persons?)

UC_PM_IN_03 Do duplicate check[edit]

The user wants to avoid creating duplicates in a specific context,during ingest of new items.

OPEN: Where to include the duplicate check? During import or during batch submission/release?

Status/Schedule[edit]

  • status: in specification
  • schedule: R5?

Actors[edit]

  • Depositor
  • Moderator

Pre-Conditions[edit]

  • One or more ingestion sets have been selected (e.g. via basket)
  • the ingested items carry an unique identifier to base the duplicate check on (i.e. duplicate check on item level)

Flow of events[edit]

  • The user selects an identifier type to base the duplicate check on (.e.g WOS identifier)
  • No duplicates are found. The user is informed.
  • In case duplicates are found, the system provides the user with a report of possible duplicates and with the following possibilities to proceed:
    • import all, including the duplicates, and overwrite the existing items (i.e. create new version)
import all (duplicates may be created)
    • skip the potential duplicates and import only the non-duplicates
    • cancel the import
  • The use case ends successfully.

OPEN: Duplicate check in CONE service possible for person?

  • The system checks if the creators within the imported data already exist within the CoNE Service.
    • The creators don't exist in the CoNE Service: New unauthorized entries are being created in CoNE.
    • The creators already exist in the CoNE Service: The system displays the possible entry in CoNE. The user can decide to use the CoNE person or to add a new person in CoNE or to add name variants, affiliations etc. to an existing CoNE entry.
this needs to be very carefully checked. For R5 i would state the following: Always create unauthorized entries (unless is a custom import of MPI ICE which needs to be handled differently)--Natasa 10:30, 30 March 2009 (UTC)

UC_PM_IN_04 batch delete items[edit]

The user wants to delete several items from the import manager interface, as they where duplicates to items, which already existed in PubMan and are no longer needed.


not needed to my understanding...depends on final set-up of the import workspace, ie.if we can keep track of history of ingests (as long as imported data are not batch released, they are visible in workspace. as soon as they are released/submitted, they are processed, i.e. might be retrieved by "my ingest history".)--Ulla 17:28, 27 March 2009 (UTC)

Actually, the batch delete of imported items is fine idea. But is not related only to the import workspace - can also be offered as functionality in the depositor workspace. --Natasa 10:32, 30 March 2009 (UTC)

UC_PM_IN_05 batch attach local tags[edit]

The user wants to assign one or more local tags to a set of items.

To be discussed if possible--Ulla 17:28, 27 March 2009 (UTC)

UC_PM_IN_06 batch assign organizational units[edit]

The user wants to assign one or more OUs to a set of items.

to be discussed if possible, e.g. for WOS or bibtexx import--Ulla 17:28, 27 March 2009 (UTC)

Architecture and thoughts[edit]

  • R5 - Import logic/design
    • import page modification to include
      • ingestion task identifier (username+timestamp)
      • checkbox for skip creation of duplicates or create evtl. duplicates as new items
    • asynchronous start of ingestion process
    • sendmail to user when ingestion is finished or failed
      • email in case of failure:
        • at which item (by title) it failed
        • possible cause: mapping, creation, validation message (for Val.point default)
        • info where to go for further steps
      • email in case of success:
        • how many items were created (with or without fulltext)
        • which items were possible duplicates (by title) (if DC is to be applied)
        • info where to go for further steps
    • on ingestion tasks (tbd: BT or eSciDoc days?)
      • possible to create items with CM: ingestion-task
      • these items should never be released
        • this will enable filtering ingestion tasks workspace when it comes
      • items may contain:
        • special metadata(or content-model-specific?) to point on the status of the ingestion task (scheduled, in-progress, finished succesfully, finished unsuccessfully) etc (NBU: to check data model from before, as it was defined in details).
        • component or MD record stream with links to ingested items (preferrable component) and info on status: failed/success, and info on evtl. found duplicate
        • component with the original file uploaded for import
      • ingestion task items can be "cleaned-up" i.e. deleted if wished or not
      • Advantage by defining them as items:
        • we can even have separate role if needed
        • we can re-use existing functionality of item handler and storage
sounds to me like a workflow engine.--Robert 11:27, 30 March 2009 (UTC)
i also think it would be somewhat strange, if we put management process data like ingestion tasks into our repository - including persistent identifiers, lta, etc. - while keeping the cone stuff, which is actually part of the core data, out.--Robert 12:01, 30 March 2009 (UTC)
correct - There is a WF manager (according to FIZ) set-up, but this would require a lot of testing. The above proposal is done in order to avoid such complexity and introduce another external component at present. It is in any case doubtful that we will anyway have ingestion tasks now - the purpose is only to understand where to store them, as we will anyway probably have to store the originally uploaded files somewhere. --Natasa 08:09, 1 April 2009 (UTC)
  • R6
    • introduce ingestion definition (again as item)?
    • to help ingest-users to define their own ingestion settings and remember them
    • ingestion tasks will in that case have relations to ingestion definition

Future development[edit]

  • Check for new version of item. It should be possible to check if a newer version of the PubMan item has been created at the import source.
  • Set up automatic ingestion mechanism (regular automatic ingest from specific URL), incl. respective update of escidoc items
  • Provide separate interface to define the mapping for customized fields to eSciDoc. These Mappings can be stored in e.g. User preferences or a "Mapping library" open for all users.