Difference between revisions of "Talk:PubMan Func Spec Ingestion"

From MPDLMediaWiki
Jump to navigation Jump to search
Line 60: Line 60:
===Flow of events===
===Flow of events===
*1. The user starts to import a file to the system
*1. The user starts to import a file to the system
**1.1 The system prompts for the path of the import file, the specification of the Import Format (BibTeX, EndNote Export Format, RIS) and the context to where s/he would like to import the items. Further more the user can specify if there are customizable fields in the import file and where the values of them should be mapped to in PubMan.
**1.1 The system prompts for the path of the import file, the specification of the Import Format (BibTeX, EndNote Export Format, RIS) and the context to where s/he would like to import the items.  
 
Rupert: For the user it should not make a difference to upload one ore more records. In tests they provided a file and expected all records to be imported (which was not the case at this time). From the interface point of view ingest is rather an extension point of import. If the system detects more than one record it should automatically invoke the ingest process with a message. --[[User:Rkiefl|Rupert]] 12:10, 24 March 2009 (UTC)
 
Further more the user can specify if there are customizable fields in the import file and where the values of them should be mapped to in PubMan.
:I think selection of the customizable fields can not be done in R5. In R5 we should prepare special custom set of mappings (as the only custom is for MPI ICE) --[[User:Natasab|Natasa]] 08:53, 23 March 2009 (UTC)
:I think selection of the customizable fields can not be done in R5. In R5 we should prepare special custom set of mappings (as the only custom is for MPI ICE) --[[User:Natasab|Natasa]] 08:53, 23 March 2009 (UTC)
**1.2 The user enters the path to the file, specifies the import format and confirms the input.
**1.2 The user enters the path to the file, specifies the import format and confirms the input.

Revision as of 12:10, 24 March 2009

work in progress

Phase 1[edit]

  • provide multiple item submission (batch import) for local Endnote files, bibtex files and WoS records, containing more than one reference.

following formats have to be supported: endnote, escidoc xml, bibtex, wos, RIS--Ulla 12:58, 24 February 2009 (UTC)

  • For endnote, consider files in various versions: Either version 1.x-7 or verion 8.x
    • encoding of files depends on endnote version: 1.x to 7 support ASCII, 8.x support UTF8
    • Mapping to PubMan Genres depends on endnote version (different mappings needed)

First Prio: Endnote version in use by ICE and MPI pflanze--Ulla 12:58, 24 February 2009 (UTC)

  • For WoS, consider:
    • include "times cited"? (AEI)
      Please note, that this information is not stable, but evolves over time. Newly published articles are not cited at all --Inga 10:25, 28 July 2008 (UTC)

exactly, rather provide feature by look-up service--Ulla 12:58, 24 February 2009 (UTC)

  • for BibTeX consider, that the record can contain an URL tag, which points to the fulltext belonging to the bibliographic record, which should be uploaded to PubMan (see also BibTeX maping)

Important scenario for ICE: references are maintained in Endnote and ingested from time to time to PubMan. Therefore, use case has to include:

  • decision by user if ingest

a) creates new data b) overwrites existing data (based on local Endnote ID)

to be clarified:

  • in case endnote import contains both new references and modified references...can user select "last modified entries" in his local endnote library to import only modified entries to pubman?
  • is institute aware that PubMan is richer than endnote, ie. additional data submissions/modifications might have to be done on PubMan?--Ulla 12:58, 24 February 2009 (UTC)

Phase 2[edit]

duplicate identification, duplicate handling

basic duplicate identification (based on ID) should be part of Phase 1--Ulla 12:59, 24 February 2009 (UTC)

Phase 3[edit]

workflow based ingestion, incl. task manager and processing of ingested items

Functional specification[edit]

UC_PM_IN_01 import file in structured format[edit]

In order to save manual typing for example, the user wants to upload a file in a structured format such as BibTeX, EndNote Export Format or RIS.

we should refer here to a separate page, where we list all supported structured formats. (incl. escidoc xml, WOS) and their respective mappings. so we do not have to update the use case each time we provide additional format--Ulla 17:50, 15 March 2009 (UTC)

Status/Schedule[edit]

  • status: in specification
  • schedule: R 5

Triggers[edit]

  • the user wants to upload a file in structured format in order to create eSciDoc items

Actors[edit]

  • Import manager (?)
  • Moderator (?)
I would here add the depositor and remove the moderator. --Natasa 08:49, 23 March 2009 (UTC)

Pre-Conditions[edit]

  • Target collection has to be selected.
(please use term "context" as below in flow of events description --Natasa 08:52, 23 March 2009 (UTC))
  • validation rules for import have to be selected.

What kind of validation rules are meant?--Ulla 17:48, 15 March 2009 (UTC)

Michael always uses very simple, or no validation rules for the import of eDoc items. I suppose for the start this would be easy to use. --Nicole 08:30, 20 March 2009 (UTC)
Validation rules must not be selected explicitly. They are part of the target collection validation rules and are related to event (default, submit, release etc.) --Natasa 08:51, 23 March 2009 (UTC)

Flow of events[edit]

  • 1. The user starts to import a file to the system
    • 1.1 The system prompts for the path of the import file, the specification of the Import Format (BibTeX, EndNote Export Format, RIS) and the context to where s/he would like to import the items.

Rupert: For the user it should not make a difference to upload one ore more records. In tests they provided a file and expected all records to be imported (which was not the case at this time). From the interface point of view ingest is rather an extension point of import. If the system detects more than one record it should automatically invoke the ingest process with a message. --Rupert 12:10, 24 March 2009 (UTC)

Further more the user can specify if there are customizable fields in the import file and where the values of them should be mapped to in PubMan.

I think selection of the customizable fields can not be done in R5. In R5 we should prepare special custom set of mappings (as the only custom is for MPI ICE) --Natasa 08:53, 23 March 2009 (UTC)
    • 1.2 The user enters the path to the file, specifies the import format and confirms the input.
    • 1.3 The system checks the import format and the available Fedora storage availability.
What import has to do with available Fedora storage availability? do not understand this sentence :) --Natasa 08:54, 23 March 2009 (UTC)
      • 1.3.1 The import format is valid. Continue with step 1.4
      • 1.3.2 The import format is invalid. The system discards the file and displays an error message saying that the import format is invalid and that the user should check the file and try to upload again.
    • 1.4 The file is uploaded to the system.

Rupert: Before the duplicate check is done, the user needs to get an information whether the system has all records got properly. In usability tests they mistrust any automatic logic and would like to know: How many items are available (to cross-check if everything worked well)? Are there items where something went wrong? What will come next? As early as possible the user needs to know if he is required to correct something or repeat the upload. Usually this is done by some kind of preview (of the first record). Another Use case would make sense here "check import/ingest", defining what is displayed from the functional point of view before entering the duplicate check. --Rupert 11:43, 24 March 2009 (UTC)

    • 1.5 Include UC_PM_IN_02 Do duplicate check
    • 1.6 The system creates new PubMan entries in status pending and displays them after the successful import in the import manager workspace.
The system should also mark these entries as part of single import. This would be best accomplished via content-model-specific-property such as <ingestion><ingestion-task></ingestion-task></ingestion>

Comment 1: The imported items shall only get into status pending if the functionality "batch release" is available. Otherwise the items should directly be released. --Nicole 08:55, 20 March 2009 (UTC)

This is dependent on the target collection workflow - Whether directly released, or submitted. --Natasa 08:57, 23 March 2009 (UTC)

Rupert: I would't expect the user to have the workflow definition in mind. And if so, it might be not clear to them that our work flow is also related to ingest/import as the work flow is set up to handle a single publication . --Rupert 11:53, 24 March 2009 (UTC)

Comment 2: There is no maximum file size. The size of the file, that can be uploaded is variable and depends on how many users are trying to upload or are currently uploading at the same time (talked to Willi). So it can be, that it is not possible to upload a file, but after some minutes if there is less traffic it is possible in fact. Don't know how to integrate that into the specification. Any ideas? --Nicole 08:54, 20 March 2009 (UTC)

The system must report technical error message if file can not be uploaded or processed due to traffic load. --Natasa 08:58, 23 March 2009 (UTC)

Comment 3: Please leave 1.5 out if not possible for now. --Nicole 16:01, 22 March 2009 (UTC)

Rupert: As duplicate check is the most difficult thing to provide an interface it would be good to know the status --Rupert 11:53, 24 March 2009 (UTC)

Post conditions[edit]

New PubMan Entries have been created with the information on when they have been imported and information on the import source.

Continue with UC_PM_IN_02.

Future development[edit]

  • Check for CoNE ID
  • Automatic upload (give the URL of the server from where to get the data on a regular basis).
  • Check if journal names within the import file are already in CoNE.

UC_PM_IN_02 Do duplicate check[edit]

The user wants to make sure, that the items s/he wants to add to PubMan are not already existing in his/her context and that the persons within the items are not already part of CoNE.

Status/Schedule[edit]

  • status: in specification
  • schedule: R5?

Triggers[edit]

  • the user wants to make new items or new versions of items available via PubMan

Actors[edit]

  • Import manager (?)
  • Moderator (?)

Pre-Conditions[edit]

  • One or more items have to be selected (e.g. via basket)
  • the targe context has to be selected

Flow of events[edit]

  • This use case is invoked by UC_PM_IN_01 import file in structured format or by UC_PM_IN_03 batch release items.
  • 1. The system performs a duplicate check on item level and shows the user the possible duplicates.
    • 1.1 No duplicates have been found. Continue with step 1.6
    • 1.2 Possible duplicates have been found: the system provides the user with a report of possible duplicates and with the following possibilities to proceed.
      • a) import only the non duplicate items
      • b) create new version for one or more duplicate items and no new version for the new items.
      • c) remove the duplicate items and copy only the new items
      • d) cancel the upload
  • 1.2 The system checks if the creators within the imported data already exist within the CoNE Service.
    • 1.2.1 The creators don't exist in the CoNE Service: New unauthorized entries are being created in CoNE.
    • 1.2.2 The creators already exist in the CoNE Service: The system displays the possible entry in CoNE. The user can decide to use the CoNE person or to add a new person in CoNE or to add name variants, affiliations etc. to an existing CoNE entry.
  • 2. The use case ends successfully.


UC_PM_IN_03 Batch release imported items[edit]

The user wants to make especially imported items visible to the public via PubMan and save time.

Status/Schedule[edit]

  • status: in specification
  • schedule: R 5

Triggers[edit]

  • the user wants to release items in order to make them publicly available via PubMan

Actors[edit]

  • Import manager (?)
  • Moderator (?)

Pre-Conditions[edit]

  • One or more items have to be selected (e.g. via basket)
  • validation rules for release item have to be selected.

Flow of events[edit]

  • 1. The user selects one or more items from the import workspace, which s/he would like to release.
  • 2. The system provides the user with an interface, on which s/he can specify which item set s/he would like to batch release.
  • 3. The user selects the item set, which s/he would like to release. Only items from the same context are allowed for a set.
  • 4. The system checks the items against the validation rules, provided for the context.
    • 4.1 The items are valid. Continue with step 5.
    • 4.2 One or more items are invalid. The system shows the invalid items and gives the user the possibility to edit them. Continue with step 4.
  • 5. One or more items have been released.

UC_PM_IN_03 batch delete items[edit]

The user wants to delete several items from the import manager interface, as they where duplicates to items, which already existed in PubMan and are no longer needed.

UC_PM_IN_04 batch attach local tags[edit]

The user wants to assign one or more local tags to a set of items.

UC_PM_IN_05 batch assign organizational units[edit]

The user wants to assign one or more OUs to a set of items.

Future development[edit]

  • Check for new version of item. It should be possible to check if a newer version of the PubMan item has been created at the import source.
  • Regular automated imports including an update of the existing items.