PubMan Func Spec Ingestion

work in progress!!!

=Scenarios=

Scenario 1
Scientists are maintaining personal publication lists in bibliographic management systems (e.g. EndNote, RefMan) or other bibliographic formats such as BibTeX. Ingestion feature allows them to upload a list of publications. In case duplicate identification/handling is provided by PubMan, the ingestion can be done periodically to enrich/complete given collections.

Scenario 2
External Abstracting&Indexing services (e.g. WoS) can be queried for publication references of a specific organizational unit and/or individual author. The publication references gathered from an external service are uploaded once to the system to create a start content for a collection (for an organizational unit and/or individual).

Implementation approach
see Discussion page

see eSciDoc Ingest Tool

=Functional specification=

UC_PM_IN_01 fetch data from external system
The system fetches (meta)data from an external system providing an external identifier. These (meta)data are used to populate metadata for a new item. If a fulltext version of this item is provided it is fetched as well.

Status/Schedule

 * Status: implemented
 * Schedule:R5

Triggers

 * The user wants to fetch the (meta)data for an item from an external system in order to create a new sSciDoc item.

Actors

 * Depositor

Pre-Conditions

 * Target context, incl. its validation rules, is selected

Flow of Events

 * 1.	The user chooses to harvest data from an external service.
 * 2. The user chooses a context where new PubMan items will be created from the fetched data.
 * 3.	The system displays a list of all available ingestion sources which provide an ID fetch method.
 * 4.	The user selects one ingestion source and specifies an external ID appropriate to the selected ingestion source.


 * 4.1.    If the ingestion source provides full text download the user can select, if s/he wants to download the default fetching format (pdf), all available or none (arXiv) or simply choose, if s/he wants to download the fulltext or not (PubMedCentral).
 * 4.2. The full text will be imported, with visibility set to "public".


 * 5.    The user triggers the query of the external system.
 * 6.	The system queries for item data with the specified ID.
 * 6.1. The system checks if one or more persons within the import are already within CoNE.
 * 6.1.1. For persons, which are already in CoNE: the system adds the CoNE ID to the respective persons in the import data.
 * 6.1.2. Otherwise: the system creates unauthorized person entries (future development).


 * 6.2.	The system receives item data for the specified ID. Continue with use case UC_PM_SM_02 edit item with fetched metadata values for metadata as default. The use case ends successfully.
 * 6.3.	The query fails. The identifier does not exist or is access-restricted. The system does not receive item metadata for the specified ID. The system displays an error message (MSG_PM_SM_10). The use case ends without success.
 * 6.4. The query fails partially. The system does recieve some fulltexts with specified IDs. The system displays an information message (MSG_PM_SM_XX). The use case ends with success. --Melanie.stetter 07:17, 9 September 2009 (UTC)Is that part of R5/still requirement?

Post-Conditions / Results

 * Fetched metadata are used to create a new item.

Constraints

 * The user should not be able to change the name of an uploaded file (uploaded automatically or manually)
 * to check if this is really a restriction one has to have in such a strong manner? (it's possible, but if really needed would move it to some other release --Natasa 13:47, 6 June 2008 (UTC)

The supported sources can be found  here
 * For arxive, full text version can be retrieved as pdf, ps, and tex document. We don't try to retrieve a fulltext version in html because it is very rare.

Future Development

 * The user can choose to see a preview of the uploaded file.
 * Suggestion for different failure messages during upload:
 * The identifier does not exist or the access is restricted. (implemented)
 * There is no fulltext version in arXiv.
 * Technical Problems when communicating with arXiv (Probably try later or something)
 * Consider SFX and OpenURL?
 * Handle origin information.

Status/Schedule

 * Status: in specification
 * Schedule:to be defined


 * default content category per genre (specified default MD)
 * default creator roles per genre (specified default MD)
 * default source genre per item genre (specified default MD)
 * default creator role if creator is of type organisation (specified default MD)
 * default affiliation (same as previous)(specified as default on GUI)

Default Metadata for an item means, that in the system a default item template is created, with defaulted metadata. As a start, we should do this as system setting. Future development might include some local definitions of item templates on collection level.

Default Metadata means, that they are pre-populated on the GUI, as a kind of proposal, but can be changed by the user.

Context to collection settings: On collection, the allowed genres are defined. In the default MD setting, the default MD for a certain genre or certain creator role are defined.

TODO:
 * define sensible defaults in matrix - where to document the matrix in CoLab
 * check dependencies in spec "create item from template", "create new revision"=> we have collection settings (limitation of allowed genres), we have default Metadata. In case an item is used as template, the templated item should "overwrite" the default Metadata, but cannot overwrite the collection setting. (?) --Ulla 13:26, 27 February 2008 (CET)

Status/Schedule

 * Status: implemented
 * Schedule:R4?

Genre-specific Metadata are bound to a certain application profile and are defined as system setting.

This matrix describes the Metadata elements, which are always OR never OR optionally displayed on the edit mask (in easy submission, in normal submission), dependent on a certain genre type. Optional displayed means, that the user has the option to fill them, if needed, but they are somehow "hidden", as less used. This matrix is needed for GUI design. Genre-specific Metadata are not related to validation rules!

TODO:
 * define matrix of genre-specific Metadata (Dimensions: Genre, Metadata or Metadata group. Values: always on Easy Submission, always on Normal Submission, optional on ESM, always on NSM). Documentation in CoLab.
 * crosscheck assumptions on genre-specific MD with Early Adopter (using functional prototype)

UC_PM_IN_02 import file in structured format
In order to save manual typing for example, the user wants to upload a locally stored file in a structured format such as BibTeX, EndNote Export Format or RIS. A complete overview on supported import formats on PubMan can be found in the Category:ESciDoc_Mappings.

Status/Schedule

 * status: implemented
 * schedule: R 5

Triggers

 * the user wants to upload a file in structured format, containing one or more items, in order to create eSciDoc items

Actors

 * user, who has depositor and moderator rights

Pre-Conditions

 * Target context, incl. its validation rules for submission and release, is selected
 * Recommendation for users: Local data is prepared for genre-specific constraints and validation rules for the selected target context, to avoid fail of import

Flow of events

 * 1. The user chooses to upload a file to the system
 * 2. The user provides the path of the import file, the type of the Import Format (BibTeX, EndNote, WoS, RIS, escidocXML ) and the context to where s/he would like to import the items.


 * 2.1. In case the user selects an import format, where customized mappings have been created beforehand, the user can in addition select the customized mapping (e.g. "Endnote for MPI ICE")
 * 2.2. If the user has choosen an import format, which contains links to full texts (like BibTeX), the full text will be imported, with visibility set to "public".
 * 3. The user defines what should happen if the validation fails:
 * 3.1. cancel ingestion
 * 3.2. ingest only valid items
 * 4. The user defines what should happen if duplicates exist:
 * 4.1. System should not check for duplicates
 * 4.2. System should not import duplicate publications or
 * 4.2. system should not import anything, if duplicates are detected
 * 5. The user provides an ingestion description for the ingestion task, which will be attached to the items within the local tags. In addition the system assigns the timestamp to the imported items within the local tags.
 * 6. the user triggers the ingestion
 * 7. the system checks if one or more persons within the import are already within CoNE
 * 7.1. for persons, which are already in CoNE: the system adds the CoNE ID to the respective persons in the import data (already in R5)
 * 7.2. otherwise: the system creates unauthorized person entries (future development)
 * 8. the system informs the user on the progress and outcome of the import.
 * 8.1. If the ingestion fails (wrong import format, full text not fetched, corrupted file, failed validation), the ingestion is canceled or only valid items are being ingested.
 * 8.1.1. during the ingestion both validation rules are being applied, the one for create item and the one for release item. If the one for create item fails, the item(s) can not be imported. If the one for release fails, the items are being imported and the user gets a report.
 * 8.2. If the import is successful, the imported items are created as pending items in the import workspace of the user. The imported items carry the information description of the ingestion task as local tag, the ingestion date and the owner of the ingestion.
 * 9. The user can view the ingested pending items in his import workspace, do batch operations (batch delete, batch submit, batch submit/release, remove ingestion task) and view the ingestion report. The use case ends successfully.
 * 10. Proceed with UC_PM_IN_03 Batch release imported items

Post conditions
New pending PubMan Entries have been created in the import workspace of the owner of the ingestion.

Constraints
For BibTeX files there are at the moment two possibilities for uploading files in PubMan:


 * Import (file containing one BibTex reference):
 * If the BibTeX record contains "URL", the system creates a full text within the record. The system can try to fetch a fulltext by following this URL. My proposal would be that if this fetching fails the submission continues with a message that the file could not be fetched. --Kleinfercher 12:14, 13 January 2009 (UTC) The user can decide, if references are handled as locator or file. The user can specify the content type and change the MIME Type in the edit mask afterwards. If the system is unable to upload the file, the user gets an error message.


 * Multiple Import (one or multiple references, only for users with depositor and moderator rights):
 * If the BibTeX record contains "URL", the system creates a full text within the record. The system can try to fetch a fulltext by following this URL. References are handled as locators. The user can follow the import of the item(s) in the import workspace. The user can specify the content type in the edit mask afterwards. If the system is unable to upload the file, the user gets an error message.

BibTeX File, structured format. See example file by the AEI.
 * BibTeX files are idiosyncratically structured; BibTool may help with preprocessing/normalization.
 * e.g. upper and lower case corrections, resolving macros, unicode encodings vs. (la)tex encoding, etc.
 * Basic TeX Parsing is needed to interpret non-ascii characters etc., see for example https://dev.livingreviews.org/projects/epubtk/browser/trunk/ePubTk/lib/bibtexlib.py.
 * In BibTeX fields are not repeatable; thus multiple authors need to be parsed from the author field.
 * BibTeX allows for different formats of representing an author's name; thus the parser needs to be smart enough to recognize them all. See for example http://search.cpan.org/~gward/Text-BibTeX-0.34/BibTeX/Name.pm

Suggested steps to prepare BibTeX files for import

 * Normalize BibTeX with BibTool (resolves macros, may be used to map field names, unifies the syntax).
 * Parse the - now normalized - records.
 * Allow for/provide a mapping for non-standard fields (and possibly genres).
 * Handle substructure of fields
 * Multiple entries in author and keyword fields. (see also http://nwalsh.com/tex/texhelp/bibtx-23.html)
 * (La)TeX encoding for special characters/formulae. (see for example https://dev.livingreviews.org/projects/epubtk/browser/trunk/ePubTk/lib/charmaps/tex2unicode.py)
 * Map BibTeX fields/genres (including non-standard ones) to eSciDoc PubItem application profile. Mapping can be found here.
 * Java Tools to check
 * http://jabref.sourceforge.net/
 * http://www-plan.cs.colorado.edu/henkel/stuff/javabib/

Future Development

 * If the user chooses an import format, which contains URLs to the full text, the user can specify if s/he would like to import the full texts or not.
 * Automatic upload (give the URL of the server from where to get the data on a regular basis).
 * Check if journal names within the import file are already in CoNE.

UC_PM_IN_03 Batch release/submit imported items
The user wants to make a set of ingested items visible via PubMan.

Status/Schedule

 * status: implemented
 * schedule: R5

Actors

 * user with Depositor and Moderator rights

Pre-Conditions

 * One ingestion set has been selected

Flow of events

 * 1. The user selects one set of ingestion tasks from the workspace.
 * 2. Depending on the workflow for the target context and the privileges of the user, the user can trigger either a batch release or a batch submission.
 * 2.1. Simple workflow: User has the option "Release", where items will be batch released
 * 2.2. Standard workflow: User has the option to either "Submit" (items will be batch submitted) or "Release" (items will be batch released).
 * 3. The system checks for potential duplicates. Include use case UC_PM_IN_04 Do duplicate check (OPEN)
 * 4. The system checks the items against the validation rules provided for the context, incl. the genre-specific constraint.
 * 4.1. If the items are valid, the items are submitted or released. User gets information how many items have been submitted or released. The use case ends successfully.
 * 4.2. If one or more items are invalid, the system shows the invalid items and gives the user the possibility to edit them and re-start the batch release process.

Future Development

 * batch release/submit should also be possible from Depositor and QA Workspace
 * the user should be able to define sets for batch operations by himself/herself

UC_PM_IN_04 Do duplicate check
The user wants to avoid creating duplicates in a specific context,during ingest of new items. He can do a duplicate check based on Identifiers. current implementation: check is based on identifiers and done during import, i.e. before creating item.--Ulla 15:36, 29 April 2009 (UTC)

Status/Schedule

 * status: implemented
 * schedule: R5

Actors

 * User with moderator and depositor rights

Pre-Conditions

 * One ingestion set has been selected
 * the ingested items carry an unique identifier to base the duplicate check on (i.e. duplicate check on item level)

Flow of events

 * 1. The user wants to specify what kind of duplicate check should be done during the import.
 * 2. The user one of the following options in case one or more duplicates have been found during the import:


 * 2.1. Don't check for duplicate publications
 * 2.2. Don't import duplicate publications
 * 2.3. If duplicates are detected, don't import anything


 * 2.3. For all three options, user can see the detailed report in import workspace.
 * 3. The items are checked for duplicate publications during the import according to the options specified by the user in import parameters. The use case ends successfully.

Constraints

 * Duplicate check can be done only on released PubMan items

Future Development

 * The user should be able to view a duplicate checking report and decide for each item, which action should be taken

UC_PM_IN_05 batch delete items
The user wants to delete an ingestion task.

Status/Schedule

 * status: implemented
 * schedule: R5

Preconditions
Items in ingestion task must be in status pending.

Flow of events

 * The user wants to delete all items of an ingestion task from the system.
 * The user triggers a batch delete in the import workspace.
 * All items that are part of the ingestion task are deleted. The use case ends successfully.

Post-Conditions

 * The items of the selected ingestion task have been deleted.
 * Comment: The delete button is also available, if the items of the ingestion task are already submitted or released, although the items in these states can no longer be deleted.

UC_PM_IN_06 batch remove items
The user wants to remove an ingestion task from the import workspace list.

Status/Schedule

 * status: implemented
 * schedule: R5

Flow of Events

 * The user wants to remove an ingestion task from the import workspace.
 * The user triggers a batch remove in the import workspace.
 * The ingestion task is removed from the import workspace list, ingested items not affected. The use case ends successfully.

Post-Conditions

 * The selected ingestion task has been removed from the import workspace and is no longer shown in the My items import filter.

UC_PM_IN_07 batch attach local tags
The user wants to assign one or more local tags to a set of items, in addition to the ingestion task description.

Status/Schedule

 * status: to be specified
 * schedule: to be specified

UC_PM_IN_08 batch assign organizational units
The user wants to assign one or more OUs to a set of items.

Status/Schedule

 * status: to be specified
 * schedule: to be specified

Future development

 * Check for new version of item. It should be possible to check if a newer version of the PubMan item has been created at the import source.
 * Set up automatic ingestion mechanism (regular automatic ingest from specific URL), incl. respective update of escidoc items
 * Provide separate interface to define the mapping for customized fields to eSciDoc. These Mappings can be stored in e.g. User preferences or a "Mapping library" open for all users.