PubMan Func Spec Ingestion

From MPDLMediaWiki
Jump to: navigation, search
PubManPublication Management Functional Specification
View · Browse
Full Submission · Easy Submission
Import · Export
Quality Assurance · Search
Collaboration · Copyright
Collection Administration
Organizational Unit Management
User Management
Feeding local webpages
History of affiliations
edit
work in progress!!!

Scenarios

Scenario 1

Scientists are maintaining personal publication lists in bibliographic management systems (e.g. EndNote, RefMan) or other bibliographic formats such as BibTeXReference Management Software for Lists of References. Ingestion feature allows them to upload a list of publications. In case duplicate identification/handling is provided by PubManPublication Management, the ingestion can be done periodically to enrich/complete given collections.

Scenario 2

External Abstracting&Indexing services (e.g. WoSWorld of Science) can be queried for publication references of a specific organizational unit and/or individual author. The publication references gathered from an external service are uploaded once to the system to create a start content for a collection (for an organizational unit and/or individual).

Implementation approach

see Discussion page

see eSciDoc Ingest Tool

Functional specification

UCUse Case_PM_IN_01 fetch data from external system

The system fetches (meta)data from an external system providing an external identifier. These (meta)data are used to populate metadata for a new item. If a fulltext version of this item is provided it is fetched as well.

Status/Schedule

  • Status: implemented
  • Schedule:R5

Triggers

  • The user wants to fetch the (meta)data for an item from an external system in order to create a new sSciDoc item.

Actors

  • Depositor

Pre-Conditions

  • Target context, incl. its validation rules, is selected

Flow of Events

  • 1. The user chooses to harvest data from an external service.
  • 2. The user chooses a context where new PubManPublication Management items will be created from the fetched data.
  • 3. The system displays a list of all available ingestion sources which provide an IDIdentifier fetch method.
  • 4. The user selects one ingestion source and specifies an external IDIdentifier appropriate to the selected ingestion source.
  • 4.1. If the ingestion source provides full text download the user can select, if s/he wants to download the default fetching format (pdfPortable Document Format), all available or none (arXiv) or simply choose, if s/he wants to download the fulltext or not (PubMedCentral).
  • 4.2. The full text will be imported, with visibility set to "public".
  • 5. The user triggers the query of the external system.
  • 6. The system queries for item data with the specified IDIdentifier.
    • 6.1. The system checks if one or more persons within the import are already within CoNEControl of Named Entities.
      • 6.1.1. For persons, which are already in CoNEControl of Named Entities: the system adds the CoNEControl of Named Entities IDIdentifier to the respective persons in the import data.
      • 6.1.2. Otherwise: the system creates unauthorized person entries (future development).
    • 6.2. The system receives item data for the specified IDIdentifier. Continue with use case UC_PM_SM_02 edit item with fetched metadata values for metadata as default. The use case ends successfully.
    • 6.3. The query fails. The identifier does not exist or is access-restricted. The system does not receive item metadata for the specified IDIdentifier. The system displays an error message (MSG_PM_SM_10). The use case ends without success.
    • 6.4. The query fails partially. The system does recieve some fulltexts with specified IDs. The system displays an information message (MSG_PM_SM_XX). The use case ends with success. --Melanie.stetter 07:17, 9 September 2009 (UTCCoordinated Universal Time)Is that part of R5/still requirement?

Post-Conditions / Results

  • Fetched metadata are used to create a new item.

Constraints

  • The user should not be able to change the name of an uploaded file (uploaded automatically or manually)
to check if this is really a restriction one has to have in such a strong manner? (it's possible, but if really needed would move it to some other release --Natasa 13:47, 6 June 2008 (UTCCoordinated Universal Time)


The supported sources can be found here

  • For arxive, full text version can be retrieved as pdfPortable Document Format, ps, and tex document. We don't try to retrieve a fulltext version in html because it is very rare.


Future Development

  • The user can choose to see a preview of the uploaded file.
  • Suggestion for different failure messages during upload:
    • The identifier does not exist or the access is restricted. (implemented)
    • There is no fulltext version in arXiv.
    • Technical Problems when communicating with arXiv (Probably try later or something)
  • Consider SFXOpenURL Link Resolver and OpenURLOpen Uniform Resource Locator?
  • Handle origin information.

Default Metadata for an item

Status/Schedule

  • Status: in specification
  • Schedule:to be defined
    • default content category per genre (specified default MDMetadata)
    • default creator roles per genre (specified default MDMetadata)
    • default source genre per item genre (specified default MDMetadata)
    • default creator role if creator is of type organisation (specified default MDMetadata)
    • default affiliation (same as previous)(specified as default on GUIGraphical User Interface)

Default Metadata for an item means, that in the system a default item template is created, with defaulted metadata. As a start, we should do this as system setting. Future development might include some local definitions of item templates on collection level.

Default Metadata means, that they are pre-populated on the GUIGraphical User Interface, as a kind of proposal, but can be changed by the user.

Context to collection settings: On collection, the allowed genres are defined. In the default MDMetadata setting, the default MDMetadata for a certain genre or certain creator role are defined.

TODO:

  • define sensible defaults in matrix - where to document the matrix in CoLabCollaboration Laboratory
  • check dependencies in spec "create item from template", "create new revision"=> we have collection settings (limitation of allowed genres), we have default Metadata. In case an item is used as template, the templated item should "overwrite" the default Metadata, but cannot overwrite the collection setting. (?) --Ulla 13:26, 27 February 2008 (CETCentral European Time)

Genre-specific Metadata

Status/Schedule

  • Status: implemented
  • Schedule:R4?

Genre-specific Metadata are bound to a certain application profile and are defined as system setting.

This matrix describes the Metadata elements, which are always OR never OR optionally displayed on the edit mask (in easy submission, in normal submission), dependent on a certain genre type. Optional displayed means, that the user has the option to fill them , if needed, but they are somehow "hidden", as less used. This matrix is needed for GUIGraphical User Interface design. Genre-specific Metadata are not related to validation rules!

TODO:

  • define matrix of genre-specific Metadata (Dimensions: Genre, Metadata or Metadata group. Values: always on Easy Submission, always on Normal Submission, optional on ESM, always on NSM). Documentation in CoLabCollaboration Laboratory.
  • crosscheck assumptions on genre-specific MDMetadata with Early Adopter (using functional prototype)

UCUse Case_PM_IN_02 import file in structured format

In order to save manual typing for example, the user wants to upload a locally stored file in a structured format such as BibTeXReference Management Software for Lists of References, EndNote Export Format or RISResearch Information System Format. A complete overview on supported import formats on PubManPublication Management can be found in the Category:ESciDoc_Mappings.

Status/Schedule

  • status: implemented
  • schedule: R 5

Triggers

  • the user wants to upload a file in structured format, containing one or more items, in order to create eSciDocEnhanced Scientific Documentation items

Actors

  • user, who has depositor and moderator rights

Pre-Conditions

  • Target context, incl. its validation rules for submission and release, is selected
  • Recommendation for users: Local data is prepared for genre-specific constraints and validation rules for the selected target context, to avoid fail of import

Flow of events

  • 1. The user chooses to upload a file to the system
  • 2. The user provides the path of the import file, the type of the Import Format (BibTeXReference Management Software for Lists of References, EndNote, WoSWorld of Science, RISResearch Information System Format, escidocXML ) and the context to where s/he would like to import the items.
    • 2.1. In case the user selects an import format, where customized mappings have been created beforehand, the user can in addition select the customized mapping (e.g. "Endnote for MPIMax-Planck-Institut ICEInternet Chat Exchange")
    • 2.2. If the user has choosen an import format, which contains links to full texts (like BibTeXReference Management Software for Lists of References), the full text will be imported, with visibility set to "public".
  • 3. The user defines what should happen if the validation fails:
    • 3.1. cancel ingestion
    • 3.2. ingest only valid items
  • 4. The user defines what should happen if duplicates exist:
    • 4.1. System should not check for duplicates
    • 4.2. System should not import duplicate publications or
    • 4.2. system should not import anything, if duplicates are detected
  • 5. The user provides an ingestion description for the ingestion task, which will be attached to the items within the local tags. In addition the system assigns the timestamp to the imported items within the local tags.
  • 6. the user triggers the ingestion
  • 7. the system checks if one or more persons within the import are already within CoNEControl of Named Entities
    • 7.1. for persons, which are already in CoNEControl of Named Entities: the system adds the CoNEControl of Named Entities IDIdentifier to the respective persons in the import data (already in R5)
    • 7.2. otherwise: the system creates unauthorized person entries (future development)
  • 8. the system informs the user on the progress and outcome of the import.
    • 8.1. If the ingestion fails (wrong import format, full text not fetched, corrupted file, failed validation), the ingestion is canceled or only valid items are being ingested.
      • 8.1.1. during the ingestion both validation rules are being applied, the one for create item and the one for release item. If the one for create item fails, the item(s) can not be imported. If the one for release fails, the items are being imported and the user gets a report.
    • 8.2. If the import is successful, the imported items are created as pending items in the import workspace of the user. The imported items carry the information description of the ingestion task as local tag, the ingestion date and the owner of the ingestion.
  • 9. The user can view the ingested pending items in his import workspace, do batch operations (batch delete, batch submit, batch submit/release, remove ingestion task) and view the ingestion report. The use case ends successfully.
  • 10. Proceed with UC_PM_IN_03 Batch release imported items

Post conditions

New pending PubManPublication Management Entries have been created in the import workspace of the owner of the ingestion.

Constraints

For BibTeXReference Management Software for Lists of References files there are at the moment two possibilities for uploading files in PubManPublication Management:

  • Import (file containing one BibTexReference Management Software for Lists of References reference):
    • If the BibTeXReference Management Software for Lists of References record contains "URLUniform Resource Locator", the system creates a full text within the record. The system can try to fetch a fulltext by following this URLUniform Resource Locator. My proposal would be that if this fetching fails the submission continues with a message that the file could not be fetched. --Kleinfercher 12:14, 13 January 2009 (UTCCoordinated Universal Time) The user can decide, if references are handled as locator or file. The user can specify the content type and change the MIMEMultipurpose Internet Mail Extensions Type in the edit mask afterwards. If the system is unable to upload the file, the user gets an error message.
  • Multiple Import (one or multiple references, only for users with depositor and moderator rights):
    • If the BibTeXReference Management Software for Lists of References record contains "URLUniform Resource Locator", the system creates a full text within the record. The system can try to fetch a fulltext by following this URLUniform Resource Locator. References are handled as locators. The user can follow the import of the item(s) in the import workspace. The user can specify the content type in the edit mask afterwards. If the system is unable to upload the file, the user gets an error message.

BibTeXReference Management Software for Lists of References File, structured format. See example file by the AEIAlbert Einstein Institute.

  • BibTeXReference Management Software for Lists of References files are idiosyncratically structured; BibTool may help with preprocessing/normalization.
    • e.g. upper and lower case corrections, resolving macros, unicode encodings vs. (la)tex encoding, etc.
  • Basic TeX Parsing is needed to interpret non-ascii characters etc., see for example https://dev.livingreviews.org/projects/epubtk/browser/trunk/ePubTk/lib/bibtexlib.py .
  • In BibTeXReference Management Software for Lists of References fields are not repeatable; thus multiple authors need to be parsed from the author field.
  • BibTeXReference Management Software for Lists of References allows for different formats of representing an author's name; thus the parser needs to be smart enough to recognize them all. See for example http://search.cpan.org/~gward/Text-BibTeX-0.34/BibTeX/Name.pm

Suggested steps to prepare BibTeXReference Management Software for Lists of References files for import

Future Development

  • If the user chooses an import format, which contains URLs to the full text, the user can specify if s/he would like to import the full texts or not.
  • Automatic upload (give the URLUniform Resource Locator of the server from where to get the data on a regular basis).
  • Check if journal names within the import file are already in CoNEControl of Named Entities.

UCUse Case_PM_IN_03 Batch release/submit imported items

The user wants to make a set of ingested items visible via PubManPublication Management.

Status/Schedule

  • status: implemented
  • schedule: R5

Actors

  • user with Depositor and Moderator rights

Pre-Conditions

  • One ingestion set has been selected

Flow of events

  • 1. The user selects one set of ingestion tasks from the workspace.
  • 2. Depending on the workflow for the target context and the privileges of the user, the user can trigger either a batch release or a batch submission.
    • 2.1. Simple workflow: User has the option "Release", where items will be batch released
    • 2.2. Standard workflow: User has the option to either "Submit" (items will be batch submitted) or "Release" (items will be batch released).
  • 3. The system checks for potential duplicates. Include use case UC_PM_IN_04 Do duplicate check (OPEN)
  • 4. The system checks the items against the validation rules provided for the context, incl. the genre-specific constraint.
    • 4.1. If the items are valid, the items are submitted or released. User gets information how many items have been submitted or released. The use case ends successfully.
    • 4.2. If one or more items are invalid, the system shows the invalid items and gives the user the possibility to edit them and re-start the batch release process.

Future Development

  • batch release/submit should also be possible from Depositor and QAQuality Assurance Workspace
  • the user should be able to define sets for batch operations by himself/herself

UCUse Case_PM_IN_04 Do duplicate check

The user wants to avoid creating duplicates in a specific context,during ingest of new items. He can do a duplicate check based on Identifiers. current implementation: check is based on identifiers and done during import, i.e. before creating item.--Ulla 15:36, 29 April 2009 (UTCCoordinated Universal Time)

Status/Schedule

  • status: implemented
  • schedule: R5

Actors

  • User with moderator and depositor rights

Pre-Conditions

  • One ingestion set has been selected
  • the ingested items carry an unique identifier to base the duplicate check on (i.e. duplicate check on item level)

Flow of events

  • 1. The user wants to specify what kind of duplicate check should be done during the import.
  • 2. The user one of the following options in case one or more duplicates have been found during the import:
    • 2.1. Don't check for duplicate publications
    • 2.2. Don't import duplicate publications
    • 2.3. If duplicates are detected, don't import anything
    • 2.3. For all three options, user can see the detailed report in import workspace.
  • 3. The items are checked for duplicate publications during the import according to the options specified by the user in import parameters. The use case ends successfully.

Constraints

  • Duplicate check can be done only on released PubManPublication Management items

Future Development

  • The user should be able to view a duplicate checking report and decide for each item, which action should be taken

UCUse Case_PM_IN_05 batch delete items

The user wants to delete an ingestion task.

Status/Schedule

  • status: implemented
  • schedule: R5

Preconditions

Items in ingestion task must be in status pending.

Flow of events

  • The user wants to delete all items of an ingestion task from the system.
  • The user triggers a batch delete in the import workspace.
  • All items that are part of the ingestion task are deleted. The use case ends successfully.

Post-Conditions

  • The items of the selected ingestion task have been deleted.
  • Comment: The delete button is also available, if the items of the ingestion task are already submitted or released, although the items in these states can no longer be deleted.

UCUse Case_PM_IN_06 batch remove items

The user wants to remove an ingestion task from the import workspace list.

Status/Schedule

  • status: implemented
  • schedule: R5

Flow of Events

  • The user wants to remove an ingestion task from the import workspace.
  • The user triggers a batch remove in the import workspace.
  • The ingestion task is removed from the import workspace list, ingested items not affected. The use case ends successfully.

Post-Conditions

  • The selected ingestion task has been removed from the import workspace and is no longer shown in the My items import filter.

UCUse Case_PM_IN_07 batch attach local tags

The user wants to assign one or more local tags to a set of items, in addition to the ingestion task description.

Status/Schedule

  • status: to be specified
  • schedule: to be specified

UCUse Case_PM_IN_08 batch assign organizational units

The user wants to assign one or more OUs to a set of items.

Status/Schedule

  • status: to be specified
  • schedule: to be specified

Future development

  • Check for new version of item. It should be possible to check if a newer version of the PubManPublication Management item has been created at the import source.
  • Set up automatic ingestion mechanism (regular automatic ingest from specific URLUniform Resource Locator), incl. respective update of escidoc items
  • Provide separate interface to define the mapping for customized fields to eSciDocEnhanced Scientific Documentation. These Mappings can be stored in e.g. User preferences or a "Mapping library" open for all users.