PubMan Func Spec Ingestion
|
work in progress!!!
Scenarios[edit]
Scenario 1[edit]
Scientists are maintaining personal publication lists in bibliographic management systems (e.g. EndNote, RefMan) or other bibliographic formats such as BibTeX. Ingestion feature allows them to upload a list of publications. In case duplicate identification/handling is provided by PubMan, the ingestion can be done periodically to enrich/complete given collections.
Scenario 2[edit]
External Abstracting&Indexing services (e.g. WoS) can be queried for publication references of a specific organizational unit and/or individual author. The publication references gathered from an external service are uploaded once to the system to create a start content for a collection (for an organizational unit and/or individual).
Implementation approach[edit]
see Discussion page
Functional specification[edit]
UC_PM_IN_01 fetch data from external system[edit]
The system fetches (meta)data from an external system providing an external identifier. These (meta)data are used to populate metadata for a new item. If a fulltext version of this item is provided it is fetched as well.
Status/Schedule[edit]
- Status: implemented
- Schedule:R5
Triggers[edit]
- The user wants to fetch the (meta)data for an item from an external system in order to create a new sSciDoc item.
Actors[edit]
- Depositor
Pre-Conditions[edit]
- Target context, incl. its validation rules, is selected
Flow of Events[edit]
- 1. The user chooses to harvest data from an external service.
- 2. The system displays a list of all available ingestion sources which provide an ID fetch method.
- 3. The user selects one ingestion source and specifies an external ID appropriate to the selected ingestion source.
- 3.1. If the ingestion source provides full text download the user can select, if s/he wants to download the default fetching format, all available or none (arXiv) or simply choose, if s/he wants to download the fulltext or not (PubMedCentral).
- 3.2. The full text will be imported, with visibility set to "public".
- 4. The user triggers the query of the external system.
- 5. The system queries for item data with the specified ID.
- 5.1. The system checks if one or more persons within the import are already within CoNE.
- 5.1.1. For persons, which are already in CoNE: the system adds the CoNE ID to the respective persons in the import data.
- 5.1.2. Otherwise: the system creates unauthorized person entries (future development).
- 5.1. The system checks if one or more persons within the import are already within CoNE.
- 5.2. The system receives item data for the specified ID. Continue with use case UC_PM_SM_02 edit item with fetched metadata values for metadata as default. The use case ends successfully.
- 5.3. The query fails. The identifier does not exist or is access-restricted. The system does not receive item metadata for the specified ID. The system displays an error message (MSG_PM_SM_10). The use case ends without success.
Post-Conditions / Results[edit]
- Fetched metadata are used to create a new item.
Constraints[edit]
The supported sources can be found here
Future Development[edit]
- Consider SFX and OpenURL?
- Handle origin information.
Default Metadata for an item[edit]
Status/Schedule[edit]
- Status: in specification
- Schedule:to be defined
- default content category per genre (specified default MD)
- default creator roles per genre (specified default MD)
- default source genre per item genre (specified default MD)
- default creator role if creator is of type organisation (specified default MD)
- default affiliation (same as previous)(specified as default on GUI)
Default Metadata for an item means, that in the system a default item template is created, with defaulted metadata. As a start, we should do this as system setting. Future development might include some local definitions of item templates on collection level.
Default Metadata means, that they are pre-populated on the GUI, as a kind of proposal, but can be changed by the user.
Context to collection settings: On collection, the allowed genres are defined. In the default MD setting, the default MD for a certain genre or certain creator role are defined.
TODO:
- define sensible defaults in matrix - where to document the matrix in CoLab
- check dependencies in spec "create item from template", "create new revision"=> we have collection settings (limitation of allowed genres), we have default Metadata. In case an item is used as template, the templated item should "overwrite" the default Metadata, but cannot overwrite the collection setting. (?) --Ulla 13:26, 27 February 2008 (CET)
Genre-specific Metadata[edit]
Status/Schedule[edit]
- Status: implemented
- Schedule:R4?
Genre-specific Metadata are bound to a certain application profile and are defined as system setting.
This matrix describes the Metadata elements, which are always OR never OR optionally displayed on the edit mask (in easy submission, in normal submission), dependent on a certain genre type. Optional displayed means, that the user has the option to fill them , if needed, but they are somehow "hidden", as less used. This matrix is needed for GUI design. Genre-specific Metadata are not related to validation rules!
TODO:
- define matrix of genre-specific Metadata (Dimensions: Genre, Metadata or Metadata group. Values: always on Easy Submission, always on Normal Submission, optional on ESM, always on NSM). Documentation in CoLab.
- crosscheck assumptions on genre-specific MD with Early Adopter (using functional prototype)
UC_PM_IN_02 import file in structured format[edit]
In order to save manual typing for example, the user wants to upload a locally stored file in a structured format such as BibTeX, EndNote Export Format or RIS. A complete overview on supported import formats on PubMan can be found in the Category:ESciDoc_Mappings.
For BibTeX files there are at the moment two possibilities for uploading files in PubMan:
- Import:
- If the BibTeX record contains "URL", the system creates a full text within the record. The user can decide, if references are handled as locator or file. The user can specify the content type and change the MIME Type in the edit mask afterwards. If the system is unable to upload the file, the user gets an error message.
- Multiple Import (only for users with depositor and moderator rights):
- If the BibTeX record contains "URL", the system creates a full text within the record. References are handled as locators. The user can follow the import of the item(s) in the import workspace. The user can specify the content type in the edit mask afterwards. If the system is unable to upload the file, the user gets an error message.
BibTeX File, structured format. See example file by the AEI.
Status/Schedule[edit]
- status: implemented
- schedule: R 5
Triggers[edit]
- the user wants to upload a file in structured format, containing one or more items, in order to create eSciDoc items
Actors[edit]
- user, who has depositor and moderator rights
Pre-Conditions[edit]
- Target context, incl. its validation rules for submission and release, is selected
- Recommendation for users: Local data is prepared for genre-specific constraints and validation rules for the selected target context, to avoid fail of import
Flow of events[edit]
- 1. The user chooses to upload a file to the system
- 2. The user provides the path of the import file, the type of the Import Format (BibTeX, EndNote, WoS, RIS, escidocXML ) and the context to where s/he would like to import the items.
- 2.1. In case the user selects an import format, where customized mappings have been created beforehand, the user can in addition select the customized mapping (e.g. "Endnote for MPI ICE")
- 2.2. If the user has choosen an import format, which contains links to full texts (like BibTeX), the full text will be imported, with visibility set to "public".
- 3. The user defines what should happen if the validation fails:
- 3.1. cancel ingestion
- 3.2. ingest only valid items
- 4. The user defines what should happen if duplicates exist:
- 4.1. System should not check for duplicates
- 4.2. System should not import duplicate publications or
- 4.2. system should not import anything, if duplicates are detected
- 5. The user provides an ingestion description for the ingestion task, which will be attached to the items within the local tags. In addition the system assigns the timestamp to the imported items within the local tags.
- 6. the user triggers the ingestion
- 7. the system checks if one or more persons within the import are already within CoNE
- 7.1. for persons, which are already in CoNE: the system adds the CoNE ID to the respective persons in the import data (already in R5)
- 7.2. otherwise: the system creates unauthorized person entries (future development)
- 8. the system informs the user on the progress and outcome of the import.
- 8.1. If the ingestion fails (wrong import format, full text not fetched, corrupted file, failed validation), the ingestion is canceled or only valid items are being ingested.
- 8.1.1. during the ingestion both validation rules are being applied, the one for create item and the one for release item. If the one for create item fails, the item(s) can not be imported. If the one for release fails, the items are being imported and the user gets a report.
- 8.2. If the import is successful, the imported items are created as pending items in the import workspace of the user. The imported items carry the information description of the ingestion task as local tag, the ingestion date and the owner of the ingestion.
- 8.1. If the ingestion fails (wrong import format, full text not fetched, corrupted file, failed validation), the ingestion is canceled or only valid items are being ingested.
- 9. The user can view the ingested pending items in his import workspace, do batch operations (batch delete, batch submit, batch submit/release, remove ingestion task) and view the ingestion report. The use case ends successfully.
- 10. Proceed with UC_PM_IN_03 Batch release imported items
Post conditions[edit]
New pending PubMan Entries have been created in the import workspace of the owner of the ingestion.
Constraints[edit]
- BibTeX files are idiosyncratically structured; BibTool may help with preprocessing/normalization.
- e.g. upper and lower case corrections, resolving macros, unicode encodings vs. (la)tex encoding, etc.
- Basic TeX Parsing is needed to interpret non-ascii characters etc., see for example https://dev.livingreviews.org/projects/epubtk/browser/trunk/ePubTk/lib/bibtexlib.py .
- In BibTeX fields are not repeatable; thus multiple authors need to be parsed from the author field.
- BibTeX allows for different formats of representing an author's name; thus the parser needs to be smart enough to recognize them all. See for example http://search.cpan.org/~gward/Text-BibTeX-0.34/BibTeX/Name.pm
Suggested steps to prepare BibTeX files for import[edit]
- Normalize BibTeX with BibTool (resolves macros, may be used to map field names, unifies the syntax).
- Parse the - now normalized - records.
- Allow for/provide a mapping for non-standard fields (and possibly genres).
- Handle substructure of fields
- Multiple entries in author and keyword fields. (see also http://nwalsh.com/tex/texhelp/bibtx-23.html)
- (La)TeX encoding for special characters/formulae. (see for example https://dev.livingreviews.org/projects/epubtk/browser/trunk/ePubTk/lib/charmaps/tex2unicode.py)
- Map BibTeX fields/genres (including non-standard ones) to eSciDoc PubItem application profile. Mapping can be found here.
- Java Tools to check
Future Development[edit]
- If the user chooses an import format, which contains URLs to the full text, the user can specify if s/he would like to import the full texts or not.
- Automatic upload (give the URL of the server from where to get the data on a regular basis).
- Check if journal names within the import file are already in CoNE.
UC_PM_IN_03 Batch release/submit imported items[edit]
The user wants to make a set of ingested items visible via PubMan.
Status/Schedule[edit]
- status: implemented
- schedule: R5
Actors[edit]
- user with Depositor and Moderator rights
Pre-Conditions[edit]
- One ingestion set has been selected
Flow of events[edit]
- the user selects one set of ingestion tasks from the workspace.
- Depending on the workflow for the target context and the privileges of the user, the user can trigger either a batch release or a batch submission.
- Simple workflow: User has the option "submit&release", where items will be batch released
- Standard workflow: User has the option to either "submit" or "submit&release" (items will be batch released). Current implementation only offers one option or the other, it is not possible to do a batch submit and batch release the items afterwards from the import workspace.
- The system checks for potential duplicates. Include use case UC_PM_IN_03 Do duplicate check (OPEN)
- The system checks the items against the validation rules provided for the context, incl. the genre-specific constraint.
- If the items are valid, the items are submitted or released. User gets information how many items have been submitted or released.
- If one or more items are invalid, the system shows the invalid items and gives the user the possibility to edit them and re-start the batch release process
Future Development[edit]
- batch release/submit should also be possible from Depositor and QA Workspace
- the user should be able to define sets for batch operations by himself/herself
UC_PM_IN_04 Do duplicate check[edit]
The user wants to avoid creating duplicates in a specific context,during ingest of new items. He can do a duplicate check based on Identifiers.
OPEN: Where to include the duplicate check? During import or during batch submission/release?
current implementation: check is based on identifiers and done during import, i.e. before creating item.--Ulla 15:36, 29 April 2009 (UTC)
Status/Schedule[edit]
- status: in design
- schedule: R5
Actors[edit]
- user with moderator and depositor rights
Pre-Conditions[edit]
- One ingestion set has been selected
- the ingested items carry an unique identifier to base the duplicate check on (i.e. duplicate check on item level)
Flow of events[edit]
- the user specifies what should be done in case one or more duplicates have been found:
- cancel the operation
- skip the potential duplicates and only handle the non-duplicates
- ignore duplicates and overwrite existing entries
Current implementation offers 3 options:--Ulla 15:36, 29 April 2009 (UTC)
- don't check for duplicates
- don't import duplicates
- If duplicate is found, do rollback
for all three options, user can see the detailed report in import workspace.
Constraints[edit]
- Duplicate check can be done only on released PubMan items
Future Development[edit]
- the user should be able to view a duplicate checking report and decide for each item, which action should be taken
UC_PM_IN_05 batch delete items[edit]
The user wants to delete an ingestion task from the import manager interface.
Status/Schedule[edit]
- status: in design
- schedule: R5
Flow of events[edit]
- remove ingestion task from workspace list (ingested items not affected)
- delete ingestion task from workspace (i.e. delete all items part of ingestion task)
UC_PM_IN_06 batch attach local tags[edit]
The user wants to assign one or more local tags to a set of items, in addition to the ingestion task description.
Status/Schedule[edit]
- status: to be specified
- schedule: to be specified
UC_PM_IN_06 batch assign organizational units[edit]
The user wants to assign one or more OUs to a set of items.
Status/Schedule[edit]
- status: to be specified
- schedule: to be specified
Future development[edit]
- Check for new version of item. It should be possible to check if a newer version of the PubMan item has been created at the import source.
- Set up automatic ingestion mechanism (regular automatic ingest from specific URL), incl. respective update of escidoc items
- Provide separate interface to define the mapping for customized fields to eSciDoc. These Mappings can be stored in e.g. User preferences or a "Mapping library" open for all users.