Talk:PubMan Func Spec Ingestion

work in progress

Phase 1

 * provide multiple item submission (batch import) for Endnote references, WoS references, eSciDoc xml, Bibtex, RIS, containing more than one reference.

Rupert: Does it mean all formats are subject to single item import too? (see comment in UC_PM_IN_01) --Rupert 08:06, 30 March 2009 (UTC)

For future developemnt, yes, might be. For R5, focus is set on batch ingestion.--Ulla 11:48, 7 April 2009 (UTC)


 * For endnote, consider files in various versions: Either version 1.x-7 or verion 8.x
 * encoding of files depends on endnote version: 1.x to 7 support ASCII, 8.x support UTF8
 * Mapping to PubMan Genres depends on endnote version (different mappings needed)
 * First Prio: Endnote version in use by ICE and MPI Pflanze--Ulla 12:58, 24 February 2009 (UTC)
 * for BibTeX consider, that the record can contain an URL tag, which points to the fulltext belonging to the bibliographic record, which should be uploaded to PubMan (see also BibTeX maping)

=Functional specification=

UC_PM_IN_01 import file in structured format
In order to save manual typing for example, the user wants to upload a file in a structured format such as BibTeX, EndNote Export Format or RIS. A complete overview on supported import formats on PubMan can be found in the Category:ESciDoc_Mappings.

Status/Schedule

 * status: in specification
 * schedule: R 5

Triggers

 * the user wants to upload a file in structured format, containing one or more items, in order to create eSciDoc items

Actors

 * user, who has depositor and moderator rights

Pre-Conditions

 * Target context, incl. its validation rules for submission and release, is selected
 * Recommendation for users: Local data is prepared for genre-specific constraints and validation rules for the selected target context, to avoid fail of import

Flow of events

 * The user starts to import a file to the system
 * The user provides the path of the import file, the type of the Import Format (BibTeX, EndNote, WoS, RIS, escidocXML ) and the context to where s/he would like to import the items.
 * In case the user selects an import format, where customized mappings have been created beforehand, the user can in addition select the customized mapping (e.g. "Endnote for MPI ICE")
 * If the user has choosen an import format, which contains links to full texts (like BibTeX), the full text is also being imported.
 * The user defines what should happen if the validation fails:
 * cancel ingestion
 * ingest only valid items
 * The user provides an ingestion description for the ingestion task, which will be attached to the items within the local tags. In addition the system assigns the timestamp to the imported items within the local tags.
 * the user triggers the ingestion
 * the system checks if one or more persons within the import are already within CoNE
 * for persons, which are already in CoNE: the system adds the CoNE ID to the respective persons in the import data (already in R5)
 * otherwise: the system creates unauthorized person entries (future development)
 * the system informs the user on the progress and outcome of the import.
 * If the ingestion fails (wrong import format, full text not fetched, corrupted file, failed validation), the ingestion is canceled or only valid items are being ingested.
 * during the ingestion both validation rules are being applied, the one for create item and the one for release item. If the one for create item fails, the imtem(s) can not be imported. If the one for release fails, the items are being imported and the user gets a report.
 * If the import is successful, the imported items are created as pending items in the import workspace of the user. The imported items carry a system-specific property for the ingestion task (This would be best accomplished via content-model-specific-property such as  ), the ingestion date and the owner of the ingestion.
 * The user can view the ingested pending items in his import workspace, do batch operations (batch delete, batch submit/release, remove ingestion task) and view the ingestion report.
 * Proceed with UC_PM_IN_02 Batch release imported items

Post conditions
New pending PubMan Entries have been created in the import workspace of the owner of the ingestion.

Future Development

 * If the user chooses an import format, which contains URLs to the full text, the user can specify if s/he would like to import the full texts or not.
 * Automatic upload (give the URL of the server from where to get the data on a regular basis).
 * Check if journal names within the import file are already in CoNE.

UC_PM_IN_02 Batch release/submit imported items
The user wants to make a set of ingested items visible via PubMan.

Status/Schedule

 * status: in specification
 * schedule: R 5

Actors

 * user with Depositor and Moderator rights

Pre-Conditions

 * One ingestion set has been selected

Flow of events

 * the user selects one set of ingestion tasks from the workspace.
 * Depending on the workflow for the target context and the privileges of the user, the user can trigger either a batch release or a batch submission.
 * The system checks for potential duplicates. Include use case UC_PM_IN_03 Do duplicate check (OPEN)
 * The system checks the items against the validation rules provided for the context, incl. the genre-specific constraint.
 * If the items are valid, the items are submitted or released. User gets information how many items have been submitted or released.
 * If one or more items are invalid, the system shows the invalid items and gives the user the possibility to edit them and re-start the batch release process

Future Development

 * batch release/submit should also be possible from Depositor and QA Workspace
 * the user should be able to define sets for batch operations by himself/herself

UC_PM_IN_03 Do duplicate check
The user wants to avoid creating duplicates in a specific context,during ingest of new items.

OPEN: Where to include the duplicate check? During import or during batch submission/release?

Status/Schedule

 * status: in specification
 * schedule: R5

Actors

 * user with moderator and depositor rights

Pre-Conditions

 * One ingestion set has been selected
 * the ingested items carry an unique identifier to base the duplicate check on (i.e. duplicate check on item level)

Flow of events

 * the user specifies what should be done in case one or more duplicates have been found:
 * cancel the operation
 * skip the potential duplicates and only handle the non-duplicates
 * ignore duplicates and overwrite existing entries

Future Development

 * the user should be able to view a duplicate checking report and decide for each item, which action should be taken

UC_PM_IN_04 batch delete items
The user wants to delete several items from the import manager interface, as they where duplicates to items, which already existed in PubMan and are no longer needed.

Status/Schedule

 * status: in specification
 * schedule: R5

Flow of events

 * the user can remove all imported items from, which have been ingested (will be possible from ingestion workspace)

UC_PM_IN_05 batch attach local tags
The user wants to assign one or more local tags to a set of items.

Status/Schedule

 * status: in specification
 * schedule: not R5

UC_PM_IN_06 batch assign organizational units
The user wants to assign one or more OUs to a set of items.

Status/Schedule

 * status: in specification
 * schedule: not R5

Architecture and thoughts

 * R5 - Import logic/design
 * import page modification to include
 * ingestion task identifier (username+timestamp)
 * checkbox for skip creation of duplicates or create evtl. duplicates as new items
 * asynchronous start of ingestion process
 * sendmail to user when ingestion is finished or failed
 * email in case of failure:
 * at which item (by title) it failed
 * possible cause: mapping, creation, validation message (for Val.point default)
 * info where to go for further steps
 * email in case of success:
 * how many items were created (with or without fulltext)
 * which items were possible duplicates (by title) (if DC is to be applied)
 * info where to go for further steps


 * on ingestion tasks (tbd: BT or eSciDoc days?)
 * possible to create items with CM: ingestion-task
 * these items should never be released
 * this will enable filtering ingestion tasks workspace when it comes
 * items may contain:
 * special metadata(or content-model-specific?) to point on the status of the ingestion task (scheduled, in-progress, finished succesfully, finished unsuccessfully) etc (NBU: to check data model from before, as it was defined in details).
 * component or MD record stream with links to ingested items (preferrable component) and info on status: failed/success, and info on evtl. found duplicate
 * component with the original file uploaded for import
 * ingestion task items can be "cleaned-up" i.e. deleted if wished or not
 * Advantage by defining them as items:
 * we can even have separate role if needed
 * we can re-use existing functionality of item handler and storage


 * sounds to me like a workflow engine.--Robert 11:27, 30 March 2009 (UTC)


 * i also think it would be somewhat strange, if we put management process data like ingestion tasks into our repository - including persistent identifiers, lta, etc. - while keeping the cone stuff, which is actually part of the core data, out.--Robert 12:01, 30 March 2009 (UTC)
 * correct - There is a WF manager (according to FIZ) set-up, but this would require a lot of testing. The above proposal is done in order to avoid such complexity and introduce another external component at present. It is in any case doubtful that we will anyway have ingestion tasks now - the purpose is only to understand where to store them, as we will anyway probably have to store the originally uploaded files somewhere. --Natasa 08:09, 1 April 2009 (UTC)


 * R6
 * introduce ingestion definition (again as item)?
 * to help ingest-users to define their own ingestion settings and remember them
 * ingestion tasks will in that case have relations to ingestion definition

Future development

 * Check for new version of item. It should be possible to check if a newer version of the PubMan item has been created at the import source.
 * Set up automatic ingestion mechanism (regular automatic ingest from specific URL), incl. respective update of escidoc items
 * Provide separate interface to define the mapping for customized fields to eSciDoc. These Mappings can be stored in e.g. User preferences or a "Mapping library" open for all users.

General Thoughts
(in this case arXiv)
 * Should we enhance the (technical) metadata of an item with the information where the item originally was created?
 * We should add something like a progress bar when importing data from another system
 * The system must take precautions not to get blocked from arxiv for indiscriminate automated download (see http://arxiv.org/RobotsBeware.html).

The Full Text Version can be retrieved with:

We don't try to retrieve a fulltext version in html because it is very rare
 * http://arxiv.org/pdf/ +arXiv ID (pdf)
 * http://arxiv.org/ps/ +arXiv ID (ps)
 * http://arxiv.org/src/ +arXiv ID (tex)
 * http://arxiv.org/html/ +arXiv ID (html) Found one example with html (http://arxiv.org/html/cs/9811020)

Discussion closed
1) External locator for content: As just learned in Nijmgen, user needs the option to provide an external locator for fulltext. I.e., in addition to upload binary content (= upload file), he needs the option to specify an locator/identifier for the binary content located externally, together with the respective content categorie. This is true for Easy as well as normal submission. This external locator will not be part of Metadata, but modeled in content model.(component?)

External locator now part of prototype --Rupert 11:09, 10 March 2008 (CET)

2) Fetch MD, Step 3: Typo on GUI, short short. In addition, would re-phrase to "...might not cover all fetched Metadata". --Ulla 12:35, 15 February 2008 (CET)

As discussed with Natasa - Step 3 is now view item version, with all metadata visible. Short edit is now only for 'Manual Submission' --Rupert 11:09, 10 March 2008 (CET)

- Step 2 (Select collection): Shouldn't there be a note about having only one collection or more than one?
 * Abstract prototype

For Easy Submission there will be only one collection in most cases. If only one collection is available the step is not visible. --Rupert 13:58, 27 February 2008 (CET)

- Typo: "Contiuer and complete"

Done --Rupert 13:58, 27 February 2008 (CET)

- After finishing step 5 there is a decision diamond without a condition. I guess it is the validation, right?

Yes (abstract prototype is done by func team) --Rupert 13:58, 27 February 2008 (CET)

- After this decision one is led to step 1.4? I guess this is a typo, too.

I took this out. --Rupert 13:58, 27 February 2008 (CET)

- Another typo: "sucess message"

Done --Rupert 13:58, 27 February 2008 (CET)

- Step 2.1: I do not understand it: Is the upload and the preview on the same page? I would also appreciate some more information on the preview. Or will this be part of the GUI design? - Yet another typo: "successfull"

The page flow diagram is more detailed here: Editable Preview is after step 4 (manual) or after step 3 (BibTeX/Fetch MD) on a separate page.

- The texts next to "choose collection" are swapped.
 * Page flow

This was wrong ... done --Rupert 13:58, 27 February 2008 (CET)

- From "view item version" there is no direct way to submit the item, only to the Edit item mask.

Right! View item version is just a rough preview in this case. Because for the existing "view item version" an item must at lease be in state pending?! Please ask Natasa just to be sure.

Comment Natasa:View item version step according to my understanding was invoked if user decides to preview the item quickly without invoking the Full edit mask. The item is not yet created, but is view-item-version page for VO (value object) of the item only (this means, the Submit action should be available). My comment is also in PageFlow diagram. However, the prototype does not show this, instead on BibTex_Fetch_MD_Step3 it provides two options:a) short short preview and quickly submit (please note that short preview does not show all metadata fetched) b) check or edit all available metadata


 * The prototype should not offer Option a) and option b) to be selected by the user, but should automatically invoke "option a)" - which was added with intention to provide "classical view item" no item id, no status information provided (because GUI Team thought it is too much disruption to directly show the full-edit mask as it was agreed originally. Therefore alternative approach was to make the view-item-page composed from the Value object (not retrieved from the FW) and in addition user would be able to "submit" the item (as she is doing it regularly from full edit mask) or go back to "edit" the item - by invoking the full-item edit mask. Therefore, "option a)" is what user automatically gets after Step2. --Natasa 16:05, 3 March 2008 (CET)

Done, Natasa can you please check that? --Rupert 22:12, 3 March 2008 (CET)

- Can this be linked to the according colab page?
 * Choose Collection

Done --Rupert 13:58, 27 February 2008 (CET)

Error: it is linked to the "submit item" use case and not to "create item" use case --Natasa 16:05, 3 March 2008 (CET)

- Where does "cancel" lead?
 * Choose submission method

Back to the Workspace ... Page Flow is updated. --Rupert 17:09, 27 February 2008 (CET)

Comment Natasa:
 * There are misleading labels: In action links on left vertical bar one has "Easy submission". Breadcrumbs say "Short Submission".

Rupert: Breadcrumb is more common now: You are here: Home > Main Function > Sub Function > Action--Rupert 10:19, 4 March 2008 (CET)

- "content-type" is now "content category"
 * Manual submission step 2

Must be replaced in every file then. At least for ES and FS it's done now. --Rupert 10:19, 4 March 2008 (CET)

- The design of a file input cannot be influenced by CSS. It only depends on the locale set in the clients browser and on the OS (Windows, Linux, Mac). I will attach some examples. The GUI design has to take this into account. - I guess the red star at "genre" means that this field is mandatory. Why isn't there one at "title"?

Done, I added another asterix to the first line of authors --Rupert 13:58, 27 February 2008 (CET)

Comment Natasa
 * please use consistent rule for labeling of fields (e.g. at present one has Upload new File, Content Category, Please Upload a file and define the type of content - here we have a mixture of sometimes camel case sometimes not, also the field label is content category and the message asks for the type of content - misleading)
 * would be useful if label "Uploaded" is changed to "File" and if the file-name does not contain the directory name but only the "C:\filename.pdf"

Comment Tobias
 * I would prefer to change the buttons for the content category into a radio button group. We agreed that buttons should only be used when actions are triggered. Here the user only makes a choice which content type he wants to use.... So it should better be obvious that he is not triggering an action by simply selecting a content category. --Tschraut 12:11, 4 March 2008 (CET)

OK, as discussed...--Rupert 15:22, 5 March 2008 (CET)


 * for uploaded files it would be useful if besides the "trash can" icon one has "editing icon" i.e. to be able to edit the category of the file without having to once again upload the same file for another content category (but that would also require some other extra work probably)

Reorganized now...--Rupert 15:22, 5 March 2008 (CET)


 * Back/Next are labels to the arrow or are buttons with the arrow icon? (not clear, preference would be to have it as a button, in a same manner as "cancel")
 * Maybe back/next can be right aligned next to each other and cancel button can be left alligned (this way it would not be central button on the form) (valid for all steps)
 * missing file visibility for files and information on the file size, mime type after the file has been uploaded

As discussed with Ulla and Nicole file visibility and others are not required. --Rupert 15:22, 5 March 2008 (CET)
 * in that case will the user know if the file uploaded by the system has size/mime-type recognized as he expect - or that is not needed?--Natasa 16:43, 5 March 2008 (CET)


 * Proposal: why not naming "Manual submission" as "Use form for data entry" or smth similar, as manual submission is not clear --Natasa 16:05, 3 March 2008 (CET)

Not sure what a librarian would expect to see here. Perhaps we will know more after the workshop ... There are basically two concerns for the user here: Do I have to fill out something (Manually)? Or can I get data from somewhere else? --Rupert 10:19, 4 March 2008 (CET)

Participants of the workshop were all fine with the term 'Manual Submission' --Rupert 11:26, 10 March 2008 (CET)


 * To this issue: I assume the idea was to click on a button and upload a file (without need to explicitly specify the content-category) - as it was thought can in 80% cases be defaulted based on the genre-value.

Whatever you decide for GUI at the end. The defaulting is anyway to be resolved for R4.--Natasa 16:43, 5 March 2008 (CET)

If last action is 'upload' I would recommend to label it 'uploaded' (Erwartungskonformität). The handling of directory is browser dependent.

- Creator names are split up into "Name" and "Family name". I expect this would cause faulty entries, because "Name" often is associated either with the surname or with the full name. IMO "Family name" and "Given name" would be better.
 * Manual submission step 3

I took 'first name' because during interviews people were not sure about given name (!). --Rupert 13:58, 27 February 2008 (CET)


 * consistent labeling needed (currently in the prototype: First Name, Family name, Creator Type - it is "Role" actually)--Natasa 18:07, 3 March 2008 (CET)

- Why should the user enter the number of a author?

If the list contains more authors this can be used to insert the author above. --Rupert 13:58, 27 February 2008 (CET)


 * even if this is the case it is not very nice to put numbers in, as the "old" numbers will switch (reorder). Why not simply using arrows up/down for this purpose? --Natasa 18:07, 3 March 2008 (CET)

- Once entered, an author cannot be edited anymore, can he?

OK as discussed with Tobias now. Looks a little bit more complex now. --Rupert 15:22, 5 March 2008 (CET)--Rupert 13:58, 27 February 2008 (CET)


 * --Natasa 18:07, 3 March 2008 (CET)it is edit form, right? Why preventing the editing in such a manner? Maybe enabling "edit icon" (same issue as for files in step 1) can explicitly enable the editing of fields for the selected row (and thus making only 1 row editable at a time) of the author. In addition moving up/down of the complete record will not be considered as editing of the row, just re-positioning of the row (in this case no explicit numbering is needed, especially not when creating the author-record).

Comment Tobias: I would like to have the following actions for each author row: edit, remove, up, down. All these actions could be placed in the action column. Even more easier would be to enable all author rows in the table. That means the rows aren't displayed with simple output text but with editable inpunt fields. So you do not need the "edit" link in the action column anymore and users are able to edit autor data faster... --Tschraut 14:37, 4 March 2008 (CET)

As we learned from the interviews, scientists only give the corresponding author and do not deal with lists of authors. --Rupert 15:22, 5 March 2008 (CET)

- If so, there should at least be the possibility to move creators up or down. Otherwise, the following can occur: The user enters 5 authors. Then she recognises that she produced at the first author. Now she has to delete all 5 authors to bring them back into the right order.

(--Natasa 18:07, 3 March 2008 (CET) Agree, see comment above as well)

- Is there a concept for entering authors in a predefined format yet? See http://colab.mpdl.mpg.de/mediawiki/Talk:Providing_Lists_of_Authors#Varieties_of_Lists

This would be wonderful, but lists of Authors are not scheduled for R3 so I took this simple approach. --Rupert 13:58, 27 February 2008 (CET)

- As it is decided that ONLY the "date published in print" will be asked for, there is no need for a dropdown meny, is there?
 * Manual submission step 4

Could be a misunderstanding; as far as I know "one" publication date only should be possible which can have several types. The dropdown just contains a dummy entry. --Rupert 17:09, 27 February 2008 (CET)

--Natasa 18:07, 3 March 2008 (CET)According the last functional/GUI meeting, it is only "date published in print" - so no need for a dropdown menu. It would be the case for librarians then to later copy/paste the appropriate date.

OK. --Rupert 22:12, 3 March 2008 (CET)

- Because "Language", "Subject" and "Abstract" follow "Title of source, I as a user would have difficulties to decide if these fields belong to my publication or to its source.

So we put the Title of source below the other fields.--Rupert 17:09, 27 February 2008 (CET)


 * --Natasa 18:07, 3 March 2008 (CET)Or one can clearly specify the source as "separate group" visually?

- Same for file input as above - If the import was successful, the user is lead to "Bibtex import step 3". What happens, if the import fails?
 * Bibtex import step 2


 * --Natasa 18:27, 3 March 2008 (CET)What is the label Metadata source next to the "Provide ID" text?

OK, as discussed now --Rupert 15:24, 5 March 2008 (CET)


 * --Natasa 18:27, 3 March 2008 (CET)We have talked that the BibTex file upload should contain the possibility either to upload a file with 1 reference or to directly paste the BibTex reference in a text area field (whereas if a file with 1 reference is uploaded, users see the uploaded reference in a text area field). This is not specified on step 2.

Please see annotation in Axure--Rupert 15:22, 5 March 2008 (CET)


 * Breadcrumb "Fetch Metadata" or "Provide Metadata" (consistent labeling needed)
 * Back/Next/Cancel button issue (see for Manual submission remarks above)

Moves back to step 2 with a message above (see Page Flow). --Rupert 17:09, 27 February 2008 (CET)

- Here and on "Choose submission method" radio buttons are used. The user could save one click if we would use direkt links ?!?
 * Bibtex import step 3

Right, but navigation should be done only with back and next. --Rupert 17:09, 27 February 2008 (CET)

--Natasa 18:27, 3 March 2008 (CET)
 * Why it is important to make the navigation only with back/next in "Fetch metadata" and in "choose submission method" pages (imho it can be a valid argument for "manual submission")?
 * in fetch metadata step 2 "Back" means "Choose other submission method"
 * in manual submission step 2 "Back" means "Choose other submission method" (so either label the button "change submission method" or simply remove the button and allow only for "cancel", as Step 3 from fetch metadata is to be removed anyway)
 * In manual submission giving "label" to the step such as "Title/Files", "Creators", "Publication info" and naming the "Back/Next" accordingly may be more "wordy" (of course, this is not generic solution, but it may be worth thinking)
 * In general, maybe the selected collection name is worth displaying somewhere (in the breadcrumb or somewhere else on the page) - as users already had to select it in the first place.

Rupert: I would do so only for Full Submission. For ES it would not make sense.--Rupert 22:12, 3 March 2008 (CET)
 * Please take care of the following: ES does not exclude selection of a collection! In any case there may be several collections to which user can make (ES/FS). Displaying the collection name in that case makes sense probably. --Natasa 16:50, 5 March 2008 (CET)