Talk:EDoc to PubMan migration

From MPDLMediaWiki
Revision as of 13:07, 25 April 2012 by Webers (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Setting up EA institute on PubMan and prepare migration[edit]

  • Prepare contexts with validation rules
    . We will have 2 contexts, for being able to filter the items in the workspace. Import will automatically release the eDoc Data on PubMan. Only released eDoc records will be imported to PubMan. For the imported data we suggest to have an own import user, like we did for MPIPL.
  • Ask institute for workflow that should be setup for the contexts
    • Simple publication and modification workflow, where the depositor can immediately release an item.
    • Extended publication and modification workflow, where the depositor can only submit the item and the moderator has to check and can then release it.
  • Create users accounts on PubMan
  • Create organizational units
    • Mapping of current edoc affiliations to PubMan org units
    • Ask if affiliations have changed (history of affiliation). The affiliations in the eDoc metadata have to be the same as the affilations in affilation admin interface. If that is not the case, then we have to check with the institute, what to do if they don't match.
    • Special case for MPIPL. They have no affiliations, but they have data organized collection wise by departments. This has to be mapped to OUs in PubMan.
  • Check data on eDoc (like done for MPIPL) and consider special handling, e.g.
    • what to do with <sub> and <sup>?
      • will be imported, but there will be no HTML processing. Check transformation or alternative notation possibilities for tags.
    • What to do with markuptitle, markuptype and markupabstract? Check eDoc query and ask concerned institutes for "help".
    • what to do with values/information filled in not-mapped metadata elements?
      • All eDoc data will be imported as ZIM Export XML and it has to be decided how it will be stored with an PubItem.
After advising with FIZ: we will tore the XML as additional component (file) with content-category "original-metadata". This file should not be visible from PubMan components. --Natasa 11:58, 8 September 2008 (UTC)
  • Delete test data


Institutes working with PubMan need to export their PubMan data to eDoc, to generate the Yearbook.

  • Create edoc collection "Yearbook YYYY" on eDoc
  • Enable export from PubMan to eDoc. Only PubMan Metadata have to be exported, no components.
    • Only PubMan original items will be exported (no eDoc ID). If the user changes imported data on PubMan s/he also has to change it in eDoc.
  • Create a container for the yearbook records in PubMan. (Search, mark for container and then export to yearbook/eDoc). Items, which are in a container will be marked as exported for yearbook XY. Records can be selected/deselected from the container. Container will be released when it is ready for the export to eDoc.
  • containers can be created, viewed and edited by moderators (prevs. have to be checked, will be done by Natasa).
  • Users have to be aware of, that the export of PubMan data for the YB to eDoc should be done only once. To avoid duplicates. PubMan data imported to eDoc will be put in one virtual collection and can then be copied to an archival collection on eDoc if needed (not recommended).

PubMan ID will be kept as local ID in eDoc and URI to PubMan.

eDoc Affiliations versus PubMan Organizational Units[edit]

Concept on how to transfer eDoc affiliations to PubMan OUs

  • What to do if the affiliation, which exists in eDoc is now another organization? Shall both be created in PubMan even though we can't handle the history of affiliations?
My proposal is first to check how many such affiliations there are and decide on case by case basis. I believe there are not too much. In addition to point that the mapping for all affiliations from eDoc to PubMan should be kept and known as it is important for the Yb processing later on. --Natasa 07:48, 19 August 2008 (UTC)
See above preparations for the migration.
Another idea would be to add the old and the new affiliation (if known) and take care of the relations between the two (or more affiliations) in a later stage. --Nicole 11:16, 8 September 2008 (UTC)
  • Consider various migration options, e.g.
    1. Do not add any affiliation
    2. Add selected affiliation to each
    3. Add all affiliations to the first creator only
Would opt for adding all MPG affiliations attached to the eDoc item to all MPG authors (is easier to remove afterwards by the moderators then to add) --Natasa 07:48, 19 August 2008 (UTC) AGREED.



  • Current assumption: There is no need to move eDoc users to PubMan
Wouldn't it be a reasonable feature to allow [edoc/escidoc] users to access "their" items under "my items"? --Inga 14:18, 6 August 2008 (UTC)
Talked to Karin today. She thinks, that there is no need to migrate users. --Nicole 13:32, 7 August 2008 (UTC)

Item Versions[edit]

  • eDoc item history will not be moved, thus the last eDoc version is imported to PubMan by creating the first version of a new object
Question: Do we need to check if the last eDoc version is in any specific state (e.g. "released")? --Inga 19:49, 24 August 2008 (UTC)
Only eDoc data in state released will be imported to PubMan.

Mapping Data: eDoc <=> PubMan[edit]

eDoc Identifier[edit]

  • search for eDoc IDs in PubMan will be possible
  • Duplicate handling? "Clever" duplicate detection and handling for the import of the eDoc data to PubMan, due to the collection policy in eDoc. See also CoLab page on duplicates.
At first, duplicate detection should be based on eDoc IDs.--Ulla 10:22, 18 July 2008 (UTC)
If eDoc record is moved to PubMan then it can be marked as moved on eDoc - therefore no need to check duplicates during import. In addition, as on eDoc there are relations between potential duplicate records, this can be resolved in a same manner i.e. duplicates can also be marked as removed. So we would not check duplicates during import. --Natasa 07:50, 19 August 2008 (UTC)
One more issue: we need to check if new identifiers need to be also put on eDoc record or it is sufficient to have eDoc identifier in PubMan record. Moving the identifier to eDoc record can enable directly linking to PubMan record for modification. --Natasa 10:37, 19 August 2008 (UTC)
Conclusion: Duplication problem has to be solved on eDoc. We have to check, if we can help by providing a report on similar titles, authors and identifiers (to be premised).
Would opt for the first alternative. --Natasa 07:50, 19 August 2008 (UTC)
I think this is not meant as a choice; ideally all these URLs will remain valid. -- Robert 07:57, 19 August 2008 (UTC)
True, got it. What was meant for the first option is to include it as Item identifier in Item metadata. Sure, all other URLs should be valid for resolution in future as well.--Natasa 08:11, 19 August 2008 (UTC)

Content categories[edit]

  • Identification and assignment of content categories to each file on eDoc. Decision: all imported files will be of category "any full text".

Access level for components/files[edit]

Mapping of eDoc access levels (public, MPG, institute, internal, privileged users) to PubMan access levels (currently supported: public, private)

  • Option a (preferred) = access level to org units is provided. In this case mapping can be offered for access levels public, MPG, institute and internal.
  • Option b (realistic) = no access level to org units is provided
    • for now Option B will be realized. we will migrate all full texts, set to all full texts except for the public ones "private" visibility, create users with priv. view, set locators to eDoc (for non public full texts), the eDoc file visibility will also be put to PubMan into the component MD.
  • Who is owner of migrated data? (relevant for access level "private")
Assumption: edoc items migrated to pubman get "batch" owner (i.e. user of institute, decided by institute, recommended: librarian)--Ulla 11:19, 23 July 2008 (UTC)

Future ideas on controlled vocabulary[edit]

See: Control_of_Named_Entities/eDoc_HowTo

After migration[edit]

  • eDoc user accounts will be de-activated (i.e. privileges will be de-activated)
  • eDoc URIs will resolve to PIDs assigned during ingestion on PubMan, i.e. eSciDoc PIDs resolve to eDocURIs and eSciDocIDs.

Scenario "double maintenance"[edit]

Institutes working with 2 systems (eDoc for productive usage plus entering "real" data on PubMan, for test purposes):

  • submit reference on edoc
  • enter same reference on PubMan and provide the eDoc ID

:maybe can be actually automatically done? --Natasa 10:51, 19 August 2008 (UTC)

  • when migrating the eDoc data to PubMan, we migrate only edoc Ids which are not referenced on PubMan (duplicate checking based on edoc ID)

see previous comments, if eDoc records are marked acordingly there would not be a need to have duplicate checking --Natasa 10:51, 19 August 2008 (UTC)

all in all i would really doubt the effectiveness of working with 2 systems and same items at the same time on PubMan or eDoc. If item is originally entered on eDoc, and then created on PubMan - which one (in case of modification) would be correct? Maybe we should think of a workflow "either-or" i.e. either users enter items on eDoc or only on PubMan. In that case, moving back PubMan records to eDoc for sake of YB would be at least minimized. --Natasa 10:51, 19 August 2008 (UTC)

To clarify in fact was the idea that single item is either maintained on edoc or only on pubman. Maybe we can actually do the splitting by eDoc collections?--Natasa 11:00, 19 August 2008 (UTC)

related to YB as well: how often are Yb rules changed? Do we change them each year? If so, maybe before putting PubMan item to eDoc for Yb inclusion one can actually validate the PubMan item for the Yb (i.e. new validation schema will be necessary in this case) - to prevent the need to modify the item on eDoc and have discrepancy in the data. --Natasa 10:51, 19 August 2008 (UTC)

Queries to match names[edit]

  • Placed on colab, not to loose them
    • The query returns the number of possible authors, and the number of entries in docs.

Those with more entries may possibly have different name variants (for first names).

select substr(p2.uml_idx,1,position(',' in p2.uml_idx)), count(*) from people p2 where p2.col=73 and p2.rm is null and archivalgrp(p2.grp)=1 group by substr(p2.uml_idx,1,position(',' in p2.uml_idx)) order by 2 desc

    • The query returns all mpg authors that match the uml_idx criteria above (extended to mpg only)

select distinct, p1.fname, substr(p1.uml_idx,1,position(',' in p1.uml_idx)) from people p1 where substr(p1.uml_idx,1,position(',' in p1.uml_idx)) in ( select substr(p2.uml_idx,1,position(',' in p2.uml_idx)) from people p2 where p2.col=73 and p2.rm is null and archivalgrp(p2.grp)=1 and p2. mpgpeople=1 group by substr(p2.uml_idx,1,position(',' in p2.uml_idx)) ) and p1.mpgpeople=1 and p1.rm is null and p1.col=73 and archivalgrp(p1.grp)=1 order by 3, 1,2 asc