Talk:EDoc to PubMan migration

From MPDLMediaWiki
Jump to: navigation, search

Setting up EA institute on PubManPublication Management and prepare migration

  • Prepare contexts with validation rules
    . We will have 2 contexts, for being able to filter the items in the workspace. Import will automatically release the eDocElectronic Documentation Data on PubManPublication Management. Only released eDocElectronic Documentation records will be imported to PubManPublication Management. For the imported data we suggest to have an own import user, like we did for MPIPLMax-Planck-Institut für Psycholinguistik.
  • Ask institute for workflow that should be setup for the contexts
    • Simple publication and modification workflow, where the depositor can immediately release an item.
    • Extended publication and modification workflow, where the depositor can only submit the item and the moderator has to check and can then release it.
  • Create users accounts on PubManPublication Management
  • Create organizational units
    • Mapping of current edoc affiliations to PubManPublication Management org units
    • Ask if affiliations have changed (history of affiliation). The affiliations in the eDocElectronic Documentation metadata have to be the same as the affilations in affilation admin interface. If that is not the case, then we have to check with the institute, what to do if they don't match.
    • Special case for MPIPLMax-Planck-Institut für Psycholinguistik. They have no affiliations, but they have data organized collection wise by departments. This has to be mapped to OUs in PubManPublication Management.
  • Check data on eDocElectronic Documentation (like done for MPIPL) and consider special handling, e.g.
    • what to do with <sub> and <sup>?
      • will be imported, but there will be no HTMLHypertext Markup Language processing. Check transformation or alternative notation possibilities for tags.
    • What to do with markuptitle, markuptype and markupabstract? Check eDocElectronic Documentation query and ask concerned institutes for "help".
    • what to do with values/information filled in not-mapped metadata elements?
      • All eDocElectronic Documentation data will be imported as ZIM Export XMLExtensible Markup Language and it has to be decided how it will be stored with an PubItem.
After advising with FIZFachinformationszentrum Karlsruhe: we will tore the XMLExtensible Markup Language as additional component (file) with content-category "original-metadata". This file should not be visible from PubManPublication Management components. --Natasa 11:58, 8 September 2008 (UTCCoordinated Universal Time)
  • Delete test data


Institutes working with PubManPublication Management need to export their PubManPublication Management data to eDocElectronic Documentation, to generate the Yearbook.

  • Create edoc collection "Yearbook YYYY" on eDocElectronic Documentation
  • Enable export from PubManPublication Management to eDocElectronic Documentation. Only PubManPublication Management Metadata have to be exported, no components.
    • Only PubManPublication Management original items will be exported (no eDocElectronic Documentation IDIdentifier). If the user changes imported data on PubManPublication Management s/he also has to change it in eDocElectronic Documentation.
  • Create a container for the yearbook records in PubManPublication Management. (Search, mark for container and then export to yearbook/eDocElectronic Documentation). Items, which are in a container will be marked as exported for yearbook XY. Records can be selected/deselected from the container. Container will be released when it is ready for the export to eDocElectronic Documentation.
  • containers can be created, viewed and edited by moderators (prevs. have to be checked, will be done by Natasa).
  • Users have to be aware of, that the export of PubManPublication Management data for the YB to eDocElectronic Documentation should be done only once. To avoid duplicates. PubManPublication Management data imported to eDocElectronic Documentation will be put in one virtual collection and can then be copied to an archival collection on eDocElectronic Documentation if needed (not recommended).

PubManPublication Management IDIdentifier will be kept as local IDIdentifier in eDocElectronic Documentation and URIUniform Resource Identifier to PubManPublication Management.

eDocElectronic Documentation Affiliations versus PubManPublication Management Organizational Units

Concept on how to transfer eDocElectronic Documentation affiliations to PubManPublication Management OUs

  • What to do if the affiliation, which exists in eDocElectronic Documentation is now another organization? Shall both be created in PubManPublication Management even though we can't handle the history of affiliations?
My proposal is first to check how many such affiliations there are and decide on case by case basis. I believe there are not too much. In addition to point that the mapping for all affiliations from eDocElectronic Documentation to PubManPublication Management should be kept and known as it is important for the Yb processing later on. --Natasa 07:48, 19 August 2008 (UTCCoordinated Universal Time)
See above preparations for the migration.
Another idea would be to add the old and the new affiliation (if known) and take care of the relations between the two (or more affiliations) in a later stage. --Nicole 11:16, 8 September 2008 (UTCCoordinated Universal Time)
  • Consider various migration options, e.g.
    1. Do not add any affiliation
    2. Add selected affiliation to each
    3. Add all affiliations to the first creator only
Would opt for adding all MPGMax-Planck-Gesellschaft affiliations attached to the eDocElectronic Documentation item to all MPGMax-Planck-Gesellschaft authors (is easier to remove afterwards by the moderators then to add) --Natasa 07:48, 19 August 2008 (UTCCoordinated Universal Time) AGREED.



  • Current assumption: There is no need to move eDocElectronic Documentation users to PubManPublication Management
Wouldn't it be a reasonable feature to allow [edoc/escidoc] users to access "their" items under "my items"? --Inga 14:18, 6 August 2008 (UTCCoordinated Universal Time)
Talked to Karin today. She thinks, that there is no need to migrate users. --Nicole 13:32, 7 August 2008 (UTCCoordinated Universal Time)

Item Versions

  • eDocElectronic Documentation item history will not be moved, thus the last eDocElectronic Documentation version is imported to PubManPublication Management by creating the first version of a new object
Question: Do we need to check if the last eDocElectronic Documentation version is in any specific state (e.g. "released")? --Inga 19:49, 24 August 2008 (UTCCoordinated Universal Time)
Only eDocElectronic Documentation data in state released will be imported to PubManPublication Management.

Mapping Data: eDocElectronic Documentation <=> PubManPublication Management

eDocElectronic Documentation Identifier

  • search for eDocElectronic Documentation IDs in PubManPublication Management will be possible
  • Duplicate handling? "Clever" duplicate detection and handling for the import of the eDocElectronic Documentation data to PubManPublication Management, due to the collection policy in eDocElectronic Documentation. See also CoLab page on duplicates.
At first, duplicate detection should be based on eDocElectronic Documentation IDs.--Ulla 10:22, 18 July 2008 (UTCCoordinated Universal Time)
If eDocElectronic Documentation record is moved to PubManPublication Management then it can be marked as moved on eDocElectronic Documentation - therefore no need to check duplicates during import. In addition, as on eDocElectronic Documentation there are relations between potential duplicate records, this can be resolved in a same manner i.e. duplicates can also be marked as removed. So we would not check duplicates during import. --Natasa 07:50, 19 August 2008 (UTCCoordinated Universal Time)
One more issue: we need to check if new identifiers need to be also put on eDocElectronic Documentation record or it is sufficient to have eDocElectronic Documentation identifier in PubManPublication Management record. Moving the identifier to eDocElectronic Documentation record can enable directly linking to PubManPublication Management record for modification. --Natasa 10:37, 19 August 2008 (UTCCoordinated Universal Time)
Conclusion: Duplication problem has to be solved on eDocElectronic Documentation. We have to check, if we can help by providing a report on similar titles, authors and identifiers (to be premised).
Would opt for the first alternative. --Natasa 07:50, 19 August 2008 (UTCCoordinated Universal Time)
I think this is not meant as a choice; ideally all these URLs will remain valid. -- Robert 07:57, 19 August 2008 (UTCCoordinated Universal Time)
True, got it. What was meant for the first option is to include it as Item identifier in Item metadata. Sure, all other URLs should be valid for resolution in future as well.--Natasa 08:11, 19 August 2008 (UTCCoordinated Universal Time)

Content categories

  • Identification and assignment of content categories to each file on eDocElectronic Documentation. Decision: all imported files will be of category "any full text".

Access level for components/files

Mapping of eDocElectronic Documentation access levels (public, MPGMax-Planck-Gesellschaft, institute, internal, privileged users) to PubManPublication Management access levels (currently supported: public, private)

  • Option a (preferred) = access level to org units is provided. In this case mapping can be offered for access levels public, MPGMax-Planck-Gesellschaft, institute and internal.
  • Option b (realistic) = no access level to org units is provided
    • for now Option B will be realized. we will migrate all full texts, set to all full texts except for the public ones "private" visibility, create users with priv. view, set locators to eDocElectronic Documentation (for non public full texts), the eDocElectronic Documentation file visibility will also be put to PubManPublication Management into the component MDMetadata.
  • Who is owner of migrated data? (relevant for access level "private")
Assumption: edoc items migrated to pubman get "batch" owner (i.e. user of institute, decided by institute, recommended: librarian)--Ulla 11:19, 23 July 2008 (UTCCoordinated Universal Time)

Future ideas on controlled vocabulary

See: Control_of_Named_Entities/eDoc_HowTo

After migration

  • eDocElectronic Documentation user accounts will be de-activated (i.e. privileges will be de-activated)
  • eDocElectronic Documentation URIs will resolve to PIDs assigned during ingestion on PubManPublication Management, i.e. eSciDocEnhanced Scientific Documentation PIDs resolve to eDocURIs and eSciDocIDs.

Scenario "double maintenance"

Institutes working with 2 systems (eDocElectronic Documentation for productive usage plus entering "real" data on PubManPublication Management, for test purposes):

  • submit reference on edoc
  • enter same reference on PubManPublication Management and provide the eDocElectronic Documentation IDIdentifier

:maybe can be actually automatically done? --Natasa 10:51, 19 August 2008 (UTCCoordinated Universal Time)

  • when migrating the eDocElectronic Documentation data to PubManPublication Management, we migrate only edoc Ids which are not referenced on PubManPublication Management (duplicate checking based on edoc IDIdentifier)

see previous comments, if eDocElectronic Documentation records are marked acordingly there would not be a need to have duplicate checking --Natasa 10:51, 19 August 2008 (UTCCoordinated Universal Time)

all in all i would really doubt the effectiveness of working with 2 systems and same items at the same time on PubManPublication Management or eDocElectronic Documentation. If item is originally entered on eDocElectronic Documentation, and then created on PubManPublication Management - which one (in case of modification) would be correct? Maybe we should think of a workflow "either-or" i.e. either users enter items on eDocElectronic Documentation or only on PubManPublication Management. In that case, moving back PubManPublication Management records to eDocElectronic Documentation for sake of YB would be at least minimized. --Natasa 10:51, 19 August 2008 (UTCCoordinated Universal Time)

To clarify in fact was the idea that single item is either maintained on edoc or only on pubman. Maybe we can actually do the splitting by eDocElectronic Documentation collections?--Natasa 11:00, 19 August 2008 (UTCCoordinated Universal Time)

related to YB as well: how often are Yb rules changed? Do we change them each year? If so, maybe before putting PubManPublication Management item to eDocElectronic Documentation for Yb inclusion one can actually validate the PubManPublication Management item for the Yb (i.e. new validation schema will be necessary in this case) - to prevent the need to modify the item on eDocElectronic Documentation and have discrepancy in the data. --Natasa 10:51, 19 August 2008 (UTCCoordinated Universal Time)

Queries to match names

  • Placed on colab, not to loose them
    • The query returns the number of possible authors, and the number of entries in docs.

Those with more entries may possibly have different name variants (for first names).

select substr(p2.uml_idx,1,position(',' in p2.uml_idx)), count(*) from people p2 where p2.col=73 and p2.rm is null and archivalgrp(p2.grp)=1 group by substr(p2.uml_idx,1,position(',' in p2.uml_idx)) order by 2 desc

    • The query returns all mpg authors that match the uml_idx criteria above (extended to mpg only)

select distinct, p1.fname, substr(p1.uml_idx,1,position(',' in p1.uml_idx)) from people p1 where substr(p1.uml_idx,1,position(',' in p1.uml_idx)) in ( select substr(p2.uml_idx,1,position(',' in p2.uml_idx)) from people p2 where p2.col=73 and p2.rm is null and archivalgrp(p2.grp)=1 and p2. mpgpeople=1 group by substr(p2.uml_idx,1,position(',' in p2.uml_idx)) ) and p1.mpgpeople=1 and p1.rm is null and p1.col=73 and archivalgrp(p1.grp)=1 order by 3, 1,2 asc