PubMan Duplicates

From MPDLMediaWiki
Jump to navigation Jump to search

The PubMan solution should provide reasonable mechanism for detecting and handling duplicates. This page should collect related use cases.


Assumption for Policy regd Duplicate Handling:

Publikationen, die in Zusammenarbeit von mehreren Institutionen innerhalb der MPG verfasst wurden, können als Dubletten im Repository auftauchen, da jede der involvierten Institutionen die Publikation in ihrem Kontext verwalten möchte bzw. die Publikation im Zusammenhang mit der Organization Search gelistet haben möchte. Den Nutzern werden Funktionalitäten zur Verfügung gestellt, um Dublettenprüfungen durchzuführen und bei Wunsch entsprechend Dubletten zu eliminieren, vorausgesetzt, sie besitzen die nötigen Rechte (zb. für Contexte anderer Institute)


Duplicates - publication items[edit]

  • Detect possible duplicate candidates during submission, by matching metadata fields. Matching algorithm might use fields as author names, publication year, title, source title, volume, pages. ...the more fields are provided, the more likely a match will be found.
  • Detect possible duplicate candidates by matching fulltexts. (similarity check)
  • detect possible duplicate candidates as part of quality assurance processes (e.g. "run duplicate check on all items in specific context or all items affiliated to specific OU")
  • Batch processes how to handle duplicate candidates (see below current handling on eDoc)
    • Assumption: Users can handle duplicate candidates as long as they have modifying privileges for the contexts, the items are stored in. Example: If candidates are identified, which are managed in contextes of another institute, they would need modifying privileges for this context. It might be considered to set up a duplicate check across all contextes of all organisational units available (optionally, user can restrict the contexts to be checked). Still, the final handling of candidates depends on modifying privileges in the respective contexts

Duplicates - controlled entities (CoNE)[edit]

  • Detect possible duplicates on person names, stored in CoNE, during update or import.(depends on import feature for CoNe)
  • Duplicate check based on
    • CoNe identifiers
    • external identifiers (e.g. local IDs, ResearcherID, etc.)
    • what else?

Duplicate Detection[edit]

How to identify duplicates?

Duplicate tool provided by Framework[edit]

eSciDoc Core services provide a duplicate identifying tool, currently not configured. This tool can do following:

  • check for exact match in defined metadata fields
  • compare fulltexts for similarity (algorithm is already defined but not known). The system provides a list of items with potentially similar fulltexts. The probability for similarity is provided as well. Constraint: only known/supported fulltext formats will be considered.

Status on eDoc[edit]

  • two items are considered as duplicates if they have an identifier of the same type with identical value
  • the duplicate check is triggered when
    • items are copied from virtual collection to archival collection (batch)
    • one item is copied from one virtual or archival collection to another archival collection (manual)
  • the duplicate check is not triggered when
    • items are moved from one archival collection to another one (batch)


  • offers an action "Find duplicates" via its References menu
  • action can be setup via user preferences where the user can change which set of fields should be considered for the detection and if the "spaces and punctation should be ignored"

Duplicate Handling[edit]

How to proceed if duplicates have been detected?

Status on eDoc[edit]

  • if two items are considered as duplicates, the system asks the user if
    1. the process should be canceled
    2. a new metadata version should be created for the target item
    3. a new intellectual version should be created for the target item