Control of Named Entities/eDoc HowTo

From MPDLMediaWiki
Jump to navigation Jump to search

Controlled Vocabulary[edit]

In eDoc the same creator or journal can occur several times in the system, because eDoc doesn't have controlled vocabulary.


  • go through all eDoc creators and identify duplicate entries, which will be considered as one author. For this all creators in eDoc will get an unique ID, so that the IDs for the same persons can be considered when moving them to PubMan.
Has to be considered for prototype service for authors--Ulla 15:28, 15 July 2008 (UTC)
In addition some matching procedures need to be developed that include author names and affiliation ids, to be easier to identify if it is really one author with different name variants on eDoc or 2 different authors. --Natasa 10:55, 19 August 2008 (UTC)
TODO: Define procedure for matching author names to IDs and quality check by Karin. First tests will go without ID.
  • Mapping of edoc "sourcecreatorfullname" or "seriescreatorfullname" to escidoc (in case we delete element person-complete name) see MDS revision for R3
Talked to Michael. Is also possible without person-complete name, as there will be a solution, which will but the complete name into first and last name. --Nicole 15:34, 28 July 2008 (UTC)


  • Done:
  1. Excel file with journal names in eDoc
  2. run journals against SFX
  3. put journal names in DB in order to get hold of a controlled vocabulary for journal names from eDoc
  4. exported from edoc production journals names and edoc ids and a procedure to match journal names from sfx enriched database is run. For each entry exported edoc ids of documents are automatically matched (no need for manual work or manual work will be minimized).
  5. subject categories/subject subcategories are matched and "cleared" per sfxobjectid. There is only 1 record problematic, was not able to match due to false data in subject category/subject subcategory. (The following rule is applied: one subject should be combined from "n. element of category" + " / " + "n. element of subcategory")
  6. a tipp/how to instructions for manual checking
  7. managed to match 122953 edocids (Submitted, Released) with the original journal table (no matter if it has sfxid or not) (at the time of matching 130137 Submitted, Released document on edocid).
  • ToDo's
  1. check if entries, which could not be found in SFX are really journal names. Some of them are journal, volume, page.
  2. search for the "real" journal name and if it exists add as alternative to it.
  3. search for "wrong" journal name in eDoc and add to the "real" entry the eDoc ID(s)
  4. delete the "wrong" entry

see discussion page for the documentation of Despoina's work