Control of Named Entities

The aim of this page is to collect information concerning control of named entities in the context of the eSciDoc Solution for publication data management PubMan. The page contains general information of envisioned PubMan services in this domain. Functional and technical specifications can be found on the page called Service for Control of Named Entities.

Introduction
The control of named entities is important in order to manage and retrieve metadata of high quality, have consistent data on the system and form the basis for excellent search results. In the field of library and information science the creation and maintenance of controlled named entities (so called authority files/authority records) are well established and follow special guidelines and special workflows. Usually such authority records are maintained as separate records and are linked to other records. An authority record normally contains - among others - the authorized name of e.g. a person, name variants, additional information e.g. to disambiguate the person from other persons with the same name, and information about relationships between the record and other records. The benefit of authority files is to establish and to offer uniform access points e.g. for persons, to group together the various works of this person, and to create consolidated indexes. Furthermore additional information like alternative name variants of a person have only to be maintained and curated in one record (i.e. authority record) but permit to search under all name variants of the person and to retrieve all records of this person independently of the name variante that is used.

Scope
For PubMan several metadata elements are candidates for control of named entities and the generation of controlled lists. The information maintainted for all the mentioned metadata elements depends on various factors:
 * the kind of information that is needed to identify the relevant metadata value (e.g. during submission),
 * the kind of information that is needed for searching and browsing,
 * the source used (i.e. what information is actually offered),
 * the usage in the system,
 * the quantity of information that should be filled in by the users and should be maintained and stored in the system
 * and finally the kind and quantity of information that is relevant within the scope of PubMan.

Controlled metadata records should mainly contain descriptive metadata. The selection of potential descriptive elements depends - as mentioned above - on various factors: the source of controlled metadata values, what kind of values has to be filled in manually and what kind of values can be filled in via batch operations, the usage of the data, the quantity of information that should be stored and maintainted in PubMan, and the question for which elements exist a controlled predefined value list and for which elements values are freely definable. The list of descriptive metadata elements should be extendable by new elements.

The publication item stored and maintained in PubMan should contain all important information to constitute a coherent item itself. Publication items might be linked with the database on controlled named entities or other (external) sources in order to enrich the publication item with additional information (e.g. rights statements for journals).

Benefits
The benefits of controlled named entities in the context of PubMan are as follows:

Search/retrieval: controlled named entities guarantee accurate research results and allow its refinement.

Consolidate indexes/browsing: controlled named entities allow the generation of consolidate indexes and enable the users to browse by e.g. person or organizational unit.

Metadata entry: controlled named entities facilitate the metadata entry for publication items. During submission the user may select a controlled named entity from a controlled list for certain metadata fields.

List of references: controlled lists enable the generation of referenzing lists (e.g. for authors or institutes).

Consistency: controlled named entities foster consistent data on the system.

Data enrichment: controlled named entities allow the enrichment of data with additional information (e.g. ISSN number). The data enrichment is also relevant for export functionalities.

Batch operations: controlled named entities facilitate the performance of batch operations (e.g. switch the access status of publication items containing a certain publisher from private to public).

Linking: controlled named entities allow to link from publication items to other sources (e.g. link from publication item to holding information via OpenURL).

Improved interoperability: controlled named entities enable better information integration either in eSciDoc itself or via other service providers which have harvested the PubMan data or provide meta-searches.

Add ons: the handling of controlled named entities are important for certain functionalities and services offered by PubMan (e.g. web service for organizational units, creation of researcher pages).

Application models
In theory there are various models of how to create and maintain controlled named entities. As an overall remark it can be stated that it is most likely that not for all metadata elements that are candidates for controlled named entities (e.g. person names) an external authority record is available. This means that

a) controlled named entities and not controlled named entities will coexist and/or

b) internally controlled named entities have to be created and maintained.

What kind of application model is selected for the PubMan services on controlled named entities will be stated in the respective functional specification.

External sources (import of complete authority system)

Critical factors:
 * Data of external sources are constantly updated and extended. The risk is to get inconsistencies between external sources and local copies. This means that this application model is primarily practicable for data that are not subject to constant changes (e.g. classifications).
 * Legal situation: it has to be clarified if external data can be stored and maintained in PubMan.
 * Imported external authority file records should contain information about its source and version. This allows to specify the quality/reliability of the data during retrieval.

External controlled metadata values one at a time (import of values, not of complete authority system, e.g. via web services)

Critical factors:
 * It has to be clarified if for all relevant external sources an appropriate web service is provided (e.g. web service for PND?).
 * Imported external authority file records should contain information about its source and version. This allows to specify the quality/reliability of the data during retrieval.

Build-up of controlled metadata values (within PubMan/eSciDoc and with internal QA process)

Critical factors:
 * Controlling/QA process has to be set up internally. It has to be clarified to what extend standardization efforts and the consideration of national and international guidelines should be applied.
 * Creation and maintenance of controlled named entities is time consuming, expensive and requires trained staff.
 * Licensing: it has to be clarified if in Germany a special license is needed to build up databases for persons (e.g. controlled named entities for person names).

Referencing to external sources (e.g. via ID)

Critical factors:
 * Controlled named entities are not stored and maintained in the system. Only a reference (e.g. ID) links to the external source where the entities are maintained. In case external source is not available, values are not accessible.

Hybrid application models:

'''Initial import of external sources as a start content. Data will be further maintained and extended in PubMan and be merged with internally created controlled metadata values.'''

Critical factors:
 * It has to be clarified who is the rights holder of the data and if storage and further editing is permitted.
 * It has to be decided whether data from external sources can/should be edited or not.
 * Imported external authority file records should contain information about its source and version. This allows to specify the quality/reliability of the data during retrieval.
 * Internally created controlled named entities should be marked to facilitate further internal QA process.

'''Shared/combined use of regularly harvested external sources and internally created controlled named entities. This can be done either by integrating harvested external sources in the internally build controlled named entities or by downloading and integrating external authority file records one at a time (via web services).'''

Critical factors:
 * It has to be clarified who is the rights holder of the data and if storage, maintenance and further editing is permitted.
 * Duplicate detection has to be secured.
 * Handling and procedure of updating/regular scheduled harvest of external sources has to be specified.

Methods of gathering external sources

 * Access to web service interface of external sources and integration of selected records.
 * Import of external sources (e.g. as start content) and its integration in to the system.
 * Regular scheduled harvest of external sources and its integration in to the system.

Management of controlled named entities
The management of controlled named entities depends - among others - on the chosen application model and especially on the kind of data (external data or internally created data) that has to be administered. It also depends on the user group and the respective usage scenarios. The management of controlled named entities will be specified in detail in the functional specification of each PubMan service for controlled named entities. The listing below gives only a rough overview of the basic functionalities that have to be supported by the system:

For the creation and/or administration of controlled named entities at least the following functionalities have to be supported (depending on the chosen application model):
 * creation (only valid for internally built controlled metadata records), editing, searching and deletion/deactivation of controlled named entities
 * exporting of controlled named entities (e.g. as XML or csv file)

For the user (e.g. depositor) of the system at least the following functionalities have to be provided:
 * searching, displaying, and selection of controlled named entities.

Customization/Usage
There are several metadata elements which are candidates for controlled named entities. The controlled list of named entities will depend not only on the respective metadata elements but also on other factors like collection or user profil. Therefore the system has to provide customization options on various levels.

Implementation
The MPDL developed an independent service for the control of named entities (CoNE).

Implementation Details and corresponding functionalities can be found here: CoNE