Difference between revisions of "Control of Named Entities"

From MPDLMediaWiki
Jump to navigation Jump to search
Line 7: Line 7:
'''Note for users maintaining this page:
'''Note for users maintaining this page:


Issues that are still work in progress or need to be discussed should be put on the discussion page first.  
Issues that are still work in progress or need to be discussed should be put on the discussion page first. As soon as something has been agreed, it will be moved to the main article page.'''
 
As soon as something has been agreed, it will be moved to the main article page.'''





Revision as of 12:19, 28 December 2007

This is a protected page.

eSciDoc Solutions

PubMan:
Overview · Functionalities
Interfaces · Support

Faces:
Overview · Functionalities
Scope · Support

ViRR:
Overview · Functionalities
Scope · Support

imeji
Digitization Lifecycle

edit

The aim of this page is to collect information concerning control of named entities in the context of PubMan/eSciDoc. The page contains general information as well as functional and technical specifications of envisioned PubMan services in this domain.

Note for users maintaining this page:

Issues that are still work in progress or need to be discussed should be put on the discussion page first. As soon as something has been agreed, it will be moved to the main article page.


Introduction[edit]

The control of named entities is important in order to manage and retrieve metadata of high quality, have consistent data on the system and form the basis for excellent search results. In the field of library and information science the creation and maintenance of controlled named entities (so called authority files/authority records) are well established and follow special guidelines and special workflows. Usually such authority records are maintained as separate records and are linked to other records. An authority record normally contains - among others - the authorized name of e.g. a person, name variants, additional information e.g. to disambiguate the person from other persons with the same name, and information about relationships between the record and other records. The benefit of authority files is to establish and to offer uniform access points e.g. for persons, to group together the various works of this person, and to create consolidated indexes. Furthermore additional information like alternative name variants of a person have only to be maintained and curated in one record (i.e. authority record) but permit to search under all name variants of the person and to retrieve all records of this person independently of the name variante that is used.

Benefits[edit]

The benefits of controlled named entities in the context of PubMan are as follows:


Search/retrieval: controlled named entities guarantee accurate research results and allow its refinement.

Consolidate indexes/browsing: controlled named entities allow the generation of consolidate indexes and enable the users to browse by e.g. person or organizational unit.

Metadata entry: controlled named entities facilitate the metadata entry for publication items. During submission the user may select a controlled named entity from a controlled list for certain metadata fields.

List of references: controlled lists enable the generation of referenzing lists (e.g. for authors or institutes).

Consistency: controlled named entities foster consistent data on the system.

Data enrichment: controlled named entities allow the enrichment of data with additional information (e.g. ISSN number). The data enrichment is also relevant for export functionalities.

Batch operations: controlled named entities facilitate the performance of batch operations (e.g. switch the access status of publication items containing a certain publisher from private to public).

Linking: controlled named entities allow to link from publication items to other sources (e.g. link from publication item to holding information via OpenURL).

Improved interoperability: controlled named entities enable better information integration either in eSciDoc itself or via other service providers which have harvested the PubMan data or provide meta-searches.

Add ons: the handling of controlled named entities are important for certain functionalities and services offered by PubMan (e.g. web service for organizational units, creation of researcher pages).


Application models[edit]

In theory there are various models of how to create and maintain controlled named entities. As an overall remark it can be stated that it is most likely that not for all metadata elements that are candidates for controlled named entities (e.g. person names) an external authority record is available. This means that a) controlled named entities and not controlled named entities will coexist and/or that b) internally controlled named entities have to be created and maintained.

What kind of application model is selected for the PubMan services on controlled named entities will be stated in the respective functional specification.


External sources (import of complete authority system)

Critical factors:

  • Data of external sources are constantly updated and extended. The risk is to get inconsistencies between external sources and local copies. This means that this application model is primarily practicable for data that are not subject to constant changes (e.g. classifications).
  • Legal situation: it has to be clarified if external data can be stored and maintained in PubMan.
  • Imported external authority file records should contain information about its source and version. This allows to specify the quality/reliability of the data during retrieval.


External controlled metadata values one at a time (import of values, not of complete authority system, e.g. via web services)

Critical factors:

  • It has to be clarified if for all relevant external sources an appropriate web service is provided (e.g. web service for PND?).
  • Imported external authority file records should contain information about its source and version. This allows to specify the quality/reliability of the data during retrieval.


Build-up of controlled metadata values (within PubMan/eSciDoc and with internal QA process)

Critical factors:

  • Controlling/QA process has to be set up internally. It has to be clarified to what extend standardization efforts and the consideration of national and international guidelines should be applied.
  • Creation and maintenance of controlled named entities is time consuming, expensive and requires trained staff.
  • Licensing: it has to be clarified if in Germany a special license is needed to build up databases for persons (e.g. controlled named entities for person names).


Referencing to external sources (e.g. via ID)

Critical factors:

  • Controlled named entities are not stored and maintained in the system. Only a reference (e.g. ID) links to the external source where the entities are maintained. In case external source is not available, values are not accessible.


Hybrid application models:


Initial import of external sources as a start content. Data will be further maintained and extended in PubMan and be merged with internally created controlled metadata values.

Critical factors:

  • It has to be clarified who is the rights holder of the data and if storage and further editing is permitted.
  • It has to be decided whether data from external sources can/should be edited or not.
  • Imported external authority file records should contain information about its source and version. This allows to specify the quality/reliability of the data during retrieval.
  • Internally created controlled named entities should be marked to facilitate further internal QA process.


Shared/combined use of regularly harvested external sources and internally created controlled named entities. This can be done either by integrating harvested external sources in the internally build controlled named entities or by downloading and integrating external authority file records one at a time (via web services).

Critical factors:

  • It has to be clarified who is the rights holder of the data and if storage, maintenance and further editing is permitted.
  • Duplicate detection has to be secured.
  • Handling and procedure of updating/regular scheduled harvest of external sources has to be specified.


Methods of gathering external sources[edit]

  • Access to web service interface of external sources and integration of selected records.
  • Import of external sources (e.g. as start content) and its integration in to the system.
  • Regular scheduled harvest of external sources and its integration in to the system.


Management of controlled named entities[edit]

The management of controlled named entities depends - among others - on the chosen application model and especially on the kind of data (external data or internally created data) that has to be administered. It also depends on the user group and the respective usage scenarios. The management of controlled named entities will be specified in detail in the functional specification of each PubMan service for controlled named entities. The listing below gives only a rough overview of the basic functionalities that have to be supported by the system:

For the creation and/or administration of controlled named entities at least the following functionalities have to be supported (depending on the chosen application model):

  • creation (only valid for internally built controlled metadata records), editing, and deactivation of controlled named entities
  • exporting of controlled named entities (e.g. as XML or csv file)

For the user (e.g. depositor) of the system at least the following functionalities have to be provided:

  • searching, displaying, and selection of controlled named entities.

Customization/Usage[edit]

There are several metadata elements which are candidates for controlled named entities. The controlled list of named entities will depend not only on the respective metadata elements but also on other factors like collection or user profil. Therefore the system has to provide customization options on various levels.

First services[edit]

Core service "organizational unit handler[edit]

The eSciDoc system provides a core service "organizational unit handler, provided by FIZ Karlsruhe. This service handles/manages the organisational units for the eSciDoc system. In future, this core service might be extended by an additional service for named entity control for organisational units, to be able to track and manage more information needed for organisational units.

The basic descriptive elements which will be covered by the core service are:

  • Name

The name of the organization, including translations. (Translations need respective language flag)

Cataloging rules are necessary, i.e. the full network path should be visible in the name of the orgunit.

  • Alternative name

Any alternative name or abbreviation used for the organization

  • City

The city where the organization is located

  • Country

The country where the organization is located

  • Type

Type of organization, i.e. institution, institute, department, group, research unit, project, sub-project, research school

  • Time period

Indication, in which time period the organisation was active.

  • Relations
    • actual hierarchies and network-relations, e.g. sub-units

Remark Traugott: hierarchical relations between org units might be keep sufficiently within the titel/name element, with the sequence from higher to lower.

    • historical dependencies, i.e. successor-predecessor
  • Identifier

URI, eSciDoc Identifier for the organisation. In addition, other identifiers can be kept.

  • MPS-Section(?)

Affiliation of the organisational unit to one of the three sections

to be checked: if needed only for statistical reports, we might consider to keep a separate source (section-org URI) and take the data just in case needed,

Ongoing work on the core service definition, see Discussion page on functional specification for org unit management

Prototype service for controlled named entities - journal names[edit]

To understand better the issues of controlled named entities for a certain application, we decided to start with a prototype service for PubMan on controlled named entitites.

Stages of prototyping:

  1. select an authority file (corporate bodies, journals, authors) and available external source
  2. create (import) data locally into an authority file from a selected source
  3. implement the referencing from the PubMan edit interface (enable automatic grow of the authority file for start when reference is not done)
  4. create very simple viewer/editor for the authority file data
  5. get feedback from potential pilot users
  6. extend the prototype with another authority file and repeat the steps 2-5
  7. modify/add functionalities based on the functional and technical feedback

Please see work in progress on Talk:ControlledVocab

For the selection of the descriptive metadata the main focus has been set on the minimum level of information that is needed to disambiguate entities. The list of descriptive metadata elements is extendable by new elements.

Metadata elements:

  • Journal title [1]

The name of the journal (e.g. "Journal of the ACM")

  • Alternative title [0-n]

Any alternative name or abbreviation of the journal

Remark Inga: Tagging of abbreviations as such? Indicating the origin of abbreviation if known?
  • Publisher [0-1?]

The name of the institution that publishes the journal

  • Identifier [0-n]

Any external identifier (e.g. ISSN, EZB-ID, ZDB-ID)

Remark Traugott: Schema has to be indicated
Remark Inga: eISSN is no specific schema and therefore has been deleted
  • Locator [0-1?]

Locator of the authority file source

Question Inga: Do we mean an URL pointing to the record?
  • Rights [0-n]

Statement on open access availability

  • Subject [0-n]

Subject/domain field of the journal

Possible Relations:

  • isSuccessorOf
  • isPredecessorOf

Further Reading[edit]