Difference between revisions of "Control of Named Entities"

From MPDLMediaWiki
Jump to navigation Jump to search
m
 
(98 intermediate revisions by 6 users not shown)
Line 1: Line 1:
<accesscontrol>MPDL</accesscontrol>
{{ESciDoc Solutions}}


[[Category:PubMan]]


The aim of this page is to collect information concerning control of named entities in the context of the eSciDoc Solution for publication data management [http://colab.mpdl.mpg.de/mediawiki/Portal:PubMan PubMan]. The page contains general information of envisioned PubMan services in this domain. Functional and technical specifications can be found on the page called [[Service_for_Control_of_Named_Entities|Service for Control of Named Entities]].


== Open Issues ==


== Introduction ==


'''Metadata'''
The control of named entities is important in order to manage and retrieve metadata of high quality, have consistent data on the system and form the basis for excellent search results. In the field of library and information science the creation and maintenance of controlled named entities (so called authority files/authority records) are well established and follow special guidelines and special workflows. Usually such authority records are maintained as separate records and are linked to other records. An authority record normally contains - among others - the authorized name of e.g. a person, name variants, additional information e.g. to disambiguate the person from other persons with the same name, and information about relationships between the record and other records. The benefit of authority files is to establish and to offer uniform access points e.g. for persons, to group together the various works of this person, and to create consolidated indexes. Furthermore additional information like alternative name variants of a person have only to be maintained and curated in one record (i.e. authority record) but permit to search under all name variants of the person and to retrieve all records of this person independently of the name variante that is used.


*Agree on a list of potential candidates for authority files.
== Scope ==
*Define what kind of descriptive elements an authority record should contain. Descriptive elements may differ from authority file to authority file and should therefore be defined individually.
*Decide whether an IR-item should be self-contained or not.
*Define how to map authority files to a MD element in a specific MDS. Note: for every MD element the system supports authority files for, we probably need to specify a list of descriptive information available for the authority file (e.g. journal names: title, translation of title, title abbreviations, ISSN, eISSN, etc. persons: last name, first name, etc.)
*Specify linking between IR items and authority records via ID.
*Specify linking of different authority files/databases (e.g. user database - personal name authority file - affiliation authority file).


'''Handling of authority files'''
For [[PubMan | PubMan]] several metadata elements are candidates for control of named entities and the generation of controlled lists. The information maintainted for all the mentioned metadata elements depends on various factors:
*the kind of information that is needed to identify the relevant metadata value (e.g. during submission),
*the kind of information that is needed for searching and browsing,
*the source used (i.e. what information is actually offered),
*the usage in the system,
*the quantity of information that should be filled in by the users and should be maintained and stored in the system
*and finally the kind and quantity of information that is relevant within the scope of PubMan.


*Describe selection of authority record (Depositor during submission? Free-text field? System suggests authority record while Depositor fills in information? Depositor may search within authority file database and selects an record?).
Controlled metadata records should mainly contain descriptive metadata. The selection of potential descriptive elements depends - as mentioned above - on various factors: the source of controlled metadata values, what kind of values has to be filled in manually and what kind of values can be filled in via batch operations, the usage of the data, the quantity of information that should be stored and maintainted in PubMan, and the question for which elements exist a controlled predefined value list and for which elements values are freely definable. The list of descriptive metadata elements should be extendable by new elements.
*Specify assignment of items to an authority record and when it will take place (while submission by selecting an authority record from the selection list? While submission by accepting an authority record selected by the system? While FQA?).
*Describe administration and control of authority files (who is allowed to create, edit, delete, redirect, and authorize authority records? (see proposal of new role AF-Editor).
*Define what will happen in case no appropriate authority record is available.
*Specify if different kinds of authority files require different handling.
*Clarify dilemma between authority files and Autopsie-Prinzip (scenario: user selects an authority record. System fills in certain fields automatically. User edits one or more of the automatically selected fields afterwards) (proposal made by Inga: entry in IR item follows Autopsie-Prinzip but browse tree will be generated from authority record and standardized data. Notation of original (Vorlage) should be integrated in authority record as an alternative (e.g. alternative name).
*Specify duplicate checking for authority records. Duplicate checking should also compare e.g. name and alternative name.
*Specify users and their rights and privileges concerning authority files (see proposal of new role AF-Editor).
*Specify if a separate authority file workflow is required.
*Describe entry of multiple authors (via copy and paste).


'''Handling of new authority records'''
The publication item stored and maintained in PubMan should contain all important information to constitute a coherent item itself. Publication items might be linked with the database on controlled named entities or other (external) sources in order to enrich the publication item with additional information (e.g. rights statements for journals).


*Describe creation of new authority records (e.g. when does user create a new record? (Depositor during submission? Moderator during FQA? AF-Editor in a separate workflow? Is it possible to use an existing entry as template? Should the system generate a message to AF-Editor in case a new authority record has been created?).
== Benefits ==
*Specify a “Regelwerk” for the creation of new authority records.
*Specify if an authority record has obligatory elements.


'''Import of external authority files'''


*Specify how external authority files can be provided (licensed by MPS? Online available? CD-ROM?) and which procedures are required (includes: harvesting, data conversion (format and character set), linking to IR items, update mechanism, maintenance).
The benefits of controlled named entities in the context of PubMan are as follows:
*Describe import of external authority files or subset of it.
*Where will be imports of data sets like: name authority files (e.g. PND), user/person related information; imports from MPG-IP-database hosted at GWDG, other authority files (e.g. Zeitschriftendatenbank) described and handled? – They are not described in USC_ingestion.


'''Build-up of internal authority files'''


*Describe procedure of how to create incrementally built authority files.
'''Search/retrieval:''' controlled named entities guarantee accurate research results and allow its refinement.
*Clarify integration/interaction of internal and external authority files (initial import of external authority files or harvest of authority files scheduled on a regular basis?).
*Will it be possible to extend/modify authority records in case of loading/synchronizing authority files from external source (assumption: no or by customizable fields).


'''Customization'''
'''Consolidate indexes/browsing:''' controlled named entities allow the generation of consolidate indexes and enable the users to browse by e.g. person or organizational unit.


*Clarify on which criteria authority files are chosen (customization of authority files on collection, user, user group level?).
'''Metadata entry:''' controlled named entities facilitate the metadata entry for publication items. During submission the user may select a controlled named entity from a controlled list for certain metadata fields.
*Describe setup of authority files on collection level.


'''Export'''
'''List of references:''' controlled lists enable the generation of referenzing lists (e.g. for authors or institutes).


*Specify export of authority records/authority files
'''Consistency:''' controlled named entities foster consistent data on the system.  
*Define what kind of descriptive elements of an authority record should be exportable in case IR item is not self-contained or in case IR item should be enriched with additional information from authority record.


'''Ingestion'''
'''Data enrichment:''' controlled named entities allow the enrichment of data with additional information (e.g. ISSN number). The data enrichment is also relevant for export functionalities.


*Specify assignment of authority records for ingested data.
'''Batch operations:''' controlled named entities facilitate the performance of batch operations (e.g. switch the access status of publication items containing a certain publisher from private to public).


'''Searching'''
'''Linking:''' controlled named entities allow to link from publication items to other sources (e.g. link from publication item to holding information via OpenURL).


*Define which elements of authority records are searchable (simple, advanced and expert search).
'''Improved interoperability:''' controlled named entities enable better information integration either in eSciDoc itself or via other service providers which have harvested the PubMan data or provide meta-searches.
*Specify searching in external authority files (provide interface for AF-Editor and Moderator to external authority files for inquiries and data transfer).
*Define generation of browse trees (proposal: browse trees should be generated of standardized data of authority record).
*Specify search in internal authority files.
*Describe basket functionality for authority records (e.g. important for re-direction in batch mode or re-use of data).


'''Migration of eDoc data'''
'''Add ons:''' the handling of controlled named entities are important for certain functionalities and services offered by PubMan (e.g. web service for organizational units, creation of researcher pages).


*Specify assignment of authority records to migrated eDoc items.


'''Others'''


*Is the SFX knowledge base an alternative to the ZDB?
 
*Favourite co-authors feature has to be implemented in accordance with authority file concept (Wörterbuchfunktion could be an alternative to the favourite coauthors feature).
== Application models ==
*Automatic “Umverknüpfungsprozess” has to be specified. Privileges and rights of IR items have to be considered (i.e. how to handle the re-direction of items from other collections).
 
 
In theory there are various models of how to create and maintain controlled named entities. As an overall remark it can be stated that it is most likely that not for all metadata elements that are candidates for controlled named entities (e.g. person names) an external authority record is available. This means that
 
a) controlled named entities and not controlled named entities will coexist and/or
 
b) internally controlled named entities have to be created and maintained.
 
What kind of application model is selected for the PubMan services on controlled named entities will be stated in the respective functional specification.
 
 
'''External sources (import of complete authority system)'''
 
Critical factors:
*Data of external sources are constantly updated and extended. The risk is to get inconsistencies between external sources and local copies. This means that this application model is primarily practicable for data that are not subject to constant changes (e.g. classifications).
*Legal situation: it has to be clarified if external data can be stored and maintained in PubMan.
*Imported external authority file records should contain information about its source and version. This allows to specify the quality/reliability of the data during retrieval.
 
 
'''External controlled metadata values one at a time (import of values, not of complete authority system, e.g. via web services)'''
 
Critical factors:
*It has to be clarified if for all relevant external sources an appropriate web service is provided (e.g. web service for PND?).
*Imported external authority file records should contain information about its source and version. This allows to specify the quality/reliability of the data during retrieval.
 
 
'''Build-up of controlled metadata values (within PubMan/eSciDoc and with internal QA process)'''
 
Critical factors:
*Controlling/QA process has to be set up internally. It has to be clarified to what extend standardization efforts and the consideration of national and international guidelines should be applied.
*Creation and maintenance of controlled named entities is time consuming, expensive and requires trained staff.
*Licensing: it has to be clarified if in Germany a special license is needed to build up databases for persons (e.g. controlled named entities for person names).
 
 
'''Referencing to external sources (e.g. via ID)'''
 
Critical factors:
*Controlled named entities are not stored and maintained in the system. Only a reference (e.g. ID) links to the external source where the entities are maintained. In case external source is not available, values are not accessible.
 
 
'''Hybrid application models:'''
 
 
'''Initial import of external sources as a start content. Data will be further maintained and extended in PubMan and be merged with internally created controlled metadata values.'''
 
Critical factors:
*It has to be clarified who is the rights holder of the data and if storage and further editing is permitted.
*It has to be decided whether data from external sources can/should be edited or not.
*Imported external authority file records should contain information about its source and version. This allows to specify the quality/reliability of the data during retrieval.
*Internally created controlled named entities should be marked to facilitate further internal QA process.
 
 
'''Shared/combined use of regularly harvested external sources and internally created controlled named entities. This can be done either by integrating harvested external sources in the internally build controlled named entities or by downloading and integrating external authority file records one at a time (via web services).'''
 
Critical factors:
*It has to be clarified who is the rights holder of the data and if storage, maintenance and further editing is permitted.
*Duplicate detection has to be secured.
*Handling and procedure of updating/regular scheduled harvest of external sources has to be specified.
 
== Methods of gathering external sources ==
 
*Access to web service interface of external sources and integration of selected records.
*Import of external sources (e.g. as start content) and its integration in to the system.
*Regular scheduled harvest of external sources and its integration in to the system.
 
 
== Management of controlled named entities ==
 
The management of controlled named entities depends - among others - on the chosen application model and especially on the kind of data (external data or internally created data) that has to be administered. It also depends on the user group and the respective usage scenarios. The management of controlled named entities will be specified in detail in the functional specification of each PubMan service for controlled named entities. The listing below gives only a rough overview of the basic functionalities that have to be supported by the system:
 
For the creation and/or administration of controlled named entities at least the following functionalities have to be supported (depending on the chosen application model):
*creation (only valid for internally built controlled metadata records), editing, searching and deletion/deactivation of controlled named entities
*exporting of controlled named entities (e.g. as XML or csv file)
 
For the user (e.g. depositor) of the system at least the following functionalities have to be provided:
*searching, displaying, and selection of controlled named entities.
 
== Customization/Usage ==
 
There are several metadata elements which are candidates for controlled named entities. The controlled list of named entities will depend not only on the respective metadata elements but also on other factors like collection or user profil. Therefore the system has to provide customization options on various levels.
==Implementation==
The MPDL developed an independent service for the control of named entities (CoNE).
 
Implementation Details and corresponding functionalities can be found here: [http://colab.mpdl.mpg.de/mediawiki/Service_for_Control_of_Named_Entities CoNE]
 
== Further Reading ==
*Max Kaiser, Hans-Jörg Lieder, Kurt Majcen, Heribert Vallant: New Ways of Sharing and Using Authority Information: The LEAF project, D-Lib Magazin, vol. 9, issue 11, 2003 http://www.dlib.org/dlib/november03/lieder/11lieder.html
 
*IFLA Working Group on Functional Requirements and Numbering of Authority Records (FRANAR): Functional Requirements of Authority Data: A conceptual Model, Draft 2007-04-01 [http://www.ifla.org/VII/d4/FRANAR-ConceptualModel-2ndReview.pdf pdf document] http://www.ifla.org/VII/d4/wg-franar.htm (FRANAR)
 
*IFLA UBCIM Working Group on Minimal Level Authority Records and ISADN: Mandatory Data Elements for Internationally Shared Resource Authority Records http://www.ifla.org/VI/3/p1996-2/mlar.htm
 
*Traugott Koch: Name authorities and author identification, 2007 (DRAFT) http://homes.ukoln.ac.uk/~tk213/drafts/name-authority-KE.html
 
*Danskin, Dixon, Docherty, Hill, Moore: A review of the current landscape in relation to a proposed Name Authority Service for UK repositories of research outputs. Prepared for the JISC Names Project. June 2008. [http://names.mimas.ac.uk/documents/Names_landscape_report_1Oct2007.pdf PDF]
 
[[Category:CoNE]]

Latest revision as of 14:06, 5 January 2011

eSciDoc Solutions

PubMan:
Overview · Functionalities
Interfaces · Support

Faces:
Overview · Functionalities
Scope · Support

ViRR:
Overview · Functionalities
Scope · Support

imeji
Digitization Lifecycle

edit


The aim of this page is to collect information concerning control of named entities in the context of the eSciDoc Solution for publication data management PubMan. The page contains general information of envisioned PubMan services in this domain. Functional and technical specifications can be found on the page called Service for Control of Named Entities.


Introduction[edit]

The control of named entities is important in order to manage and retrieve metadata of high quality, have consistent data on the system and form the basis for excellent search results. In the field of library and information science the creation and maintenance of controlled named entities (so called authority files/authority records) are well established and follow special guidelines and special workflows. Usually such authority records are maintained as separate records and are linked to other records. An authority record normally contains - among others - the authorized name of e.g. a person, name variants, additional information e.g. to disambiguate the person from other persons with the same name, and information about relationships between the record and other records. The benefit of authority files is to establish and to offer uniform access points e.g. for persons, to group together the various works of this person, and to create consolidated indexes. Furthermore additional information like alternative name variants of a person have only to be maintained and curated in one record (i.e. authority record) but permit to search under all name variants of the person and to retrieve all records of this person independently of the name variante that is used.

Scope[edit]

For PubMan several metadata elements are candidates for control of named entities and the generation of controlled lists. The information maintainted for all the mentioned metadata elements depends on various factors:

  • the kind of information that is needed to identify the relevant metadata value (e.g. during submission),
  • the kind of information that is needed for searching and browsing,
  • the source used (i.e. what information is actually offered),
  • the usage in the system,
  • the quantity of information that should be filled in by the users and should be maintained and stored in the system
  • and finally the kind and quantity of information that is relevant within the scope of PubMan.

Controlled metadata records should mainly contain descriptive metadata. The selection of potential descriptive elements depends - as mentioned above - on various factors: the source of controlled metadata values, what kind of values has to be filled in manually and what kind of values can be filled in via batch operations, the usage of the data, the quantity of information that should be stored and maintainted in PubMan, and the question for which elements exist a controlled predefined value list and for which elements values are freely definable. The list of descriptive metadata elements should be extendable by new elements.

The publication item stored and maintained in PubMan should contain all important information to constitute a coherent item itself. Publication items might be linked with the database on controlled named entities or other (external) sources in order to enrich the publication item with additional information (e.g. rights statements for journals).

Benefits[edit]

The benefits of controlled named entities in the context of PubMan are as follows:


Search/retrieval: controlled named entities guarantee accurate research results and allow its refinement.

Consolidate indexes/browsing: controlled named entities allow the generation of consolidate indexes and enable the users to browse by e.g. person or organizational unit.

Metadata entry: controlled named entities facilitate the metadata entry for publication items. During submission the user may select a controlled named entity from a controlled list for certain metadata fields.

List of references: controlled lists enable the generation of referenzing lists (e.g. for authors or institutes).

Consistency: controlled named entities foster consistent data on the system.

Data enrichment: controlled named entities allow the enrichment of data with additional information (e.g. ISSN number). The data enrichment is also relevant for export functionalities.

Batch operations: controlled named entities facilitate the performance of batch operations (e.g. switch the access status of publication items containing a certain publisher from private to public).

Linking: controlled named entities allow to link from publication items to other sources (e.g. link from publication item to holding information via OpenURL).

Improved interoperability: controlled named entities enable better information integration either in eSciDoc itself or via other service providers which have harvested the PubMan data or provide meta-searches.

Add ons: the handling of controlled named entities are important for certain functionalities and services offered by PubMan (e.g. web service for organizational units, creation of researcher pages).



Application models[edit]

In theory there are various models of how to create and maintain controlled named entities. As an overall remark it can be stated that it is most likely that not for all metadata elements that are candidates for controlled named entities (e.g. person names) an external authority record is available. This means that

a) controlled named entities and not controlled named entities will coexist and/or

b) internally controlled named entities have to be created and maintained.

What kind of application model is selected for the PubMan services on controlled named entities will be stated in the respective functional specification.


External sources (import of complete authority system)

Critical factors:

  • Data of external sources are constantly updated and extended. The risk is to get inconsistencies between external sources and local copies. This means that this application model is primarily practicable for data that are not subject to constant changes (e.g. classifications).
  • Legal situation: it has to be clarified if external data can be stored and maintained in PubMan.
  • Imported external authority file records should contain information about its source and version. This allows to specify the quality/reliability of the data during retrieval.


External controlled metadata values one at a time (import of values, not of complete authority system, e.g. via web services)

Critical factors:

  • It has to be clarified if for all relevant external sources an appropriate web service is provided (e.g. web service for PND?).
  • Imported external authority file records should contain information about its source and version. This allows to specify the quality/reliability of the data during retrieval.


Build-up of controlled metadata values (within PubMan/eSciDoc and with internal QA process)

Critical factors:

  • Controlling/QA process has to be set up internally. It has to be clarified to what extend standardization efforts and the consideration of national and international guidelines should be applied.
  • Creation and maintenance of controlled named entities is time consuming, expensive and requires trained staff.
  • Licensing: it has to be clarified if in Germany a special license is needed to build up databases for persons (e.g. controlled named entities for person names).


Referencing to external sources (e.g. via ID)

Critical factors:

  • Controlled named entities are not stored and maintained in the system. Only a reference (e.g. ID) links to the external source where the entities are maintained. In case external source is not available, values are not accessible.


Hybrid application models:


Initial import of external sources as a start content. Data will be further maintained and extended in PubMan and be merged with internally created controlled metadata values.

Critical factors:

  • It has to be clarified who is the rights holder of the data and if storage and further editing is permitted.
  • It has to be decided whether data from external sources can/should be edited or not.
  • Imported external authority file records should contain information about its source and version. This allows to specify the quality/reliability of the data during retrieval.
  • Internally created controlled named entities should be marked to facilitate further internal QA process.


Shared/combined use of regularly harvested external sources and internally created controlled named entities. This can be done either by integrating harvested external sources in the internally build controlled named entities or by downloading and integrating external authority file records one at a time (via web services).

Critical factors:

  • It has to be clarified who is the rights holder of the data and if storage, maintenance and further editing is permitted.
  • Duplicate detection has to be secured.
  • Handling and procedure of updating/regular scheduled harvest of external sources has to be specified.

Methods of gathering external sources[edit]

  • Access to web service interface of external sources and integration of selected records.
  • Import of external sources (e.g. as start content) and its integration in to the system.
  • Regular scheduled harvest of external sources and its integration in to the system.


Management of controlled named entities[edit]

The management of controlled named entities depends - among others - on the chosen application model and especially on the kind of data (external data or internally created data) that has to be administered. It also depends on the user group and the respective usage scenarios. The management of controlled named entities will be specified in detail in the functional specification of each PubMan service for controlled named entities. The listing below gives only a rough overview of the basic functionalities that have to be supported by the system:

For the creation and/or administration of controlled named entities at least the following functionalities have to be supported (depending on the chosen application model):

  • creation (only valid for internally built controlled metadata records), editing, searching and deletion/deactivation of controlled named entities
  • exporting of controlled named entities (e.g. as XML or csv file)

For the user (e.g. depositor) of the system at least the following functionalities have to be provided:

  • searching, displaying, and selection of controlled named entities.

Customization/Usage[edit]

There are several metadata elements which are candidates for controlled named entities. The controlled list of named entities will depend not only on the respective metadata elements but also on other factors like collection or user profil. Therefore the system has to provide customization options on various levels.

Implementation[edit]

The MPDL developed an independent service for the control of named entities (CoNE).

Implementation Details and corresponding functionalities can be found here: CoNE

Further Reading[edit]

  • Danskin, Dixon, Docherty, Hill, Moore: A review of the current landscape in relation to a proposed Name Authority Service for UK repositories of research outputs. Prepared for the JISC Names Project. June 2008. PDF