Talk:Control of Named Entities

This is a protected page.

Naming[edit]

The aim is to find a new name for authority files that does not refer too much to the practice of authority files in libraries. Suggestions and feedback welcome!

Possible alternative names

normalizing metadata/data entries
managing controlled vocabularies
harmonizing metadata/data entries
controlling metadata entries
terminology management
reference information service (can be split in: reference person service, reference affiliation service, reference journal service, etc.)
(proposal) master data management is another term that can be considered (though it is not an exact same meaning like used in ERP, CRM systems)
(proposal) metadata value domains/metadata domain value
controlled metadata values
see also ISAAR(CPF):http://www.ica.org/en/node/30230

If i'm not mistaken CDS Invenio (the software of the CERN document server) calls the concept knowldege base. It's also worth mentioning, how it functions: No normalization of data is performed on input, i.e. the data in the database will always be what was inserted by the metadata editor. Knowledge bases do only come into play when outputting data. In this case, output formatting templates can associate certain fields with knowledge bases and thus force normalization of data. This concept is due to a requirement which should be familiar from eDoc: Scientists want to be able to get their data out exactly as it was inserted - e.g. author names in all-caps. Obviously this approach has it's own share of problems. Basically all methods which investigate the data (searching, duplicate detection, etc.) must take knowledge bases into account, or will only work in idiosyncratic ways.

Open Issues[edit]

Metadata

Agree on a list of potential candidates for authority files. Note: If a generic mechanism like CDS Invenio's knowledge base would be implemented no such list would be needed in advance.
Define what kind of descriptive elements an authority record should contain. Descriptive elements may differ from authority file to authority file and should therefore be defined individually.
Decide whether an IR-item should be self-contained or not. Question: What does self-contained mean? Even right now, IR items are not self-contained in the sense that they contain all relevant metadata values, because other repository objects like creators are only referenced.
Define how to map authority files to a MD element in a specific MDS. Note: for every MD element the system supports authority files for, we probably need to specify a list of descriptive information available for the authority file (e.g. journal names: title, translation of title, title abbreviations, ISSN, eISSN, etc. persons: last name, first name, etc.)
Specify linking between IR items and authority records via ID.
Specify linking of different authority files/databases (e.g. user database - personal name authority file - affiliation authority file).
Describe selection of authority record (Depositor during submission? Free-text field? System suggests authority record while Depositor fills in information? Depositor may search within authority file database and selects an record?).

Handling of authority files

Specify assignment of items to an authority record and when it will take place (while submission by selecting an authority record from the selection list? While submission by accepting an authority record selected by the system? While FQA?).
Describe administration and control of authority files (who is allowed to create, edit, delete, redirect, and authorize authority records? (see proposal of new role AF-Editor).
Define what will happen in case no appropriate authority record is available.
Specify if different kinds of authority files require different handling.
Clarify dilemma between authority files and Autopsie-Prinzip (scenario: user selects an authority record. System fills in certain fields automatically. User edits one or more of the automatically selected fields afterwards) (proposal made by Inga: entry in IR item follows Autopsie-Prinzip but browse tree will be generated from authority record and standardized data. Notation of original (Vorlage) should be integrated in authority record as an alternative (e.g. alternative name).
Specify duplicate checking for authority records. Duplicate checking should also compare e.g. name and alternative name.
Specify users and their rights and privileges concerning authority files.
Specify if a separate authority file workflow is required.
Describe entry of multiple authors (via copy and paste).

Handling of new authority records

Describe creation of new authority records (e.g. when does user create a new record? (Depositor during submission? Moderator during FQA? AF-Editor in a separate workflow? Is it possible to use an existing entry as template? Should the system generate a message to AF-Editor in case a new authority record has been created?).
Specify a “Regelwerk” for the creation of new authority records.
Specify if an authority record has obligatory elements.

Import of external authority files

Specify how external authority files can be provided (licensed by MPS? Online available? CD-ROM?) and which procedures are required (includes: harvesting, data conversion (format and character set), linking to IR items, update mechanism, maintenance).
Describe import of external authority files or subset of it.
Where will be imports of data sets like: name authority files (e.g. PND), user/person related information; imports from MPG-IP-database hosted at GWDG, other authority files (e.g. Zeitschriftendatenbank) described and handled? – They are not described in USC_ingestion.

Build-up of internal authority files

Describe procedure of how to create incrementally built authority files.
Clarify integration/interaction of internal and external authority files (initial import of external authority files or harvest of authority files scheduled on a regular basis?).
Will it be possible to extend/modify authority records in case of loading/synchronizing authority files from external source (assumption: no or by customizable fields).

Customization

Clarify on which criteria authority files are chosen (customization of authority files on collection, user, user group level?).
Describe setup of authority files on collection level.

Export

Specify export of authority records/authority files
Define what kind of descriptive elements of an authority record should be exportable in case IR item is not self-contained or in case IR item should be enriched with additional information from authority record.

Ingestion

Specify assignment of authority records for ingested data.

Searching

Define which elements of authority records are searchable (simple, advanced and expert search).
Specify searching in external authority files (provide interface for AF-Editor and Moderator to external authority files for inquiries and data transfer).
Define generation of browse trees (proposal: browse trees should be generated of standardized data of authority record).
Specify search in internal authority files.
Describe basket functionality for authority records (e.g. important for re-direction in batch mode or re-use of data).

Migration of eDoc data

Specify assignment of authority records to migrated eDoc items.

Others

Is the SFX knowledge base an alternative to the ZDB?
Favourite co-authors feature has to be implemented in accordance with authority file concept (Wörterbuchfunktion could be an alternative to the favourite coauthors feature).
Automatic “Umverknüpfungsprozess” has to be specified. Privileges and rights of IR items have to be considered (i.e. how to handle the re-direction of items from other collections).

New role: AF-Editor[edit]

It has to be discussed further if a new role called AF-Editor has to be established. The idea is that the AF-Editor is responsible to provide and maintain high data quality of authority records and to ensure the consistency of the authority file databases. He/she is familiar with relevant cataloging and standardization rules and takes care of the standardization of selected data. The AF-Editor complements the area of responsibilities of the Moderator and the MD-Editor and has special privileges to authorize and to delete authority records. Once an authority record has been authorized it is locked and can only be edited by the AF-Editor him-/herself.

Potential new use cases

authorize an authority record
send an authority record back for revision (in case e.g. Moderator wants to edit an authority record which has been already authorized)
propose an authority record for deletion

Potential new status

After creation authority record is either in state pending or submitted.

Privileges/competencies

AF-Editor is allowed to authorize and to delete authority records and beyond that has privileges to all other actions connected to authority files/authority records.
During separate AFQA process authority record gets checked and authoritzed by the AF-Editor. A list of newly created authority records is displayed in the AF-Editors’ workspace.

Open issues

Separate AFQA process and its interaction with FQA process has to be specified. We assume that the release process of IR items is not affected by new AF workflow when IR items are self-contained and follow Autopsie-Prinzip.

Prototyping[edit]

To understand better all the issues it would be good to start asap prototyping an authority file for one selected metadata e.g. corporate bodies, journal, authors.

Stages (proposal):

select an authority file (corporate bodies, journals, authors) and available external source (todo: Functional Experts)

create (import) data locally into an authority file from a selected source (todo: development team)
implement the referencing from the PubMan edit interface (enable automatic grow of the authority file for start when reference is not done) (todo: development team)
create very simple viewer/editor for the authority file data (todo: development team)
get feedback from potential pilot users (todo: Functional Experts)
extend the prototype with another authority file and repeat the steps 2-5
modify/add functionalities based on the functional and technical feedback

Functional proposal for prototyping a first "controlled vocab service" for next release:
Idea is, to start with the design of a service, which allows import of a controlled list and an update/edit of this list. we would start with start content from edoc, the controlled list for journal names and their abbreviations. (export done already from vlad/nicole). that might be a task for R3.
second step, in a later release, would be to inlude this service in the edit mask/submission and include it in the search. --Ulla 15:28, 12 November 2007 (CET)

Inga will provide an enhanced import file (start content) after her holiday, see https://dev.livingreviews.org/projects/vlib/wiki/SFXJournalIssues#JournalListforeSciDoc --Inga 22:27, 13 November 2007 (CET)

Scenarios[edit]

ControlledVocab

Scope[edit]

For PubMan at least the following metadata elements are candidates for controlled metadata values and the generation of controlled lists: person, organization, journal, and conference/event. Other candidates have to be discussed further. Controlled metadata records should mainly contain descriptive metadata. The selection of potential descriptive elements depends on various factors: the source of controlled metadata values, what kind of values has to be filled in manually and what kind of values can be filled in via batch operations, the usage of the data, the quantity of information that should be stored and maintained in PubMan, and the question for which elements exist a controlled predefined value list and for which elements values are freely definable. The list of descriptive metadata elements should be extendable by new elements.

A list of potential descriptive elements is given below. The list should be considered as a first draft and has to be revised in accordance to the influencing factors mentioned above. For the first draft list below the document Systemspecification eSciDoc Metadata Sets served as a basis. (Systemspecification_eSciDoc_metadata_sets.doc)

Person[edit]

Complete name

The complete name of a person, usually a concatenation of given names and family name

Given name

A given name of a person

Family name

The family name of a person

Alternative name

Any alternative name used for the person

Title

The title or peerage of a person in one string

Pseudonym

The pen or stage name of a person

- Remark Sabine: Can pseudonym also be covered by alternative name?

- Remark Ulla: Let's assume: yes

Affiliation

The institution the person was affiliated to when creating the item

Identifier

Identifier in the Personennormdatei, provided by the Deutsche Nationalbibliothek

- Remark Sabine: IMO other identifier should be allowed as well (e.g. Identifier of Library of Congress Name Authority File)

- Remark Ulla: Can be modified if needed

Email

Email address of the person (e.g. will allow users to send an email to the author asking for the fulltext in case it is not available)

- Remark Sabine: I am not sure whether the email address is an important information for all persons or for registered PubMan users only. The handling of this "private data" has also be clarified.

- Remark Traugott: Problem of regularly updating the email address

Remark Ulla: IMO updating controlled vocab. is always a challenge, not only for emails...?

Homepage

The location of a personal homepage (e.g. in case fulltext is available via personal homepage)

- Remark Sabine: same as for Email

Organization[edit]

Name

The name of the organization

Alternative name

Any alternative name used for the organization (also translations)

Abbreviation

Any abbreviation used for the organization

- Remark Sabine: Or should we consider an abbreviation as an alternative name?

- Remark Ulla: fine for me

Address

The postal address of the organization

Country

The country where the organization is located

Type

Type of organization (e.g. institution, institute, department/group/research unit)

Journal[edit]

Journal title

The name of the journal (e.g. "Journal of the ACM")

Alternative title

Any alternative name of the Journal (e.g. "J. of the ACM", "Journal of the American Chemical Society")

Abbreviation

Any abbreviation used for the journal (e.g. "JACM")

- Remark Sabine: Can abbreviation be covered by alternative title?

- Remark Traugott: no usage for element abbreviation in this context

- Remark Ulla: check with users ... I think abbrev. is needed also for openURL ... if anyone offers controlled vocab. on journals, abbrev. will be included anyway.

No, the abbreviation is not necessarily required for the openURL, but may be relevant

for searching (pnas instead of proceedings of the ...)
for citation styles which sometimes request an abbreviated instead of the full title

By the way, the first alternative title is already an abbreviation. I don't mind to store abbreviated titles as alternative titles, but then we should tag that it is an abbreviated form, plus the origin of the abbreviation (if known) --Inga 22:18, 13 November 2007 (CET)

Publisher

The name of the institution that has published the journal

Place

Place where journal has been published

- Remark Sabine: I am not sure whether this information is needed. Also Uta does not see a need for it.

- Remark Ulla: can be deleted

Identifier

Any external identifier (e.g. ISSN, eISSN, ZDB-ID)

- Remark Traugott: Schema has to be indicated

Conference/event[edit]

Title

The name of the event (e.g. Symposium on Theory of Computing)

Alternative title

Any alternative name of the event

Abbreviation

Abbreviated name of the event (e.g. STOC)

Start date

Start date of the event

End date

End date to the event

Place

Place where the event took place

Invitation status

The information if the creator was explicitly invited

- Remark Sabine: Should this information be stored in controlled metadata record?

- Remark Ulla: No, not to my understanding

Other candidates[edit]

Potential other candidates for normalized metadata entries have to be discussed futher and maybe to be defined with pilots.

Keywords, classifications, thesauri → see cpt_pubman_classifications
Title of Source (e.g. Series titles)

Web services[edit]

It has to be discussed whether part of the controlled metadata values which are stored and maintained in PubMan should be provided via web services (e.g. an interface/plugin for organizational units in order to re-use the data for instance when writing a scientific paper). The legal situation for metadata values from external sources has to be clarified in this context.

Future projects[edit]

build a working group on authority files (out of PubMan pilot group and other interested Max Planck Institutes). Possible tasks:
- sample creation of controlled entries of MPG-related authors (maybe of one institute) according to standard guidelines and in Library of Congress Authority File format.