PubMan Indexing

From MPDLMediaWiki
Jump to: navigation, search

The information on this page is deprecated. Please check page Publication metadata context for more accurate information.


This article lists the search indexes required by the PubMan solution, focusing the external-repository view as described in the use case PubMan Func Spec Search. Please note that additional indexes are needed to fulfill administrative tasks on the repository.

Initially, the content of this page was copied from the file Systemspecification_Pubman.doc (FE_PM_04 Indexing item data), but the requirements will only be maintained on this page. Please use the discussion page to comment on the existing or to submit new requirements.

Required Indexes

All released items have to be retrievable via the search engine. Lucene should always index the last released version of database should provide following indexes:

Any Field (escidoc.metadata)

  • All elements defined in the functional Metadata set specification are indexed for the search.
  • In addition, following properties are indexed: <Id of item>, <PIDPersistent Identifer or Identification of item>, <PIDPersistent Identifer or Identification of file>
  • Files of any type are indexed, except of content category "correspondence" and "copyright transfer agreement".

Special requirements for identifiers

In "any-field" index identifiers should be indexed as "identifier-type" and "identifier-value" as two separate tokens e.g.

  • ISSNInternational Standard Serial Number
  • ISBNInternational Standard Book Number 232323184738
  • URI http://mpdl.mpg.de/123

Please note remarks on normalization

Example Queries:

  • escidoc.metadata=ISSNInternational Standard Serial Number - to return all records which have an identifier of type ISSNInternational Standard Serial Number
  • escidoc.metadata="ISBNInternational Standard Book Number 1234*" - to return all records which have an identifier of type ISBNInternational Standard Book Number starting with 1234
  • escidoc.metadata="URI http://mpdl.mpg.de" - to return all records which have an identifier of type URIUniform Resource Identifier with the exact value "http://mpdl.mpg.de"


Genre (escidoc.any-genre)

  • Publication.Genre

Persons (escidoc.any-persons)

  • all Creator.Person.CompleteName with Creator.CreatorType = "Person"


Organization

by name (escidoc.organization-name)

  • all Organization.Name for each language separately and for all languages at once

by identifier (escidoc.any-organization-pids)

  • Organization.PIDPersistent Identifer or Identification and PIDs of all Organizations in the organizational path hierarchy to the top level organizations (i.e. the PIDPersistent Identifer or Identification of the authors affiliations and PIDs of all parent organization).

Example 1:

  • A PubItem "Title1" has authors:
    • Udo Müller, affiliated to Max-Planck Institute for Psycholinguistics (PID2)
    • Johanna Müllerin, affiliated to Department1 (PID4) of Max-Planck Institute for Plasma Physics
  • If the organizational unit structure is as follows:
MaxPlanckSociety (PID1)
 |---Max-Planck-Institute for Psycholinguistics (PID2)
 |---Max-Planck Institute for Plasma Physics (PID5)
      |__Department 1 (PID4)
PlasmaphysicsSociety (PID6)
 |---Max-Planck Institute for Plasma Physics (PID5)
      |__Department 1 (PID4)
 
  • Outcome: The index on Organization.PIDPersistent Identifer or Identification should contain the following values for the PubItem "Title1":
   PID2, PID1, PID5, PID4, PID6

Even if the Author affiliations in the descriptive metadata are directly related only to PID2 and PID4

Example 2:

  • A PubItem "Title2" has authors:
    • Organization: Max-Planck Institute for Plasma Physics (PID5)
    • Organization: Max-Planck Institute for Psycholinguistics (PID2)
  • If the organizational unit structure is as follows:
MaxPlanckSociety (PID1)
 |---Max-Planck-Institute for Psycholinguistics (PID2)
 |---Max-Planck Institute for Plasma Physics (PID5)
      |__Department 1 (PID4)
PlasmaphysicsSociety (PID6)
 |---Max-Planck Institute for Plasma Physics (PID5)
      |__Department 1 (PID4)
 
  • Outcome: The index on Organization.PIDPersistent Identifer or Identification should contain the following values for the PubItem "Title2":
   PID2, PID5, PID1, PID6

Even if the Organizations in the creator descriptive metadata are directly related only to PID2 and PID5

Please note: clarification

Title (escidoc.any-title)

Following elements for each language separately and for all languages at once

  • publication.dc:title
  • publication.dcterms:alternative


Topic (escidoc.any-topic)

Following elements for each language separately and for all languages at once

  • publication.dc:title
  • publication.dcterms:alternative
  • publication.dcterms:tableOfContents
  • publication.dcterms:abstract
  • publication.dc:subject


Dates

escidoc.any-dates

All publication dates, including following elements

  • publication.dcterms:created
  • publication.dcterms:modified
  • publication.dcterms:dateSubmitted
  • publication.dcterms:dateAccepted
  • publication.published-online
  • publication.dcterms:issued

Search index: escidoc.created

  • indexed field: publication.dcterms:created

Search index: escidoc.modified

  • indexed field: publication.dcterms:modified

Search index: escidoc.dateSubmitted

  • indexed field: publication.dcterms:dateSubmitted

Search index: escidoc.dateAccepted

  • indexed field: publication.dcterms:dateAccepted

Search index: escidoc.published-online

  • indexed field: publication.dcterms:published-online

Search index: escidoc.issued

  • indexed field: publication.dcterms:issued

Event (escidoc.any-event)

Following elements for each language separately and for all languages at once

  • publication.event.dc:title
  • publication.event.dcterms:alternative
  • publication.event.place


Internal Identifier (escidoc.any-identifier)

Internal identifiers available for the items and its components:

  • IDIdentifier and PIDPersistent Identifer or Identification of item, PIDPersistent Identifer or Identification of files

External Identifier (escidoc.any-identifier)

All additional identifiers specified within the metadata record, including

  • publication.dc:identifier
  • publication.source.dc:identifier
  • publication.source.source.dc:identifier


Special requirements for indexing external identifiers

Identifier type and identifier value should be indexed together so that users are able to find an item with identifier type ISSNInternational Standard Serial Number and identifier value 123-234 in all of following "search by identifier" criteria:

  • escidoc.any-identifier="ISSNInternational Standard Serial Number 123-234"
  • escidoc.any-identifier="ISSNInternational Standard Serial Number 12*"
  • escidoc.any-identifier="ISSNInternational Standard Serial Number *"
  • escidoc.any-identifier="ISSNInternational Standard Serial Number"
  • escidoc.any-identifier="123-234"

Please note remarks on normalization

Source (escidoc.any-source)

Following elements for each language separately and for all languages at once

  • publication.source.dc:title
  • publication.source.dcterms:alternative
  • publication.source.source.dc:title
  • publication.source.source.dcterms:alternative


Components Properties (escidoc.component.<elementname>)

Additional indexes are required for following component properties

  • component.properties.content-category
  • component.properties.visibility
  • component.properties.file-name
  • component.properties.pid

Maybe the best option to index all component properties by default.

Future Development

release/submission comment

Via the administrative search it should be possible to search for the release/submission comment.

CompleteName Index

It should be possible to make an exact search for creators like "Kondic, Nicole". With the new framework it will be probably possible to make indexes. Then we should think of a index syntax for persons.

complete name index tokens: "persons.lastname, person.firstname" complete name index tokens: "persons.lastname person.firstname"

Related Pages

PubMan Search & Export