ESciDoc Search Index Names

From MPDLMediaWiki
Jump to: navigation, search

Introduction

This article describes the rules for creation of search indexes of the resource properties and metadata stored as the eSciDocEnhanced Scientific Documentation content resources.
Note: the specification at present covers indexing rules of item and container resources. The automatically generated index names rules may however be used also for other resources. Should be separately specified if needed.

The rules define top-level indexing contexts derived from the XMLExtensible Markup Language representation of the escidoc resource. The names of these contexts should be used as prefix for index names of indexed elements. The following top-level contexts are defined:

  • property
  • metadata (name of the context may change, depending on the metadata profile name)
    • Important: only metadata records labeled "escidoc" should be indexed, unless specified additionally
  • component
  • struct-map

Automatically generated indexes

This section describes the rules for indexes that should always be created based on respective rules of the top-level contexts.

NOTE: Due to core-service set-up all indexes have additional prefix "escidoc" and all sort-keys have additional prefix "sort"

Indexing of properties

  • All properties are indexed (represented as elements or attributes in escidoc-resource.xml)
  • Index names of regular properties are
  property.<path-of-elements-and-attributes-under-properties-element>
  • Index names on content-model-specific-properties (cmsprop) are
  property.content-model-specific.<path-of-elements-and-attributes-under-cmsprop-element>


    • Exception 1: even though property.objid is not explicitly stated in the property section of the resource XMLExtensible Markup Language, the index name should be
                *property.objid
    • Exception 2: property indexes that include additionally referenced data
                *property.created-by.objid
                *property.created-by.name
    • Example 1: standard property indexes
                *property.version.status
                *property.version.number
                *property.public-status
                *property.public-status-comment
                *property.context.objid
                *property.content-model.objid
                *property.lock-status
                *property.pid
                *property.creation-date
    • Example 2: content-model-specific-property indexes
                *content-model-specific.local-tags.local-tag

Indexing of metadata

  • Top-level context for index names of metadata has prefix based on the top-level enclosing metadata element. At present, MPDLMax Planck Digital Library solutions have the following contexts (Note: these are not fixed, and may be extended with new top-level metadata indexing contexts, it depends on application profile in use):
    • publication
    • face-item
    • virr-toc
    • virr-element
    • FacesAlbum (Note: should be renamed to faces-album)
  • Index name should be

<metadata-record-enclosing-element-name>.<path-of-elements-and-attributes-under-metadata-element>

  • Index names should not use namespace alias (e.g. dc, dcterms, escidoc)
    • Examples publication: generated indexes of the full path of all metadata
         * publication.title
         * publication.alternativeTitle
         * publication.genre
         * publication.source.title
         * publication.source.genre
         * publication.event.title
         * publication.source.source.genre
         * publication.source.source.title
    • Examples face-item: generated indexes of the full path of all metadata
         * face-item.emotion
         * face-item.picture-group
         * face-item.identifier
         * face-item.age
         * face-item.age-group
         * face-item.gender

Indexing of components

  • The top-level indexing context name is "component".
  • Same rules as for Indexing of properties and indexing of metadata apply.
    • Example 1: index names of component properties
        *component.file-name
        *component.description
        *component.visibility
    • Example 2: index names of component metadata for "file metadata profile"
        *component.file.title
        *component.file.description
        *component.file.format
        *component.file.extent

Indexing of components content

  • Components content storage must be indexed as well to enable distingushing searches by files and locators.
        *component.content.storage

Compound components index

  • as there is a need to have searches that are specific exactly to a component in case when item has several components, new compound index is introduced.
             *component.compound.properties


This index is derived as a concatenation of the values of following component properties and order: visibility, storage-type, content-category, mime-type, valid-status, created-by, creation-date, checksum, PIDPersistent Identifer or Identification, file-name.

This index is a bit cumbersome, but it allows for more exact search results.

Full text - content index

  • escidoc.fulltext

In case when the content of the component i.e. the fulltext can not be indexed, the indexing of the metadata continues, but the following is written in the indexfield

   escidoc.fulltext=textfrompdffilenotextractable

Indexing of content-relations

This index refers to the "simple" type of content relations - i.e. those which are defined directly on the item and are delivered with the item xml representation.

        *escidoc.content-relation 

This index is derived as "<predicate> <target>"

  • search for predicate search escidoc.content-relation=<predicate> (instead of <predicate> real value shall be used)
  • search for target search escidoc.content-relation=<target> (instead of <predicate> real value shall be used)
  • search for combination, search escidoc.content-relation="<predicate> <target>" (phrase)

Indexing of struct-map

  • The top-level indexing context name is "struct-map".
  • The struct-map indexin context should support the following indexes:
        *struct-map.item.objid
        *struct-map.item.title (derived index)
        *struct-map.container.objid
        *struct-map.container.title (derived index)

Compound and derived indexes

This section defines the naming rules for compound and derived indexes that have different logic for creation then the one specified in Automatically generated indexes section.

  • Compound and derived indexes are dependent on the user requirements, metadata profile and are created explicitly
  • We need to make sure that we do have correct index names when implementing a new index
  • in future we can expect to register xslt transformations for derivation on content model and/or metadata profile level

Naming rules

To clarify better the naming rules the following general structure for the index name is given:

        <top-level-indexing-metadata-context-name>.<name-part1>...<name-partN>.<end-index-name>

where the number of <name-part> elements depends on the metadata structure.

  • Rules for derived indexes within single top-level-indexing metadata context name
    • index names of indexes that are based on single metadata profile (e.g. publication, virr-element, face-item) MUST start with the actual name of the metadata indexing context (e.g. must have "publication" as value for <top-level-indexing-context-name>).
    • indexes that index more than one metadata element on same level in the metadata-xml stream MUST have in the index name the string "compound"
    • indexes that index more than one metadata element starting with specific level (e.g. level 1) and using metadata at any level below in the metadata-xml stream MUST have in the index name the string "any"
    • strings "compound" or "any" must always be used in front of the <end-index-name> (e.g. as <partN> in the structure given above)
    • strings "compound" or "any" can not be used in same index name


  • Example 1: Index of all titles of the publication on publication level (it indexes metadata publication.title, and publication.alternativeTitle)
        publication.compound.titles
  • Example 2: Index of all titles of the publication on publication level (it indexes metadata publication.title, publication.alternativeTitle, source.title, source.source.title, source.alternativeTitle, event.title, event.alternativeTitle)
        publication.any.titles
  • Example 3: Index of all titles of the publication on source level (it indexes metadata source.title, source.alternativeTitle)
       publication.source.compound.titles
  • Example 4: Index of all titles of the publication on source level and any level below (it indexes metadata source.title, source.alternativeTitle, source.source.title, source.source.alternativeTitle)
       publication.source.any.titles

Publication metadata context

VirrVirtueller Raum Reichsrecht metadata context

Faces metadata context

  • No need for compound indexes at present

Future development

  • Rules for derived indexes cross several top-level-indexing metadata context name
    • index names of indexes that are based on several metadata profiles (e.g. combine publication and virr-element) MUST always use string "any" as <top-level-indexing-context-name> and can have only <end-index-name> in their structure additionally. (Note: as this is future development, there is high probability that this rule is not valid at present)
  • Examples:
                 any.creator-names
                 any.titles

Would be derived from any names of the creator from all metadata contexts, or specified metadata contexts.

Possible use cases: find all resources where Mr. X was creator of, find all resources with title A