ESciDoc Search Index Names

Introduction[edit]

This article describes the rules for creation of search indexes of the resource properties and metadata stored as the eSciDoc content resources.
Note: the specification at present covers indexing rules of item and container resources. The automatically generated index names rules may however be used also for other resources. Should be separately specified if needed.

The rules define top-level indexing contexts derived from the XML representation of the escidoc resource. The names of these contexts should be used as prefix for index names of indexed elements. The following top-level contexts are defined:

property
metadata (name of the context may change, depending on the metadata profile name)
- Important: only metadata records labeled "escidoc" should be indexed, unless specified additionally
component
struct-map

Automatically generated indexes[edit]

This section describes the rules for indexes that should always be created based on respective rules of the top-level contexts.

Indexing of properties[edit]

All properties are indexed (represented as elements or attributes in escidoc-resource.xml)
Index names of regular properties are

 property.<path-of-elements-and-attributes-under-properties-element>

Index names on content-model-specific-properties (cmsprop) are

 property.content-model-specific.<path-of-elements-and-attributes-under-cmsprop-element>

- Exception 1: even though property.objid is not explicitly stated in the property section of the resource XML, the index name should be

                *property.objid

- Exception 2: property indexes that include additionally referenced data

                *property.created-by.objid
                *property.created-by.name

- Example 1: standard property indexes

                *property.version.status
                *property.version.number
                *property.public-status
                *property.public-status-comment
                *property.context.objid
                *property.content-model.objid
                *property.lock-status
                *property.pid
                *property.creation-date

- Example 2: content-model-specific-property indexes

                *content-model-specific.local-tags.local-tag

Indexing of metadata[edit]

Top-level context for index names of metadata has prefix based on the top-level enclosing metadata element. At present, MPDL solutions have the following contexts (Note: these are not fixed, and may be extended with new top-level metadata indexing contexts, it depends on application profile in use):
- publication
- face-item
- virr-toc
- virr-element
- FacesAlbum (Note: should be renamed to faces-album)

Index name should be <metadata-record-enclosing-element-name>.<path-of-elements-and-attributes-under-metadata-element>
Index names should not use namespace alias (e.g. dc, dcterms, escidoc)

- Examples publication: generated indexes of the qualified path of all metadata

         * publication.title
         * publication.alternativeTitle
         * publication.genre
         * publication.source.title
         * publication.source.genre
         * publication.event.title
         * publication.source.source.genre
         * publication.source.source.title

- Examples face-item: generated indexes of the qualified path of all metadata

         * face-item.emotion
         * face-item.picture-group
         * face-item.identifier
         * face-item.age
         * face-item.age-group
         * face-item.gender

Indexing of components[edit]

The top-level indexing context name is "component".
Same rules as for Indexing of properties and indexing of metadata apply.

- Example 1: index names of component properties

        *component.file-name
        *component.description
        *component.visibility

- Example 2: index names of component metadata for "file metadata profile"

        *component.file.title
        *component.file.description
        *component.file.format
        *component.file.extent

Indexing of struct-map[edit]

The top-level indexing context name is "struct-map".
The struct-map indexin context should support the following indexes:

        *struct-map.item.objid
        *struct-map.item.title (derived index)
        *struct-map.container.objid
        *struct-map.container.title (derived index)

Compound and derived indexes[edit]

This section defines the naming rules for compound and derived indexes that have different logic for creation then the one specified in Automatically generated indexes section.

Compound and derived indexes are dependent on the user requirements, metadata profile and are created explicitly
We need to make sure that we do have correct index names when implementing a new index
in future we can expect to register xslt transformations for derivation on content model and/or metadata profile level

Naming rules[edit]

To clarify better the naming rules the following general structure for the index name is given:

        <top-level-indexing-metadata-context-name>.<name-part1>...<name-partN>.<end-index-name>

where the number of <name-part> elements depends on the metadata structure.

Rules for derived indexes within single top-level-indexing metadata context name
- index names of indexes that are based on single metadata profile (e.g. publication, virr-element, face-item) MUST start with the actual name of the metadata indexing context (e.g. must have "publication" as value for <top-level-indexing-context-name>).
- indexes that index more than one metadata element on same level in the metadata-xml stream MUST have in the index name the string "compound"
- indexes that index more than one metadata element starting with specific level (e.g. level 1) and using metadata at any level below in the metadata-xml stream MUST have in the index name the string "any"
- strings "compound" or "any" must always be used in front of the <end-index-name> (e.g. as <partN> in the structure given above)
- strings "compound" or "any" can not be used in same index name

- Example 1: Index of all titles of the publication on publication level (it indexes metadata publication.title, and publication.alternativeTitle)

        publication.compound.titles

- Example 2: Index of all titles of the publication on publication level (it indexes metadata publication.title, publication.alternativeTitle, source.title, source.source.title, source.alternativeTitle, event.title, event.alternativeTitle)

        publication.any.titles

- Example 3: Index of all titles of the publication on source level (it indexes metadata source.title, source.alternativeTitle)

       publication.source.compound.titles

- Example 4: Index of all titles of the publication on source level and any level below (it indexes metadata source.title, source.alternativeTitle, source.source.title, source.source.alternativeTitle)

       publication.source.any.titles

Publication metadata context[edit]

Creator indexes[edit]

See also PubMan_Indexing#Persons_.28escidoc.any-persons.29
Stated: Creator.Person.CompleteName with Creator.CreatorType = "Person"

Automatically generated indexes are:

       publication.creator.person.{complete-name, family-name, given-name, identifier}
       publication.creator.person.organization.{organization-name, address, idenfier}
       publication.source.creator.person.{complete-name, family-name, given-name, identifier}
       publication.source.creator.person.organization.{organization-name, address, idenfier}
       publication.creator.organization.{organization-name, address, identifier}

Compound&Any indexes
- Note: for all complete-name like compound/any indexes, please check Complete name indexing tokons

       1) publication.creator.person.compound.person-complete-name

1) Derived from the metadata publication.creator.person.family-name, publication.creator.person.given-name, sortkey field values sorted previously by order of entry i.e. position in the xml

       2) publication.source.any.person-complete-name

2) Derived from the metadata publication.source.creator.person.family-name, publication.source.creator.person.given-name, and any sub-sources that would exist, sortkey field values sorted previously by level (i.e. publication, source1, source1.2, source 1.3, source 2, source 2.1, source 2.2) and order of entry i.e. position in the xml

       3) publication.creator.compound.organization-path-identifiers

3) Derived from identifiers of the organizational unit referenced in the metadata (at publication creator level i.e. creator.organization, creator.person.organization) and all parents of that organizational unit, see Organization index examples

       4) publication.creator.any.organization-path-identifiers

4) Derived from identifiers of the organizational unit referenced in the metadata (at any creator level i.e. creator.organization, creator.person.organization, source.creator.organization, source.creator.person.organization, source.source.creator.person.organization, source.source.creator.organization etc.) and all parents of that organizational unit

       5) publication.compound.publication-creator-names

5)Derived from all creator names such as: publication.creator.person.family-name, publication.creator.person.given-name, publication.creator.organization.organization-name, sortkey field values sorted previously by order of entry i.e. position in the xml

       6) publication.any.publication-creator-names

6)Derived from all creator names at any level of publication or source (including embedded sources) such as: publication.creator.person.family-name, publication.creator.person.given-name, publication.creator.organization.organization-name, publication.creator.organization.organization-name, publication.source.creator.person.family-name, publication.creator.person.given-name, publication.source.creator.organization.organization-name, publication.source.creator.organization.organization-name, etc., sortkey field values sorted previously by level (i.e. publication, source1, source1.2, source 1.3, source 2, source 2.1, source 2.2) and order of entry i.e. position in the xml

Title indexes[edit]

Automatically generated indexes

           publication.title
           publication.alternative
           publication.source.title
           publication.source.alternative
           publication.event.title
           publication.event.alternative
           publication.source.source.title
           publication.source.source.alternative
           ...

Compound&Any indexes

        1) publication.compound.titles

1) Derived from publication.title, publication.alternative, sortkey field values as specified order of derivation

        2) publication.compound.topic

2) Derived from publication.title, publication.alternative, publication.tableOfContents, publication.abstract, publication.subject, sortkey filed values as specified order of derivation

        3) publication.event.compound.title-place

3) Derived from publication.event.title, publication.event.alternative, publication.event.place, sortkey field values as specified order of derivation

        4) publication.source.any.title

4) Derived from publication.source.title, publication.source.alternative, and any embedded sources, sortkey field values as positioned in the xml (i.e. source1, source1.1, source1.2, source2 etc.)

Date indexes[edit]

Compound&Any indexes

     1) pubication.compound.dates

1) Derived from publication.created, publication.modified, publication.dateSubmitted, publication.dateAccepted, publication.published-online, publication.issued, sortkey field should be populated with logic specified at PubMan Sorting (Default system sorting)

     2) publication.compound.most-recent-date

2) Specification missing

Identifier indexes[edit]

Compound&Any indexes

     1) publication.any.identifier

1) Derived from publication.identifier, publication.source.identifier, publication.source.source.identifier (and any further embedded sources), See also Special requirements and remarks on normalization

Future development[edit]

Rules for derived indexes cross several top-level-indexing metadata context name
- index names of indexes that are based on several metadata profiles (e.g. combine publication and virr-element) MUST always use string "any" as <top-level-indexing-context-name> and can have only <end-index-name> in their structure additionally. (Note: as this is future development, there is high probability that this rule is not valid at present)

Examples:

                 any.creator-names
                 any.titles

Would be derived from any names of the creator from all metadata contexts, or specified metadata contexts.

Possible use cases: find all resources where Mr. X was creator of, find all resources with title A

Discussion/Comments[edit]

Faces[edit]

In my opinion, this is not relevant for Faces, because in Faces we only have very clear indexes (one per attribute).--Kristina 10:17, 6 February 2009 (UTC)

Ok, but even in this case we have clear naming convention for face indexes such as:

     *faces.emotion instead of escidoc.emotion
     *faces.age instead of escidoc.age

ViRR[edit]

Section moved to VIRR Development Page

General[edit]

Perhaps it makes sense to add escidoc to beginning of indexes, like:

*escidoc.publication.compound.creator
*escidoc.faces.album.creator

because then we could also query like

*escidoc.compound.creator

Possible use case: find all escidoc items where Mr. X was creator of.

*escidoc.compound.title delivers all title on first level 
 **(publication.title, publication.alternativetitle, virrelement title, etc. )

*escidoc.any.title 
 **deliviers all title elements in all escidoc items