Difference between revisions of "PubMan Indexing"

From MPDLMediaWiki
Jump to navigation Jump to search
Line 1: Line 1:
The last versions of items in state released have to be indexed with the following indexes:  
This article lists the search requirements of the [[PubMan]] solution, focusing the the external-repository view. Please note that additional indexes are needed to fulfill administrative tasks on the repository.
===Any Field===
 
*All elements defined in the functional Metadata set specification (see [SSESDMD]) are indexed for the search.
Initially, the content of this page was copied from the file Systemspecification_Pubman.doc (FE_PM_04 Indexing item data), but the requirements will only be maintained on this page. Please use the [[Talk:PubMan_Indexing|discussion page]] to comment on the existing or to submit new requirements.
 
= Required Indexes =
All released items have to be retrievable via the search engine. Lucene should always index the last released version of  database should provide following indexes:  
 
===Any Field (===
*All elements defined in the functional Metadata set specification are indexed for the search.
*In addition, following attributes are indexed: <Id of item>, <PID of item>, <PID of file>
*In addition, following attributes are indexed: <Id of item>, <PID of item>, <PID of file>
*Files of any type are indexed, except of type "correspondence" and "copyright transfer agreement".
*Files of any type are indexed, except of type "correspondence" and "copyright transfer agreement".
====Special requirements for "any-field" index====
====Special requirements for "any-field" index====


In "any-field" index identifiers should be indexed as
In "any-field" index identifiers should be indexed as "identifier-type" and "identifier-value" as two separate tokens
"identifier-type" and "identifier-value" as two separate tokens
 
e.g.  
e.g.  
*ISSN 2323-123123
*ISSN 2323-123123

Revision as of 14:56, 19 February 2008

This article lists the search requirements of the PubMan solution, focusing the the external-repository view. Please note that additional indexes are needed to fulfill administrative tasks on the repository.

Initially, the content of this page was copied from the file Systemspecification_Pubman.doc (FE_PM_04 Indexing item data), but the requirements will only be maintained on this page. Please use the discussion page to comment on the existing or to submit new requirements.

Required Indexes[edit]

All released items have to be retrievable via the search engine. Lucene should always index the last released version of database should provide following indexes:

Any Field ([edit]

  • All elements defined in the functional Metadata set specification are indexed for the search.
  • In addition, following attributes are indexed: <Id of item>, <PID of item>, <PID of file>
  • Files of any type are indexed, except of type "correspondence" and "copyright transfer agreement".

Special requirements for "any-field" index[edit]

In "any-field" index identifiers should be indexed as "identifier-type" and "identifier-value" as two separate tokens e.g.

  • ISSN 2323-123123
  • ISBN 2323-23184738
  • URI http://mpdl.mpg.de/123

Users should be able to search for e.g. ISSN or ISBN* or ISBN 1234*

Genre[edit]

  • Publication.Genre

Persons[edit]

  • all Creator.Person.CompleteName with Creator.CreatorType = “Person”

Organizations[edit]

  • all Organization.Name for each language separately and for all languages at once
  • Organization.PID and PIDs of all Organizations in the organizational path hierarchy to the top level organizations (i.e. the PID of the authors affiliations and PIDs of all parent organization)
  • Example:
    • A PubItem "Title1" has authors:
      • Müller, affiliated to Max-Planck Institute for Psycholinguistics (PID2)
      • Müllerin, affiliated to Department1 (PID4) of Max-Planck Institute for Plasmaphysics
    • If the organizational unit structure is as follows:
MaxPlanckSociety (PID1)
 |---Max-Planck-Instutue for Psycholinguistics (PID2)
 |---Max-Planck Institute for Plasmaphyics (PID5)
      |__Department 1 (PID4)
PlasmaphysicsSociety (PID6)
 |---Max-Planck Institute for Plasmaphyics (PID5)
      |__Department 1 (PID4)
 
    • Outcome:

The index on Organization.PID should contain the following values for the PubItem "Title1":

   PID2, PID1, PID5, PID4, PID6

Even if the Author affiliations in the descriptive metadata are directly related only to PID2 and PID4

Title[edit]

  • Publication.Title and Publication.AlternativeTitle for each language separately and for all languages at once

Topic[edit]

  • Publication.Title, Publication.AlternativeTitle, Publication.TableOfContents, Publication.Abstract and Publication.Subject for each language separately and for all languages at once

Dates[edit]

  • Publication.Date

Event[edit]

  • all Event.Title,Event.AlternativeTitle and Event.Place for each language separately and for all languages at once

Identifier[edit]

  • ID and PID of item, PID of files

Special requirements for indexing identifiers as "any-identifier" index[edit]

identifier type and identifier value should be indexed together so that users are able to find itemA with identifier type ISSN and identifier value 123-234 in all of following "search by identifier" criteria:

ISSN:123-234 ISSN:12* ISSN:* ISSN 123-234

Source[edit]

  • all Source.Title and Source.AlternativeTitle for each language separately and for all languages at once

Components[edit]

We to index also some properties of the components in the search, such as:

  • content-category
  • visibility
  • component-name
  • pid (of component)

Maybe the best option to index all component properties by default.