Talk:PubMan Indexing

=Discussion about search requirements for framework release 1.0 =

Questions from Michael, arrived at the MPDL team on 8th of February and further discussed on this wiki side

Rules for sorting the search result
You wanted to send me the rules for sorting the search result


 * 1) The functional specification regarding sorting is available on PubMan Sorting --Inga 17:56, 19 February 2008 (CET)
 * 2) For technical implementation, please check: http://colab.mpdl.mpg.de/mediawiki/images/b/bf/PubItemVOComparator.java --Natasa 15:40, 13 February 2008 (CET)]

Discussion:


 * PubItemVOComparator.java uses .compareToIgnoreCase. This sorts special characters like ä,ö,ü at the end. I was trying java.text.Collator-class. This sorts special characters correctly. I would propose to use the Collator class. --Michael.hoppe 13:04, 14 February 2008 (CET)


 * Thanks! The involvement of special characters is actually part of the functional specification --Inga 20:09, 19 February 2008 (CET)


 * OK, Michael, If you think the Collator class provides better results it is fine for us. But, :) one question: does this class actually compares based on the "locale" value for language? We have mixture of English, German and probably other languages (i.e. next candidate is a French language). How it will behave in this case and how it affects the sorting when we use the fuzzy search (as this would probably be the most common case i.e. people rarely specify language in which they search)? --Natasa 11:29, 14 February 2008 (CET)


 * I am not sure what the locale is used for in the Collator-class. I tried sorting german umlaute and french apostrophes etc. with locale German and locale English and couldn't see any differences. --Michael.hoppe 13:04, 14 February 2008 (CET)


 * Other remark: When using a custom comparator class with lucene, the custom class doesn't need untokenized index fields for sorting. Therefore we do not need seperate indexes for the sorting anymore. Therefore we can delete the sort.-fields (eg sort.title). You would have to use the escidoc.-fields (e.g. escidoc.title) as sort-criterion. Example of cql: escidoc.metadata=escidoc* sortKeys=escidoc.title --Michael.hoppe 07:36, 14 February 2008 (CET)


 * Using the index field names as sortKeys seems to be easier and more comprehensive it will not be that big problem for us to change the sorting logic actually. --Natasa 11:29, 14 February 2008 (CET)

Indexes
You have a page where you describe the indexes you want to have for search (PubMan Indexing). Is this page complete? Can i delete all the other fields that we index right now separately? (properties of item/container, attributes of elements ...)


 * This page is to my understanding complete. --Natasa 15:55, 13 February 2008 (CET)


 * Please note that this page only describes the search options currently provided by the PubMan user interface. In addition, we have the requirement to support "expert searches" which could be implemented by specific CQL statements, e.g. to restrict the search to . Therefore, I would like to keep separate indexes for metadata elements and selected item properties. Should I provide a list somewhere? --Inga 21:00, 19 February 2008 (CET)


 * Yes, please provide a list. Currently we index all metadata-elements as own indexes, but no properties!Michael.hoppe 09:20, 20 February 2008 (CET)

Index escidoc.metadata
We then have the index  instead of escidoc.metadata


 * The index escidoc.metadata is fine as name, it is the logic of what is indexed with this index is a bit wrong - please do not change the name of the index as this we use in our business logic for queries - the colab page is the func spec and you provided the index names. --Natasa 15:55, 13 February 2008 (CET)


 * OK, i will change the logic of what is indexed in the escidoc.metadata index. --Michael.hoppe 07:57, 14 February 2008 (CET)


 * Regarding index names: Could we at least point from this page to the names of the indexes build to provide a certain functionality? I started to add the index names, but wasn't 100% successful until now. Michael, could you complete? --Inga 21:00, 19 February 2008 (CET)


 * OK, i completed the list --Michael.hoppe 09:22, 20 February 2008 (CET)


 * There is difference between particular index fields on this page and any-field index on this page: i.e. any-field index still contains ONLY descriptive metadata of item and container and component (i.e. what is in the respective metadata records) and respective IDs and PIDs and NOT other PROPERTIES (e.g. context identifier - if needed they will be separate index fields such as escidoc.item.context etc.) --Natasa 15:55, 13 February 2008 (CET)


 * OK, I will implement it according to the page. --Michael.hoppe 07:57, 14 February 2008 (CET)

Index any.*
Can we remove the  prefix from the index names (escidoc. any-genre gets escidoc.genre)?


 * Please see my answer for renaming index names above. Why would you know rename the indexes? You have given the names for the indexes and we were previously not clear what was the logic, does it has something to do with repeatable metadata? for start maybe not renaming the indexes would be fine. --Natasa 16:48, 13 February 2008 (CET)


 * OK, i will leave the names as they are. --Michael.hoppe 07:57, 14 February 2008 (CET)


 * Hm! The SRU interface currently lists many indexes which are hard to map to the respective elements because we decided not to keep the complete path. What do you think about a dual strategy: Keep the names for all indexes already used (see index names in article), provide new (longer) names for additional indexes (see my remark above) --Inga 21:00, 19 February 2008 (CET)


 * In the article you only refer to the any-indexes which are customized for you. You dont refer to any of the indexes that have names according to the element-names. exception currently is escidoc.organization-name. You can instead use escidoc.any-organizations. Then i could change the index-names to longer names (eg escidoc.publication.creator.person.complete-name instead of escidoc.complete-name) if you want. --Michael.hoppe 09:29, 20 February 2008 (CET)

Index organization.name
You want to search e.g. all e for each language separately and for all languages at once
 * If you only want to search English organization-names, use the escidoc_en database
 * If you want to search all language organization-names, use the escidoc_all database


 * On last developer workshop we agreed to have one search database, which is aware of language specific stemming if the metadata has xml-lang associated with it and to change the logic to "fuzzy" search logic. Are the 2 questions above of further relevance then? --Natasa 16:48, 13 February 2008 (CET)''


 * We cannot do stemming for one index in more than one language. Therefore we decided on the last workshop that we can do fuzzy search instead. So as agreed in the workshop, I added stopwords to the escidoc_all database. You can do fuzzy search in cql eg: escidoc.metadata=/fuzzy dokument --Michael.hoppe 07:57, 14 February 2008 (CET)


 * Does that mean that eSciDoc will further have various databases, thus something like (escidoc_all.metadata=plankton OR escidoc_de.title=fischfutter) won't be possible? The user may only want the specify the language for a specific part of the search request, e.g. to limit the result. I would suggest to forget about the language specific databases. --Inga 21:00, 19 February 2008 (CET)


 * I agree, the concept of different language-specific databases didnt proove. I suggest only using the language-independent database and do fuzzy search. --Michael.hoppe 09:34, 20 February 2008 (CET)

Index any-organizaton-pids
Can we remove the  index?


 * Absolutely not! This should not happen, as this index is very important for browsing and i.e. searching with one organization pid for MPG should give as results items which are related directly to the child-organizational-unit of MPG even if the MPG is not related directly to the MPG in the metadata! This is an important issue it should provide the path-list of ids --Natasa 16:48, 13 February 2008 (CET)''


 * Michael.hoppe 07:57, 14 February 2008 (CET) Then maybe you should add it to the http://colab.mpdl.mpg.de/mediawiki/PubMan_Indexing page


 * Added. Hopefully is clearer now. Please let us know if there are other questions regarding this issue. --Natasa 11:12, 14 February 2008 (CET)


 * Split into two elements suggested, see PubMan_Indexing --Inga 21:00, 19 February 2008 (CET)

Indexes escidoc.component.*
We create new indexes for component-properties with name


 * For all metadata in the metadata record of the component. This record as agreed with your colleagues (pls. check with Rozita) will be provided by us with the component --Natasa 16:48, 13 February 2008 (CET)


 * OK. --Michael.hoppe 07:58, 14 February 2008 (CET)


 * I have no clue what the metadata record of a component is. Has this something to do with containers? Should we document this index somewhere? --Inga 21:00, 19 February 2008 (CET)


 * I also have no clue, but i dont really mind. It is up to you what metadata-records you put into the component. Indexer will index all elements of all md-records. By now as escidoc.component., but if you want i can change that to a longer name (escidoc.component.. --Michael.hoppe 09:37, 20 February 2008 (CET)
 * Metadata record of a component are the metadata of the file itself (component is a file associated with item). That means the metadata record of a component contains file name, file size, content-category, anything else relevant for specific file. --Natasa 09:29, 25 February 2008 (CET)

Index identifier
We write all pids and ids (as described on the PubMan_Indexing page) in the index  and additionally the pid of the file as  ?


 * Split into two elements suggested, see PubMan Indexing Internal Identifier --Inga 21:00, 19 February 2008 (CET)

Index escidoc.objecttype
Proposal: I think we should have an index  where we can distinguish between item and container


 * sound reasonable in respect of reusability --Tom 16:49, 13 February 2008 (CET)


 * Would like to ask here, is this index selective enough? (all objects are items or containers) --Natasa 17:15, 13 February 2008 (CET)


 * What do you mean? You can restrict your search to containers only eg escidoc.metadata=whatever and escidoc.objecttype=container. -- Michael.hoppe 08:02, 14 February 2008 (CET)


 * Another issue, can we make index for content-model-name (and not only for the content-model-id?) does it make sense? --Natasa 17:15, 13 February 2008 (CET)


 * Would make sense. Slows down the index-generation a bit. -- Michael.hoppe 08:13, 14 February 2008 (CET)


 * Is it slowing it dramatically? (i thought that in item.xml all referenced objects are present with id of the object and the title of the object i.e. think that this is named externaltitle or smth. so one does not need to make extra query from the content model. Is this actually the case? --Natasa 11:20, 14 February 2008 (CET)


 * No this is only partly the case. When retrieving the object with REST, we get an attribute xlink:title, containing the name. But when requesting it with SOAP, we dont get the attribute xlink:title. And the indexer works with SOAP! So indexer would have to retrieve the content-model.xml to get the name. I have to test, how long this request takes. Hopefully below one second ;-) -- Michael.hoppe 16:26, 14 February 2008 (CET)


 * Should we document this index somewhere? --Inga 21:00, 19 February 2008 (CET)

Searching for ISSN/ISBN
How do you put the ISSN in the metadata?

Option a: As dc.identifier like URI urn:ISSN:1361-3200 ? Then i could just put it in the index like that and you can search for it (urn:ISSN* or urn:ISSN:13*)

Option b: Or do you put the ISSN like that: 1361-3200

Then i could write it in the index as ISSN:1361-3200 and you can search for it (ISSN* or ISSN:13*) We put in the metadata in the following manner --Natasa 17:15, 13 February 2008 (CET) 0028-0836 978-3-499-13467-8


 * OK, then i could take the xsi:type together with the value and index it as ISBN:asdasdasd. Is this OK for you? -- Michael.hoppe 08:13, 14 February 2008 (CET)


 * The problem is that if we index it as e.g. ISBN:asdasdasd we will not be able to find "asdasdasd" only, that was why we wanted to index it as two word-tokens such as "ISBN asdasdasd" - note: we are aware of the consequences for not exact results, but the users sometimes search only for an identifier value and they do not know exactly the type of the identifier.


 * I checked what happens if we index it as 2 tokens: We then had to search it as phrase (in double quotes). But unfortunately, lucene doesn't support wildcards in phrase-queries. So we wouldnt be able to search for "ISSN a*". What if we index it as the two tokens: one with just the value of the identifier (asdasdasd) and the other one as : (ISSN:asdasdasd)? -- Michael.hoppe 16:18, 14 February 2008 (CET)


 * When we deal with any-field index we need smth like "ISSN" and "asdasd" to be able to search for e.g. ISSN or ISBN* or ISBN 1234* in any-field index. When we deal with any-identifier index we need the variants to search for ISSN:12* or ISSN:123123123 or ISSN* . Does this clarifies a bit? Not certain if I can provide an answer in real implementation terms :)'' --Natasa 17:51, 14 February 2008 (CET)


 * Regarding wildcards in phrase search. We probably should document this limitation somewhere. The current specification "Rest_api_doc_SB_Search.pdf" seems only to refer to CQL which lists this use case, see -> masking. Do we automatically right-truncated phrases? --Inga 21:19, 19 February 2008 (CET)


 * Yes we definitely should document this limitation. No we do no automatic right-truncation of search-words. --Michael.hoppe 09:46, 20 February 2008 (CET)
 * Please note that we do not talk of a "phrase search" when we talk on identifier index. If the requirement was for a phrase search that would have been explicitly stated. I used "ISSN" "asdasd" or "ISSN asdasd" not as phrases in the examples, but to delimit the search criteria from other text. Maybe this was a confusion. --Natasa 09:32, 25 February 2008 (CET)

Normalization of ISSN/ISBN
ISSNs and ISBNs may be specified with our without hyphens or blank. Therefore, a normalization would be required, e.g. by deleting all hyphens and blanks before the string is indexed. In addition, we might consider ISBN-10 to ISBN-13 conversion --Inga 16:50, 19 February 2008 (CET)

Question: Is it also possible to translate a cql query on the fly? It would be appreciated if the cql statement "escidoc.issn=0028-0836" would return the same result as "escidoc.issn=00280836". --Inga 16:50, 19 February 2008 (CET)


 * normalization should not only be done while indexing but also while searching. So the best place to do this is the analyzer, which analyzes while indexing and while searching. But we only may do this normalization for identifiers, so the analyzer has to decide upon this normaization dependent on the index-name. So we could do this for the index any-identifier but not for the index escidoc.metadata which also contains the identifiers. And then, all identifiers would get normalized, not only ISSN and ISBN. --Michael.hoppe 09:58, 20 February 2008 (CET)


 * Different behaviors in escidoc:metadata and escidoc:identifier is probably confusing --Inga 12:32, 21 February 2008 (CET)


 * I am not sure why you want different behavior of identifiers in any-index and any-identifier index. When we can't use wildcards in phrase-queries, we always should index identifiers as ISSN:dadads and not as ISSN ddads, because we cannot search for ISSN da* as phrase and then might find objects that have ISSN=32143 and ISBN=dader. But we could search for ISSN:da*! We additionally could index the value (dadads) then you also would be able to find only dadads.
 * Your cql-queries then would be:
 * --Michael.hoppe 10:34, 20 February 2008 (CET)
 * --Michael.hoppe 10:34, 20 February 2008 (CET)
 * --Michael.hoppe 10:34, 20 February 2008 (CET)
 * --Michael.hoppe 10:34, 20 February 2008 (CET)
 * --Michael.hoppe 10:34, 20 February 2008 (CET)


 * My ambition was to harmonize the article - which used "ISSN 123-234" for any-index and "ISSN:123-234" for any-identifier before I started the revision. I agree with your argumentation pro colons, but could you please explain, why the 4. example would work? Anyway, I would suggest to change the article accordingly --Inga 12:32, 21 February 2008 (CET)


 * Inga, you are right, 4. example wouldnt work. So we have to index ISSN:zwtzrz and ISSN and zwtzrz, then all examples will work --Michael.hoppe 10:05, 22 February 2008 (CET)
 * That was actually the requirement. Hopefully now is clearer? --Natasa 09:34, 25 February 2008 (CET)

All metadata records or only escidoc metadata records?
Do you want me to provide a generic lucene-index that contains data from all MD-records that you have in the items/container and not only from the internal-one with attribute name=escidoc?


 * I think there is no more "internal-one" metadata record. Depending on the content-model we use different metadata profiles. Current indexes only deal with pubman metadata profile (unfortunately named as escidoc index). How do you know which metadata record is internal? --Natasa 17:15, 13 February 2008 (CET)


 * The 'internal' md-record for me is the md-record with attribute name = 'escidoc'. The profile of this md-record doesn't really matter for me, as long as i can find the elements that have to get indexed. eg for index any-title md-record has to have elements title, alternative. Michael.hoppe 08:13, 14 February 2008 (CET)


 * We have not tried putting more than a single metadata record for an item/container (that is something we will do next month). How will then indexing be done? Probably we need to focus on single metadata record for indexing (in whatever profile it is) - as long as you have the information that this is the "default" one. --Natasa 17:15, 13 February 2008 (CET)


 * See above: default md-record is the md-record with attribute name='escidoc' --Michael.hoppe 08:13, 14 February 2008 (CET)

All elements from all records?
Should this Lucene-Index contain the data from all md-elements of all md-records?


 * Would dare to state for now: NO! We will not maintain many metadata records at the same time. Most probable use-case for existance of more metadata records would be: we ingest some data, we keep original metadata as a metadata record, but further we work only on "solution-supported-metadata-profile" which can be different from the originally ingested one. Therefore we probably would not index the original one for searching. --Natasa 17:18, 13 February 2008 (CET)

Naming of this index? Creation of new indexes?
How should we name the indexes?


 * I think the names are not the problem, we maybe need to talk a bit about - how do we create a new index easily - that is not only single-metadata index, but sometimes is a compound one (like in case of any-title index)? --Natasa 17:21, 13 February 2008 (CET)''


 * For now, we do not have an easy method. We have to change the stylesheet that extracts the data out of the item/container.xml and writes the index-information-document. Then we have to recreate the index-database. (I am currently developing an 'admin-tool' that can do the recreation).-- Michael.hoppe 08:17, 14 February 2008 (CET)


 * "I am currently developing an 'admin-tool' that can do the recreation" - Michael, that is a great news! --Natasa 11:20, 14 February 2008 (CET)

=Clarification on Organization index=

On Indexing of organization PIDS

It is important to understand some constraints and limitations regarding the indexing of organizational unit PID (Path-list) with the PubItem


 * Q: what happens when the organizational structure is changed i.e. when the organizational unit is assigned with a new parent? Should all corresponding PubItem indexes be updated?
 * A: according to the Organizational Units life-cycle, they have a status "new", "opened", "closed".
 * when in status "new": organizational unit can not be associated with a PubItem, because is still not made "official". In this stage any re-parenting (i.e. assigning of new parents, removing of old parents) can take place
 * when in status "opened": organizational unit can be associated with a PubItem, it is "official". In this stage re-parenting can not take place
 * when in status "closed": organizational unit can be associated with a PubItem, can not be associated with new children or parents i.e. no changes are allowed (but can be associated with other OrgUnits via "successor", "predecessor" relations). According to this settings, there will be no need to re-index PubItem indexes in case of re-parenting. The backround of the idea is, even if an orgunit is in status "opened" and one must re-parent it - should not be allowed, because of the real change of the organizational structure - is simply a new organizational unit (even if with the same name), and should be related with "successor" or "predecessor" to the newly created organizational unit. Logically, the old organizational unit has been changed and does not exist "officially" as in previous context.
 * Q: what happens with successor/predecessor relations when searching?
 * A: The system should be made "smart" to inform the user who is searching, that when searching for a specific organizational unit, that unit had a "successor" or a "predecessor" and to ask the user if she would also like to search in that organizational unit (thus the scalability problem can be substantially reduced, as: we will not re-index already existing PubItems and the frequency of "successors", "predecessors" and changing of the organizational unit structure is not high.

Question & Remark: The fact that an open organizational unit cannot be re-parented is not in sync with the former usage scenarios? Please note that I support any measures which simplify the OU management. But not-enabling re-parenting only shifts the problem, because this will force the administrative users to create additional OUs relation "isSuccessorOf" to an existing OU.
 * An issue in here is really why to allow re-parenting in case when organizational unit is already "opened"? Re-parenting means something had changed in the organizational structure - thus from that moment onwards it is a new organizational unit actually. This is simplifying a lot.--Natasa 09:39, 25 February 2008 (CET)

To my understanding, indexing of the "Parent-path-list" is only a workaround solution (right?). What we probably require is a fast relation service which returns all objects related [in a specific type] to an object and can be called recursively. This would give solutions the option to retrieve the IDs of all predecessors/successors/children/parents and build the search query on the fly - if the user choose so. --Inga 19:04, 19 February 2008 (CET)
 * Indexing of the "Parent-path-list" is not necessarily a workaround solution. It is becoming important for fast retrieval of results. I doubt that there is better alternative than making long queries with "OR/AND" (Or we should be more modest with our requirements i.e. when searching for "MPDL" one gives only results for "MPDL" not as well for the MPDL children (that was exactly the reason for extra parent-path-list index!). --Natasa 09:39, 25 February 2008 (CET)

=Search index names&logic synchronization=

In Progress

Setting-up the rules
Shall talk about the
 * indexing context (publication | virrelement | faces)
 * composition (derivation)
 * clear name of which metadata is used for indexing

Automatic index names
As an improvement for this requirement, we would like to set-up a rule for indexes. The rule should be the following:

1) Make automatic index of the qualified path of all metadata/attributes such as:

* publication.title * publication.alternativeTitle * publication.genre * publication.source.title * publication.source.genre * publication.event.title

Where qualified means that for item-level structures partly the path is ignored (possible?, makes sense?) e.g.:


 * components element is ignored
 * therefore instead of having index of components.component.file-name we have index such as component.file-name
 * for content-model-specific properties full path is to be taken as we can not know in advance the structure in it e.g.:
 * content-model-specific.local-tags.local-tag

The below logic we still do not have in the metadata, but if we specify source within source then it would be like:

* publication.source.source.genre * publication.source.source.title

Compound/Derived indexes
Compound index names are dependent on the metadata set and are created by request, we need to make sure that we do have correct index names when requesting new index (in future we can expect to register xslt transformations for derivation on content model level):

* publicaton.compound.publication-titles ** index of all titles of the publication on publication level (title, alternativeTitle)


 * IMO is enough to name it publication.compound.title--Friederike 10:58, 11 February 2009 (UTC)

* publication.compound.any-titles ** index of all titles of the publication on any level (publication.title, publication.alternativeTitle,               source.title, source.source.title, source.alternativeTitle, event.title, event.alternativeTitle)


 * Add new option ANY (compound | any | void) * publication.any.title
 * To make it a rule one could say compound referres to all object on same level, any referres to all objects on same level and below--Friederike 10:58, 11 February 2009 (UTC)

* publication.compound.source-titles **index of all titles of the publication on source level (source.title, source.alternativeTitle, if wished also               source.source.title, source.source.AlternativeTitle - not necessary now)


 * Respectivly: *publication.source.compound.title--Friederike 10:58, 11 February 2009 (UTC)

The index publication.source.any.title would deliver also the (not yet possible) titles of a sources source.

The option comound or any always reffers to the object in front of it.

Therefor publication.compound.title gives title & alternativeTitle. And publication.source.compound.title gives source.title & source.alternativeTitle

These are to be created based on our requirements

Person indexes
See also PubMan_Indexing Stated: Creator.Person.CompleteName with Creator.CreatorType = "Person"

we need compound indexes such as

*publication.compound.creator.publication-person **This will index all creator persons on publication level

*publication.compound.creator.any-person **This will index all creator persons on any level (publication, source, source.source)

*publication.compound.creator.source-person **This will index all creator persons on source level only (if wished also source.source level)

Organization indexes
See also PubMan_Indexing Logic is the same, index names proposal change:

* escidoc.organization-name => automatically becomes publication.creator.organization.organization-name

* escidoc.any-organization-pids =>   becomes publication.compound.publication-organization-pids **if indexing only publication creators organizations

* escidoc.any-organization-pids => becomes publication.compound.any-organization-pids ** if indexing all publication and source creators affiliations organizations

Creator indexes
We need compound index for searching any type of creator, and sorting by any type of creator

* publication.compound.publication-creator-names

To index (also needed respective sortkeys) all names of publication creators independently of whether this is a person or organization.

NOTE: This is anyway a requirement, as we are sorting by creator names, and these can be organization or person names. There is not an index that does it only for publication creators. Existing index does it also for source creators.

@TODO: check if similar compound index we also need for PIDs e.g. publication.compound.publication-creator-pids

Faces

 * In my opinion, this is not relevant for Faces, because in Faces we only have very clear indexes (one per attribute).--Kristina 10:17, 6 February 2009 (UTC)
 * Ok, but even in this case we have clear naming convention for face indexes such as:

*faces.emotion instead of escidoc.emotion *faces.age instead of escidoc.age

ViRR
Section moved to VIRR Development Page

General
*escidoc.publication.compound.creator *escidoc.faces.album.creator because then we could also query like *escidoc.compound.creator Possible use case: find all escidoc items where Mr. X was creator of. *escidoc.compound.title delivers all title on first level **(publication.title, publication.alternativetitle, virrelement title, etc. )
 * Perhaps it makes sense to add escidoc to beginning of indexes, like:

*escidoc.any.title **deliviers all title elements in all escidoc items