Pubman Func Spec Linguistic Literature Fulltext Search

MPDL,LinguisticLiterature

= Introduction =

This WP consists of two parts:


 * 1) Preparation of full-text indices of selections of documents [[Image:Checkmark.png | 20px]]
 * 2) Show-case website with exemplary usages of such full-text indices [[Image:Symbol_Remove.png | 20px]]

= Usage Scenarios =

Full text indices
There are two kinds of documents made available in PubMan through the Lingusitic Literature project. First, there is a large set of documents for which the rights still have to be cleared, so they can only be used for research within the MPI-EVA and their collaborators (accessibility: restricted for usage group organization + special defined usage groups in person basis). Second, there are the documents that already have a CC-license (accessibility: public).

Show-case website
To exemplify the power of full-text indexing for the quicker evaluation of the content a search through the books will be realized. This means that the search within PubMan has to be extended: it shall be possible to restrict the search on one context only. That can include when selecting the context, not the whole advanced search mask will be displayed but only predefined parts.

Search within the fulltexts

 * Usage of the already available fulltext search within PubMan (based on Lucene)
 * It would be useful to have links to the actual pages, as made available through the 'splitting pages' WP --> There will be two links in the search result list for each item: one to the detailed view of the item (already there) and one directly to the actual page within the fulltext (should be opened in a separate window/page so that the user still can go to the detailed view of the item from the search result list.
 * An enhancement of the search that translates the search terms automatically into other languages (e.g. using http://translate.google.com) will not be implemented because it is to complicated. But the advances search offered several search fields where the user just can enter the same term in different languages.

The search shall always search trough all available fulltexts. But if the user can see the pdf linked to the search result or not, depends on his priviledges and the accessibility of the PDF. When the user is not allowed so see the PDF there should be a corresponding note with further information. This search shall be linked through the LDH.
 * COMMENT: That is not possible as the PubMan fulltext search only searches through public full texts.

Further Features

 * Present automatic summaries of books
 * The idea behind this goal is to allow for better identification of useful books for the scientist. The current collection of books in our project consists of grammatical descriptions of the world's languages. The researchers often use these books because they are looking for specific information from different languages. However, not all books have information on this topic. Currently, a lot of time is wasted by going through books only to find out that there is hardly any information in them. The idea of this task is to offer a way to quickly find those books that at least seem to have relevant information, to be checked by the researcher.
 * Practically, this amounts to doing a set of searches based on a given set of terms. The most easy to use set of terms is GOLD (an ontology): http://linguistics-ontology.org/version.
 * All this further information can not easily be displayed on PubMan (one possibility would be CoNE), but could be integrated in the LDH blog.


 * Search each book/article for the terms in GOLD, and save these searches --> Each term shall be searched in each book
 * Present a tag-cloud with each book showing which terms are most prominently available --> For each book one tag cloud should be displayed. This tag cloud can include all terms which have been found in the document or only the 10 most common ones. This depends on the actual results.
 * The GOLD terminology has a hierarchy: find parts of books that are strongly biased to a particular notion higher up in the GOLD hierarchy (e.g. pages 34-56 of a book might have many terms related to "case property"). --> This can mean that when I find the term "Abessive Case" in a book, the system should show me the next higher term "Case Property" (or all higher terms?) and when I choose one of this terms I will get a new search result list of books that contain this term.
 * --> All this details will be discussed between Michael C. and the development when we can show some examples about what is possible with the GOLD integration.


 * Just for reference, in case the usage of GOLD does not work. The following pages give a large set of linguistic terminology. We could extract the terms from these pages to make a list of typical linguistic terms, and use these for the book summary:


 * http://www.glottopedia.de/index.php/Category:Syntax
 * http://www.glottopedia.de/index.php/Category:Phonetics_and_phonology
 * http://www.glottopedia.de/index.php/Category:Morphology
 * http://www.glottopedia.de/index.php/Category:Semantics
 * http://www.glottopedia.de/index.php/Category:Diachrony
 * http://www.sil.org/linguistics/GlossaryOfLinguisticTerms/

= Use Cases =

UC_LL_FS_01 Do context sensitive advanced search
Extension of the already available UC_PM_SR_03 Do advanced search.

Status/Schedule
 * Status: implemented
 * Schedule: PubMan 6.2

Motivation
 * The user wants to search only through all available full-texts belonging to the LDH collection, not through the whole PubMan by providing one or more search criteria.

Steps
 * 1) The user chooses to execute a context sensitive advanced search.
 * 2) The system displays the advanced search view which additionally provides the possibility to restrict the search to items of a special context (e.g. linguistic literature).
 * 3) (Optionally) The user chooses to search within a selected context (e.g. linguistic literature).
 * 4) Continue with UC_PM_SR_03 Do advanced search Step 3.

Actors Involved
 * User

Further information (Kristina & Natasa on 04.05.10)
 * Blog may come to the Pubman advanced search with predefined criteria for Linguistic Literature context
 * Three possibilities to allow a search within one context:
 * a) add another field - context name or context-id - in pubman advanced search mask --> would be most user friendly
 * b) change the indexing style-sheet to also include the context name in escidoc.metadata index (at present only includes the context-id)
 * done in 1.2 --Natasa 09:19, 5 October 2010 (UTC)
 * c) do not do anything, but pre-fill the any-metadata field with the context id from linguistic literature (may be confusing for the user --> is not a preferred solution)

UC_LL_FS_02 View fulltext search results
Status/Schedule
 * Status: implemented
 * Schedule: PubMan 6.2

Motivation
 * The user wants to view the search result list which includes information about the search results within the fulltexts.

Pre-Condition
 * A search result list is available.

Triggers
 * This use case can be included by the use cases
 * UC_LL_FS_01 Do context sensitive advanced search

Steps
 * 1) The user chooses to view a list of items by executing one of the above mentioned use cases.
 * 2) The system displays the number of items and the list of items (see display details on UC_PM_BD_02 View item list). Further to the already implemented functionalities, this list offers a new sorting by ranking (score), which is the default sorting order (ascending).
 * 3) (Optionally) The user chooses to search directly within one selected pdf of one selected item within the search result list.
 * 3.1 The system displays the selected pdf (in a separate tab) within the pdf viewer of the user and the search query from the current search result list already pre-filled. Here, all functionality from the adobe viewer are offered to the user (see one example here).
 * 1) Continue with UC_PM_BD_02 View item list Step 3.

Actors Involved
 * every user

Further Information
 * Only public fultexts will be searched through by the search. That is a general PubMan policy. But when we will implement the Administrative Search (with Core Service 1.3) we can think about also offering to search in all fulltextes the user has the right to see.

Constraints
 * Step 3 only works for pdfs, not for other formats.

Within the blog
There was the idea
 * to show a tag cloud of relevant terms about each document within the blog and
 * to show a hierarchy of one tag cloud term (in which documents does one special term occur how often?)