ESciDoc Administrative Search

From MPDLMediaWiki
Jump to navigation Jump to search

Introduction[edit]

Discussion[edit]

Input FIZ[edit]

What are the requirements for the administrative Search, using Lucene?

Currently we use the filters for admin-search, accessing the db-cache that contains all fedora-objects.

  • The db-cache provides search-capabilities for all properties and all metadata of an object.
  • Only the last version of an object is searchable.
  • It is possible to sort by all searchable fields
  • Additionally it is possible to apply the special filter-criteria "user" and "role" which filters

the list of retrieved objects with access-rights of the given user with the given role.

  • If no "user" and "role" filter is applied, the list of retrieved objects is filterd

with the access-rights of the current user with all of his granted roles.

  • The new administrative search will use Lucene as underlying search-database.
  • The Lucene administrative indexes will contain additional fields for access-rights filtering

Access-Rights are filtered during search by expanding the search-query with the access-rights filter.

  • Are the requirements stated above also the requirements for the administrative search using a lucene-index?
  • Are there additional requirements?
    • search in fulltext?
    • search older versions?
    • custom search-result schemas?
  • Index design (one lucene-index containing all object-types or one lucene-index per object-type)
  • Indexing Performance (requirement: synchronous indexing)
    • reindexing of complete trees when members are added/removed from container
  • Proposal:
    • Only use lucene administrative index for fedora-objects (items, containers, contexts, org-units, content-relations, content-models).
    • Leave old filter methods for objects in internal database (user-accounts, user-groups, grants, roles, statistics).

Input MPDL[edit]

On functionalities:

  • DB Cache allows filtering by exact value.
  • Administrative search should allow search as regular search (also supported wildcards)
  • Additional requirements
    • fulltext searching:
      • administrative search shall also allow searching in fulltexts depending on user privileges
        • if that would be resolved with administrative search, maybe is good to understand implications for extension of the normal search with respect to privileges on fulltexts
    • custom search-results schemas: not clear
  • searching older versions
    • so far we did not have any special requirement to search for older versions of a resource
    • proposal: stick with this rule, however, we need to actually be able to search through both latest versions and latest releases with admin search
  • index design: not certain on implications - first impressions:
    • items/containers - single index;
    • OUS, contexts, content-models-> separate indexes;
    • content-relations: start with separate index and see - we have not solution development at the moment based on content relations .. we need to check what are real scenarios (e.g. get me all resources tagged created after 2010 and tagged as "my publications") ... this kind of query would require more complex indexing strategy ..
  • reindexing of complete trees when members are added/removed from container ... not clear why .. maybe some more explanation in here
    • related: see collaborator-role descriptions at JIRA

Outcome of initial Discussion[edit]

  • Administrative search will be realized with Lucene-Indexes containing additional fields for permission-filtering.
    • Additional fields are:
      • permissions-filter.objecttype
      • permissions-filter.context-id
      • permissions-filter.PID
      • permissions-filter.parent-id
      • permissions-filter.direct-parent-id
      • permissions-filter.component-id
      • permissions-filter.created-by
      • permissions-filter.version-status
      • permissions-filter.public-status
  • The permission-filtering is done by expanding the query with a subquery that restricts the search-result to the objects the current user may see.
  • subquery is generated at search-time
    • AA-Service is asked to generate subquery dependent on the roles the current user has granted
  • Besides the fields needed for permission-filtering, the admin-indexes will contain fields for each property- and each metadata-element and (in case of item) for fulltexts.
  • When searching in fulltexts, still whole item-XML is returned as search result and not content of fulltexts. Therefore it is not necessary that the permission-filter filters for rights to see component-content.
  • Search will behave just like the normal search (wildcards etc)
  • Whenever an object changes (create/update), it is updated in the admin index
  • Admin index will always contain latest version of an object and additionally (if released) the latest release.
  • Dependent on the rights of the user, search result either contains latest version or latest release of the object.
  • The admin indexes are written asynchronously
    • Few seconds of delay between end of update-operation and availability of changed object in the index.
  • search-result returns full eSciDoc-XML-Representation
  • we will have 5 different admin-indexes, containing the following objects:
    • items/containers
    • organizational-units
    • contexts
    • content-models
    • content-relations
  • Allow additional filter "role"
    • in search-query? as additional parameter? not clear how this fits in an srw-request
    • subquery is generated only for the given role and user
    • Example: give all contexts for which user X has role escidoc:role-depositor, give all contexts for which user Y has role (escidoc:role-depositor or escidoc:role-moderator)
  • Sorting
    • same sort keys as present in present search index for released items

Difficulties[edit]

  • For some roles, hierarchies of objects have to get resolved
    • eg Collaborator with scope on container may see all objects that are in the child-hierarchy of the container.
    • permission-filter fields of each object below the container must contain the parent-hierarchy-tree.
    • whenever a member of a container is added/removed, all objects of the child-hierarchy of this member have to get reindexed.
    • --> Do some more evaluation on parallel searching (one index containing the property/metadata fields, another index containing the permission-filter fields) as this could prevent reindexing of properties/metadata and fulltext
  • Question: in this case most probably
  1. permissions-filter.parent-id
  2. permissions-filter.direct-parent-id

will not be needed in the content-index, only in the separate permissions-filter index? --Natasa 10:49, 26 March 2010 (UTC)

More Input and Questions from FIZ[edit]

  • Parallel Searching (1 index with search-fields and another with the permission-fields):
    • Lucene Primary keys of both indexes always must match! When optimize is called on an index, these Primary-Keys might change.
    • --> Both indexes always must get updated/optimized together. But no transactions available in Lucene. Therefore high danger of index-corruption.
    • --> Safer to have one index containing search-fields and permission-fields, no parallel searching
the issue of parallel searching was raised after initial proposal to index the complete indexing hierarchy. As the hierarchy will be resolved at search-time directly (without an index) this issue in fact is deprecated --Natasa 10:59, 24 June 2010 (UTC)
  • Container- and OrgUnit-Hierarchies
    • Only index direct parent of object
    • Resolve hierarchies at search-time
    • Whenever an object is added to/removed from container, reindex object
here there may be additional problem again. Whenever an object is added or removed from a container the object is reindexed. The container must be re-indexed as well in this case, as the struct-map is in the container. However, in case when e.g. many objects at once are added to a container (addMembers method) - and this is in fact a real use-case which comes-up from upcoming developments a problem appears again (all added Members including the container member itself have to be reindexed). This would mean, that as long as the re-indexing is not finished, one would not be able to "filter" (search) for members of this container with e.g. particular metadata value. This applies to the fact that maybe the direct parent of the object shall NOT be in the index in case of containers/members (organizational units are another issue, but these are not critical, as OUHandler does not provide such an operation). Are there any possibility to resolve this as well at a search-time? --Natasa 10:59, 24 June 2010 (UTC)
    • No need to reindex complete child-hierarchy if object is added to/removed from container
  • Fulltext-Permissions
    • No possibility to filter which fulltext user is allowed to search
    • Every user may search all indexed fulltexts
    • Possibility to configure if all fulltexts get indexed or only the ones with visibility=public
maybe would be good that we explicitly are able to select the minimal visibility level for indexing (private, public, restricted). If level "private" is selected (all are indexed), if level "restricted" is selected (restricted and public are indexed), if level "public" is restricted (only public are indexed)--Natasa 12:02, 24 June 2010 (UTC)
    • Fulltext is not shown in search-result (dont show highlight-snippets for fulltexts not publicly visible)
see the minimum visibility level for which ftexts to index, here same logic (minimum visibility level for which snippets to show) could be provided. (of course, both have to be in sync, or the upper of both is selected in case they differ).--Natasa 12:02, 24 June 2010 (UTC)
Question: up to now, there was no possibility to distinguish between full-text fields in case when there are more full-texts associated with the resource. this sometimes is causing troubles. Is there any possibility to have each full-text indexed in separate field (not certain here if using the correct terminology, but hopefully it is understandable:)) ? --Natasa 11:06, 24 June 2010 (UTC)


  • Search as other user with one or more roles
    • Filter-Query currently only supports one role
    • Who is allowed to search as other user?
in fact this is most probably misunderstood. See below bullet points below ... --Natasa 10:42, 24 June 2010 (UTC)
      • The search actually in general means, a user searches for resources which s/he can retrieve with a particular role. For example: userA has depositor privilege for context1, context2 and moderator privilege for context2 and context3. First use case is: userA needs to get the list of contexts for which s/he has depositor privilege (context1, context2). In the second use case: userA needs to get the list of contexts for which s/he has moderator privilege (context2, context3). This is filter per role. If user does not give this filter, then all allowed contexts are provided in the result (e.g. opened, closed and those which this user evtentually created but are not in opened, closed status).
      • another example is when a user has a collaborator/audience privilege for some items (or containers)
        • he would like to retrieve all resources for which he is a collaborator
        • he would like to retrieve all resources where he has "audience" privilege (e.g. retrieve all released items which have restricted files which he may retrieve)

Mih

Requirements[edit]

  • search all statusses