ESciDoc Item List

From MPDLMediaWiki
Jump to: navigation, search

This page is the result of the eSciDoc workshop held in September 2007. It should serve for collecting ideas, questions and constraints, discovered while evaluating the filtering of items according to a "mini item" information object. Please feel free to use the page

Basic Idea

Base filter methods on a "mini item" = a specific data stream with minimal metadata. The metadata would be created "content model" specific with creation/update of an item. Sketch razum1.jpg

The idea of "list metadata" is derived from the necessity to generate Dublin Core Metadata for each FedoraFlexible Extensible Digital Object Repository Architecture object (and therefore for each eSciDocEnhanced Scientific Documentation object). This generated subset of the solution specific metadata may be used for filtering/searching eSciDocEnhanced Scientific Documentation objects. The entries in the Dublin Core Metadata datastream named "DCDublin Core" of a FedoraFlexible Extensible Digital Object Repository Architecture object are automatically written to the resource index. The assumption is that the core metadata entries for one object - that are written to the DCDublin Core datastream of the object - are sufficient for searching/filtering.

If this leads to the usage of DCDublin Core metadata as container for search specific properties this approach is of course failed.

Specification of Prototype

Boundaries for prototype by FIZFachinformationszentrum Karlsruhe:

  • with AA
  • without paging
  • list format: RSSReally Simple Syndication in RDFResource Description Framework

Open questio: Which PubManPublication Management requirements (only display? display and sorting?) need to be fulfilled by this format! --Inga 20:57, 10 March 2008 (CETCentral European Time)

Requirements, Sorting

Note: As discussed with Natasa and Robert: We need to be aware not to mix up item list requirements with searching requirements. If a new sorting of a result list means executing a new search, we only need to consider sorting requirements for item lists not generated from a search (e.g. in the depositor workspace). --Inga 21:18, 12 October 2007 (CESTCentral European Summer Time)

  1. SORTING (from PubManPublication Management R3 specification for item lists after a search). Item lists should be sortable
    • by following metadata elements (mapping provided by xslt)
      • date -> dc:date (mapping needs to be adapted to use one date at most)
      • title -> dc:title (fine)
      • genre/type -> dc:type (mapping needs to be extended)
      • creator name (first creator) -> dc:creator (Person.FamilyName or Organization.Name of the first creator - mapping needs to be adapted)
      • publishingInfo (Organization.Name) -> dc:publisher
      • reviewType -> not mapped at all
      • source.Creator (Person.FamilyName or Organization.Name of the first Source.Creator) -> not mapped at all
      • source.Title -> dc:source (fine)
      • event.Title -> not mapped at all
    • by following item properties:
      • date of last modification -> last-modification-date
      • collection -> properties/context
      • state -> properties/status (now properties/public-status)
      • owner -> properties/creator (now properties/created-by)


Requirements, Item List Short Display

The article PubMan Display specifies the bibliographic information required for displaying lists of publication items. Item list (short view) includes following metadata:

Element Mapped to Note
Publication.Title dc:title -
Publication.Creator dc:creator = <person/family-name>, <person/given-name
dc:creator = <organization-name>
all creators of type person and organization
Publication.Date dc:date single element, value needs to be filled according to priorities provided -
Publication.Genre dc:type -
Number of the Files dc:format repeating the element provides information about number of files

Requirements, Item List Medium Display

The article PubMan Display specifies the bibliographic information required for displaying lists of publication items. The following list of elements has been copied from this article.

  1. Publication.identifier (mapped to -> dc:identifier)
  2. Publication.Title (mapped to -> dc:title)
  3. Publication.Creator[role]
    (mapped to dc:creator (if role = author, otherwise to dc:contributor) -- Frank)
    (alternatively mapped to dc:contributor = <person/family-name>, <person/given-name> (with role as refinement as suggested by http://www.loc.gov/loc.terms/relators/) -- Inga 22:04, 10 March 2008 (CETCentral European Time)
    1. Publication.creator.person.organization (no dc mapping exist)
      Any chance to support nested information, e.g. <dc:creator><rdf:Description><vcard:fn>Smith, Joe</vcard:fn><vcard:org>FIZ Karlsruhe</vcard:org></rdf:Description></dc:creator>
    2. Publication.creator.organization (no dc mapping exist)
      (mapped to dc:creator = <organization-name> (if role = author, otherwise to dc:contributor) -- Frank
      (alternatively mapped to dc:contributor, see above) -- Inga 22:04, 10 March 2008 (CETCentral European Time)
  4. Publication dates (mapped to -> dc:date)
    1. dcterms:created - date created
    2. dcterms:modified - date modified
    3. dcterms:dateSubmitted - date submitted
    4. dcterms:dateAccepted - date accepted
    5. published-online - date published online
    6. dcterms:issued - date published in print
  5. Publication.Genre (mapped to dc.type)
  6. Publication.Degree, e.g. Habilitation, Diploma, etc.
  7. Publication.Publishing-Info
    1. publication.publishing-info.publisher (already in publication metadata mapped as dc:publisher)
    2. publication.publishing-info.place
    3. publication.publishing-info.edition
  8. Publication.total-number-of-pages (can be mapped as "dcterms.extend")
  9. publication.event.title
  10. publication.source
    1. publication.source.title
    2. publication.source(@type) (is genre of the publication)
    3. Publication.source.creator[role] (can not be fully supported in dc)
    4. publication.source.volume
    5. publication.source.issue
    6. publication.source.publishing-info.publisher (already in publication metadata mapped as dc:publisher)
    7. publication.source.publishing-info.place
    8. publication.source.publishing-info.edition
    9. publication.Source.Identifier -> dc.identifier
  11. item.components (File information, gives the number of attached files -> @FIZFachinformationszentrum Karlsruhe: We need links to content components. Thus we will be able to produce the number on application side.--Natasa 13:27, 17 December 2007 (CETCentral European Time))
  12. publication.review-method
  13. item.username-created-by (check if name is possible)
  14. item.context-name (to be derived from context id)
  15. item.date-last-modified
  16. item.status
  17. item.id
  18. item.pid

There are 2 types of item lists we can retrieve from core services:

  • as output of Searching service (sorting possible, limit possible, offset possible): search-result.xsd

SORTING REQUIREMENTS for search results (same as those specified above):

  • as output of: ItemHandler.retrieveItems (sorting, limit, offset not possible yet): item-list.xsd

SORTING REQUIREMENTS FOR filter results: minimum sorting options possible are the same as filter criteria available.

To clarify:

  • from where the search service is deriving the items in search-result.xml?
  • must the "flattened" metadata which will be in the resource index be pure DCDublin Core, Dc qualified or they can also include metadata from escidoc namespace?
  • if the resource index will be queried for item lists - what will be the cost to include the limit/offset and sorting by metadata which are in the resource index?

Discussion of Requirements from Issue 323

  • "... this [is] relevant not only for search results but also for filtered item lists."
    • Is "search results" related to SearchAndBrowse?! Frank 15:28, 10 October 2007 (CESTCentral European Summer Time)
Yes, search results come from search and browse - search results at the moment are only from released items. Filters include also non-released items. --Natasa 09:46, 11 October 2007 (CESTCentral European Summer Time)
  • Sorting should be possible by resource properties "Context Name" and "Owner Name" (created-by name?).
    • The names of resources like user and context are not properties of an item or container and therefore not included in the mentioned triplestore based RDFResource Description Framework representation. Frank 15:28, 10 October 2007 (CESTCentral European Summer Time)
Question: but the Context resources are created in FedoraFlexible Extensible Digital Object Repository Architecture and have own properties (i.e. context name, description etc.). So why not having same behaviour for all resources in this case? - i.e. all properties and some additional metadata (if they exist) will be also stored in the triple store. In this case also this kind of sorting (with some more complex queries in triple store) will be possible? --Natasa 09:53, 11 October 2007 (CESTCentral European Summer Time)
I just want to mention that sorting by names of referenced resources indeed requires "complex queries". Then we have special cases which are not covered by metadata mapping or resource properties. The contexts name is of course stored in the triplestore but as property of context and not as property of item. Such a query may better covered by SearchAndBrowse?! Frank 13:15, 11 October 2007 (CESTCentral European Summer Time)
  • Taking a better look at our current solution: there is no need to sort released items when retrieved via Search/Browse by context name (as the contexts are nothing that is published, they are only the administrative containers). On the other hand, maybe we should even think better on solution side on how we are presenting the item lists to our users. In general if the user would like to see items grouped by the context - that is fine. Then in the solution we first need to query contexts (ordered by name) and then for each context query the respective items (by use of filter). The same would go for sorting by "created-by" name.
  • is a combination necessary? i.e. sort by context name and then within the context name by owner name or vice versa? --Natasa 14:06, 26 October 2007 (CESTCentral European Summer Time)


  • There are sorting rules pertaining the order of characters.
    • ITQL allows an "order by" clause. I doubt the order of characters is configurable (to be checked). Frank 15:28, 10 October 2007 (CESTCentral European Summer Time)
Probably the triple store implementation should be checked. Normally there should be the possibility to specify the collating order.

Request Syntax

OpenSearch

One idea is to use OpenSearch to select items and represent a list of selected items.

OpenSearch defines a format to describe search engines and four search metadata elements ("totalResults", "startIndex", "itemsPerPage" and "Query") to be included in XMLExtensible Markup Language response documents.

A search interface is described by an OpenSearch description document as an URLUniform Resource Locator. Suggested kinds of search requests are HTTPHyperText Transfer Protocol GET and HTTPHyperText Transfer Protocol POST requests. A description document contains a request template in this manner. A request template contains parameters that are replaced by specific values before a request is executed. OpenSearch defines some basic search parameters and allows the use of custom parameters. A description document as well as a OpenSearch response may contain detailed descriptions on the capabilities of the search engine (e.g. which custom parameters may be used).

If OpenSearch should be used for eSciDocEnhanced Scientific Documentation - e.g. between PubManPublication Management and the ObjectManager - some questions arise:

  • Where should a description document be located?
  • Should a search include custom parameters like 'created-by', 'context' etc. and aren't such parameters solution specific?
  • Is OpenSearch - at all - a solution task? The Framework offers search capabilities with SearchAndBrowse and ObjectManagers filter methods and a solution may describe the search and generate costumized responses in a solution specific way (e.g. to generate a PubManPublication Management-Search-Firefox-Plugin).

OpenSearch defines elements to afford autodiscovery of search engines. This implies the usage of generic search clients. Firefox is able to automatically discover search engines by pertinent elements in websites and generate a search field - including autocompletion if provided by the search interface - in the browsers toolbar. The situation between PubManPublication Management - or other solutions - and "the framework" does not fit to such features. The filter methods of the ObjectManager are described as part of the APIApplication Programming Interface documentation and used by solution developers. Autodiscovery seams to be something that a solution may offer to its users.

As an example you may have a look at http://solarphysics.livingreviews.org/refdb/search which links to the OpenSearch description here http://solarphysics.livingreviews.org/refdb/search?format=osd Search results look as follows: http://solarphysics.livingreviews.org/refdb/search?journal=lrsp&a=Abart%2C+R.&tM=substring&js=lrsp&s=yearDesc&l=10&_action_search=Search&format=rss

Note: OpenSearch does not lead to a specific result format. To talk about OpenSearch is to talk about search engine discovery and generic query generation etc. --Frank 13:37, 1 October 2007 (CESTCentral European Summer Time)
Yes. My point when mentioning OpenSearch was: If lists of items are delivered in a format commonly used with OpenSearch (like RSSReally Simple Syndication), support for OpenSearch for search engine discovery will be as cheap as creating the description in addition. --Robert

SRUSearch/Retrieval via URL/W?

Another idea would be to offer an SRUSearch/Retrieval via URL/SRWSearch/Retrieval Web Service to request and return a list of items.

Result Format / Representation

RSSReally Simple Syndication 1.0

RSSReally Simple Syndication in version 1.0 supports extension with custom-made elements because it is a subset of RDFResource Description Framework. The other way around it would be possible to enrich a RDFResource Description Framework response from the resource index (triplestore) with RSSReally Simple Syndication 1.0 predicates. RSSReally Simple Syndication 1.0 may be interpreted in a semantic-web context and in a newsfeed context.

A RSSReally Simple Syndication 1.0 document is a RDFResource Description Framework/XMLExtensible Markup Language document which contains an 'items' entry with a RDFResource Description Framework sequence containing references to item entries.

<rdf:li rdf:resource="http://localhost:8080/ir/item/escidoc:234"/>

The value of the resource attribute is an URIUniform Resource Identifier that MAY be the URLUniform Resource Locator of the entity but IS the identifier of an item entry stated later in the same document.

<rss:item rdf:about="http://localhost:8080/ir/item/escidoc:234">
  <rss:title>Title of the refered resource</rss:title>
  <rss:link>http://localhost:8080/ir/item/escidoc:234</rss:link>
  <rss:description>Description from the refered resource</rss:description>
</rss:item>

An item entry may be extended by predicates from other contexts (ontologies, namespaces) than RSSReally Simple Syndication. E.g. with DCDublin Core metadata:

<rss:item rdf:about="http://localhost:8080/ir/item/escidoc:234">
  <rss:title>Title of the refered resource</rss:title>
  <rss:link>http://localhost:8080/ir/item/escidoc:234</rss:link>
  <rss:description>Description from the refered resource</rss:description>
  <dc:title>The Title</dc:title>
  <dc:identifier>hdl:12345/102030</dc:identifier>
</rss:item>

If an eSciDocEnhanced Scientific Documentation item list is an RSSReally Simple Syndication 1.0 document the values of the rss predicates title, link and description are retrieved from the appropriate resource index entries. Additionally all DCDublin Core entries from the resource index related to the item may be added.

It may be possible to add entries from the mpdl metadata datastream converted to RDFResource Description Framework statements. Because the document is a RDFResource Description Framework document it is possible to insert the non-flat structure of these metadata by means of additional resources/entities defined in the mpdl metadata context (ontologie!?) but Newsfeed Readers will probably don't understand that structure.

<!-- ... -->
<rss:item rdf:about="http://localhost:8080/ir/item/escidoc:234"
    xmlns:mpdl="http://escidoc.mpg.de/metadataprofile/schema/0.1/types">
  <rss:title>Title of the refered resource</rss:title>
  <rss:link>http://localhost:8080/ir/item/escidoc:234</rss:link>
  <rss:description>Description from the refered resource</rss:description>
  <mpdl:creator rdf:resource="http://some/uri"/>
</rss:item>

<rdf:Description rdf:about="http://some/uri"
    xmlns:mpdl="http://escidoc.mpg.de/metadataprofile/schema/0.1/types">
  <rdf:type rdf:resource="http://escidoc.mpg.de/metadataprofile/schema/0.1/types/creator"/>
  <mpdl:role>author</mpdl:role><!-- note: here role is qualified (with namespace) -->
  <mpdl:complete-name>Arno Schindlmayr</mpdl:complete-name>
  <mpdl:organization rdf:resource="http://some/other/uri"/>
</rdf:Description>

<rdf:Description rdf:about="http://some/other/uri"
    xmlns:mpdl="http://escidoc.mpg.de/metadataprofile/schema/0.1/types">
  <rdf:type rdf:resource="http://escidoc.mpg.de/metadataprofile/schema/0.1/types/organization"/>
  <mpdl:organization-name>Forschungszentrum</mpdl:organization-name>
  <!-- ... -->
</rdf:Description>
<!-- ... -->

See ESciDoc_Container_Toc.

Questions & Discussion

  • How to version the changes required if a change in the transformation requires a change of all item of a specific content type
Proposed solution can be treated as a "temporary workaround" and a test for having all metadata further placed into proper storage. At present, the "list metadata" are part of the FedoraFlexible Extensible Digital Object Repository Architecture system datastream (DCDublin Core metadata) that is "pushed" into the relational database. Note: only non-qualified DCDublin Core metadata is allowed. --Natasa 12:32, 26 September 2007 (CESTCentral European Summer Time)
  • To Do: think of "moving" the "list metadata " (or all descriptive metadata) into a RDFResource Description Framework storage (this will take substantial rework, but will give also substantial benefits for interoperability, discovery - and of course - performance etc.) --Natasa 12:32, 26 September 2007 (CESTCentral European Summer Time)
The idea to use the DCDublin Core datastream is because of that the entries are automatically written to the triplestore, isn't it? It would be nice to write all entries from the solution specific metadata to the triplestore but then it must be transfered in a flat well-known structure. The DCDublin Core-mapping is an approach to do that. -- Frank
The entries from DCDublin Core datastream are automatically extracted from the DCDublin Core datastream of a FedoraFlexible Extensible Digital Object Repository Architecture object and inserted into the FedoraFlexible Extensible Digital Object Repository Architecture resource index. And yes, that is actually the problem with the FedoraFlexible Extensible Digital Object Repository Architecture DCDublin Core Support. Our metadata set is much reacher, so if we only extract DCDublin Core we have basic (not qualified) DCDublin Core metadata. But our metadata schemas are not flat. --Natasa 10:36, 28 September 2007 (CESTCentral European Summer Time)
Trying to summarize: It would be nice to use DCDublin Core metadata for enrich list views because it contains core metadata of a resource and is written to the resource index. Unfortunately not all entries from mpdl-escidoc-metadata can be mapped to DCDublin Core. An other idea is to write the mpdl-escidoc-metadata to the resource index, how ever we do that.
Question: Is it possible to bring mpdl-escidoc-metadata to a flat structure in order to write the entries to the resource index or is that nearly the same problem as to map it to DCDublin Core? Frank 13:30, 1 October 2007 (CESTCentral European Summer Time)
  • To check: Are standard DCDublin Core data sufficient to create the information required to display item lists, i.e. are all required data for sorting and displaying available in DCDublin Core core? A transformation "escidoc publication->dc simple" is available in the specification svn, zim01.gwdg.de
    • If not: To change the functional specification or to "disemploy" the standard?
      • or think of something else like keeping the standard but moving to metadata RDFResource Description Framework storage? --Natasa 10:36, 28 September 2007 (CESTCentral European Summer Time)
  • Should a change of the "mini item" data stream
    1. create a new logical version?
    2. create no new logical version?
To consider for start: In case when metadata presented in the "list-metadata" are not sufficient and a new transformation is needed - this is not (from aspect of content versioning) changing the version of the content item - it just provides a new "enriched view" on the content item. In this case if very necessary: utility should be developed that is modifying the content items (but not creating new versions of them) as it is a redundant information derived from the actual metadata of the item. --Natasa 12:32, 26 September 2007 (CESTCentral European Summer Time)

First Prototype

We have a prototype implementation of the dc-mapping and retrieval of "short-item-list". The method retrieveItems may be called with an additional element "format" in the taskParam. The default format is still "full"(-object-list). If format is set to "short" an RDFResource Description Framework-Document is delivered that contains selected triplestore entries for every item. The delivered short format is an unfinished example. Some elements may be renamed or removed before delivery. The short format is based on the retrieveItems implementation for full objects and has no advantages in performance. In order to demonstrate the possibility to transform the basic short format the additional format "atom" is temporarily added.