ESciDoc Item List

From MPDLMediaWiki
Revision as of 08:53, 23 April 2008 by Tanja (talk | contribs) (cat)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page is the result of the eSciDoc workshop held in September 2007. It should serve for collecting ideas, questions and constraints, discovered while evaluating the filtering of items according to a "mini item" information object. Please feel free to use the page

Basic Idea[edit]

Base filter methods on a "mini item" = a specific data stream with minimal metadata. The metadata would be created "content model" specific with creation/update of an item. Sketch razum1.jpg

The idea of "list metadata" is derived from the necessity to generate Dublin Core Metadata for each Fedora object (and therefore for each eSciDoc object). This generated subset of the solution specific metadata may be used for filtering/searching eSciDoc objects. The entries in the Dublin Core Metadata datastream named "DC" of a Fedora object are automatically written to the resource index. The assumption is that the core metadata entries for one object - that are written to the DC datastream of the object - are sufficient for searching/filtering.

If this leads to the usage of DC metadata as container for search specific properties this approach is of course failed.

Specification of Prototype[edit]

Boundaries for prototype by FIZ:

  • with AA
  • without paging
  • list format: RSS in RDF

Open questio: Which PubMan requirements (only display? display and sorting?) need to be fulfilled by this format! --Inga 20:57, 10 March 2008 (CET)

Requirements, Sorting[edit]

Note: As discussed with Natasa and Robert: We need to be aware not to mix up item list requirements with searching requirements. If a new sorting of a result list means executing a new search, we only need to consider sorting requirements for item lists not generated from a search (e.g. in the depositor workspace). --Inga 21:18, 12 October 2007 (CEST)

  1. SORTING (from PubMan R3 specification for item lists after a search). Item lists should be sortable
    • by following metadata elements (mapping provided by xslt)
      • date -> dc:date (mapping needs to be adapted to use one date at most)
      • title -> dc:title (fine)
      • genre/type -> dc:type (mapping needs to be extended)
      • creator name (first creator) -> dc:creator (Person.FamilyName or Organization.Name of the first creator - mapping needs to be adapted)
      • publishingInfo (Organization.Name) -> dc:publisher
      • reviewType -> not mapped at all
      • source.Creator (Person.FamilyName or Organization.Name of the first Source.Creator) -> not mapped at all
      • source.Title -> dc:source (fine)
      • event.Title -> not mapped at all
    • by following item properties:
      • date of last modification -> last-modification-date
      • collection -> properties/context
      • state -> properties/status (now properties/public-status)
      • owner -> properties/creator (now properties/created-by)


Requirements, Item List Short Display[edit]

The article PubMan Display specifies the bibliographic information required for displaying lists of publication items. Item list (short view) includes following metadata:

Element Mapped to Note
Publication.Title dc:title -
Publication.Creator dc:creator = <person/family-name>, <person/given-name
dc:creator = <organization-name>
all creators of type person and organization
Publication.Date dc:date single element, value needs to be filled according to priorities provided -
Publication.Genre dc:type -
Number of the Files dc:format repeating the element provides information about number of files

Requirements, Item List Medium Display[edit]

The article PubMan Display specifies the bibliographic information required for displaying lists of publication items. The following list of elements has been copied from this article.

  1. Publication.identifier (mapped to -> dc:identifier)
  2. Publication.Title (mapped to -> dc:title)
  3. Publication.Creator[role]
    (mapped to dc:creator (if role = author, otherwise to dc:contributor) -- Frank)
    (alternatively mapped to dc:contributor = <person/family-name>, <person/given-name> (with role as refinement as suggested by http://www.loc.gov/loc.terms/relators/) -- Inga 22:04, 10 March 2008 (CET)
    1. Publication.creator.person.organization (no dc mapping exist)
      Any chance to support nested information, e.g. <dc:creator><rdf:Description><vcard:fn>Smith, Joe</vcard:fn><vcard:org>FIZ Karlsruhe</vcard:org></rdf:Description></dc:creator>
    2. Publication.creator.organization (no dc mapping exist)
      (mapped to dc:creator = <organization-name> (if role = author, otherwise to dc:contributor) -- Frank
      (alternatively mapped to dc:contributor, see above) -- Inga 22:04, 10 March 2008 (CET)
  4. Publication dates (mapped to -> dc:date)
    1. dcterms:created - date created
    2. dcterms:modified - date modified
    3. dcterms:dateSubmitted - date submitted
    4. dcterms:dateAccepted - date accepted
    5. published-online - date published online
    6. dcterms:issued - date published in print
  5. Publication.Genre (mapped to dc.type)
  6. Publication.Degree, e.g. Habilitation, Diploma, etc.
  7. Publication.Publishing-Info
    1. publication.publishing-info.publisher (already in publication metadata mapped as dc:publisher)
    2. publication.publishing-info.place
    3. publication.publishing-info.edition
  8. Publication.total-number-of-pages (can be mapped as "dcterms.extend")
  9. publication.event.title
  10. publication.source
    1. publication.source.title
    2. publication.source(@type) (is genre of the publication)
    3. Publication.source.creator[role] (can not be fully supported in dc)
    4. publication.source.volume
    5. publication.source.issue
    6. publication.source.publishing-info.publisher (already in publication metadata mapped as dc:publisher)
    7. publication.source.publishing-info.place
    8. publication.source.publishing-info.edition
    9. publication.Source.Identifier -> dc.identifier
  11. item.components (File information, gives the number of attached files -> @FIZ: We need links to content components. Thus we will be able to produce the number on application side.--Natasa 13:27, 17 December 2007 (CET))
  12. publication.review-method
  13. item.username-created-by (check if name is possible)
  14. item.context-name (to be derived from context id)
  15. item.date-last-modified
  16. item.status
  17. item.id
  18. item.pid

There are 2 types of item lists we can retrieve from core services:

  • as output of Searching service (sorting possible, limit possible, offset possible): search-result.xsd

SORTING REQUIREMENTS for search results (same as those specified above):

  • as output of: ItemHandler.retrieveItems (sorting, limit, offset not possible yet): item-list.xsd

SORTING REQUIREMENTS FOR filter results: minimum sorting options possible are the same as filter criteria available.

To clarify:

  • from where the search service is deriving the items in search-result.xml?
  • must the "flattened" metadata which will be in the resource index be pure DC, Dc qualified or they can also include metadata from escidoc namespace?
  • if the resource index will be queried for item lists - what will be the cost to include the limit/offset and sorting by metadata which are in the resource index?

Discussion of Requirements from Issue 323[edit]

  • "... this [is] relevant not only for search results but also for filtered item lists."
    • Is "search results" related to SearchAndBrowse?! Frank 15:28, 10 October 2007 (CEST)
Yes, search results come from search and browse - search results at the moment are only from released items. Filters include also non-released items. --Natasa 09:46, 11 October 2007 (CEST)
  • Sorting should be possible by resource properties "Context Name" and "Owner Name" (created-by name?).
    • The names of resources like user and context are not properties of an item or container and therefore not included in the mentioned triplestore based RDF representation. Frank 15:28, 10 October 2007 (CEST)
Question: but the Context resources are created in Fedora and have own properties (i.e. context name, description etc.). So why not having same behaviour for all resources in this case? - i.e. all properties and some additional metadata (if they exist) will be also stored in the triple store. In this case also this kind of sorting (with some more complex queries in triple store) will be possible? --Natasa 09:53, 11 October 2007 (CEST)
I just want to mention that sorting by names of referenced resources indeed requires "complex queries". Then we have special cases which are not covered by metadata mapping or resource properties. The contexts name is of course stored in the triplestore but as property of context and not as property of item. Such a query may better covered by SearchAndBrowse?! Frank 13:15, 11 October 2007 (CEST)
  • Taking a better look at our current solution: there is no need to sort released items when retrieved via Search/Browse by context name (as the contexts are nothing that is published, they are only the administrative containers). On the other hand, maybe we should even think better on solution side on how we are presenting the item lists to our users. In general if the user would like to see items grouped by the context - that is fine. Then in the solution we first need to query contexts (ordered by name) and then for each context query the respective items (by use of filter). The same would go for sorting by "created-by" name.
  • is a combination necessary? i.e. sort by context name and then within the context name by owner name or vice versa? --Natasa 14:06, 26 October 2007 (CEST)


  • There are sorting rules pertaining the order of characters.
    • ITQL allows an "order by" clause. I doubt the order of characters is configurable (to be checked). Frank 15:28, 10 October 2007 (CEST)
Probably the triple store implementation should be checked. Normally there should be the possibility to specify the collating order.

Request Syntax[edit]

OpenSearch[edit]

One idea is to use OpenSearch to select items and represent a list of selected items.

OpenSearch defines a format to describe search engines and four search metadata elements ("totalResults", "startIndex", "itemsPerPage" and "Query") to be included in XML response documents.

A search interface is described by an OpenSearch description document as an URL. Suggested kinds of search requests are HTTP GET and HTTP POST requests. A description document contains a request template in this manner. A request template contains parameters that are replaced by specific values before a request is executed. OpenSearch defines some basic search parameters and allows the use of custom parameters. A description document as well as a OpenSearch response may contain detailed descriptions on the capabilities of the search engine (e.g. which custom parameters may be used).

If OpenSearch should be used for eSciDoc - e.g. between PubMan and the ObjectManager - some questions arise:

  • Where should a description document be located?
  • Should a search include custom parameters like 'created-by', 'context' etc. and aren't such parameters solution specific?
  • Is OpenSearch - at all - a solution task? The Framework offers search capabilities with SearchAndBrowse and ObjectManagers filter methods and a solution may describe the search and generate costumized responses in a solution specific way (e.g. to generate a PubMan-Search-Firefox-Plugin).

OpenSearch defines elements to afford autodiscovery of search engines. This implies the usage of generic search clients. Firefox is able to automatically discover search engines by pertinent elements in websites and generate a search field - including autocompletion if provided by the search interface - in the browsers toolbar. The situation between PubMan - or other solutions - and "the framework" does not fit to such features. The filter methods of the ObjectManager are described as part of the API documentation and used by solution developers. Autodiscovery seams to be something that a solution may offer to its users.

As an example you may have a look at http://solarphysics.livingreviews.org/refdb/search which links to the OpenSearch description here http://solarphysics.livingreviews.org/refdb/search?format=osd Search results look as follows: http://solarphysics.livingreviews.org/refdb/search?journal=lrsp&a=Abart%2C+R.&tM=substring&js=lrsp&s=yearDesc&l=10&_action_search=Search&format=rss

Note: OpenSearch does not lead to a specific result format. To talk about OpenSearch is to talk about search engine discovery and generic query generation etc. --Frank 13:37, 1 October 2007 (CEST)
Yes. My point when mentioning OpenSearch was: If lists of items are delivered in a format commonly used with OpenSearch (like RSS), support for OpenSearch for search engine discovery will be as cheap as creating the description in addition. --Robert

SRU/W?[edit]

Another idea would be to offer an SRU/SRW to request and return a list of items.

Result Format / Representation[edit]

RSS 1.0[edit]

RSS in version 1.0 supports extension with custom-made elements because it is a subset of RDF. The other way around it would be possible to enrich a RDF response from the resource index (triplestore) with RSS 1.0 predicates. RSS 1.0 may be interpreted in a semantic-web context and in a newsfeed context.

A RSS 1.0 document is a RDF/XML document which contains an 'items' entry with a RDF sequence containing references to item entries.

<rdf:li rdf:resource="http://localhost:8080/ir/item/escidoc:234"/>

The value of the resource attribute is an URI that MAY be the URL of the entity but IS the identifier of an item entry stated later in the same document.

<rss:item rdf:about="http://localhost:8080/ir/item/escidoc:234">
  <rss:title>Title of the refered resource</rss:title>
  <rss:link>http://localhost:8080/ir/item/escidoc:234</rss:link>
  <rss:description>Description from the refered resource</rss:description>
</rss:item>

An item entry may be extended by predicates from other contexts (ontologies, namespaces) than RSS. E.g. with DC metadata:

<rss:item rdf:about="http://localhost:8080/ir/item/escidoc:234">
  <rss:title>Title of the refered resource</rss:title>
  <rss:link>http://localhost:8080/ir/item/escidoc:234</rss:link>
  <rss:description>Description from the refered resource</rss:description>
  <dc:title>The Title</dc:title>
  <dc:identifier>hdl:12345/102030</dc:identifier>
</rss:item>

If an eSciDoc item list is an RSS 1.0 document the values of the rss predicates title, link and description are retrieved from the appropriate resource index entries. Additionally all DC entries from the resource index related to the item may be added.

It may be possible to add entries from the mpdl metadata datastream converted to RDF statements. Because the document is a RDF document it is possible to insert the non-flat structure of these metadata by means of additional resources/entities defined in the mpdl metadata context (ontologie!?) but Newsfeed Readers will probably don't understand that structure.

<!-- ... -->
<rss:item rdf:about="http://localhost:8080/ir/item/escidoc:234"
    xmlns:mpdl="http://escidoc.mpg.de/metadataprofile/schema/0.1/types">
  <rss:title>Title of the refered resource</rss:title>
  <rss:link>http://localhost:8080/ir/item/escidoc:234</rss:link>
  <rss:description>Description from the refered resource</rss:description>
  <mpdl:creator rdf:resource="http://some/uri"/>
</rss:item>

<rdf:Description rdf:about="http://some/uri"
    xmlns:mpdl="http://escidoc.mpg.de/metadataprofile/schema/0.1/types">
  <rdf:type rdf:resource="http://escidoc.mpg.de/metadataprofile/schema/0.1/types/creator"/>
  <mpdl:role>author</mpdl:role><!-- note: here role is qualified (with namespace) -->
  <mpdl:complete-name>Arno Schindlmayr</mpdl:complete-name>
  <mpdl:organization rdf:resource="http://some/other/uri"/>
</rdf:Description>

<rdf:Description rdf:about="http://some/other/uri"
    xmlns:mpdl="http://escidoc.mpg.de/metadataprofile/schema/0.1/types">
  <rdf:type rdf:resource="http://escidoc.mpg.de/metadataprofile/schema/0.1/types/organization"/>
  <mpdl:organization-name>Forschungszentrum</mpdl:organization-name>
  <!-- ... -->
</rdf:Description>
<!-- ... -->

See ESciDoc_Container_Toc.

Questions & Discussion[edit]

  • How to version the changes required if a change in the transformation requires a change of all item of a specific content type
Proposed solution can be treated as a "temporary workaround" and a test for having all metadata further placed into proper storage. At present, the "list metadata" are part of the Fedora system datastream (DC metadata) that is "pushed" into the relational database. Note: only non-qualified DC metadata is allowed. --Natasa 12:32, 26 September 2007 (CEST)
  • To Do: think of "moving" the "list metadata " (or all descriptive metadata) into a RDF storage (this will take substantial rework, but will give also substantial benefits for interoperability, discovery - and of course - performance etc.) --Natasa 12:32, 26 September 2007 (CEST)
The idea to use the DC datastream is because of that the entries are automatically written to the triplestore, isn't it? It would be nice to write all entries from the solution specific metadata to the triplestore but then it must be transfered in a flat well-known structure. The DC-mapping is an approach to do that. -- Frank
The entries from DC datastream are automatically extracted from the DC datastream of a Fedora object and inserted into the Fedora resource index. And yes, that is actually the problem with the Fedora DC Support. Our metadata set is much reacher, so if we only extract DC we have basic (not qualified) DC metadata. But our metadata schemas are not flat. --Natasa 10:36, 28 September 2007 (CEST)
Trying to summarize: It would be nice to use DC metadata for enrich list views because it contains core metadata of a resource and is written to the resource index. Unfortunately not all entries from mpdl-escidoc-metadata can be mapped to DC. An other idea is to write the mpdl-escidoc-metadata to the resource index, how ever we do that.
Question: Is it possible to bring mpdl-escidoc-metadata to a flat structure in order to write the entries to the resource index or is that nearly the same problem as to map it to DC? Frank 13:30, 1 October 2007 (CEST)
  • To check: Are standard DC data sufficient to create the information required to display item lists, i.e. are all required data for sorting and displaying available in DC core? A transformation "escidoc publication->dc simple" is available in the specification svn, zim01.gwdg.de
    • If not: To change the functional specification or to "disemploy" the standard?
      • or think of something else like keeping the standard but moving to metadata RDF storage? --Natasa 10:36, 28 September 2007 (CEST)
  • Should a change of the "mini item" data stream
    1. create a new logical version?
    2. create no new logical version?
To consider for start: In case when metadata presented in the "list-metadata" are not sufficient and a new transformation is needed - this is not (from aspect of content versioning) changing the version of the content item - it just provides a new "enriched view" on the content item. In this case if very necessary: utility should be developed that is modifying the content items (but not creating new versions of them) as it is a redundant information derived from the actual metadata of the item. --Natasa 12:32, 26 September 2007 (CEST)

First Prototype[edit]

We have a prototype implementation of the dc-mapping and retrieval of "short-item-list". The method retrieveItems may be called with an additional element "format" in the taskParam. The default format is still "full"(-object-list). If format is set to "short" an RDF-Document is delivered that contains selected triplestore entries for every item. The delivered short format is an unfinished example. Some elements may be renamed or removed before delivery. The short format is based on the retrieveItems implementation for full objects and has no advantages in performance. In order to demonstrate the possibility to transform the basic short format the additional format "atom" is temporarily added.