ESciDoc As Archive

ESciDoc

Abstract
idea: Provide an archiving service on top of eSciDoc using techniques proposed by OAI-ORE and SWORD.

eSciDoc could be turned into an archive by allowing application to POST named graphs (or just their names) as an easy way to deposit material already existent on the web.


 * this remark is also stated below: to avoid misunderstanding on archive and repository - we should probably talk on ingestion service rather than on archiving service - as so far the project had developed different understanding on archival (LTA preservation) and depositing (depositing, ingesting into repository).--Natasa 10:18, 13 November 2007 (CET)

Named Graphs
The semantic web idea of named graphs was recently picked up by OAI-ORE as a means to specify boundaries/sets of digital objects. In our context, named graphs are basically site maps, identified by a URI.

Depositing
To deposit a compound digital object on escidoc, an application POSTs a named graph to a suitable escidoc service. The invoked service is responsible for retrieving the constituents referenced in the named graph and turning them into components of the item representing the named graph.


 * a named graph can constitute of other named graphs: this is rather something which can not be simply put into an item and item components always.

A named graph by itself only and the mime-type of the content represented by the graph nodes would probably not be sufficient.

One would still like to have "typed named graphs" (i.e. in escidoc case contentmodel also for named graphs) to be able to manage them and provide some nicer functionality rather then archive/retrieve functionality solely. --Natasa 17:43, 9 November 2007 (CET)

Versioning
Posting a graph for an existing name will create a new revision of the corresponding item.

Authentication/Authorization
The escidoc archiving service will only authenticate posting (registered) applications (possibly by certificate). In case the graph references access controlled resources, the controlling application will have to support oauth to grant access to the resources in question to the archiving service.

Archived items have one of two access levels:


 * public (in case none of the referenced resources is access controlled)
 * private (to the posting application), in case at least one referenced resource is access controlled.

Retrieval
Archived items can be retrieved by name or by name and date. In the latter case, the revision of the item current at the given date will be returned.

Services for archived items
Since archived items are basically a graph of mime-typed components, eSciDoc is free to offer all kinds of services applicable to the respective mime-types.

An example for such a service may be to convert RDF+XML typed components to JSON suitable as input for exhibit ; or aggregating all data from RDF components and offering faceted browsing on it via longwell or similar technologies.

An Example
An application like Living Reviews in Relativity could provide fulltext search and longterm archiving for it's publications by posting http://www.livingreviews.org/lrr-2005-1 |- http://relativity.livingreviews.org/Articles/lrr-2005-1/metadata.rdf (application/rdf+xml) |- http://relativity.livingreviews.org/Articles/lrr-2005-1/download/lrr-2005-1Color.pdf (application/pdf) to the archiving service. The corresponding item would thus have two components, creator "Living Reviews in Relativity", ...


 * --Natasa 10:08, 13 November 2007 (CET)

to my understanding here we have a combination of usage of:


 * ingestion service (see svn )
 * itemhandler service (create item, create metadatarecord, create component)

ItemHandler service allows for posting of item, item metadata and components via separate methods

Ingestion service is not yet implemented, nor fully specified (older draft can be found in svn: but when we were talking on this service we were thinking on "data pushing" functionality i.e. a link (in this case we can consider it to be the named graph) to the content (both metadata and fulltext) is given, the content is downloaded from a known ingestion source (in the case of the above example the source would be the LivRev). The system is then "pulling" the data from the source and creating appropriate items and components. This service is not specified in detail so far. We can put the ingestion service as a topic for the upcoming Developer workshop on 17.12.2007

Benefits
Using escidoc as an archive as described above has the following advantages (over the more tightly coupled framework-solution model currently pursued):

1. Decoupling of escidoc framework and solutions: Most framework services (search, PID, LTA) are still usable but without coupling the solutions internal data representation with framework objects. I.e. the benefits of escidoc can be realized in an inobtrusive way by a wider range of applications.

2. Very simple depositing service.

3. Very simple versioning for objects on the web.

4. Better integration of escidoc with the web as a whole:


 * components are simply typed by mime-type (escidoc may become smarter over time in how to interpret particular mime-types)
 * instead of having to specify content models in escidoc, reusing (or registering) mime-types would suffice.


 * would this mean that when people search for an article of a specific journal they can simply search for mime-type "application/pdf"?--Natasa 17:54, 9 November 2007 (CET)
 * No. They would probably search for the article title/author/journal - all of which will be indexed as part of the PDF component, or better yet, as part of a metadata component. Robert 10:54, 10 November 2007 (CET)

5. Making escidoc services available to legacy systems: Legacy systems would just have to POST a sitemap to escidoc. In case the escidoc representation of the application's data is sufficient to provide the application's functionality, this will additionally provide a way to migrate web applications while keeping the data intact.
 * this looks a bit like the "cataloging" service envisioned before, where a catalog is a "published collection" of "things". A "thing" can be a native eSciDoc object or simply a link to an external object wherever. The "external object" in this case can be in addition another "named graph" to my understanding.--Natasa 17:54, 9 November 2007 (CET)
 * The difference may be that I'd like reference objects (in the named graph) to be imported into escidoc by the service. Robert 10:54, 10 November 2007 (CET)

6. Extensibility: Since the components of archived items will simply be mime-typed content, it can always be manipulated by suitable known methods (e.g. XHTML can be rendered by browsers, ...). Additional interpretation of the data remains with the posting application. Therefore, the application can change/extend it's data model without breaking escidoc functionality.
 * in fact, that is how Fedora works - so one does not need to touch eSciDoc content models or use eSciDoc interfaces etc. on top on Fedora to make this happen. --Natasa 17:54, 9 November 2007 (CET)
 * I do not propose to change escidoc content models or interfaces - just to add a service, which - as you say - may easily be implemented because fedora does provide most of the needed functionality. Robert 10:54, 10 November 2007 (CET)

Conclusion
Using the escidoc framework as archive for named graphs will make it possible to realize most of its current benefits (search, LTA, PID) without having to switch to a new model of publishing objects on the web. In particular the possibility to retrofit existing applications to use escidoc as archive will help greatly with its adoption and content aggregation.


 * --Natasa 17:40, 9 November 2007 (CET)


 * this proposal seems like complete switch of the escidoc idea from a soa and a framework (that is not only core services but also other services) to an archive for LTA. This is not the purpose of the escidoc.
 * Note: This is proposing an archival service - thus something very fitting for a soa and a framework. Robert 10:48, 10 November 2007 (CET)


 * solely based on mime/types would mean not much more on content retrieval. there are still different content models that need to be supported.
 * Of course. This proposal is not aiming at replacing all other escidoc services. Robert 10:48, 10 November 2007 (CET)


 * posting named graphs and retrieving objects of the named graphs is still an idea that needs to be explored - as a named graph does not always equal to an escidoc item (and its components), but can be e.g. also a container object that references other items
 * This proposal is about using very simple named graphs for very simple purposes, which does not preclude further exploration in different contexts. Robert 10:48, 10 November 2007 (CET)


 * note that escidoc is not an archive and is not persistency layer, also escidoc does not equal framework only nor pubman only
 * If escidoc is not an archive only, may it still be an archive among other things? Robert 10:49, 10 November 2007 (CET)
 * eSciDoc is not an archive, it is a repository. In terms of archiving, eSciDoc should establish interfaces with LTA partners in future. Therefore would prefer distinguishing the terms archive and repository /i.e. archival, depositing/ from functional reasons.


 * Additional information on current activities: we started the "named graph" exploration for "smaller" issue at the moment - authority file handling for journals - we decided to model journals in Mulgara RDF store - and see how it works. That would be one step to understand better the whole issue before we simply turn our items into "named graphs" i.e. turn the current item metadata representation into RDF - that would actually enable that we represent each repository item as a "named graph".
 * I think this is a misunderstanding. I don't want to turn repository items into named graphs. I just want to use named graphs as a format or a specification for what an application wants to store/archive in escidoc. The item type created by this archival service may be called ArchivalItem or whatever, and the item representation will not have to be changed at all. Robert 10:48, 10 November 2007 (CET)
 * To omit the misunderstanding, why not just simply talk on depositing into repository and ingesting items into repository? --Natasa 10:13, 13 November 2007 (CET)