Trip Report: Dublin Core 2013

MPDL

Event: Dublin Core 2013 Participants: Andrea Wuchner Conference Site:

=Zusammenfassungen=

=Tutorial on Metadata Provenance (Kai Eckert, Mannheim University)=

Definitions

 * Provenance: Provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. Provenance assertions are a form of contextual metadata and can themselves become important records with their own provenance.


 * Metadata: Data that provides information about other data. Metadata is structured data that is used to describe the properties of a resource.


 * Resource Description Framework (RDF): All things described by RDF are called resources and are instances of the class rdfs:Resource. This is the class of everything. All other classes are subclasses of this class.


 * RDF Resources: RDF URI Reference: URIs are globally unique and every URI identifies one and only one resource. Literal: Identify values such as numbers and dates, typed or plain. Blank node: A resource that exists, but is not identified by an URI.


 * Statement: Information about resources is expressed in statements about the resource. A statement is a triple of subject, predicate and object. It generally describes one property of one identifiable resource by assigning a value. The subject is always a resource (a blank node or identified by an URI). The object can be another resource or a literal.


 * Linked Data:
 * Linked Data Principles: 1) Use URIs as names for things 2) Use HTTP URIs so that people can look up those names 3) When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL), 4)Include links to other URIs, so that they can discover more things.
 * Information Resources: Resources that are delivered via the Web: Web pages, images, PDF files, ...
 * Non-information Resources: Resources that are not on the Web: Books, concepts, persons, ...
 * Dereferencing a URI from RDF data: Information resource: Depending on content negotiation and using http redirects, Delivers the resource itself or delivers information on the resource in RDF format Non-information resources: Using http redirects (303 redirect), Delivers information on the resource in RDF format

'''
 * Metadata in a linked data environment: Now metadata on a given resource.... ...can come from many sources, ...can contain redundant statements, ...can contain false or contradictory statements, ...can be created by many means and processes.
 * '''One would like to keep track of those statements: but provenance - as defined - only deals with resources. --> thus: we need a notion of metadata as a resource. --> We need Metadata provenance: what dataset does a given statement belong to? Who (or what) is responsible for it?

Reification

 * Explaination: RDF offers a way to describe statements: Reification. A new resource is used to represent a statement. Subject, predicate and object are properties of this resource. Additional information is added by using additional properties. But: you still need the original statement in the data.
 * Limits: no link between statement and reification, only by matching subject, predicate and object. No grouping possible: Excessive numbers of statements e.g. identifical creator for 100 statements leads to 500 additional statements. Reification can be used to talk about specific statements.

Linked Metadata

 * Usual best practice: Metadata provenance is delivered with the metadata.
 * Methode 1: Embedded Linked Metadata: Metadata and Metadata provenance are linked by an 303-redirect. Problem: What about the provenance of the provenance? There is no URI for the metadata provenance.
 * Methode 2: The Link Header: Metadata and Metadata provenance are linked by any 303-redirect. We give the metadata provenance an URI. Problem: How to tell that we want the provenance. Content negotiation is not working any more, as both contents are RDF. Missing: a request header that asks for provenance.
 * Method 3: Additional Statements: Provide a reference to the provenance data. Example: ex:eiffeltower-meta rdfs:seeAlso ex:eiffeltower-metameta. Problem: rdfs:seeAlso is very general. There is no commonly accepted property for provenance.
 * Summery:
 * + Based on Linked Data Principles.
 * + Current "best practice"
 * - Not suitable for provenance on statement level.
 * - Requires full control over web server.
 * - No URI for provenance information.
 * - No accepted provenance retrieval mechanism.
 * --> A good start point, as every provenance mechanism has to fit with the linked data principles.

Named Graphs

 * Definition:
 * An RDF Graph is a set of RDF triples. A Graph does not contain other graphs.
 * A named graph is an RDF graph with an assigned URI as name. Serialization is possible in TriG.
 * Named Graphs will be part of the RDF 1.1 standard and are supported by SPARQL.
 * A client that fetches linked data via a URI usually stores this URI as graph URI in a quad store. This is great, because this way we can talk about the fetched RDF data and store provenance in our RDF store. This is only half way there, because we can not reexpose the provenance information easily. Because it is not part of the RDF.
 * RDF Stores: RDF-Stores today are usually quad-stores. Each triple is assigned to a graph via the fourth quad element. If the fourth element contains a URI, the URI is interpreted as the name of the graph that contains all triples with the same graph URI.
 * '''Summery: You can reexpose graphs with names (e.g. with TriG), but no directions how to interprete the graph URI and: when the TriG file is fetched no possibility to store the graphs inside another graph with the URI of the TriG file.
 * --> Half way there, but still enough room for own desicions and developments.

OAI-ORE

 * Open Archives Initiative - Object Reuse and Exchange
 * Originally addresses another problem that lacks a solution in RDF: How to make a statement about a resource that is only valid in a special context? Example: The ordering of resources in an aggregation like the ordering of articles in a bibliography.
 * Adaption for provenance: All statements are provided within such a context, the context can be identified and further described by provenance statements.

Europeana Data Model and DM2E Model

 * Status Europeana: Europeana provides data about cultural heritage objects (CHO) from CH institutions all over Europe. Provenance requirements: Distinguish metadata form different institutions talking about the same (owl:sameAs) resource. Provenance is realized by means of OAI-ORE. Problems: Users have to understand Proxies and Aggregations. Wouldn't named graphs be nicer?
 * Use of proxies and aggregations:
 * Aggregation: Aggregations are used in Europeana to represent the complex constructs that are provided by contributors. An aggregation is associated to the object that it is about, by the property edm:aggregatedCHO. The class Europeana Aggregation aggregates other aggregations (from data providers).
 * Removing the proxies: Proxies are resources for the actual resources. Every data provider has an "own" resource to describe, as a placeholder. But data providers use different URIs for their resources anyway. Linking creates owl:sameAs statements and conflates resources. How we then reliably maintain different descriptions: we simply use named graphs to distinguish descriptions from different providers.
 * A Named Graph per Resource: Correspond to the EDM aggregations, Named Graphs are used as first class members in the model.
 * Problem: If we provide a dump of the full dataset from one provider we have several named graphs withing one resource that form another named graph. --> Nested Graph Problem. RDF does not provide a solution, it is not clear how to deal with such data.

State-ful data

 * Content on web pages can change, they are usually state-less. Example: http://example.org/weather/lisbon.
 * By comitment the content of a URL can be kept stable, the URL represents a specific state, it is state-ful. Example: http://example.org/weather/lisbon/2013-09-02.
 * State-ful URLs make provenance-life easier. The URL represeents the data, so it can be used to identify the fetched data in local systems without problems.
 * STate-less URLs are no show-stopper. But in fact that the data migth have changed in the source should be indicated: Use a local state-ful URL for your data. Link to the state-less URL as source, e.g. via dct:source or prov:was DerivedFrom.

Versioning

 * Data always changes. Most applications with state-ful URLs will therefore need versioning.
 * The necessary links to other versions can be included with the data. Use Versioning vocabulary for this puropose.
 * Avoid changing properties in your data.

Preliminaries

 * Several ontologies modelling provenance exist: W3C PROV, Open Provenance Model (OPM), Provenance Vocabulary, Provenir....And we have Dublin Core...

Dublin Core for Provenance

 * Some of the 55 terms contain only information about the resource itselft, but not how or when it was produced. --> Descriptive terms.
 * Some terms also contain information on the creation or derivation of the resource --> Provenance Terms (Who, When, How?)
 * Who:
 * Range is dct:Agent (a resource that acts or has the power to act, clearly influencing creating of a resource,
 * Terms: Contributor, Creator, Publisher, RightsHolder
 * When:
 * Ranges: Date range (available, valid), Single date (all others). Dates are basic provenance information.
 * Terms: Available, Created, Date, DateAccepted, DateCopyrighted, DateSubmitted, Issued, Modified, Valid.
 * How:
 * Information on Derivation and Replacement, Information on relations to other resources, Information on processes involved in creation.
 * dcterms:provenance: "statement of any changes in ownership and custody of the resource since its creation that are signifivant for its authenticity, integrity and interpretation. Used for "classic" provenance of works of art.
 * Summary: More than half of the DC terms deal with provenance related information. They cover who, when and how. Missing information is where and why.

PROV Ontology
=The PROV Ontology Tutorial (Daniel Garijo)=
 * Basic constructs: Entities, Activities and Agents.
 * Summary:
 * PROV is a complex abstract model. It is initially harder to use than straightforward dcterms, but is able to express complex relationships.
 * PROV models provenance by describing actions.
 * PROV can easily model the whole lifecycle of a resource.
 * Dublin Core vs. PROV
 * Dublin Core: very distinct roles, implicit and part of the semantics, limited number of terms.
 * PROV: explicit modelling of roles, Subclassing, Qualified classes.

Introduction

 * W3C's Provenance Incubator Group
 * 2009 - 2010
 * Report "provenance XG Final Report", provides an overview of the various existing approaches and vocabularies, proposes the creation of a dedicated W3C Working Group
 * Introduces requirement for the provenance in the web
 * Maps different existing vocabulary approaches to OPM
 * Define three common use case scenarios for provenance: News Aggregator, Disease Outbreak, Business Contract
 * W3C Provenance Working Group
 * Set up in April 2011
 * aims to express how data has evolved
 * difficulties: requires a complete model describing the various constituents (actors, revisions, etc.), model should be usable with RDF to be used on the Semantic Web, has to find a balance between provenance granularities (simple provenance vs. complex provenance)
 * Applications of Provenance: Art (Ownership), Open Information System (origin of the data, who was responsible for its creation), Science applications (how the results of a publication were obained (scientific workflows), News (origins and references of blogs, news items), Law (licensing attribution of documents, data, privacy information)
 * Provenance is Metadata, but not all metadata is provenance.
 * A lot of work has been done in Workflow management systems, Databases, knowledge presentation, information retrieval
 * Communities and vocabularies are already in use: DC, OPM, Provenir ontology, Provenance vocabulary, SWAN provenance ontology, SOIC, VOID etc.
 * The existing models track provenance at different granularities in different domains.
 * How do we make the provenance descriptions interchangeable?
 * How do we integrate these heterogeneous provenance data?
 * Goal is to define a standard way to interchange provenance on the web.
 * Focused on the Semantic Web
 * Documents
 * PROV Overview
 * PROV Primer
 * PROV Data Model
 * PROV Ontology
 * PROV XML Serialization
 * etc.

From Dublin Core to PROV-O: An Example

 * Presentation X
 * was created by Daniel.
 * Kai contributed with feedback.
 * Uses previous tutorials as reference
 * Is a refinement of previous draft.


 * dcProv Presentation X a foaf:Document; dc:title "presentation x"; dct:creator daniel; dct:contributor kai; dct:created "2013-08-25"; dct:replaces draft; dct:references previous tutorial.


 * In PROV daniel would be the prov:Agent and presentation x would be prov:Entity. :makingTheTutorial would be prov:Activity.

Introduction

 * Lightweight OWL ontology for interchanging provenance information
 * Simple
 * Domain neutral
 * Meant to be extended
 * Encodes PROV-DM's "abstract model" in RDF
 * There are alternate encodings for XML, etc.
 * Final W3C recommendation
 * Content negotiation is enabled.

Starting point terms ("Simple")

 * 3 classes + 9 properties
 * Simple binary relationships
 * Class "Entity": a physical, digital, conceptual or other kind of thing with some fixed aspects; entities may be real or imaginary.
 * Class "Activities": any processes that used or generated entities. An activity is something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, modifying, relocating, using or generating entities.
 * Class "Agent": is something that bears some form of responsibility for an activity taking place for the existence of an entity or for another agent's activity.
 * The 3 main classes (Entity, Activity, Agent) can be organized by time or responsibility.
 * prov:used: Usage is the beginning of utizlizing an entity by an activity. Before usage, the activity had not begun to utilize this entity and could not have been affected by the entity.
 * prov:wasAssociatedWith: An activity association is an assignemnt of responsibility to an agent for an activity, indiciating that the agent had a role in the activity. It further allows for a plan to be specified, which is the plan intended by the agent to achieve some goals in the context o fthis activity.
 * prov:wasGeneratedBy: Generation is the completion of production of a new entity by an activity. This entity did not exist before generation and becomes available for usage after this generation.
 * prov:wasDerivedfrom: A derivation is a transformation of an entity into another, an update of an entity resulting in a new one, or the construction of a new entity based on a preexisting entity.
 * prov:wasInformedBy: Communication is the exchange of some unspecified entity by two activities; one activity using some entity generated by the other.
 * prov:wasAttributedTo: Attribution is the ascribing of an entity to an agent.
 * prov:actedOnBehalfOf: Delegation (actedOnBehalfOf) is the assignment of authority and responsibility to an agent (by itself or by another agent) to carry out a specific activity as a delegate or representative, while the agent it acts on behalf of retains some responsibility for the outcome of the delegated work.

Expanded terms

 * Extension of the Starting Point Terms.
 * Generic definitions to remain as domain independend as possible.
 * Allow for richer descriptions of resources.
 * Software Agent: is running software.
 * Organization: social or legal institution such as a company, society etc.
 * Person: people.
 * Location: A Location can be identifiable geographic place, but it can also be a non-geographic place such as a directory, row or column. As such, there are numerous ways in which location can be expressed, such as by a coordinate, address, landmark...
 * Plan: a plan is an entity that represents a set of actions or steps intended by one or more agents to achieve some goals. Plans are associated to Activities and executed by Agents.
 * Collection: an entity that provides a structure to some constituents that must themselves be entities. These constituents are said to be member of the collections. An empty collection is a collection without members.
 * bundle: a named set of provenance descriptions and is itself an entity, so allowing provenance of provenance to be expressed.
 * primary source: a primary source relation is a kind of derivation relation from secondary materials to their primary sources.
 * prov:wasQuotedFrom: a quotation repeat of (some or all of) an entity, such as text or image, by someone who may or may not be its original author. The blog post os not a quote. The wasQuotedFrom relationship should be only for the quotes.
 * prov:value: used when enitites have a string or numeric value. Provides a value that is a direct presentation of an entity.
 * prov:wasStartedBy: Start is when an activity is deemed to have been started by an entity, known as trigger. The activity did not exist before its start.
 * prov:wasEndedBy: End is when an activity is deemed to have been ended by an entity, known as trigger. The activity no longer exists after its end.