ESciDoc Content Model in Fedora

From MPDLMediaWiki
Jump to navigation Jump to search

!!! in progress !!!

This page starts with an overview of the basic ideas of content model in the eSciDoc Infrastructure, Fedora and Enhancements to Fedora. For a discussion of mapping content model between eSciDoc and Fedora see chapter 5.

eSciDoc Content Models[edit]

In the eSciDoc infrastructure a content model describes the specialization of an Item or Container object and is stored inside the infrastructure as Content Model Object. Content Model Objects are managed via Content Model Handler.

A Content Model Object is a versionable eSciDoc resource with a publication workflow. Thus it contains common object properties of a eSciDoc resource including version information.

Additionally it holds key-value pairs governing the behavior and form of the specialized content object. These are e.g. the initial state, if versioning is enabled, the name of a mandatory metadata record etc..

In order to flexibly define the structure of a content object in its variable parts, the Content Model Object may include or refer a rule document.

In general the term "content model object" refers to a concrete digital object that holds - or consist of - the description of a content model. An eSciDoc resource (e.g. item, container) usually refers a content model object which means it is (or should be) conform to the appropriate content model. Instead of "conform to" one can say the eSciDoc resource is of the type defined by the content model.

See also [|eSciDoc Content Model Object].

Common Object Properties[edit]

Each eSciDoc resource, thus as well the eSciDoc Content Model Object, holds a set of common properties. The properties of the eSciDoc Content Model are:

  • id (objid)
  • name
  • description
  • creation date
  • creator (created by)
  • modification date (last-modification-date)
  • status (public-status, version-status) + comment (values are pending, in-revision, submitted, released))
  • PID
  • context
  • content model (either defines a content model object for content models OR don't state this property; note: content model is not a special item)
  • lock-status, lock-date, lock-owner
  • version, latest-version, release

Questions and Notes[edit]

  • status withdrawn not needed
    • maybe we should still keep status "withdrawn" in case when no new resources of this content model can be created ... e.g. legacy data--Natasa 15:33, 9 September 2009 (UTC)
  • name and description? (cf. Item and Container)
    • a content model object should have a name and a description which might give some textual information about it, rather then purely objid. These might be taken (as in case of items) from a e.g. dublin core metadata record of the content model.
  • What means the context of a Content Model Object?--Natasa 15:33, 9 September 2009 (UTC)
    • would still keep it: special rights can be managed for each context in which there are content model objects in place. In case of collaborative environments and usage of single repository by different organizations this may be an interesting issue. Another use case which comes up from DARIAH project: content models can be published... so in this case an eSciDoc repository may be a repository of content models (distant use case)--Natasa 15:33, 9 September 2009 (UTC)

Key-Value Pairs[edit]

This section covers how to state these information and the form to represent them inside the resource representation of a Content Model Object. For a list of possible or needed values see eSciDoc Content Model Object.

There are single values as for initial state and lists of values as for mime-types or names of metadata records. Lists usually demand for a definition if a special occurrence is allowed or forbidden, if the list is completed or open etc.. Therefore lists may be better covered by rules inside a rule document.

Most single values must be accessible for creation of a resource; in the following called creation information. Some must be accessible for specific operations (status transitions, transformation description for dc mapping); in the following called transition information. Transition information may also be creation information but not vice versa. Creation information are necessary for the creation process while transition information are necessary for one or more service operations.

List values can be used as input for a resource validation which must be done before or after every operation. They must not be available for individual read-access. Therefore it is sufficient to formulate them into a rule document (cf. [|List (Properties)]). Other values should be available for fast access.

Questions and Notes[edit]

  • not certain if transitions should be part of the content model--Natasa 15:33, 9 September 2009 (UTC)
    • I think one important thing is to have the initial state and then e.g. if a resource must be submitted before release. DC-Mapping is also important but probably you don't mean that. Frank 12:40, 10 September 2009 (UTC)

Rule Document[edit]

The rule-document (if any) is stored as valid XML document inside the Content Model Object or as reference to a valid XML document. It contains restrictions to an Item or Container expressed in a rule-language (e.g. Schematron-XML).

This approach does not support content validation in Fedora. Even if the rule document related to the eSciDoc resource is in a language Fedora can evaluate, it will describe the eSciDoc resource and not the Fedora object.

Questions and Notes[edit]

Fedora Content Models[edit]

Fedora Content Model Architecture (CMA). See [|Fedora Content Model Architecture (CMA)]

One important reason to introduce CMA in Fedora was the need to decouple content objects and Fedora disseminators. Now, the behavior of objects - in form of disseminators - is bound to the content model object of a content object. This is of lower interest for eSciDoc and eSciDoc Content Models and therefore not discussed here. Though it enables the eSciDoc Infrastructure to use disseminators and in the future, eSciDoc content models should be extended to describe behavior of eSciDoc objects.

Each Fedora object refers a Fedora content model object by the property info:fedora/fedora-system:def/model#hasModel. This relation is stored in the RELS-EXT datastream of the Fedora object.

Fedora CMA supports "complex single-object models", "commonly called 'compound'" and "multi-object models and is commonly called 'atomistic' or 'linked'"((http://fedora-commons.org/confluence/display/FCR30/Content+Model+Architecture)). Since no concrete means could be found this seems to be a theoretical statement.

Content models "include structural, behavioral and semantic information [and] a description of the permitted, excluded, and required relationships to other digital objects or identifiable entities."((http://fedora-commons.org/confluence/display/FCR30/Content+Model+Architecture)) This is defined by means of a "content modeling language".

Fedora Content Modeling Language[edit]

Fedora 3 contains a reference implementation of a content modeling language. This implementation seems to be limited to the definition of a very simple XML format. It allows a list of elements where each specifies the ID of datastream that must exist. Additionaly, for a datastream ID a list of allowed MIME types may be specified. The Fedora client code comprises a validator implementation for that language.

The content modeling language document is stored in the mandatory datastream "DS-COMPOSITE-MODEL" of a Content Model Object.

.<img src="http://fedora-commons.org/confluence/download/attachments/4718710/cmodel.png"/>

Questions and Notes[edit]


Enhanced Content Models for Fedora[edit]

Enhanced Content Models for Fedora (ECM) is an enhancement to Fedora CMA providing a validator implementation, a webservice to create compound views from atomisitc objects, templates for object creation, an extension to Fedoras content modeling language and an ontology datastream extending possibilities. ECM is curently available in version 0.8.

See [|http://ecm.wiki.sourceforge.net/]

[|Presentation on OpenRepositories 2009 by Asger Blekinge-Rasmussen]

Supports [|Content Model Inheritance].

Validator[edit]

By binding an object to a special Content Model defined by ECM the validator can be called as disseminator of that object. Beside that, it can be called by a specific HTTP URL containing the ID of the object to be validated. There seem to be no tool to validate a FOXML document outside Fedora. At least when modifying an existing Fedora object the validation must be done after persisting the modification.

Extension to Fedora content modeling language[edit]

Fedora content modeling language is extended by an element that adds the possibility to specify a XML Schema (stored inside the Content Model Object) for a datastream (with MIME type text/xml) in addition to the possiblity to define MIME types for a datastream(see "Fedora Content Modeling Language" above). So, a further specification of datastreams of MIME type XML is possible.

Ontology Datastream[edit]

The Ontology Datastream contains an ontology about objects conform to the content model. It consists of a RDF/XML document defining a class using RDFs and OWL (Lite) and the predicates an instance of this class may have.

Therefore, beside other things, the relations of objects which are expressed in RDF/XML in the RELS-EXT of these objects may be restricted to a defined set and to what kind (content model) of objects they point. This approach fits perfectly the idea of RELS-EXT datastream.

Templates[edit]

ECM defines a predicate to be stated in the RDF/XML document stored in the RELS-EXT of the Content Model Object that refers an object which should be used as template. New objects - conform to that content model - can be created from that template by means of a webservice.

Questions and Notes[edit]

  • eSciDoc Content Models can not define templates in the above sense because the structure of Fedora objects is completely hidden by the eSciDoc Infrastructure!?
    • there could be templates of objects such as publication items e.g. articles, books .. these would have populated the standard item structure (no components) + some metadata values populated by default. These template objects in fact are referenced in the context and not on the CModel.
  • eSciDoc Infrastructure is able to create new objects as copy of existing objects (retrieve and then create)
    • this feature is already used from within PubMan
  • ECM template objects in Fedora have state inactive
    • would there be a possibility to define a template object (for a prarticular context, of type e.g. PubItem, but still knowing it is a template thus could be in state inactive? Is there any need to have the template object in state inactive, or simply to treat it as regular PubItem, with separate "templateFor" relationship?

Mapping between eSciDoc and Fedora Content Model Objects[edit]

In Fedora an object points to its content model by the predicate info:fedora/fedora-system:def/model#hasModel which may be reported as http://escidoc.de/core/01/structural-relations/content-model inside the eSciDoc XML representation of an object.

Because a Fedora object may refer more than one content model objects and it is recommend to refer the fedora-defined "Basic Content Model" if a self-defined content model is refered((http://fedora-commons.org/confluence/display/FCR30/Fedora+Digital+Object+Model#FedoraDigitalObjectModel-ContentModelObject)), it may be hard to figure out which of these relations to state in the eSciDoc representation of the content object (e.g. in the item or container).

Fedora Datastreams of an eSciDoc Object[edit]

The only optional Fedora datastreams of an eSciDoc Object are content streams (only Item) and the metadata records.

Metadata records have a name which is the ID of the corresponding XML datastream in Fedora. In order to define that a metadata record with a given name must appear in an eSciDoc object the Fedora content modeling language is sufficient. It is easy to map a list of mandatory names for metadata records - which may be specified in an eSciDoc Content Model Object - to an appropriate Fedora content modeling language document.

Using the ECM extension of the Fedora content modeling language accordingly the XML schema, a metadata record must be conform to, can be defined and mapped.

the last one means, if we have a schema for a metadata record, then the metadata record can be validated in accordance with that schema?--Natasa 15:40, 9 September 2009 (UTC)
    • Yes, one idea of ECM is to store the XML Schema for a specific XML datastream in the content model object. Frank 13:01, 10 September 2009 (UTC)

Relations[edit]

structural vs. content relations

Defined by ONTOLOGY of ECM. Hardly by users.

Compound Views[edit]

Key-Value Pairs[edit]

Single values defined by an eSciDoc content model (see "eSciDoc Content Models" above) are stored inside the Fedora Content Model Object and may have consequences regarding the set of datastreams. E.g. if an object should not be versioned it does not need the versioning datastream.

Rule vs. Content Modeling Language[edit]

The Fedora content modeling language (even though extended by ECM) does not fullfil the requirements on a rule language describing an eSciDoc object.

A rule language is one idea to restrict the kind and content of Components. A Fedora content model (with ECM) can define the relation between an Item and its Components but no constraints about the Component Object. Not even cardinality of Components different from 0 or 1 can be restricted. The only way to specify Components be means of Fedora and ECM is to introduce seperate Content Models for Components and to state there must be at least one Component of a specific Content Model.

Questions and Notes[edit]

  • How to seperate basic content model from self-defined content model in RELS-EXT of Fedora Content Model Object.
  • How to store key/value pairs? RELS-EXT?
  • Can everything that need to be defined for an eSciDoc object be defined in an Fedora Content Model Object using CMA and ECM?
    • If not, two validation stages overlapping?
  • Necessity for consider key/value pairs generating the Fedora content modeling language document? E.g. version-history.

eSciDoc Content Model in Fedora[edit]

Values stated in an eSciDoc Content Model Object defining content and behavior of a content object (e.g. Item, Container) are separated in three categories:

  1. Creation; values pertaining the initial state
  2. Transition; values defining possible transitions or effects of specific transitions
  3. State; values describing content independent of the previous or next state

Creation[edit]

Information considered in the creation process.

  • initial state
  • versioning enabled
  • name of main metadata record
  • dc mapping
  • (content checksum enabled)
  • (applies to object pattern)

Transition[edit]

Information considered for specific operations in order to decide if the operation is allowed and/or which sub-operations must be triggered or are requirements for that operation.

  • status transitions (e.g from pending to submitted)
  • cascade information (e.g. for containers, should a release of all members be tried on release)
  • PID assignement

State[edit]

Information used to validate the current state of the resource.

  • Name, schema, and occurrence of additional metadata records.
  • mime-types of content
  • Name, content-category etc. of Component
  • content model of allowed members, occurrences

Note: The listings above are not necessarily complete.


Creation and Transition[edit]

The main question pertaining values for creation or transition is how to store them in order to provide easy and fast access without too much effort.

Some value (e.g. if versioning is enabled) have consequences to the set of datastreams and therefore may be considered for validation. It is possible to validate if versioning is enabled by listing the versioning datastream in the Fedora content models datastream dsCompositeModel.

Validation[edit]

Information related to the state of a resource (the content object) are used to validate the object. Such a validation may be done in Fedora based on the information from the dsCompositeModel and maybe additionally the ONTOLOGY datastream.

Components and Members[edit]

The description of which Components of an Item and what kind of members of a Container are allowed may in Fedora be validated as relations. In fact both are stored as relations in Fedora (see "Structural Relations" below) but information about the kind of the related resource is needed to reach the intended level of description.

Relations and Datastreams[edit]

With CMA and ECM the relations and datastreams of a Fedora Object can be validated. Both can easily be mapped from a description of an eSciDoc resource into a description of a Fedora object except for cardinality of relations.

Relations of an eSciDoc resource are not directly mapped to relations of the corresponding Fedora object (see below "Content Relations").

Metadata Datastreams[edit]

Metadata records of an eSciDoc resource are defined by a name and a XML Schema. Technically such a record is stored as datastream of MIME type text/xml in a Fedora object where the name of the metadata record is the name of the datastream, the XML schema applies to the content of the datastream and the datastream is marked as eSciDoc metadata record.

The description of an eSciDoc metadata record can be mapped to the description of a Fedora datastream and vice versa. So a validation of the eSciDoc resource is possible as well as a Fedora object validation based on Fedoras modeling language extended by ECM.

This approche lacks the possibility to state optional metadata records or to restrict the set of metadata records to the defined set.

Note that in future we may have metadata records which are defined in RDF/XML - that would mean XML Schema is not the only one to validate against. This is an issue we still consider, but maybe for some new solutions.--Natasa 15:48, 9 September 2009 (UTC)
Yes, the validation of the content of an XML document (what it states) is another task. The syntactical correctness of an RDF/XML datastream may easily validated by a XML-Schema but further validation should probably be done by an external tool. Frank 13:10, 10 September 2009 (UTC)
Content Datastreams[edit]

Content (also referred as binary content) of an eSciDoc resource is defined by a name (an individual name in case of content-stream in Item and the name "content" in Component) and a mime-type. These values can be accurately mapped to the values of a datastream in Fedora. The storage-type of the content can be freely choosen and is not restricted by the content model. So a validation of the eSciDoc resource is possible as well as a Fedora object validation based on Fedoras modeling language extended by ECM. The content itself is not considered for validation by content model.

This approche lacks the possibility to state optional content-streams in Item or to restrict the of set content-streams in Item to a defined set.

Content Relations[edit]

In eSciDoc Infrastructure it should be possible to provide an ontology (usually containing just predicate definitions) in order to allow a specific set of Content Relations for every content object. This idea would collide with Fedora content models if Content Relations are usual relations stored in the RELS-EXT datastream of content objects. Because then every content model should define every possible relation and changing the mentioned ontology would require a modification of every conent model object.

Fortunately, technically content relations are objects by their own. So every Fedora content model derived from an eSciDoc content model must just allow to state relations to Content Relation Objects. From this point of view it must be considered as advantage there is only one predicate referring to Content Relations. Otherwise the ECM rule always to state the complete set of possible relations would break the idea to freely relate resources by Content Relations without respect for ownership and content model of the resources.

Consequently a definition of allowed Content Relations is not possible by content model.

Structural Relations[edit]

For an eSciDoc object there exist a set of structural relations which can be listed in the ECM ONTOLOGY datastream in order to ensure there existence in an objects RELS-EXT datastream. Additionally it is possible to define such a relation must or can point to an object of a specific content model. So correspondig definitions from an eSciDoc content model can be mapped to a Fedora content model except cardinality other than 1 or 0.

For restrictions on members and components of a content object a restriction on the (Fedora) content model of the related object is necessary. This restrictions are done via someValuesFrom- and allValuesFrom-restrictions in the ontology part. If more than one content model is allowed that is described by restrictions to a union of content model classes. (To be checked if supported by ECM.) This approach lacks the possibility to state cardinalities per allowed content model.

For every component-type stated in the eSciDoc content model a separate Fedora content model object is created. The name (aka content-category) of a component-type and the ID of the eSciDoc Content Model Object are used to generate the ID of the Component Content Model Object. Allowed metadata records are modeled as for common eSciDoc resources (see above "Metadata Datastreams"). The allowed mime-types are listed for the datastream "content". The content-category is ensured by a hasValue-restriction (to be checked if supported by ECM).


File:ContentModel-draft.xml

File:ContentModel-draft-dsCompositeModel.xml

File:ContentModel-draft-ONTOLOGY.xml

File:ContentModel-draft-CM component FULLSIZE-dsCompositeModel.xml

File:ContentModel-draft-CM component FULLSIZE-ONTOLOGY.xml