ESciDoc Content Model in Fedora

From MPDLMediaWiki
Jump to: navigation, search

!!! in progress !!!

This page starts with an overview of the basic ideas of content model in the eSciDocEnhanced Scientific Documentation Infrastructure, FedoraFlexible Extensible Digital Object Repository Architecture and Enhancements to FedoraFlexible Extensible Digital Object Repository Architecture. For a discussion of mapping content model between eSciDocEnhanced Scientific Documentation and FedoraFlexible Extensible Digital Object Repository Architecture see chapter 5.

eSciDocEnhanced Scientific Documentation Content Models

In the eSciDocEnhanced Scientific Documentation infrastructure a content model describes the specialization of an Item or Container object and is stored inside the infrastructure as Content Model Object. Content Model Objects are managed via Content Model Handler.

A Content Model Object is a versionable eSciDocEnhanced Scientific Documentation resource with a publication workflow. Thus it contains common object properties of a eSciDocEnhanced Scientific Documentation resource including version information.

Additionally it holds key-value pairs governing the behavior and form of the specialized content object. These are e.g. the initial state, if versioning is enabled, the name of a mandatory metadata record etc..

In order to flexibly define the structure of a content object in its variable parts, the Content Model Object may include or refer a rule document.

In general the term "content model object" refers to a concrete digital object that holds - or consist of - the description of a content model. An eSciDocEnhanced Scientific Documentation resource (e.g. item, container) usually refers a content model object which means it is (or should be) conform to the appropriate content model. Instead of "conform to" one can say the eSciDocEnhanced Scientific Documentation resource is of the type defined by the content model.

See also [|eSciDoc Content Model Object].

Common Object Properties

Each eSciDocEnhanced Scientific Documentation resource, thus as well the eSciDocEnhanced Scientific Documentation Content Model Object, holds a set of common properties. The properties of the eSciDocEnhanced Scientific Documentation Content Model are:

  • id (objid)
  • name
  • description
  • creation date
  • creator (created by)
  • modification date (last-modification-date)
  • status (public-status, version-status) + comment (values are pending, in-revision, submitted, released))
  • PIDPersistent Identifer or Identification
  • context
  • content model (either defines a content model object for content models OR don't state this property; note: content model is not a special item)
  • lock-status, lock-date, lock-owner
  • version, latest-version, release

Questions and Notes

  • status withdrawn not needed
    • maybe we should still keep status "withdrawn" in case when no new resources of this content model can be created ... e.g. legacy data--Natasa 15:33, 9 September 2009 (UTCCoordinated Universal Time)
  • name and description? (cf. Item and Container)
    • a content model object should have a name and a description which might give some textual information about it, rather then purely objid. These might be taken (as in case of items) from a e.g. dublin core metadata record of the content model.
  • What means the context of a Content Model Object?--Natasa 15:33, 9 September 2009 (UTCCoordinated Universal Time)
    • would still keep it: special rights can be managed for each context in which there are content model objects in place. In case of collaborative environments and usage of single repository by different organizations this may be an interesting issue. Another use case which comes up from DARIAHDigital Research Infrastructure for the Arts and Humanities project: content models can be published... so in this case an eSciDocEnhanced Scientific Documentation repository may be a repository of content models (distant use case)--Natasa 15:33, 9 September 2009 (UTCCoordinated Universal Time)

Key-Value Pairs

This section covers how to state these information and the form to represent them inside the resource representation of a Content Model Object. For a list of possible or needed values see eSciDoc Content Model Object.

There are single values as for initial state and lists of values as for mime-types or names of metadata records. Lists usually demand for a definition if a special occurrence is allowed or forbidden, if the list is completed or open etc.. Therefore lists may be better covered by rules inside a rule document.

Most single values must be accessible for creation of a resource; in the following called creation information. Some must be accessible for specific operations (status transitions, transformation description for dc mapping); in the following called transition information. Transition information may also be creation information but not vice versa. Creation information are necessary for the creation process while transition information are necessary for one or more service operations.

List values can be used as input for a resource validation which must be done before or after every operation. They must not be available for individual read-access. Therefore it is sufficient to formulate them into a rule document (cf. [|List (Properties)]). Other values should be available for fast access.

Questions and Notes

  • not certain if transitions should be part of the content model--Natasa 15:33, 9 September 2009 (UTCCoordinated Universal Time)
    • I think one important thing is to have the initial state and then e.g. if a resource must be submitted before release. DCDublin Core-Mapping is also important but probably you don't mean that. Frank 12:40, 10 September 2009 (UTCCoordinated Universal Time)

Rule Document

The rule-document (if any) is stored as valid XMLExtensible Markup Language document inside the Content Model Object or as reference to a valid XMLExtensible Markup Language document. It contains restrictions to an Item or Container expressed in a rule-language (e.g. Schematron-XMLExtensible Markup Language).

This approach does not support content validation in FedoraFlexible Extensible Digital Object Repository Architecture. Even if the rule document related to the eSciDocEnhanced Scientific Documentation resource is in a language FedoraFlexible Extensible Digital Object Repository Architecture can evaluate, it will describe the eSciDocEnhanced Scientific Documentation resource and not the FedoraFlexible Extensible Digital Object Repository Architecture object.

Questions and Notes

FedoraFlexible Extensible Digital Object Repository Architecture Content Models

FedoraFlexible Extensible Digital Object Repository Architecture Content Model Architecture (CMA). See [|Fedora Content Model Architecture (CMA)]

One important reason to introduce CMA in FedoraFlexible Extensible Digital Object Repository Architecture was the need to decouple content objects and FedoraFlexible Extensible Digital Object Repository Architecture disseminators. Now, the behavior of objects - in form of disseminators - is bound to the content model object of a content object. This is of lower interest for eSciDocEnhanced Scientific Documentation and eSciDocEnhanced Scientific Documentation Content Models and therefore not discussed here. Though it enables the eSciDocEnhanced Scientific Documentation Infrastructure to use disseminators and in the future, eSciDocEnhanced Scientific Documentation content models should be extended to describe behavior of eSciDocEnhanced Scientific Documentation objects.

Each FedoraFlexible Extensible Digital Object Repository Architecture object refers a FedoraFlexible Extensible Digital Object Repository Architecture content model object by the property info:fedora/fedora-system:def/model#hasModel. This relation is stored in the RELS-EXT datastream of the FedoraFlexible Extensible Digital Object Repository Architecture object.

FedoraFlexible Extensible Digital Object Repository Architecture CMA supports "complex single-object models", "commonly called 'compound'" and "multi-object models and is commonly called 'atomistic' or 'linked'"((http://fedora-commons.org/confluence/display/FCR30/Content+Model+Architecture)). Since no concrete means could be found this seems to be a theoretical statement.

Content models "include structural, behavioral and semantic information [and] a description of the permitted, excluded, and required relationships to other digital objects or identifiable entities."((http://fedora-commons.org/confluence/display/FCR30/Content+Model+Architecture)) This is defined by means of a "content modeling language".

FedoraFlexible Extensible Digital Object Repository Architecture Content Modeling Language

FedoraFlexible Extensible Digital Object Repository Architecture 3 contains a reference implementation of a content modeling language. This implementation seems to be limited to the definition of a very simple XMLExtensible Markup Language format. It allows a list of elements where each specifies the IDIdentifier of datastream that must exist. Additionaly, for a datastream IDIdentifier a list of allowed MIMEMultipurpose Internet Mail Extensions types may be specified. The FedoraFlexible Extensible Digital Object Repository Architecture client code comprises a validator implementation for that language.

The content modeling language document is stored in the mandatory datastream "DSDirectory Service-COMPOSITE-MODEL" of a Content Model Object.

.<img src="http://fedora-commons.org/confluence/download/attachments/4718710/cmodel.png"/>

Questions and Notes


Enhanced Content Models for FedoraFlexible Extensible Digital Object Repository Architecture

Enhanced Content Models for FedoraFlexible Extensible Digital Object Repository Architecture (ECM) is an enhancement to FedoraFlexible Extensible Digital Object Repository Architecture CMA providing a validator implementation, a webservice to create compound views from atomisitc objects, templates for object creation, an extension to Fedoras content modeling language and an ontology datastream extending possibilities. ECM is curently available in version 0.8.

See [|http://ecm.wiki.sourceforge.net/]

[|Presentation on OpenRepositories 2009 by Asger Blekinge-Rasmussen]

Supports [|Content Model Inheritance].

Validator

By binding an object to a special Content Model defined by ECM the validator can be called as disseminator of that object. Beside that, it can be called by a specific HTTPHyperText Transfer Protocol URLUniform Resource Locator containing the IDIdentifier of the object to be validated. There seem to be no tool to validate a FOXMLFedora Object XML document outside FedoraFlexible Extensible Digital Object Repository Architecture. At least when modifying an existing FedoraFlexible Extensible Digital Object Repository Architecture object the validation must be done after persisting the modification.

Extension to FedoraFlexible Extensible Digital Object Repository Architecture content modeling language

FedoraFlexible Extensible Digital Object Repository Architecture content modeling language is extended by an element that adds the possibility to specify a XMLExtensible Markup Language Schema (stored inside the Content Model Object) for a datastream (with MIMEMultipurpose Internet Mail Extensions type text/xml) in addition to the possiblity to define MIMEMultipurpose Internet Mail Extensions types for a datastream(see "FedoraFlexible Extensible Digital Object Repository Architecture Content Modeling Language" above). So, a further specification of datastreams of MIMEMultipurpose Internet Mail Extensions type XMLExtensible Markup Language is possible.

Ontology Datastream

The Ontology Datastream contains an ontology about objects conform to the content model. It consists of a RDFResource Description Framework/XMLExtensible Markup Language document defining a class using RDFs and OWLWeb Ontology Language (Lite) and the predicates an instance of this class may have.

Therefore, beside other things, the relations of objects which are expressed in RDFResource Description Framework/XMLExtensible Markup Language in the RELS-EXT of these objects may be restricted to a defined set and to what kind (content model) of objects they point. This approach fits perfectly the idea of RELS-EXT datastream.

Templates

ECM defines a predicate to be stated in the RDFResource Description Framework/XMLExtensible Markup Language document stored in the RELS-EXT of the Content Model Object that refers an object which should be used as template. New objects - conform to that content model - can be created from that template by means of a webservice.

Questions and Notes

  • eSciDocEnhanced Scientific Documentation Content Models can not define templates in the above sense because the structure of FedoraFlexible Extensible Digital Object Repository Architecture objects is completely hidden by the eSciDocEnhanced Scientific Documentation Infrastructure!?
    • there could be templates of objects such as publication items e.g. articles, books .. these would have populated the standard item structure (no components) + some metadata values populated by default. These template objects in fact are referenced in the context and not on the CModel.
  • eSciDocEnhanced Scientific Documentation Infrastructure is able to create new objects as copy of existing objects (retrieve and then create)
    • this feature is already used from within PubManPublication Management
  • ECM template objects in FedoraFlexible Extensible Digital Object Repository Architecture have state inactive
    • would there be a possibility to define a template object (for a prarticular context, of type e.g. PubItem, but still knowing it is a template thus could be in state inactive? Is there any need to have the template object in state inactive, or simply to treat it as regular PubItem, with separate "templateFor" relationship?

Mapping between eSciDocEnhanced Scientific Documentation and FedoraFlexible Extensible Digital Object Repository Architecture Content Model Objects

In FedoraFlexible Extensible Digital Object Repository Architecture an object points to its content model by the predicate info:fedora/fedora-system:def/model#hasModel which may be reported as http://escidoc.de/core/01/structural-relations/content-model inside the eSciDocEnhanced Scientific Documentation XMLExtensible Markup Language representation of an object.

Because a FedoraFlexible Extensible Digital Object Repository Architecture object may refer more than one content model objects and it is recommend to refer the fedora-defined "Basic Content Model" if a self-defined content model is refered((http://fedora-commons.org/confluence/display/FCR30/Fedora+Digital+Object+Model#FedoraDigitalObjectModel-ContentModelObject)), it may be hard to figure out which of these relations to state in the eSciDocEnhanced Scientific Documentation representation of the content object (e.g. in the item or container).

FedoraFlexible Extensible Digital Object Repository Architecture Datastreams of an eSciDocEnhanced Scientific Documentation Object

The only optional FedoraFlexible Extensible Digital Object Repository Architecture datastreams of an eSciDocEnhanced Scientific Documentation Object are content streams (only Item) and the metadata records.

Metadata records have a name which is the IDIdentifier of the corresponding XMLExtensible Markup Language datastream in FedoraFlexible Extensible Digital Object Repository Architecture. In order to define that a metadata record with a given name must appear in an eSciDocEnhanced Scientific Documentation object the FedoraFlexible Extensible Digital Object Repository Architecture content modeling language is sufficient. It is easy to map a list of mandatory names for metadata records - which may be specified in an eSciDocEnhanced Scientific Documentation Content Model Object - to an appropriate FedoraFlexible Extensible Digital Object Repository Architecture content modeling language document.

Using the ECM extension of the FedoraFlexible Extensible Digital Object Repository Architecture content modeling language accordingly the XMLExtensible Markup Language schema, a metadata record must be conform to, can be defined and mapped.

the last one means, if we have a schema for a metadata record, then the metadata record can be validated in accordance with that schema?--Natasa 15:40, 9 September 2009 (UTCCoordinated Universal Time)
    • Yes, one idea of ECM is to store the XMLExtensible Markup Language Schema for a specific XMLExtensible Markup Language datastream in the content model object. Frank 13:01, 10 September 2009 (UTCCoordinated Universal Time)

Relations

structural vs. content relations

Defined by ONTOLOGY of ECM. Hardly by users.

Compound Views

Key-Value Pairs

Single values defined by an eSciDocEnhanced Scientific Documentation content model (see "eSciDocEnhanced Scientific Documentation Content Models" above) are stored inside the FedoraFlexible Extensible Digital Object Repository Architecture Content Model Object and may have consequences regarding the set of datastreams. E.g. if an object should not be versioned it does not need the versioning datastream.

Rule vs. Content Modeling Language

The FedoraFlexible Extensible Digital Object Repository Architecture content modeling language (even though extended by ECM) does not fullfil the requirements on a rule language describing an eSciDocEnhanced Scientific Documentation object.

A rule language is one idea to restrict the kind and content of Components. A FedoraFlexible Extensible Digital Object Repository Architecture content model (with ECM) can define the relation between an Item and its Components but no constraints about the Component Object. Not even cardinality of Components different from 0 or 1 can be restricted. The only way to specify Components be means of FedoraFlexible Extensible Digital Object Repository Architecture and ECM is to introduce seperate Content Models for Components and to state there must be at least one Component of a specific Content Model.

Questions and Notes

  • How to seperate basic content model from self-defined content model in RELS-EXT of FedoraFlexible Extensible Digital Object Repository Architecture Content Model Object.
  • How to store key/value pairs? RELS-EXT?
  • Can everything that need to be defined for an eSciDocEnhanced Scientific Documentation object be defined in an FedoraFlexible Extensible Digital Object Repository Architecture Content Model Object using CMA and ECM?
    • If not, two validation stages overlapping?
  • Necessity for consider key/value pairs generating the FedoraFlexible Extensible Digital Object Repository Architecture content modeling language document? E.g. version-history.

eSciDocEnhanced Scientific Documentation Content Model in FedoraFlexible Extensible Digital Object Repository Architecture

Values stated in an eSciDocEnhanced Scientific Documentation Content Model Object defining content and behavior of a content object (e.g. Item, Container) are separated in three categories:

  1. Creation; values pertaining the initial state
  2. Transition; values defining possible transitions or effects of specific transitions
  3. State; values describing content independent of the previous or next state

Creation

Information considered in the creation process.

  • initial state
  • versioning enabled
  • name of main metadata record
  • dc mapping
  • (content checksum enabled)
  • (applies to object pattern)

Transition

Information considered for specific operations in order to decide if the operation is allowed and/or which sub-operations must be triggered or are requirements for that operation.

  • status transitions (e.g from pending to submitted)
  • cascade information (e.g. for containers, should a release of all members be tried on release)
  • PIDPersistent Identifer or Identification assignement

State

Information used to validate the current state of the resource.

  • Name, schema, and occurrence of additional metadata records.
  • mime-types of content
  • Name, content-category etc. of Component
  • content model of allowed members, occurrences

Note: The listings above are not necessarily complete.


Creation and Transition

The main question pertaining values for creation or transition is how to store them in order to provide easy and fast access without too much effort.

Some value (e.g. if versioning is enabled) have consequences to the set of datastreams and therefore may be considered for validation. It is possible to validate if versioning is enabled by listing the versioning datastream in the FedoraFlexible Extensible Digital Object Repository Architecture content models datastream dsCompositeModel.

Validation

Information related to the state of a resource (the content object) are used to validate the object. Such a validation may be done in FedoraFlexible Extensible Digital Object Repository Architecture based on the information from the dsCompositeModel and maybe additionally the ONTOLOGY datastream.

Components and Members

The description of which Components of an Item and what kind of members of a Container are allowed may in FedoraFlexible Extensible Digital Object Repository Architecture be validated as relations. In fact both are stored as relations in FedoraFlexible Extensible Digital Object Repository Architecture (see "Structural Relations" below) but information about the kind of the related resource is needed to reach the intended level of description.

Relations and Datastreams

With CMA and ECM the relations and datastreams of a FedoraFlexible Extensible Digital Object Repository Architecture Object can be validated. Both can easily be mapped from a description of an eSciDocEnhanced Scientific Documentation resource into a description of a FedoraFlexible Extensible Digital Object Repository Architecture object except for cardinality of relations.

Relations of an eSciDocEnhanced Scientific Documentation resource are not directly mapped to relations of the corresponding FedoraFlexible Extensible Digital Object Repository Architecture object (see below "Content Relations").

Metadata Datastreams

Metadata records of an eSciDocEnhanced Scientific Documentation resource are defined by a name and a XMLExtensible Markup Language Schema. Technically such a record is stored as datastream of MIMEMultipurpose Internet Mail Extensions type text/xml in a FedoraFlexible Extensible Digital Object Repository Architecture object where the name of the metadata record is the name of the datastream, the XMLExtensible Markup Language schema applies to the content of the datastream and the datastream is marked as eSciDocEnhanced Scientific Documentation metadata record.

The description of an eSciDocEnhanced Scientific Documentation metadata record can be mapped to the description of a FedoraFlexible Extensible Digital Object Repository Architecture datastream and vice versa. So a validation of the eSciDocEnhanced Scientific Documentation resource is possible as well as a FedoraFlexible Extensible Digital Object Repository Architecture object validation based on Fedoras modeling language extended by ECM.

This approche lacks the possibility to state optional metadata records or to restrict the set of metadata records to the defined set.

Note that in future we may have metadata records which are defined in RDFResource Description Framework/XMLExtensible Markup Language - that would mean XMLExtensible Markup Language Schema is not the only one to validate against. This is an issue we still consider, but maybe for some new solutions.--Natasa 15:48, 9 September 2009 (UTCCoordinated Universal Time)
Yes, the validation of the content of an XMLExtensible Markup Language document (what it states) is another task. The syntactical correctness of an RDFResource Description Framework/XMLExtensible Markup Language datastream may easily validated by a XMLExtensible Markup Language-Schema but further validation should probably be done by an external tool. Frank 13:10, 10 September 2009 (UTCCoordinated Universal Time)
Content Datastreams

Content (also referred as binary content) of an eSciDocEnhanced Scientific Documentation resource is defined by a name (an individual name in case of content-stream in Item and the name "content" in Component) and a mime-type. These values can be accurately mapped to the values of a datastream in FedoraFlexible Extensible Digital Object Repository Architecture. The storage-type of the content can be freely choosen and is not restricted by the content model. So a validation of the eSciDocEnhanced Scientific Documentation resource is possible as well as a FedoraFlexible Extensible Digital Object Repository Architecture object validation based on Fedoras modeling language extended by ECM. The content itself is not considered for validation by content model.

This approche lacks the possibility to state optional content-streams in Item or to restrict the of set content-streams in Item to a defined set.

Content Relations

In eSciDocEnhanced Scientific Documentation Infrastructure it should be possible to provide an ontology (usually containing just predicate definitions) in order to allow a specific set of Content Relations for every content object. This idea would collide with FedoraFlexible Extensible Digital Object Repository Architecture content models if Content Relations are usual relations stored in the RELS-EXT datastream of content objects. Because then every content model should define every possible relation and changing the mentioned ontology would require a modification of every conent model object.

Fortunately, technically content relations are objects by their own. So every FedoraFlexible Extensible Digital Object Repository Architecture content model derived from an eSciDocEnhanced Scientific Documentation content model must just allow to state relations to Content Relation Objects. From this point of view it must be considered as advantage there is only one predicate referring to Content Relations. Otherwise the ECM rule always to state the complete set of possible relations would break the idea to freely relate resources by Content Relations without respect for ownership and content model of the resources.

Consequently a definition of allowed Content Relations is not possible by content model.

Structural Relations

For an eSciDocEnhanced Scientific Documentation object there exist a set of structural relations which can be listed in the ECM ONTOLOGY datastream in order to ensure there existence in an objects RELS-EXT datastream. Additionally it is possible to define such a relation must or can point to an object of a specific content model. So correspondig definitions from an eSciDocEnhanced Scientific Documentation content model can be mapped to a FedoraFlexible Extensible Digital Object Repository Architecture content model except cardinality other than 1 or 0.

For restrictions on members and components of a content object a restriction on the (FedoraFlexible Extensible Digital Object Repository Architecture) content model of the related object is necessary. This restrictions are done via someValuesFrom- and allValuesFrom-restrictions in the ontology part. If more than one content model is allowed that is described by restrictions to a union of content model classes. (To be checked if supported by ECM.) This approach lacks the possibility to state cardinalities per allowed content model.

For every component-type stated in the eSciDocEnhanced Scientific Documentation content model a separate FedoraFlexible Extensible Digital Object Repository Architecture content model object is created. The name (aka content-category) of a component-type and the IDIdentifier of the eSciDocEnhanced Scientific Documentation Content Model Object are used to generate the IDIdentifier of the Component Content Model Object. Allowed metadata records are modeled as for common eSciDocEnhanced Scientific Documentation resources (see above "Metadata Datastreams"). The allowed mime-types are listed for the datastream "content". The content-category is ensured by a hasValue-restriction (to be checked if supported by ECM).


File:ContentModel-draft.xml

File:ContentModel-draft-dsCompositeModel.xml

File:ContentModel-draft-ONTOLOGY.xml

File:ContentModel-draft-CM component FULLSIZE-dsCompositeModel.xml

File:ContentModel-draft-CM component FULLSIZE-ONTOLOGY.xml