EScience Seminar 2008/EScience-Seminar Metadata Infrastructures

Background[edit]

All Max Planck Institutes have to cope with the management of an increasing amount of data and its storage for at least 10 years. Metadata descriptions are essential to the solution of the management problem. Metadata can also be used to support resource discovery, to perform scientific data-mining and to generate virtual collections.

Goal[edit]

The seminar will present and discuss the role of metadata in the context of management, use and reuse of scientific data (e-Science). Presenters from ongoing large international e-Science projects will describe issues, experiences, problems and solutions. International and German experts will talk about standard-developing efforts regarding necessary infrastructure components and demonstrate feasible methodologies , i.e. metadata application profiles, linking between datasets and publications, treatment of aggregations of web resources. Presentations from MPI's are intended to document metadata related needs and experiences and to further the discussion on cooperation and a strategy for future work in the MPG.

Responsible for content[edit]

Traugott Koch (MPDL)
Peter Wittenburg (MPI Nijmegen)

Place[edit]

Harnack House, Berlin

Date[edit]

14-15 October 2008

Agenda[edit]

Tuesday, October 14th, 2008

11.00 - 11.45
Introduction. All participants present themselves

11.45 - 13.00
MPI presentations:
Frank Toussaint (MPI Meteorology, Hamburg; World Data Center for Climate): The CERA-2 meta database and needs for a common information model (Slides: 3.9 MB)

Sylvia Kortüm (MPI for Intellectual Property, Competition and Tax Law, Munich): Law-related MPIs: CMS project (Slides: 3.5 MB)

~~Wolfgang Voges (MPI for Extraterrestrial Physics, Garching; MPDL): Metadata in the case of an astronomical RoR [Registry of Registries]~~

Peter Wittenburg, Wolfgang Voges: MPG Registry of Registries (Slides: 0.7 MB)

Peter Wittenburg, Daan Broeder (MPI for Psycholinguistics, Nijmegen): ISOCat. Component model (Slides: 0.2 MB)

(13.00 - 14.00
Lunch)

14.00 - 15.30
Brian Matthews (Rutherford Appleton Laboratory, UK): Practical experiences from e-science applications and related metadata solutions (Slides: 18.8 MB)

(15.30 - 16.00
Coffee)

16.00 - 17.30
Pete Johnston (Eduserv Foundation, UK): Metadata standard and best practice developments: Dublin Core Abstract Model, SWAP, OAI-ORE etc. (Slides: 6.9 MB)

17.40 - 18.15
Tom Baker (DCMI): Metadata engineering methodology (Slides: 5.3 MB)

Wednesday, October 15th, 2008

9.00 - 10.30
Jacco van Ossenbruggen (Centrum voor Wiskunde en Informatica and Vrije Universiteit Amsterdam, Netherlands): Semantic interoperability of data values, use and matching of ontologies and unstructured vocabularies (Slides: 15.9 MB or from CWI: slides)

(10.30 - 11.00
Coffee)

11.00 - 12.30
Breakout groups (w. Introduction to tasks and presentation of a MPG-wide metadata registry project).
Topics: Enumeration of metadata-related problems; cooperation options, MPG-wide projects, potential support needed; necessity of common policies and standards

(12.30 - 13.30
Lunch)

13.30 - 14.10
Reports from breakout groups

14.10 - 14.40
Martin Stricker (Helmholtz-Zentrum fuer Kulturtechnik der Humboldt-Universitaet zu Berlin): Developing an ontology for academic disciplines (Slides: 1.5 MB)

(14.40 - 15.10
Coffee)

15.10 - 15.45
Tom Baker (DCMI): Recent developments reg. web-enabled vocabularies. SKOS, tagging, microformats etc. (Slides: 4.4 MB‎

15.45 - 16.00
Conclusion

Results of the breakout groups[edit]

Reporters: Bettina Mann, Michael Lautenschlager, Daan Broeder, Peter Wittenburg

This note gives a brief report on the topics that were raised during the breakout session at the MPG eScience Seminar about metadata. Each breakout group got the same set of questions that should be addressed. Additional points could be taken up by each of the three groups. A few questions such as the one which standards are relevant were not taken up due to time restrictions. However, it was argued that such topics can also be elaborated on by the experts via the COLAB wiki.

It should be noted that the term "metadata" was used in the restricted sense of "descriptive metadata", i.e. a keyword type description of the resource. It is assumed here that in general metadata is rich and contains scientifically useful information, i.e. that it does not just cover the typical bibliographic descriptors such as creator, title etc.

Is Metadata creation a "must" for each MPI? Which functions does it serve?[edit]

Dependent on the usage realms, metadata will serve different needs and not all information will be relevant for all users. We can differentiate between the following realms: private use, collaboration use, use by discipline scholars, interdisciplinary use and occasional users.
Metadata not only represents a given resource, but adds additional information about the resource that is relevant for management, documentation, discovery, citation and research purposes.
Metadata is important to preserve information about the resources when researchers leave the institutes which is the normal case in MPIs and provided it is detailed enough to allow re-usability even when the creator is not available anymore.
There is the requirement for all MPIs to preserve the context of scientific publications at least for 10 years, i.e. the primary and secondary resources that were used as sources of information. Metadata can be used in references and to organize the holding so that the context can be easily located. In those cases where MPIs are morally obliged to preserve data for even much longer times this requirement is even more important. Without metadata the nature of resources could not be identified easily.
For each MPI that is creating and collecting large amounts of resources metadata are the only chance for taking care of proper management, discovery and long term preservation. Therefore, metadata is a prerequisite for any data re-usage in many cases.
If metadata is rich it turns into a valuable research object since it can be used for selection, filtering, querying, statistics, portal creation etc. Increasingly often in the Web scenario MD can be used together with content in complex queries that directly lead to research results. More advanced usages will emerge when critical masses of metadata will be collected and exploited with the help of semantic web technologies.
In experimentally oriented MPIs metadata could be used as a lab book if metadata would contain all process information and if authenticity can be granted. This the direction Rutherford Appleton and MPI for Meteorology are heading to.
There will be an increasing pressure from researcher communities and also from governments to publish metadata about the resources where public money was used to create them. Only this will create visibility, allow evaluation and enable their sharing. This will also affect MPIs if they do not want to get a bad image.
Currently, publications form the major evaluation criterion. It was assumed that in future also resources (data and/or tools) will become a point in the evaluation process of the FBR or in public impression forming.
Metadata should also inform the user about ways to get access to resources.

Summarizing all these statements it was obvious that all participants gave a clear "Yes" to the question whether all institutes that deal with larger amounts of data should create and maintain metadata descriptions of their resources. It was argued that in particular the directors next to researchers need to be convinced about the necessity of metadata description and to reserve time and attention for this task.

Which aspects are relevant for metadata creation?[edit]

It is obvious that metadata will change over time since resource’s attributes will change, since new versions will be created and since metadata will be enriched due to deeper insights etc.
Generally agreed is that metadata should be created as early as possible in the resource creation process, since only then high quality descriptions can be expected with high probability. Where possible this dualism should be supported by appropriate tools and (web) forms.
Various groups can participate in the metadata creation process and take over different roles: researchers to enter (minimal) metadata which is essential for the scientific interpretation, assistants adding standard information, technologists doing script-based curation, librarians adding further information related to publications etc. Certain content related fields can of course only be filled in by the researchers.
Metadata creation costs efforts and directors need to be convinced to form small groups involving different people in the creation and curation process.
Requirements are different for all MPIs, i.e. there is no one schema and the workflow processes are different which also has as implication that often tools will be different.
Metadata are only useful when they point to accessible resources.
General experience says that incentives need to be made obvious to the researchers to motivate them to improve metadata quality.

Which help is required from outside?[edit]

Primarily each MPI is responsible to organize the metadata creation, management and distribution processes. This question was meant to better understand what kind of support MPIs would like to have from a central facility such as the MPDL.

A major point was that central groups have the capacity to gain knowledge due to their outside connections and that they can bring this knowledge back into the internal MPG processes. This knowledge can cover a wide range of topics from design of metadata vocabularies to its semantic web usage. Also generic solutions could be provided where this makes sense. The drawback is certainly that central experts miss the knowledge about the domain specific aspects.
Other points of support were mentioned: information about standards, legal aspects, check quality in various respects (schema correctness, schema compliance, content auditing/ evaluation).
A central and shared group could form a part of local teams that will setup a metadata production line.
A type of fire brigade function was discussed in that sense that such a central group could help in setting up solutions that are required for example to participate in harvesting networks. Experts could help the local people to setup the OAI-PMH protocol software, help in designing and implementing a DCMI gateway, for example. The Helmholtz society has such a group.
It could also be very valuable to help finding solutions to integrate data created by the researchers in Excel spreadsheets, Word documents etc.
Effective help requires small pilot projects in order to demonstrate the usefulness. Help needs to focus on practical solutions in the MPIs and not address directly advanced semantic web aspects.
A centralized group should maintain an information site (such as COLAB) where best practice examples are gathered and commented. They could also maintain a repository of relevant documentation, as it costs much time to determine the relevance of documents out on the web.
Highly important are hands-on training courses on various matters.
A central group can operate as catalyst to bring institutes together that are facing similar challenges or where it makes sense to share metadata.
Support or funds could be provided to join discipline oriented metadata service collaborations.

What function can RoR have within the MPG?[edit]

The idea of a Registry of Registries, i.e. gathering and merging metadata from various institutes and offering it via one portal, was born at an earlier eScience seminar and a grant application was written. In the mean time the grant application was accepted, but it was decided to wait on the results of the metadata meeting to specify the details of the approach. Only very little time was left during the breakout sessions to talk about this topic.

The big challenge is the broad semantic scope that all MPIs will span with their metadata. There will be much semantic overlap between some MPIs having similar research foci, but there will also be very little overlap between MPIs from humanities on the one hand and plasma physics on the other hand. Therefore, methods are needed that do not pretend to develop THE ontology for the MPG, since this would fail.
It is a challenge how to present domain-oriented clusters of resources as well as the overall view. A problem is that institutes are part of several clusters and that there is almost continuity between extremes.
A number of reasons were mentioned that support the RoR idea:
- one portal for resource visibility and of the originating institutes which is important for scientific as well as political reasons in the ongoing competition
- facilitation of interdisciplinary research (even the unexpected cases)
- one portal that could support visibility for Google and other search engines
- makes claims about existing metadata and in particular resources explicit
- gives a pressure on metadata provider to keep their services up to date
- a central (machine readable) catalog would allow institutes to participate that do not have their own local one and are not part of other collaborations.
- it will help propagate uniformity throughout the MPG with respect to many issues such as quality, persistence, citation information associated with PIDs etc if the MPG decides to pursue this.
Different views and search interfaces need to be offered at the user interface allowing researchers to enter imprecise as well as precise queries. In most cases researchers will prefer to use their discipline terminologies at the user interface. DCMI semantics are obvious for the general user, but not for professional researchers looking for detailed information.
RoR will only work when vocabularies will be registered and when researchers can tune the incorporated relations according to their needs.
Metadata should be open, but partial restrictions need to be respected during harvesting. However, those are the responsibility of the participating MPI.
Central harvesting makes only sense at collection level, i.e. at low granularity. However, the notion of a collection is partly blurring and in some cases dynamic.
Some institutes’ metadata will be harvested by discipline specific organizations with much more overlapping semantics while RoR is an organization oriented approach. They are complementary. For example, ICSU is an organization that harvests metadata from worldwide data centers, i.e. cross-disciplinary.
One possible approach for establishing a federated metadata model could be to take ISO 690-2 metadata as a basis, as it would be generic enough to be applicable to a wide range of disciplines. ISO 690-2 metadata is used by the MPI for Meteorology for cross-discipline referencing of data entities in their STD-DOI publication service for primary data.

Links[edit]

You can find the seminar preparation page here: Metadata_Infrastructures_Seminar_Preparation