Difference between revisions of "EScience Seminar 2009/EScience-Seminar Repository Systems"

Revision as of 11:20, 21 April 2009

Introduction[edit]

The second MPG eScience seminar of 2009 is devoted to one of the big challenges for research institutes: how to guarantee persistence and continuous access to its records to all interested and authorized researchers. Therefore, each institute needs to have a strategy of how to manage the increasing amounts and complexity of data, how to guarantee online access to it and how to replicate the data for preservation purposes. The term "digital repositories" seems to properly describe the layer of functionality that we want. JISC defines the term in the following words ^[1]: "Repositories are important for universities and colleges in helping to capture, manage, and share institutional assets as a part of their information strategy. A digital repository can hold a wide range of materials for a variety of purposes and users. It can support learning, research and administrative processes."

The concept "Digital Repository" is not new of course, although its meaning changed rapidly due to the new requirements caused mainly by the increasing amount and complexity of data, the Internet and the awareness of long-term accessibility. Within the MPG it was Friedrich Hertweck ^[2] from RZG (Computer Centre in Garching) who created AMOS (Advanced Multi user Operating System) in the 70-ies. AMOS ^[3] was an excellent piece of software, which allowed scientists in Plasma Physics (amongst others) to store and retrieve their large data volumes, and it turned out to be an island of stability for many years. Natural sciences now need to maintain repositories that cover petabytes of data, which in general is highly structured. But also in the humanities we are now close to maintaining hundreds of terabytes where often the problem is not the sheer amount, but the inherent complexity of the data sets.

Since a few years the concept of "digital repositories" is being discussed in a number of different contexts. A recent study from the DRIVER project ^[4], covering 114 digital repositories, revealed that most institutes associate with this term repositories for publications. More than 80% of these repositories contain journal articles and other types of publications, only about 10 % also store primary data sets and common data types such as audio and video recordings. This is fully in line with the experiences in the European CLARIN project ^[5], which wants to build a network of centers functioning as the backbone for offering persistent access to language resources and services. About 80% of the potential centers are busy restructuring their repository to fulfill the new requirements. From these two results we can conclude

that researchers are used to store ePublications in proper repositories and associate them with proper metadata for example to support discovery and
that researchers are not used to store their research data in such ways that other researchers can easily access them.

It seems that in general researchers still use idiosyncratic methods to store their data, that they tend to structure it by minimalistic solutions, such as file names and directory structures, and that long-term accessibility was/is not an issue of primary concern. For an increasing number of researchers, in particular when they are participating in international data driven collaborations (e.g. genomics or climate), it becomes increasingly obvious, however, that they need to change their behavior.

This eScience seminar will therefore focus on repository solutions for research data, be it primary data generated by some types of sensors or secondary data that is generated by researchers to allow interpretations. The field of primary and in particular secondary data is characterized by an extreme heterogeneity of data types, formats, and implicit or explicit semantics, making it a difficult field for abstractions. This is different from ePublications where the data types and formats are widely standardized, where metadata characterization has a long history and where the semantics of the content can be interpreted by the reading researcher.

Views on Digital Repositories[edit]

Much has been written about digital repositories during the last years. We would like to cite three initiatives without claiming being comprehensive (see below). Important other initiatives have thought about repositories and layers of abstractions as well, such as FEDORA ^[6] or OAI-ORE ^[7].

DELOS Digital Library[edit]

The DELOS Digital Library project ^[8] presented a careful analysis of a number of aspects of "Digital Libraries" which can be transformed to digital repositories. They present a summary of the main points of their manifesto ^[9], which we simply include here. As a consequence of this manifesto they derive an abstract reference model. There is no clear separation between Digital Libraries and Digital Repositories, but it can be stated that a proper Digital Library model will include a proper Digital Repository as its core.

The Digital Library Manifesto in Brief[edit]

It is commonly understood that the Digital Library universe is a complex and multifaceted domain that cannot be captured by a single definition. The Manifesto organizes the pieces constituting the puzzle into a single framework (Figure II.1-1).

In particular, it identifies the three different types of systems operating in the Digital Library universe, i.e.

the Digital Library (DL) – the final ‘system’ actually perceived by the end-users as being the digital library;
the Digital Library System (DLS) – the deployed and running software system that implements the DL facilities; and
the Digital Library Management System (DLMS) – the generic software system that supports the production and administration of DLSs and the integration of additional software offering more refined, specialized or advanced facilities.

The Manifesto also organizes the Digital Library universe into domains[edit]

The Resource Domain captures generic characteristics that are common to the other specialized domains. Building on this, the model introduces six orthogonal and complementary domains that together strongly characterize the Digital Library universe and capture its specificities with respect to generic information systems. These specialized domains are:
Content – represents the information made available;
User – represents the actors interacting with the system;
Functionality – represents the facilities supported;
Policy – represents the rules and conditions, including digital rights, governing the operation;
Quality – represents the aspects needed to consider digital library systems from a quality point of view;
Architecture – represents the physical software (and hardware) constituents concretely realizing the whole.

Venue[edit]

RZG Garching

Date[edit]

25/26 June 2009

Responsible for content[edit]

Malte Dreyer, Andreas Gros, (MPDL), Stefan Heinzel (RZG), Peter Wittenburg, Daan Broeder (MPI for Psycholinguistics), Frank Toussaint, Michael Lautenschlager (MPI for Meteorology)

Speakers[edit]

Registration[edit]

Registration is open at: http://escience.mpg.de/registration_en.html

References[edit]

[1] ttp://www.jisc.ac.uk/whatwedo/programmes/digitalrepositories2005/repositories_conference.aspx

[2] ttp://www.ipp.mpg.de/ippcms/de/presse/archiv/10_98_pi.html

[3] ttp://de.wikipedia.org/wiki/Rechenzentrum_Garching

[4] ttp://www.driver-repository.eu/

[5] ttp://www.clarin.eu/

[6] ttp://www.fedora-commons.org/

[7] ttp://www.openarchives.org/ore/

[8] ttp://www.delos.info/

[9] ttp://www.delos.info/index.php?option=com_content&task=view&id=345&Itemid=#docs

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

@@ Line 12: / Line 12: @@
 It seems that in general researchers still use idiosyncratic methods to store their data, that they tend to structure it by minimalistic solutions, such as file names and directory structures, and that long-term accessibility was/is not an issue of primary concern. For an increasing number of researchers, in particular when they are participating in international data driven collaborations (e.g. genomics or climate), it becomes increasingly obvious, however, that they need to change their behavior.
-This eScience seminar will therefore focus on repository solutions for research data, be it primary data generated by some types of sensors or secondary data that is generated by researchers to allow interpretations. The field of primary and in particular secondary data is characterized by an extreme heterogeneity of data types, formats, and implicit or explicit semantics, making it a difficult field for abstractions. This is different from ePublications where the data types and formats are widely standardized, where metadata characterization has a long history and where the semantics of the content can be interpreted by the reading researcher.
+'''This eScience seminar will therefore focus on repository solutions for research data, be it primary data generated by some types of sensors or secondary data that is generated by researchers to allow interpretations. The field of primary and in particular secondary data is characterized by an extreme heterogeneity of data types, formats, and implicit or explicit semantics, making it a difficult field for abstractions. This is different from ePublications where the data types and formats are widely standardized, where metadata characterization has a long history and where the semantics of the content can be interpreted by the reading researcher.'''
+===Views on Digital Repositories===
+Much has been written about digital repositories during the last years. We would like to cite three initiatives without claiming being comprehensive (see below). Important other initiatives have thought about repositories and layers of abstractions as well, such as FEDORA <ref>http://www.fedora-commons.org/</ref> or OAI-ORE <ref>http://www.openarchives.org/ore/</ref>.
+====DELOS Digital Library====
+The DELOS Digital Library project <ref>http://www.delos.info/</ref> presented a careful analysis of a number of aspects of "Digital Libraries" which can be transformed to digital repositories. They present a summary of the main points of their manifesto <ref>http://www.delos.info/index.php?option=com_content&task=view&id=345&Itemid=#docs</ref>, which we simply include here. As a consequence of this manifesto they derive an abstract reference model. There is no clear separation between Digital Libraries and Digital Repositories, but it can be stated that a proper Digital Library model will include a proper Digital Repository as its core.
+=====The Digital Library Manifesto in Brief=====
+It is commonly understood that the Digital Library universe is a complex and multifaceted domain that cannot be captured by a single definition. The Manifesto organizes the pieces constituting the puzzle into a single framework (Figure II.1-1).
+In particular, it identifies the three different types of systems operating in the Digital Library universe, i.e.
+#	the Digital Library (DL) – the final ‘system’ actually perceived by the end-users as being the digital library;
+#	the Digital Library System (DLS) – the deployed and running software system that implements the DL facilities; and
+#	the Digital Library Management System (DLMS) – the generic software system that supports the production and administration of DLSs and the integration of additional software offering more refined, specialized or advanced facilities.
+=====The Manifesto also organizes the Digital Library universe into domains=====
+#	The Resource Domain captures generic characteristics that are common to the other specialized domains. Building on this, the model introduces six orthogonal and complementary domains that together strongly characterize the Digital Library universe and capture its specificities with respect to generic information systems. These specialized domains are:
+#	Content – represents the information made available;
+#	User – represents the actors interacting with the system;
+#	Functionality – represents the facilities supported;
+#	Policy – represents the rules and conditions, including digital rights, governing the operation;
+#	Quality – represents the aspects needed to consider digital library systems from a quality point of view;
+#	Architecture – represents the physical software (and hardware) constituents concretely realizing the whole.
@@ Line 34: / Line 58: @@
+==References==
 <references/>
 [[Category:EScience_Seminars]]