Difference between revisions of "EScience Seminar 2009/EScience-Seminar Repository Systems"

Revision as of 11:13, 21 April 2009

Introduction[edit]

The second MPG eScience seminar of 2009 is devoted to one of the big challenges for research institutes: how to guarantee persistence and continuous access to its records to all interested and authorized researchers. Therefore, each institute needs to have a strategy of how to manage the increasing amounts and complexity of data, how to guarantee online access to it and how to replicate the data for preservation purposes. The term "digital repositories" seems to properly describe the layer of functionality that we want. JISC defines the term in the following words ^[1]: "Repositories are important for universities and colleges in helping to capture, manage, and share institutional assets as a part of their information strategy. A digital repository can hold a wide range of materials for a variety of purposes and users. It can support learning, research and administrative processes."

The concept "Digital Repository" is not new of course, although its meaning changed rapidly due to the new requirements caused mainly by the increasing amount and complexity of data, the Internet and the awareness of long-term accessibility. Within the MPG it was Friedrich Hertweck ^[2] from RZG (Computer Centre in Garching) who created AMOS (Advanced Multi user Operating System) in the 70-ies. AMOS ^[3] was an excellent piece of software, which allowed scientists in Plasma Physics (amongst others) to store and retrieve their large data volumes, and it turned out to be an island of stability for many years. Natural sciences now need to maintain repositories that cover petabytes of data, which in general is highly structured. But also in the humanities we are now close to maintaining hundreds of terabytes where often the problem is not the sheer amount, but the inherent complexity of the data sets.

Since a few years the concept of "digital repositories" is being discussed in a number of different contexts. A recent study from the DRIVER project ^[4], covering 114 digital repositories, revealed that most institutes associate with this term repositories for publications. More than 80% of these repositories contain journal articles and other types of publications, only about 10 % also store primary data sets and common data types such as audio and video recordings. This is fully in line with the experiences in the European CLARIN project ^[5], which wants to build a network of centers functioning as the backbone for offering persistent access to language resources and services. About 80% of the potential centers are busy restructuring their repository to fulfill the new requirements. From these two results we can conclude

that researchers are used to store ePublications in proper repositories and associate them with proper metadata for example to support discovery and
that researchers are not used to store their research data in such ways that other researchers can easily access them.

It seems that in general researchers still use idiosyncratic methods to store their data, that they tend to structure it by minimalistic solutions, such as file names and directory structures, and that long-term accessibility was/is not an issue of primary concern. For an increasing number of researchers, in particular when they are participating in international data driven collaborations (e.g. genomics or climate), it becomes increasingly obvious, however, that they need to change their behavior.

This eScience seminar will therefore focus on repository solutions for research data, be it primary data generated by some types of sensors or secondary data that is generated by researchers to allow interpretations. The field of primary and in particular secondary data is characterized by an extreme heterogeneity of data types, formats, and implicit or explicit semantics, making it a difficult field for abstractions. This is different from ePublications where the data types and formats are widely standardized, where metadata characterization has a long history and where the semantics of the content can be interpreted by the reading researcher.

Venue[edit]

RZG Garching

Date[edit]

25/26 June 2009

Responsible for content[edit]

Malte Dreyer, Andreas Gros, (MPDL), Stefan Heinzel (RZG), Peter Wittenburg, Daan Broeder (MPI for Psycholinguistics), Frank Toussaint, Michael Lautenschlager (MPI for Meteorology)

Speakers[edit]

Registration[edit]

Registration is open at: http://escience.mpg.de/registration_en.html

[1] ttp://www.jisc.ac.uk/whatwedo/programmes/digitalrepositories2005/repositories_conference.aspx

[2] ttp://www.ipp.mpg.de/ippcms/de/presse/archiv/10_98_pi.html

[3] ttp://de.wikipedia.org/wiki/Rechenzentrum_Garching

[4] ttp://www.driver-repository.eu/

[5] ttp://www.clarin.eu/

[1]

[2]

[3]

[4]

[5]

@@ Line 1: / Line 1: @@
 ==Introduction==
 The second MPG eScience seminar of 2009 is devoted to one of the big challenges for research institutes: how to guarantee persistence and continuous access to its records to all interested and authorized researchers.
 Therefore, each institute needs to have a strategy of how to manage the increasing amounts and complexity of data, how to guarantee online access to it and how to replicate the data for preservation purposes. The term "digital repositories" seems to properly describe the layer of functionality that we want. JISC defines the term in the following words <ref>http://www.jisc.ac.uk/whatwedo/programmes/digitalrepositories2005/repositories_conference.aspx</ref>: "Repositories are important for universities and colleges in helping to capture, manage, and share institutional assets as a part of their information strategy. A digital repository can hold a wide range of materials for a variety of purposes and users. It can support learning, research and administrative processes."
-The concept "Digital Repository" is not new of course, although its meaning changed rapidly due to the new requirements caused mainly by the increasing amount and complexity of data, the Internet and the awareness of long-term accessibility. Within the MPG it was Friedrich Hertweck <ref>http://www.ipp.mpg.de/ippcms/de/presse/archiv/10_98_pi.html</ref> from RZG (Computer Centre in Garching) who created AMOS (Advanced Multi user Operating System) in the 70-ies. AMOS [3] was an excellent piece of software, which allowed scientists in Plasma Physics (amongst others) to store and retrieve their large data volumes, and it turned out to be an island of stability for many years. Natural sciences now need to maintain repositories that cover petabytes of data, which in general is highly structured. But also in the humanities we are now close to maintaining hundreds of terabytes where often the problem is not the sheer amount, but the inherent complexity of the data sets.
+The concept "Digital Repository" is not new of course, although its meaning changed rapidly due to the new requirements caused mainly by the increasing amount and complexity of data, the Internet and the awareness of long-term accessibility. Within the MPG it was Friedrich Hertweck <ref>http://www.ipp.mpg.de/ippcms/de/presse/archiv/10_98_pi.html</ref> from RZG (Computer Centre in Garching) who created AMOS (Advanced Multi user Operating System) in the 70-ies. AMOS <ref>http://de.wikipedia.org/wiki/Rechenzentrum_Garching</ref> was an excellent piece of software, which allowed scientists in Plasma Physics (amongst others) to store and retrieve their large data volumes, and it turned out to be an island of stability for many years. Natural sciences now need to maintain repositories that cover petabytes of data, which in general is highly structured. But also in the humanities we are now close to maintaining hundreds of terabytes where often the problem is not the sheer amount, but the inherent complexity of the data sets.
-Since a few years the concept of "digital repositories" is being discussed in a number of different contexts. A recent study from the DRIVER project [4], covering 114 digital repositories, revealed that most institutes associate with this term repositories for publications. More than 80% of these repositories contain journal articles and other types of publications, only about 10 % also store primary data sets and common data types such as audio and video recordings. This is fully in line with the experiences in the European CLARIN project [5], which wants to build a network of centers functioning as the backbone for offering persistent access to language resources and services. About 80% of the potential centers are busy restructuring their repository to fulfill the new requirements. From these two results we can conclude
+Since a few years the concept of "digital repositories" is being discussed in a number of different contexts. A recent study from the DRIVER project <ref>http://www.driver-repository.eu/</ref>, covering 114 digital repositories, revealed that most institutes associate with this term repositories for publications. More than 80% of these repositories contain journal articles and other types of publications, only about 10 % also store primary data sets and common data types such as audio and video recordings. This is fully in line with the experiences in the European CLARIN project <ref>http://www.clarin.eu/</ref>, which wants to build a network of centers functioning as the backbone for offering persistent access to language resources and services. About 80% of the potential centers are busy restructuring their repository to fulfill the new requirements. From these two results we can conclude
-•	that researchers are used to store ePublications in proper repositories and associate them with proper metadata for example to support discovery and
+*that researchers are used to store ePublications in proper repositories and associate them with proper metadata for example to support discovery and
-•	that researchers are not used to store their research data in such ways that other researchers can easily access them.
+*that researchers are not used to store their research data in such ways that other researchers can easily access them.
 It seems that in general researchers still use idiosyncratic methods to store their data, that they tend to structure it by minimalistic solutions, such as file names and directory structures, and that long-term accessibility was/is not an issue of primary concern. For an increasing number of researchers, in particular when they are participating in international data driven collaborations (e.g. genomics or climate), it becomes increasingly obvious, however, that they need to change their behavior.
@@ Line 36: / Line 35: @@
+<references/>
 [[Category:EScience_Seminars]]