EScience Seminar 2008/EScience-Seminar Aspects of long-term archiving

From MPDLMediaWiki
Revision as of 08:06, 15 July 2008 by Andi (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Goal

Building on experience acquired in recent years (GWDGGesellschaft für Wissenschaftliche Datenverarbeitung Göttingen and RZGRechenzentrum Garching, jetzt MPCDF offering services for bitstream preservation, growing awareness of the need for open archive formats), strategies for long term archiving within the Max Planck Society will be developed. Furthermore, future service offerings and suggestions for file formats and metadata will be discussed. Organisational responsibilities for the lifecycle management of data (format migration, access strategies) within the Max Planck Society will be clarified.

Responsible for content

Dagmar Ullrich (GWDGGesellschaft für Wissenschaftliche Datenverarbeitung Göttingen)
Wolfgang Voges (MPDLMax Planck Digital Library)


Contributions

(The slides are being uploaded after submission of the respective final version)

General approach to digital Long-Term Preservation (dLTP)

  • Introduction, current situation and work done so far at the MPGMax-Planck-Gesellschaft (Dagmar Ullrich (GWDGGesellschaft für Wissenschaftliche Datenverarbeitung Göttingen), Wolfgang Voges (MPDLMax Planck Digital Library))
  • Dealing with Data: Roles, Rights, Responsibilities and Relationsships, Liz Lyon, UKOLNUnited Kingdom Office for Library and Information Networking, (Slides, 11.1MB)
  • LTP of digital publications in a memory institution -- a challenge in the triangle of technology, integration and cooperation, Reinhard Altenhöner, DNBDeutsche Nationalbibliothek, (Slides, 3.8MB)
  • Requirements of e-Science and GridGlobal Release Identifier Projects towards dLTP of Research Data, Jens Klump, GFZDeutsches GeoForschungsZentrum Potsdam Potsdam, (Slides, 0.3MB)


Technical aspects

  • Metadata for digital Long-Term Preservation, Michael Day (UKOLNUnited Kingdom Office for Library and Information Networking)
  • Assessing file formats for dLTP, Caroline van Wijk (KBKoninklijke Bibliotheek)
  • Persistent identifier for long-term archived data, Malte Dreyer (MPDLMax Planck Digital Library)


Organisational aspects

  • Rule-based Distributed Data Management, Reagan Moore, SDSC, (Slides, 3MB)
  • Standards and Standardization in the Context of eScienceEnhanced Science and dLTP, Peter Rödig, UniBwM (Slides, 0.2MB)
  • Trustworthy Digital Archives, Susanne Dobratz, RZ HUHumboldt Universität zu Berlin Berlin (Slides, 0.5MB)
  • Overview of Sustainable Digital Preservation, Sayeed Choudhury, Blue Ribbon Task Force, (Slides, 0.3MB)
  • Calculating costs of dLTP, Neil Beagrie, Charles Beagrie Limited, (Slides, 0.4MB)


Current practices

  • The role of dLTP in the eSciDocEnhanced Scientific Documentation project, Natasa Bulatovic, MPDLMax Planck Digital Library, (Slides, 2.2MB)
  • Digital Long-Term Archiving at GWDGGesellschaft für Wissenschaftliche Datenverarbeitung Göttingen and other Archiving Systems, Dagmar Ullrich, GWDGGesellschaft für Wissenschaftliche Datenverarbeitung Göttingen, (Slides, 0.8MB)
  • Long-Term Archiving of Climate Model Data at WDC Climate and DKRZ, Michael Lautenschlager, MPI-M, (Slides, 2.5MB)
  • Digital Long-Term Preservation of linguistic resources at the MPIMax-Planck-Institut for Psycholinguistics, Paul Trilsbeek, Peter Wittenburg, MPIPLMax-Planck-Institut für Psycholinguistik (Slides, 1.2MB)


Future perspective for dLTP in the MPGMax-Planck-Gesellschaft, final discussion

  • Summary, Dagmar Ullrich (GWDGGesellschaft für Wissenschaftliche Datenverarbeitung Göttingen) Wolfgang Voges (MPDLMax Planck Digital Library)


Abstracts

LTP of digital publications in a memory institution -- a challenge in the triangle of technology, integration and cooperation (Reinhard Altenhöner, DNBDeutsche Nationalbibliothek)

One of the unresolved problems of the global information society is to ensure the long-term accessibility of digital documents. Especially for those institutions which aim for the availability of information objects in several hundred years, the challenges are impressive. Not only technological aspects but also organisational questions have to be answered. And at least the question of how the Long-term preservation should be integrated into the life-cycle of a digital information object has to be answered. The example of kopal (Co-operative Development of a Long-Term Digital Information Archive), a public funded, successful realisation of a cooperative digital archive-solution, shows how one possible technological solution looks like and how the development of subsequent steps helps to understand the specific challenges for libraries and cultural heritage organisations in terms of the underlying technology and the need for cooperation and for the integration of LTP into the life cycle of digital objects.
http://www.kopal.langzeitarchivierung.de/index.php.en


Requirements of e-Science and GridGlobal Release Identifier Projects towards dLTP of Research Data (Jens Klump, GFZDeutsches GeoForschungsZentrum Potsdam Potsdam)

The enormous amounts of data from GridGlobal Release Identifier projects and the complexity of data from e-science projects suggest that these new types of projects also have new requirements towards long- term archiving of data. On the other hand, GridGlobal Release Identifier technology and semantic tools emerging from e-science might provide us with new methods that may be useful in long-term digital preservation.

The study "Requirements of e-science and GridGlobal Release Identifier projects towards long-term archiving of scientific and scholarly data" investigates from a technological and from a management perspective whether existing infrastructures in data producing research e-science and GridGlobal Release Identifier communities meet the requirements of long-term digital preservation. The study also investigates whether technologies and best practices from e-science and GridGlobal Release Identifier project can be transferred to organisations and systems in the field of long-term digital preservation.

The interviews conducted as part of this study showed considerable differences between projects in the way they approached long-term digital preservation of data. Their achievements –but also their deficits– are analysed and discussed. The recommendations given in this study are derived from this analysis and discussion with stakeholders in e- science and GridGlobal Release Identifier projects.


Standards and Standardization in the Context of eScienceEnhanced Science and dLTP (Peter Rödig, UniBwM)

The talk first introduces the basics of standards and standardization and then briefly presents OAIS as a means for organizing standards and also illustrates the prior art of handling traditional digital objects in the dLTP community. Then the presentation tries to summarize the specific characteristics of eScienceEnhanced Science and to find out the impact to OAIS-elements and related standards. The talk concludes with an assessment of the current situation and some suggestions for improving standardization efforts.


Overview of Sustainable Digital Preservation (Sayeed Choudhury, Blue Ribbon Task Force)

Johns Hopkins University (JHU) has initiated a series of data curation activities that focus on data and publications as new compound objects. This work has been most thoroughly explored in the context of the Virtual Observatory. While much of this work has been technical in nature, an equally important aspect for consideration is the economic issues of sustainability. Choudhury who leads the JHU Libraries' data curation efforts is also a member of the Blue Ribbon Task Force on Sustainable Digital Preservation and Access (BRTF- SPDA). The BRTF-SPDA is funded by the US National Science Foundation and the Andrew W. Mellon Foundation, in partnership with the US Library of Congress, the UKUnited Kingdom Joint Information Systems Committee, the US Council on Library and Information Resources and the US National Archives and Records Administration. During the next two years, the BRTF- SDPA will explore the sustainability challenge with the goal of delivering specific recommendations that are economically viable of use to a broad audience, from individuals to institutions and corporations to cultural heritage centers.

This Task Force will:

  • Conduct an analysis of previous and current models for sustainable digital preservation, and identify current best practices among existing collections, repositories and analogous enterprises.
  • Develop a set of economically viable recommendations to catalyze the development of reliable strategies for the preservation of digital information.
  • Provide a research agenda to organize and motivate future work in the specific area of economic sustainability of digital information.

Sayeed Choudhury will provide an update related to the BRTF-SPDA's recent work featuring a working definition of economic sustainability and highlights from the Task Force's first draft report.


Digital Long-Term Archiving at GWDGGesellschaft für Wissenschaftliche Datenverarbeitung Göttingen and other Archiving Systems (Dagmar Ullrich, GWDGGesellschaft für Wissenschaftliche Datenverarbeitung Göttingen)

This talk informs about the current approach to digital Long-Term Archiving at GWDGGesellschaft für Wissenschaftliche Datenverarbeitung Göttingen. It shows which technical solutions are already in use for Bitstream Preservation. The kopal system which derived from the kopal project "Co-operative Development of a Long-Term Digital Information Archive" and is hosted at the GWDGGesellschaft für Wissenschaftliche Datenverarbeitung Göttingen is introduced. A short overview over other existing archiving systems is given.


Long-Term Archiving of Climate Model Data at WDC Climate and DKRZ (Michael Lautenschlager, MPIM)

The computing capabilities for production of Earth system model data are growing faster than the prices for mass storage media sink. If the archive philosophy would be left unchanged during the migration to the next compute server generation consequently the amount of money for long-term archiving rises and the total amount of money for archiving tends to exceed the money which is left for compute services. At WDCCWorld Data Center for Climate (World Data Center Climate) and DKRZ (German Climate Computing Centre) a new concept for long-term archiving has been developed which addresses this problem and improves the overall confidence in the long-term archive. The new archive concept separates data storage with expiration date at the scientific project level and the documented long-term archive. The transition process to the new archive concept already started and at the end we expect to have a completely documented long-term archive with a searchable data catalogue. This archive concept is supported by a four level storage hierarchy which reflects the lifetimes of the different data categories.


Digital Long-term Preservation of Linguistic Resources at the MPIMax-Planck-Institut for Psycholinguistics (Paul Trilsbeek, MPIPLMax-Planck-Institut für Psycholinguistik)

The MPIMax-Planck-Institut for Psycholinguistics started developing a digital archive for linguistic resources about 10 years ago. An essential part of the infrastructure is the IMDI metadata set, which was developed together with a substantial number of linguists to reflect the needs of the linguistic community. These metadata descriptions are stored in individual XMLExtensible Markup Language files and contain links to the described resources. During the last couple of years, a framework has been developed that consists of many components and tools that can be classified in three groups: preparation tools that help the researchers with their analysis work, organization tools that allow the linguist to write metadata descriptions and to upload and organize data into a central archive, and utilization tools that make it possible to make use of the archived content in various ways via the web.

A number of measures are taken to increase the chance of long-term survival of the archived material:

  • the creation of at least 7 copies of each resource in various locations in the Netherlands and Germany
  • migration to the latest storage technology every four to five years
  • the establishment of a grid of regional archives, each of which contain a part of the centrally archived data
  • minimizing the number of file formats and encodings and making use of open standards where possible
  • the use of URIDs for each archived resource

Besides that, the Max Planck Gesellschaft has given us a guarantee for the bit-stream preservation of the copies of our resources that are stored at the data centers in Göttingen and Garching for 50 years.