EScience Seminar 2008/EScience-Seminar Aspects of long-term archiving

Goal
Building on experience acquired in recent years (GWDG and RZG offering services for bitstream preservation, growing awareness of the need for open archive formats), strategies for long term archiving within the Max Planck Society will be developed. Furthermore, future service offerings and suggestions for file formats and metadata will be discussed. Organisational responsibilities for the lifecycle management of data (format migration, access strategies) within the Max Planck Society will be clarified.

Responsible for content
Dagmar Ullrich (GWDG)

Wolfgang Voges (MPDL)

Contributions
(The slides are being uploaded after submission of the respective final version)

General approach to digital Long-Term Preservation (dLTP)

 * Introduction, current situation and work done so far at the MPG (Dagmar Ullrich (GWDG), Wolfgang Voges (MPDL))
 * Dealing with Data: Roles, Rights, Responsibilities and Relationsships, Liz Lyon, UKOLN, ([[media:ESci08_Sem_2_Dealing_with_Data-Roles_Rights_Responsibilities_and_Relationsships_Lyon.pdf|Slides, 11.1MB]])
 * LTP of digital publications in a memory institution -- a challenge in the triangle of technology, integration and cooperation, Reinhard Altenhöner, DNB, ([[media:ESci08_Sem_2_dLTP_of_digital_publications_in_a_memory_institution_Altenhoener.pdf|Slides, 3.8MB]])
 * Requirements of e-Science and Grid Projects towards dLTP of Research Data, Jens Klump, GFZ Potsdam, ([[media:ESci08_Sem_2_Requirements_of_eScience_and_Grid_Projects_towards_dLTP_Klump.pdf|Slides, 0.3MB]])

Technical aspects

 * Metadata for digital Long-Term Preservation, Michael Day (UKOLN)
 * Assessing file formats for dLTP, Caroline van Wijk (KB)
 * Persistent identifier for long-term archived data, Malte Dreyer (MPDL)

Organisational aspects

 * Rule-based Distributed Data Management, Reagan Moore, SDSC, ([[media:ESci08 Sem 2 Rule-based distributed data management Moore.pdf|Slides, 3MB]])
 * Standards and Standardization in the Context of eScience and dLTP, Peter Rödig, UniBwM ([[media:ESci08_Sem_2_Standards_and_Standardization_in_the_Context_of_eScience_and_dLTP_Roedig.pdf|Slides, 0.2MB]])
 * Trustworthy Digital Archives, Susanne Dobratz, RZ HU Berlin ([[media:ESci08_Sem_2_Trustworthy_digital_archives_Dobratz.pdf|Slides, 0.5MB]])
 * Overview of Sustainable Digital Preservation, Sayeed Choudhury, Blue Ribbon Task Force, ([[media:ESci08_Sem_2_Overview_on_Sustainable_Digital_Preservation_and_Access_Choudhury.pdf|Slides, 0.3MB]])
 * Calculating costs of dLTP, Neil Beagrie, Charles Beagrie Limited, ([[media:ESci08_Sem_2_Calculating_costs_of_dLTP_Beagrie.pdf‎|Slides, 0.4MB]])

Current practices

 * The role of dLTP in the eSciDoc project, Natasa Bulatovic, MPDL, ([[media:EScie08_Sem_2_Role_of_dLTP_in_the_eSciDoc_Project.pdf|Slides, 2.2MB]])
 * Digital Long-Term Archiving at GWDG and other Archiving Systems, Dagmar Ullrich, GWDG, ([[media:ESci08_Sem_2_LZAatGWDG_Ullrich.pdf|Slides, 0.8MB]])
 * Long-Term Archiving of Climate Model Data at WDC Climate and DKRZ, Michael Lautenschlager, MPI-M, ([[media:ESci08_Sem_2_Long-term_Archiving_of_Climate_Model_Data_Lautenschlager.pdf|Slides, 2.5MB]])
 * Digital Long-Term Preservation of linguistic resources at the MPI for Psycholinguistics, Paul Trilsbeek, Peter Wittenburg, MPIPL ([[media:ESci08_Sem_2_dLTP_of_Linguistic_Resources_at_MPIPL_Trilsbeek.pdf|Slides, 1.2MB]])

Future perspective for dLTP in the MPG, final discussion

 * Summary, Dagmar Ullrich (GWDG) Wolfgang Voges (MPDL)

LTP of digital publications in a memory institution -- a challenge in the triangle of technology, integration and cooperation (Reinhard Altenhöner, DNB)
One of the unresolved problems of the global information society is to ensure the long-term accessibility of digital documents. Especially for those institutions which aim for the availability of information objects in several hundred years, the challenges are impressive. Not only technological aspects but also organisational questions have to be answered. And at least the question of how the Long-term preservation should be integrated into the life-cycle of a digital information object has to be answered. The example of kopal (Co-operative Development of a Long-Term Digital Information Archive), a public funded, successful realisation of a cooperative digital archive-solution, shows how one possible technological solution looks like and how the development of subsequent steps helps to understand the specific challenges for libraries and cultural heritage organisations in terms of the underlying technology and the need for cooperation and for the integration of  LTP into the life cycle of digital objects. http://www.kopal.langzeitarchivierung.de/index.php.en
 * Reinhard Altenhöner's slides: [[media:ESci08_Sem_2_dLTP_of_digital_publications_in_a_memory_institution_Altenhoener.pdf|PDF, 3.8MB]]

Requirements of e-Science and Grid Projects towards dLTP of Research Data (Jens Klump, GFZ Potsdam)
The enormous amounts of data from Grid projects and the complexity of data from e-science projects suggest that these new types of projects also have new requirements towards long- term archiving of data. On the other hand, Grid technology and semantic tools emerging from e-science might provide us with new methods that may be useful in long-term digital preservation.

The study "Requirements of e-science and Grid projects towards long-term archiving of scientific and scholarly data" investigates from a technological and from a management perspective whether existing infrastructures in data producing research e-science and Grid communities meet the requirements of long-term digital preservation. The study also investigates whether technologies and best practices from e-science and Grid project can be transferred to organisations and systems in the field of long-term digital preservation. The interviews conducted as part of this study showed considerable differences between projects in the way they approached long-term digital preservation of data. Their achievements –but also their deficits– are analysed and discussed. The recommendations given in this study are derived from this analysis and discussion with stakeholders in e- science and Grid projects.
 * Jens Klump's slides: [[media:ESci08_Sem_2_Requirements_of_eScience_and_Grid_Projects_towards_dLTP_Klump.pdf|PDF, 0.3MB]]

Standards and Standardization in the Context of eScience and dLTP (Peter Rödig, UniBwM)
The talk first introduces the basics of standards and standardization and then briefly presents OAIS as a means for organizing standards and also illustrates the prior art of handling traditional digital objects in the dLTP community. Then the presentation tries to summarize the specific characteristics of eScience and to find out the impact to OAIS-elements and related standards. The talk concludes with an assessment of the current situation and some suggestions for improving standardization efforts.
 * Peter Rödig's slides: [[media:ESci08_Sem_2_Standards_and_Standardization_in_the_Context_of_eScience_and_dLTP_Roedig.pdf|PDF, 0.2MB]]

Overview of Sustainable Digital Preservation (Sayeed Choudhury, Blue Ribbon Task Force)
Johns Hopkins University (JHU) has initiated a series of data curation activities that focus on data and publications as new compound objects. This work has been most thoroughly explored in the context of the Virtual Observatory. While much of this work has been technical in nature, an equally important aspect for consideration is the economic issues of sustainability. Choudhury who leads the JHU Libraries' data curation efforts is also a member of the Blue Ribbon Task Force on Sustainable Digital Preservation and Access (BRTF- SPDA). The BRTF-SPDA is funded by the US National Science Foundation and the Andrew W. Mellon Foundation, in partnership with the US Library of Congress, the UK Joint Information Systems Committee, the US Council on Library and Information Resources and the US National Archives and Records Administration. During the next two years, the BRTF- SDPA will explore the sustainability challenge with the goal of delivering specific recommendations that are economically viable of use to a broad audience, from individuals to institutions and corporations to cultural heritage centers.

This Task Force will: Sayeed Choudhury will provide an update related to the BRTF-SPDA's recent work featuring a working definition of economic sustainability and highlights from the Task Force's first draft report.
 * Conduct an analysis of previous and current models for sustainable digital preservation, and identify current best practices among existing collections, repositories and analogous enterprises.
 * Develop a set of economically viable recommendations to catalyze the development of reliable strategies for the preservation of digital information.
 * Provide a research agenda to organize and motivate future work in the specific area of economic sustainability of digital information.
 * Sayeed Choudhury's slides: [[media:ESci08_Sem_2_Overview_on_Sustainable_Digital_Preservation_and_Access_Choudhury.pdf|PDF, 0.3MB]]

Digital Long-Term Archiving at GWDG and other Archiving Systems (Dagmar Ullrich, GWDG)
This talk informs about the current approach to digital Long-Term Archiving at GWDG. It shows which technical solutions are already in use for Bitstream Preservation. The kopal system which derived from the kopal project "Co-operative Development of a Long-Term Digital Information Archive" and is hosted at the GWDG is introduced. A short overview over other existing archiving systems is given.
 * Dagmar Ullrich's slides: [[media:ESci08_Sem_2_LZAatGWDG_Ullrich.pdf|PDF, 0.8MB]]

Long-Term Archiving of Climate Model Data at WDC Climate and DKRZ (Michael Lautenschlager, MPIM)
The computing capabilities for production of Earth system model data are growing faster than the prices for mass storage media sink. If the archive philosophy would be left unchanged during the migration to the next compute server generation consequently the amount of money for long-term archiving rises and the total amount of money for archiving tends to exceed the money which is left for compute services. At WDCC (World Data Center Climate) and DKRZ (German Climate Computing Centre) a new concept for long-term archiving has been developed which addresses this problem and improves the overall confidence in the long-term archive. The new archive concept separates data storage with expiration date at the scientific project level and the documented long-term archive. The transition process to the new archive concept already started and at the end we expect to have a completely documented long-term archive with a searchable data catalogue. This archive concept is supported by a four level storage hierarchy which reflects the lifetimes of the different data categories.
 * Michael Lautenschlager's slides: [[media:ESci08_Sem_2_Long-term_Archiving_of_Climate_Model_Data_Lautenschlager.pdf|PDF, 2.5MB]]

Digital Long-term Preservation of Linguistic Resources at the MPI for Psycholinguistics (Paul Trilsbeek, MPIPL)
The MPI for Psycholinguistics started developing a digital archive for linguistic resources about 10 years ago. An essential part of the infrastructure is the IMDI metadata set, which was developed together with a substantial number of linguists to reflect the needs of the linguistic community. These metadata descriptions are stored in individual XML files and contain links to the described resources. During the last couple of years, a framework has been developed that consists of many components and tools that can be classified in three groups: preparation tools that help the researchers with their analysis work, organization tools that allow the linguist to write metadata descriptions and to upload and organize data into a central archive, and utilization tools that make it possible to make use of the archived content in various ways via the web.

A number of measures are taken to increase the chance of long-term survival of the archived material:


 * the creation of at least 7 copies of each resource in various locations in the Netherlands and Germany
 * migration to the latest storage technology every four to five years
 * the establishment of a grid of regional archives, each of which contain a part of the centrally archived data
 * minimizing the number of file formats and encodings and making use of open standards where possible
 * the use of URIDs for each archived resource

Besides that, the Max Planck Gesellschaft has given us a guarantee for the bit-stream preservation of the copies of our resources that are stored at the data centers in Göttingen and Garching for 50 years.
 * Paul Trilsbeek's slides: [[media:ESci08_Sem_2_dLTP_of_Linguistic_Resources_at_MPIPL_Trilsbeek.pdf|PDF, 1.2MB]]