DANS MIXED workshop

MPDL,Vlad

=About MIXED= MIXED is a project of DANS (Data Archiving and Networked Services). It is partially funded by the Dutch Ministry for Education, Culture, Arts and Sciences.

MIXED is to contribute to digital preservation, by dealing with the problem of file formats. Over time, file formats become obsolete. When that happens, the information in such file types is no longer accessible. MIXED follows the strategy of converting files to XML as soon as possible, preferably when data is ingested into the archive. MIXED also converts these XML files to formats of choice by the archive user.

MIXED will be implemented in the DANS workflow, but it is public software, to be published in Open Source software repositories.

MIXED is a contribution by DANS to practices that implement digital preservation. While the MIXED project is limited to tabular data and a limited number of file formats, it aims to collect the best practices in file format conversion. This expanding set of high quality file conversions will then be usable by the digital archiving community as a whole.

The intended results of MIXED consist of several parts: 1) at the heart there are the conversion plug-ins that "know" obsolete file formats and convert the data in such formats into XML. This points at the second key result: 2) a set of XML schemas, brought together under one umbrella (SDFP = Standard Data Formats for Preservation), that represent the way the information in file formats will be encoded in XML. In order to use MIXED, a repository needs a 3) piece of adaptor software, so that the repository software is able to use MIXED at designated points in the repository workflow. We hope to contribute to 4) the propagation of MIXED as an implementation of smart migration, and as a means to practical digital preservation for repositories.


 * 1) Conversion plug-ins. Several binary file formats are now within the scope of MIXED conversion plug-ins. They are:
 * 2) Microsoft Access, the pre-2007 versions;
 * 3) DataPerfect (the WordPerfect database companion) and
 * 4) dBase, versions 3, 4 and 5, by the original vendor and in the Clipper and FoxPro implementations. We used a DOS emulator and original software packages together with some open source libraries. The result is 100% pure Java code, publicly usable to convert files in any of these formats into SDFP.
 * 5) Standard Data Formats for Preservation (SDFP). MIXED's policy is to connect to the most promising initiatives, and bring the corresponding file formats under its umbrella.
 * 6) For spreadsheets the the ODF (= Open Document Format, used by Open Office) has been chosen. The OOXML can be used later as far as the development of the user base of OOXML will grow.
 * 7) For databases there does not seem an XML format that is canonical (e.g dbXML). For MIXED we need a single schema that fits all databases. We have found a schema in which you can express the data model of a database and its data in a generic way. The data model defines the tables, with their names, properties, fields, constraints, and relationships. We have chosen to include the declarative properties, but not the action items, such as triggers and forms.
 * 8) We just started specifying points in DANS's repository workflow where MIXED could step in. For the moment we will use MIXED by triggering it from a repository user interface, but ideally MIXED will be scheduled to perform tasks automatically when certain conditions are met and certain files are encountered. There is nothing in the MIXED software that prevents that. But MIXED implements a new preservation strategy, and we need to find the proper place for MIXED by experience. In order to facilitate the invocation of MIXED, we plan to write a piece of adaptor software that calls MIXED as a web service, feeding it with input files and collecting result files.
 * 9) Propogation of MIXED:
 * 10) find scenarios where smart migration is really useful.
 * 11) Advices on how to present SDFP to a public of repository managers, archivists, researchers and developers are needed
 * 12) Two dimensions of MIXED usage: more users and more usability. If we attract the attention of power users such as repositories, with budgets to take digital preservation measures, they are candidate plug-in writers for new file formats and new data kinds. Conversely, repositories that use a growing MIXED can delegate an increasing portion of their file format conversions to MIXED. If this works out, MIXED will collect file format conversions like a snowball.

=Workshop discussion themes:=

digital preservation
MIXED represents a variant of the migration strategy for digital preservation. We call it “smart migration” because the method breaks loose from the wheel of eternal upgrades of vendor formats. Instead it converts material into an ‘enlightened’ format, where it may rest for ever. When the material is used, a dissemination copy of it may easily be converted, on the fly, to a current vendor format. Question that arise are:
 * how do you evaluate “smart migration”, in the presence of plain migration and emulation?
 * migration to the XML data format works best for those aspects of content that can be expressed in XML in standard ways. Presentation and action are aspects for which few canonical formalisms exist. What impact does this have on the usability of smart migration?

research archives
A digital preservation measure such as smart migration must be implemented in repository context, i.e. an archive or repository with digital research data. Questions here are:
 * what is the best interface of MIXED in view of being used in a repository workflow?
 * where in the archival process (ingest, administration, management, dissemination) would you want to use the software?
 * what is needed to trust the quality of the MIXED conversions?
 * is MIXED a contribution to being/becoming a Trusted Digital Repository in the sense of TRAC, DRAMBORA or the Data Seal of Approval: 1, 2?

research infrastructures
The task of digital preservation is too daunting to be tackled by individual repositories. There are tasks that do not even make sense at that level, such as standardization of data formats and metadata. In order to exploit the benefits of digital data for scholarly purposes, we need research infrastructures. These are not only about physical data connections, but much more about mutual understandings about how to refer to data, how to shape metadata, how to deal with rights. The vision is that research data in individual repositories will become transparently referable, accessible, usable and reliable for any researcher with access to the infrastructure as a whole. Questions for MIXED are:
 * what is needed to deploy MIXED not in one repository, but on a network of repositories?
 * how can MIXED collect the best practices in the area of file format conversions, so that networks of repository can employ these practices?
 * how should MIXED be governed? An international board? A foundation with contributing members?

open source software communities
The MIXED software will be published as open source software. That will guarantee the legal usability of the software for whoever needs it. But we want more: contributions of new conversions, improvements of existing conversions, better interfaces and integrations with repository systems, such as FEDORA. The main questions are: These are questions. We hope that in the process of pondering them, some scenarios will crop up by which we can put MIXED firmly on the road.
 * how do we foster an open source MIXED community?
 * what platform do we choose? SourceForge, Apache?
 * who are likely to contribute, and in what way? Individuals? Repositories with budget for digital preservation measures? Software vendors? Government ministries?
 * will this help to keep the MIXED software sustainable over time?

=Related links=

MIXED

 * Workshop report
 * Home page
 * The workshop page
 * Memos: [[Media:MIXED_Consultation_Workshop_Memo_01.pdf|1]], [[Media:MIXED_Consultation_Workshop_Memo_02.pdf|2]], [[Media:MIXED_Consultation_Workshop_Memo_03.pdf|3]]
 * MIXED white paper: [[Media:MIXED_white_paper.pdf‎|1]], 2
 * DANS DBF library

eSciDoc

 * eDoc->eSciDoc Migration Requirements from MPG institutes.
 * Transformation Service
 * Validation Service
 * eSciDoc metadata concept, XSDs: old, new
 * Application profiles