Trip Report: MPI-EVA Linguistic Texts

MPDL  Restricted Access

Visit at MPI for evolutionary Anthropology (MPI-EVA) on 20th of February 2008

Participants:
 * Gisela Lausberg
 * Michael Cysouw
 * Tanja Friedrich
 * Andreas Gros

Introduction
The Department of Linguistics of the MPI-EVA currently stores about 2500 PDFs with linguistic texts on a local server inside the institute. Links to these files are integrated into the local OPAC. To increase the scientific usability of those texts, Michael Cysouw, Gisela Lausberg, and colleagues wish for storing the PDFs in a repository where the PDFs can be given PIDs and where it would be possible to perform full-text searches over the whole collection instead of only searching through single PDFs.

Procedure
As the services provided by PubMan match the requirements of M. Cysouw and G. Lausberg, we offered to start with ingesting the collection into a test-installation of PubMan/eSciDoc. The collection of linguistic texts is well described by MAB metadata (exported from LIBERO -- the local OPAC system ) which has to be mapped to eSciDoc metadata first. During the testing period access to all files will have to be restricted to members of the MPI-EVA and the MPDL. After the testing period, PDFs from authors who agree on making them publicly available will have to become accessible by the general public.

After testing the workflow of mapping and ingesting, testing of the search functionality can begin. Full-text search of those PDFs might pose difficulties as they vary in internal representation. In the future PDFs created by the institute will be PDF/A, the standard the DFG suggests for long-term archiving. Searchability of PDF/A remains to be tested in eSciDoc.

The collection should be held in a working (deployed) version of eSciDoc latest by Spring 2009.

Future considerations
In the long run the scientists of MPI-EVA wish for the possibility to annotate and refer to distinct text passages, so that it becomes possible to link from individual collections of text passages about certain language features (e.g. held in an Excel Spreadsheet or any other text document) directly into the corresponding PDFs.

ToDo List

 * Map the MAB data to a metadata format eSciDoc/PubMan can ingest
 * Create a collection and ingest the PDFs into it
 * Test the full-text search
 * Iterate results with the institute
 * Plan further steps (see Future Considerations)

Andreas Gros 10:00, 27 February 2008 (CET)