Lamus Content Structures
Introduction[edit]
This document describes the basic data model that underlies the LAMUS archiving software and other tools from the LAT tools suite that exploit a LAMUS/LAT based archive. The LAMUS/LAT provides repository functions (deposition and access) for resources in the domain of language resources.
Short Definitions[edit]
IMDI: ISLE Metadata Initiative. A metadata set and encoding used to describe various type of LRs, with an emphasis on multi-modal multi-media type of LRs. The IMDI set is encoded as XML specified in the IMDI schema. The values of the elements, when they are not free, are constrained by the schema or by a link to a controlled vocabulary specification. IMDI allows specializations of the schema in the form of IMDI profiles. These are instantiations of the IMDI scheme with added key/value pairs for descriptions of domain or project specific information.
LR: Language Resource Any resource used by linguists for studying linguistic phenomena or used to develop language technology. The type of LRs covers a wide domain such as text corpora, audio & video recordings, images, lexica and ontologies.
LAMUS: Language Management and Upload System A suite of web-applications that allow uploading and managing of resources and IMDI metadata descriptions into a LAMUS managed LR archive.
The IMDI archive model[edit]
The resources in an archive are modeled in an hierarchical structure of corpora and sub-corpora and eventually all corpora belong to the archive's top corpus or super corpus. Some resources are more tightly related to each other then to others. Such sets of tightly related resources are called sessions and the resources in a session share most of their descriptive metadata. The session tries to capture the fact that the containes resources all pertain to the same linguistic action or event. The succes of modelling LR collections with corpus and session descriptions varies depending on the linguistic (sub) discipline, and repository purpose. The hierarchical structure of corpora and sub corpora is not a tree. Some corpora may have several parent corpora just as some sessions may belong to different corpora and resources may be part of more than one session. The archive is assumed to be dynamic, new material is added and the content of the corpora changes continuously. The most stable elements are the sessions and therefore metadata is attached at the session level while at the corpus level only general descriptions are attached.
To implement such a model, an IMDI archive uses IMDI XML files to model the corpora, sub corpora and sessions and the relations between them. The basic implementation of the model is at the file level for two reasons: • Interpreting XML files and following URI links rewuires very simple technology that can be implemented with limited resources. This allows creating managed repositories on PCs and notebooks. • XMLfiles are considered an archivable format. It can be expected that all information can be regained from studying the content of the files.
The IMDI File data model[edit]
Fundamentally, an archive based on IMDI files relies on relations between resources and IMDI files. IMDI files are XML files that come in two varieties namely corpus files and session files that both conform to the IMDI schema. The corpus file can contain several descriptions and any number of links to other corpus files or session files. A link is at least a URI but may also have a PID. A IMDI session file contains links to resources and can contain several descriptions. Most important it contains IMDI metadata that pertains to the resources it links to. Also in the session file every link is at least a URI but may also contain a PID. In addition to containing descriptions IMDI files may link to information files, these contain human legible information about the corpus or session. Also the IMDI metadata contained in the session file can refer to information files from different levels.
The IMDI metadata schema[edit]
The IMDI metadata schema specifies the format of both the IMDI corpus and session files. The IMDI metadata set is structured and allows for flexibility adding extra key/value pairs to different parts of the metadata. The schema is located at http://www.mpi.nl/IMDI/Schema/IMDI_3.0.xsd.
The links in the IMDI files[edit]
Links are embedded in IMDI files to refer from one IMDI file to another or to a resource and every link implements a unidirectional association. A link contains at least a URI to refer to the the other file but maybe accompanied by a PID (persistent identifier). In case of inconsistency between the URI and PID, the PID has priority. This allows that for downloaded IMDI files the link to sub corpora and resources in the archive remains in tact. However for IMDI files in the archive URIs and PIDs must be consistent.
Every IMDI file indentifies itself via a self reference in the form of an ArchiveHandle attribute that has the file's own PID as value.
LAMUS API[edit]
The LAMUS API is used by the MPI-PL internally to communicate between the backend systems. We consider it a starting point for cross system communication. This is a short description of the API: The main api interface is lamusAPI that declares a set of atomic operations among the corpus tree’s nodes. These operations are:
- addCorpusTree
- addCorpusNode
- removeHandles
- addSessionNode
- writeIMDIValue
- deleteNode
- addResourceNode
- replaceNode
- getNodeInfo
- readIMDIValue
There are three more classes involved in the api:
- lamusAPIClient
- lamusAPIHelper
- lamusAPIImpl
The lamusAPIClient is a client for LamusAPI via XML-RPC used for providing POJO SOAP access to LamusAPI via Axis2 running in a separate webapp. The lamusAPIHelper contains some helpful methods for lams.api.lamusAPI implementations. It is limited to the database activity and low level file operations but no IMDI file editing is executed here.