Agenda Meeting 16.10.2008[edit]

Project and current status[edit]

The XML Workflow project started September 2008. There are two main parts of the project:

Defining a working process (workflow) for production of XML texts (and documenting this process so that it can be reused)
- Digitization of e.g. manuscripts
- Transcription of the text presented on the manuscripts
- Markup of parts of the XML texts
- Enrichment of the XML texts

Development aspect by enabling tools and infrastructure to
- enable access to documents, linking between documents and internal parts of the documents
- building functions for searching, indexing and retrieval of relevant results

The main motivation is to standardize the working processes and develop a Center of competence that provides guidelines for transcription of texts.

Currently the project is in the initial design phase
One of the main goals is to use a repository functionality for all artefacts and enable easier reuse such as: importing the texts on which a work needs to be done into tool installed on a local system, as well as easily submitting the modified texts back to the repository

Introduction into:

Presentations:

Functions that need to be provided are basically:

- project team considers as a core system to search within the XML documents the following: eXist database, Lucene or Oracle 11g
- two types of queries need to be supported:
  - structural queries of XML documents (in particular trees, subset of trees)
  - Full-text searching (integrated language technology)
  - support for different languages/scripts such as: Latin, Greek, Chinese, European languages, Sanscrit
Digilib - to be enabled as a service for viewing in-line images such as figures, diagrams
- need to have the possibility to use quite mature/robust tools for working with images

The functionality MPI WG intends to build is to enable searching by a word in its citation form
- for EU languages stemming is usually fine
- for other languages (e.g. Sanscrit, Chinese) words need to be indexed e.g. woman as ГΥΝΑζΚΑ (*gh'inEka*)

NLP (Natural Language Processing)
- enable generic interface to dictionaries
- define/check on standards

Acting as XML Middleware
- XML document is always indexed each time it is changed

as a minimum a need to have a core set of repository services to easily create objects and relate these objects