Revision as of 12:22, 5 November 2008

Agenda Meeting 16.10.2008[edit]

The XML Workflow project started September 2008. There are two main parts of the project:

Defining a working process (workflow) for production of XML texts (and documenting this process so that it can be reused)
- Digitization of e.g. manuscripts
- Transcription of the text presented on the manuscripts
- Markup of parts of the XML texts
- Enrichment of the XML texts

Development aspect by enabling tools and infrastructure to
- enable access to documents, linking between documents and internal parts of the documents
- building functions for searching, indexing and retrieval of relevant results

The main motivation is to standardize the working processes and develop a Center of competence that provides guidelines for transcription of texts.

Currently the project is in the initial design phase
One of the main goals is to use a repository functionality for all artefacts and enable easier reuse such as: importing the texts on which a work needs to be done into tool installed on a local system, as well as easily submitting the modified texts back to the repository

Introduction into:

eSciDoc service infrastructure
eSciDoc data model
support for versioning
support for persistent identification
support for authorization
currently developed solutions
VIRR Solution (R1, possible relation to XML Workflow project)
- see first Demo release

Presentations:

Functions that need to be provided are basically:

- project team considers as a core system to search within the XML documents the following: eXist database, Lucene or Oracle 11g
- two types of queries need to be supported:
  - structural queries of XML documents (in particular trees, subset of trees)
  - Full-text searching (integrated language technology)
  - support for different languages/scripts such as: Latin, Greek, Chinese, European languages, Sanscrit
Digilib - to be enabled as a service for viewing in-line images such as figures, diagrams
- need to have the possibility to use quite mature/robust tools for working with images

The functionality MPI WG intends to build is to enable searching by a word in its citation form
- for EU languages stemming is usually fine
- for other languages (e.g. Sanscrit, Chinese) words need to be indexed e.g. woman as ГΥΝΑζΚΑ (*gh'inEka*)

NLP (Natural Language Processing)
- enable generic interface to dictionaries
- define/check on standards

Acting as XML Middleware
- XML document is always indexed each time it is changed

as a minimum a need to have a core set of repository services to easily create objects and relate these objects

eSciDoc alredy provides several functionalities stated within the high-level requirements:

a repository functionality that supports versioning, persistent identification and quite free modeling schema for content resources
possibility to define different workflows and schemas for different projects (via the usage of "Context" resource)
possibility to define fine-granular access level to resources and parts of resources
resources are managed via Resource Handlers exposed as web services
authorization is done for each service operation on the resource (as eSciDoc core services are stateless)
services are provided as SOAP or REST interfaces
The possibility to define Containers, Members and TOC is of great advantage for the XML Workflows
all metadata and selected formats of full-texts are searchable
Digilib is integrated with the item handler interface and parameters can be used appended to the (file) content retrieval request

@@ Line 77: / Line 77: @@
 ==Relation between the two projects and possibility for reuse==
+===Overview of supported functionalities===
+*eSciDoc alredy provides several functionalities stated within the high-level requirements:
+*a repository functionality that supports versioning, persistent identification and quite free modeling schema for content resources
+*possibility to define different workflows and schemas for different projects (via the usage of "Context" resource)
+*possibility to define fine-granular access level to resources and parts of resources
+*resources are managed via Resource Handlers exposed as web services
+*authorization is done for each service operation on the resource (as eSciDoc core services are stateless)
+*services are provided as SOAP or REST interfaces
+*The possibility to define Containers, Members and TOC is of great advantage for the XML Workflows
+*all metadata and selected formats of full-texts are searchable
+*Digilib is integrated with the item handler interface and parameters can be used appended to the (file) content retrieval request