Talk:MPDL Project XML Workflow

From MPDLMediaWiki
Revision as of 01:59, 10 December 2008 by Hyman (talk | contribs) (→‎Lemmatized search (more precise): added arrow)
Jump to navigation Jump to search

Agenda Meeting 16.10.2008[edit]

Project and current status[edit]

The XML Workflow project started September 2008. There are two main parts of the project:

  • Defining a working process (workflow) for production of XML texts (and documenting this process so that it can be reused)
    • Digitization of e.g. manuscripts
    • Transcription of the text presented on the manuscripts
    • Markup of parts of the XML texts
    • Enrichment of the XML texts
  • Development aspect by enabling tools and infrastructure to
    • enable access to documents, linking between documents and internal parts of the documents
    • building functions for searching, indexing and retrieval of relevant results

The main motivation is to standardize the working processes and develop a Center of competence that provides guidelines for transcription of texts.

  • Currently the project is in the initial design phase
  • One of the main goals is to use a repository functionality for all artefacts and enable easier reuse such as: importing the texts on which a work needs to be done into tool installed on a local system, as well as easily submitting the modified texts back to the repository

Introduction&status overview of eSciDoc project[edit]

Introduction into:

  • eSciDoc service infrastructure
  • eSciDoc data model
  • support for versioning
  • support for persistent identification
  • support for authorization
  • currently developed solutions


Presentations:

High-Level Requirements (functional, technical) for XML Workflow project[edit]

Repository[edit]

Functions that need to be provided are basically:

    • persistent storage for resources that come from various projects
    • versioning of resources
    • persistent identification of resources
    • possibility to access arbitrary portions of XML Documents
    • transformation of XML documents to XHTML for presentation purposes
    • enrichment of data such as:
      • links to language specific functionality
      • links to sources (available on the web)
    • expose resources publicly as soon as possible for further processing

Searching functionality (general)[edit]

    • project team considers as a core system to search within the XML documents the following: eXist database, Lucene or Oracle 11g
    • two types of queries need to be supported:
      • structural queries of XML documents (in particular subtrees, sets of subtrees)
      • full-text searching (integrated language technology)
      • support for different languages/scripts such as: Latin, Greek, Chinese, European languages, Sanscrit
  • Digilib - to be enabled as a service for viewing in-line images such as figures, diagrams
    • need to have the possibility to use quite mature/robust tools for working with images

Lemmatized search (more precise)[edit]

  • The functionality MPIWG intends to build is to enable searching by a word in its citation form
    • for modern Western European languages stemming may be acceptable
    • for other languages (e.g. Sanskrit, Latin) words need to be indexed by lemma (e.g. "tulit" → "fero")
  • NLP (Natural Language Processing)
    • enable generic interface to dictionaries
    • define/check on standards
  • Acting as XML Middleware
    • XML document is always indexed each time it is changed

Relations[edit]

  • as a minimum a need to have a core set of repository services to easily create objects and relate these objects

Relation between the two projects and possibility for reuse[edit]

Overview of supported repository functionalities[edit]

  • eSciDoc alredy provides several functionalities stated within the high-level requirements:
  • a repository functionality that supports versioning, persistent identification and quite free modeling schema for content resources
  • possibility to define different workflows and schemas for different projects (via the usage of "Context" resource)
  • possibility to define fine-granular access level to resources and parts of resources
  • resources are managed via Resource Handlers exposed as web services
  • authorization is done for each service operation on the resource (as eSciDoc core services are stateless)
  • services are provided as SOAP or REST interfaces
  • The possibility to define Containers, Members and TOC is of great advantage for the XML Workflows
  • all metadata and selected formats of full-texts are searchable
  • Digilib is integrated with the item handler interface and parameters can be used appended to the (file) content retrieval request

Discussion[edit]

  • The TOC is an interesting feature also for XML Workflow project
  • The next release of the VIRR solution may serve also partly the needs of the XML Workflow project
  • XML Documents do not use TEI as standard, they partly depart from this standard
  • It is recommended that the substantial value for the outcome of this project (in general for editing XML documents) is to stick to (general or community) standards as much as possible
  • As XML Workflow project also considered usage of Fedora, it is recommended to use eSciDoc service infrastructure that provides extended functionality to Fedora especially in terms of content model, encapsulation of the complexity of objects structure
    • an image available as thumbnail, web-resolution or original-sized image can be in eSciDoc encapsulated as a single resource with three components - even though these are 4 different Fedora objects.
    • in this case the versioning is done on both component-level and resource level
    • in this case the PID can anyway be associated both at resource-level and component level

Development[edit]

  • The first stable 1.0 release of the eSciDoc infrastructure will be available by end of the year 2008
  • MPDL will inform XML Workflow project team on the availability
  • The XML Workflow project team can use DEMO version of the eSciDoc repository for first tests
  • For further development, to be independent of MPDL development efforts on other solutions, XML Workflow project team needs to set-up own development installation
  • The XML Workflow project team can use the productive installation of eSciDoc repository for real data and work i.e. after enablement of particular solutions
  • We need to further discuss the process of feedback and inclusion of evtl. new features into core services, MPDL developed solutions when the XMl Workflow project team establish own development environment
  • XML Workflow project team is welcomed to use Colab (where this page is also written) - to share ideas with bigger community.

Other related projects[edit]

  • MPDL Solution VIRR (R1, possible relation to XML Workflow project)
  • MPDL Project INTER (pls. contact Marc Snijders, MPI PL for further information on demo he made with Lexica)
  • DFG Viewer (available at DFG Viewer)
    • to check with DFG Viewer people to extend it with parameters to use the Digilib service (will be checked by Robert Casties)