Talk:MPDL Project XML Workflow

From MPDLMediaWiki
Jump to: navigation, search

Agenda Meeting 16.10.2008

Project and current status

The XML Workflow project started September 2008. There are two main parts of the project:

  • Defining a working process (workflow) for production of XMLExtensible Markup Language texts (and documenting this process so that it can be reused)
    • Digitization of e.g. manuscripts
    • Transcription of the text presented on the manuscripts
    • Markup of parts of the XMLExtensible Markup Language texts
    • Enrichment of the XMLExtensible Markup Language texts
  • Development aspect by enabling tools and infrastructure to
    • enable access to documents, linking between documents and internal parts of the documents
    • building functions for searching, indexing and retrieval of relevant results

The main motivation is to standardize the working processes and develop a Center of competence that provides guidelines for transcription of texts.

  • Currently the project is in the initial design phase
  • One of the main goals is to use a repository functionality for all artefacts and enable easier reuse such as: importing the texts on which a work needs to be done into tool installed on a local system, as well as easily submitting the modified texts back to the repository

Introduction&status overview of eSciDocEnhanced Scientific Documentation project

Introduction into:

  • eSciDocEnhanced Scientific Documentation service infrastructure
  • eSciDocEnhanced Scientific Documentation data model
  • support for versioning
  • support for persistent identification
  • support for authorization
  • currently developed solutions


Presentations:

High-Level Requirements (functional, technical) for XMLExtensible Markup Language Workflow project

Repository

Functions that need to be provided are basically:

    • persistent storage for resources that come from various projects
    • versioning of resources
    • persistent identification of resources
    • possibility to access arbitrary portions of XMLExtensible Markup Language Documents
    • transformation of XMLExtensible Markup Language documents to XHTMLExtensible HyperText Markup Language for presentation purposes
    • enrichment of data such as:
      • links to language specific functionality
      • links to sources (available on the web)
    • expose resources publicly as soon as possible for further processing

Searching functionality (general)

    • project team considers as a core system to search within the XMLExtensible Markup Language documents the following: eXist database, Lucene or Oracle 11g
    • two types of queries need to be supported:
      • structural queries of XMLExtensible Markup Language documents (in particular subtrees, sets of subtrees)
      • full-text searching (integrated language technology)
      • support for different languages/scripts such as: Latin, Greek, Chinese, European languages, Sanscrit
  • DigilibWeb Based Server Technology for Viewing and Working with Images - to be enabled as a service for viewing in-line images such as figures, diagrams
    • need to have the possibility to use quite mature/robust tools for working with images

Lemmatized search (more precise)

  • The functionality MPIWGMax-Planck-Institut für Wissenschaftsgeschichte intends to build is to enable searching by a word in its citation form
    • for modern Western European languages stemming may be acceptable
    • for other languages (e.g. Sanskrit, Latin) words need to be indexed by lemma (e.g. "tulit" → "fero")
  • NLP (Natural Language Processing)
    • enable generic interface to dictionaries
    • define/check on standards
  • Acting as XMLExtensible Markup Language Middleware
    • XMLExtensible Markup Language document is always indexed each time it is changed

Relations

  • as a minimum a need to have a core set of repository services to easily create objects and relate these objects

Relation between the two projects and possibility for reuse

Overview of supported repository functionalities

  • eSciDocEnhanced Scientific Documentation alredy provides several functionalities stated within the high-level requirements:
  • a repository functionality that supports versioning, persistent identification and quite free modeling schema for content resources
  • possibility to define different workflows and schemas for different projects (via the usage of "Context" resource)
  • possibility to define fine-granular access level to resources and parts of resources
  • resources are managed via Resource Handlers exposed as web services
  • authorization is done for each service operation on the resource (as eSciDocEnhanced Scientific Documentation core services are stateless)
  • services are provided as SOAPSimple Object Access Protocol or RESTRepresentational State Transfer interfaces
  • the possibility to define Containers, Members and TOCTable of Contents is of great advantage for the XMLExtensible Markup Language Workflows
  • all metadata and selected formats of full-texts are searchable
  • DigilibWeb Based Server Technology for Viewing and Working with Images is integrated with the item handler interface and parameters can be used appended to the (file) content retrieval request

Discussion

  • The TOCTable of Contents is an interesting feature also for XMLExtensible Markup Language Workflow project
  • The next release of the VIRRVirtueller Raum Reichsrecht solution may serve also partly the needs of the XMLExtensible Markup Language Workflow project
  • XMLExtensible Markup Language Documents do not use TEIText Encoding Initiative as standard, they partly depart from this standard
  • It is recommended that the substantial value for the outcome of this project (in general for editing XMLExtensible Markup Language documents) is to stick to (general or community) standards as much as possible
  • As XMLExtensible Markup Language Workflow project also considered usage of FedoraFlexible Extensible Digital Object Repository Architecture, it is recommended to use eSciDocEnhanced Scientific Documentation service infrastructure that provides extended functionality to FedoraFlexible Extensible Digital Object Repository Architecture especially in terms of content model, encapsulation of the complexity of objects structure
    • an image available as thumbnail, web-resolution or original-sized image can be in eSciDocEnhanced Scientific Documentation encapsulated as a single resource with three components - even though these are 4 different FedoraFlexible Extensible Digital Object Repository Architecture objects.
    • in this case the versioning is done on both component-level and resource level
    • in this case the PIDPersistent Identifer or Identification can anyway be associated both at resource-level and component level

Development

  • The first stable 1.0 release of the eSciDocEnhanced Scientific Documentation infrastructure will be available by end of the year 2008
  • MPDLMax Planck Digital Library will inform XMLExtensible Markup Language Workflow project team on the availability
  • The XMLExtensible Markup Language Workflow project team can use DEMO version of the eSciDocEnhanced Scientific Documentation repository for first tests
  • For further development, to be independent of MPDLMax Planck Digital Library development efforts on other solutions, XMLExtensible Markup Language Workflow project team needs to set-up own development installation
  • The XMLExtensible Markup Language Workflow project team can use the productive installation of eSciDocEnhanced Scientific Documentation repository for real data and work i.e. after enablement of particular solutions
  • We need to further discuss the process of feedback and inclusion of evtl. new features into core services, MPDLMax Planck Digital Library developed solutions when the XMLExtensible Markup Language Workflow project team establish own development environment
  • XMLExtensible Markup Language Workflow project team is welcomed to use ColabCollaboration Laboratory (where this page is also written) - to share ideas with bigger community.

Other related projects

  • MPDLMax Planck Digital Library Solution VIRR (R1, possible relation to XMLExtensible Markup Language Workflow project)
  • MPDL Project INTER (pls. contact Marc Snijders, MPIMax-Planck-Institut PL for further information on demo he made with Lexica)
  • DFGDeutsche Forschungsgemeinschaft Viewer (available at DFG Viewer)
    • to check with DFGDeutsche Forschungsgemeinschaft Viewer people to extend it with parameters to use the DigilibWeb Based Server Technology for Viewing and Working with Images service (will be checked by Robert Casties)