AWOB Project Proposal

MPDL,GAVO

 PLEASE NOTE: This document was only a draft and is now outdated as the final project proposal is available.

Summary [WV]
Within the astronomy community, the need for a user-centered community platform to work collaboratively with distributed research data has been identified. The project will provide a discipline-specific variant of a "scholarly workbench scenario" and is structured in three phases:

The first phase will focus on building a demonstrator community platform together with two project partners (MPI für extraterrestrische Physik, MPI für Astropyhsik), which allows shared work during a complete scientific project workflow - from data acquisation, data analysis, scientific data interpretation, to a publication of a scientific article.

The collaboration platform will be based on an advanced Wiki-System, which is aligned with the eSciDoc infrastructure for management and storage of the relevant project resources. Discipline-specific services and tools (such as catalogs, databases, as well as analysis tools for comparing, combining, viewing and manipulating externally stored research data) will be integrated.

The second phase will focus on active outreach to the astrophysical community within the MPG and other organisations, to gather necessary feedback on extensions, improvements and their priorities.

The third phase will focus on technical consolidation and extensions needed for the other partner institutes. In addition, technical aspects for the preparation of long-term data preservation and curation of discipline-specific research data will be considered.

Background [J1, GL]
Various developments in astronomy as a whole, and at the MPG in particular, have lead to requirements on astronomers to publish his/her data online and in a form that facilitates their discovery and reuse by their colleagues.


 * MPG mandates long term archiving of data underlying scientific publications.


 * Astronomical journals offer the possibility, and will at some point possibly mandate, to refer to data sets from within the publication in a standardised manner. [TODO Refer to ADS, IVOIdentifiers]
 * The so called Virtual Observatory (VO) is a world-wide effort that aims to define standards that prescribe how data should be published in a manner that facilitates their reuse and, one might argue, enable their interoperability.
 * Funding agencies more and more demand that results of the funded projects are published using VO standards.

It is the experience from 5 years work in the German Astrophysical Virtual OBservatory that, though many scientists are interested in participating in these developments, they see it as a burden on their activities:
 * 1) They have in general not much experience with or knowledge about the particular metadata that must be provided. Here the existence of a pre-existing metadata and data standard FITS (=Flexible Image Transport Systsem) is an advantage, as it does mean that there is at least some superficial familiarity  with the concept of metadata among astronomers. But the developments in the VO have lead to additional, more specific standards that are not all easy to digest let alone implement.
 * 2) It is in general not sufficient to provide only static access to files, even if they have been provided with appropriate metadata. The size of data sets is growing at such a rate that one can not expect that users will be able to download all data just to filter out the interesting pieces after the fact. The data in general should be made accessible through services that allow users to select the parts that are potentially of interest on the server side. Implementing such services is as yet not part of the astronomy student's curriculum and maintaining such services in a stable and robust manner requires skills that are not generally available outside of the larger data centers.
 * 3) The formats of the data that are to be returned by these services are in general different from the data formats used by the scientists themselves. Although FITS is a widely accepted standard in astronomy, it is not identical to what is required. And in general the main effort is not in adapting to a container, but in properly setting its contents, both data and meta-data.
 * 4) Gathering the required data together is, especially in the large geographically distributed collaborations that are so common today, a major effort in its own right. Unless some pre-arranged standards are adhered to, in general the person(s) in charge of the publication effort will have to go through multiple translation tasks, for each of the providers. Moreover the communication of which data sets are required is an exhaustive effort.

We believe that much of this problem is because data publication is, literally, an afterthought. Only at the end of the project, after the observations have been made, analysed and published, do most astronomers (want to) think about this aspect of the project. Since it (seems) not to be of direct use of them, after all, data publication is meant to be of use for the community at large, the motivation is lacking. Knowledge about details of the data processing has been lost, or not well kept, the history of data products is hard to trace back. In larger projects many of the participants may have moved elsewhere after finishing their PhD, or to a different post-doctoral position.

The current proposal aims to lift this burden from the astronomers by providing a platform that allows the sharing of data online from the very beginning of a project. It offers a collaborative environment, equipped with special tools that allow for associating (semi-)standardised metadata as well as free form descriptions to uploaded data sets. It supports tracking of history and versioning of the data products. It provides tools facilitating their discovery, retrieval as well as online visualisation and analysis. It finally supplies tools for the eventual publication as services to the VO and as properly annotated and persistently identified data sets to the MPDL long-term storage facilities.

Our proposal is centered around the observation that many of the issues that arise when publishing data to the community at large, according to standard (meta)data formats and protocols, already arise at all earlier phases in collaborative projects. Allowing the participants to explicitly address these issues early in the project, means that the collaboration benefits as well. Data sets can be easily found and understood by all collaborators, without the need of ad hoc emailing about the availability of new data products somewhere on an FTP site, or personal web page. Once all data are in a well managed environment, the step to publication to the community as a whole is relatively minor.

Needs [J1, GL]
To support collaborations between geographically distributed astronomers this project will build AWOB, a web based application platform that allows users to share their resources (this includes documents and data) in an organised fashion. We have identified a number of specific needs that will be addressed in the project. These originate from conversations with astronomers, from existing attempts to organise resources online in collaborative astronomical projects and from experience with Virtual Observatory projects aimed to publish astronomical results online.

Project and user management
The organisation of AWOB follows the organisation of scientific projects (see the attached images). An AWOB user can create a new Project, of which (s)he becomes the owner. AWOB is safe. As long as users have not explicitly published their resources these are not available publicly. The Project is the "authorisation unit" that organises collaborators, web pages, data and documents within a single, shared workspace. The owner can assign other user to this projects with specific roles which govern capabilities for reading, viewing, writing, uploading of resources. The project structures is hierarchical, sub-projects may be defined with their own internal organisation and user groups. Individual resources as well as projects as a whole can go through various stages, ending in one or more publications to journals, which will be accompanied by making (part of) the project's resources public.

Resource management
Within an AWOB project, text can be written, discussions can be held, files can be uploaded. To support text authoring AWOB offers users a Wiki-like functionality. Users can create web pages and fill these with text and images, for example to give human readable documentation of newly uploaded results to their collaborators, or for the final publication of the project to the community. These pages fall under the same project management as the other resources. Also file upload is familiar from wikis, but AWOB adds many functions to this. We will provide functions to discover the data products belonging to the project in a uniform manner and not distributed over different pages.

Metadata management
AWOB offers specialised management, search and visualisation functionalities for uploaded data products. This functionality depends on appropriate metadata and AWOB offers tools to associate these to files during upload or at a later time. Some data formats contain metadata in the files, but in all cases extra metadata can be associated using appropriate tools that AWOB will supply. What kind of metadata can be assigned depends on the format of the data and on their contents. VO specific metadata can be assigned for example to images, spectra or source catalogues. If so desired the metadata can be stored in a database, which will allow flexible query capabilities to discover interesting files. Such capabilities are also at the heart of the various IVOA query protocols, which can therefore be implemented easily.

Relational database usage
Apart from images and spectra, many results of astronomical research come in the form of tables. AWOB will allow users to upload such data into a relational database and make it available for advanced querying. It will provide a form based management interface through which for example tables can be created. This will be combined with adding metadata according to the Virtual Observatory Table Access Protocol [TBD reference]. We will write components that can extract the tabular data from uploaded files in variety of formats including standard comma/tab/blank-separated values, but also the common FITS binary table format and the VO's VOTable XML format. The latter two formats include the table metadata in the file and the step to create a table through a form can be skipped. These relational database resources will be managed within the context of the Project, i.e. they will be available only to specific users, but can also be published to the community together with the rest of the project.

To make use of the database AWOB will offer various modes of querying the data. This will allow a simple interface for SQL querying, but also ways to define forms that allow parametrised queries. in both cases the TAP query protocol will be supported.

Portal to external services
Since the advent of the internet many online services have been published that are of use to astronomers. The recent development of the VO has increased the rate with which archive access and other services become available substantially. AWOB will provide a centralised portal to organise and facilitate access to such external services. Here special care will be taken to enable where possible interoperability between the uploaded data products and external services. For example unified visualisation tools will combine images from external archives with the ones inside the project and project source catalogues can be cross correlated with external databases.

Publication
AWOB will support the publication of data to the virtual observatory and to MPG long-term archives. The web pages+resources for a complete Project can be made publicly available using a "1-click" mechanism that preserves all interactive services, apart form changing the Project of course.

System requirements
Appart form all these specialised service some core functions will be supported. These include logging, version management and history tracking, backups. Naturallly authentication will be supported together which requires registration of users and role assignment. Registered users will receive their own workspace where data can be loaded. This includes a private database where results of queries may be stored.

Preparatory and related work [J1, GL]
GAVO has assisted astronomers in publishing data products and services through online web applications. Here special attention was paid to implementing IVOA standard protocols and data models, but also custom services were created for cases where no such standards were (yet) available. Notable examples are services to publish image and source catalogues resulting from the ROSAT X-Ray telescope and spectra from optical (CDFS, zCosmos) and X-Ray (2XMM) observatories. For these special tools were written to extract metadata from the data products in a manner that greatly facilitates their mapping to the standard data models from the IVOA. GAVO has been especially active in promoting theory to the VO. Notable there is the Millennium Database. This is a web application providing online SQL query facilities for a relational database containing results from the Millennium Simulation. Due to the online publication of the data that simulation is arguably the one most extensively studied, with more than 200 scientific publications using its results, most not by members of the collaboration that produced it. GAVO has moreover been very active in the definition of standards in the IVOA, most notably the Registry, VOTable, and TAP standards, whilst leading the effort on the Simulation Database (SimDB) standard for publishing simulations.

The MPDL provides with eSciDoc a scalable, Fedora-based infrastructure, which allows the storage and management of various resources. In the course of the development of three solutions for specific artefacts (PubMan, VIRR, Faces), we gained experience in the definition of discipline-specific content models, needed to describe resources in a standardized, platform-independet manner for long-term archiving. The publication process of research data, i.e. the necessary workflows, review processes and assignment of persistence identifiers (PID), developed for images, publications and digitized resources will be extended to support the specifics of the publication process of astronomical resources (such as projects, databases, files, services). The application services developed sofar will be extended and improved, with special focus on the integration of external services available in the astronomical community. In addition, the authentication and authorisation mechanims needed to allow collaborative scenarios between humans, services and databases will be a main benefit for the overall eSciDoc infrastructure.

Wider context/Re-use for others [WV]

 * Collaborative platform for a virtual organisation
 * Assistance to the refereeing process
 * Showcase for added value of implemented standards
 * Integration of standards, stable infrastructure and web2.0 technologies to facilitate dynamic and collaborative environments (cf. eSciDoc?)
 * Re-use for astronomy community within MPG
 * MPI für Astronomie (CPTS)
 * MPI für Gravitationsphysik (CPTS)
 * MPI für Kernphysik (CPTS)
 * MPI für Physik (CPTS)
 * MPI für Radioastronomie (CPTS)
 * MPI für Sonnensystemforschung (CPTS)
 * Re-use for astronomy community outside MPG
 * could be made public thru GAVO nationally (about 35 astronomy institutions within Germany) and internationally via IVOA
 * Re-use (after some modifications/adaptations) for other communities like geo-science, high-energy particle physics
 * Long-term storage of data sets used in a publication (PIDs, metadata, links to publications)(cf. deposit mandate?)
 * Open access to all results of scientific research online (cf. Berlin declaration?)
 * mandated by some funding agencies
 * IVOA dataset identifier in use by ADS (main portal for astronomers)
 * Integration of ADIP (Astronomical/Astrophysical Data Integration Platform)

Work description
Project duration
 * Overall: 36 months
 * Phase 1: 12 months
 * Phase 2: 3 months
 * Phase 3: 21 months

Partners I do not think that it's feasible to include them. we would need more time to align the proposal...currently, we do not really focus on extension of LTA procedures by GWDG/RZG. In addition, would need to add additional resources, and we have already quite decent costs;-)--Ulla 08:52, 9 April 2009 (UTC)
 * MPI für extraterrestrische Physik (Chemische-Physikalische-Technische Sektion CPTS)
 * Responsible for proposal: Frank Haberl
 * MPI für Astrophysik (CPTS)
 * Responsible for proposal: Simon White(?), Marat Gilfanov (interested) (from Rasheed Sunyaev's group)
 * MPDL
 * Responsible for proposal: Malte Dreyer
 * RZG and/or GWDG?

Phase 1 - Pilotphase
Aim: Create a demonstrator, which focuses on integration of XWiki and eScidoc infrastructure. Demonstrator will be able to adress first collaborative scenarios on the Wiki as well as the sustainable storage of resources within the eSciDoc infrastructure. The main entry point for the user is the wiki, where he can -depending on privileges - read the wiki and its linked resources and/or update the wiki pages and/or update the respective resources stored in eSciDoc. A first demonstrator version will be available after 6 months, to prove sustainability of the envisioned solution, via selected features: creation of project and related resources, browsing and viewing details of resources, basic user roles for viewing and editing.

Duration: 12 months

Workpackage 1: Definition of basic content models (2 months)
 * Database registries
 * Project registries
 * Project resources
 * Wiki sources per project

Workpackage 2: Definition basic users, roles, workflows (2 months)
 * What can be done by whom in the Wiki?
 * creation and management of users
 * activities such as creation of resources, publication of resources, modification of resources, closure/de-activation of resources
 * Necessary workflows
 * Necessary versioning mechanisms

Workpackage 3: Definition of necessary Wiki extensions (1 month)
 * for endusers
 * for interfaces to eSciDoc/integration eSciDoc resources
 * for integration of external services (e.g. viewers, searches)

Workpackage 4: Definition basic AWOB AA component (1 month)
 * to handle the privileges of wiki AA component and eSciDoc AA component
 * set-up LDAP

Workpackage 5: Definition AWOB Database platform (1 month)
 * function-specific interfaces to query/update databases

Workpackage 6: Implementation Prototype (5 months)

Phase 2 - Evaluation
Aim: In the second phase, the demonstrator will be presented to interested scientists within the MPG and other organisations. Urgent Bug Fixes for the demonstrator will be implemented, if necessary for proper presentation. Feedback for possible extensions and improvements will be gathered.

Duration: 3 months

Workpackage 7: Outreach and communication
 * Pro-active communication to raise interest
 * Presentation of demonstrator to scientists
 * Feedback analyis
 * Align project planning for phase 3
 * Documentation, Support

Phase 3 - Extensions
Aim: Based on the requirements of the new partners, necessary extensions and improvements will be implemented. The final solution will be consolidated and prepared for productive usage by MPIs.

Duration: 21 months

Workpackage 8: Define/implement Wiki extensions

Workpackage 9: Define/implement extensions eSciDoc content models

Workpackage 10: Define /implement extensions for proper "publication" of research data
 * Assignment of PIDs for research data and services
 * Review, Editorial processes
 * Link between Publication and Research Data

Workpackage 11: Define/implement extensions AWOB AA component
 * e.g. Shibboleth

Workpackage 12: Define/implement extensions AWOB database platform
 * integration external databases, incl. authentication
 * protocols to create web-interfaces on external databases

Workpackage 13: Consolidation
 * preparation productive usage
 * documentation, support&maintenance policies
 * outreach to other communities
 * identify necessary extensions/improvements for re-use in other communities

Inkind from project partners

 * MPDL:
 * Senior Advise Head of Development during complete project life time
 * Senior Advice Head of Servicemanagement during complete project life time
 * Scientific Consultant for Astrophysical community (MPG and extern) during complete project life time
 * MPI Extraterrestrische Physik/MPI Astrophysik:
 * Senior Advise from GAVO team members during complete project life time