Trip Report: Elsevier Visit 13th September 2007

Participants: Curt E. Kohler, Anita de Waard, Barbara Kalumenos, Malte Dreyer, Ulla Tschida, Inga Overkamp

Summary: "Elsevier Labs" (formerly known as "Advanced Technology Group") is a group of the publisher Elsevier Elsevier Science Ltd. The group is responsible for development for new products as well as evaluating new technologies. They do have interest in next-generation architectures for scholarly applications. Their visit to MPDL was driven by the interest to learn more about eSciDoc and explore potential areas of cooperations.

Result: Although we found a range of issues where both institutions are interested in, no concrete project was envisioned. Anyway, we might keep in contact for further information exchange.

Introducing "Elsevier Labs" & their academic collaborations
Motto: "Scientists publish everywhere - and need to get back all relevant things whenever they need"

Strategy: Establish a web of entities to improve search, retrieval and interlinking of resources

Areas of interest:
 * semantic entities, e.g. publications, structures, proteins, etc.
 * author support, e.g. to help tagging the content semantically
 * research data - become more important and need to be stored as well as related to other entities
 * desktop tools, e.g. for search and retrieval
 * virtual communities, e.g. for scientific collaboration

Exemplary Projects:
 * BioImage (UK): to come for a metadata standard for images
 * DOPE for Economics (University of Mannheim):  - semantic entities. visualizing content concepts in economic
 * Metadata Madness (neuroscience editors): authoring tool to semantically enrich neuroscience publications by bibliographic references, biological references, multimedia entities
 * OKKAM (many partners): architecture to build a global web of entities
 * Pragmatic Research Article (University of Utrecht). Final goal: To develop a structure to find out how the knowledge is derived from/represented in a publication
 * Analyzing the storyboards of scientific papers. Finding: 3-5 episodes and then resolution.
 * Analyzing rhetorical moves. Finding: Problems are described in "past" forms

Metadata in the SD publication process
In ScienceDirect each publication is represented by one main document (in sgml/xml – with few structured parts inside)

Following metadata are available to represent the article front matter:
 * doi/pii
 * keywords
 * authors
 * affilations
 * pagination
 * document type
 * article title (fulltex,t abstract & pdf)
 * dates
 * issn
 * article title
 * unique issue key
 * email addresses

In addition, SD calculates some additional metadata from the fulltext by applying pattern matching software. Entitites in the full text, which are (what was asked for), e.g.
 * dois/piis, arxiv preprint references
 * urls
 * DNA sequence references, chemical structures, EMBL nucleotide sequences, OMIM (NCBI)

Afterwards, the copy editors are asked to confirm the suggestions and the document structure is modified to insert markup. Not all e-products leverage all of these entities.

Elsevier and Long Term Archiving
The long term strategy of Elsevier LTD comprises of
 * 1) Dark archives, e.g. at the Koninklijke Bibliotheek in the Netherlands. Dark archives are not available for public access, but can be opened to all Elsevier customers in case Elsevier runs bankrupt
 * 2) "de facto" archives, that means installations of the SD software run in library consortias (e.g. Hebis)
 * 3) participation in (C)LOCKSS and Portico

Side tracks & Result

 * Anita pointed us to: CATCH (Continuous Access to Cultural Heritage) a dutch funding program to develop ontologies to describe cultural heritage objects
 * Get in contact again, when biomedical/neuroscience issues come up (e.g. protein identification, etc.)
 * Metadata extraction is of big interest for MPS