Difference between revisions of "Living Sources in Lexical Description"

From MPDLMediaWiki
Jump to navigation Jump to search
Line 3: Line 3:
== Summary ==
== Summary ==


Living Sources is an '''infrastructure for publishing scientific data'''. There are many issues concerning the publication of data that are shared by many scientific fields (specifically issues like persistence, quality control and scientific recognition). The Living Sources concept aims to address these problems, so different fields of scientific inquiry can profit from the solutions proposed. The general plan of publishing data will be approached through a concrete case, namely the '''Living Sources in Lexical Description''', an online data-journal for the publication of dictionaries of the world's languages.
Living Sources is an '''infrastructure for publishing scientific data'''. There are many general issues concerning the publication of data (in contrast to the publication of results) that are applicable to most scientific fields (specifically, issues like persistence, quality control and scientific recognition). The Living Sources concept aims to address these problems, so different fields of scientific inquiry can profit from the solutions proposed. The general plan of publishing data will be approached through a concrete case, namely the '''Living Sources in Lexical Description''', an online data-journal for the publication of dictionaries of the world's languages.


== The Living Sources concept ==
== The Living Sources concept ==
Line 50: Line 50:
== Living Sources in Lexical Description ==
== Living Sources in Lexical Description ==


The Living Sources in Lexical Description is the first implementation of the Living Sources concept. It is specifically geared towards the publication of dictionaries and other lexical resourced of the world's languages. It will be primarily focussed on lexical resources of lesser studies and often endangered languages, offering specialists for such languages to publish their resources for which there is mostly no real interest among traditional publishers. Also, the published lexical data will offer cross-searchability and cross-annotation, opening up new possibilities for research, like the investigation of cognate words for historical comparison of the world's languages.
The Living Sources in Lexical Description is the first implementation of the Living Sources concept. It is specifically geared towards the '''publication of dictionaries and other lexical resourced of the world's languages'''. It will be primarily focussed on lexical resources of lesser studies and often endangered languages, offering specialists for such languages to publish their resources for which there is mostly no real interest among traditional publishers. Also, the published lexical data will offer cross-searchability and cross-annotation, opening up new possibilities for research, like the investigation of cognate words for historical comparison of the world's languages.


=== Scientific scope ===
'''Scientific scope'''
 
Words are of prime interest to linguists and the general audience alike. Various branches of linguistics are interested in well-organized and cross-searchable lexical resources, like:
 
*Lexicography and terminology research
*Description and documentation of endangered languages
*Dialectology
*Ethnolinguistics
*Historical linguistics
*Computational linguistics
*Psycholinguistics
 
Also, in the context of the recent movement in linguistics to recognize constructions (including both "set expressions" and "grammatical structures") as language-particular entities on a par with lexical items ("words"), the infrastructure for lexical resources can be expanded to a much larger scope of language description in the future.
 
'''Concept of open submission and peer review'''
 
One of the main new possibilities for scientific publishing offered by online electronic formats is that submission and quality control can be separated. In a publication system where each publication is costly, the quality control has to precede the physical publication,
 
Submission (technical check by editors):
 
*data is directly uploaded (in a first phase some technical assistance should be available)
*this leads automatically to an evaluation by to editors (possibly closed, if requested by the author)
*editorial check will be on technical issues and requirements only (data structure, terminology, preface, etc.)
*retraction from peer review at this point still possible, but data remain available (with restricted access if wanted)
*these steps can be iterated until all technical requirements are met
*data that does not meet the technical requirements can still remain available, categorized as "draft"
 
Review (content check by peers)
 
*open peer-review submission (time-restricted)
*critical assessment about submission as a whole (i.e. commentary on preface, not on individual entries) decide on acceptance. Should be seen separate from commentary on individual entries of the data.
*individual errors/shortcomings can and should be corrected, but should not ban possible publication.
*result: publicated database meaning "the principle of collecting and organising data is good, though there might be discussion about individual items"
*different publication status: e.g. "wordlist", "wordform collection (including frequencies, collocations, etc)", "wortfeld", "language-particular dictionary", "comparative dictionary"
 
Once a submission has passed the technical step (which is actually already a large hurdle for many traditional lexicographers), a submission is technically published. We would like to encourage people to publish smaller amounts of data, but such smaller datasets of course should be distinguished from large publications (for example complete dictionaries). To allow for different kinds of publications, some kind of stratification is needed. This stratification of publication will happen through the (open) peer review system.
 
The two basic modes of publications are "Wordlist" (for onomasiological submissions) and "Wordform collection (for semasiological submissions). These 'stamps' are given after the technical check, and it is thus actually not very rewarding to have just one of these labels (cf. a lower-rate journal). To get into one of the more rewarding categories, 
 
 
 
'''step 3: living commentary and growth of data'''
 
*addition of more data, corrections, versions
*discussion about individual items (not time-restricted)


*Lexical data, view on language description and analysis
*Linguistics (Psycholinguistics, Ethnolinguistics, Lexicography, Terminology, Dialectology, Computational Linguistics)


=== Infrastructure ===
=== Infrastructure ===


'''Technical issues:'''
'''Technical issues:'''
Line 68: Line 112:
*Citation structure (receipts, recipy, granularity)
*Citation structure (receipts, recipy, granularity)


=== Means ===


Needed man-power: Lexical Curator




Line 90: Line 132:
*webservice
*webservice


===Concept of an open submission and peer review===
'''step 1: Technical check (by editors)'''
*(possibly closed) submission to editors
*editorial check on technical issues (data structure, terminology, preface, etc.)
*possible retraction for scientific check
*data remain submitted (possibly with restriced access)
*these steps can be iterated (each iteration should be time-restriced)
'''step 2: Content check (by peers)'''
*open peer-review submission (time-restriced)
*critical assessment about submission as a whole (i.e. commentary on preface, not on individual entries) decide on acceptance. Should be seen separate from commentary on individual entries of the data.
*individual errors/shortcomings can and should be corrected, but should not ban possible publication.
*result: publicated database meaning "the principle of collecting and organising data is good, though there might be discussion about individual items"
*different publication status: e.g. "wordlist", "wordform collection (including frequencies, collocations, etc)", "wortfeld", "language-particular dictionary", "comparative dictionary"
Once a submission has passed the technical step (which is actually already a large hurdle for many traditional lexicographers), a submission is technically published. We would like to encourage people to publish smaller amounts of data, but such smaller datasets of course should be distinguished from large publications (for example complete dictionaries). To allow for different kinds of publications, some kind of stratification is needed. This stratification of publication will happen through the (open) peer review system.
The two basic modes of publications are "Wordlist" (for onomasiological submissions) and "Wordform collection (for semasiological submissions). These 'stamps' are given after the technical check, and it is thus actually not very rewarding to have just one of these labels (cf. a lower-rate journal). To get into one of the more rewarding categories, 
'''step 3: living commentary and growth of data'''
*addition of more data, corrections, versions
*discussion about individual items (not time-restricted)


===Open issue:===
===Open issue:===
Line 137: Line 151:
*possibility of third party commentaries by any registered user
*possibility of third party commentaries by any registered user


=== Means ===
Needed man-power:
*Lexical Curator
*Infrastructure programmer


===Support===
===Support===

Revision as of 12:05, 19 May 2008

This is a protected page.

Summary[edit]

Living Sources is an infrastructure for publishing scientific data. There are many general issues concerning the publication of data (in contrast to the publication of results) that are applicable to most scientific fields (specifically, issues like persistence, quality control and scientific recognition). The Living Sources concept aims to address these problems, so different fields of scientific inquiry can profit from the solutions proposed. The general plan of publishing data will be approached through a concrete case, namely the Living Sources in Lexical Description, an online data-journal for the publication of dictionaries of the world's languages.

The Living Sources concept[edit]

Current situation

In contrast to the common practice of publishing and discussing research results, currently most scientists do not disclose the underlying research data. They do not make them available to a wider audience because of various reasons, like:

  • failure to see wider applicability of data ("Why would anybody be interested in this?")
  • insufficient quality (e.g. the data collection is not finished, it is not properly cross-checked, or the data is not complete)
  • fear of plagiarism (others might not properly acknowledge the data)
  • loss of control over interpretation (others might misunderstand the data, with undeserved blame being cast on the original creator of the data)
  • loss of primacy of discovery (others might come up with important discoveries that the original creator also observed, but did not have time to work out and publish)
  • lack of suitable publications to publish the data (most publishers are not interested to publish large tables with raw data)
  • lack of technical knowledge how to make data available
  • limited scientific recognition for making data available

All these - completely legitimate - reasons lead to the current situation in which data are mostly unavailable for inspection and scientific scrutiny, unavailable for reanalysis, and unavailable for meta-analysis. When much more (raw) data would be available, many new possibilities for research, both within disciplines but also across disciplines, will become possible.

Prospects

Recent developments in computational infrastructure ("web 2.0") are showing the possibility for new kinds of information exchange. Living Sources will be an online repository of information created for and by scientists, tailored to the goals and needs of these scientists. To reach this goal, the concept of Living Sources will tackle problems that are general enough to be of importance to many field on inquiry:

  • persistence of data (storage and archiving)
  • systems of quality control ("peer review")
  • securing of scientific recognition and citability

The electronic format of publication offers various additional possibilities:

  • incremental publications (corrections and additions possible which is difficult for traditional forms of publications)
  • comments on and citation of individual datapoints (micro-publication too small for traditional forms of publication)
  • open peer review schemes
  • addition of digitalized legacy material to supplement the newly published data
  • persistence of data through grid-like backup

Strategy

Living Sources will not attempt to force scientists to adapt to new paradigms of how to deal with data. It will function more as a service to those (sub)fields that have a need for data publication and dissemination. An instance of the Living Sources concept will be in need of:

  • Availability of data with high level quality
  • Support from scientists in the field
  • Editiorial board (technical checks, organisation of field)
  • Peer review (content check)

There are at least two complementary scenarios for the application of the Living Sources concept. First, the construction of a dedicated technical infrastructure which enhances the usability of data. This should be a "one stop shop" for scientists who look for a hosting environment, including all features needed for usage and deployment of such a system (including, e.g., user interfaces for editors, casual browsers and power users, searchability, persistent data storage, etc.). Second, Living Sources aims to set standards for the structure of data portals (like data journals or data archives) as for issues of citation and quality control. This is specifically geared towards groups of scientists who want to keep a strong hold on their data can build their own systems, that are still interoperable with the dedicated Living Sources infrastructure.

Living Sources in Lexical Description[edit]

The Living Sources in Lexical Description is the first implementation of the Living Sources concept. It is specifically geared towards the publication of dictionaries and other lexical resourced of the world's languages. It will be primarily focussed on lexical resources of lesser studies and often endangered languages, offering specialists for such languages to publish their resources for which there is mostly no real interest among traditional publishers. Also, the published lexical data will offer cross-searchability and cross-annotation, opening up new possibilities for research, like the investigation of cognate words for historical comparison of the world's languages.

Scientific scope

Words are of prime interest to linguists and the general audience alike. Various branches of linguistics are interested in well-organized and cross-searchable lexical resources, like:

  • Lexicography and terminology research
  • Description and documentation of endangered languages
  • Dialectology
  • Ethnolinguistics
  • Historical linguistics
  • Computational linguistics
  • Psycholinguistics

Also, in the context of the recent movement in linguistics to recognize constructions (including both "set expressions" and "grammatical structures") as language-particular entities on a par with lexical items ("words"), the infrastructure for lexical resources can be expanded to a much larger scope of language description in the future.

Concept of open submission and peer review

One of the main new possibilities for scientific publishing offered by online electronic formats is that submission and quality control can be separated. In a publication system where each publication is costly, the quality control has to precede the physical publication,

Submission (technical check by editors):

  • data is directly uploaded (in a first phase some technical assistance should be available)
  • this leads automatically to an evaluation by to editors (possibly closed, if requested by the author)
  • editorial check will be on technical issues and requirements only (data structure, terminology, preface, etc.)
  • retraction from peer review at this point still possible, but data remain available (with restricted access if wanted)
  • these steps can be iterated until all technical requirements are met
  • data that does not meet the technical requirements can still remain available, categorized as "draft"

Review (content check by peers)

  • open peer-review submission (time-restricted)
  • critical assessment about submission as a whole (i.e. commentary on preface, not on individual entries) decide on acceptance. Should be seen separate from commentary on individual entries of the data.
  • individual errors/shortcomings can and should be corrected, but should not ban possible publication.
  • result: publicated database meaning "the principle of collecting and organising data is good, though there might be discussion about individual items"
  • different publication status: e.g. "wordlist", "wordform collection (including frequencies, collocations, etc)", "wortfeld", "language-particular dictionary", "comparative dictionary"

Once a submission has passed the technical step (which is actually already a large hurdle for many traditional lexicographers), a submission is technically published. We would like to encourage people to publish smaller amounts of data, but such smaller datasets of course should be distinguished from large publications (for example complete dictionaries). To allow for different kinds of publications, some kind of stratification is needed. This stratification of publication will happen through the (open) peer review system.

The two basic modes of publications are "Wordlist" (for onomasiological submissions) and "Wordform collection (for semasiological submissions). These 'stamps' are given after the technical check, and it is thus actually not very rewarding to have just one of these labels (cf. a lower-rate journal). To get into one of the more rewarding categories,


step 3: living commentary and growth of data

  • addition of more data, corrections, versions
  • discussion about individual items (not time-restricted)


Infrastructure[edit]

Technical issues:

  • Formats (TMF, LMF, TEI/dic.)
  • Technical infrastructure: Lexus (MPI for Psycholinguistics, Nijmegen), eSciDoc
  • Unique identification of data objects
  • Direct reusability of data (local databases, linking of databases)
  • Formats for commentaries
  • Formats for orthography profiles
  • Citation structure (receipts, recipy, granularity)



Functional specification/Requirements[edit]

Submission:

Required information, seen as a preface:

  • scientific background/research field
  • editorial background/rational of the data
  • selection criteria: e.g. sampling, fields, etc.
  • data category/use of data: e.g. ODD specification, schema, specification of orthography, terminology specification etc.
  • links to other databases/sources

Required informtion about the data itself:

  • upload vs. URL
  • upload on Lexus
  • fulltext/XML
  • webservice


Open issue:[edit]

  • check if Living Reviews infrastructure for the peer review process can be re-used
  • need of a sampling strategy on the data
  1. sample of full entries
  2. full overview of specific fields (e.g. all parts of speech, all etymological fields)

Rights[edit]

  • Open Access
  • Creative Commons Licence for data and metadata (by default: attribution)
  • No copyright transfer
  • agreement with authors that Living Sources in Lexical Description has the rights (to store) and distribute the data under the Creative Commons Licence


Miscellaneous[edit]

  • possibility of third party commentaries by any registered user

Means[edit]

Needed man-power:

  • Lexical Curator
  • Infrastructure programmer

Support[edit]

  • Potential scientific support from MPI for Psycholinguistics, Nijmegen, MPI for Evolutionary Anthropology, Leipzig and other Max Planck Instituts
  • Potential financial support: ESF call BABEL, Volkswagenstiftung, Heinz-Nixdorf-Stiftung

Other[edit]

  • applied for domains livingsources.org, livingsources.com, livingsources.eu (request processed by AEI Potsdam)