Living Sources in Lexical Description
Summary[edit]
Living Sources is an infrastructure for publishing scientific data. There are many general issues concerning the publication of data (in contrast to the publication of results) that are applicable to most scientific fields (specifically, issues like persistence, quality control and scientific recognition). The Living Sources concept aims to address these problems, so different fields of scientific inquiry can profit from the solutions proposed. The general plan of publishing data will be approached through a concrete case, namely the Living Sources in Lexical Description, an online data-journal for the publication of dictionaries of the world's languages.
The Living Sources concept[edit]
Current situation[edit]
In contrast to the common practice of publishing and discussing research results, currently most scientists do not disclose the underlying research data. They do not make them available to a wider audience because of various reasons, like:
- failure to see wider applicability of data ("Why would anybody be interested in this?")
- insufficient quality (e.g. the data collection is not finished, it is not properly cross-checked, or the data is not complete)
- fear of plagiarism (others might not properly acknowledge the data)
- loss of control over interpretation (others might misunderstand the data, with undeserved blame being cast on the original creator of the data)
- loss of primacy of discovery (others might come up with important discoveries that the original creator also observed, but did not have time to work out and publish)
- lack of suitable publications to publish the data (most publishers are not interested to publish large tables with raw data)
- lack of technical knowledge how to make data available
- limited scientific recognition for making data available
All these - completely legitimate - reasons lead to the current situation in which data are mostly unavailable for inspection and scientific scrutiny, unavailable for reanalysis, and unavailable for meta-analysis. When much more (raw) data would be available, many new possibilities for research, both within disciplines but also across disciplines, will become possible.
Prospects[edit]
Recent developments in computational infrastructure ("web 2.0") are showing the possibility for new kinds of information exchange. Living Sources will be an online repository of information created for and by scientists, tailored to the goals and needs of these scientists. To reach this goal, the concept of Living Sources will tackle problems that are general enough to be of importance to many field on inquiry:
- persistence of data (storage and archiving)
- systems of quality control ("peer review")
- securing of scientific recognition and citability
The electronic format of publication offers various additional possibilities:
- incremental publications (corrections and additions possible, which is difficult for traditional forms of publications)
- comments on and citation of individual datapoints (micro-publication too small for traditional forms of publication)
- open peer review schemes
- addition of digitalized legacy material to supplement the newly published data
- persistence of data through grid-like backup
Strategy[edit]
Living Sources will not attempt to force scientists to adapt to new paradigms of how to deal with data. It will function more as a service to those (sub)fields that have a need for data publication and dissemination. An instance of the Living Sources concept will be in need of:
- Availability of data with high level quality
- Support from scientists in the field
- Editiorial board (technical checks, organisation of field)
- Active engagement of scientists in the peer review (content check)
There are at least two complementary scenarios for the application of the Living Sources concept. First, the construction of a dedicated technical infrastructure which enhances the usability of data. This should be a "one stop shop" for scientists who look for a hosting environment, including all features needed for usage and deployment of such a system (including, e.g., user interfaces for editors, casual browsers and power users, searchability, persistent data storage, etc.). Second, Living Sources aims to set standards for the structure of data portals (like data journals or data archives) as for issues of citation and quality control. This is specifically geared towards groups of scientists who want to keep a strong hold on their data can build their own systems, that are still interoperable with the dedicated Living Sources infrastructure.
Living Sources in Lexical Description[edit]
The Living Sources in Lexical Description is the first implementation of the Living Sources concept. It is specifically geared towards the publication of dictionaries and other lexical resourced of the world's languages. It will be primarily focussed on lexical resources of lesser studies and often endangered languages, offering specialists for such languages to publish their resources for which there is mostly no real interest among traditional publishers. Also, the published lexical data will offer cross-searchability and cross-annotation, opening up new possibilities for research, like the investigation of cognate words for historical comparison of the world's languages.
Scientific scope[edit]
Words are of prime interest to linguists and the general audience alike. Various branches of linguistics are interested in well-organized and cross-searchable lexical resources, like:
- Lexicography and terminology research
- Description and documentation of endangered languages
- Dialectology
- Ethnolinguistics
- Historical linguistics
- Computational linguistics
- Psycholinguistics
Also, in the context of the recent movement in linguistics to recognize constructions (including both "set expressions" and "grammatical structures") as language-particular entities on a par with lexical items ("words"), the infrastructure for lexical resources can be expanded to a much larger scope of language description in the future.
Open submission and peer review[edit]
One of the main new possibilities for scientific publishing offered by the online electronic format is that publication and quality control can be separated. In a publication system where each publication is costly, the quality control has to precede the physical publication. In contrast, in electronic form, the cost of each publication if small (the main costs relate to the up keeping of the overall system). This allows for a system in which publication itself (i.e "making available") can happen independent of the assessment of the quality ("peer review"). We would like to encourage people to publish smaller amounts of data, but such smaller datasets of course should be distinguished from large publications (for example complete dictionaries). To allow for different kinds of publications, some kind of stratification is needed. This stratification of publication will happen through the (open) peer review system. For lexical data, we propose a two-layered system. The first level of publication will be called "Words of the World" and consists of a technically correct submissions that do not (yet) have been peer-reviewed. Peer review can (but need not) happen to obtain more scientific recognition, and (if successful) lead to publication in more prestigious series, like "Dictionaries of the World's Languages".
Step 1: Submission
Submission consists mainly of a technical check by editors.
- data is directly uploaded (in a first phase some technical assistance should be available)
- this leads automatically to an evaluation by to editors (possibly closed, if requested by the author)
- editorial check will be on technical issues and requirements only (data structure, terminology, preface, etc.)
- retraction from peer review at this point still possible, but data remain available (with restricted access if wanted)
- these steps can be iterated until all technical requirements are met
Step 2: Bare publication
To allow of the availability of data, irrespective of scientific recognition, there should be a level of "bare" publication, with its own brand name.
- data that does not meet all technical requirements can still remain openly available, categorized as Draft
- data that passes the technical check will be announced as published in a special series, for example called Words of the World
Step 3: Review
There can be different, independent, more prestigious series. Such series simply consist of an editorial board and an active community of peer-reviewers. If that particular scientific sub-community accepts a submission, it can give it's own "stamp" of recognition by branding a special series. In the context of lexical data, one could think of series like "Dictionaries of the "World's Languages", "Intercontinental Dictionary Series", "Loanword Typology Wordlists", or "Cognate set collections". The branding and recognition of such series completely depend on the effort and success of the editors and the community or reviewers. Initially, only one such brand will be established, namely Dictionaries of the World's Languages.
- if wanted by the author, any technically accepted publication can be opened up for peer review to obtain more scientific recognition
- this peer-review will be time restricted and openly available to the whole community
- review should be a critical assessment of submission as a whole (i.e. commentary on kind of collection, and on larger samples of submitted data points)
- comments on individual entries should be seen separate from commentary on the whole enterprise.
- individual errors/shortcomings can and should be corrected, but should not ban scientific recognition (except of course when the errors are too widespread).
Step 4: Full publication
On the basis of the reviews, the editors decide on acceptance. After acceptance, the result will be a peer-reviewed dictionary, meaning "the principle of collecting and organising data is good, though there might be discussion about individual items". The submission is then published in the series called "Dictionaries of the World's Languages"
Step 5: Editions/Supplements
A central part of the Living Sources concept is that published data is changeable. Authors can add and correct data, users can add commentary or additional information. Any larger collections of such additions to the system can be in turn submitted to review (we will not led individual entries through to the review process). The idea is that once a particular author/user has added a lot of new information (i.e an author has added much information to his/her dictionary, or a user has collected many sets of cognates across different languages), such a collection of new information can be given to the scrutiny of the peers, resulting in either a new edition of an available publication, or a supplement to an available publication, or a completely new publication. Such substantially new version should count as publications worthy of being listed on a cv.
Citation[edit]
In the process of usage, the data in Living Sources in Lexical Description have to be citable. Some of such citations have clear parallels to the citation of traditional print media, but there are also some usages of the data that ask for new forms of citation. We distinguish between (at least) four different kinds of citations that people could use: citation of whole submissions, of individual data points, of micro-publications, and of complex collections of many data points originating from various publications. They are to some extent parallel to traditional forms of citation:
- whole submission ↔ book/article
- individual data points ↔ page in a book/article
- micro-publication ↔ personal communication
- collection of data points ↔ multi-author work
Some fictional examples follow, to illustrate how this could possible work (The structure of these example URIs is of course still unsettled.)
Citing whole submissions
The citation of whole submissions is completely parallel to traditional citation of books and articles. The submission has author, title and a submission data. The "Living Sources in Lexical Description" is like a publisher (though without physical location). The "stamp" is like a series, which might also have a serial number. As being online citations, they of course need a URL and a date. This might, for example, look like:
Doe, John (2013) Dictionary of Nehali. [Dictionaries of the World's Languages, 3]. Living Sources in Lexical Description. (available online at livingsources.org/dictionaries/doe2013/, accessed on 23 March 2015). |
In-text citation likewise function as normal, e.g. (Doe 2013).
Citing individual data points
Often just one lexical entry will be cited, or individual points of information available in the databases. This is completely parallel to traditional citation of pages. Each data entity will have its own URI that can be referred to, so it will be possible to add in-text citations like (Doe 2013: foo). In the bibliography only the whole work will be cited, not the individual URI. The URI to the individual data point is a composite of both: "livingsources.org/dictionaries/doe2013/foo/"
Doe, John (2013) Dictionary of Nehali. [Dictionaries of the World's Languages, 3]. Living Sources in Lexical Description. (available online at livingsources.org/dictionaries/doe2013/, accessed on 23 March 2015). |
Citing micro-publications
One more unusual situation that comes up in this new medium is the citation of individual comments that have been added by users to an entry. Such micro-publications should probably be seen as alike to the tradition of "personal communication", meaning that they are cited in-text, but do not turn up in a bibliography. For example, consider the case of a discussion happening on the entry discussed previously (Doe 2013: foo). There are various comments posted in this discussion, and an in-text citation to one of them might look like this: (A. Ash 2015, commenting on Doe 2013: foo/talk/7). In the Bibliography, only the entry on (Doe 2013) turns up, and the link to the comment is like a page number, leading to the URI "livingsources.org/dictionaries/doe2013/foo/talk/7".
Doe, John (2013) Dictionary of Nehali. [Dictionaries of the World's Languages, 3]. Living Sources in Lexical Description. (available online at livingsources.org/dictionaries/doe2013/, accessed on 23 March 2015). |
Citing complex collection of data point
One of the main advantages of electronic resources is the possibility to search for data, possibly resulting in a very complex selection of data, cross-secting multiply submissions. It is very important to have a good system for citing such usage. The closest parallel in traditional citation is citing a multi-author publication. Probably, such citation will work as follows. After creation of a custom data set (e.g. through search and subsequent hand-picked selection), this data set can be saved online, resulting in a URI for the saved data set (e.g. livingsources.org/users/ArthurAsh/savesets/35). With the saved data set comes a receipt that counts the number of selected data points per author, e.g. John Doe (243), Michael Cysouw (67), Sonia Ash (12), D.H.M Broom (2). In the citation of this multi-authored publication, the authors should be ordered according to the amount of data points, and the date would be the date of the saving of the collection. The citation in the bibliography should list all authors, and might look like:
Doe, John, Michael Cysouw, Sonia Ash, D.H.M. Broom (2015) Custom data collection. Living Sources in Lexical Description. (available at livingsources.org/users/ArthurAsh/savesets/35). |
The form of in text citation might looke like: (Doe et al. 2015), or maybe like: (Doe, Cysouw, Ash et al. 2015), depending on editorial guidelines about citing multi-author works.
Submissions[edit]
Submissions should consist of the data in a suitable format (e.g. TMF, LMF, or TEI - concrete decisions are still open on this) with a set of supplementary material. This supplementary material can be considered to be some kind of preface to the data. First, there should be a text document describing the data, addressing at least the following issues:
- introduction/origin of data/
- scientific background/research field
- editorial background/rationale of the data
- selection criteria: e.g. sampling, fields, etc.
- links to other databases/sources
Second, there should be various structures documents describing the structure of the data.
- description of data categories (e.g. DTD (document type definition), ODD specification from TEI, database schema)
- specification of orthography used
- specification of terminology used
Infrastructure[edit]
Technical issues
- Formats (TMF, LMF, TEI/dic.)
- Technical infrastructure: Lexus, eSciDoc
- Unique identification of data objects
- Direct reusability of data (local databases, linking of databases)
- Formats for commentaries
- Formats for orthography profiles
- Citation structure (receipts, recipy, granularity)
- upload vs. URL
- upload on Lexus
- fulltext/XML
- webservice
Rights
- Open Access
- Creative Commons Licence for data and metadata (by default: attribution)
- No copyright transfer
- agreement with authors that Living Sources in Lexical Description has the rights (to store) and distribute the data under the Creative Commons Licence
Miscellaneous
- possibility of third party commentaries by any registered user
- sampling strategies for peer review
- check if Living Reviews infrastructure for the peer review process can be re-used
- applied for domains livingsources.org, livingsources.com, livingsources.eu (request processed by AEI Potsdam
Application[edit]
three-year project, Needed man-power:
- Lexical Curator
- Infrastructure programmer
Expenses
- Travel money
- money for workshops to inform and help possible authors
Sources
- Fieldworkers attached to MPI-EVA (Leipzig) and MPI for Psycholinguistics (Nijmegen)
- Lexical collections available online
- Intercontinental Dictionary Series, Loanword Typology Project (Leipzig)
Cooperation
- EMELD
other Support
- Potential scientific support from MPI for Psycholinguistics, Nijmegen, MPI for Evolutionary Anthropology, Leipzig and other Max Planck Instituts
- Potential financial support: ESF call BABEL, Volkswagenstiftung, Heinz-Nixdorf-Stiftung