PubMan Func Spec Submission/arXiv mapping

This page is discussing issues related to PubMan Func Spec Submission: UC_PM_SM_04_fetch_metadata_from_external_system

Format Name
Source: arXiv Target: escidoc-publication

Overview on Schemas supported by arXiv
arXiv currently provides the following arXiv metadata formats via the OAI-PMH interface:

Decision
For start we will use arXiv metadata format as it seems to require minimum parsing of the metadata values to PubItem.


 * Agreed, please note that splitting of journal-ref information is probably desired --Inga 12:11, 9 April 2008 (CEST)
 * I would plan splitting of journal-ref information to R4 (if not a problem, as there is no single rule - based on checking several records) --Natasa 12:46, 9 April 2008 (CEST)

arXiv format

 * DOI is available, even though not mentioned in schema

1. header/identifier => identifier (without "oai:arXiv.org:" prefix) (note: this identifier is important because in the output is pointing to the exact version i.e. v1, v2) which is by arXiv used in "citeAs"

2. Authors 2.1. author/keyname => LastName 2.2. author/forename => Firstname 2.3. author/affiliation => External organization

3. title => title

4. report-no => Source/sequence-number (only if journal-ref is in, otherwise do not map?). Inga would rather map it to identifier. 5. journal-ref => source/title/volume/issue/pages (one single field in arxiv, therefore no 'better' mapping possible) 6. msc-class => dcterms:subject 7. abstract => abstract 8. categories => dcterms:subject 9. doi => dc:identifier of type DOI 10. proxy => ???, e.g. ccsd hal-00260045 11. id => identifier, type=arXiv

11. http://arxiv.org/abs/&lt;header/identifier value> => dc:identifier (arXiv)

Subject Classes
From arXiv we get the abbreviated subject which we can dissolve for further use in publication metadata.

OAI_DC metadata format

 * Pretty simple as we also use dc metadata in publication profile, but is not correct if not parsed
 * Affiliations are not available

1. dc:description => dc:abstract (if dc:description does not start with "Comment") 2. dc:date => lists all dates of all versions that exist (earliest date is date when submitted to arXiv, all other are dates when a      new revision is done) 3. dc:identifier => partly is identifier, partly is source information (journal reference i.e. from journal-ref in the arXiv     metadata format) 4. dcterms:subject => to dcterms:subject (it is full-name of category ids that are delivered from arXiv format) 5. dc:creators => affiliations are missing.

See https://dev.livingreviews.org/projects/epubtk/browser/trunk/ePubTk/lib/arxiv.py for an example of how to use arxiv's oai-pmh interface.

Issues

 * Affiliations: (no possibility for parsing MPI für XXX as organizational units service does not fully support search by organization name)
 * as not certain if we would like to have it within the controlled vocab or directly ask for search-organizations methods from core services an issue is not created as extra requirement for core services. Might be internal requirement for controlled vocab service institutions).
 * Parsing of journal/source information: to check if it is feasible and if possible to relate it in future with controlled vocab service (journals)
 * Genre: Ein weiteres Problem ist, dass bei beiden Formaten das Genre nicht ersichtlich wird, aber hier wird man vermutlich so vorgehen, dass man Article nimmt
 * to check for book chapters?