PubMan Func Spec Submission/arXiv mapping
Jump to navigation
Jump to search
This page is discussing issues related to PubMan Func Spec Submission: UC_PM_SM_04_fetch_metadata_from_external_system
Overview on Schemas supported by arXiv[edit]
arXiv currently provides the following arXiv metadata formats via the OAI-PMH interface:
format | schema | example record | note |
---|---|---|---|
arXiv | arXiv Schema | Record in arXiv Record in arXiv (with affiliation) Record in arXiv (with doi) |
most complete descriptive metadata format, without complete administrative data |
oai_dc | OAI_DC Schema | Record in oai_dc Record in oai_dc (with affiliation) Record in oai_dc (with doi) |
standard format with some specific mappings to available elements, therefore an arXiv specific parsing would be required |
arXivRaw | arXivRaw Schema | Record in arXivRaw Record in arXivRaw (with affiliation) Record in arXivRaw (with doi) |
not complete yet and subject to modification according to schema comments |
arXivOld | arXivOld Schema | Record in arXivOld | not considered |
Decision[edit]
For start we will use arXiv metadata format as it seems to require minimum parsing of the metadata values to PubItem.
- Agreed, please note that splitting of journal-ref information is probably desired --Inga 12:11, 9 April 2008 (CEST)
- I would plan splitting of journal-ref information to R4 (if not a problem, as there is no single rule - based on checking several records) --Natasa 12:46, 9 April 2008 (CEST)
Mapping from arXiv formats to PubItem[edit]
arXiv format[edit]
- Missing version dates
- Affiliations are available
- Subject categories are only available as code, e.g. "hep-th" -> mapping to descriptor, e.g. "High Energy Physics - Theory"
1. header/identifier => identifier (without "oai:arXiv.org:" prefix)
(note: this identifier is important because in the output is pointing to the exact version i.e. v1, v2) which is by arXiv used in "citeAs"
2. Authors
2.1. author/keyname => LastName
2.2. author/forename => Firstname
2.3. author/affiliation => External organization
3. title => title
4. report-no => Source/sequence-number (only if journal-ref is in, otherwise do not map?). Inga would rather map it to identifier.
5. journal-ref => source/title/volume/issue/pages (Parsing not in R3)
6. msc-class => dc:subject
7. abstract => abstract
8. categories => dc:subject
9. doi => dc:identifier of type DOI
10. proxy => ???, e.g. <proxy>ccsd hal-00260045</proxy>
11. http://arxiv.org/abs/<header/identifier value> => dc:identifier (OTHER)
OAI_DC metadata format[edit]
- Pretty simple as we also use dc metadata in publication profile, but is not correct if not parsed
- Affiliations are not available
1. dc:description => dc:abstract (if dc:description does not start with "Comment") 2. dc:date => lists all dates of all versions that exist (earliest date is date when submitted to arXiv, all other are dates when a new revision is done) 3. dc:identifier => partly is identifier, partly is source information (journal reference i.e. from journal-ref in the arXiv metadata format) 4. dc:subject => to dc:subject (it is full-name of category ids that are delivered from arXiv format) 5. dc:creators => affiliations are missing.
arXivRaw format[edit]
Issues[edit]
- Affiliations: (no possibility for parsing MPI für XXX as organizational units service does not fully support search by organization name)
- as not certain if we would like to have it within the controlled vocab or directly ask for search-organizations methods from core services an issue is not created as extra requirement for core services. Might be internal requirement for controlled vocab service institutions).
- Parsing of journal/source information: to check if it is feasible and if possible to relate it in future with controlled vocab service (journals)
- Genre: Ein weiteres Problem ist, dass bei beiden Formaten das Genre nicht ersichtlich wird, aber hier wird man vermutlich so vorgehen, dass man Article nimmt
- to check for book chapters?