PubMan Func Spec Submission/arXiv mapping

From MPDLMediaWiki
Jump to navigation Jump to search

This page is discussing issues related to PubMan Func Spec Submission: UC_PM_SM_04_fetch_metadata_from_external_system

Overview on Schemas supported by arXiv[edit]

arXiv currently provides the following arXiv metadata formats via the OAI-PMH interface:

format schema example record notes
oai_dc OAI_DC Schema Record in oai_dc
Record in oai_dc (with affiliation)
Record in oai_dc (with doi)
*parsing is needed see below in mappings
*source information needs to be parsed from string, e.g. "<dc:identifier>Phys.Rev. D75 (2007) 083523</dc:identifier>"
*affiliations are missing
arXiv arXiv Schema Record in arXiv
Record in arXiv (with affiliation)
Record in arXiv (with doi)
*subject categories only available as code, e.g. "hep-th" -> mapping to descriptor, e.g. "High Energy Physics - Theory"?
*more complete/precise version dates are missing
*source information needs to be parsed from string, e.g. "<journal-ref>Phys.Rev. D75 (2007) 083523</journal-ref>"
arXivRaw arXivRaw Schema Record in arXivRaw
Record in arXivRaw (with affiliation)
Record in arXivRaw (with doi)
*not complete yet and subject to modification according to schema comments
*authors need to be parsed from string, e.g. "<authors>Gilbert Holder (McGill), Ian G. McCarthy (Durham), Arif Babul (Victoria)</authors>"
arXivOld arXivOld Schema Record in arXivOld *not considered

Decision[edit]

For start we will use arXiv metadata format as it seems to require minimum parsing of the metadata values to PubItem.

Mapping from arXiv to PubItem[edit]

arXiv format[edit]

  • Missing version dates
  • Affiliations are available
  • Subject categories are only available as code, e.g. "hep-th" -> mapping to descriptor, e.g. "High Energy Physics - Theory"
1.  header/identifier => identifier (without "oai:arXiv.org:" prefix)
(note: this identifier is important because in the output is pointing to the exact version i.e. v1, v2) which is by arXiv used in "citeAs"

2.  Authors
2.1. author/keyname => LastName
2.2. author/forename => Firstname
2.3. author/affiliation => External organization 

3.  title => title

4.  report-no => Source/sequence-number (only if journal-ref is in, otherwise do not map?)
5.  journal-ref => source/title (Parsing not in R3)
6.  msc-class => dc:subject
7.  abstract => abstract
8.  categories => dc:subject
9.  doi => dc:identifier (DOI)
10. proxy => ???, e.g. <proxy>ccsd hal-00260045</proxy>

11. http://arxiv.org/abs/<header/identifier value> => dc:identifier (OTHER)

OAI_DC metadata format[edit]

  • Pretty simple as we also use dc metadata in publication profile, but is not correct if not parsed
  • Affiliations are not available
1.  dc:description => dc:abstract (if dc:description does not start with "Comment")
2.  dc:date => lists all dates of all versions that exist (earliest date is date when submitted to arXiv, all other are dates when a  
    new revision is done) 
3.  dc:identifier => partly is identifier, partly is source information (journal reference i.e. from journal-ref in the arXiv 
    metadata format)
4.  dc:subject => to dc:subject (it is full-name of category ids that are delivered from arXiv format)
5.  dc:creators => affiliations are missing. 

arXivRaw format[edit]

Issues[edit]

  • Affiliations: (no possibility for parsing MPI für XXX as organizational units service does not fully support search by organization name)
    • as not certain if we would like to have it within the controlled vocab or directly ask for search-organizations methods from core services an issue is not created as extra requirement for core services. Might be internal requirement for controlled vocab service institutions).
  • Parsing of journal names: to check if it is feasible and if possible to relate it in future with controlled vocab service (journals)