Zeitschrift Naturforschung Discussion

From MPDLMediaWiki

Revision as of 09:45, 2 April 2012 by Unfried (Talk | contribs)
Jump to: navigation, search

This page will function as a shared discussion page for metadata issues in the ZfN project.

Contents


Metadata Problems

Priority Identifier Problem Description/ Example Status
Yellow.png ZNC-1988-43c-xyz This is the problem This is a more detailed description of the problem, also comments etc. are welcome in this area. e.g. reported/in progress/ solved
Testing from Nov. 2011
Red.pngZNC-1988-43c-0015 Author has wrong affiliation reported/corrected in new run
Red.pngZNC-1988-43c-0021 Wrong title Dilinoylgalactosylglycerol => DiIinoyIgalactosylglycerol (the l becomes a capital i)
This problem is OCR related
reported
Yellow.pngZNC-1988-43c-0029 All affiliations are listed as creators (not child from author) reported/corrected in new run
Red.pngZNC-1988-43c-0034 Merged affiliations reported/corrected in new run
Red.pngZNC-1988-43c-0084 Author has wrong affiliation reported/bug under consideration
Red.pngZNC-1988-43c-0185 Wrong author name Gülz => Giilz
This problem is OCR related
reported
Red.pngZNC-1988-43c-0189 Merged affiliations reported
Red.pngZNC-1988-43c-0211 Wrong author name Hans Ulrich Seitz & Ernst Reinhard => Hans Ulrich & Seitz Ernst Reinhard reported
Red.pngZNC-1988-43c-0243 Author has wrong affiliation & Affiliations are listed twice reported
Red.pngZNC-1988-43c-0261 Wrong author name Helmut Schipp von Branitz => Helmut Schipp & von Branitz (one author becomes two) reported
Red.pngZNC-1988-43c-0287 Wrong author name Leopold von Proff => Leopold von & Proff (one author becomes two) reported
Red.pngZNC-1988-43c-0340 Wrong title & Author has wrong affiliation Pimpinella anisum => Pimpinella anisutn
This problem is OCR related
reported
Yellow.pngZNC-1988-43c-0382 Wrong title "New Constituents of Essential Oil from Elsholtzia pilosa* Materials and Methods" (Phrase Materials and Methods do not belong here) reported
Red.pngZNC-1988-43c-0455 Author has wrong affiliation reported
Red.png ZNC-1988-43c-0461 Author has wrong affiliation reported
Red.pngZNC-1988-43c-0475 Wrong author name (special character, OCR related) & All affiliations are listed as creators (not child from author) & Merged affiliations reported
Red.pngZNC-1988-43c-0491 Wrong author name Schübel => Schiibel
This problem is OCR related
reported
Red.png ZNC-1988-43c-0523 Wrong author name Borner => Börner
This problem is OCR related
reported
Red.png ZNC-1988-43c-0527 Author has wrong affiliation & place is listed twice reported
Red.pngZNC-1988-43c-0589 Author has wrong affiliation reported
Red.pngZNC-1988-43c-0613 Wrong author name Ilpo => lipo
This problem is OCR related
reported
Red.pngZNC-1988-43c-0621 All subjects become authors reported
Red.png ZNC-1988-43c-0668 Merged affiliations reported
Red.png ZNC-1988-43c-0721 Author has wrong affiliation reported
Red.pngZNC-1988-43c-0794 Merged affiliations & Author has wrong affiliation reported
Red.png ZNC-1988-43c-0830 Author has wrong affiliation reported
Red.pngZNC-1988-43c-0869 Author has wrong affiliation & Merged affiliations reported
Red.pngZNC-1988-43c-0900 Author has wrong affiliation & Merged affiliations & Wrong author created Francesco Dall'Acqua becomes 2 authors reported
Red.pngZNC-1988-43c-0905 part of subtitle becomes author The english subtitle information gets lost & author Erich F. Elstner becomes 2 authors "Heme Erich" and "F Elstner" (The 'Heme' comes from the subtitle) reported
Red.pngZNC-1988-43c-0920 Author has wrong affiliation reported
Red.png ZNC-1988-43c-0317_b Author has wrong affiliation & Merged affiliations reported
Red.pngZNC-1988-43c-0372_b Author has wrong affiliation & Merged affiliations reported
Red.pngZNC-1988-43c-0_805b All affiliations are listed as creators (not child from author) & Wrong author names Three authors are merged to 2 authors reported
Testing from 17.01.2012
ZNA-1947-2a-0491 Wrong pdf???
ZNA-1947-2a-0494 Wrong pdf???
ZNA-1947-2a-0497 Wrong pdf???
ZNA-1947-2a-0509 Wrong pdf???
ZNA-1947-2a-0539b_b Wrong pdf???
Testing from 24.01.2012
ZNB-1947-2b-0001 <forename type="first">Von</forename>
<forename type="middle">A</forename>
<surname>Catsch</surname>
ZNB-1947-2b-0005 content in abstract
ZNB-1947-2b-0010 First part of content becomes abstract and chemical formula becomes text:OH r jYVT^ COOH
ZNB-1947-2b-0012 Beginning of content in abstract
ZNB-1947-2b-0014 title is missing
ZNB-1947-2b-0019 OCR problem, wrong author name, S-Gold-N-Allyl-N' => Ä-Gold-A T -Allyl-A
part of abstract is missing
ZNB-1947-2b-0025 OCR problem: Bacterium => Bactenum, Bact => Bad.
@BULLET in org info
ZNB-1947-2b-0029 OCR problem: Vermehrung => Yermehrung (title), Tabak => T&bak
content in title
ZNB-1947-2b-0035 OCR problem, Beziehungen => Beziehlingen (title)
content becomes abstract
ZNB-1947-2b-0063 part of content in abstract
ZNB-1947-2b-0066 tei missing
ZNB-1947-2b-0072 content becomes abstract
ZNB-1947-2b-0073 OCR problem, Bruno => Bkuno
part of content becomes abstract
ZNB-1947-2b-0081 OCR problem, Strompotentialkurven => Strompotentialkuryen, wrong translation of formula in abstract
ZNB-1947-2b-0089 OCR problem, Verbindungen => Verbinclungen
ZNB-1947-2b-0094 content in abstract
ZNB-1947-2b-0104 OCR problem in title
ZNB-1947-2b-0108 OCR problem in abstract
content in abstract
ZNB-1947-2b-0112 content in abstract
ZNB-1947-2b-0146 OCR problem in title
ZNB-1947-2b-0158 OCR problem in title, author name
ZNB-1947-2b-0187 part of abstract missing
ZNB-1947-2b-0203 tei missing
ZNB-1947-2b-0215 OCR problem in abstract
ZNB-1947-2b-0222 Affiliation missing, content in abstract
ZNB-1947-2b-0233 OCR problem in title
ZNB-1947-2b-0249 OCR problem in title
ZNB-1947-2b-0286 content in abstract
ZNB-1947-2b-0292 content in abstract
ZNB-1947-2b-0295 OCR problem in abstract
ZNB-1947-2b-0301 part of abstract missing
ZNB-1947-2b-0308 OCR problem in title
affiliation missing
ZNB-1947-2b-0313 part of content in abstract
ZNB-1947-2b-0330.header.tei.xml no corresponding pdf on regensburg server
ZNB-1947-2b-0330 OCR problem in author name
part of content in abstract
ZNB-1947-2b-0358 OCR problem in affiliation
ZNB-1947-2b-0361 content in abstract
ZNB-1947-2b-0367 content in abstract
ZNB-1947-2b-0369 content in abstract
ZNB-1947-2b-0382 OCR too bad for tei transformation
ZNB-1947-2b-0397 OCR problem in title
ZNB-1947-2b-0397 content in abstract
ZNB-1947-2b-0400 content in abstract
ZNB-1947-2b-0404 OCR problem in abstract,
content in abstract
ZNB-1947-2b-0410 OCR problem in abstract,
content in abstract
ZNB-1947-2b-0414 OCR problem in abstract
ZNB-1947-2b-0419 content in abstract
ZNB-1947-2b-0421 OCR problem in author name
content in abstract
ZNB-1947-2b-0428 part of abstract missing
ZNB-1947-2b-0433 OCR problem in abstract,
content in abstract
ZNB-1947-2b-0444 OCR problem in abstract
part of abstract missing
ZNB-1947-2b-0450 OCR problem in abstract
part of abstract missing
ZfN-1946-1-0003 content in abstract
ZfN-1946-1-0010 wrong abstract
ZfN-1946-1-0013 content in abstract
ZfN-1946-1-0018 OCR problem in abstract
ZfN-1946-1-0053 OCR problem in abstract Maxwellschen => Maxwell sehen
ZfN-1946-1-0067 Wrong affiliation address info
ZfN-1946-1-0070 OCR problem in title
ZfN-1946-1-0087 content becomes abstract
ZfN-1946-1-0093 OCR problem in author name
ZfN-1946-1-0108
ZfN-1946-1-0120 content becomes abstract
ZfN-1946-1-0121 content becomes abstract
ZfN-1946-1-0125 OCR problem in title (special character)
part of content in abstract
ZfN-1946-1-0131 OCR problem in title
Mattauch -Herzog'schen => M attauch -H e r z o gsehen
ZfN-1946-1-0146 part of abstract missing
ZfN-1946-1-0151 content becomes abstract
Testing from 22.03.2012
ZNC-1988-43c-0001 Keywords are in abstract Very similar layout; may improve as we reach this year proper later since we will have accumulated data corresponding to the field. CNR
ZNC-1988-43c-0019 part of keyword is in dedication Very similar layout, difficult to differentiate
ZNC-1988-43c-0029 Keywords are in abstract idem; keywords and abstract have very similar format
ZNC-1988-43c-0074 Author gets additional affiliation Would go in the training data - checking this with Patrice
ZNC-1988-43c-0099 Department info gets lost Would go in the training data - checking this with Patrice
ZNC-1988-43c-0126 Subtitle is added to title No model for subtitles; Impossible to do any better there
ZNC-1988-43c-0133 special character problem in title, </br> wrong parsing of keywords comes from OCR, </br> goes in training data
ZNC-1988-43c-0155 Subtitle is added to title cf. above
ZNC-1988-43c-0167 All affiliations are listed as creators (not child from author) This is manageable (default rule when all affiliation are applied to all authors)
ZNC-1988-43c-0173 Keywords are in abstract cf. above, very difficult case
ZNC-1988-43c-0177 All affiliations are listed as creators (not child from author)
Authors become affiliations
To be checked on our side if we can reduce such pbs
ZNC-1988-43c-0199 One author becomes two Already identified issue with previous volumes; contrib to training data CNR
ZNC-1988-43c-0213 special character problem in title OCR issue
ZNC-1988-43c-0231 author gets wrong affiliation
zfn becomes affiliation
affiliations are listed as authors
Grobid gets desynch; candidate for training data
ZNC-1988-43c-0269 Title is missing Grobid mistakes (happens...); training data
ZNC-1988-43c-0285 address becomes author Interesting! Training data CNR
ZNC-1988-43c-0337 affiliations are listed as authors
abstract is missing
Difficult case (organisation looks like a name)
ZNC-1988-43c-0363 Institute info is missing
ZNC-1988-43c-0370 Abstract is missing
ZNC-1988-43c-0397 Subtitle is added to title
ZNC-1988-43c-0403 Abstract is missing
ZNC-1988-43c-0418 special characters in affiliations are not recognized (in names they are) OCR issues
ZNC-1988-43c-0431 Keywords are in abstract
ZNC-1988-43c-0438 One author becomes two Training data
ZNC-1988-43c-0443 Affiliation gets lost, author gets wrong affiliation
ZNC-1988-43c-0449 Authors get wrong affiliations
ZNC-1988-43c-0463 Author gets lost
abstract is misisng
ZNC-1988-43c-0467 Abstract is missing
ZNC-1988-43c-0479 Keywords in abstract
ZNC-1988-43c-0505 Keywords in abstract
ZNC-1988-43c-0511 Abstract is missing
ZNC-1988-43c-0515 Keywords in abstract
ZNC-1988-43c-0519 Title and subtitle are merged
Abstract is missing
ZNC-1988-43c-0529 Keywords in abstract CNR
ZNC-1988-43c-0545 Institution and department info are lost
ZNC-1988-43c-0554 Institution and department info are lost
Keywords in abstract
ZNC-1988-43c-0563 Keywords in abstract
ZNC-1988-43c-0577 OCR problem in title
ZNC-1988-43c-0601 Title is missing
Author name is title
OCR problem in author name
ZNC-1988-43c-0609 Institution and department info are lost
Keywords in abstract
ZNC-1988-43c-0613 Subtitle is added to title
Abstract is missing
ZNC-1988-43c-0636 Address is added to affiliation name
OCR problem in author name
ZNC-1988-43c-0665 Start of abstract is in keywords CNR
ZNC-1988-43c-0709 Institution and department info are lost
ZNC-1988-43c-0717 Institution and department info are lost
Keywords in abstract
ZNC-1988-43c-0731 Keywords in abstract
ZNC-1988-43c-0765 author affiliation mix up difficult data
ZNC-1988-43c-0769 Keywords in abstract
ZNC-1988-43c-0777 Start of abstract is in keywords
ZNC-1988-43c-0782 Relation between author and affiliation gets lost
Keywords in abstract
ZNC-1988-43c-0795 Relation between author and affiliation gets lost
ZNC-1988-43c-0799 Keywords in abstract
ZNC-1988-43c-0823 Department info in address line
ZNC-1988-43c-0850 Keywords in abstract
ZNC-1988-43c-0857 OCR problem in title
ZNC-1988-43c-0857 affiliations merged
Keywords in abstract
difficult data
ZNC-1988-43c-0893 subtitle becomes author CNR
ZNC-1988-43c-0903 Keywords in abstract
ZNC-1988-43c-0908 Abstract is missing
ZNC-1988-43c-0918 Abstract is missing
ZNC-1988-43c-0938 Department info is missing
ZNC-1988-43c-0955 Keywords in abstract
ZfN-1946-1-0151 content becomes abstract
Testing from 29.03.2012 / Reihe A, Volume 2 (1947)
ZNA-1947-2a-0154 Abstract fehlt
ZNA-1947-2a-0159 ok
ZNA-1947-2a-0163 ok
ZNA-1947-2a-0167 ok
ZNA-1947-2a-0171 PDF Datei fehlt bzw. gleicher Inhalt wie ZNA-1947-2a-0173
ZNA-1947-2a-0173 <affiliation><orgName type="department" key="dep1">Institut für physikalische Chemie und Elektrochemie</orgName><orgName type="department" key="dep2">Kaiser-Wilhelm-Institut für physikalische Chemie und Elektrochemie</orgName><orgName type="institution">Technischen Universität Berlin-Charlottenburg</orgName><address><settlement>Berlin-Dahlem</settlement></address></affiliation> /// Zuordnung????
ZNA-1947-2a-0177_b ZNA-1947-2a-0175_b.header.tei.xml und ZNA-1947-2a-0177_b.pdf /// Namen stimmen nicht überein.
ZNA-1947-2a-0184_n <date type="published" when="10471"/> /// OCR
ZNA-1947-2a-0185 ok
ZNA-1947-2a-0202 <publicationStmt>unknown</publicationStmt> imprint nicht vollständig, abstract fehlt
ZNA-1947-2a-0216
<head>Abstract</head>

A ls "Neue Sterne" oder "Novae" werden

/// Gibt kein Abstract nur Inhalt.
ZNA-1947-2a-0217.header.tei/// IDENTISCH MIT ZNA-1947-2a-0219.header.tei
ZNA-1947-2a-0219 ok
ZNA-1947-2a-0226 ok
<title level="a" type="main">dereinschalten</title> author fehlt, imprint nicht vollständig, /// Falscher Title (Gasballastpumpen)
ZNA-1947-2a-0238_n <date type="published" when="1946"/> /// Falscher Title (Gasballastpumpen), imprint nicht vollständig,
ZNA-1947-2a-0239_n <orgName type="institution">Unterharzer Berg</orgName><author><persName><forename type="first">Hüttenwerke</forename><forename type="middle">G m b H</forename><roleName>Goslar</roleName></persName>imprint nicht vollständig,
ZNA-1947-2a-0241 ok


Red.png : Blocker

Yellow.png : Nice to fix

OCR Problems

  • special character problems with:
    • Greek symbols like α, β etc.
    • elevated numbers 2e² => 2e2
    • mathematical signs like ∞,
    • Diacritical signs like é, ă
    • Latin alphabet Å

pdf/a_1b Problem

In a project meeting was decided that we use pdf/A_1b format, which is not the case.

Validating file "ZNA-1948-3a-0434.pdf" for conformance level pdfa-1b
The value of the key N is 4 but must be 3.
The document does not conform to the requested standard.
The document doesn't conform to the PDF reference (missing required entries, wrong value types, etc.).
    • validation report from validatepdf.com:
<issues>
<colorSpace>
<problem severity="error" clause="6.2.2" standard="pdfa">OutputIntent object has an incorrect parameter or invalid color profile</problem>
</colorSpace>
<metadata>
<problem severity="warning" objectID="30" clause=" TN0009" standard="pdfa">Predefined XMP property 'amd' of the schema 'pdfaid' should not be defined in custom extension schemas</problem>
<problem severity="warning" objectID="30" clause=" TN0009" standard="pdfa">Predefined XMP property 'conformance' of the schema 'pdfaid' should not be defined in custom extension schemas</problem>
<problem severity="warning" objectID="30" clause=" TN0009" standard="pdfa">Predefined XMP property 'part' of the schema 'pdfaid' should not be defined in custom extension schemas</problem>
<problem severity="warning" objectID="30" clause=" TN0009" standard="pdfa">Predefined XMP property 'InstanceID' of the schema 'xmpMM' should not be defined in custom extension schemas</problem>
</metadata>
</issues>

pdf/a_1b result (mail from Dr. rer. nat. Helge Knüttel, UR - Universität Regensburg)

Die Frage ist hier: Wie entscheide ich, ob eine vorliegende Datei konform zur Norm ist?

Wir erzeugen die PDF/A-Dateien mit Adobe Acrobat Pro 9 und ggf. 10. Die PDF/A-Konformität stellen wir durch eine Überprüfung und ggf. Konvertierung mit dem in Acrobat eingebauten Preflight-Werkzeug her. Dies stammt von der Firma callas (siehe z.B. http://www.callassoftware.com/callas/doku.php/en:news:press:20110706).

Eine erneute Überprüfung der von Ihnen genannten Datei (ich nahm sie vom ftp-Server) mit Acrobat 9.5 sowie auch dem aktuellen callas pdfaPilot 3 ergab keine Probleme bzgl. der Konformität mit PDF/A-1b. Sie können dies gerne nachvollziehen, der pdfaPilot ist als Testversion zum Download erhältlich: http://www.callassoftware.com/callas/doku.php/de:download

Allgemein gilt, dass die PDF/A-Standards komplex sind und entsprechend auch die Validierung von Dateien auf Standardkonformität (vgl. http://www.pdfa.org/2011/08/validating-pdfa/). Es gibt nun verschiedene Software zur Konformitätsprüfung. Doch, welcher Software will man vertrauen?

Was fehlt ist eine Referenzimplementierung, also eine vom Standardisierungsgremium abgesegnete Software, die die Konformitätsprüfung von Dateien sicher, korrekt und vollständig vornimmt. Es gibt einen kompenten und umfangreichen Test diverser Software von 2009 (http://www.pdflib.com/fileadmin/pdflib/pdf/pdfa/2009-05-04-Bavaria-report-on-PDFA-validation-accuracy.pdf), incl. Vorgängerversionen der hier genannten Software. Darin schneiden Acrobat/callas nicht schlecht ab. Kein Programm ist perfekt. Das häufigste Problem bei Acrobat, callas, Solid Documents und PDF Tools sind falsche Alarme, d.h. es werden Verstöße gegen die Norm gemeldet, die aber tatsächlich nicht vorhanden sind.

Von den verfügbaren Optionen erscheint uns die callas-Software nicht die schlechteste Wahl. Sie hat offenbar eine gewisse Marktpräsenz und wird ja auch nicht ohne Grund von Adobe (den Schöpfern von pdf!) in die Acrobat-Software integriert. callas ist eines der Mitglieder des PDF/A Competence Center und arbeitete aktiv an der Erstellung diverser pdf-Standards incl. den PDF/A-ISO-Normen mit.

Andererseits: Ich testete die genannte Datei heute auch selber mit der von Ihnen genannten online-Validierung von pdf-tools.com und erhielt ebenfalls die Fehlermeldung. Diese ist aber wenig hilfreich, da sie das gefundene Problem nicht genau lokalisiert. So kann man (oder zumindest ich) nicht wirklich für Abhilfe sorgen.

Der von Ihnen zitierte Validierungsreport von validatepdf.com bemängelt einen Fehler im OutputIntent, lässt aber offen, worin das Problem genau liegt. Weiterhin gibt er Warnungen bzgl. der XMP-Metadaten aus. Die XMP-Metadaten werden aber von den anderen Programme, incl. dem XMP-Validierer für PDF/A-1 von PDFlib (http://www.pdflib.com/de/knowledge-base/xmp-metadaten/kostenloser-xmp-validator/) nicht bemängelt.

Fazit: Manche Software bestätigt Normen-Konformität, während andere Fehler oder Warnungen ausgibt. Keiner der angeblichen Mängel wird von einer zweiten Software gefunden. Dies entspricht dem Bild aus dem oben zitierten Bavaria-Report. Inhaltlich erscheinen mir die Mängel, so sie denn existent sind, nicht gravierend.

Was ist das eigentliche Ziel der Verwendung von PDF/A in diesem Projekt?

Wir schlugen Ihnen die Lieferung als PDF/A vor, um Ihnen möglichst große Sicherheit für die langzeitige Verfügbarkeit der Daten zu geben. Wobei unsere pdf-Dateien ja ohnehin recht einfach aufgebaut sind und schon ohne PDF/A wenig Probleme erwarten ließen. Absolute Zukunftssicherheit gibt es bei der digitalen Langzeitarchivierung bislang nicht. Wir sind aber zuversichtlich, dass die an Sie gelieferten und nach Acrobat/callas auch PDF/A-konformen Dateien vergleichsweise sehr zukunftssicher sind.

PDF/A-Konformität nach Acrobat/callas sehe ich als ausreichend an. Ich würde mich freuen, wenn Sie dem zustimmen könnten.

(Die Projektleitung der MPDL (Malte Dreyer) stimmt dem zu und ist vor diesem Hintergrund gerne bereit die Anlieferung zu akzeptieren.)

Personal tools
About CoLab