Digitization Guidelines

Work in progress (14:41, 5 August 2008)

TO DO: Content hast to be checked and updated (will be part of the Digitization Lifecycle Project)!

In the following you will soon find a relevant subset of the DFG guidelines for digitization^[1] (new version) and, in addition, a description of a workflow for digitizing and submitting digital documents for digital collections maintained by the MPDL.

Scanning[edit]

Resolution and Image Quality[edit]

For scanning greyscale or colored prints, a minimum resolution of 300 dpi is suggested. Documents containing handwriting or maps with fine lines and small descriptions might require a scan resolution of up to 400 dpi. For generating bitonal scans, 600 dpi are requested.

Color depth[edit]

Bitonal scans (b/w) are generated with a color depth of 1 level (1 bit) per pixel. Greyscale images are digitized with 256 levels (8 bit = 1 byte) per pixel. Color images use 3 color channels (red, green, blue) with 1 byte per channel (= 3 byte = 24 bit) per pixel, enabling 256 x 256 x 256 = 16.7 million colors per pixel. 24 bit provide a sufficiently high color depth for color scans. Scanning with 48 bit color depth makes sense only in few cases where images need to be corrected or reworked after the scanning process.

File formats[edit]

Master images of greyscale or color images should be stored in "TIFF uncompressed". For bitonale images TIFF with group 4 compression can be used. In the future JPEG2000 could be used as an alternative to "TIFF uncompressed". Unfortunately, currently there are not enough software tools available to make JPEG2000 a feasible tool for the archiving of master images. For some text documents the PDF format PDF/A-1a (ISO Standard 19005-1 ^[2]) can be used (see Fulltext digitization below).

Fulltext digitization[edit]

The machine-readable form of a text has to be provided in either ASCII (Latin-1) or Unicode (either UTF-8 or UTF-16 with Byte Order Mark (BOM)^[3]). The creation of fulltext from digital master files can be done by Optical Character Recognition (OCR)^[4] or by manual transcription. Current OCR software, however, is suitable only for printed texts, reaching back as far as to texts printed in more recent Antiqua script or Fraktur script produced by using automated printing presses (from approx. 1850 on). For older text manual transcription is the only feasible means of producing fulltext.

Manual transcription can be done by single-key or double-key processing. In the latter procedure the same master file is transcribed twice and differences between the two transcriptions are filtered out in an automated procedure. The double-key procedure should result in a precision of up to 99.997%. Offers by service providers with only 99.5% (or less) precision are not acceptable.

If the fulltext is created only to enable fulltext-searches with just positive matches possible, so called dirty OCR is sufficient. A search result in dirty OCR with no matches does not necessarily mean that the term to be found does not exist in the text. Only if all words/characters of the searched expression appear in the original text and were correctly transformed into fulltext by OCR, the search will return results. In case dirty OCR is used, it should be clearly indicated when the corresponding document is presented to users, so that they can assess the quality of their search results.

In some cases it is important to display the fulltext with the same layout as the original text appears on the image. For such layout descriptions XML markup languages should be used, e.g. XSLT^[5], XSL:FO^[6] because they are software-independent.

Text documents, for which archiving with XML techniques is unfeasible, should be archived using the container format PDF/A-1a (ISO Standard 19005-1^[7]). This format allows for storing fulltext information together with a graphical representation of the text. As all fonts used in the visual representation of the text have to be included in a PDF/A-1a document, a complete Unicode representation of the text is possible. If the provision of PDF/A-1a is not possible, PDF/A-1b is the minimum requirement. In contrast to PDF/A-1a 1b does not contain Unicode. You can find more information on the two compatibility levels on the website of the PDF/A competence center ^[8]:

http://www.pdfa.org/doku.php?id=artikel:en:improved_pdfa-1b.

For scanning providers it is more difficult to produce PDF/A-1a then 1b because for 1a they have to identify the fonts used in the scanned document.

PDF/A Tools[edit]

The PDF/A competence center provides a collection of links and descriptions of tools to create and manipulate PDF/A documents:

http://www.pdfa.org/doku.php?id=artikel:en:processing_pdfa_documents&s=tools

Metadata requirements[edit]

to be continued

Notes[edit]

[1] :Praxisregeln_Digitalisierung_Maerz_2007_DFG.pdf

[2] ttp://www.iso.org/iso/catalogue_detail?csnumber=38920

[3] ttp://www.unicode.org/unicode/faq/utf_bom.html

[4] ttp://en.wikipedia.org/wiki/Optical_character_recognition#References

[5] ttp://www.w3.org/TR/xslt

[6] ttp://www.w3schools.com/xslfo/xslfo_intro.asp

[7] ttp://www.iso.org/iso/catalogue_detail?csnumber=38920

[8] ttp://www.pdfa.org/doku.php?id=artikel:en:improved_pdfa-1b

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]