Digitization Guidelines

From MPDLMediaWiki
Revision as of 13:57, 1 April 2008 by Andi (talk | contribs)
Jump to navigation Jump to search
* work in progress  -- Andreas Gros 11:06, 1 April 2008 (CEST)

In the following you will soon find a relevant subset of the DFG guidelines for digitization[1] and, in addition, a description of a workflow for digitizing and submitting digital documents for digital collections maintained by the MPDL.

Scanning[edit]

Resolution and Image Quality[edit]

For scanning greyscale or colored prints, a minimum resolution of 300 dpi is suggested. Documents containing handwriting or maps with fine lines and small descriptions might require a scan resolution of up to 400 dpi. For generating bitonal scans, 600 dpi are requested.


Color depth[edit]

Bitonal scans (b/w) are generated with a color depth of 1 level (1 bit) per pixel. Greyscale images are digitized with 256 levels (8 bit = 1 byte) per pixel. Color images use 3 color channels (red, green, blue) with 1 byte per channel (= 3 byte = 24 bit) per pixel, enabling 256 x 256 x 256 = 16.7 million colors per pixel. 24 bit provide a sufficiently high color depth for color scans. Scanning with 48 bit color depth makes sense only in few cases where images need to be corrected or reworked after the scanning process.


File formats[edit]

Master images of greyscale or color images should be stored in "TIFF uncompressed". For bitonale images TIFF with group 4 compression can be used. In the future JPEG2000 could be used as an alternative to "TIFF uncompressed". Unfortunately, currently there are not enough software tools available to make JPEG2000 a feasible tool for the archiving of master images.


Fulltext digitization[edit]

The machine-readable form of a text has to be provided in either ASCII (Latin-1) or Unicode (either UTF-8 or UTF-16 with Byte Order Mark (BOM)[2]). The creation of fulltext from digital master files can be done by Optical Character Recognition (OCR) or by manual transcription. Current OCR software, however, is suitable only for printed texts, reaching back as far as to texts printed in more recent roman type or Gothic print produced by using automated printing presses (from approx. 1850 on). For older text manual transcription is the only feasible means of producing fulltext.

Manual transcription can be done by single-key or double-key processing. In the latter procedure the same master file is transcribed twice and differences between the two transcriptions are filtered out in an automated procedure. The double-key procedure should result in a precision of up to 99.997%. Offers by service providers with only 99.5% (or less) precision are not acceptable.

If the fulltext is created only to enable fulltext-searches with just positive matches possible, so called dirty OCR is sufficient. A search result in dirty OCR with no matches does not necessarily mean that the term to be found does not exist in the text. Only if all words/characters of the searched expression appear in the original text and were correctly transformed into fulltext by OCR, the search will return results. In case dirty OCR is used, it should be clearly indicated when the corresponding document is presented to users, so that they can assess the quality of their search results.

In some cases it is important to display the fulltext with the same layout as the original text appears on the image. For such layout descriptions XML markup languages should be used, e.g. XSLT, XSL:FO, because they are software-independent.


Metadata requirements[edit]

* to be continued



Notes[edit]