Difference between revisions of "Digitization Guidelines"
Line 26: | Line 26: | ||
The machine-readable form of a text has to be provided in either ASCII (Latin-1) or Unicode (either UTF-8 or UTF-16 with Byte Order Mark (BOM)<ref>http://www.unicode.org/unicode/faq/utf_bom.html</ref>). | The machine-readable form of a text has to be provided in either ASCII (Latin-1) or Unicode (either UTF-8 or UTF-16 with Byte Order Mark (BOM)<ref>http://www.unicode.org/unicode/faq/utf_bom.html</ref>). | ||
The creation of fulltext from digital master files can be done by Optical Character Recognition (OCR) or by manual transcription. Current OCR software, however, is suitable only for printed texts, reaching back as far as to texts printed in more recent roman type or Gothic print produced by using automated printing presses (from approx. 1850 on). For older text manual transcription currently is the only feasible means of producting fulltext. | |||
Manual transcription can be done by single-key or double-key processing. The latter procedure means that the same master file is transcribed twice and differences between the two transcriptions are filtered out in an automated procedure. The double-key procedure results in a precision of up to 99.997%. Offers by service providers with only 99.5% (or less) precision are not acceptable. | |||
Revision as of 09:27, 1 April 2008
* work in progress -- Andreas Gros 11:06, 1 April 2008 (CEST)
In the following you will soon find a relevant subset of the DFG guidelines for digitization[1] and, in addition, a description of a workflow for digitizing and submitting digital documents for digital collections maintained by the MPDL.
Scanning[edit]
Resolution and Image Quality[edit]
For scanning greyscale or colored prints, a minimum resolution of 300 dpi is suggested. Documents containing handwriting or maps with fine lines and small descriptions might require a scan resolution of up to 400 dpi. For generating bitonal scans, 600 dpi are requested.
Color depth[edit]
Bitonal scans (b/w) are generated with a color depth of 1 level (1 bit) per pixel. Greyscale images are digitized with 256 levels (8 bit = 1 byte) per pixel. Color images use 3 color channels (red, green, blue) with 1 byte per channel (= 3 byte = 24 bit) per pixel, enabling 256 x 256 x 256 = 16.7 million colors per pixel. 24 bit provide a sufficiently high color depth for color scans. Scanning with 48 bit color depth makes sense only in few cases where images need to be corrected or reworked after the scanning process.
File formats[edit]
Master images of greyscale or color images should be stored in "TIFF uncompressed". For bitonale images TIFF with group 4 compression can be used. In the future JPEG2000 could be used as an alternative to "TIFF uncompressed". Unfortunately, currently there are not enough software tools available to make JPEG2000 a feasible tool for the archiving of master images.
Fulltext digitization[edit]
The machine-readable form of a text has to be provided in either ASCII (Latin-1) or Unicode (either UTF-8 or UTF-16 with Byte Order Mark (BOM)[2]). The creation of fulltext from digital master files can be done by Optical Character Recognition (OCR) or by manual transcription. Current OCR software, however, is suitable only for printed texts, reaching back as far as to texts printed in more recent roman type or Gothic print produced by using automated printing presses (from approx. 1850 on). For older text manual transcription currently is the only feasible means of producting fulltext.
Manual transcription can be done by single-key or double-key processing. The latter procedure means that the same master file is transcribed twice and differences between the two transcriptions are filtered out in an automated procedure. The double-key procedure results in a precision of up to 99.997%. Offers by service providers with only 99.5% (or less) precision are not acceptable.
Metadata requirements[edit]
* to be continued