5.0 Best Practices for Optical Character Recognition

advertisement
5.0 Best Practices for Optical Character Recognition
Optical character recognition (OCR) is a process by which specialized software is used to
convert scanned images of text to electronic text so that that digitized texts can be searched,
indexed and retrieved. The recommended software for OCR creation is ABBYYFineReader;
however, Adobe Acrobat can produce high-quality OCR for clear, crisp, and structurally
uncomplicated texts in a variety of languages.
Table of contents
5.1 Uses of OCR output
5.2 Functionality of OCR Software
5.3 Output File Formats
5.4 Factors Affecting Accuracy of OCR
5.5 OCR Correction and Rekeying
5.1 Uses of OCR output
There are several uses for the output of the optical character recognition process 1:
•
•
•
•
Indexing—OCR text can be output to a text file that is then imported to a search engine
and used as the basis for full-text searching. This can be an effective use of the output
even if it is not 100% accurate.
Full-text retrieval—Search engine results are displayed with hit highlighting within the
page image displayed.
Full-text representation—the user is presented with a text file as a representation of the
actual document. In this case, accuracy of the OCR is critical; projects using OCR output
for this purpose generally require that the output has been spell-checked and errors
corrected.
Full-text representation with xml mark-up—The OCR out put is presented to the user
with layout, structure, and/or metadata added through xml mark-up. This presentation
requires significant human intervention.
5.2 Functionality of OCR Software
Text must be scanned or digitally photographed and saved in either an image or PDF (portable
document) format prior to running it through an OCR program. OCR software converts the
patterns of light and dark found in a digital image of a page of text into text characters and
1
Tanner, Simon. Deciding Whether Optical Character Recognition is Feasible. KDCS, 2004.
(http://www.odl.ox.ac.uk/papers/OCRFeasibility_final.pdf)
saves them in a format that computers can search or index, such as Unicode or ASCII. OCR
software generally employs a wide variety of language dictionaries. ABBYYFineReader v. 10 can
read over 180 languages, as well as common programming languages, numbers, and simple
chemical formulae. ABBYYFineReader can read multilingual documents, and includes
dictionaries and spell-checking capabilities for 39 languages. Adobe Acrobat 9 can read over 40
languages and has some basic spell-checking capabilities. Generally, operators select the
language(s) of the document from a drop down list prior to initiating the OCR process.
“Reading” of digitized text is the primary function of OCR software; however, some OCR
programs (e.g., ABBYYFineReader) have other functions aimed at either improving the accuracy
of the OCR results or speeding up the OCR process. Among these processes are:
•
Structural analysis to first identify structural features of the document, such as text
orientation, headings, images, tables, captions, and paragraphs. Structural analysis can
both help improve OCR accuracy and enables the software to reproduce the original
text formatting if the output is a Word document.
•
Basic image processing tasks such as straightening, deskewing, despeckling, cropping,
and rotating that can improve OCR accuracy.
•
A training mode to “teach” the software to recognize decorative fonts and unusual
characters.
•
Spell-checking and editing functions that prompt the operator to review questionable
spellings and, if necessary, correct spelling errors.
•
Batch processing to automate repeated tasks and scheduling functions to set the
software to run automated tasks at night or at other times when the computer’s
processing load is lighter.
5.3 Output File Formats
Optical character recognition should be performed for printed textual materials to enhance
searchability and access of the digitized version.
•
For scanned text resources that will be converted and made available to end-users in PDF,
OCR should be done as part of the conversion to PDF to provide text behind and make
individual documents searchable when viewed in Acrobat Reader. For searchable PDFs, the
final format should be text-under-images. When saving OCR output as a searchable PDF, the
option to “enable tagged PDF” should always be checked. Tagged PDF is a version of PDF
that provides structure and order information to allow PDF documents to be read by screen
readers used by persons with disabilities.
•
Separate text files of OCR output need only be generated for members of a collection of
text resources when it's anticipated that indexing / search and discovery across the
collection of individual documents as a whole will be required for that collection. Note
should be made of software used to do the OCR; these notes should include values for all
settings used and any training set used for the OCR.
•
Separate XML files of OCR output need only be generated for a collection of text resources
when it's anticipated that structured text will be required for that collection. Notes should
be made as with preceding case.
5.4 Factors Affecting Accuracy of OCR
Most commercial software packages boast an OCR accuracy of between 97% and 99%. These
rates are based on character errors, not word errors. So while 97% of characters may be
accurate in an OCR’d document, only 75% of words may be spelled correctly. Any of the
following factors can also affect the accuracy of the OCR:
•
Textual considerations
o Standard OCR should not be attempted on certain materials. For example, currently
OCR with default settings should not be attempted on most texts published prior to
1850. For some languages (e.g., German) the cutoff date may be even later. Before
trying to create transcriptions for these materials via OCR, detailed analysis and
often experimentation is required to judge trade-offs between custom OCR and
keyboarding options.
o Older and discolored documents must be scanned in RGB mode to capture all the
image data, and to maximize OCR accuracy.
o Low-contrast documents can result in poor OCR.
o Typescript results in poorer OCR than printed type; inconsistent use of font faces
and sizes can lower OCR accuracy.
o Font sizes of below 6 points in the original can limit OCR, although increasing
resolution in the scanned image to 600 dpi and using greyscale may improve OCR
output.
o Handwritten documents cannot be recognized with any degree of accuracy.
•
Scanning considerations that affect the accuracy of OCR include:
o The recommended best scanning resolution for OCR accuracy is 300 dpi. Higher
resolutions do not necessarily result in better accuracy and can slow down OCR
processing time. Resolutions below 300 dpi may affect the quality and accuracy of
OCR results.
o Brightness settings that are too high or too low may adversely affect OCR accuracy.
A medium brightness value of 50% will be suitable in most cases.
o Straightness of the initial scan can affect OCR quality; crooked lines of text produce
poor results.
o
o
Older and discolored documents must be scanned in RGB mode to capture all the
image data, and to maximize OCR accuracy.
Image enhancements, such as contrast adjustment and unsharp mask, have NOT
been shown to significantly enhance the accuracy of OCR. 2
5.5 OCR correction and rekeying
If the use to which OCR’d text is being put requires 100% accuracy, two options are available:
•
•
2
The OCR output can be corrected using the spell check and editing functions of the
software. This process is labor intensive and expensive.
For some applications, it may be most cost-effective to rekey and then proof-read the
document. Re-keying is the manual process of retyping the original. This can be done
in-house or outsourced to a vendor. Most U.S. vendors subcontract this work to
overseas vendors.
Booth, Jon M., et. al. Optimizing OCR Accuracy on Older Documents: A Study of Scan Mode, File Enhancement, and Software Products
(USGPO, 2006). (http://www.gpo.gov/pdfs/fdsys-info/documents/WhitePaper-OptimizingOCRAccuracy.pdf)
Download