5.0 Best Practices for Optical Character Recognition Optical character recognition (OCR) is a process by which specialized software is used to convert scanned images of text to electronic text so that that digitized texts can be searched, indexed and retrieved. The recommended software for OCR creation is ABBYYFineReader; however, Adobe Acrobat can produce high-quality OCR for clear, crisp, and structurally uncomplicated texts in a variety of languages. Table of contents 5.1 Uses of OCR output 5.2 Functionality of OCR Software 5.3 Output File Formats 5.4 Factors Affecting Accuracy of OCR 5.5 OCR Correction and Rekeying 5.1 Uses of OCR output There are several uses for the output of the optical character recognition process 1: • • • • Indexing—OCR text can be output to a text file that is then imported to a search engine and used as the basis for full-text searching. This can be an effective use of the output even if it is not 100% accurate. Full-text retrieval—Search engine results are displayed with hit highlighting within the page image displayed. Full-text representation—the user is presented with a text file as a representation of the actual document. In this case, accuracy of the OCR is critical; projects using OCR output for this purpose generally require that the output has been spell-checked and errors corrected. Full-text representation with xml mark-up—The OCR out put is presented to the user with layout, structure, and/or metadata added through xml mark-up. This presentation requires significant human intervention. 5.2 Functionality of OCR Software Text must be scanned or digitally photographed and saved in either an image or PDF (portable document) format prior to running it through an OCR program. OCR software converts the patterns of light and dark found in a digital image of a page of text into text characters and 1 Tanner, Simon. Deciding Whether Optical Character Recognition is Feasible. KDCS, 2004. (http://www.odl.ox.ac.uk/papers/OCRFeasibility_final.pdf) saves them in a format that computers can search or index, such as Unicode or ASCII. OCR software generally employs a wide variety of language dictionaries. ABBYYFineReader v. 10 can read over 180 languages, as well as common programming languages, numbers, and simple chemical formulae. ABBYYFineReader can read multilingual documents, and includes dictionaries and spell-checking capabilities for 39 languages. Adobe Acrobat 9 can read over 40 languages and has some basic spell-checking capabilities. Generally, operators select the language(s) of the document from a drop down list prior to initiating the OCR process. “Reading” of digitized text is the primary function of OCR software; however, some OCR programs (e.g., ABBYYFineReader) have other functions aimed at either improving the accuracy of the OCR results or speeding up the OCR process. Among these processes are: • Structural analysis to first identify structural features of the document, such as text orientation, headings, images, tables, captions, and paragraphs. Structural analysis can both help improve OCR accuracy and enables the software to reproduce the original text formatting if the output is a Word document. • Basic image processing tasks such as straightening, deskewing, despeckling, cropping, and rotating that can improve OCR accuracy. • A training mode to “teach” the software to recognize decorative fonts and unusual characters. • Spell-checking and editing functions that prompt the operator to review questionable spellings and, if necessary, correct spelling errors. • Batch processing to automate repeated tasks and scheduling functions to set the software to run automated tasks at night or at other times when the computer’s processing load is lighter. 5.3 Output File Formats Optical character recognition should be performed for printed textual materials to enhance searchability and access of the digitized version. • For scanned text resources that will be converted and made available to end-users in PDF, OCR should be done as part of the conversion to PDF to provide text behind and make individual documents searchable when viewed in Acrobat Reader. For searchable PDFs, the final format should be text-under-images. When saving OCR output as a searchable PDF, the option to “enable tagged PDF” should always be checked. Tagged PDF is a version of PDF that provides structure and order information to allow PDF documents to be read by screen readers used by persons with disabilities. • Separate text files of OCR output need only be generated for members of a collection of text resources when it's anticipated that indexing / search and discovery across the collection of individual documents as a whole will be required for that collection. Note should be made of software used to do the OCR; these notes should include values for all settings used and any training set used for the OCR. • Separate XML files of OCR output need only be generated for a collection of text resources when it's anticipated that structured text will be required for that collection. Notes should be made as with preceding case. 5.4 Factors Affecting Accuracy of OCR Most commercial software packages boast an OCR accuracy of between 97% and 99%. These rates are based on character errors, not word errors. So while 97% of characters may be accurate in an OCR’d document, only 75% of words may be spelled correctly. Any of the following factors can also affect the accuracy of the OCR: • Textual considerations o Standard OCR should not be attempted on certain materials. For example, currently OCR with default settings should not be attempted on most texts published prior to 1850. For some languages (e.g., German) the cutoff date may be even later. Before trying to create transcriptions for these materials via OCR, detailed analysis and often experimentation is required to judge trade-offs between custom OCR and keyboarding options. o Older and discolored documents must be scanned in RGB mode to capture all the image data, and to maximize OCR accuracy. o Low-contrast documents can result in poor OCR. o Typescript results in poorer OCR than printed type; inconsistent use of font faces and sizes can lower OCR accuracy. o Font sizes of below 6 points in the original can limit OCR, although increasing resolution in the scanned image to 600 dpi and using greyscale may improve OCR output. o Handwritten documents cannot be recognized with any degree of accuracy. • Scanning considerations that affect the accuracy of OCR include: o The recommended best scanning resolution for OCR accuracy is 300 dpi. Higher resolutions do not necessarily result in better accuracy and can slow down OCR processing time. Resolutions below 300 dpi may affect the quality and accuracy of OCR results. o Brightness settings that are too high or too low may adversely affect OCR accuracy. A medium brightness value of 50% will be suitable in most cases. o Straightness of the initial scan can affect OCR quality; crooked lines of text produce poor results. o o Older and discolored documents must be scanned in RGB mode to capture all the image data, and to maximize OCR accuracy. Image enhancements, such as contrast adjustment and unsharp mask, have NOT been shown to significantly enhance the accuracy of OCR. 2 5.5 OCR correction and rekeying If the use to which OCR’d text is being put requires 100% accuracy, two options are available: • • 2 The OCR output can be corrected using the spell check and editing functions of the software. This process is labor intensive and expensive. For some applications, it may be most cost-effective to rekey and then proof-read the document. Re-keying is the manual process of retyping the original. This can be done in-house or outsourced to a vendor. Most U.S. vendors subcontract this work to overseas vendors. Booth, Jon M., et. al. Optimizing OCR Accuracy on Older Documents: A Study of Scan Mode, File Enhancement, and Software Products (USGPO, 2006). (http://www.gpo.gov/pdfs/fdsys-info/documents/WhitePaper-OptimizingOCRAccuracy.pdf)