InftyReader Group, Inc. 2809 Bohlen Drive Hilliard, Ohio 43026 United States Phone: (614) 777-0660 Fax: (614) 259-0013 TTY: (800) 750-0750 http://apps4android.org InftyReader: An OCR System for Math Documents by: Masakazu Suzuki and Katsuhito Yamaguchi The Infty Project and Science Accessibility Net and, John Gardner ViewPlus Technologies, Inc. Updated on June 25, 2011 Table of Contents Table of Contents .................................................................................... 1 Overview ............................................................................................... 2 InftyReader Features ................................................................................ 2 Output Formats ....................................................................................... 2 System Requirements ................................................................................ 3 Installation Procedure ............................................................................... 3 License ................................................................................................. 3 Optimizing the Quality of OCR Recognition ...................................................... 3 Step by Step Instructions for Using InftyReader ................................................. 4 Feedback .............................................................................................. 6 Overview Optical character recognition (OCR) technologies are invaluable for improving access to printed materials by people with print disabilities.Most do not work well with scientific content. Scientific documents typically include mathematical and other special symbols that standard OCR does not recognize. In addition, no standard commercial OCR application can recognize a two-dimensionally-structured math equation and convert it to a standard software format. The InftyReader OCR application can properly recognize scientific documents scanned from paper or in PDF format. InftyReader recognizes complicated math expressions, tables, graphs and other technical notations and converts them to accessible formats. InftyReader can be used by people with print disabilities in combination with the ChattyInfty accessible scientific editor application. ChattyInfty provides speech access for reading and writing math and editing the output of InftyReader. Sighted people can use InftyReader with the free Infty Editor to edit InftyReader output and produce accessible scientific content. InftyReader and ChattyInfty are sold by InftyReader Group, Inc. Their website is: http://inftyreader.org InftyReader Features It uses the "ExpressReaderPro", OCR engine from Toshiba Corporation and the "WinReader" OCR application from MediaDrive Corporation simultaneously to recognize characters in regular text. It uses an OCR engine developed by Infty to recognize math and scientific formulas. It can recognize tables containing math expressions. It can convert black and white scanned documents and PDF files It recognizes individual pages in a PDF file It is licensed for business/commercial use Output Formats InftyReader can output a recognition result in any of the following formats IML LaTeX HR-TeX XHTML+MathML Microsoft Word 2007 (XML) IML is a XML file format developed expressely for InftyEditor and ChattyInfty. By default InftyReader will save results in IML. The original image is retained and can be displayed Copyright © 2008-2011 by InftyReader Group, Inc. All rights Reserved. Page 2 of 6 with either Infty Editor or ChattyInfty. In ChattyInfty, the image can be accessed tactually through an on-line graphics display (available from KGS Japan) or by embossing it on a ViewPlus embosser. Consequently Infty Editor and ChattyInfty users may compare results with the original image and make corrections as necessary. These editors can also convert the result into any of the formats listed above. Other than IML, allInfty formats except HR-TeX are standard mainstream forms. HR-TeX (Human-Readable TeX), is an abbreviated LaTeX-like notation developed to be more easily readable than standard LaTeX. System Requirements InftyReader and ChattyInfty require Windows XP, Vista, or Windows7 operating systems, 32 or 64 bit. Microsoft Internet Explorer7 or later must be installed. In order to correct OCR errors and edit documents, we strongly recommend that the free Infty editor be installed for use by sighted people or the ChattyInfty editor for use by people with print disabilities. Installation Procedure The initial InftyReader or ChattyInfty downloaded archive is a zip file. Extract the contents into any convenient folder. One file is named, "setup.exe." Run this file to install InftyReader or ChattyInfty. Note that administrator privileges are required to install applications. License The InftyReader license is included in the download archive as "License_E.txt". Please read it! Optimizing the Quality of OCR Recognition InftyReader can recognize only a high quality black-and-white (binary) image. For paper documents, it is very important to scan the document in black and white (binary) mode and to use a resolution of at least 400 dpi. 600 dpi is recommended for best results. The paper should be flat and carefully aligned in the scanner to avoid images that are fuzzy, skewed, or slanted. If possible, pages in books should be cut from the binding so they will lie flat on the scanner. Save the scanned files as TIFF, GIF or PNG format. Recognizing math characters requires much higher quality images than does standard OCR, and poor quality images will give correspondingly poor results. Heavy users of InftyReader can improve the quality of recognition by editing scanned images to remove small extraneous scan defects and artifacts. Recognition can be improved by optimizing the scanner threshold so that fewer than 1% of characters are broken or touch other characters. InftyReader subdivides the document into text, math, tables, and figures and then uses different procedures to recognize each. Users can improve the quality of recognition by hand-editing images to ensure that the content flows properly. For example, cutting Copyright © 2008-2011 by InftyReader Group, Inc. All rights Reserved. Page 3 of 6 columns apart and arranging them in proper sequence is recommended when pages are partially columnized. One common problem that needs to be avoided with scanned images is a dark frame that can appear around the page. This problem is caused by non-white area above the page during scanning. It can be avoided by placing a large white paper over the page being scanned. The paper should be large enough to cover the entire scanner surface. Images with such problems can be repaired by removing the offending dark frame in a good image editor. Be careful not to reduce the image dpi during such a process. A PDF file also can be recognized. Normal PDF files have characters represented in fonts and are subject to fewer OCR problems than scanned images. Always obtain a PDF if possible. Articles in most scientific journals are available as PDF. Step by Step Instructions for Using InftyReader InftyReader is a GUI application. It can be used in command mode in the Console Window by running Infty.exe from the Infty program folder. This tutorial is restricted to the GUI version. Command mode use is covered in the InftyHelpE.txt file included in the archive. Step 1. Start InftyReader. For Windows7, type InftyReader in the Search box that takes focus when you go to Start (by pressing the Windows key or CTRL-ESC), and press Enter. For older Windows operating systems, find InftyReader in the Program menu and press Enter. You will find a number of buttons and edit boxes if you TAB around the initial InftyReader screen. Use space bar to press buttons, not Enter. The screen has these elements: o Three buttons giving you the choice of selecting a file to recognize, a folder to recognize, or to scan in a document. o A read-only box that will show the current input file/folder name. It is initially empty. o A radio selection box permitting you to select the desired input format. Choices are Tiff, BMP, GIF, PNG, and PDF. o A read-only box showing the output file/folder name. It is initially empty. o A radio selection box giving you the choice of output format. Choices are IML, LaTeX, HR-TeX, XHTML(MathML), and Microsoft Word 2007 (XML). o A yes/no choice of whether to open the output file in the appropriate application. Default is "no". o Choice of whether to put a NewLine indicator at the end of each paragraph. Default is to do so. o The "Start OCR" button that should be pressed once all selections are made. o An Exit button that closes InftyReader o A choice of math level as "all math symbols" or "High school math symbols". Select the latter if the math is simple, because recognition will be better. o A choice of using InftyReader in English or Japanese. Copyright © 2008-2011 by InftyReader Group, Inc. All rights Reserved. Page 4 of 6 Step 2 is to set the various options and then click on the choice of file, folder, or scan input. o If you choose to convert a single file, click the "file" button by pressing space bar or left-clicking the mouse. This opens a standard Windows "open" dialog that permits you to type in the file name or browse for it. There is a choice of keeping this file as "read-only", and an "open" button. After selecting the file, press the "open" button. After a short processing delay, the initial screen re-appears with the input and default output file names filled in. You may edit the output file name or browse to another folder if you wish. o If you click on "folder", you open a selection tree that permits you to browse for the desired input folder. Once the folder is found, press "ok". The initial screen re-appears with the input folder name filled in. The Output file name box is empty. You cannot edit the output folder name. An additional box appears, giving you the choice of whether to convert files in sub-folders. Default for this "Search sub-folders" choice is off. Step 3: Set the desired dpi level in the box giving an option of 600 or 400 dpi. Then press the "Start OCR" button with space bar or mouse. If you have selected LaTeX for the output format, after pressing the "Start OCR" button, you will be given a long list of LaTeX options. If you are LaTeX-savvy, you will recognize these options. If not, we recommend that you choose your desired paper size and then accept all other default options and press "ok" to continue. If you have chosen to convert a single file: o The output file will be in the same folder as the input file and, by default, will have the same name except for extension. The extension will be whatever was selected for output format. For example, File0.tiff will produce output file File0.iml if iml is the output format, File0.tex if LaTeX is the output format, etc. o You will find that there are other files produced by InftyReader as well, in the same folder as the input and output files. At a minimum there will be a log file, sometimes several log files. Other files will be produced, depending on which output format is used. If you have chosen to convert a folder: o InftyReader will convert all image files in that folder having the specified input format. All output will be put into a single file having the name of the folder with extension of the output. For example all files in the folder Hi will be in the single file Hi.iml for iml output, Hi.tex for LaTeX output, etc. o If the "Search sub-folders" box is checked, InftyReader will convert files in all sub-folders as well. All output in sub-folder HiThere will be in HiThere and gathered together into the single file HiThere.iml, etc. Copyright © 2008-2011 by InftyReader Group, Inc. All rights Reserved. Page 5 of 6 Feedback Steve.jacobs @ inftyreader.org This work is licensed under a Creative Commons Attribution 3.0 Unported License Copyright © 2008-2011 by InftyReader Group, Inc. All rights Reserved. Page 6 of 6