Project Proposal Form CS791A – Machine Learning Program Name: Fused Optical Character Recognition System (FOCRS) Program Participants: Nick Bartlow, Nathan. Kalka New:_X_ Continuation:____ Description: According to [1], “Optical character recognition (OCR), as understood in the following, is the whole process of transforming a document image (machine printed or handwritten) into a corresponding ASCII text. Many steps are necessary to perform this task, e.g. layout analysis, image preprocessing, line segmentation, character recognition, contextual postprocessing... Modifying one of them may lead to completely different results.” Much research has gone into the development of applications to provide (semi) automatic OCR. As technology has matured, performance of such applications has improved dramatically. That said, performance of applications is necessarily a function of the testing environment / data repositories. Additionally, applications often approach the problem of OCR with (semi) orthogonal methodologies to reach a final solution. Given this fact, various data fusion methodologies may lead to promising results if multiple OCR packages are combined appropriately. Experimental Plan: We intend on taking a series of freely distributed OCR packages (gocr, tesseract, ocrad, ocropus, etc…) and developing a framework for combining their output with the intention of arriving at an increased level of accuracy relative to the individual packages alone. Techniques such as boosting, cascading, and adaptive fusion frameworks may be investigated to this end. Besides applying the chosen techniques on machine generated datasets and samples from paper documents scanned electronically, we will also collect a dataset consisting of handwriting samples gathered electronically through a tablet PC. Formal comparisons of performance will include recognition of individual characters as well as passages of text. Related Work Elsewhere: [1] Incorporates How ours is Different: To the best of our knowledge we geometrical criteria to prevent incorrect character have not seen any experiments of OCR technologies on segmentations as well as improving performance through handwritten databases collected electronically through tablet PC. classical combination rules such as Borda Count or Although, the results of collection in this format may be arguably Plurality Vote. [2] Focuses on obtaining a tradeoff similar to scanned handwriting samples, we anticipate different between speed and recognition accuracy through a recognition challenges with data acquired in this manner. Besides cascade of classifiers. [3] Investigates the utility of string natural differences in quality related to the capture device (tablet alignment algorithms in merging outputs from multiple PC vs. paper), the dynamics of the writing process also changes. OCR classifiers. We expect this change in “writing style” to be observed in varying degrees from individual to individual. Related Work in: Milestones: [1] E. Wilczok, W. Lellmann, “Adaptive Combination of Commercial OCR Systems,” Book Series Lecture Notes in Computer Science, Vol. 2956, 2004. [2] K. Chellapilla, M. Shilman, P. Simard, “Optimally combining a cascade of classifiers,” Proceedings of SPIE, 2006. [3] J.C. Handley, “Improving OCR accuracy through combination: A survey,” Proc. IEEE Int. Conf. Syst. Man Cybern. Vol. 5, pp. 4330-4333. 1998. (1) Construct / collect a database of testing images including machine generated text, scanned handwritten text, and electronically gathered text. (2) Construct a cascade of classifiers and or fusion framework for individual OCR packages. (3) Analyze results on data collected from (1). Deliverables: Milestones will result in a technical Budget: Total: ~$40,000, Students: $30,000 (get us while report / presentation. Progress to Date: we’re still cheap) Travel: $5,000, Other (software, office supplies) $5,000 (we need new machines). Various individual OCR programs installed / sanity tested. Knowledge Transfer Target Date: 2 months, Fall 2007.