Project Proposal Form

advertisement
Project Proposal Form
CS791A – Machine Learning
Program Name: Fused Optical Character Recognition System (FOCRS)
Program Participants: Nick Bartlow, Nathan. Kalka
New:_X_ Continuation:____
Description: According to [1], “Optical character recognition (OCR), as understood in the following, is the whole process of
transforming a document image (machine printed or handwritten) into a corresponding ASCII text. Many steps are necessary to
perform this task, e.g. layout analysis, image preprocessing, line segmentation, character recognition, contextual
postprocessing... Modifying one of them may lead to completely different results.” Much research has gone into the
development of applications to provide (semi) automatic OCR. As technology has matured, performance of such applications
has improved dramatically. That said, performance of applications is necessarily a function of the testing environment / data
repositories. Additionally, applications often approach the problem of OCR with (semi) orthogonal methodologies to reach a
final solution. Given this fact, various data fusion methodologies may lead to promising results if multiple OCR packages are
combined appropriately.
Experimental Plan: We intend on taking a series of freely distributed OCR packages (gocr, tesseract, ocrad, ocropus,
etc…) and developing a framework for combining their output with the intention of arriving at an increased level of accuracy
relative to the individual packages alone. Techniques such as boosting, cascading, and adaptive fusion frameworks may be
investigated to this end. Besides applying the chosen techniques on machine generated datasets and samples from paper
documents scanned electronically, we will also collect a dataset consisting of handwriting samples gathered electronically
through a tablet PC. Formal comparisons of performance will include recognition of individual characters as well as passages
of text.
Related Work Elsewhere: [1] Incorporates How ours is Different: To the best of our knowledge we
geometrical criteria to prevent incorrect character have not seen any experiments of OCR technologies on
segmentations as well as improving performance through handwritten databases collected electronically through tablet PC.
classical combination rules such as Borda Count or Although, the results of collection in this format may be arguably
Plurality Vote. [2] Focuses on obtaining a tradeoff similar to scanned handwriting samples, we anticipate different
between speed and recognition accuracy through a recognition challenges with data acquired in this manner. Besides
cascade of classifiers. [3] Investigates the utility of string natural differences in quality related to the capture device (tablet
alignment algorithms in merging outputs from multiple PC vs. paper), the dynamics of the writing process also changes.
OCR classifiers.
We expect this change in “writing style” to be observed in varying
degrees from individual to individual.
Related Work in:
Milestones:
[1] E. Wilczok, W. Lellmann, “Adaptive Combination of
Commercial OCR Systems,” Book Series Lecture Notes in
Computer Science, Vol. 2956, 2004.
[2] K. Chellapilla, M. Shilman, P. Simard, “Optimally
combining a cascade of classifiers,” Proceedings of SPIE, 2006.
[3] J.C. Handley, “Improving OCR accuracy through
combination: A survey,” Proc. IEEE Int. Conf. Syst. Man
Cybern. Vol. 5, pp. 4330-4333. 1998.
(1) Construct / collect a database of testing images including
machine generated text, scanned handwritten text, and
electronically gathered text.
(2) Construct a cascade of classifiers and or fusion framework for
individual OCR packages.
(3) Analyze results on data collected from (1).
Deliverables: Milestones will result in a technical
Budget: Total: ~$40,000, Students: $30,000 (get us while
report / presentation.
Progress to Date:
we’re still cheap) Travel: $5,000, Other (software, office
supplies) $5,000 (we need new machines).
Various individual OCR programs installed / sanity tested.
Knowledge Transfer Target Date: 2 months, Fall 2007.
Download