link to slides

advertisement
POTENTIAL OCR SOFTWARE
FOR NUTRITION FACTS LABELS
Dennis Given
THE GENERAL OPTICAL CHARACTER
RECOGNITION CONCEPT
OCR
Output
Input
buzz
PREFERENCES
FOR THE
OCRS
Accurate
 Fast
 Written in Java



Open-source (free)


This will make it easier to find someone to work on
the software in the future.
Commercial options, although considerably faster
and more accurate, are costly solutions.
Editable
So that if I have to, I can go into the OCR engine and
edit whatever I have to.
 Commercial OCRs don’t always allow for this option.

OCRS THAT MEET SOME OF THE
PREFERENCES
Aspire OCR SDK
 Java OCR
 ABBYY FineReader
 Tesseract Version 3.01

COMPARISONS

Preferences:






Accuracy?
Speed?
Java?
Editable?
Open-source?
Example image to determine the best:
GIF Image
 1204x2004 image - this resolution is close to the
resolution of the iPhone 3GS camera phone
(1500x2000) and the iPhone 3G resolution
(1200x1600) images.

EXAMPLE IMAGE
2004 pixels
1200 pixels
ASPIRE OCR

Pros:
Runs across many platforms
 Relatively fast
 Written in Java and meant to be added to Java
applications


Cons:
Not very accurate.
 Must pay for the full SDK (Software Development
Kit).

ASPIRE RESULTS
JAVAOCR

Pros:
Written entirely in Java
 Full source code is given (easy to edit)
 Easy graphical user interface
 Relatively fast


Cons:
Instead of converting the image to text, it converts it
to .png files by character
 Not very accurate (sometimes won’t even bother
converting the image) to more than one character
 Even the images that were converted were not done
very well…

JAVAOCR RESULTS
ABBYY FINEREADER

Pros:
Very good interface
 Lots of tools to edit the area being scanned
 The most accurate program tried


Cons:
Not in Java
 Commercial (not open-source) and VERY expensive

ABBYY FINEREADER RESULTS
TESSERACT VERSION 3.01
Developed by HP Labs
 Now used by Google
 Pros:

Close in accuracy to the commercial OCRs
 Easy to use from the command line
 Lots of documentation available


Cons:
Must use a Java Wrapper if we want future edits to
be done in Java
 Source code is written in C/C++ - will be difficult to
edit

TESSERACT RESULTS
COMMERCIAL OCR VS. TESSERACT

100+ languages

6+ languages

Accuracy is good

Accuracy is good, but not

Sophisticated application
as good as commercial
with complex user
OCRs

interface

No user interface
Mostly meant for

Runs on Linux, Mac,
Windows OS

Costs $100+ to use
Windows, and more…

Open Source – Free!
WHERE TO GO FROM HERE…
Tesseract is our best option at this point.
 It is…

Fast
 Free
 Outperforms the other available open-source OCR
engines
 Plenty of documentation

An Overview of the Tesseract OCR Engine by Ryan Smith
 Tesseract OSCON pdf
 http://code.google.com/p/tesseract-ocr/


Three different ways to go
OPTION 1: ~5 WEEKS
USE TESSERACT ENGINE AND WRAP IT

Wrapper Library




A collection of subroutines or classes used to develop
software. Libraries expose interfaces which clients of the
library use to execute library routines. Wrapper
libraries (or library wrappers) consist of a thin layer of
code which translates a library's existing interface into a
compatible interface.
By wrapping Tesseract, it won’t matter that
Tesseract’s source code is written in C++
However, this means we will still not be able to
customize the Tesseract engine to do exactly what we
want (specific to Nutrition labels).
We can control the input and output, but the process
of determining characters will remain the same.
OPTION 2: ~7 WEEKS
BUILD AN OCR ENGINE FROM SCRATCH



Understand general concepts
Can use ideas and implementations from OCRs such
as Tesseract and JavaOCR.
Can customize the engine to run specifically for
nutrition facts labels.



Would be more effective than a “general” OCR which isn’t
looking for specifics.
The whole thing can be written in Java (easier for
future developers to work on).
However:
It will take more time
 Will probably have more bugs in it


Option 3 is to take more time to determine the OCR…
GOALS

At the end of the time frame, I plan to have:

A running OCR application that will:
At least be able to scan in cereal box (flat) images effectively
and convert the labels to usable data.
 Have minimal bugs (although some will definitely exist).
 Have an accuracy rate of at least 95%.
 Begin to identify effective ways to manage images with
curved (jars, bottles, etc.) and wrinkled (bags, packaging,
etc.) nutrition facts labels.

Download