Optical Character Recognition Image Character Recognition Intelligent Recognition

advertisement
UN Workshop on Data Capture, Minsk
Session 15
Data Capture Process with
Optical Character Recognition
Image Character Recognition
Intelligent Recognition
Christoph Steinl
Vice Director International
Enterprise Content Management
© Beta Systems Software AG 2008
1
Agenda
OCR
Optical Character
Recognition
ICR
Image Character
Recognition
DFR
Dynamic Form Recognition
7/23/2016
© Beta Systems Software AG 2008
2
OCR = optical character recognition
 Technology
was first invented in 1929
 Gustav
Tauschek obtained
a patent on OCR in Germany
 Mechanical
device that used templates
 First
commercial system was installed at
Readers Digest in 1955
 Years
later donated to the Smithsonian Institution
 Today
 Recognition
of machine written text is
now considered largely a solved problem
 Accuracy
7/23/2016
rates exceed 99%
© Beta Systems Software AG 2008
3
OCR
 Beta

Systems well experienced with this recognition engines in Banks
in Germany OCR A
⑁
⑀
⑂
Chair
Hook
Fork
 Austria
+
7/23/2016
OCR B
Plus
© Beta Systems Software AG 2008
4
ICR Image Character Recognition
 The
technique is far ahead of OCR
because of ongoing development of ICR
 Handwriting
recognition system
 Allows
different styles of handwriting
to be learned by a computer
during / before processing
to improve accuracy
and recognition rates
7/23/2016
© Beta Systems Software AG 2008
5
ICR Process:
 Capturing
 Processing
the image with Scanners
by (ICR) and/or (OCR)
 Segmentation
is a very important step
 Decision
if the homogenous criteria belong
to the foreground or to the background
 Human
editors can do that depending on the context
 Comparable
to computer tomography:
according to different results from radio waves reflected
from different angels the computer can reconstruct the picture
 With
the first step only a suitable starting point
(sets of pixels) is possible
 The
increasing process links all closer pixels (computation of
valleys and peaks with high degree of confidence)
7/23/2016
© Beta Systems Software AG 2008
6
ICR Process:
 Pre-processing
 Deskew
 Shift,
rotate
 Stretch
7/23/2016
© Beta Systems Software AG 2008
7
Recognition – Image Pre-processing
Skewed
document ...
…after alignment
7/23/2016
© Beta Systems Software AG 2008
88 1
ICR Process:

Enhance
 Less
/ More Contrast
 Clean
up
(de-noise,
halftone removal)
 to
enable the recognition engine
to give best results
7/23/2016
© Beta Systems Software AG 2008
9
Recognition: Noise and box removal
7/23/2016
© Beta Systems Software AG 2008
10
ICR Process:
Classification
A
one was written
90
% =1
8
%
=7
2%
7/23/2016
© Beta Systems Software AG 2008
=4
12
ICR Algorithm:
Neural
 Using
Network
kNN
k-Nearest Neighbour
SVM
Support Vector Machine
Minimize simultaneously the empirical classification error
and maximize the geometric margin;
hence they are also known as maximum margin classifiers
7/23/2016
© Beta Systems Software AG 2008
13
ICR Process:
 After
different classification alternatives
the appropriate confidence will be provided
 Recognition
Limitation only for most probable characters
e.g. if only characters 3,6,0 are possible
the engine can also be limited to this set
and the results are much better
 Voting
Machine
 Usability:
 security,
 efficiency
and
 Accuracy
7/23/2016
© Beta Systems Software AG 2008
14
Dynamic Field Recognition
 No
 If
fixed position is required
form is only ½ available still ½ readable
 No
special Forms are required
 No
timing tracks are necessary on the forms
for OMR but results are also available
the same time
no cleaning of LEDs in the scanner necessary
 Robust
against vertical / horizontal stretching,
shrinking and displacement
(e.g. Variation in printing)
7/23/2016
© Beta Systems Software AG 2008
15
Dynamic Field Recognition
 Recognizes:
 features
(word as pixel cloud)
 boxes,
 lines
and
 symbols
7/23/2016
© Beta Systems Software AG 2008
16
7/23/2016
© Beta Systems Software AG 2008
17
Hardware- / Software - Requirement
 Hardware
 Scanner
 PC
 Network
 Disc
Storage necessary for re-processing and
if images are needed for audit purposes
 Software
 Scan
Software
 One
Recognition and Voting Software
for OMR, OCR, ICR, Barcode
7/23/2016
© Beta Systems Software AG 2008
18
OMR
Cost Comparatives in general
OMR/ICR from image
Forms Design
Same
Forms Production
-
Up to 50% More
Enumerator
Training
-
Up to double the cost
Scanners
-
Up to double the cost
PC
Low cost PC
PC Operators
Same
Servers
Same
Cost of more/new
flexibility
7/23/2016
OMR/ICR from
dedicated OMR Scanner
© Beta Systems Software AG 2008
low
high
19
ICR Advantages
 Better
than:
 Manual
keying
 90
% (plus) correct keys
Manual = higher substitution rate
than automated recognition
 Time
consuming
 Deliberate
 OMR,
manipulation possible
because OMR is space consuming
 OCR,
because OCR is machine written
and therefore of limited use
7/23/2016
© Beta Systems Software AG 2008
20
ICR Advantages
 Clear
accuracy for OMR
because of dirt removal by software
depending on the mark size and figure
 Can
detect line
 Clear
7/23/2016
and can ignore dirt
result
© Beta Systems Software AG 2008
21
ICR Advantages
 Barcode,
OCR
 OMR,
 and
ICR
Recognition with one Software
7/23/2016
© Beta Systems Software AG 2008
22
ICR Advantages
 Pro:
 Only
rejected characters/fields need correction
Rest of the form untouched
 With
new technologies open for future
faster, better quality
 With
standardized correction mode
 Handwriting
of the corresponding country will be recognized
 The
previously mentioned advantages
do not have to be repeated here again
7/23/2016
© Beta Systems Software AG 2008
23
ICR Advantages : Capture Process
 SORM
Scan Once Read Multiple
Images are Scanned once and stored for re-processing. (disk space is cheap)
 In several serial sessions parts of the
data is collected from the Image (important fields first).
 Example:
SORM Session 1: Fields Age, Sex and Nationality -> provisional partial results
SORM Session 2: All other numeric fields
SORM Session 3: Alphanumeric fields that need more manual coding (Occupation ->
Occupation Code)
 Each Session Updates the Data files / Database until all data is captured.
 Faster preliminary results. Less political stress.
 Faster data for PES planning
 Analysis of Session 1 results is possible in parallel
to Recognition, Coding and Editing of Session 2
 Data lifting on different batching levels is possible. (EA, settlement)

7/23/2016
© Beta Systems Software AG 2008
24
Process Stages
of Census Surveys
Christoph J. Steinl, Vice Director int. ECM
December 2008, Minsk
© Beta Systems Software AG 2008
25
Capture Process
 Store
In (EA Batch Header Creation – EA Paper store Database)
 Scanning
 Recognition
 Verifying
 The
Processes
solution
 Data
capture
 Census
Process internal
 Census
data flow
 Quality
assurance
7/23/2016
© Beta Systems Software AG 2008
26
Scanning
7/23/2016
© Beta Systems Software AG 2008
Kleindienst SC80HC
27
Scanning

Simultaneous creation of up to six images

Optical lens > 10 mm, sharpness-depth-area 3 mm

Optical und ultrasonic double feed control

Energy saving / live cycle extending Mode

Consistent jam handling:
no document is lost or double captured due to physical jams

cleanness check program: detects white - and black dirt spots
7/23/2016
© Beta Systems Software AG 2008
28
Scanning

Pockets 2 – 12 why:

if the document is scanned skewed or de-skewed

if the very important questions are filled / are readable
(if OMR, OCR, Barcode)

if there are fingerprints on the questionnaire or not

if the Barcode/OCR/OMR numbers (not ICR) numbers
are in the given range

if there are double entries – we check the unique number

if there are (colour) copies used

if there are mismatches in quantity:
Batch header shows 50 and only 40 are scanned

Transport stop can be programmed to clarify the issue
7/23/2016
© Beta Systems Software AG 2008
29
Scanning

Customer:
I have a printer and print since long my own questionnaire

…I learnt from the internet that it is just a matter of software …



Before printing we should be consulted
to give best advice, we will test and optimize.

Single side printing or higher scaled paper is necessary
(shine through factor = opacity)

Paper should be white without any spots inside
Discuss different methods before making big investments
7/23/2016
© Beta Systems Software AG 2008
30
Census Process internal
Form Type
Analysis
Structure
Analysis
ICR voting
Batch Job
Processing
ICR 1
ICR 2
Editing / Coding
Output
Assembly
7/23/2016
© Beta Systems Software AG 2008
Logical
Result
Analysis
ICR Result
Analysis
32
The Path to Recognition
 Analyze
7/23/2016
the structure of documents for identification
© Beta Systems Software AG 2008
33
The Path to Recognition
 Perform
proper clean-up and image
pre-processing
 Analyze
individual page layout
 Dynamically
 Character
locate fields of interest
recognition :numeric handwriting
voting of two ICR Engines and also with OMR.
 Compile
7/23/2016
results
© Beta Systems Software AG 2008
34
Verifying Processes
 Unique
Number – double Scan check
 Double
feed check
 Check
 Trace
if Copy
of all editing work
 Logical
checks
 Completeness
checks
 Reports
7/23/2016
© Beta Systems Software AG 2008
35
The Solution: SC80HC + FC Census
FC Census
recognition
DevInfo
Data +
Images
Work
Data
Storage & DB
Preparation:
cut & jogg
CSPro
Batch
Header
Archive
Paper
Archive
Data
Storage & DB
Redatam
Form
x
TAPE
Editing
7/23/2016
© Beta Systems Software AG 2008
Local reports
36
Data capture
 Data
Processing Centres in different locations
 Peak
period 3 shifts, average 1-2 shifts
 Local
operators trained by our supervisors
 Supervisors
 Central
support from Lab
 Training
 Help
7/23/2016
local
& documentation realised in advance
to design the documents
© Beta Systems Software AG 2008
37
Thank you for your attention
7/23/2016
© Beta Systems Software AG 2008
38
Download