UN Workshop on Data Capture, Bangkok Session 7 Data Capture Richard Lang

advertisement
UN Workshop on Data Capture, Bangkok
Session 7
Data Capture
Richard Lang
International Manager
© Beta Systems Software AG 2008
1
Agenda
OCR
Optical Character
Recognition
ICR
Intelligent Character
Recognition
DFR
Dynamic Form Recognition
7/23/2016
© Beta Systems Software AG 2008
2
OCR = optical character recognition
 Technology
was first invented in 1929
 Gustav
Tauschek obtained
a patent on OCR in Germany
 Mechanical
device that used templates
 First
commercial system was installed at
Readers Digest in 1955
 Years
later donated to the Smithsonian Institution
 Today
 Recognition
of machine written text is
now considered largely a solved problem
 Accuracy
7/23/2016
rates exceed 99%
© Beta Systems Software AG 2008
3
OCR
 Beta

Systems well experienced with this recognition engines in Banks
in Germany OCR A
⑁
⑀
⑂
Chair
Hook
Fork
 Austria
+
7/23/2016
OCR B
Plus
© Beta Systems Software AG 2008
4
ICR Intelligent Character Recognition
 The
technique is far ahead of OCR
because of ongoing development of ICR
 Handwriting
recognition system
 Allows
different styles of handwriting
to be learned by a computer
during / before processing
to improve accuracy
and recognition rates
7/23/2016
© Beta Systems Software AG 2008
5
ICR Process:
 Capturing
 Processing
the image with Scanners
by (ICR) and/or (OCR)
 Segmentation
is a very important step
 Decision
if the homogenous criteria belong
to the foreground or to the background
 Human
editors can do that depending on the context
 Compare
also computer tomography:
according to different results from radio waves reflected
from different angels the computer can reconstruct the picture
 With
the first step only a suitable starting point
(sets of pixels) is possible
 The
increasing process links all closer pixels (computation of
valleys and peaks with high degree of confidence)
7/23/2016
© Beta Systems Software AG 2008
6
ICR Process:
 Pre-processing
 Deskew
 Shift,
rotate
 Stretch
7/23/2016
© Beta Systems Software AG 2008
7
ICR Process:

Enhance
 Less
/ More Contrast
 Clean
up
(de-noise,
halftone removal)
 to
enable the recognition engine
to give best results
7/23/2016
© Beta Systems Software AG 2008
8
ICR Process:
Feature
extraction
Data
reduction
7/23/2016
© Beta Systems Software AG 2008
9
ICR Process:
Classification
A
one was written
90
% =1
8
%
=7
2%
7/23/2016
© Beta Systems Software AG 2008
=4
10
ICR Algorithm:
Neural
 Using
Network
kNN
k-Nearest Neighbour
SVM
Support Vector Machine
Minimize simultaneously the empirical classification error
and maximize the geometric margin;
hence they are also known as maximum margin classifiers
7/23/2016
© Beta Systems Software AG 2008
11
ICR Process:
 After
different classification alternatives
the appropriate confidence will be provided
 Recognition
Limitation only for most probable characters
e.g. if only characters 3,6,0 are possible
the engine can also be limited to this set
and the results are much better
 Voting
Machine
 Usability:
 security,
 efficiency
and
 Accuracy
7/23/2016
© Beta Systems Software AG 2008
12
Dynamic Field Recognition
 No
 If
fixed position is required
form is only ½ available still ½ readable
 No
special Forms are required
 No
timing tracks are necessary on the forms
for OMR but results are also available
the same time
no cleaning of LEDs in the scanner necessary
 Robust
against
vertical / horizontal stretching or shrinking
(e.g. different printers)
7/23/2016
© Beta Systems Software AG 2008
13
Dynamic Field Recognition
 Recognizes:
 features
(word as pixel cloud)
 boxes,
 lines
and
 symbols
7/23/2016
© Beta Systems Software AG 2008
14
Hardware- / Software - Requirement
 Hardware
 Scanner
 PC
 Network
 Disc
Storage only necessary if images
are needed for audit purposes
 Software
 Scan
Software
 One
Recognition and Voting Software
for OMR, OCR, ICR, Barcode
7/23/2016
© Beta Systems Software AG 2008
15
OMR
Cost Comparatives in general
OMR from image
Forms Design
Dedicated OMR Scanner
Same
Forms Production
-
Up to 50% More
Enumerator
Training
-
Up to double the cost
Scanners
-
Up to double the cost
PC
Low cost PC
PC Operators
Same
Servers
Same
Cost of more/new
flexibility
7/23/2016
© Beta Systems Software AG 2008
low
high
16
ICR Advantages
 Better
than:
 Manual
keying
 90
% (plus) correct keys
Manual = higher substitution rate
than automated recognition
 Time
consuming
 Deliberate
 OMR,
manipulation possible
because OMR is space consuming
 OCR,
because OCR is machine written
and therefore of limited use
7/23/2016
© Beta Systems Software AG 2008
17
ICR Advantages
 Clear
accuracy for OMR
because of dirt removal by software
depending on the mark size and figure
 Can
detect line
 Clear
7/23/2016
and can ignore dirt
result
© Beta Systems Software AG 2008
18
ICR Advantages
 Barcode,
 OCR,
 OMR,
 and
ICR
Recognition with one Software
7/23/2016
© Beta Systems Software AG 2008
19
ICR Advantages
 Pro:
 Only
rejected characters/fields need correction
Rest of the form untouched
 With
new technologies open for future
faster, better quality
 With
standardized correction mode
 Handwriting
of the corresponding country will be recognized
 The
previously mentioned advantages
do not have to be repeated here again
7/23/2016
© Beta Systems Software AG 2008
20
Thank you for your attention
7/23/2016
© Beta Systems Software AG 2008
21
Download