x - BYU Computer Science - Brigham Young University

advertisement
How did we get to where we are in DIA
and OCR, and
where are we now?
George Nagy
Professor Emeritus, RPI
3/2/2012
How? gn
1
How did we get to where we are in DIA and OCR, and where are we now?
George Nagy, DocLab†, RPI
Abstract
Some properties of pattern classifiers used for character recognition are presented.
The improvement in accuracy resulting from exploiting various types of context –
language, data, shape – are illustrated. Further improvements are proposed through
adaptation and field classification. The presentation is intended to be accessible,
though not necessarily interesting, to those without a background in digital image
analysis and optical character recognition.
3/2/2012
How? gn
2
Last week (SPIE - DRR)
• Interactive verification of table ground truth
• Calligraphic style recognition
• Asymptotic cost of document processing
3/2/2012
How? gn
3
Today
• Classifiers
– pattern recognition and machine learning
• Context
– Data
– Language
• Style (shape context)
– Intra-class (adaptation)
– Inter-class (field classification)
3/2/2012
How? gn
4
Tomorrow
“Three problems that, if solved, would make an
impact on the conversion of documents to
symbolic electronic form”
1. Feature design for OCR
2. Integrated segmentation and recognition
3. Green interaction
3/2/2012
How? gn
5
The beginning
(almost) 1953
3/2/2012
How? gn
6
Shepard’s 9 features
(slits on a Nipkow disk)
3/2/2012
How? gn
7
CLASSIFIERS
A pattern is a vector whose elements represent numerical
observations of some object .
Each object can be assigned to a category or class.
The label of the pattern is the name of the object's category.
Classifiers are machines or algorithms
Input:
one or more pattern(s)
Output:
label (s) of the pattern(s)
3/2/2012
How? gn
8
Example (average RGB, perimeter)
Object
Pattern
Label
(22, 75, 12, 33) aspen
(14, 62, 24, 49) sycamore
(40, 55, 17, 98) maple
3/2/2012
How? gn
9
Traditional open-loop OCR System
training
set
patterns and labels
meta-parameters (e.g. regularization, estimators)
parameter estimation
classifier
parameters
labels
CLASSIFIER
rejects
operational
data
(features)
3/2/2012
correction,
reject entry
transcript
patterns
How? gn
10
Classifiers and Features
• OCR classifiers operate on feature vectors that are
either the pixels of the bitmaps to be classified, or
values computed by a class-independent
transformation of the pixels.
• The dimensionality of the feature space is usually
lower than the dimensionality of the pixel space.
• Features should be invariant to aspects of the
patterns that don’t affect classification
(position, size, stroke-width, color, noise).
3/2/2012
How? gn
11
Representation
equiprobability contours
x2
X X
O O
X X X X XX
OOO O
O O O
OO
decision
boundary
3/2/2012
X XX
X
samples
Feature Space
of two features
How? gn
x1
12
Some classifiers
GAUSSIAN
LINEAR
QUADRATIC
BAYES
MULTILAYER
NEURAL
NETWORK
SIMPLE
PERCEPTRON
NEAREST
3/2/2012
NEIGHBOR
How? gn
SUPPORT
VECTOR
MACHINE
13
Nonlinear vs. Linear classifiers
x and y are features; A and B are classes
2-D:
Quadratic classifier:
Iff ax + by + cx2 + dy2 + exy + f > 0, then (x,y)  A
5-D:
Linear classifier with s = x, t = y, u = x2, v = y2, w = xy
Iff as + bt + cu + dv + ew + f > 0,
then (x,y)  A
3/2/2012
How? gn
14
Quadratic to linear decision boundary
u = x
v = x2
v
X
bv > c
ax2 > c
X X O O X X
X
X
X
O O
x
u
SVMs, like many other classifiers, need only dot products
to compute decision boundary from training samples
3/2/2012
How? gn
15
SUPPORT VECTOR MACHINE (V. Vapnik)
SVMs, like many other classifiers, need only dot products
z
to compute decision boundary from training samples
via the “Kernel Trick”
0
0
Kernel-induced transformation
x  v = (y,z)
0
X
X
X
X
0 x
Transformation only implicit:
max min { f(vi•vj ) } by QP
Mercer’s theorem: vi•vj =K(xi•xj)
i.e., compute dot-products in
high-dim space from
via kernels in low-dim space
3/2/2012
y
z
0
0
X
gn
resistsHow?
over-training
X
y
16
Some classifier considerations
• Linear decision functions are easier to optimize
• Classes may not be linearly separable
• Estimation of decision boundary parameters in higherdimensional spaces needs more training samples
• The training set is often not representative of the test set
• The size of the training set is always too small
• Outliers that don’t belong to any class cause problems
3/2/2012
How? gn
17
Bayes error
is unreachable without an infinite training set;
is zero unless some patterns belong to more than
one class
Bayes error
3/2/2012
How? gn
18
CLASSIFIER BIAS AND VARIANCE
Training set # 1
0
0
X
0
X X
0
X
0
0
0 0
Error
X X
X
X
0
0
Training set # 2
0
0
0
True decision boundary
0
Complex Classifier Error____
Bias ------Variance …….
Simple Classifier
Error____
Bias ------Variance …….
Number of training samples
CLASSIFIER BIAS AND VARIANCE DON’T ADD!
Any classifier can be shown to be better than any other.
May 4, 2008
SSDIP GN Character Recognition
19
CONTEXT
Assigned label, part-of-speech, meaning, value,
shape, style, position, color, ...
of other character/word/phrase patterns
3/2/2012
How? gn
20
Data context
There is no February 30
Legal amount = courtesy amount (bank checks)
Total = sum of values
Date of birth < date of marriage ≤ date of death
Data frames:
email addresses, postal addresses, telephone numbers,
$amounts, chemical formulas (C20H14O4), dates, units
(m/s2, m·s−2, or m s−2), libcatnos (978-0-306-40615-7,
lLB2395.C65 1991, 823.914) copyrights, license plates, ...
3/2/2012
How? gn
21
Language models
Letter n-grams
unigrams: e (12.7%); z (0.07%)
bigrams: th (3.9%), he, in, nt, ed, en (1.4%)
trigrams: the (3.5%), and, ing, her, hat, his, you, are
Lexicon (stored as trie, hash-table, n-grams)
(20K-100K words – 3x as many for Italian than for English)
domain-specific: chemicals, drugs, family/first names,
geographic gazetteers, business directories,
abbreviations and acronyms, …
Syntax – probabilistic rather than rule based grammars
3/2/2012
How? gn
22
Post-processing
Hanson, Riseman, & Fisher 1975:
Contextual postprocessor (based on n-grams)
• Current OCR systems generate multiple candidates for
each word.
• They use a lexicon (with or w/o frequencies) or n-grams
to select top candidate word.
• Language-independent OCR engines have higher error
rates. Context beyond word-level is seldom used.
• Entropy (English text) of letters, bigrams, 100-grams, ..
Shannon 1951 4.163.563.3 ...  1.3 bits/char
Brown et al 1992: <1.75 bits/char (used 5 x 108 tokens)
3/2/2012
How? gn
23
Probabilistic context algorithms
•
•
•
•
Markov Chain
Hidden Markov Methods (HMM)
Markov Random Fields (MRF)
Bayesian Networks
• Cryptanalysis
3/2/2012
How? gn
24
Raviv 1967
No context:
P(C|x) ~ P(x|C)
No features:
P(C|x) ~ P(C)
0th order Markov: P(C|x) ~ P(x|C) P(C)
(prior)
1st order: P(C|x1,...xn) ~ P(xn|C) P(C|x1,...xn-1)
(This is an iterative calculation)
Raviv estimated letter bigram and trigram
probabilities from 8 million characters.
3/2/2012
How? gn
25
Error–Reject curves show improvement in
classification accuracy
How? gn
3/2/2012
26
Estimating rare n-grams (like ...rta...)
(smoothing, regularization)
The quick brown fox jumped over the tired dog
P(e) = 5/39 = 13%
P(u) = 2/39 = 5%
P(a) = 0/39 = 0%
(12.7 %)
( 2.8 %)
( 8.2 %)
Laplace’s Law of Succession for
k instances from N items of m classes:
P^(x) = (k+1) / (N+m)  P^(a) = (0+1)/(39+26) = 1.5 %
Next approximation: (k+a) / (N+ma), which also sums to unity
3/2/2012
How? gn
27
HIDDEN MARKOV MODELS: A or B? (3 states, 2 features)
MODEL A
1
2
MODEL B
3
1
(0.2, 0.3) (0.3, 0.6) (0.7, 0.8)
time
2
3
(0.3, 0.1) (0.5, 0.4) (0.4, 0.8)
states
time
1
2
3
(0,1) (0,0) (0,1)
3/2/2012
(1,1)
How? gn
(0,1) (0,0) (0,1) (1,1)
TRAINING: joint probs via Baum-Welch Forward-Backward (EM)
28
HMM situated between Markov chains & Bayesian networks
Find joint posterior probability of sequence of feature vectors
States are invisible; only, features can be observed
Transitions may be unidirectional or not, skip or not.
States may represent partial characters, characters, words,
or separators and merge-indicators
Observable features are window-based vectors
HMMs can be nested (e.g. character HMM in word HMM)
Training: estimate state-conditional feature distributions and
state transition probabilities
complex Baum-Welch Forwards-Backwards EM algorithm
3/2/2012
How? gn
29
Widely used for script
E.g.:
Using a Hidden-Markov Model in Semi- Automatic
Indexing of Historical Handwritten Records
Thomas Packer, Oliver Nina, Ilya Raykhel
Computer Science, Brigham Young University
FHTW 2009
3/2/2012
How? gn
30
Markov Random Fields ( ~ 2000 for OCR)
• Maximize an energy function
• Models more complex relationships within
cliques of features
• Applied so far mainly to on- and off-line Chinese
and Japanese handwriting
• More widely used in computer vision
3/2/2012
How? gn
31
Bayesian Networks (J. Pearl, > 1980)
•
•
•
•
•
Directed acyclic graphical model
Models cliques of dependencies between variables
Nodes are variables, edges represent dependencies
Often designed by intuition instead of structure learning
Learning and inference algorithms (message passing)
• Applied to document classification
3/2/2012
How? gn
32
Inter-pattern Class Dependence
(Linguistic Context)
G
G
E O N G E
E O R G E
Classes
Features
3/2/2012
How? gn
33
Inter-pattern Class-Feature Dependence
3/2/2012
How? gn
34
Inter-pattern Feature Dependence
(Order-dependent: Ligatures, Co-articulation)
3/2/2012
How? gn
35
Inter-pattern Feature Dependence
(order-independent: Style)
The shape of the ‘w’ depends on the shape of the ‘v’
3/2/2012
How? gn
36
OCR via decoding a substitution cipher
an unknown sentence
Cluster the
bitmaps:
1 2
5
Cipher text: 1 2 . 2 . 2 . .2 5 2 . . 5 2 . 5
DECODER
3/2/2012
LANGUAGE MODEL:
N-gram frequencies,
Lexicon,
…
1a
2n
[Nagy & Casey, 1966;
Nagy & Seth, 1987
Ho &Nagy, 2000
5e
Thanks to J.J. Hull]
How? gn
37
An unusual typeface
[Ho, Nagy 2000]
3/2/2012
How? gn
38
Text printed with Spitz glyphs
3/2/2012
How? gn
39
Decoded text
chapter i _ bee_inds
_all me ishmaels some years ago__never mind
how long precisely __having little or no money
in my purses and nothing particular to interest
me on shores i thought i would sail about a
little and see the watery part of the worlds it is
a way i have ...
GT
chapter I 2 LOOMINGS
Call me Ishmael. Some years ago – never mind
how long precisely – having little or no money in
my purse, and nothing particular to interest me on
shore, I thought I would sail about a little and see
the watery part of the world. It is a way I have ...
3/2/2012
How? gn
40
STYLE-CONSTRAINED
CLASSIFICATION
(AKA style-conscious or style-consistent classification)
3/2/2012
How? gn
41
Inter-pattern Feature Dependence
(Style)
3/2/2012
How? gn
42
Style-constrained field classification
3/2/2012
How? gn
43
Intra-class and inter-class style
(aka weak and strong style)
INTRA-CLASS
INTER-CLASS
With thanks to Prof. C-L Liu
Adaptation for intra-class style, field classification for inter-class style
3/2/2012
How? gn
44
Adaptation and Style
training
test
(3) discrete styles
(1) representative
(5) weakly
constrained
(2) adaptable
(long test fields)
3/2/2012
(4) continuous styles
(short test fields)
How? gn
45
Adaptive algorithms
cf.:
stochastic approximation (Robbins-Munro algorithm)
self-training, self-adaptation, self-correction,
unsupervised / semi-supervised learning,
transfer learning, inductive transfer,
co-training, decision-directed learning, ...
3/2/2012
How? gn
46
Traditional open-loop OCR System
training
set
patterns and labels
meta-parameters (e.g. regularization, estimators)
parameter estimation
classifier
parameters
CLASSIFIER
rejects
operational
data
(bitmaps)
3/2/2012
transcript
labels
correction,
reject entry
patterns
How? gn
47
Supervised learning
Generic OCR System that makes use of
post-processed rejects and errors
training
set
keyboarded labels of rejects and errors
meta-parameters
parameter estimation
classifier
parameters
CLASSIFIER
correction,
reject entry
transcript
operational
data
(features)
3/2/2012
How? gn
48
Adaptation (Jain, PAMI 00: “Decision directed classifier”)
Field estimation, singlet classification
training
set
classifier assigned labels
meta-parameters
parameter estimation
classifier
parameters
CLASSIFIER
correction,
reject entry
transcript
operational
data
(features)
3/2/2012
How? gn
49
Self-corrective recognition (@IBM, IEEE IT 1966)
(hardwired features and reference comparisons
INTIAL REFERENCES
NEW REFERENCES
FEATURE
EXTRACTOR
accepted
CATEGORIZER
rejected
REFERENCE
GENERATOR
SCANNER
SOURCE
DOCUMENT
3/2/2012
How? gn
50
Results: self-corrective recognition
(Shelton & Nagy 1966)
22,500 patterns
Training set:
Test set:
9 fonts,
500 characters/font, U/C
12 fonts, 1500 characters/font, U/C
96 n-tuple features, ternary reference vectors
Error
Reject
Initial:
3.5%
15.2%
After self correction:
0.7%
3.7%
3/2/2012
How? gn
51
Self-corrective classification
z
7
7
1
7
1
Omnifont
classifier
1
adapted to
a single font
1
7
3/2/2012
original boundary
How? gn
52
Adaptive classification
3/2/2012
How? gn
53
Baird skeptical! Results (DR&R 1994)
100 fonts, 80 symbols each from Baird’s defect model (6,400,000 characters)
Size (pt)
Error
reduction
% fonts improved
Best
Worst
6
x 1.4
100
x4
x 1.0
10
x 2.5
93
x 11
x 0.8
12
x 4.4
98
x 34
x 0.9
16
x 7.2
98
x 141
x0.8
His conclusion: good investment: large potential for gain, low downside risk
3/2/2012
How? gn
54
Results: adapting both means and variances
(Harsha Veeramachaneni 2003  IJDAR 2004)
NIST Hand-printed digit classes, with 50 “Hitachi features”
Train
Test
% Error
Before
Adapt
means
Adapt
variance
SD3
SD3
1.1
0.7
0.6
SD7
5.0
2.6
2.2
SD3
1.7
0.9
0.8
SD7
2.4
1.6
1.7
SD3
0.9
0.6
0.6
SD7
3.2
1.9
1.8
SD7
SD3+SD7
How? gn
3/2/2012
55
Examples from NIST dataset
3/2/2012
How? gn
56
Writer Adaptation by Style Transfer
for many-class problems (Chinese script)
• Patterns of single-writer field transformed to style-free
space using style transfer matrix (STM)
• Supervised adaptation: learning STM from labeled samples
• Unsupervised adaptation: learning STM from test samples
Zhang & Liu, Style transfer matrix learning for writer adaptation, CVPR, 2011
• STM Formulation
– Source point set
– Target point set
– Objective:
– Solution:
• Application to Writer Adaptation
– Source point set: writer-specific data
– Target point set: parameters in basic classifier
Field classifiers and Adaptive classifiers
• A field classifier classifies consecutive finite-length
fields of the test set.
– Subsequent patterns do not benefit from knowledge
gained in earlier patterns. Exploits inter-class style.
• An adaptive classifier is a field classifier with a field
that encompasses an entire (isogenous) test set.
– The last pattern benefits from information from the first
pattern. The first pattern benefits from information from
the last pattern. Exploits only intra-class style.
3/2/2012
How? gn
59
Field Classifier for Discrete Styles
(Prateek Sarkar ICPR 2000, PAMI `05)
Optimal for multimodal feature distributions from
weighted Gaussians
m-dimensional feature vector, n-pattern field, s styles
field-class c* = (c1, c2, ......, cn)
= f (style means and covariances)
Parameters estimated via Expectation Maximization
For a field of n characters from m classes
there are mn ordered and
(m+n-1)!/n!(m-1)! unordered field classes
3/2/2012
How? gn
60
Example: 2 classes, 2 styles, field-length = 2
61
3/2/2012
How? gn
Second Pattern
Singlet and Style-constrained field classification boundaries
(2 classes, 2 styles, 1 Gaussian feature per pattern)
3/2/2012
First Pattern
How? gn
62
Top-style: a computable approximation
(with style-unsupervised estimation via EM)
Style-conscious
3/2/2012
Top-style
How? gn
Singlet classifier
63
• Experiments
– Digits of six fonts
– Moment features
– Training fields of L=13, 14,430 patterns
– Test fields L=2,4
Field Classifier for Continuous Styles
(H. Veeramachaneni IJDAR `04, PAMI `05, `07)
With Gaussian features, feature distribution for any field length
can be computed from class means and
class-pair-conditional feature cross-covariance matrices
3/2/2012
How? gn
65
Style-conscious quadratic discriminant field classifier
• Class-style means are normally distributed about the class means
• Σk is the singlet class-conditional feature covariance matrix
• Cij is the class-pair-conditional cross-covariance matrix
estimated from pairs of same-source singlet class-means
• SQDF approximates the optimal discrete style field classifier
well when inter-class distance >> style variation
• Inter-class style is order-independent:
P(5 7 9 | [ x1 x2 x3]) = P( 7 5 9 | [ x2 x1 x3])
+ SQDF avoids Expectation Maximization in discrete style method
- Supralinear in the number of features and classes because
the size of the N-pattern field covariance matrix is (Nxd)2
and for M classes there are (M+N-1)!/N!(M-1)! matrices
3/2/2012
How? gn
66
Example of continuous-style feature distributions
Two classes, one feature
3/2/2012
How? gn
67
Results: style-constrained classification - short fields
Continuous style-constrained classifier,
trained on ~ 17,000 characters and tested on ~17,000 characters.
25 top principal component “Hitachi” blurred directional features.
Field error rate (%)
Field length: L=2
L=5
Test data
w/o style with style w/o style with style
SD3
SD7
3/2/2012
1.4
2.7
1.3
2.4
3.0
5.3
How? gn
2.5
4.5
68
Field-trained (i.e. word) classification vs.
style-constrained classification
Field Length = 4
Training set for
field classification
0000
0001
0010
Training set for
style classification
00
01
02
98
99
(102 classes with order)
9998
9999
(104 classes)
classifier parameters for longer field length computed from pair parameters
(because Gaussian variables defined completely by covariance)
3/2/2012
How? gn
69
Style context versus Linguistic context
Two digits in an isogenous field:
...... 5 6 .....
with feature vectors:
x, y
and class labels:
5, 6
Style:
Language context: P(x y | 5, 6)  P( y x | 6, 5 ) )
Intra-class style: P(x y | 5, 5)  P(x | 5) P(y | 5)
Inter-class style: P(x y | 5, 6) = P( y x|6, 5 )  P(x|5) P(y|6)
3/2/2012
How? gn
70
Weakly-constrained data
given p(x), find p(y), where y=g(x)
3 classes, 4 multi-class styles
training
3/2/2012
How? gn
test
71
Recommendations for OCR systems that improve with use
Never let the machine rest:
design it so that it puts every coffee-break to good use.
Don’t throw away edits (corrected labels): use them.
Classify style-consistent fields, not characters:
adapt on long fields, exploit inter-class style in short fields.
Use order rather than position.
Let the machine guess: lazy decisions.
Make use of all possible contexts: style, language, shape,
layout, structure, and function.
Please help to increase computer literacy!
3/2/2012
How? gn
72
Thank you!
http://www.ecse.rpi.edu/~nagy/
3/2/2012
How? gn
73
Classification of style-constrained pattern-fields
Prateek Sarkar
sarkap@rpi.edu
George Nagy
nagy@ecse.rpi.edu
Introduction
Rensselaer Polytechnic Institute, U.S.A.
Style is a manner of rendering patterns. Patterns are rendered in many different styles.
a a a a a a a a a
A field is a group of patterns with a common origin (isogenous patterns).
Style consistency constraint: Patterns in a field are rendered in the same style.
Objective
x2
Modeling style consistency can help improve
classification accuracy.
1
Classifier
=
Model
= arg all
max
(c1c2 ) p(x1x2 | c1c2 ).P[c1c2]
[ .5 p ( x1 |
) + .5 p ( x1 | ) ] ×
71
77/11
17
77
17
x1
Style consistency model
p ( x1 x2 | 17 )
[ .5 p ( x2 |
) + .5 p ( x2 | ) ]
= .25 [ p ( x1 | ) p ( x2 | ) + p ( x1 | ) p ( x2 | )
= .5 p ( x1 x2 |
) + .5 p ( x1 x2 |
)
= .5 [ p ( x1 | ) p ( x2 | ) + p ( x1 | ) p ( x2 | ) ]
+ p ( x1 | ) p ( x2 | ) + p ( x1 | ) p ( x2 | ) ]
Unsupervised style estimation: Patterns in a field, and class labels are observed,
but style of the field is unobserved. Apply the EM algorithm for estimating model parameters.
71
Application to recognition of digit fields:
15-25% reduction in errors (relative) was observed in
laboratory experiments on handprinted digit recognition.
11
11
d
71
77
Results
11
7
Singlet Model
p ( x1 x2 | 17 ) = p ( x1 | 1 ) × p ( x2 | 7 )
(c1*c2*)
71
17
77
Improvement in accuracy was more for longer fields.
17
d
Simulation of two-class, two-style problem
with unit variance Gaussian distributions.
3/2/2012
How? gn
Reference: Prateek Sarkar. Style consistency in pattern
fields. PhD thesis, Rensselaer Polytechnic Institute,
74
U.S.A., 2000.
Prateek Sarkar, August 2000 for ICPR, Barcelona, September 2000
Download