How did we get to where we are in DIA and OCR, and where are we now? George Nagy Professor Emeritus, RPI 3/2/2012 How? gn 1 How did we get to where we are in DIA and OCR, and where are we now? George Nagy, DocLab†, RPI Abstract Some properties of pattern classifiers used for character recognition are presented. The improvement in accuracy resulting from exploiting various types of context – language, data, shape – are illustrated. Further improvements are proposed through adaptation and field classification. The presentation is intended to be accessible, though not necessarily interesting, to those without a background in digital image analysis and optical character recognition. 3/2/2012 How? gn 2 Last week (SPIE - DRR) • Interactive verification of table ground truth • Calligraphic style recognition • Asymptotic cost of document processing 3/2/2012 How? gn 3 Today • Classifiers – pattern recognition and machine learning • Context – Data – Language • Style (shape context) – Intra-class (adaptation) – Inter-class (field classification) 3/2/2012 How? gn 4 Tomorrow “Three problems that, if solved, would make an impact on the conversion of documents to symbolic electronic form” 1. Feature design for OCR 2. Integrated segmentation and recognition 3. Green interaction 3/2/2012 How? gn 5 The beginning (almost) 1953 3/2/2012 How? gn 6 Shepard’s 9 features (slits on a Nipkow disk) 3/2/2012 How? gn 7 CLASSIFIERS A pattern is a vector whose elements represent numerical observations of some object . Each object can be assigned to a category or class. The label of the pattern is the name of the object's category. Classifiers are machines or algorithms Input: one or more pattern(s) Output: label (s) of the pattern(s) 3/2/2012 How? gn 8 Example (average RGB, perimeter) Object Pattern Label (22, 75, 12, 33) aspen (14, 62, 24, 49) sycamore (40, 55, 17, 98) maple 3/2/2012 How? gn 9 Traditional open-loop OCR System training set patterns and labels meta-parameters (e.g. regularization, estimators) parameter estimation classifier parameters labels CLASSIFIER rejects operational data (features) 3/2/2012 correction, reject entry transcript patterns How? gn 10 Classifiers and Features • OCR classifiers operate on feature vectors that are either the pixels of the bitmaps to be classified, or values computed by a class-independent transformation of the pixels. • The dimensionality of the feature space is usually lower than the dimensionality of the pixel space. • Features should be invariant to aspects of the patterns that don’t affect classification (position, size, stroke-width, color, noise). 3/2/2012 How? gn 11 Representation equiprobability contours x2 X X O O X X X X XX OOO O O O O OO decision boundary 3/2/2012 X XX X samples Feature Space of two features How? gn x1 12 Some classifiers GAUSSIAN LINEAR QUADRATIC BAYES MULTILAYER NEURAL NETWORK SIMPLE PERCEPTRON NEAREST 3/2/2012 NEIGHBOR How? gn SUPPORT VECTOR MACHINE 13 Nonlinear vs. Linear classifiers x and y are features; A and B are classes 2-D: Quadratic classifier: Iff ax + by + cx2 + dy2 + exy + f > 0, then (x,y) A 5-D: Linear classifier with s = x, t = y, u = x2, v = y2, w = xy Iff as + bt + cu + dv + ew + f > 0, then (x,y) A 3/2/2012 How? gn 14 Quadratic to linear decision boundary u = x v = x2 v X bv > c ax2 > c X X O O X X X X X O O x u SVMs, like many other classifiers, need only dot products to compute decision boundary from training samples 3/2/2012 How? gn 15 SUPPORT VECTOR MACHINE (V. Vapnik) SVMs, like many other classifiers, need only dot products z to compute decision boundary from training samples via the “Kernel Trick” 0 0 Kernel-induced transformation x v = (y,z) 0 X X X X 0 x Transformation only implicit: max min { f(vi•vj ) } by QP Mercer’s theorem: vi•vj =K(xi•xj) i.e., compute dot-products in high-dim space from via kernels in low-dim space 3/2/2012 y z 0 0 X gn resistsHow? over-training X y 16 Some classifier considerations • Linear decision functions are easier to optimize • Classes may not be linearly separable • Estimation of decision boundary parameters in higherdimensional spaces needs more training samples • The training set is often not representative of the test set • The size of the training set is always too small • Outliers that don’t belong to any class cause problems 3/2/2012 How? gn 17 Bayes error is unreachable without an infinite training set; is zero unless some patterns belong to more than one class Bayes error 3/2/2012 How? gn 18 CLASSIFIER BIAS AND VARIANCE Training set # 1 0 0 X 0 X X 0 X 0 0 0 0 Error X X X X 0 0 Training set # 2 0 0 0 True decision boundary 0 Complex Classifier Error____ Bias ------Variance ……. Simple Classifier Error____ Bias ------Variance ……. Number of training samples CLASSIFIER BIAS AND VARIANCE DON’T ADD! Any classifier can be shown to be better than any other. May 4, 2008 SSDIP GN Character Recognition 19 CONTEXT Assigned label, part-of-speech, meaning, value, shape, style, position, color, ... of other character/word/phrase patterns 3/2/2012 How? gn 20 Data context There is no February 30 Legal amount = courtesy amount (bank checks) Total = sum of values Date of birth < date of marriage ≤ date of death Data frames: email addresses, postal addresses, telephone numbers, $amounts, chemical formulas (C20H14O4), dates, units (m/s2, m·s−2, or m s−2), libcatnos (978-0-306-40615-7, lLB2395.C65 1991, 823.914) copyrights, license plates, ... 3/2/2012 How? gn 21 Language models Letter n-grams unigrams: e (12.7%); z (0.07%) bigrams: th (3.9%), he, in, nt, ed, en (1.4%) trigrams: the (3.5%), and, ing, her, hat, his, you, are Lexicon (stored as trie, hash-table, n-grams) (20K-100K words – 3x as many for Italian than for English) domain-specific: chemicals, drugs, family/first names, geographic gazetteers, business directories, abbreviations and acronyms, … Syntax – probabilistic rather than rule based grammars 3/2/2012 How? gn 22 Post-processing Hanson, Riseman, & Fisher 1975: Contextual postprocessor (based on n-grams) • Current OCR systems generate multiple candidates for each word. • They use a lexicon (with or w/o frequencies) or n-grams to select top candidate word. • Language-independent OCR engines have higher error rates. Context beyond word-level is seldom used. • Entropy (English text) of letters, bigrams, 100-grams, .. Shannon 1951 4.163.563.3 ... 1.3 bits/char Brown et al 1992: <1.75 bits/char (used 5 x 108 tokens) 3/2/2012 How? gn 23 Probabilistic context algorithms • • • • Markov Chain Hidden Markov Methods (HMM) Markov Random Fields (MRF) Bayesian Networks • Cryptanalysis 3/2/2012 How? gn 24 Raviv 1967 No context: P(C|x) ~ P(x|C) No features: P(C|x) ~ P(C) 0th order Markov: P(C|x) ~ P(x|C) P(C) (prior) 1st order: P(C|x1,...xn) ~ P(xn|C) P(C|x1,...xn-1) (This is an iterative calculation) Raviv estimated letter bigram and trigram probabilities from 8 million characters. 3/2/2012 How? gn 25 Error–Reject curves show improvement in classification accuracy How? gn 3/2/2012 26 Estimating rare n-grams (like ...rta...) (smoothing, regularization) The quick brown fox jumped over the tired dog P(e) = 5/39 = 13% P(u) = 2/39 = 5% P(a) = 0/39 = 0% (12.7 %) ( 2.8 %) ( 8.2 %) Laplace’s Law of Succession for k instances from N items of m classes: P^(x) = (k+1) / (N+m) P^(a) = (0+1)/(39+26) = 1.5 % Next approximation: (k+a) / (N+ma), which also sums to unity 3/2/2012 How? gn 27 HIDDEN MARKOV MODELS: A or B? (3 states, 2 features) MODEL A 1 2 MODEL B 3 1 (0.2, 0.3) (0.3, 0.6) (0.7, 0.8) time 2 3 (0.3, 0.1) (0.5, 0.4) (0.4, 0.8) states time 1 2 3 (0,1) (0,0) (0,1) 3/2/2012 (1,1) How? gn (0,1) (0,0) (0,1) (1,1) TRAINING: joint probs via Baum-Welch Forward-Backward (EM) 28 HMM situated between Markov chains & Bayesian networks Find joint posterior probability of sequence of feature vectors States are invisible; only, features can be observed Transitions may be unidirectional or not, skip or not. States may represent partial characters, characters, words, or separators and merge-indicators Observable features are window-based vectors HMMs can be nested (e.g. character HMM in word HMM) Training: estimate state-conditional feature distributions and state transition probabilities complex Baum-Welch Forwards-Backwards EM algorithm 3/2/2012 How? gn 29 Widely used for script E.g.: Using a Hidden-Markov Model in Semi- Automatic Indexing of Historical Handwritten Records Thomas Packer, Oliver Nina, Ilya Raykhel Computer Science, Brigham Young University FHTW 2009 3/2/2012 How? gn 30 Markov Random Fields ( ~ 2000 for OCR) • Maximize an energy function • Models more complex relationships within cliques of features • Applied so far mainly to on- and off-line Chinese and Japanese handwriting • More widely used in computer vision 3/2/2012 How? gn 31 Bayesian Networks (J. Pearl, > 1980) • • • • • Directed acyclic graphical model Models cliques of dependencies between variables Nodes are variables, edges represent dependencies Often designed by intuition instead of structure learning Learning and inference algorithms (message passing) • Applied to document classification 3/2/2012 How? gn 32 Inter-pattern Class Dependence (Linguistic Context) G G E O N G E E O R G E Classes Features 3/2/2012 How? gn 33 Inter-pattern Class-Feature Dependence 3/2/2012 How? gn 34 Inter-pattern Feature Dependence (Order-dependent: Ligatures, Co-articulation) 3/2/2012 How? gn 35 Inter-pattern Feature Dependence (order-independent: Style) The shape of the ‘w’ depends on the shape of the ‘v’ 3/2/2012 How? gn 36 OCR via decoding a substitution cipher an unknown sentence Cluster the bitmaps: 1 2 5 Cipher text: 1 2 . 2 . 2 . .2 5 2 . . 5 2 . 5 DECODER 3/2/2012 LANGUAGE MODEL: N-gram frequencies, Lexicon, … 1a 2n [Nagy & Casey, 1966; Nagy & Seth, 1987 Ho &Nagy, 2000 5e Thanks to J.J. Hull] How? gn 37 An unusual typeface [Ho, Nagy 2000] 3/2/2012 How? gn 38 Text printed with Spitz glyphs 3/2/2012 How? gn 39 Decoded text chapter i _ bee_inds _all me ishmaels some years ago__never mind how long precisely __having little or no money in my purses and nothing particular to interest me on shores i thought i would sail about a little and see the watery part of the worlds it is a way i have ... GT chapter I 2 LOOMINGS Call me Ishmael. Some years ago – never mind how long precisely – having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have ... 3/2/2012 How? gn 40 STYLE-CONSTRAINED CLASSIFICATION (AKA style-conscious or style-consistent classification) 3/2/2012 How? gn 41 Inter-pattern Feature Dependence (Style) 3/2/2012 How? gn 42 Style-constrained field classification 3/2/2012 How? gn 43 Intra-class and inter-class style (aka weak and strong style) INTRA-CLASS INTER-CLASS With thanks to Prof. C-L Liu Adaptation for intra-class style, field classification for inter-class style 3/2/2012 How? gn 44 Adaptation and Style training test (3) discrete styles (1) representative (5) weakly constrained (2) adaptable (long test fields) 3/2/2012 (4) continuous styles (short test fields) How? gn 45 Adaptive algorithms cf.: stochastic approximation (Robbins-Munro algorithm) self-training, self-adaptation, self-correction, unsupervised / semi-supervised learning, transfer learning, inductive transfer, co-training, decision-directed learning, ... 3/2/2012 How? gn 46 Traditional open-loop OCR System training set patterns and labels meta-parameters (e.g. regularization, estimators) parameter estimation classifier parameters CLASSIFIER rejects operational data (bitmaps) 3/2/2012 transcript labels correction, reject entry patterns How? gn 47 Supervised learning Generic OCR System that makes use of post-processed rejects and errors training set keyboarded labels of rejects and errors meta-parameters parameter estimation classifier parameters CLASSIFIER correction, reject entry transcript operational data (features) 3/2/2012 How? gn 48 Adaptation (Jain, PAMI 00: “Decision directed classifier”) Field estimation, singlet classification training set classifier assigned labels meta-parameters parameter estimation classifier parameters CLASSIFIER correction, reject entry transcript operational data (features) 3/2/2012 How? gn 49 Self-corrective recognition (@IBM, IEEE IT 1966) (hardwired features and reference comparisons INTIAL REFERENCES NEW REFERENCES FEATURE EXTRACTOR accepted CATEGORIZER rejected REFERENCE GENERATOR SCANNER SOURCE DOCUMENT 3/2/2012 How? gn 50 Results: self-corrective recognition (Shelton & Nagy 1966) 22,500 patterns Training set: Test set: 9 fonts, 500 characters/font, U/C 12 fonts, 1500 characters/font, U/C 96 n-tuple features, ternary reference vectors Error Reject Initial: 3.5% 15.2% After self correction: 0.7% 3.7% 3/2/2012 How? gn 51 Self-corrective classification z 7 7 1 7 1 Omnifont classifier 1 adapted to a single font 1 7 3/2/2012 original boundary How? gn 52 Adaptive classification 3/2/2012 How? gn 53 Baird skeptical! Results (DR&R 1994) 100 fonts, 80 symbols each from Baird’s defect model (6,400,000 characters) Size (pt) Error reduction % fonts improved Best Worst 6 x 1.4 100 x4 x 1.0 10 x 2.5 93 x 11 x 0.8 12 x 4.4 98 x 34 x 0.9 16 x 7.2 98 x 141 x0.8 His conclusion: good investment: large potential for gain, low downside risk 3/2/2012 How? gn 54 Results: adapting both means and variances (Harsha Veeramachaneni 2003 IJDAR 2004) NIST Hand-printed digit classes, with 50 “Hitachi features” Train Test % Error Before Adapt means Adapt variance SD3 SD3 1.1 0.7 0.6 SD7 5.0 2.6 2.2 SD3 1.7 0.9 0.8 SD7 2.4 1.6 1.7 SD3 0.9 0.6 0.6 SD7 3.2 1.9 1.8 SD7 SD3+SD7 How? gn 3/2/2012 55 Examples from NIST dataset 3/2/2012 How? gn 56 Writer Adaptation by Style Transfer for many-class problems (Chinese script) • Patterns of single-writer field transformed to style-free space using style transfer matrix (STM) • Supervised adaptation: learning STM from labeled samples • Unsupervised adaptation: learning STM from test samples Zhang & Liu, Style transfer matrix learning for writer adaptation, CVPR, 2011 • STM Formulation – Source point set – Target point set – Objective: – Solution: • Application to Writer Adaptation – Source point set: writer-specific data – Target point set: parameters in basic classifier Field classifiers and Adaptive classifiers • A field classifier classifies consecutive finite-length fields of the test set. – Subsequent patterns do not benefit from knowledge gained in earlier patterns. Exploits inter-class style. • An adaptive classifier is a field classifier with a field that encompasses an entire (isogenous) test set. – The last pattern benefits from information from the first pattern. The first pattern benefits from information from the last pattern. Exploits only intra-class style. 3/2/2012 How? gn 59 Field Classifier for Discrete Styles (Prateek Sarkar ICPR 2000, PAMI `05) Optimal for multimodal feature distributions from weighted Gaussians m-dimensional feature vector, n-pattern field, s styles field-class c* = (c1, c2, ......, cn) = f (style means and covariances) Parameters estimated via Expectation Maximization For a field of n characters from m classes there are mn ordered and (m+n-1)!/n!(m-1)! unordered field classes 3/2/2012 How? gn 60 Example: 2 classes, 2 styles, field-length = 2 61 3/2/2012 How? gn Second Pattern Singlet and Style-constrained field classification boundaries (2 classes, 2 styles, 1 Gaussian feature per pattern) 3/2/2012 First Pattern How? gn 62 Top-style: a computable approximation (with style-unsupervised estimation via EM) Style-conscious 3/2/2012 Top-style How? gn Singlet classifier 63 • Experiments – Digits of six fonts – Moment features – Training fields of L=13, 14,430 patterns – Test fields L=2,4 Field Classifier for Continuous Styles (H. Veeramachaneni IJDAR `04, PAMI `05, `07) With Gaussian features, feature distribution for any field length can be computed from class means and class-pair-conditional feature cross-covariance matrices 3/2/2012 How? gn 65 Style-conscious quadratic discriminant field classifier • Class-style means are normally distributed about the class means • Σk is the singlet class-conditional feature covariance matrix • Cij is the class-pair-conditional cross-covariance matrix estimated from pairs of same-source singlet class-means • SQDF approximates the optimal discrete style field classifier well when inter-class distance >> style variation • Inter-class style is order-independent: P(5 7 9 | [ x1 x2 x3]) = P( 7 5 9 | [ x2 x1 x3]) + SQDF avoids Expectation Maximization in discrete style method - Supralinear in the number of features and classes because the size of the N-pattern field covariance matrix is (Nxd)2 and for M classes there are (M+N-1)!/N!(M-1)! matrices 3/2/2012 How? gn 66 Example of continuous-style feature distributions Two classes, one feature 3/2/2012 How? gn 67 Results: style-constrained classification - short fields Continuous style-constrained classifier, trained on ~ 17,000 characters and tested on ~17,000 characters. 25 top principal component “Hitachi” blurred directional features. Field error rate (%) Field length: L=2 L=5 Test data w/o style with style w/o style with style SD3 SD7 3/2/2012 1.4 2.7 1.3 2.4 3.0 5.3 How? gn 2.5 4.5 68 Field-trained (i.e. word) classification vs. style-constrained classification Field Length = 4 Training set for field classification 0000 0001 0010 Training set for style classification 00 01 02 98 99 (102 classes with order) 9998 9999 (104 classes) classifier parameters for longer field length computed from pair parameters (because Gaussian variables defined completely by covariance) 3/2/2012 How? gn 69 Style context versus Linguistic context Two digits in an isogenous field: ...... 5 6 ..... with feature vectors: x, y and class labels: 5, 6 Style: Language context: P(x y | 5, 6) P( y x | 6, 5 ) ) Intra-class style: P(x y | 5, 5) P(x | 5) P(y | 5) Inter-class style: P(x y | 5, 6) = P( y x|6, 5 ) P(x|5) P(y|6) 3/2/2012 How? gn 70 Weakly-constrained data given p(x), find p(y), where y=g(x) 3 classes, 4 multi-class styles training 3/2/2012 How? gn test 71 Recommendations for OCR systems that improve with use Never let the machine rest: design it so that it puts every coffee-break to good use. Don’t throw away edits (corrected labels): use them. Classify style-consistent fields, not characters: adapt on long fields, exploit inter-class style in short fields. Use order rather than position. Let the machine guess: lazy decisions. Make use of all possible contexts: style, language, shape, layout, structure, and function. Please help to increase computer literacy! 3/2/2012 How? gn 72 Thank you! http://www.ecse.rpi.edu/~nagy/ 3/2/2012 How? gn 73 Classification of style-constrained pattern-fields Prateek Sarkar sarkap@rpi.edu George Nagy nagy@ecse.rpi.edu Introduction Rensselaer Polytechnic Institute, U.S.A. Style is a manner of rendering patterns. Patterns are rendered in many different styles. a a a a a a a a a A field is a group of patterns with a common origin (isogenous patterns). Style consistency constraint: Patterns in a field are rendered in the same style. Objective x2 Modeling style consistency can help improve classification accuracy. 1 Classifier = Model = arg all max (c1c2 ) p(x1x2 | c1c2 ).P[c1c2] [ .5 p ( x1 | ) + .5 p ( x1 | ) ] × 71 77/11 17 77 17 x1 Style consistency model p ( x1 x2 | 17 ) [ .5 p ( x2 | ) + .5 p ( x2 | ) ] = .25 [ p ( x1 | ) p ( x2 | ) + p ( x1 | ) p ( x2 | ) = .5 p ( x1 x2 | ) + .5 p ( x1 x2 | ) = .5 [ p ( x1 | ) p ( x2 | ) + p ( x1 | ) p ( x2 | ) ] + p ( x1 | ) p ( x2 | ) + p ( x1 | ) p ( x2 | ) ] Unsupervised style estimation: Patterns in a field, and class labels are observed, but style of the field is unobserved. Apply the EM algorithm for estimating model parameters. 71 Application to recognition of digit fields: 15-25% reduction in errors (relative) was observed in laboratory experiments on handprinted digit recognition. 11 11 d 71 77 Results 11 7 Singlet Model p ( x1 x2 | 17 ) = p ( x1 | 1 ) × p ( x2 | 7 ) (c1*c2*) 71 17 77 Improvement in accuracy was more for longer fields. 17 d Simulation of two-class, two-style problem with unit variance Gaussian distributions. 3/2/2012 How? gn Reference: Prateek Sarkar. Style consistency in pattern fields. PhD thesis, Rensselaer Polytechnic Institute, 74 U.S.A., 2000. Prateek Sarkar, August 2000 for ICPR, Barcelona, September 2000