Lukáš Neumann and Jiří Matas Centre for Machine Perception, Department of Cybernetics Czech Technical University, Prague 1 1. 2. 3. 4. 5. 6. 7. End-to-End Scene Text Recognition Problem Introduction The TextSpotter System Character Detection as Extremal Region (ERs) Selection Line formation & Character Recognition Character Ordering Optimal Sequence Selection Experiments Neumann, Matas, ICDAR 2013 2/21 Bounding Box=[240;1428;391;1770] Content="TESCO" Input: (AVI) Output: Digital image (BMP, JPG, PNG) / video Lexicon-free method Set of words in the image word = (horizontal) rectangular bounding box, text content Neumann, Matas, ICDAR 2013 3/21 1. 2. 3. 4. Multi-scale Character Detection [1] with Gaussian Pyramid (new) Text Line Formation [2] Character Recognition [3] Optimal Sequence Selection (new) [1] L. Neumann, J. Matas, “Real-time scene text localization and recognition”, CVPR 2012 [2] L. Neumann, J. Matas, “Text localization in real-world images using efficiently pruned exhaustive search”, ICDAR 2011 [3] L. Neumann, J. Matas, “A method for text localization and recognition in real-world images”, ACCV 2010 Neumann, Matas, ICDAR 2013 4/21 Input image (PNG, JPEG, BMP) 1D projection <0;255> (grey scale, hue,…) Extremal regions with threshold ( =50, 100, 150, 200) Neumann, Matas, ICDAR 2013 5/21 Let image I be a mapping I: Z2 S Let S be a totally ordered set, e.g. <0, 255> Let A be an adjacency relation (e.g. 4-neigbourhood) Region Q is a contiguous subset w.r.t. A (Outer) Region Boundary δQ is set of pixels adjacent but not belonging to Q Extremal Region is a region where there exists a threshold = 32 that separates the region and its boundary Assuming character is an ER, 3 parameters still have to be determined: : pQ,qQ : I(p) < I(q) 1. Threshold 2. Mapping to a totally order set (colour space projection) 3. Adjacency relation Neumann, Matas, ICDAR 2013 6/21 Character boundaries are often fuzzy It is very difficult to locally determine the threshold value, typical document processing pipeline (image binarization OCR) leads to inferior results Thresholds that most probably correspond to a character segmentation are selected using a CSER classifier [1], multiple hypotheses for each character are generated [1] L. Neumann and J. Matas, “Real-time scene text localization and recognition”, CVPR 2012 Neumann, Matas, ICDAR 2013 7/21 p(r|character) estimated at each threshold for each region Only regions corresponding to local maxima selected by the detector Incrementally computed descriptors used for classification [1] ◦ ◦ ◦ ◦ Aspect ratio Compactness Number of holes Horizontal crossings Trained AdaBoost classifier with decision trees calibrated to output probabilities Linear complexity, real-time performance (300ms on an L. 800x600px Neumann and J. Matas, “Real-time scene image) [1] recognition”, CVPR 2012 Neumann, Matas, ICDAR 2013 text localization and 8/21 Color space projection maps a color image into a totally ordered set Trade-off between recall and speed (although can be easily parallelized) Standard channels (R, G, B, H, S, I) of RGB / HSI color space 85.6% characters detected in the Intensity channel, combining all channels increases the recall to 94.8% Source Image Neumann, Matas, ICDAR 2013 Intensity Channel (no threshold exists for the letter “A”) Red Channel 9/21 Pre-processing with a Gaussian pyramid alters the adjacency relation At each level of the pyramid only a certain interval of character stroke widths is amplified Not a major overhead as each level is 4 times faster than the previous one, total processing takes ~ 4/3 of the first level (1 + ¼ + ¼2 …) Characters formed of multiple small regions Neumann, Matas, ICDAR 2013 Multiple characters joint together 10/21 Regions agglomerated into text lines hypotheses by exhaustive search [1] Each segmentation (region) labeled by a FLANN classifier trained on synthetic data [2] Multiple mutually exclusive segmentations with different label(s) present in each text line hypothesis P A ilI n m f f n [1] Neumann, Matas, Text localization in real-world images using efficiently pruned exhaustive search, ICDAR 2011 [2] Neumann, Matas, A method for text localization and recognition in realworld images”, ACCV 2010 Neumann, Matas, ICDAR 2013 11/21 Region A is a predecessor of a region B if A immediately precedes B in a text line Approximated by a heuristic function based on text direction and mutual overlap The relation induces a directed graph for each text line Neumann, Matas, ICDAR 2013 12/21 The final region sequence of each text line is selected as an optimal path in the graph, maximizing the total score Unary terms ◦ Text line positioning (prefers regions which “sit nicely” in the text line) ◦ Character recognition confidence Binary terms (regions pair compatibility score) ◦ Threshold interval overlap (prefers that neighboring regions have similar threshold) ◦ Language model transition probability (2nd order character model) Accommodation Neumann, Matas, ICDAR 2013 13/21 ICDAR 2011 Dataset – Text Localization pipeline SM+SS SM+MS SWT+SS SWT+MS MLM+SS MLM+MS recall 45.9 55.5 38.0 41.0 62.1 67.5 precision 69.8 75.2 66.0 80.0 85.9 85.4 f 55.4 63.8 48.0 54.0 72.0 75.4 time / 1.87s 2.35s 0.60s 0.84s 2.52s 3.10s Single Maximum (SM) Segmentation with the highest Multiple Local Maxima Segmentations which correspond to maxima of the CSER score Stroke Width Transform Reimplementation of character based on Epshtein et al. [1] SS = Single Scale MS = Multiple Scales (Gaussian [1] B. Epshtein, E. Ofek, and Y. Wexler, “Detecting text in natural scenes with stroke width transform”, CVPR 2010 Neumann, Matas, ICDAR 2013 14/21 ICDAR 2011 Dataset – Text Localization pipeline Proposed method Shi’s method [1] Kim’s method [2] (ICDAR 2011 winner) Neumann & Matas [3] Yi’s Method [4] TH-TextLoc System recall precision f 67.5 63.1 85.4 83.3 75.4 71.8 62.5 64.7 83.0 73.1 71.3 68.7 58.1 57.7 67.2 67.0 62.3 62.0 [1] C. Shi, C. Wang, B. Xiao, Y. Zhang, and S. Gao, “Scene text detection using graph model built upon maximally stable extremal regions”, Pattern Recognition Letters, 2013 [2] A. Shahab, F. Shafait, and A. Dengel, “ICDAR 2011 robust reading competition challenge 2: Reading text in scene images”, ICDAR 2011 [3] L. Neumann and J. Matas, “Real-time scene text localization and recognition”, CVPR 2012 [4] C. Yi and Y. Tian, “Text string detection from natural scenes by structure-based partition and grouping”, Image Processing, 2011 [5] S. M. Hanif and L. Prevost, “Text detection and localization in complex Neumann, scene images Matas, using ICDAR constrained 2013 adaboost algorithm”, ICDAR 2009 15/21 ICDAR 2011 Dataset – End-to-End Text Recognition pipeline recall precision f Proposed method Neumann & Matas 2012) [1] 37.8 39.4 38.5 37.2 37.1 36.5 Percentage of words correctly recognized without any error – case-sensitive comparison (ICDAR 2003 protocol) [1] L. Neumann and J. Matas, “Real-time scene text localization and recognition”, CVPR 2012 Neumann, Matas, ICDAR 2013 16/21 chips cut Neumann, Matas, ICDAR 2013 CABOT PLACF FREEDON 17/21 Multi-scale processing / Gaussian Pyramid improves text localization results without a significant impact on speed Combining several channels and postponing the decision about character detection parameters (e.g. binarization threshold) to a later stage improves localization and OCR accuracy Method current state ◦ The method placed second in ICDAR 2013 Text Localization competition, 1.4% worse than the winner (f-measure) (unfortunately, end-to-end text recognition is not part of the competition) ◦ Online demo available at http://www.textspotter.org/ ◦ OpenCV implementation of the character detector in progress by the open source community Future work ◦ OCR accuracy improvement ◦ Overcoming limitations of CC-based methods (e.g. nonlinearity non-robustness caused by a single pixel) 18/21 Neumann, Matas, ICDAR 2013