UN Workshop on Data Capture, Bangkok Session 7 Data Capture Richard Lang International Manager © Beta Systems Software AG 2008 1 Agenda OCR Optical Character Recognition ICR Intelligent Character Recognition DFR Dynamic Form Recognition 7/23/2016 © Beta Systems Software AG 2008 2 OCR = optical character recognition Technology was first invented in 1929 Gustav Tauschek obtained a patent on OCR in Germany Mechanical device that used templates First commercial system was installed at Readers Digest in 1955 Years later donated to the Smithsonian Institution Today Recognition of machine written text is now considered largely a solved problem Accuracy 7/23/2016 rates exceed 99% © Beta Systems Software AG 2008 3 OCR Beta Systems well experienced with this recognition engines in Banks in Germany OCR A ⑁ ⑀ ⑂ Chair Hook Fork Austria + 7/23/2016 OCR B Plus © Beta Systems Software AG 2008 4 ICR Intelligent Character Recognition The technique is far ahead of OCR because of ongoing development of ICR Handwriting recognition system Allows different styles of handwriting to be learned by a computer during / before processing to improve accuracy and recognition rates 7/23/2016 © Beta Systems Software AG 2008 5 ICR Process: Capturing Processing the image with Scanners by (ICR) and/or (OCR) Segmentation is a very important step Decision if the homogenous criteria belong to the foreground or to the background Human editors can do that depending on the context Compare also computer tomography: according to different results from radio waves reflected from different angels the computer can reconstruct the picture With the first step only a suitable starting point (sets of pixels) is possible The increasing process links all closer pixels (computation of valleys and peaks with high degree of confidence) 7/23/2016 © Beta Systems Software AG 2008 6 ICR Process: Pre-processing Deskew Shift, rotate Stretch 7/23/2016 © Beta Systems Software AG 2008 7 ICR Process: Enhance Less / More Contrast Clean up (de-noise, halftone removal) to enable the recognition engine to give best results 7/23/2016 © Beta Systems Software AG 2008 8 ICR Process: Feature extraction Data reduction 7/23/2016 © Beta Systems Software AG 2008 9 ICR Process: Classification A one was written 90 % =1 8 % =7 2% 7/23/2016 © Beta Systems Software AG 2008 =4 10 ICR Algorithm: Neural Using Network kNN k-Nearest Neighbour SVM Support Vector Machine Minimize simultaneously the empirical classification error and maximize the geometric margin; hence they are also known as maximum margin classifiers 7/23/2016 © Beta Systems Software AG 2008 11 ICR Process: After different classification alternatives the appropriate confidence will be provided Recognition Limitation only for most probable characters e.g. if only characters 3,6,0 are possible the engine can also be limited to this set and the results are much better Voting Machine Usability: security, efficiency and Accuracy 7/23/2016 © Beta Systems Software AG 2008 12 Dynamic Field Recognition No If fixed position is required form is only ½ available still ½ readable No special Forms are required No timing tracks are necessary on the forms for OMR but results are also available the same time no cleaning of LEDs in the scanner necessary Robust against vertical / horizontal stretching or shrinking (e.g. different printers) 7/23/2016 © Beta Systems Software AG 2008 13 Dynamic Field Recognition Recognizes: features (word as pixel cloud) boxes, lines and symbols 7/23/2016 © Beta Systems Software AG 2008 14 Hardware- / Software - Requirement Hardware Scanner PC Network Disc Storage only necessary if images are needed for audit purposes Software Scan Software One Recognition and Voting Software for OMR, OCR, ICR, Barcode 7/23/2016 © Beta Systems Software AG 2008 15 OMR Cost Comparatives in general OMR from image Forms Design Dedicated OMR Scanner Same Forms Production - Up to 50% More Enumerator Training - Up to double the cost Scanners - Up to double the cost PC Low cost PC PC Operators Same Servers Same Cost of more/new flexibility 7/23/2016 © Beta Systems Software AG 2008 low high 16 ICR Advantages Better than: Manual keying 90 % (plus) correct keys Manual = higher substitution rate than automated recognition Time consuming Deliberate OMR, manipulation possible because OMR is space consuming OCR, because OCR is machine written and therefore of limited use 7/23/2016 © Beta Systems Software AG 2008 17 ICR Advantages Clear accuracy for OMR because of dirt removal by software depending on the mark size and figure Can detect line Clear 7/23/2016 and can ignore dirt result © Beta Systems Software AG 2008 18 ICR Advantages Barcode, OCR, OMR, and ICR Recognition with one Software 7/23/2016 © Beta Systems Software AG 2008 19 ICR Advantages Pro: Only rejected characters/fields need correction Rest of the form untouched With new technologies open for future faster, better quality With standardized correction mode Handwriting of the corresponding country will be recognized The previously mentioned advantages do not have to be repeated here again 7/23/2016 © Beta Systems Software AG 2008 20 Thank you for your attention 7/23/2016 © Beta Systems Software AG 2008 21