UN Workshop on Data Capture, Minsk Session 15 Data Capture Process with Optical Character Recognition Image Character Recognition Intelligent Recognition Christoph Steinl Vice Director International Enterprise Content Management © Beta Systems Software AG 2008 1 Agenda OCR Optical Character Recognition ICR Image Character Recognition DFR Dynamic Form Recognition 7/23/2016 © Beta Systems Software AG 2008 2 OCR = optical character recognition Technology was first invented in 1929 Gustav Tauschek obtained a patent on OCR in Germany Mechanical device that used templates First commercial system was installed at Readers Digest in 1955 Years later donated to the Smithsonian Institution Today Recognition of machine written text is now considered largely a solved problem Accuracy 7/23/2016 rates exceed 99% © Beta Systems Software AG 2008 3 OCR Beta Systems well experienced with this recognition engines in Banks in Germany OCR A ⑁ ⑀ ⑂ Chair Hook Fork Austria + 7/23/2016 OCR B Plus © Beta Systems Software AG 2008 4 ICR Image Character Recognition The technique is far ahead of OCR because of ongoing development of ICR Handwriting recognition system Allows different styles of handwriting to be learned by a computer during / before processing to improve accuracy and recognition rates 7/23/2016 © Beta Systems Software AG 2008 5 ICR Process: Capturing Processing the image with Scanners by (ICR) and/or (OCR) Segmentation is a very important step Decision if the homogenous criteria belong to the foreground or to the background Human editors can do that depending on the context Comparable to computer tomography: according to different results from radio waves reflected from different angels the computer can reconstruct the picture With the first step only a suitable starting point (sets of pixels) is possible The increasing process links all closer pixels (computation of valleys and peaks with high degree of confidence) 7/23/2016 © Beta Systems Software AG 2008 6 ICR Process: Pre-processing Deskew Shift, rotate Stretch 7/23/2016 © Beta Systems Software AG 2008 7 Recognition – Image Pre-processing Skewed document ... …after alignment 7/23/2016 © Beta Systems Software AG 2008 88 1 ICR Process: Enhance Less / More Contrast Clean up (de-noise, halftone removal) to enable the recognition engine to give best results 7/23/2016 © Beta Systems Software AG 2008 9 Recognition: Noise and box removal 7/23/2016 © Beta Systems Software AG 2008 10 ICR Process: Classification A one was written 90 % =1 8 % =7 2% 7/23/2016 © Beta Systems Software AG 2008 =4 12 ICR Algorithm: Neural Using Network kNN k-Nearest Neighbour SVM Support Vector Machine Minimize simultaneously the empirical classification error and maximize the geometric margin; hence they are also known as maximum margin classifiers 7/23/2016 © Beta Systems Software AG 2008 13 ICR Process: After different classification alternatives the appropriate confidence will be provided Recognition Limitation only for most probable characters e.g. if only characters 3,6,0 are possible the engine can also be limited to this set and the results are much better Voting Machine Usability: security, efficiency and Accuracy 7/23/2016 © Beta Systems Software AG 2008 14 Dynamic Field Recognition No If fixed position is required form is only ½ available still ½ readable No special Forms are required No timing tracks are necessary on the forms for OMR but results are also available the same time no cleaning of LEDs in the scanner necessary Robust against vertical / horizontal stretching, shrinking and displacement (e.g. Variation in printing) 7/23/2016 © Beta Systems Software AG 2008 15 Dynamic Field Recognition Recognizes: features (word as pixel cloud) boxes, lines and symbols 7/23/2016 © Beta Systems Software AG 2008 16 7/23/2016 © Beta Systems Software AG 2008 17 Hardware- / Software - Requirement Hardware Scanner PC Network Disc Storage necessary for re-processing and if images are needed for audit purposes Software Scan Software One Recognition and Voting Software for OMR, OCR, ICR, Barcode 7/23/2016 © Beta Systems Software AG 2008 18 OMR Cost Comparatives in general OMR/ICR from image Forms Design Same Forms Production - Up to 50% More Enumerator Training - Up to double the cost Scanners - Up to double the cost PC Low cost PC PC Operators Same Servers Same Cost of more/new flexibility 7/23/2016 OMR/ICR from dedicated OMR Scanner © Beta Systems Software AG 2008 low high 19 ICR Advantages Better than: Manual keying 90 % (plus) correct keys Manual = higher substitution rate than automated recognition Time consuming Deliberate OMR, manipulation possible because OMR is space consuming OCR, because OCR is machine written and therefore of limited use 7/23/2016 © Beta Systems Software AG 2008 20 ICR Advantages Clear accuracy for OMR because of dirt removal by software depending on the mark size and figure Can detect line Clear 7/23/2016 and can ignore dirt result © Beta Systems Software AG 2008 21 ICR Advantages Barcode, OCR OMR, and ICR Recognition with one Software 7/23/2016 © Beta Systems Software AG 2008 22 ICR Advantages Pro: Only rejected characters/fields need correction Rest of the form untouched With new technologies open for future faster, better quality With standardized correction mode Handwriting of the corresponding country will be recognized The previously mentioned advantages do not have to be repeated here again 7/23/2016 © Beta Systems Software AG 2008 23 ICR Advantages : Capture Process SORM Scan Once Read Multiple Images are Scanned once and stored for re-processing. (disk space is cheap) In several serial sessions parts of the data is collected from the Image (important fields first). Example: SORM Session 1: Fields Age, Sex and Nationality -> provisional partial results SORM Session 2: All other numeric fields SORM Session 3: Alphanumeric fields that need more manual coding (Occupation -> Occupation Code) Each Session Updates the Data files / Database until all data is captured. Faster preliminary results. Less political stress. Faster data for PES planning Analysis of Session 1 results is possible in parallel to Recognition, Coding and Editing of Session 2 Data lifting on different batching levels is possible. (EA, settlement) 7/23/2016 © Beta Systems Software AG 2008 24 Process Stages of Census Surveys Christoph J. Steinl, Vice Director int. ECM December 2008, Minsk © Beta Systems Software AG 2008 25 Capture Process Store In (EA Batch Header Creation – EA Paper store Database) Scanning Recognition Verifying The Processes solution Data capture Census Process internal Census data flow Quality assurance 7/23/2016 © Beta Systems Software AG 2008 26 Scanning 7/23/2016 © Beta Systems Software AG 2008 Kleindienst SC80HC 27 Scanning Simultaneous creation of up to six images Optical lens > 10 mm, sharpness-depth-area 3 mm Optical und ultrasonic double feed control Energy saving / live cycle extending Mode Consistent jam handling: no document is lost or double captured due to physical jams cleanness check program: detects white - and black dirt spots 7/23/2016 © Beta Systems Software AG 2008 28 Scanning Pockets 2 – 12 why: if the document is scanned skewed or de-skewed if the very important questions are filled / are readable (if OMR, OCR, Barcode) if there are fingerprints on the questionnaire or not if the Barcode/OCR/OMR numbers (not ICR) numbers are in the given range if there are double entries – we check the unique number if there are (colour) copies used if there are mismatches in quantity: Batch header shows 50 and only 40 are scanned Transport stop can be programmed to clarify the issue 7/23/2016 © Beta Systems Software AG 2008 29 Scanning Customer: I have a printer and print since long my own questionnaire …I learnt from the internet that it is just a matter of software … Before printing we should be consulted to give best advice, we will test and optimize. Single side printing or higher scaled paper is necessary (shine through factor = opacity) Paper should be white without any spots inside Discuss different methods before making big investments 7/23/2016 © Beta Systems Software AG 2008 30 Census Process internal Form Type Analysis Structure Analysis ICR voting Batch Job Processing ICR 1 ICR 2 Editing / Coding Output Assembly 7/23/2016 © Beta Systems Software AG 2008 Logical Result Analysis ICR Result Analysis 32 The Path to Recognition Analyze 7/23/2016 the structure of documents for identification © Beta Systems Software AG 2008 33 The Path to Recognition Perform proper clean-up and image pre-processing Analyze individual page layout Dynamically Character locate fields of interest recognition :numeric handwriting voting of two ICR Engines and also with OMR. Compile 7/23/2016 results © Beta Systems Software AG 2008 34 Verifying Processes Unique Number – double Scan check Double feed check Check Trace if Copy of all editing work Logical checks Completeness checks Reports 7/23/2016 © Beta Systems Software AG 2008 35 The Solution: SC80HC + FC Census FC Census recognition DevInfo Data + Images Work Data Storage & DB Preparation: cut & jogg CSPro Batch Header Archive Paper Archive Data Storage & DB Redatam Form x TAPE Editing 7/23/2016 © Beta Systems Software AG 2008 Local reports 36 Data capture Data Processing Centres in different locations Peak period 3 shifts, average 1-2 shifts Local operators trained by our supervisors Supervisors Central support from Lab Training Help 7/23/2016 local & documentation realised in advance to design the documents © Beta Systems Software AG 2008 37 Thank you for your attention 7/23/2016 © Beta Systems Software AG 2008 38