DATA CAPTURE IN CENSUS OF INDIA Registrar General & Census Commissioner, India Visit Our Website at www.censusindia.gov.in FEATURES OF INDIAN CENSUS • India – a large country with more than a billion population Censuses is then one of the world largest administrative and statistical exercise • Diversity in languages • 2 million enumerators deployed in 2001 Census – likely to increase further in 2011 census. languages – Schedules filled in 16 FEATURES OF INDIAN CENSUS (Contd..) • Census which is conducted using ‘canvasser’ method is in two phases: House-listing Population Enumeration • Census Organization has experimented innovations since the beginning with new IT • Technology is required particularly for data capture/processing – mainly due to large volume and for speedier tabulation & release of Census results MODE FOR DATA CAPTURE & PROCESSING SINCE 1961 Census 1961 1971 1981 1991 Population 43.9 Million 54.8 Million 68.3 Million 84.6 Million 102.8 Million Collection % 100 100 100 100 100 Capture % 5 15 25 45 100 Mode Hand Punch Key Punch Data Entry Data Entry Scanning/I CR Time taken 8-9Years 8-9Years 8-9 Years 7-8 Years 3-5 Years 2001 DATA CAPTURE & PROCESSING IN 2001 CENSUS Important Considerations • Conventional data entry not suitable for large volume (228 million schedules for 102.8 million population) of data. • Availability of advanced IT tools and techniques. • Capture and process all the collected information. • Complexities in data entry due to multiplicity of languages/responses and size (A3) Census Schedule. DATA CAPTURE & PROCESSING IN 2001 CENSUS Important Considerations (Contd..) • Retrieval of original documents for correction labor – intensive. • Reduce the time span from 5-8 years to 3-5 years. • Compact , reliable and efficient archival system. • Better workflow management. DATA CAPTURE & PROCESSING IN 2001 CENSUS Selection and Consequent Action • Evaluation of various available technologies (OMR/OCR/ICR). Trial run with NCS and DRS OMR. Trial Run with various ICR vendors. • Opted for ICR technology(TIS eFlow) • IT Infrastructure in all the 15 Data Centers upgraded to meet the new requirement. DATA CAPTURE & PROCESSING IN 2001 CENSUS Model Conceived for implementation • Services of System Integrator hired to guide and assist in the implementation of ICR technology. • An unique model for Outsourcing SI to work in our premises for better communication and control maintain data security, safety and confidentiality Capacity building (Training and guiding to IT staff) Production Linked payment to SI DATA CAPTURE & PROCESSING IN 2001 CENSUS Work Flow of ORGI (TIS Eflow characteristic) Design data capture workflow Presents a graphical view of the system Monitors the processing and workflow in real time Enables to customize applications and add custom features DATA CAPTURE & PROCESSING IN 2001 CENSUS Work flow Modules Scan Portal, File Portal, Controller FormID, Manual FormID RC Processing [OCR/ICR] Tile, Completion, CAC & Exception Export DATA CAPTURE & PROCESSING IN 2001 CENSUS ORGI Workflow Stages Server Export/Archival Exception Completion Tiling Recognition Scanning Prepare Batch DATA CAPTURE & PROCESSING IN 2001 CENSUS LAN SETUP - ORGI DATA CENTERs Export station Supervisor Export completed batches as ASCII file for further processing Scanning station Controller station Supervisor Monitor the workflow & Balance the load at different stages of operation Forms are fed thru SCANNER(S) batch by batch Supervisors Handle Exceptional cases referred by Operators Exception stations Form IMAGES stored in Network DISK Server Recognition stations Tile/Correction station - Un-recognised Characters are corrected by OPERATORS Tiling & Completion stations Field by field character images are automatically RECOGNISED DATA CAPTURE & PROCESSING IN 2001 CENSUS eFlow customization • customization of Scanning software for Batching the images • optimization of Batch Size for Network movement of images and data • Customization of workflow management to reduce the workload on Manual Identification station DATA CAPTURE & PROCESSING IN 2001 CENSUS eFlow customization (Contd..) • Development of new Management Information tools for operators and daily production status etc • creation of JUSTICR.mdb to recognize the Indian enumerators writing patterns • Creation and implementation of various static and Dynamic Dictionaries for CAC DATA CAPTURE & PROCESSING IN 2001 CENSUS Results Achieved • First time 100% data captured, processed and released within five year of Census • Auto Recognition Rate 90% & false positive < 2% • Considerable financial saving • Assimilation of IT skills internally in the organisation. DATA CAPTURE & PROCESSING IN 2001 CENSUS Results Achieved (Contd..) • Manual Coding was replaced by Computer Assisted Coding Schedule Caste/ Schedule Tribe Languages spoken, Education level Migration particulars, NIC and NCO • Indigenous data capture for other projects Economic Census Sample Registration System Verbal Autopsy DATA CAPTURE & PROCESSING IN 2001 CENSUS Difficulties Experienced • Unable to use color drop-out at scanning stage • Difficult to handle bad images during scanning stages. • Bad/Back Images due to variation in paper/print quality • Over writing/use of whitener, grid line recognize as 1 • Limitation of recognizing Indian languages affected the through put DATA CAPTURE & PROCESSING IN 2001 CENSUS Difficulties Experienced (Contd..) • Operational Constraints in Manual Identification • No powerful tools for online Load balancing among various stages of eflow • Lack of concurrent quality check at each stage of eflow • Lack of Auto coding features for textual responses • Even Single image non recognition leads to redo whole batch LESSONS LEARNT FOR FUTURE • Outsourcing in controlled environment beneficial and cost-effective • Good quality of paper • ICR friendly Form Design • Use of Bar Code for better work flow and management • Good quality printing Inventory LESSONS LEARNT FOR FUTURE (Contd..) • Special training to enumerators for filling the forms • For CAC, use knowledge increase throughput Based dictionaries to • Use of concurrent quality check procedures on the line of USA and UK DATA CAPTURE & PROCESSING Technology for 2011 Census • Continuation of ICR Technology International and national experience shows as on date no better substitute for scanning & ICR technology Expertise and competence gained in using ICR technology available in the organization DATA CAPTURE & PROCESSING Technology for 2011 Census (contd..) Use more efficient scanners having facility for image enhancement, noise removal, color drop-out, better throughput and on-spot detection and correction (through in-built software) of bad images to be used. • Use of improved version of ICR software with better recognition and built-in enhanced workflow management capability. • Use new features in Auto/Computer Assisted Coding in ICR software • Thank you. Visit Our Website at www.censusindia.gov.in Steps involved in e-Flow Process • Intelligent Character Recognition (ICR) Technology is used to extract the handwritten/machine printed (typeset) character(s) from the scanned images to generate the computer processable data file. In brief, following steps are involved in using ICR technology. • Scanning:- Paper based forms are scanned to create bit map image file • File Portal::- It is an Image File Registration module in eflow as an input to next activity. • Form Identification:- Automatically identifies the Images of various schedules based on the Empty Form Image (EFI) template created during the designing stage. Steps involved in e-Flow Process • Manual Identification: Unidentified forms due to bad images are matched by the operator manually on computer with the help of EFIs . • Processing: This module is heart and brain of the ICR technology. It automatically recognize the data (numerals/alpha) from the images with the help of various engines (CGK, AEG,KADMOS,TISICR etc) • Tile: This module displays the images of similar digit at one place to identify any wrongly recognized character by system for correction and thus, enhances the accuracy and quality of data. STEPS INVOLVED IN eFLOW PROCESS • Completion:- Unrecognized or wrongly marked recognized characters in the Tiling will be presented for correction using images displayed simultaneously. • Exception:- If any character image is not understood by operator at completion station (module), that will be corrected in Exception station by an officer competent to make decision. • Export:- System exports the data generated in above steps to server for further processing like editing/aggregation/tabulation etc. eFLOW CONTROLLER e-FLOW WORKFLOW FOR ORGI EXAMPLE – BACK IMAGE EXAMPLE – IMPROPER GRID LINES EXAMPLE – USE OF WHITENER Casual writing pattern CAC Of MOTHER TONGUE CAC OF HIGHEST EDUCATION LEVEL ATTAINED CAC OF NATIONAL INDUSTRIAL CLASSIFICATION NIC HOUSEHOLD SCHEDULE IMAGE OF SIDE A HOUSEHOLD SCHEDULE IMAGE OF SIDE B FORM-ID STATION MANUAL-ID STATION IMAGE AFTER FORMOUT IN PROCESSING SEGMENTATION OF A FIELD IN PROCESSING VOTING IN PROCESSING ICR 1 3 ICR 2 ICR 3 ICR 4 3 8 3 Majority = 3 Unanimous = ? FINAL RESULT IN PROCESSING TILING STATION COMPLETION STATION [Field mode display] EXCEPTION STATION Form Field Date Original Form Image Viewer Exception Area EXPORT STATION HOUSEHOLD SCHEDULE- SIDE A Religion Name of SC/ST Mother Tongue & Other languages Education HOUSEHOLD SCHEDULE- SIDE B NCO NCO NIC Place of Birth & Last residence DATA CAPTURE & PROCESSING Selection of technology OMR/OCR / ICR in 2001 Recognition of hand written descriptive entries in different languages is beyond the capabilities of the known ICR SW and hence a conscious decision was taken to go in for the recognition of Only Numeric Characters, leaving the rest to be handled thru Image enabled computer assisted coding (CAC) . Following key features were introduced in the data capture solution. Parameters for selecting the ICR Software Highest recognition rate and lowest percentage of false positive with customization and assured support & Training •Facility of organized workflow in LAN environment with centralized controls with Computer Assisted Coding facility. •In built quality enhancement tools to trap the wrongly recognized characters so as to facilitate corrective action. •Use of multiple engines with voting algorithm. Ability to incorporate validation rules to trap inconsistent entries/wrong recognition. Learning capabilities of engines. DATA CAPTURE & PROCESSING • Parameters for selecting the scanner – Speed to match with our volume – Duty cycle (life and production tolerance) – Must be duplex scanning – Resolution minimum to 200dpi – Image enhancement facility like noise removing, skewing, cropping, contrast – Hopper size and scanning path(U,J or flat belt) – Maintenance & Training services DATA CAPTURE & PROCESSING Selection of Scanner/Hardware/ICR software • High level technical committee has evaluated and selected the above items on the basis of demonstrated capabilities of concerned items by various vendors • As a result CMC was selected System Integrator, ACER and HP for Computer Hardware with OS Window NT 4.0 • Kodak Module 7520 Scanner, TIS for ICR software • National Informatics Centre has done LAN cabling and inspection of Hardware • Up gradation of 15 Data Centers SETUP AT D.P. DIVISION (HQ) HARDWARE Server: (P-III, 800 MHz, 512 MB, 6*36 GB HDD, SOFTWARE Operating Systems: Windows 98, Windows NT CD & 1.44 MB Floppy Drive) 40/80 GB DLT Drives 100 MB Zip Drives Latest Software Packages: CD Writer IMPS, Local Area Network MS-Office, Intelligent Workstations (PMS Visual Studio, III) MS SQL Server, 800MHz, 128 MB, 9GB HDD, ISM Publisher (Hindi, CD & 1.4 MB Floppy Drive English), Laser & Line Matrix Printer Adobe Publishing Collection SETUP AT D.D.E. CENTRES 15 Locations (State Capitals) HARDWARE SOFTWARE High Speed Scanner – 24 (Nos.) Operating Systems: Server (45 No.): (P-III, 800 Windows NT, Windows 98, MHz, 512 MB, 6*36 GB HDD, CD & 1.44 MB Floppy Drive) Latest Software Packages: 40/80 GB DLT Drives E-FLOW, MS-OFFICE, 100 MB Zip Drives, CD Writer Software Package for Local Area Network Computer Assisted Coding 24 Workstation with each Server Intelligent Workstations (P-III) 800MHz, 128 MB, 9GB HDD, Laser & Line Matrix Printer SNAPSHOTS OF HARDWARE RESOURCES 1 1 2 2 1 1 1 1 1 1 1 1 1 1 4 4 1 1 1 1 2 2 1 4 4 3 1 1 1 2 5 1 RC 1 1 16 16 26 33 27 34 Total PCs 4 4 9 9 9 10 15 7 7 Sub total 1 1 1 1 1 1 Completion 1 1 1 1 1 1 12 13 14 2 5 11 2 6 12 Tile 1 1 1 1 10 11 1 3 1 3 Sub-Total 3 3 3 4 1 1 9 13 13 Exception 2 2 2 2 8 Manual ID 1 1 1 1 Scan 7 1 1 FormID 3 3 6 1 1 4 Sub-total 8 1 1 5 6 6 3 Controller 7 MERGE 6 Export 4 5 Processing 1 2 3 2 Ahmedabad Bangalore Bhopal eflow1 eflow2 eflow3 Bhubaneswar Chandigarh eflow1 eflow2 Chennai eflow1 eflow2 Delhi eflow1 eflow2 eflow3 eflow4 Guwahati File Portal 1 Location Slno Distribution of PCs for various stages of Form Processing using e-FLOW - HHOLD PROJECT Un-manned PC Supervisory staff PC Operators PC 17 57 59 1 1 1 1 3 1 1 1 1 3 3 3 4 6 6 5 9 3 3 3 5 13 13 13 19 16 16 16 24 31 31 30 43 9 9 1 1 3 3 1 1 3 3 8 8 4 4 15 19 15 19 36 36 1 1 10 10 1 1 3 3 1 1 3 3 8 8 4 4 15 19 16 20 37 38 1 1 1 1 1 1 10 10 8 1 1 3 3 3 1 1 1 3 3 3 8 8 7 4 4 4 15 19 15 19 13 17 37 37 32 1 1 11 1 3 1 4 9 6 20 26 46 SNAPSHOTS OF HARDWARE RESOURCES 9 10 11 12 13 14 15 FormID Processing Export MERGE Controller Sub-total Scan RC Manual ID Exception Sub-Total Tile Completion Sub total Total PCs 2 Hyderabad eflow1 eflow2 Jaipur Kolkatta eflow1 eflow2 Lucknow eflow1 eflow2 eflow3 Mumbai eflow1 eflow2 eflow3 Patna eflow1 eflow2 Trivandrum Total File Portal 1 Location Slno Distribution of PCs for various stages of Form Processing using e-FLOW - HHOLD PROJECT Un-manned PC Supervisory staff PC Operators PC 3 4 5 6 7 8 9 10 11 12 13 14 15 16 16 17 1 1 1 2 2 3 5 4 7 1 1 1 1 1 1 1 1 1 11 10 14 1 1 1 3 3 3 1 1 2 4 4 6 9 9 12 5 5 7 19 24 19 24 28 35 44 43 61 1 1 1 1 3 3 1 1 1 1 1 1 8 8 1 1 3 3 1 1 3 3 8 8 4 4 14 18 13 17 34 33 1 1 1 2 2 2 4 4 5 1 1 1 1 1 1 1 1 1 10 10 11 1 1 3 3 3 1 1 1 4 4 4 9 9 8 5 5 5 18 23 18 23 19 24 42 42 43 1 1 1 2 2 2 3 3 4 1 1 1 1 1 1 1 1 1 9 9 10 1 1 3 3 3 1 1 1 3 3 3 8 8 7 4 4 4 15 19 15 19 15 19 36 36 36 2 4 2 4 2 4 54 114 1 1 1 28 1 1 1 28 1 1 1 28 10 10 10 280 1 1 1 24 3 3 3 78 1 4 9 1 4 9 1 3 8 31 101 234 5 5 4 128 1 1 1 28 18 18 16 480 23 23 20 608 42 42 38 1122 DATA CAPTURE & PROCESSING Role of the Integrator • Supply, Installation and On-site Maintenance of SCANNERS. • Supply, Installation of Form Processing Software. • Manage LAN and load balancing from one stage to another. • Provide Software Core-Team centrally at ORGI HQ. • Impart operational training to the staff at each location. • Provide Software Personnel at each site • Provide scanner operators and carry out Scanning operations • Achieve > 90% recognition rate and < 2% false positive