The Best Practices in Census Data Processing Operation: Case of 2009 Census: By Cleophas Kiio Director, ICT 15-sep-10 1 Overview • • • • • • • • 15-sep-10 Data processing Activities Review Planning for Data processing Setting the Data processing site Implementation Data capture Analysis Dissemination Archival 2 Data Processing Activities Review DP follows the completion of field data collection and entails the following: • Capture • Cleaning/Editing • Tabulation • Analysis • Dissemination • Archival 15-sep-10 3 Planning for Data Processing (DP) 1. Identification of Methodology/technology: – – – – – – • • • 15-sep-10 Keying From Paper (KFP) - Manual Data Entry largely used in KNBS for small Surveys Keying From Image (KFI) -scanning Optical Mark Reading (OMR)- scanning Optical/Intelligent Character Recognition (OCR/ICR) - scanning Online data capture – use of pc Use of mobile devices (PDA) For the 2009 Census, KNBS chose scanning technology with OCR/ICR having used the same in the 1999 Census. A study tour the US Census Bureau was conducted to understudy the best practices. Major considerations were the budget and availability of technical knowhow. 4 Planning for Data Processing (DP) cont’d 2. Selection of Tools and Equipment: • • Computers – acquired 125 high capacity computers with duo screens. Servers- 3 high-end servers did the census (32 GB memory, multiple processors, 1 Terabyte secondary storage each) Storage – 3 high capacity Storage Area Networks (SANs) were procured initially 5 Terabytes (TB) each but later upgraded to 14 TB each. Software- • • – Capture software - with the challenges faced the 1999 census where the bureau used the AFPS pro from Top Image Systems (TIS), the Bureau chose to use the iCADE system ( integrated Computer Assisted Data Entry System) developed by the US Census Bureau. Cleaning/tabulation- Cspro (Census and Surveys Processing software) – • Scanners- 3 new Kodak 1860 high volume scanners were acquired in addition to the 2 existing Kodak 1900 scanners used during the 1999 Census. Capable of scanning over 200 ppm. • Network infrastructure- all computers, scanners, servers and SAN were connected in a wide area network (WAN) 15-sep-10 5 Planning for Data Processing (DP) cont’d 3. Design of Questionnaires • As standard practice questionnaires are developed and designed with technology to be used in mind. • The 2009 Census questionnaires were designed by highly trained Bureau staff. • Technical support was offered by the US Census Bureau • Precision in design was critical for compatibility with the iCADE system. 15-sep-10 6 Setting Up the DP Site 1. Planning the layout (library, KFI, OCR/Manual registration, server room, editing ) 2. Installing the computer network 3. Installing the power supply system and provisioning for power backup system: UPS and generators 4. Installing the furniture, lifts and Air-conditioning 5. Procuring high bandwidth internet. 6. A ware house for storage 7. Recruitment of staff 15-sep-10 7 Implementation – Installation Systems and testing was completed after census enumeration – Integrated Computer Aided Data Entry (iCADE) system training – In 2009 we had approximately 12 million A3 questionnaires. – Engaged close to 500 personnel for the processing. – Processing took less than a one (1) year to complete 15-sep-10 8 a) 2009 Data Capture Processes • Tracking of questionnaires done with a custom made tracking system • with inbuilt geocode list to ascertain completeness and flow control • Guillotining- trimming/cutting off the spirals • iCADE system processes o Batching- registering books from each EA in the iCADE o Scanning o Auto and Manual registration o Exception review o OCR review o Key From Image (KFI) 15-sep-10 9 iCADE Processes flow Check-in and Guillotining Output Data Server/SAN Batching Scanning Library (Questionnaires Holding area) Key From Image (KFI) Exception Review OCR Review Auto and Manual Registration Images and Script files database Server/SAN Minimum Interaction 15-sep-10 Process Flow 10 Capture Output – Captured data was output to a text file then auto-formated as input to the CSPro software – OCR characters read: 2,485,008,272 with an accuracy rate of 99.86% (0.14% error) – KFI characters keyed: 228,771,647 with a 99.94 accuracy rate (0.055%error) – This means the OCR read over 90% of the characters with a very high accuracy rate (OCR review definitely helped get this accuracy rate but customization algorithms had to be added to the quality). – 22,326,373 images from the census questionnaires – 273,201 books in 144,098 batches – 10,602 batches went to exception review and 133,496 batches bypassed Exception Review altogether and went straight into OCR. 15-sep-10 11 b) 2009 Data Analysis – KNBS used CSPro a freeware from the US Census Bureau. – This process required: – Subject matter specialists provide editing rules – Programmers implement editing rules through programs – The team developed the editing program with which data is cleaned. 15-sep-10 12 Editing/cleaning and Imputation • Systematic inspection of invalid and inconsistent responses, and subsequent manual or automatic correction according to predetermined rules (edit specs). • Imputation is the procedure of assigning values to missing, invalid, or inconsistent data using a set of predefined criteria embedded to an editing program. 15-sep-10 13 Why Edit and Impute? • Clean up data to facilitate analysis • Identify types and sources of error • Improve quality of census data • Errors must be detected and their causes identified • Appropriate corrective measures are taken to improve the overall data quality. 15-sep-10 14 Graphic flow of Editing and imputation Codes book (Dictionary) Editing and Imputation (Edit Specs) Data Cleaning Program iCADE Output Data Clean Data 15-sep-10 15 c) Data Tabulation – Process of producing data outputs (tables, frequencies, cross-tabulations,…) – Requires subject matter specialists to prepare dummy output layouts supported programmers – Data in then presented in this tabular layouts. 15-sep-10 16 Graphic flow of Tabulation Codes book (Dictionary) Area Names Preferred Presentation (Tabulation Specs) Tabulation Program Clean Data Reports Volume IA 15-sep-10 Volume IB Volume IC Volume II … 17 d) Data Dissemination – Providing public with information through census books, fliers, CDs, DVDs, online databases (Census info, IMIS, sms service) e) Data Archival – Documentation for permanent storage for further and future analysis 15-sep-10 18 Challenges – Ware-house was located about 10Km from processing centre – Inadequate processing space – Printing was not perfect this affected the OCR – Limited number and constant breakdown of the KNBS dedicated lift slowed down processing. – Power outages posed a major challenges – Being a new system, there was a cautious and slow acceptance of the system. 15-sep-10 19 Best practices: Lessons learnt – Comprehensive DP plan be developed with clearly defined objectives: 1. Efficiency and effectiveness to process in the shortest time possible. 2. Control cost of processing to avoid budget overruns. 3. Quality data output – Carry out risk analysis beforehand to identify potential pitfalls and put in place mitigation measures. 15-sep-10 20 Best practices: Lessons learnt cont’d – Cartographic mapping be completed 1 year before census – geographical codes and related documentation (geo-codes) to be ready 6 months before enumeration. – Timely acquisition of census tools and equipment – DP site be ready 6 months before enumeration date for test runs . – Technical and maintenance support measures must be instituted and enforced. 15-sep-10 21 Best practices: Lessons learnt cont’d – Questionnaires and manuals be ready 5 months before census date to allow for logistics and pretesting. – Total quality control at the printing press must be ensured for precision printing. – Recruitment and training of staff be done before the census date. – DP site be located in close proximity to the questionnaire warehouse 15-sep-10 22 Conclusion • Despite the challenges, it was possible to complete DP in less than a year after census. • However better planning and organization of the exercise it possible to complete the exercise within 6 months after enumeration. • The lessons learnt may form the recommendations that if adopted the above can be attained. 15-sep-10 23 15-sep-10 24 Thank You! 15-sep-10 25