United Nations Regional Workshop on Census Data Processing Contemporary Technology from Census Data Capturing and Editing: A perspective of South Africa Data Processing System A presentation by South African Data Processing Team Dar-es-Salaam, Tanzania, 9-13 June 2008 Census 1996, 2001 & Community Survey (CS) 1 The presentation layout • • • • • • Introduction Data processing Goal Planning phase Design of Data Processing System System Development & Testing Implementation & Operations – – – – – – Process flow Document Management System Progress reporting Tool of scanning Exceptions Quality Assurance (QA) • Accounting or Balancing process • Data validation & Editing • Tabulation and output products Census 1996, 2001 & Community Survey (CS) 2 Introduction • Data Processing is considered as part of Survey operations value chain (Proper define accountability structure); • There is a define inter-dependency links with other Census sections (i.e. questionnaire design, Data collection,…); • Heavily dependent on the available support in information technology around the country (Outsourcing of management of the system); • Tight project management principle checking timeline, resources and detailed production lines • Obliged to adapt on ever changing technology (1996 KFP, 2001 Census scanning, 2007 scanning with old scanner, 2011 Census scanning with upgraded scanners) Census 1996, 2001 & Community Survey (CS) 3 Goal of Data Processing To accurately process or convert the statistical information from different collection tools such as the questionnaire into a comprehensive electronic data that is clean, accurate, consistent and reliable. Census 1996, 2001 & Community Survey (CS) 4 • Planning phase Going through the lessons learned from previous censuses and surveys (1996 Census, 2001 census, 2007 Community Survey) – Preparation of processing site • In 1996 Census, distributed data processing centre in 9 provinces • In 2001 Census and 2007 CS centralized data processing centre – Mode of Data Capturing • In 1996 Census Manual capturing (key from paper) running on SQL database with interface developed in visual basic • In 2001 Census and 2007 CS: Use of proprietary scanning technology linked to Oracle database – Census Budget • The 1996 Census budget estimated at 500 Million Rand • The 2001 Census budget estimated at 1.2 Billion Rand • The 2007 CS budget estimated at 600 Million Rand – Human Resource • The 1996 Census have more staff for key from paper (options considered for Job creation across the country) • The 2001 Census and 2007 CS has a reduced number of staff supporting the scanning technology working on shifts – Duration • The 1996 Census data capturing was planned for 12 Months • The 2001 Census was planned for 6 months. However, the period was extended 18 months due to not tested new technology • The 2007 CS took only 3 months as planned – Systems design and specifications • In 2001 Census, system specification & development was reviewed during implementation • In 2007 CS, most the system specification & development were completed and tested before the production Census 1996, 2001 & Community Survey (CS) 5 Planning phase • Strategic plan – There is a policy on standard procedure in terms of documentation, process flow, metadata, concepts managed by DMID (Data Management and Information Delivery) project ; – Common strategy across surveys program by using scanning technology with control of transaction in database – Moving toward a Centralised Corporate data processing Centre ( store management,…) – Accounting of production transaction tracking the questionnaire using a barcode; – Measurement of quality at each process of the production (; – Having a permanent team of data processors in order to keep the experience while build the capacity; – Acceptance of any system or module into production after it has gone through testing phase to avoid the experience of 2001 Census of untested system; Census 1996, 2001 & Community Survey (CS) 6 Planning phase • Operational plan & Budget – Since 2001 Census, there is a detailed activities list, sub-activities and tasks with timelines (start and end date) and responsible persons; – Since 2007 CS, each activity is linked to budget in what is called activity/task base costing; – Since 2007 CS, there is an independent and dedicate team in charge of project management and monitoring of activities; – A list of documents and other derivable are submitted to the project management team (PMO) to keep track of the progress; – Development of performance indicators for PMO to track which will give the daily production counts per process; – Based on activities costing, the budget has never been an issue, except in 2001 Census when the project went beyond the planned period. Census 1996, 2001 & Community Survey (CS) 7 Design of Data Processing • The data processing team get the user requirement from the questionnaire design team and data collection team; • The team comprised by Data processors, system analyst (1 person), programmers, statisticians and Data technologist (IT technicians) prepare the overall design specifications; • The data processing team is supplemented by the Data Collection team in the management of production and staff management on flow; • The scanning module of the system is out source (in 2001 a consortium of companies , but in 2007 CS one company was accountable); • In 2007 CS, the data processing project management was controlled in house to avoid the lack of accountability observed in 2001 Census where it was done by external (PROCON) • Since the workflow was changing in 2001Census, a approved workflow with the operation procedure manual was ready in 2007 CS before the start of production • The functional specifications where done only in 2007 CS as part of overall system specification; • The technical specifications were completed for as build system in 2001 Census whereas the 2007 CS specification where done before any implementation. Census 1996, 2001 & Community Survey (CS) 8 System Development & Testing • In 1996 Census, the system development was done by in-house team supported by the Swedish consultants; • In 2001 Census, the system development was outsourced to local based company that put together a consortium of service providers in project management, system development, scanner specialist/maintenance, Image and recognition software; • In 2007 CS, the system development and project management was done in-house outsourcing only the scanning software and scanner maintenance; • In 1996 Census, only unit test was conducted whereas the 2001 Census, most of the tests (unit tests, production load test,…) were conducted while in production already; • In 2007 CS, all tests were done before production: – – – For instance, the background colour drop out was tested in 2007 CS whereas the blue colour background in 2001 Census required a blue light in scanner (tested after months of production); The decision on exception handling was done during production in 2001 Census (rescan or transcription) whereas in 2007 CS, the questionnaire were send to Key From Paper (KFP) or Key From Image (KFI); In 2007 CS, false-positive reading were reduce by introducing voting rules between two different recognition engines whereas in 2001 Census all false-positive reading were sent to verification stage (Tiling and Completion/Key correction) Census 1996, 2001 & Community Survey (CS) 9 Implementation & Operations • Operational procedures – In 2001 Census, operational procedure manual was prepared during production; – In 2007 CS, the operational procedure was in place before training – Every day production account is produced (extraction from Oracle database) • Recruitment – In 1996 Census, the production staff were selected based on keying speed only; – In 2001 Census, the production staff were recruited based on each process requirement; – In 2007 CS, the production staff have versatile skills as data processors and can move between processes depending on needs as determined by the flow manager. – In 2001 Census, staff worked 24 hours, 7 days a week in 3 shifts. In 2007 CS, only one shift was managed to meet the deadline. • Training – IN 2001 Census, training was conducted by service provider (PROCON) whereas in 1996 Census and 2007 CS, the training was by the senior data processors, system developers and statistician who were part of the design team. • Preparation of work environment – In 1996 Census used 9 sites. In 2001 Census, one warehouse site and in 2007 CS, there were two sites (one for main storage and the other for the production. – Site preparation including partitioning, hardware and networking installed one month before the end of Census field operation. Census 1996, 2001 & Community Survey (CS) 10 Operations cont… High Level Process Flow CS Data Processing Process Flow 4 CSAS Reduced Content Verification 3 CSAS Barcode Matching 5 CSAS Receiving Yes Content confirmed Scannable 4 Guillotine 3 Primary Preparation Scannable Export CSAS 2 DMS Box content reverification Back in store & CSAS checks No Confirmed good box Yes CSAS Database 6 Secondary Preparation No No, scannable Yes 14 Transfer/ Export New KFP box created 9 De-activate qn from box and Create KFP box Failed <= 95% 21 KFP Database 2 13 Normalisation 12 Verify/Tiling/ Completion Split coding fields 12 On screen coding (resolution) 12 Automated coding 12 Verify/Tiling/ Completion coded fields 11 Recognition/ Interpret Accuracy Check (95%) 16 Sample Passed >= 95% 15 Database Passed >= 95% 17 Key from Image No Yes No Validation Fail 10 Image validations + Form Identification 44 Fast Track Automated Coding 22 Unrecognisable cases Non Scannable (damaged) Yes 20 Key from Paper t (2nd Capture) 19 KFP Database 1 8 Post-Scanning Box check Post-scanning check pass Scanning & Recognition 18 Key from Paper (1st Capture) No No, damaged 1 Document Management System DPS Database Box barcode checking pass 7 Scanning Yes New KFP box created 2 CSAS Content Verification 5 De-activate qn from box and Create box for KFP DMS Ver 3.0 43 Determine Cause Accuracy Check (95%) Failed <= 95% 44 Manual editing Key from Paper Transfer only if >= 95% Sign-off (Balancing) 42 Output Database Exception Path Data Movement Path QA & Validation + Key from Image Census 1996, 2001 & Community Survey (CS) 11 Physical Movement Path Operations cont… Document Management System • Tracking the documents movement across processes • Accounting of all transactions including the production staff login; • Database driven (SyBase in 1996, Oracle in 2001 and 2007); • Progress reporting per user, per function and per process • Reporting gives the performance management (speed, time, production unit,…) Census 1996, 2001 & Community Survey (CS) 12 Operations cont… Progress reporting CS 2007 Data Processing Progress Report Date: Constants Number of boxes to be processed Number of Questionnaires in boxes (Final Result Code 0-9) Number of Questionnaires in boxes (Final Result Code 1,4)-Processable Number of Questionnaires in boxes Not Processable Monday 02 Jul 2007 17,387 284,244 251,775 32,469 Boxes Planned Overall Progress Process Buffer A Check Out DPC store Check In OM store Store audit verification Primary preparation and Guillotine KFP (Manual Coding, FirstCapture and Second Capture) Secondary preparation Scanning Post Scan Checkout B Work in Progress C Completed Total Outstanding D E F - - 17,387 17,387 17,387 17,304 - - - - Tue 26 Jun - Percentage Complete Start Date Actual End Date Start Date End Date G 100.00% 100.00% 100.00% 100.00% H 03 Apr 07 03 Apr 07 04 Apr 07 11 Apr 07 I 01 Jun 07 01 Jun 07 01 Jun 07 01 Jun 07 J 02 Apr 07 02 Apr 07 04 Apr 07 11 Apr 07 22 73.49% 28 May 07 29 Jun 07 12 Jun 07 - 100.00% 100.00% 100.00% 12 Apr 07 12 Apr 07 12 Apr 07 08 Jun 07 08 Jun 07 08 Jun 07 12 Apr 07 12 Apr 07 12 Apr 07 Thu 28 Jun - Fri 29 Jun - Mon 02 Jul - 17,387 17,387 17,387 17,304 - 61 83 17,304 17,015 17,015 17,304 17,015 17,015 Wed 27 Jun - K 11-Jun-07 11-Jun-07 11-Jun-07 11-Jun-07 Required target per day Average production per day L M 11-Jun-07 11-Jun-07 11-Jun-07 Estimated days for completion based on current average N (Lead) or Delay New End Date Days in Terms of Complete O P 0 0 0 0 446 446 458 524 - - 4 - - - 0 468 460 460 - - - 0 0 0 Daily Progress Process Check Out DPC store Check In OM store Store audit verification Primary preparation and Guillotine KFP (Manual Coding, FirstCapture and Second Capture) Secondary preparation Scanning Post Scan Checkout Fri 22 Jun - Mon 25 Jun - - - - - - - - - - - - - - - Questionnaires Planned Overall Progress Process Buffer A Store Audit Verification Primary preparation and Guillotine KFP (Manual Coding, FirstCapture and Second Capture) Secondary Preparation Scanning Character Recognition Tiling Completion Data export Sampled for QA Quality Assurance Coding KFI (Exception resolution) Sign-off B Work in Progress Completed C D 284,244 281,840 Total Outstanding E 284,244 281,840 - - - - 1,902 2,404 1,141 218 1,605 -1,740 - 281,840 247,746 246,605 246,387 244,782 246,522 246,200 242,700 49,390 240,423 281,840 247,746 247,746 247,746 247,746 247,746 247,746 247,746 247,746 49,976 247,746 3,500 586 - Percentage Complete Census 1996, 2001 & Community Survey (CS) F Start Date Actual End Date Start Date G 100.00% 100.00% H 04 Apr 07 04 Apr 07 I 01 Jun 07 01 Jun 07 J 04 Apr 07 11 Apr 07 502 79.12% 28 May 07 29 Jun 07 12 Jun 07 1,141 1,359 2,964 1,224 1,546 5,046 247,746 586 7,323 100.00% 100.00% 99.54% 99.45% 98.80% 99.51% 99.38% 97.96% 0.00% 98.83% 97.04% 12 Apr 07 12 Apr 07 13 Apr 07 16 Apr 07 16 Apr 07 07 May 07 14 May 07 21 May 07 21 May 07 14 May 07 02 May 07 08 Jun 07 08 Jun 07 15 Jun 07 15 Jun 07 15 Jun 07 15 Jun 07 22 Jun 07 22 Jun 07 22 Jun 07 29 Jun 07 29 Jun 07 12 Apr 07 12 Apr 07 24 Apr 07 24 Apr 07 24 Apr 07 07 May 07 - - 13 7,480 7,417 - (Lead) or Delay New End Date Days in Terms of Complete O P 0 0 - 120 0 - - 0 11-Jun-07 11-Jun-07 - 7,617 6,696 6,043 6,194 6,194 9,910 9,910 12,387 12,387 1,666 6,520 1,905 2,742 3,080 7,524 10,605 25,190 3,233 3,233 1 0 1 0 0 0 0 2 2-Jul-07 2-Jul-07 2-Jul-07 2-Jul-07 2-Jul-07 2-Jul-07 2-Jul-07 4-Jul-07 0 0 18 17 18 17 10 10 0 3 End Date K 11-Jun-07 11-Jun-07 Required target per day Average production per day L M Estimated days for completion based on current average N 5 Operations cont… Progress reporting Completed Buffer Overall Progress - Boxes Work in Progress Outstanding Overall Progress - Questionnaires Completed Work in Progress Buffer Outstanding 100% 100% 80% 80% 60% 60% 40% 40% 20% 20% 0% 0% Check Out DPC store Check In OM store Store audit verification Primary KFP (Manual preparation and Coding, Guillotine FirstCapture and Second Capture) Secondary preparation Scanning Store Audit Verification Post Scan Checkout -20% Primary preparation and Guillotine KFP (Manual Coding, FirstCapture and Second Capture) Secondary Preparation Scanning Character Recognition Target Manual Processes Completion Data export Sampled for QA Quality Assurance Coding KFP (Manual Coding, FirstCapture and Second Capture) KFI (Exception resolution) Sign-off Data export Tiling Completion Target Automated Processes Daily Progress - Questionnaires Daily Progress - Boxes 100 Tiling 20,000 18,000 90 16,000 80 14,000 70 12,000 60 10,000 . 50 40 8,000 30 6,000 20 4,000 10 2,000 - 0 Fri 22 Jun Mon 25 Jun Tue 26 Jun Wed 27 Jun Thu 28 Jun Fri 29 Jun Fri 22 Jun Mon 02 Jul Mon 25 Jun Tue 26 Jun Wed 27 Jun Thu 28 Jun Fri 29 Jun Mon 02 Jul Cumulative Progress - Scanning - Planned vs Actual Planned Actual 300,000 Number of Questionnaire 250,000 200,000 150,000 100,000 Production Days Census 1996, 2001 & Community Survey (CS) 14 Fri 08 Jun Thu 07 Jun Tue 05 Jun Wed 06 Jun Sun 03 Jun Mon 04 Jun Fri 01 Jun Thu 31 May Tue 29 May Wed 30 May Sun 27 May Mon 28 May Fri 25 May Sat 26 May Thu 24 May Tue 22 May Wed 23 May Sun 20 May Mon 21 May Fri 18 May Sat 19 May Thu 17 May Tue 15 May Wed 16 May Sun 13 May Mon 14 May Fri 11 May Sat 12 May Thu 10 May Tue 08 May Wed 09 May Sun 06 May Mon 07 May Fri 04 May Sat 05 May Thu 03 May Tue 01 May Wed 02 May Sun 29 Apr Mon 30 Apr Fri 27 Apr Sat 28 Apr Thu 26 Apr Tue 24 Apr Wed 25 Apr Sun 22 Apr Mon 23 Apr Fri 20 Apr Sat 21 Apr Thu 19 Apr Tue 17 Apr Wed 18 Apr Sun 15 Apr Mon 16 Apr Fri 13 Apr Sat 14 Apr Thu 12 Apr - Sat 02 Jun 50,000 Operation cont… Tool of scanning • Kodak 9520D – Used in 2001 Census; – Used in 2007 CS; • Differential scanner feeding (pages by page and/or batches); • Barcode recognition at scanning time Census 1996, 2001 & Community Survey (CS) 15 Operation cont… Exceptions • Questionnaires transcription: – – – – • Key From Paper (KFP): – – – – • Poor image quality Faint writing Missing pages Wrong unique identifier (Enumerator Area, Dwelling Unit & Household Number) False-Positive reading: – – – – • Damaged Unscannable Inconsistent page numbering Unique identifier (barcode) Poor software recognition Poor image quality Incomplete text (character) Unrecognized mark or character Failed quality checks: – Quality rate below the threshold (95% accurate rate) Census 1996, 2001 & Community Survey (CS) 16 Operation cont… Quality Assurance (QA) • In 1996 Census, the quality was implemented as part of double keying without any measurement attached to it; • In 2001 Census, the quality was measured at scanning time (check image quality) and after data capturing (Key from Image of the sampled batches (the threshold was 97%); • In 2007 CS, the sample of captured were subjected to second capture comparing with the first capture where the agreement rate was determined (the threshold was 95% reduced due to good image quality): – For scanned cases: sample keyed from image and calculation of an agreement rate; – For exceptional cases: 100% double keyed from Paper and calculation of agreement rate; Census 1996, 2001 & Community Survey (CS) 17 Accounting or Balancing process • After capturing, each questionnaire is accounted for linked to the geographical area (EA) and having the correct data structure (household, persons,….) before any export; • In 1996 Census, the export process of captured data into SAS/ASCII for for post-capture process (editing and tabulation); • In 2001 Census, the balancing process took longer because of lack of reference link to the EA of postal questionnaire (selfenumeration); • In 2007 CS, a Census and Administration System (CSAS) assisted in getting the full account of the questionnaires linked to their referenced geography; Census 1996, 2001 & Community Survey (CS) 18 Data validation & Editing • In 1996, the adopted strategy was not to impute any derived value. Only manual editing was allowed; • In 2001 Census, based on editing specification with the assistance of US Bureau of Census, an automated editing was implemented using IMPS/CSpro. The 2007 CS follows the same approach used in 2001 Census. • Different editing report with imputation rates were produced to an editing committee which come out with the rule to apply for correction; • • In 2001 Census and 2007 CS, limited manual editing were implemented; One of key editing rule is the removal of minimal processable cases caused by poor recognition or false-positive reading; • Though the editing has been in ASCII, the output database is exported with in different formats (i.e. users driven: ASCII, Oracle, SAS, Oracle,…) linked with the metadata; Census 1996, 2001 & Community Survey (CS) 19 Data validation & editing Cont… Census 1996, 2001 & Community Survey (CS) 20 Tabulation and output products • Since Stats SA policy is to give access to data users, the strategy is to put the Census data in different format to increase accessibility and promote data use; • In 1996 Census, the output database was packaged in SuperCorss database and a set of aggregated databases put on CD for the users; • In 2001 Census, the access to the data was increased by adding on the online processing tabulation tools (PX-Web), the SuperCross, reduced ASCII file,…. • In 2007 CS, the data is also available in different format (SuperCross, ASCII file, PX-web and other map/chart linked tools • The traditional reports are still produced based on tabulation plan/output reports Census 1996, 2001 & Community Survey (CS) 21 Benefit of scanning Technology • Improve the Quality of the Data • Save Time • Reduce Costs Census 1996, 2001 & Community Survey (CS) 22 THANK YOU! Census 1996, 2001 & Community Survey (CS) 23