The Best Practices in Census Data Processing Operation: By Case of 2009 Census:

advertisement
The Best Practices in Census Data
Processing Operation:
Case of 2009 Census:
By
Cleophas Kiio
Director, ICT
15-sep-10
1
Overview
•
•
•
•
•
•
•
•
15-sep-10
Data processing Activities Review
Planning for Data processing
Setting the Data processing site
Implementation
Data capture
Analysis
Dissemination
Archival
2
Data Processing Activities Review
DP follows the completion of field data
collection and entails the following:
• Capture
• Cleaning/Editing
• Tabulation
• Analysis
• Dissemination
• Archival
15-sep-10
3
Planning for Data Processing (DP)
1. Identification of Methodology/technology:
–
–
–
–
–
–
•
•
•
15-sep-10
Keying From Paper (KFP) - Manual Data Entry largely used in KNBS for
small Surveys
Keying From Image (KFI) -scanning
Optical Mark Reading (OMR)- scanning
Optical/Intelligent Character Recognition (OCR/ICR) - scanning
Online data capture – use of pc
Use of mobile devices (PDA)
For the 2009 Census, KNBS chose scanning technology with
OCR/ICR having used the same in the 1999 Census.
A study tour the US Census Bureau was conducted to
understudy the best practices.
Major considerations were the budget and availability of
technical knowhow.
4
Planning for Data Processing (DP) cont’d
2.
Selection of Tools and Equipment:
•
•
Computers – acquired 125 high capacity computers with duo screens.
Servers- 3 high-end servers did the census (32 GB memory, multiple
processors, 1 Terabyte secondary storage each)
Storage – 3 high capacity Storage Area Networks (SANs) were procured
initially 5 Terabytes (TB) each but later upgraded to 14 TB each.
Software-
•
•
–
Capture software - with the challenges faced the 1999 census where the
bureau used the AFPS pro from Top Image Systems (TIS), the Bureau chose
to use the iCADE system ( integrated Computer Assisted Data Entry
System) developed by the US Census Bureau.
Cleaning/tabulation- Cspro (Census and Surveys Processing software)
–
•
Scanners- 3 new Kodak 1860 high volume scanners were acquired in
addition to the 2 existing Kodak 1900 scanners used during the 1999
Census. Capable of scanning over 200 ppm.
•
Network infrastructure- all computers, scanners, servers and SAN were
connected in a wide area network (WAN)
15-sep-10
5
Planning for Data Processing (DP) cont’d
3. Design of Questionnaires
• As standard practice questionnaires are developed
and designed with technology to be used in mind.
•
The 2009 Census questionnaires were designed by
highly trained Bureau staff.
•
Technical support was offered by the US Census
Bureau
• Precision in design was critical for compatibility
with the iCADE system.
15-sep-10
6
Setting Up the DP Site
1. Planning the layout (library, KFI, OCR/Manual
registration, server room, editing )
2. Installing the computer network
3. Installing the power supply system and
provisioning for power backup system: UPS and
generators
4. Installing the furniture, lifts and Air-conditioning
5. Procuring high bandwidth internet.
6. A ware house for storage
7. Recruitment of staff
15-sep-10
7
Implementation
– Installation Systems and testing was completed
after census enumeration
– Integrated Computer Aided Data Entry (iCADE)
system training
– In 2009 we had approximately 12 million A3
questionnaires.
– Engaged close to 500 personnel for the
processing.
– Processing took less than a one (1) year to
complete
15-sep-10
8
a)
2009 Data Capture Processes
• Tracking of questionnaires done with a custom
made tracking system
• with inbuilt geocode list to ascertain
completeness and flow control
• Guillotining- trimming/cutting off the spirals
• iCADE system processes
o Batching- registering books from each EA in the iCADE
o Scanning
o Auto and Manual registration
o Exception review
o OCR review
o Key From Image (KFI)
15-sep-10
9
iCADE Processes flow
Check-in and
Guillotining
Output Data
Server/SAN
Batching
Scanning
Library
(Questionnaires
Holding area)
Key From
Image (KFI)
Exception
Review
OCR Review
Auto and Manual
Registration
Images and
Script files
database
Server/SAN
Minimum Interaction
15-sep-10
Process Flow
10
Capture Output
– Captured data was output to a text file then auto-formated
as input to the CSPro software
– OCR characters read: 2,485,008,272 with an accuracy
rate of 99.86% (0.14% error)
– KFI characters keyed: 228,771,647 with a 99.94 accuracy
rate (0.055%error)
– This means the OCR read over 90% of the characters with a
very high accuracy rate (OCR review definitely helped get
this accuracy rate but customization algorithms had to be
added to the quality).
– 22,326,373 images from the census questionnaires
– 273,201 books in 144,098 batches
– 10,602 batches went to exception review and 133,496
batches bypassed Exception Review altogether and went
straight into OCR.
15-sep-10
11
b) 2009 Data Analysis
– KNBS used CSPro a freeware from the US
Census Bureau.
– This process required:
– Subject matter specialists provide
editing rules
– Programmers implement editing rules
through programs
– The team developed the editing
program with which data is cleaned.
15-sep-10
12
Editing/cleaning and Imputation
• Systematic inspection of invalid and inconsistent
responses, and subsequent manual or automatic
correction according to predetermined rules (edit
specs).
• Imputation is the procedure of assigning values to
missing, invalid, or inconsistent data using a set of
predefined criteria embedded to an editing program.
15-sep-10
13
Why Edit and Impute?
• Clean up data to facilitate analysis
• Identify types and sources of error
• Improve quality of census data
• Errors must be detected and their causes
identified
• Appropriate corrective measures are taken to
improve the overall data quality.
15-sep-10
14
Graphic flow of Editing and imputation
Codes book
(Dictionary)
Editing and
Imputation (Edit
Specs)
Data Cleaning
Program
iCADE Output Data
Clean Data
15-sep-10
15
c) Data Tabulation
– Process of producing data outputs (tables,
frequencies, cross-tabulations,…)
– Requires subject matter specialists to prepare
dummy output layouts supported programmers
– Data in then presented in this tabular layouts.
15-sep-10
16
Graphic flow of Tabulation
Codes book
(Dictionary)
Area Names
Preferred
Presentation
(Tabulation Specs)
Tabulation
Program
Clean Data
Reports
Volume IA
15-sep-10
Volume IB
Volume IC
Volume II
…
17
d) Data Dissemination
– Providing public with information through census
books, fliers, CDs, DVDs, online databases (Census
info, IMIS, sms service)
e) Data Archival
– Documentation for permanent storage for further
and future analysis
15-sep-10
18
Challenges
– Ware-house was located about 10Km from
processing centre
– Inadequate processing space
– Printing was not perfect this affected the OCR
– Limited number and constant breakdown of
the KNBS dedicated lift slowed down
processing.
– Power outages posed a major challenges
– Being a new system, there was a cautious and
slow acceptance of the system.
15-sep-10
19
Best practices: Lessons learnt
– Comprehensive DP plan be developed with
clearly defined objectives:
1. Efficiency and effectiveness to process in the
shortest time possible.
2. Control cost of processing to avoid budget
overruns.
3. Quality data output
– Carry out risk analysis beforehand to identify
potential pitfalls and put in place mitigation
measures.
15-sep-10
20
Best practices: Lessons learnt cont’d
– Cartographic mapping be completed 1 year before
census
– geographical codes and related documentation
(geo-codes) to be ready 6 months before
enumeration.
– Timely acquisition of census tools and equipment
– DP site be ready 6 months before enumeration
date for test runs .
– Technical and maintenance support measures
must be instituted and enforced.
15-sep-10
21
Best practices: Lessons learnt cont’d
– Questionnaires and manuals be ready 5
months before census date to allow for
logistics and pretesting.
– Total quality control at the printing press must
be ensured for precision printing.
– Recruitment and training of staff be done
before the census date.
– DP site be located in close proximity to the
questionnaire warehouse
15-sep-10
22
Conclusion
• Despite the challenges, it was possible to
complete DP in less than a year after census.
• However better planning and organization of
the exercise it possible to complete the
exercise within 6 months after enumeration.
• The lessons learnt may form the
recommendations that if adopted the above
can be attained.
15-sep-10
23
15-sep-10
24
Thank You!
15-sep-10
25
Download