Ethiopian 2007 CENSUS DATA CAPTURING AND PROCESSING (CSA)

advertisement
Ethiopian 2007 CENSUS DATA
CAPTURING AND PROCESSING
CENTRAL STATISTICAL AGENCY
(CSA)
APRIL, 2008
Background Information
 Population and Housing Census process is the
largest data capturing exercise a country can undertake.
 It involves capturing of millions of forms
 The Central Statistics Agency (CSA) started using old
techniques like Punched Card Reader as early 1960’s.
 Two Population and Housing Censuses have so
far been conducted in Ethiopia.
 The first Population and Housing Census was
carried out in 1984.
Background Information Cont’d . . .
 During the 1984 Census:
 Data capture was done on manual keyboard
based entry using mainframe computer
 FORMSPEC data entry system was used
 It took more than 2 years to capture the data
for about 42 million people.
 In the case of the 1994 Census:
 Data capture was again done on manual
keyboard entry basis using PC’s
 CENTRY data entry system (IMPS) was used
Background Information Cont’d . . .
 It took about 18 months to capture the data for the
population of about 53 million.
 About 180 data entry clerks were involved
 Around 90 Pc’s were used
 The entry work was done on 2-shift basis
Some Limitations of the Keyboard Manual
Entry Method
 Time consuming
 Does not allow the availability of timely data
 The data will be weaker in representing the
current or existing situation
 Subject to additional non-sampling errors
 Human error due to manual keying
 Due to the volume of the data, a 100% verification,
as in the case of sample surveys, is difficult.
Limitations Cont . . .
 Involves a great deal of human resource
management.
 Large number of data entry operators
and equipment required
The Need for Alternative Solutions
 The need to have timely census results and
the limitations discussed above forced the
Agency to look for other alternatives
 This is obviously very important with regards
to large volume of data like census.
 Hence the need to use the Scanning Technology
The Scanning Technology
 The Scanning Technology in general implements
two basic techniques
 Mark recognition, like the Optical Mark
Reader (OMR)
 Character recognition, like the Optical
Character Recognition (OCR), and the
Intelligent Character Recognition (ICR)
Scanning Technology Cont . . .
 OMR is the recognition of shaded marks (blobs) on the
forms
 The positioning of these blobs on a form
determines the alphanumeric characters
they represent
 The character recognition is the recognition of
alphanumeric characters on forms and they are
of 2 types:
 OCR which is the recognition of machine printed
characters and . .
Scanning Technology Cont . . .
 ICR which refers to the capture of
hand- printed characters from a form
 For scanning of the 2007 Census the Optical
Mark Reader (OMR) technique has been selected
 The Scanning Technology we use:
PhotoScribe Series PS900 Scanners
(DRS Scanning Technology Product)
DRS
Photo Scribe Series PS900
 High speed Imaging Mark Reader
 Windows XP professional
 CD R/WR drive
 Network connectivity
 A TFT monitor, Keyboard, mouse
 Speed: up to 8,500 forms / hour
The Scanning Process in General
It mainly involves:
 Scanning / Data Capture – including IMAGE capturing
 Validation and Key-correction of scanned data
 Exporting the scanned and key-corrected data
into ASCII or Text format
 The format suitable for electronic processing
Learning from Experiences of
Other Countries
 Study tour made to two African countries
 Tanzania
 To learn from their successes
 Data capture of the 2002 Census of Tanzania
was done in about 26 days
 General report tables were produced within
3 months from the start of the scanning
Experiences of Other Countries . . .
 Ghana
 To learn from their difficulties
 Data capture of the 2000 Census took about
6 months - ( forms from 29,000 EAs)
 3 Scanners were used (Kodak, Fujitsu)
 The larger scanner was Kodak 500D
 Speed: About 500 forms/min
 Power failure was one of the major problems
 Loss of some data occurred as a result
 A large generator was installed to minimize
the effect of the frequent power cut
Major Benefits of the Scanning Technology
 Significant decrease in time required to capture
the data
 This helps to get timely data
 Users’ need satisfied (policy makers, planners,
researchers, etc.)
 No need to worry to store millions of forms for
long time in the future
 Scanning captures the whole content of a
questionnaire in an electronic image format
Requirements for Effective Scanning
 Proper training
 Both on Hardware and Software
 This helps to “own” the technology
 Being able to use the technology after the
departure of the trainers / technical advisors
 A reliable Network System
 A well organized space for forms and data flow
is required
STRUCTURED SPACE FOR FILE FLOW
Data Processing Center
Waiting Room
Warehouse
Retrieval
Registering EA’s
for Scanning
Registering &
Organizing EA’s
Received from
the Field
1
4
3
5
Scanning
Room
Receiving the
Questionnaires
2
6
Store
Key-Correction
Room
7
8
Processing
Center
Requirements for Effective Scanning - -  Proper file management and care
 Checking Batch (EA) IDs and orientation of
forms
 Ensuring the EA code on each box is the
same as the one on the questionnaires
 Proper recording of the in-coming and outgoing questionnaires
 Close attention in detecting errors in the
scanning process is required
Requirements for Effective Scanning - -  Ensuring the proper paper throughput
through the scanner
 Ensuring smooth running of the scanning machines
 Maintenance
 Cleaning (daily)
 An arrangement to minimize the effect of Power
Interruption is required
Major Activities Accomplished in the
Course of the Census Taking
 Data from the Pilot Census was successfully scanned
(OMR), key-corrected, exported to text format,
tabulated and tested.
 One scanner (PS 900 Photo Scribe) was used to
capture the pilot data
 Technical experts from the DRS company assisted in
capturing, validating and exporting the pilot data
 Training in scanning technology was given :
 16 professionals were trained
Major Activities Accomplished - -  Hardware and Software training conducted
 The training in general took about 7 working days
 SOSKITW for Windows :- a DRS software package
for scanning was introduced
 Components of the SOSKITW Software :
 SOSGen : - used to generate scanning
decodes for completed OMR forms (How
marks on forms are interpreted and stored)
 SOSInp : - used to scan, validate and export
scanned data.
Major Activities Accomplished - -  Equipment purchased and installed
 10 additional PS900 iM2 DRS Scanners
 16 high capacity PC’s for key-correction
 Census data processing work plan prepared
 Recruitment of temporary staff
 Staff training (scanning technology, CSPro)
 Retrieval and organization of completed forms
 Scanning and validation
 Computer editing and tabulation
(For each activity: duration and responsible body are indicated)
Major Activities Accomplished - -  Census data processing teams organized
 Batch header database group
 Scanning and validation team
 Technical desk heads
 Shift supervisors
 Two senior programmers responsible for
the overall scanning process
 Other sub-professional staff assigned
 4 batch header scanning technicians
 16 data validation workers
Major Activities Accomplished - -  The scanning room organized
 An air conditioner for the scanning room installed
 A high capacity automatic generator installed to
ensure uninterrupted power supply
 Batch Header Database organized
 EA Control Forms completed in 2 parts during dispatch
 Same EA ID on both parts of the control form
 Same Enumerator Number on each part
 No. of Households in the EA filled-in
 The scannable part detached and scanned in office
Completed Census Forms
 Completed forms retrieved from the field
(about 90,000 EA’s)
 Reception and organization of filled-in forms
completed
 About 33 teams for registering and
organizing forms were organized
 3 persons assigned per team
 Retrieval of each EA checked and registered
 Presence of all form types checked (each EA)
 Control forms are also used to check the
completeness of EA’s
Completed Census Forms - -  Types of the 2007 Census Forms
 Short questionnaires
 Long questionnaires
 Household Listing Forms
 Summary Forms
 Community Level Forms
 EA Control Forms
(Batch Header Forms)
 EA ID’s and no. of households filled-in
 Unique Enumerator No. assigned

Scanned to create EA Database
Long Questionnaire
Batch Control Form
Summary Form
Actual Scanning Process - Census Forms
 Organized forms taken from store to the waiting room
 Batch Header information printed and associated with
its respective EA box
 The existence of each EA verified
 Checked EAs sent to the scanning room
 Scanned forms are finally sent back to the stores
 Captured data are validated and key-corrected
 Key-correction involved checking and correcting:
 Missing marks
 Multi-marks
 Partial marks
Actual Scanning Process - -  Scanned and validated data is exported to TEXT format
 Format suitable for computer editing and tabulation
 Backup of the scanned / captured data is taken :
 on the Database Server
 externally, on high capacity tape cartridges
HP Ultrium
Data Cartridge
400 GB
Actual Scanning Process - -  All Census forms have been scanned :
 The scanning of the 10 sedentary Regions
was carried from mid Aug. 2007 to
mid Dec 2008
 The scanning for Affar and Somali Regions
took about one month including checking
(mid Jan - mid Feb 2008)
 44 scanning operators were assigned
 11 scanners used
 2 shifts per day, 7 days per week
 Validation and key-correction of the scanned
data is done
Census Forms Scanning Process
Scanning
Key-Correction
Data Cleaning / Computer Editing
 Scanned, key-corrected and exported data
 Batch Edit Program based on Edit Specs provided by
subject matter specialists developed and run on the data.
 The software to be used in editing the data is the Census
and Survey Processing System (CSPro)
 And Batch Edit Application (.bch) is the component of
CSPro used to clean the data through editing and
imputation processes
Report Generation / Tabulation
 Raising factors attached to the edited long
questionnaire data
 Tabulation programs (in CSPro) are prepared and
tested
 Tables in accordance with the Tabulation Plan will be
produced
 Final data will be organized in various formats
(ASCII, SPSS)
 Final data will be sent to the Central Databank for
achieving and dissemination purposes.
Problems Encountered
I. Scanning :
 A batch might slip through un scanned during data capture
 A batch might also be scanned in parts only
 Misplacement of scanned forms in wrong boxes
 Limited storage space on the scanning machines
 Scanners become full– that makes scanning difficult
 Scanned images should constantly be moved to the
storage server
 The location of scanned images on the storage server
may sometimes not be found
Problems Encountered - - II. Key Correction:
 Problems in retrieving scanned images for key
correction was encountered
 Key correction took longer time as it is done
manually
 The key correction process, as stared earlier, was
based on fixing:
 Missing marks
 Multi-marks
 Partial marks
Problems Encountered - - III. Processing the data :
 Large volume of data – takes long time (8 hrs)
 Frequent power failure highly affects the processing
sessions
 The tabulation component of CSPro software
sometimes fails unpredictably
(It is a newly developed tabulation system)
In summary :
 Registration and organization of all completed Census
Forms done
 The scanning and key correction of the Census
questionnaires completed
 The scanning of the Household Listing forms is done
 Draft Census preliminary results have been produced
Additional Comment:
 Quick manual review (editing and coding) of the
filled-in forms might be needed prior to the scanning
process
Download