Using OCR for Census Data Capture in China

advertisement
Using OCR for Census Data
Capture in China
National Bureau of Statistics of China
Background
 5 population censuses have been conducted in
1953, 1964, 1982, 1990, 2000 respectively
 1953,1964 census: manual tabulation
 Since 1982 census, using computer for data
process.
 1982,1990 census, manual data entry
 2000 census, using OCR for data capture
 2006, the second Agriculture census: also use
OCR for data capture
2000 population census and 2006 agriculture
census are the two cases of OCR use for largevolume data capture
Two cases of OCR for large-volume Census Data
Capture
 The data capture of 2000 Population Census
 Census reference time: Nov. 1,2000
 Data Capture cycle:Jan. -- June., 2001.(6
months)
 Scale:
Types of Census Form:4
 Short Form: 49 Census items, 90% HH, 360
million
A4 size double sheets in total
 Long Form: 95 Census items, 10% HH, 40
million
A3 double sheets in total
 Other Forms: Death pop, temporary
residents. 10
million A4 double sheets in
total
Two cases of OCR for large-volumes Census Data
Capture
 The Second National Agricultural Census
 Census reference time: end of 2006
 Data Capture cycle :April to mid-July,
2007,100 days
 Scale:
Types of Census Form :8
Total census items:541
Total agricultural Families:250 millions
Total Census Forms:about 500 million pieces
of paper
Original Census data :about 300GB
Image data :40TB
Organizational Structure for data
process
Data
Capture
editing
Data process
NBS
Province
Coding
Checked
Checked
& packed
Checked & packed
& packed
Village
0.9 million
EA
5 million
Town
40000
31
Prefecture
1 31
County
2847
340
Data capture was
decentralized at
prefecture offices
Function framework of OCR data
capture
(2006 agriculture census)
System management
User management
System Initialization
Log management
Space management
Address base
Sys management
task management
Archiving
management
process management
Scanner self-inspection
scanning
System Functions
numeric data
Data management
Image management
Progress monitoring
Alternative scan
Check scan
Chinese character
numeric data edit
Editing and checkup
QA
Forms check
Add scan
Batch form scan
Repeat scan
OCR
Client management
English character
Chinese character checkup
edit
Special
character edit
English character edit
Restore
Delete
Browse
Input
Output
Backup
Restore
Delete
Import
export
Image merger
Information
display
Statistics
summary
Generation census form
management ID
Generation image
management ID
Backup
Receiving
file
Special character
Enquiry
Browse
Image reported
The Process of OCR data capture
 The scanning module generates
image files and transmits them to
image management module and also
transmits the status information to
task management module.
 The task management module
executes task distribution according
to the state of vacancy of each OCR
clients.
The Process of OCR data capture
 The OCR module performs
recognition of numerical data and
Chinese characters and transmits the
data and Chinese characters to data
management module and transmits
the status information to task
management module.
The Process of OCR data capture
 The task management dispatches the
data to edit module for editing. If
original image is needed,
corresponding image is fetched by
image management module for
comparison, the cleansed data after
edit are returned back to data
management module. when data
capture work is all finished, report
upward the data.
Quality Control
To ensure the quality of captured data, quality
control is executed in three stages: scanning,
recognizing and data editing.
 During the process of scanning, recognizing
batch cover data and scanner count, the
system checks if the total page count, total
household count for each batch are consistent
with the results of scanning; Comparing the
actual address code with address code
repository, ensure that the address codes are
validity, uniqueness and correctness.
 During the recognition, collecting real time
statistics for rejection ratio and suspect ratio. If
rejection ratio and suspect ratio is too high, the
task administrator checks the reason.
Quality Control
 During the process of editing, checking
the consistency between recognized
record count and the record count in
controller document; Checking the
basic logic relationship and value range;
indicate the items which have mistakes
in logic relationships or value ranges,
recognition results and corresponding
items from original scanned images are
displayed comparatively in parallel
windows, and convenient modification
means are provided for those which
need get modified.
 After the whole set of data has been
captured, quality is assured through
Main Problems and Solutions
In large-scale census data capture projects,
there’re three aspects of problems we
regard as the most outstanding: 1. How to
enhance OCR’s recognition capability. 2.
Availability and reliability of the system. 3.
Project management. What we have done
are:
1. Improve the capability of recognizing numeric
characters
Two kinds of recognition algorithms and two
kinds of recognition engines based on the two
algorithms were developed, after a series of
onsite test, which better suites the census
project is chosen.
2. Improve the recognition capability for Chinese
Main Problems and Solutions
3. Improve orientation capability
Aiming at print deviation and filling deviation,
smart locating algorithm has been developed
which has minimized the impact of the print
deviation and filling deviation.
4. Enhance efficiency of recognition
Improve the fundamental software of scanner,
to achieve the best match between hardware
drivers and OCR software and improve the
efficiency of recognition.
5. Improve the quality of forms filling
Prescribe the filling standards for form filling
so that OCR error rate will be reduced,
meanwhile rejection rate could also be reduced.
Main Problems and Solutions
6. Establish regulation, working guidance
and processes to make every data entry
site to execute work following uniform
regulations, processes and standards.
7. Strengthen the training. we organized
centralized training and on-site training
for the users. Lecturing and actual
operations are combined during
centralized training, through the
combination of these two ways, the
familiarity with the system has get
deepened.
8. Organize multi-target pilot. We organized
multiple pilots in many locations aiming
Lessons Learned
 Using advanced technology to raise
efficiency
 Combining technical and administrative
methods to resolve quality problems and
security issues
 Choose partners with the higher capability of
system development and service
 Early project preparation
 Manage project with partners
 Training, pilot projects and management is
the key to success
 Control the printing quality of the census
forms and census data filling quality
 Project change control
Prospect of the 2010 Population
Census
Census time: Nov. 1, 2010
 Short form and long form, death population form
 Foreigners living in China are considered to be
enumerated
Data capture in 2011
 OCR data capture will be the main data entry
method
 Modifying the existing system of agricultural
census and make some innovate
 Adding more OCR equipments
THANKS
Download