DATA CAPTURE IN CENSUS OF INDIA Registrar General & Census Commissioner, India

advertisement
DATA CAPTURE IN CENSUS OF
INDIA
Registrar General & Census
Commissioner, India
Visit Our Website at
www.censusindia.gov.in
FEATURES OF INDIAN CENSUS
•
India – a large country with more than a billion
population Censuses is then one of the world largest
administrative and statistical exercise
•
Diversity in
languages
•
2 million enumerators deployed in 2001 Census – likely
to increase further in 2011 census.
languages
–
Schedules
filled
in
16
FEATURES OF INDIAN CENSUS (Contd..)
•
Census which is conducted using ‘canvasser’ method is in
two phases:
 House-listing
 Population Enumeration
•
Census Organization has experimented
innovations since the beginning
with
new
IT
•
Technology
is
required
particularly
for
data
capture/processing – mainly due to large volume and for
speedier tabulation & release of Census results
MODE FOR DATA CAPTURE & PROCESSING
SINCE 1961
Census
1961
1971
1981
1991
Population
43.9
Million
54.8
Million
68.3
Million
84.6 Million 102.8
Million
Collection
%
100
100
100
100
100
Capture % 5
15
25
45
100
Mode
Hand
Punch
Key Punch
Data
Entry
Data Entry
Scanning/I
CR
Time
taken
8-9Years
8-9Years
8-9 Years
7-8 Years
3-5 Years
2001
DATA CAPTURE & PROCESSING IN 2001 CENSUS
Important Considerations
•
Conventional data entry not suitable for large volume (228
million schedules for 102.8 million population) of data.
•
Availability of advanced IT tools and techniques.
•
Capture and process all the collected information.
•
Complexities in data entry due to multiplicity of
languages/responses and size (A3) Census Schedule.
DATA CAPTURE & PROCESSING IN 2001 CENSUS
Important Considerations (Contd..)
•
Retrieval of original documents for correction labor –
intensive.
•
Reduce the time span from 5-8 years to 3-5 years.
•
Compact , reliable and efficient archival system.
•
Better workflow management.
DATA CAPTURE & PROCESSING IN 2001 CENSUS
Selection and Consequent Action
•
Evaluation of various available technologies
(OMR/OCR/ICR).

Trial run with NCS and DRS OMR.

Trial Run with various ICR vendors.
•
Opted for ICR technology(TIS eFlow)
•
IT Infrastructure in all the 15 Data Centers upgraded to
meet the new requirement.
DATA CAPTURE & PROCESSING IN 2001 CENSUS
Model Conceived for implementation
• Services of System Integrator hired to guide and assist
in the implementation of ICR technology.
• An unique model for Outsourcing
 SI to work in our premises for better
 communication and control
 maintain data security, safety and
confidentiality
 Capacity building (Training and guiding to IT staff)
 Production Linked payment to SI
DATA CAPTURE & PROCESSING IN 2001 CENSUS
Work Flow of ORGI (TIS Eflow characteristic)
Design data capture workflow
Presents a graphical view of the system
Monitors the processing and workflow in real time
Enables to customize applications and add custom features
DATA CAPTURE & PROCESSING IN 2001 CENSUS
Work flow Modules
Scan Portal, File Portal,
Controller
FormID, Manual FormID RC
Processing [OCR/ICR]
Tile, Completion, CAC &
Exception
Export
DATA CAPTURE & PROCESSING IN 2001 CENSUS
ORGI Workflow Stages
Server
Export/Archival
Exception
Completion
Tiling
Recognition
Scanning
Prepare Batch
DATA CAPTURE & PROCESSING IN 2001 CENSUS
LAN SETUP - ORGI DATA CENTERs
Export station
Supervisor Export completed batches
as ASCII file for further processing
Scanning station
Controller station
Supervisor Monitor the workflow & Balance
the load at different stages of operation
Forms are fed thru SCANNER(S)
batch by batch
Supervisors Handle
Exceptional cases referred
by Operators
Exception stations
Form
IMAGES
stored in
Network
DISK
Server
Recognition stations
Tile/Correction station - Un-recognised
Characters are corrected by OPERATORS
Tiling & Completion stations
Field by field
character
images are
automatically
RECOGNISED
DATA CAPTURE & PROCESSING IN 2001 CENSUS
eFlow customization
• customization of Scanning software for Batching the
images
• optimization of Batch Size for Network movement
of images and data
• Customization of workflow management to reduce
the workload on Manual Identification station
DATA CAPTURE & PROCESSING IN 2001 CENSUS
eFlow customization (Contd..)
• Development of new Management Information tools
for operators and daily production status etc
• creation of JUSTICR.mdb to recognize the Indian
enumerators writing patterns
• Creation and implementation of various static and
Dynamic Dictionaries for CAC
DATA CAPTURE & PROCESSING IN 2001 CENSUS
Results Achieved
• First time 100% data captured, processed and
released within five year of Census
• Auto Recognition Rate 90% & false positive < 2%
• Considerable financial saving
• Assimilation of IT skills internally in the organisation.
DATA CAPTURE & PROCESSING IN 2001 CENSUS
Results Achieved (Contd..)
• Manual Coding was replaced by Computer Assisted Coding
 Schedule Caste/ Schedule Tribe
 Languages spoken, Education level
 Migration particulars, NIC and NCO
• Indigenous data capture for other projects
 Economic Census
 Sample Registration System
 Verbal Autopsy
DATA CAPTURE & PROCESSING IN 2001 CENSUS
Difficulties Experienced
•
Unable to use color drop-out at scanning stage
•
Difficult to handle bad images during scanning stages.
•
Bad/Back Images due to variation in paper/print quality
•
Over writing/use of whitener, grid line recognize as 1
•
Limitation of recognizing Indian languages affected the
through put
DATA CAPTURE & PROCESSING IN 2001 CENSUS
Difficulties Experienced (Contd..)
•
Operational Constraints in Manual Identification
•
No powerful tools for online Load balancing among
various stages of eflow
•
Lack of concurrent quality check at each stage of eflow
•
Lack of Auto coding features for textual responses
•
Even Single image non recognition leads to redo whole
batch
LESSONS LEARNT FOR FUTURE
• Outsourcing in controlled environment beneficial and
cost-effective
• Good quality of paper
• ICR friendly Form Design
• Use of Bar Code for better work flow and
management
• Good quality printing
Inventory
LESSONS LEARNT FOR FUTURE
(Contd..)
• Special training to enumerators for filling the forms
• For CAC, use knowledge
increase throughput
Based
dictionaries
to
• Use of concurrent quality check procedures on the line
of USA and UK
DATA CAPTURE & PROCESSING
Technology for 2011 Census
•
Continuation of ICR Technology
 International and national experience shows as on
date no better substitute for scanning & ICR
technology
 Expertise and competence gained in using ICR
technology available in the organization
DATA CAPTURE & PROCESSING
Technology for 2011 Census (contd..)
Use more efficient scanners having facility for image
enhancement, noise removal, color drop-out, better
throughput and on-spot detection and correction (through
in-built software) of bad images to be used.
•
Use of improved version of ICR software with better
recognition and built-in enhanced workflow management
capability.
•
Use new features in Auto/Computer Assisted Coding in
ICR software
•
Thank you.
Visit Our Website at
www.censusindia.gov.in
Steps involved in e-Flow Process
•
Intelligent Character Recognition (ICR) Technology is used
to extract the handwritten/machine printed (typeset)
character(s) from the scanned images to generate the
computer processable data file. In brief, following steps are
involved in using ICR technology.
•
Scanning:- Paper based forms are scanned to create bit map
image file
•
File Portal::- It is an Image File Registration module in eflow
as an input to next activity.
•
Form Identification:- Automatically identifies the Images of
various schedules based on the Empty Form Image (EFI)
template created during the designing stage.
Steps involved in e-Flow Process
•
Manual Identification: Unidentified forms due to bad images
are matched by the operator manually on computer with the
help of EFIs .
•
Processing: This module is heart and brain of the ICR
technology.
It
automatically
recognize
the
data
(numerals/alpha) from the images with the help of various
engines (CGK, AEG,KADMOS,TISICR etc)
•
Tile: This module displays the images of similar digit at one
place to identify any wrongly recognized character by system
for correction and thus, enhances the accuracy and quality of
data.
STEPS INVOLVED IN eFLOW PROCESS
•
Completion:- Unrecognized or wrongly marked recognized
characters in the Tiling will be presented for correction
using images displayed simultaneously.
•
Exception:- If any character image is not understood by
operator at completion station (module), that
will be
corrected in Exception station by an officer competent to
make decision.
•
Export:- System exports the data generated in above steps
to
server
for
further
processing
like
editing/aggregation/tabulation etc.
eFLOW CONTROLLER
e-FLOW WORKFLOW FOR ORGI
EXAMPLE – BACK IMAGE
EXAMPLE – IMPROPER GRID LINES
EXAMPLE – USE OF WHITENER
Casual
writing
pattern
CAC Of MOTHER TONGUE
CAC OF HIGHEST EDUCATION LEVEL ATTAINED
CAC OF NATIONAL INDUSTRIAL CLASSIFICATION
NIC
HOUSEHOLD SCHEDULE IMAGE OF SIDE A
HOUSEHOLD SCHEDULE IMAGE OF SIDE B
FORM-ID STATION
MANUAL-ID STATION
IMAGE AFTER FORMOUT IN PROCESSING
SEGMENTATION OF A FIELD IN PROCESSING
VOTING IN PROCESSING
ICR 1
3
ICR 2
ICR 3
ICR 4
3
8
3
Majority = 3
Unanimous = ?
FINAL RESULT IN PROCESSING
TILING STATION
COMPLETION STATION
[Field mode display]
EXCEPTION STATION
Form
Field
Date
Original Form
Image Viewer
Exception
Area
EXPORT STATION
HOUSEHOLD SCHEDULE- SIDE A
Religion
Name of SC/ST
Mother Tongue &
Other languages
Education
HOUSEHOLD SCHEDULE- SIDE B
NCO
NCO
NIC
Place of Birth &
Last residence
DATA CAPTURE & PROCESSING
Selection of technology OMR/OCR / ICR in 2001
Recognition of hand written descriptive entries in different languages is
beyond the capabilities of the known ICR SW and hence a conscious
decision was taken to go in for the recognition of Only Numeric
Characters, leaving the rest to be handled thru Image enabled computer
assisted coding (CAC) . Following key features were introduced in the
data capture solution.
Parameters for selecting the ICR Software
Highest recognition rate and lowest percentage of false positive with
customization and assured support & Training
•Facility of organized workflow in LAN environment with centralized
controls with Computer Assisted Coding facility.
•In built quality enhancement tools to trap the wrongly recognized
characters so as to facilitate corrective action.
•Use of multiple engines with voting algorithm. Ability to incorporate
validation rules to trap inconsistent entries/wrong recognition. Learning
capabilities of engines.
DATA CAPTURE & PROCESSING
•
Parameters for selecting the scanner
– Speed to match with our volume
– Duty cycle (life and production tolerance)
– Must be duplex scanning
– Resolution minimum to 200dpi
– Image enhancement facility like noise removing,
skewing, cropping, contrast
– Hopper size and scanning path(U,J or flat belt)
– Maintenance & Training services
DATA CAPTURE & PROCESSING
Selection of Scanner/Hardware/ICR software
•
High level technical committee has evaluated and
selected the above items on the basis of demonstrated
capabilities of concerned items by various vendors
•
As a result CMC was selected System Integrator, ACER
and HP for Computer Hardware with OS Window NT 4.0
•
Kodak Module 7520 Scanner, TIS for ICR software
•
National Informatics Centre has done LAN cabling and
inspection of Hardware
•
Up gradation of 15 Data Centers
SETUP AT D.P. DIVISION (HQ)
HARDWARE
Server: (P-III, 800 MHz,
512 MB, 6*36 GB HDD,
SOFTWARE
Operating Systems:
Windows 98, Windows NT
CD & 1.44 MB Floppy Drive)
40/80 GB DLT Drives
100 MB Zip Drives
Latest Software Packages:
CD Writer
IMPS,
Local Area Network
MS-Office,
Intelligent Workstations (PMS Visual Studio,
III)
MS SQL Server,
800MHz, 128 MB, 9GB HDD,
ISM Publisher (Hindi,
CD & 1.4 MB Floppy Drive
English),
Laser & Line Matrix Printer
Adobe Publishing
Collection
SETUP AT D.D.E. CENTRES
15 Locations (State Capitals)
HARDWARE
SOFTWARE
High Speed Scanner – 24 (Nos.) Operating Systems:
Server (45 No.): (P-III, 800
Windows NT, Windows 98,
MHz,
512 MB, 6*36 GB HDD,
CD & 1.44 MB Floppy Drive)
Latest Software Packages:
40/80 GB DLT Drives
E-FLOW, MS-OFFICE,
100 MB Zip Drives, CD Writer
Software Package for
Local Area Network
Computer Assisted Coding
24 Workstation with each
Server
Intelligent Workstations (P-III)
800MHz, 128 MB, 9GB HDD,
Laser & Line Matrix Printer
SNAPSHOTS OF HARDWARE RESOURCES
1
1
2
2
1
1
1
1
1
1
1
1
1
1
4
4
1
1
1
1
2
2
1
4
4
3
1
1
1
2
5
1
RC
1
1
16
16
26 33
27 34
Total PCs
4
4
9
9
9
10
15
7
7
Sub total
1
1
1
1
1
1
Completion
1
1
1
1
1
1
12 13 14
2
5 11
2
6 12
Tile
1
1
1
1
10 11
1
3
1
3
Sub-Total
3
3
3
4
1
1
9
13
13
Exception
2
2
2
2
8
Manual ID
1
1
1
1
Scan
7
1
1
FormID
3
3
6
1
1
4
Sub-total
8
1
1
5
6
6
3
Controller
7
MERGE
6
Export
4
5
Processing
1
2
3
2
Ahmedabad
Bangalore
Bhopal
eflow1
eflow2
eflow3
Bhubaneswar
Chandigarh
eflow1
eflow2
Chennai
eflow1
eflow2
Delhi
eflow1
eflow2
eflow3
eflow4
Guwahati
File Portal
1
Location
Slno
Distribution of PCs for various stages of Form Processing using e-FLOW - HHOLD PROJECT
Un-manned PC
Supervisory staff PC
Operators PC
17
57
59
1
1
1
1
3
1
1
1
1
3
3
3
4
6
6
5
9
3
3
3
5
13
13
13
19
16
16
16
24
31
31
30
43
9
9
1
1
3
3
1
1
3
3
8
8
4
4
15 19
15 19
36
36
1
1
10
10
1
1
3
3
1
1
3
3
8
8
4
4
15 19
16 20
37
38
1
1
1
1
1
1
10
10
8
1
1
3
3
3
1
1
1
3
3
3
8
8
7
4
4
4
15 19
15 19
13 17
37
37
32
1
1
11
1
3
1
4
9
6
20 26
46
SNAPSHOTS OF HARDWARE RESOURCES
9
10
11
12
13
14
15
FormID
Processing
Export
MERGE
Controller
Sub-total
Scan
RC
Manual ID
Exception
Sub-Total
Tile
Completion
Sub total
Total PCs
2
Hyderabad
eflow1
eflow2
Jaipur
Kolkatta
eflow1
eflow2
Lucknow
eflow1
eflow2
eflow3
Mumbai
eflow1
eflow2
eflow3
Patna
eflow1
eflow2
Trivandrum
Total
File Portal
1
Location
Slno
Distribution of PCs for various stages of Form Processing using e-FLOW - HHOLD PROJECT
Un-manned PC
Supervisory staff PC
Operators PC
3
4
5
6
7
8
9
10
11
12
13
14
15
16
16
17
1
1
1
2
2
3
5
4
7
1
1
1
1
1
1
1
1
1
11
10
14
1
1
1
3
3
3
1
1
2
4
4
6
9
9
12
5
5
7
19 24
19 24
28 35
44
43
61
1
1
1
1
3
3
1
1
1
1
1
1
8
8
1
1
3
3
1
1
3
3
8
8
4
4
14 18
13 17
34
33
1
1
1
2
2
2
4
4
5
1
1
1
1
1
1
1
1
1
10
10
11
1
1
3
3
3
1
1
1
4
4
4
9
9
8
5
5
5
18 23
18 23
19 24
42
42
43
1
1
1
2
2
2
3
3
4
1
1
1
1
1
1
1
1
1
9
9
10
1
1
3
3
3
1
1
1
3
3
3
8
8
7
4
4
4
15 19
15 19
15 19
36
36
36
2
4
2
4
2
4
54 114
1
1
1
28
1
1
1
28
1
1
1
28
10
10
10
280
1
1
1
24
3
3
3
78
1
4
9
1
4
9
1
3
8
31 101 234
5
5
4
128
1
1
1
28
18
18
16
480
23
23
20
608
42
42
38
1122
DATA CAPTURE & PROCESSING
Role of the Integrator
• Supply, Installation and On-site Maintenance of SCANNERS.
• Supply, Installation of Form Processing Software.
• Manage LAN and load balancing from one stage to another.
• Provide Software Core-Team centrally at ORGI HQ.
• Impart operational training to the staff at each location.
• Provide Software Personnel at each site
• Provide scanner operators and carry out Scanning operations
• Achieve > 90% recognition rate and < 2% false positive
Download