United Nations Regional Workshop on Census Data Processing Contemporary Technology from

advertisement
United Nations Regional Workshop on Census
Data Processing Contemporary Technology from
Census Data Capturing and Editing:
A perspective of South Africa Data Processing System
A presentation by
South African Data Processing Team
Dar-es-Salaam, Tanzania, 9-13 June 2008
Census 1996, 2001 & Community Survey (CS)
1
The presentation layout
•
•
•
•
•
•
Introduction
Data processing Goal
Planning phase
Design of Data Processing System
System Development & Testing
Implementation & Operations
–
–
–
–
–
–
Process flow
Document Management System
Progress reporting
Tool of scanning
Exceptions
Quality Assurance (QA)
• Accounting or Balancing process
• Data validation & Editing
• Tabulation and output products
Census 1996, 2001 & Community Survey (CS)
2
Introduction
• Data Processing is considered as part of Survey operations
value chain (Proper define accountability structure);
• There is a define inter-dependency links with other Census
sections (i.e. questionnaire design, Data collection,…);
• Heavily dependent on the available support in information
technology around the country (Outsourcing of
management of the system);
• Tight project management principle checking timeline,
resources and detailed production lines
• Obliged to adapt on ever changing technology (1996 KFP,
2001 Census scanning, 2007 scanning with old scanner,
2011 Census scanning with upgraded scanners)
Census 1996, 2001 & Community Survey (CS)
3
Goal of Data Processing
To accurately process or convert the
statistical information from different
collection
tools
such
as
the
questionnaire into a comprehensive
electronic
data
that
is
clean,
accurate, consistent and reliable.
Census 1996, 2001 & Community Survey (CS)
4
•
Planning phase
Going through the lessons learned from previous censuses and surveys (1996 Census,
2001 census, 2007 Community Survey)
–
Preparation of processing site
• In 1996 Census, distributed data processing centre in 9 provinces
• In 2001 Census and 2007 CS centralized data processing centre
–
Mode of Data Capturing
• In 1996 Census Manual capturing (key from paper) running on SQL database with interface developed in
visual basic
• In 2001 Census and 2007 CS: Use of proprietary scanning technology linked to Oracle database
–
Census Budget
• The 1996 Census budget estimated at 500 Million Rand
• The 2001 Census budget estimated at 1.2 Billion Rand
• The 2007 CS budget estimated at 600 Million Rand
–
Human Resource
• The 1996 Census have more staff for key from paper (options considered for Job creation across the
country)
• The 2001 Census and 2007 CS has a reduced number of staff supporting the scanning technology working
on shifts
–
Duration
• The 1996 Census data capturing was planned for 12 Months
• The 2001 Census was planned for 6 months. However, the period was extended 18 months due to not
tested new technology
• The 2007 CS took only 3 months as planned
–
Systems design and specifications
• In 2001 Census, system specification & development was reviewed during implementation
• In 2007 CS, most the system specification & development were completed and tested before the
production
Census 1996, 2001 & Community Survey (CS)
5
Planning phase
•
Strategic plan
– There is a policy on standard procedure in terms of documentation, process flow,
metadata, concepts managed by DMID (Data Management and Information Delivery)
project ;
– Common strategy across surveys program by using scanning technology with control
of transaction in database
– Moving toward a Centralised Corporate data processing Centre ( store
management,…)
– Accounting of production transaction tracking the questionnaire using a barcode;
– Measurement of quality at each process of the production (;
– Having a permanent team of data processors in order to keep the experience while
build the capacity;
– Acceptance of any system or module into production after it has gone through testing
phase to avoid the experience of 2001 Census of untested system;
Census 1996, 2001 & Community Survey (CS)
6
Planning phase
• Operational plan & Budget
– Since 2001 Census, there is a detailed activities list, sub-activities and
tasks with timelines (start and end date) and responsible persons;
– Since 2007 CS, each activity is linked to budget in what is called
activity/task base costing;
– Since 2007 CS, there is an independent and dedicate team in charge
of project management and monitoring of activities;
– A list of documents and other derivable are submitted to the project
management team (PMO) to keep track of the progress;
– Development of performance indicators for PMO to track which will
give the daily production counts per process;
– Based on activities costing, the budget has never been an issue,
except in 2001 Census when the project went beyond the planned
period.
Census 1996, 2001 & Community Survey (CS)
7
Design of Data Processing
•
The data processing team get the user requirement from the questionnaire design team
and data collection team;
•
The team comprised by Data processors, system analyst (1 person), programmers,
statisticians and Data technologist (IT technicians) prepare the overall design
specifications;
•
The data processing team is supplemented by the Data Collection team in the
management of production and staff management on flow;
•
The scanning module of the system is out source (in 2001 a consortium of companies , but
in 2007 CS one company was accountable);
•
In 2007 CS, the data processing project management was controlled in house to avoid the
lack of accountability observed in 2001 Census where it was done by external (PROCON)
•
Since the workflow was changing in 2001Census, a approved workflow with the operation
procedure manual was ready in 2007 CS before the start of production
•
The functional specifications where done only in 2007 CS as part of overall system
specification;
•
The technical specifications were completed for as build system in 2001 Census whereas
the 2007 CS specification where done before any implementation.
Census 1996, 2001 & Community Survey (CS)
8
System Development & Testing
•
In 1996 Census, the system development was done by in-house team supported by the
Swedish consultants;
•
In 2001 Census, the system development was outsourced to local based company that put
together a consortium of service providers in project management, system development,
scanner specialist/maintenance, Image and recognition software;
•
In 2007 CS, the system development and project management was done in-house
outsourcing only the scanning software and scanner maintenance;
•
In 1996 Census, only unit test was conducted whereas the 2001 Census, most of the tests
(unit tests, production load test,…) were conducted while in production already;
•
In 2007 CS, all tests were done before production:
–
–
–
For instance, the background colour drop out was tested in 2007 CS whereas the blue colour
background in 2001 Census required a blue light in scanner (tested after months of
production);
The decision on exception handling was done during production in 2001 Census (rescan or
transcription) whereas in 2007 CS, the questionnaire were send to Key From Paper (KFP) or
Key From Image (KFI);
In 2007 CS, false-positive reading were reduce by introducing voting rules between two
different recognition engines whereas in 2001 Census all false-positive reading were sent to
verification stage (Tiling and Completion/Key correction)
Census 1996, 2001 & Community Survey (CS)
9
Implementation & Operations
•
Operational procedures
– In 2001 Census, operational procedure manual was prepared during production;
– In 2007 CS, the operational procedure was in place before training
– Every day production account is produced (extraction from Oracle database)
•
Recruitment
– In 1996 Census, the production staff were selected based on keying speed only;
– In 2001 Census, the production staff were recruited based on each process
requirement;
– In 2007 CS, the production staff have versatile skills as data processors and can
move between processes depending on needs as determined by the flow manager.
– In 2001 Census, staff worked 24 hours, 7 days a week in 3 shifts. In 2007 CS, only
one shift was managed to meet the deadline.
•
Training
– IN 2001 Census, training was conducted by service provider (PROCON) whereas in
1996 Census and 2007 CS, the training was by the senior data processors, system
developers and statistician who were part of the design team.
•
Preparation of work environment
– In 1996 Census used 9 sites. In 2001 Census, one warehouse site and in 2007 CS,
there were two sites (one for main storage and the other for the production.
– Site preparation including partitioning, hardware and networking installed one
month before the end of Census field operation.
Census 1996, 2001 & Community Survey (CS)
10
Operations cont…
High Level Process Flow
CS Data Processing Process Flow
4
CSAS
Reduced
Content
Verification
3
CSAS
Barcode
Matching
5
CSAS
Receiving
Yes
Content
confirmed
Scannable
4
Guillotine
3
Primary
Preparation
Scannable
Export
CSAS
2
DMS Box
content reverification
Back in store & CSAS
checks
No
Confirmed good box
Yes
CSAS Database
6
Secondary
Preparation
No
No, scannable
Yes
14
Transfer/
Export
New KFP box
created
9
De-activate qn
from box and
Create KFP box
Failed <= 95%
21
KFP Database 2
13
Normalisation
12
Verify/Tiling/
Completion
Split coding
fields
12
On screen
coding
(resolution)
12
Automated
coding
12
Verify/Tiling/
Completion
coded fields
11
Recognition/
Interpret
Accuracy
Check (95%)
16
Sample
Passed >= 95%
15
Database
Passed >= 95%
17
Key from Image
No
Yes
No
Validation Fail
10
Image
validations +
Form
Identification
44
Fast Track
Automated Coding
22
Unrecognisable cases
Non Scannable
(damaged)
Yes
20
Key from Paper
t
(2nd Capture)
19
KFP Database 1
8
Post-Scanning
Box check
Post-scanning
check pass
Scanning & Recognition
18
Key from Paper
(1st Capture)
No
No, damaged
1
Document Management System
DPS Database
Box barcode
checking pass
7
Scanning
Yes
New KFP box created
2
CSAS
Content
Verification
5
De-activate qn from box and
Create box for KFP
DMS
Ver 3.0
43
Determine Cause
Accuracy
Check (95%)
Failed <= 95%
44
Manual editing
Key from Paper
Transfer only if >= 95%
Sign-off (Balancing)
42
Output Database
Exception Path
Data Movement Path
QA & Validation + Key from Image
Census 1996, 2001 & Community Survey (CS)
11
Physical Movement Path
Operations cont…
Document Management System
• Tracking the documents movement across processes
• Accounting of all transactions including the production staff
login;
• Database driven (SyBase in 1996, Oracle in 2001 and 2007);
• Progress reporting per user, per function and per process
• Reporting gives the performance management (speed, time,
production unit,…)
Census 1996, 2001 & Community Survey (CS)
12
Operations cont…
Progress reporting
CS 2007 Data Processing Progress Report
Date:
Constants
Number of boxes to be processed
Number of Questionnaires in boxes (Final Result Code 0-9)
Number of Questionnaires in boxes (Final Result Code 1,4)-Processable
Number of Questionnaires in boxes Not Processable
Monday 02 Jul 2007
17,387
284,244
251,775
32,469
Boxes
Planned
Overall Progress
Process
Buffer
A
Check Out DPC store
Check In OM store
Store audit verification
Primary preparation and Guillotine
KFP (Manual Coding, FirstCapture and
Second Capture)
Secondary preparation
Scanning
Post Scan Checkout
B
Work in
Progress
C
Completed
Total
Outstanding
D
E
F
-
-
17,387
17,387
17,387
17,304
-
-
-
-
Tue 26 Jun
-
Percentage
Complete
Start Date
Actual
End Date
Start Date
End Date
G
100.00%
100.00%
100.00%
100.00%
H
03 Apr 07
03 Apr 07
04 Apr 07
11 Apr 07
I
01 Jun 07
01 Jun 07
01 Jun 07
01 Jun 07
J
02 Apr 07
02 Apr 07
04 Apr 07
11 Apr 07
22
73.49%
28 May 07
29 Jun 07
12 Jun 07
-
100.00%
100.00%
100.00%
12 Apr 07
12 Apr 07
12 Apr 07
08 Jun 07
08 Jun 07
08 Jun 07
12 Apr 07
12 Apr 07
12 Apr 07
Thu 28 Jun
-
Fri 29 Jun
-
Mon 02 Jul
-
17,387
17,387
17,387
17,304
-
61
83
17,304
17,015
17,015
17,304
17,015
17,015
Wed 27 Jun
-
K
11-Jun-07
11-Jun-07
11-Jun-07
11-Jun-07
Required target per
day
Average production per
day
L
M
11-Jun-07
11-Jun-07
11-Jun-07
Estimated days for
completion based on
current average
N
(Lead) or Delay
New End Date Days in Terms of
Complete
O
P
0
0
0
0
446
446
458
524
-
-
4
-
-
-
0
468
460
460
-
-
-
0
0
0
Daily Progress
Process
Check Out DPC store
Check In OM store
Store audit verification
Primary preparation and Guillotine
KFP (Manual Coding, FirstCapture and
Second Capture)
Secondary preparation
Scanning
Post Scan Checkout
Fri 22 Jun
-
Mon 25 Jun
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Questionnaires
Planned
Overall Progress
Process
Buffer
A
Store Audit Verification
Primary preparation and Guillotine
KFP (Manual Coding, FirstCapture and
Second Capture)
Secondary Preparation
Scanning
Character Recognition
Tiling
Completion
Data export
Sampled for QA
Quality Assurance
Coding
KFI (Exception resolution)
Sign-off
B
Work in
Progress
Completed
C
D
284,244
281,840
Total
Outstanding
E
284,244
281,840
-
-
-
-
1,902
2,404
1,141
218
1,605
-1,740
-
281,840
247,746
246,605
246,387
244,782
246,522
246,200
242,700
49,390
240,423
281,840
247,746
247,746
247,746
247,746
247,746
247,746
247,746
247,746
49,976
247,746
3,500
586
-
Percentage
Complete
Census 1996, 2001 & Community Survey (CS)
F
Start Date
Actual
End Date
Start Date
G
100.00%
100.00%
H
04 Apr 07
04 Apr 07
I
01 Jun 07
01 Jun 07
J
04 Apr 07
11 Apr 07
502
79.12%
28 May 07
29 Jun 07
12 Jun 07
1,141
1,359
2,964
1,224
1,546
5,046
247,746
586
7,323
100.00%
100.00%
99.54%
99.45%
98.80%
99.51%
99.38%
97.96%
0.00%
98.83%
97.04%
12 Apr 07
12 Apr 07
13 Apr 07
16 Apr 07
16 Apr 07
07 May 07
14 May 07
21 May 07
21 May 07
14 May 07
02 May 07
08 Jun 07
08 Jun 07
15 Jun 07
15 Jun 07
15 Jun 07
15 Jun 07
22 Jun 07
22 Jun 07
22 Jun 07
29 Jun 07
29 Jun 07
12 Apr 07
12 Apr 07
24 Apr 07
24 Apr 07
24 Apr 07
07 May 07
-
-
13
7,480
7,417
-
(Lead) or Delay
New End Date Days in Terms of
Complete
O
P
0
0
-
120
0
-
-
0
11-Jun-07
11-Jun-07
-
7,617
6,696
6,043
6,194
6,194
9,910
9,910
12,387
12,387
1,666
6,520
1,905
2,742
3,080
7,524
10,605
25,190
3,233
3,233
1
0
1
0
0
0
0
2
2-Jul-07
2-Jul-07
2-Jul-07
2-Jul-07
2-Jul-07
2-Jul-07
2-Jul-07
4-Jul-07
0
0
18
17
18
17
10
10
0
3
End Date
K
11-Jun-07
11-Jun-07
Required target per
day
Average production per
day
L
M
Estimated days for
completion based on
current average
N
5
Operations cont…
Progress reporting
Completed
Buffer
Overall Progress - Boxes
Work in Progress
Outstanding
Overall Progress - Questionnaires
Completed
Work in Progress
Buffer
Outstanding
100%
100%
80%
80%
60%
60%
40%
40%
20%
20%
0%
0%
Check Out DPC
store
Check In OM
store
Store audit
verification
Primary
KFP (Manual
preparation and
Coding,
Guillotine
FirstCapture and
Second Capture)
Secondary
preparation
Scanning
Store Audit
Verification
Post Scan
Checkout
-20%
Primary
preparation
and Guillotine
KFP (Manual
Coding,
FirstCapture
and Second
Capture)
Secondary
Preparation
Scanning
Character
Recognition
Target Manual Processes
Completion
Data export
Sampled for
QA
Quality
Assurance
Coding
KFP (Manual Coding, FirstCapture
and Second Capture)
KFI (Exception
resolution)
Sign-off
Data export
Tiling
Completion
Target Automated Processes
Daily Progress - Questionnaires
Daily Progress - Boxes
100
Tiling
20,000
18,000
90
16,000
80
14,000
70
12,000
60
10,000
.
50
40
8,000
30
6,000
20
4,000
10
2,000
-
0
Fri 22 Jun
Mon 25 Jun
Tue 26 Jun
Wed 27 Jun
Thu 28 Jun
Fri 29 Jun
Fri 22 Jun
Mon 02 Jul
Mon 25 Jun
Tue 26 Jun
Wed 27 Jun
Thu 28 Jun
Fri 29 Jun
Mon 02 Jul
Cumulative Progress - Scanning - Planned vs Actual
Planned
Actual
300,000
Number of Questionnaire
250,000
200,000
150,000
100,000
Production Days
Census 1996, 2001 & Community Survey (CS)
14
Fri 08 Jun
Thu 07 Jun
Tue 05 Jun
Wed 06 Jun
Sun 03 Jun
Mon 04 Jun
Fri 01 Jun
Thu 31 May
Tue 29 May
Wed 30 May
Sun 27 May
Mon 28 May
Fri 25 May
Sat 26 May
Thu 24 May
Tue 22 May
Wed 23 May
Sun 20 May
Mon 21 May
Fri 18 May
Sat 19 May
Thu 17 May
Tue 15 May
Wed 16 May
Sun 13 May
Mon 14 May
Fri 11 May
Sat 12 May
Thu 10 May
Tue 08 May
Wed 09 May
Sun 06 May
Mon 07 May
Fri 04 May
Sat 05 May
Thu 03 May
Tue 01 May
Wed 02 May
Sun 29 Apr
Mon 30 Apr
Fri 27 Apr
Sat 28 Apr
Thu 26 Apr
Tue 24 Apr
Wed 25 Apr
Sun 22 Apr
Mon 23 Apr
Fri 20 Apr
Sat 21 Apr
Thu 19 Apr
Tue 17 Apr
Wed 18 Apr
Sun 15 Apr
Mon 16 Apr
Fri 13 Apr
Sat 14 Apr
Thu 12 Apr
-
Sat 02 Jun
50,000
Operation cont…
Tool of scanning
• Kodak 9520D
– Used in 2001 Census;
– Used in 2007 CS;
• Differential scanner feeding (pages by page
and/or batches);
• Barcode recognition at scanning time
Census 1996, 2001 & Community Survey (CS)
15
Operation cont…
Exceptions
•
Questionnaires transcription:
–
–
–
–
•
Key From Paper (KFP):
–
–
–
–
•
Poor image quality
Faint writing
Missing pages
Wrong unique identifier (Enumerator Area, Dwelling Unit & Household Number)
False-Positive reading:
–
–
–
–
•
Damaged
Unscannable
Inconsistent page numbering
Unique identifier (barcode)
Poor software recognition
Poor image quality
Incomplete text (character)
Unrecognized mark or character
Failed quality checks:
–
Quality rate below the threshold (95% accurate rate)
Census 1996, 2001 & Community Survey (CS)
16
Operation cont…
Quality Assurance (QA)
• In 1996 Census, the quality was implemented as part of double keying
without any measurement attached to it;
• In 2001 Census, the quality was measured at scanning time (check
image quality) and after data capturing (Key from Image of the
sampled batches (the threshold was 97%);
• In 2007 CS, the sample of captured were subjected to second capture
comparing with the first capture where the agreement rate was
determined (the threshold was 95% reduced due to good image
quality):
– For scanned cases: sample keyed from image and calculation of an
agreement rate;
– For exceptional cases: 100% double keyed from Paper and calculation of
agreement rate;
Census 1996, 2001 & Community Survey (CS)
17
Accounting or Balancing process
• After capturing, each questionnaire is accounted for linked to
the geographical area (EA) and having the correct data
structure (household, persons,….) before any export;
• In 1996 Census, the export process of captured data into
SAS/ASCII for for post-capture process (editing and
tabulation);
• In 2001 Census, the balancing process took longer because of
lack of reference link to the EA of postal questionnaire (selfenumeration);
• In 2007 CS, a Census and Administration System (CSAS)
assisted in getting the full account of the questionnaires linked
to their referenced geography;
Census 1996, 2001 & Community Survey (CS)
18
Data validation & Editing
•
In 1996, the adopted strategy was not to impute any derived value. Only
manual editing was allowed;
•
In 2001 Census, based on editing specification with the assistance of US
Bureau of Census, an automated editing was implemented using IMPS/CSpro.
The 2007 CS follows the same approach used in 2001 Census.
•
Different editing report with imputation rates were produced to an editing
committee which come out with the rule to apply for correction;
•
•
In 2001 Census and 2007 CS, limited manual editing were implemented;
One of key editing rule is the removal of minimal processable cases caused by
poor recognition or false-positive reading;
•
Though the editing has been in ASCII, the output database is exported with in
different formats (i.e. users driven: ASCII, Oracle, SAS, Oracle,…) linked with
the metadata;
Census 1996, 2001 & Community Survey (CS)
19
Data validation & editing Cont…
Census 1996, 2001 & Community Survey (CS)
20
Tabulation and output products
• Since Stats SA policy is to give access to data users, the strategy is to
put the Census data in different format to increase accessibility and
promote data use;
• In 1996 Census, the output database was packaged in SuperCorss
database and a set of aggregated databases put on CD for the users;
• In 2001 Census, the access to the data was increased by adding on
the online processing tabulation tools (PX-Web), the SuperCross,
reduced ASCII file,….
• In 2007 CS, the data is also available in different format (SuperCross,
ASCII file, PX-web and other map/chart linked tools
• The traditional reports are still produced based on tabulation
plan/output reports
Census 1996, 2001 & Community Survey (CS)
21
Benefit of scanning Technology
• Improve the
Quality of the Data
• Save Time
• Reduce Costs
Census 1996, 2001 & Community Survey (CS)
22
THANK YOU!
Census 1996, 2001 & Community Survey (CS)
23
Download