Data Editing: Introduction

advertisement
Data Editing:
Introduction
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries:
Contemporary technologies for data capture, methodology and practice of data editing
Dar es Salaam, Tanzania, 9-13 June 2008
Objectives of Session
Editing is the procedure for detecting and correcting
errors from data.
Imputation is the procedure of assigning values to
missing or inconsistent data
The objective of the session is to present an overview of
the concepts and definitions, and discuss the application
and issues
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries:
Contemporary technologies for data capture, methodology and practice of data editing
Dar es Salaam, Tanzania, 9-13 June 2008
Summary










Types of Errors in the census process
Objectives of Editing: Why do it?
How to and Why Edit? Some illustrative examples
Principles of Editing: How to do it
Fatal versus Query Edits
Micro-editing versus Macro-editing
Manual versus automatic editing
Impact of capture mode on editing
Pitfalls of Over-editing
Other considerations
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries:
Contemporary technologies for data capture, methodology and practice of data editing
Dar es Salaam, Tanzania, 9-13 June 2008
Types of Errors in the Census Process

Coverage Errors
 Incomplete/inaccurate maps or EAs
 Incomplete canvassing of all units
 Duplicate counting
 Omission of persons unwilling to be enumerated
 Erroneous treatment of visitors or non-resident aliens
(especially in relation to de jure versus de facto methods)
 Loss or destruction of census records after enumeration
 ……
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries:
Contemporary technologies for data capture, methodology and practice of data editing
Dar es Salaam, Tanzania, 9-13 June 2008
Types of Errors in the Census Process
 Content Errors







Errors in questionnaire design
Enumerator errors
Respondent errors
Coding errors
Data entry errors
Errors in computer editing
Errors in tabulation
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries:
Contemporary technologies for data capture, methodology and practice of data editing
Dar es Salaam, Tanzania, 9-13 June 2008
Types of Errors in the Census Process
 Two types of errors at processing stage:
 Those that block further processing and
 Those that produce invalid/ inconsistent results
without interrupting logical flow of subsequent
processing operations
 ALL errors of first kind must be corrected and
as many of second kind as possible
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries:
Contemporary technologies for data capture, methodology and practice of data editing
Dar es Salaam, Tanzania, 9-13 June 2008
Objectives of Editing : Why do it?
 Objectives of editing (Granquist, 1984)
 “Tidy up data” so as to facilitate analysis (creation of
complete file)
 Identify types and sources of errors (for reporting on
data quality)
 Improve quality of census data (for current and
future census)
 Important not only to detect errors but also to
identify causes, in order to take appropriate
corrective measures and improve overall quality
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries:
Contemporary technologies for data capture, methodology and practice of data editing
Dar es Salaam, Tanzania, 9-13 June 2008
How to Edit?
TABLE 1: 2010 Population by Age and
Sex, Unedited and Edited
Unedited data
Edited data
Age group
Total
Male
Female
Sex Not reported
Total
4,147
2,033
2,091
23
Less than 15 years
1,639
799
825
15
15 to 29 years
1,256
612
643
30 to 44 years
727
356
45 to 59 years
360
60 to 74 years
Total
Male
Female
4,147
2,043
2,104
1,646
809
837
1
1,260
614
646
369
2
729
358
371
194
166
0
362
195
167
116
54
59
3
116
55
61
75 years and over
34
12
22
0
34
12
22
Age Not reported
15
6
7
2
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries:
Contemporary technologies for data capture, methodology and practice of data editing
Dar es Salaam, Tanzania, 9-13 June 2008
How to Edit?
TABLE 1: Population by Age and Sex,
Unedited and Edited

How to deal with data “not reported”?





Distribute the age unknowns and the sex unknowns in same
proportion as the corresponding known values
For example, for 23 sex unknowns, distribute (2033/4147)*23
= 12 to males (and remaining 11 to females by subtraction);
see RHS of Table 1
Similarly, distribute 15 age unknowns across 6 age groups in
proportion to known values, see RHS of Table 1
This method could render biased results if number of
unknowns (number of non-responses) high since distribution
of knowns and unknowns may be very different
An improved strategy would be to use multivariate
distributions involving other variables such as relationship
between spouses, having a positive entry for number of
children born, etc,
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries:
Contemporary technologies for data capture, methodology and practice of data editing
Dar es Salaam, Tanzania, 9-13 June 2008
Why Edit?
TABLE 2: Population by Age with
Unknowns for 2000 and 2010
Age group
Numbers
Percent
2010
2000
2010
2000
Total
4,147
3,319
100
100
Less than 15 years
1,639
1,348
39.5
40.6
15 to 29 years
1,256
902
30.3
27.2
30 to 44 years
727
538
17.5
16.2
45 to 59 years
360
200
8.7
6
60 to 74 years
116
89
2.8
2.7
75 years and over
34
25
0.8
0.8
Age Not reported
15
217
0.4
6.5
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries:
Contemporary technologies for data capture, methodology and practice of data editing
Dar es Salaam, Tanzania, 9-13 June 2008
Why Edit?
TABLES 2 and 3: Population by Age
with and without Unknowns for 2000
and 2010
 Another problem is that unknowns may affect the
analysis of trends
 In Table 2, if unknowns not taken into account,
percentage of persons aged 15-29 years appears to
increase from 27.2% in 2000 to 30.3% in 2010
 Redistributing unknowns may change this trend
 In Table 3, after distributing unknowns, there is only
an increase from 28.7% in 2000 to 29.3% in 2010
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries:
Contemporary technologies for data capture, methodology and practice of data editing
Dar es Salaam, Tanzania, 9-13 June 2008
Why Edit?
TABLE 3: Population by Age without
Unknowns for 2000 and 2010
Age group
Numbers
Per cent
2010
2000
2010
2000
Total
4,147
3,319
100
100
Less than 15
1,743
1,408
42
42.4
15 to 29 years
1,217
952
29.3
28.7
30 to 44 years
695
578
16.8
17.4
45 to 59 years
341
230
8.2
6.9
60 to 74 years
114
109
2.7
3.3
37
42
0.9
1.3
75 years and over
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries:
Contemporary technologies for data capture, methodology and practice of data editing
Dar es Salaam, Tanzania, 9-13 June 2008
Principles of Editing : How to do it
 In general the editing system should be:
 Minimalist (change only obvious errors and as
few as possible)
 Automated (as much as possible, for both
detection and correction)
 Systematic
 Consistent with other NSO statistical collections
 Compliant with UN or other international
standards
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries:
Contemporary technologies for data capture, methodology and practice of data editing
Dar es Salaam, Tanzania, 9-13 June 2008
Fatal versus Query Edits

Types of edits:

Fatal Edits: identify errors with certainty

Query Edits: identify suspected errors

Fatal Edits identify fatal errors, which include invalid or missing entries as well as
errors due to inconsistencies

Query Edits identify data items that fall outside subjective data bounds, or items
that are relatively high or low as compared with other data on the same
questionnaire

Fatal edits must be resolved but query edits more difficult to correct, have fewer
benefits than the detection and resolution of fatal edits, and add more to the cost
of the process

For query edits, subject-matter specialists should investigate edits developed for
pilot censuses and those developed during processing to make sure that
individual edit have the expected cost of census evaluation (e.g., look at hit rates
or share of flags that result in changes to the original data)
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries:
Contemporary technologies for data capture, methodology and practice of data editing
Dar es Salaam, Tanzania, 9-13 June 2008
Micro-editing versus Macro-editing
 Micro-editing: concerns ways to ensure validity and
consistency of individual data records and
relationships between records in a household
 Macro-editing: checks aggregated data to make sure
that they are reasonable
 Example, If census results show large percentage of
persons without a reported age, imputing for age (at
micro level) will produce a complete data set.
 BUT far more essential to make checks at macro
(aggregate) level to ensure that imputation does not
skew overall age distribution
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries:
Contemporary technologies for data capture, methodology and practice of data editing
Dar es Salaam, Tanzania, 9-13 June 2008
Impact of Capture Mode on Editing
 Types of capture modes typically used:
manual (key-entry), OMR, OCR/ICR, PDA,
Internet
 For key-entry, PDA, Internet: some limited
detection and correction of errors can be
done in “real time”
 Not possible for OMR or OCR/ICR (from paper
questionnaire) with scanning; limited to
“batch editing” after the fact
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries:
Contemporary technologies for data capture, methodology and practice of data editing
Dar es Salaam, Tanzania, 9-13 June 2008
Manual versus Automated Editing





Manual edits may be done in several places along the editing chain –
by enumerator, supervisor, field office worker, coder, key entry clerk,
etc
Disadvantage is that manual editing expends enormous amount of
time (months or years), energy (human resources) and cost
If data set is small, timing not so crucial and work force available, then
manual editing may be feasible
Automated editing reduces time required, decreases introduction of
human error, and allows for creation of edit trail (and is therefore
reproducible)
Unlike manual editing, automated editing makes it feasible and
efficient to impute responses based on other information in the
questionnaire or on reported information for a unit with similar
characteristics
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries:
Contemporary technologies for data capture, methodology and practice of data editing
Dar es Salaam, Tanzania, 9-13 June 2008
Pitfalls of Over-editing




Reduced timeliness
Increased costs
Potential distortion of true values
False sense of security
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries:
Contemporary technologies for data capture, methodology and practice of data editing
Dar es Salaam, Tanzania, 9-13 June 2008
Other Considerations

Determination of tolerance levels for error detection

For most items in a census, some small percentage of the respondents will
not give “acceptable” responses, for whatever reason

Not every failure is pervasive and therefore may not be worthy of remedial
action- see Pitfalls of Over-editing

Tolerance levels indicate number of invalid and inconsistent responses
allowed before editing teams take remedial action

Decided by editing team including both subject-matter and data processing
specialists

For key items such as age and sex, typically low (1%-2%) whereas less key
items such as literacy and disability, typically higher (5%-10%)

Correction may occur by returning enumerators to field, conducting
telephone re-interviews or by applying specific knowledge of an area

Learning from the editing process/ quality assurance systems

Positive and negative feed-back loops need to be recorded to improve the
quality of both the current census and future censuses and surveys

Audit trails, performance measures and diagnostic statistics crucial

This is often the most important outcome of editing
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries:
Contemporary technologies for data capture, methodology and practice of data editing
Dar es Salaam, Tanzania, 9-13 June 2008
Other Considerations

Cost of editing
 Cost of editing has not decreased in the last much in the
last 20 years, although process has been rationalized by
continuous exploitation of technological developments
 In general, editing activities take a disproportionate
amount of time (and therefore staff costs) relative to other
activities
 Excessive editing can delay census results

Archiving
 Both edited and unedited data files should be preserved for
later analysis – and in several places
 Documentation should be complete enough for census
planners to be able to reconstruct the same processes at a
later date
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries:
Contemporary technologies for data capture, methodology and practice of data editing
Dar es Salaam, Tanzania, 9-13 June 2008
THANK YOU!
UNSD Regional Workshop on Census Data Processing for the English speaking African Countries:
Contemporary technologies for data capture, methodology and practice of data editing
Dar es Salaam, Tanzania, 9-13 June 2008
Download