Census Data Editing: Structure and Within Record Editing

advertisement
Census Data Editing:
Structure and Within Record Editing
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Part I: Structure Editing
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Summary
 Part I: Structure Edits
 What are structure edits?
 Geography edits
 Hierarchy of records
 Correspondence between housing and
population records
 Editing relationships in a household
 Family nuclei
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
What are structure edits?

Structure edits check coverage and relationships between
different units: persons, households, housing units,
enumeration areas, etc. Specifically, they check that:
 all households and collective quarters records within an
enumeration area are present and are in the proper order;
 all occupied housing units have person records, but vacant units
have no person records;
 households must have neither duplicate person records, nor
missing person records;
 enumeration areas must have neither duplicate nor missing
housing records.
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Geography edits
 Each EA must have the right geographic codes (city,
province, region...)
 Every housing unit in an EA should be entered and
every record must have a valid EA code
 The capture process must check this before editing of
data commences
 If errors remain, it is best to find the right code by
returning to the enumeration documents and
correcting manually, for example.
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Hierarchy of records
EA1
Type 1
Housing unit 1
Type 2
Cat: dwelling
Housing unit 2
Type 2
Cat: dwelling
Person 1
Type 4
Person 1
Type 4
Person 2
Type 4
Person 3
Type 4
Person 2
Type 4
EA2
Type 1
Housing unit 3
Type 2
Cat: vacant
dwelling
Collective living quarter 1
Type 3
Cat: Hospital
Person 1
Type 4
Person 2
Type 4
Person 3
Type 4
Person 4
Type 4
Person 5
Type 4
Person 6
Type 4
Housing unit 1
Type 2
Cat: dwelling
Housing unit 2
Type 2
Cat: dwelling
EA3
Type 1
Collective living quarter
Type 3
Cat: Hotel
Person 1
Type 4
Person 1
Type 4
Person 2
Type 4
Person 2
Type 4
Person 3
Type 4
Person 4
Type 4
Person 5
Type 4
Person 6
Type 4
Person 7
Type 4
Person 8
Type 4
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Hierarchy of records
1_EA
2_Housing unit
4_Individual
4_Individual
2_Housing unit
3_Collective living quater
4_Individual
4_Individual
1_EA
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Hierarchy of records

Type 1 (EA) followed by new Type 1 (if original EA empty) or
Type 2 (Housing unit) or Type 3 (Collective Living Quarter)
 Particular case of homeless people: create a dummy
housing record to make structural checking easier

Type 2 (Housing Unit) followed by Type 1, 2 or 3 (if original
dwelling vacant) or Type 4 (if original dwelling occupied)

Type 3 (Collective Living Quarter) followed by Type 4
(Individual)
 If not occupied, empty CLQ allowed?

Type 4 (Individual) followed by Type 4 (other individual in the
same dwelling or collective living quarter), or Type 2 or 3
(other dwelling or CLQ) or Type 1 (new EA)
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Correspondence between housing and
population records

An occupied unit should have at least one person and a vacant
unit should have no people: if Type 2 (Housing Unit) &
category (vacant) followed by Type 4 (individual) then change
the category to occupied

The number of occupants recorded on the Housing Unit form
should be exactly the same as the sum of the individual
records in the household. If not, change the number on the
Housing Unit form

Population records should be sequenced (numbered)

Type 3 (CLQ) & category (Hospital) followed by multiple Type 4
(individual) of category “Retirement home” then change the
category of the CLQ to “Retirement home”
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Editing relationships in a household
 Each individual has a relation to the first person:
 1st person (or Head, or reference person)
 Spouse
 Child of the 1st or of his/her spouse
 Parent
 Other relative
 Friend
 Lodger
 ...
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Editing relationships in a household
Household with potential
inconsistencies in age
reporting
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Family nuclei
 Father:
 Sex should be male and Age should be >
minimum age
 Mother
 Sex should be female and Age should be >
minimum age
 Child
 Age under a maximum limit ?
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Part II: Within Record Editing
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Summary
 Part II: Within Record Edits
Validity and Consistency Checks
Top-down Editing versus Multiple-variable Editing
Example of Multiple-Variable Editing
Methods of Correcting and Imputing Data
Example of Hot Deck for Sample Household (Sex Only)
Example of Hot Deck for Sample Household (Sex and
Age)
 Issues Related to Hot Deck
 Methods of Correcting and Imputing Data: General
Principles
 Edit Trails and the Use of Imputation Flags






UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Validity and Consistency Checks

Validity checks are performed to see if the value of
individual variables are plausible or lie within a reasonable
range

Examples:



0<=AGE<=110
SEX= Female or SEX=Male
Consistency checks are performed to ensure that there is
coherence between two or more variables

Examples:



Head of Household should have AGE>=15
A child should be younger than a head of household
A person with AGE<15 should never be married
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Top-down Editing versus MultipleVariable Editing



Top-down Editing approach starts by editing top priority variable (not
necessarily first variable on questionnaire) and moves sequentially
through all items in decreasing priority
During editing process, some edits change the value of an item more
than once; this can introduce one or more errors in dataset
 Example: Child’s age first imputed on basis of mother’s age. Later
child’s age re-imputed on basis of reported years of schooling,
which might be inconsistent with mother’s age
 In this case, child’s age should keep being re-imputed till it is
consistent
Important to avoid circular editing!
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Top-down Editing versus MultipleVariable Editing
 Multiple-Editing approach uses a set of rules that state
the relationship between variables
 Each statement is tested against data to see if true
 Edit system keeps track of all false statements relating
to invalid entries or inconsistencies
 Assessment is then made on how to change record so
that it will pass all edits and then decision is made
 Fellegi-Holt principle of “minimum change” should be
used
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Example of Multiple-Variable Editing
TABLE 1: Head of household and spouse
have same sex
Person
Relationship
Sex
Children ever born
1
Head of household
Male
3
2
Spouse
Male
BLANK
1
Head of household
Female
2
Spouse
Male
Unedited data
Data after editing for sex
3
BLANK
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Example of Multiple-Variable Editing
TABLE 2: Head of household and spouse
have same sex
No.
Rule
1
Head of household should be 15 years or older
2
Spouse should be 15 years or older
3
A spouse should be married
4
If spouse present, head of household and
spouse should be opposite sex
5
Person less than 15 years old should be never
married
6
Male should have no fertility
7
For female 15 years or older fertility entry
should not be blank
Totals
Relationship
Sex
1
1
1
Age
Marital status
Fertility
1
1
2
1
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Methods of Correcting and Imputing Data




The process of imputation changes one or more responses or missing values in a
record or several records to ensure internally coherent records result
Before using any imputation method, the best strategy is to start with manual
study of responses or to contact the respondents to resolve some of problems;
imputation can then handle the remaining unresolved edit failures
Two methods of imputation: Cold Deck and Hot Deck
Cold Deck Imputation:

Used mainly for missing or unknown values (not for inconsistent/invalid
values)

Values are imputed on a proportional basis from a distribution of valid
responses (e.g., from previous census)

Set of valid “donor” responses do not change and are not updated as
imputation proceeds; i.e., original values provide imputations for any
missing data

In doing so, cold deck draws values from a fixed (but possibly outdated)
distribution of values

Example: Suppose previous census (the cold deck) gives distribution of
males aged 33 employed in agriculture: 25% worked 50 hours/week; 40%
worked 60 hours/week; 35% worked 70 hours/week

Example (cont’d): In cold deck method, missing values in current census for
males aged 33 employed in agriculture are imputed according to the above
distribution
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Methods of Correcting and
Imputing Data

Hot Deck or Dynamic Imputation:
 Used for both missing data and inconsistent/invalid items
 Uses one or more variables to estimate the likely response based
on data about individuals with similar characteristics
 The “donor set” (or imputation matrix) constantly changes through
updating; therefore, imputations dynamically change during the
process of editing all the records
 Thus, hot deck draws from a distribution that dynamically
changes with each imputation and eventually (through
modifications) “approaches” the distribution of current data set
 Caution: if the different items for a particular record have
unknown values, hot deck may not use the same “donor” to
impute for both missing values; in this case, it is preferable to use
the same donor for both items
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Example of Hot Deck for Sample
Household (Sex Only)
ID number
Relationship
1
1
2
Sex
Age
Dynamic Imputation
Matrix
1
39
1
2
2
35
2
3
3
1
13
1
4
3
9
10
1
5
4
2
40
2
6
4
1
99*
1
7
4
2
13
2
8
5
9
99*
2
9
5
1
44
1
10
5
2
36
2
1
2
Missing Information: 9, 99
Relationship: 1=Head; 2=Spouse; 3=Child;
4=Other Relative; 5=Non-Relative
Sex: 1=Male; 2=Female
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Example of Hot Deck for Sample
Household (Sex and Age)
ID number
Relationship
Sex
Age
1
1
1
39
2
2
2
35
3
3
1
13
4
3
9
5
4
2
6
4
1
7
4
2
8
5
9
9
5
1
44
10
5
2
36
1
10
40
99
40
13
2
99
37
Missing Information: 9, 99
Relationship: 1=Head; 2=Spouse; 3=Child; 4=Other Relative; 5=Non-Relative
Sex: 1=Male; 2=Female
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Example of Hot Deck for Sample
Household (Sex and Age)-cont’d
Initial Imputation Matrix For Age Based on Sex and Relationship
Relationship
Head of Household (1)
Male
Spouse (2)
Son/Daughter (3)
Other Relative (4)
Non-Relative (5)
(1)
35
35
12
40
40
Female (2)
32
32
12
37
37
Dynamic Imputation Matrix After Multiple Changes
Relationship
Head of Household (1)
Male
Spouse (2)
Son/Daughter (3)
Other Relative (4)
Non-Relative (5)
(1)
39*
35
13*
40
44*
Female (2)
32
35*
12
13*
36*
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Issues Related to Hot Deck




An attempt should be made to devise dynamic imputation matrices based
on people living in same small geographic area since they tend to be
homogeneous with respect to many characteristics, i.e., different
imputation matrices for different geographic areas should be created
Sometimes the simplest approaches are best: for example, for a missing
housing attribute, it may be preferable to use the value of a neighboring
household rather than using a complex imputation matrix that may result in
the assignment of a value from outside the neighborhood
Before using dynamic imputation, an effort should be made to use related
items instead. For example, if marital status is missing for an individual and
there exists a spouse for that individual, then the value “married” should be
assigned
One should edit key items such as age and sex first so that these can be
used in other imputation matrices for lower priority items
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Issues Related to Hot Deck




Subject-matter and data processor staff should construct
imputation matrices based on research from administrative
sources or previous censuses and surveys
Standardized imputation matrices, (i.e., having standard
dimensions, such as age and sex (e.g., for language)) can
streamline process since they can be tested and applied
quickly
BUT if language missing, first look to language of others in the
same household or to race, ethnicity, birthplace before using
dynamic imputation; i.e., an attempt should be made to use
related information to assign values before resorting to
imputation
Some editing teams keep more than one value per cell in
imputation matrices to protect against same value being
imputed multiple times; e.g., in case of 4 male children in
household all with ages unknown, different values will be
assigned
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Issues Related to Hot Deck



Imputation matrices that are too big (with too many
dimensions) cannot be updated thoroughly, leading to
inefficiencies and inaccuracies
Imputation matrices that are too small (with too few
dimensions or too few groupings within dimensions) may
lead to the same donor value being used repeatedly in
imputation before the matrix is updated
Some items such as occupation and industry are notoriously
difficult to edit since the large number of categories can
make dynamic imputation very cumbersome; in such cases,
may be counter-productive to impute and may be
preferable to use “not stated”
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Methods of Correcting and Imputing
Data: General Principles
 Imputed record should closely resemble the
failed edit record; impute for a minimum
number of variables
 Imputed record should satisfy all edits
 All imputed values should be flagged and
methods and sources of imputation should be
clearly specified
 Both un-imputed and imputed values should be
stored to allow for evaluation of degree and
effects of imputation
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Edit Trails and the Use of
Imputation Flags




Important to generate edit trail showing all data changes and
substituted values with their tallies
In terms of tallies, counters of several types are essential to
process planning and management: i) number of cases of each
type of error; ii) non-response rates for each item; iii)
imputation rates for each item, ….
Imputation flags are binary flags that change from initial value
of 0 to 1 if original value of data is changed in any way; flags
should be added onto each item that is imputed
Although a separate file with imputation flags takes up
considerable space, this information is critical for planning of
future censuses; e.g., As a means to investigate age threshold
below which female with “child ever born” triggers a query edit
and to decide if threshold should be modified for future rounds
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
THANK YOU!
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region:
Contemporary technologies for data capture, methodology and practice of data editing
Doha, State of Qatar, 18-22 May 2008
Download