Census Data Editing: Structure and Within Record Editing UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Part I: Structure Editing UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Summary Part I: Structure Edits What are structure edits? Geography edits Hierarchy of records Correspondence between housing and population records Editing relationships in a household Family nuclei UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 What are structure edits? Structure edits check coverage and relationships between different units: persons, households, housing units, enumeration areas, etc. Specifically, they check that: all households and collective quarters records within an enumeration area are present and are in the proper order; all occupied housing units have person records, but vacant units have no person records; households must have neither duplicate person records, nor missing person records; enumeration areas must have neither duplicate nor missing housing records. UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Geography edits Each EA must have the right geographic codes (city, province, region...) Every housing unit in an EA should be entered and every record must have a valid EA code The capture process must check this before editing of data commences If errors remain, it is best to find the right code by returning to the enumeration documents and correcting manually, for example. UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Hierarchy of records EA1 Type 1 Housing unit 1 Type 2 Cat: dwelling Housing unit 2 Type 2 Cat: dwelling Person 1 Type 4 Person 1 Type 4 Person 2 Type 4 Person 3 Type 4 Person 2 Type 4 EA2 Type 1 Housing unit 3 Type 2 Cat: vacant dwelling Collective living quarter 1 Type 3 Cat: Hospital Person 1 Type 4 Person 2 Type 4 Person 3 Type 4 Person 4 Type 4 Person 5 Type 4 Person 6 Type 4 Housing unit 1 Type 2 Cat: dwelling Housing unit 2 Type 2 Cat: dwelling EA3 Type 1 Collective living quarter Type 3 Cat: Hotel Person 1 Type 4 Person 1 Type 4 Person 2 Type 4 Person 2 Type 4 Person 3 Type 4 Person 4 Type 4 Person 5 Type 4 Person 6 Type 4 Person 7 Type 4 Person 8 Type 4 UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Hierarchy of records 1_EA 2_Housing unit 4_Individual 4_Individual 2_Housing unit 3_Collective living quater 4_Individual 4_Individual 1_EA UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Hierarchy of records Type 1 (EA) followed by new Type 1 (if original EA empty) or Type 2 (Housing unit) or Type 3 (Collective Living Quarter) Particular case of homeless people: create a dummy housing record to make structural checking easier Type 2 (Housing Unit) followed by Type 1, 2 or 3 (if original dwelling vacant) or Type 4 (if original dwelling occupied) Type 3 (Collective Living Quarter) followed by Type 4 (Individual) If not occupied, empty CLQ allowed? Type 4 (Individual) followed by Type 4 (other individual in the same dwelling or collective living quarter), or Type 2 or 3 (other dwelling or CLQ) or Type 1 (new EA) UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Correspondence between housing and population records An occupied unit should have at least one person and a vacant unit should have no people: if Type 2 (Housing Unit) & category (vacant) followed by Type 4 (individual) then change the category to occupied The number of occupants recorded on the Housing Unit form should be exactly the same as the sum of the individual records in the household. If not, change the number on the Housing Unit form Population records should be sequenced (numbered) Type 3 (CLQ) & category (Hospital) followed by multiple Type 4 (individual) of category “Retirement home” then change the category of the CLQ to “Retirement home” UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Editing relationships in a household Each individual has a relation to the first person: 1st person (or Head, or reference person) Spouse Child of the 1st or of his/her spouse Parent Other relative Friend Lodger ... UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Editing relationships in a household Household with potential inconsistencies in age reporting UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Family nuclei Father: Sex should be male and Age should be > minimum age Mother Sex should be female and Age should be > minimum age Child Age under a maximum limit ? UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Part II: Within Record Editing UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Summary Part II: Within Record Edits Validity and Consistency Checks Top-down Editing versus Multiple-variable Editing Example of Multiple-Variable Editing Methods of Correcting and Imputing Data Example of Hot Deck for Sample Household (Sex Only) Example of Hot Deck for Sample Household (Sex and Age) Issues Related to Hot Deck Methods of Correcting and Imputing Data: General Principles Edit Trails and the Use of Imputation Flags UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Validity and Consistency Checks Validity checks are performed to see if the value of individual variables are plausible or lie within a reasonable range Examples: 0<=AGE<=110 SEX= Female or SEX=Male Consistency checks are performed to ensure that there is coherence between two or more variables Examples: Head of Household should have AGE>=15 A child should be younger than a head of household A person with AGE<15 should never be married UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Top-down Editing versus MultipleVariable Editing Top-down Editing approach starts by editing top priority variable (not necessarily first variable on questionnaire) and moves sequentially through all items in decreasing priority During editing process, some edits change the value of an item more than once; this can introduce one or more errors in dataset Example: Child’s age first imputed on basis of mother’s age. Later child’s age re-imputed on basis of reported years of schooling, which might be inconsistent with mother’s age In this case, child’s age should keep being re-imputed till it is consistent Important to avoid circular editing! UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Top-down Editing versus MultipleVariable Editing Multiple-Editing approach uses a set of rules that state the relationship between variables Each statement is tested against data to see if true Edit system keeps track of all false statements relating to invalid entries or inconsistencies Assessment is then made on how to change record so that it will pass all edits and then decision is made Fellegi-Holt principle of “minimum change” should be used UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Example of Multiple-Variable Editing TABLE 1: Head of household and spouse have same sex Person Relationship Sex Children ever born 1 Head of household Male 3 2 Spouse Male BLANK 1 Head of household Female 2 Spouse Male Unedited data Data after editing for sex 3 BLANK UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Example of Multiple-Variable Editing TABLE 2: Head of household and spouse have same sex No. Rule 1 Head of household should be 15 years or older 2 Spouse should be 15 years or older 3 A spouse should be married 4 If spouse present, head of household and spouse should be opposite sex 5 Person less than 15 years old should be never married 6 Male should have no fertility 7 For female 15 years or older fertility entry should not be blank Totals Relationship Sex 1 1 1 Age Marital status Fertility 1 1 2 1 UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Methods of Correcting and Imputing Data The process of imputation changes one or more responses or missing values in a record or several records to ensure internally coherent records result Before using any imputation method, the best strategy is to start with manual study of responses or to contact the respondents to resolve some of problems; imputation can then handle the remaining unresolved edit failures Two methods of imputation: Cold Deck and Hot Deck Cold Deck Imputation: Used mainly for missing or unknown values (not for inconsistent/invalid values) Values are imputed on a proportional basis from a distribution of valid responses (e.g., from previous census) Set of valid “donor” responses do not change and are not updated as imputation proceeds; i.e., original values provide imputations for any missing data In doing so, cold deck draws values from a fixed (but possibly outdated) distribution of values Example: Suppose previous census (the cold deck) gives distribution of males aged 33 employed in agriculture: 25% worked 50 hours/week; 40% worked 60 hours/week; 35% worked 70 hours/week Example (cont’d): In cold deck method, missing values in current census for males aged 33 employed in agriculture are imputed according to the above distribution UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Methods of Correcting and Imputing Data Hot Deck or Dynamic Imputation: Used for both missing data and inconsistent/invalid items Uses one or more variables to estimate the likely response based on data about individuals with similar characteristics The “donor set” (or imputation matrix) constantly changes through updating; therefore, imputations dynamically change during the process of editing all the records Thus, hot deck draws from a distribution that dynamically changes with each imputation and eventually (through modifications) “approaches” the distribution of current data set Caution: if the different items for a particular record have unknown values, hot deck may not use the same “donor” to impute for both missing values; in this case, it is preferable to use the same donor for both items UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Example of Hot Deck for Sample Household (Sex Only) ID number Relationship 1 1 2 Sex Age Dynamic Imputation Matrix 1 39 1 2 2 35 2 3 3 1 13 1 4 3 9 10 1 5 4 2 40 2 6 4 1 99* 1 7 4 2 13 2 8 5 9 99* 2 9 5 1 44 1 10 5 2 36 2 1 2 Missing Information: 9, 99 Relationship: 1=Head; 2=Spouse; 3=Child; 4=Other Relative; 5=Non-Relative Sex: 1=Male; 2=Female UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Example of Hot Deck for Sample Household (Sex and Age) ID number Relationship Sex Age 1 1 1 39 2 2 2 35 3 3 1 13 4 3 9 5 4 2 6 4 1 7 4 2 8 5 9 9 5 1 44 10 5 2 36 1 10 40 99 40 13 2 99 37 Missing Information: 9, 99 Relationship: 1=Head; 2=Spouse; 3=Child; 4=Other Relative; 5=Non-Relative Sex: 1=Male; 2=Female UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Example of Hot Deck for Sample Household (Sex and Age)-cont’d Initial Imputation Matrix For Age Based on Sex and Relationship Relationship Head of Household (1) Male Spouse (2) Son/Daughter (3) Other Relative (4) Non-Relative (5) (1) 35 35 12 40 40 Female (2) 32 32 12 37 37 Dynamic Imputation Matrix After Multiple Changes Relationship Head of Household (1) Male Spouse (2) Son/Daughter (3) Other Relative (4) Non-Relative (5) (1) 39* 35 13* 40 44* Female (2) 32 35* 12 13* 36* UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Issues Related to Hot Deck An attempt should be made to devise dynamic imputation matrices based on people living in same small geographic area since they tend to be homogeneous with respect to many characteristics, i.e., different imputation matrices for different geographic areas should be created Sometimes the simplest approaches are best: for example, for a missing housing attribute, it may be preferable to use the value of a neighboring household rather than using a complex imputation matrix that may result in the assignment of a value from outside the neighborhood Before using dynamic imputation, an effort should be made to use related items instead. For example, if marital status is missing for an individual and there exists a spouse for that individual, then the value “married” should be assigned One should edit key items such as age and sex first so that these can be used in other imputation matrices for lower priority items UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Issues Related to Hot Deck Subject-matter and data processor staff should construct imputation matrices based on research from administrative sources or previous censuses and surveys Standardized imputation matrices, (i.e., having standard dimensions, such as age and sex (e.g., for language)) can streamline process since they can be tested and applied quickly BUT if language missing, first look to language of others in the same household or to race, ethnicity, birthplace before using dynamic imputation; i.e., an attempt should be made to use related information to assign values before resorting to imputation Some editing teams keep more than one value per cell in imputation matrices to protect against same value being imputed multiple times; e.g., in case of 4 male children in household all with ages unknown, different values will be assigned UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Issues Related to Hot Deck Imputation matrices that are too big (with too many dimensions) cannot be updated thoroughly, leading to inefficiencies and inaccuracies Imputation matrices that are too small (with too few dimensions or too few groupings within dimensions) may lead to the same donor value being used repeatedly in imputation before the matrix is updated Some items such as occupation and industry are notoriously difficult to edit since the large number of categories can make dynamic imputation very cumbersome; in such cases, may be counter-productive to impute and may be preferable to use “not stated” UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Methods of Correcting and Imputing Data: General Principles Imputed record should closely resemble the failed edit record; impute for a minimum number of variables Imputed record should satisfy all edits All imputed values should be flagged and methods and sources of imputation should be clearly specified Both un-imputed and imputed values should be stored to allow for evaluation of degree and effects of imputation UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 Edit Trails and the Use of Imputation Flags Important to generate edit trail showing all data changes and substituted values with their tallies In terms of tallies, counters of several types are essential to process planning and management: i) number of cases of each type of error; ii) non-response rates for each item; iii) imputation rates for each item, …. Imputation flags are binary flags that change from initial value of 0 to 1 if original value of data is changed in any way; flags should be added onto each item that is imputed Although a separate file with imputation flags takes up considerable space, this information is critical for planning of future censuses; e.g., As a means to investigate age threshold below which female with “child ever born” triggers a query edit and to decide if threshold should be modified for future rounds UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008 THANK YOU! UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data editing Doha, State of Qatar, 18-22 May 2008