Data Editing: Introduction UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Objectives of Session Editing is the procedure for detecting and correcting errors from data. Imputation is the procedure of assigning values to missing or inconsistent data The objective of the session is to present an overview of the concepts and definitions, and discuss the application and issues UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Summary Types of Errors in the census process Objectives of Editing: Why do it? How to and Why Edit? Some illustrative examples Principles of Editing: How to do it Fatal versus Query Edits Micro-editing versus Macro-editing Manual versus automatic editing Impact of capture mode on editing Pitfalls of Over-editing Other considerations UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Types of Errors in the Census Process Coverage Errors Incomplete/inaccurate maps or EAs Incomplete canvassing of all units Duplicate counting Omission of persons unwilling to be enumerated Erroneous treatment of visitors or non-resident aliens (especially in relation to de jure versus de facto methods) Loss or destruction of census records after enumeration …… UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Types of Errors in the Census Process Content Errors Errors in questionnaire design Enumerator errors Respondent errors Coding errors Data entry errors Errors in computer editing Errors in tabulation UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Types of Errors in the Census Process Two types of errors at processing stage: Those that block further processing and Those that produce invalid/ inconsistent results without interrupting logical flow of subsequent processing operations ALL errors of first kind must be corrected and as many of second kind as possible UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Objectives of Editing : Why do it? Objectives of editing (Granquist, 1984) “Tidy up data” so as to facilitate analysis (creation of complete file) Identify types and sources of errors (for reporting on data quality) Improve quality of census data (for current and future census) Important not only to detect errors but also to identify causes, in order to take appropriate corrective measures and improve overall quality UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 How to Edit? TABLE 1: 2010 Population by Age and Sex, Unedited and Edited Unedited data Edited data Age group Total Male Female Sex Not reported Total 4,147 2,033 2,091 23 Less than 15 years 1,639 799 825 15 15 to 29 years 1,256 612 643 30 to 44 years 727 356 45 to 59 years 360 60 to 74 years Total Male Female 4,147 2,043 2,104 1,646 809 837 1 1,260 614 646 369 2 729 358 371 194 166 0 362 195 167 116 54 59 3 116 55 61 75 years and over 34 12 22 0 34 12 22 Age Not reported 15 6 7 2 UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 How to Edit? TABLE 1: Population by Age and Sex, Unedited and Edited How to deal with data “not reported”? Distribute the age unknowns and the sex unknowns in same proportion as the corresponding known values For example, for 23 sex unknowns, distribute (2033/4147)*23 = 12 to males (and remaining 11 to females by subtraction); see RHS of Table 1 Similarly, distribute 15 age unknowns across 6 age groups in proportion to known values, see RHS of Table 1 This method could render biased results if number of unknowns (number of non-responses) high since distribution of knowns and unknowns may be very different An improved strategy would be to use multivariate distributions involving other variables such as relationship between spouses, having a positive entry for number of children born, etc, UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Why Edit? TABLE 2: Population by Age with Unknowns for 2000 and 2010 Age group Numbers Percent 2010 2000 2010 2000 Total 4,147 3,319 100 100 Less than 15 years 1,639 1,348 39.5 40.6 15 to 29 years 1,256 902 30.3 27.2 30 to 44 years 727 538 17.5 16.2 45 to 59 years 360 200 8.7 6 60 to 74 years 116 89 2.8 2.7 75 years and over 34 25 0.8 0.8 Age Not reported 15 217 0.4 6.5 UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Why Edit? TABLES 2 and 3: Population by Age with and without Unknowns for 2000 and 2010 Another problem is that unknowns may affect the analysis of trends In Table 2, if unknowns not taken into account, percentage of persons aged 15-29 years appears to increase from 27.2% in 2000 to 30.3% in 2010 Redistributing unknowns may change this trend In Table 3, after distributing unknowns, there is only an increase from 28.7% in 2000 to 29.3% in 2010 UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Why Edit? TABLE 3: Population by Age without Unknowns for 2000 and 2010 Age group Numbers Per cent 2010 2000 2010 2000 Total 4,147 3,319 100 100 Less than 15 1,743 1,408 42 42.4 15 to 29 years 1,217 952 29.3 28.7 30 to 44 years 695 578 16.8 17.4 45 to 59 years 341 230 8.2 6.9 60 to 74 years 114 109 2.7 3.3 37 42 0.9 1.3 75 years and over UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Principles of Editing : How to do it In general the editing system should be: Minimalist (change only obvious errors and as few as possible) Automated (as much as possible, for both detection and correction) Systematic Consistent with other NSO statistical collections Compliant with UN or other international standards UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Fatal versus Query Edits Types of edits: Fatal Edits: identify errors with certainty Query Edits: identify suspected errors Fatal Edits identify fatal errors, which include invalid or missing entries as well as errors due to inconsistencies Query Edits identify data items that fall outside subjective data bounds, or items that are relatively high or low as compared with other data on the same questionnaire Fatal edits must be resolved but query edits more difficult to correct, have fewer benefits than the detection and resolution of fatal edits, and add more to the cost of the process For query edits, subject-matter specialists should investigate edits developed for pilot censuses and those developed during processing to make sure that individual edit have the expected cost of census evaluation (e.g., look at hit rates or share of flags that result in changes to the original data) UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Micro-editing versus Macro-editing Micro-editing: concerns ways to ensure validity and consistency of individual data records and relationships between records in a household Macro-editing: checks aggregated data to make sure that they are reasonable Example, If census results show large percentage of persons without a reported age, imputing for age (at micro level) will produce a complete data set. BUT far more essential to make checks at macro (aggregate) level to ensure that imputation does not skew overall age distribution UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Impact of Capture Mode on Editing Types of capture modes typically used: manual (key-entry), OMR, OCR/ICR, PDA, Internet For key-entry, PDA, Internet: some limited detection and correction of errors can be done in “real time” Not possible for OMR or OCR/ICR (from paper questionnaire) with scanning; limited to “batch editing” after the fact UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Manual versus Automated Editing Manual edits may be done in several places along the editing chain – by enumerator, supervisor, field office worker, coder, key entry clerk, etc Disadvantage is that manual editing expends enormous amount of time (months or years), energy (human resources) and cost If data set is small, timing not so crucial and work force available, then manual editing may be feasible Automated editing reduces time required, decreases introduction of human error, and allows for creation of edit trail (and is therefore reproducible) Unlike manual editing, automated editing makes it feasible and efficient to impute responses based on other information in the questionnaire or on reported information for a unit with similar characteristics UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Pitfalls of Over-editing Reduced timeliness Increased costs Potential distortion of true values False sense of security UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Other Considerations Determination of tolerance levels for error detection For most items in a census, some small percentage of the respondents will not give “acceptable” responses, for whatever reason Not every failure is pervasive and therefore may not be worthy of remedial action- see Pitfalls of Over-editing Tolerance levels indicate number of invalid and inconsistent responses allowed before editing teams take remedial action Decided by editing team including both subject-matter and data processing specialists For key items such as age and sex, typically low (1%-2%) whereas less key items such as literacy and disability, typically higher (5%-10%) Correction may occur by returning enumerators to field, conducting telephone re-interviews or by applying specific knowledge of an area Learning from the editing process/ quality assurance systems Positive and negative feed-back loops need to be recorded to improve the quality of both the current census and future censuses and surveys Audit trails, performance measures and diagnostic statistics crucial This is often the most important outcome of editing UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 Other Considerations Cost of editing Cost of editing has not decreased in the last much in the last 20 years, although process has been rationalized by continuous exploitation of technological developments In general, editing activities take a disproportionate amount of time (and therefore staff costs) relative to other activities Excessive editing can delay census results Archiving Both edited and unedited data files should be preserved for later analysis – and in several places Documentation should be complete enough for census planners to be able to reconstruct the same processes at a later date UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008 THANK YOU! UNSD Regional Workshop on Census Data Processing for the English speaking African Countries: Contemporary technologies for data capture, methodology and practice of data editing Dar es Salaam, Tanzania, 9-13 June 2008