Data Editing, Coding, and Just a Little Imputation Katherine (Jenny) Thompson Office of Statistical Methods and Research for Economic Programs Katherine.J.Thompson@census.gov (301) 763-4941 1 The Basics: What is Editing? Editing (procedures) review reported/keyed data for errors and pinpoints “inconsistent” values • For “industry” • For respondent Editing does not change the data. Items that fail edits are • referred to an analyst; or • automatically imputed (replaced with consistent values) 2 The Basics: What Is Imputation? Imputation is the replacement of a missing or incorrectly reported item using logical edits or statistical procedures. In other words, Imputation replaces a missing or incorrect data item with an “educated guess.” 3 The Basics: What is Coding? Coding is the assignment of recognizable values to flags that describe key characteristics of the unit or item, such as • • • • Industry (unit level) Response status (unit or item level) Source of data correction (item level) Imputation model (item level) 4 We Begin With Coding Before we can evaluate whether a response is reasonable, we have to know where it comes from: • Classification variable(s) value, e.g., industry, state • Frame information may be erroneous or • unit may have changed classification value Each unit must be assigned classification code(s) before editing/imputation 5 We End With Coding At the end of the processing cycle, we want to know • • • • How the data were changed, Where the data were changed, Why (if possible) data were changed, and The final status of the reporting unit (respondent, non-respondent). 6 Some Edit Definitions Editing: Procedures for detecting “incorrect” keyed or respondent data. Micro-Editing: Editing at the individual record (questionnaire) level Macro-Editing: Editing at the tabulated value level 7 “Typical” Editing Processing Flow • Micro-editing (static) • • • • • Outlier detection (dynamic) • • • • Performed on a flow basis Predetermined edit tests and edit parameters (historic data) Administered by machine Resolved by machine and human Performed after “close-out” Administered by machine Often resolved by human Macro-editing (dynamic) • See above 8 Micro-Edits are Either: Fatal Must be resolved before subsequent editing • Unit is Out-of-Scope for Survey • Unit is missing classification variable value • Required data item not reported Query Can be corrected “automatically” • Detail items do not add to reported total • Ratio of two items is outside (userdetermined) limits 9 Where Do Micro-Edits Come From? • Questionnaire • Reality • Subject-Matter Expert Rules • (Enforced) Statistical Relationships 10 Fictional Sample Questionnaire Instructions: Report all dollar figures in thousands. Report all hours in thousands. Report employment in units. Millions Thousands Item 1. ANNUAL PAYROLL Item 2. 1ST QUARTER PAYROLL Item 3. SALES 3.a. ON SITE MANUFACTURES 3.b. REPACKAGED MANUFACTURES 3.c. TOTAL (3.a. + 3.b.) Item 4. TOTAL HOURS WORKED Item 5. EMPLOYMENT 11 Edit Sources: Questionnaire Item 3. SALES 3.a. ON SITE MANUFACTURES 3.b. REPACKAGED MANUFACTURES 3.c. TOTAL (3.a. + 3.b.) Balance Edit Item 3.a. Value + Item 3.b. Value = Item 3.c. Value Things have to add up! 12 Edits Sources: Questionnaire/Reality Millions Thousands Item 1. ANNUAL PAYROLL Item 2. 1ST QUARTER PAYROLL Ratio Edit ANNUAL PAYROLL/1ST QUARTER PAYROLL 1 Can’t spend more on payroll in one quarter than for the entire year! 13 Edit Sources: Questionnaire/Reality Millions Thousands Item 4. TOTAL HOURS WORKED Item 5. EMPLOYMENT Ratio Edit 0.96 < TOTAL HOURS WORKED/EMPLOYMENT < 8.76 ( 20 hours week ) ( 48 weeks year ) 1000 (reporting unit) ( 24 hours day) (365 days year ) 1000 (reporting unit) 0.96 8.76 14 Edits Sources: Questionnaire/Reality Item 5. EMPLOYMENT Range Edit 0 EMPLOYMENT 5,615,727 A unit can’t have more employees than the population of the resident state (or negatively-value employees!) 15 Edit Sources: Subject-Matter Rules Millions Thousands Item 1. ANNUAL PAYROLL Item 3. SALES 3.a. ON SITE MANUFACTURES 3.b. REPACKAGED MANUFACTURES 3.c. TOTAL (3.a. + 3.b.) Ratio Edit TOTAL SALES/ANNUAL PAYROLL > 1 “Full-year reporters should operate at a profit!” 16 Edit Sources: Statistical Relationships Millions Thousands Item 1. ANNUAL PAYROLL Item 5. EMPLOYMENT Ratio Edit A ANNUAL PAYROLL/EMPLOYMENT B Wage per employee should be within the (industry) range. 17 Examples of Fatal Micro-edits • Classification Edits • Required Data Item Tests 18 Examples of Query Micro-edits • • • • List Directed (Verification) Edits Skip Pattern Validation Edits Range Edits (Including negative tests) Ratio Edits • • • • Within same questionnaire Current to prior period Balance Edits Subject-matter rules 19 List Directed/Verification Edits Purpose: of To compare the reported value of a data field to a pre-determined list legal values. • Machine edits, but highly dependent on dataquality of list • Human (manual) correction of edit failures 20 Skip Pattern Validation Edits Purpose: To verify that values of skip items are consistent with the skip instructions provided on the questionnaire. Machine edits that CAN be resolved by machineimputation • Subject-matter rules (if..then..logic) • Operations Research approach 21 Range Edits Purpose: To check the reported value of a data item to see if it is within specified minimum and maximum values. Form of edit: lower bound data item upper bound • Upper and lower bounds are tolerances. • If data item is not contained within the bounds, then it fails the range edit (“out of tolerance”). • Negative tests are a special case of range edits. • Can be used to define an imputation region. 22 Range Edits Examples: 0 Employment 301,064,982 (2006 U.S. Population) 0 Sales 12,455.8 billion (2005 Gross Domestic Product) 0 Percent of work done in category 100% 23 Ratio Edits Purpose: To compare two “related” items in a questionnaire to see if reported values are consistent. Form of Ratio Edit: X1 L U X2 • Upper and lower bounds are known as tolerances. • Tolerances generally developed from prior period data. • If ratio is not contained within the bounds, then it fails the ratio edit (“out of tolerance”). 24 Some Reasons for Ratio Editing One data item is a function of another. Annual Payroll = 1st Quarter Payroll + Payroll for Remaining 3 Quarters Ratio Edit: 1 ANNUAL PAYROLL 4.4 1ST QUARTER PAYROLL 25 Some Reasons for Ratio Editing One data item can only be evaluated in comparison with another item 0.96 TOTAL HOURS WORKED 8.76 NUMBER OF EMPLOYEES ( 20 hours week ) ( 48 weeks year ) 1000 (reporting unit) ( 24 hours day) (365 days year ) 1000 (reporting unit) 0.96 8.76 (reasonable lower bound) (reasonable upper bound) 26 Some Reasons for Ratio Editing One data item is a good predictor of another. Annual Payroll = factor Total Employment 6000 Annual Payroll 5000 4000 3000 2000 1000 0 0 20 40 60 80 100 120 Total Employment 27 Plot of a Typical Ratio Edit 8000 Annual Payroll 7000 6000 5000 4000 3000 2000 1000 0 0 20 40 60 80 100 120 Total Employment Census Data Lower Tolerance Upper Tolerance 28 Advantages of Ratio Edits • Useful for detecting systematic and random errors • Reasonable comparisons for quantitative data • Verifiable assumptions • Often insensitive to changes in economy when both items are in the same units • Imply certain imputation models • Can be solved simultaneously • imputation region implications 29 Disadvantages of Ratio Edits • Edit failure identifies a pair of potentially incorrect data fields • Need to have a “tie-breaker” • Often work best when combined with other edits (can be ratio edits) • Very dependent on the distribution of ratios • Highly correlated • Goes through origin 30 “Best” Practices for Ratio Edits • Incorporate unit size categories as well as classification variables in editing cells • Perform preliminary data analysis to determine validity of edit model • Incorporate tests to prior data from same unit and item when reasonable • Use non-parametric outlier-resistant methods for setting ratio edit tolerances • Audit edits • An edit test that has a high rate of failure could indicate problems with the tolerances or the test itself 31 Periodic Data and Ratio Edits (Caution) Current Month's Number of Employees 80 70 60 50 40 30 20 10 0 0 5 10 15 20 25 30 35 Prior Month's Number of Employees 32 Brief Digression on Imputation Situation: Missing item or item marked for imputation (replacement) due to edit failure(s) We would like the machine to automatically replace the “inconsistent” item with a consistent value. 33 The ideal “imputations” find replacement values that are still considered reported (from the same respondent) Examples • divide reported data by correct reporting unit • replace reported total with sum of details 34 Link Between Imputation and Program • Published tabulations (macro-data) • Ratio imputation models • Regression imputation models • Published micro-data • Hot deck imputation 35 Commonly-Used Imputation Methods (Economic Data) Rounding/Data Slides (systematic error) • Respondent data divided by unit conversion factor (e.g., imputed value = reported value/1,000) Direct Substitution • Another data item (same questionnaire) • Absolute value of reported/keyed item • Sum of Reported Details (logical edit) • Derived value from other reported/keyed item • Previously reported value (historic) from same respondent • Administrative data value (same respondent) 36 Ratio Imputation (Model Imputation) imputed item = (factor) (another data field) • Same reporting unit/questionnaire • Edit-passing item Industry (Category) Average Ratio (use average ratio of two items in industry/category) e.g., factor = industry wage/employee ratio Historic Imputation (Auxiliary Trend) (use ratio of prior data to current data for same respondent) e.g. factor = previous tabulated value of edit-failing item previous tabulated value of auxiliary data field 37 Balance Edits Purpose: To determine if detail items add to associated reported total. Form of Edit: TOTAL = DETAIL1 + DETAIL2 + ... + DETAILn • Developed from questionnaire • A set of details along with their associated total is called a balance complex. • More complicated balance complexes • Nested 1-Dimensional • 2-Dimensional 38 Sample Questionnaire Example Instructions: Report all dollar figures in thousands. Report all hours in thousands. Report employment in units. Millions Thousands Item 1. ANNUAL PAYROLL Item 2. 1ST QUARTER PAYROLL Item 3. SALES 3.a. ON SITE MANUFACTURES 3.b. REPACKAGED MANUFACTURES 3.c. TOTAL (3.a. + 3.b.) Item 4. TOTAL HOURS WORKED Item 5. EMPLOYMENT 39 “Fixing” a Failed Balance Edit Editing generally integrated with imputation: Editor decides which is more believable: TOTAL or SUM OF DETAILS Only change one side of balance complex (TOTAL or SUM OF DETAILS) 40 Balance Edit Definitions Residual: TOTAL - SUM OF DETAILS Failed edit solution can depend on • SIZE of residual (absolute tolerance) • RATIO of residual to total (relative tolerance) 41 A Few Balance Edit Fixes RAKE* YSUMX* Rake all detail items to TOTAL Replace TOTAL with the SUM OF DETAILS ROUND Divide all details by 1000 or divide TOTAL by 1000 RESIDUAL Set one missing DETAIL to the RESIDUAL IMPUTE* Replace all DETAILS with imputed values *Briefly discussed… 42 Raking Adjust each detail item as TOTAL DETAILi *= DETAILi SUM OF DETAILS Conditions: • Reported TOTAL must be “acceptable.” • Relative tolerance is “small” (e.g., within 5%). 43 Raking -- Considerations • Is not considered imputation • Preserves reported distribution of the detail items 44 YSUMX Set TOTAL equal to SUM OF DETAILS Conditions: • TOTAL can be changed by edit (not “fixed”); • (Optional, but preferable) SUM OF DETAILS is “reasonable” (e.g., verify with ratio test or range test) 45 YSUMX -- Considerations • Not (considered: imputation; • logical edit or deductive imputation • Useful when TOTAL is missing (and details are not); • Can be imputation solution to ratio edit 1 TOTAL 1 SUM OF DETAILS 46 Impute Replace ALL reported DETAILS with imputed values Imputed DETAILi for reporting unit c is given by factori TOTAL Conditions • TOTAL > 0 (and value of TOTAL “acceptable”) • No restriction on SUM OF DETAILS (all DETAILS are replaced...] • Difference between TOTAL and SUM OF DETAILS too large for raking 47 Macro-Editing (Brief Comments) • Systematic review of tabulations (estimates) • Tendency to rely on ratio comparisons to identify outlying estimates • Hidiroglou-Berthelot edit • Ratio Edits • Need to analyze micro-data in outlying cells 48 Back to Coding… Throughout the editing and imputation process, what do we need to keep track of? 49 Back to Coding… Original source of data item • • • Reported from respondent Elicited by analyst/subject-matter expert Missing/not reported 50 Back to Coding… Final source of data item value • • • • Unchanged (respondent data) Raked/considered reported Rounded detail item (rescaled) Substitution • • • • • Administrative data Other item from same questionnaire Prior period value from same respondent (can indicate bad survey practice) Model imputation (+ auxiliary data) Other imputation 51 Back to Coding… Why was data item value changed? • • Not reported? Edit failure (automatic) • • • Macro or micro level? Which edit/edit module Analyst change (manual) • Should be documented in notes 52 Back to Coding… What is the final disposition the data item? • • • Reported data? Equivalent in quality to reported data? Imputed data? What is the final disposition of the entire reporting unit? 53 Wrap Up • Talked in GREAT detail on • • • • • Methods and sources of micro-edits Common imputation models Talked in SEMI-GREAT detail on imputation methods Brought up the idea of macro-editing Mentioned a few coding concerns here and there… 54 For me, 40 minutes is barely a warm-up. Contact information: Katherine.J.Thompson@census.gov (301) 763-4941 55