Thompson - American Statistical Association

advertisement
Data Editing, Coding,
and Just a Little Imputation
Katherine (Jenny) Thompson
Office of Statistical Methods and Research
for Economic Programs
Katherine.J.Thompson@census.gov
(301) 763-4941
1
The Basics: What is Editing?
Editing (procedures) review reported/keyed
data for errors and pinpoints “inconsistent”
values
• For “industry”
• For respondent
Editing does not change the data. Items
that fail edits are
• referred to an analyst; or
• automatically imputed (replaced with consistent
values)
2
The Basics: What Is Imputation?
Imputation is the replacement of a missing or
incorrectly reported item using logical edits or
statistical procedures.
In other words,
Imputation replaces a missing or incorrect data
item with an “educated guess.”
3
The Basics: What is Coding?
Coding is the assignment of recognizable
values to flags that describe key
characteristics of the unit or item, such as
•
•
•
•
Industry (unit level)
Response status (unit or item level)
Source of data correction (item level)
Imputation model (item level)
4
We Begin With Coding
Before we can evaluate whether a response is
reasonable, we have to know where it comes
from:
• Classification variable(s) value, e.g., industry, state
• Frame information may be erroneous or
• unit may have changed classification value
 Each unit must be assigned classification
code(s) before editing/imputation
5
We End With Coding
At the end of the processing cycle, we
want to know
•
•
•
•
How the data were changed,
Where the data were changed,
Why (if possible) data were changed, and
The final status of the reporting unit
(respondent, non-respondent).
6
Some Edit Definitions
Editing:
Procedures for detecting
“incorrect” keyed or respondent
data.
Micro-Editing:
Editing at the individual record
(questionnaire) level
Macro-Editing:
Editing at the tabulated value level
7
“Typical” Editing Processing Flow
•
Micro-editing (static)
•
•
•
•
•
Outlier detection (dynamic)
•
•
•
•
Performed on a flow basis
Predetermined edit tests and edit parameters (historic data)
Administered by machine
Resolved by machine and human
Performed after “close-out”
Administered by machine
Often resolved by human
Macro-editing (dynamic)
•
See above
8
Micro-Edits are Either:
Fatal
Must be resolved before subsequent
editing
• Unit is Out-of-Scope for Survey
• Unit is missing classification variable value
• Required data item not reported
Query
Can be corrected “automatically”
• Detail items do not add to reported total
• Ratio of two items is outside (userdetermined) limits
9
Where Do Micro-Edits Come From?
• Questionnaire
• Reality
• Subject-Matter Expert Rules
• (Enforced) Statistical Relationships
10
Fictional Sample Questionnaire
Instructions: Report all dollar figures in thousands. Report all hours in thousands.
Report employment in units.
Millions
Thousands
Item 1. ANNUAL PAYROLL
Item 2. 1ST QUARTER PAYROLL
Item 3. SALES
3.a. ON SITE MANUFACTURES
3.b. REPACKAGED MANUFACTURES
3.c. TOTAL (3.a. + 3.b.)
Item 4. TOTAL HOURS WORKED
Item 5. EMPLOYMENT
11
Edit Sources: Questionnaire
Item 3. SALES
3.a. ON SITE MANUFACTURES
3.b. REPACKAGED MANUFACTURES
3.c. TOTAL (3.a. + 3.b.)
Balance Edit
Item 3.a. Value + Item 3.b. Value = Item 3.c. Value
Things have to add up!
12
Edits Sources: Questionnaire/Reality
Millions Thousands
Item 1. ANNUAL PAYROLL
Item 2. 1ST QUARTER PAYROLL
Ratio Edit
ANNUAL PAYROLL/1ST QUARTER PAYROLL 1
Can’t spend more on payroll in one quarter than for the
entire year!
13
Edit Sources: Questionnaire/Reality
Millions
Thousands
Item 4. TOTAL HOURS WORKED
Item 5. EMPLOYMENT
Ratio Edit
0.96 < TOTAL HOURS WORKED/EMPLOYMENT < 8.76
( 20 hours week )  ( 48 weeks year )
1000 (reporting unit)
( 24 hours day)  (365 days year )
1000 (reporting unit)
 0.96
 8.76
14
Edits Sources: Questionnaire/Reality
Item 5. EMPLOYMENT
Range Edit
0  EMPLOYMENT  5,615,727
A unit can’t have more employees than the population
of the resident state (or negatively-value employees!)
15
Edit Sources: Subject-Matter Rules
Millions
Thousands
Item 1. ANNUAL PAYROLL
Item 3. SALES
3.a. ON SITE MANUFACTURES
3.b. REPACKAGED MANUFACTURES
3.c. TOTAL (3.a. + 3.b.)
Ratio Edit
TOTAL SALES/ANNUAL PAYROLL > 1
“Full-year reporters should operate at a profit!”
16
Edit Sources: Statistical Relationships
Millions
Thousands
Item 1. ANNUAL PAYROLL
Item 5. EMPLOYMENT
Ratio Edit
A  ANNUAL PAYROLL/EMPLOYMENT  B
Wage per employee should be within the (industry) range.
17
Examples of Fatal Micro-edits
•
Classification Edits
•
Required Data Item Tests
18
Examples of Query Micro-edits
•
•
•
•
List Directed (Verification) Edits
Skip Pattern Validation Edits
Range Edits (Including negative tests)
Ratio Edits
•
•
•
•
Within same questionnaire
Current to prior period
Balance Edits
Subject-matter rules
19
List Directed/Verification Edits
Purpose:
of
To compare the reported value of a
data field to a pre-determined list
legal values.
• Machine edits, but highly dependent on dataquality of list
• Human (manual) correction of edit failures
20
Skip Pattern Validation Edits
Purpose:
To verify that values of skip items are
consistent with the skip instructions
provided on the questionnaire.
Machine edits that CAN be resolved by machineimputation
• Subject-matter rules (if..then..logic)
• Operations Research approach
21
Range Edits
Purpose:
To check the reported value of a data
item to see if it is within specified
minimum and maximum values.
Form of edit: lower bound  data item  upper bound
• Upper and lower bounds are tolerances.
• If data item is not contained within the bounds, then it
fails the range edit (“out of tolerance”).
• Negative tests are a special case of range edits.
• Can be used to define an imputation region.
22
Range Edits
Examples:
0  Employment  301,064,982 (2006 U.S. Population)
0  Sales  12,455.8 billion (2005 Gross Domestic Product)
0  Percent of work done in category  100%
23
Ratio Edits
Purpose:
To compare two “related” items in a
questionnaire to see if reported values
are consistent.
Form of Ratio Edit:
X1
L
U
X2
• Upper and lower bounds are known as tolerances.
• Tolerances generally developed from prior period
data.
• If ratio is not contained within the bounds, then it
fails the ratio edit (“out of tolerance”).
24
Some Reasons for Ratio Editing
One data item is a function of another.
Annual Payroll = 1st Quarter Payroll + Payroll for Remaining 3 Quarters
Ratio Edit:
1
ANNUAL PAYROLL
 4.4
1ST QUARTER PAYROLL
25
Some Reasons for Ratio Editing
One data item can only be evaluated in comparison
with another item
0.96 
TOTAL HOURS WORKED
 8.76
NUMBER OF EMPLOYEES
( 20 hours week )  ( 48 weeks year )
1000 (reporting unit)
( 24 hours day)  (365 days year )
1000 (reporting unit)
 0.96
 8.76
(reasonable lower bound)
(reasonable upper bound)
26
Some Reasons for Ratio Editing
One data item is a good predictor of another.
Annual Payroll = factor  Total Employment
6000
Annual Payroll
5000
4000
3000
2000
1000
0
0
20
40
60
80
100
120
Total Employment
27
Plot of a Typical Ratio Edit
8000
Annual Payroll
7000
6000
5000
4000
3000
2000
1000
0
0
20
40
60
80
100
120
Total Employment
Census Data
Lower Tolerance
Upper Tolerance
28
Advantages of Ratio Edits
• Useful for detecting systematic and random
errors
• Reasonable comparisons for quantitative data
• Verifiable assumptions
• Often insensitive to changes in economy when
both items are in the same units
• Imply certain imputation models
• Can be solved simultaneously
• imputation region implications
29
Disadvantages of Ratio Edits
• Edit failure identifies a pair of potentially
incorrect data fields
• Need to have a “tie-breaker”
• Often work best when combined with other
edits (can be ratio edits)
• Very dependent on the distribution of ratios
• Highly correlated
• Goes through origin
30
“Best” Practices for Ratio Edits
• Incorporate unit size categories as well as
classification variables in editing cells
• Perform preliminary data analysis to determine
validity of edit model
• Incorporate tests to prior data from same unit and
item when reasonable
• Use non-parametric outlier-resistant methods for
setting ratio edit tolerances
• Audit edits
• An edit test that has a high rate of failure could indicate
problems with the tolerances or the test itself
31
Periodic Data and Ratio Edits (Caution)
Current Month's Number of Employees
80
70
60
50
40
30
20
10
0
0
5
10
15
20
25
30
35
Prior Month's Number of Employees
32
Brief Digression on Imputation
Situation: Missing item or item marked for
imputation (replacement) due to
edit failure(s)
We would like the machine to automatically
replace the “inconsistent” item with a
consistent value.
33
The ideal “imputations” find replacement
values that are still considered reported
(from the same respondent)
Examples
• divide reported data by correct reporting unit
• replace reported total with sum of details
34
Link Between Imputation and Program
• Published tabulations (macro-data)
• Ratio imputation models
• Regression imputation models
• Published micro-data
• Hot deck imputation
35
Commonly-Used Imputation Methods
(Economic Data)
Rounding/Data Slides (systematic error)
• Respondent data divided by unit conversion factor
(e.g., imputed value = reported value/1,000)
Direct Substitution
• Another data item (same questionnaire)
• Absolute value of reported/keyed item
• Sum of Reported Details (logical edit)
• Derived value from other reported/keyed item
• Previously reported value (historic) from same
respondent
• Administrative data value (same respondent)
36
Ratio Imputation (Model Imputation)
imputed item = (factor)  (another data field)
• Same reporting unit/questionnaire
• Edit-passing item
Industry (Category) Average Ratio
(use average ratio of two items in industry/category)
e.g., factor = industry wage/employee ratio
Historic Imputation (Auxiliary Trend)
(use ratio of prior data to current data for same respondent)
e.g. factor = previous tabulated value of edit-failing item
previous tabulated value of auxiliary data field
37
Balance Edits
Purpose:
To determine if detail items add to associated
reported total.
Form of Edit: TOTAL = DETAIL1 + DETAIL2 + ... + DETAILn
• Developed from questionnaire
• A set of details along with their associated total is
called a balance complex.
• More complicated balance complexes
• Nested 1-Dimensional
• 2-Dimensional
38
Sample Questionnaire Example
Instructions: Report all dollar figures in thousands. Report all hours
in thousands. Report employment in units.
Millions
Thousands
Item 1. ANNUAL PAYROLL
Item 2. 1ST QUARTER PAYROLL
Item 3. SALES
3.a. ON SITE MANUFACTURES
3.b. REPACKAGED MANUFACTURES
3.c. TOTAL (3.a. + 3.b.)
Item 4. TOTAL HOURS WORKED
Item 5. EMPLOYMENT
39
“Fixing” a Failed Balance Edit
Editing generally integrated with imputation:
Editor decides which is more believable:
TOTAL or SUM OF DETAILS
Only change one side of balance complex
(TOTAL or SUM OF DETAILS)
40
Balance Edit Definitions
Residual:
TOTAL - SUM OF DETAILS
Failed edit solution can depend on
• SIZE of residual (absolute tolerance)
• RATIO of residual to total (relative tolerance)
41
A Few Balance Edit Fixes
RAKE*
YSUMX*
Rake all detail items to TOTAL
Replace TOTAL with the SUM OF
DETAILS
ROUND
Divide all details by 1000 or
divide TOTAL by 1000
RESIDUAL Set one missing DETAIL to the RESIDUAL
IMPUTE*
Replace all DETAILS with imputed values
*Briefly discussed…
42
Raking
Adjust each detail item as
TOTAL
DETAILi *=
 DETAILi
SUM OF DETAILS
Conditions:
• Reported TOTAL must be “acceptable.”
• Relative tolerance is “small” (e.g., within 5%).
43
Raking -- Considerations
• Is not considered imputation
• Preserves reported distribution of the detail
items
44
YSUMX
Set TOTAL equal to SUM OF DETAILS
Conditions:
• TOTAL can be changed by edit (not
“fixed”);
• (Optional, but preferable) SUM OF
DETAILS is “reasonable” (e.g., verify with
ratio test or range test)
45
YSUMX -- Considerations
• Not (considered: imputation;
• logical edit or deductive imputation
• Useful when TOTAL is missing (and details
are not);
• Can be imputation solution to ratio edit
1
TOTAL
1
SUM OF DETAILS
46
Impute
Replace ALL reported DETAILS with imputed values
Imputed DETAILi for reporting unit c is given by
factori  TOTAL
Conditions
• TOTAL > 0 (and value of TOTAL “acceptable”)
• No restriction on SUM OF DETAILS (all DETAILS are
replaced...]
• Difference between TOTAL and SUM OF DETAILS too
large for raking
47
Macro-Editing (Brief Comments)
• Systematic review of tabulations
(estimates)
• Tendency to rely on ratio comparisons
to identify outlying estimates
• Hidiroglou-Berthelot edit
• Ratio Edits
• Need to analyze micro-data in outlying
cells
48
Back to Coding…
Throughout the editing and
imputation process, what do we
need to keep track of?
49
Back to Coding…
Original source of data item
•
•
•
Reported from respondent
Elicited by analyst/subject-matter expert
Missing/not reported
50
Back to Coding…
Final source of data item value
•
•
•
•
Unchanged (respondent data)
Raked/considered reported
Rounded detail item (rescaled)
Substitution
•
•
•
•
•
Administrative data
Other item from same questionnaire
Prior period value from same respondent (can indicate
bad survey practice)
Model imputation (+ auxiliary data)
Other imputation
51
Back to Coding…
Why was data item value changed?
•
•
Not reported?
Edit failure (automatic)
•
•
•
Macro or micro level?
Which edit/edit module
Analyst change (manual)
•
Should be documented in notes
52
Back to Coding…
What is the final disposition the data
item?
•
•
•
Reported data?
Equivalent in quality to reported data?
Imputed data?
What is the final disposition of the entire
reporting unit?
53
Wrap Up
•
Talked in GREAT detail on
•
•
•
•
•
Methods and sources of micro-edits
Common imputation models
Talked in SEMI-GREAT detail on
imputation methods
Brought up the idea of macro-editing
Mentioned a few coding concerns
here and there…
54
For me, 40 minutes is barely a warm-up.
Contact information:
Katherine.J.Thompson@census.gov
(301) 763-4941
55
Download