1
Stata Users Group Meeting
Rodrigo Alfaro – Marcelo Fuenzalida
S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Multiple Imputation (MI)
Methods for MI
Chilean Household Surveys
Application of MI to EFH
Comments and Conclusions
Appendix
2 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
3
In empirical applications researchers must work with incomplete data sets.
A “solution” to the problem above is known as
Multiple Imputation (MI) procedure.
MI relies on the assumption Missing At Random:
“…the probability of missing data on Y is unrelated to the value of Y, after controlling for other variables in the analysis.” (Allison, 2002)
In our empirical application we assume MAR.
Validity of assumption is beyond the scope of this analysis.
S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
4
MI is based on the assumption that we have a good proxies of distributions of missing observations in the sample. Given this:
We could “fill the blanks” taking random realizations.
We create m-versions of the complete datasets in order to reflect randomness of the procedure.
So, we change our incomplete data set for a set of complete ones. We do not “solve” missing data problem we just measure it.
Analysis of multiple data sets could be combine using Rubin’s rules (Rubin, 1987).
S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
5
We divide methods into 2 categories
Conditional: Hot-Deck and Univariate.
Multivariate: Normal and Chained Equations.
Hot-Deck (hotdeck.ado)
Replace missing observations with an observed one taken randomly within a specific group: males with college.
Informal conversations: std. dev.’s are still small.
Univariate (uvis.ado)
Regress variable with missing observations on exogenous variables with no missing.
Draw posterior of estimators (beta & sigma)
“Predict” missing values.
S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
6
Chained Equation (ice.ado)
Based on Univariate method, but with the possibility of having missing values in exogenous variables.
Using reverse equation missing values of exogenous variables are replaced.
Loop over previous steps.
Normal
Assume Multivariate Normal. Estimate parameters using
EM algorithm (or other initial value), and draw imputations using Data Augmentation procedure.
Theory relies on the convergence of EM.
No implemented in Stata. Schafer’s stand-alone package.
S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
7 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
8 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
9 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
10 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
11
We have households surveys with a few number of waves: CASEN, and EPS.
CASEN was created to measure poverty.
EPS was created to evaluate pension system.
At the Central Bank we have been using these surveys to analyze financial fragility of households.
However, CASEN and EPS were not created for this purpose. We need new sample designs.
In 2007, we started a new survey designed for our purposes: EFH.
S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
12
Our surveys have different levels of information
Personal information of each member of the household.
For example: age, year of education, labor income, etc.
Aggregate information of the household. For example: value of assets (cars, house, financial instruments, etc.), debts (mortgage, consumer loans, educational loans, etc.)
Our variables of interest could be irrelevant for some households in the sample.
Many households have loans with retails-companies instead of borrowing the money directly from banks.
Few households have personal savings invested in financial instruments such as stocks, bonds, etc.
S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
13
Using conditional methods, we could attach the constraints to the imputation procedure.
We are able to impute labor income for each member of the household, considering only individual level vars.
At the household level, we could impute “banks loans” in a sub-sample of households that declared to have that kind of debt. We use as exogenous variables age, years of education, and gender of interviewee.
We impute “debt in retails-companies” with a different sub-sample but with the same exogenous variables.
But, we cannot impute “debt in retails-companies” with
“banks loans” because sub-samples may be different.
S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
14
Multivariate methods imply groups of households.
Suppose that a household without a house, we could pretend that the value of its house is “zero”. However, that will be affect the correlation between value of the house and total amount of debts.
Our first round includes 3 groups defined by the credit access.
First group includes households without debts in financial institutions and without any kind of assets.
Second group includes households with real assets: cars, and primary house.
Third group adds households with debts in financial institutions.
S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Missing Information
Assets Primary house
Vehicules
Debt Mortgage
Debt in retail companies
Bank loan
Missing values Total observations % of missing
80
158
2637
2003
3%
8%
137
320
47
769
1901
670
Total Number of Households in EFH 2007
18%
17%
7%
Number of observations with data
2557
1845
632
1581
623
4021
Low missing rate of information.
However, combining variables reduces sample size.
We will concentrate our analysis in the second group.
We use logit transformation to avoid unbounded results.
Source: EFH 2007.
15 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Primary House
HD_sex
HD_seq uvis_sex uvis_seq
Vehicules
HD_sex
HD_seq uvis_sex uvis_seq
Mortgage
HD_sex
HD_seq uvis_sex uvis_seq
Debt in retail companies
HD_sex
HD_seq uvis_sex uvis_seq
Source: EFH 2007.
16
N mean sd mean sd
124 41,300,000 51,400,000 41,300,000 51,400,000
133
121
133 m=5 m = 10
42,021,053 53,054,393 41,363,910 51,875,497
40,804,511 50,395,197 41,843,008 53,564,263
45,799,319 61,270,298 45,192,579 60,678,326
45,343,722 61,377,351 45,508,348 61,755,670
109 14,500,000 10,800,000 14,500,000 10,800,000
15,683,299 13,673,049 15,580,981 13,076,041
133
15,753,395 12,699,439 15,778,602 13,257,939
16,540,712 14,795,948 16,521,443 15,674,534
16,703,439 16,311,423 16,876,838 16,434,770
118
133
5,430,165 14,000,000
5,515,865 13,545,393
6,398,767 21,334,829
5,732,371 13,955,271
5,861,634 16,442,425
564,897
540,023
543,912
542,185
565,666
1,076,239
1,025,211
1,029,810
1,037,191
1,033,445
5,430,165
5,606,466
5,985,748
5,678,109
6,043,486
564,897
549,424
548,536
568,213
563,049
14,000,000
14,084,994
17,893,250
13,815,379
15,476,726
1,076,239
1,056,475
1,032,109
1,067,170
1,034,414
S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Primary House
Norm
Ice
Vehicules
Norm
Ice
Mortgage
Norm
Ice
Debt in retail companies
Norm
Ice m=5 m = 10
N mean sd mean sd
124 41,300,000 51,400,000 41,300,000 51,400,000
133
45,666,537 63,217,837 45,696,699 63,810,076
42,870,280 57,188,693 42,981,100 57,419,325
121
133
5,430,165 14,000,000 5,430,165 14,000,000
6,498,768 17,089,963 6,446,325 16,600,325
7,282,372 19,242,160 6,558,340 17,211,530
109 14,500,000 10,800,000 14,500,000 10,800,000
133
15,734,073 13,055,818 15,686,541 12,946,747
15,181,216 12,585,790 15,431,263 12,719,416
118
133
564,897
635,011
707,768
1,076,239
1,242,141
1,427,360
564,897
654,637
672,833
1,076,239
1,285,391
1,357,371
Households with debt in retails companies.
UVIS and HD could have lower variances than raw data.
We note that ICE and NORM are consistent in sd’s.
Source: EFH.
17 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
18
In our empirical application Hot-Deck as well as
Univariate imputation have smaller variances than multivariate methods.
Under a multivariate imputation we are able to have a reasonable standard deviation that reflects the uncertainty of complete data sets.
Moving to multivariate methods we have ICE and
NORM. Both have advantages and disadvantages.
ICE is implemented in Stata with many features available to accommodate several models.
In that respect is more general than NORM.
S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
19
NORM relies in EM algorithm in theoretical terms and DA for imputation method. For that we observed that we need convergence and
“reasonable” positive definite matrix.
In practical terms we observed that a high rate of missing data is associated with non-converge of
EM algorithm and/or some problems with DA.
In the case of ICE we observed that the algorithm does not have “convergence problem”. We were able to impute data with a high rate of incomplete information.
S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
20
We think that NORM provides a useful information about the stability of the model.
For that its implementation in Stata would be a good complement for ICE. Don’t you think?
We found 2 versions of NORM code: miss.sas by
Paul Allison and norm.R by Alvaro Novo.
We translated miss SAS routine into Stata-ado programming. Allison used SAS package IML
(Interactive Matrix Language). We observed that IML is similar to Mata.
S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
21
However, original code was not “optimized” as
Schafer suggested in his book.
A month ago we move to R-code. We found that original code in Fortran was included in R routine.
So, norm.R allows to use Fortran code directly in R.
Speedy up for this meeting, we translated 800 lines of Fortran into Mata in a week.
However, our translation from R to Mata is not good enough… yet.
S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
22
Besides technical issues on MI we have an unsolved topic for which your opinion is crucial: Aggregation
We want to work at the household level, then aggregation of individual information must be done somehow.
In order to deal with missing observation at individual level we could apply “improper imputations”: (1) replacing with zeros, or (2) replacing with a predicted value.
Alternative, we could code “missing” to the household if any member has missing observation. Because we lost information we discarded it.
Any feasible mixture?
Is it possible to add variables in order to account for incomplete information at individual level?
S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
23
Stata Users Group Meeting
Rodrigo Alfaro – Marcelo Fuenzalida
S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Primary house
Vehicules
Mortgage
HD_sex
HD_seq uvis_sex uvis_seq
N
2,557
2,637
1,845
HD_sex
HD_seq uvis_sex uvis_seq
2,003
HD_sex
HD_seq uvis_sex uvis_seq
N
632
769
46,742,044
46,519,877
51,023,261
50,809,129
Assets mean
46,467,118 m=5 sd
63,446,185
64,178,915
64,309,665
72,903,540
72,860,751
11,839,545
12,020,063
11,964,703
11,636,439
11,904,938
46,851,535
47,783,641
47,559,264
45,164,884
45,975,034
21,201,238
21,244,300
21,832,867
21,799,370
Debt mean
21,299,214 m=5 sd
19,204,692
19,271,787
19,184,499
20,593,237
20,704,561 mean
46,467,118 m=10
46,634,739
46,690,809
51,095,700
50,543,847 sd
63,446,185
64,084,809
64,581,258
72,962,069
72,175,917
11,839,545
11,926,336
11,870,948
11,689,710
11,823,094
46,851,535
47,152,707
47,067,930
45,217,017
45,687,387 mean
21,299,214 m=10
21,143,672
21,202,081
21,852,518
21,926,977 sd
19,204,692
19,104,947
19,246,894
20,790,692
20,889,226
Source: EFH 2007.
24 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008
Debt in retail companies
HD_sex
HD_seq uvis_sex uvis_seq
Bank Loan
HD_sex
HD_seq uvis_sex uvis_seq
N
1,581
1,901
623
670 mean
431,153
426,891
422,734
445,489
444,923 m=5
Debt (continue) sd
622,756
607,481
600,901
658,403
676,724 mean
431,153
427,533
426,400
447,371
446,381 m=10 sd
622,756
615,761
614,459
664,727
683,083
6,175,111
6,132,041
6,108,118
6,294,591
6,350,864
9,850,386
9,726,635
9,742,051
10,090,469
10,055,060
6,175,111
6,118,279
6,163,441
6,314,019
6,311,301
9,850,386
9,700,285
9,800,157
10,149,870
10,046,143
Source: EFH 2007.
25 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008