Multiple Imputation for households surveys A comparison of methods Stata Users Group Meeting

1

Multiple Imputation for households surveys

A comparison of methods

Stata Users Group Meeting

Rodrigo Alfaro – Marcelo Fuenzalida

S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008

Outline













Multiple Imputation (MI)

Methods for MI

Chilean Household Surveys

Application of MI to EFH

Comments and Conclusions

Appendix

2 S T A T A U S E R S G R O U P M E E T I N G SEPTEMBER 09 2008

Multiple Imputation

3











In empirical applications researchers must work with incomplete data sets.

A “solution” to the problem above is known as

Multiple Imputation (MI) procedure.

MI relies on the assumption Missing At Random:



“…the probability of missing data on Y is unrelated to the value of Y, after controlling for other variables in the analysis.” (Allison, 2002)

In our empirical application we assume MAR.

Validity of assumption is beyond the scope of this analysis.


4

Multiple Imputation







MI is based on the assumption that we have a good proxies of distributions of missing observations in the sample. Given this:



We could “fill the blanks” taking random realizations.



We create m-versions of the complete datasets in order to reflect randomness of the procedure.

So, we change our incomplete data set for a set of complete ones. We do not “solve” missing data problem we just measure it.

Analysis of multiple data sets could be combine using Rubin’s rules (Rubin, 1987).


Methods for MI

5







We divide methods into 2 categories





Conditional: Hot-Deck and Univariate.

Multivariate: Normal and Chained Equations.

Hot-Deck (hotdeck.ado)



Replace missing observations with an observed one taken randomly within a specific group: males with college.



Informal conversations: std. dev.’s are still small.

Univariate (uvis.ado)







Regress variable with missing observations on exogenous variables with no missing.

Draw posterior of estimators (beta & sigma)

“Predict” missing values.


Methods for MI

6





Chained Equation (ice.ado)







Based on Univariate method, but with the possibility of having missing values in exogenous variables.

Using reverse equation missing values of exogenous variables are replaced.

Loop over previous steps.

Normal







Assume Multivariate Normal. Estimate parameters using

EM algorithm (or other initial value), and draw imputations using Data Augmentation procedure.

Theory relies on the convergence of EM.

No implemented in Stata. Schafer’s stand-alone package.


Methods for MI



Schafer’s stand-alone package: Data.


Methods for MI



Schafer’s stand-alone package: EM.


Methods for MI



Schafer’s stand-alone package: DA.


Methods for MI



Schafer’s stand-alone package: DA.


11

Chilean Households Surveys









We have households surveys with a few number of waves: CASEN, and EPS.





CASEN was created to measure poverty.

EPS was created to evaluate pension system.

At the Central Bank we have been using these surveys to analyze financial fragility of households.

However, CASEN and EPS were not created for this purpose. We need new sample designs.

In 2007, we started a new survey designed for our purposes: EFH.


Chilean Households Surveys

12





Our surveys have different levels of information





Personal information of each member of the household.

For example: age, year of education, labor income, etc.

Aggregate information of the household. For example: value of assets (cars, house, financial instruments, etc.), debts (mortgage, consumer loans, educational loans, etc.)

Our variables of interest could be irrelevant for some households in the sample.





Many households have loans with retails-companies instead of borrowing the money directly from banks.

Few households have personal savings invested in financial instruments such as stocks, bonds, etc.


13




Using conditional methods, we could attach the constraints to the imputation procedure.









We are able to impute labor income for each member of the household, considering only individual level vars.

At the household level, we could impute “banks loans” in a sub-sample of households that declared to have that kind of debt. We use as exogenous variables age, years of education, and gender of interviewee.

We impute “debt in retails-companies” with a different sub-sample but with the same exogenous variables.

But, we cannot impute “debt in retails-companies” with

“banks loans” because sub-samples may be different.


14






Multivariate methods imply groups of households.



Suppose that a household without a house, we could pretend that the value of its house is “zero”. However, that will be affect the correlation between value of the house and total amount of debts.

Our first round includes 3 groups defined by the credit access.







First group includes households without debts in financial institutions and without any kind of assets.

Second group includes households with real assets: cars, and primary house.

Third group adds households with debts in financial institutions.


Results for EFH

Missing Information

Assets Primary house

Vehicules

Debt Mortgage

Debt in retail companies

Bank loan

Missing values Total observations % of missing

80

158

2637

2003

3%

8%

137

320

47

769

1901

670

Total Number of Households in EFH 2007

18%

17%

7%

Number of observations with data

2557

1845

632

1581

623

4021



Low missing rate of information.

 However, combining variables reduces sample size.

 We will concentrate our analysis in the second group.

 We use logit transformation to avoid unbounded results.

Source: EFH 2007.


Group 2: Conditional

Primary House

HD_sex

HD_seq uvis_sex uvis_seq

Vehicules

HD_sex


Mortgage

HD_sex



HD_sex


Source: EFH 2007.

16

N mean sd mean sd

124 41,300,000 51,400,000 41,300,000 51,400,000

133

121

133 m=5 m = 10

42,021,053 53,054,393 41,363,910 51,875,497

40,804,511 50,395,197 41,843,008 53,564,263

45,799,319 61,270,298 45,192,579 60,678,326

45,343,722 61,377,351 45,508,348 61,755,670

109 14,500,000 10,800,000 14,500,000 10,800,000

15,683,299 13,673,049 15,580,981 13,076,041

133

15,753,395 12,699,439 15,778,602 13,257,939

16,540,712 14,795,948 16,521,443 15,674,534

16,703,439 16,311,423 16,876,838 16,434,770

118

133

5,430,165 14,000,000

5,515,865 13,545,393

6,398,767 21,334,829

5,732,371 13,955,271

5,861,634 16,442,425

564,897

540,023

543,912

542,185

565,666

1,076,239

1,025,211

1,029,810

1,037,191

1,033,445

5,430,165

5,606,466

5,985,748

5,678,109

6,043,486

564,897

549,424

548,536

568,213

563,049

14,000,000

14,084,994

17,893,250

13,815,379

15,476,726

1,076,239

1,056,475

1,032,109

1,067,170

1,034,414


Group 2: Multivariate

Primary House

Norm

Ice

Vehicules

Norm

Ice

Mortgage

Norm

Ice


Norm

Ice m=5 m = 10

N mean sd mean sd

124 41,300,000 51,400,000 41,300,000 51,400,000

133

45,666,537 63,217,837 45,696,699 63,810,076

42,870,280 57,188,693 42,981,100 57,419,325

121

133

5,430,165 14,000,000 5,430,165 14,000,000

6,498,768 17,089,963 6,446,325 16,600,325

7,282,372 19,242,160 6,558,340 17,211,530

109 14,500,000 10,800,000 14,500,000 10,800,000

133

15,734,073 13,055,818 15,686,541 12,946,747

15,181,216 12,585,790 15,431,263 12,719,416

118

133

564,897

635,011

707,768

1,076,239

1,242,141

1,427,360

564,897

654,637

672,833

1,076,239

1,285,391

1,357,371

 Households with debt in retails companies.

 UVIS and HD could have lower variances than raw data.

 We note that ICE and NORM are consistent in sd’s.

Source: EFH.



18











In our empirical application Hot-Deck as well as

Univariate imputation have smaller variances than multivariate methods.

Under a multivariate imputation we are able to have a reasonable standard deviation that reflects the uncertainty of complete data sets.

Moving to multivariate methods we have ICE and

NORM. Both have advantages and disadvantages.

ICE is implemented in Stata with many features available to accommodate several models.

In that respect is more general than NORM.



19







NORM relies in EM algorithm in theoretical terms and DA for imputation method. For that we observed that we need convergence and

“reasonable” positive definite matrix.

In practical terms we observed that a high rate of missing data is associated with non-converge of

EM algorithm and/or some problems with DA.

In the case of ICE we observed that the algorithm does not have “convergence problem”. We were able to impute data with a high rate of incomplete information.


20










We think that NORM provides a useful information about the stability of the model.

For that its implementation in Stata would be a good complement for ICE. Don’t you think?

We found 2 versions of NORM code: miss.sas by

Paul Allison and norm.R by Alvaro Novo.

We translated miss SAS routine into Stata-ado programming. Allison used SAS package IML

(Interactive Matrix Language). We observed that IML is similar to Mata.


21












However, original code was not “optimized” as

Schafer suggested in his book.

A month ago we move to R-code. We found that original code in Fortran was included in R routine.

So, norm.R allows to use Fortran code directly in R.

Speedy up for this meeting, we translated 800 lines of Fortran into Mata in a week.

However, our translation from R to Mata is not good enough… yet.


Problem

22



Besides technical issues on MI we have an unsolved topic for which your opinion is crucial: Aggregation











We want to work at the household level, then aggregation of individual information must be done somehow.

In order to deal with missing observation at individual level we could apply “improper imputations”: (1) replacing with zeros, or (2) replacing with a predicted value.

Alternative, we could code “missing” to the household if any member has missing observation. Because we lost information we discarded it.

Any feasible mixture?

Is it possible to add variables in order to account for incomplete information at individual level?


23

Multiple Imputation for households surveys

A comparison of methods

Stata Users Group Meeting

Rodrigo Alfaro – Marcelo Fuenzalida


All Households: Conditional

Primary house

Vehicules

Mortgage

HD_sex


N

2,557

2,637

1,845

HD_sex


2,003

HD_sex


N

632

769

46,742,044

46,519,877

51,023,261

50,809,129

Assets mean

46,467,118 m=5 sd

63,446,185

64,178,915

64,309,665

72,903,540

72,860,751

11,839,545

12,020,063

11,964,703

11,636,439

11,904,938

46,851,535

47,783,641

47,559,264

45,164,884

45,975,034

21,201,238

21,244,300

21,832,867

21,799,370

Debt mean

21,299,214 m=5 sd

19,204,692

19,271,787

19,184,499

20,593,237

20,704,561 mean

46,467,118 m=10

46,634,739

46,690,809

51,095,700

50,543,847 sd

63,446,185

64,084,809

64,581,258

72,962,069

72,175,917

11,839,545

11,926,336

11,870,948

11,689,710

11,823,094

46,851,535

47,152,707

47,067,930

45,217,017

45,687,387 mean

21,299,214 m=10

21,143,672

21,202,081

21,852,518

21,926,977 sd

19,204,692

19,104,947

19,246,894

20,790,692

20,889,226

Source: EFH 2007.


All Households: Conditional


HD_sex


Bank Loan

HD_sex


N

1,581

1,901

623

670 mean

431,153

426,891

422,734

445,489

444,923 m=5

Debt (continue) sd

622,756

607,481

600,901

658,403

676,724 mean

431,153

427,533

426,400

447,371

446,381 m=10 sd

622,756

615,761

614,459

664,727

683,083

6,175,111

6,132,041

6,108,118

6,294,591

6,350,864

9,850,386

9,726,635

9,742,051

10,090,469

10,055,060

6,175,111

6,118,279

6,163,441

6,314,019

6,311,301

9,850,386

9,700,285

9,800,157

10,149,870

10,046,143

Source: EFH 2007.


Multiple Imputation for households surveys A comparison of methods Stata Users Group Meeting

Multiple Imputation for households surveys

A comparison of methods

Outline