Slide - IAOS 2014 Conference

advertisement

IAOS 2014 Conference – Meeting the Demands of a Changing World

Da Nang, Vietnam, 8-10 October 2014

Diagnosing the Imputation of Missing

Values in Official Economic Statistics via Multiple Imputation:

Unveiling the Invisible Missing Values

National Statistics Center (Japan)

Masayoshi Takahashi

Notes: The views and opinions expressed in this presentation are the authors’ own, not necessarily those of the institution.

Outline

4.

5.

6.

1.

2.

3.

Problems of Missing Values and

Imputation

Theory of MI and the EMB Algorithm

Mechanism Behind the Diagnostic

Algorithm

Data and Missing Mechanism

Assessment of the Diagnostic Algorithm

Conclusions and Future Work

1

1. Problems of Missing Values and Imputation

Problems of Missing Values

Prevalence of missing values

Effects of missing values

Reduction in efficiency

Introduction of bias

Assumptions and solution

Missing At Random (MAR)

Imputation

2

1. Problems of Missing Values and Imputation

Problematic Nature of Single Imputation (SI)

Deterministic SI

Y

ˆ ij

Y i ,

 j

 ˆ ^ = OLS estimate

Stochastic SI

Y

ˆ ij

Y i ,

 j

 ˆ   ˆ i

There is only one set of regression coefficients.

Random noise

3

2. Theory of Multiple Imputation and the EMB Algorithm

Multiple Imputation (MI) Comes for Rescue

~

Y ij

Y i ,

 j

~

  ~ i

~ = random sampling from a posterior distribution

Multiple sets of regression coefficients

Need multiple values of

&

4

2. Theory of Multiple Imputation and the EMB Algorithm

Likelihood of Observed Data

L

 

,

| Y obs

 i n 

1

N

Y i , obs

|

 i , obs

,

 i , obs

Random sampling from observed likelihood

 Not easy!!

Solution

 Various computation algorithms

5

2. Theory of Multiple Imputation and the EMB Algorithm

Computational Algorithms

EMB algorithm

Expectation-Maximization

Bootstrapping

Most computationally efficient

Other MI algorithms

MCMC

FCS

6

2. Theory of Multiple Imputation and the EMB Algorithm

Graphical Presentation of the EMB Algorithm

7

3. Mechanism Behind the Diagnostic Algorithm

Paradox in Imputation

Imputed values

Estimates, not true values

Diagnosis

True values

Always missing

Cannot compare the imputed values with the truth

How do we go about imputation diagnostics?

8

3. Mechanism Behind the Diagnostic Algorithm

Solution to the Paradox

Indirect diagnostics of imputation

Abayomi, Gelman, and Levy (2008)

MI

Honaker and King (2010)

Within-imputation variance

Between-imputation variance

9

3. Mechanism Behind the Diagnostic Algorithm

Disadvantage of multiple imputation

Dozens of imputed datasets

Computational burden

Multiple values for one cell

Unrealistic to directly use in official statistics

10

3. Mechanism Behind the Diagnostic Algorithm

Proposal in this Research

Two-step procedure

Imputation step: Stochastic SI

Diagnostic step: MI  New !!

Advantage

Can have only one imputed value

Advantage of SI

Can know the confidence about each imputed value

Advantage of MI

11

3. Mechanism Behind the Diagnostic Algorithm

Multiple Imputation as a Diagnostic Tool

Variation among M imputed datasets

Estimation uncertainty in imputation

Our diagnostic algorithm

Utilizes this variability

Can examine the stability & confidence of imputation models

What does this mean?

See the next slide for illustration

12

3. Mechanism Behind the Diagnostic Algorithm

Illustration: Two Cases of Variation in Imputations

13

3. Mechanism Behind the Diagnostic Algorithm

Mathematical Representation

Imputation Step:

Stochastic SI

Y

ˆ ij

Y i ,

 j

 ˆ   ˆ i

If 

ˆ

 

~

, then no uncertainties

Diagnostic Step:

MI

~

Y ij

Y i ,

 j

~

  ~ i

What we actually check is whether sd (

~

Y ij

)

0

14

4. Data and Missing Mechanism

Data

Multivariate log-normal distribution

Mean vector & variance-covariance matrix

Simulated dataset

Manufacturing Sector

2012 Japanese Economic Census

Number of observations

1,000

Variables

 turnover, capital, worker

15

4. Data and Missing Mechanism

Missing Mechanism

Target variable

 turnover

Missing rate

20%

Missing mechanism

MAR

A logistic regression to estimate the probability of missingness according to the values of explanatory variables

(capital and worker)

16

5. Assessment of the Diagnostic Algorithm

R-Function diagimpute

New function developed in R

Graphical detection of problematic imputations as outliers

Graphical presentation of the stability of imputation via control chart

Not yet publicly available

A work in progress

Once finalized, planning to make it publicly available

17

5. Assessment of the Diagnostic Algorithm

Preliminary Result 1

18

5. Assessment of the Diagnostic Algorithm

Preliminary Result 2

19

6. Conclusions and Future Work

Conclusions

MI as a diagnostic tool

A novel way

Diagnostic algorithm

Still a work in progress

A preliminary assessment given

Useful to detect problematic imputations

Help us strengthen the validness of official economic statistics.

20

6. Conclusions and Future Work

Future Work

Intend to further refine the algorithm

Test it against a variety of real datasets

Use several imputation models

21

References 1

1.

2.

3.

4.

5.

6.

7.

8.

Abayomi, Kobi, Andrew Gelman, and Marc Levy. (2008). “Diagnostics for

Multivariate Imputations,” Applied Statistics vol.57, no.3, pp.273-291.

Allison, Paul D. (2002). Missing Data. CA: Sage Publications.

Congdon, Peter. (2006). Bayesian Statistical Modelling, Second Edition. West

Sussex: John Wiley & Sons Ltd.

de Waal, Ton, Jeroen Pannekoek, and Sander Scholtus. (2011). Handbook of

Statistical Data Editing and Imputation. Hoboken, NJ: John Wiley & Sons.

Honaker, James and Gary King. (2010). “What to do About Missing Values in

Time Series Cross-Section Data,” American Journal of Political Science vol.54, no.2, pp.561–581.

Honaker, James, Gary King, and Matthew Blackwell. (2011). “Amelia II: A

Program for Missing Data,” Journal of Statistical Software vol.45, no.7.

King, Gary, James Honaker, Anne Joseph, and Kenneth Scheve. (2001).

“Analyzing Incomplete Political Science Data: An Alternative Algorithm for

Multiple Imputation,” American Political Science Review vol.95, no.1, pp.49-69.

Little, Roderick J. A. and Donald B. Rubin. (2002). Statistical Analysis with

Missing Data, Second Edition. New Jersey: John Wiley & Sons.

22

References 2

9.

10.

11.

12.

13.

14.

15.

Oakland, John S. and Roy F. Followell. (1990). Statistical Process Control: A

Practical Guide. Oxford: Heinemann Newnes.

Rubin, Donald B.

(1978).

“Multiple Imputations in Sample

Surveys — A Phenomenological Bayesian Approach to Nonresponse,”

Proceedings of the Survey Research Methods Section, American Statistical

Association, pp.20-34.

Rubin, Donald B. (1987). Multiple Imputation for Nonresponse in Surveys. New

York: John Wiley & Sons.

Schafer, Joseph L. (1997). Analysis of Incomplete Multivariate Data. London:

Chapman & Hall/CRC.

Scrucca, Luca. (2014). “Package qcc: Quality Control Charts,” http://cran.rproject.org/web/packages/qcc/qcc.pdf

.

Statistics Bureau of Japan. (2012). “Economic Census for Business Activity,” http://www.stat.go.jp/english/data/e-census/2012/index.htm

.

Takahashi, Masayoshi and Takayuki Ito. (2012). “Multiple Imputation of

Turnover in EDINET Data: Toward the Improvement of Imputation for the

Economic Census,” Work Session on Statistical Data Editing, UNECE, Oslo,

Norway, September 24-26, 2012.

23

References 3

16.

17.

18.

19.

Takahashi, Masayoshi and Takayuki Ito. (2013). “Multiple Imputation of Missing

Values in Economic Surveys: Comparison of Competing Algorithms,”

Proceedings of the 59 th World Statistics Congress of the International Statistical

Institute, Hong Kong, China, August 25-30, 2013, pp.3240-3245.

Takahashi, Masayoshi. (2014a). “An Assessment of Automatic Editing via the

Contamination Model and Multiple Imputation,” Work Session on Statistical

Data Editing, United Nations Economic Commission for Europe, Paris, France,

April 28-30, 2014.

Takahashi, Masayoshi. (2014b). “Keiryouchi Data no Kanrizu (Control Chart for

Continuous Data),” Excel de Hajimeru Keizai Toukei Data no Bunseki (Statistical

Data Analysis for Economists Using Excel) , 3 rd edition. Tokyo: Zaidan Houjin

Nihon Toukei Kyoukai..

van Buuren, Stef. (2012). Flexible Imputation of Missing Data. London:

Chapman & Hall/CRC.

24

Thank you

25

Download