Missing data imputation

advertisement
CountrySTAT Team-I
10-13 November 2014, ECO Secretariat,Teheran
MISSING DATA IMPUTATION
SUMMARY
Introduction
 Origin of missing data
 Nature of missing data
 Implemented methodologies
 Proposed methodologies
 Results
 Conclusion

INTRODUCTION

The objective of this presentation is to introduce
basics tools to handle missing data in CountrySTAT
and FAOSTAT domains. They are based on simple
and friendly approach, easy to use.

The CountrySTAT agricultural production domain
was used as a basis to develop and test
imputation and validation methodologies that
could assist in standardisation across the different
statistical domains presents at FAO level.
ORIGIN OF MISSING DATA

Data are missing for different reasons
1) The value has not been measured
(forget...);
2) The value is measured but lost;
3) The value is measured, but considered
unusable (outliers, etc.);
4) The value is measured but unavailable.
DATA ARE ESSENTIAL TO RESEARCH, BUT ANY EXPERIENCED RESEARCHER
KNOWS THAT IT'S NEARLY IMPOSSIBLE TO COLLECT DATA WITHOUT HOLES,
BIASES, OR FLAWS
NATURE OF MISSING DATA

In a dataset, data can be
1) Missing completely at random (MCAR): when the events that
lead to any particular data-item being missing are independent both of observable
variables and of unobservable parameters of interest, and occur entirely at
random.
P(r |Yobserved;Ymissing) = P(r )
2) Missing at random (MAR): when the missingness is related to a
particular variable, but it is not related to the value of the variable that has missing
data.
P(r |Yobserved;Ymissing) = P(r |Yobserved)
3) Not missing at random (NMAR): when data are not MCAR or MAR
P(r |Yobserved;Ymissing) = P(r |Yobserved;Ymissing)
4) Censored and Truncated Data.
Data use to be MCAR or MAR
OVERVIEW OF DIFFERENT METHODOLOGIES
A) Deductive or logical imputation;
 B) Mean imputation;
 C) Ratio imputation;
 D) Regression imputation;
 E) Donor imputation (hot-deck, cold-deck, nearest neighbor);
 F) Multiple imputation : Because it is not deterministic, it is

not applicable to officials statistics.
IMPLEMENTED METHODOLOGIES: IMPUTATION
METHODS IN FAOSTAT




expert judgment
last observations carried forward
linear interpolation
growth-rate benchmarking
already
applied
• trend smoothing
tested but not
applied
• yield estimation
• multivariate approach
under
development
These imputations are based on deductive or logical imputation, ratio
imputation and donor imputation.
The selected method is based on Regression imputation method.
WHY?
IMPLEMENTED METHODOLOGIES: MOVING
AVERAGE
 yt*
is the value to be imputed.
 We consider the time serie (yt): y1, y2,…,yn.
yt−l+⋯+yt +yt+1 +⋯+yt+𝑚
yt ∗=
𝑚+𝑙
 If m=0, yt* is the estimation for the
current year.
 If m=0 and l=1, the last observation is
carried forward.
−1
IMPLEMENTED METHODOLOGIES: MOVING AVERAGE. EXAMPLE
Year
2003 2004
2005
2006 2007 2008 2009 2010
2011 2012 2013
Area
135
195
160
210
---
---
170
190
208
Area production for Afghanistan (in thousand ha.)
m=2, l=1
135 + 195 + 160
y2004 ∗ =
= 163.333
3
160 + 170 + 190
y2007 ∗ =
= 173.333
3
m=0, l=1
205
y2013 ∗ =
= 205
1
205
---
IMPLEMENTED METHODOLOGIES: LINEAR
INTERPOLATION
A linear trend is assumed to exist between the startand endpoints of gaps in the time series.
Let y0, y1, ..., yt-l denote the data points with values
obtained from official sources before the gap and
yt+r, yt+r+1, ..., ym denote the data points with official
values after the gap. The imputed values are
calculated as:
y t  r  y t l
yˆ t  yt l  l
.
lr
IMPLEMENTED METHODOLOGIES: LINEAR INTERPOLATION. EXAMPLE
Year
2003 2004
2005
2006 2007 2008 2009 2010
2011 2012 2013
Area
135
195
160
210
195
---
---
---
208
205
Area production for Afghanistan (in thousand ha.)
208 − 160
y2007 ∗ = 160 + 1 ∗
= 172
1+3
208 − 160
y2008 ∗ = 160 + 2 ∗
= 184
2+2
208 − 160
y2009 ∗ = 160 + 3 ∗
= 196
3+1
205
IMPLEMENTED METHODOLOGIES: ESTIMATION
BASED ON AVERAGE YIELD (1)
An estimate of the yield in data point 0 is calculated by
taking the average of the ratio between agricultural
output (y) and agricultural input (x) observed at the three
data points with valid observations in both y and x which
are nearest to the imputable value in terms of years.
∗
𝑟 =
1 y1
(
3 x1
y2 y3
+ + )
x2 x3
IMPLEMENTED METHODOLOGIES: ESTIMATION
BASED ON AVERAGE YIELD (2)
If a valid value for agricultural input exists in the current
year, x, then the corresponding value of agricultural
output is estimated as:
∗
∗
𝑦 =𝑟 x x
If a valid value for agricultural output exists in the
current year, y0, then the corresponding value of
agricultural input is estimated as:
x
∗
𝑦
= ∗
𝑟
Year
Area
2005 2006 2007
125
135
141
2008 2009
--
133
Production 1125 2002 2695 2200 1982
∗
𝑟 2008 =
1 1125
(
3 125
2010 2011 2012
125
144
--
160
1001 2725
--
2820
2002 2695
+
+
)=14.31
135
141
2200
Area2008=
= 153.73
14.31
1 1982
1001 2725
𝑟 ∗ 2012 = (
+
+
)=13.94
3 133
125
144
160−144
Area2012=144+
=152
1+1
Area2012= 152 ∗ 13.94 =2119.58
2013
IMPLEMENTED METHODOLOGIES: TREND
REGRESSION
A polynomial regression is run based on the model:
yt = α+β1 X t + β2 X 𝑡 2 + β3 X 𝑡 3 + β4 X 𝑡 4 + ρ X ut-1
where yt is a valid value observed for year t and ut is the residual
in that year.
PROPOSED METHODOLOGIES: REGRESSION
IMPUTATION
Used methods are based on regression
imputation and used EM-algorithm :
1)Yield estimation: estimate yield using an
arima model;
2)Linear regression: Use a linear regression
between Pt and At including Trend;

3)Arima model: Estimate Pt and At using
ARIMA model;
4) Spline regression: Estimate Pt and At using
spline;
PROPOSED METHODOLOGIES: LINEAR
REGRESSION
EXPECTATION-MAXIMIZATION ALGORITHM (EM)

How it is work ?
4.PROPOSED METHODOLOGIES: YIELD
ESTIMATION (EM:EXPECTATION-MAXIMISATION)

Compute a yield time series Yt containing missing data:

Yt=Pt/At, where Pt is the production and At is the area harvested at
time t;

Use linear interpolation method to obtain starting values;

ARIMA(0,1,1): Yt =Yt-1 + α+ εt - θ1* εt-1;

EM algorithm.
Use Yield estimate to impute Production and Area Harvested.
Where Pt and At are missing, we use last observation carried forward
method to impute area harvested.


4.PROPOSED METHODOLOGIES: LINEAR
REGRESSION (EM:EXPECTATION-MAXIMISATION)










The model assumes linear relationship between Production and Area
Harvested;
Pt= Yt *At
Pt= Production in the year t;
At= Area Harvested in the year t;
Yt= Yield in the year t.
Algorithm:
1) Linear interpolation for Area for starting values;
2) Repeat and update until the convergence of prediction values:
Pt= α+ β1 *Trend + β2 *At + εt (EM-Algorithm to impute Pt)
At= α+ β1 *Trend + β2 *Pt + εt (EM-Algorithm to impute At)
PROPOSED METHODOLOGIES: ARIMA MODEL

The ARIMA models must be identified

ARIMA(0,1,1): Yt =Yt-1 + α+ εt - θ1* εt-1;

Use relation between Production and Area

Use these variable as time series and Impute using EMalgorithm.

Package mtsdi of R.

Impute using ARIMA model for Pt and At imputation
PROPOSED METHODOLOGIES: SPLINE MODEL


Form of interpolation where the interpolant is a special type
of piecewise polynomial called a spline.
For each interval, we try estimate a polynomial function
which fit well data.

Spline interpolation is preferred over polynomial
interpolation because the interpolation error can be made
small even when using low degree polynomials for the
spline.
Package mtsdi of R.

Impute using Spline regression for Pt and At imputation

RESULTS

We use reals data to test proposed methodologies:
Yield estimation, Linear Regression, ARIMA, Spline

We add also linear interpolation

Data are from CountrySTAT-Mali website.

Missing data are generated randomly.

Data are from 1984 to 2012.
Use real data to test.
RESULTS: TEST CASE

Test case: Maize.

Missing data at 10 %.
RESULTS: TESTS CASES

We perform again these methods on the same
dataset at different percentages of missing
data.
RESULTS: RELATIVES ERRORS (MAIZE)
% Missing
Method
Min
Max
Mean
Std.Dev
10
Linear.Int.
Yield
Linear Reg.
ARIMA
Spline
0.092
0.107
0.063
0.008
0.098
0.558
0.354
0.308
0.270
0.332
0.262
0.205
0.191
0.136
0.190
0.202
0.102
0.096
0.097
0.079
20
Linear.Int.
Yield
Linear Reg.
ARIMA
Spline
0.011
0.061
0.014
0.050
0.034
1.142
0.540
0.758
0.517
0.281
0.249
0.238
0.312
0.231
0.142
0.303
0.126
0.253
0.142
0.076
40
Linear.Int.
Yield
Linear Reg.
ARIMA
Spline
0.011
0.003
0.184
0.026
0.013
0.011
0.003
0.184
0.026
0.013
0.198
0.174
0.235
0.182
0.154
0.160
0.098
0.181
0.106
0.096
CONCLUSION
For the 3 tests cases, relatives errors are less for
method of Spline in the most of case, when the
percentage of missing data is more than 10%.
 The method ARIMA is more adapted when we have
less than 10% of missing data in the dataset.
 The above tests use only two variables for the
same crop (area and production). If the number of
missing data exceeds 40%, it will be appropriated
to use a third correlated control variable.

THANK YOU
Download