CountrySTAT Team-I 10-13 November 2014, ECO Secretariat,Teheran MISSING DATA IMPUTATION SUMMARY Introduction Origin of missing data Nature of missing data Implemented methodologies Proposed methodologies Results Conclusion INTRODUCTION The objective of this presentation is to introduce basics tools to handle missing data in CountrySTAT and FAOSTAT domains. They are based on simple and friendly approach, easy to use. The CountrySTAT agricultural production domain was used as a basis to develop and test imputation and validation methodologies that could assist in standardisation across the different statistical domains presents at FAO level. ORIGIN OF MISSING DATA Data are missing for different reasons 1) The value has not been measured (forget...); 2) The value is measured but lost; 3) The value is measured, but considered unusable (outliers, etc.); 4) The value is measured but unavailable. DATA ARE ESSENTIAL TO RESEARCH, BUT ANY EXPERIENCED RESEARCHER KNOWS THAT IT'S NEARLY IMPOSSIBLE TO COLLECT DATA WITHOUT HOLES, BIASES, OR FLAWS NATURE OF MISSING DATA In a dataset, data can be 1) Missing completely at random (MCAR): when the events that lead to any particular data-item being missing are independent both of observable variables and of unobservable parameters of interest, and occur entirely at random. P(r |Yobserved;Ymissing) = P(r ) 2) Missing at random (MAR): when the missingness is related to a particular variable, but it is not related to the value of the variable that has missing data. P(r |Yobserved;Ymissing) = P(r |Yobserved) 3) Not missing at random (NMAR): when data are not MCAR or MAR P(r |Yobserved;Ymissing) = P(r |Yobserved;Ymissing) 4) Censored and Truncated Data. Data use to be MCAR or MAR OVERVIEW OF DIFFERENT METHODOLOGIES A) Deductive or logical imputation; B) Mean imputation; C) Ratio imputation; D) Regression imputation; E) Donor imputation (hot-deck, cold-deck, nearest neighbor); F) Multiple imputation : Because it is not deterministic, it is not applicable to officials statistics. IMPLEMENTED METHODOLOGIES: IMPUTATION METHODS IN FAOSTAT expert judgment last observations carried forward linear interpolation growth-rate benchmarking already applied • trend smoothing tested but not applied • yield estimation • multivariate approach under development These imputations are based on deductive or logical imputation, ratio imputation and donor imputation. The selected method is based on Regression imputation method. WHY? IMPLEMENTED METHODOLOGIES: MOVING AVERAGE yt* is the value to be imputed. We consider the time serie (yt): y1, y2,…,yn. yt−l+⋯+yt +yt+1 +⋯+yt+𝑚 yt ∗= 𝑚+𝑙 If m=0, yt* is the estimation for the current year. If m=0 and l=1, the last observation is carried forward. −1 IMPLEMENTED METHODOLOGIES: MOVING AVERAGE. EXAMPLE Year 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Area 135 195 160 210 --- --- 170 190 208 Area production for Afghanistan (in thousand ha.) m=2, l=1 135 + 195 + 160 y2004 ∗ = = 163.333 3 160 + 170 + 190 y2007 ∗ = = 173.333 3 m=0, l=1 205 y2013 ∗ = = 205 1 205 --- IMPLEMENTED METHODOLOGIES: LINEAR INTERPOLATION A linear trend is assumed to exist between the startand endpoints of gaps in the time series. Let y0, y1, ..., yt-l denote the data points with values obtained from official sources before the gap and yt+r, yt+r+1, ..., ym denote the data points with official values after the gap. The imputed values are calculated as: y t r y t l yˆ t yt l l . lr IMPLEMENTED METHODOLOGIES: LINEAR INTERPOLATION. EXAMPLE Year 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Area 135 195 160 210 195 --- --- --- 208 205 Area production for Afghanistan (in thousand ha.) 208 − 160 y2007 ∗ = 160 + 1 ∗ = 172 1+3 208 − 160 y2008 ∗ = 160 + 2 ∗ = 184 2+2 208 − 160 y2009 ∗ = 160 + 3 ∗ = 196 3+1 205 IMPLEMENTED METHODOLOGIES: ESTIMATION BASED ON AVERAGE YIELD (1) An estimate of the yield in data point 0 is calculated by taking the average of the ratio between agricultural output (y) and agricultural input (x) observed at the three data points with valid observations in both y and x which are nearest to the imputable value in terms of years. ∗ 𝑟 = 1 y1 ( 3 x1 y2 y3 + + ) x2 x3 IMPLEMENTED METHODOLOGIES: ESTIMATION BASED ON AVERAGE YIELD (2) If a valid value for agricultural input exists in the current year, x, then the corresponding value of agricultural output is estimated as: ∗ ∗ 𝑦 =𝑟 x x If a valid value for agricultural output exists in the current year, y0, then the corresponding value of agricultural input is estimated as: x ∗ 𝑦 = ∗ 𝑟 Year Area 2005 2006 2007 125 135 141 2008 2009 -- 133 Production 1125 2002 2695 2200 1982 ∗ 𝑟 2008 = 1 1125 ( 3 125 2010 2011 2012 125 144 -- 160 1001 2725 -- 2820 2002 2695 + + )=14.31 135 141 2200 Area2008= = 153.73 14.31 1 1982 1001 2725 𝑟 ∗ 2012 = ( + + )=13.94 3 133 125 144 160−144 Area2012=144+ =152 1+1 Area2012= 152 ∗ 13.94 =2119.58 2013 IMPLEMENTED METHODOLOGIES: TREND REGRESSION A polynomial regression is run based on the model: yt = α+β1 X t + β2 X 𝑡 2 + β3 X 𝑡 3 + β4 X 𝑡 4 + ρ X ut-1 where yt is a valid value observed for year t and ut is the residual in that year. PROPOSED METHODOLOGIES: REGRESSION IMPUTATION Used methods are based on regression imputation and used EM-algorithm : 1)Yield estimation: estimate yield using an arima model; 2)Linear regression: Use a linear regression between Pt and At including Trend; 3)Arima model: Estimate Pt and At using ARIMA model; 4) Spline regression: Estimate Pt and At using spline; PROPOSED METHODOLOGIES: LINEAR REGRESSION EXPECTATION-MAXIMIZATION ALGORITHM (EM) How it is work ? 4.PROPOSED METHODOLOGIES: YIELD ESTIMATION (EM:EXPECTATION-MAXIMISATION) Compute a yield time series Yt containing missing data: Yt=Pt/At, where Pt is the production and At is the area harvested at time t; Use linear interpolation method to obtain starting values; ARIMA(0,1,1): Yt =Yt-1 + α+ εt - θ1* εt-1; EM algorithm. Use Yield estimate to impute Production and Area Harvested. Where Pt and At are missing, we use last observation carried forward method to impute area harvested. 4.PROPOSED METHODOLOGIES: LINEAR REGRESSION (EM:EXPECTATION-MAXIMISATION) The model assumes linear relationship between Production and Area Harvested; Pt= Yt *At Pt= Production in the year t; At= Area Harvested in the year t; Yt= Yield in the year t. Algorithm: 1) Linear interpolation for Area for starting values; 2) Repeat and update until the convergence of prediction values: Pt= α+ β1 *Trend + β2 *At + εt (EM-Algorithm to impute Pt) At= α+ β1 *Trend + β2 *Pt + εt (EM-Algorithm to impute At) PROPOSED METHODOLOGIES: ARIMA MODEL The ARIMA models must be identified ARIMA(0,1,1): Yt =Yt-1 + α+ εt - θ1* εt-1; Use relation between Production and Area Use these variable as time series and Impute using EMalgorithm. Package mtsdi of R. Impute using ARIMA model for Pt and At imputation PROPOSED METHODOLOGIES: SPLINE MODEL Form of interpolation where the interpolant is a special type of piecewise polynomial called a spline. For each interval, we try estimate a polynomial function which fit well data. Spline interpolation is preferred over polynomial interpolation because the interpolation error can be made small even when using low degree polynomials for the spline. Package mtsdi of R. Impute using Spline regression for Pt and At imputation RESULTS We use reals data to test proposed methodologies: Yield estimation, Linear Regression, ARIMA, Spline We add also linear interpolation Data are from CountrySTAT-Mali website. Missing data are generated randomly. Data are from 1984 to 2012. Use real data to test. RESULTS: TEST CASE Test case: Maize. Missing data at 10 %. RESULTS: TESTS CASES We perform again these methods on the same dataset at different percentages of missing data. RESULTS: RELATIVES ERRORS (MAIZE) % Missing Method Min Max Mean Std.Dev 10 Linear.Int. Yield Linear Reg. ARIMA Spline 0.092 0.107 0.063 0.008 0.098 0.558 0.354 0.308 0.270 0.332 0.262 0.205 0.191 0.136 0.190 0.202 0.102 0.096 0.097 0.079 20 Linear.Int. Yield Linear Reg. ARIMA Spline 0.011 0.061 0.014 0.050 0.034 1.142 0.540 0.758 0.517 0.281 0.249 0.238 0.312 0.231 0.142 0.303 0.126 0.253 0.142 0.076 40 Linear.Int. Yield Linear Reg. ARIMA Spline 0.011 0.003 0.184 0.026 0.013 0.011 0.003 0.184 0.026 0.013 0.198 0.174 0.235 0.182 0.154 0.160 0.098 0.181 0.106 0.096 CONCLUSION For the 3 tests cases, relatives errors are less for method of Spline in the most of case, when the percentage of missing data is more than 10%. The method ARIMA is more adapted when we have less than 10% of missing data in the dataset. The above tests use only two variables for the same crop (area and production). If the number of missing data exceeds 40%, it will be appropriated to use a third correlated control variable. THANK YOU