Lecture #5: MAPS WITH GAPS-- Small geographic area estimation, kriging, and kernel smoothing Spatial statistics in practice Center for Tropical Ecology and Biodiversity, Tunghai University & Fushan Botanical Garden Topics for today’s lecture • • • • The E-M algorithm The spatial E-M algorithm Kriging in ArcGIS geographically weighted regression (GWR) • approaches to map smoothing THEOREM 1 When missing values occur only in a response variable, Y, then the iterative solution to the EM algorithm produces the regression coefficients calculated with only the complete data. PF: Let b denote the vector of regression coefficients that is converged upon. Then if Yˆ m X m b , 1 Xo Xo Xo T Yo b Xm Xm Xm Xmb T T -1 T T ( Xo Xo Xm Xm ) ( Xo Yo Xm Xmb) T ( XTo Xo )1 XTo Yo bo THEOREM 2 When missing values occur only in a response variable, Y, then by replacing the missing values with zeroes and introducing a binary 0/-1 indicator variable covariate -Im for each missing value m, such that Im is 0 for all but missing value observation m and 1 for missing value observation m, the estimated regression coefficient bm is equivalent to the point estimate for a new observation, and hence furnishes EM algorithm imputations. PF:Let bm denote the vector of regression coefficients for the missing values, and partition the data matrices such that b X 0 X 0 X 0 Y 1 T o o b m X m I mm om o Xm o I mm X m om T I mm om o 0m ( XTo Xo ) 1 XTo Yo ( XTo Xo ) 1 XTm T 1 T 1 T I mm X m ( Xo Xo ) X m 0m X m ( Xo Xo ) b o ( XTo Xo ) 1 XTo Yo , and b m X mb o , The EM algorithm solution M Y Xβ y m I m ε m 1 Yo 1o 0 m 1m X o α 0 o,m εo y m X m β I m, m 0m where: the missing values are replaced by 0 in Y, and Im is an indicator variable for missing value m that contains n-m 0s and a single 1 For imputations computed THEOREM 3 based upon Theorem 2, each standard error of the estimated regression coefficients bm is equivalent to the conventional standard deviation used to construct a prediction interval for a new observation, and as such furnishes the corresponding EM algorithm imputation standard error. PF: X o X m 0 om I mm T Xo Xm 1 0 om I mm 2 ( X To X o ) 1 σ̂ ε T 1 X m (X o X o ) sbm I mm 2 ( X To X o ) 1 X Tm σ̂ T 1 T ε X m (X o X o ) X m [I mm X m ( X To X o ) 1 X Tm ] diag σ̂ ε2 What is the set of equations for the following case? M Y 1α y m I m ε m 1 10 7 7 y4 = ? 10 8 7 8 0 0 2 1 8 8 0 y4 1 0 7 0 Some preliminary assessments Calculations from ANCOVA regression and the EM algorithm Data Source quantity Reported value OLS/NLS estimate McLachlan & Krishnan (1997) p. 49 14.61523 14.61523 ̂ 2 20.88516 208.85156/10 ̂12 26.75405 230.875/8 + ̂ 22 0.519532(402/10384/8) = 26.75407 ŷ(1,1) p. 53 429.6978 429.69767 ŷ(0,1) 324.0233 324.02326 p. 54 4.73 4.73030 ŷ 23 ŷ 51 Little & Rubin (1987) p. 31 u1 estimate u2 estimate p. 101 ̂ 2 p. 118 ˆ (x ) 1 ˆ (x 2 ) ˆ (x 4 ) 3.598 3.59697 7.8549 7.9206 49.3333 6.655 7.85492 7.92063 49.33333 6.65518 49.965 49.96524 27.047 27.03739 Calculations from ANCOVA regression and the EM algorithm Data Source quantity Reported value OLS/NLS estimate Schafer (1997) p. 43 48.1000 48.10000 () ( ) p. 195 simulations 59.4260 59.42600 ŷ 3, 2 average (n=5) 226.2 228.0 (se = 32.86) ŷ 3, 4 average (n=5) 146.8 146.2 (se = 38.37) ŷ 3, 5 average (n=5) 190.8 192.5 (se = 34.11) ŷ 3,10 average (n=5) 250.2 271.7 (se = 36.20) ŷ 3,13 average (n=5) 234.2 241.3 (se = 35.18) ŷ 3,16 average (n=5) 269.2 269.9 (se = 34.53) ŷ 3,18 average (n=5) 192.4 201.9 (se = 32.91) ŷ 3, 23 average (n=5) 215.6 207.4 (se = 33.09) ŷ 3, 25 average (n=5) 250.0 255.7 (se = 33.39) Ŷreported 0.044 0.987Yimputed ; R 0.99 2 simulated imputations EM algorithm solution for aggregated georeferenced data: vandalized turnips plots MTB > regress c4 8 c7-c14 Regression Analysis: C4 versus C7, C8, C9, C10, C11, C12, C13, C14 The regression equation is C4 = 28.9 - 6.32 C7 - 18.2 C8 - 1.10 C9 - 11.4 C10 - 10.1 C11 + 28.9 C12 + 18.8 C13 + 27.8 C14 Predictor Constant C7 [I1-I6] C8 [I2-I6] C9 [I3-I6] C10 [I4-I6] C11 [I5-I6] Coef SE Coef 28.900 2.404 -6.317 3.254 -18.200 3.254 -1.100 3.399 -11.400 3.254 -10.100 3.399 C12 [plot(6,5)] 28.900 C13 [plot(5,6)] 18.800 C14 [plot(6,6)] 27.800 5.887 5.887 5.887 T 12.02 -1.94 -5.59 -0.32 -3.50 -2.97 P 0.000 0.063 0.000 0.749 0.002 0.006 4.91 3.19 4.72 0.000 0.004 0.000 Analysis of Variance for C4 Source DF SS MS C5 5 1289.0 257.8 Error 27 779.9 28.9 Total 32 2068.9 Level 1 2 3 4 5 6 N 5 6 6 5 6 5 Mean 28.900 22.583 10.700 27.800 17.500 18.800 Pooled StDev = StDev 4.407 6.391 2.585 5.082 6.648 5.922 5.375 F 8.92 P 0.000 Individual 95% CIs For Mean Based on Pooled StDev ---+---------+---------+---------+--(-----*-----) (----*-----) (----*-----) (-----*-----) (-----*-----) (------*-----) ---+---------+---------+---------+--8.0 16.0 24.0 32.0 Residual spatial autocorrelation What does this mean? SAR-based missing data estimation Y ρWY (I ρW ) Xβ M y ( I ρ W ) ε m m om * m 1 where ym is a missing value (replaced by 0 in Y), Im is an indicator variable for ym, and * Wom is the mth column of geographic weights matrix W The Jacobian term Voo 2 J det Vmo n Vom V mm 1 nm 2 [ LN(1 ρλ i ) LN(1 ρωk )] n - n m i 1 k 1 NOTE: denominator becomes (n-nm) What is the set of equations for the following case? 7 Y2 = ? 10 M Y ρWY (I ρW)1μ y m (I m ρWom ) ε * m 1 ρ̂ 0 y1 y3 0 ρ̂ 2 10 ρ̂ 0 7 μ̂(1 ρ̂) ρ̂y 2 e1 μ̂(1 ρ̂) y2 e2 μ̂(1 ρ̂) ρ̂y 2 e3 spatial autoregressive (AR) Woo Yo ρ 0m Wmo Wom Yo 0 o Ym Wmm Ym I m β1 1o X o ε (1 - ρ)α 1m X m β 0 k kriging 1 ˆ ˆ ˆ Ym X mβ Σ mo Σ oo (Yo X o βˆ ) estimate with semivariogram model fit semivariogram model with The pure spatial autocorrelation CAR model -1 ˆ Ym 1m β̂0 ρ̂(I ρ̂Cmm ) Cmo (Yo 1o β̂0 ) NOTE: exactly the same algebraic structure as the kriging equation Dispersed missing values: ˆ 1 β̂ ρ̂C (Y 1 β̂ ) Y m m 0 mo o o 0 Imputation = the observed mean plus a weighted average of the surrounding residuals Employing rook’s adjacency and a CAR model, what is the equation for the following imputation? 10 3 7 6 y5 = ? 4 9 5 5 ŷ5 b 0 ρ̂[(3 b 0 ) (6 b 0 ) (4 b 0 ) (5 b 0 )] The spatial filter EM algorithm solution M Y Xβ X y m I m E k β E k ε m 1 where: the missing values are replaced by 0 in Y, and Im is an indicator variable for missing value m that contains n-m 0s and a single 1 Imputation of turnip production in 3 vandalized field plots Field plot Conventional EM estimate Spatial SAR- Spatial filter: 3 EM estimate selected ρ̂ SAR = 0.443 eigenvectors 29.99 24.31 (6,5) 28.9 (5,6) 18.8 17.66 13.62 (6,6) 27.8 28.26 23.93 Cressie’s PA coal ash model Cressie min mean max 7.00 9.78 17.61 estimate 10.27% Spherical 10.62% Gaussian 10.18% exponential 10.12% SAR 10.17% spatial filter 10.71% Unconstrained and constrained missing value estimates for the Little and Rubin (1987, p. 118) example Variable & Unconstrained Non-negative Non-negative & Reported observation constraint totals constraints values x1,10 12.9 12.8 15.9 21 x1,11 - 0.5 0.0 1.2 1 x1,12 10.0 10.0 13.0 11 x1,13 10.1 10.1 12.9 10 x2,10 65.8 66.0 59.0 47 x2,11 48.2 46.9 44.4 40 x2,12 68.1 68.1 61.4 66 x2,13 62.4 62.5 56.2 68 x4,7 0.8 0.8 6.7 6 x4,8 37.9 37.9 44.0 44 x4,9 20.0 20.0 24.4 22 x4,10 14.5 14.4 17.9 26 x4,11 20.8 21.6 29.4 34 x4,12 8.2 8.2 13.6 12 x4,13 14.5 15.4 20.0 12 Missing 1992 georeferenced density of milk production in Puerto Rico: constrained (total = 1918) Predicted from 1991 DMILK 235 1,339 344 Predicted from spatial filter 70 1,848 0 Predicted from both 385 1,065 468 predictions Moran scatterplot USDA-NASS estimation of covariate Pennsylvania total crop production constraints map gaps USDA-NASS estimation of Michigan crop production If this is 2% milk, how much am I paying for the other 98%? different response variable specifications Michigan imputations USDA-NASS estimation of Tennessee crop production Tennessee imputations An EM specification when some data for both Y and the Xs are missing Yo 1o X o 0o , x , m 0o , y , m Y Yx , m 1x , m 0 x , m I x , m X x , m Y 0 x , y , m Yy , m Y 0 1 X Y 0 I y,m y,m y,m y, x ,m y,m X o 1o Yo 0 o, x ,m 0 o, y,m X I x , m X x , m 0 x , y , m Yy , m X X 0 x , m 1x , m Yx , m X 1 0 X 0 I y,m y,m y,m y, x ,m y,m Concatenation results: Yy , o 0y,o X x ,o 0x ,o 0y,m X eq1 y y 0 y I X x , m I Yy , m x,m x ,m y,m x ,o 0 x,m y,o 0y,m Yy , o 0y,o 0x ,o x Yy , m X x , m eq 2 x x 0 I I y , m y , m x , m x , o 0 x,m Calculations from ANCOVA regression and the EM algorithm Data Source quantity Reported value OLS/NLS estimate McLachlan & Krishnan (1997) p. 91 5/2, 5/2, 0 5/2, 5/2, 0 saddle point: 11 , 22 , ρ maxima 8/3, 8/3, 0.5 2.87977, 2.87977, 0.88817 Schafer (1997) p. 54 1.80 18/10 ̂11 = ̂ 22 -1 -1 ̂ 0 0 ̂ 1 = ̂ 2 The spatial model p jr C n pr r n nm 1 pkr n y y xy k 1 x y[ wij ( yield j y ) wij ( y ) ] y (1 y ) y [( area xy ) a jr Car nr j 1 j nm 1 1 a kr k 1 n n xj xj xy xy xy y y wij ( area xy ) ] { y (1 y ) y [ xy y wij ( area xy ) ] y j j j 1 j 1 p jr C pr nr n nm 1 pkr n k 1 y y y[ wij ( yield j y ) wij ( y ) ]} I 0 a jr Car nr j 1 j nm 1 1 a kr k 1 covariate spatial autocorrelation p jr C pr nr n nm 1 pkr n y y xy k 1 x [ wij ( yield j y ) wij ( y ) ] y (1 y ) y [( area xy ) a jr y Car nr j 1 j nm 1 1 a kr k 1 pir C pr nr 1 pkr n xj xy y k 1 y wij ( area xy ) ] [( y ) ] I mr a j Car nrir j 1 totals 1 a kr constraints k 1 0 0 0 0 ( yield ) y y 0 acres a ( area a ) 0 production p ( ) p area 0 power transformation 0 0 a jr C nr ar n n n 1 a kr m acres j k 1 x [ wij ( area j a ) a wij ( a ) a ] a (1 a ) a [( area xa ) xa a area j j 1 j nm 1 n n xj xj xa xa xa a a wij ( area j xa ) ] { a (1 a ) a [ xa a wij ( area j xa ) ] a j 1 j 1 a jr Car nr n nm n 1 a kr acres j k 1 a a [ w ( ) w ( ) ]} I a ij area j a ij a 0 area j j 1 j n 1 m a jr C n r ar n nm n 1 a kr acres j k 1 x a a xa a[ wij ( area j a ) wij ( a ) ] a (1 a ) a [( area xa ) area j j 1 j nm 1 air Car nr n 1 a kr x j k 1 xa a w ( ) ] [( ) ] I a ij xa a m area r j area j 1 0 0 0 0 0 0 p jr C pr nr n nm n 1 pkr production j k 1 p x [ wij ( area j p ) wij ( p ) p ] p (1 p ) p [( area xp ) xp p area j j 1 j nm 1 n n xj xj xp p wij ( area j xp ) ] { p (1 p ) p [ xp xp p wij ( area j xp ) xp ] p p j 1 j 1 p jr C pr nr n nm n 1 pkr production j k 1 p p[ wij ( area j p ) wij ( p ) p ]} I 0 area j j 1 j nm 1 p jr C pr nr n nm n 1 pkr production j k 1 p x p[ wij ( area j p ) wij ( p ) p ] p (1 p ) p [( area xp ) xp area j j 1 j nm 1 p C pr nrir n 1 pkr xj k 1 xp p wij ( area j xp ) ] [( p ) p ] I mr area j 1 yield residual yield acres residualacres , production residual production Imputation of turnip production in 3 vandalized field plots Field plot Spatial filter: 3 selected eigenvectors (6,5) 24.31 (5,6) (6,6) 13.62 23.93 Cross-validation of spatial filter for observed turnip data Kriging: best linear unbiased spatial interpolator (i.e., predictor) The accompanying table contains a test set of sixteen random samples (#17-32) used to evaluate three maps. The “Actual” column lists the measured values for the test locations identified by “Col, Row” coordinates. The difference between these values and those predicted by the three interpolation techniques form the residuals shown in parentheses. The “Average” column compares the whole field arithmetic mean of 23 (guess 23 everywhere) for each test location. ArcGIS: Geostatistical Wizard density of German workers anisotropy check Cross-validation check of krigged values This is one use of the missing spatial data imputation methods. Unclipped krigged surface values increase with darkness of brown exponential semivariogram model extrapolation krigged (mean response) surface prediction error surface Clipped krigged surface krigged (mean response) surface values increase with darkness of brown prediction error surface Detrended population density across China anisotropy check Cross-validation check of krigged values This is one use of the missing spatial data imputation methods. Unclipped krigged surface values increase with darkness of brown exponential semivariogram model extrapolation krigged (mean response) surface prediction error surface Clipped krigged surface krigged (mean response) surface values increase with darkness of brown prediction error surface THEOREM 4 The maximum likelihood estimate for missing georeferenced values described by a spatial autoregressive model specification is equivalent to the best linear unbiased predictor kriging equation of geostatistics. Geographically weighted regression: GWR Spatial filtering enables easier implementation of GWR, as well as proper assessment of its dfs •Step #1: compute the eigenvectors of a geographic connectivity matrix, say C •Step #2: compute all of the interactions terms XjEk for the P covariates times the K candidate eigenvectors (e.g., with MC > 0.25) •Step #3: select from the total set, including the individual eigenvectors, with stepwise regression • Step #4: the geographically varying intercept term is given by: K a i a E i,k b E i,k k 1 • Step #5: the geographically varying covariate coefficient is given by factoring Xj out of its appropriate selected interaction terms: bi, j X j b j Ei,k b X jEi,k k 1 K X j A Puerto Rico DEM example Mean elevation (Y) is a function of: standard deviation of elevation (X), eigenvectors E1E18, and 18 interaction terms (XE) Results intercept: 1, E2, E5-E7, E9, E11-E13, E15, E18 slope: 1, E4, E6, E9, E10 R2 increases from 0.576 (with X only) to 0.911 (with geographically varying coefficients) P(S-W) = 0.52 for the final model GWRspatial filter intercept (MC = 0.692) GWRspatial filter slope (MC = 0.721) Spatial moving averages Local smoothing of attribute values n μ̂ i w j1 n ij w j1 yi i 1, 2, ..., n ij where: wij is a spatial weights matrix yi is the attribute value for each areal unit n is the number of areal units A summary: what have we learned during the 5 lectures? • • • • • • • • Lecture #1 The nature of data and its information content. What is spatial autocorrelation? Visualizing spatial autocorrelation: Moran scatterplots, semivariogram plots, and maps. Defining and articulating spatial structure: topology and distance perspectives; contagion and hierarchy concepts. Necessary concepts from multivariate statistics. An example of the elusive negative spatial autocorrelation. Some comments about spatial sampling. Implications about space-time data structure. • • • • • Lecture #2 Multivariate grouping, and location-allocation modeling. Going from the global to the local: variability and heterogeneity. Impacts of spatial autocorrelation on histograms. The LISA and Getis-Ord statistics. Cluster analysis: multivariate analysis, cluster detection, and spider diagrams. – An overview of geographic and space-time clusters. • Regression diagnostics and geographic clusters • • • • • • Lecture #3 Autoregressive specifications and normal curve theory (PROC NLIN). Auto-binomial and auto-Poisson models: the need for MCMC. Relationships between spatial autoregressive and geostatistical models Spatial filtering specifications and linear and generalized linear models (PROC GENMOD). Autoregressive specifications and linear mixed models (PROC MIXED). Implications for space-time datasets (PROC NLMIXED) Lecture #4 • Frequentist versus Bayesian perspectives. • Implementing random effects models in GeoBUGS. • Spatially structured and unstructured random effects: the CAR, the ICAR, and the spatial filter specifications • • • • Lecture #5 The E-M algorithm The spatial E-M algorithm Kriging in ArcGIS Approaches to map smoothing