Geographically Geographically Weighted Weighted Regression Regression A Stewart Fotheringham Martin Charlton Chris Brunsdon Stewart.Fotheringham@nuim.ie http://ncg.nuim.ie/GWR Regression In a typical linear regression model applied to spatial data we assume a stationary process -the same stimulus provokes the same response in all parts of the study region process: yi = β0 + β1x1i + β2x2i +… βnxni + εi so that... The parameter estimates obtained in the calibration of such a model are constant over space: β’ = (XT X)-1 XT Y The assumption of stationarity in regression yi = α + β x i β2 β1 Assumption is that the values of β are the same everywhere. Why might measured relationships vary spatially? Sampling variation Relationships intrinsically different across space e.g. differences in attitudes, preferences or different administrative, political or other contextual effects produce different responses to the same stimuli Model misspecification - suppose a global statement can ultimately be made but models not properly specified to allow us to make it. Local models good indicator of how model is misspecified. Can all contextual effects ever be removed? Can all significant variations in local relationships be removed? Consequently…if there is spatial nonstationarity, we only see it through the residuals We might map the residuals from the regression to determine whether there are any spatial patterns. Or compute an autocorrelation statistic for the residuals We might even try to ‘model’ the error dependency with various types of spatial regression models. However... Why not address the issue of spatial nonstationarity directly and allow the relationships we are measuring to vary over space? This is the essence of GWR yi = β0(i) + β1 (i) x1i + β2 (i) x2i +… βn (i) xni+ εi … with the estimator β’(i) = (XTW(i) X)-1 XT W(i) Y where W(i) is a matrix of weights specific to location i such that observations nearer to i are given greater weight than observations further away. W(i) = wi1 0 .……..…..0 0 wi2 …..……..0 0 0 wi3 ……..0 . . . . 0 0 0 ………win where win is the weight given to data point n for the estimate of the local parameters at location i. A Typical Spatial Weighting Function Weighting schemes Numerous weighting schemes can be used although they tend to be Gaussian or ‘Gaussian-like’ reflecting the type of dependency found in most spatial processes. Weighting schemes can be either fixed or adaptive. adaptive Fixed Weighting Scheme A Fixed Weighting Scheme For each location i at which the local regression model is calibrated, wij = exp [ - ½ (dij / h)2 ] where dij is the distance between locations i and j h is the bandwidth – as h increases, the gradient of the kernel becomes less steep and more data points are included in the local calibration. We need to find the optimal value of h in the GWR routine. Spatially Adaptive Weighting Scheme A Spatially Adaptive Weighting Function wij = [1-(dij2 / h2)]2 =0 if j is one of the Nth nearest neighbours of i otherwise Here, we find the optimal value of N in the GWR routine Calibration The results of GWR appear to be relatively insensitive to the choice of weighting function as long as it is a continuous distance-based function Whichever weighting function is used, the results will, however, be sensitive to the degree of distance-decay. Therefore an optimal value of either h or N has to be obtained. This can be found by minimising a crossvalidation score (CV) or the Akaike Information Criterion (AICc) where... CV = ∑i [yi - y ≠i* (h)]2 where y ≠i* (h) is the fitted value of yi with data from point i omitted from the calibration Lower values of CV indicate better model fits. AICc = Deviance + 2k [n/(n-k-1)] where n is the number of data points and k is the number of parameters in the model Lower values of AICc indicate better model fits. Bandwidth Selection Optimal bandwidth selection is a tradeoff between bias and variance Too small a bandwidth leads to large variance in the local estimates Too large a bandwidth leads to large bias in the local estimates Bias-Variance Trade-Off Output from GWR Main output from GWR is a set of location-specific parameter estimates which can be mapped and analysed to provide information on spatial nonstationarity in relationships. An Example using Educational Attainment Data in Georgia In GWR, we can also ... estimate local standard errors derive local t statistics calculate local goodness-of-fit measures perform tests to assess the significance of the spatial variation in the local parameter estimates perform tests to determine if the local model performs better than the global one, accounting for differences in degrees of freedom A Simulation Experiment Yi = αi + β1i X1i + β2i X2i Data on X1 and X2 drawn randomly for 2500 locations on a 50 x 50 matrix s.t. r(X1, X2) is controlled. Results shown to be independent of r(X1,X2) Experiment 1: (parameters spatially invariant) αi = 10 for all i β1i = 3 for all i Β2i = -5 for all i Yi obtained from above Data used to calibrate model by global regression and by GWR Results… Global: Adj. R2 = 1.0 AIC = -59,390 K = 3 α (est.) = 10; β1 (est.) = 3; β2 (est.) = -5 GWR: Adj. R2 = 1.0 AIC = -59,386 N = 2,434 αi (est.) = 10 for all i β1i (est.) = 3 for all i β2i (est.) = -5 for all i K = 6.5 Conclusion: GWR does NOT appear to suggest any spurious nonstationarity when relationships are constant Experiment 2: (parameters spatially variant) 0 ≤ i ≤ 50 0 ≤ j ≤ 50 αi = 0 + 0.2i + 0.2j β1i = -5 + 0.1i + 0.1j Β2i = -5 + 0.2i + 0.2j 0 to 20 -5 to 5 -5 to 15 Yi obtained in same way Data used to calibrate model by global regression and by GWR Results… Global: Adj. R2 = 0.04 AIC = 17,046 K = 3 α (est.) = 10.26; β1 (est.) = -0.1; β2 (est.) = 5.28 These are close to the averages of the local estimates (10;0;5) GWR: Adj. R2 = 0.997 AIC = 2,218 K = 167 N = 129 αi (est.) range = 2 to 18.6 β1i (est.) range = -4.3 to 4.7 β2i (est.) range = -3.9 to 13.6 Conclusion: GWR identifies spatial nonstationarity in relationships; global model fails completely. 0 ≤ α(i) ≤ 20 -5 ≤ β1(i) ≤ 5 -5 ≤ β2(i) ≤ 15 An An Empirical Empirical Example: Example: House House Prices Prices in in London London 1990 sales price data for 12,493 houses in London along with various attributes of each property and a postcode so locations down to 100m can be obtained via the Central Postcode Directory neighbourhood data obtained for enumeration districts (via postcode-to-ED LUT) Locations of house sales in data set Hedonic Price Modelling Very common tool to examine determinants of house prices House prices related to determinants (usually) in the form of a linear-inparameters model generally calibrated by some form of regression Problem: A global model is almost always assumed where “one size fits all” Explanatory Variables Floor area House type (detached; semi-d; flat etc) Date built Garage Central Heating 2+ bathrooms % professionals in neighbourhood % unemployed in neighbourhood distance to centre of London Global Regression Parameter Estimates Variable Intercept Parameter Estimate 58,900 T value 23.3 FLRAREA 697 49.3 FLRDETACH* FLRFLAT* FLRBNGLW* FLRTRRCD* 205 -123 -87 -119 7.5 -5.6 -1.4 -6.2 BLDPWW1** BLDPOSTW** BLD60S** BLD70S** BLD80S** -2,340 -2,786 -5,177 -2,421 6,315 -3.9 -3.1 -5.0 -2.1 6.9 GARAGE CENHEAT BATH2+ 5,956 7,777 22,297 10.6 12.4 19.1 72 -211 3.0 -5.5 -18,137 -30.1 PROF UNEMPLOY ln(DISTCL) R2 = 0.60 * Excluded house type is Semi-detached ** Excluded age is Inter-war 1914-1939 Using GWR In this case an adaptive kernel is used a bisquare function Calibration yielded an optimal number of nearest neighbours = 931 Results presented in a series of parameter surfaces - those shown all have significant spatial variation Value of terraced property £/m2 (global estimate = £578) Pre-1914 housing compared to inter-war (global estimate = £-2,340) 1960s housing compared to inter-war (global estimate = £-5,177) Residuals from GWR are generally much lower and are not spatially autocorrelated GWR models give much better fits to data, even accounting for increases in number of parameters GWR residuals are generally not spatially autocorrelated so reducing/removing the need for spatial regression models Global Regression Parameter Estimates Variable Intercept Parameter Estimate 58,900 T value 23.3 FLRAREA 697 49.3 FLRDETACH* FLRFLAT* FLRBNGLW* FLRTRRCD* 205 -123 -87 -119 7.5 -5.6 -1.4 -6.2 BLDPWW1** BLDPOSTW** BLD60S** BLD70S** BLD80S** -2,340 -2,786 -5,177 -2,421 6,315 -3.9 -3.1 -5.0 -2.1 6.9 GARAGE CENHEAT BATH2+ 5,956 7,777 22,297 10.6 12.4 19.1 72 -211 3.0 -5.5 -18,137 -30.1 PROF UNEMPLOY ln(DISTCL) R2 = 0.60 * Excluded house type is Semi-detached ** Excluded age is Inter-war 1914-1939 Residuals from Global Model Residuals from GWR Model Assessing whether the spatial variation in measured relationships might be important (i.e. the variation is unlikely to be just a product of sampling variation) 1. Monte-Carlo tests 2. Local t values 3. Variability of local parameter estimates 1. Monte- Carlo Tests 1. 2. 3. 4. 5. 6. Obtain local parameter estimates and calculate variance of estimates Rearrange data randomly across the zones (keeping Yi X1i X2i …Xni) together Compute new set of local parameter estimates based on rearranged data Repeat steps 2 and 3 LOTS of times each time computing the variance of the local estimates Compare variance of local parameter estimates in step 1 with those from steps 2 and 3 p value associated with 1 is then the proportion of variances that lie above that for 1 in a list of variances sorted high to low. Can do this very easily within GWR3.0! An example from the Georgia data ************************************************* * * * Test for spatial variability of parameters * * * ************************************************* Tests based on the Monte Carlo significance test procedure due to Hope [1968,JRSB,30(3),582-598] Parameter ---------Intercept TotPop90 PctRural PctEld PctFB PctPov PctBlack P-value ------------0.29000 0.10000 0.24000 0.75000 0.00000 0.59000 0.02000 Spatial variation in the % FB local parameter estimate Spatial variation in the % black local parameter 2. Local t values ti = βi* / SE (βi* ) Map these values. Look for areas on the map with ti values > 2 and/or ti values < -2 Local t values are provided automatically in the GWR3.0 output file and can be mapped in ArcGIS 3. Variability of local estimates To examine the ‘importance’ of the spatial variability of any relationship, compare the variation of the local parameter estimates from GWR with the SE of the global parameter estimate. This can be done in two ways. 3.1 Calculate the following index: I = S.D. of local estimates / S.E. global estimate An example from the Georgia data set Par. S.D. local S.E. global I Int %Rur %Eld %FB %Pov %Bla 1.744 0.012 0.095 0.985 0.099 0.045 1.753 0.014 0.130 0.307 0.072 0.026 0.99 0.86 0.73 3.21 1.38 1.73 M-C p value 0.42 0.41 0.60 0.00 0.50 0.01 Very rough ‘Rule of Thumb’: Potentially interesting spatial variation if I > 1.5 3.2 Use the inter-quartile range in the listing file 50% of the local parameter estimates will lie within the inter-quartile range Approx. 68% of the values in a Normal distribution lie between ± 1 SD of the mean. The global parameter estimate is the mean of a Normal distribution. Therefore if the inter-quartile range of the local estimates is greater than 2 SD of the global mean, this is indicative of a possible non-stationary process. An example from the Georgia data set Parameter 2 x S.E. global Inter-quartile range (local) Int %Rur %Eld %FB %Pov %Bla 3.506 0.028 0.260 0.614 0.144 0.052 3.400 0.019 0.109 2.034 0.094 0.080 Same interpretations as M-C tests X X X √ X √ Can Use GWR as a ‘Spatial Microscope’ Instead of determining an optimal bandwidth during the calibration of a GWR model, a bandwidth can be input a priori. A series of bandwidths can be selected and the resulting parameter surfaces examined at different levels of smoothing For example, consider a very simple model of house prices regressed on floor area for 570 houses in Tyne & Wear, North East England. Surfaces of the local floorspace parameter are derived for bandwidths corresponding to 400, 350, 300, 250, 200, 150, 100 and 50 NN OK, So how do you all this? Running the GWR software: GWR 3.0 First, create a data file… File type xxxx.dat (xxxx.csv) First line is a comma separated list of variable names (<= 8 characters) Data lines have numeric items only terminated by a carriage return One line of data per location Space or comma delimited (easily created in Excel) Example of a data file Georgia.dat The first 10 lines of the file… ID,Lat,Lon,TotPop90,PctRural,PctBach,PctEld,PctFB,PctPov,PctBlack 13001,31.753389,-82.285580,15744,75.6,8.2,11.43,0.635,19.9,20.76 13003,31.294857,-82.874736,6213,100.0,6.4,11.77,1.577,26.0,26.86 13005,31.556775,-82.451152,9566,61.7,6.6,11.11,0.272,24.1,15.42 13007,31.330837,-84.454013,3615,100.0,9.4,13.17,0.111,24.8,51.67 13009,33.071932,-83.250851,39530,42.7,13.3,8.64,1.432,17.5,42.39 13011,34.352696,-83.500539,10308,100.0,6.4,11.37,0.340,15.1,3.49 13013,33.993471,-83.711811,29721,64.6,9.2,10.63,0.922,14.7,11.44 13015,34.238402,-84.839182,55911,75.2,9.0,9.66,0.816,10.7,9.21 13017,31.759395,-83.219755,16245,47.0,7.6,12.81,0.332,22.0,31.33 Starting GWR The software is usually stored in the C:\GWR3 folder and the program is called GWR30.exe Start/Programs/Geographically Weighted Regression Desktop icon Explorer This brings up the GWR Wizard You have a number of options to choose from in creating and running a GWR model The job of the Wizard is to provide suitable guidance in making the right choices If you want to create and run a new GWR model, click on the option ‘Create a new model’ If you created and saved a GWR model in a previous session and you want to access this, click on ‘Open an existing model using the GWR model editor’ Inputting data – click and drag .dat or .csv file from appropriate folder Regression Points • Do you want to run GWR at the data point locations or some other set of locations? Name the output file… Three types of output file are possible: .e00 ArcInfo export file .mif Map Info file .csv comma separated variable file The Model Editor To specify a dependent variable, highlight it in the list on the left and click on the [->] symbol The Model Editor To specify independent variables, highlight them individually in the list on the left and click on the [->] symbol The Model Editor To specify location variables, highlight them individually in the list on the left and click on the corresponding [->] symbols The Model Editor Next you specify the type of kernel: this can be fixed (Gaussian) or adaptive (bisquare) The Model Editor You can either preset the bandwidth in the units that the location variables are measured in (for example, metres) Or if you want the program to determine the optimal bandwidth, specify either crossvalidation or AIC minimisation. For large files, there is a sampling option to speed the process. The Model Editor Select the type of coordinate system you are using for your location variables – choice is either Cartesian or spherical The Model Editor The type of output in the printed listing can also be controlled by clicking on Model Options Bandwidth selection Bandwidth AICc 56.043532255000 913.159190588348 84.500000000000 885.119969660068 112.956467745000 872.910381423844 130.543532046749 868.887720190066 141.412935569545 869.149708997055 123.825871267796 870.450868861077 134.695274741431 869.114420384913 127.977613962479 869.551269557617 ** Convergence after 8 function calls ** Convergence: Local Sample Size= 131 Useful if you want to plot the relationship to see how steep or flat it is List predictions… Predictions from this model... Obs Y(i) Yhat(i) 1 8.200 9.006 2 6.400 6.958 3 6.600 8.524 4 9.400 8.308 5 13.300 13.835 6 6.400 8.910 7 9.200 11.760 8 9.000 11.446 9 7.600 10.231 10 7.500 9.104 Res(i) -0.806 -0.558 -1.924 1.092 -0.535 -2.510 -2.560 -2.446 -2.631 -1.604 X(i) -82.286 -82.875 -82.451 -84.454 -83.251 -83.501 -83.712 -84.839 -83.220 -83.232 Y(i) 31.753 31.295 31.557 31.331 33.072 34.353 33.993 34.238 31.759 31.274 F F F F F F F F F F If you have requested an output file, this information and the diagnostics are also written to this file. List Pointwise Diagnostics Obs Observed Predicted Residual Std Resid R-Square Influence Cook's D ----- -------------- -------------- -------------- ----------- ----------- ----------- ----------1 8.20000 8.84819 -0.64819 -0.182251 0.784156 0.032346 0.000073 2 6.40000 6.39738 0.00262 0.000759 0.775286 0.085459 0.000000 3 6.60000 8.48954 -1.88954 -0.549871 0.782090 0.096687 0.002126 4 9.40000 8.35258 1.04742 0.302776 0.808351 0.084519 0.000556 5 13.30000 14.60358 -1.30358 -0.377493 0.834522 0.087768 0.000901 6 6.40000 8.29036 -1.89036 -0.538081 0.839070 0.055846 0.001125 7 9.20000 12.02529 -2.82529 -0.794934 0.841344 0.033697 0.001448 8 9.00000 10.97210 -1.97210 -0.554246 0.846250 0.031492 0.000656 9 7.60000 10.73917 -3.13917 -0.892715 0.778960 0.054083 0.002994 10 7.50000 9.04908 -1.54908 -0.444086 0.778295 0.069182 0.000963 The Model Editor Once the model specification is completed, give the file a title and save it before you run it. The saved file can be used in later sessions. - the listing of the output will be saved here and then click on ‘Run’ The local parameter estimates will be saved in your named output file e.g. georgia.e00 This can be used for subsequent mapping in ArcGIS… Summary GWR appears to be a useful method to investigate spatial non-stationarity - simply assuming relationships are stationary over space is no longer tenable GWR can be likened to a ‘spatial microscope’ allows us to see variations in relationships that were previous unobservable Can use GWR as a model diagnostic or to identify interesting locations for investigation. Windows-based software makes it easy to apply to any spatial data set. Things to watch out for... Local colinearity occurs sometimes esp. with binary variables Inference in GWR be careful about multiple hypothesis testing issues Software limits max. no. of explanatory variables = 50 max. no. of observations = 80,000 Running time running a ‘full’ GWR with a large data set and a large model can be time consuming! Think about your model GWR is unlikely to be able to rescue a poor global model End of presentation Bandwidth and Effective Numbers of Parameters As the bandwidth → ∞, the local model will tend to the global model with number of parameters = k. As the bandwidth → 0, the local model ‘wraps itself around the data’ so the number of parameters = n The number of parameters in local models therefore ranges between k and n and depends on the bandwidth. This number need not be an integer and we refer to it as the effective number of parameters in the model An Example from the Georgia Data Bandwidth and The Effective Number of Parameters Effective Number of Paramaters 160 140 120 100 80 60 40 20 0 0 50 100 Bandwidth 150 200 The reason … reason… Suppose we have a non-stationary process that can be modelled by: yi = α + βi xi but we model it incorrectly with a global model of the form: yi = α + β xi Real values of βi .9 .8 .7 .6 .5 .8 .7 .6 .5 .4 .8 .6 .5 .4 .3 .7 .5 .4 .3 .2 .5 .4 .4 .2 .1 Estimated value of βi from global model .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 .5 Residuals (yi - yi’) + + + + 0 + + + 0 - + + 0 - + 0 - 0 -