Geographically Weighted Regression A Stewart Fotheringham Martin Charlton

advertisement
Geographically
Geographically Weighted
Weighted
Regression
Regression
A Stewart Fotheringham
Martin Charlton
Chris Brunsdon
Stewart.Fotheringham@nuim.ie
http://ncg.nuim.ie/GWR
Regression
In a typical linear regression model applied
to spatial data we assume a stationary
process
-the same stimulus provokes the same
response in all parts of the study region
process:
yi = β0 + β1x1i + β2x2i +… βnxni + εi
so that...
The parameter estimates obtained in
the calibration of such a model are
constant over space:
β’ = (XT X)-1 XT Y
The assumption of stationarity in
regression
yi = α + β x i
β2
β1
Assumption is that the values of β are
the same everywhere.
Why might measured relationships
vary spatially?
Sampling variation
Relationships intrinsically different across
space e.g. differences in attitudes, preferences or
different administrative, political or other contextual
effects produce different responses to the same
stimuli
Model misspecification - suppose a global
statement can ultimately be made but models not
properly specified to allow us to make it. Local
models good indicator of how model is misspecified.
Can all contextual effects ever be removed?
Can all significant variations in local
relationships be removed?
Consequently…if there is spatial nonstationarity, we only see it through the
residuals
We might map the residuals from the
regression to determine whether
there are any spatial patterns.
Or compute an autocorrelation
statistic for the residuals
We might even try to ‘model’ the
error dependency with various types
of spatial regression models.
However...
Why not address the issue of spatial nonstationarity directly and allow the
relationships we are measuring to vary
over space? This is the essence of GWR
yi = β0(i) + β1 (i) x1i + β2 (i) x2i +…
βn (i) xni+ εi
… with the estimator
β’(i) = (XTW(i) X)-1 XT W(i) Y
where W(i) is a matrix of weights specific
to location i such that observations
nearer to i are given greater weight
than observations further away.
W(i)
=
wi1 0 .……..…..0
0 wi2 …..……..0
0 0 wi3 ……..0
.
. .
.
0 0 0 ………win
where win is the weight given to data point n for
the estimate of the local parameters at
location i.
A Typical Spatial Weighting Function
Weighting schemes
Numerous weighting schemes can be used
although they tend to be Gaussian or
‘Gaussian-like’ reflecting the type of
dependency found in most spatial processes.
Weighting schemes can be either fixed or
adaptive.
adaptive
Fixed Weighting Scheme
A Fixed Weighting Scheme
For each location i at which the local regression model
is calibrated,
wij = exp [ - ½ (dij / h)2 ]
where
dij is the distance between locations i and j
h is the bandwidth – as h increases, the gradient of the
kernel becomes less steep and more data points are
included in the local calibration. We need to find the
optimal value of h in the GWR routine.
Spatially Adaptive Weighting
Scheme
A Spatially Adaptive
Weighting Function
wij = [1-(dij2 / h2)]2
=0
if j is one of the Nth
nearest neighbours of
i
otherwise
Here, we find the optimal value of N
in the GWR routine
Calibration
The results of GWR appear to be relatively
insensitive to the choice of weighting function as
long as it is a continuous distance-based
function
Whichever weighting function is used, the results
will, however, be sensitive to the degree of
distance-decay.
Therefore an optimal value of either h or N has to
be obtained. This can be found by minimising a
crossvalidation score (CV) or the Akaike
Information Criterion (AICc)
where...
CV = ∑i [yi - y ≠i* (h)]2
where y ≠i* (h) is the fitted value of yi with data from
point i omitted from the calibration
Lower values of CV indicate better model fits.
AICc = Deviance + 2k [n/(n-k-1)]
where n is the number of data points and k is the
number of parameters in the model
Lower values of AICc indicate better model fits.
Bandwidth Selection
Optimal bandwidth selection is a tradeoff between bias and variance
Too small a bandwidth leads to large
variance in the local estimates
Too large a bandwidth leads to large
bias in the local estimates
Bias-Variance Trade-Off
Output from GWR
Main output from GWR is a set
of location-specific parameter
estimates which can be mapped
and analysed to provide
information on spatial nonstationarity in relationships.
An Example using Educational
Attainment Data in Georgia
In GWR, we can also ...
estimate local standard errors
derive local t statistics
calculate local goodness-of-fit
measures
perform tests to assess the
significance of the spatial variation in
the local parameter estimates
perform tests to determine if the local
model performs better than the global
one, accounting for differences in
degrees of freedom
A Simulation Experiment
Yi = αi + β1i X1i + β2i X2i
Data on X1 and X2 drawn randomly for 2500 locations on a 50
x 50 matrix s.t. r(X1, X2) is controlled. Results shown to be
independent of r(X1,X2)
Experiment 1: (parameters spatially
invariant)
αi = 10 for all i
β1i = 3 for all i
Β2i = -5 for all i
Yi obtained from above
Data used to calibrate model by global regression and by
GWR
Results…
Global:
Adj. R2 = 1.0 AIC = -59,390 K = 3
α (est.) = 10; β1 (est.) = 3; β2 (est.) = -5
GWR:
Adj. R2 = 1.0 AIC = -59,386
N = 2,434
αi (est.) = 10 for all i
β1i (est.) = 3 for all i
β2i (est.) = -5 for all i
K = 6.5
Conclusion:
GWR does NOT appear to suggest any spurious
nonstationarity when relationships are constant
Experiment 2: (parameters spatially
variant)
0 ≤ i ≤ 50
0 ≤ j ≤ 50
αi = 0 + 0.2i + 0.2j
β1i = -5 + 0.1i + 0.1j
Β2i = -5 + 0.2i + 0.2j
0 to 20
-5 to 5
-5 to 15
Yi obtained in same way
Data used to calibrate model by global regression
and by GWR
Results…
Global:
Adj. R2 = 0.04 AIC = 17,046 K = 3
α (est.) = 10.26; β1 (est.) = -0.1; β2 (est.) = 5.28
These are close to the averages of the local estimates
(10;0;5)
GWR:
Adj. R2 = 0.997 AIC = 2,218 K = 167
N = 129
αi (est.) range = 2 to 18.6
β1i (est.) range = -4.3 to 4.7
β2i (est.) range = -3.9 to 13.6
Conclusion:
GWR identifies spatial nonstationarity in
relationships; global model fails completely.
0 ≤ α(i) ≤ 20
-5 ≤ β1(i) ≤ 5
-5 ≤ β2(i) ≤ 15
An
An Empirical
Empirical Example:
Example: House
House
Prices
Prices in
in London
London
1990 sales price data for 12,493 houses in
London
along with various attributes of each
property and a postcode so locations
down to 100m can be obtained via the
Central Postcode Directory
neighbourhood data obtained for
enumeration districts (via postcode-to-ED
LUT)
Locations of house sales in data set
Hedonic Price Modelling
Very common tool to examine
determinants of house prices
House prices related to determinants
(usually) in the form of a linear-inparameters model generally calibrated by
some form of regression
Problem: A global model is almost always
assumed where “one size fits all”
Explanatory Variables
Floor area
House type (detached; semi-d; flat etc)
Date built
Garage
Central Heating
2+ bathrooms
% professionals in neighbourhood
% unemployed in neighbourhood
distance to centre of London
Global Regression Parameter Estimates
Variable
Intercept
Parameter
Estimate
58,900
T value
23.3
FLRAREA
697
49.3
FLRDETACH*
FLRFLAT*
FLRBNGLW*
FLRTRRCD*
205
-123
-87
-119
7.5
-5.6
-1.4
-6.2
BLDPWW1**
BLDPOSTW**
BLD60S**
BLD70S**
BLD80S**
-2,340
-2,786
-5,177
-2,421
6,315
-3.9
-3.1
-5.0
-2.1
6.9
GARAGE
CENHEAT
BATH2+
5,956
7,777
22,297
10.6
12.4
19.1
72
-211
3.0
-5.5
-18,137
-30.1
PROF
UNEMPLOY
ln(DISTCL)
R2 = 0.60
* Excluded house type is Semi-detached
** Excluded age is Inter-war 1914-1939
Using GWR
In this case an adaptive kernel is used a bisquare function
Calibration yielded an optimal number of
nearest neighbours = 931
Results presented in a series of
parameter surfaces - those shown all
have significant spatial variation
Value of terraced property £/m2
(global estimate = £578)
Pre-1914 housing compared to inter-war
(global estimate = £-2,340)
1960s housing compared to inter-war
(global estimate = £-5,177)
Residuals from GWR are generally
much lower and are not spatially
autocorrelated
GWR models give much better fits to data,
even accounting for increases in number
of parameters
GWR residuals are generally not spatially
autocorrelated so reducing/removing the
need for spatial regression models
Global Regression Parameter Estimates
Variable
Intercept
Parameter
Estimate
58,900
T value
23.3
FLRAREA
697
49.3
FLRDETACH*
FLRFLAT*
FLRBNGLW*
FLRTRRCD*
205
-123
-87
-119
7.5
-5.6
-1.4
-6.2
BLDPWW1**
BLDPOSTW**
BLD60S**
BLD70S**
BLD80S**
-2,340
-2,786
-5,177
-2,421
6,315
-3.9
-3.1
-5.0
-2.1
6.9
GARAGE
CENHEAT
BATH2+
5,956
7,777
22,297
10.6
12.4
19.1
72
-211
3.0
-5.5
-18,137
-30.1
PROF
UNEMPLOY
ln(DISTCL)
R2 = 0.60
* Excluded house type is Semi-detached
** Excluded age is Inter-war 1914-1939
Residuals from Global Model
Residuals from GWR Model
Assessing whether the spatial variation in
measured relationships might be important
(i.e. the variation is unlikely to be just a
product of sampling variation)
1. Monte-Carlo tests
2. Local t values
3. Variability of local parameter estimates
1. Monte- Carlo Tests
1.
2.
3.
4.
5.
6.
Obtain local parameter estimates and calculate
variance of estimates
Rearrange data randomly across the zones
(keeping Yi X1i X2i …Xni) together
Compute new set of local parameter estimates
based on rearranged data
Repeat steps 2 and 3 LOTS of times each time
computing the variance of the local estimates
Compare variance of local parameter estimates in
step 1 with those from steps 2 and 3
p value associated with 1 is then the proportion of
variances that lie above that for 1 in a list of
variances sorted high to low.
Can do this very easily within GWR3.0!
An example from the Georgia data
*************************************************
*
*
* Test for spatial variability of parameters *
*
*
*************************************************
Tests based on the Monte Carlo significance test
procedure due to Hope [1968,JRSB,30(3),582-598]
Parameter
---------Intercept
TotPop90
PctRural
PctEld
PctFB
PctPov
PctBlack
P-value
------------0.29000
0.10000
0.24000
0.75000
0.00000
0.59000
0.02000
Spatial variation in the % FB local
parameter estimate
Spatial variation in the % black local
parameter
2. Local t values
ti = βi* / SE (βi* )
Map these values.
Look for areas on the map with ti values > 2
and/or ti values < -2
Local t values are provided automatically in the
GWR3.0 output file and can be mapped in
ArcGIS
3. Variability of local estimates
To examine the ‘importance’ of the spatial
variability of any relationship, compare the
variation of the local parameter estimates
from GWR with the SE of the global
parameter estimate. This can be done in two
ways.
3.1 Calculate the following index:
I = S.D. of local estimates / S.E. global estimate
An example from the Georgia data set
Par.
S.D. local
S.E. global I
Int
%Rur
%Eld
%FB
%Pov
%Bla
1.744
0.012
0.095
0.985
0.099
0.045
1.753
0.014
0.130
0.307
0.072
0.026
0.99
0.86
0.73
3.21
1.38
1.73
M-C p value
0.42
0.41
0.60
0.00
0.50
0.01
Very rough ‘Rule of Thumb’: Potentially interesting
spatial variation if I > 1.5
3.2 Use the inter-quartile range in the listing
file
50% of the local parameter estimates will lie within
the inter-quartile range
Approx. 68% of the values in a Normal distribution lie
between ± 1 SD of the mean. The global parameter
estimate is the mean of a Normal distribution.
Therefore if the inter-quartile range of the local
estimates is greater than 2 SD of the global mean,
this is indicative of a possible non-stationary process.
An example from the Georgia data set
Parameter
2 x S.E. global
Inter-quartile
range (local)
Int
%Rur
%Eld
%FB
%Pov
%Bla
3.506
0.028
0.260
0.614
0.144
0.052
3.400
0.019
0.109
2.034
0.094
0.080
Same interpretations as M-C tests
X
X
X
√
X
√
Can Use GWR as a ‘Spatial
Microscope’
Instead of determining an optimal bandwidth
during the calibration of a GWR model, a
bandwidth can be input a priori.
A series of bandwidths can be selected and
the resulting parameter surfaces examined at
different levels of smoothing
For example, consider a very simple model of
house prices regressed on floor area for 570
houses in Tyne & Wear, North East England.
Surfaces of the local floorspace parameter are
derived for bandwidths corresponding to 400,
350, 300, 250, 200, 150, 100 and 50 NN
OK, So how do you all this?
Running the GWR software:
GWR 3.0
First, create a data file…
File type xxxx.dat (xxxx.csv)
First line is a comma separated list of
variable names (<= 8 characters)
Data lines have numeric items only
terminated by a carriage return
One line of data per location
Space or comma delimited (easily
created in Excel)
Example of a data file
Georgia.dat
The first 10 lines of the file…
ID,Lat,Lon,TotPop90,PctRural,PctBach,PctEld,PctFB,PctPov,PctBlack
13001,31.753389,-82.285580,15744,75.6,8.2,11.43,0.635,19.9,20.76
13003,31.294857,-82.874736,6213,100.0,6.4,11.77,1.577,26.0,26.86
13005,31.556775,-82.451152,9566,61.7,6.6,11.11,0.272,24.1,15.42
13007,31.330837,-84.454013,3615,100.0,9.4,13.17,0.111,24.8,51.67
13009,33.071932,-83.250851,39530,42.7,13.3,8.64,1.432,17.5,42.39
13011,34.352696,-83.500539,10308,100.0,6.4,11.37,0.340,15.1,3.49
13013,33.993471,-83.711811,29721,64.6,9.2,10.63,0.922,14.7,11.44
13015,34.238402,-84.839182,55911,75.2,9.0,9.66,0.816,10.7,9.21
13017,31.759395,-83.219755,16245,47.0,7.6,12.81,0.332,22.0,31.33
Starting GWR
The software is usually stored in the
C:\GWR3 folder and the program is
called GWR30.exe
Start/Programs/Geographically Weighted
Regression
Desktop icon
Explorer
This brings up the GWR
Wizard
You have a number of options to
choose from in creating and running a
GWR model
The job of the Wizard is to provide
suitable guidance in making the right
choices
If you want to create
and run a new GWR
model, click on the
option ‘Create a new
model’
If you created and
saved a GWR model in a
previous session and
you want to access this,
click on ‘Open an
existing model
using the GWR model
editor’
Inputting data – click and drag
.dat or .csv file from appropriate
folder
Regression Points
• Do you want to run GWR at the data point
locations or some other set of locations?
Name the output file…
Three types of output file are possible:
.e00 ArcInfo export file
.mif Map Info file
.csv comma separated variable file
The
Model
Editor
To specify a
dependent
variable,
highlight it in
the list on the
left and click
on the [->]
symbol
The
Model
Editor
To specify
independent
variables,
highlight them
individually in
the list on the
left and click
on the [->]
symbol
The
Model
Editor
To specify
location
variables,
highlight them
individually in
the list on the
left and click
on the
corresponding
[->] symbols
The
Model
Editor
Next you
specify the
type of
kernel: this
can be fixed
(Gaussian) or
adaptive
(bisquare)
The
Model
Editor
You can either
preset the
bandwidth in
the units that
the location
variables are
measured in
(for example,
metres)
Or if you want the
program to
determine the
optimal
bandwidth, specify
either
crossvalidation or
AIC minimisation.
For large files,
there is a sampling
option to speed
the process.
The
Model
Editor
Select the type
of coordinate
system you are
using for your
location
variables –
choice is either
Cartesian or
spherical
The
Model
Editor
The type of
output in the
printed listing
can also be
controlled by
clicking on
Model Options
Bandwidth selection
Bandwidth
AICc
56.043532255000
913.159190588348
84.500000000000
885.119969660068
112.956467745000
872.910381423844
130.543532046749
868.887720190066
141.412935569545
869.149708997055
123.825871267796
870.450868861077
134.695274741431
869.114420384913
127.977613962479
869.551269557617
** Convergence after
8 function calls
** Convergence: Local Sample Size=
131
Useful if you want to plot the relationship to see how steep
or flat it is
List predictions…
Predictions from this model...
Obs
Y(i)
Yhat(i)
1
8.200
9.006
2
6.400
6.958
3
6.600
8.524
4
9.400
8.308
5
13.300
13.835
6
6.400
8.910
7
9.200
11.760
8
9.000
11.446
9
7.600
10.231
10
7.500
9.104
Res(i)
-0.806
-0.558
-1.924
1.092
-0.535
-2.510
-2.560
-2.446
-2.631
-1.604
X(i)
-82.286
-82.875
-82.451
-84.454
-83.251
-83.501
-83.712
-84.839
-83.220
-83.232
Y(i)
31.753
31.295
31.557
31.331
33.072
34.353
33.993
34.238
31.759
31.274
F
F
F
F
F
F
F
F
F
F
If you have requested an output file, this
information and the diagnostics are also written
to this file.
List Pointwise Diagnostics
Obs
Observed
Predicted
Residual
Std Resid
R-Square
Influence
Cook's D
----- -------------- -------------- -------------- ----------- ----------- ----------- ----------1
8.20000
8.84819
-0.64819
-0.182251
0.784156
0.032346
0.000073
2
6.40000
6.39738
0.00262
0.000759
0.775286
0.085459
0.000000
3
6.60000
8.48954
-1.88954
-0.549871
0.782090
0.096687
0.002126
4
9.40000
8.35258
1.04742
0.302776
0.808351
0.084519
0.000556
5
13.30000
14.60358
-1.30358
-0.377493
0.834522
0.087768
0.000901
6
6.40000
8.29036
-1.89036
-0.538081
0.839070
0.055846
0.001125
7
9.20000
12.02529
-2.82529
-0.794934
0.841344
0.033697
0.001448
8
9.00000
10.97210
-1.97210
-0.554246
0.846250
0.031492
0.000656
9
7.60000
10.73917
-3.13917
-0.892715
0.778960
0.054083
0.002994
10
7.50000
9.04908
-1.54908
-0.444086
0.778295
0.069182
0.000963
The
Model
Editor
Once the model
specification is
completed, give
the file a title
and save it
before you run
it. The saved file
can be used in
later sessions.
- the listing of the output will be saved here
and then click on ‘Run’
The local parameter estimates will be
saved in your named output file e.g.
georgia.e00
This can be used for subsequent
mapping in ArcGIS…
Summary
GWR appears to be a useful method to
investigate spatial non-stationarity - simply
assuming relationships are stationary over
space is no longer tenable
GWR can be likened to a ‘spatial microscope’ allows us to see variations in relationships that
were previous unobservable
Can use GWR as a model diagnostic or to
identify interesting locations for investigation.
Windows-based software makes it easy to apply
to any spatial data set.
Things to watch out for...
Local colinearity
occurs sometimes esp. with binary variables
Inference in GWR
be careful about multiple hypothesis testing issues
Software limits
max. no. of explanatory variables = 50
max. no. of observations = 80,000
Running time
running a ‘full’ GWR with a large data set and a large
model can be time consuming!
Think about your model
GWR is unlikely to be able to rescue a poor global
model
End of presentation
Bandwidth and Effective
Numbers of Parameters
As the bandwidth → ∞, the local model will
tend to the global model with number of
parameters = k.
As the bandwidth → 0, the local model ‘wraps
itself around the data’ so the number of
parameters = n
The number of parameters in local models
therefore ranges between k and n and
depends on the bandwidth. This number
need not be an integer and we refer to it as
the effective number of parameters in the
model
An Example from the Georgia Data
Bandwidth and The Effective Number of
Parameters
Effective Number of Paramaters
160
140
120
100
80
60
40
20
0
0
50
100
Bandwidth
150
200
The reason
…
reason…
Suppose we have a non-stationary
process that can be modelled by:
yi = α + βi xi
but we model it incorrectly with a
global model of the form:
yi = α + β xi
Real values of βi
.9
.8
.7
.6
.5
.8
.7
.6
.5
.4
.8
.6
.5
.4
.3
.7
.5
.4
.3
.2
.5
.4
.4
.2
.1
Estimated value of βi from global
model
.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
Residuals (yi - yi’)
+
+
+
+
0
+
+
+
0
-
+
+
0
-
+
0
-
0
-
Download