Spatial Statistics - The University of Texas at Dallas

advertisement
Regression in geoDA
Example regression analyses for Illiteracy Rate ( ILLITERACY)
ChinaData.shp (n=35)
1. Simple regression with URBAN_POP_
ChinaData_29 (n=29)
2. Simple regression with URBAN_POP
3. Multiple regression with URBAN_POP
and RMB_PC_UR_
4. Spatial lag and error multiple regression
5. Multiple regression with log of Illiteracy
1
Briggs Henan University 2010
1
Running Regression in geoDA: I
File>Open Shape File
ChinaData
Tools>Weights>
Open or Create
Need weights to test for
spatial autocorrelation.
Generally, always use a
weights file.
If you have a large
number of observations,
do not
Need this for
Moran’ s I for
residuals
2
Methods>Regress
Place
as below
You can begin with
Method>Regress if
--very large number
of observations (over
1,000)
--no spatial weights
--data only in a .dbf
file
Running Regression in geoDA: II
Select one dependent variable
One or more independent variables
Select type of regression:
Classic or Lag or Error
Warning-bug!
Use Suggested name.
The names are
reversed here!
Click OK
to save
these.
Click RUN, then Click SAVE
Saves values for Predicted Y and
Residuals in the table
--use Table>>Promotion to see
them in table.
--you can map them or draw graphs
--use Table >> Save to Shapefile
if you want to keep them
permanently
3
Running Regression in geoDA: III
The results
Results are saved in this text file.
It is saved in the same folder as the
shapefile.
You can rename it and change
location.
Click OK to see the results.
(You can also open the file later with
a program such as Notepad)
--scroll to end of file since results are
added to end if file already exists
Warning: if you want the residuals
(see previous slide) you must click
Save before clicking OK
Click Reset to run a different
regression
4
Summary: Running Regression in geoDA
File>Open Shape File Select variables as below.
Select type of regression:
ChinaData
Classic Lag Error
Tools>Weights>
Open or Create
Warning-bug!
Use Suggested name.
The names are
reversed here!
(need weights to test for
spatial autocorrelation
in residuals)
Methods>Regress
Place
as below
Click OK to save these.
Use Table>Promotion to
see them in table.
Click RUN, then Click SAVE
Click OK in Regression
window to see results
--scroll to end of file since
results are added to end if
file exists already
5
Regression for Provinces: n = 35
• Next slide shows results from running a simple
regression with ChinaData.shp
Y = Illiteracy rate (ILLITERACY)
X = % of population urban (URBAN_POP_)
• All provinces included
• Note problems with
– Extreme value for Xizang/Tibet
– Zeros (0) for missing data on X variable
(Taiwan, Macau, Hong Kong, P’eng-hu)
• Solution: Reduced data set to 29 using ArcGIS
– (do not know how to do this in geoDA!)
Briggs Henan University 2010
6
Display table: Table >Promotion
Plot using: Explore >ScatterPlot
Results for
simple regression
Note: mean of
residuals is
always zero
Total Variation
Predicted by Regression
Illiteracy v. Urban Pop% OLS_Predict v. Urban Pop%
Residual Variation
OLS_Resid v. Urban Pop%
Extreme
value
identified
by linking:
Xizang/Tibet
Briggs Henan University 2010
7
Partitioning the Variance on Y
Residual Variation
Total Variation
Predicted by Regression
Illiteracy v. Urban Pop% OLS_Predict v. Urban Pop% OLS_Resid v. Urban Pop%
(Y-Ỹ)
Ỹ
Y
Y
Y
Y
Y
Y

Y
( Yi – Y)   ( Ŷi – Y)   ( Yi – Ŷi )
SS Total
or Total Sum of
Squares
2
2
SS Regression
or Explained Sum of
Squares
2
SS Residual
or Error Sum of Squares
Briggs Henan University 2010
8
Simple Regression Results from GeoDA:
Statistics for dependent variable
general
n = 35
Results for overall regression
explains only 4.6% of variance in Y
Not statistically
significant
Sigma-square= Variance of the estimate = 1368.89/33=41.4816
SE of regression=standard error of the estimate=√41.4816=6.44062
Results for each regression coefficient
Y= 11.3146 - 6.578X
Identical in
simple
regression
Briggs Henan University 2010
9
Simple Regression Results from GeoDA:
spatial
n = 35
Moran’s I for regression residuals
--not statistically significant (p=.09)
Space > Univariate Moran
for variable: OLS_Resid
Same results!
Briggs Henan University 2010
10
Results with omitted observations:
much better!
Now explains 33.41%
But probably non-linear
Statistically
significant
Spatial autocorrelation
not a problem
Data for China Provinces 29:
excludes Xizang/Tibet, Macao, Hong Kong, Hainan, Taiwan, P'eng-hu
Briggs Henan University 2010
11
Multiple Regression Results n = 29
Illiteracy with % Pop Urban and Urban Income
Overall Results
Results for each variable
significant
Not significant
Spatial Results
Not significant
12
Briggs Henan University 2010
Residual Analysis:
Illiteracy v. Urban Pop % and UrbanIncomePerCapita
Moran’s I = .0226
p = 0.5520
Not statistically significant
No Spatial autocorrelation
in residuals
Briggs Henan University 2010
13
Spatial Error Model Results
illustrative only: not needed
Spatial
error not
significant
Briggs Henan University 2010
14
Spatial Lag Model Results
illustrative only: not needed
Spatial lag not significant
Briggs Henan University 2010
15
Regression Results Summary
Overall
R2
Adj2 Akaike
Urban Pop
F
F-prob coeff Test Stat
1.60 0.215
Urban Income
prob
coef
Test
Stat prob
*Spatial Term
coeff
Test
Stat
prob
Simple-35
0.046 0.017 231.65
-6.58
-1.263
0.215
0.1636
1.678
0.0934
Simple-29
0.334 0.309 155.42 13.55 0.001 -16.15
-3.681
0.001
0.0272
0.578
0.5631
Multiple
0.384 0.337 155.16
8.11 0.002 -26.80
-3.151
0.000 0.00041 1.452 0.159 -0.0226
0.383
0.7015
Spatial Error 0.385
155.13
-27.02
-3.411
0.001 0.00041 1.572 0.116 -0.0389
-0.162
0.8716
Spatial Lag
157.05
-26.00
-3.128
0.006 0.00040 1.486 0.137 0.0720
0.340
0.7339
Lag: for W_Illiteracy
For Multiple Regression
29
Robust LM
(lag)
1.312
0.2520
Error: for Lambda
Robust LM
(error)
0.2693
*Spatial
Term
0.387
OLS: for Moran's I
Briggs Henan University 2010
1.220
16
Note on:
Variables Saved for Spatial Models
Again, labels are reversed. Use suggested
variable names.
ERR_ indicates use of Spatial Error model.
LAG_indicates use of Spatial Lag Model
OLS_ indicates use of classic model
For the spatial lag model, there is a distinction between the residual and the
prediction error. The latter is the difference between the observed value and
the predicted value that uses only exogenous variables, rather than treating
the spatial lag Wy as observed. (Documentation for 905i, page 53)
Prediction error (xxx_PRDERR): calculated without including spatial term.
Residual error (xxx_RESIDU): calculated including spatial term
Briggs Henan University 2010
17
Table >> Add Column
Table >> Field Calculator
Improving the model
Relationship is Non-linear
Use log of Illiteracy
Briggs Henan University 2010
18
The same plots using Excel
Relationship is Non-linear
Illiteracy
Log of Illiteracy
1.4
20.00
18.00
1.2
16.00
14.00
1
12.00
0.8
10.00
8.00
0.6
6.00
0.4
4.00
0.2
2.00
0.00
0.0000
0.2000
0.4000
0.6000
0.8000
0
1.0000
0.0000
0.2000
0.4000
0.6000
0.8000
1.0000
Urban pop %
Briggs Henan University 2010
19
Y = Log of Illiteracy
R2 increases
from
38% to 83% !
Urban Income now significant and Urban Population is not!
Briggs Henan University 2010
20
Log of Illiteracy:
makes relationship linear
Overall
R2
Adj2 Akaike
Urban Pop
F
Fprob
coeff
Test
Stat
Urban Income
prob
coef
Test
Stat
*Spatial Term
prob coeff
Test
Stat
prob
Simple-35
0.046 0.017 231.65 1.60 0.215
-6.58
-1.263
0.215
0.1636
1.678 0.0934
Simple-29
0.334 0.309 155.42 13.55 0.001
-16.15
-3.681
0.001
0.0272
0.578 0.5631
Multiple
Multiple
Log Y
0.384 0.337 155.16 8.11 0.002
-26.80
-3.151
0.000 0.00041
1.452 0.159 -0.0226
0.383 0.7015
0.837 0.824 560.07 66.69 0.000 -3962.73
-1.800
0.083 -6446.67 -2.975 0.006 -0.1192 -0.548 0.5839
*Spatial
Term
OLS: for Moran's I
Urban Income now significant, and % urban not significant.
--these two variables are highly intercorrelated
--see next slide
Briggs Henan University 2010
21
Inter-Correlation between Urban
Population and Urban Income
Urban Population
R2 for Urban Pop
versus Urban Income
0.84
R is .92
N=29
Urban Income
Briggs Henan University 2010
22
Table >> Add Column
then use Table >> Field Calculator
Creating a better model
• Transforming dependent
and/or independent variables
can often improve the
predictive capability of
regression models
• geoDA has several
capabilities to support this.
Briggs Henan University 2010
23
Other software options for multiple regression
• Multiple regression of the type discussed here is not available in
ArcGIS
– Only geographically weighted
regression available
(there is a multiple regression for raster data
but it is only in ArcInfo Workstation—difficult to use)
• Use geoDA to create spatial lag variables, then use standard statistical
packages such as SAS, SPSS or STATA
• Use R
– Free open source software, but difficult to use
– http://cran.r-project.org/web/views/Spatial.html
• CrimeStat III has some support for spatial regression
http://www.icpsr.umich.edu/NACJD/crimestat.html
• For a good list of spatial software sources, go to:
http://en.wikipedia.org/wiki/List_of_spatial_analysis_software
Briggs Henan University 2010
24
What have we learned today?
• How to use geoDA to run
– classic regression models
– Spatial Lag models
– Spatial Error Models
• Importance of examining data for “problems”
– Can have a very large affect on results
– Missing data and zeros
– Extreme values can dominate results
• Using transformations to create a better model
Briggs Henan University 2010
25
26
Briggs Henan University 2010
Geographically Weighted Regression
27
Briggs Henan University 2010
Geographically Weighted Regression
• The idea of Local Indicators can also be applied to
regression
• Its called geographically weighted regression
• It calculates a separate regression
Xi
for each polygon and its neighbors,
– then maps the parameters from the model, such as the regression
coefficient (b) and/or its significance value
• Mathematically, this is done by applying the spatial weights
matrix (Wij) to the standard formulae for regression
See Fotheringham, Brunsdon and Charlton Geographically Weighted
Regression Wiley, 2002
Briggs Henan University 2010
28
Problems with Geographically Weighted Regression
• Each regression is based on few observations
– the estimates of the regression
parameters (b) are unreliable
• Need to use more observations than just those
with shared border, but
– how far out do we go?
– How far out is the “local effect”?
Xi
• Need strong theory to explain why the regression
parameters are different at different places
• Serious questions about validity of statistical
inference tests since observations not independent
Briggs Henan University 2010
29
GWR in ARCGIS
• Requires ArcInfo, Spatial Analyst or Geostat.
Analyst license
• Shapefile is created:
– Open its table to see results
– for each polygon there are standard regression results
– Condition variable: indicates when the results are
unstable due to local multicollinearity
• Results not good if condition > 30, Null, or -1.79e+308
– Use source_ID to join with FID of original data to
identify observations
Briggs Henan University 2010
30
Usage Tips from ArcGIS Help
• Use projected data
• Observations included in each regression depend on kernal type,
bandwidth method and bandwidth distance parameters set by user
– Max of 1,000 observations in any one local regression
• Multicollinearity can be a problem
– if variables cluster spatially
– if use binary/nominal/categorical variables
– Never use dummy variables (1/0) to index spatial regions
• (Multicollinearity: intercorrelation between independent variables)
• Not appropriate for small data sets: need several hundred observations
• Shapefiles cannot store “nul l” values: treated as zero. Be sure there is
no missing data
Briggs Henan University 2010
31
Running GWR in ArcGIS
Briggs Henan University 2010
32
Execution Dialog for GWR in ArcGIS
Results presumable for global regression?????
--R2 value does not agree with results from geoDA?
Briggs Henan University 2010
33
Mapping Results from GWR in ArcGIS
(Default) standardized residuals
--the bigger the absolute value the
poorer the prediction?
Regression coefficient
for % Urban Pop
--larger impact of urban
pop in south east China.
Briggs Henan University 2010
34
Join with the original shapefile using FID
and Source_Id in order to identify provinces
Briggs Henan University 2010
35
GWR output: R2 and Y values
Output table
(part)
(Columns reordered.
Highlighted columns
obtained from join with
original data.)
Observed: values
on the dependent
variable Y
Predicted values
and residuals are
based upon each
local regression
and are not the
same as those for a
global regression.
Briggs Henan University 2010
36
GWR output: regression coefficients and standard errors
Regression
coefficients (b)
Standard error
of the estimate
Standard error of the coefficients
No statistical
significance results
provided
--statistical
significance tests in
GWR have been
severely criticized.
Briggs Henan University 2010
37
Download