Spatial Regression Lecture

advertisement
Introduction to
Spatial Regression
Glen Johnson, PhD
Lehman College / CUNY School of Public Health
glen.johnson@lehman.cuny.edu
Typical scenario:
•Have a health outcome and covariables
aggregated at a common geographic level,
such as counties, census tracts, ZIP codes …
•Want to measure association between the
outcome and the covariables.
•Specific Question is: Are there variables that
co-vary spatially with the outcome variable ?
Benzene in ambient air
Smoking rate
+
Lung Cancer Rates (observed)
+…+?
=
+ residual
Consider the linear model:
For i  1,
,n
y i  x i β   i
or, m ore generally, g ( E [ y i ])  x β   i
and  i ~ iid (0,  )
2
w ith cov(  i ,  j )  0 for all i , j
w here  i  y i  yˆ i
This is the point of departure.
When applying regression modeling to spatial
units that are connected in space (lattice data),
the critical assumption that residuals are
independently distributed with constant variance
is typically violated.
Tobler’s First Law of Geography:
Things closer in space tend to be more similar
than things further apart
When we model the expected value, E[y], as a
function of spatially-varying covariates, it is possible
that we may explain all of the spatial variation of the
observed response, y, with the covariates, leaving
uncorrelated residuals.
When this is not the case, as is typical, the assumption
of iid residuals is violated and we will obtained biased
estimates of the variance – typically biased
downward, leading to underestimating our standard
errors and concluding that some covariates are
significant when in fact they are not.
Tests for spatial autocorrelation should be
applied to residuals if a “conventional”
regression model is applied.
This may be done with various software
packages or GIS add-ons.
A common statistic is Moran’s I, which equals
n
n
 w
I 
i 1
1
ij
(Yi  Y )(Y j  Y )
j 1
n
(Y

n
i 1
n
i
Y )
2
n
 w
i 1
j 1
ij
When residual spatial autocorrelation is present,
several approaches may be taken to adjust for it.
The simplest is to add a fixed effect dummy
variable to allow the model intercept to change
with spatial location. For example, an adjustment
is made that depends on a county of membership.
y i  x i β   c   i
This is essentially stratifying the analysis by location
And can be done with any statistical software
Since spatial location is a proxy for unobserved
randomly varying covariables, it is more correctly
treated as a random effect in a mixed effect model,
such as
E [ y i | S ( i )]  x i β  S ( i )
w here S(i) ~ N (0,  )
2
s
Which can be solved for through pseudolikelihood methods, using software like
PROC GLIMMIX or PROC MIXED in SAS, or
R with appropriate library (?)
Illustration:
Community Teen Pregnancy Rates vs. Socioeconomic
Status and Demographic Composition
For each ZIP code:
Response (i.e. Teen Pregnancy cases)
 Predictors:
•% pop. > age 24 w/ 4-year or greater
college degree
•% single-parent households out of
households w/ at least one child < 18 years
old
•% of tot. pop. that is Black Alone
•% of tot. pop. that is Hispanic, regardless of
race
•% of tot. pop. that is a foreign-born
naturalized citizen
•% of tot. pop. with income below poverty
Population at Risk
County
(crude indicator of neighborhood effect)
Teen Preg rate vs Single-Parent Households
Teen Preg rate vs Education
350
Pregnancies per 1000 females
age 15-19
300
250
200
150
100
50
0
0
20
40
60
80
100
300
250
200
150
100
50
0
0
10
% adults with 4-yr college degree
20
30
40
% households with one parent at home
Teen Preg rate vs % Immigrants
350
Pregnancies per 1000 females
age 15-19
Pregnancies per 1000 females
age 15-19
350
300
250
...
200
150
100
50
0
0
5
10
15
20
25
% forein-born naturalized citizens
30
35
50
Pregnancies per 1000 females
age 15-19
Teen Preg rate vs Race
350
300
250
200
150
100
50
0
0
20
40
60
80
100
% black alone (regardless of hispanic ethnicity)
Pregnancies per 1000 females
age 15-19
Teen Preg rate vs % Hispanic
350
300
250
200
150
100
50
0
0
20
40
60
% hispanic (regardless of race)
80
100
The Model …
For i = 1, …,n ZIP codes, let
yi = observed caseload
ni = population at risk
{x1, …, xp}i = community predictors
{β1, …, βp} = coefficients
Li = location effect, arising from a random
process such that Li ~ N(0, σL2)
Then, the expected value of yi, given {x1, …, xp, L}i =
E[yi| {x1, …, xp, L}i ] = niexp(β1x1i + … + βp xpi + Li)
• Values for the unknown coefficients {β1, …, βp, σL2 } are
estimated with SAS PROC GLIMMIX, assuming yi arose from a
Poisson random process, conditional on location.
• … thus allowing risk adjusted estimates of caseload for each ZIP
code.
• Incorporating the “location effect”
- adjusts for unidentified covariables that co-vary spatially with
the response, thus reducing residual spatial autocorrelation and
potential confounding
- also provides a “smoothing” effect, in that the predicted
caseload is adjusted towards a common local value
Teen Pregnancy Association with Select Covariables
No Spatial Effect
coefficient name
estimate
p-value
with Spatial Effect
estimate
p-value
intercept
-3.423
<0.0001
-3.262
<0.0001
% adults w/ Bachelors
-0.016
<0.0001
-0.018
<0.0001
% Black Alone
0.008
<0.0001
0.01
<0.0001
% Hispanic
0.009
<0.0001
0.012
<0.0001
% Foreign Born
0.003
0.2884
0.002
0.5906
0.04
<0.0001
0.027
<0.0001
% single-parent households
model parameters
scale
0.166
0.009
chi-square / d.f.
1.13
0.91
-2 log likelihood
7934.5
2706.6
0.92
0.31
Residual Spatial Autocorrelation
(Moran's I)
Deviation of Observed from Model-Predicted
Teen Pregnancy Rates
(3-Year Average for the Year 2005)
No Spatial Correction
Moran's I = 0.92
county boundaries
New York City
Pearson Residuals
No Spatial Correction
9.6 - 23
3.8 - 9.5
0.74 - 3.7
-1.2 - 0.73
-4.4 - -1.3
April, 2009
Deviation of Observed from Model-Predicted
Teen Pregnancy Rates
(3-Year Average for the Year 2005)
with Spatial Random Effect
Moran's I = 0.31
county boundaries
New York City
Pearson Residuals
with Spatial Random Effect
2.3 - 8.0
0.83 - 2.2
0.0093 - 0.82
-0.62 - 0.0092
-2.1 - -0.63
April, 2009
Other approaches include …
A spatial lag model, where
y i    w ij y j  x i β   i
j
and a spatial error model, where
y i    w ij  j  x i β   i
j
for a spatial autoregressive coefficient ρ.
These two models differ by whether the adjustment
is made by a weighted sum of the response variable
or the residuals.
The spatial lag and spatial error models can be
solved for in Geoda, a simple, well supported
freeware found at
http://geodacenter.asu.edu/
… but only for gaussian responses.
For generalized linear models (i.e. Poisson and
logistic regression),
see R with appropriate libraries
http://www.r-project.org/
Another approach is hierarchical modelling,
which treats the response as conditional on the
weighted average of local neighborhood
errors.
Frequentist solutions exist, but these hierarchical models lend
themselves well to a fully Bayesian solution, as used by many
geographic epidemiologists
Main advantages include
* flexibility offered by Generalized Linear Mixed Models
* obtain full distribution of possible outcomes
- allows many ways to view the outcome
(mean, median, percentiles)
- inference based on actual probability distributions,
instead of confidence intervals
Main limitation is level of conceptual difficulty; however,
implementation is accessible through free software …
WINBUGS (Bayesian Inference Using the Gibbs Sampler)
A Hierarchical Model
1 . D efin e a lik elih o o d fo r th e o b servatio n s y , w h ere i  1,..., n reg io n s :
i
y i ~ P o isso n (  i ), w h ere  i  E i i
E i is th e calcu lated e x p ected valu e a n d  i is th e relative risk
(n o te: th e m ax . lik elih o o d estim ate o f  i is ˆi  y i / E i , th e S IR )
2. Link the P oisson expectation to both fixed and random effects:
log( i )  log( E i )   
k
  j xij   i
u
  is
j 1
for a com m on m ean  , fixed effect covariates x ij w ith coefficients  j ,
and random effects (com ponents of variance) due to
unstructured  iu and spatially structured  is sources of variation
3. Assign prior probability distribution s to param eters in the linear m odel
 ~ N (0, 10 4 ),  j ~ N (0, 10 4 ) for all j ,  iu ~ N (0,  u2 )
and [ is |   ] ~ N ( is , s2i ) for spatial neighbor hood  i
i
4. A ssign hyperprior distributions to th e hyperparam eters
2
2
 u  1 /  u ~ G am m a(a,b) and  s  1 /  s ~ G am m a(c ,d)
Focus on the random effect that captures
local spatial autocorrelation
 is is distributed conditionally on location, such that
[ is |   ] ~ N ( i ,  2i )
i 

j i
w ij 

j i
w ij
j
and  2i 
 2

j i
w ij
A Directed Acyclic Graph of the Bayesian Model
beta
tau.u
X(covariate)
epsilon.u
alpha
tau.s
m u[i]
E[i]
y[i]
for(i IN 1 : n)
epsilon.s [i]
Gibbs sampling basic procedure
-
All stochastic parameters in the model are assigned an initial value
(somewhat arbitrarily).
-
The values for each parameter are updated by random simulation from a
conditional probability distribution, given all other parameters in the
model.
-
After all terms have been updated, completing one cycle
(of what is called a Markov Chain), the cycle is repeated.
-
After many iterations, the simulated values for each term converge to a
stationary posterior distribution
(further iterations don’t change the distribution)
Estimation and inference can then be made from these posterior distributions
For example, a simulated sample of 1000 fitted SIR values (μi / Ei)
can be used to yield a point estimate (typically the median)
and an interval estimate, such as the 95 %-tile range (credible set)
0.12
0.10
0.08
0.06
0.04
0.02
0.00
5th %-tile
50th %-tile
S IR
95th %-tile
An illustration for geospatial analysis
of prostate cancer incidence
in New York State, USA …
Prostate Cancer Incidence by ZIP code
adjusted for age and race
New York State
1994-1998
Example Output:
Posterior Kernel Densities of Prostate Cancer Incidence
(`94-`98) for Some Manhattan ZIP Codes
SIRhat[27] sample: 1000
SIRhat[29] sample: 1000
8.0
6.0
4.0
2.0
0.0
6.0
4.0
2.0
0.0
0.8
1.0
1.2
1.4
0.6
SIRhat[26] sample: 1000
0.8
1.0
1.2
SIRhat[28] sample: 1000
8.0
6.0
4.0
2.0
0.0
6.0
4.0
2.0
0.0
0.6
0.7
0.8
0.9
1.0
0.8
SIRhat[25] sample: 1000
6.0
4.0
4.0
2.0
2.0
0.0
0.0
0.6
0.8
1.2
SIRhat[30] sample: 1000
6.0
0.4
1.0
1.0
0.6
0.8
1.0
1.2
some references
• Waller, L.A. and Gotway, C.A. 2004. Applied Spatial Statistics for Public Health
Data. Wiley. 494 pp.
• Johnson, G.D. 2004. Smoothing Small Area Maps of Prostate Cancer Incidence in
New York State (USA) using Fully Bayesian Hierarchical Modelling. International
Journal of Health Geographics 2004, 3:29
( http://www.ij-healthgeographics.com/content/3/1/29 )
• Elliot, P., Wakefield, J.C., Best, N.G. and Briggs, D.J. 2000. Spatial
Epidemiology: Methods and Applications. Oxford. 475 pp.
• Statistics in Medicine. 2000. Vol. 19 (special issue on disease mapping)
• Lawson, A. et al. 1999. Disease Mapping and Risk Assessment for Public
Health. Wiley. 482 pp.
Method and Software Sources
GeoDa
http://geodacenter.asu.edu/
(with links to R and R-Geo)
WINBUGS for Bayesian Modeling
http://www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml
Both of these freewares are supported by large
international community with active listserves
Download