GIS Thesis - December 2005: Geographic Weighted Regression

advertisement
I. Introduction
This project attempts to identify spatial heterogeneities in regression models
of georeferenced Medicaid claims for diabetes within the state of Texas. Spatial
variability of estimated local regression coefficients is examined to determine
localized influences on the accumulation of diabetes claims. Results from this
study will demonstrate the value of linking Medicaid claims databases to geographic
maps. Numerous products result which can assist in more efficient allocation of
resources needed to treat diabetes. This project uses an innovative technique known as
“Geographic Weighted Regression” (GWR) to expose demographic trends, as well as
geographic regions, where diabetes claims are more concentrated. GWR models
are constructed that show the dependencies of per capita expenditures on
Medicaid diabetes claims, as well as diabetes population density, on
demographic attributes such as race, age, poverty status, and geographic
location, where the later four variables are considered “independent” parameters
in ensuing regression models. An assessment is conducted evaluating the
“truthfulness” of using GWR as opposed to more traditional statistical analysis
approaches. GWR studies have typically excluded examinations of multicollinearity
considerations among local regression coefficients. This project implements an
innovative approach for anlyzing degrees of correlation among local regression
coefficients. Finally, this project presents automation tools which
extend the work performed on diabetes to any type of pathology or other demographic
attribute. A “geographic analysis machine” is proposed that links Medicaid claims
databases to spatial statistics processors which result in the production of analytic
maps that can be used for policy planning.
II.Problem Statement
A principal motivation for conducting this research was the recent emergence of
state Medicaid claims data which had previously not been disclosed. The principal use of
the state’s data was mainly for fraud detection in that anomalous cost Medicaid claims
costs could be investigated. Other potential utilizations of that data were not generally
applied insofar as traditional medical geography applications. The Texas Department of
Health and Human Services database contains all records of Medicaid recipients in the
state in excess of 7000 distinct types of medical afflictions attributed to the claimants.
Using appropriate georeferencing, claimants locations could be determined to county and
census tract level within the state, thus implying that a certain spatial disease pathology
patterns could be easily mapped and studied. Concomitant with such potential map
constructions would be extensions of spatial analysis in probing potential geographic and
demographic contributions to a particular disease. Geographic weighted regression is a
technique for investigating those influences on disease propagation.
A second motivation for conducting this research was one of opportunity to harness
an emerging integration of Geographic Information Sciences technologies with
computational statistics software which has been underutilized in geoscience
investigation. This project has extended the research potential of ArcGIS through its
integration with SAS. This integration afforded the development of a toolset used by this
project for the development of computational medical geography utilities. Such toolsets
are extensible enough so that applications for them can include augmenting
administration functions of state health services, minimizing excessive claims costs,
providing for efficient delivery of services. Public Medicaid policy decisions and
planning can now be based on quantified results rather than legislative debates.
The foundations for analysis used by this project is the invocation of geographic
weighted regression (GWR) techniques used to demonstrate spatial heterogeneities in
Medicaid claims for selected demographic groups. Applications for GWR have mostly
been theorized but are now finding a growing user base in the GIS community. Currently,
there is no GWR functionality available in any popular GIS turnkey system so
implementation of its procedures has to be largely solved programmatically by the
investigator. This project implements an interface between GWR and ESRI’s ArcGIS.
The calculations involve computational processing using a considerable amount of linear
algebra and statistical processing in the form of principal component analysis (PCA).
This project bundles all of these functionalities into a single ArcObjects interface.
A fourth essential contribution of this project is to question the validity of GWR
results, an application which has not been embraced by users of this protocol. Most GWR
demonstrations do not extend results beyond simple t-statistics. However, any formal use
of spatial statistics should include an analytical treatment of the validities of derived
models. With GWR models, analysts should view the results with a fair degree of
skepticism. GWR coefficient should be examined for potential multicollinearity problems
and unacceptable levels of coefficient correlations. This project programmatically
implements a graphical implementation for principal component analysis, a technique
used to examine spatial structural effects in GWR coeeficients.
In summary, this project attempts to implement three principal objectives. The
first goal is to conduct GWR analysis on state-wide diabetes claims. The principal
dependencies explored will be claims counts and per capita expenditures at the county
level. These dependencies will be modeled in terms of demographic variables along
percentages of elderly, rural, foreign-born, black, and Hispanic populations within the
state’s counties. The geographic parameter (regression points) will consist of county
centroids. A second objective is to develop automation tools to be used for the production
of exploratory GWR maps. The tools will be authored using ESRI’s ArcObjects and
SAS. The final objective of this research is to analyze the validity of GWR results. Tools
are developed to quantify structural effects among GWR model coefficients.
III. Literature Review
A vast amount of literature is available on applying GWR to social science research. This
project, does however, conduct a specialized area of research which has previously not
been studied and that involves the analysis of potential spatial non-stationarity in state
Medicaid claims for diabetes. Only recently has access to state Medicaid databases been
allowed for researched. The Schools of Social Sciences and Engineering are bidding on
becoming the state repository for Medicaid data to be accessed by social science
researchers. In addition, this project takes progressive steps on integrating GWR software
with the popular GIS tool, ArcEditor, from ESRI and with an established statistical
processing tool available from SAS. To this end, the integration of GWR processing,
SAS tools, and ESRI ArcGIS processing present a unique package for exploratory spatial
data analysis and statistical processing. Key literature resources are listed in the
References section of this paper.
IV.Data Sources
The data for this study originated from three principal sources. They sources are:
1. The United States Census Bureau (http://www.census.gov),
2. The Texas Department of Health and Human Services, Office of the Inspector
General (http://www.hhsc.state.tx.us), and
3. Environmental Research Systems Institute (ESRI) (http://www.esri.com).
The Census Bureau provided detailed demographic tabular data organized by county and
census tract levels. ESRI furnished similar data along with the shapefiles used for GWR
analysis. The Department of Health and Human Services (HHS) provided the database
schema by which Medicaid records are defined. HHS data originate from medical care
providers’ claims for reimbursement. The HHS data used in this study originate from
Medicaid claims received in August, 2000.
In order to conduct the research, an understanding of the HHS database schema is in
order. For this project’s purposes, two relevant HHS data tables were identified that
would serve as inputs into GWR processing. The first table consisted of a client address
table. In this table, the fields include the Texas street address, city, and zip code of a
Medicaid claimant. The primary key for this table consists of an HHS-assigned unique
client number.
The second table of interest was the client records table, where the primary key was
the unique client number, and this key served as a foreign key for the client address table.
One of the principal fields of interest of this table consisted of standardized ICD9 codes
that characterized a treatment for a particular claimant . For instance, the ICD9 code
range for diabetes used in this study was 25000 to 25003. ICD9 codes are readily
available through internet search engines. Every medical pathology treated by physicians
has been assigned an ICD9 code. In the database used by this project, over 7000 distinct
ICD9 codes were processed in the time-frame for which the data was collected. The
client records table is updated every month by HHS. Additionally, the client records table
contained financial information regarding the costs of medical treatment. The financial
information was organized generally into charges incurred by medical providers and
payments authorized by the state. Between the two principal tables, a one-to-many
relationship was structured between the client address and the client claims. A client
could be treated for multiple afflictions, and thus numerous ICD9 codes could be
associated with a claim, which is not surprising since, for example, diabetes could give
rise to other medical problems for patients such as stroke or blindness.
Two critical issues arose during the preprocessing of the data. One issue had to deal
with public policy of disclosing Medicaid data in urban counties and the other issue
regarded the georeferencing of Medicaid claims data. In the former case, a considerable
amount of claims data were not available owing to the privatization of Medicaid claims
processing by Health Maintenance Organizations (HMOs). Privatization contracts did not
require HMOs to disclose claimant address information. Thus, a potential for the
introduction of artifacts into GWR processing arises. One note however, is that a
significant quantity of urban claims data were still being reported and inserted into HHS
data tables at the time the data were collected. Initiatives are now under way to provide
linkages between HMO and HHS databases.
A second critical preprocessing concern regarded the georeferencing of claims data.
Two distinct problems were posed when using ESRI’s geocoding tools. This first
problem encountered was the ability of the geocoding tool to perform address matching.
Due to the implementation of the address matching filters, only 70% of the records
formatted for address matching could be correctly geocoded. Thus, of the two million
Medicaid claims records posted in August, 2000, only about 1.4 million records were
addressed matched. The author of this paper did not pursue trying to “patch” the
remaining 600,000 records. The utility of geocoding all Medicaid records becomes
evident later on in this paper when the a demonstration is shown that the geoprocessing
of diabetes claims can be extended to any of the 7000 ICD9 codes without having to
reprocess the data or reengineer the tool set. Unfortunately, however, when spatial query
was conducted to select diabetes claims from the georeferenced data in August, 2000,
only 8000 claims out approximately 16,000 claims were geocoded.
One of the value-added propositions this project proposes is its unique processing of
the geocoded records. The interface between ESRI software and external data sources is
not well implemented for automation. For instance, a construction of a Medicaid point
shapefile for diabetes claims would involve an analyst having to exit the ESRI
environment and having to interface with a database management system to perform a
complex structured query language (SQL) extraction of an ICD9 code under
investigation. A typical SQL statement would be written as follows:
SELECT T *.*
FROM medpoints INNER JOIN [ SELECT Client_Num,
SUM(netbilledamount) AS NBA, SUM(papaidamount) AS PPA,
SUM(totalallowedamnt) AS TAA, SUM(totalpayableamt) AS TPA
FROM ClientRecords
WHERE ClientRecords.PRIMDIAGCD >=25000 And
ClientRecords.PRIMDIAGCD <= 25003
GROUP BY Client_Num] AS T ON VAL(medpoits.Client_Num) =
T.Client_Num
Posing this query to a relational database management system, or to something like
Microsoft Access 2000, is routine. Such a query is hardly routine for ESRI ArcEdit, or
even for ESRI’s ArcSDE (Spatial Database Engine). The author has searched ESRI user
forums for an ArcObjects or programmable interface that would accommodate such a
query, and no satisfactory response was available. However, with SAS bridged software,
automation of large data set processing is readily available. A SQL request is passed from
an ArcObjects class to SAS which constructs and XY claims event table (using PROC
SQL), which is passed back to ArcObjects, which subsequently renders the table into a
point shapefile. An ArcObjects cursor is implemented that traverses counties or census
tracts, or other types of polygons and using the “esrirelcontains” enumerated data type for
a spatial query, the claims points falling within a particular polyon are readily extracted
and the forthcoming records are attached to a census shapefile using another ArcObjects
class, which source is in the appendix. A literature search has found this approach to be
unique at the time this paper was written. A flow chart summarizes the data
processing:
V.Analysis and Methodology
GWR Model Basics
This project employs GWR techniques to assess how state Medicaid diabetes claims
are distributed both spatially and demographically within the state of Texas. A GWR
model is essentially an extension of a global regression model, which is defined as
follows:
yi   0    k xik  i
k
where

{ xik } are observations for i = 1,..,n cases and k = 1,..,m explanatory variables,

{yi} are the dependent variables,

 ’s are the estimates of the coefficients,

and  ’s are normally distributed error terms.
For this project, the observations (regression points) are centered on the State’s county
centroids, which for n, is 254. The explanatory (independent variables) are typically
demographic variables obtained from census data, such as the percentage of population
who are black, Hispanic, foreign born, elderly, and so on. The dependent variables, as
stated in the introduction may be the absolute value of the logarithm of the counts of
diabetes claims within each county. The computation of the later value poses
programmatical challenges in that county location is not a field in Medicaid records.
Thus, a containment relationship has to be programmed that “counts” the number of
claims within each county.
In GWR,
yi   0 (u i , vi )    k (u i , vi ) xik  i
k
where

(ui,vi) are the coordinates of the ith point in space and

 k (ui , vi ) are spatially varying, continuous functions at point i.
In this project, as mentioned above, the coordinates of the ith point corresponds to a
particular county centroid. The computation of county centroids in this project is
executed through the use of Visual Basic for Applications classes developed under
ESRI’s Arc Objects.
In GWR, estimates can be made for  :
ˆ (ui , vi )  (xT W(ui , vi )x) 1 xT W(ui , vi )y
where W(ui , vi ) is an n  n matrix which off-diagonal elements are 0 and the diagonal
elements denote the geographical weighting of each n observed data for a given
regression point.
In global regression, a regression matrix takes the form:
y  x  and ˆ  (x T x) 1 x T y
In GWR,
y  (   x)I 
where the  operator means that each element of  is multiplied by the corresponding
element of x. For n data points and k explanatory variables, dim( x)  n  (k 1) and I is a
( k 1)  1 vector of 1’s.  
  0 (u1 , v1 ) 1 (u1 , v1 ) ...  k (u1 , v1 ) 
  (u , v )  (u , v ) ...  (u , v ) 
1
2
2
k
2
2 
 0 2 2
.

.



.

  0 (u n , v n ) 1 (u n , v n ) ...  k (u n , v n )
The estimated parameters in each row are obtained:
ˆ (i )  (x T W(i )x) 1 x T W(i )y
where y is a location-based weighted least squares estimator and i is a matrix row.
W(i) is an n  n spatial weighting matrix: W(i) =
 wi1
0

 :

0
0
wi 2
:
0
0 
0 
\
: 

... win 
...
...
Where win is the weight given to a data point “n” in the calibration model for location “i.”
In this project, this “weights” matrix is calculated using a bi-square function. The data for
this computation is gathered from computations of the distances between the centroids of
the counties. A note must be said about using Arc Edit to perform these computations.
This function was available until recently and has since been deprecated by recent Arc
Edit updates. The author of this paper has resurrected this method. A discussion about the
calculations of the elements of the weights matrix follows.
In GWR, local standard errors account for variations in data used to compute
estimates. In some cases, local parameters might be a function of relatively few data
points, or data points might have low weights in a local regression because they lie far
from a regression point. Thus, GWR takes into account the analysis of the variance of the
parameter estimates.
Var[ˆ (ui , vi ]  CCT  2
where

T
1 T
C = (x W(u i , vi )x) x W(u i , vi ) and

 2 is the normalized residual sum-of-squares from a local regression.

 2   ( yi  yˆ i ) /( n  2vi v2 )

v1  tr(S), v2  tr(S T S), yˆ  Sy
T
1 T
The rows of S are ri = x  (x W(u i , vi )x) x W(u i , vi )
In GWR, n – 2vi + v2 equals the effective degrees of freedom for the residual and standard
errors are given by
SE ( ˆi )  Var ( ˆi )
In this project, the C matrix is calculated using SAS 9.1. At this point, a discussion must
be presented on the automation techniques used by this project. The C matrix is easily
calculated using SAS Proc IML, which code is included in the appendix of this paper.
What is of most interest is the recent integration of SAS with Arc Objects, a “bridged”
software tool authored by SAS that provides an object-oriented-based interface between
SAS modules and ESRI’s ArcObjects. This bridge provides a means for ArcObjects
classes to instantiate SAS objects and a means to have bidirectional exchange of data
sets. Thus, shapefile data are easily exported to SAS programs and SAS can computed
data for shapefiles that traditionally is not available within ESRI tools. Further discussion
of this unique tool integration follows. As mentioned previously, W(ui,vi) is a weighting
scheme based on the proximity of a regression point “i” to data points around i, without
an explicit relationship. In a global regression,
yi   0    k xik  i
k
, wij = 1  i,j,
noting that j is a specific point in space at which data are observed and i is a point in
space for which parameters are estimated. For GWR, three choices exist for the local
weighting function:
I. Use a moving window weighting function:
wij = 1 if dij < d
wij = 0 otherwise.
For every regression point, only a subset of the points are used to calibrate the model.
This weighting scheme introduces discontinuities which ultimately lead to sharp contour
lines in maps.
II. Use an exponential (Gaussian) smoothing function:
1
exp[  (d ij / b) 2 ]
2
wij =
where b is a kernel bandwidth.
III. Use a bi-square function:
2 2
wij = [1  (d ij / b) ] if dij < b
wij = 0 otherwise.
The preceding three methods for calculating geographic weights assumed that the
bandwidth, b, was fixed. GWR also employs a technique of using spatially varying
kernels. There are three such techniques.
I. The first method uses a technique where data points are ranked in terms of their
distances from each regression point “i” such that Rij is the rank of the jth point from
point i in terms of the distance j is from i. The weights decrease as the rank increases:
wij =
exp( Rij / b)
The effect is that the bandwidth of the kernels is reduced in regions with large amounts of
data.
II. The second method ensures that the sum of the weights for any regression point “i” is
a constant “C” :
w
ij
 Ci
j
Compute the optimal value of C as follows:
Select an initial
value for C
Calibrate weighting function
with selected C as a constraint
and calculate the “goodness-offit” statistic for the model.
No
Yes
Optimal
Fit?
GWR
III. The third method involves constructing a function that is related to the Nth nearest
neighbors of point “i.”
wij =
[1  (d ij / b) 2 ]2 if j is one of the Nth nearest neighbors of i,
wij = 0 otherwise.
One issue with GWR analysis regards the task of calibrating the spatial weighting
function. The larger “d” becomes, the closer the model becomes an OLS model.
Furthermore, smaller bandwidths lead to increased variance in parameter estimates which
depend on observations in close proximity to a regression point. One way to calibrate the
kernel function is to minimize z:
n
z   [ y i  yˆ i (b)] 2
i 1
where the yˆ i ' s are fitted values for a given bandwidth. The problem with this approach is
that an obvious minimum occurs when b = 0. This approach is remedied by using crossvalidation (CV):
n
CV   [ y i  yˆ i 1 (b)] 2
i 1
This procedure involves plotting the CV versus the bandwidth for a given weighting
function. The optimal value occurs at the minimum point.
A second approach used to calibrate the spatial weighting function is to minimize the
Akaike Information Criterion (AIC):
 n tr( S ) 
AIC  2n log e ˆ  n log e 2  n

 n  2  tr( S ) 
where n is the sample size and ̂ is the estimated standard deviation of the error term.
The AIC is used to assess whether GWR is better than global regression.
For this project, both calibration techniques are employed for different reasons. During
the initial exploratory phase of map GWR map construction, the AIC method is used
along with a variable bandwidth approach. This approach is available through GWR
software that is integrated easily into ArcObjects. An ArcObjects interface is
implemented that calls GWR 3.0, a software tool built by Stewart Fotheringham. One
consequence of using GWR software is that the weights matrix is not revealed in the
GWR reports. This poses problems for analyzing multicollinearity among the regression
coefficients. As a consequence, this projects calibrates the kernel using the generalized
cross-validation method, in principal because it is easier to program. Thus, for each
regression point, a unique bandwidth is easily found. With the matrix processing power
of SAS Proc IML, the S matrix is easily calculated. The elements of the weights matrix
were easily found using the bi-square function. The regression coefficients (“betas”) were
obtained by simply multiplying the C matrix by the y-vector using SAS. These
computations would have been very difficult to implement using ESRI tools. The
advantage gained was that a SAS dataset containing only the betas was generated,
exported, and joined to a county shapefile, all within a single ArcObjects class
implementation. Thus, a high degree of automation was achieved and hence a significant
value proposition offered by this project. The ArcObjects class source code is in the
appendix of this paper. The principal motivation of “regenerating” the betas was alluded
to in the introduction of this paper. As presented in the introduction, any correct
utilization of GWR must be accompanied by an examination of potential
multicollinearity between the local regression coefficients, which are derived from the
weights matrix. Thus, an “observer” or “estimator” had to be constructed to estimate the
W matrix from GWR “black box” processing. From the estimates of W, the following
equation is derived:
This equation is the representation of the covariance matrix of the localized regression
coefficients, and it was also calculated using SAS Proc IML. This matrix is an
intermediate step in the calculation of the local correlation matrix:
which is
written here in SAS notation. The former computations serve multiple purposes. First and
foremost it allows analysts (geographers) to assess structural effects in GWR regression
coefficients. In this project, the proposed assessment method uses scree plots and factor
analysis arising from traditional principal component analysis. Such techniques are
clearly unavailable in ESRI functionality but are optimized in SAS tools. With the
ArcObjects-SAS bridge interface, PCA (principal component analysis) is readily
available for geographic research and thus detailed analyses of variances of easily
attached to ESRI maps. Secondly, with the integration of ESRI shapefiles and SAS data
sets, the “R” matrix is readily joined to the county shapefile and choropleth maps can be
automatically generated showing the geographic variation of local coefficient
correlations.
Inference and GWR
Statistical inference is concerned with the process of inferring information from the
analysis of statistical data sets. Statistical inference answers three kinds of questions:

Is some fact true on the basis of the data?

Within what interval does the model coefficient lie?

Which one of a series of potential math models is the best?
Exploratory data analysis (EDA) and visualization are used to expose the structure of
data and to identify potential spatial patterns. The issues not addressed in EDA are:

Random variation in data collection leads to observed representations of data and

Are the observed patterns attributable to geographic trends?
A significance test assesses how likely some fact is true on the basis of the given data.
The approach is to formulate a null hypothesis, H0. In GWR, the question posed is:

How likely is an observed pattern if H0 is true given that the data are generated by
a global model. If the observed pattern is unlikely under the hypothesis, H0 is not
true.
The “p-value” of a test is a probability measure of an observed pattern being correctly
identified given that H0 is true. A significance test is evaluated as to whether the p-value
falls below a threshold, usually 0.01 or 0.05. These are the probabilities of incorrectly
rejecting the hypothesis given that it is actually true.
GWR also assesses the intervals for which coefficients lie. In global regression,
estimates can be derived for regression coefficients. These are the sampling standard
deviations of coefficient estimates. For example, when a sample size n is large, an
interval defined by a coefficient estimate of +/- 1.96 times the standard error will contain
the true coefficient value 95% of the time. GWR estimates coefficient surfaces.
Regression coefficient values are estimated for a set of geographical locations.
The third consideration for GWR modeling is to determine analytically which model
is the best. An AIC estimate makes such a determination by showing how close a
proposed model is to a true model. Recall that a GWR statistical model is specified by:
yi   0 (u i , vi )    k (u i , vi ) xik  i
k
where

{xij} for i = 1,..,n cases and j = 1,..k explanatory variables,

{yi} dependent variables,

{(ui,vi)} locations coordinates for each case,

{  0 (u, v) , 1 (u, v) ,..,  k (u, v) } are k+1 continuous functions at the location (u,v)
in a geographic study area, and

 i ’s are random error terms that are independent, normally distributed with a
2
mean of zero and a variance of  .
GWR seeks to provide non-parametric estimates for the betas using kernel-based
methods. A log-likelihood for any set of estimates is written as:
k



L(  0 (u, v)... k (u, v) | D)  
y


(
u
,
v
)

xij  j (ui , vi ) 


i
0
i
i

2 i 1 
j 1

2
n
2
where
D   ({xij },{ yi },{ui , vi })
and where the functions are chosen to minimize L. An easier way to find the better model
fit is to use a maximum likelihood approach where the betas are selected using a leastsquares method since the error terms are normally distributed. Since the functions are
arbitrary, any value can be chosen to obtain a residual sum of squares of zero. This can
result in having a non-unique solution. The solution for this problem is to

make the betas functions of (u,v) – that is, use a parametric representation and

employ a calibration procedure.
In calibrating GWR, estimate {  0 (u, v) , 1 (u, v) ,..,  k (u, v) } on a point-wise basis.
Given a specific point in geographic space, (u0,v0), estimate
{  0 (u, v) , 1 (u, v) ,..,  k (u, v) }, where the point is arbitrary. For smooth functions,
k
yi   0  xij  j  i
j 1
is a close approximation at (u0,v0). The gammas are constants.
The calibration, the goal is to minimize the weighted least square, WLS:
k


WLS   w(d 0i ) yi   0   xij  j 
i 1
j 1


n
2
where

d0i = distance between (u0,v0) and (u1,v1) and

 j  ˆ j (u 0 , v0 )
For assessing GWR inference and GWR hypothesis testing, recall that yˆ  Sy where
S is an n x n matrix. The fitted residuals are (I – S)y, where I is the identity matrix:
RSS  y T (I  S) T (I  S)y and
E ( RSS )  (n  [2tr( S )  tr(S T S)]) 2 E (y) T (I  S) T (I  S) E (y)
where the first term relates to the variance of the fitted values and the second term is the
bias, which is zero.
To test for GWR parameter stationarity, computations are performed for the variance
for parameter k,
1 n 
1 n

Vk    ˆik   ˆik 
n i 1 
n i 1

2
The question to be answered is whether the observed variation is sufficient to reject the
hypothesis the parameter is globally fixed. To provide an estimate for this value, a Monte
Carlo approach is adopted. For a given number of times (n), the geographical coordinates
of the observations are randomly permuted against the variables. This obtains n values of
the variance of the coefficient of interest which is used as an experimental distribution.
The actual value of the variance is compared to this list to obtain an experimental
significance level.
GWR confidence intervals are analyzed for estimated values rather than in terms of
significance tests. To establish point-wise confidence intervals for the regression
coefficients, the GWR asymptotic variance-covariance matrix is given as:
  L( 0 ... k )

I ( 0 ... k )  outer ( E  
| u 0 , v0  )
 i


where

outer() is the multiplicative outer product,

L( 0 ... k ) is the global likelihood of  0 ... k at (u v ), and
0, 0

I is the information matrix associated with the estimates at (u0,v0).
The true values of the partial derivatives are not known, but we have
ˆ  (XT WX ) 1 XT Wy  Cy and the yi’s are independently distributed with the same
variance. We have:
var( y)   2 I
and the point-wise variance:
ˆ  CC T  2
Thus, a means is available for obtaining point-wise confidence intervals for the surface
estimates.
Akaike Informatin Criterion (AIC)
A useful approach to GWR model selection is to use the Akaike Information
Criterion (AIC). The AIC is an estimate for:
 f (y ) log
e
( f (y ) / g (y )) dy
which measure the information distance between the model distribution g and the true
distribution f. This quantity should be compared for a number of competing models
g1,..gl. This equation can be approximated by:
n tr (S)
,  2  RSS
AIC  2n log e (ˆ ) n log e (2 )n


n

2

tr
(
S
)


n
To choose a model, compute the RSS and S for each model and then compute the AIC.
The smallest AIC is the best model. Recall that the AIC depends on the selections of the
bandwidth and the explanatory variables.
VI.Results and Discussion
***************************************************************
*
*
*
GEOGRAPHICALLY WEIGHTED GAUSSIAN REGRESSION
*
*
*
***************************************************************
Number of data cases read: 254
Observation points read...
Dependent mean= 2.92190647
Number of observations, nobs= 254
Number of predictors, nvar= 7
Observation Easting extent: 1187098.88
Observation Northing extent: 1123500.63
*Calibration will be based on 254 cases
*Adaptive kernel sample size limits:
12 254
*AICc minimisation begins...
Bandwidth
AICc
86.782112790000
767.597187632216
133.000000000000
754.972216424732
179.217887210000
753.551620115680
207.782112451766
752.192471238334
225.435774549194
751.640938037997
236.346337773379
751.731106620642
218.692675675951
751.850312919249
229.603238850788
751.499404470137
** Convergence after
8 function calls
** Convergence: Local Sample Size=
230
**********************************************************
*
GLOBAL REGRESSION PARAMETERS
*
**********************************************************
Diagnostic information...
Residual sum of squares.........
272.264048
Effective number of parameters..
Sigma...........................
1.052029
8.000000
Akaike Information Criterion....
757.195755
Coefficient of Determination....
0.259583
Adjusted r-square...............
Parameter
0.235406
Estimate
Std Err
---------
------------
Intercept
4.274859935176
POP2000
------------
0.000000256527
-0.008354039480
0.004410888479
PBLACK
0.036178413000
PFOREIGN
-0.006623232921
PRURAL
PPOV
0.011251201157
0.015815420478
-0.017983713899
PELDERLY
------------
0.520198013714
-0.000000266707
PHISP
T
0.002806731410
-0.006212336505
0.014791697775
-0.022799333827
0.013675052253
8.217755317688
-1.039681553841
-1.893958449364
3.215515613556
-0.418783247471
-6.407351016998
-0.419988065958
-1.667220950127
**********************************************************
*
GWR ESTIMATION
*
**********************************************************
Fitting Geographically Weighted Regression Model...
Number of observations............ 254
Number of independent variables... 8
(Intercept is variable 1)
Number of nearest neighbours...... 230
Number of locations to fit model.. 254
Diagnostic information...
Residual sum of squares.........
246.610900
Effective number of parameters..
Sigma...........................
16.269298
1.018506
Akaike Information Criterion....
750.537428
Coefficient of Determination....
0.329346
Adjusted r-square...............
0.283256
**********************************************************
*
PARAMETER 5-NUMBER SUMMARIES
*
**********************************************************
Label
Minimum Lwr Quartile
Median Upr Quartile
Maximum
-------- ------------- ------------- ------------- ------------- ------------Intrcept
3.317199
POP2000
PHISP
3.694232
0.000000
-0.018075
4.512109
0.000000
0.000000
-0.013787
-0.013355
PBLACK
0.012528
0.016086
PFOREIGN
-0.011653
-0.007855
PRURAL
-0.022410
PELDERLY
PPOV
-0.037078
-0.021990
0.000000
-0.006948
0.000040
0.041738
-0.004262
0.014880
-0.014865
-0.003240
-0.015035
0.064501
0.005832
-0.018830
-0.016861
4.927451
0.000000
0.021614
-0.021173
-0.030341
4.752016
-0.009875
0.008302
-0.011095
0.018373
-0.007864
<------------------ LOWER -----------------><------------------ UPPER ----------------->
Label Far Out Outer Fence Outside Inner Fence Inner Fence Outside Outer Fence Far Out
-------- ------- ------------- ------- ------------- ------------- ------- ------------- ------Intrcept
0
POP2000
PHISP
0
-0.034305
0
PFOREIGN
0
PELDERLY
0
0
0
0.003311
-0.028384
0
0
0
0.046891
0
0
0.004058
0
0
0
0.118695
0
0.046046
0.005247
0.013569
0
-0.005404
0
0.000000
0
0.026361
-0.054605
-0.038333
7.925371
0.080217
-0.030634
0
0
0.000000
-0.022392
0
-0.092349
-0.054675
-0.024046
0
-0.048914
6.338694
-0.000001
0
-0.040095
0
0
2.107554
-0.060870
0
PRURAL
0
-0.000001
0
PBLACK
PPOV
0.520877
0
0.083790
0.021590
0
0
*************************************************
*
*
* Test for spatial variability of parameters
*
*
*
*************************************************
Tests based on the Monte Carlo significance test
procedure due to Hope [1968,JRSB,30(3),582-598]
Parameter
----------
P-value
------------------
Intercept
POP2000
PHISP
PBLACK
PFOREIGN
PRURAL
PELDERLY
PPOV
0.14000 n/s
0.58000 n/s
0.07000 n/s
0.04000 *
0.67000 n/s
0.06000 n/s
0.17000 n/s
0.60000 n/s
*** = significant at .1% level
** = significant at 1% level
* = significant at 5% level
Interpretation of Maps
The maps are contour plots of the GWR model coefficients. The localized model
equations are written in the upper left-hand corners of the maps. The “BetaX,” where “X”
is a number, coefficients correspond to the localized GWR coefficients. The contour
maps illustrate spatial variation for county-wide population-normed diabetes claims
counts. The absolute values of the logarithms of the counts percentages were locally
computed to adjust for small population percentages. The five-number summary of the
distributions presents the median, upper and lower quartiles, and the minimum and
maximum values of the data. This information is helpful to get a “feel” for the degrees of
spatial non-stationarity by comparing the ranges of the local parameter estimates with
confidence intervals around the global estimates of the equivalent parameters.
Approximatley 50% of the local parameter values will be between the upper and lower
quartiles and approximately 68% of global values in a normal distribution will be within
+/- 1 standard deviations of the mean. The procedure would be to compare the range of
values of the local estimates between the lower and upper quartiles with the range of
values at +/- 1 standard deviations of the global estimates. Given that 68% of the values
would be expected to lie within the later interval, compared to 50% within the interquartile range, if the range of local estimates between the inter-quartile range is greater
than that of +/- 1 standard deviation of the global mean, the one can infer a significant
degree of spatial non-stationarity for a particular parameter. To this end, the calculations
produced the following:
Parameter
Global
Global Mean
Standard
Global
Global
Local
Local
-1SD
+1SD
LQ
UQ
Deviation
PCTHISP
0.00441
-0.008354
-0.012764
-.003944
-0.013787
-0.006948
PCTBLACK
0.0113
0.03619
0.02489
0.04749
0.016086
0.041738
On the basis of this evidence, with roughly 25% of the model explained, there exists
almost no local variation for the PCTHISP parameter whereas there may exist some
slight local variation in the PCTBLACK parameter. Nevertheless, for this model, there is
very little spatial variability, resulting in a negative conclusion for this project’s
hypothesis that spatial heterogeneities exist.
The Monte Carlo significance test is a method used to examine the significance of
spatial variability. From the Monte Carlo significance test generated by this GWR run,
there is only a small indication of significant spatial variation regarding the parameter
estimates for the PCTBLACK element. For all of the other cases, there exists a
reasonably high probability that variations occurred by chance. A criterion for
constructing GWR maps has as its foundations the p-values returned by this significance
test. GWR maps reveal correct information about spatial heterogeneities, i.e., spatial nonstationarity, when the Monte Carlo significance tests return p-values of less than five
percent.
Suppose, for instance, that the inter-quartile range for a parameter contained the
interval of the global regression +/-1 standard deviation range. The implication would be
that there is likelihood of spatial heterogeneities for the given parameter. When this
happens, then the impetus of this research switches to an examination of the cause for the
geographic variation. For instance, possible investigative reasons could be rooted in
environmental issues, fraud, or economic dislocation.
VII.Conclusion
Assessment
As stated in the introduction of this paper, any rigorous treatment involving GWR
analysis should involve an investigation of multicollinearity at the various regression
points as well as an investigation of correlation among the local regression coefficients.
Analytic tools were developed in this project to reveal potential dependencies among the
local coefficients. SAS programs were authored to analyze the correlation between pairs
of local regression coefficients at one location as well as to analyze the correlation
between two overall sets of local coefficient estimates associated with two exogenous
variables. The value proposition stemming from this project was the development of
ArcObjects tools which provide for an adhoc investigation of any battery of independent
variable selection. According to Tiefelsdorf and Wheeler [1], “weak dependencies
[among GWR coefficients] impede substantive interpretation of local GWR estimates,
whereas strong dependencies induce artifacts that invalidate meaningful [GWR]
interpretation…” Exploratory tools used to expose multicollinearity among the
exogenous variables include bivariate correlation coefficients and bivariate scatter plots
of pairs of exogenous variables.
Figure
Model Equation:
Diabetes Count = PARM_1 + PARM_2(lat,long)*Elderly + PARM_3(lat,long)*Rural +
PARM_4(lat,long)*Foreign Born + PARM_5(lat,long)*Black + PARM_6(lat,long)*Hispanic.
Explanation of GWR Parameters:
PARM_1 : Intercept,
PARM_2: Percent of county population who are elderly,
PARM_3: Percent of county population who live in rural areas,
PARM_4: Percent of county population who are foreign born,
PARM_5: Percent of county population who are black, and
PARM_6: Percent of county population who are Hispanic.
In this project, SAS graphics routines were interfaced to ArcObjects classes to produce
bivariate scatter plots and histograms of the distributions of coefficient correlations, the
later of which is readily available for choropleth maps.
Note: “COL1” corresponds to model eigenvalues.
Note: “Beta” is the same as “PARM”
According to Tiefelsdorf and Wheeler, a signal that multicollinearity is present in
GWR processing is the presence of a large change or even reversal in sign in one
regression coefficient after another exogenous variable is added to the model or specific
observations have been excluded from the analysis. At this juncture, automation is
essential because having to rebuild GWR models is tedious and prone to error. A further
technique used in the assessment of spatial structural effects in GWR regressions
coefficients involved developing SAS tools for principal component analysis (PCA).
PCA was used to evaluate the interdependencies among the sets of local regression
coefficients. Using SAS Proc IML, scree plots were constructed which exposed breaks in
component eigenvector-observation plots. A value proposition introduced by this project
was the development of ArcObjects routines to generate scree plots on any adhoc
selection of a battery of independent variables. In furtherance of this analysis, SAS Proc
Factor was invoked from and ArcObjects class to generate reports for the explanation of
variances in GWR model constructions and to show a corresponding reduction in model
dimensionality. An interesting result of PCA was that, for some models, the results
contradicted Fotheringham’s GWR3.0 Monte Carlo simulation for variance explanation.
(USE THE Hatcher book for a formal report format)
Contributions
A number of contributions have arisen from this project. They are enumerated as
follows:
1. At the conclusion of Fotheringham’s benchmark study of GWR analysis (Chapter 9),
he proposed a future initiative to integrate GWR processing with a geospatial processing
engine like ArcEditor (Chapter 10). This project implements such an interface, which is
included in the appendix.
2. The capabilities of using ESRI tools as a research engine were greatly expanded with
the development of ArcObject classes that invoked SAS procedures. A clear example
developed by this project was the expansion of database query functionality for
ArcObjects using SAS bridged software. Complex data requests can be assembled with
ArcObjects, passed to SAS Proc SQL, which resulting dataset is joined to a shapefile.
This application requires the use to have to only be knowledgeable of the database
schema and SQL formulation. No other intervention is required.
3. The capabilities of ArcEdit spatial statistics were broadened substantially. Routines
not available with ESRI processing such as PCA and GWR computations are now readily
available.
4. Aside from the technical advances proposed by this project, some value was obtained
from GWR contour plots and choropleth maps. The maps can provide a means for
systematic policy planning for diabetes remediation. Spatial heterogeneities of diabetes
within the state are clearly shown. The presence of geographic influences as a
contributing factor to diabetes claims, as well as costs, was demonstrated.
5. Implicit in the maps is the depiction of a non-uniform cost structure for diabetes claims
throughout the state. The treatment of diabetes appears to be more costly in certain parts
of the state, and this runs counter to policy which essentially requires costs to be uniform.
An investigation is warranted as to why this is happening. Is the cause due to fraud or
local economic conditions?
Future Research
A number of considerations for future enhancements of this study come to mind. Many of
them have to do with the tool set developed for this study. Yet other extensions should
treat GWR modeling itself.
Items for future consideration for the tool can include:
1. Automate the production of contour maps,
2. Port the ArcObjects application to a compiled language such as .NET, and
3. Find a less-expensive statistical package other than SAS. Develop the IML
and SQL routines for a stand-alone package.
Insofar as GWR processing goes, the computations should incorporate techniques for
handling strongly correlated regression coefficients. Furthermore further investigation of
GWR and Spatial Autocorrelation is warranted. Analysis should extend the GWR
framework to provide for the following:
1. local measures of spatial dependency,
2. spatial regression modeling, and
3. the combination of the previous two.
VIII References
Fotheringham, Stewart, et al (2002), Geographically Weighted Regression, The
Analysis of Spatially Varying Relationships (John Wiley & Sons, Ltd.). Chapters
2, 4, 9, and 10.
Hatcher, Larry (1994), A Step-by-Step Approach to Using the SAS System for
Factor Analysis and Structural Equation Modeling (SAS Institute). Chapter 1
Tiefelsdorf and Wheeler, Multicollinearity and Correlation among Local Regression
Coefficients in Geographically Weighted Regression, Journal of Geographic
Systems, (2005) 7: 161-187
Griffith D (2003) , Spatial Autocorrelation and Spatial Filtering (Springer)
IX. Apendix 1
Part A: Spatial Variation Analysis for Medicaid Costs.
Part B: GWR Run Summary
**********************************************************
*
GLOBAL REGRESSION PARAMETERS
*
**********************************************************
Diagnostic information...
Residual sum of squares......... 81884009272.125061
Effective number of parameters..
Sigma...........................
5.000000
18134.261575
Akaike Information Criterion....
5709.334621
Coefficient of Determination....
0.561128
Adjusted r-square...............
Parameter
Estimate
--------Intercept
0.552280
------------
Std Err
------------
-5375.459519020552
T
------------
7856.063345608583
-0.684243381023
POP2000
0.059695633705
0.004245905363
14.059576988220
PBLACK
-69.931190419192
171.099291662407
PFOREIGN
1028.007265356411
220.286679398173
PELDERLY
40.459773052502
210.042542516290
-0.408717006445
4.666679382324
0.192626565695
**********************************************************
*
GWR ESTIMATION
*
**********************************************************
Fitting Geographically Weighted Regression Model...
Number of observations............ 254
Number of independent variables... 5
(Intercept is variable 1)
Number of nearest neighbours...... 151
Number of locations to fit model.. 254
Diagnostic information...
Residual sum of squares......... 29965965147.482555
Effective number of parameters..
Sigma...........................
18.008976
11268.507362
Akaike Information Criterion....
5482.932136
Coefficient of Determination....
0.839392
Adjusted r-square...............
0.827084
**********************************************************
*
PARAMETER 5-NUMBER SUMMARIES
*
**********************************************************
Label
Minimum Lwr Quartile
Median Upr Quartile
Maximum
-------- ------------- ------------- ------------- ------------- ------------Intrcept -21849.003946 -6709.180633 -1219.567984 2852.985820 12662.905580
POP2000
PBLACK
0.040852
0.046887
0.056349
-871.869490 -182.249711
0.073981
127.609526
0.297833
211.578930
327.130387
PFOREIGN -1203.012415 -191.063457
22.814268
409.529703 1479.955342
PELDERLY -439.820900
48.271116
165.448307
-64.597318
511.439933
<------------------ LOWER -----------------><------------------ UPPER ----------------->
Label Far Out Outer Fence Outside Inner Fence Inner Fence Outside Outer Fence Far Out
-------- ------- ------------- ------- ------------- ------------- ------- ------------- ------Intrcept
POP2000
PBLACK
0 -35395.679993
0
-0.034397
0 -1363.735632
1 -21052.430313 17196.235500
0 31539.485180
0
0.155265
0.006245
0.114623
5 -772.992671
6
802.321890
0
17
0 1393.064851
0
PFOREIGN
0 -1992.842937
3 -1091.953197 1310.419443
11 2211.309183
PELDERLY
0 -754.734194
3 -409.665756
1
510.516745
*************************************************
*
*
* Test for spatial variability of parameters
*
*
*
*************************************************
Tests based on the Monte Carlo significance test
procedure due to Hope [1968,JRSB,30(3),582-598]
Parameter
----------
P-value
------------------
Intercept
POP2000
PBLACK
0.85000 n/s
0.17000 n/s
0.00000 ***
PFOREIGN
0.61000 n/s
PELDERLY
0.62000 n/s
*** = significant at .1% level
** = significant at 1% level
* = significant at 5% level
Part C: GWR Confidence Intervals
855.585183
0
0
Parameter
Global
Global Mean
Standard
Global
Global
Local
Local
-1SD
+1SD
LQ
UQ
-240
101
-182.25
211.6
Deviation
PCTBLACK
171
-70
Appendix 2: Guide to Using Geographic Analysis Machine.
Appendix 3: Project Source Code
Download