GIS Thesis - December 2005: Geographic Weighted Regression

I. Introduction This project attempts to identify spatial heterogeneities in regression models of georeferenced Medicaid claims for diabetes within the state of Texas. Spatial variability of estimated local regression coefficients is examined to determine localized influences on the accumulation of diabetes claims. Results from this study will demonstrate the value of linking Medicaid claims databases to geographic maps. Numerous products result which can assist in more efficient allocation of resources needed to treat diabetes. This project uses an innovative technique known as “Geographic Weighted Regression” (GWR) to expose demographic trends, as well as geographic regions, where diabetes claims are more concentrated. GWR models are constructed that show the dependencies of per capita expenditures on Medicaid diabetes claims, as well as diabetes population density, on demographic attributes such as race, age, poverty status, and geographic location, where the later four variables are considered “independent” parameters in ensuing regression models. An assessment is conducted evaluating the “truthfulness” of using GWR as opposed to more traditional statistical analysis approaches. GWR studies have typically excluded examinations of multicollinearity considerations among local regression coefficients. This project implements an innovative approach for anlyzing degrees of correlation among local regression coefficients. Finally, this project presents automation tools which extend the work performed on diabetes to any type of pathology or other demographic attribute. A “geographic analysis machine” is proposed that links Medicaid claims databases to spatial statistics processors which result in the production of analytic maps that can be used for policy planning. II.Problem Statement A principal motivation for conducting this research was the recent emergence of state Medicaid claims data which had previously not been disclosed. The principal use of the state’s data was mainly for fraud detection in that anomalous cost Medicaid claims costs could be investigated. Other potential utilizations of that data were not generally applied insofar as traditional medical geography applications. The Texas Department of Health and Human Services database contains all records of Medicaid recipients in the state in excess of 7000 distinct types of medical afflictions attributed to the claimants. Using appropriate georeferencing, claimants locations could be determined to county and census tract level within the state, thus implying that a certain spatial disease pathology patterns could be easily mapped and studied. Concomitant with such potential map constructions would be extensions of spatial analysis in probing potential geographic and demographic contributions to a particular disease. Geographic weighted regression is a technique for investigating those influences on disease propagation. A second motivation for conducting this research was one of opportunity to harness an emerging integration of Geographic Information Sciences technologies with computational statistics software which has been underutilized in geoscience investigation. This project has extended the research potential of ArcGIS through its integration with SAS. This integration afforded the development of a toolset used by this project for the development of computational medical geography utilities. Such toolsets are extensible enough so that applications for them can include augmenting administration functions of state health services, minimizing excessive claims costs, providing for efficient delivery of services. Public Medicaid policy decisions and planning can now be based on quantified results rather than legislative debates. The foundations for analysis used by this project is the invocation of geographic weighted regression (GWR) techniques used to demonstrate spatial heterogeneities in Medicaid claims for selected demographic groups. Applications for GWR have mostly been theorized but are now finding a growing user base in the GIS community. Currently, there is no GWR functionality available in any popular GIS turnkey system so implementation of its procedures has to be largely solved programmatically by the investigator. This project implements an interface between GWR and ESRI’s ArcGIS. The calculations involve computational processing using a considerable amount of linear algebra and statistical processing in the form of principal component analysis (PCA). This project bundles all of these functionalities into a single ArcObjects interface. A fourth essential contribution of this project is to question the validity of GWR results, an application which has not been embraced by users of this protocol. Most GWR demonstrations do not extend results beyond simple t-statistics. However, any formal use of spatial statistics should include an analytical treatment of the validities of derived models. With GWR models, analysts should view the results with a fair degree of skepticism. GWR coefficient should be examined for potential multicollinearity problems and unacceptable levels of coefficient correlations. This project programmatically implements a graphical implementation for principal component analysis, a technique used to examine spatial structural effects in GWR coeeficients. In summary, this project attempts to implement three principal objectives. The first goal is to conduct GWR analysis on state-wide diabetes claims. The principal dependencies explored will be claims counts and per capita expenditures at the county level. These dependencies will be modeled in terms of demographic variables along percentages of elderly, rural, foreign-born, black, and Hispanic populations within the state’s counties. The geographic parameter (regression points) will consist of county centroids. A second objective is to develop automation tools to be used for the production of exploratory GWR maps. The tools will be authored using ESRI’s ArcObjects and SAS. The final objective of this research is to analyze the validity of GWR results. Tools are developed to quantify structural effects among GWR model coefficients. III. Literature Review A vast amount of literature is available on applying GWR to social science research. This project, does however, conduct a specialized area of research which has previously not been studied and that involves the analysis of potential spatial non-stationarity in state Medicaid claims for diabetes. Only recently has access to state Medicaid databases been allowed for researched. The Schools of Social Sciences and Engineering are bidding on becoming the state repository for Medicaid data to be accessed by social science researchers. In addition, this project takes progressive steps on integrating GWR software with the popular GIS tool, ArcEditor, from ESRI and with an established statistical processing tool available from SAS. To this end, the integration of GWR processing, SAS tools, and ESRI ArcGIS processing present a unique package for exploratory spatial data analysis and statistical processing. Key literature resources are listed in the References section of this paper. IV.Data Sources The data for this study originated from three principal sources. They sources are: 1. The United States Census Bureau (http://www.census.gov), 2. The Texas Department of Health and Human Services, Office of the Inspector General (http://www.hhsc.state.tx.us), and 3. Environmental Research Systems Institute (ESRI) (http://www.esri.com). The Census Bureau provided detailed demographic tabular data organized by county and census tract levels. ESRI furnished similar data along with the shapefiles used for GWR analysis. The Department of Health and Human Services (HHS) provided the database schema by which Medicaid records are defined. HHS data originate from medical care providers’ claims for reimbursement. The HHS data used in this study originate from Medicaid claims received in August, 2000. In order to conduct the research, an understanding of the HHS database schema is in order. For this project’s purposes, two relevant HHS data tables were identified that would serve as inputs into GWR processing. The first table consisted of a client address table. In this table, the fields include the Texas street address, city, and zip code of a Medicaid claimant. The primary key for this table consists of an HHS-assigned unique client number. The second table of interest was the client records table, where the primary key was the unique client number, and this key served as a foreign key for the client address table. One of the principal fields of interest of this table consisted of standardized ICD9 codes that characterized a treatment for a particular claimant . For instance, the ICD9 code range for diabetes used in this study was 25000 to 25003. ICD9 codes are readily available through internet search engines. Every medical pathology treated by physicians has been assigned an ICD9 code. In the database used by this project, over 7000 distinct ICD9 codes were processed in the time-frame for which the data was collected. The client records table is updated every month by HHS. Additionally, the client records table contained financial information regarding the costs of medical treatment. The financial information was organized generally into charges incurred by medical providers and payments authorized by the state. Between the two principal tables, a one-to-many relationship was structured between the client address and the client claims. A client could be treated for multiple afflictions, and thus numerous ICD9 codes could be associated with a claim, which is not surprising since, for example, diabetes could give rise to other medical problems for patients such as stroke or blindness. Two critical issues arose during the preprocessing of the data. One issue had to deal with public policy of disclosing Medicaid data in urban counties and the other issue regarded the georeferencing of Medicaid claims data. In the former case, a considerable amount of claims data were not available owing to the privatization of Medicaid claims processing by Health Maintenance Organizations (HMOs). Privatization contracts did not require HMOs to disclose claimant address information. Thus, a potential for the introduction of artifacts into GWR processing arises. One note however, is that a significant quantity of urban claims data were still being reported and inserted into HHS data tables at the time the data were collected. Initiatives are now under way to provide linkages between HMO and HHS databases. A second critical preprocessing concern regarded the georeferencing of claims data. Two distinct problems were posed when using ESRI’s geocoding tools. This first problem encountered was the ability of the geocoding tool to perform address matching. Due to the implementation of the address matching filters, only 70% of the records formatted for address matching could be correctly geocoded. Thus, of the two million Medicaid claims records posted in August, 2000, only about 1.4 million records were addressed matched. The author of this paper did not pursue trying to “patch” the remaining 600,000 records. The utility of geocoding all Medicaid records becomes evident later on in this paper when the a demonstration is shown that the geoprocessing of diabetes claims can be extended to any of the 7000 ICD9 codes without having to reprocess the data or reengineer the tool set. Unfortunately, however, when spatial query was conducted to select diabetes claims from the georeferenced data in August, 2000, only 8000 claims out approximately 16,000 claims were geocoded. One of the value-added propositions this project proposes is its unique processing of the geocoded records. The interface between ESRI software and external data sources is not well implemented for automation. For instance, a construction of a Medicaid point shapefile for diabetes claims would involve an analyst having to exit the ESRI environment and having to interface with a database management system to perform a complex structured query language (SQL) extraction of an ICD9 code under investigation. A typical SQL statement would be written as follows: SELECT T *.* FROM medpoints INNER JOIN [ SELECT Client_Num, SUM(netbilledamount) AS NBA, SUM(papaidamount) AS PPA, SUM(totalallowedamnt) AS TAA, SUM(totalpayableamt) AS TPA FROM ClientRecords WHERE ClientRecords.PRIMDIAGCD >=25000 And ClientRecords.PRIMDIAGCD <= 25003 GROUP BY Client_Num] AS T ON VAL(medpoits.Client_Num) = T.Client_Num Posing this query to a relational database management system, or to something like Microsoft Access 2000, is routine. Such a query is hardly routine for ESRI ArcEdit, or even for ESRI’s ArcSDE (Spatial Database Engine). The author has searched ESRI user forums for an ArcObjects or programmable interface that would accommodate such a query, and no satisfactory response was available. However, with SAS bridged software, automation of large data set processing is readily available. A SQL request is passed from an ArcObjects class to SAS which constructs and XY claims event table (using PROC SQL), which is passed back to ArcObjects, which subsequently renders the table into a point shapefile. An ArcObjects cursor is implemented that traverses counties or census tracts, or other types of polygons and using the “esrirelcontains” enumerated data type for a spatial query, the claims points falling within a particular polyon are readily extracted and the forthcoming records are attached to a census shapefile using another ArcObjects class, which source is in the appendix. A literature search has found this approach to be unique at the time this paper was written. A flow chart summarizes the data processing: V.Analysis and Methodology GWR Model Basics This project employs GWR techniques to assess how state Medicaid diabetes claims are distributed both spatially and demographically within the state of Texas. A GWR model is essentially an extension of a global regression model, which is defined as follows: yi   0    k xik  i k where  { xik } are observations for i = 1,..,n cases and k = 1,..,m explanatory variables,  {yi} are the dependent variables,   ’s are the estimates of the coefficients,  and  ’s are normally distributed error terms. For this project, the observations (regression points) are centered on the State’s county centroids, which for n, is 254. The explanatory (independent variables) are typically demographic variables obtained from census data, such as the percentage of population who are black, Hispanic, foreign born, elderly, and so on. The dependent variables, as stated in the introduction may be the absolute value of the logarithm of the counts of diabetes claims within each county. The computation of the later value poses programmatical challenges in that county location is not a field in Medicaid records. Thus, a containment relationship has to be programmed that “counts” the number of claims within each county. In GWR, yi   0 (u i , vi )    k (u i , vi ) xik  i k where  (ui,vi) are the coordinates of the ith point in space and   k (ui , vi ) are spatially varying, continuous functions at point i. In this project, as mentioned above, the coordinates of the ith point corresponds to a particular county centroid. The computation of county centroids in this project is executed through the use of Visual Basic for Applications classes developed under ESRI’s Arc Objects. In GWR, estimates can be made for  : ˆ (ui , vi )  (xT W(ui , vi )x) 1 xT W(ui , vi )y where W(ui , vi ) is an n  n matrix which off-diagonal elements are 0 and the diagonal elements denote the geographical weighting of each n observed data for a given regression point. In global regression, a regression matrix takes the form: y  x  and ˆ  (x T x) 1 x T y In GWR, y  (   x)I  where the  operator means that each element of  is multiplied by the corresponding element of x. For n data points and k explanatory variables, dim( x)  n  (k 1) and I is a ( k 1)  1 vector of 1’s.     0 (u1 , v1 ) 1 (u1 , v1 ) ...  k (u1 , v1 )    (u , v )  (u , v ) ...  (u , v )  1 2 2 k 2 2   0 2 2 .  .    .    0 (u n , v n ) 1 (u n , v n ) ...  k (u n , v n ) The estimated parameters in each row are obtained: ˆ (i )  (x T W(i )x) 1 x T W(i )y where y is a location-based weighted least squares estimator and i is a matrix row. W(i) is an n  n spatial weighting matrix: W(i) =  wi1 0   :  0 0 wi 2 : 0 0  0  \ :   ... win  ... ... Where win is the weight given to a data point “n” in the calibration model for location “i.” In this project, this “weights” matrix is calculated using a bi-square function. The data for this computation is gathered from computations of the distances between the centroids of the counties. A note must be said about using Arc Edit to perform these computations. This function was available until recently and has since been deprecated by recent Arc Edit updates. The author of this paper has resurrected this method. A discussion about the calculations of the elements of the weights matrix follows. In GWR, local standard errors account for variations in data used to compute estimates. In some cases, local parameters might be a function of relatively few data points, or data points might have low weights in a local regression because they lie far from a regression point. Thus, GWR takes into account the analysis of the variance of the parameter estimates. Var[ˆ (ui , vi ]  CCT  2 where  T 1 T C = (x W(u i , vi )x) x W(u i , vi ) and   2 is the normalized residual sum-of-squares from a local regression.   2   ( yi  yˆ i ) /( n  2vi v2 )  v1  tr(S), v2  tr(S T S), yˆ  Sy T 1 T The rows of S are ri = x  (x W(u i , vi )x) x W(u i , vi ) In GWR, n – 2vi + v2 equals the effective degrees of freedom for the residual and standard errors are given by SE ( î )  Var ( î ) In this project, the C matrix is calculated using SAS 9.1. At this point, a discussion must be presented on the automation techniques used by this project. The C matrix is easily calculated using SAS Proc IML, which code is included in the appendix of this paper. What is of most interest is the recent integration of SAS with Arc Objects, a “bridged” software tool authored by SAS that provides an object-oriented-based interface between SAS modules and ESRI’s ArcObjects. This bridge provides a means for ArcObjects classes to instantiate SAS objects and a means to have bidirectional exchange of data sets. Thus, shapefile data are easily exported to SAS programs and SAS can computed data for shapefiles that traditionally is not available within ESRI tools. Further discussion of this unique tool integration follows. As mentioned previously, W(ui,vi) is a weighting scheme based on the proximity of a regression point “i” to data points around i, without an explicit relationship. In a global regression, yi   0    k xik  i k , wij = 1  i,j, noting that j is a specific point in space at which data are observed and i is a point in space for which parameters are estimated. For GWR, three choices exist for the local weighting function: I. Use a moving window weighting function: wij = 1 if dij < d wij = 0 otherwise. For every regression point, only a subset of the points are used to calibrate the model. This weighting scheme introduces discontinuities which ultimately lead to sharp contour lines in maps. II. Use an exponential (Gaussian) smoothing function: 1 exp[  (d ij / b) 2 ] 2 wij = where b is a kernel bandwidth. III. Use a bi-square function: 2 2 wij = [1  (d ij / b) ] if dij < b wij = 0 otherwise. The preceding three methods for calculating geographic weights assumed that the bandwidth, b, was fixed. GWR also employs a technique of using spatially varying kernels. There are three such techniques. I. The first method uses a technique where data points are ranked in terms of their distances from each regression point “i” such that Rij is the rank of the jth point from point i in terms of the distance j is from i. The weights decrease as the rank increases: wij = exp( Rij / b) The effect is that the bandwidth of the kernels is reduced in regions with large amounts of data. II. The second method ensures that the sum of the weights for any regression point “i” is a constant “C” : w ij  Ci j Compute the optimal value of C as follows: Select an initial value for C Calibrate weighting function with selected C as a constraint and calculate the “goodness-offit” statistic for the model. No Yes Optimal Fit? GWR III. The third method involves constructing a function that is related to the Nth nearest neighbors of point “i.” wij = [1  (d ij / b) 2 ]2 if j is one of the Nth nearest neighbors of i, wij = 0 otherwise. One issue with GWR analysis regards the task of calibrating the spatial weighting function. The larger “d” becomes, the closer the model becomes an OLS model. Furthermore, smaller bandwidths lead to increased variance in parameter estimates which depend on observations in close proximity to a regression point. One way to calibrate the kernel function is to minimize z: n z   [ y i  yˆ i (b)] 2 i 1 where the yˆ i ' s are fitted values for a given bandwidth. The problem with this approach is that an obvious minimum occurs when b = 0. This approach is remedied by using crossvalidation (CV): n CV   [ y i  yˆ i 1 (b)] 2 i 1 This procedure involves plotting the CV versus the bandwidth for a given weighting function. The optimal value occurs at the minimum point. A second approach used to calibrate the spatial weighting function is to minimize the Akaike Information Criterion (AIC):  n tr( S )  AIC  2n log e ˆ  n log e 2  n   n  2  tr( S )  where n is the sample size and ̂ is the estimated standard deviation of the error term. The AIC is used to assess whether GWR is better than global regression. For this project, both calibration techniques are employed for different reasons. During the initial exploratory phase of map GWR map construction, the AIC method is used along with a variable bandwidth approach. This approach is available through GWR software that is integrated easily into ArcObjects. An ArcObjects interface is implemented that calls GWR 3.0, a software tool built by Stewart Fotheringham. One consequence of using GWR software is that the weights matrix is not revealed in the GWR reports. This poses problems for analyzing multicollinearity among the regression coefficients. As a consequence, this projects calibrates the kernel using the generalized cross-validation method, in principal because it is easier to program. Thus, for each regression point, a unique bandwidth is easily found. With the matrix processing power of SAS Proc IML, the S matrix is easily calculated. The elements of the weights matrix were easily found using the bi-square function. The regression coefficients (“betas”) were obtained by simply multiplying the C matrix by the y-vector using SAS. These computations would have been very difficult to implement using ESRI tools. The advantage gained was that a SAS dataset containing only the betas was generated, exported, and joined to a county shapefile, all within a single ArcObjects class implementation. Thus, a high degree of automation was achieved and hence a significant value proposition offered by this project. The ArcObjects class source code is in the appendix of this paper. The principal motivation of “regenerating” the betas was alluded to in the introduction of this paper. As presented in the introduction, any correct utilization of GWR must be accompanied by an examination of potential multicollinearity between the local regression coefficients, which are derived from the weights matrix. Thus, an “observer” or “estimator” had to be constructed to estimate the W matrix from GWR “black box” processing. From the estimates of W, the following equation is derived: This equation is the representation of the covariance matrix of the localized regression coefficients, and it was also calculated using SAS Proc IML. This matrix is an intermediate step in the calculation of the local correlation matrix: which is written here in SAS notation. The former computations serve multiple purposes. First and foremost it allows analysts (geographers) to assess structural effects in GWR regression coefficients. In this project, the proposed assessment method uses scree plots and factor analysis arising from traditional principal component analysis. Such techniques are clearly unavailable in ESRI functionality but are optimized in SAS tools. With the ArcObjects-SAS bridge interface, PCA (principal component analysis) is readily available for geographic research and thus detailed analyses of variances of easily attached to ESRI maps. Secondly, with the integration of ESRI shapefiles and SAS data sets, the “R” matrix is readily joined to the county shapefile and choropleth maps can be automatically generated showing the geographic variation of local coefficient correlations. Inference and GWR Statistical inference is concerned with the process of inferring information from the analysis of statistical data sets. Statistical inference answers three kinds of questions:  Is some fact true on the basis of the data?  Within what interval does the model coefficient lie?  Which one of a series of potential math models is the best? Exploratory data analysis (EDA) and visualization are used to expose the structure of data and to identify potential spatial patterns. The issues not addressed in EDA are:  Random variation in data collection leads to observed representations of data and  Are the observed patterns attributable to geographic trends? A significance test assesses how likely some fact is true on the basis of the given data. The approach is to formulate a null hypothesis, H0. In GWR, the question posed is:  How likely is an observed pattern if H0 is true given that the data are generated by a global model. If the observed pattern is unlikely under the hypothesis, H0 is not true. The “p-value” of a test is a probability measure of an observed pattern being correctly identified given that H0 is true. A significance test is evaluated as to whether the p-value falls below a threshold, usually 0.01 or 0.05. These are the probabilities of incorrectly rejecting the hypothesis given that it is actually true. GWR also assesses the intervals for which coefficients lie. In global regression, estimates can be derived for regression coefficients. These are the sampling standard deviations of coefficient estimates. For example, when a sample size n is large, an interval defined by a coefficient estimate of +/- 1.96 times the standard error will contain the true coefficient value 95% of the time. GWR estimates coefficient surfaces. Regression coefficient values are estimated for a set of geographical locations. The third consideration for GWR modeling is to determine analytically which model is the best. An AIC estimate makes such a determination by showing how close a proposed model is to a true model. Recall that a GWR statistical model is specified by: yi   0 (u i , vi )    k (u i , vi ) xik  i k where  {xij} for i = 1,..,n cases and j = 1,..k explanatory variables,  {yi} dependent variables,  {(ui,vi)} locations coordinates for each case,  {  0 (u, v) , 1 (u, v) ,..,  k (u, v) } are k+1 continuous functions at the location (u,v) in a geographic study area, and   i ’s are random error terms that are independent, normally distributed with a 2 mean of zero and a variance of  . GWR seeks to provide non-parametric estimates for the betas using kernel-based methods. A log-likelihood for any set of estimates is written as: k    L(  0 (u, v)... k (u, v) | D)   y   ( u , v )  xij  j (ui , vi )    i 0 i i  2 i 1  j 1  2 n 2 where D   ({xij },{ yi },{ui , vi }) and where the functions are chosen to minimize L. An easier way to find the better model fit is to use a maximum likelihood approach where the betas are selected using a leastsquares method since the error terms are normally distributed. Since the functions are arbitrary, any value can be chosen to obtain a residual sum of squares of zero. This can result in having a non-unique solution. The solution for this problem is to  make the betas functions of (u,v) – that is, use a parametric representation and  employ a calibration procedure. In calibrating GWR, estimate {  0 (u, v) , 1 (u, v) ,..,  k (u, v) } on a point-wise basis. Given a specific point in geographic space, (u0,v0), estimate {  0 (u, v) , 1 (u, v) ,..,  k (u, v) }, where the point is arbitrary. For smooth functions, k yi   0  xij  j  i j 1 is a close approximation at (u0,v0). The gammas are constants. The calibration, the goal is to minimize the weighted least square, WLS: k   WLS   w(d 0i ) yi   0   xij  j  i 1 j 1   n 2 where  d0i = distance between (u0,v0) and (u1,v1) and   j  ˆ j (u 0 , v0 ) For assessing GWR inference and GWR hypothesis testing, recall that yˆ  Sy where S is an n x n matrix. The fitted residuals are (I – S)y, where I is the identity matrix: RSS  y T (I  S) T (I  S)y and E ( RSS )  (n  [2tr( S )  tr(S T S)]) 2 E (y) T (I  S) T (I  S) E (y) where the first term relates to the variance of the fitted values and the second term is the bias, which is zero. To test for GWR parameter stationarity, computations are performed for the variance for parameter k, 1 n  1 n  Vk    îk   îk  n i 1  n i 1  2 The question to be answered is whether the observed variation is sufficient to reject the hypothesis the parameter is globally fixed. To provide an estimate for this value, a Monte Carlo approach is adopted. For a given number of times (n), the geographical coordinates of the observations are randomly permuted against the variables. This obtains n values of the variance of the coefficient of interest which is used as an experimental distribution. The actual value of the variance is compared to this list to obtain an experimental significance level. GWR confidence intervals are analyzed for estimated values rather than in terms of significance tests. To establish point-wise confidence intervals for the regression coefficients, the GWR asymptotic variance-covariance matrix is given as:   L( 0 ... k )  I ( 0 ... k )  outer ( E   | u 0 , v0  )  i   where  outer() is the multiplicative outer product,  L( 0 ... k ) is the global likelihood of  0 ... k at (u v ), and 0, 0  I is the information matrix associated with the estimates at (u0,v0). The true values of the partial derivatives are not known, but we have ˆ  (XT WX ) 1 XT Wy  Cy and the yi’s are independently distributed with the same variance. We have: var( y)   2 I and the point-wise variance: ˆ  CC T  2 Thus, a means is available for obtaining point-wise confidence intervals for the surface estimates. Akaike Informatin Criterion (AIC) A useful approach to GWR model selection is to use the Akaike Information Criterion (AIC). The AIC is an estimate for:  f (y ) log e ( f (y ) / g (y )) dy which measure the information distance between the model distribution g and the true distribution f. This quantity should be compared for a number of competing models g1,..gl. This equation can be approximated by: n tr (S) ,  2  RSS AIC  2n log e (ˆ ) n log e (2 )n   n  2  tr ( S )   n To choose a model, compute the RSS and S for each model and then compute the AIC. The smallest AIC is the best model. Recall that the AIC depends on the selections of the bandwidth and the explanatory variables. VI.Results and Discussion *************************************************************** * * * GEOGRAPHICALLY WEIGHTED GAUSSIAN REGRESSION * * * *************************************************************** Number of data cases read: 254 Observation points read... Dependent mean= 2.92190647 Number of observations, nobs= 254 Number of predictors, nvar= 7 Observation Easting extent: 1187098.88 Observation Northing extent: 1123500.63 *Calibration will be based on 254 cases *Adaptive kernel sample size limits: 12 254 *AICc minimisation begins... Bandwidth AICc 86.782112790000 767.597187632216 133.000000000000 754.972216424732 179.217887210000 753.551620115680 207.782112451766 752.192471238334 225.435774549194 751.640938037997 236.346337773379 751.731106620642 218.692675675951 751.850312919249 229.603238850788 751.499404470137 ** Convergence after 8 function calls ** Convergence: Local Sample Size= 230 ********************************************************** * GLOBAL REGRESSION PARAMETERS * ********************************************************** Diagnostic information... Residual sum of squares......... 272.264048 Effective number of parameters.. Sigma........................... 1.052029 8.000000 Akaike Information Criterion.... 757.195755 Coefficient of Determination.... 0.259583 Adjusted r-square............... Parameter 0.235406 Estimate Std Err --------- ------------ Intercept 4.274859935176 POP2000 ------------ 0.000000256527 -0.008354039480 0.004410888479 PBLACK 0.036178413000 PFOREIGN -0.006623232921 PRURAL PPOV 0.011251201157 0.015815420478 -0.017983713899 PELDERLY ------------ 0.520198013714 -0.000000266707 PHISP T 0.002806731410 -0.006212336505 0.014791697775 -0.022799333827 0.013675052253 8.217755317688 -1.039681553841 -1.893958449364 3.215515613556 -0.418783247471 -6.407351016998 -0.419988065958 -1.667220950127 ********************************************************** * GWR ESTIMATION * ********************************************************** Fitting Geographically Weighted Regression Model... Number of observations............ 254 Number of independent variables... 8 (Intercept is variable 1) Number of nearest neighbours...... 230 Number of locations to fit model.. 254 Diagnostic information... Residual sum of squares......... 246.610900 Effective number of parameters.. Sigma........................... 16.269298 1.018506 Akaike Information Criterion.... 750.537428 Coefficient of Determination.... 0.329346 Adjusted r-square............... 0.283256 ********************************************************** * PARAMETER 5-NUMBER SUMMARIES * ********************************************************** Label Minimum Lwr Quartile Median Upr Quartile Maximum -------- ------------- ------------- ------------- ------------- ------------Intrcept 3.317199 POP2000 PHISP 3.694232 0.000000 -0.018075 4.512109 0.000000 0.000000 -0.013787 -0.013355 PBLACK 0.012528 0.016086 PFOREIGN -0.011653 -0.007855 PRURAL -0.022410 PELDERLY PPOV -0.037078 -0.021990 0.000000 -0.006948 0.000040 0.041738 -0.004262 0.014880 -0.014865 -0.003240 -0.015035 0.064501 0.005832 -0.018830 -0.016861 4.927451 0.000000 0.021614 -0.021173 -0.030341 4.752016 -0.009875 0.008302 -0.011095 0.018373 -0.007864 <------------------ LOWER -----------------><------------------ UPPER -----------------> Label Far Out Outer Fence Outside Inner Fence Inner Fence Outside Outer Fence Far Out -------- ------- ------------- ------- ------------- ------------- ------- ------------- ------Intrcept 0 POP2000 PHISP 0 -0.034305 0 PFOREIGN 0 PELDERLY 0 0 0 0.003311 -0.028384 0 0 0 0.046891 0 0 0.004058 0 0 0 0.118695 0 0.046046 0.005247 0.013569 0 -0.005404 0 0.000000 0 0.026361 -0.054605 -0.038333 7.925371 0.080217 -0.030634 0 0 0.000000 -0.022392 0 -0.092349 -0.054675 -0.024046 0 -0.048914 6.338694 -0.000001 0 -0.040095 0 0 2.107554 -0.060870 0 PRURAL 0 -0.000001 0 PBLACK PPOV 0.520877 0 0.083790 0.021590 0 0 ************************************************* * * * Test for spatial variability of parameters * * * ************************************************* Tests based on the Monte Carlo significance test procedure due to Hope [1968,JRSB,30(3),582-598] Parameter ---------- P-value ------------------ Intercept POP2000 PHISP PBLACK PFOREIGN PRURAL PELDERLY PPOV 0.14000 n/s 0.58000 n/s 0.07000 n/s 0.04000 * 0.67000 n/s 0.06000 n/s 0.17000 n/s 0.60000 n/s *** = significant at .1% level ** = significant at 1% level * = significant at 5% level Interpretation of Maps The maps are contour plots of the GWR model coefficients. The localized model equations are written in the upper left-hand corners of the maps. The “BetaX,” where “X” is a number, coefficients correspond to the localized GWR coefficients. The contour maps illustrate spatial variation for county-wide population-normed diabetes claims counts. The absolute values of the logarithms of the counts percentages were locally computed to adjust for small population percentages. The five-number summary of the distributions presents the median, upper and lower quartiles, and the minimum and maximum values of the data. This information is helpful to get a “feel” for the degrees of spatial non-stationarity by comparing the ranges of the local parameter estimates with confidence intervals around the global estimates of the equivalent parameters. Approximatley 50% of the local parameter values will be between the upper and lower quartiles and approximately 68% of global values in a normal distribution will be within +/- 1 standard deviations of the mean. The procedure would be to compare the range of values of the local estimates between the lower and upper quartiles with the range of values at +/- 1 standard deviations of the global estimates. Given that 68% of the values would be expected to lie within the later interval, compared to 50% within the interquartile range, if the range of local estimates between the inter-quartile range is greater than that of +/- 1 standard deviation of the global mean, the one can infer a significant degree of spatial non-stationarity for a particular parameter. To this end, the calculations produced the following: Parameter Global Global Mean Standard Global Global Local Local -1SD +1SD LQ UQ Deviation PCTHISP 0.00441 -0.008354 -0.012764 -.003944 -0.013787 -0.006948 PCTBLACK 0.0113 0.03619 0.02489 0.04749 0.016086 0.041738 On the basis of this evidence, with roughly 25% of the model explained, there exists almost no local variation for the PCTHISP parameter whereas there may exist some slight local variation in the PCTBLACK parameter. Nevertheless, for this model, there is very little spatial variability, resulting in a negative conclusion for this project’s hypothesis that spatial heterogeneities exist. The Monte Carlo significance test is a method used to examine the significance of spatial variability. From the Monte Carlo significance test generated by this GWR run, there is only a small indication of significant spatial variation regarding the parameter estimates for the PCTBLACK element. For all of the other cases, there exists a reasonably high probability that variations occurred by chance. A criterion for constructing GWR maps has as its foundations the p-values returned by this significance test. GWR maps reveal correct information about spatial heterogeneities, i.e., spatial nonstationarity, when the Monte Carlo significance tests return p-values of less than five percent. Suppose, for instance, that the inter-quartile range for a parameter contained the interval of the global regression +/-1 standard deviation range. The implication would be that there is likelihood of spatial heterogeneities for the given parameter. When this happens, then the impetus of this research switches to an examination of the cause for the geographic variation. For instance, possible investigative reasons could be rooted in environmental issues, fraud, or economic dislocation. VII.Conclusion Assessment As stated in the introduction of this paper, any rigorous treatment involving GWR analysis should involve an investigation of multicollinearity at the various regression points as well as an investigation of correlation among the local regression coefficients. Analytic tools were developed in this project to reveal potential dependencies among the local coefficients. SAS programs were authored to analyze the correlation between pairs of local regression coefficients at one location as well as to analyze the correlation between two overall sets of local coefficient estimates associated with two exogenous variables. The value proposition stemming from this project was the development of ArcObjects tools which provide for an adhoc investigation of any battery of independent variable selection. According to Tiefelsdorf and Wheeler [1], “weak dependencies [among GWR coefficients] impede substantive interpretation of local GWR estimates, whereas strong dependencies induce artifacts that invalidate meaningful [GWR] interpretation…” Exploratory tools used to expose multicollinearity among the exogenous variables include bivariate correlation coefficients and bivariate scatter plots of pairs of exogenous variables. Figure Model Equation: Diabetes Count = PARM_1 + PARM_2(lat,long)*Elderly + PARM_3(lat,long)*Rural + PARM_4(lat,long)*Foreign Born + PARM_5(lat,long)*Black + PARM_6(lat,long)*Hispanic. Explanation of GWR Parameters: PARM_1 : Intercept, PARM_2: Percent of county population who are elderly, PARM_3: Percent of county population who live in rural areas, PARM_4: Percent of county population who are foreign born, PARM_5: Percent of county population who are black, and PARM_6: Percent of county population who are Hispanic. In this project, SAS graphics routines were interfaced to ArcObjects classes to produce bivariate scatter plots and histograms of the distributions of coefficient correlations, the later of which is readily available for choropleth maps. Note: “COL1” corresponds to model eigenvalues. Note: “Beta” is the same as “PARM” According to Tiefelsdorf and Wheeler, a signal that multicollinearity is present in GWR processing is the presence of a large change or even reversal in sign in one regression coefficient after another exogenous variable is added to the model or specific observations have been excluded from the analysis. At this juncture, automation is essential because having to rebuild GWR models is tedious and prone to error. A further technique used in the assessment of spatial structural effects in GWR regressions coefficients involved developing SAS tools for principal component analysis (PCA). PCA was used to evaluate the interdependencies among the sets of local regression coefficients. Using SAS Proc IML, scree plots were constructed which exposed breaks in component eigenvector-observation plots. A value proposition introduced by this project was the development of ArcObjects routines to generate scree plots on any adhoc selection of a battery of independent variables. In furtherance of this analysis, SAS Proc Factor was invoked from and ArcObjects class to generate reports for the explanation of variances in GWR model constructions and to show a corresponding reduction in model dimensionality. An interesting result of PCA was that, for some models, the results contradicted Fotheringham’s GWR3.0 Monte Carlo simulation for variance explanation. (USE THE Hatcher book for a formal report format) Contributions A number of contributions have arisen from this project. They are enumerated as follows: 1. At the conclusion of Fotheringham’s benchmark study of GWR analysis (Chapter 9), he proposed a future initiative to integrate GWR processing with a geospatial processing engine like ArcEditor (Chapter 10). This project implements such an interface, which is included in the appendix. 2. The capabilities of using ESRI tools as a research engine were greatly expanded with the development of ArcObject classes that invoked SAS procedures. A clear example developed by this project was the expansion of database query functionality for ArcObjects using SAS bridged software. Complex data requests can be assembled with ArcObjects, passed to SAS Proc SQL, which resulting dataset is joined to a shapefile. This application requires the use to have to only be knowledgeable of the database schema and SQL formulation. No other intervention is required. 3. The capabilities of ArcEdit spatial statistics were broadened substantially. Routines not available with ESRI processing such as PCA and GWR computations are now readily available. 4. Aside from the technical advances proposed by this project, some value was obtained from GWR contour plots and choropleth maps. The maps can provide a means for systematic policy planning for diabetes remediation. Spatial heterogeneities of diabetes within the state are clearly shown. The presence of geographic influences as a contributing factor to diabetes claims, as well as costs, was demonstrated. 5. Implicit in the maps is the depiction of a non-uniform cost structure for diabetes claims throughout the state. The treatment of diabetes appears to be more costly in certain parts of the state, and this runs counter to policy which essentially requires costs to be uniform. An investigation is warranted as to why this is happening. Is the cause due to fraud or local economic conditions? Future Research A number of considerations for future enhancements of this study come to mind. Many of them have to do with the tool set developed for this study. Yet other extensions should treat GWR modeling itself. Items for future consideration for the tool can include: 1. Automate the production of contour maps, 2. Port the ArcObjects application to a compiled language such as .NET, and 3. Find a less-expensive statistical package other than SAS. Develop the IML and SQL routines for a stand-alone package. Insofar as GWR processing goes, the computations should incorporate techniques for handling strongly correlated regression coefficients. Furthermore further investigation of GWR and Spatial Autocorrelation is warranted. Analysis should extend the GWR framework to provide for the following: 1. local measures of spatial dependency, 2. spatial regression modeling, and 3. the combination of the previous two. VIII References Fotheringham, Stewart, et al (2002), Geographically Weighted Regression, The Analysis of Spatially Varying Relationships (John Wiley & Sons, Ltd.). Chapters 2, 4, 9, and 10. Hatcher, Larry (1994), A Step-by-Step Approach to Using the SAS System for Factor Analysis and Structural Equation Modeling (SAS Institute). Chapter 1 Tiefelsdorf and Wheeler, Multicollinearity and Correlation among Local Regression Coefficients in Geographically Weighted Regression, Journal of Geographic Systems, (2005) 7: 161-187 Griffith D (2003) , Spatial Autocorrelation and Spatial Filtering (Springer) IX. Apendix 1 Part A: Spatial Variation Analysis for Medicaid Costs. Part B: GWR Run Summary ********************************************************** * GLOBAL REGRESSION PARAMETERS * ********************************************************** Diagnostic information... Residual sum of squares......... 81884009272.125061 Effective number of parameters.. Sigma........................... 5.000000 18134.261575 Akaike Information Criterion.... 5709.334621 Coefficient of Determination.... 0.561128 Adjusted r-square............... Parameter Estimate --------Intercept 0.552280 ------------ Std Err ------------ -5375.459519020552 T ------------ 7856.063345608583 -0.684243381023 POP2000 0.059695633705 0.004245905363 14.059576988220 PBLACK -69.931190419192 171.099291662407 PFOREIGN 1028.007265356411 220.286679398173 PELDERLY 40.459773052502 210.042542516290 -0.408717006445 4.666679382324 0.192626565695 ********************************************************** * GWR ESTIMATION * ********************************************************** Fitting Geographically Weighted Regression Model... Number of observations............ 254 Number of independent variables... 5 (Intercept is variable 1) Number of nearest neighbours...... 151 Number of locations to fit model.. 254 Diagnostic information... Residual sum of squares......... 29965965147.482555 Effective number of parameters.. Sigma........................... 18.008976 11268.507362 Akaike Information Criterion.... 5482.932136 Coefficient of Determination.... 0.839392 Adjusted r-square............... 0.827084 ********************************************************** * PARAMETER 5-NUMBER SUMMARIES * ********************************************************** Label Minimum Lwr Quartile Median Upr Quartile Maximum -------- ------------- ------------- ------------- ------------- ------------Intrcept -21849.003946 -6709.180633 -1219.567984 2852.985820 12662.905580 POP2000 PBLACK 0.040852 0.046887 0.056349 -871.869490 -182.249711 0.073981 127.609526 0.297833 211.578930 327.130387 PFOREIGN -1203.012415 -191.063457 22.814268 409.529703 1479.955342 PELDERLY -439.820900 48.271116 165.448307 -64.597318 511.439933 <------------------ LOWER -----------------><------------------ UPPER -----------------> Label Far Out Outer Fence Outside Inner Fence Inner Fence Outside Outer Fence Far Out -------- ------- ------------- ------- ------------- ------------- ------- ------------- ------Intrcept POP2000 PBLACK 0 -35395.679993 0 -0.034397 0 -1363.735632 1 -21052.430313 17196.235500 0 31539.485180 0 0.155265 0.006245 0.114623 5 -772.992671 6 802.321890 0 17 0 1393.064851 0 PFOREIGN 0 -1992.842937 3 -1091.953197 1310.419443 11 2211.309183 PELDERLY 0 -754.734194 3 -409.665756 1 510.516745 ************************************************* * * * Test for spatial variability of parameters * * * ************************************************* Tests based on the Monte Carlo significance test procedure due to Hope [1968,JRSB,30(3),582-598] Parameter ---------- P-value ------------------ Intercept POP2000 PBLACK 0.85000 n/s 0.17000 n/s 0.00000 *** PFOREIGN 0.61000 n/s PELDERLY 0.62000 n/s *** = significant at .1% level ** = significant at 1% level * = significant at 5% level Part C: GWR Confidence Intervals 855.585183 0 0 Parameter Global Global Mean Standard Global Global Local Local -1SD +1SD LQ UQ -240 101 -182.25 211.6 Deviation PCTBLACK 171 -70 Appendix 2: Guide to Using Geographic Analysis Machine. Appendix 3: Project Source Code

GIS Thesis - December 2005: Geographic Weighted Regression

Related documents

Products

Support

GIS Thesis - December 2005: Geographic Weighted Regression

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib