• Statistical options tend to be limited in most GIS applications.
• This is likely to be redressed in the future.
• We will look at spatial statistics in general terms, and conclude with a review of the software available.
• Spatial statistics differ from ‘ordinary’ statistics by the inclusion of locational properties.
• This makes spatial statistics more complex.
• The book by Bailey and Gatrell (1995) provides an accessible introduction. They identify four categories:
– Point pattern data;
– Spatially continuous data;
–
Areal data; and
– Interaction data.
•
Obvious correspondence with conceptual models.
• Attribute data can be classified by measurement scale:
– Nominal: e.g. 1=females, 2=males.
– Ordinal: e.g. 1=good, 2=medium, 3=poor.
– Interval (+ ratio): e.g. degrees Centigrade, percentages.
•
Bailey and Gatrell classify techniques by purpose:
–
Visualisation
–
Exploration
–
Modelling – this is involved in all statistical inference and hypothesis testing)
•
Statistical models deals with phenomena that are stochastic (i.e. are subject to uncertainty).
• A random variable Y has values that are subject to uncertainty (but may not necessarily be random).
•
The distribution of possible values is referred to as the probability distribution .
• Represented by a function f
Y
(y)
• Random variables may be discrete or continuous .
•
Probability that y is between a and b is given by:
y b
a f
Y f
Y
dy a b if Y is discrete i f Y is continuous (probability density)
• Cumulative probability ( or distribution function ) F
Y given by: is
F
Y
F
Y
u
y
f
Y
y
f
Y
du if Y is discrete if Y is continuous
• The expected value of Y is its mean E(Y):
E
y
y .
f
Y or
E
y .
f
Y
dy
• The expected value of a function of Y, say g(Y) is :
E
g
g
Y or
E
g
g
Y dy
• Variance is:
VAR(Y) =
S
([Y - E(Y)] 2 )
• The square root of this is the standard deviation ( s
Y
)
• Can generalise to situations where there is more than one random variable.
• Joint probability distribution (or density): f
XY
(x,y)
• Covariance : COV(X,Y) =
S
((X - E(X)).(Y - E(Y)))
•
Correlation : r
X,Y
= COV(X,Y) / s
X
.
s y
•
Independence : Neither variable affects the other. Joint probability is product of individual probabilities: f
XY
(x,y)=f
X
(x).f
Y
(y)
• A statistical model specifies the probability distribution for the phenomenon being modelled.
• If modelling ozone levels in a region R we would have a probability distribution for each location s (where s is a
2x1 vector of x,y coordinate pairs). Individual points can be referred to as s
1
, s
2 etc.
• The complete set of random variables may be referred to as a spatial stochastic process .
• The probability distribution for near points will probably be more similar than for distant points, so our random variables will probably not be independent.
• To specify a model we need to specify its probability distribution. For the ozone model we would need to specify the joint distribution of every possible subset of random variables.
• For a fair die: f
Y
(y) = 1/6
• For more complex models (e.g. ozone) we can use observed data: (y
1
, y
2
, …)
• These data are a realistion
– i.e. one outcome from the joint probability distribution {Y
1
, Y
2
, …}
• One set of data does not get us very far. Even with more data observations we must make reasonable assumptions, based either on theory or prior observations.
• Assumptions may be expressed in general terms (e.g. a
Normal distribution, a regression model) with unspecified parameters .
• The model can be fitted using observed data to estimate the parameters.
• After evaluating the model we may decide to change its general form.
• To illustrate, to model our ozone data we might make the following assumptions:
– The random variables {Y(s), s
R} are independent;
– They have the same distribution, but different means;
– Their means are a simple linear function of location, say E(Y(s)) = b
0
+ b
1 s
1
+ b
2 s
2
;
– Each
Y(s) has a normal distribution about this mean with the same variance s
2 .
• These assumptions would enable us to estimate the parameters from the available data.
• Most frequently used method is maximum likelihood .
• We can write down the general form of the joint probability distribution e.g. vector of parameters - ( b
0 model.
f(y
, b
1
, b
,y
2
2
,
, … y s
2 n
; q
) where q is a
) in our regression
• Given that we have actual values for y
1
… y n
, this joint probability distribution is the probability of getting these actual values. This is referred to as the likelihood and would usually be denoted L(y
1
, y
2
, … y n
; q
).
• Our objective is to identify the parameter values q that maximise L. In practice we usually maximise the logarithm of L ( log likelihood ) denoted l(y
1
, y
2
, … y n
; q
).
• This is the basic approach, but the actual estimation may be complicated.
• Parameter estimation of our multiple linear regression involving assumptions of independence, normal distributions and equal variance reduces to using the method of ordinary least squares.
• Relaxing the independence and equal variance, we can still use generalised least squares .
•
Standard errors provide a measure of the reliability of each parameter estimate.
• Likelihood ratios can be used to compare alternative models.
• Hypothesis testing entails comparing the fit of two models, one of which incorporates assumptions which reflect the hypothesis, the other incorporating a less specific set of assumptions.
• All modelling inevitably involves some assumptions about the phenomenon under study; hence hypothesis testing will always involve comparison of the fit of a hypothesised model with that of an alternative which also incorporates assumptions, albeit of a more general nature.
• Spatial data often exhibit spatial correlation (or autocorrelation). Assumptions of independence may therefore be unrealistic.
• Can make a distinction between:
–
First order effects : variation in the mean due to global trend;
– Second order effects : caused by spatial correlation.
• Can illustrate using analogy of iron filings and magnets.
• Real-world patterns are often an outcome of a mix of first and second order effects.
• To allow for second order effects, spatial models may need to assume a covariance structure.
• The second order effects may be modelled as a stationary spatial process – i.e.
– Its statistical properties (mean, variance) are independent of absolute location;
– Covariance depends only on relative location.
• A process is said to be isotropic if it is stationary, and covariance depends only on distance and not direction.
• If the mean, variance or covariance ‘drifts’ over the study area, then the process exhibits non-stationarity or heterogeneity .
• Heterogeneity in the mean, combined with stationarity in second order effects, is a useful spatial modelling assumption.
• The modelling of a spatial process often tends to proceed by first identifying any heterogeneous 'trend' in mean value and then modelling the 'residuals', or deviations from this
'trend', as a stationary process.
• Covariates are often incorporated in a multiple regression model taking the general form: y i
b
0
b k x ik
k
i
• The model assumes the coefficients are homogeneous or stationary.
• Fotheringham et al. proposed an alternative model: y i
b
0
u i
, v i
k b k
u i
, v i
x ik
i
• To allow the model to be fitted, it is assumed the parameters are non-stationary but are a function of location.
• Parameters can be mapped.
• Bailey and Gatrell discuss various techniques, organised by data type.
• Point pattern techniques include:
–
Quadrat analysis
– Kernel estimation
–
Nearest neighbour analysis
– K-functions
•
Normally used to test null hypothesis of complete spatial randomness (i.e. homogeneous Poisson process), but can also examine heterogeneous Poisson processes.
•
Techniques used to explore field data .
•
Sometimes referred to as geostatistics .
– Spatial moving averages
–
Trend surface analysis
– Delauney triangulation / Thiesen polygons / TINs
– Kernel estimation (for the values at sample points)
– Variograms / covariograms / kriging
– Principal components analysis / factor analysis
– Procrustes analysis
–
Cluster analysis
– Canonical correlation
• Techniques for analysing areal data (i.e. polygon attributes) include:
–
Spatial moving averages
–
Kernel estimation
– Spatial autocorrelation (Moran’s I, Geary’s c)
– Spatial correlation and regression
•
Generalised linear models provide a family of techniques for dealing with special types of data: e.g. counts
(Poisson regression), proportions (logistic regression).
•
Bayesian techniques often used to model rates based on small numbers.
• Techniques for modelling spatial interactions are most based on some variant of the gravity model.
• This postulates that the amount of interaction between two places is a function of their sizes (measured using an appropriate metric) and is inversely related to the distance between them.
•
ArcGIS . Geostatistical Analyst a step forward.
•
Idrisi . GIS Analysis | Statistics menu has a lot of options.
• S-Plus . The S+SpatialStats addon provides a lot of options.
• R . R is an open-source version of S-Plus. There are a number of projects currently developing tools for spatial statistics (e.g. sp, spatstat, DCluster, spgwr).
• BUGS . Software for Bayesian statistics. There is a free version for Windows (WinBUGS). Includes a spatial subset called GeoBUGS.