Assignment Two - GISnatural.net

advertisement
ASSIGNMENT 2
SPATIAL STATISTICS: ANALYSING THREE POINT PATTERNS
COVERSHEET - STUDENTS MUST COMPLETE ALL **** AREAS.
Student Name: Bruce Mitchell
Assignment: 2
Tutor: Dave Unwin
Date of Submission: 2 March 2008
Due Date: 3 March 2008
---------------------------------------------------------------------------------------------------------------TO BE COMPLETED BY THE TUTORS
Overall Agreed Mark (NB: A = 70+, B=60-69, C = 50-59):
Markers Comments:
First Marker Name:
***********
Second Marker Name: **********
Alpha Mark:
Alpha Mark:
Moderation and comments if needed:
---------------------------------------------------------------------------------------------------------------TO BE COMPLETED BY THE STUDENT
Submitting and Assessing Assignments
In preparation for completing and submitting this assignment, I have read and understood the Assignment and Project Guide
and the Programme Handbook available in the Reference Materials folder on the WebCT Home Page of each module.
Initial that you’ve read: BJWM
Plagiarism Statement
The presentation of another person’s thoughts or words as though they were your own, must be avoided, with particular care
in coursework, projects and dissertations. Sources should always be acknowledged, and direct quotations from the published
or unpublished work of others, should be placed inside quotation marks and clearly referenced in the proper form.
I testify that, unless otherwise acknowledged, the work submitted is entirely my own.
Student Name: Bruce Mitchell
Any Student Comments

Any comments for the tutor: Professor. There is a contradiction in the instructions here: For each of the three
patterns of events ... carry out ... analysis on your selected data set ...
1
ASSIGNMENT - COMPLETE YOUR ASSIGNMENT AT THE BOTTOM OF THIS DOCUMENT, RENAME AND SAVE AS EXPLAINED
IN THE ASSIGNMENT GUIDE (e.g. Smith-J-Assign2.doc (Surname-FirstInitial-Assign#.doc, where #
is the assignment number, which is the same as the module number)).
BRIEFING
This assignment gives you an opportunity to consolidate your knowledge of basic spatial statistical analysis using some
deliberately contrasting data. You will carry out analysis on three spatial data sets. These data were used by Diggle, P.J. (1991)
Statistical Analysis of Spatial Point Patterns (London: Academic) and have been analyzed many times. You will carry out some
statistical tests on the data similar to that covered in the topics. The aim of these tests will be to describe all data sets using simple
measures and then to carry out a more sophisticated test on one of the data sets to test for independent random process (IRP).
You will be asked to interpret your findings and write a short report on them.
TASK
In order to complete this assignment you will need to download assign2datafiles2008.zip also available from the Assignment
tool in WebCT (in the attachment section where you also download the text file of this assignment). The zip file contains the
following data sets:
Pine.txt (x, y) values for 65 Japanese Pine tree seedlings in a 5.7x5.7m square
Redwood.txt
(x, y) values for 62 Redwood tree seedlings in a 23x23m square
Cells.txt (x, y) values for 42 biological cells from a microscope slide with no scale
In each case all you have are the (x, y) co-ordinate pairs for some enumerated point events. To facilitate comparisons, both X and Y
co-ordinates are for a 'unit square', that is, the co-ordinates on both X and Y have been scaled to the range 0 – 1. Think about this
carefully when computing the real areal densities and distances.
Each file is saved as a tab-delimited text file that can easily be converted for most software systems.
You are required to do the following:
1.
Download the data listed above;
2.
Produce dot maps of all of the data sets.







Use ArcGIS ™ or another appropriate software to produce the maps. As explained in the Homework to the first Topic
on spatial statistical analysis, Excel™ will do this. In each case, and in a single word or short phrase, how would you
describe the pattern?
Estimate the overall density (or intensity), , of events together with the basic centrographic measures of mean
centre and standard distance. I*n doing this take care to ensure that you understand how the co-ordinate system
used relates to the actual physical dimensions of the study regions. I have no measure for the area occupied by the
biological cells data.
Discuss and interpret the analyses you have carried out (be critical!);
This section should be no more than 500 words
For each of the three patterns of events, using a spreadsheet, statistical analysis package or Crimestat™, carry out a
standard nearest neighbour analysis on your selected data set to test against the CSR hypothesis. Be sure to state
whether you accept or reject the null hypothesis and why;
Using just one of the data sets, show how the use of alternative methods can shed more light on the patterning. In
this there is no set sequence, just experiment with the tools that you have available which can include, for example,
Crimestat III.
Present the results of your analysis as a discussion and interpretation of your results. For example, how appropriate
were the tests used to analyse the data? To what extent do the numerical results accord with the evidence of your
2
eyes? Are there any particular strengths or weaknesses that you can identify in analysis of this type? This section
should also be not more than 500 words.
You should submit your maps, calculations and results from your analysis, and the discussion and interpretation of your results.
ASSESSMENT CRITERIA
Good marks will be obtained by submissions that:








Produce appropriate maps of the distributions;
Carry out the nearest neighbour test and centrographic measures correctly;
Clearly show the method of data analysis, by stating the formulae used and showing all calculations;
Discuss the results from each stage of the analysis (i.e. producing the point map, calculating density, carrying out
nearest neighbour analysis) and interpret the findings;
Show some initiative in the application of alternative tests;
Comment on the nature of the data provided and identify potential limitations with regards the analysis undertaken.
Don’t be afraid to be critical – you may also suggest further tests that could be carried out;
Draw on academic literature and research into the nature of spatial statistics;
Stick within the two 500 word limits.
© Birkbeck College & Steven Musson, 7 February 2005
Revised Dave Unwin, 14th December 2005
Revised Dave Unwin, 2nd January 2007
Revised Dave Unwin, 11th Nov. 2007
START YOUR ASSIGNMENT HERE
3
Introduction
Spatial analysis is the science of masking use of where things are – the incorporation of positions in space and the
relationships between objects in space into the analysis framework. It can cover such different topics as bacterial
infection, migration of reindeer or the distribution of dust in the universe. As such, the datasets involved can be
colossal.
By applying statistical methods to such data you can reduce them to a small array of summary statistics. Statistical
methods can support exploratory data analysis and the generation of hypotheses about the data. They can also make
it possible to examine the role of randomness in producing patterns and to test hypotheses about how such patterns
may have come about. Armed with such information, one can then make predictions about the data’s future
behaviour with statistically definable confidence intervals, or create optimised statistical models. We can also
generate expected values under different hypotheses, and these can then be checked against reality.
A spatial pattern in data may comprise of any, or all of three basic components. These are spatial pattern, spatial
dependence and spatial heterogeneity. All may be termed causal relationships across space.
Spatial pattern is the quality of distribution, whether clustered, random or clustered. Spatial dependence is an
expression of Tobler’s First Law, whereby things are likely to be similar to nearby others. Spatial heterogeneity
covers systematic spatial variation in, say, climate, or, in survey work, the attitudes and effectiveness of individual
interviewers.
Here, we will be looking and spatial pattern, seeking to identify whether three datasets are dispersed, randomly
distributed or clustered. We will examine each dataset in CrimeStat, ArcGIS, Excel and CorelDraw and extract a
number of key statistics.
We will address each of these statistics in turn and see what they have to tell us about these datasets, and about
spatial data in general.
4
Dot Maps of the Distributions
Figure 1: Dot maps of the data sets with Mean Centre
The Pattern for Redwood would appear to be clustered, but broken, as though a road had been driven through
the stands NW-SE and N-SW.
Pine looks like a classic ‘random’ scatter, with a couple of close pairings, a number of loose groups and some open
spaces.
Cells are pretty evenly disposed across the field.
5
Overall Density of Events (), Mean Centre and Standard Distance
Given the different areas under consideration, we have two options. One is to reduce all three areas to a common
scale to ease comparison of results. The second is to keep each set of data in its own absolute dimensions, and to
compare the results later.
I have produced both relative (for all three datasets) and absolute (for pine and redwood only) figures.
Application: Crimestat
Settings for Cells, and for Pine (relative) and Redwood (relative)
Primary File - Variables: X Y - Column - X Y
Type of Coordinate System: Projected - Euclidean – Data Units - metres
Reference File – Create Grid – x:0 to 1; y: 0 to 1
Measurement Parameters – Coverage - 1 sq metre
Spatial Description - Spatial Distribution – Mean centre and standard distance
Settings for Pine (absolute)
Primary File - Variables: X Y - Column - X Y
Type of Coordinate System: Projected - Euclidean – Data Units - metres
Reference File – Create Grid – x:0 to 5.7; y: 0 to 5.7
Measurement Parameters – Coverage – 32.49 sq metre
Spatial Description - Spatial Distribution – Mean centre and standard distance
Settings for redwood (absolute)
Primary File - Variables: X Y - Column - X Y
Type of Coordinate System: Projected - Euclidean – Data Units - metres
Reference File – Create Grid – x: 0 to 23; y: 0 to 23
Measurement Parameters – Coverage - 529 sq metres
Spatial Description - Spatial Distribution – Mean centre and standard distance
6
The following table displays the results for both relative and absolute datasets, including the Areal Density (λ),
Mean Centre and Standard Distance Deviation:
It is to be noted that, as expected, the Mean Centre measure differs between the relative and absolute datasets.
However, the Standard Deviation Distance ...
... is also different between the absolute and relative data, where this might have been expected to remain the
same. It is after all a measure of dispersion - the greater the standard deviation distance, the more dispersed the
eventsiii, and this should be the same regardless of which scale is used, as long as both data and enclosing frame
are scaled by the same amount. I suspect an error in my work.
7
Figure 2: Crude Density, Mean Centre and SDD
8
Quadrat Censusing
I applied quadrat censusing to determine the distribution of points across the three study areas.
Considerations of Distance
In all cases I used Euclidean distance: given the small areas involved in all three cases, there is no need to bother
with the curvature of the earth. In any case, no locality given, so it would not be possible to determine local
curvature.
For each of the three datasets:
I created a dot-map in ArcGIS in a 10 cm x 10 cm Dataframe
In CorelDraw, I created a 10 cm x 10 cm array of 1cm quadrats.
Imported the dot-map into CorelDraw and constrained it to 10 cm x 10 cm.
Overlaid and centred the dot-map and the array.
Calculated:











Area occupied by each quadrat under each dataset.
Crude areal density
Values for possible point-counts in each quadrat cell (x)
How many points in each cell. Totalled to obtain:
How many cells with 1, 2, 3, 4 points (f) [‘Cell-point-census’]
Product of cell-point-census and x
[ f *x ]
x2
Mean count of points per cell xbar
[ (Σ (f *x)) / number of Gridcells ]
Variance
s2
[ ( Σ(f *x2)) – (x2/ Gridcells)) / (Gridcells – 1) ]
VMR
[ s2 / xbar ]
T-statistic
[ (VMR – 1) / √ (2/(GRIDCELLS - 1)) ]
9
Cells
42 events
Total area = 1 x 1 = 1 unit square
Scale - 1 quadrat cell = 1/100 unit = 0.01
Figure 3: Quadrat Cencus for CELLS
10
Pine
65 events
Total area (5.7 m x 5.7 m)= 32.49 sq m
Scale – 1 quadrat cell = 0.57 m x 0.57 m = 0.3249 sq m
Figure 4: Quadrat Census for PINE
11
Redwood
62 events
Total area (23m x 23m) = 529 sq m
Scale – 1 quadrat cell = 2.3 m x 2.3 m = 5.29 sq m
Figure 5: Quadrat Census for REDWOOD
12
Discuss and interpret analyses
Overall Density, Mean Centre, Standard Distance Deviation

The Overall Density , is not a very good index of distribution. It is akin to population density, and this itself is
a pretty poor index, as may be deduced from Wikipedia. iii All it really tells you is that high densities occur in
small nations: there are obvious reasons for this – many of these small nations consist simply of one large city
with no hinterland. 22 of the top 30 in the list have a total area of less than half the size of Wales.

The location of the Mean Centre, in all our cases, is not significantly different from dead centre of the square
plot, so this also is shown to be a poor index of distribution.

The Standard Distance also appears to be very weak, as the results for all three (relative) datasets were in the
range 0.39 to 0.43. In fact, both Cells and Redwoods were identical at 0.39
Quadrat Censusing
Results here were more helpful. The S2 statistic is a measure of the variance of the cell frequencies, and the variance
means ratio, or VMR, “standardizes the degree of variability in the cell frequencies in relation to mean cell
frequency”.iv
Our three values for s2 are 0.246 (cells); 0.614 (pines) and 1.389 (redwood). These correspond well to dispersed,
random and clustered. This is because where your data are dispersed, a lot of grid cells will have closely similar
values, producing a low variance. In clustered data, on the other hand, a few Gridcells will contain a lot of data
points while others will have few or none. This will produce a high variance. The variance of randomly distributed
data will fall between these extremes.
The VMR is useful in that a VMR value of 1 is what results from a Poisson distribution, because this has an equal
variance and mean. VMRs either side of 1 will correspond to a negative binomial distribution (> 1) or to a binomial
distribution (< 1). v
However, the quadrat method has its own weaknesses. It still tells us nothing about how the individual points
relate to each other. Certainly it does give us more information than the methods tried above, but all we end up
with is a single measure of dispersion, not pattern, across the whole dataset. It is also affected by the choice of
grid size.
397 words
13
Nearest Neighbour Analysis
For this exercise, I treated the cells data
Figure 6: Nearest Neighbour Results from Crimestat
What Crimestat calls the “Nearest Neighbour Index” is in Arc referred to as “Observed Mean Distance / Expected
Mean Distance”, but it is the same thing, and has identical values across all groups. In the literature, this
conventionally appears as the ‘R‘statistic.
where
vi
The Z-score is also identical for both ArcGIS and Crimestat.
The Z-score is critical when we consider the NULL HYPOTHESIS. It is a measure of the standard deviation from purely
random distribution. If you are operating on a 95% confidence interval, and you obtain a Z-score of –between -1.96
and +1.96, you cannot reject the NULL HYPOTHESIS as there exists a significant likelihood that the observed pattern
was generated by random chance. If your Z-Score is outside of these values, however, you should reject it.vii
14
Analysis – CELLS
ArcGIS’ Average Nearest Neighbour Distance’ dialog makes it very clear that the CELLS dataset is dispersed, with a
Nearest Neighbour Index (R) above 1.67. An R-value of 1 indicates a purely random distribution, whereas 2.149
denotes perfectly dispersion. An R-value of O means that the distribution is perfectly clustered – I.e. that all the
observation occur in a single location. http://www.geog.ubc.ca/courses/geog516/presentations/sunny1.htm
ArcGIS announces that there is only a 1% chance that the observed pattern could be the result of random chance - in
other words, there exists a 99% degree of confidence in the assertion that it is dispersed. In this case, the NULL
HYPOTHESIS, that the observed pattern in the data is the result of random chance, must be rejected.
Figure 7: Nearest Neighbour Output from ArcGIS - CELLS
Further grounds for rejecting the NULL HYPOTHESIS are supplied by Crimestat. In our cases the z-score for CELLS is
8.2812, which, in terms of standard deviations, is practically off the scale. Rowntree (1981), p.139, states that Z-test
scores are only suitable for large samples, and that t-tests are better for small ones. However, he then defines the
threshold for small/large as 30 observations, so our three datasets may, on these grounds, adequately be dealt with
by the Z-statistic. There is, moreover, no obvious way of generating a t-statistic in Crimestat.
15
Analysis – PINE
In the case of PINE, ArcGIS places the relative dataset squarely in ‘random’. The R statistic of 1.06 and the Z score
of 0.99 are as central as could be desired. The same key results are true also of the absolute dataset.
Figure 8: Nearest Neighbour Output from ArcGIS - PINES
16
Analysis – REDWOOD
In the case of REDWOOD, ArcGIS places the relative dataset at the far end of ‘clustered’. The R statistic of 0.61
conforms to the range of values you would expect for clustered data, and the large negative Z-score of -5.92 is
well outside the -1.92 required to reject the null hypothesis. These data are also true of the redwood data on
absolute values.
Figure 9: Nearest Neighbour Output from ArcGIS - Redwood
Nearest Neighbour Analysis offers more useful spatial information that either of the above methods, takes
account of distances, and escapes the problems of arbitrary Gridcell sizes. However, it is constrained by the data
boundary – and this may be just as arbitrary. You are likely to encounter edge effects (see below), and these may
be severe.
17
Alternative methodologies
Kernel Density Functions
I turn now to Kernel Density Functions in the Spatial Analyst Tools section of ArcGIS’ ArcToolbox. We will be looking at
the Redwood (absolute) dataset.
If we accept all default settings here, the output map is barely more informative than the original distribution
Default settings
Output cell size 0.080684
Search radius 0.672366666666667
General Settings - Extent - default
Equal intervals (9)
Figure 10: Kernel Density Output from ArcGIS - Default Settings
But we can do better. If we replace those defaults withthe following values, we obtain a far better map:
Output cell size 0.04
Search radius 4
Square meters
General settings - Extent – as specified below - 0-23, 0-23
Equal intervals (9)
18
Figure 11: Kernel Density Output from ArcGIS - Custom Settings
We can extend this further by replacing ‘equal intervals with 20 quantiles. In this case (see fig) there is a far higher
degree of graduation and a smoother effect, but it may be questioned what benefit we really obtain from this
added definition.
It may indeed be argued that, visually, and given the sparsity of the original dataset, the nine equal intervals is
better.
Figure 12: Kernel Density Output from ArcGIS - 20 Quantiles
19
Thiessen Polygons
An alternative approach is to create Thiessen or Voronoi polygons, which are formed from lines perpendicular to and
bisecting the mid-point of lines connecting points. Comparing the area of Thiessen polygons across the plot will give a
reasonable indication of clustering or dispersion. In a process analogous to that of the quadrat counts, if we
determine the variance and VMR of Thiessen polygon areas, we have another method for quantifying the distribution.
A low variance will once again indicate that there is little variation in the e size of the polygons, and that their
centroids are consequently dispersed.. Similarly a high variance will indicate clustering.
This is an opportunity to touch lightly on the problem of edge effects, a serious problem which I have skirted around
up to this point. The diagram below results from a Third-Party VB routine available from ESRIviii. This includes a setting
for “set boundary extent by buffering point area by n%”. I have done this for 0%, 25%, 50% and 100%. It will be clear
that polygons around the edge of a body will either be severely distorted, or truncated. In the case of a real island, this
will not matter, as the edge is the coast, and you can clip the data area to that genuine physical barrier. In the case of
a landlocked county however, the nearest neighbour of a particular data point may in fact lie just across the border.
Figure 13: Thiessen Polygons around Redwood data
20
Discussion and interpretation of results
In sum, we undertook a range of basic spatial analysis tests on our three sets of data and on the basis of their
combined evidence, we:



rejected the null hypothesis for CELLS, as this data is DISPERSED.
accepted it for PINES, as this data is RANDOMLY distributed
rejected it also for REDWOOD, as this data is CLUSTERED.
Some of the tools were to be honest, pretty useless – mean centre in particular, which was as good as identical for all
three of the datasets. Standard Distance Deviation also seemed pretty poor.
Quadrat analysis, with its associated s2, VMR, and t-statistic, were an improvement, but the z-score of nearest
neighbour index and the R- statistic were more conclusive.
As far as these tests went, the evidence might seem absolutely conclusive – after all, we were right up at and even
beyond the 99% confidence level with our assertions.
However, we did not take several things into account.
The suggestion with Pines and Redwoods was that the locations of the saplings were subject purely to natural
considerations. This might not be so. We were not given an underlying geography, and this might – in the Redwood
case – have provided evidence of roads or railways and associated developments cutting through the stands, features
which would have prevented the growth of Redwood saplings along those ribbons. There might also be zones of
higher or lower fertility, or marsh, ravines or the fringes of lakes within the plot. In any of these cases, then the
clustering that we observed might be no more than we would expect.
On referring to David Strauss’ original article (1975)ix where the Redwood dataset was first published, and Diggles
(1979) continuationx, no such permanent man-made intrusions are present on the site, but the saplings were
observed to cluster around felled mature redwoods. This is a factor which has not been taken into account in our
analysis.
Edge effects are also present and have not been dealt with in this paper. It is perfectly possible that the three square
plots as drawn up ignored significant areas of data just beyond the boundaries. This might result in ‘stragglers’ from
beyond the boundary being isolated within the study zone. This would affect the statistics for the study zone and
diminish the degree of clustering recorded. This, of course, could also work the other way round, too, with small
‘empty’ parches within the plot being parts of huge ‘empty’ areas beyond.
Word count 408
21
Notes
i
http://webhelp.esri.com/arcgisdesktop/9.2/index.cfm?TopicName=mean_center_(spatial_statistics)
ii
STATISTICS IN GEOGRAPHY by David Ebdon, Basil Blackwell, 1985
http://books.google.co.uk/books?id=IQ7LnlAPE_gC&pg=PA134&lpg=PA134&dq=%22standard+distance+deviation%22&source
=web&ots=RQj24AxwfJ&sig=WpdBeg3IShVl9A8btWQMNgEUUPc&hl=en
iii
http://en.wikipedia.org/wiki/List_of_countries_by_population_density
iv
http://instruct.uwo.ca/geog/500/Quadrat_by_6.pdf
v
http://en.wikipedia.org/wiki/Variance-to-mean_ratio
vi
http://www.geog.ubc.ca/courses/geog516/presentations/sunny1.htm
vii
http://webhelp.esri.com/arcgisdesktop/9.2/index.cfm?TopicName=What%20is%20a%20Z%20Score
http://upload.wikimedia.org/wikipedia/en/8/89/Normal_distribution_and_scales.svg
viii
http://arcscripts.esri.com/details.asp?dbid=11958
ix
David J. Strauss, A Model for Clustering. Biometrika, Vol. 62, No. 2. (Aug., 1975), pp. 467-475.
x
On Parameter Estimation and Goodness-of-Fit Testing for Spatial Point Patterns, Peter J. Diggle, Biometrics, Vol. 35, No. 1,
Perspectives in Biometry. (Mar., 1979), pp. 87-101.
Further reading
Michael J de Smith, Michael F Goodchild, Paul A Longley (2007). Geospatial Analysis – A Comprehensive Guide to Principles,
Techniques and Software Tools. (Matador, Leicester). Also, maintained web version: http://www.spatialanalysisonline.com/
Fotheringham, A.S., Charlton, M.E. and Brunsdon, C. (1996). Quantitative Geography: Perspectives on Spatial Data Analysis.
Sage, London.
22
Download