ASSIGNMENT 2 SPATIAL STATISTICS: ANALYSING THREE POINT PATTERNS COVERSHEET - STUDENTS MUST COMPLETE ALL **** AREAS. Student Name: Bruce Mitchell Assignment: 2 Tutor: Dave Unwin Date of Submission: 2 March 2008 Due Date: 3 March 2008 ---------------------------------------------------------------------------------------------------------------TO BE COMPLETED BY THE TUTORS Overall Agreed Mark (NB: A = 70+, B=60-69, C = 50-59): Markers Comments: First Marker Name: *********** Second Marker Name: ********** Alpha Mark: Alpha Mark: Moderation and comments if needed: ---------------------------------------------------------------------------------------------------------------TO BE COMPLETED BY THE STUDENT Submitting and Assessing Assignments In preparation for completing and submitting this assignment, I have read and understood the Assignment and Project Guide and the Programme Handbook available in the Reference Materials folder on the WebCT Home Page of each module. Initial that you’ve read: BJWM Plagiarism Statement The presentation of another person’s thoughts or words as though they were your own, must be avoided, with particular care in coursework, projects and dissertations. Sources should always be acknowledged, and direct quotations from the published or unpublished work of others, should be placed inside quotation marks and clearly referenced in the proper form. I testify that, unless otherwise acknowledged, the work submitted is entirely my own. Student Name: Bruce Mitchell Any Student Comments Any comments for the tutor: Professor. There is a contradiction in the instructions here: For each of the three patterns of events ... carry out ... analysis on your selected data set ... 1 ASSIGNMENT - COMPLETE YOUR ASSIGNMENT AT THE BOTTOM OF THIS DOCUMENT, RENAME AND SAVE AS EXPLAINED IN THE ASSIGNMENT GUIDE (e.g. Smith-J-Assign2.doc (Surname-FirstInitial-Assign#.doc, where # is the assignment number, which is the same as the module number)). BRIEFING This assignment gives you an opportunity to consolidate your knowledge of basic spatial statistical analysis using some deliberately contrasting data. You will carry out analysis on three spatial data sets. These data were used by Diggle, P.J. (1991) Statistical Analysis of Spatial Point Patterns (London: Academic) and have been analyzed many times. You will carry out some statistical tests on the data similar to that covered in the topics. The aim of these tests will be to describe all data sets using simple measures and then to carry out a more sophisticated test on one of the data sets to test for independent random process (IRP). You will be asked to interpret your findings and write a short report on them. TASK In order to complete this assignment you will need to download assign2datafiles2008.zip also available from the Assignment tool in WebCT (in the attachment section where you also download the text file of this assignment). The zip file contains the following data sets: Pine.txt (x, y) values for 65 Japanese Pine tree seedlings in a 5.7x5.7m square Redwood.txt (x, y) values for 62 Redwood tree seedlings in a 23x23m square Cells.txt (x, y) values for 42 biological cells from a microscope slide with no scale In each case all you have are the (x, y) co-ordinate pairs for some enumerated point events. To facilitate comparisons, both X and Y co-ordinates are for a 'unit square', that is, the co-ordinates on both X and Y have been scaled to the range 0 – 1. Think about this carefully when computing the real areal densities and distances. Each file is saved as a tab-delimited text file that can easily be converted for most software systems. You are required to do the following: 1. Download the data listed above; 2. Produce dot maps of all of the data sets. Use ArcGIS ™ or another appropriate software to produce the maps. As explained in the Homework to the first Topic on spatial statistical analysis, Excel™ will do this. In each case, and in a single word or short phrase, how would you describe the pattern? Estimate the overall density (or intensity), , of events together with the basic centrographic measures of mean centre and standard distance. I*n doing this take care to ensure that you understand how the co-ordinate system used relates to the actual physical dimensions of the study regions. I have no measure for the area occupied by the biological cells data. Discuss and interpret the analyses you have carried out (be critical!); This section should be no more than 500 words For each of the three patterns of events, using a spreadsheet, statistical analysis package or Crimestat™, carry out a standard nearest neighbour analysis on your selected data set to test against the CSR hypothesis. Be sure to state whether you accept or reject the null hypothesis and why; Using just one of the data sets, show how the use of alternative methods can shed more light on the patterning. In this there is no set sequence, just experiment with the tools that you have available which can include, for example, Crimestat III. Present the results of your analysis as a discussion and interpretation of your results. For example, how appropriate were the tests used to analyse the data? To what extent do the numerical results accord with the evidence of your 2 eyes? Are there any particular strengths or weaknesses that you can identify in analysis of this type? This section should also be not more than 500 words. You should submit your maps, calculations and results from your analysis, and the discussion and interpretation of your results. ASSESSMENT CRITERIA Good marks will be obtained by submissions that: Produce appropriate maps of the distributions; Carry out the nearest neighbour test and centrographic measures correctly; Clearly show the method of data analysis, by stating the formulae used and showing all calculations; Discuss the results from each stage of the analysis (i.e. producing the point map, calculating density, carrying out nearest neighbour analysis) and interpret the findings; Show some initiative in the application of alternative tests; Comment on the nature of the data provided and identify potential limitations with regards the analysis undertaken. Don’t be afraid to be critical – you may also suggest further tests that could be carried out; Draw on academic literature and research into the nature of spatial statistics; Stick within the two 500 word limits. © Birkbeck College & Steven Musson, 7 February 2005 Revised Dave Unwin, 14th December 2005 Revised Dave Unwin, 2nd January 2007 Revised Dave Unwin, 11th Nov. 2007 START YOUR ASSIGNMENT HERE 3 Introduction Spatial analysis is the science of masking use of where things are – the incorporation of positions in space and the relationships between objects in space into the analysis framework. It can cover such different topics as bacterial infection, migration of reindeer or the distribution of dust in the universe. As such, the datasets involved can be colossal. By applying statistical methods to such data you can reduce them to a small array of summary statistics. Statistical methods can support exploratory data analysis and the generation of hypotheses about the data. They can also make it possible to examine the role of randomness in producing patterns and to test hypotheses about how such patterns may have come about. Armed with such information, one can then make predictions about the data’s future behaviour with statistically definable confidence intervals, or create optimised statistical models. We can also generate expected values under different hypotheses, and these can then be checked against reality. A spatial pattern in data may comprise of any, or all of three basic components. These are spatial pattern, spatial dependence and spatial heterogeneity. All may be termed causal relationships across space. Spatial pattern is the quality of distribution, whether clustered, random or clustered. Spatial dependence is an expression of Tobler’s First Law, whereby things are likely to be similar to nearby others. Spatial heterogeneity covers systematic spatial variation in, say, climate, or, in survey work, the attitudes and effectiveness of individual interviewers. Here, we will be looking and spatial pattern, seeking to identify whether three datasets are dispersed, randomly distributed or clustered. We will examine each dataset in CrimeStat, ArcGIS, Excel and CorelDraw and extract a number of key statistics. We will address each of these statistics in turn and see what they have to tell us about these datasets, and about spatial data in general. 4 Dot Maps of the Distributions Figure 1: Dot maps of the data sets with Mean Centre The Pattern for Redwood would appear to be clustered, but broken, as though a road had been driven through the stands NW-SE and N-SW. Pine looks like a classic ‘random’ scatter, with a couple of close pairings, a number of loose groups and some open spaces. Cells are pretty evenly disposed across the field. 5 Overall Density of Events (), Mean Centre and Standard Distance Given the different areas under consideration, we have two options. One is to reduce all three areas to a common scale to ease comparison of results. The second is to keep each set of data in its own absolute dimensions, and to compare the results later. I have produced both relative (for all three datasets) and absolute (for pine and redwood only) figures. Application: Crimestat Settings for Cells, and for Pine (relative) and Redwood (relative) Primary File - Variables: X Y - Column - X Y Type of Coordinate System: Projected - Euclidean – Data Units - metres Reference File – Create Grid – x:0 to 1; y: 0 to 1 Measurement Parameters – Coverage - 1 sq metre Spatial Description - Spatial Distribution – Mean centre and standard distance Settings for Pine (absolute) Primary File - Variables: X Y - Column - X Y Type of Coordinate System: Projected - Euclidean – Data Units - metres Reference File – Create Grid – x:0 to 5.7; y: 0 to 5.7 Measurement Parameters – Coverage – 32.49 sq metre Spatial Description - Spatial Distribution – Mean centre and standard distance Settings for redwood (absolute) Primary File - Variables: X Y - Column - X Y Type of Coordinate System: Projected - Euclidean – Data Units - metres Reference File – Create Grid – x: 0 to 23; y: 0 to 23 Measurement Parameters – Coverage - 529 sq metres Spatial Description - Spatial Distribution – Mean centre and standard distance 6 The following table displays the results for both relative and absolute datasets, including the Areal Density (λ), Mean Centre and Standard Distance Deviation: It is to be noted that, as expected, the Mean Centre measure differs between the relative and absolute datasets. However, the Standard Deviation Distance ... ... is also different between the absolute and relative data, where this might have been expected to remain the same. It is after all a measure of dispersion - the greater the standard deviation distance, the more dispersed the eventsiii, and this should be the same regardless of which scale is used, as long as both data and enclosing frame are scaled by the same amount. I suspect an error in my work. 7 Figure 2: Crude Density, Mean Centre and SDD 8 Quadrat Censusing I applied quadrat censusing to determine the distribution of points across the three study areas. Considerations of Distance In all cases I used Euclidean distance: given the small areas involved in all three cases, there is no need to bother with the curvature of the earth. In any case, no locality given, so it would not be possible to determine local curvature. For each of the three datasets: I created a dot-map in ArcGIS in a 10 cm x 10 cm Dataframe In CorelDraw, I created a 10 cm x 10 cm array of 1cm quadrats. Imported the dot-map into CorelDraw and constrained it to 10 cm x 10 cm. Overlaid and centred the dot-map and the array. Calculated: Area occupied by each quadrat under each dataset. Crude areal density Values for possible point-counts in each quadrat cell (x) How many points in each cell. Totalled to obtain: How many cells with 1, 2, 3, 4 points (f) [‘Cell-point-census’] Product of cell-point-census and x [ f *x ] x2 Mean count of points per cell xbar [ (Σ (f *x)) / number of Gridcells ] Variance s2 [ ( Σ(f *x2)) – (x2/ Gridcells)) / (Gridcells – 1) ] VMR [ s2 / xbar ] T-statistic [ (VMR – 1) / √ (2/(GRIDCELLS - 1)) ] 9 Cells 42 events Total area = 1 x 1 = 1 unit square Scale - 1 quadrat cell = 1/100 unit = 0.01 Figure 3: Quadrat Cencus for CELLS 10 Pine 65 events Total area (5.7 m x 5.7 m)= 32.49 sq m Scale – 1 quadrat cell = 0.57 m x 0.57 m = 0.3249 sq m Figure 4: Quadrat Census for PINE 11 Redwood 62 events Total area (23m x 23m) = 529 sq m Scale – 1 quadrat cell = 2.3 m x 2.3 m = 5.29 sq m Figure 5: Quadrat Census for REDWOOD 12 Discuss and interpret analyses Overall Density, Mean Centre, Standard Distance Deviation The Overall Density , is not a very good index of distribution. It is akin to population density, and this itself is a pretty poor index, as may be deduced from Wikipedia. iii All it really tells you is that high densities occur in small nations: there are obvious reasons for this – many of these small nations consist simply of one large city with no hinterland. 22 of the top 30 in the list have a total area of less than half the size of Wales. The location of the Mean Centre, in all our cases, is not significantly different from dead centre of the square plot, so this also is shown to be a poor index of distribution. The Standard Distance also appears to be very weak, as the results for all three (relative) datasets were in the range 0.39 to 0.43. In fact, both Cells and Redwoods were identical at 0.39 Quadrat Censusing Results here were more helpful. The S2 statistic is a measure of the variance of the cell frequencies, and the variance means ratio, or VMR, “standardizes the degree of variability in the cell frequencies in relation to mean cell frequency”.iv Our three values for s2 are 0.246 (cells); 0.614 (pines) and 1.389 (redwood). These correspond well to dispersed, random and clustered. This is because where your data are dispersed, a lot of grid cells will have closely similar values, producing a low variance. In clustered data, on the other hand, a few Gridcells will contain a lot of data points while others will have few or none. This will produce a high variance. The variance of randomly distributed data will fall between these extremes. The VMR is useful in that a VMR value of 1 is what results from a Poisson distribution, because this has an equal variance and mean. VMRs either side of 1 will correspond to a negative binomial distribution (> 1) or to a binomial distribution (< 1). v However, the quadrat method has its own weaknesses. It still tells us nothing about how the individual points relate to each other. Certainly it does give us more information than the methods tried above, but all we end up with is a single measure of dispersion, not pattern, across the whole dataset. It is also affected by the choice of grid size. 397 words 13 Nearest Neighbour Analysis For this exercise, I treated the cells data Figure 6: Nearest Neighbour Results from Crimestat What Crimestat calls the “Nearest Neighbour Index” is in Arc referred to as “Observed Mean Distance / Expected Mean Distance”, but it is the same thing, and has identical values across all groups. In the literature, this conventionally appears as the ‘R‘statistic. where vi The Z-score is also identical for both ArcGIS and Crimestat. The Z-score is critical when we consider the NULL HYPOTHESIS. It is a measure of the standard deviation from purely random distribution. If you are operating on a 95% confidence interval, and you obtain a Z-score of –between -1.96 and +1.96, you cannot reject the NULL HYPOTHESIS as there exists a significant likelihood that the observed pattern was generated by random chance. If your Z-Score is outside of these values, however, you should reject it.vii 14 Analysis – CELLS ArcGIS’ Average Nearest Neighbour Distance’ dialog makes it very clear that the CELLS dataset is dispersed, with a Nearest Neighbour Index (R) above 1.67. An R-value of 1 indicates a purely random distribution, whereas 2.149 denotes perfectly dispersion. An R-value of O means that the distribution is perfectly clustered – I.e. that all the observation occur in a single location. http://www.geog.ubc.ca/courses/geog516/presentations/sunny1.htm ArcGIS announces that there is only a 1% chance that the observed pattern could be the result of random chance - in other words, there exists a 99% degree of confidence in the assertion that it is dispersed. In this case, the NULL HYPOTHESIS, that the observed pattern in the data is the result of random chance, must be rejected. Figure 7: Nearest Neighbour Output from ArcGIS - CELLS Further grounds for rejecting the NULL HYPOTHESIS are supplied by Crimestat. In our cases the z-score for CELLS is 8.2812, which, in terms of standard deviations, is practically off the scale. Rowntree (1981), p.139, states that Z-test scores are only suitable for large samples, and that t-tests are better for small ones. However, he then defines the threshold for small/large as 30 observations, so our three datasets may, on these grounds, adequately be dealt with by the Z-statistic. There is, moreover, no obvious way of generating a t-statistic in Crimestat. 15 Analysis – PINE In the case of PINE, ArcGIS places the relative dataset squarely in ‘random’. The R statistic of 1.06 and the Z score of 0.99 are as central as could be desired. The same key results are true also of the absolute dataset. Figure 8: Nearest Neighbour Output from ArcGIS - PINES 16 Analysis – REDWOOD In the case of REDWOOD, ArcGIS places the relative dataset at the far end of ‘clustered’. The R statistic of 0.61 conforms to the range of values you would expect for clustered data, and the large negative Z-score of -5.92 is well outside the -1.92 required to reject the null hypothesis. These data are also true of the redwood data on absolute values. Figure 9: Nearest Neighbour Output from ArcGIS - Redwood Nearest Neighbour Analysis offers more useful spatial information that either of the above methods, takes account of distances, and escapes the problems of arbitrary Gridcell sizes. However, it is constrained by the data boundary – and this may be just as arbitrary. You are likely to encounter edge effects (see below), and these may be severe. 17 Alternative methodologies Kernel Density Functions I turn now to Kernel Density Functions in the Spatial Analyst Tools section of ArcGIS’ ArcToolbox. We will be looking at the Redwood (absolute) dataset. If we accept all default settings here, the output map is barely more informative than the original distribution Default settings Output cell size 0.080684 Search radius 0.672366666666667 General Settings - Extent - default Equal intervals (9) Figure 10: Kernel Density Output from ArcGIS - Default Settings But we can do better. If we replace those defaults withthe following values, we obtain a far better map: Output cell size 0.04 Search radius 4 Square meters General settings - Extent – as specified below - 0-23, 0-23 Equal intervals (9) 18 Figure 11: Kernel Density Output from ArcGIS - Custom Settings We can extend this further by replacing ‘equal intervals with 20 quantiles. In this case (see fig) there is a far higher degree of graduation and a smoother effect, but it may be questioned what benefit we really obtain from this added definition. It may indeed be argued that, visually, and given the sparsity of the original dataset, the nine equal intervals is better. Figure 12: Kernel Density Output from ArcGIS - 20 Quantiles 19 Thiessen Polygons An alternative approach is to create Thiessen or Voronoi polygons, which are formed from lines perpendicular to and bisecting the mid-point of lines connecting points. Comparing the area of Thiessen polygons across the plot will give a reasonable indication of clustering or dispersion. In a process analogous to that of the quadrat counts, if we determine the variance and VMR of Thiessen polygon areas, we have another method for quantifying the distribution. A low variance will once again indicate that there is little variation in the e size of the polygons, and that their centroids are consequently dispersed.. Similarly a high variance will indicate clustering. This is an opportunity to touch lightly on the problem of edge effects, a serious problem which I have skirted around up to this point. The diagram below results from a Third-Party VB routine available from ESRIviii. This includes a setting for “set boundary extent by buffering point area by n%”. I have done this for 0%, 25%, 50% and 100%. It will be clear that polygons around the edge of a body will either be severely distorted, or truncated. In the case of a real island, this will not matter, as the edge is the coast, and you can clip the data area to that genuine physical barrier. In the case of a landlocked county however, the nearest neighbour of a particular data point may in fact lie just across the border. Figure 13: Thiessen Polygons around Redwood data 20 Discussion and interpretation of results In sum, we undertook a range of basic spatial analysis tests on our three sets of data and on the basis of their combined evidence, we: rejected the null hypothesis for CELLS, as this data is DISPERSED. accepted it for PINES, as this data is RANDOMLY distributed rejected it also for REDWOOD, as this data is CLUSTERED. Some of the tools were to be honest, pretty useless – mean centre in particular, which was as good as identical for all three of the datasets. Standard Distance Deviation also seemed pretty poor. Quadrat analysis, with its associated s2, VMR, and t-statistic, were an improvement, but the z-score of nearest neighbour index and the R- statistic were more conclusive. As far as these tests went, the evidence might seem absolutely conclusive – after all, we were right up at and even beyond the 99% confidence level with our assertions. However, we did not take several things into account. The suggestion with Pines and Redwoods was that the locations of the saplings were subject purely to natural considerations. This might not be so. We were not given an underlying geography, and this might – in the Redwood case – have provided evidence of roads or railways and associated developments cutting through the stands, features which would have prevented the growth of Redwood saplings along those ribbons. There might also be zones of higher or lower fertility, or marsh, ravines or the fringes of lakes within the plot. In any of these cases, then the clustering that we observed might be no more than we would expect. On referring to David Strauss’ original article (1975)ix where the Redwood dataset was first published, and Diggles (1979) continuationx, no such permanent man-made intrusions are present on the site, but the saplings were observed to cluster around felled mature redwoods. This is a factor which has not been taken into account in our analysis. Edge effects are also present and have not been dealt with in this paper. It is perfectly possible that the three square plots as drawn up ignored significant areas of data just beyond the boundaries. This might result in ‘stragglers’ from beyond the boundary being isolated within the study zone. This would affect the statistics for the study zone and diminish the degree of clustering recorded. This, of course, could also work the other way round, too, with small ‘empty’ parches within the plot being parts of huge ‘empty’ areas beyond. Word count 408 21 Notes i http://webhelp.esri.com/arcgisdesktop/9.2/index.cfm?TopicName=mean_center_(spatial_statistics) ii STATISTICS IN GEOGRAPHY by David Ebdon, Basil Blackwell, 1985 http://books.google.co.uk/books?id=IQ7LnlAPE_gC&pg=PA134&lpg=PA134&dq=%22standard+distance+deviation%22&source =web&ots=RQj24AxwfJ&sig=WpdBeg3IShVl9A8btWQMNgEUUPc&hl=en iii http://en.wikipedia.org/wiki/List_of_countries_by_population_density iv http://instruct.uwo.ca/geog/500/Quadrat_by_6.pdf v http://en.wikipedia.org/wiki/Variance-to-mean_ratio vi http://www.geog.ubc.ca/courses/geog516/presentations/sunny1.htm vii http://webhelp.esri.com/arcgisdesktop/9.2/index.cfm?TopicName=What%20is%20a%20Z%20Score http://upload.wikimedia.org/wikipedia/en/8/89/Normal_distribution_and_scales.svg viii http://arcscripts.esri.com/details.asp?dbid=11958 ix David J. Strauss, A Model for Clustering. Biometrika, Vol. 62, No. 2. (Aug., 1975), pp. 467-475. x On Parameter Estimation and Goodness-of-Fit Testing for Spatial Point Patterns, Peter J. Diggle, Biometrics, Vol. 35, No. 1, Perspectives in Biometry. (Mar., 1979), pp. 87-101. Further reading Michael J de Smith, Michael F Goodchild, Paul A Longley (2007). Geospatial Analysis – A Comprehensive Guide to Principles, Techniques and Software Tools. (Matador, Leicester). Also, maintained web version: http://www.spatialanalysisonline.com/ Fotheringham, A.S., Charlton, M.E. and Brunsdon, C. (1996). Quantitative Geography: Perspectives on Spatial Data Analysis. Sage, London. 22