Inferential Spatial Statistics: Introduction to Concepts Infer Population Sample Today: Review standard statistical inference. Examine the concept of Spatial Randomness. Define a random point pattern. Next Time Using inferential spatial statistics to analyze point patterns 1 Briggs Henan University 2010 Spatial Analysis: successive levels of sophistication 1. Spatial data description: classic GIS capabilities – – Spatial queries & measurement, buffering, map layer overlay 2. Exploratory Spatial Data Analysis (ESDA): – – – searching for patterns and possible explanations GeoVisualization through data graphing and mapping Descriptive spatial statistics 3. Spatial statistical analysis and hypothesis testing – Are data “to be expected” or are they “unexpected” relative to some statistical model, usually of a random process 4. Spatial modeling or prediction – Constructing models (of processes) to predict spatial outcomes (patterns) Briggs Henan University 2010 2 Descriptive & Inferential Statistical Analysis Last time we discussed descriptive statistics for spatial analysis Concerned with obtaining summary measures to describe a set of data For example, the mean and the standard deviation, the centroid and the standard distance This time we will discuss inferential statistics begin by reviewing standard (non-spatial) inferential statistics then look at inferential spatial statistics 3 Briggs Henan University 2010 Standard Statistical Inference: Inferential statistics –Concerned with making inferences: • from a sample(s) about a population(s) • from observed patterns about underlying processes I hope you are already familiar with standard (non-spatial) inferential statistics. I will quickly review the main ideas. Briggs Henan University 2010 4 Populations and Samples Population: all occurrences of a particular phenomena Sample: a part (subset) of the population for which we have data. You are a sample of the population of all people in the world. The sample is used to make inferences about the population. Infer We draw conclusions about the population from the sample. 5 Briggs Henan University 2010 From Lecture #2 on Spatial Analysis Process, Pattern and Analysis • Often, we cannot observe the process, so we have to infer the process by observing the pattern • From the sample, we infer the process in the population. Infer Population Processes Sample Create Patterns 6 Briggs Henan University 2010 The Importance of the Sample How “good “ (or “accurate” or “true”) are our inferences or conclusions? It depends upon the sample! If we get sample, the conclusions are good. Sample is representative of the population If we get sample, the conclusions are not good. Sample is a not representative of the population. 7 Briggs Henan University 2010 The Requirement of a Random Sample • All statistical inference is based on the assumption (requirement) that you have a random sample • What is a random sample? • A sample chosen such that every member of the population has an equal chance (probability) of being included • Doesn’t guarantee a representative sample • Could be really unlucky and get Some Definitions • Population – All occurences • Parameters – Numbers calculated from the population • Sample − Subset of population for which we have data • Statistics – Numbers calculated from the sample statistics are estimates of parameters We can calculate the statistic because we have data for samples. We cannot calculate the parameter because we 9 do not have data for entire population. Briggs Henan University 2010 Example: Are girls more intelligent than boys? • Sample of boys • Sample of girls – IQ* = 115 – IQ* = 130 *IQ = Intelligence Quotient Ha! Ha! Girls are more intelligent than boys. Here is the proof! No! No! It depends on the samples we have. The sample statistics are different, but the population parameters may be the same! Who is correct? Briggs Henan University 2010 10 How do we decide who is correct? The Null Hypothesis and the Alternative Hypothesis Assume that in the population the average (mean) IQ of girls is the same as the average IQ of boys g b This is called the Null Hypothesis: g b --there is no difference between girls and boys in the population The Alternative Hypothesis: g b --in the population, girls are smarter than boys Briggs Henan University 2010 11 Choosing between Null and Alternative • In our two samples: X g X b 130 115 15 – The difference between the sample means was 15 • Ask the question: if the population means are the same, how probable is it that, from sampling variation alone, I would get a difference of 15 points between sample means? • If this is reasonable probable (or likely), accept the Null Hypothesis • If this is highly improbable (highly unlikely), reject the Null and accept the Alternative Hypothesis 12 Briggs Henan University 2010 How do I calculate the probability of getting a difference of 15? We use the sampling distribution. What is this? 13 Briggs Henan University 2010 All girls All boys (the population of girls) (the population of boys) Random samples Random samples X g Xb For every pair of samples, calculate the mean of each, and then the difference between these means. 14 Briggs Henan University 2010 The Sampling Distribution If we have a thousand sample pairs, we have a thousand values for X g X b We can draw a frequency distribution showing how often or frequently different values occur 2.5% 2.5% -1.96 X g Xb 0 1.96 The sampling distribution is simply the frequency distribution for some value calculated each time from many, many, many samples. The calculated value is called the test statistic 15 Briggs Henan University 2010 Using the Sampling Distribution 2.5% 2.5% -1.96 X g Xb 0 Here, a sample difference of 15 15 is quite likely: Conclusion: Accept the Null. Boys and Girls are the same The probability should be less than 5% (.05) to reject the null hypothesis. This probability is called the statistical significance of the test. 1.96 15 Here, a sample difference of 15 is very unlikely: Conclusion: Reject the Null Accept the Alternative Girls are smarter than boys 16 Briggs Henan University 2010 Calculating a Test Statistic • To find the exact probability of getting a difference of 15 between the girls and boys we calculate a test statistic • a test statistic is: a number, calculated from a sample statistic, whose sampling distribution is known – That is, we know the shape of the frequency distribution of the test statistic when multiple samples are taken • In the case of the difference between two sample means the test statistic is: It is a Normal Frequency X g Xb Distribution if the sample sizes z are greater than 30. 2 2 s g sb S2g =variance for girls n n g b S2b =variance for boys • Note: test statistics always have “degrees of freedom” which are calculated from the sample size (N) Test Statistic for Normal Frequency Distribution 2.5% 2.5% X g Xb -1.96 0 1.96 To reject the Null Hypothesis, the Z test statistic should have a value greater than 1.96 (or less than -1.96). There is less than a 5% chance that, in the population, the means are the same. Conclusion: Reject the Null Accept the Alternative Girls are smarter than boys 18 Briggs Henan University 2010 Standard Error: Standard Deviation of the Sampling Distribution Smaller standard error Test statistic for the difference between two means: X g Xb z s 2 g s 2b ng nb Larger standard error 2.5% -1.96 2.5% -1.96 2.5% 2.5% X g Xb 0 1.96 1.96 Standard error for the difference between two means • Standard error very important • Approximately, it tells you how far, on average, the sample statistic is away from the population parameter – Thus, it is a measure of sampling variability or error • The larger the standard error, the more difficult it is to reject the 19 Null Hypothesis Briggs Henan University 2010 • • • • Reporting the Results of a Statistical Significance Test: many ways to say the same thing! When we use a test statistic and its sampling distribution we say that we are conducting a statistical significance test We reject the null hypothesis if there are less than 5 chances in 100 that it is true We say the results are “statistically significant at the 5% level” Or we say the results are “significant at the 95% confidence level” 20 Briggs Henan University 2010 The Normal or Gaussian Probability Distribution. 2.5% 2.5% -1.96 X g Xb 0 This is the sampling distribution for tests involving differences between means. Why is it this shape? 1.96 If the null hypothesis is true, − what would be the average value of the differences between the sample means? • It would be zero (0) – We expect many small difference values and few big differences • Values would be concentrated around mean – We expect as many negative differences as positive differences 21 • Symmetrical—same on each side of the mean Briggs Henan University 2010 How do we find the Sampling Distribution and Test Statistic? Two methods: 1. By mathematical theory: • • test statistics and sampling distributions already known through theory common distributions are Z (Normal), Chi-square, and F distributions 2. By computer simulation • The computer is used to “simulate” multiple samples, and we use these to draw a frequency distribution – As with our “boys and girls” example • Very common in spatial statistics Briggs Henan University 2010 22 Spatial Statistical Inference 23 Briggs Henan University 2010 Spatial Statistical Inference: Null and Alternative Hypotheses • Null Hypothesis: – The spatial pattern is random – IRP/CSR: independent random process/complete spatial randomness • Alternative Hypothesis: – The spatial pattern is not random – It may be clustered or dispersed 24 Briggs Henan University 2010 What do we mean by spatially random? RANDOM UNIFORM/DISPERSED CLUSTERED – Random: a point is equally likely to occur at any location, and the position of a point is not affected by the position of any other point. – Uniform: every point is as far from other points as possible: “likely to be distant” – Clustered: every point is close to other points: “likely to be close” Is it Spatially Random? Difficult to know! • Fact: Two times as many people sit “on the corners” rather than opposite at tables in a restaurant – Conclusion: psychological preference for nearness • In actuality: an outcome to be expected from a random process: two ways to sit opposite, but four ways to sit on the corners From O’Sullivan and Unwin p.69 26 Briggs Henan University 2010 High Peak district biomass index: ratio of remotely sensed data spectral bands B3 and B4 Spatially clustered Geographically random Why Processes differ from Random Processes differ from random in two primary ways • Variation in the study area – Diseases cluster because people cluster (e.g. cancer) – Cancer cases cluster ‘cos chemical plants cluster – First order effect • Interdependence of the points themselves – Diseases cluster ‘cos people catch them from others who have the disease (colds) – Second order effect In practice, it is very difficult to distinguish these two effects merely by the analysis of spatial data Briggs Henan University 2010 28 Bank Robberies—First Order or Second Order effect? – Bank robberies are clustered Bank robbery – First order--because banks are Banks clustered Bank Robberies In lecture on Spatial Analysis we called this the effect of “non-uniformity of space” Could there also be a second order effect? Briggs Henan University 2010 29 Remember our data on software and telecommunications industries in Dallas? We can think of this data as a sample. We can use statistical inference to test if the spatial pattern is clustered, or “random” (no pattern) We will look at the actual tests later. 30 Briggs Henan University 2010 Spatial Statistical Hypothesis Testing: Simulation Approach • Because of the complexity of spatial processes, it is often difficult to derive theoretically a test statistic with known probability distribution • Instead, we often use computer simulations • We take multiple samples from a random spatial pattern, the spatial statistic we are using is calculated for each sample, and then a frequency distribution is drawn • This simulated sampling distribution Empirical frequency is used to measure the probability distribution from 500 of obtaining our actual random patterns (“samples”) observed spatial statistic Our observed value: --highly unlikely to have occurred if the process was random --conclude that process is not random Software for Spatial Statistics • ArcGIS 9 The most common GIS Software, but $$$$! – Spatial Statistics Tools for point and polygon analysis – Spatial Analyst tools for density kernel – GeoStatistical Analyst Tools for interpolation of continuous surface data • CrimeStat III download from http://www.icpsr.umich.edu/NACJD/crimestat.html – Standalone package, free for government and education use – Calculates values for spatial statistics but no GIS graphics – Good documentation and explanation of measures and concepts • OpenGeoDA, Geographic Data Analysis by Luc Anselin now at Arizona State – – – – Download from: http://geodacenter.asu.edu/ Runs on Vista and Windows 7 (also MAC and UNIX) Earlier version called GeoDA runs only on XP (0.9.5i_6) Easy to use and has good graphic capabilities • R Open Source statistical package, – – – – originally on UNIX but now has MS Windows version Has the most extensive set of spatial statistical analyses Difficult to use Need to learn it if you are going to do major work in this area • S-Plus the only commercial statistical package with extensive support for spatial statistics 32 – www.insightful.com Briggs Henan University 2010 References • O’Sullivan and Unwin Geographic Information Analysis New York: John Wiley, 1st ed. 2003, 2nd ed. 2010 • Jay Lee and David Wong Statistical Analysis with ArcView GIS New York: Wiley, 1st ed. 2001 (all page references are to this book), 2nd ed. 2005 – Unfortunately, these books are based on old software (Avenue scripts used with ArcView 3.x) and no longer work in the current version of ArcGIS 9 or 10. • Ned Levine and Associates CrimeStat III Washington: National Institutes of Justice, 2010 – Available as pdf – download from: http://www.icpsr.umich.edu/NACJD/crimestat.html • Arthur J. Lembo at http://www.css.cornell.edu/courses/620/css620.html (no longer active) 33 Briggs Henan University 2010 Next time: Inferential Statistics for Point Pattern Analysis 34 Briggs Henan University 2010 35 Briggs Henan University 2010 Software for Spatial Statistics: Examples Planned as a separate lecture …but we couldn’t meet last Friday …so I will look as some examples after today’s lecture, and again after the next lecture 36 Briggs Henan University 2010 1. Using ArcGIS to find the Population Centroid of China Open ArcGIS Add data files: China.shp and ChinaProvinceData.xls Join ChinaProvinceData.xlx to China,shp Right click China and select Joins .. Use GMI_Admin as join field Open ArcToolbox by clicking on Go to Spatial Statistics Tools>Measuring Geographic Distribution>Mean Center Input Feature Class: China Output: China_MeanCenter.shp Weight Field: Population 2008 Note the warning: we should have projected data first! WARNING 000916: The input feature class does not appear to contain projected data. It is in south Henan province! 37 Briggs Henan University 2010 2. Calculate Population Centroid using a Spreadsheet Program (e.g. Excel) Make a copy of ChinaProvinceData.xls and open this copy ChinaProvinceData Copy.xls It contains Centroids for each province obtained from GeoDA. (You need the very expensive ArcInfo version to get centroids for all polygons from ArcGIS and I do not have it!) Calculate: XCentroid * Weight (Population 2008), and then Sum YCentroid * Weight (Population 2008), and then Sum Divide each sum by the Sum of the Weights (Total Population 2008). These are the X and Y coordinates for the China Population Centroid 113.4696704 32.3797596 Copy these values into a new worksheet, and create a very simple data table ID X Y 1 113.4697 32.3798 Save spreadsheet and close Excel. Read this table into ArcGIS Right click on table name and select Display XY Data This displays X, Y coordinates from a table on the map. The results are very similar to the value calculated by ArcGIS itself! Briggs Henan University 2010 38 3. Use ArcGIS to Calculate Standard Deviation Ellipse for Population and for Illiterate Population SDE for Population Go to Spatial Statistics Tools>Measuring Geographic Distribution> Directional Distribution Input Feature Class: China Output: SDE_Population.shp Weight Field: Data$.Pop2008 Mean Center for Illiterate Percent Go to Spatial Statistics Tools>Measuring Geographic Distribution>Mean Center Input Feature Class: China Output: MC_Illit_PerCent.shp Weight Field: Data$.Illiterate_Prcnt SDE for Illiterate Percent Go to Spatial Statistics Tools>Measuring Geographic Distribution> Directional Distribution Input Feature Class: China 39 Output: SDE_Illit_PerCent.shp Weight Field: Data$.Illiterate_Prcnt. Briggs Henan University 2010 4. Use GeoDA to find the Centroids of the Provinces of China (Need ArcInfo to do this in ArcGIS, which is expensive. GeoDA is free. ) --The GeoDA program is on my Web site at: www.utdallas.edu/~briggs or go to http://geodacenter.asu.edu/ --download, unzip, and click the file OpenGeoDA.exe to start the software --it does have some “bugs” so some things may not work or it may crash! --Input the provinces shapefile: File>Open Shape File China.shp --Open the data table: Table>Promotion to see what is there --Create centroids for each province: Options> Add Centroids to Table Place check mark in X coordinates and Y coordinates box, click OK --Go to Table>Promotion to open the table—it has the X and Y centroid coordinates --Save as a new shapefile: Table> Save to Shapefile as China_Centroids.shp I then opened the China_centroids.dbf (part of the shapefile) file with Excel and copied the centroid values into the ChinaProvincesData.xls spreadsheet. 40 Briggs Henan University 2010