CrimeStat Version 3.3 Update Notes: Part I: Fixes Getis-Ord G Bayesian Journey-to-Crime Ned Levine Ned Levine & Associates Houston, TX July 2010 CrimeStat Version 3.3 Update Notes: Part I: Fixes Getis-Ord G Bayesian Journey-to-Crime July 2010 Ned Levine1 Ned Levine & Associates Houston, TX This is Part I in the update notes for version 3.3 They provide information on some of the changes to CrimeStat III since the release of version 3.0 in March 2005. They incorporate the changes that were included in version 3.1, which was released in March 2007 and version 3.2 that was released in June 2009, and re-released in September 2009. The notes proceed by, first, discussing changes that were made to the existing routines from version 3.0 and, second, by discussing some new routines and organizational changes that were made since version 3.0 (either in version 3.1 or in this current version, 3.2). For all existing routines, the chapters of the CrimeStat manual should be consulted. Part II of the update notes to version 3.3 discusses the new regression module. 1 The author would like to thank Ms. Haiyan Teng and Mr. Pradeep Mohan for the programming and Dr. Dick Block of Loyola University for reading and editing the update notes. Mr. Ron Wilson of the National Institute of Justice deserves thanks for overseeing the project and Dr. Shashi Shekhar of the University of Minnesota is thanked for supervising some of the programming. Additional thanks should be given to Dr. David Wong for help with the Getis-Ord ‘G’ and local Getis-Ord and Dr. Wim Bernasco, Dr. Michael Leitner, Dr. Craig Bennell, Dr. Brent Snook, Dr. Paul Taylor, Dr. Josh Kent, and Ms. Patsy Lee for extensively testing the Bayesian Journey to Crime module. Table of Contents Known Problems with Version 3.2 Accessing the Help Menu in Windows Vista Running CrimeStat with MapInfo Open Fixes and Improvements to Existing Routines from Version 3.0 Paths MapInfo Output Geometric Mean Uses Harmonic Mean Uses Linear Nearest Neighbor Index Risk-adjusted Nearest Neighbor Hierarchical Clustering Crime Travel Demand Module Crime Travel Demand Project Directory Utility Moran Correlogram Calculate for Individual Intervals Simulation of Confidence Intervals for Anselin’s Local Moran Example of Simulated Confidence Interval for Local Moran Statistic Simulated vs. Theoretical Confidence Intervals Potential Problem with Significance Tests of Local “I” New Routines Added in Versions 3.1 through 3.3 Spatial Autocorrelation Tab Getis-Ord “G” Statistic 5 5 5 5 5 6 7 7 7 8 8 8 9 9 9 9 10 13 14 14 14 14 Testing the Significance of G Simulating Confidence Intervals for G Running the Getis-Ord “G” Search distance Output Getis-Ord simulation of confidence intervals Example 1: Testing Simulated Data with the Getis-Ord “G” Example 2: Testing Houston Burglaries with the Getis-Ord “G” Use and limitations of the Getis-Ord “G” Geary Correlogram 15 17 17 17 17 18 18 22 23 24 Adjust for Small Distances Calculate for Individual Intervals Geary Correlogram Simulation of Confidence Intervals Output Graphing the “C” Values by Distance Example: Testing Houston Burglaries with the Geary Correlogram Uses of the Geary Correlogram 2 24 24 24 24 25 25 27 Getis-Ord Correlogram 27 Getis-Ord Simulation of Confidence Intervals Output Graphing the “G” Values by Distance Example: Testing Houston Burglaries with the Getis-Ord Correlogram Uses of the Getis-Ord Correlogram Getis-Ord Local “G” 27 27 28 28 28 30 ID Field Search Distance Getis-Ord Local “G” Simulation of Confidence Intervals Output for Each Zone Example: Testing Houston Burglaries with the Getis-Ord Local “G” Uses of the Getis-Ord Local “G” Limitations of the Getis-Ord Local “G” Interpolation I and II Tabs Head Bang 31 31 31 31 32 32 32 34 34 Rates and Volumes Decision Rules Example to Illustrate Decision Rules Setup Output Example 1: Using the Head Bang for Mapping Houston Burglaries Example 2: Using the Head Bang for Mapping Houston Burglary Rates Example 3: Using the Head Bang for Creating Burglary Rates Uses of the Head Bang Routine Limitations of the Head Bang Routine Interpolated Head Bang 35 35 37 39 41 41 41 44 47 47 47 Method of Interpolation Choice of Bandwidth Output (areal) Units Calculate Densities or Probabilities Output Example: Using the Interpolated Head Bang to Visualize Houston Burglaries Advantages and Disadvantages of the Interpolated Head Bang Bayesian Journey to Crime Module 47 48 48 48 49 49 49 52 Bayesian Probability Bayesian Inference Application of Bayesian Inference to Journey to Crime Analysis The Bayesian Journey to Crime Estimation Module Data Preparation for Bayesian Journey to Crime Estimation Logic of the Routine Bayesian Journey to Crime Diagnostics Data Input Methods Tested Interpolated Grid Output Which is the Most Accurate and Precise Journey to Crime Estimation Method? 3 52 54 55 60 60 64 65 65 65 66 67 68 Measures of Accuracy and Precision Testing the Routine with Serial Offenders from Baltimore County Conclusion of the Evaluation Tests with Other Data Sets Estimate Likely Origin of a Serial Offender Data Input Selected Method Interpolated Grid Output Accumulator Matrix Two Examples of Using the Bayesian Journey to Crime Routine Potential to Add more Information to Improve the Methodology Probability Filters Summary References 69 75 78 79 79 80 80 80 81 81 82 92 95 95 96 4 Known Problems with Version 3.2a There are several known problems with version 3.2a. Accessing the Help Menu in Windows Vista CrimeStat III works with the Windows Vista operating system. There are several problems that have been identified with Vista, however. First, Vista does not recognize the help menu. If a user clicks on the help menu button in CrimeStat, there will be no response. However, Microsoft has developed a special file that allows help menus to be viewed in Vista. It will be necessary for Vista users to obtain the file and install it according the instructions provided by Microsoft. The URL is found at: http://support.microsoft.com/kb/917607 Second, version 3.2 has problems running multiple Monte Carlo simulations in Vista. CrimeStat is a multi-threading routine, which means that it will run separate calculations as unique ‘threads’. In general, this capability works with Vista. However, if multiple Monte Carlo simulations are run, “irrecoverable error” messages are produced with some of the results not being visible on the output screen. Since version 3.2a added a number of new Monte Carlo simulations (Getis-Ord “G”, Geary Correlogram, Getis-Ord Correlogram, Anselin’s Local Moran, and Getis-Ord Local “G”), there is a potential for this error to become more prominent. This is a Vista problem only and involves conflicts over access to the graphics device interface. The output will not be affected and the user can access the ‘graph’ button for those routines where it is available. We suggest that users run only one simulation at a time. This problem does not occur when the program is run in Windows XP. We have tested it in the Windows 7 Release Candidate and the problem appears to have been solved. Running CrimeStat with MapInfo Open The same ‘dbf’ or ‘tab’ file should not be opened simultaneously in MapInfo® and CrimeStat. This causes a file conflict error which may cause CrimeStat to crash. This is not a problem with ArcGIS® . Fixes and Improvements to Version 3.0 The following fixes and improvements to version 3.0 have been made. Paths For any output file, the program now checks that a path which is defined actually exists. MapInfo Output The output format for MapInfo MIF/MID files has been updated. The user can access a variety of common projections and their parameters. MapInfo uses a file called MAPINFOW.PRJ, which is in the MapInfo application folder, that lists many projections and their parameter. New projections can also be added to that file; users should consult the MapInfo Interchange documentation file. To use the 5 projections in CrimeStat copy the file (MAPINFOW.PRJ) to the same directory that CrimeStat resides within. When this is done, CrimeStat will allow the user to scroll down and select a particular projection that will then be saved in MIF/MID format for graphical output. The user can also choose to define a custom projection by filling in the eight parameter fields that are required: name of projection (optional), projection number, datum number, units, origin longitude, origin latitude, scale factor, false easting, and false northing. We suggest that any custom projection be added to the MAPINFOW.PRJ file. Note that the first projection listed in the file ("--- Longitude / Latitude ---") has one too many zeros and won’t be read. Use the second definition or remove one zero from that first line. Geometric Mean The Geometric Mean output in the “Mean center and standard distance” routine under Spatial Description now allows weighted values. It is defined as (Wikipedia, 2007a): N Geometric Mean of X = GM(X) = Ð ( XiWi )1/(ÓWi) (Up. 1.1) i=1 N Geometric Mean of Y = GM(Y) = Ð (YiWi)1/(ÓWi) (Up. 1.2) i=1 where Ð is the product term of each point value, i (i.e., the values of X or Y are multiplied times each other), Wi is the weight used (default=1), and N is the sample size (Everitt, 1995). The weights have to be defined on the Primary File page, either in the Weights field or in the Intensity field (but not both together). The equation can be evaluated by logarithms. G[Wi *Ln(Xi)] 1 Ln[GM(X)] = ---- [ W1 *Ln(X1 ) + W2*Ln(X2 ) + ..+ W2*Ln(XN ) ] = -----------------GWi GWi 1 G[Wi*Ln(Yi)] Ln[GM(Y)] = ---- [ W1 *Ln(Y1 ) + W2*Ln(Y2 ) + ..+ W2*Ln(YN ) ] = ---------------------GWi GWi (Up. 1.3) (Up. 1.4) GM(X) = eLn[GM(X)] (Up. 1.5) GM(Y) = eLn[GM(Y)] (Up.1.6) The geometric mean is the anti-log of the mean of the logarithms. If weights are used, then the logarithm of each X or Y value is weighted and the sum of the weighted logarithms are divided by the sum of the weights. If weights are not used, then the default weight is 1 and the sum of the weights will equal the sample size. The geometric mean is output as part of the Mcsd routine and has a ‘Gm’ prefix before the user defined name. 6 Uses The geometric mean is used when units are multipled by each other (e.g., a stock’s value increases by 10% one year, 15% the next, and 12% the next) (Wikipedia, 2007a). One can’t just take the simple mean because there is a cumulative change in the units. In most cases, this is not relevant to point (incident) locations since the coordinates of each incident are independent and are not multiplied by each other. However, the geometric mean can be useful because it first converts all X and Y coordinates into logarithms and, thus, has the effect of discounting extreme values. Harmonic Mean Also, the Harmonic Mean output in the “Mean center and standard distance” routine under Spatial Description now allows weighted values. It is defined as (Wikipedia, 2007b): GW i Harmonic mean of X = HM(X) = ------------------G [Wi/(Xi)] (Up. 1.7) GW i Harmonic mean of Y = HM(Y) = ------------------G [Wi/(Yi)] (Up. 1.8) where Wi is the weight used (default=1), and N is the sample size. The weights have to be defined on the Primary File page, either in the Weights field or in the Intensity field (but not both together). The harmonic mean of X and Y is the inverse of the mean of the inverse of X and Y respectively (i.e., take the inverse; take the mean of the inverse; and invert the mean of the inverse). If weights are used, then each X or Y value is weighted by its inverse while the numerator is the sum of the weights. If weights are not used, then the default weight is 1 and the sum of weights will equal the sample size. The harmonic mean is output as part of the Mcsd routine and has a ‘Hm’ prefix before the user-defined name. Uses Typically, harmonic means are used in calculating the average of rates, or quantities whose values are changing over time (Wikipedia, 2007b). For example, in calculating the average speed over multiple segments of equal length (see chapter 16 on Network Assignment), the harmonic mean should be used, not the arithmetic mean. If there are two adjacent road segments, each one mile in length and if a car travels over the first segment 20 miles per hour (mph) but over the second segment at 40 mph, the average speed is not 30 mph (the arithmetic mean), but 26.7 mph (the harmonic mean). The car takes 3 minutes to travel the first segment (60 minutes per hour times 1 mile divided by 20 mph ) and 1.5 minutes to travel the second segment (60 minutes per hour times 1 mile divided by 40 mph). Thus, the total time to travel the two miles is 4.5 minutes and the average speed is 26.7 mph. Again, for point (incident) locations, the harmonic mean would normally not be relevant since the coordinates of each of the incidents are independent. However, since the harmonic mean is weighted more heavily by the smaller values, it can be useful to discount cases which have outlying coordinates. 7 Linear Nearest Neighbor Index The test statistic for the Linear Nearest Neighbor index on the Distance Analysis I page now gives the correct probability level. Risk-adjusted Nearest Neighbor Hierarchical Clustering (Rnnh) The intensity checkbox for using the Intensity variable on the Primary File in calculating baseline variable for the risk-adjusted nearest neighbor hierarchical clustering (Rnnh) has been brought from the risk parameters dialogue to the main interface. Some users had forgotten to check this box to utilize the intensity variable in the calculation. Crime Travel Demand Module Several fixes have been made to the Crime Travel Demand module routines: 1. In the “Make prediction” routine under the Trip Generation module of the Crime Travel Demand model, the output variable has been changed from “Prediction” to “ADJORIGINS” for the origin model and “ADJDEST” for the destination model. 2. In the “Calculate observed origin-destination trips” routine under the “Describe origindestination trips” of the Trip Distribution module of the Crime Travel Demand model, the output variable is now called “FREQ”. 3. Under the “Setup origin-destination model” page of the Trip Distribution module of the Crime Travel Demand, there is a new parameter defining the minimum number of trips per cell. Typically, in the gravity model, many cells will have small predicted values (e.g., 0.004). In order to concentrate the predicted values, the user can set a minimum level. If the predicted value is below this minimum, the routine automatically sets a zero (0) value with the remaining predicted values being re-scaled so that the total number of predicted trips remains constant. The default value is 0.05. This parameter should be used cautiously, however, as extreme concentration can occur by merely raising this value. Because the number of predicted trips remains constant, setting a minimum that is too high will have the effect of increasing all values greater than the minimum substantially. For example, in one run where the minimum was set at 5, a re-scaled minimum value for a line became 13.3. 4. For the Network Assignment routine, the prefix for the network load output is now VOL. 5. In defining a travel network either on the Measurement Parameters page or on the Network Assignment page, if the network is defined as single directional, then the “From one way flag” and “To one way flag” options are blanked out. 8 Crime Travel Demand Project Directory Utility The Crime Travel Demand module is a complex model that involves many different files. Because of this, we recommend that the separate steps in the model be stored in separate directories under a main project directory. While the user can save any file to any directory within the module, keeping the inputs and output files in separate directories can make it easier to identify files as well as examine files that have already been used at some later time. A new project directory utility tab under the Crime Travel Demand module allows the creation of a master directory for a project and four separate sub-directories under the master directory that correspond to the four modeling stages. The user puts in the name of a project in the dialogue box and points it to a particular drive and directory location (depending on the number of drives available to the user). For example, a project directory might be called “Robberies 2003” or “Bank robberies 2005”. The utility then creates this directory if it does not already exist and creates four sub-directories underneath the project directory: Trip generation Trip distribution Mode split Network assignment The user can then save the different output files into the appropriate directories. Further, for each sequential step in the crime travel demand model, the user can easily find the output file from the previous step which would then become the input file for the next step. Moran Correlogram Calculate for Individual Intervals Currently, the Moran Correlogram calculates a cumulative value for the interval from a distance of 0 up to the mid-point of the interval. If the option to calculate for individual intervals is checked, the “I” value will be calculated only for those pairs of points that are separated by a distance between the minimum and maximum distances of the interval (i.e., excluding distances that are shorter than the minimum value of the interval). This can be useful for checking the spatial autocorrelation for a specific interval or checking whether some distances don’t have sufficient numbers of points (in which case the “I” value will be unreliable). Simulation of Confidence Intervals for Anselin’s Local Moran In previous versions of CrimeStat, the Anselin’s Local Moran routine had a option to calculate the variance and a standardized “I” score (essentially, a Z-test of the significance of the “I” value). One problem with this test is that “I” may not actually follow a normal standard error. That is, if “I” is calculated for all zones with random data, the distribution of the statistic may not be normally distributed. This would be especially true if the variable of interest, X, is a skewed variable with some zones having very high values while the majority having low values, as is typically true with crime distributions. Consequently, the user can estimate the confidence intervals using a Monte Carlo simulation. In this case, a permutation type simulation is run whereby the original values of the intensity variable, Z, 9 are maintained but are randomly re-assigned for each simulation run. This will maintain the distribution of the variable Z but will estimate the value of I for each under random assignment of this variable. Note: a simulation may take time to run especially if the data set is large or if a large number of simulation runs are requested. If a permutation Monte Carlo simulation is run to estimate confidence intervals, specify the number of simulations to be run (e.g., 1000, 5,000, 10000). In addition to the above statistics, the output of includes the results that were obtained by the simulation for: 1. 2. 3. 4. 5. 6. The minimum “I” value The maximum “I” value The 0.5 percentile of “I” The 2.5 percentile of “I” The 97.5 percentile of “I” The 99.5 percentile of “I” The two pairs of percentiles (2.5 and 97.5; 0.5 and 99.5) create approximate 95% and 99% confidence intervals respectively. The minimum and maximum “I” values create an ‘envelope’ around each zone. It is important to run enough simulations to produce reliable estimates. The tabular results can be printed, saved to a text file or saved as a '.dbf' file with a LMoran<root name> prefix with the root name being provided by the user. For the latter, specify a file name in the “Save result to” in the dialogue box. The ‘dbf’ file can then be linked to the input ‘dbf’ file by using the ID field as a matching variable. This would be done if the user wants to map the “I” variable, the Z-test, or those zones for which the “I” value is either higher than the 97.5 or 99.5 percentiles or lower than the 2.5 or 0.5 percentiles of the simulation results. Example of Simulated Confidence Intervals for Local Moran Statistic To illustrate the simulated confidence intervals, we apply Anselin’s Local Moran to an analysis of 2006 burglaries in the City of Houston. The data are 26,480 burglaries that have been aggregated to 1,179 traffic analysis zones (TAZ). These are, essentially, census blocks or aggregations of census blocks. Figure Up. 1.1 shows a map of burglaries in the City of Houston in 2006 by TAZ. Anselin’s Local Moran statistic was calculated on each of 1,179 traffic analysis zones with 1,000 Monte Carlo simulations being calculated. Figure Up. 1.2 shows a map of the calculated local “I” values. It can be seen that there are many more zones of negative spatial autocorrelation where the zones are different than their neighbors. In most of these cases, the zone has no burglaries whereas it is surrounded by zones that have some burglaries. A few zones have positive spatial autocorrelation. In most of the cases, the zones have many burglaries and are surrounded by other zones with many burglaries. Confidence intervals were calculated in two ways. First, the theoretical variance was calculated and a Z-test computed. This is done in CrimeStat by checking the ‘theoretical variance’ box. The test assumes that “I” is normally distributed. Second, a Monte Carlo simulation was used to estimate the 99% confidence intervals (i.e., outside the 0.5 and 99.5 percentiles). Table Up. 1.1 shows the results for 10 Figure Up. 1.1: Figure Up. 1.2: four records. The four records illustrate different combinations. In the first, the “I” value is 0.000036. Comparing it to the 99% confidence interval, it is between the 0.5 percentile and the 99.5 percentile. In other words, the simulation shows that it is not significant. The Z-test is 0.22 which is also not significant. Thus, both the simulated confidence intervals and the theoretical confidence interval indicate that the “I” for this zone is not significant. Keep in mind that crime data rarely is normally distributed and is usually very skewed. Therefore, the theoretical distribution should be used with caution. The best mapping solution may be to map only those zones that are highly significant with the theoretical Z-test (with probability values smaller than 0.01) or else map only those zones that are significant with the Monte Carlo simulation. In the second record (TAZ 530), the “I” value is -0.001033. This is smaller than the 0.5 percentile value. Thus, the simulation indicates that it is lower than what would be expected 99% of the time; TAZ 530 has values that are dissimilar from its neighbors. Similarly, the theoretical Z-test gives a value smaller than the .001 probability. Thus, both the simulated confidence intervals and the theoretical confidence intervals indicate that the “I” for this zone has negative spatial autocorrelation, namely that its value is different from its neighbors. Table Up. 1.1 Anselin’s Local Moran 95% Confidence Intervals Estimated from Theoretical Variance and from Monte Carlo Simulation TAZ X Y “I” Expected “I” 0.5 % 99.5 % Z-test p 532 530 3193470 3172640 13953400 13943300 0.000036 -0.001033 -0.000008 -0.000009 -0.000886 -0.000558 0.000599 0.000440 0.22 -7.20 n.s. 0.001 1608 1622 3089820 3102450 13887600 13884000 0.000953 0.001993 -0.000019 -0.000022 -0.002210 -0.002621 0.000953 0.002984 2.07 3.11 0.05 0.01 The third record (TAZ 1608) shows a discrepancy between the simulated confidence intervals and the theoretical confidence intervals. In this case, the “I” value (0.000953 is equal to the 99.5 percentile while the theoretical Z-test is significant at only the .05 level. Finally, the fourth record (TAZ 1662) shows the opposite condition, where the theoretical confidence interval (Z-test) is significant while the simulated confidence interval is not. Simulated vs. Theoretical Confidence Intervals In general, the simulated confidence intervals will be similar to the theoretical ones most of the time. But, there will be discrepancies. The reason is that the sampling distribution of “I” may not be (and probably isn’t) normally distributed. In these 1,179 traffic analysis zones, 520 of the zones showed significant “I” values according to the simulated confidence intervals with 99% confidence intervals (i.e., either equal to or smaller than the 0.5 percentile or equal to or greater than the 99.5 percentile) while 631 of the zones showed significant “I” values according to the theoretical Z-test at the 99% level (i.e., having a Z-value equal to or less than -2.58 or equal to or greater than 2.58). It would behoove the user to estimate the number of zones that are significant according to the simulated and theoretical confidence intervals before making a decision as to which criterion to use. 13 Potential Problem with Significance Tests of Local “I” Also, one has to be suspect about a technique that finds significance in more than half the cases. It would probably be more conservative to use the 99% confidence intervals as a test for identifying zones that show positive or negative spatial autocorrelation rather than using the 95% confidence intervals or, better yet, choosing only those zones that have very negative or very positive “I” values. Unfortunately, this characteristic of the Anselin’s local Moran is also true of the local Getis-Ord routine (see below). The significance tests, whether simulated or theoretical, are not strict enough and, thereby, increase the likelihood of a Type I (false positive) error. A user must be careful in interpreting the “I” values for individual zones and would be better served choosing only the very highest or very lowest. New Routines Added in Versions 3.1 through 3.2 New routines were added in versions 3.1 and 3.2. The second update chapter describes the regression routines that were added in version 3.3. Spatial Autocorrelation Tab Spatial autocorrelation tests have now been separated from the spatial distribution routines. This section now includes six tests for global spatial autocorrelation: 1. 2. 3. 4. 5. 6. Moran’s “I” statistic Geary’s “C” statistic Getis-Ord “G” statistic (NEW) Moran Correlogram Geary Correlogram (NEW) Getis-Ord Correlogram (NEW) These indices would typically be applied to zonal data where an attribute value can be assigned to each zone. Six spatial autocorrelation indices are calculated. All require an intensity variable in the Primary File. Getis-Ord “G” Statistic The Getis-Ord “G” statistic is an index of global spatial autocorrelation for values that fall within a specified distance of each other (Getis and Ord, 1992). When compared to an expected value of “G” under the assumption of no spatial association, it has the advantage over other global spatial autocorrelation measures (Moran, Geary) in that it can distinguish between ‘hot spots’ and ‘cold spots’, which neither Moran’s “I” nor Geary’s “C” can do. The “G” statistic calculates the spatial interaction of the value of a particular variable in a zone with the values of that same variable in nearby zones, similar to Moran’s “I” and Geary’s “C”. Thus, it is also a measure of spatial association or interaction. Unlike the other two measures, it only identifies positive spatial autocorrelation, that is where zones have similar values to their neighbors. It cannot detect negative spatial autocorrelation where zones have different values to their neighbors. But, unlike the other two global measures it can distinguish between positive spatial autocorrelation where zones with high values are near to other zones with high values (high positive spatial autocorrelation) from 14 positive spatial autocorrelation which results from zones with low values being near to other zones also with low values (low positive spatial autocorrelation. Further, the “G” value is calculated with respect to a specified search distance (defined by the user) rather than to an inverse distance, as with the Moran’s “I” or Geary’s “C”. The formulation of the general “G” statistic presented here is taken from Lee and Wong (2001). It is defined as: G i G j Wj(d) Xi Xj G(d) = --------------------------------- Gi Gj (Up. 1.9) X i Xj for a variable, X. This formula indicates that the cross-product of the value of X at location “i” and at another zone “j” is weighted by a distance weight, wj(d) which is defined by either a ‘1' if the two zones are equal to or closer than a threshold distance, d, or “0" otherwise. The cross-product is summed for all other zones, j, over all zones, i. Thus, the numerator is a sub-set of the denominator and can vary between 0 and 1. If the distance selected is too small so that no other zones are closer than this distance, then the weight will be 0 for all cross-products of variable X. Hence, the value of G(d) will be 0. Similarly, if the distance selected is too large so that all other zones are closer than this distance, then the weight will be 1 for all cross-products of variable X. Hence, the value of G(d) will be 1. There are actually two G statistics. The first one, G*, includes the interaction of a zone with itself; that is, zone “i” and zone “j” can be the same zone. The second one, G, does not include the interaction of a zone with itself. In CrimeStat, we only include the G statistic (i.e., there is no interaction of a zone with itself) because, first, the two measures produce almost identical results and, second, the interpretation of G is more straightforward than with G*. Essentially, with G, the statistic measures the interaction of a zone with nearby zones (a ‘neighborhood’). See articles by Getis & Ord (1992) and by Khan, Qin and Noyce (2006) for a discussion of the use of G*. Testing the Significance of G By itself, the G statistic is not very meaningful. Since it can vary between 0 and 1, as the threshold distance increases, the statistic will always approach 1.0. Consequently, G is compared to an expected value of G under no significant spatial association. The expected G for a threshold distance, d, is defined as: E[G(d)] = W ------------ (Up.1.10) N(N-1) where W is the sum of weights for all pairs and N is the number of cases. The sum of the weights is based on symmetrical distances for each zone “i”. That is, if zone 1 is within the threshold distance of zone 2, then zone 1 has a weight of 1 with zone 2. In counting the total number of weights for zone 1, the weight of zone 2 is counted. Similarly, zone 2 has a weight of 1 with zone 1. So, in counting the total number of weights for zone 2, the weight of zone 1 is counted, too. In other words, if two zones are within the threshold (search) distance, then they both contribute 2 to the total weight. 15 Note that, since the expected value of G is a function of the sample size and the sum of weights which, in turn, is a function of the search distance, it will be the same for all variables of a single data set in which the same search distance is specified. However, as the search distance changes, so will the expected G change. Theoretically, the G statistic is assumed to have a normally distributed standard error. If this is the case (and we often don’t know if it is), then the standard error of G can be calculated and a simple significance test based on the normal distributed be constructed. The variance of G(d) is defined as: E(G2 ) - E(G)2 Var[G(d)] = (Up. 1.11) E(G)2 = 1 -------------------- [Bo m2 2 + B1 m4 + B2 m1 2 m2 + B3 m1 m3 + B4 m1 4 ] (m1 2 - m2 )2 n(4) where (Up. 1.12) and where m1 m2 m3 m4 n(4) S1 S2 B0 B1 B2 B3 B4 = = = = = = = = = = = = GjXi GjXi2 GX i 3 GX i 4 n(n-1)(n-2)(n-3) 0.5 Gi Gj(wij+wji)2 Gi (Gjwij + Gjwji)2 (n2 - 3n + 3)S1 - nS2 + 3W2 -[(n2 - n)S1 - 2nS2 + 3W2 ] -[2nS1 - (n+3)S2 + 6W2 ] 4(n - 1)S1 - 2(n + 1)S2 + 8W2 S1 -S2 + W2 (Up. 1.13) (Up. 1.14) (Up. 1.15) (Up. 1.16) (Up. 1.17) (Up. 1.18) (Up. 1.19) (Up. 1.20) (Up. 1.21) (Up. 1.22) (Up. 1.23) (Up. 1.24) where i is the zone being calculated, j is all other zones, and n (Lee and Wong, 2001). Note that this formula is different than that written in other sources (e.g., see Lees, 2006) but is consistent with the formulation by Getis and Ord (1992). The standard error of G(d) is the square root of the variance of G. Consequently, a Z-test can be constructed by: S.E.[G(d)] = SQRT{Var[G(d)]} (Up. 1.25) Z[G(d)] = G(d) - E[G(d)] --------------------------S.E.[G(d)] (Up. 1.26) 16 Relative to the expected value of G, a positive Z-value indicates spatial clustering of high values (high positive spatial autocorrelation or ‘hot spots’) while a negative Z-value indicates spatial clustering of low values (low positive spatial autocorrelation or ‘cold spots’). A “G” value around 0 typically indicates either no positive spatial autocorrelation, negative spatial autocorrelation (which the Getis-Ord cannot detect), or that the number of ‘hot spots’ more or less balances the number of ‘cold spots’. Note that the value of this test will vary with the search distance selected. One search distance may yield a significant spatial association for G whereas another may not. Thus, the statistic is useful for identifying distances at which spatial autocorrelation exists (see the Getis-Ord Correlogram below). Also, and this is an important point, the expected value of G as calculated in equation Up.1.10 is only meaningful if the variable is positive. For variables with negative values, such as residual errors from a regression model, one cannot use equation Up. 1.10 but, instead, must use a simulation to estimate confidence intervals. Simulating Confidence Intervals for G One of the problems with this test is that G may not actually follow a normal standard error. That is, if G was calculated for a specific distance, d, with random data, the distribution of the statistic may not be normally distributed. This would be especially true if the variable of interest, X, is a skewed variable with some zones having very high values while the majority of zones having low values. Consequently, the user has an alternative for estimating the confidence intervals using a Monte Carlo simulation. In this case, a permutation type simulation is run whereby the original values of the intensity variable, Z, are maintained but are randomly re-assigned for each simulation run. This will maintain the distribution of the variable Z but will estimate the value of G under random assignment of this variable. The user can take the usual 95% or 99% confidence intervals based on the simulation. Keep in mind that a simulation may take time to run especially if the data set is large or if a large number of simulation runs are requested. Running the Getis-Ord “G” Routine The Getis-Ord global “G” routine is found on the new Spatial Autocorrelation tab under the main Spatial Description heading. The variable that will be used in the calculation is the intensity variable which is defined on the Primary File page. By choosing different intensity variables, the user can estimate G for different variables (e.g., number of assaults, robbery rate). Search distance The user must specify a search distance for the test and indicate the distance units (miles, nautical miles, feet, kilometers, or meters,). Output The Getis-Ord “G” routine calculates 7 statistics: 1. 2. The sample size Getis-Ord “G” 17 3. 4. 5. 6. 7. 8. The spatially random (expected) "G" The difference between “G” and the expected “G” The standard error of "G" A Z-test of "G" under the assumption of normality (Z-test) The one-tail probability level The two-tail probability level Getis-Ord Simulation of Confidence Intervals If a permutation Monte Carlo simulation is run to estimate confidence intervals, specify the number of simulations to be run (e.g., 100, 1000, 10000). In addition to the above statistics, the output of includes the results that were obtained by the simulation for: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. The minimum “G” value The maximum “G” value The 0.5 percentile of “G” The 2.5 percentile of “G” The 5 percentile of “G” The 10 percentile of “G” The 90 percentile of “G” The 95 percentile of “G” The 97.5 percentile of “G” The 99.5 percentile of “G” The four pairs of percentiles (10 and 90; 5 and 95; 2.5 and 97.5; 0.5 and 99.5) create approximate 80%, 90%, 95% and 99% confidence intervals respectively. The tabular results can be printed, saved to a text file or saved as a '.dbf' file. For the latter, specify a file name in the “Save result to” in the dialogue box. Example 1: Testing Simulated Data with the Getis-Ord “G” To understand how the Getis-Ord “G” works and how it compares to the other two global spatial autocorrelation measures - Moran’s “I” and Geary’s “C”, three simulated data sets were created. In the first, a random pattern was created (Figure Up. 1.3). In the second, a data set of extreme positive spatial autocorrelated was created (Figure Up. 1.4) and, in the third, a data set of extreme negative spatial autocorrelation was created (Figure Up. 1.5); the latter is essentially a checkerboard pattern. Table Up. 1.2 compares the three global spatial autocorrelation statistics on the three distributions. For the Getis-Ord “G”, both the actual “G” and the expected “G” are shown. A one mile search distance was used for the Getis-Ord “G”. The random pattern is not significant with all three measures. That is, neither Moran’s “I”, Geary”C”, nor the Getis-Ord “G” are significantly different than the expected values under a random distribution. This is what would be expected since the data were assigned randomly. 18 Figure Up. 1.3: Figure Up. 1.4: Figure Up. 1.5: Table Up. 1.2: Global Spatial Autocorrelation Statistics for Simulated Data Sets N = 100 Grid Cells Pattern Moran’s “I” Geary’s “C” Getis-Ord “G” Expected “G” (1 mile search) (1 mile search) Random -0.007162 n.s. 0.965278 n.s 0.151059 n.s 0.159596 Positive spatial autocorrelation 0.292008*** 0.700912*** 0.241015*** 0.159596 1.049471* 0.140803n.s. 0.159596 Negative spatial autocorrelation -0.060071*** _____________________ n.s * ** *** not significant p#.05 p#.01 p#.001 For the extreme positive spatial autocorrelation pattern, on the other hand, all three measures show highly significant differences with a random simulation. Moran’s “I” is highly positive. Geary’s “C is below 1.0, indicating positive spatial autocorrelation and the Getis-Ord “G” has a “G” value that is significantly higher than the expected “G” according to the Z-test based on the theoretical standard error. The Getis-Ord “G”, therefore, indicates that the type of spatial autocorrelation is high positive. Finally, the extreme negative spatial autocorrelation pattern (Figure Up. 1.5 above) shows different results for the three measures. Moran’s “I” shows negative spatial autocorrelation and is highly significant. Geary’s “C also shows negative spatial autocorrelation but it is significant only at the .05 level. Finally, the GetisOrd “G” is not significant which is not surprising since the statistic cannot detect negative spatial autocorrelation. The “G” is slightly smaller than the expected “G”, which indicates low positive spatial autocorrelation, but it is not significant. In other words, all three statistics can identify positive spatial correlation. Of these, Moran’s “I” is a more powerful (sensitive) test than either Geary’s “C” or the Getis-Ord “G”. For the negative spatial autocorrelation pattern, only Moran’s “I” and Geary’s “C” is able to detect it and, again, Moran’s “I” is more powerful than Geary’s “C”. On the other hand, only the Getis-Ord “G” can distinguish between high positive and low positive spatial autocorrelation. The Moran and Geary tests would show these conditions to be identical, as the example below shows. Example 2: Testing Houston Burglaries with the Getis-Ord “G” Now, let’s take a real data set, the 26,480 burglaries in the City of Houston in 2006 aggregated to 1,179 traffic analysis zones (Figure Up. 1.1 above). To compare the Getis-Ord “G” statistic with the Moran’s “I” and Geary’s “C”, the three spatial autocorrelation tests were run on this data set. The GetisOrd “G” was tested with a search distance of 2 miles and 1000 simulation runs were made on the “G”. Table Up. 1.3 shows the three global spatial autocorrelation statistics for these data. 22 Table Up. 1.3: Global Spatial Autocorrelation Statistics for City of Houston Burglaries: 2001 N = 1,179 Traffic Analysis Zones Moran’s “I” Geary’s “C” Getis-Ord “G” (2 mile search) Observed 0.25179 0.397080 0.028816 Expected -.000849 1.000000 0.107760 Observed -Expected 0.25265 -0.60292 -0.07894 Standard Error 0.002796 0.035138 0.010355 Z-test 90.35 -17.158851 -7.623948 p-value *** *** *** Based on simulation: 2.5 percentile n.a. n.a. 0.088162 97.5 percentile n.a. _____________________ n.a. 0.129304 n.s * ** *** not significant p#.05 p#.01 p#.001 The Moran and Geary tests show that the Houston burglaries have significant positive spatial autocorrelation (zones have values that are similar to their neighbors). Moran’s “I” is significantly higher than the expected “I”. Geary’s “C” is significantly lower than the expected “C” (1.0), which means positive spatial autocorrelation. However, the Getis-Ord “G” is lower than the expected “G value and is significant whether using the theoretical Z-test or the simulated confidence intervals (notice how the “G” is lower than the 2.5 percentile). This indicates that, in general, zones having low values are nearby other zones with low values. In other words, there is low positive spatial autocorrelation, suggesting a number of ‘cold spots’. Note also that the expected “G” is between the 2.5 percentile and 97.5 percentile of the simulated confidence intervals. Uses and Limitations of the Getis-Ord “G” The advantage of the “G” statistic over the other two spatial autocorrelation measures is that it can definitely indicate ‘hot spots’ or ‘cold spots’. The Moran and Geary measures cannot determine whether the positive spatial autocorrelation is ‘high positive’ or ‘low positive’. With Moran’s “I” or Geary’s “C”, an indicator of positive spatial autocorrelation means that zones have values that are similar to their neighbors. However, the positive spatial autocorrelation could be caused by many zones with low values being concentrated, too. In other words, one cannot tell from those two indices whether the concentration is a hot spot or a cold spot. The Getis-Ord “G” can do this. 23 The main limitation of the Getis-Ord “G” is that it cannot detect negative spatial autocorrelation, a condition that, while rare, does occur. With the checkerboard pattern above (Figure Up. 1.5), this test could not detect that there was negative spatial autocorrelation. For this condition (which is rare), Moran’s “I” or Geary’s “C” would be more appropriate tests. Geary Correlogram The Geary “C” statistic is already part of CrimeStat (see chapter 4 in the manual). This statistic typically varies between 0 and 2 with a value of 1 indicating no spatial autocorrelation. Values less than 1 indicate positive spatial autocorrelation (zones have similar values to their neighbors) while values greater than 1 indicate negative spatial autocorrelation (zones have different values from their neighbors). The Geary Correlogram requires an intensity variable in the primary file and calculates the Geary “C” index for different distance intervals/bins. The user can select any number of distance intervals. The default is 10 distance intervals. The size of each interval is determined by the maximum distance between zones and the number of intervals selected. Adjust for Small Distances If checked, small distances are adjusted so that the maximum weighting is 1 (see documentation for details.) This ensures that the “C” values for individual distances won't become excessively large or excessively small for points that are close together. The default value is no adjustment. Calculate for Individual Intervals The Geary Correlogram normally calculates a cumulative value for the interval/bin from a distance of 0 up to the mid-point of the interval/bin. If this option is checked, the “C” value will be calculated only for those pairs of points that are separated by a distance between the minimum and maximum distances of the interval. This can be useful for checking the spatial autocorrelation for a specific interval or checking whether some distance intervals don’t have sufficient numbers of points (in which case the “C” value will be unreliable for that distance). Geary Correlogram Simulation of Confidence Intervals Since the Geary’s “C” statistic may not be normally distributed, the significance test is frequently inaccurate. Instead, a permutation type Monte Carlo simulation is run whereby the original values of the variable, Z, are maintained but are randomly re-assigned for each simulation run. This will maintain the distribution of the variable Z but will estimate the value of “C” under random assignment of this variable. Specify the number of simulations to be run (e.g., 1000, 5000, 10000). Note, a simulation may take time to run especially if the data set is large or if a large number of simulation runs are requested. Output The output includes: 1. 2. The sample size The maximum distance 24 3. 4. 5. The bin (interval) number The midpoint of the distance bin The “C” value for the distance bin and if a simulation is run: 6. 7. 8. 9. 10. 11. The minimum “C” value for the distance bin The maximum “C” value for the distance bin The 0.5 percentile of “C” for the distance bin The 2.5 percentile of “C” for the distance bin The 97.5 percentile of “C” for the distance bin The 99.5 percentile of “C” for the distance bin. The two pairs of percentiles (2.5 and 97.5; 0.5 and 99.5) create approximate 95% and 99% confidence intervals respectively. The minimum and maximum simulated “C” values create an ‘envelope’ for each interval. However, unless a large number of simulations are run, the actual “C” value may fall outside the envelope for any one interval. The tabular results can be printed, saved to a text file or saved as a '.dbf' file with a GearyCorr<root name> prefix with the root name being provided by the user. For the latter, specify a file name in the “Save result to” in the dialogue box. Graphing the “C” Values by Distance A graph can be shown with the “C” value on the Y-axis by the distance bin on the X-axis. Click on the “Graph” button. If a simulation is run, the 2.5 and 97.5 percentiles of the simulated “C” values are also shown on the graph. The graph displays the reduction in spatial autocorrelation with distance. The graph is useful for selecting the type of kernel in the single- and dual-kernel interpolation routines when the primary variable is weighted (see Chapter 8 on Kernel Density Interpolation). For a presentation quality graph, however, the output file should be brought into Excel or another graphics program in order to display the change in “C” values and label the axes properly. Example: Testing Houston Burglaries with the Geary Correlogram Using the same data set on the Houston burglaries as above, the Geary Correlogram was run with 100 intervals (bins). The routine was also run with 1000 simulations to estimate confidence intervals around the “C” value. Figure Up. 1.6 illustrates the distance decay of “C” as a function of distance along with the simulated 95% confidence interval. The theoretical “C” under random conditions is also shown. As seen, the “C” values are below 1.0 for all distances tested and are also below the 2.5 percentile simulated “C” for all intervals. Thus, it is clear that there is substantial positive spatial autocorrelation with the Houston burglaries. Looking at the ‘distance decay’, the “C” values decrease with distance indicating that they approach a random distribution. However, they level off at around 15 miles to the global “C” value (0.397). In other words, spatial autocorrelation is substantial in these data up to about a separation of 15 miles, whereupon the general positive spatial autocorrelation holds. This distance can be used to set limits for search distances in other routines (e.g., kernel density interpolation). 25 Figure Up. 1.6: 97.5 percentile of “C” Theoretical random “C” 25 2.5 percentile of “C” til f “C” “C” of Houston burglaries Uses of the Geary Correlogram Similar to the Moran Correlogram and the Getis-Ord Correlogram (see below), the Geary Correlogram is useful in order to determine the degree of spatial autocorrelation and how far away from each zone it typically lasts. Since it is an average over all zones, it is a general indicator of the spread of the spatial autocorrelation. This can be useful for defining limits to search distances in other routines, such as the single kernel density interpolation routine where a fixed bandwidth would be defined to capture the majority of spatial autocorrelation. Getis-Ord Correlogram The Getis-Ord Correlogram calculates the Getis-Ord “G” index for different distance intervals/bins. The statistic requires an intensity variable in the primary file and calculates the Getis-Ord “G” index for different distance intervals/bins. The user can select any number of distance intervals. The default is 10 distance intervals. The size of each interval is determined by the maximum distance between zones and the number of intervals selected. Getis-Ord Correlogram Simulation of Confidence Intervals Since the Getis-Ord “G” statistic may not be normally distributed, the significance test is frequently inaccurate. Instead, a permutation type Monte Carlo simulation is run whereby the original values of the intensity variable, Z, are maintained but are randomly re-assigned for each simulation run. This will maintain the distribution of the variable Z but will estimate the value of G under random assignment of this variable. Specify the number of simulations to be run (e.g., 100, 1000, 10000). Note, a simulation may take time to run especially if the data set is large or if a large number of simulation runs are requested. Output The output includes: 1. 2. 3. 4. 5. The sample size The maximum distance The bin (interval) number The midpoint of the distance bin The “G” value for the distance bin and if a simulation is run, the simulated results under the assumption of random re-assignment for: 6. 7. 8. 9. 10. 11. The minimum “G” value The maximum “G” value The 0.5 percentile of “G” The 2.5 percentile of “G” The 97.5 percentile of “G” The 99.5 percentile of “G” 27 The two pairs of percentiles (2.5 and 97.5; 0.5 and 99.5) create approximate 95% and 99% confidence intervals respectively. The minimum and maximum “G” values create an ‘envelope’ for each interval. However, unless a large number of simulations are run, the actual “G” value for any interval may fall outside the envelope. The tabular results can be printed, saved to a text file or saved as a '.dbf' file with a Getis-OrdCorr<root name> prefix with the root name being provided by the user. For the latter, specify a file name in the “Save result to” in the dialogue box. Graphing the “G” Values by Distance A graph can be shown that shows the “G” and Expected “G” values on the Y-axis by the distance bin on the X-axis. Click on the “Graph” button. If a simulation is run, the 2.5 and 97.5 percentile “G” values are also shown on the graph along with the “G”; the Expected “G” is not shown in this case. The graph displays the reduction in spatial autocorrelation with distance. Note that the “G” and expected “G” approach 1.0 as the search distance increases, that is as the pairs included within the search distance approximate the number of pairs in the entire data set. The graph is useful for selecting the type of kernel in the single- and dual-kernel interpolation routines when the primary variable is weighted (see Chapter 8 on Kernel Density Interpolation). For a presentation quality graph, however, the output file should be brought into Excel or another graphics program in order to display the change in “G” values and label the axes properly. Note that the “G” and expected “G” approach 1.0 as the search distance increases, that is as the pairs included within the search distance approximate the number of pairs in the entire data set. Graphing the Getis-Ord correlogram is useful for selecting the type of kernel in the single- and dual-kernel interpolation routines when the primary variable is weighted with an intensity value (see Chapter 8 on Kernel Density Interpolation). Example: Testing Houston Burglaries with the Getis-Ord Correlogram Using the same data set on the Houston burglaries as above, the Getis-Ord Correlogram was run. The routine was run with 100 intervals and 1000 Monte Carlo simulations in order to simulate 95% confidence intervals around the “G” value. The output was then brought into Excel to produce a graph. Figure Up. 1.7 illustrates the distance decay of the “G”, the expected “G”, and the 2.5 and 97.5 percentile “G” values from the simulation. As can be seen, the “G” value increases with distance from close to 0 to close to 1 at the largest distance, around 44 miles. The expected “G” is higher than the “G” up to a distance of 20 miles, indicating that there is consistent low positive spatial autocorrelation in the data set. Since the Getis-Ord can distinguish a hot spot from a cold spot, the deficit of “G” from the expected “G” indicates that there is a concentration of zones all with smaller numbers of burglaries. This means that, overall, there are more ‘cold spots’ than ‘hot spots’. Notice how the expected “G” falls between the 2.5 and 97.5 percentiles, the approximate 95% confidence interval, In other words, with zones that are separated as much as 20 miles apart, zones with low burglary numbers have similar values, mostly low ones. Uses of the Getis-Ord Correlogram Similar to the Moran Correlogram and the Geary Correlogram, the Getis-Ord Correlogram is useful in order to determine the degree of spatial autocorrelation and how far away from each zone it 28 Figure Up. 1.7: 97.5 percentile of “G” 2.5 percentile of “G” Theoretical random “G” “G” f H “G” of Houston burglaries b l i typically lasts. Since it is an average over all zones, it is a general indicator of the spread of the spatial autocorrelation. This can be useful for defining limits to search distances in other routines, such as the single kernel density interpolation routine or the MCMC spatial regression module (to be released in version 4.0). Unlike the other two correlograms, however, it can distinguish hot spots from cold spots. In the example above, there are more cold spots than hot spots since the “G” is smaller than the expected “G” for most of the distances. The biggest limitation for the Getis-Ord Correlogram is that it cannot detect negative spatial autocorrelation, where zones have different values from their neighbors. For that condition, which is rare, the other two correlograms should be used. Getis-Ord Local “G” The Getis-Ord Local G statistic applies the Getis-Ord "G" statistic to individual zones to assess whether particular points are spatially related to the nearby points. Unlike the global Getis-Ord “G”, the Getis-Ord Local “G” is applied to each individual zone. The formulation presented here is taken from Lee and Wong (2001). The “G” value is calculated with respect to a specified search distance (defined by the user), namely: G(d)i = G j wj(d) Xj ---------------------------- (Up. 1.27) G j Xj E[G(d)i] = Wi -----------(N-1) 1 Var[G(d)i] = (Up. 1.28) Wi *(n-1-Wi)*G jXj)2 [ Wi (Wi - 1) ] ---------------- * ---------------------------(G j X j ) 2 (n-1)(n-2) + -------------- (Up. 1.29) (n-1)(n-2) where wj is the weight of zone “j” from zone “i”, Wi is the sum of weights for zone “i”, and n is the number of cases. The standard error of G(d) is the square root of the variance of G. Consequently, a Z-test can be constructed by: S.E.[G(d)i] = SQRT{Var[G(d)i]} (Up. 1.30) Z[G(d)i] = G(d)i - E[G(d)i] --------------------------S.E.[G(d)i] (Up. 1.31) 30 A good example of using the Getis-Ord local “G” statistic in crime mapping is found in Chainey and Racliffe (2005, pp. 164-172). ID Field The user should indicate a field for the ID of each zone. This ID will be saved with the output and can then be linked with the input file (Primary File) for mapping. Search Distance The user must specify a search distance for the test and indicate the distance units (miles, nautical miles, feet, kilometers, meters, Getis-Ord Local “G” Simulation of Confidence Intervals Since the Getis-Ord “G” statistic may not be normally distributed, the significance test is frequently inaccurate. Instead, a permutation type Monte Carlo simulation can be run whereby the original values of the intensity variable, Z, for the zones are maintained but are randomly re-assigned to zones for each simulation run. This will maintain the distribution of the variable Z but will estimate the value of G for each zone under random assignment of this variable. Specify the number of simulations to be run (e.g., 100, 1000, 10000). Output for Each Zone The output is for each zone and includes: 1. 2. 3. 4. 5. 6. 7. 8. 9. The sample size The ID for the zone The X coordinate for the zone The Y coordinate for the zone The “G” for the zone The expected “G” for the zone The difference between “G” and the expected “G” The standard deviation of “G” for the zone A Z-test of "G" under the assumption of normality for the zone and if a simulation is run: 10. 11. 12. 13. The 0.5 percentile of “G” for the zone The 2.5 percentile of “G” for the zone The 97.5 percentile of “G” for the zone The 99.5 percentile of “G” for the zone The two pairs of percentiles (5 and 95; 2.5 and 97.5; 0.5 and 99.5) create approximate 95% and 99% confidence intervals respectively around each zone. The minimum and maximum “G” values create an ‘envelope’ around each zone. However, unless a large number of simulations are run, the actual “G” value may fall outside the envelope for any zone. The tabular results can be printed, saved to a text file or 31 saved as a '.dbf' file. For the latter, specify a file name in the “Save result to” in the dialogue box. The file is saved with a LGetis-Ord<root name> prefix with the root name being provided by the user. The ‘dbf’ output file can be linked to the Primary File by using the ID field as a matching variable. This would be done if the user wants to map the “G” variable, the expected “G”, the Z-test, or those zones for which the “G” value is either higher than the 97.5 or 99.5 percentiles or lower than the 2.5 or 0.5 percentiles of the simulation results respectively (95% or 99% confidence intervals). Example: Testing Houston Burglaries with the Getis-Ord Local “G” Using the same data set on the Houston burglaries as above, the Getis-Ord Local “G” was run with a search radius of 2 miles and with 1000 simulations being run to produce 95% confidence intervals around the “G” value. The output file was then linked to the input file using the ID field to allow the mapping of the local “G” values. Figure Up. 1.8 illustrates the local Getis-Ord “G” for different zones. The map displays the difference between the “G” and the expected “G (“G” minus expected “G”) with the Z-test being applied to the difference. Zones with a Z-test of +1.96 or higher are shown in red (hot spots). Zones with Z-tests of -1.96 or smaller are shown in blue (cold spots) while zones with a Z-test between -1.96 and +1.96 are shown in yellow (no pattern). As seen, there are some very distinct patterns of zones with high positive spatial autocorrelation and low positive spatial autocorrelation. Examining the original map of burglaries by TAZ (Figure Up. 1.1 above), it can be seen that where there are a lot of burglaries, the zones show high positive spatial autocorrelation in Figure Up. 1.8. Conversely, where there are few burglaries, the zones show low positive spatial autocorrelation in Figure Up. 1.8. Uses of the Getis-Ord Local “G” The Getis-Ord Local “G” is very good at identifying hot spots and also good at identifying cold spots. As mentioned, Anselin’s Local Moran can only identify positive or negative spatial autocorrelation. Those zones with positive spatial autocorrelation could occur because zones with high values are nearby other zones with high values or zones with low values are nearby other zones with low values. The Getis-Ord Local “G” can distinguish those two types. Limitations of the Getis-Ord Local “G” The biggest limitation with the Getis-Ord Local “G”, which applies to all the Getis-Ord routines, is that it cannot detect negative spatial autocorrelation, where a zone is surrounded by neighbors that are different (either having a high value surrounded by zones with low values or having a low value and being surrounded by zones with high values). In actual use, both the Anselin’s Local Moran and the Getis-Ord Local “G” should be used to produce a full interpretation of the rsults. Another limitation is that the significance tests are too weak, allowing too many zones to show significance. In the data shown in Figure Up. 1.8, more than half the zones (727) were statistically significant, either by the Z-test or by the simulated 99% confidence intervals. Thus, there is a substantial Type I error with this statistic (false positives), a similarity it shares with Anselin’s Local Moran. A user should be careful in interpreting zones with significant “G” values and would probably be better served by choosing only those zones with the highest or lowest “G” values. 32 Figure Up. 1.8: Interpolation I and Interpolation II Tabs The Interpolation tab, under Spatial Modeling, has now been separated into two tabs: Interpolation I and Interpolation II. The Interpolation I tab includes the single and dual kernel density routines that have been part of CrimeStat since version 1.0. The interpolation II tab includes the Head Bang routine and the Interpolated Head Bang routine. Head Bang The Head Bang statistic is a weighted two-dimensional smoothing algorithm that is applied to zonal data. It was developed at the National Cancer Institute in order to smooth out ‘peaks’ and ‘valleys’ in health data that occur because of small numbers of events (Mungiole, Pickle, and Simonson, 2002; Pickle and Su, 2002). For example, with lung cancer rates (relative to population), counties with small populations could show extremely high lung cancer rates with only an increase of a couple of cases in a year or, conversely, very low rates if there was a decrease of a couple of cases. On the other hand, counties with large populations will show stable estimates because their numbers are larger. The aim of the Head Bang, therefore, is to smooth out the values for smaller geographical zones while generally keeping the values of larger geographical zones. The methodology is based on the idea of a medianbased head-banging smoother proposed by Tukey and Tukey (1981) and later implemented by Hansen (1991) in two dimensions. Mean smoothing functions tend to over-smooth in the presence of edges while median smoothing functions tend to preserve the edges. The Head Bang routine applies the concept of a median smoothing function to a threedimensional plane. The Head Bang algorithm used in CrimeStat is a simplification of the methodology proposed by Mungiole, Pickle and Simonson (2002) but similar to that used by Pickle and Su (2002).2 Consider a set of zones with a variable being displayed. In a raw map, the value of the variable for any one zone is independent of the values for nearby zones. However, in a Head Bang smoothing, the value of any one zone becomes a function of the values of nearby zones. It is useful for eliminating extreme values in a distribution and adjusting the values of zones to be similar to their neighbors. A set of neighbors is defined for a particular zone (the central zone). In CrimeStat, the user can choose any number of neighbors with the default being 6. Mungiole and Pickle (1999) found that 6 nearest neighbors generally produced small errors between the actual values and the smoothed values, and that increasing the number did not reduce the error substantially. On the other hand, they found that choosing fewer than 6 neighbors could sometimes produce unusual results. The values of the neighbors are sorted from high to low and divided into two groups, called the ‘high screen’ and the ‘low screen’. If the number of neighbors is even, then the two groups are mutually exclusive; on the other hand, if the number of neighbors is odd, then the middle record is counted twice, once with the high screen and once with the low screen. For each sub-group, the median value is calculated. Thus, the median of the high screen group is the ‘high median’ and the median of the low screen group is the ‘low median’. The value of the central zone is then compared to these two medians. 2 The Head Bang statistic is sometimes written as Head-Bang or even Headbang. We prefer to use the term without the hyphen. 34 Rates and Volumes Figure Up. 1.9 shows the graphical interface for the Interpolation II page, which includes the Head Bang and the Interpolated Head Bang routines. The original Head Bang statistic was applied to rates (e.g., number of lung cancer cases relative to population). In the CrimeStat implementation, the routine can be applied to volumes (counts) or rates or can even be used to estimate a rate from volumes. Volumes have no weighting (i.e., they are self-weighted). In the case of rates, though, they should be weighted (e.g., by population). The most plausible weighting variable for a rate is the same baseline variable used in the denominator of the rate (e.g., population, number of households) because the rate variance is proportional to 1/baseline (Pickle and Su, 2002). Decision Rules Depending on whether the intensity variable is a volume (count) or rate variable, slightly different decision rules apply. Smoothed Median for Volume Variable With a volume variable, there is only a volume (the number of events). There is no weighting of the volume since it is self-weighting (i.e., the number equals its weight). In CrimeStat, the volume variable is defined as the Intensity variable on the Primary File page. For a volume variable, if the value of the central zone falls between the two medians (‘low screen’ and ‘high screen’), then the central zone retains its value. On the other hand, if the value of the central zone is higher than the high median, then it takes the high median as its smoothed value whereas if it is lower than the low median, then it takes the low median as its smoothed value. Smoothed Median for Rate Variable With a rate, there is both a value (the rate) and a weight. The value of the variable is its rate. However, there is a separate weight that must be applied to this rate to distinguish a large zone from a small zone. In CrimeStat, the weight variable is always defined on the Primary File under the Weight field. Depending on whether the rate is input as part of the original data set or created out of two variables from the original data set, it will be defined slightly differently. If the rate is a variable in the Primary File data set, it must be defined as the Intensity variable. If the rate is created out of two variables from the Primary File data set, it is defined on the Head Bang interface under ‘Create rate’. Irrespective of how a rate is defined, if the value of the central zone falls between the two medians, then the central zone retains its value. However, if it is either higher than the high median or lower than the low median, then its weight determines whether it is adjusted or not. First, it is compared to the screen to which it is closest (high or low). Second, if it has a weight that is greater than all the weights of its closest screen, then it maintains it value. For example, if the central zone has a rate value greater than the high median but also a weight greater than any of the weights of the high screen zones, then it still maintains its value. On the other hand, if its weight is smaller than any of the weights in the high screen, then it takes the high median as its value. The same logic applies if its value is lower than the low median. This logic ensures that if a central zone is large relative to its neighbors, then its rate is most likely an accurate indicator of risk. However, if it is smaller than its neighbors, then its value is adjusted 35 Figure Up. 1.9: Interpolation II Page to be like its neighbors. In this case, extreme rates, either high or low, are reduced to moderate levels (smoothed) and, thereby, minimize the potential for ‘peak’ or ‘valleys’ as well as maintaining sensitivity where there are real edges in the data. Example to Illustrate Decision Rules A simple example will illustrate this process. Suppose the intensity variable is a rate (as opposed to a volume - a count). For each point, the eight nearest neighbors are examined. Suppose that the eight nearest neighbors of zone A have the following values (Table Up. 1.4): Table Up. 1.4 Example: Nearest Neighbors of Zone “A” Neighbor 1 2 3 4 5 6 7 8 Intensity Value 10 15 12 7 14 16 10 12 Weight 1000 3000 4000 1500 2300 1200 2000 2500 Note that the value at the central point (zone A) is not included in this list. These are the nearest neighbors only. Next, the 8 neighbors are sorted from the lowest rate to the highest (Table Up. 1.5). The record number (neighbor) and weight value are also sorted. Table Up. 1.5 Sorted Nearest Neighbors of Zone “A” Neighbor 4 1 7 3 8 5 2 6 Rate 7 10 10 12 12 14 15 16 Weight 1500 1000 2000 4000 2500 2300 3000 1200 Third, a cumulative sum of the weights is calculated starting with the lowest intensity value (Table Up. 1.6): 37 Table Up. 1.6 Cumulative Weights for Nearest Neighbors of Zone “A” Neighbor 4 1 7 3 8 5 2 6 Rate 7 10 10 12 12 14 15 16 Weight 1500 1000 2000 4000 2500 2300 3000 1200 Cumulative Weight 1,500 2,500 4,500 8,500 11,000 13,300 16,300 17,500 Fourth, the neighbors are then divided into two groups at the median. Since the number of records is even, then the “low screen” are records 4, 1, 7, and 3 and the “high screen” are records 8, 5, 2, and 6. The weighted medians of the “low screen” and “high screen” are calculated (Table Up. 1.7). Since these are rates, the “low screen” median is calculated from the first four records while the “high screen” median is calculated from the second four records. The calculations are as follows (assume the baseline is ‘per 10,000’). The intensity value is multiplied by the weight and divided by the baseline (for example, 7 * 1500/10000 = 1.05). This is called the “score”; it is an estimate of the volume (number) of events in that zone. Table Up. 1.7 Cumulative Scores by Screens for Nearest Neighbors of Zone “A” “Low screen” Neighbor 4 1 7 3 Rate 7 10 10 12 Weight 1500 1000 2000 4000 “Score” I*Weight/Baseline 1.05 1.00 2.00 4.80 Cumulative Score 1.05 2.05 4.05 8.85 Weight 2500 2300 3000 1200 “Score” I * Weight/Baseline 3.00 3.22 4.50 1.92 Cumulative Score 3.00 6.22 10.72 12.64 “High screen” Neighbor 8 5 2 6 Rate 12 14 15 16 For the “low screen”, the median score is 8.85/2 = 4.425. This falls between records 7 and 3. To estimate the rate associated with this median score, the gap in scores between records 7 and 3 is interpolated, and then converted to rates. The gap between records 7 and 3 is 4.80 (8.85-4.05). The “low screen” median score, 4.425, is (4.425-4.05)/4.80 = 0.0781 of that gap. The gap between the rates of 38 records 7 and 3 is 2 (12-10). Thus, 0.0781 of that gap is 0.1563. This is added to the rate of record 7 to yield a low median rate of 10.1563 For the “high screen”, the median score is 12.64/2 = 6.32. This falls between records 5 and 2. To estimate the rate associated with this median score, the gap in scores between records 5 and 2 is interpolated, and then converted to rates. The gap between records 5 and 2 is 4.50 (10.72-6.22). The “low screen” median score, 6.32, is (6.32-6.22)/4.50 = 0.0222 of that gap. The gap between the rates of records 5 and 2 is 1 (15-14). Thus, 0.0222 of that gap is 0.0222. This is added to the rate of record 5 to yield a high median rate of 14.0222. Finally, the rate associated with the central zone (zone A in our example) is compared to these two medians. If its rate falls between these medians, then it keeps its value. For example, if the rate of zone A is 13, then that falls between the two medians (10.1563 and 14.0222). On the other hand, if its rate falls outside this range (either lower than the low median or higher than the high median), its value is determined by its weight relative to the screen to which it is closest. For example, suppose zone A has a rate of 15 with a weight of 1700. In this case, its rate is higher than the high median (14.0222) but its weight is smaller than three of the weights in the high screen. Therefore, it takes the high median as its new smoothed value. Relative to its neighbors, it is smaller than three of them so that its value is probably too high. But, suppose it has a rate of 15 and a weight of 3000? Even though its rate is higher than the high median, its weight is also higher than the four neighbors making up the high screen. Consequently, it keeps its value. Relative to its neighbors, it is a large zone and its value is probably accurate. For volumes, the comparison is simpler because all weights are equal. Consequently, the volume of the central zone is compared directly to the two medians. If it falls between the medians, it keeps its value. If it falls outside the medians, then it takes the median to which it is closest (the high median if it has a higher value or the low median if it has a lower value). Setup For either a rate or a volume, the statistic requires an intensity variable in the primary file. The user must specify whether the variable to be smoothed is a rate variable, a volume variable, or two variables that are to be combined into a rate. If a weight is to be used (for either a rate or the creating of a rate from two volume variables), it must be defined as a Weight on the Primary File page. Note that if the variable is a rate, it probably should be weighted. A typical weighting variable is the population size of the zone. The user has to complete the following steps to run the routine: 1. Define input file and coordinates on the Primary File page 2. Define an intensity variable, Z(intensity), on the Primary File page. 3. OPTIONAL: Define a weighting variable in the weight field on the Primary File page (for rates and for the creating of rates from two volume variables) 39 4. Define an ID variable to identify each zone. 5. Select data type: A. Rate: the variable to be smoothed is a rate variable which calculates the number of events (the numerator) relative to a baseline variable (the denominator). a. The baseline units should be defined, which is an assumed multiplier in powers of 10. The default is ‘per 100' (percentages) but other choices are 0 (no multiplier used), ‘per 10' (rate is multiplied by 10), ‘per 1000', ‘per 10,000', ‘per 100,000', and ‘per 1,000,000'. This is not used in the calculation but for reference only. b. If a weight is to be used, the ‘Use weight variable’ box should be checked. B. Volume: the variable to be smoothed is a raw count of the number of events. There is no baseline used. C. Create Rate: A rate is to be calculated by dividing one variable by another. a. The user must define the numerator variable and the denominator variable. b. The baseline rate must be defined, which is an assumed multiplier in powers of 10. The default is ‘per 100' (percentages) but other choices are 1 (no multiplier used), ‘per 10' (rate is multiplied by 10), ‘per 1000', ‘per 10,000', ‘per 100,000', and ‘per 1,000,000'. This is used in the calculation of the rate. c. If a weight is to be used, the ‘Use weight variable’ box should be checked. 6. Select number of neighbors. The number of neighbors can run from 4 through 40. The default is 6. If the number of neighbors selected is even, the routine divides the data set into two equal-sized groups. If the number of neighbors selected is odd, then the middle zone is used in calculating both the low median and the high median. Itis recommended that an even number of neighbors be used (e.g., 4, 6, 8, 10, 12). 7. Select output file. The output can be saved as a dbase ‘dbf’ file. If the output file is a rate, then the prefix RateHB is used. If the output is a volume, then the prefix VolHB is used. If the output is a created rate, then the prefix CrateHB is used. 8. Run the routine by clicking ‘Compute’. 40 Output The Head Bang routine creates a ‘dbf’ file with the following variables: 1. 2. 3. 4. 5. The ID field The X coordinate The Y coordinate The smoothed intensity variable (called ‘Z_MEDIAN’). Note that this is not a Z score but a smoothed intensity (Z) value The weight applied to the smoothed intensity variable. This will be automatically 1 if no weighting is applied. The ‘dbf’ file can then be linked to the input ‘dbf’ file by using the ID field as a matching variable. This would be done if the user wants to map the smoothed intensity variable. Example 1: Using the Head Bang Routine for Mapping Houston Burglaries Earlier, Figure Up. 1.1 showed a map of Houston burglaries by traffic analysis zones; the mapped variable is the number of burglaries committed in 2006. On the Head Bang interface, the ‘Volume’ box is checked, indicating that the number of burglaries will be estimated. The number of neighbors is left at the default 6. The output ‘dbf’ file was then linked to the input ‘dbf’ file using the ID field to allow the smoothed intensity values to be mapped. Figure Up. 1.10 shows a smoothed map of the number of burglaries conducted by the Head Bang routine. With both maps, the number of intervals that are mapped is 5. Comparing this map with Figure Up. 1.1, it can be seen that there are fewer zones in the lowest interval/bin (in yellow). The actual counts are 528 zones with scores of less than 10 in Figure Up. 1.1 compared to 498 zones in Figure Up. 1.10. Also, there are fewer zones in the highest interval/bin (in black) as well. The actual counts are 215 zones with scores of 40 or more in Figure Up. 1.1 compared to 181 zones in Figure Up 1. In other words, the Head Bang routine has eliminated many of the highest values by assigning them to the median values of their neighbors, either those of the ‘high screen’ or the ‘low screen’. Example 2: Using the Head Bang Routine for Mapping Burglary Rates The second example shows how the Head Bang routine can smooth rates. In the Houston burglary data base, a rate variable was created which divided the number of burglaries in 2006 by the number of households in 2006. This variable was then multiplied by 1000 to minimize the effects of decimal place (the baseline unit). Figure Up. 1.11 shows the raw burglary rate (burglaries per 1,000 households) for the City of Houston in 2006. The Head Bang routine was set up to estimate a rate for this variable (Burglaries Per 1000 Households). On the Primary File page, the intensity variable was defined as the calculated rate (burglaries per 1,000 households) because the Head Bang will smooth the rate. Also, a weight variable is selected on the Primary File page. In this example, the weight variable was the number of households. 41 Figure Up. 1.10: Figure Up. 1.11: On the Head Bang interface, the ‘Rate’ box is checked (see Figure Up. 1.9). The ID variable is selected (which is also TAZ03). The baseline number of units was set to ‘Per 1000'; this is for information purposes, only, and will not affect the calculation. With any rate, there is always the potential of a small zone producing a very high rate. Consequently, the estimates are weighted to ensure that the values of each zone are proportional to their size. Zones with larger numbers of households will keep their values whereas zones with small numbers of households will most likely change their values to be closer to their neighbors. On the primary file page, the number of households is chosen as the weight variable and the ‘Use weight variable’ box is checked under the Head Bang routine.. The number of neighbors was left at the default 6. Finally, an output ‘dbf’ file is defined in the ‘Save Head Bang’ dialogue. The output ‘dbf’ file was linked to the input ‘dbf’ file using the ID field to allow the smoothed rates to be mapped. Figure Up. 1.12 shows the result of smoothing the burglary rate. As can be seen, the rates are more moderate than with the raw numbers (comparing Figure Up. 1.12 with Figure Up. 1.11). There are fewer zones in the highest rate category (100 or more burglaries per 1,000 households) for the Head Bang estimate compared to the raw data (64 compared to 185) but there are also more zones in the lowest rate category (0-24 burglaries per 1,000 households) for the Head Bang compared to the raw data (585 compared to 520). In short, the Head Bang has reduced the rates throughout the map. Example 3: Using the Head Bang Routine for Creating Burglary Rates The third example illustrates using the Head Bang routine to create smoothed rates. In the Houston burglary data set, there are two variables that can be used to create a rate. First, there is the number of burglaries per traffic analysis zone. Second, there is the number of households that live in each zone. By dividing the number of burglaries by the number of households, an exposure index can be calculated. Of course, this index is not perfect because some of the burglaries occur on commercial properties, rather than residential units. But, without separating residential from non-residential burglaries, this index can be considered a rough exposure measure. On the Head Bang interface, the ‘Create Rate’ box is checked. The ID variable is selected (which is TAZ03 in the example - see Figure Up. 1.9). The numerator variable is selected. In this example, the numerator variable is the number of burglaries. Next, the denominator variable is selected. In the example, the denominator variable is the number of households. The baseline units must be chosen and, unlike the rate routine, are used in the calculations. For the example, the rate is ‘per 1,000' which means that the routine will calculate the rate (burglaries divided by households) but then will multiply by 1,000. On the Head Bang page, the ‘Use weight variable’ box under the ‘Create rate’ column is checked. Next, the number of neighbors are chosen, both for the numerator and for the denominator. To avoid dividing by a small number, generally we recommend using a larger number of neighbors for the denominator than for the numerator. In the example, the default 6 neighbors is chosen for the numerator variable (burglaries) while 8 neighbors is chosen for the denominator variable (households). Finally, a ‘dbf’ output file is defined and the routine is run. The output ‘dbf’ file was then linked to the input ‘dbf’ file using the ID field to allow the smoothed rates to be mapped. Figure Up. 1.13 shows the results. Compared to the raw burglary rate (Figure Up. 1.11), there are fewer zones in the highest category (36 compared to 185) but also more zones in the lowest category (607 compared to 520). Like the rate smoother, the rate that is created has reduced the rates throughout the map. 44 Figure Up. 1.12: Figure Up. 1.13: Uses of the Head Bang Routine The Head Bang routine is useful for several purposes. First, it eliminates extreme measures, particularly very high ones (‘peaks’). For a rate, in particular, it will produce more stable estimates. For zones with small baseline numbers, a few events can cause dramatic increases in the rates if just calculated as such. The Head Bang smoother will eliminate those extreme fluctuations. The use of population weights for estimating rates ensures that unusually high or low proportions that are reliable due to large populations are not modified whereas values based on small base populations are modified to be more like those of the surrounding counties. Similarly, for volumes (counts), the method will produce values that are more moderate. Limitations of the Head Bang Routine On the other hand, the Head Bang methodology does distort data. Because the extreme values are eliminated, the routine aims for more moderate estimates. However, those extremes may be real. Consequently, the Head Bang routine should not be used to interpret the results for any one zone but more for the general pattern within the area. If used carefully, the Head Bang can be a powerful tool for examining risk within a study area and, especially, for examining changes in risk over time. Interpolated Head Bang The Head Bang calculations can be interpolated to a grid. If the user checks this box, then the routine will also interpolate the calculations to a grid using kernel density estimation. An output file from the Head Bang routine is required. Also, a reference file is required to be defined on the Reference File page. Essentially, the routine takes a Head Bang output and interpolates it to a grid using a kernel density function. The same results can be obtained by inputting the Head Bang output on the Primary File page and using the single kernel density routine on the Interpolations I page. The user must then define the parameters of the interpolation. However, there is no intensity variable in the Interpolated Head Bang because the intensity has already been incorporated in the Head Bang output. Also, there is no weighting of the Head Bang estimate. The user must then define the parameters of the interpolation. Method of Interpolation There are five types of kernel distributions to interpolate the Head Bang to the grid: 1. 2. 3. The normal kernel overlays a three-dimensional normal distribution over each point that then extends over the area defined by the reference file. This is the default kernel function. However, the normal kernel tends to over-smooth. One of the other kernel functions may produce a more differentiated map; The uniform kernel overlays a uniform function (disk) over each point that only extends for a limited distance; The quartic kernel overlays a quartic function (inverse sphere) over each point that only extends for a limited distance; 47 4. 5. The triangular kernel overlays a three-dimensional triangle (cone) over each point that only extends for a limited distance; and The negative exponential kernel overlays a three dimensional negative exponential function over each point that only extends for a limited distance. The different kernel functions produce similar results though the normal is generally smoother for any given bandwidth. Choice of Bandwidth The kernels are applied to a limited search distance, called 'bandwidth'. For the normal kernel, bandwidth is the standard deviation of the normal distribution. For the uniform, quartic, triangular and negative exponential kernels, bandwidth is the radius of a circle defined by the surface. For all types, larger bandwidth will produce smoother density estimates and both adaptive and fixed bandwidth intervals can be selected. Adaptive bandwidth An adaptive bandwidth distance is identified by the minimum number of other points found within a circle drawn around a single point. A circle is placed around each point, in turn, and the radius is increased until the minimum sample size is reached. Thus, each point has a different bandwidth interval. The user can modify the minimum sample size. The default is 100 points. If there is a small sample size (e.g., less than 500), then a smaller minimum sample size would be more appropriate). Fixed bandwidth A fixed bandwidth distance is a fixed interval for each point. The user must define the interval and the distance units by which it is calculated (miles, nautical miles, feet, kilometers, meters.) Output (areal) units Specify the areal density units as points per square mile, per squared nautical miles, per square feet, per square kilometers, or per square meters. The default is points per square mile. Calculate Densities or Probabilities The density estimate for each cell can be calculated in one of three ways: Absolute densities This is the number of points per grid cell and is scaled so that the sum of all grid cells equals the sample size. This is the default. Relative densities For each grid cell, this is the absolute density divided by the grid cell area and is expressed in the output units (e.g., points per square mile) 48 Probabilities This is the proportion of all incidents that occur in each grid cell. The sum of all grid cells equals 1. Select whether absolute densities, relative densities, or probabilities are to be output for each cell. The default is absolute densities. Output The results can be output as a Surfer for Windows file (for both an external or generated reference file) or as an ArcView '.shp', MapInfo '.mif', Atlas*GIS '.bna', ArcView Spatial Analyst 'asc', or ASCII grid 'grd' file (only if the reference file is generated by CrimeStat.) The output file is saved as IHB<root name> with the root name being provided by the user. Example: Using the Interpolated Head Bang to Visualize Houston Burglaries The Houston burglary data set was, first, smoothed using the Head Bang routine (Figure Up. 1.10 above) and, second, interpolated to a grid using the Interpolated Head Bang routine. The kernel chosen was the default normal distribution but with a fixed bandwidth of 1 mile. Figure Up. 1.14 shows the results of the interpolation. To compare this to an interpolation of the original data, the raw number of burglaries in each zone were interpolated using the single kernel density routine. The kernel used was also the normal distribution with a fixed bandwidth of 1 mile. Figure Up. 1.15 shows the results of interpolating the raw burglary numbers. An inspection of these two figures show that they both capture the areas with the highest burglary density. However, the Interpolated Head Bang produces fewer high density cells which, in turn, allows the moderately high cells to stand out. For example, in southwest Houston, the Interpolated Head Bang shows two small areas of moderately high density of burglaries whereas the raw interpolation merges these together. Advantages and Disadvantages of the Interpolated Head Bang The Interpolated Head Bang routine has the same advantages and disadvantages as the Head Bang routine. Its advantages are that it captures the strongest tendencies by eliminating ‘peaks’ and ‘valleys’. But, it also does this by distorting the data. The user has to determine whether the elimination of areas with very high or very low density values are real just due to small number of events. For law enforcement applications, this may or may not be an advantage. Some hot spots, for example, are small areas where there are a many crime events. Smoothing the data may eliminate the visibility of these. On the other hand, large hot spots will generally survive the smoothing process because the number of events is large and will usually spread to adjacent grid cells. As usual, the user has to be aware of the advantages and disadvantages in order to decide whether a particular tool, such as the interpolated Head Bang is useful or not. 49 Figure Up. 1.14: Figure Up. 1.15: Bayesian Journey to Crime Module The Bayesian Journey to Crime module (Bayesian Jtc) are a set of tools for estimating the likely residence location of a serial offender. It is an extension of the distance-based Journey to crime routine (Jtc) which uses a typical travel distance function to make guesses about the likely residence location. The extension involves the use an origin-destination matrix which provides information about the particular origins of offenders who committed crimes in particular destinations. First, the theory behind the Bayesian Jtc routine will be described. Then, the data requirements will be discussed. Finally, the routine will be illustrated with some data from Baltimore County. Bayesian Probability Bayes Theorem is a formulation that relates the conditional and marginal probability distributions of random variables. The marginal probability distribution is a probability independent of any other conditions. Hence, P(A) and P(B) is the marginal probability (or just plain probability) of A and B respectively. The conditional probability is the probability of an event given that some other event has occurred. It is written in the form of P(A|B) (i.e., event A given that event B has occurred). In probability theory, it is defined as: P(A|B) = P (A and B) ----------------P(B) (Up. 1.32) Conditional probabilities can be best be seen in contingency tables. Table Up. 1.8 below show a possible sequence of counts for two variables (e.g., taking a sample of persons and counting their gender - male = 1; female = 0, and their age - older than 30 = 1; 30 or younger = 0). The probabilities can be obtained just by counting: P(A) = 30/50 = 0.6 P(B) = 35/50 = 0.7 P(A and B) = 25/50 = 0.5 P(A or B) = (30+35-25)/50 = 0.8 P(A|B) = 25/35 = 0.71 P(B|A) = 25/30 = 0.83 52 Table Up. 1.8: Example of Determining Probabilities by Counting A has NOT Occurred A has Occurred TOTAL B has NOT Occurred 10 5 15 B has Occurred 10 25 35 TOTAL 20 30 50 However, if four of these six calculations are known, Bayes Theorem can be used to solve for the other two. Two logical terms in probability are the ‘and’ and ‘or’ conditions. Usually, the symbol c is used for ‘or’ and 1 is used for ‘and’, but writing it in words makes it easier to understand. The following two theorems define these. 1. The probability that either A or B will occur is P(A or B) = P(A) + P(B) - P(A and B) 2. (Up. 1.33) The probability that both A and B will occur is: P(A and B) = P(A) * P(B|A) = P(B)*P(A|B) (Up. 1.34) Bayes Theorem relates the two equivalents of the ‘and’ condition together. P(B) * P(A|B) = P(A) * P(B|A) (Up. 1.35) P(A) * P(B|A) ------------------P(B) (Up. 1.36) P(A|B) = The theorem is sometimes called the ‘inverse probability’ in that it can invert two conditional probabilities: P(B|A) = P(B) * P(A|B) -------------------P(A) (Up. 1.37) By plugging in the values from the example in Table Up. 1.8, the reader can verify that Bayes Theorem produces the correct results (e.g., P(B|A) = 0.7 * 0.71/0.6 = 0.83). 53 Bayesian Inference In the statistical interpretation of Bayes Theorem, the probabilities are estimates of a random variable. Let è be a parameter of interest and let X be some data. Thus, Bayes Theorem can be expressed as: P(è|X) = P(X|è) * P(è) ------------------P(X) (Up. 1.38) Interpreting this equation, P(è|X) is the probability of è given the data, X. P(è) is the probability that è has a certain distribution and is often called the prior probability. P(X|è) is the probability that the data would be obtained given that è is true and is often called the likelihood function (i.e., it is the likelihood that the data will be obtained given the distribution of è). Finally, P(X) is the marginal probability of the data, the probability of obtaining the data under all possible scenarios; essentially, it is the data. The equation can be rephrased in logical terms: Posterior probability that è is true given the data, X = Likelihood of Prior obtaining the data probability given è is true * of è --------------------------------------------------------Marginal probability of X (Up. 1.39) In other words, this formulation allows an estimate of the probability of a particular parameter, è, to be updated given new information. Since è is the prior probability of an event, given some new data, X, Bayes Theorem can be used to update the estimate of è. The prior probability of è can come from prior studies, an assumption of no difference between any of the conditions affecting è, or an assumed mathematical distribution. The likelihood function can also come from empirical studies or an assumed mathematical function. Irrespective of how these are interpreted, the result is an estimate of the parameter, è, given the evidence, X. This is called the posterior probability (or posterior distribution). A point that is often made is that the prior probability of obtaining the data (the denominator of the above equation) is not known or can’t easily be evaluated. The data are what was obtained from some data gathering exercise (either experimental or from observations). Thus, it is not easy to estimate it. Consequently, often the numerator only is used for estimate the posterior probability since P(è|X) % P(X|è) * P(è) (Up. 1.40) where % means ‘proportional to’. In some statistical methods (e.g., the Markov Chain Monte Carlo simulation, or MCMC), the parameter of interest is estimated by thousands of random simulations using approximations to P(X|è) and P(è) respectively. The key point behind this logic is that an estimate of a parameter can be updated by additional new information systematically. The formula requires that a prior probability value for the estimate be given with new information being added which is conditional on the prior estimate, meaning that it 54 factors in information from the prior. Bayesian approaches are increasingly be used provide estimates for complex calculations that previously were intractable (Denison, Holmes, Mallilck, and Smith, 2002; Lee, 2004; Gelman, Carlin, Stern, and Rubin, 2004). Application of Bayesian Inference to Journey to Crime Analysis Bayes Theorem can be applied to the journey to crime methodology. In the Journey to Crime (Jtc) method, an estimate is made about where a serial offender is living. The Jtc method produces a probability estimate based on an assumed travel distance function (or, in more refined uses of the method, travel time). That is, it is assumed that an offender follows a typical travel distance function. This function can be estimated from prior studies (Canter and Gregory, 1994; Canter, 2003) or from creating a sample of known offenders - a calibration sample (Levine, 2004) or from assuming that every offender follows a particular mathematical function (Rossmo, 1995; 2000). Essentially, it is a prior probability for a particular location, P(è). That is, it is a guess about where the offender lives on the assumption that the offender of interest is following an existing travel distance model. However, additional information from a sample of known offenders where both the crime location and the residence location are known can be added. This information would be obtained from arrest records, each of which will have a crime location defined (a ‘destination’) and a residence location (an ‘origin’). If these locations are then assigned to a set of zones, a matrix that relates the origin zones to the destination zones can be created (Figure Up. 1.16). This is called an origin-destination matrix (or a trip distribution matrix or an O-D matrix, for short). In this figure, the numbers indicate crimes that were committed in each destination zone that originated (i.e., the offender lived) in each origin zone. For example, taking the first row in figure Up. 1.16, there were 37 crimes that were committed in zone 1 and in which the offender also lived in zone 1; there were 15 crimes committed in zone 2 in which the offender lived in zone 1; however, there were only 7 crimes committed in zone 1 in which the offender lived in zone 2; and so forth. Note two things about the matrix. First, the number of origin zones can be (and usually is) greater than the number of destination zones because crimes can originate outside the study area. Second, the marginal totals have to be equal. That is, the number of crimes committed in all destination zones has to equal the number of crimes originating in all origin zones. This information can be treated as the likelihood estimate for the Journey to Crime framework. That is, if a certain distribution of incidents committed by a particular serial offender is known, then this matrix can be used to estimate the likely origin zones from which offenders came, independent of any assumption about travel distance. In other words, this matrix is equivalent to the likelihood function in equation Up. 1.38, which is repeated below: P(è|X) = P(X|è) * P(è) ----------------P(X) repeat (Up. 1.38) Thus, the estimate of the likely location of a serial offender can be improved by updating the estimate from the Jtc method, P(è), with information from an empirically-derived likelihood estimate, P(X|è). Figure Up. 1.17 illustrates how this process works. Suppose one serial offender committed 55 Figure Up. 1.16: Crime Origin-Destination Matrix Figure Up. 1.17: crimes in three zones. These are shown in terms of grid cell zones. In reality, most zones are not grid cell shaped but are irregular. However, illustrating it with grid cells makes it more understandable. Using an O-D matrix based on those cells, only the destination zones corresponding to those cells are selected (Figure Up. 1.18). This process is repeated for all serial offenders in the calibration file which results in marginal totals that correspond to frequencies for those serial offenders who committed crimes in the selected zones. In other words, the distribution of crimes is conditioned on the locations that correspond to where the serial offender of interest committed his or her crimes. It is a conditional probability. But, what about the denominator, P(X)? Essentially, it is the spatial distribution of all crimes irrespective of which particular model or scenario we’re exploring. In practice, it is very difficult, if not impossible, to estimate the probability of obtaining the data under all circumstances. I’m going to change the symbols at this point so the Jtc represents the distance-based Journey to Crime estimate, O represents an estimate based on an origin-destination matrix, and O|Jtc represents the particular origins associated with crimes committed in the same zones as that identified in the Jtc estimate. Therefore, there are three different probability estimates of where an offender lives: 1. A probability estimate of the residence location of a single offender based on the location of the incidents that this person committed and an assumed travel distance function, P(Jtc); 2. A probability estimate of the residence location of a single offender based on a general distribution of all offenders, irrespective of any particular destinations for incidents, P(O). Essentially, this is the distribution of origins irrespective of the destinations; and 3. A probability estimate of the residence location of a single offender based on the distribution of offenders given the distribution of incidents committed by other offenders who committed crimes in the same locaiton, P(O|Jtc). Therefore, Bayes Theorem can be used to create an estimate that combines information both from a travel distance function and an origin-destination matrix (equation Up. 1.38): P(Jtc|O) % P(O|Jtc) * P(Jtc) (Up. 1.41) in which the posterior probability of the journey to crime location conditional on the origin-destination matrix is proportional to the product of the prior probability of the journey to crime function, P(Jtc), and the conditional probability of the origins for other offenders who committed crimes in the same locations. This will be called the product probability. As mentioned above, it is very difficult, if not impossible, to determine the probability of obtaining the data under any circumstance. Consequently, the Bayesian estimate is usually calculated only with respect to the numerator, the product of the prior probability and the likelihood function. A very rough approximation to the full Bayesian probability can be obtained by taking the product probability and dividing it by the general probability: It related the the product term (the numerator) to the general distribution of crimes. This will produce a relative risk measure, which is called Bayesian risk. 58 Figure Up. 1.18: Conditional Origin-Destination Matrix Destination zones where serial offender ff d committed itt d crimes i 1 Crime origin zone Marginal g totals for selected zones only Crime destination zone 2 3 4 5 G N 1 15 4 . . . . . . 12 121 2 53 0 . . . . . . 15 205 3 9 7 . . . . . . 33 65 4 10 12 . . . . . . 0 35 5 7 2 . . . . . . 14 40 . . . . . . M 5 3 G 276 99 . . . . . . . . . 92 141 812 1,597 P(Jtc|O) = P(O|Jtc) * P(Jtc) -----------------------P(O) (Up. 1.42) In this case, the product probability is being compared to the general distribution of the origins of all offenders irrespective of where they committed their crimes. Note that this measure will correlate with the product term because they both have the same numerator. The Bayesian Journey to Crime Estimation Module The Bayesian Journey to Crime estimation module is made up of two routines, one for diagnosing which Journey to Crime method is best and one for applying that method to a particular serial offender. Figure Up. 1.19 shows the layout of the module. Data Preparation for Bayesian Journey to Crime Estimation There are four data sets that are required: 1. The incidents committed by a single offender for which an estimate will be made of where that individual lives; 2. A Journey to Crime travel distance function that estimates the likelihood of an offender committing crimes at a certain distance (or travel time if a network is used); 3. An origin-destination matrix; and 4. A diagnostics file of multiple known serial offenders for which both their residence and crime locations are known. Serial offender data For each serial offender for whom an estimate will be made of where that person lives, the data set should include the location of the incidents committed by the offender. The data are set up as a series of records in which each record represents a single event. On each data set, there are X and Y coordinates identifying the location of the incidents this person has committed (Table Up. 1.9). 60 Figure Up. 1.19: Bayesian Journey to Crime Page Table Up. 1.9: Minimum Information Required for Serial Offenders: Example for Offender Who Committed Seven Incidents ID TS7C TS7C TS7C TS7C TS7C TS7C TS7C UCR 430.00 440.00 630.00 430.00 311.00 440.00 341.00 INCIDX -76.494300 -76.450900 -76.460600 -76.450700 -76.449700 -76.450300 -76.448200 INCIDY 39.2846 39.3185 39.3157 39.3181 39.3162 39.3178 39.3123 Journey to Crime travel function The Journey to Crime travel function (Jtc) is an estimate of the likelihood of an offender traveling a certain distance. Typically, it represents a frequency distribution of distances traveled, though it could be a frequency distribution of travel times if a network was used to calibrate the function with the Journey to crime estimation routine. It can come from an a priori assumption about travel distances, prior research, or a calibration data set of offenders who have already been caught. The “Calibrate Journey to Crime function” routine (on the Journey to Crime page under Spatial modeling) can be used to estimate this function. Details are found in chapter 10 of the CrimeStat manual. The BJtc routine can use two different travel distance functions: 1) An already-calibrated distance function; and 2) A mathematical formula. Either direct or indirect (Manhattan) distances can be used though the default is direct (see Measurement parameters). In practice, an empirically-derived travel function is often as accurate, if not better, than a mathematically-defined one. Given that an origin-destination matrix is also needed, it is easy for the user to estimate the travel function using the “Calibrate Journey to crime function”. Origin-destination matrix The origin-destination matrix relates the number of offenders who commit crimes in one of N zones who live (originate) in one of M zones, similar to Figure Up. 1.16 above. It can be created from the “Calculate observed origin-destination trips” routine (on the ‘Describe origin-destination trips’ page under the Trip distribution module of the Crime Travel Demand model). How many incidents are needed where the origin and destination location are known? While there is no simple answer to this, the numbers ideally should be in the tens of thousands. If there are N destinations and M rows, ideally one would want an average of 30 cases for each cell to produce a reliable estimate. Obviously, that’s a huge amount of data and one not easily found with any real database. For example, if there are 325 destination zones and 532 origin zones (the Baltimore County example given below), that would be 172,900 individual cells. If the 30 cases or more rule is applied, then that would require 5,187,000 records or more to produce a barely reliable estimate for most cells. 62 The task becomes even more daunting when it is realized that many of these links (cells) have few or no cases in them as offenders typically travel along certain pathways. Obviously, such a demand for data is impractical even in the largest jurisdictions. Therefore, we recommend that as much data as possible be used to produce the origin-destination (O-D) matrix, at least several years worth. The matrix can be built with what data is available and then periodically updated to produce better estimates. Diagnostics file for Bayesian Jtc routine The fourth data set is used for estimating which of several parameters is best at predicting the residence location of serial offenders in a particular jurisdiction. Essentially, it is a set of serial offenders, each record of which has information on the X and Y coordinates of the residence location as well as the crime location. For example, offender T7B committed seven incidents while offender S8A committed eight incidents. The records of both offenders are placed in the same file along with the records for all other offenders in the diagnostics file. The diagnostics file provides information about which of several parameters (to be described below) are best at guessing where an offender lives. The assumption is that if a particular parameter was best with the K offenders in a diagnostics file in which the residence location was known, then the same parameter will also be best for a serial offender for whom the residence location is not known. How many serial offenders are needed to make up a diagnostics file? Again, there is no simple answer to this though the number are much less than for the O-D matrix. Clearly, the more, the better since the aim is to identify which parameter is most sensitive with a certain level of precision and accuracy. I used 88 offenders in my diagnostics file (see below). Certainly, a minimum of 10 would be necessary. But, more would certainly be more accurate. Further, the offender records used in the diagnostics file should be similar in other dimensions to the offender that is being tracked. However, this may be impractical. In the example data set, I combined offenders who committed different types of crimes. The results may be different if offenders who had committed only one type of crimes were tested. Once the data sets have been collected, they need to be placed in an appended file, with one serial offender on top of another. Each record has to represent a single incident. Further, the records have to be arranged sequentially with all the records for a single offender being grouped together. The routine automatically sorts the data by the offender ID. But, to be sure that the result is consistent, the data should be prepared in this way. The structure of the records is similar to the example in Table Up. 1.10 below. At the minimum, there is a need for an ID field, and the X and Y coordinates of both crime location and the residence location. Thus, in the example, all the records for the first offender (Num 1) are together; all the records for the second offender (Num 2) are together; and so forth. The ID field is any string variable. In Table Up. 1.10, the ID field is labeled “ID”, but any label would be acceptable as long as it is consistent (i.e., all the records of a single offender are together). 63 Table Up. 1.10: Example Records in Bayesian Journey to Crime Diagnostics File OffenderID Num 1 Num 1 Num 1 Num 2 Num 2 Num 2 Num 3 Num 3 Num 3 Num 4 Num 4 Num 4 Num 5 Num 5 Num 5 Num 5 Num 6 Num 6 Num 6 Num 6 Num 7 Num 7 Num 7 Num 7 . . . . Num Last Num Last Num Last Num Last Num Last Num Last Num Last HomeX -77.1496 -77.1496 -77.1496 -76.3098 -76.3098 -76.3098 -76.7104 -76.7104 -76.7104 -76.5179 -76.5179 -76.5179 -76.3793 -76.3793 -76.3793 -76.3793 -76.5920 -76.5920 -76.5920 -76.5920 -76.7152 -76.7152 -76.7152 -76.7152 -76.4320 -76.4880 -76.4437 -76.4085 -76.4083 -76.4082 -76.4081 HomeY IncidX IncidY 39.3762 -76.6101 39.3729 39.3762 -76.5385 39.3790 39.3762 -76.5240 39.3944 39.4696 -76.5427 39.3989 39.4696 -76.5140 39.2940 39.4696 -76.4710 39.3741 39.3619 -76.7195 39.3704 39.3619 -76.8091 39.4428 39.3619 -76.7114 39.3625 39.2501 -76.5144 39.3177 39.2501 -76.4804 39.2609 39.2501 -76.5099 39.2952 39.3524 -76.4684 39.3526 39.3524 -76.4579 39.3590 39.3524 -76.4576 39.3590 39.3524 -76.4512 39.3347 39.3719 -76.5867 39.3745 39.3719 -76.5879 39.3730 39.3719 -76.7166 39.2757 39.3719 -76.6015 39.4042 39.3468 -76.7542 39.2815 39.3468 -76.7516 39.2832 39.3468 -76.7331 39.2878 39.3468 -76.7281 39.2889 39.3182 39.3372 39.3300 39.3342 39.3332 39.3324 39.3335 -76.4297 -76.4297 -76.4297 -76.4297 -76.4297 -76.4297 -76.4297 39.3172 39.3172 39.3172 39.3172 39.3172 39.3172 39.3172 In addition to the ID field, the X and Y coordinates of both the crime and residence location must be included on each record. In the example (Table Up. 1.10), the ID variable is called OffenderID, the crime location coordinates are called IncidX and IncidY while the residence location coordinates are called HomeX and HomeY. Again, any label is acceptable as long as the column locations in each record are consistent. As with the Journey to Crime calibration file, other fields can be included. 64 Logic of the Routine The module is divided into two parts (under the “Bayesian Journey to Crime Estimation” page of “Spatial Modeling”): 1. Diagnostics for Journey to Crime methods; and 2. Estimate likely origin location of a serial offender. The “diagnostics” routine takes the diagnostics calibration file and estimates a number of methods for each serial offender in the file and tests the accuracy of each parameter against the known residence location. The result is a comparison of the different methods in terms of accuracy in predicting both where the offender lives as well as minimizing the distance between where the method predicts the most likely location for the offender and where the offender actually lives. The “estimate” routine allows the user to choose one method and to apply it to the data for a single serial offender. The result is a probability surface showing the results of the method in predicting where the offender is liable to be living. Bayesian Journey to Crime Diagnostics The following applies to the “diagnostics” routine only. Data Input The user inputs the four required data sets. 1. Any primary file with an X and Y location. A suggestion is to use one of the files for the serial offender, but this is not essential; 2. A grid that will be overlaid on the study area. Use the Reference File under Data Setup to define the X and Y coordinates of the lower-left and upper-right corners of the grid as well as the number of columns; 3. A Journey to Crime travel function (Jtc) that estimates the likelihood of an offender committing crimes at a certain distance (or travel time if a network is used); 4. An origin-destination matrix; and 5. The diagnostics file of known serial offenders in which both their residence and crime locations are known. Methods Tested The “diagnostics” routine compares six methods for estimating the likely location of a serial offender: 65 1. The Jtc distance method, P(Jtc); 2. The general crime distribution based on the origin-destination matrix, P(O). Essentially, this is the distribution of origins irrespective of the destinations; 3. The distribution of origins in the O-D matrix based only on the incidents in zones that are identical to those committed by the serial offender, P(O|Jtc); 4. The product of the Jtc estimate (1 above) and the distribution of origins based only on those incidents committed in zones identical to those by the serial offender (3 above), P(Jtc)*P(O|Jtc). This is the numerator of the Bayesian function (equation Up. 1.38), the product of the prior probability times the likelihood estimate; 5. The Bayesian risk estimate as indicated in equation Up. 1.38 above (method 4 above divided by method 2 above), P(Bayesian). This is a rough approximation to the Bayesian function in equal Up. 1.42 above; and 6. The center of minimum distance, Cmd. Previous research has indicated that the center of minimum of distance produces the least error in minimizing the distance between where the method predicts the most likely location for the offender and where the offender actually lives (Levine, 2004; Snook, Zito, Bennell, and Taylor (2005). Interpolated Grid For each serial offender in turn and for each method, the routine overlays a grid over the study area. The grid is defined by the Reference File parameters (under Data Setup; see chapter 3). The routine then interpolates each input data set into a probability estimate for each grid cell with the sum of the cells equaling 1.0 (within three decimal places). The manner in which the interpolation is done varies by the method: 1. For the Jtc method, P(Jtc), the routine interpolates the selected distance function to each grid cell to produce a density estimate. The densities are then re-scaled so that the sum of the grid cells equals 1.0 (see chapter 10); 2. For the general crime distribution method, P(O), the routine sums up the incidents by each origin zone from the origin-destination matrix and interpolates that using the normal distribution method of the single kernel density routine (see chapter 9). The density estimates are converted to probabilities so that the sum of the grid cells equals 1.0; 3. For the distribution of origins based only on the incidents committed by the serial offender, from the origin-destination matrix the routine identifies the zone in which the incidents occur and reads only those origins associated with those destination zones. Multiple incidents committed in the same origin zone are counted multiple times. The routine adds up the number of incidents counted for each zone and uses the single kernel density routine to interpolate the distribution to the grid (see chapter 9). The density estimates are converted to probabilities so that the sum of the grid cells equals 1.0; 66 4. For the product of the Jtc estimate and the distribution of origins based only on the incidents committed by the serial offender, the routine multiples the probability estimate obtained in 1 above by the probability estimate obtained in 3 above. The probabilities are then re-scaled so that the sum of the grid cells equals 1.0; 5. For the Bayesian risk estimate, the routine takes the product estimate (4 above) and divides it by the general crime distribution estimate (2 above). The resulting probabilities are then re-scaled so that the sum of the grid cells equals 1.0; and 6. Finally, for the center of minimum distance estimate, the routine calculates the center of minimum distance for each serial offender in the “diagnostics” file and calculates the distance between this statistic and the location where the offender is actually residing. This is used only for the distance error comparisons. Note in all of the probability estimate (excluding 6), the cells are converted to probabilities prior to any multiplication or division. The results are then re-scaled so that the resulting grid is a probability (i.e., all cells sum to 1.0). Output For each offender in the “diagnostics” file, the routine calculates three different statistics for each of the methods: 1. The estimated probability in the cell where the offender actually lives. It does this by, first, identifying the grid cell in which the offender lives (i.e., the grid cell where the offender’s residence X and Y coordinate is found) and, second, by noting the probability associated with that grid cell; 2. The percentile of all grid cells in the entire grid that have to be searched to find the cell where the offender lives based on the probability estimate from 1, ranked from those with the highest probability to the lowest. Obviously, this percentile will vary by how large a reference grid is used (e.g., with a very large reference grid, the percentile where the offender actually lives will be small whereas with a small reference grid, the percentile will be larger). But, since the purpose is to compare methods, the actual percentage should be treated as a relative index. The result is sorted from low to high so that the smaller the percentile, the better. For example, a percentile of 1% indicates that the probability estimate for the cell where the offender lives is within the top 1% of all grid cells. Conversely, a percentile of 30% indicates that the probability estimate for the cell where the offender lives in within the top 30% of all grid cells; and 3. The distance between the cell with the highest probability and the cell where the offender lives. Table Up. 1.11 illustrates a typical probability output for four of the methods (there are too many to display in a single table). Only five serial offenders are shown in the table. 67 Table Up. 1.11: Sample Output of Probability Matrix Offender P(Jtc) 1 0.001169 2 0.000292 Percentile for P(Jtc) 0.01% 5.68% P(O|Jtc) 0.000663 0.000483 Percentile for P(O|Jtc) 0.01% 0.12% 3 0.000838 4 0.000611 5 0.001619 0.14% 1.56% 0.04% 0.000409 0.000525 0.000943 0.18% 1.47% 0.03% Percentile Percentile for for P(O) P(O) P(Jtc)*P(O|Jtc) P(Jtc)*P(O|Jtc) 0.0003 11.38% 0.002587 0.01% 0.000377 0.33% 0.000673 0.40% 0.0002 0.0004 0.000266 30.28% 2.37% 11.98% 0.00172 0.000993 0.004286 0.10% 1.37% 0.04% Table Up. 1.12 illustrates a typical distance output for four of the methods. Only five serial offenders are shown in the table. Table Up. 1.12: Sample Output of Distance Matrix OffenderDistance(Jtc) Distance(O|Jtc) Distance(O) 1 0.060644 0.060644 7.510158 2 6.406375 0.673807 2.23202 3 0.906104 0.407762 11.53447 4 3.694369 3.672257 2.20705 5 0.423577 0.405526 6.772228 Distance for P(Jtc)*P(O|Jtc) 0.060644 0.840291 0.407762 3.672257 0.423577 Thus, these three indices provide information about the accuracy and precision of the method. Output matrices The “diagnostics” routine outputs two separate matrices. The probability estimates (numbers 1 and 2 above) are presented in a separate matrix from the distance estimates (number 3 above). The user can save the total output as a text file or can copy and paste each of the two output matrices into a spreadsheet separately. We recommend the copying-and-pasting method into a spreadsheet as it will be difficult to line up differing column widths for the two matrices and summary tables in a text file. Which is the Most Accurate and Precise Journey to Crime Estimation Method? Accuracy and precision are two different criteria for evaluating a method. With accuracy, one wants to know how close a method comes to a target. The target can be an exact location (e.g., the residence of a serial offender) or it can be a zone (e.g., a high probability area within which the serial offender lives). Precision, on the other hand, refers to the consistency of the method, irrespective of how accurate it is. A more precise measure is one in which the method has a limited variability at estimating the central location whereas a less precise measure may have a high degree of variability. These two criteria - accuracy and precision, often can conflict. 68 The following example is from Jessen (1978). Consider a target that one is trying to ‘hit’ (Figure Up. 1.20). The target can be a physical target, such as a dart board, or it can be a location in space, such as the residence of a serial offender. One can think of three different ‘throwers’ or methods attempting to hit the center of target, the Bulls Eye. The throwers make repeated attempts to hit the target and the ‘throws’ (or estimates from the method) can be evaluated in terms of accuracy and precision. In Figure Up. 1.21, the thrower is all over the dartboard. There is no consistency at all. However, if the center of minimum distance (Cmd) is calculated, it is very close to the actual center of the target, the Bulls Eye. In this case, the thrower is accurate but not precise. That is, there is no systematic bias in the thrower’s throws, but they are not reliable. This thrower is accurate (or unbiased) but not precise. In Figure Up.1. 22, there is an opposite condition. In this case, the thrower is precise but not accurate. That is, there is a systematic bias in the throws even though the throws (or method) are relatively consistent. Finally, in Figure Up. 1.23, the thrower is both relatively precise and accurate as the Cmd of the throws is almost exactly on the Bulls Eye. One can apply this analogy to a method. A method produces estimates from a sample. For each sample, one can evaluate how accurate is the method (i.e., how close to the target did it come) and how consistent is it (how much of variability does it produce). Perhaps the analogy is not perfect because the thrower makes multiple throws whereas the method produces a single estimate. But, clearly, we want a method that is both accurate and precise. Measures of Accuracy and Precision Much of the debate in the area of journey to crime estimation has revolved around arguments about the accuracy and precision of the method. Levine (2004) first raised the issue of accuracy by proposing distance from the location with the highest probability to the location where the offender lived as a measure of accuracy, and suggested that simple, centrographic measures were as accurate as more precise journey to crime methods in estimating this. Snook and colleagues confirmed this conclusion and showed that human subjects could do as well as any of the algorithms (Snook, Zito, Bennell, and Taylor, 2005; Snook, Taylor and Bennell, 2004). Canter, Coffey and Missen (2000), Canter (2003), and Rossmo (2000) have argued for an area of highest probability being the criterion for evaluating accuracy, indicating a ‘search cost’ or a ‘hit score’ with the aim being to narrow the search area to as small as possible. Rich and Shivley (2004) compared different journey to crime/geographic profiling software packages and concluded that there were at least five different criteria for evaluating accuracy and precision - error distance, search cost/hit score, profile error distance, top profile area, and profile accuracy. Rossmo (2005a; b) and Rossmo and Filer (2005) have critiqued these measures as being too simple and have rejected error distance. Levine (2005) justified the use of error distance as being fundamental to statistical error while acknowledging that an area measure is necessary, too. Paulsen (2007; 2006a; 2006b) compared different journey to crime/geographic profiling methods and argued that they are more or less were comparable to in terms of several criteria of accuracy, both error distance and search cost/hit score. 69 Figure Up. 1.20: From Raymond J.Jessen, Statistical Survey Techniques. J. Wiley, 1978 Figure Up. 1.21: Figure Up. 1.22: Figure Up. 1.23: While the debate continues to develop, practically a distinction can be made in terms of measures of accuracy and measures of precision. Accuracy is measured by how close to the target is the estimate while precision refers to how large or small an area the method produces. The two become identical when the precision is extremely small, similar to a variance converging into a mean in statistics as the distance between observations and the mean approach zero. In evaluating the methods, five different measures are used: Accuracy 1. True accuracy - the probability in the cell where the offender actually lives. The Bayesian Jtc diagnostics routine evaluates the six above mentioned methods on a sample of serial offenders with known residence address. Each of the methods (except for the center of minimum distance, Cmd) has a probability distribution. That method which has the highest probability in the cell where the offender lives is the most accurate. 2. Diagnostic accuracy - the distance between the cell with the highest probability estimate and the cell where the offender lives. Each of the probability methods produces probability estimates for each cell. The cell with the highest probability is the best guess of the method for where the offender lives. The error from this location to where the offender lives is an indicator of the diagnostic accuracy of the method. 3. Neighborhood accuracy - the percent of offenders who reside within the cell with the highest probability. Since the grid cell is the smallest unit of resolution, this measures the percent of all offenders who live at the highest probability cell. This was estimated by those cases where the error distance was smaller than half the grid cell size. Precision 4. Search cost/hit score - the percent of the total study area that has to be searched to find the cell where the offender actually lived after having sorted the output cells from the highest probability to the lowest 5. Potential search cost - the percent of offenders who live within a specified distance of the cell with the highest probability. In this evaluation, two distances are used though others can certainly be used: A. The percent of offender who live within one mile of the cell with the highest probability. B. The percent of offenders who live within one-half mile of the cell with the highest probability (“Probable search area in miles”). Summary Statistics The “diagnostics” routine will also provide summary information at the bottom of each matrix. There are summary measures and counts of the number of times a method had the highest probability or 74 the closest distance from the cell with the highest probability to the cell where the offender actually lived; ties between methods are counted as fractions (e.g., two tied methods are given 0.5 each; three tied methods are give 0.33 each). For the probability matrix, these statistics include: 1. 2. 3. 4. 5. 6. 7. 8. The mean (probability or percentile); The median (probability or percentile); The standard deviation (probability or percentile); The number of times the P(Jtc) estimate produces the highest probility; The number of times the P(O|Jtc) estimate produces the highest probability; The number of times the P(O) estimate produces the highest probability; The number of times the product term estimate produces the highest probability; The number of times the Bayesian estimate produces the highest probability. For the distance matrix, these statistics include: 1. 2. 3. 4. 5. 6. 7. 8. 9. The mean distance; The median distance; The standard deviation distance; The number of times the P(Jtc) estimate produces the closest distance; The number of times the P(O|Jtc) estimate produces the closest distance; The number of times the P(O) estimate produces the closest distance; The number of times the product term estimate produces the closest distance; The number of times the Bayesian estimate produces the closest distance; and The number of times the center of minimum distance produces the closest distance. Testing the Routine with Serial Offenders from Baltimore County To illustrate the use of the Bayesian Jtc diagnostics routine, the records of 88 serial offenders who had committed crimes in Baltimore County, MD, between 1993 and 1997 were compiled into a diagnostics file. The number of incidents committed by these offenders varied from 3 to 33 and included a range of different crime types (larceny, burglary, robbery, vehicle theft, arson, bank robbery). The following are the results for the three measures of accuracy and three measures of precision. Because the methods are interdependent, traditional parametric statistical tests cannot be used. Instead, non-parametric tests have been applied. For the probability and distance measures, two tests were used. First, the Friedman two-way analysis of variance test examines differences in the overall rank orders of multiple measures (treatments) for a group of subjects (Kanji, 1993, 115; Siegel, 1956). This is a chi-square test and measures whether there are significant differences in the rank orders across all measures (treatments). Second, differences between specific pairs of measures can be tested using the Wilcoxon matched pairs signed-ranks test (Siegel, 1956, 75-83). This examines pairs of methods by not only their rank, but also by the difference in the values of the measurements. For the percentage of offenders who live in the same grid cell, within one mile, and within one half-mile of the cell with the peak likelihood, the Cochran Q test for k related samples was used to test differences among the methods (Kanji, 1993, 74; Siegel, 1956, 161-166). This is a chi-square test of whether there are overall differences among the methods in the percentages, but it cannot indicate whether any one method has a statistically higher percentage. Consequently, we then tested the method 75 with the highest percentage against the method with the second highest percentage with the Q test in order to see whether the best method stood out. Results: Accuracy Table Up. 1.13 presents the results three accuracy measures. For the first measure, the probability estimate in the cell where the offender actually lived, the product probability is far superior to any of the others. It has the highest mean probability of any of the measures and is more than double that of the journey to crime. The Friedman test indicates that these differences are significant and the Wilcoxon matched pairs test indicates that the product has a significantly higher probability than the second best measure, the Bayesian risk, which in turn is significantly higher than the journey to crime measure. At the low, the general probability has the lowest average and is significantly lower than the other measures. In terms of the individual offenders, the product probability had the highest probability for 74 of the 88 offenders. The Bayesian risk measure, which is correlated with the product term, had the highest probability for 10 of the offenders. The journey to crime measure, on the other hand, had the highest probability for only one of the offenders. The conditional probability had the highest probability for two of the offenders and the general probability was highest for one offender. Table Up. 1.13: Accuracy Measures of Total Sample Method Mean distance From highest Probability cell to Offender cell (mi)b Mean Probability in Offender cella Percent of Offenders whose Residence is in Highest prob. cellc Journey to crime 0.00082 2.78 12.5% General 0.00025 8.21 0.0% Conditional 0.00052 3.22 3.4% Product 0.00170 2.65 13.6% Bayesian risk 0.00131 3.15 10.2% Cmd n.a. 2.62 18.2% _____________________________________________________________________ a b c Friedman ÷2 =236.0 ; d.f. = 4; p£.001; Wilcoxon signed-ranks test at p£.05: Product > Bayesian risk > JTC = Conditional > General Friedman ÷2 = 114.2; d.f. = 5; p£.001; Wilcoxon signed-ranks test at p£.05: CMD = Product = JTC > Bayesian risk = Conditional < General Cochran Q = 33.9, d.f. = 5, p£.001; Cochran Q of difference between best and second best =1.14, n.s. In other words, the product term produced the highest estimate in the actual cell where the offender lived. The other two accuracy measures are less discriminating but still indicative the 76 improvement gained from the Bayesian approach. For the measure of diagnostic accuracy (the distance from the cell with the highest probability to the cell where the offender lived), the center of minimum distance (Cmd) had the lowest error distance followed closely by the product term. The journey to crime method had a slightly larger error. Again, the general probability had the greatest error, as might be expected. The Friedman test indicates that there are overall differences in the mean distance among the six measures. The Wilcoxon signed-ranks test, however, showed that the Cmd, the product, and the journey to crime estimates are not significantly different, though all three are significantly lower than the Bayesian risk measure and the conditional probability which, in turn, are significantly lower than the general probability. In terms of individual cases, the Cmd produced the lowest average error for 30 of the 88 cases while the conditional term (O|Jtc) had the lowest error in 17.9 cases (including ties). The product term produced a lower average distance error for 9.5 cases (including ties) and the Jtc estimate produced lower average distance errors in 8.2 cases (again, including ties). In other words, the Cmd will either be very accurate or very inaccurate, which is not surprising given that it is only a point estimate. Finally, for the third accuracy measure, the percent offenders residing in the area covered by the cell with the highest probability estimate, the Cmd has the highest percentage (18.2%) followed by the product probability, and the journey to crime probabilty. The Cochran Q shows that there are significant differences over all these measures. However, the difference between the measure with the highest percentage in the same grid cell (the Cmd) and the measure with the second highest percentage (the product probability) is not significant. For accuracy, the product probability appears to be better than the journey to crime estimate and almost as accurate as the Cmd. It has the highest probability in the cell where the offender lived and a lower error distance than the journey to crime (though not significantly so). Finally, it had a slightly higher percentage of offenders living in the area covered by cell with the highest probability than for the journey to crime. The Cmd, on the other hand, which had been shown to be the most accurate in previous studies (Levine, 2004; (Snook, Zito, Bennell, and Taylor, 2005; Snook, Taylor and Bennell, 2004; Paulsen, 2006a; 2006b), does not appear to be more accurate than the product probability. It has only a slightly lower error distance and a slightly higher percentage of offenders residing in the area covered by the cell with the highest probability. Thus, the product term has equaled the Cmd in terms of accuracy. Both, however, are more accurate than the journey to crime estimate. Results: Precision Table Up. 1.14 presents the three precision measures used to evaluate the six different measures. For the first measure, the mean percent of the study area with a higher probability (what Canter calls ‘search cost’ and Rossmo calls ‘hit score’; Canter, 2003; Rossmo, 2005a, 2005b), the Bayesian risk measure had the lowest percentage followed closely by the product term. The conditional probability was third followed by the journey to crime probability followed by the general probability. The Friedman test indicates that these differences are significant overall and the Wilcoxon test shows that the Bayesian risk, product term, conditional probability and journey to crime estimates are not significantly different from each other. The general probability estimate, however, is much worse. 77 In terms of individual cases, the product probability had either the lowest percentage or was tied with other measures for the lowest percentage in 36 of the 88 cases. The Bayesian risk and journey to crime measures had the lowest percentage or were tied with other measures for the lowest percentage in 34 of the 88 cases. The conditional probability had the lowest percentage or was tied with other measures for the lowest percentage in 23 of the cases. Finally, the general probability had the lowest percentage or was tied with other measures for the lowest percentage in only 7 of the cases. Similar results are seen for the percent of offenders living within one mile of the cell with the highest probability and also for the percent living within a half mile. For the percent within one mile, the product term had the highest percentage followed closely by the journey to crime measure and the Cmd. Again, at the low end is the general probability. The Cochran Q test indicates that these differences are significant over all measures though the difference between the best method (the product) and the second best (the journey to crime) is not significant. Table Up. 1.14: Precision Measures of Total Sample Mean percent of Study area with Higher Probabilitya Percent of offenders living within distance of highest probability cell: 1 mileb 0.5 milesc 4.7% 56.8% 44.3% 16.8% 2.3% 0.0% Conditional 4.6% 47.7% 31.8% Product 4.2% 59.1% 48.9% Bayesian risk 4.1% 51.1% 42.0% Method Journey to crime General Cmd n.a. 54.5% 42.0% ________________________________________________________________ a b c Friedman ÷2 =115.4 ; d.f. = 4; p£.001; Wilcoxon signed-ranks test at p£.05: Bayesian risk =Product= JTC = Conditional> General Cochran Q = 141.0, d.f. = 5, p£.001; Cochran Q of difference between best and second best = 0.7, n.s. Cochran Q = 112.2, d.f. = 5, p£.001; Cochran Q of difference between best and second best = 2.0, n.s. Conclusion of the Evaluation In conclusion, the product method appears to be an improvement over the journey to crime method, at least with these data from Baltimore County. It is substantially more accurate and about as precise. Further, the product probability appears to be, on average, as accurate as the Cmd, though the Cmd still is more accurate for a small proportion of the cases (about one-sixth). That is, the Cmd will identify about one-sixth of all offenders exactly. For a single guess of where a serial offender is living, the center of minimum distance produced the lowest distance error. But, since it is only a point estimate, 78 it cannot point to a search area where the offender might be living. The product term, on the other hand, produced an average distance error almost as small as the center of minimum distance, but produced estimates for other grid cells too. Among all the measures, it had the highest probability in the cell where the offender lived and was among the most efficient in terms of reducing the search area. In other words, using information about the origin location of other offenders appears to improve the accuracy of the Jtc method. The result is an index (the product term) that is almost as good as the center of minimum distance, but one that is more useful since the center of minimum distance is only a single point. Of course, each jurisdiction should re-run these diagnostics to determine the most appropriate measure. It is very possible that other jurisdictions will have different results due to the uniqueness of their land uses, street layout, and location in relation to the center of the metropolitan area. Baltimore County is a suburb and the conclusions in a central city or in a rural area might be different. Tests with Other Data Sets The Bayesian Journey-to-crime model has been tested over the last few years in several jurisdictions: 1. In Baltimore County with 850 serial offenders by Michael Leitner and Joshua Kent of Louisiana State University (Leitner and Kent, 2009) 2. In the Hague, Netherlands with 62 serial burglars by Dick Block of Loyola University in Chicago and Wim Bernasco of the Netherlands Institute for the Study of Crime and Law Enforcement (Block and Bernasco, 2009). 3. In Chicago, with 103 serial robbers by Dick Block of Loyola University (Levine and Block, 2010). 4. In Manchester, England with 171 serial offenders by Patsy Lee of the Greater Manchester Police Department and myself (Levine and Lee, 2009) In all cases, the product probability measures was both more accurate and more precise than the Journey to Crime measure. In two of the studies (Chicago and the Hague), the product term was also more accurate than the Center of Minimum Distance, the previous most accurate measure. In the other two studies (Baltimore County and Manchester), the Center of Minimum Distance was slightly more accurate than the product term. However, the product term has been more precise than the Center of Minimum Distance in all four study comparisons. The mathematics of these models has been explored by O’Leary (2009). These studies are presented in a special issue of the Journal of Investigative Psychology and Offender Profiling. Introductions are provided by Canter (2009) and Levine (2009). Estimate Likely Origin Location of a Serial Offender The following applies to the Bayesian Jtc “Estimate likely origin of a serial offender” routine. Once the “diagnostic” routine has been run and a preferred method selected, the next routine allows the application of that method to a single serial offender. 79 Data Input The user inputs the three required data sets and a reference file grid: 1. The incidents committed by a single offender that we’re interested in catching. This must be the Primary File; 2. A Jtc function that estimates the likelihood of an offender committing crimes at a certain distance (or travel time if a network is used). This can be either a mathematicallydefined function or an empirically-derived one (see Chapter 10 on Journey to Crime Estimation). In general, the empirically-derived function is slightly more accurate than the mathematically-defined one (though the differences are not large); 3. An origin-destination matrix; and 4. The reference file also needs to be defined and should include all locations where crimes have been committed (see Reference File). Selected Method The Bayesian Jtc “Estimate” routine interpolates the incidents committed by a serial offender to a grid, yielding an estimate of where the offender is liable to live. There are five different methods for that can be used. However, the user has to choose one of these: 1. The Jtc distance method, P(Jtc); 2. The general crime distribution based on the origin-destination matrix, P(O). Essentially, this is the distribution of origins irrespective of the destinations; 3. The conditional Jtc distance. This is the distribution of origins based only on the incidents committed by other offenders in the same zones as those committed by the serial offender, P(O|Jtc). This is extracted from the O-D matrix; 4. The product of the Jtc estimate (1 above) and the distribution of origins based only on the incidents committed by the serial offender (3 above), P(Jtc)*P(O|Jtc). This is the numerator of the Bayesian function (equation Up. 1.38), the product of the prior probability times the likelihood estimate; and 5. The Bayesian risk estimate as indicated in equation Up. 1.42 above (method 4 above divided by method 2 above), P(Bayesian). Interpolated Grid For the method that is selected, the routine overlays a grid on the study area. The grid is defined by the reference file parameters (see chapter 3). The routine then interpolates the input data set (the primary file) into a probability estimate for each grid cell with the sum of the cells equaling 1.0 (within three decimal places). The manner in which the interpolation is done varies by the method chosen: 80 1. For the Jtc method, P(Jtc), the routine interpolates the selected distance function to each grid cell to produce a density estimate. The density estimates are converted to probabilities so that the sum of the grid cells equals 1.0 (see chapter 10); 2. For the general crime distribution method, P(O), the routine sums up the incidents by each origin zone and interpolates this to the grid using the normal distribution method of the single kernel density routine (see chapter 9). The density estimates are converted to probabilities so that the sum of the grid cells equals 1.0; 3. For the distribution of origins based only on the incident committed by the serial offender, the routine identifies the zone in which the incident occurs and reads only those origins associated with those destination zones in the origin-destination matrix. Multiple incidents committed in the same origin zone are counted multiple times. The routine then uses the single kernel density routine to interpolate the distribution to the grid (see chapter 9). The density estimates are converted to probabilities so that the sum of the grid cells equals 1.0; 4. For the product of the Jtc estimate and the distribution of origins based only on the incidents committed by the serial offender , the routine multiples the probability estimate obtained in 1 above by the probability estimate obtained in 3 above. The product probabilities are then re-scaled so that the sum of the grid cells equals 1.0; and 5. For the full Bayesian estimate as indicated in equation Up. 1.38 above, the routine takes the product estimate (4 above) and divides it by the general crime distribution estimate (2 above). The resulting density estimates are converted to probabilities so that the sum of the grid cells equals 1.0. Note in all estimates, the results are then re-scaled so that the resulting grid is a probability (i.e., all cells sum to 1.0). Output Once the method has been selected, the routine interpolates the data to the grid cell and outputs it as a ‘shp’, ‘mif/mid’, or Ascii file for display in a GIS program. The tabular output shows the probability values for each cell in the matrix and also indicates which grid cell has the highest probability estimate. Accumulator Matrix There is also an intermediate output, called the accumulator matrix, which the user can save. This lists the number of origins identified in each origin zone for the specific pattern of incidents committed by the offender, prior to the interpolation to grid cells. That is, in reading the origindestination file, the routine first identifies which zone each incident committed by the offender falls within. Second, it reads the origin-destination matrix and identifies which origin zones are associated with incidents committed in the particular destination zones. Finally, it sums up the number of origins by zone ID associated with the incident distribution of the offender. This can be useful for examining the distribution of origins by zones prior to interpolating these to the grid. 81 Two Examples of Using the Bayesian Journey to Crime Routine Two examples will illustrate the routines. Figure Up. 1.24 presents the probability output for the general origin model, that is for the origins of all offenders irrespective of where they commit their crimes. This will be true for any serial offender. It is a probability surface in that all the grid cells sum to 1.0. The map is scaled so that each bin covers a probability of 0.0001. The cell with the highest probability is highlighted in light blue. As seen, the distribution is heavily weighted towards the center of the metropolitan area, particularly in the City of Baltimore. For the crimes committed in Baltimore County between 1993 and 1997 in which both the crime location and the residence location was known, about 40% of the offenders resided within the City of Baltimore and the bulk of those living within Baltimore County lived close to the border with City. In other words, as a general condition, most offenders in Baltimore County live relatively close to the center. Offender S14A The general probability output does not take into consideration information about the particular pattern of an offender. Therefore, we will examine specifically a particular offender. Figure Up. 1.25 presents the distribution of an offender who committed 14 offenses between 1993 and 1997 before being caught and the residence location where the individual lived when arrested (offender S14A). Of the 14 offenses, seven were thefts (larceny), four were assaults, two were robberies, and one was a burglary. As seen, the incidents all occurred in the southeast corner of Baltimore County in a fairly concentrated pattern though two incidents were committed more than five miles away from the offender’s residence. The general probability model is not very precise since it assigns the same location to all offenders. In the case of offender S14A, the distance error between the cell with the highest probability and the cell where the offender actually lived is 7.4 miles. On the other hand, the Jtc method uses the distribution of the incidents committed by a particular offender and a model of a typical travel distance distribution to estimate the likely origin of the offender’s residence. A travel distance estimate based on the distribution of 41,424 offenders from Baltimore County was created using the CrimeStat journey to crime calibration routine (see Chapter 10 of the CrimeStat manual). Figure Up. 1.26 shows the results of the Jtc probability output. In this map and the following maps, the bins represent probability ranges of 0.0001. The cell with the highest likelihood is highlighted in light blue. As seen, this cell is very close to the cell where the actual offender lived. The distance between the two cells was 0.34 miles. With the Jtc probability estimate, however, the area with a higher probability (dark red) covers a fairly large area indicating the relative lack of precision of this method. The precision of the Jtc estimate is good since only 0.03% of the cells have higher probabilities that the cell associated with the area where the offender lived. In other words, the Jtc estimate has produced a very good estimate of the location of the offender, as might be expected given the concentration of the incidents committed by this person. 82 Figure Up. 1.24: Figure Up. 1.25: Figure Up. 1.26: For this same offender, Figure Up. 1.27 shows the results of the conditional probability estimate of the offender’s residence location, that is the distribution of the likely origin based on the origins of offenders who committed crimes in the same locations as that by S14A. Again, the cell with the highest probability is highlighted (in light green). As seen, this method has also produced a fairly close estimate, with the distance between the cell with the highest probability and the cell where the offender actually lived being 0.18 miles, about half the error distance of the Jtc method. Further, the conditional estimate is more precise than the Jtc with only 0.01% of the cells having a higher probability than the cell associated with the residence of hte offender. Thus, the conditional probability estimate is not only more accurate than the Jtc method, but also more precise (i.e., more efficient in terms of search area). For this same offender, figure Up. 1.28 shows the results of the Bayesian product estimate, the product of the Jtc probability and the conditional probability re-scaled to be a single probability (i.e., with the sum of the grid cells equal to 1.0). It is a Bayesian estimate because it updates the Jtc probability estimate with the information on the likely origins of offenders who committed crimes in the same locations (the conditional estimate). Again, the cell with the highest probability is highlighted (in dark tan). The distance error for this method is 0.26 miles, not as accurate as the conditional probability estimate but more accurate than the Jtc estimate. Further, this method is about as precise as the Jtc since 0.03% of the cells having probabilities higher than that associated with the location where the offender lived. Figure Up. 1.29 shows the results of the Bayesian risk probability estimate. This method takes the Bayesian product estimate and divides it by the general origin probability estimate. It is analogous to a risk measure that relates the number of events to a baseline population. In this case, it is the estimate of the probability of the updated Jtc estimate relative to the probability of where offenders live in general. Again, the cell with the highest likelihood is highlighted (in dark yellow). The Bayesian risk estimate produces an error of 0.34 miles, the same as the Jtc estimate, with 0.04% of the cells having probabilities higher than that associated with the residence of the offender. Finally, the center of minimum distance (Cmd) is indicated on each of the maps with a grey cross. In this case, the Cmd is not as accurate as any of the other methods since it has an error distance of 0.58 miles. In summary, all of the Journey to Crime estimate methods produce relatively accurate estimates of the location of the offender (S14A). Given that the incidents committed by this person were within a fairly concentrated pattern, it is not surprising that each of the methods produces reasonable accuracy. Offender TS15A But what happens if an offender who did not commit crimes in the same part of town is selected? Figure Up. 1.30 shows the distribution of an offender who committed 15 offenses (TS15A). Of the 15 offenses committed by this individual, there were six larceny thefts, two assaults, two vehicle thefts, one robbery, one burglary, and three incidents of arson. While the distribution of 13 of the offenses are within about a three mile radius, two of the incidents are more than eight miles away. Only three of the estimates will be shown. The general method produces an error of 4.6 miles. Figure Up. 1.31 shows the results of the Jtc method. Again, the map bins are in ranges of 0.0001 and the cell with the highest probability is highlighted. As seen, the cell with the highest 86 Figure Up. 1.27: Figure Up. 1.28: Figure Up. 1.29: Figure Up. 1.30: Figure Up. 1.31: probability is located north and west of the actual offender’s residence. The distance error is 1.89 miles. The precision of this estimate is good with only 0.08% of the cells having higher probabilities than the cell where the offender lived. Figure Up. 1.32 shows the result of the conditional probability estimate for this offender. In this case, the conditional probability method is less accurate than the Jtc method with a distance between the cell with the highest probability and the cell where the offender lived being 2.39 miles. However, this method is less precise than the Jtc method with 1.6% of the study area having probabilities higher than that in the cell where the offender lived. Finally, figure Up. 1.33 shows the results of the product probability estimate. For this method, the error distance is only 0.47 miles, much less than the Jtc method. Further, it is smaller than the center of minimum distance which has a distance error of 1.33 miles. Again, updating the Jtc estimate with information from the conditional estimate produces a more accurate guess where the offender lives. Further, the product estimate is more precise with only 0.02% of the study area having probabilities higher than the cell covering the area where the offender lived. In other words, the “Estimate likely origin of a serial offender” routine allows the estimation of a probability grid based on a single selected method. The user must decide which probability method to select and the routine then calculates that estimate and assigns it to a grid. As mentioned above, the “diagnostics” routine should be first run to decide on which method is most appropriate for your jurisdiction. In these 88 cases, the Bayesian product estimate was the most accurate of all the probability methods. But, it is not known whether it will be the most accurate for other jurisdictions. Differences in the balance between central-city and suburbs, the road network, and land uses may change the travel patterns of offenders. However, so far, as mentioned above, in tests in four cities (Baltimore County, Chicago, the Hague, Manchester), the product estimate has consistently been better than the journey to crime estimate and almost as good, if not better, than the center of minimum distance. Further, the product term appears to be more precise than the journey to crime method. The center of minimum distance, while generally more accurate than other methods, has no probability distribution; it is simply a point. Consequently, one cannot select a search area from the estimate. Potential to Add More Information to Improve the Methodology Further, it should be possible to add more information to this framework to improve the accuracy and precision of the estimates. One obvious dimension that should be added is an opportunity matrix, a distribution of targets that are crime attractions for offenders. Among these are convenience stores, shopping malls, parking lots, and other types of land uses that attract offenders. It will be necessary to create a probability matrix for quantifying these attractions. Further, the opportunity matrix would have to be conditional on the distribution of the crimes and on the distribution of origins of offenders who committed crimes in the same location. The Bayesian framework is a conditional one where factors are added to the framework but conditioned on the distribution of earlier factors, namely P(Jtc|O) % P(Jtc)*P(O|Jtc)*P(A|O,Jtc) (Up. 1.43) where A is the attractions (or opportunities), Jtc is the distribution of incidents, and O is the distribution of other offender origins. It will not be an easy task to estimate an opportunity matrix that is 92 Figure Up. 1.32: Figure Up. 1.33: conditioned (dependent) upon both the distribution of offences (Jtc) and the origin of other offenders who committed crimes in the same location (O|Jtc) and it may be necessary to approximate this through a series of filters. Probability Filters A filter is a probability matrix that is applied to the estimate but is not conditioned on the existing variables in the model. For example, an opportunity matrix that was independent of the distribution of offences by a single serial offender or the origins of other offenders who committed crimes in the same locations could be applied as an alternative to equation Up. 1.43. P(Jtc|O) % P(Jtc)*P(O|Jtc)*P(A) (Up. 1.44) In this case, P(A) is an independent matrix. Another filter that could be applied are residential land uses. The vast majority of offenders are going to live in residential areas. Thus, a residential land use filter estimates the probability of a residential land use for every cell, P(Rs), could be applied to screen out cells that are not residential, such as P(Jtc|O) % P(Jtc)*P(O|Jtc)*P(A)*P(Rs) (Up. 1.45) In this way, additional information can be integrated into the journey to crime methodology to improve the accuracy and precision of the estimates. Clearly, having additional variables be conditioned upon existing variables in the model would be ideal since that would fit the true Bayesian approach. But, even if independent filters were brought in, the model could be improved. Summary In sum, the Bayesian Jtc methodology presented here is an improvement over the current journey to crime method and appears to be as good, and more useful, than the center of minimum distance. First, it adds new information to the journey to crime function to yield a more accurate and more precise estimate. Second, it can sometimes predict the origin of ‘commuter’-type serial offenders, those individuals who do not commit crimes in their neighborhoods (Paulsen, 2007). The traditional journey to crime function cannot predict the origin location of a ‘commuter’-type. Of course, this will only work if there are prior offenders who lived in the same location as the serial offender of interest. If the offender lives in a neighborhood where there has been no previous serial offenders that are documented in the origin-destination matrix, the Bayesian approach cannot detect that location, either. A caveat should be noted, however, in that the Bayesian method still has a substantial amount of error. Much of this error reflects, I believe, the inherent mobility of offenders, especially those living in a suburb such as in Baltimore County. While adolescent offenders tend to commit crimes within a more circumscribed area, the ability of an adult to own an automobile and to travel outside the residential neighborhood is turning crime into a much more mobile phenomena than it was, say, 50 years ago when only about half of American households owned an automobile. Thus, the Bayesian approach to Journey to Crime estimation must be seen as a tool which produces an incremental improvement in accuracy and precision. Geographic profiling is but one tool in the arsenal of methods that police must use to catch serial offenders. 95 References Anselin, Luc (2008). “Personal note on the testing of significance of the local Moran values”. Block, Richard and Wim Bernasco (2009). “Finding a serial burglar’s home using distance decay and conditional origin-destination patterns: A test of Empirical Bayes journey to crime estimation in The Hague”. Journal of Investigative Psychology & Offender Profiling. 6(3), 187-211. Canter, David (2009). “Developments in geographical offender profiling: Commentary on Bayesian journey-to-crime modeling”. Journal of Investigative Psychology & Offender Profiling. 6(3), 161-166. Canter, David (2003). Dragnet: A Geographical Prioritisation Package. Center for Investigative Psychology, Department of Psychology, The University of Liverpool: Liverpool, UK. http://www.i-psy.com/publications/publications_dragnet.php. Canter, D., Coffey, T., Huntley, M., & Missen, C. (2000). “Predicting serial killers' home base using a decision support system”. Journal of Quantitative Criminology, 16 (4), 457 -- 478. Canter, D. and A. Gregory (1994). “Identifying the residential location of rapists”, Journal of the Forensic Science Society, 34 (3), 169-175. Chainey, Spencer and Jerry Ratcliffe (2005). GIS and Crime Mapping, John Wiley & Sons, Inc.:Chichester, Sussex, England. Denison, D.G.T., C.C. Holmes, B.K. Mallilck, and A.F.M. Smith (2002). Bayesian Methods for Nonlinear Classification and Regression. John Wiley & Sons, Ltd: New York. Gelman, Andrew, John B. Carlin, Hal S. Stern, and Donald B. Rubin (2004). Bayesian Data Analysis (second edition). Chapman & Hall/CRC: Boca Raton, FL Getis, A. and J.K. Ord (1992). “The analysis of spatial association by use of distance statistics”, Geographical Analysis, 24, 189-206. Hansen, Katherine (1991). “Head-banging: robust smoothing in the plane”. IEEE Transactions on Geoscience and Remote Sensing, 29 (3), 369-378. Kanji, Gopal K. (1993). 100 Statistical Tests. Sage Publications: Thousand Oaks, CA. Khan, Ghazan, Xiao Qin, and David A. Noyce (2006). “Spatial analysis of weather crash patterns in Wisconsin”. 85th Annual meeting of the Transportation Research Board: Washington, DC. Lee, Jay and David W. S. Wong (2001). Statistical Analysis with ArcView GIS. J. Wiley & Sons, Inc.: New York. Lee, Peter M. (2004). Bayesian Statistics: An Introduction (third edition). Hodder Arnold: London. 96 Lees, Brian (2006). “The spatial analysis of spectral data: Extracting the neglected data”, Applied GIS, 2 (2), 14.1-14.13. Leitner, Michael and Joshua Kent (2009). “Bayesian journey to crime modeling of single- and multiple crime type series in Baltimore County, MD”. Journal of Investigative Psychology & Offender Profiling. 6(3), 213-236. Levine, Ned, and Richard Block (2010). Bayesian Journey-to-Crime Estimation: An Improvement in Geographic Profiling Methodology”. The Professional Geographer. In press. Levine, Ned and Patsy Lee (2009). “Bayesian journey to crime modeling of juvenile and adult offenders by gender in Manchester”. Journal of Investigative Psychology & Offender Profiling. 6(3), 237-251. Levine, Ned (2009). “Introduction to the special issue on Bayesian Journey-to-crime modeling”. Journal of Investigative Psychology & Offender Profiling. 6(3), 167-185. Levine, Ned (2005). “The evaluation of geographic profiling software: Response to Kim Rossmo's critique of the NIJ methodology”. http://www.nedlevine.com/Response to Kim Rossmo Critique of the GP Evaluation Methodology.May 8 2005.doc Levine, Ned (2004). “Journey to crime Estimation”. Chapter 10 of Ned Levine (ed), CrimeStat III: A Spatial Statistics Program for the Analysis of Crime Incident Locations (version 3.0). Ned Levine & Associates, Houston, TX.; National Institute of Justice, Washington, DC. November. http://www.icpsr.umich.edu/crimestat. Originally published August 2000. Mungiole, Michael, Linda W. Pickle, and Katherine H. Simonson (2002). “Application of a weighted Head-Banging algorithm to Mortality data maps”, Statistics in Medicine, 18, 3201-3209. Mungiole, Michael and Linda Williams Pickle (1999). “Determining the optimal degree of smoothing using the weighted head-banging algorithm on mapped mortality data”, In ASC '99 - Leading Survey & Statistical Computing into the New Millennium, Proceedings of the ASC International Conference, September. Available at http://srab.cancer.gov/headbang. O’Leary, Mike (2009). “The mathematics of geographical profiling”. Journal of Investigative Psychology & Offender Profiling. 6(3), 253-265. Ord, J.K. and A. Getis (1995). “Local spatial autocorrelation statistics:Distributional Issues and an Application. Geographical Analysis, Vol. 27, 1995, 286-306. Paulsen, Derek (2007). “Improving geographic profiling through commuter/marauder prediction:. Police Practice and Research 8: 347-357 Paulsen, Derek (2006a). “Connecting the dots: assessing the accuracy of geographic profiling software”. Policing: An International Journal of Police Strategies and Management. 20 (2), 306-334. 97 Paulsen, Derek (2006b). “Human versus machine: A comparison of the accuracy of geographic profiling methods”. Journal of Investigative Psychology and Offender Profiling 3: 77-89. Pickle, Linda W. and Yuchen Su (2002). “Within-State geographic patterns of health insurance coverage and health risk factors in the United States”, American Journal of Preventive Medicine, 22 (2), 75-83. Pickle, Linda Williams, Michael Mungiole, Gretchen K Jones, Andrew A White (1996). Atlas of United States Mortality. National Center for Health Statistics: Hyattsville, MD. Rich, T., & Shively, M. (2004). A Methodology for Evaluating Geographic Profiling Software. Final Report for the National Institute of Justice, Abt Associates: Cambridge, MA. http://www.ojp.usdoj.gov/nij/maps/gp.pdf Rossmo, D. Kim (2005a). “Geographic heuristics or shortcuts to failure?: Response to Snook et al. Applied Cognitive Psychology 19: 651-654. Rossmo, D. Kim (2005b). “Response to NIJ’s methodology for evaluating geographic profiling software”. http://www.ojp.usdoj.gov/nij/maps/gp.htm. Rossmo, D. Kim, & Filer, S. (2005). “Analysis versus guesswork”. Blue Line Magazine, , August / September, 24:26. Rossmo, D. Kim (2000). Geographic Profiling. CRC Press: Boca Raton Fl. Rossmo, D. Kim (1995). “Overview: multivariate spatial profiles as a tool in crime investigation”. In Carolyn Rebecca Block, Margaret Dabdoub and Suzanne Fregly, Crime Analysis Through Computer Mapping. Police Executive Research Forum: Washington, DC. 65-97. Siegel, Sidney (1956). Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill: New York. Snook, Brent, Michele Zito, Craig Bennell, and Paul J. Taylor (2005). “On the complexity and accuracy of geographic profiling strategies”. Journal of Quantitative Criminology, 21 (1), 1-26. Snook, Brent, Paul Taylor and Craig Bennell (2004). “Geographic profiling; the fast, fugal and accurate way”. Applied Cognitive Psychology 18: 105-121. Tukey, P. A. and J.W. Tukey (1981). “Graphical display of data sets in 3 or more dimensions”. In V. Barnett (ed), Interpreting Multivariate Data. John Wiley & Sons: New York. Wikipedia (2007a). “Geometric mean” http://en.wikipedia.org/wiki/Geometric_mean and “Weighted geometric mean” http://en.wikipedia.org/wiki/Weighted_geometric_mean. Wikipedia (2007b). “Harmonic mean” http://en.wikipedia.org/wiki/Harmonic_mean and “Weighted harmonic mean” http://en.wikipedia.org/wiki/Weighted_harmonic_mean. 98