Spatial structure for multidimensional spatial lattice data Fumio Ishioka1 and Koji Kurihara2 1 2 Graduate School of Natural Science and Technology, Okayama University, 3-1-1 Tsushima-naka Okayama 700-8530, Japan fishioka@ems.okayama-u.ac.jp Graduate School of Environmental Science, Okayama University, 3-1-1 Tsushima-naka Okayama 700-8530, Japan kurihara@ems.okayama-u.ac.jp Summary. We deal with the problem to detect areas with significantly high value (hotspot) for spatial lattice data. Spatial scan statistic [Kul97] is one of the effective tools for the detection of hotspots. Many techniques are proposed to scan the areas for the hotspot detection. We apply echelon technique to find the candidate of hotspots. Echelon analysis [MPJ97] is a method to investigate the phase-structure of spatial data systematically and objectively. We scan the areas from the upper echelon to the bottom, based on the hierarchical structure of echelon analysis. In this paper, we explore the structure of spatial lattice data using echelon analysis. We extend to 3-dimensional and 4-dimensional spatial lattice data. In addition, we detect a hotspot based on echelon structure using spatial scan statistic. Key words: lattice data, echelon analysis, spatial scan statistic, hotspot detection 1 Introduction The interest for statistical analysis of spatial data has grown in the various types of scientific fields. Spatial data consist of measurements or observations taken at specific locations or within specific areas. Lattice data are observations associated with spatial areas. Generally, neighbor information for the spatial areas is available. An example of spatial regular lattice data is remote sensing data that divide the area into a series of small rectangles (cell). The cancer rates in the county is an example of irregular lattice data. There is few approach of structure analysis for such spatial lattice data. Echelon analysis [MPJ97] is the analyzing method to investigate the phase-structure of spatial data systematically and objectively, based on neighbor information between each cell. The echelon analysis is useful to prospect the areas of interest in regional monitoring of a surface variable. We explore the structure for various dimensional lattice data based on echelon analysis. In addition, we detect areas with significantly higher values for these spatial lattice data. These areas are called hotspot. Regarding the detection of the 1210 Fumio Ishioka and Koji Kurihara hotspot, many methods have been proposed the way to detect the area where it becomes the hotspot. Naus [Nau65] studied for the detection of hotspot area in a random fixed rectangular type data. Turnbull et al. [TIB90] and Kulldorff and Nagarwalla [KN95] also detected hotspots by scanning the area using the circle of decided shape and various circular types respectively. Recently, Patil and Taillie [PT04] proposed approach detecting hotspots which uses the tree structure upon the data. From these many methods, the spatial scan statistic [Kul97] is the most useful tool to detect the hotspots. Spatial scan statistic is a method of detection and inference for the areas of significantly high or low rates based on the likelihood ratio. We use the echelon technique to calculate the spatial scan statistic. In this paper, we explore the structure of spatial lattice data using echelon analysis. We treat a spatial lattice data of not only two-dimensions but also 3-dimensions and 4-dimensions by defining the neighbor information. Their spatial structure is demonstrated by hierarchical graphical representation with some examples. In addition, we detect a hotspot based on echelon structure using spatial scan statistic. 2 Echelon analysis 2.1 Echelon analysis for one-dimensional spatial lattice data The echelon analysis is based on the areas of relative high and low values of response variables for spatial lattice data. The echelon approach aggregates the areas in which the values have the same topological structure and makes hierarchically related structure of these areas, based on connective (neighbor) information between each cell. One-dimensional spatial lattice data has the position (i) and the value hi on the horizontal and vertical lines, respectively. For D1 divided lattice (interval) data, data are taken at the interval l1 (i) = (i−1, i], i = 1, 2, ..., D1 . We denote the neighbor information of l1 (i) by N B(i). For one-dimensional spatial lattice data, N B(i) is given by i=1 {i + 1}, N B(i) = {i − 1, i + 1}, 1 < i < D1 {i − 1}, i = D1 Table 1. One-dimensional spatial interval data. i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ID A B C D E F G H I J K L M N O P Q R S T U V W X Y hi 1 2 3 4 3 4 5 4 3 2 3 4 5 6 5 6 7 6 5 4 3 2 1 2 1 Spatial structure for multidimensional spatial lattice data 1211 Table 1 shows the 25 intervals named from A to Y in order and their values (e.g., A=1 and Q=7). In order to use N B(i), we make the cross sectional view of topographical map like Figure 1. Fig. 1. The hypothetical set of hills in one-dimensional spatial lattice data. There are nine numbered parts with same topological structure in these hills. These parts are called echelons. These echelons consist of peaks, foundation of peaks and foundation of foundation. The numbers 1,2,3,4 and 5 are the peaks of hills. The numbers 6 and 7 are the foundations of two peaks. The number 8 is the foundation of two foundations. The number 9 is the foundation of foundation and peak and also called as the root. The graphical representation is given by the following dendrogram shown in Figure 2. In dendrogram, the symbol ”×” shows the position of specified interval. We can make spatial cluster using neighbor information based on echelon analysis. In this figure, D and T belong to different cluster, although they have same value. Fig. 2. The echelon dendrogram for one-dimensional spatial lattice data 1212 Fumio Ishioka and Koji Kurihara 2.2 Echelon analysis for two-dimensional spatial lattice data The two-dimensional spatial lattice data have the value hi,j of response variable within the area l2 (i, j). Remote sensing data and mesh data are typical type of these data. These spatial data are given as the D1 × D2 lattice area, and the neighbor information of cell l2 (i, j) for i = 1, 2, ...D1 , j = 1, 2, ..., D2 is given by N B(l2 (i, j)) = {(a, b)|i − 1 ≤ a ≤ i + 1, j − 1 ≤ b ≤ j + 1} ∩ {(a, b)|1 ≤ a ≤ D1 , 1 ≤ b ≤ D2 } − {(i, j)} where A − B = A ∩ {B c } for the sets of A and B. Here, B c denote the complement of B. For the illustration, we will apply the digital values over a 5 × 5 array shown in Table 2. In each cell, there is response value hi,j consisting of from 1 to 25. The graphical representation for these array data is shown as the dendrogram in Figure 3. We find that these array data is constructed from 7 echelons that is consisting of 4 peaks and 3 foundations. The dendrogram expresses the hierarchical structure of a 5 × 5 array data shown in Table 2. In addition, the features of each cell are also shown as the abbreviated labels in dendrogram. Table 2. The digital values over a 5 × 5 array. 1 2 3 4 5 A 2 10 4 20 16 B 24 1 13 21 6 C 8 14 19 12 9 D 15 22 23 11 18 E 3 5 25 17 7 2.3 Echelon analysis for 3 or 4-dimensional spatial lattice data The 3-dimensional spatial lattice data are lattice data that consisted of regularly overlapped two-dimensional spatial lattice data as shown in Figure 4. Each lattice has the value hi,j,k of response variable within the area l2 (i, j, k). Then, neighbor information of cell l3 (i, j, k) is given by N B(l3 (i, j, k)) = {(a, b, c)|i − 1 ≤ a ≤ i + 1, j − 1 ≤ b ≤ j + 1, k − 1 ≤ c ≤ k + 1} ∩ {(a, b, c)|1 ≤ a ≤ D1 , 1 ≤ b ≤ D2 , 1 ≤ c ≤ D3 } − {(i, j, k)} (1) where i = 1, 2, ..., D1 , j = 1, 2, ..., D2 and k = 1, 2, ..., D3 . These data are also considered as cube data which consist of (D1 × D2 × D3 ) cubes. Spatial structure for multidimensional spatial lattice data 1213 Fig. 3. The echelon dendrogram for the digital values over a 5 × 5 array. Fig. 4. The structure of 3-dimensional spatial lattice data. The 4-dimensional spatial lattice data are constructed by T × l3 (i, j, k) as shown in Figure 5; it has the value hi,j,k,t for t = 1, 2, ..., T of response variable within the area l4 (i, j, k, t). We define the neighbors of the area l4 (i, j, k, t) are the areas l4 (i, j, k, t − 1) and l4 (i, j, k, t + 1) for 1 ≤ t ≤ T . The neighbor information of area l4 (i, j, k, t) are defined by N B(l4 (i, j, k, t)) = {(a, b, c, d)|i − 1 ≤ a ≤ i + 1, j − 1 ≤ b ≤ j + 1, k − 1 ≤ c ≤ k + 1, d = t} ∩ {(a, b, c, d)|1 ≤ a ≤ D1 , 1 ≤ b ≤ D2 , 1 ≤ c ≤ D3 , d = t} ∪ {i, j, k, t − 1} ∪ {i, j, k, t + 1} − {(i, j, k, t)} (2) where i = 1, 2, ..., D1 , j = 1, 2, ..., D2 , k = 1, 2, ..., D3 and t = 1, 2, ...T . For the illustration, we will apply the digital values over a 3×3, D3 = 3 and T = 3 array shown in Table 3. The graphical representation for these 1214 Fumio Ishioka and Koji Kurihara Fig. 5. The structure of 4-dimensional spatial data. array data is shown as the dendrogram in Figure 6. ”A1-2-3” in dendrogram denote the position (A1,D32 , T3 ) of spatial lattice data shown in Table 3. We find that these array data are constructed from 12 echelons which consist of 7 peaks and 5 foundations. Table 3. The digital values over a 3×3, D3 = 3 and T = 3 array. T1 A D31 D32 D33 B T2 C 1 97 78 143 2 60 42 49 3 46 40 16 .. . 1 39 65 45 2 8 38 11 3 56 86 28 .. . 1 27 54 84 2 71 10 15 3 36 100 73 A B T3 C 23 216 41 30 172 28 86 256 61 .. . 266 49 25 147 29 30 241 191 32 .. . 35 65 232 300 46 24 1 54 26 A B C 28 70 56 10 217 236 67 68 42 .. . 14 35 278 64 9 6 59 56 73 .. . 38 24 25 289 19 45 20 62 40 3 Detection of hotspot for spatial lattice data 3.1 Spatial scan statistics The spatial scan statistics is used to detect the areas of high or low significant rates and to find the feature of data. These areas are called hotspot. Suppose the hotspot candidate area Z are within whole area G. Each individual within the area Z has population probability p1 of the attribute, while the population probability for individual outside of the area Z is p2 . The probability for any individual is independent respectively. The null hypothesis is H0 : p1 = p2 . The alternative hypothesis to detect high rate is H1 : p1 > p2 . Let n(G) be the total population in the whole area G, and n(Z) be the population within the Spatial structure for multidimensional spatial lattice data 1215 Fig. 6. The echelon dendrogram for the 4-dimensional spatial lattice data. area Z. The c(G) is the total number of the attribute in the whole area G and c(Z) is the number of the attribute within the area Z. Here, we consider the model based on the Poisson distribution. We can hence write the likelihood function as exp[−p1 n(Z) − p2 (n(G) − n(Z))] c(Z) c(G)−c(Z) Y L(Z, p1 , p2 ) = n(xi ) p1 p2 c(G)! x i In order to maximize the likelihood function, we calculate the maximize likelihood function conditioned the area Z. The maximum likelihood estimator pˆ1 = c(Z)/n(Z) and pˆ2 = (c(G) − c(Z))/(n(G) − n(Z)) are substituted. L(Z) = exp[−c(G)] c(Z) c(Z) c(G) − c(Z) c(G)−c(Z) Y n(xi ) ( ) ( ) c(G)! n(Z) n(G) − n(Z) x i The likelihood ratio λ is maximized over all subset area of whole area to detect the hotspots. max L(Z) λ= z L0 c(Z) c(Z) c(G) − c(Z) c(G)−c(Z) ) ( ) n(Z) n(G) − n(Z) = c(G) c(G) ) ( n(G) ( 1216 Fumio Ishioka and Koji Kurihara Here, L0 is the following likelihood function under the null hypothesis. The test statistic λ is also written by λ=( c(Z) c(Z) c(G) − c(Z) c(G)−c(Z) ) ( ) e(Z) e(G) − e(Z) where e(Z) is expected value of the attribute within the area Z, and e(G) = c(G). 3.2 Detection of hotspot We detect a hotspot for the spatial lattice data shown in Table 3. To detect a hotspot, the scan method based on the structure of echelon is effective. The candidate of hotspot would be located on the top echelon in the dendrogram. We calculate the value of spatial scan statistic for the aggregated areas from the cell of top position to the bottom position in the first peak of echelonP dendrogram. P P P Here, the expected value e(z) is given by mean of hi,j,k,t ; e(z) = i j k t hi,j,k,t /81. In consequence, the hotspot candidate is consisting of 12 cells as shown in Figure 7, then the statistic is 1694.84. By using echelon analysis and spatial scan statistic, we find that the cell (C1,D33 , T2 ) is not hotspot candidate, although the value of this cell is higher than cell (B2,D31 , T2 ) or (A2,D32 , T2 ). Fig. 7. The hotspot candidate of 4-dimensional spatial lattice data. 4 Conclusions We applied the echelon analysis for spatial lattice data consisting of various dimension in this paper. We also showed the topological structure for complex Spatial structure for multidimensional spatial lattice data 1217 structure such as 4-dimensional spatial lattice data. In addition, the candidate of hotspot is obtained from the echelon structure by using spatial scan statistic. This method is applied for any spatial lattice data if we can make the neighbor information for each cell of spatial lattice. References [KN95] Kulldorff, M., Nagarwalla, N.: Spatial disease clusters : Detection and inference. Statistics in Medicine, 14, 799–810 (1995) [Kul97] Kulldorff, M.: A spatial scan statistics. Communications in Statistics, Theory and Methods, 26, 1481–1496 (1997) [Kur04] Kurihara, K.: Classification of geospatial lattice data and their graphical representation. Classification, Clustering, and Data Mining Applications, Springer, 251–258 (2004) [MPJ97] Myers, W.L., Patil,G.P., Joly,K.: Echelon approach to areas of concern in synoptic regional monitoring. Environmental and Ecological Statistics, 4, 131–152 (1997) [Nau65] Naus, J.I.: Clustering of random points in two dimensions. Biometrika, 52, 263–267 (1965) [PT04] Patil, G.P., Taillie, C.: Upper level set scan statistic for detecting arbitrarily shaped hotspots. Environmental and Ecological Statistics, 11, 183–197 (2004) [TIB90] Turnbull, B.W., Iwano, E., Burnett, W., Howe, H., Clark, L.: Monitoring for clusters of disease : Application to leukemia incidence in upstate New York. American Journal of Epidemiology, 132, 136–143 (1990)