Spatial structure for multidimensional spatial lattice data Fumio Ishioka and Koji Kurihara

advertisement
Spatial structure for multidimensional spatial
lattice data
Fumio Ishioka1 and Koji Kurihara2
1
2
Graduate School of Natural Science and Technology, Okayama University, 3-1-1
Tsushima-naka Okayama 700-8530, Japan fishioka@ems.okayama-u.ac.jp
Graduate School of Environmental Science, Okayama University, 3-1-1
Tsushima-naka Okayama 700-8530, Japan kurihara@ems.okayama-u.ac.jp
Summary. We deal with the problem to detect areas with significantly high value
(hotspot) for spatial lattice data. Spatial scan statistic [Kul97] is one of the effective tools for the detection of hotspots. Many techniques are proposed to scan the
areas for the hotspot detection. We apply echelon technique to find the candidate of
hotspots. Echelon analysis [MPJ97] is a method to investigate the phase-structure
of spatial data systematically and objectively. We scan the areas from the upper
echelon to the bottom, based on the hierarchical structure of echelon analysis. In
this paper, we explore the structure of spatial lattice data using echelon analysis.
We extend to 3-dimensional and 4-dimensional spatial lattice data. In addition, we
detect a hotspot based on echelon structure using spatial scan statistic.
Key words: lattice data, echelon analysis, spatial scan statistic, hotspot detection
1 Introduction
The interest for statistical analysis of spatial data has grown in the various
types of scientific fields. Spatial data consist of measurements or observations
taken at specific locations or within specific areas. Lattice data are observations associated with spatial areas. Generally, neighbor information for the
spatial areas is available. An example of spatial regular lattice data is remote
sensing data that divide the area into a series of small rectangles (cell). The
cancer rates in the county is an example of irregular lattice data. There is few
approach of structure analysis for such spatial lattice data. Echelon analysis
[MPJ97] is the analyzing method to investigate the phase-structure of spatial
data systematically and objectively, based on neighbor information between
each cell. The echelon analysis is useful to prospect the areas of interest in
regional monitoring of a surface variable. We explore the structure for various
dimensional lattice data based on echelon analysis.
In addition, we detect areas with significantly higher values for these spatial lattice data. These areas are called hotspot. Regarding the detection of the
1210
Fumio Ishioka and Koji Kurihara
hotspot, many methods have been proposed the way to detect the area where
it becomes the hotspot. Naus [Nau65] studied for the detection of hotspot area
in a random fixed rectangular type data. Turnbull et al. [TIB90] and Kulldorff
and Nagarwalla [KN95] also detected hotspots by scanning the area using the
circle of decided shape and various circular types respectively. Recently, Patil
and Taillie [PT04] proposed approach detecting hotspots which uses the tree
structure upon the data. From these many methods, the spatial scan statistic
[Kul97] is the most useful tool to detect the hotspots. Spatial scan statistic is
a method of detection and inference for the areas of significantly high or low
rates based on the likelihood ratio. We use the echelon technique to calculate
the spatial scan statistic.
In this paper, we explore the structure of spatial lattice data using echelon
analysis. We treat a spatial lattice data of not only two-dimensions but also
3-dimensions and 4-dimensions by defining the neighbor information. Their
spatial structure is demonstrated by hierarchical graphical representation with
some examples. In addition, we detect a hotspot based on echelon structure
using spatial scan statistic.
2 Echelon analysis
2.1 Echelon analysis for one-dimensional spatial lattice data
The echelon analysis is based on the areas of relative high and low values of
response variables for spatial lattice data. The echelon approach aggregates
the areas in which the values have the same topological structure and makes
hierarchically related structure of these areas, based on connective (neighbor)
information between each cell. One-dimensional spatial lattice data has the
position (i) and the value hi on the horizontal and vertical lines, respectively.
For D1 divided lattice (interval) data, data are taken at the interval l1 (i) =
(i−1, i], i = 1, 2, ..., D1 . We denote the neighbor information of l1 (i) by N B(i).
For one-dimensional spatial lattice data, N B(i) is given by

i=1
 {i + 1},
N B(i) = {i − 1, i + 1}, 1 < i < D1

{i − 1},
i = D1
Table 1. One-dimensional spatial interval data.
i
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
ID A B C D E F G H I J K L M N O P Q R S T U V W X Y
hi
1 2 3 4 3 4 5 4 3 2 3 4 5 6 5 6 7 6 5 4 3 2 1 2 1
Spatial structure for multidimensional spatial lattice data
1211
Table 1 shows the 25 intervals named from A to Y in order and their values
(e.g., A=1 and Q=7). In order to use N B(i), we make the cross sectional view
of topographical map like Figure 1.
Fig. 1. The hypothetical set of hills in one-dimensional spatial lattice data.
There are nine numbered parts with same topological structure in these
hills. These parts are called echelons. These echelons consist of peaks, foundation of peaks and foundation of foundation. The numbers 1,2,3,4 and 5 are
the peaks of hills. The numbers 6 and 7 are the foundations of two peaks.
The number 8 is the foundation of two foundations. The number 9 is the
foundation of foundation and peak and also called as the root. The graphical
representation is given by the following dendrogram shown in Figure 2. In
dendrogram, the symbol ”×” shows the position of specified interval. We can
make spatial cluster using neighbor information based on echelon analysis.
In this figure, D and T belong to different cluster, although they have same
value.
Fig. 2. The echelon dendrogram for one-dimensional spatial lattice data
1212
Fumio Ishioka and Koji Kurihara
2.2 Echelon analysis for two-dimensional spatial lattice data
The two-dimensional spatial lattice data have the value hi,j of response variable within the area l2 (i, j). Remote sensing data and mesh data are typical
type of these data. These spatial data are given as the D1 × D2 lattice area,
and the neighbor information of cell l2 (i, j) for i = 1, 2, ...D1 , j = 1, 2, ..., D2
is given by
N B(l2 (i, j)) = {(a, b)|i − 1 ≤ a ≤ i + 1, j − 1 ≤ b ≤ j + 1}
∩ {(a, b)|1 ≤ a ≤ D1 , 1 ≤ b ≤ D2 } − {(i, j)}
where A − B = A ∩ {B c } for the sets of A and B. Here, B c denote the
complement of B.
For the illustration, we will apply the digital values over a 5 × 5 array
shown in Table 2. In each cell, there is response value hi,j consisting of from
1 to 25. The graphical representation for these array data is shown as the
dendrogram in Figure 3. We find that these array data is constructed from
7 echelons that is consisting of 4 peaks and 3 foundations. The dendrogram
expresses the hierarchical structure of a 5 × 5 array data shown in Table 2.
In addition, the features of each cell are also shown as the abbreviated labels
in dendrogram.
Table 2. The digital values over a 5 × 5 array.
1
2
3
4
5
A
2
10
4
20
16
B
24
1
13
21
6
C
8
14
19
12
9
D
15
22
23
11
18
E
3
5
25
17
7
2.3 Echelon analysis for 3 or 4-dimensional spatial lattice data
The 3-dimensional spatial lattice data are lattice data that consisted of regularly overlapped two-dimensional spatial lattice data as shown in Figure 4.
Each lattice has the value hi,j,k of response variable within the area l2 (i, j, k).
Then, neighbor information of cell l3 (i, j, k) is given by
N B(l3 (i, j, k)) = {(a, b, c)|i − 1 ≤ a ≤ i + 1, j − 1 ≤ b ≤ j + 1,
k − 1 ≤ c ≤ k + 1}
∩ {(a, b, c)|1 ≤ a ≤ D1 , 1 ≤ b ≤ D2 , 1 ≤ c ≤ D3 } − {(i, j, k)}
(1)
where i = 1, 2, ..., D1 , j = 1, 2, ..., D2 and k = 1, 2, ..., D3 . These data are also
considered as cube data which consist of (D1 × D2 × D3 ) cubes.
Spatial structure for multidimensional spatial lattice data
1213
Fig. 3. The echelon dendrogram for the digital values over a 5 × 5 array.
Fig. 4. The structure of 3-dimensional spatial lattice data.
The 4-dimensional spatial lattice data are constructed by T × l3 (i, j, k) as
shown in Figure 5; it has the value hi,j,k,t for t = 1, 2, ..., T of response variable
within the area l4 (i, j, k, t). We define the neighbors of the area l4 (i, j, k, t) are
the areas l4 (i, j, k, t − 1) and l4 (i, j, k, t + 1) for 1 ≤ t ≤ T . The neighbor
information of area l4 (i, j, k, t) are defined by
N B(l4 (i, j, k, t)) = {(a, b, c, d)|i − 1 ≤ a ≤ i + 1, j − 1 ≤ b ≤ j + 1,
k − 1 ≤ c ≤ k + 1, d = t}
∩ {(a, b, c, d)|1 ≤ a ≤ D1 , 1 ≤ b ≤ D2 , 1 ≤ c ≤ D3 , d = t}
∪ {i, j, k, t − 1} ∪ {i, j, k, t + 1} − {(i, j, k, t)}
(2)
where i = 1, 2, ..., D1 , j = 1, 2, ..., D2 , k = 1, 2, ..., D3 and t = 1, 2, ...T .
For the illustration, we will apply the digital values over a 3×3, D3 = 3
and T = 3 array shown in Table 3. The graphical representation for these
1214
Fumio Ishioka and Koji Kurihara
Fig. 5. The structure of 4-dimensional spatial data.
array data is shown as the dendrogram in Figure 6. ”A1-2-3” in dendrogram
denote the position (A1,D32 , T3 ) of spatial lattice data shown in Table 3. We
find that these array data are constructed from 12 echelons which consist of
7 peaks and 5 foundations.
Table 3. The digital values over a 3×3, D3 = 3 and T = 3 array.
T1
A
D31
D32
D33
B
T2
C
1 97 78 143
2 60 42 49
3 46 40 16
..
.
1 39 65 45
2 8 38 11
3 56 86 28
..
.
1 27 54 84
2 71 10 15
3 36 100 73
A
B
T3
C
23 216 41
30 172 28
86 256 61
..
.
266 49 25
147 29 30
241 191 32
..
.
35 65 232
300 46 24
1 54 26
A
B
C
28 70 56
10 217 236
67 68 42
..
.
14 35 278
64 9 6
59 56 73
..
.
38 24 25
289 19 45
20 62 40
3 Detection of hotspot for spatial lattice data
3.1 Spatial scan statistics
The spatial scan statistics is used to detect the areas of high or low significant
rates and to find the feature of data. These areas are called hotspot. Suppose
the hotspot candidate area Z are within whole area G. Each individual within
the area Z has population probability p1 of the attribute, while the population
probability for individual outside of the area Z is p2 . The probability for any
individual is independent respectively. The null hypothesis is H0 : p1 = p2 .
The alternative hypothesis to detect high rate is H1 : p1 > p2 . Let n(G) be the
total population in the whole area G, and n(Z) be the population within the
Spatial structure for multidimensional spatial lattice data
1215
Fig. 6. The echelon dendrogram for the 4-dimensional spatial lattice data.
area Z. The c(G) is the total number of the attribute in the whole area G and
c(Z) is the number of the attribute within the area Z. Here, we consider the
model based on the Poisson distribution. We can hence write the likelihood
function as
exp[−p1 n(Z) − p2 (n(G) − n(Z))] c(Z) c(G)−c(Z) Y
L(Z, p1 , p2 ) =
n(xi )
p1 p2
c(G)!
x
i
In order to maximize the likelihood function, we calculate the maximize likelihood function conditioned the area Z. The maximum likelihood estimator
pˆ1 = c(Z)/n(Z) and pˆ2 = (c(G) − c(Z))/(n(G) − n(Z)) are substituted.
L(Z) =
exp[−c(G)] c(Z) c(Z) c(G) − c(Z) c(G)−c(Z) Y
n(xi )
(
)
(
)
c(G)!
n(Z)
n(G) − n(Z)
x
i
The likelihood ratio λ is maximized over all subset area of whole area to detect
the hotspots.
max L(Z)
λ=
z
L0
c(Z) c(Z) c(G) − c(Z) c(G)−c(Z)
)
(
)
n(Z)
n(G) − n(Z)
=
c(G) c(G)
)
(
n(G)
(
1216
Fumio Ishioka and Koji Kurihara
Here, L0 is the following likelihood function under the null hypothesis. The
test statistic λ is also written by
λ=(
c(Z) c(Z) c(G) − c(Z) c(G)−c(Z)
)
(
)
e(Z)
e(G) − e(Z)
where e(Z) is expected value of the attribute within the area Z, and e(G) =
c(G).
3.2 Detection of hotspot
We detect a hotspot for the spatial lattice data shown in Table 3. To detect
a hotspot, the scan method based on the structure of echelon is effective.
The candidate of hotspot would be located on the top echelon in the dendrogram. We calculate the value of spatial scan statistic for the aggregated
areas from the cell of top position to the bottom position in the first peak of
echelonP
dendrogram.
P P P Here, the expected value e(z) is given by mean of hi,j,k,t ;
e(z) = i j k t hi,j,k,t /81. In consequence, the hotspot candidate is consisting of 12 cells as shown in Figure 7, then the statistic is 1694.84. By using
echelon analysis and spatial scan statistic, we find that the cell (C1,D33 , T2 )
is not hotspot candidate, although the value of this cell is higher than cell
(B2,D31 , T2 ) or (A2,D32 , T2 ).
Fig. 7. The hotspot candidate of 4-dimensional spatial lattice data.
4 Conclusions
We applied the echelon analysis for spatial lattice data consisting of various
dimension in this paper. We also showed the topological structure for complex
Spatial structure for multidimensional spatial lattice data
1217
structure such as 4-dimensional spatial lattice data. In addition, the candidate of hotspot is obtained from the echelon structure by using spatial scan
statistic. This method is applied for any spatial lattice data if we can make
the neighbor information for each cell of spatial lattice.
References
[KN95]
Kulldorff, M., Nagarwalla, N.: Spatial disease clusters : Detection and
inference. Statistics in Medicine, 14, 799–810 (1995)
[Kul97] Kulldorff, M.: A spatial scan statistics. Communications in Statistics, Theory and Methods, 26, 1481–1496 (1997)
[Kur04] Kurihara, K.: Classification of geospatial lattice data and their graphical
representation. Classification, Clustering, and Data Mining Applications,
Springer, 251–258 (2004)
[MPJ97] Myers, W.L., Patil,G.P., Joly,K.: Echelon approach to areas of concern in
synoptic regional monitoring. Environmental and Ecological Statistics, 4,
131–152 (1997)
[Nau65] Naus, J.I.: Clustering of random points in two dimensions. Biometrika,
52, 263–267 (1965)
[PT04]
Patil, G.P., Taillie, C.: Upper level set scan statistic for detecting arbitrarily shaped hotspots. Environmental and Ecological Statistics, 11, 183–197
(2004)
[TIB90] Turnbull, B.W., Iwano, E., Burnett, W., Howe, H., Clark, L.: Monitoring
for clusters of disease : Application to leukemia incidence in upstate New
York. American Journal of Epidemiology, 132, 136–143 (1990)
Download