03_shim_journal_cluster

advertisement
Cluster Analysis of Spatial Data Using Peano Count Tree 1, 2
Qiang Ding, William Perrizo
Department of Computer Science
North Dakota State University
Fargo, ND 58105-5164
{Qiang.Ding, William.Perrizo}@ndsu.nodak.edu
Abstract
Cluster Analysis of spatial data is a very important field due to the very large
quantities of spatial data collected in various application areas. In most cases the spatial
data sizes are too large to be mined in a reasonable amount of time using existing
methods. In this paper, we restructure the data using a data-mining ready data structure,
Peano Count Tree (P-tree) and then based on this new data formats, we present a new
scalable clustering algorithm that can also handle noise and outliers effectively. We
achieve this by introducing a new similarity function and adjusting the partitioning
methods. Our analysis shows that with the assistance of P-trees, very large data sets can
be handled efficiently and easily.
Keywords
Data mining, Cluster Analysis, Spatial Data, Peano ordering, k-Medoids
1
INTRODUCTION
Data mining in general is the search for hidden patterns that may exist in large
databases. Spatial data mining in particular is the discovery of interesting relationships
and characteristics that may exist implicitly in spatial databases. In the past 30 years,
_____________________________________________
1
The P-tree and bSQ technologies discussed in this paper are patent pending by North
Dakota State University.
2
This work was partially supported by GSA grant number ACT# K96130308.
cluster analysis has been widely applied to many areas. Spatial data is a promising area
for clustering [3]. Due to the large size of spatial data, such as satellite images, the
existing methods are not very suitable. In this paper, we propose a new method to
perform clustering on spatial data. The application focus of this paper is the clustering
productivity levels in agricultural fields.
The lossless data structure, Peano Count Tree (P-tree) [4], is used in the model. Ptrees represent spatial data bit by bit in a recursive quadrant-by-quadrant arrangement.
With the information in P-trees, we can do the clustering much easier and much more
efficient. The rest of the paper is organized as follows. In section 2, we review the data
formats of spatial data and describe the P-tree data structure (and its variants) and
algebra. In Section 3, we introduce the current clustering algorithms based on
partitioning. In Section 4, we present the HOBBit distance function and detail our
clustering method to show the advantage of P-trees. This section also includes a simple
example to demonstrate the basic idea. Section 5 concludes with a summary and some
directions for future research.
2
DATA STRUCTURES
A spatial image can be viewed as a 2-dimensional array of pixels. Associated
with each pixel are various descriptive attributes, called “bands”. For example, visible
reflectance bands (Blue, Green and Red), infrared reflectance bands (e.g., NIR, MIR1,
MIR2 and TIR) and possibly some bands of data gathered from ground sensors (e.g.,
yield quantity, yield quality, and soil attributes such as moisture and nitrate levels, etc.).
All the values have been scaled to values between 0 and 255 for simplicity. The pixel
coordinates in raster order constitute the key attribute. One can view such data as table in
relational form where each pixel is a tuple and each band is an attribute.
There are several formats used for spatial data, such as Band Sequential (BSQ),
Band Interleaved by Line (BIL) and Band Interleaved by Pixel (BIP). In our previous
work [4, 11], we proposed a new format called bit Sequential Organization (bSQ). Since
each intensity value ranges from 0 to 255, which can be represented as a byte, we try to
split each bit in one band into a separate file, called a bSQ file. Each bSQ file can be
reorganized into a quadrant-based tree (P-tree). The example in Figure 1 shows a bSQ
file and its P-tree.
11 11
11 00
11 11
10 00
11 11
11 00
11 11
11 10
11 11
11 11
11 11
11 11
11 11
11 11
01 11
11 11
55
16
8
3 0
1 1
level=3
4 1
1 0 0 0 1 0
15
4
16
level=2
4 3 4
level=1
1 1 0 1
level=0
Figure 1. 8 by 8 image and its p-tree.
In this example, 55 is the count of 1’s in the entire image (called root count), the
numbers at the next level, 16, 8, 15 and 16, are the 1-bit counts for the four major
quadrants. Since the first and last quadrant is made up of entirely 1-bits (called pure-1
quadrants), we do not need sub-trees for them. Similarly, quadrants made up of entirely
0-bits are called pure-0 quadrant. This pattern is continued recursively. Recursive raster
ordering is called Peano or Z-ordering in the literature. The process terminates at the leaf
level (level-0) where each quadrant is a 1-row-1-column quadrant. If we were to expand
all sub-trees, including those pure quadrants, then the leaf sequence is just the Peano
space-filling curve for the original raster image.
For each band (assuming 8-bit data values), we get 8 basic P-trees, one for each
bit positions. For band, Bi, we will label the basic P-trees, Pi,1, Pi,2, …, Pi,8, thus, Pi,j is a
lossless representation of the jth bits of the values from the ith band. However, Pij
provides more information and are structured to facilitate data mining processes. Some
of the useful features of P-trees can be found later in this paper or our earlier work [4,
11].
The basic P-trees defined above can be combined using simple logical operations
(AND, OR and COMPLEMENT) to produce P-trees for the original values (at any level
of precision, 1-bit precision, 2-bit precision, etc.). We let Pb,v denote the Peano Count
Tree for band, b, and value, v, where v can be expressed in 1-bit, 2-bit,.., or 8-bit
precision. For example, Pb,110 can be constructed from the basic P-trees as:
Pb,110 = Pb,1 AND Pb,2 AND Pb,3’
where ’ indicates the bit-complement (which is simply the count complement in each
quadrant). This is called the value P-tree. The AND operation is simply the pixel wise
AND of the bits.
The data in the relational format can also be represented as P-trees. For any
combination of values, (v1,v2,…,vn), where vi is from band-i, the quadrant-wise count of
occurrences of this tuple of values is given by:
P(v1,v2,…,vn) = P1,V1 AND P2,V2 AND … AND Pn,Vn
This is called a tuple P-tree.
Finally, we note that the basic P-trees can be generated quickly and it is only a
one-time cost. The logical operations are also very fast [5]. So this structure can be
viewed as a “data mining ready” and lossless format for storing spatial data.
3
CLUSTERING ALGORITHMS BASED ON PARTITIONING
Given a database of n objects and k, the number of clusters to form, a partitioning
algorithm organizes the objects into k partitions (k≤n), where each partition represents a
cluster [7].
K-means, k-medoids and their variations are the most commonly used partitioning
methods.
The k-means algorithm [9] first randomly selects k of the objects, which initially
each represent a cluster mean or center. For each of the remaining objects, an object is
assigned to the cluster to which it is the most similar, based on the distance between the
object and the cluster mean. It then computes the mew mean for each cluster. This
process iterates until the criterion function converges.
The k-means method, however, can be applied only when the mean of a cluster is
defined. And it is sensitive to noisy data and outliers since a small number of such data
can substantially influence the mean value. So we will not use this method in our
clustering.
The basic strategy of k-medoids clustering algorithms is to find k cluster in n
objects by first arbitrarily finding a representative object (the medoid) for each cluster.
Each remaining object is clustered with the medoid to which it is the most similar. The
strategy then iteratively replaces one of the medoids by one of the non-medoids as long
as the quality of the resulting clustering is improved.
PAM [8] is one of the well known k-medoids algorithms. After an initial random
selection of k medoids, the algorithm repeatedly tries to make a better choice of medoids.
All of the possible pairs of objects are analyzed, where one object in each pair is
considered a medoid and the other is not. The algorithm proceeds in two steps:

BUILD-step: This step sequentially selects k "centrally located" objects, to be
used as initial medoids

SWAP-step: If the objective function can be reduced by swapping a selected
object with an unselected object, then the swap is carried out. This is continued
till the objective function can no longer be decreased.
Experimental results show that PAM works satisfactorily for small data sets. But
it is not efficient in dealing with medium and large data sets.
Instead of finding representative objects for the entire data set, CLARA [8] draws
a sample of the data set, applies PAM on the sample, and finds the medoids of the
sample. However, a good clustering based on samples will not necessarily represent a
good clustering of the whole data set if the sample is biased. So CLARANS was
proposed [2] which does not confine itself to any sample at any given time. It draws a
sample with some randomness in each step of the search.
4
OUR ALGORITHM
4.1
Clustering Similarity Functions
In our algorithm, the clustering similarity function is based on the closeness of
feature attribute values. Various similarity functions have been proposed. For two data
points, X = <x1, x2, x3, …, xn-1> and Y = <y1, y2, y3, …, yn-1>, the Euclidean similarity
function is defined as
n 1
 x
d 2 ( X ,Y ) 
i
 yi 
2
.
i 1
It can be generalized to the Minkowski similarity function,
d q ( X ,Y )  q
n 1
w
i
xi  y i
q
.
i 1
If q = 2, this gives the Euclidean function. If q = 1, it gives the Manhattan distance,
which is
n 1
d 1 ( X , Y )   xi  y i
.
i 1
If q = , it gives the max function
n 1
d  ( X , Y )  max x i  y i
i 1
.
We proposed a metric using P-trees, called HOBBit [10]. The HOBBit metric
measures distance based on the most significant consecutive bit positions starting from
the left (the highest order bit). As you may have already noticed, those values have the
same higher order bits are close based on certain precision. This precision depend s on
the length of higher order bits considered. We can also partition each dimension into
several intervals by using this higher-order-bits concept hierarchy.
The HOBBit similarity between two integers A and B is defined by
SH(A, B) = max{s | 0  i  s  ai = bi}
where ai and bi are the ith bits of A and B respectively.
The HOBBit distance between two tuples X and Y is defined by
d H  X,Y   max m - S H xi ,y i 
n
i 1
where m is the number of bits in binary representations of the values; n is the number of
attributes used for measuring distance; and xi and yi are the ith attributes of tuples X and Y.
The HOBBit distance between two tuples is a function of the least similar pairing of
attribute values in them.
In this paper, we define the HOBBit distance between two tuples X and Y to be
the HOBBit distance between tuple P-trees P(X) and P(Y).
From the experiments, we found that HOBBit distance is more natural for spatial
data than other distance metrics.
4.2
Our Clustering Method
Before giving a formal description of the clustering algorithm, we first review the
concept of dense units [1].
Let S=B1× B1× ... × Bd be a d-dimensional numerical space and B1, B2, ... , Bd are
the dimensions of S. We consider each pixel of the spatial image data as a d-dimensional
points v = {v1, v2, ... , vd}. If we partition every dimension into several intervals, then the
data space S can be partitioned into non-overlapping rectangular units. Each unit u is the
intersection of one interval from each attribute. It has the form { u1, u2, ... , ud} where ui =
[li, hi) is a right-open interval in the partitioning of Bi. We say that a point v = {v1, v2, ... ,
vd} is contained in a unit u = {u1, u2, ... , ud} if li ≤ v1 < hi for all ui . The selectivity of a
unit is defined to be the fraction of total data points contained in the unit. We call a unit u
dense if selectivity(u) is greater than the density threshold r. A cluster is a maximal set of
connected dense units.
By using higher-order-bits concept hierarchy, we partition each dimension into
several intervals. Each of these intervals can be represented as a P-tree. In this way, all
the units can also be represented as P-trees by ANDing those interval P-trees.
Furthermore, to deal with outliers, we arbitrarily disregard those sparse values. If
we have an estimation of the percentage of outliers, we can use the algorithm in Figure 2
to prune them out. Then we will do the clustering be based on these dense units.
Input: total number of objects (N), all tuple P-trees,
percentage of outliers (t)
Output: tuple P-trees after prune
(1) Choose the tuple P-tree with smallest root count (Pv)
(2) outliers:=outliers+RootCount(P v)
(3) if (outliers/N<= t) then remove P v and repeat (1)(2)
Figure 2. Algorithm to prune out outliers.
Finally, we use the PAM method to partition all the P-trees representing these
dense units into k clusters. We consider each tuple P-tree as an object and use the
HOBBit similarity function between tuple P-trees to calculate the dissimilarity matrix.
EXAMPLE
The following relation contains 4 bands of 4-bit data values (expressed in decimal
and binary) (BSQ format would consist of the 4 projections of this relation, R[YIELD],
R[Blue], R[Green], R[Red] ).
To make it clear, we generate dense units using all the bits instead of only part of
higher order bits here.
FIELD CLASS
COORDS LABEL
X
Y
REMOTELY SENSED
REFLECTANCES
B1
B2
B3
B4
0,0
0,1
0,2
0,3
0011
0011
0111
0111
0111
0011
0011
0010
1000
1000
0100
0101
1011
1111
1011
1011
1,0
1,1
1,2
1,3
0011
0011
0111
0111
0111
0011
0011
0010
1000
1000
0100
0101
1011
1011
1011
1011
2,0
2,1
2,2
2,3
0010
0010
1010
1111
1011
1011
1010
1010
1000
1000
0100
0100
1111
1111
1011
1011
3,0
3,1
3,2
3,3
0010
1010
1111
1111
1011
1011
1010
1010
1000
1000
0100
0100
1111
1111
1011
1011
Figure 3. Learning dataset.
.
This dataset is converted to bSQ format. We display the bSQ bit-bands values in
their spatial positions, rather than displaying them in 1-column files. The Band-1 bitbands are:
B11
0000
0000
0011
0111
B12
0011
0011
0001
0011
B13
1111
1111
1111
1111
B14
1111
1111
0001
0011
Thus, the Band-1 Basic P-trees are as follows (tree pointers are omitted).
P1,1
5
0014
0001
P1,2
7
0403
0111
P1,3
16
P1,4
11
4403
0111
We can use AND and COMPLEMENT operation to calculate all the value P-trees
of Band-1 as below. (e.g., P1,0011 = P1,1’ AND P1,2’ AND P1,3 AND P1,4 )
P1,0000
P1,0100
P1,1000
P1,1100
P1,0010
P1,0110
P1,1010
P1,1110
P1,0001
P1,0101
P1,1001
P1,1101
P1,0011
P1,0111
P1,1011
P1,1111
0
3
0030
1110
0
4
4000
0
0
0
4
0400
0
0
2
0011
0001 1000
0
0
0
0
3
0003
0111
Then we generate basic P-trees and value P-trees similarly to B2, B3 and B4.
From the value P-trees, we can generate all the tuple P-trees. Here we give all the
non-zero trees:
P-0010,1011,1000,1111
P-1010,1010,0100,1011
P-1010,1011,1000,1111
P-0011,0011,1000,1011
P-0011,0011,1000,1111
P-0011,0111,1000,1011
P-0111,0010,0101,1011
P-0111,0011,0100,1011
P-1111,1010,0100,1011
3
1
1
1
1
2
2
2
3
0030
0001
0010
1000
1000
2000
0200
0200
0003
1110
1000
0001
0001
0100
1010
0101
1010
0111
If the noise and outliers are estimated at about 25%, then the dense units are:
1) P-0010,1011,1000,1111
2) P-0011,0111,1000,1011
3) P-0111,0010,0101,1011
4) P-0111,0011,0100,1011
5) P-1111,1010,0100,1011
3
2
2
2
3
0030
2000
0200
0200
0003
1110
1010
0101
1010
0111
Now, the dissimilarity matrix can be calculated as in Figure 4.
1
2
3
4
1
0
2
4
0
3
4
4
0
4
4
4
3
0
5
4
4
4
4
5
0
Figure 4. Dissimilarity matrix.
Finally, we use PAM method to partition these tuple P-trees into k clusters (let
k=4):
cluster1:
cluster2:
cluster3:
cluster4:
5
P-0010,1011,1000,1111
P-0011,0111,1000,1011
P-0111,0010,0101,1011
P-0111,0011,0100,1011
P-1111,1010,0100,1011
3
2
2
2
3
0030
2000
0200
0200
0003
1110
1010
0101
1010
0111
CONCLUSION
In this paper, we propose a new approach to cluster analysis that is especially useful
for the clustering on spatial data. We use the bit Sequential data organization (bSQ) and a
lossless, data-mining ready data structure, the Peano Count Tree (P-tree), to represent the
information needed for clustering in an efficient and ready-to-use form. The rich and
efficient P-tree storage structure and fast P-tree algebra, facilitate the development of a
very fast clustering method.
We also discuss a lot about clustering methods based on partitioning. We know PAM
is not efficient in dealing with medium and large data sets. CLARA and CLARANS first
draw samples from the original data and then adopt PAM method. In our algorithm, we
did not draw samples, but group the data first (each tuple P-tree can be viewed as a group
of data) and then use PAM method on these groups. The number of these P-trees is much
smaller than the size of original data sets and even smaller than what CLARA or
CLARANS needs to deal with. We also present a data pruning method that works
effectively.
6
REFERENCES
[1] R. Agrawal, J. Cehrke, D. Gunopulos, and P. Raghavan. “Automatic subspace
clustering of high dimensional data for data mining application”, Proc. ACMSIGMOD, Washington, 1998.
[2] R. Ng and J. Han. “Efficient and effective clustering method for spatial data
mining”, Proc. VLDB, pp. 144--155, Santiago, Chile, 1994.
[3] M.S. Chen, J. Han, and P.S. Yu, “Data Mining: An Overview from a Database
Perspective”, IEEE Transactions on Knowledge and Data Engineering, 8(6):
866-883, 1996.
[4] William Perrizo, Qin Ding, Qiang Ding, Amlendu Roy, “Deriving High
Confidence Rules from Spatial Data using Peano Count Trees”, SpringerVerlag, LNCS 2118, 2001.
[5] William Perrizo, “Peano Count Tree Technology”, Technical Report NDSUCSOR-TR-01-1, 2001.
[6] Ester, M., Kriegel, M. -P., Sander, J. and Xu, X., “A Density-Based Algorithm
for Discovering Clusters in Large Spatial Databases with Noise”, Proc. KDD,
pp. 226-231, Portland, Oregon, 1996.
[7] Jiawei Han, Micheline Kamber, “Data Mining: Concepts and Techniques”,
Morgan Kaufmann, 2001.
[8] L. Kaufman and P. J. Rousseeuw, “Finding Groups in Data: an Introduction to
Cluster Analysis”. John Wiley & Sons, 1990.
[9] J. MacQueen, “Some methods for classification and analysis of multivariate
observations”, L. Le Cam and J. Neyman, editors, 5th Berkley Symposium on
Mathematical Statistics and Probability, 1967.
[10] Maleq Khan, Qin Ding, William Perrizo, “k-Nearest Neighbor Classification on
Spatial Data Streams Using P-Trees”, PAKDD 2002, Springer-Verlag, LNAI
2336, 2002, pp. 517-528.
[11] Q. Ding, M. Khan, A. Roy, and W. Perrizo, “The P-tree algebra”, Proc. ACM
Symposium Applied Computing (SAC 2002), pp.426-431, Madrid, Spain, 2002.
Download