03_dcluster_jan30_15pm

advertisement
Efficient Density Clustering For Large Spatial Data Using Hawaiian
Metrics
Fei Pan, Baoying Wang, Yi Zhang, Dongmei Ren, Xin Hu, William Perrizo
Computer Science Department
North Dakota State University
Fargo, ND 58105
Abstract
Data mining for spatial data has become increasingly important as more and more organizations are
exposed to spatial data from such sources as remote sensing, geographical information systems (GIS),
astronomy, computer cartography, environmental assessment and planning, bioinformatics, etc.
Recently, density clustering methods, such as DENCLUE, DBSCAN, OPTICS, have been published
and recognized as powerful clustering methods for Data Mining. These approaches typically regard
clusters as dense regions of objects in the data space that are separated by regions of low density.
However, these methods are know to lack scalability with respect to dimensionality. In this paper, we
develop a new efficient density-clustering algorithm using Hawaiian metrics and P-trees [1]. The fast
P-tree ANDing operation facilitates the calculation of the density function within Hawaiian rings. The
average time complexity of our algorithm for spatial data in d-dimensions is O ( dn n ) . Our proposed
method has comparable cardinality scalability with other density methods for small and medium size of
data, but superior dimensional scalability.
Keywords: Density Clustering. Hawaiian Metrics. Peano Count Trees. Spatial Data
1. Introduction
With the rapid growth of large quantities of spatial data collected in various application areas, such as
remote sensing, geographical information systems (GIS), astronomy, computer cartography,
environmental assessment and planning, efficient spatial data mining methods are in great demand.
Density based cluster algorithms have been recognized as a powerful clustering approach capable of
discovering arbitrary shape of clusters as well as dealing with noise and outliers, and are widely used in
the mining of large spatial data.
There are two major approaches for density-based methods. The first approach is represented by
DENCLUE [3]. It exploits a density function, e.g., step function or Gaussian function to measure the
density in attribute metric space. Clusters are identified by determining density attractors. Thus,
clusters of arbitrary shape can be easily determined by overall density functions. This algorithm scales
well with run time complexity O ( n log n ) by means of grid cells techniques. However, it requires
careful selection of the density parameter  and noise threshold , which may significantly influence
the quality of the clustering results [10].
The second approach calculates the density of all data points and groups them based on density
connectivity. Typical algorithms in this approach include DBSCAN [6] and OPTICS [8]. DBSCAN first
defines a core object as a set of neighbor points consisting of more than a specified number of data
points. All the data points reachable within a chain of overlapping core objects define a cluster. The run
time complexity of DBSCAN is O ( n log n ) for spatial data when using a spatial index. Otherwise, it
is
O (n 2 ) [10]. OPTICS can be considered as an extension of DBSCAN without providing global
density. It assumes each cluster has its own density parameter and uses a random variable to learn its
probability distribution. It has the same run time complexity as DBSCAN, that is, O ( n log n ) if a
spatial index is used and
O (n 2 ) otherwise.
The spatial index and grid cell techniques used in these two methods are only suitable for low
dimension data sets []. For high dimension, the volume of spatial index nodes and grid cells will
increase exponentially, degrading accuracy and performance dramatically.
Recently, a new distance metric, the Hawaiian Metric, has been proposed for data mining [2]. It
exploits a new lossless data structure, called the Peano Count Tree (P-tree) [1]. The performance of
Hawaiian metric data mining using P-trees is shown to be fast and accurate [2].
In this paper, we propose an efficient density clustering algorithm using Hawaiian metrics and show
that the method scales well with respect to dimension. The basic idea is to make use of P-trees and
Hawaiian metrics to calculate the density function in O ( n ) time, on the average. The fast P-tree
ANDing operation is used to get density functions within certain Hawaiian ring neighbors. Furthermore,
we adopt a look around pruning method to combine the density calculation and a hill climbing
technique. The overall run time complexity is O ( dn n ) for a d-dimensional data set, on the average.
Experimental results show that the algorithm works efficiently on large-scale, high-dimensional, spatial
data, outperforming other density methods significantly.
This paper is organized as follows. In section 2, The Hawaiian metrics and P-tree techniques are briefly
reviewed. In section 3, we introduce the new efficient density clustering method using Hawaiian Metric,
and then prove its efficiency in terms of time complexity. Finally, we compare our method with other
density methods experimentally in section 4 and conclude the paper in section 5.
The symbols used in this paper are given in Table 1.
Table 1. Symbols and Notations
Symbol
X
m
r
Pi,j
Definition
Spatial pixel, X = {x1, x2, …, xn}, n is the
number of attributes
Maximal bit length of attributes
Radius of Hawaiian ring
Basic P-tree for bit j of attribute i
Pi,j’
bi,j
Pxi,j
Pvi,r
Px,r
Qid
Complement of Pi,j
The jth bit of the ith attribute of x.
Operator P-tree of jth bit of the ith attribute of x
Value P-tree within ring r
Tuple P-tree within ring r
Quadrant identification
2. Review of Hawaiian Metrics and P-trees
Distance metrics (or similarity functions) are key elements of clustering algorithms and therefore play
an important role in data mining. A number of distance metrics has been utilized so far; of which some
of the common ones are Euclidian distance, Manhattan distance, Max distance and Lp-distance. In this
section, we first briefly review the Hawaiian Metrics, and the Peano Count Tree (P-tree) [1] data
structure and related P-tree algebra.
2.1. Hawaiian Metrics
Many distance metrics have been used in clustering algorithms. Representative metrics include
Euclidean distance, Manhattan distance, and Max distance (Minkowski or Lq-metrics with q=2, 1 and
 respectively). For two data points, X = (x1, x2, x3, …, xn-1) and Y = (y1, y2, y3, …, yn-1), the Euclidean
distance function is defined as
Minkowski distance function
d 2 ( X ,Y ) 
d q ( X ,Y )  q
n 1
(x
i 1
 y i ) 2 . It can be generalized to the Lq or
i
n 1
 wi | x
i 1
i
this gives the Euclidean function. If q=1, d 1 ( X , Y ) 
 y i |q , where q is a natural number. If q=2,
n 1
| x
i
 yi | gives the Manhattan distance.
i 1
n 1
Max function is defined as
d  ( X , Y )  max | x i  y i | .
i 1
The Hawaiian metrics, also called HOBBit metric [2], is bit wise distance function. It measures
distance based on the most significant consecutive matching bit positions starting from the left
(Position Of Inequality or POI – leading to the Hawaiian terminology). Bist at different position gives
different contribution to similarity measurements.
based on the following observation.
Hawaiian metric difference measurements are
When comparing two values bitwise from left to right, once a
P-tree
36
/
__________/
\__________
/
\___
/
\
16
_13__
\
/
\
____7__
0
/
/
4
|
4
\
\
2
1
4
//|\
//|\
1100
0001
\
___ /
/
0
|
\
4
difference
is1 found, the position of that first difference reveals much about the magnitude of difference
//|\ two values. Let Ai be a non-negative fixed point attribute in tabular data sets, R(A1, A2, ...,
between the
0010
Each
An).
Ai
is
represented
as
a
fixed-point
binary
number,
x,
i.e.,
x
=
x(m)x(m-1)---x(1)x(0).x(-1)---x(-n). Let X and Y be two values of Ai, the position of inequality (POI)
or Hawaiian similarity between X and Y is defined by
m( X , Y )  max{ i | xi  yi  1}
where
xi and y i are the i th bits of X and Y respectively, and  denotes the XOR (exclusive OR)
operation. In another word, m is the left most position at which X and Y differ. The Hawaiian distance
between two tuples, X and Y, is defined by
d ( X , Y )  2 m( X ,Y ) .
For two value X and Y of signed fixed binary attribute Ai, the Hawaiian distance between X and Y are
same as above if X and Y are same sign. If X and Y are opposite sign, then the distance is
d ( X , Y )  d ( X ,0)  d (Y ,0) . Hawaiian metric data mining uses a data structure, called a Peano
Count Tree, to facilitate its computation for spatial data. Some details about P-trees are described in
next sub-section.
2.2. Peano Count Trees (P-trees)
The Peano Count Tree (P-tree) is a tree structure organizing any tabular data set with fixed point
numerical values (categorical attributes can be coded to fixed point numeric and floating point
attributes can be intervalized using their exponents). Each attribute is split into separate files, one for
each bit position. A basic P-tree, Pi, j, is then the P-tree for the jth bit of the ith attribute. Given a fixed
point attribute of m bits, there are m basic P-trees, one for each bit position. The complement of a basic
P-tree, Pi, j, is a P-tree associated with the column of bit complements, which is denoted as Pi, j ‘. Figure 1
shows an example of basic P-tree construction and its complement.
11
11
11
11
11
11
00
01
11
11
11
11
11
11
11
11
11
00
11
11
00
00
00
00
00
00
00
10
00
00
00
00
a) 8x8 bSQ file
36
_________/ / \ \__________
/
___ /
\___
\
/
/
\
\
16
___7___
___13___
0
/
/
|
\
/ | \
\
2
0
4 1
4 4
1
4
//|\
//|\
//|\
1100
0010
0001
b) Basic P-tree
Figure 1.
28
__________/ / \ \_________
/
___ /
\___
\
/
/
\
\
0
___9___
___3__
16
/
/
|
\
/ | \
\
2
4
0 3
0 0 3
0
//|\
//|\
//|\
0011
1101
1110
c) Complement P-tree
Example of P-tree Construction
In Figure 1, we are assuming the original table is a table where each row is a pixel in an image and
each attribute is a reflectance value (e.g., of Red, Green, Blue, etc.) ranging from 0 and 255.
The
88 bit array on the left of Figure 1 is some bit of some attribute (a bit-Sequential or bSQ file).
We
have arranged the bits spatially (rather than as a single column of bits) so that their pixel of origin is
clear.
The corresponding Peano Count tree (P-tree) showing the hierarchy of quadrant 1-bit counts is given in
the middle. The root count is 36, and the counts at the next level, 16, 7, 13, 0, are the 1-bit count for the
four major quadrants (in Peano or Z order). Since the first and last quadrant is made up of entirely
1-bits and 0-bits respectively, we do not need sub-trees for these two quadrants. The complement is
shown on the right.
It provides the 0-bit counts for each quadrant.
P-tree ANDing is one of the most important and frequently used algebraic operations on P-trees. The
ANDing operation is executed using Peano Mask trees (PM-trees), a lossless, compressed variant of
Peano Count trees, which used simple masks instead of root counts at each internal node. In PM-trees,
three value logic, i.e., 0, 1, and m (for mixed), is used to represent pure-0, pure-1 and non-pure (or mixed)
quadrants, respectively. The bit-wise AND of two bit columns can be done efficiently using PM-tree, as
illustrated in Figure 2.
m
____/
m
/
\ \______
/
/
\
\
/
/
\
\
1
m
m
1
/ / \ \
/ / \ \
m 0 1 m 11 m 1
//|\
//|\
//|\
1110
0010
1101
_____/
/
/
/
/
/
1
a). P-tree-1
0
\
\
m
________ / / \
/
____ /
/
/
1
0
/
1
\______
\
\
\
m
/ / \ \
1 1 1 m
//|\
0100
0
b). P-tree-2
Figure 2.
\___
\
\
\
m
| \ \
1 m
//|\
1101
\
0
m
//|\
0100
c). AND-Result
P-tree ANDING Operation
Figure 2 shows the PM-tree result (on the right) of the ANDing of PM-tree1 (on the left) and PM-tree2
(in the middle). There are several ways to perform P-tree ANDing. The basic way is to perform
ANDing level-by-level starting from the root level. The rules are summarized in Table 2.
Table 2. P-tree AND rules
Operand 1
0
0
Operand 2
0
1
Result
0
0
0
1
1
m
m
1
m
m
0
1
m
0 if four sub-quadrants result in
0; Otherwise m
In Table 2, operand 1 and operand 2 are two P-trees (or sub-trees). ANDing a pure-0 tree with any
P-tree results in a pure-0 tree. ANDing a pure-1 tree with any P-tree, P2, results in P2. ANDing two
mixed trees results a mixed tree or pure-0 tree.
By using P-tree logical AND and complement operations, Hawaiian distance can be computed very
quickly. The detailed algorithm for Hawaiian ring based P-tree operations are discussed in the
following section.
3. The P-tree Hawaiian Density Based Clustering Algorithm
Generally speaking, density based cluster algorithms group the attribute objects into a set of connected
dense components separated by regions of low density. A cluster is regarded as a connected dense
region of objects, which grows in any direction that density leads. Therefore, density based clusters are
capable of discovering arbitrarily shaped clusters and deal well with noise and outliers.
The main drawback of existing density based algorithms is slowness and lack of scalability. Typical
density based algorithms, such as DBSCAN, OPTICS and DENCLUE, exploit different approaches to
improve the speed and scalability. In this paper, we propose a P-tree Hawaiian ring based density
clustering algorithm, which we will refer to as PHD-Clustering (P-tree, Hawaiian, Density Clustering).
The basic idea is to exploit Hawaiian rings (all points lying between a inner Hawaiian distance
threshold and an outer Hawaiian distance threshold from a center point) and P-trees to get the density
function in one step. The fast P-tree ANDing operation is used to get density function within any
specified Hawaiian ring neighbor. We also adopt a look around pruning method to combine the density
calculation and hill climbing. The detailed algorithm is in section 3.1 and 3.2.
In section 3.1, we describe calculation of the density function using P-trees and Hawaiian rings. In
section 3.2, the algorithm for finding density attractors is discussed. Finally, the efficiency of our
algorithm is analyzed in terms of time complexity.
3.1. Calculation of the Density Function Using P-trees and Hawaiian Rings
Definition 3.1.1.
Hawaiian Ring and Hawaiian Ball The Hawaiian ring of radii,
centered at c is defined as R(c, r1, r2) = {x X | r2 dh(c,x)  r1}.
r1 and r2 ,
Letting r1 = 0, a Hawaiian Ring
becomes a Hawaiian Ball.
Definition 3.1.2.
Point Sets within Hawaiian Balls (PSHB) Given a Hawaiian ball, R = R(c,0,r),
centered at c, if there exists xX such that xR, then return P-tree Pc,r else return the pure0 tree. The
size of point set is defined as |PSHB(c,r)| = RootCount(Pc,r).
Algorithm 3.1.1.
Calculation of PSHB Given a Hawaiian ball R (c, 0, r) centered at c. PSHB or
Pc,r is implemented as follows. Operator P-tree, Pci,j, is defined based on b i,j of point c:
If b i,j = 1
Pci,j = Pi,j
Otherwise
= P’i,j
P-trees on dimension i ( i = 1, 2, 3, …, n) within Hawaiian ball r are calculated by
Pvi,r = Pci,1 & Pci,2 & Pci,3 & … & Pci,r
P-trees on all dimensions are calculated by ANDing the value P-trees as
Pc,r= Pv1,r & Pv2,r &Pv3,r & … & Pvn,r
According to Definition 3.1.2, |PSHB(c,r)| = RootCount (Pc,r).
Definition 3.1.3.
Point Set within Hawaiian Rings Given a Hawaiian ring R(c, r1, r2) centered at
c, point set size in the Hawaiian ring is define as |PSHB(c,r2)| - |PSHB(c,r1)| = RootCount (Pc,r2) RootCount (Pc,r1).
Definition 3.1.4.
Hawaiian Density Function Given n number of Hawaiian rings centered at c
with radius of r1, r2, … rn, Hawaiian Density Function of c is defined as
m
Dc =
 w (| PSHB (c, r ) |  | PSHB (c, r
i
i 1
i 1
i
) |)
m
=
 w * ( RootCount ( Pc, r )  RootCount ( Pc, r
i 1
i
i
i 1
))
where wi is the weight.
Note, we define the Hawaiian density function as the weighted summation of point set size within each
consecutive Hawaiian ring. Wi is determined such that the nearby neighbors contribute to the density
function more than those far away from the point. Giving more voting weight to closer point sets than
distant ones increases clustering accuracy.
There are many weighting functions which can be used to adjust the distance, e.g. Gaussian weighting,
Kriging and radial basis function weighting, Podium weighting [12], etc.
2
(ri* d)
Here we defined wi = ri*
, where ri is radius of Hawaiian ring and d is the dimension of the spatial data set. The definition is
based on the following analysis. First, since the Hawaiian metric is a bitwise metrics, the spatial radius
actually increases exponentially (at the rate of 2 r ) as the Hawaiian radius r increases, i.e. 1, 2, 4, 8 … 2 r.
Secondly, data point size increases exponentially as the spatial dimension increases. Suppose there is a
data point in each spatial unit. For d-dimensional space with radius 2r, the data set size is (2r)d or 2r*d.
For d-dimensional space, the data set within Hawaiian ring R(c, r, r+1) is 2(r+1)*d - 2r*d = 2r*d. If wi is
defined as 2r*d, the data set in each Hawaiian ring will make the same contribution to the density
function.
Since we want closer neighbors to have a greater contribution, we defined w i = ri* 2(ri* d).
3.2. Finding Density Attractors Using Look Around Pruning Technique
Now that we know the density of each data point, we need to find the density attractors, i.e., local
maxima of the overall density function. The density function based algorithm can be summarized in 2
steps: 1) Get the density function for each point, 2) Get the density attractors.
Having a high density
doesn’t necessarily make a point a density attractor – it must have the highest density among its
neighbors.
Instead of using hill climbing techniques as is done in DENCLUE [12], we adopt look
around pruning. We first define a neighborhood as a ball of radius . The number, , determines the
final number of density attractors.  defined a fixed number of Hawaiian rings. It ranges from 0 to the
maximal bit length of attributes. After finding density, D x, of a point. x, we compare that density with
that of any density attractor within its neighborhood. If it is greater than the density of all its neighbors,
it is labeled as the new density attractor, cluster center. The old density attractors are de-labeled and
will not be considered as attractors further.
For example, suppose QID of data point X is 0.3.2 and D x = 250. The tuple P-tree Px, within
neighborhood  is shown in Figure 3. We need compare Dx with the neighbor’s density. From the Px,,
x has four neighbors with QIDs of 0.0.2, 0.3.1, 2.3.0 and 2.3.3. If densities of these points are
respectively 300, 0, 220 and 0, and 0.0.2 and 2.3.0 are labeled as density attractors. By comparing Dx
with the maximal density of 0.0.2 and 2.3.0, 250 < max(300, 220), therefore we determine that x is not
a density attractor. Otherwise if Dx = 350, 350 > max (300, 220), x is labeled as the new density
attractor. The old density attractors 0.0.2 and 2.3.0 are de-labeled and will not be considered later.
/
3
/ /\
1 0
//\\
0010
5
_____ / / \ \______
/ \
\
0 2
0
\
/ /\ \
0 2 0 00 2
//\\
//\\
0110
1001
Figure 3. Tuple P-tree Within Neighborhood  of x
After all the data points have gone through the process above, the final density attractors, cluster
centers will be determined. The algorithm is summarized in Figure 4.
INPUT : PtreeSet Pi,j for bit j and attribute i, neighborhood 
OUTPUT : Density attractors
// Pi,j - PTree for attribute i and bit j; N - # of data points; n - # of attributes; P - Neighborhood ptree
// m - maximal bit length of attributes; flag[i] - lable of cluster center of data point i.
//w[h] – density weight of Hawaiian ring h
BEGIN
FOR i=1 to N DO
flag[i] 0
P [i]  Pure1 PTree, DENS[i]  0, PrevRC  0
FOR j=1 TO n DO
FOR h=1 TO m DO
GET bjh[i]
IF bjh[i]=1
PXjh  Pj,h
ELSE
PXjh  P`j,h
P [i]  P [i] & PXjh
w [i] = h * pow (2, h*n)
DENS[i] DENS[i]+ w * (RootCount(P [i]- PrevRC);
PrevRC  RootCount(P [i]);
IF h = m - 
P  P [i]
IF DENS[i] > The largest of the density within neighborhood ,
flag[i]  1, clear the flags of its neighborhood points.
END
Figure 4.
PHDClustering Algorithm
The look around pruning algorithm is robust, which means the clustering results are independent of
training data point order. Suppose there are three points x[1], x[2] and x[3] in the neighborhood . Let
Den[1] > Den[2] > Den[3]. There are 6 possible feeding orders of these three data points. The training
processes of different feeding orders are shown in Table 3. From the table, the density attractor after
step 3 are all x[1]. Therefore our look around pruning algorithm for find density attractors is robust,
independent of data feeding order.
Table 3 Density Attractor during Training Process For Different Data Feeding Order
Feeding Order
Step 1
Step 2
Step 3
x[1], x[2], x[3]
x[1]
x[1]
x[1]
x[1], x[3], x[2]
x[1]
x[1]
x[1]
x[2], x[1], x[3]
x[2]
x[1]
x[1]
x[2], x[3], x[1]
x[2]
x[2]
x[1]
x[3], x[1], x[2]
x[3]
x[1]
x[1]
x[3], x[2], x[1]
x[3]
x[2]
x[1]
3.3. Time Complexity Analysis
Let  be the fan-out of a P-tree and let n be the number of data points it represents. We first present
some Lemmas on P-trees, and then derive the run time complexity O ( n n ) on average.
Lemma 3.3.1.
The number of level of P-tree k = log() n
Proof Sketch:
The numbers of nodes in each level of P-trees are: 1, , 2, 3, … k. Obviously the
leave level k is n bits long, i.e. k = n. Thus k = log() n.
Lemma 3.3.2.
Maximum number of nodes in P-tree at the worst case  = ( n – 1) / ( – 1)
Proof Sketch: without compression, the total number of nodes  = 1 +  + 2 + 3 + … k-1 = (k – 1) /
( – 1), according to Lemma 3.3.1, k = n, we get
 = ( n – 1) / ( – 1)
Lemma 3.3.3. Total number of nodes in P-tree with compression ratio of  (<1)  = 1 + (k * n – ) /
( *  – 1), where k is the number of levels of P-tree.
Proof Sketch: The numbers of nodes in each level of P-trees with compression ratio  at level i is i *
i-1., where i ranges from 1 to k.. For example, at level 2, there are ( * )*  = 2 *  nodes. We get the
total number of nodes with compression ratio of  is

= 1 +  + 2 *  + 3 * 2 + … + k-1 * k-2
= 1 +  * (k-1*k-1 – 1) / (* – 1)
= 1 + ( k * k – ) / ( *  – 1)
= 1 + (k * n – ) / ( *  – 1)
Corollary 3.3.1.
When  = 0, the total number of nodes in P-tree is 1; when  = 1, the total number
of nodes in P-tree is (n – ) ( – 1) + 1. We also notice that when  = 0.5 and  = 4, the total number of
nodes in P-tree with compression ratio  is

= 1 + (4k/2k – 2 *4) / (4 – 2)
= 1 + (4k /2 –
=1+(
Theorem 3.3.1.
8) /2
n - 8 ) /2
Average run time complexity of PHDClustering with compression ratio 0.5 and
fan-out of 4 is O (d*n *
n ), where d is the number of dimensions.
Proof Sketch: P-tree ANDing operation is executed node by node to calculate the density. Each node
ANDing is counted as one operation. For n data points in d-dimension, there are d*m basic P-trees,
here m is the maximal bit size of each dimension. The total run time to get density P-trees is d*m*n*,
where  is the total number of nodes of a P-tree.
For data sets with fan-out  = 4 and average compress rate  = 0.5, according to Corollary 3.4.1, the
total number of nodes of a P-tree  = 1 + (
n - 8) /2. Therefore, the total time to get density for n
data points in d-dimension is d*m*n * (1 + (
n - 8) /2).
Since we adopt look around pruning technique and only compare previous density attractors, so the run
time complexity of finding density attractors can be neglected. Thus, the average time complexity of
density based clustering using P-tree with compression ratio 0.5 and fan-out of 4 is O (d*n *
n ).
4. Experiment Evaluation
Our experiments were implemented in C++ language on a 1GHz Pentium PC machine with 1GB main
memory, running on Debian Linux 4.0. The test data includes the aerial TIFF image (with Red, Green
and Blue band reflectance values), moisture, and nitrate map of the Oaks area in North Dakota. The
data is prepared in five sizes, that is, 128x128, 256x256, 512x512, 1024x1024, 2048x2048. The data
sets are available at [4]. We evaluate our proposed P-tree Hawaiian ring based density clustering
algorithm, PHDClustering from respect of scalability, quality and parameter sensitivity.
In this experiment, we compare our proposed PHDClustering with several Density Function based
Clustering method (DFC) using different distance metrics, including Manhattan distance based DFC
(DFC-Manhattan), Euclidian distance based DFC (DFC-Euclidian), and Max distance DFC
(DFC-Max). The experiment was performed on the five different sizes of data sets. The average CPU
run time of 30 runs is shown in Figure 5.
Average Run Time (S)
DFC-Manhattan
DFC-Max
DFC-Euclidean
PHDCluster
1800
1600
1400
1200
1000
800
600
400
200
0
128x128
256x256
512x512
1024x1024 2048x2048
Data Size (number of tuples)
Figure 5.
Running Time Comparison of PHDCluster with other Density Clustering using
Different metrics
From Figure 5, we see that DC-Manhattan is faster than DC-Max and Dc-Euclidan. But PHDCluster
method is much faster than all of them on these five data sets. Especially when the data set size
increases, the time of PHDCluster method increases at a much lower rate than other methods. The
experiment results show that PHDCluster method is more scalable for large spatial data set.
5. Conclusion
In this paper, we propose an efficient P-tree Hawaiian density based clustering algorithm (PHDCluster),
with average time complexity, O ( dn n ) ), for spatial data sets.
PHDCluster exploits a new
distance metric, the Hawaiian metric, to calculate density functions using Peano Trees (P-trees). The
Hawaiian metric is natural for spatial data and the calculation of Hawaiian metrics using P-tree is
extremely fast. Our proposed method has comparable cardinality scalability with other density methods
for small and medium size of data, but is shown to be superior regarding dimensional scalability.
Our method is particularly useful for data streams. In data streams, such as large sets of transactions,
remotely sensed images, multimedia video, etc., new data keeps on arrival continually. Therefore both
speed and accuracy are critical issues. Achieving high speed using P-tree, and high accuracy using the
weighted Hawaiian metrics provides a density based clustering method that is well suited to the
clustering of steam data. Besides spatial data, our method also has potential applications in other areas,
such as DNA micro array and medical image analysis.
Reference:
1.
William Perrizo, Peano Count Tree Technology, Technical Report NDSU-CSOR-TR-01-1, 2001.
2.
Maleq Khan, Qin Ding, William Perrizo, k-Nearest Neighbor Classification on Spatial Data
Streams Using P-Trees, PAKDD 2002, Spriger-Verlag, LNAI 2336, 2002, pp. 517-528.
3.
Alexander Hinneburg, Daniel A. Keim, An Efficient Approach to Clustering in Large Multimedia
Databases with Noise, Proc. 4rd Int. Conf. on Knowledge Discovery and Data Mining, AAAI
Press, 1998.
4.
TIFF image data sets. Available at http://midas-10cs.ndsu.nodak.edu/data/images/.
5.
Ester M., Kriegel H.P., Sander J., Xu X, Density-Connected Sets and their Application for Trend
Detection in Spatial Databases, Proc. 3rd Int. Conf. On Knowledge Discovery and Data Mining,
AAAI Press, 1997.
6.
ESTER, M., KRIEGEL, H-P., SANDER, J. and XU, X. 1996. A density-based algorithm for
discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM
SIGKDD, 226-231, Portland, Oregon.
7.
SANDER, J., ESTER, M., KRIEGEL, H.-P., and XU, X. 1998. Density-based clustering in spatial
databases: the algorithm GDBSCAN and its applications. In Data Mining and Knowledge
Discovery, 2, 2, 169-194.
8.
ANKERST, M., BREUNIG, M., KRIEGEL, H.-P., and SANDER, J. 1999. OPTICS: Ordering
points to identify clustering structure. In Proceedings of the ACM SIGMOD Conference, 49-60,
Philadelphia, PA.
9.
XU, X., ESTER, M., KRIEGEL, H.-P., and SANDER, J. 1998. A distribution-based clustering
algorithm for mining in large spatial databases. In Proceedings of the 14th ICDE, 324-331,
Orlando, FL.
10. HAN, J. and KAMBER, M. 2001. Data Mining. Morgan Kaufmann Publishers.
11. HAN, J., KAMBER, M., and TUNG, A. K. H. 2001. Spatial clustering methods in data mining: A
survey. In Miller, H. and Han, J. (Eds.) Geographic Data Mining and Knowledge Discovery, Taylor
and Francis.
12. Perrizo, W., Ding, Q., Denton, A., Scott, Kirk., Ding, Q., and Khan, M. 2003. PINE – Podium
Incremental Neighbor Evaluator for Classifying Spatial Data. SAC2003, Melbourne, Florida, USA
13. H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley, Reading, MA,
1989.
14. T. Sellis, N. Roussopoulos and C. Faloutsos. Multidimensional Access Methods: Trees Have
Grown Everywhere. Proceedings of the 23 rd International Conference on Very Large Data Bases
(VLDB), 1997, pp. 13-15.
Download