Spatial Data Mining

advertisement
Efficient and
Effective Clustering
Methods for Spatial
Data Mining
Raymond T. Ng, Jiawei Han
Pavan Podila
COSC 6341, Fall ‘04
1
Overview






Spatial Data Mining
Clustering techniques
CLARANS
Spatial and Non-Spatial dominant
CLARANS
Observations
Summary
2
Overview






Spatial Data Mining
Clustering techniques
CLARANS
Spatial and Non-Spatial dominant
CLARANS
Observations
Summary
3
Spatial Data Mining



Identifying interesting relationships and
characteristics that may exist implicitly in Spatial
Databases
Different from Relational Databases
 Spatial objects - store both spatial and nonspatial attributes
 Queries (“All Walmart stores within 10 miles of
UH)
 Spatial Joins, work on spatial indexes (R-tree)
 Huge sizes (Tera bytes)
GIS is a classic example
4
Overview






Spatial Data Mining
Clustering techniques
CLARANS
Spatial and Non-Spatial dominant
CLARANS
Observations
Summary
5
Partitioning Methods
Given K, the number of partitions to create, a partitioning
method constructs initial partitions. It then iterative
refines the quality of these clusters so as to maximize
intra-cluster similarity and inter-cluster dissimilarity.
[Quality of Clustering]: Average dissimilarity of objects
from their cluster centers (medoids)
Selected algorithms:
1.
K-medoids
2.
PAM
3.
CLARA
4.
CLARANS
6
K-Medoids
10
9
8


Partition based clustering (K
partitions)
Effective, why ?




Resistant to outliers
Do not depend on order in
which data points are
examined
Cluster center is part of
dataset, unlike k-means
where cluster center is gravity
based
Experiments show that large
data sets are handled
efficiently
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
K-medoids
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
K-means
7
PAM (Partitioning Around Medoids)

[Goal]: Find K representative objects of
the data set. Each of the K objects is
called a Medoid, the most centrally
located object within a cluster.
8
PAM (2)

Start with K data points designated
as medoids. Create cluster around
a medoid by moving data points
close to the medoid
Oj belongs to Oi
if d(Oj, Oi) = minOe d(Oj, Oe)


Iteratively replace Oi with Oh if
quality of clustering improves.
Swapping cost, Cijh, associated for
replacing a selected object Oi with
a non-selected object Oh
Oi
Oh
Oj
9
PAM (3)
* O(k(n-k)2) for each iteration
* Good for small data sets
(n=100, k=5)
Select K
representative
objects
arbitrarily
Compute TCih
for all pairs
(Oi, Oh)
Replace Oi
with Oh
Select pair
(Oi, Oh) with
min TCih (Oi, Oh)
Yes
TCih < 0
No
For every Oj
find the most
representative
object
10
CLARA (Clustering LARge
Applications)





Improvement over PAM
Finds medoids in a sample from the dataset
[Idea]: If the samples are sufficiently
random, the medoids of the sample
approximate the medoids of the dataset
[Heuristics]: 5 samples of size 40+2k gives
satisfactory results
Works well for large datasets (n=1000, k=10)
11
Overview






Spatial Data Mining
Clustering techniques
CLARANS
Spatial and Non-Spatial dominant
CLARANS
Observations
Summary
12
CLARANS (Clustering Large Applications
based on RANdomized Search)







A graph abstraction, Gn,k
Each vertex is a
collection of k medoids

S1
| S1
S2 | = k – 1
Each node has k(n-k)
neighbors
Cost of each node is total
dissimilarity of objects to
their medoids
PAM searches whole graph
CLARA searches subgraph
{Om1, ..., Omk}
S2
{Oa1, ..., Oak}
{Ob1, ..., Obk}
{Oc1, ..., Ock}
{Od1, ..., Odk}
13
CLARANS (2)
Input
maxNeighbors,
numLocal
i = 1,
minCost = ∞,
bestNode = -1
j=1
current = S
current =
random node of
Gn,k
Pick random
neighbor S of
current.
TCS < TCcurrent
YES
NO
YES
TCcurrent < minCost
j++
YES
NO
minCost = TCcurrent,
bestNode = current
NO
YES
j < maxNeighbor
Experimental values
i++
• numLocal = 2
• maxNeighbors =
max(1.25% of k(n-k), 250)
i > numLocal
NO
Output
bestNode
Stop
14
CLARANS (3)


Outperforms PAM and CLARA in terms
of running time and quality of
clustering
O(n2) for each iteration
CLARANS vs CLARA
15
CLARANS vs PAM
Overview






Spatial Data Mining
Clustering techniques
CLARANS
Spatial and Non-Spatial dominant
CLARANS
Observations
Summary
16
Generalization



Useful to mine non-spatial
attributes
Process of merging tuples
based on a concept hierarchy
DBLearn – SQL query, gen.
hierarchy and threshold
color
reddish
red
orange
yellowish
yellow
bluish
green
blue
indigo
violet
Sphere(color, diameter)
diameter
Initial relation
Generalized relation
small
large
1...20
21...40
17
Silhouette
Silhouette of object Oj



determines how much
Oj belongs to it’s
cluster
Between -1 and 1
1 indicates high
degree of membership
Silhouette width of
cluster

Average silhouette of
all objects in cluster
Silhouette coefficient

Average silhouette
widths of k clusters
Silhoutte width
Interpretation
0.71 – 1
Strong cluster
0.51 – 0.7
Reasonable cluster
0.26 – 0.5
Weak or artificial
cluster
≤ 0.25
No cluster found
18
SD and NSD approach





SD – Spatial Dominant
NSD – Non-Spatial Dominant
Clustering for spatial attributes /
Generalization for non-spatial attributes
Dominance is decided by what is
carried out first
(clustering/generalization)
Second phase works on tuples from
previous stage
19
SD(CLARANS)
Data
SQL
Specify learning
request in the
form of SQL
query
For every cluster
CLARANS
on spatial
attributes
Oi
Tuples
Oh
Oj
Collect non-spatial
components
Knat clusters
Apply DBLearn



Finds non-spatial generalizations from spatial
clustering
Value for Knat is determined through heuristics
using the silhouette coefficients
Clustering phase can be treated as finding
spatial generalization hierarchy
20
NSD(CLARANS)
For every
generalized tuple
Data
SQL
Collect spatial
components
Specify learning
request as SQL
query
Clusters
Tuples
Oi
Apply DBLearn to
non-spatial
attributes


Generalized tuples
Oh
Oj
Check if any
clusters
overlap. Merge
them.
CLARANS
to find
Knat
clusters
Finds spatial clusters from non-spatial
generalizations
Clusters may overlap
21
Overview






Spatial Data Mining
Clustering techniques
CLARANS
Spatial and Non-Spatial dominant
CLARANS
Observations
Summary
22
Observations



In all previous methods, quality of
mining depends on the SQL query
CLARANS assumes that the entire
dataset is in memory. Not always the
case for large data sets.
Quality of results cannot be guaranteed
when N is very large – due to
Randomized Search
23
Observations (2)




Other clustering algorithms proposed
for Spatial Data Mining
Hierarchical: BIRCH
Density based: DBSCAN, GDBSCAN,
DBRS
Grid based: STING
24
Summary



A seminal paper on use of clustering for
spatial data mining
CLARANS is an effective clustering
technique for large datasets
SD(CLARANS)/NSD(CLARANS) are
effective spatial data mining algorithms
25
References

Primary


Efficient and Effective Clustering Methods for
Spatial Data Mining (1994) - Raymond T. Ng, Jiawei Han
Secondary


CLARANS: A Method for Clustering Objects for
Spatial Data Mining - Raymond T. Ng, Jiawei Han
Clustering for Mining in Large Spatial Databases Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu

An Introduction to Spatial Database Systems -
Ralf
Hartmut Güting
26
Download