CP5605/CP3300: Knowledge Discovery and Data Mining

advertisement
DBSCAN
Core Point w.r.t. Radius Eps and MinPts
 Eps = d
 MinPts = 3

d
d
p
|NEps(p)|=3
q
|NEps(q)|=2
DBSCAN. Eps = d. MinPts = 4
d
d
d
a
b
5 |NEps(b)|=__
3
|NEps(a)|=__
Which are Core-Points?
c
4
|NEps(c)|=__
DBSCAN


Directly Density Reachable (DDR)
Eps = d, MinPts = 3
p1 is a core point
d
d
p1
p2
p2 is a core point
p2 belongs to Neps(p1),
So, p2 is DDR from p1
|NEps(p1)|=_ |NEps(p2)|=__
DBSCAN


p1 and p2 are DDR
p2 and p3 are DDR
So, p1 and p3 are DR
Density-Reachable (DR)
Eps = d, MinPts = 3
d
d
p3
d
p1
p2
|NEps(p3)|=4
|NEps(p1)|=_ |NEps(p2)|=__
DBSCAN


Density-Connected (DC)
Eps = d, MinPts = 4
p1
o
p1 and o are DR
o and p2 are DR
So, p1 and p2 are DC
p2
DBSCAN Algorithm
1.
2.
3.
4.
5.
6.
Randomly select a point o
If o is a non-Core-point, label as a noise.
If o is a Core-point, create a new cluster Ci.
Retrieve all points that are DR from o
Add the points to Ci.
Repeat until no more core-points are found.
p1
o
p2
The STING Clustering Method
7




Each cell at a high level is partitioned into a number of smaller cells in the
next lower level
Statistical info of each cell is calculated and stored beforehand and is
used to answer queries
Parameters of higher level cells can be easily calculated from parameters
of lower level cell
 count, mean, s, min, max
 type of distribution—normal, uniform, etc.
Use a top-down approach to answer spatial data queries





Start from a pre-selected layer—typically with a small number of cells
For each cell in the current level compute the confidence interval
Remove the irrelevant cells from further consideration
When finish examining the current layer, proceed to the next lower level
Repeat this process until the bottom layer is reached
Comments on STING
8

Advantages:
 Query-independent
 Easy
to parallelize, incremental update
 O(K),
where K is the number of grid cells at the
lowest level

Disadvantages:
 All
the cluster boundaries are either horizontal
or vertical, and no diagonal boundary is
detected
WaveCluster: Clustering by Wavelet Analysis (1998)
9



Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
A multi-resolution clustering approach which applies wavelet
transform to the feature space
How to apply wavelet transform to find clusters

Summarizes the data by imposing a multidimensional grid structure
onto data space

These multidimensional spatial data objects are represented in a ndimensional feature space

Apply wavelet transform on feature space to find the dense
regions in the feature space

Apply wavelet transform multiple times which result in clusters at
different scales from fine to coarse
Quantization & Transformation
10

First, quantize data into m-D grid
structure, then wavelet transform
 a) scale 1: high resolution
 b) scale 2: medium resolution
 c) scale 3: low resolution
The EM (Expectation Maximization) Algorithm
11


Initially, randomly assign k cluster centers
Iteratively refine the clusters based on two steps
 Expectation step: assign each data point Xi to cluster Ci
with the following probability

Maximization step:
 Estimation of model parameters
Self-Organizing Feature Map (SOM)
12
Initialize Wij
d i   ( wij  x j )
2
j
W12
W11
x1
X = [x1, x2]
x2
Input Vector X = [ 1, 2]
W1 = [3, 4]
d1 = (1-3)2 + (2-4)2 = 8
wij  wij   ( x j  wij )
Download