CSE 634 Data Mining Techniques

advertisement
CSE 634 Data Mining Techniques
CLUSTERING
Part 2( Group no: 1 )
By: Anushree Shibani Shivaprakash &
Fatima Zarinni
Spring 2006
Professor Anita Wasilewska
SUNY Stony Brook
References






Jiawei Han and Michelle Kamber. Data Mining Concept and
Techniques (Chapter8). Morgan Kaufman, 2002.
M. Ester, H.P. Kriegel, J. Sander, and X. Xu. A density-based
algorithm for discovering clusters in large spatial databases.
KDD'96. http://ifsc.ualr.edu/xwxu/publications/kdd-96.pdf
How to explain hierarchical clustering.
http://www.analytictech.com/networks/hiclus.htm
Tian Zhang, Raghu Ramakrishnan, Miron Livny. Birch: An
efficient data clustering method for very large databases
Data mining- Margaret H. Dunham
http://cs.sunysb.edu/~cse634/ Presentation 9 – Cluster
Analysis
Introduction
Major clustering methods




Partitioning methods
Hierarchical methods
Density-based methods
Grid-based methods
Hierarchical methods


1.
2.
Here we group data objects into a
tree of clusters.
There are two types of hierarchical
clustering
Agglomerative hierarchical
clustering.
Divisive hierarchical clustering
Agglomerative hierarchical
clustering




Group data objects in a bottom-up
fashion.
Initially each data object is in its own
cluster.
Then we merge these atomic clusters into
larger and larger clusters, until all of the
objects are in a single cluster or until
certain
termination
conditions
are
satisfied.
A user can specify the desired number of
clusters as a termination condition.
Divisive hierarchical clustering



Groups data objects in a top-down
fashion.
Initially all data objects are in one cluster.
We then subdivide the cluster into smaller
and smaller clusters, until each object
forms cluster on its own or satisfies
certain termination conditions, such as a
desired number of clusters is obtained.
AGNES & DIANA

Application of AGNES( AGglomerative NESting)
and DIANA( Divisive ANAlysis) to a data set of
five objects, {a, b, c, d, e}.
Step 0
a
Step 1
Step 2 Step 3 Step 4
ab
b
abcde
c
cde
d
de
e
Step 4
agglomerative
(AGNES)
Step 3
Step 2 Step 1 Step 0
divisive
(DIANA)
AGNES-Explored
1.
2.
3.
Given a set of N items to be clustered, and an NxN distance
(or similarity) matrix, the basic process of Johnson's (1967)
hierarchical clustering is this:
Start by assigning each item to its own cluster, so that if you
have N items, you now have N clusters, each containing just
one item. Let the distances (similarities) between the clusters
equal the distances (similarities) between the items they
contain.
Find the closest (most similar) pair of clusters and merge
them into a single cluster, so that now you have one less
cluster.
AGNES
4.
5.
6.
Compute
distances
(similarities)
between the new cluster and each of
the old clusters.
Repeat steps 2 and 3 until all items
are clustered into a single cluster of
size N.
Step 3 can be done in different ways,
which is what distinguishes single-link
from complete-link and average-link
clustering
Similarity/Distance metrics


single-link clustering, distance
= shortest distance
complete-link clustering, distance
= longest distance
average-link clustering, distance
= average distance
from any member of one cluster to
any member of the other cluster.

Single Linkage Hierarchical Clustering
1. Say “Every point is
its own cluster”
Single Linkage Hierarchical Clustering
1. Say “Every point is
its own cluster”
2. Find “most similar”
pair of clusters
Single Linkage Hierarchical Clustering
1. Say “Every point is
its own cluster”
2. Find “most similar”
pair of clusters
3. Merge it into a
parent cluster
Single Linkage Hierarchical Clustering
1. Say “Every point is
its own cluster”
2. Find “most similar”
pair of clusters
3. Merge it into a
parent cluster
4. Repeat
Single Linkage Hierarchical Clustering
1. Say “Every point is
its own cluster”
2. Find “most similar”
pair of clusters
3. Merge it into a
parent cluster
4. Repeat
DIANA (Divisive Analysis)

Introduced in Kaufmann and Rousseeuw (1990)

Inverse order of AGNES

Eventually each node forms a cluster on its own
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
Overview


Divisive Clustering starts by
placing all objects into a single
group.
Before
we
start
the
procedure, we need to decide on a
threshold distance.
The procedure is as follows:
The distance between all pairs of
objects within the same group is
determined and the pair with the
largest distance is selected.
Overview-contd

This maximum distance is compared to the
threshold distance.



If it is larger than the threshold, this group is
divided in two. This is done by placing the selected
pair into different groups and using them as seed
points. All other objects in this group are examined,
and are placed into the new group with the closest
seed point. The procedure then returns to Step 1.
If the distance between the selected objects is less
than the threshold, the divisive clustering stops.
To run a divisive clustering, you simply need to
decide upon a method of measuring the distance
between two objects.
DIANA- Explored



In DIANA, a divisive hierarchical
clustering method, all of the objects
form one cluster.
The cluster is split according to
some principle, such as the
minimum Euclidean distance
between the closest neighboring
objects in the cluster.
The cluster splitting process repeats
until, eventually, each new cluster
contains a single object or a
termination condition is met.
Difficulties with Hierarchical clustering




It encounters difficulties regarding the
selection of merge and split points.
Such a decision is critical because once a
group of objects is merged or split, the
process at the next step will operate on
the newly generated clusters.
It will not undo what was done previously.
Thus, split or merge decisions, if not well
chosen at some step, may lead to lowquality clusters.
Solution to improve Hierarchical
clustering

1.
2.
3.
One promising direction for improving
the clustering quality of hierarchical
methods is to integrate hierarchical
clustering with other clustering
techniques. A few such methods are:
Birch
Cure
Chameleon
BIRCH: An Efficient Data Clustering Method for Very Large
Databases
Paper by:
Miron Livny
Tian Zhang
Raghu Ramakrishnan
Computer Sciences Dept.
Computer Sciences Dept.
Computer Sciences Dept.
University of Wisconsin- Madison
University of Wisconsin- Madison University of Wisconsin- Madison
miron@cs.wisc.edu
raghu@cs.wisc.edu
zhang@cs.wisc.edu
In Proceedings of the International Conference Management of Data (ACM-SIGMOD), pages 103-114,
Montreal, Canada, June, 1996.
Reference For Paper

www2.informatik.huberlin.de/wm/mldm2004/zhang96
birch.pdf
Birch (Balanced Iterative Reducing and Clustering Using
Hierarchies)


1.
2.
A hierarchical clustering method.
It introduces two concepts :
Clustering feature
Clustering feature tree (CF tree)
These structures help the clustering method
achieve good speed and scalability in large
databases.
Clustering Feature Definition
Given N d-dimensional data points in
a cluster: {Xi} where i = 1, 2, …, N,
CF = (N, LS, SS)
N is the number of data points in the
cluster,
LS is the linear sum of the N data
points,
SS is the square sum of the N data
points.
Clustering feature concepts

Each record (data object) is a tuple of values of
attributes and here is called a vector.
Here is a database.
We define
(Vi1, …Vid) =
Oi
Linear Sum Definition
N
N
N
N
LS = ∑ Oi = (∑Vi1, ∑ Vi2,… ∑Vid)
i=1
i=1 i=1
i =1
Definition
Name
Square sum
N
N
N
N
SS = ∑ Oi2 = ( ∑Vi12, ∑Vi22… ∑Vid2)
i =1
i=1
i=1
i=1
Name
Definition
Example of a case
Assume N = 5 and d = 2
Linear Sum
5
5
5
LS = ∑ Oi = (∑Vi1, ∑ Vi2)
i=1
i=1 i=1
Square Sum
5
5
SS =( ∑Vi12), ∑Vi22)
i=1
i=1
Example 2
Clustering feature = CF=( N, LS, SS)
N=5
LS = (16, 30)
SS = ( 54, 190)
CF = (5, (16,30),(54,190))
10
Object
Attribute1
Attribute2
O1
3
4
O2
2
6
O3
4
5
O4
4
7
O5
3
8
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
CF-Tree




A CF-tree is a height-balanced tree with
two parameters: branching factor (B for
nonleaf node and L for leaf node) and
threshold T.
The entry in each nonleaf node has the
form [CFi, childi]
The entry in each leaf node is a CF; each
leaf node has two pointers: `prev'
and`next'.
The CF tree is basically a tree used to
store all the clustering features.
CF Tree
CF1
CF2 CF3
CF6
child1
child2 child3
child6
Root
Non-leaf node
CF1
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev CF1 CF2
CF6 next
Leaf node
prev CF1 CF2
CF4 next
BIRCH Clustering


Phase 1: scan DB to build an initial
in-memory CF tree (a multi-level
compression of the data that tries
to preserve the inherent clustering
structure of the data)
Phase 2: use an arbitrary clustering
algorithm to cluster the leaf nodes
of the CF-tree
BIRCH Algorithm Overview
Summary of Birch


Scales linearly- with a single scan
you get good clustering and the
quality of clustering improves with a
few additional scans.
It handles noise (data points that
are not part of the underlying
pattern) effectively.
Density-Based Clustering Methods



Clustering based on density, such as densityconnected points instead of distance metric.
Cluster = set of “density connected” points.
Major features:
Discover clusters of arbitrary shape
 Handle noise
 Need “density parameters” as termination condition(when no new objects can be added to the cluster.)


Example:



DBSCAN (Ester, et al. 1996)
OPTICS (Ankerst, et al 1999)
DENCLUE (Hinneburg & D. Keim 1998)
Density-Based Clustering: Background

Eps neighborhood:
Eps of a given object



The neighborhood within a radius
MinPts: Minimum number of points in an Epsneighborhood of that object.
Core object :If the Eps neighborhood contains at least a
minimum number of points Minpts, then the object is a core
object
Directly density-reachable: A point p is directly
density-reachable from a point q wrt. Eps, MinPts if
1) p is within the Eps neighborhood of q
2) q is a core object
p
q
MinPts = 5
Eps = 1
Figure showing the density reachability and density
connectivity in density based clustering

M, P, O, R and S are core objects
since each is in an Eps
neighborhood containing at least 3
points
Minpts = 3
Eps=radius
of the
circles
Directly density reachable
Q is directly density reachable from M. M
is directly density reachable from P and
vice versa.
Indirectly density reachable

Q is indirectly density reachable from P since Q is
directly density reachable from M and M is directly
density reachable from P. But, P is not density
reachable from Q since Q is not a core object.
Core, border, and noise points

DBSCAN is a density-based algorithm.

Density = number of points within a specified radius
(Eps)



A point is a core point if it has more than a specified
number of points (MinPts) within Eps
 These are points that are at the interior of a cluster.
A border point has fewer than MinPts within Eps, but is
in the neighborhood of a core point.
A noise point is any point that is not a core point nor a
border point.
DBSCAN (Density based Spatial clustering of Application
with noise): The Algorithm





Arbitrary select a point p
Retrieve all points density-reachable from p wrt Eps
and MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the
database.
Continue the process until all of the points have been
processed.
Conclusions



We discussed two hierarchical clustering
methods – Agglomerative and Divisive.
We also discussed Birch- a hierarchical
clustering which produces good clustering
over a single scan and with a few
additional scans you get better clustering.
DBSCAN is a density based clustering
algorithm and through this algorithm we
discover clusters of arbitrary shapes.
Distance is not the metric unlike the case
of hierarchical methods.
GRID-BASED CLUSTERING
METHODS


This is the approach in which we
quantize space into a finite number of
cells that form a grid structure on which
all of the operations for clustering is
performed.
So, for example assume that we have a
set of records and we want to cluster with
respect to two attributes, then, we divide
the related space (plane), into a grid
structure and then we find the clusters.
Salary (10,000)
8
Our “space” is this
plane
7
6
5
4
3
2
1
0
20
30
40
50
60
Age
Techniques for Grid-Based Clustering
The following are some techniques
that are used to perform Grid-Based
Clustering:



CLIQUE (CLustering In QUest.)
STING (STatistical Information Grid.)
WaveCluster
Looking at CLIQUE as an Example


CLIQUE is used for the clustering of highdimensional data present in large tables.
By high-dimensional data we mean
records that have many attributes.
CLIQUE identifies the dense units in the
subspaces of high dimensional data
space, and uses these subspaces to
provide more efficient clustering.
Definitions That Need to Be Known



Unit : After forming a grid structure on
the
space, each rectangular cell is
called a Unit.
Dense: A unit is dense, if the fraction of
total data points contained in the
unit exceeds the input model
parameter.
Cluster: A cluster is defined as a maximal
set of
connected dense units.



How
Does
CLIQUE
Work?
Let us say that we have a set of records
that we would like to cluster in terms of
n-attributes.
So, we are dealing with an ndimensional space.
MAJOR STEPS :


CLIQUE partitions each subspace that has
dimension 1 into the same number of equal
length intervals.
Using this as basis, it partitions the ndimensional data space into non-overlapping
rectangular units.
CLIQUE: Major Steps (Cont.)





Now CLIQUE’S goal is to identify the dense ndimensional units.
It does this in the following way:
CLIQUE finds dense units of higher
dimensionality by finding the dense units in the
subspaces.
So, for example if we are dealing with a 3dimensional space, CLIQUE finds the dense
units in the 3 related PLANES (2-dimensional
subspaces.)
It then intersects the extension of the
subspaces representing the dense units to
form a candidate search space in which dense
units of higher dimensionality would exist.
CLIQUE: Major Steps. (Cont.)




Each maximal set of connected dense units is
considered a cluster.
Using this definition, the dense units in the
subspaces are examined in order to find
clusters in the subspaces.
The information of the subspaces is then used
to find clusters in the n-dimensional space.
It must be noted that all cluster boundaries are
either horizontal or vertical. This is due to the
nature of the rectangular grid cells.
Example for CLIQUE


Let us say that we want to cluster a set
of records that have three attributes,
namely, salary, vacation and age.
The data space for the this data would
be 3-dimensional.
vacation
age
salary
Example (Cont.)



After plotting the data objects,
each dimension, (i.e., salary,
vacation and age) is split into
intervals of equal length.
Then we form a 3-dimensional grid
on the space, each unit of which
would be a 3-D rectangle.
Now, our goal is to find the dense
3-D rectangular units.
Example (Cont.)



To do this, we find the dense units
of the subspaces of this 3-d space.
So, we find the dense units with
respect to age for salary. This
means that we look at the salaryage plane and find all the 2-D
rectangular units that are dense.
We also find the dense 2-D
rectangular units for the vacationage plane.
Salary
(10,000)
0 1 2 3 4 5 6 7
20
30
40
50
age
60
Vacation
(week)
0 1 2 3 4 5 6 7
Example 1
20
30
40
50
age
60
Example (Cont.)

Now let us try to visualize the
dense units of the two planes on the
following 3-d figure :
Vacation
=3
S
y
r
a
al
30
50
age
Example (Cont.)




We can extend the dense areas in the
vacation-age plane inwards.
We can extend the dense areas in the
salary-age plane upwards.
The intersection of these two spaces
would give us a candidate search space in
which 3-dimensional dense units exist.
We then find the dense units in the
salary-vacation plane and we form an
extension of the subspace that represents
these dense units.
Example (Cont.)




Now, we perform an intersection of
the candidate search space with the
extension of the dense units of the
salary-vacation plane, in order to
get all the 3-d dense units.
So, What was the main idea?
We used the dense units in
subspaces in order to find the dense
units in the 3-dimensional space.
After finding the dense units, it is
very easy to find clusters.
Reflecting upon CLIQUE



Why does CLIQUE confine its search for
dense units in high dimensions to the
intersection of dense units in subspaces?
Because the Apriori property employs
prior knowledge of the items in the search
space so that portions of the space can be
pruned.
The property for CLIQUE says that if a kdimensional unit is dense then so are its
projections in the (k-1) dimensional
space.
Strength and Weakness of CLIQUE

Strength





It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces.
It is quite efficient.
It is insensitive to the order of records in input and
does not presume some canonical data distribution.
It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases.
Weakness

The accuracy of the clustering result may be
degraded at the expense of simplicity of the simplicity
of this method.
STING: A Statistical Information
Grid Approach to Spatial Data
Mining
Paper by:
Wei Wang
Jiong Yang
Department of Computer Science Department of Computer Science
University of California, Los
University of California, Los
Angeles
Angeles
CA 90095, U.S.A.
CA 90095, U.S.A.
weiwang@cs.ucla.edu
jyang@cs.ucla.edu
Richard Muntz
Department of Computer Science
University of California, Los
Angeles
CA 90095, U.S.A.
muntz@cs.ucla.edu
VLDB Conference Athens, Greece, 1997
Reference For Paper

http://georges.gardarin.free.fr/Cours_XMLDM_Master2
/Sting.PDF
Definitions That Need to Be Known

Spatial Data:




Data that have a spatial or location
component.
These are objects that themselves are located
in physical space.
Examples: My house, lake Geneva, New York
City, etc.
Spatial Area:

The area that encompasses the locations of all
the spatial data is called spatial area.
STING (Introduction)





STING is used for performing clustering
on spatial data.
STING uses a hierarchical multi resolution
grid data structure to partition the spatial
area.
STINGS big benefit is that it processes
many common “region oriented” queries
on a set of points, efficiently.
We want to cluster the records that are in
a spatial table in terms of location.
Placement of a record in a grid cell is
completely determined by its physical
location.
Hierarchical Structure of Each Grid Cell





The spatial area is divided into
rectangular cells. (Using latitude and
longitude.)
Each cell forms a hierarchical structure.
This means that each cell at a higher
level is further partitioned into 4 smaller
cells in the lower level.
In other words each cell at the ith level
(except the leaves) has 4 children in the
i+1 level.
The union of the 4 children cells would
give back the parent cell in the level
above them.
Hierarchical Structure of Cells (Cont.)



The size of the leaf level cells and the
number of layers depends upon how
much granularity the user wants.
So, Why do we have a hierarchical
structure for cells?
We have them in order to provide a
better granularity, or higher resolution.
A Hierarchical Structure for Sting Clustering
Statistical Parameters Stored in each
Cell

For each cell in each layer we have
attribute dependent and attribute
independent parameters.

Attribute Independent Parameter:


Count : number of records in this cell.
Attribute Dependent Parameter:

(We are assuming that our attribute
values are real numbers.)
Statistical Parameters (Cont.)

For each attribute of each cell we store
the following parameters:





M  mean of all values of each attribute in
this cell.
S  Standard Deviation of all values of
each attribute in this cell.
Min  The minimum value for each attribute
in this cell.
Max  The maximum value for each
attribute in this cell.
Distribution  The type of distribution that
the attribute value in this cell follows. (e.g.
normal, exponential, etc.) None is assigned
to “Distribution” if the distribution is
unknown.
Storing of Statistical Parameters



Statistical information regarding the
attributes in each grid cell, for each layer
are pre-computed and stored before
hand.
The statistical parameters for the cells in
the lowest layer is computed directly from
the values that are present in the table.
The Statistical parameters for the cells in
all the other levels are computed from
their respective children cells that are in
the lower level.
How are Queries Processed ?






STING can answer many queries, (especially
region queries) efficiently, because we don’t have
to access full database.
How are spatial data queries processed?
We use a top-down approach to answer spatial
data queries.
Start from a pre-selected layer-typically with a
small number of cells.
The pre-selected layer does not have to be the
top most layer.
For each cell in the current layer compute the
confidence interval (or estimated range of
probability) reflecting the cells relevance to the
given query.
Query Processing (Cont.)





The confidence interval is calculated by
using the statistical parameters of each
cell.
Remove irrelevant cells from further
consideration.
When finished with the current layer,
proceed to the next lower level.
Processing of the next lower level
examines only the remaining relevant
cells.
Repeat this process until the bottom layer
is reached.
Different Grid Levels during Query
Processing.
Sample Query Examples




Assume that the spatial area is the map of the
regions of Long Island, Brooklyn and Queens.
Our records represent apartments that are
present throughout the above region.
Query : “ Find all the apartments that are for
rent near Stony Brook University that have a
rent range of: $800 to $1000”
The above query depend upon the parameter
“near.” For our example near means within 15
miles of Stony Brook University.
Advantages and Disadvantages of
STING

ADVANTAGES:





Very efficient.
The computational complexity is O(k) where k
is the number of grid cells at the lowest level.
Usually
k << N, where N is the number of records.
STING is a query independent approach, since
statistical information exists independently of
queries.
Incremental update.
DISADVANTAGES:

All Cluster boundaries are either horizontal or
vertical, and no diagonal boundary is selected.
Thank you !
Download