Clustering revision (Falguni Negandhi)

advertisement
CLUSTERING
Revision
Spring 2007
SJSU
Falguni Negandhi
Overview
Definition
 Main Features
 Types of Clustering
 Some Clustering Approaches
 Why we use Clustering
 Some methods of Clustering
 Conclusion
 References

Definition


Clustering is “the process of organizing objects into
groups whose members are similar in some way”.
A cluster is therefore a collection of objects which are
“similar” between them and are “dissimilar” to the
objects belonging to other clusters.
Clustering Main Features
Clustering – a data mining technique
 Usage:

Statistical Data Analysis
 Machine Learning
 Data Mining
 Pattern Recognition
 Image Analysis
 Bioinformatics

Types of Clustering

Hierarchical
 Finding

new clusters using previously found ones
Partitional
 Finding
all clusters at once
Self-Organizing Maps
 Hybrids (incremental)

Some Clustering Approaches
Clustering
Hierarchical
Agglomerative
Partitional
Divisive
Categorical
Sampling
Large DB
Compression
Why clustering?
A few good reasons ...
Simplifications
 Pattern detection
 Useful in data concept construction
 Unsupervised learning process

Where to use clustering?
Data mining
 Information retrieval
 text mining
 Web analysis
 marketing
 medical diagnostic

Major Existing clustering methods
Distance-based
 Hierarchical
 Partitioning
 Probabilistic

Distance based method
• In this case we easily identify the 4 clusters into which the data can be divided; the
similarity criterion is distance: two or more objects belong to the same cluster if they
are “close” according to a given distance. This is called distance-based clustering.
Hierarchical clustering
Agglomerative (bottom up)
1.
2.
3.
start with 1 point
(singleton)
recursively add two or
more appropriate
clusters
Stop when k number of
clusters is achieved.
Divisive (top dow)
1.
2.
3.
Start with a big cluster
Recursively divide into
smaller clusters
Stop when k number of
clusters is achieved.
Partitioning clustering
1.
2.
Divide data into proper subset
recursively go through each subset and
relocate points between clusters
(opposite to visit-once approach in
Hierarchical approach)
This recursive relocation= higher quality cluster
Steps of hierarchical clustering
Given a set of N items to be clustered, and an N*N
distance (or similarity) matrix, the basic process of
hierarchical clustering (defined by S.C. Johnson in 1967)
is this:



Start by assigning each item to a cluster, so that if you
have N items, you now have N clusters, each containing
just one item. Let the distances (similarities) between
the clusters the same as the distances (similarities)
between the items they contain.
Find the closest (most similar) pair of clusters and merge
them into a single cluster, so that now you have one
cluster less.
Compute distances (similarities) between the new cluster
and each of the old clusters.
Repeat steps 2 and 3 until all items are clustered into K
number of clusters
Hierarchical algorithms
Agglomerative algorithms begin with each
element as a separate cluster and merge
them into successively larger clusters.
 Divisive algorithms begin with the whole
set and proceed to divide it into
successively smaller clusters.

Hierarchical agglomerative general
algorithm
1.
2.
3.
Find the 2 closest objects
and merge them into a
cluster
Find and merge the next two
closest points, where a point
is either an individual object
or a cluster of objects.
If more than one cluster
remains, return to step 2
Partitioning clustering
1.
2.
Divide data into proper subset
recursively go through each subset and
relocate points between clusters
(opposite to visit-once approach in
Hierarchical approach)
Probabilistic clustering
1.
2.
3.
Data are picked from mixture of
probability distribution.
Use the mean, variance of each
distribution as parameters for cluster
Single cluster membership
K-Clustering

K-clustering algorithm



Result: Given the input set S and a fixed integer k, a
partition of S into k subsets must be returned.
K-means clustering is the most common partitioning
algorithm.
K-Means re-assigns each record in the dataset to
only one of the new clusters formed. A record or data
point is assigned to the nearest cluster (the cluster
which it is most similar to) using a measure of
distance or similarity
K-clustering algo cont'd
1. Select k initial cluster centroids, c1, c2,
c3..., ck.
 2. Assign each instance x in S to the
cluster whose centroid is the nearest to x.
 3. For each cluster, re-compute its centroid
based on which elements are contained in.
 4. Go to (2) until convergence is achieved

K-Means Clustering



Separate the objects (data points) into K clusters.
Cluster center (centroid) = the average of all the data
points in the cluster.
Assigns each data point to the cluster whose centroid is
nearest (using distance function.)
K-Means Algorithm
1. Place K points into the space of the objects
being clustered. They represent the initial
group centroids.
2. Assign each object to the group that has the
closest centroid.
3. Recalculate the positions of the K centroids.
4. Repeat Steps 2 & 3 until the group centroids
no longer move.
K-Means
Algorithm:
Example
Output
Conclusion
very useful in data mining
 applicable for both text and graphical
based data
 Help simplify data complexity
 classification
 detect hidden pattern in data

References

Dr. M.H. Dunham - http://engr.smu.edu/~mhd/dmbook/part2.ppt.

Dr. Lee, Sin-Min – San Jose State University
Mu-Yu Lu, SJSU

Database System Concepts, Silberschatz, Korth, Sudarshan

Wikipedia:
http://en.wikipedia.org/wiki/Data_clustering#Types_of_clustering

Enrique Blanco Garcia
http://genome.imim.es/~eblanco/seminars/docs/clustering/index_types.html
#hierarchy

Data Clustering - Wikipedia
http://en.wikipedia.org/wiki/Data_clustering
A Tutorial on Clustering Algorithms
http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/
The K-Means Data Clustering Problem
http://people.scs.fsu.edu/~burkardt/f_src/kmeans/kmeans.html


Download