Clustering - anuradhasrinivas

advertisement
Definition
 Finding groups of objects such that the objects in a
group will be similar (or related) to one another
and different from (or unrelated to) the objects in
other groups
Intra-cluster
distances are
minimized
Inter-cluster
distances are
maximized
Applications
• Group related documents for browsing
• Group genes and proteins that have similar
functionality
• Group stocks with similar price fluctuations
• Reduce the size of large data sets
• Group users with similar buying mentalities
Clustering is ambiguous
 There is no correct or incorrect solution for clustering.
How many clusters?
Six Clusters
Two Clusters
Four Clusters
Challenges faced
 Scalability
 Ability to deal with different types of attributes
 Noise & Outliers
 Complex shapes and types of data
 Incremental clustering and insensitivity to the order of
input records
 High dimensionality
 Constraint-based clustering
 Interpretability and usability
Types of Data
 Data Matrix
 n-objects with p-variables.
 The structure is in the form of a relational table, or n x p
matrix
 Dissimilarity Matrix
 object-by-object structure. Stores a collection of
proximities that are available for all pair of n objects.
 d(i, j) is the dissimilarity between objects i and j.
 d(i, j) = d(j, i) and d(i, i) = 0
Types of Data
 Interval- Scaled Variables
 Binary Variables
 Nominal
 Ordinal
 Ratio-Scaled variables
 Variables of Mixed Types
Interval- Scaled Variables

Interval-scaled variables
contd…

Binary variables
 Binary variable has only two states 0 and 1
 Dissimilarity between two binary variables is by a 2*2
contingency table for binary variables
OBJ j
OBJ i
1
0
1
q
r
q+r
0
s
t
s+t
q+s
r+t
p
Dissimilarity between binary
variables
Name
Gender Fever
Cough
Test-1
Test-2
Test-3
Test-4
Jack
M
Y
N
P
N
N
N
Mary
F
Y
N
P
N
P
N
Jim
M
Y
Y
N
N
N
N
D(Jack,Mary)=0.33
D(Jack,Jim)=0.67
D(Mary,Jim)=0.75
Categorical Variables

Other types of data
 Ordinal
 similar to nominal variables, but values are ordered in
some sequence.
 Eg. rank or employees can be assistant, associate, full
 Ratio-Scaled variables
 Makes a positive measurement on a non-linear scale
Eg. Growth of bacteria, radioactivity
 Variables of Mixed Types
Types of clustering
 Hierarchical clustering(BIRCH)
 A set of nested clusters organized as a hierarchical tree
 Partitional Clustering(k-means,k-mediods)
 A division data objects into non-overlapping (distinct)
subsets (i.e., clusters) such that each data object is in
exactly one subset
 Density – Based(DBSCAN)
 Based on density functions
 Grid-Based(STING)
 Based on nultiple-level granularity structure
 Model-Based(SOM)
 Hypothesize a model for each of the clusters and find
the best fit of the data to the given model
Partitional Clustering
Original Points
A Partitional Clustering
Hierarchical Clustering
p1
p3
p4
p2
p1 p2
Traditional Hierarchical
Clustering
p3 p4
Traditional Dendrogram
p1
p3
p4
p2
p1 p2
Non-traditional Hierarchical
Clustering
p3 p4
Non-traditional Dendrogram
Clustering Algorithms
 Partitional
 K-means
 K-mediods
 Hierarchial
 Agglomerative
 Divisive
K-Mean Algorithm
 Each cluster is represented by the mean value of the
objects in the cluster
 Input
: set of objects (n), no of clusters (k)
 Output
: set of k clusters
 Algo
 Randomly select k samples & mark them a initial cluster
 Repeat


Assign/ reassign in sample to any given cluster to which it is
most similar depending upon the mean of the cluster
Update the cluster’s mean until No Change.
K-Means (Array)
 Step 1:
 Step 2:
 Step 3:
 Step 4:
Randomly assign objects to k clusters
Find the mean of each cluster
Re-assign objects to the cluster with closest
mean.
Go to step2
Repeat until no change.
Example 1
Given: {2,3,6,8,9,12,15,18,22} Assume k=3.
 Solution:
 Randomly partition given data set:
 K1 = 2,8,15
mean = 8.3
 K2 = 3,9,18
mean = 10
 K3 = 6,12,22
mean = 13.3
 Reassign
 K1 = 2,3,6,8,9
 K2 =
 K3 = 12,15,18,22
mean = 5.6
mean = 0
mean = 16.75
 Reassign
 K1 = 3,6,8,9
 K2 = 2
 K3 = 12,15,18,22
 Reassign
 K1 = 6,8,9
 K2 = 2,3
 K3 = 12,15,18,22
 Reassign
 K1 = 6,8,9
 K2 = 2,3
 K3 = 12,15,18,22
 STOP
mean = 6.5
mean = 2
mean = 16.75
mean = 7.6
mean = 2.5
mean = 16.75
mean = 7.6
mean = 2.5
mean = 16.75
Example 2
Given {2,4,10,12,3,20,30,11,25}
Assume k=2.
Solution:
K1 = 2,3,4,10,11,12
K2 = 20, 25, 30
Advantages
• K-means is relatively scalable and efficient in processing large
data sets
• The computational complexity of the algorithm is O(nkt)
n: the total number of objects
k: the number of clusters
t: the number of iterations
Normally: k<<n and t<<n
Disadvantage
• Can be applied only when the mean of a cluster is defined
• Users need to specify k
• K-means is not suitable for discovering clusters with non convex
shapes or clusters of very different size
• It is sensitive to noise and outlier data points (can influence the
mean value)
K-Means (graph)
 Step1:
 Step2:
Form k centroids, randomly
Calculate distance between centroids and
each object
 Use Euclidean’s law do determine min distance:
d(A,B) = (x2-x1)2 + (y2-y1)2
 Step3:
 Step4:
C=
Assign objects based on min distance to k
clusters
Calculate centroid of each cluster using
(x1+x2+…xn , y1+y2+…yn)
n
n
 Go to step 2.
 Repeat until no change in centroids.
Example 1
 There are four types of medicines and each have two
attributes, as shown below. Find a way to group them
into 2 groups based on their features.
Medicine
A
B
Weight
1
2
pH
1
1
C
4
3
D
5
4
Solution
 Plot the values on a graph.
 Mark any k centeroids
 Calculate Euclidean distance of each point from the
centeroids.
D=
0
1
1
0
3.61
2.83
5
4.24
 Based on minimum distance, we assign points to
clusters:
K1 = A
K2 = B, C, D
 Calculate new centeroids
 C = 2+4+5 ,
1+3+4
3
3
=
(11/3 , 8/3)
 Marking the new centroids
 Continue the iteration, until there is no change in the
centroids or clusters.
Final solution
Example 2
 Use K-means algorithm to create two clusters. Given:
Example 3
.
Group the below points into 3 clusters
Download