clusters

advertisement
Machine Learning
Clustering
1
Untrained Methods
• So far we analyzez TRAINED machine
learning methods: given a set of categories,
and some labelled training examples, the
learner acquires a categorization MODEL
than can be applied to classify unseen
examples
• The last 3 methods describe UNTRAINED
machine learning. No labeled examples are
available.
2
Clustering: the task
• Partition unlabeled examples into disjoint subsets of
clusters, such that:
– Examples within a cluster are “very similar”
– Examples in different clusters are “very different”
• Discover new categories in an unsupervised manner
(no sample category labels provided).
• Notice that clusters define a category extensionally
(by enumeration) rater than intentionally (through
some label or general property of the category)
• Automated labeling of clusters is a sub-problem of
clustering
3
Clustering Example
• Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Intra-cluster
distances are
minimized
Inter-cluster
distances are
maximized
Applications
• Understanding
Group genes and proteins that have similar
functionality, or group stocks with similar price
fluctuations
• Summarization
– Reduce the size of large data sets
Clustering precipitation
in Australia
5
Notion of a Cluster can be Ambiguous
How many clusters?
Six Clusters
Two Clusters
Four Clusters
Clustering types
• Hierarchical clustering
– Clusters are created iteratively, using clusters
created in previous step
• Partitional clustering
– A single partition is created, minimizing some cost
function
Partitional Clustering
Original Points
A Partitional Clustering
Hierarchical Clustering
p1
p3
p4
p2
p1 p2
Traditional Hierarchical Clustering
p3 p4
Traditional Dendrogram
Hierarchical Clustering
• Build a tree-based hierarchical taxonomy
(dendrogram) from a set of unlabeled examples.
animal
vertebrate
fish reptile amphib. mammal
invertebrate
worm insect crustacean
• Recursive application of a standard clustering
algorithm can produce a hierarchical clustering.
10
Agglomerative vs. Divisive Clustering
• Agglomerative (bottom-up) methods start
with each example in its own cluster and
iteratively combine them to form larger and
larger clusters.
• Divisive (partitional, top-down) separate all
examples immediately into clusters.
11
Direct Clustering Method
• Direct clustering methods require a
specification of the number of clusters, k,
desired.
• A clustering evaluation function assigns a
real-value quality measure to a clustering.
• The number of clusters can be determined
automatically by explicitly generating
clusterings for multiple values of k and
choosing the best result according to a
clustering evaluation function.
12
Hierarchical Agglomerative Clustering
(HAC)
• Assumes a similarity function for determining
the similarity of two instances.
• Starts with all instances in a separate cluster
and then repeatedly joins the two clusters that
are most similar until there is only one cluster.
• The history of merging forms a binary tree or
hierarchy.
13
HAC Algorithm
Start with all instances in their own cluster.
Until there is only one cluster:
Among the current clusters, determine the two
clusters, ci and cj, that are most similar.
Replace ci and cj with a single cluster ci  cj
So, the main issue is the similarity function
14
Cluster Similarity
• Assume a similarity function that determines the
similarity of two instances: sim(x,y).
– Cosine similarity of document vectors.
• How to compute similarity of two clusters each
possibly containing multiple instances?
– Single Link: Similarity of two most similar members.
– Complete Link: Similarity of two least similar members.
– Group Average: Average similarity between members.
15
Example
16
Single Link Agglomerative Clustering
• Use maximum similarity of pairs:
sim(ci ,c j )  max sim( x, y)
xci , yc j
• Can result in “straggly” (long and thin)
clusters due to chaining effect.
– Appropriate in some domains, such as
clustering islands.
17
Single Link Example
18
Complete Link Agglomerative Clustering
• Use minimum similarity of pairs:
sim(ci ,c j )  min sim( x, y )
xci , yc j
• Makes more “tight,” spherical clusters that
are typically preferable.
19
Complete Link Example
20
Computational Complexity
• In the first iteration, all HAC methods need
to compute similarity of all pairs of n
individual instances which is O(n2).
• In each of the subsequent n2 merging
iterations, it must compute the distance
between the most recently created cluster
and all other existing clusters.
• In order to maintain an overall O(n2)
performance, computing similarity to each
other cluster must be done in constant time.
21
Computing Cluster Similarity
• After merging ci and cj, the similarity of the
resulting cluster to any other cluster, ck, can
be computed by:
– Single Link:
sim((ci  c j ), ck )  max(sim(ci , ck ), sim(c j , ck ))
– Complete Link:
sim((ci  c j ), ck )  min(sim(ci , ck ), sim(c j , ck ))
22
Group Average Agglomerative Clustering
• Use average similarity across all pairs within the
merged cluster to measure the similarity of two
clusters.
 
1
sim(ci , c j ) 
sim( x, y)


ci  c j ( ci  c j  1) x(ci c j ) y(ci c j ): y  x
• Compromise between single and complete link.
• Averaged across all ordered pairs in the merged
cluster instead of unordered pairs between the two
clusters to encourage tight clusters.
23
Computing Group Average Similarity
• Assume cosine similarity and normalized
vectors with unit length.
• Always maintain sum of vectors in each
cluster. 

s (c j ) 
x

xc j
• Compute similarity of clusters in constant
time:




sim(ci , c j ) 
(s (ci )  s (c j ))  (s (ci )  s (c j ))  (| ci |  | c j |)
(| ci |  | c j |)(| ci |  | c j | 1)
24
Non-Hierarchical Clustering
• Typically must provide the number of desired
clusters, k.
• Randomly choose k instances as seeds, one per
cluster.
• Form initial clusters based on these seeds.
• Iterate, repeatedly reallocating instances to
different clusters to improve the overall clustering.
• Stop when clustering converges or after a fixed
number of iterations.
25
K-Means
• Assumes instances are real-valued vectors.
• Clusters based on centroids, center of
gravity, or mean of points in a cluster, c:


1
μ(c) 
x

| c | xc
• Reassignment of instances to clusters is
based on distance to the current cluster
centroids.
26
Distance Metrics
• Euclidian distance
(L2 norm):
m
 
L2 ( x , y )   ( xi  yi ) 2
• L1 norm:  
i 1
m
L1 ( x, y )   xi  yi
i 1
• Cosine Similarity (transform to a distance
by subtracting from 1):
 
x y
1  
x  y
27
K-Means Algorithm
Let d be the distance measure between instances.
Select k random instances {s1, s2,… sk} as seeds.
Until clustering converges or other stopping criterion:
For each instance xi:
Assign xi to the cluster cj such that d(xi, sj) is minimal.
(Update the seeds to the centroid of each cluster)
For each cluster cj
sj = (cj)
28
K Means Example
(K=2)
Pick seeds
Reassign clusters
Compute centroids
Reasssign clusters
x
x
x
x
Compute centroids
Reassign clusters
Converged!
29
Time Complexity
• Assume computing distance between two instances is
O(m) where m is the dimensionality of the vectors.
• Reassigning clusters: O(kn) distance computations,
or O(knm).
• Computing centroids: Each instance vector gets
added once to some centroid: O(nm).
• Assume these two steps are each done once for I
iterations: O(Iknm).
• Linear in all relevant factors, assuming a fixed
number of iterations, more efficient than O(n2)
HAC.
30
K-Means Objective
• The objective of k-means is to minimize the
total sum of the squared distance of every
point to its corresponding cluster centroid.
 
K
l 1
xi X l
|| xi  l ||
2
• Finding the global optimum is NP-hard.
• The k-means algorithm is guaranteed to
converge a local optimum.
31
Seed Choice
• Results can vary based on random seed selection.
• Some seeds can result in poor convergence rate, or
convergence to sub-optimal clusterings.
• Select good seeds using a heuristic or the results
of another method.
• K-means ++ : optimal seed selection (David
Arthur and Sergei Vassilvitskii, 2007)
• With the k-means++ initialization, the algorithm is
guaranteed to find a solution that is O(log k)
competitive to the optimal k-means solution.
32
Kmeans ++
• Choose one center uniformly at random from among the
data points.
• For each data point x, compute D(x), the distance between
x and the nearest center that has already been chosen.
• Choose one new data point at random as a new center,
using a weighted probability distribution where a point
x is chosen with probability proportional to D(x)2.
• Repeat Steps 2 and 3 until k centers have been chosen.
• Now that the initial centers have been chosen, proceed
using standard k-means clustering.
33
Soft Clustering
• Clustering typically assumes that each instance is
given a “hard” assignment to exactly one cluster.
• Does not allow uncertainty in class membership
or for an instance to belong to more than one
cluster.
• Soft clustering gives probabilities that an instance
belongs to each of a set of clusters.
• Each instance is assigned a probability distribution
across a set of discovered categories (probabilities
of all categories must sum to 1).
35
Learning from Probabilistically Labeled Data
• Instead of training data labeled with “hard”
category labels, training data is labeled with “soft”
probabilistic category labels.
• When estimating model parameters  from training
data, weight counts by the corresponding
probability of the given category label.
• For example, if P(c1 | E) = 0.8 and P(c2 | E) = 0.2,
each item xj in E contributes only 0.8 towards the
counts n1 and n1j, and 0.2 towards the counts n2 and
n2j .
36
Expectation Maximumization (EM)
•
•
•
•
Probabilistic method for soft clustering.
Direct method that assumes k clusters:{c1, c2,… ck}
Soft version of k-means.
Assumes a probabilistic model of categories that
allows computing P(ci | E) for each category, ci, for a
given example, E.
37
EM Algorithm
• Iterative method for learning probabilistic
categorization model from unsupervised data.
• Initially assume random assignment of examples
to categories.
• Learn an initial probabilistic model by estimating
model parameters  from this randomly labeled data.
• Iterate following two steps until convergence:
– Expectation (E-step): Compute P(ci | E) for each example
given the current model, and probabilistically re-label the
examples based on these posterior probability estimates.
– Maximization (M-step): Re-estimate the model
parameters, , from the probabilistically re-labeled data.
38
EM
Initialize:
Assign random probabilistic labels to unlabeled data
Unlabeled Examples
+ 
+ 
+ 
+ 
+ 
39
EM
Initialize:
Give soft-labeled training data to a probabilistic learner
+ 
+ 
+ 
+ 
Prob.
Learner
+ 
40
EM
Initialize:
Produce a probabilistic classifier
+ 
+ 
+ 
+ 
Prob.
Learner
Prob.
Classifier
+ 
41
EM
E Step:
Relabel unlabled data using the trained classifier
+ 
Prob.
Learner
Prob.
Classifier
+ 
+ 
+ 
+ 
42
EM
M step:
Retrain classifier on relabeled data
+ 
Prob.
Learner
Prob.
Classifier
+ 
+ 
+ 
+ 
Continue EM iterations until probabilistic labels
on unlabeled data converge.
43
A 2-Cluster Example
• In this example, it is assumed that there are
two clusters and the attribute value
distributions in both clusters are normal
distributions (μ, σ = average and stand.
dev.).
N(1,12)
N(2,22)
•44
Assume that we have the following 51 samples and it
is believed that these samples are taken from two
normal distributions:
•
•
•
•
•
•
•
•
•
51
43
62
64
45
42
46
45
45
62
47
52
64
51
65
48
49
46
64
51
52
62
49
48
62
43
40
48
64
51
63
43
65
66
65
46
39
62
64
52
63
64
48
64
48
51
48
64
42
48
41
Examples have just 1 attribute, whose vales are shown here.
E.g. age of myocardial infarction. We suspect ages vary according to
countryside vrs city, want to prove.
•45
Operation of the EM Algorithm
• The EM algorithm is to figure out the
parameters for the model.
• Let {s1, s2,…, sn} denote the the set of
(51)samples.
• In this example, we need to figure out the
following 5 parameters:
1, 1, 2, 2, P(C1).
•46
Operation of the EM Algorithm (2)
• For a general 1-dimensional case that has k
clusters, we need to figure out totally
2k+(k-1) parameters.
• The EM algorithm begins with an initial
guess of the parameter values.
47
Example (2)
• For our example, we initially assume that
P(C1)=0.5. Accordingly, we divide the
dataset into two subsets (at random):
– {51(*2.5),52(*3),62(*5),63(*2),64(*8),65(*3
),66(*1), 41(*1)};
– {51(*2.5),49(*2),48(*7),47(*1),46(*3),45(*3
),43(*3),42(*2), 40(*1),39(*1)};
– 1=47.63;
– 1 =3.79;
*= # of occurrences of a value assigned to a cluster
– 2=58.53;
total sums to 25,5 per category
– 2 =5.57.
•48
Example (3)
• E step: Then, the probabilities that sample si
belongs to these two clusters are recomputed as follow (remember hypothesis
of normal distribution):
 x  

P robxi | C1  P robC1  P robC1 
1
P robC1 | xi  


e 2
P robxi 
P robxi  2  1
2
i
P robxi | C2  P robC2  P robC2 
1
P robC2 | xi  


e
P robxi 
P robxi  2  2
pi 
1
2
1

xi  2 2

2 22
P robC1 | xi 
,
P robC1 | xi   P robC2 | xi 
where xi is at tributevalue of sample si .
49
Example (4)
• For example, for an object with value=52,
(3 occ.) we have
0. 5
1
P robC1 | 52 

e
P robx52 2  (3.79)
0.5

 0.1357
P robx52 2

52  47.63 2

2( 3.79 ) 2

52 58.53 2

0.5
1
2(5.57 ) 2
P robC 2 | 52 

e
P robx52 2  5.57
0.5

 0.0903
P robx52 2
P robC1 | x52
0.1357
pi 

 0.600,
P robC1 | x52  P robC 2 | x52 0.1357 0.0903
50
Example (5)
•M- step: The new estimated values of
parameters are computed as follows.
n
n
 pi xi
 (1  pi ) xi
1 
; 2 
i 1
n
p
 (1  p )
i
i 1
i 1
n
 12 
i 1
n
 pi xi  1 
n
2
i 1
n
p
i
i 1
i
;  22 
 (1  pi )xi  1 
2
i 1
n
 (1  p )
i 1
i
n
P(C1 ) 
p
i 1
n
i
51
Example (6)
• The process is repeated until the clustering
results converge.
• Generally, we attempt to maximize the
following likelihood function:
 P(C ) Px |C   P(C )Px |C .
1
i
1
2
i
2
i
Lecture notes of Dr. Yen-Jen Oyang
52
Example (7)
• Once we have figured out the
approximate parameter values, then we
assign sample si into C1, if
P robC1 | xi 
pi 
 0.5,
P robC1 | xi   P robC2 | xi 
P robxi | C1  P robC1  P robC1 
1
P robC1 | xi  


e
P robxi 
P robxi  2  1

xi  1 2

P robxi | C2  P robC2  P robC2 
1
P robC2 | xi  


e
P robxi 
P robxi  2  2
• Otherwise, si is assigned into C2.
•53
2 12

xi   2 2

2 22
.
Semi-Supervised Learning
• For supervised categorization, generating labeled
training data is expensive.
• Idea: Use unlabeled data to aid supervised
categorization.
• Use EM in a semi-supervised mode by training
EM on both labeled and unlabeled data.
– Train initial probabilistic model on user-labeled subset
of data instead of randomly labeled unsupervised data.
– Labels of user-labeled examples are “frozen” and never
relabeled during EM iterations.
– Labels of unsupervised data are constantly
probabilistically relabeled by EM.
54
Semi-Supervised EM
Training Examples
+
+
+
Unlabeled Examples
+ 
Prob.
Learner
Prob.
Classifier
+ 
+ 
+ 
+ 
55
Semi-Supervised EM
Training Examples
+
+
+
+ 
Prob.
Learner
Prob.
Classifier
+ 
+ 
+ 
+ 
56
Semi-Supervised EM
Training Examples
+
+
+
Prob.
Learner
Prob.
Classifier
+ 
+ 
+ 
+ 
+ 
57
Semi-Supervised EM
Training Examples
+
+
+
Unlabeled Examples
+ 
Prob.
Learner
Prob.
Classifier
+ 
+ 
+ 
+ 
58
Semi-Supervised EM
Training Examples
+
+
+
+ 
Prob.
Learner
Prob.
Classifier
+ 
+ 
+ 
+ 
Continue retraining iterations until probabilistic
labels on unlabeled data converge.
59
Issues in Unsupervised Learning
• How to evaluate clustering?
– Internal:
• Tightness and separation of clusters (e.g. k-means objective)
• Fit of probabilistic model to data
– External
• Compare to known class labels on benchmark data
•
•
•
•
•
Improving search to converge faster and avoid local minima.
Overlapping clustering.
Ensemble clustering.
Clustering structured relational data.
Semi-supervised methods other than EM:
– Co-training
– Transductive SVM’s
– Semi-supervised clustering (must-link, cannot-link)
60
Conclusions
• Unsupervised learning induces categories
from unlabeled data.
• There are a variety of approaches, including:
– HAC
– k-means
– EM
• Semi-supervised learning uses both labeled
and unlabeled data to improve results.
61
Download