CS315-L17-Clustering

advertisement
Basic Machine Learning:
Clustering
CS 315 – Web Search and Data Mining
1
Supervised vs. Unsupervised Learning
Two Fundamental Methods in Machine Learning
Supervised Learning (“learn from my example”)




Goal: A program that performs a task as good as humans.
TASK – well defined (the target function)
EXPERIENCE – training data provided by a human
PERFORMANCE – error/accuracy on the task
Unsupervised Learning (“see what you can find”)




Goal: To find some kind of structure in the data.
TASK – vaguely defined
No EXPERIENCE
No PERFORMANCE (but, there are some evaluations metrics)
2
What is Clustering?
The most common form of Unsupervised Learning
Clustering is the process of
grouping a set of physical or abstract objects
into classes (“clusters”) of similar objects
It can be used in IR:


1
0.9
0.8
To improve recall in search
0.7
For better navigation of search results 0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
1.2
Ex1: Cluster to Improve Recall
Cluster hypothesis:
Documents with similar text are related
Thus, when a query matches a document D,
also return other documents in the cluster containing D.
4
Ex2: Cluster for Better Navigation
5
Clustering Characteristics
Flat Clustering vs Hierarchical Clustering


Flat: just dividing objects in groups (clusters)
Hierarchical: organize clusters in a hierarchy
Evaluating Clustering

Internal Criteria
 The intra-cluster similarity is high (tightness)
 The inter-cluster similarity is low (separateness)

External Criteria
 Did we discover the hidden classes?
(we need gold standard data for this evaluation)
6
Clustering for Web IR
Representation for clustering


Document representation
Need a notion of similarity/distance
How many clusters?



Fixed a priori?
Completely data driven?
Avoid “trivial” clusters - too large or small
7
Recall: Documents as vectors
Each doc j is a vector of tf.idf values,
one component for each term.

Can normalize to unit length.

dj
dj   
dj
wi , j

n
i 1
where wi , j  tf i , j  idfi
wi , j
Vector space



terms are axes - aka features
N docs live in this space
even with stemming, may have 20,000+ dimensions
What makes documents related?
8
Intuition for relatedness
D2
D3
D1
x
y
t1
t2
D4
Documents that are “close together”
in vector space talk about the same things.
9
What makes documents related?
Ideal: semantic similarity.
Practical: statistical similarity

We will use cosine similarity.
We will describe algorithms in terms of cosine similarity.
Cosine similarity of normalized d j , dk :
n
sim( d , d )   w  w
j k i1 i, j
i, k
This is known as the “normalized inner product”.
10
Clustering Algorithms
Hierarchical algorithms

Bottom-up, agglomerative clustering
Partitioning “flat” algorithms


Usually start with a random (partial) partitioning
Refine it iteratively
The famous k-means partitioning algorithm:


Given: a set of n documents and the number k
Compute: a partition of k clusters that
optimizes the chosen partitioning criterion
11
K-means
Assumes documents are real-valued vectors.
Clusters based on centroids
of points in a cluster, c
(= the center of gravity or mean) :


1
μ(c) 
x

| c | xc
Reassignment of instances to clusters is based on distance
to the current cluster centroids.
See Animation
12
K-Means Algorithm
Let d be the distance measure between instances.
Select k random instances {s1, s2,… sk} as seeds.
Until clustering converges or other stopping criterion:
For each instance xi:
Assign xi to the cluster cj such that d(xi, sj) is minimal.
(Update the seeds to the centroid of each cluster)
For each cluster cj
sj = (cj)
13
K-means: Different Issues
When to stop?


When a fixed number of iterations is reached
When centroid positions do not change
Seed Choice


Results can vary based on random seed selection.
Try out multiple starting points
Example showing
sensitivity to seeds
If you start with
centroids: B and E
you converge to
A
B
C
If you start with
centroids D and F
you converge to:
D
E
F
14
Hierarchical clustering
Build a tree-based hierarchical taxonomy (dendrogram)
from a set of unlabeled examples.
animal
vertebrate
fish reptile amphib. mammal
invertebrate
worm insect crustacean
15
Hierarchical Agglomerative Clustering
We assume there is a similarity function that determines
the similarity of two instances.
Algorithm:
Start with all instances in their own cluster.
Until there is only one cluster:
Among the current clusters, determine the two
clusters, ci and cj, that are most similar.
Replace ci and cj with a single cluster ci  cj
Watch animation of HAC
16
What is the most similar cluster?
Single-link

Similarity of the most cosine-similar (single-link)
Complete-link

Similarity of the “furthest” points, the least cosine-similar
Group-average agglomerative clustering

Average cosine between pairs of elements
Centroid clustering

Similarity of clusters’ centroids
17
Single link clustering
1) Use maximum similarity of pairs:
sim(ci ,c j )  max sim( x, y)
xci , yc j
2) After merging ci and cj, the similarity of the resulting cluster to
another cluster, ck, is:
sim((ci  c j ), ck )  max(sim(ci , ck ), sim(c j , ck ))
18
Complete link clustering
1) Use minimum similarity of pairs:
2) After merging ci and cj, the similarity of the resulting cluster to
another cluster, ck, is:
19
Major issue - labeling
After clustering algorithm finds clusters - how can they be
useful to the end user?
Need a concise label for each cluster


In search results, say “Animal” or “Car” in the jaguar example.
In topic trees (Yahoo), need navigational cues.
 Often done by hand, a posteriori.
20
How to Label Clusters
Show titles of typical documents



Titles are easy to scan
Authors create them for quick scanning!
But you can only show a few titles which may not fully represent
cluster
Show words/phrases prominent in cluster



More likely to fully represent cluster
Use distinguishing words/phrases
But harder to scan
21
Further issues
Complexity:

Clustering is computationally expensive. Implementations need
careful balancing of needs.
How to decide how many clusters are best?
Evaluating the “goodness” of clustering

There are many techniques, some focus on implementation issues
(complexity/time), some on the quality of
22
Download