Supervised vs. Unsupervised Learning
Dr. Yousef Qawqzeh
1
Supervised vs. Unsupervised Learning
Examples of clustering in Web IR
Characteristics of clustering
Clustering algorithms
Cluster Labeling
2
Supervised Learning
Goal: A program that performs a task as good as humans.
TASK – well defined (the target function)
EXPERIENCE – training data provided by a human
PERFORMANCE – error/accuracy on the task
Unsupervised Learning
Goal: To find some kind of structure in the data.
TASK – vaguely defined
No EXPERIENCE
No PERFORMANCE (but, there are some evaluations metrics)
3
Clustering is the most common form of Unsupervised
Learning
Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects
It can be used in IR:
To improve recall in search applications
For better navigation of search results
4
Cluster hypothesis - Documents with similar text are related
Thus, when a query matches a document D, also return other documents in the cluster containing D.
5
6
Flat versus Hierarchical Clustering
Flat means dividing objects in groups (clusters)
Hierarchical means organize clusters in a subsuming hierarchy
Evaluating Clustering
Internal Criteria
The intra-cluster similarity is high (tightness)
The inter-cluster similarity is low (separateness)
External Criteria
Did we discover the hidden classes? (we need gold standard data for this evaluation)
7
Representation for clustering
Document representation
Vector space? Normalization?
Need a notion of similarity/distance
How many clusters?
Fixed a priori?
Completely data driven?
Avoid “trivial” clusters - too large or small
8
Each doc j is a vector of tf
idf values, one component for each term.
Can normalize to unit length.
d j
d d
j j
w i , i n
1 j w i , j
where w i , j
tf i , j
idf i
So we have a vector space
terms are axes - aka features
n docs live in this space even with stemming, may have 20,000+ dimensions
9
Ideal: semantic similarity.
Practical: statistical similarity
We will use cosine similarity.
Documents as vectors.
We will describe algorithms in terms of cosine similarity.
Cosine similarity of normalized d j
, d k
: sim ( d j
, d k
)
i
1 w i , j
w i , k
This is known as the normalized inner product.
10
D2
D3
D1 x y t 1 t 2
D4
Documents that are “close together” in vector space talk about the same things.
11
Partitioning “flat” algorithms
Usually start with a random (partial) partitioning
Refine it iteratively
k-means clustering
Model based clustering (we will not cover it)
Hierarchical algorithms
Bottom-up, agglomerative
Top-down, divisive (we will not cover it)
12
Partitioning method: Construct a partition of n documents into a set of k clusters
Given: a set of documents and the number k
Find: a partition of k clusters that optimizes the chosen partitioning criterion
Watch animation of k-means
13
Assumes documents are real-valued vectors.
Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c:
μ
(c)
|
1 c | x
c
x
Reassignment of instances to clusters is based on distance to the current cluster centroids.
14
Let d be the distance measure between instances.
Select k random instances { s
1
, s
2
,… s k
} as seeds.
Until clustering converges or other stopping criterion:
For each instance x i
:
Assign x i to the cluster c j such that d ( x i
, s j
) is minimal.
( Update the seeds to the centroid of each cluster )
For each cluster c j s j
=
( c j
)
15
When to stop?
When a fixed number of iterations is reached
When centroid positions do not change
Seed Choice
Results can vary based on random seed selection.
Try out multiple starting points Example showing sensitivity to seeds
If you start with B and E as centroids you converge to {A,B,C} and {D,E,F}
If you start with D and F you converge to
{A,B,D,E} {C,F}
A
D
B
E
C
F
16
Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples.
animal invertebrate vertebrate fish reptile amphibian. mammal worm insect crustacean
17
We assume there is a similarity function that determines the similarity of two instances.
Algorithm:
Start with all instances in their own cluster.
Until there is only one cluster:
Among the current clusters, determine the two clusters, c i
Replace c i and and c j c j
, that are most similar.
with a single cluster c i
c j
Watch animation of HAC
18
Single-link
Similarity of the most cosine-similar (single-link)
Complete-link
Similarity of the “furthest” points, the least cosine-similar
Group-average agglomerative clustering
Average cosine between pairs of elements
Centroid clustering
Similarity of clusters’ centroids
19
1) Use maximum similarity of pairs: sim ( c i
, c j
)
max x
c i
, y
c j sim ( x , y )
2) After merging c i and another cluster, c k
, is: c j
, the similarity of the resulting cluster to sim (( c i
c j
), c k
)
max( sim ( c i
, c k
), sim ( c j
, c k
))
20
1) Use minimum similarity of pairs: sim ( c i
, c j
)
min x
c i
, y
c j sim ( x , y )
2) After merging c i and another cluster, c k
, is: c j
, the similarity of the resulting cluster to sim (( c i
c j
), c k
)
min( sim ( c i
, c k
), sim ( c j
, c k
))
21
After clustering algorithm finds clusters - how can they be useful to the end user?
Need a concise label for each cluster
In search results, say “Animal” or “Car” in the jaguar example.
In topic trees (Yahoo), need navigational cues.
Often done by hand, a posteriori.
22
Show titles of typical documents
Titles are easy to scan
Authors create them for quick scanning!
But you can only show a few titles which may not fully represent cluster
Show words/phrases prominent in cluster
More likely to fully represent cluster
Use distinguishing words/phrases
But harder to scan
23
Complexity:
Clustering is computationally expensive. Implementations need careful balancing of needs.
How to decide how many clusters are best?
Evaluating the “goodness” of clustering
There are many techniques, some focus on implementation issues
(complexity/time), some on the quality of
24