Lect 8 Supervised vs. Unsupervised Learning

Lect 8

Supervised vs. Unsupervised Learning

Dr. Yousef Qawqzeh

1

Clustering


Examples of clustering in Web IR

Characteristics of clustering

Clustering algorithms

Cluster Labeling

2


Supervised Learning



Goal: A program that performs a task as good as humans.







TASK – well defined (the target function)

EXPERIENCE – training data provided by a human

PERFORMANCE – error/accuracy on the task

Unsupervised Learning



Goal: To find some kind of structure in the data.







TASK – vaguely defined

No EXPERIENCE

No PERFORMANCE (but, there are some evaluations metrics)

3

What is Clustering?

Clustering is the most common form of Unsupervised

Learning

Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects

It can be used in IR:





To improve recall in search applications

For better navigation of search results

4

Example 1: Improving Recall

Cluster hypothesis - Documents with similar text are related

Thus, when a query matches a document D, also return other documents in the cluster containing D.

5

Example 2: Better Navigation

6

Clustering Characteristics

Flat versus Hierarchical Clustering



Flat means dividing objects in groups (clusters)



Hierarchical means organize clusters in a subsuming hierarchy

Evaluating Clustering





Internal Criteria

 The intra-cluster similarity is high (tightness)

 The inter-cluster similarity is low (separateness)

External Criteria

 Did we discover the hidden classes? (we need gold standard data for this evaluation)

7

Clustering for Web IR

Representation for clustering





Document representation

 Vector space? Normalization?

Need a notion of similarity/distance

How many clusters?





Fixed a priori?

Completely data driven?

 Avoid “trivial” clusters - too large or small

8

Recall documents as vectors

Each doc j is a vector of tf



idf values, one component for each term.

Can normalize to unit length.

d j



 d d

 j j 

 w i , i n



1 j w i , j

where w i , j

 tf i , j

 idf i

So we have a vector space





 terms are axes - aka features

n docs live in this space even with stemming, may have 20,000+ dimensions

9

What makes documents related?

Ideal: semantic similarity.

Practical: statistical similarity





We will use cosine similarity.

Documents as vectors.

We will describe algorithms in terms of cosine similarity.

Cosine similarity of normalized d j

, d k

: sim ( d j

, d k

)

 i



1 w i , j

 w i , k

This is known as the normalized inner product.

10

Intuition for relatedness

D2

D3

D1 x y t 1 t 2

D4

Documents that are “close together” in vector space talk about the same things.

11

Clustering Algorithms

Partitioning “flat” algorithms



Usually start with a random (partial) partitioning



Refine it iteratively

 k-means clustering

 Model based clustering (we will not cover it)

Hierarchical algorithms





Bottom-up, agglomerative

Top-down, divisive (we will not cover it)

12

Partitioning “flat” algorithms

Partitioning method: Construct a partition of n documents into a set of k clusters

Given: a set of documents and the number k

Find: a partition of k clusters that optimizes the chosen partitioning criterion

Watch animation of k-means

13

K-means

Assumes documents are real-valued vectors.

Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c:

μ

(c)



|

1  c |  x

 c

 x

Reassignment of instances to clusters is based on distance to the current cluster centroids.

14

K-Means Algorithm

Let d be the distance measure between instances.

Select k random instances { s

1

, s

2

,… s k

} as seeds.

Until clustering converges or other stopping criterion:

For each instance x i

:

Assign x i to the cluster c j such that d ( x i

, s j

) is minimal.

( Update the seeds to the centroid of each cluster )

For each cluster c j s j

=



( c j

)

15

K-means: Different Issues

When to stop?



When a fixed number of iterations is reached



When centroid positions do not change

Seed Choice





Results can vary based on random seed selection.

Try out multiple starting points Example showing sensitivity to seeds

If you start with B and E as centroids you converge to {A,B,C} and {D,E,F}

If you start with D and F you converge to

{A,B,D,E} {C,F}

A

D

B

E

C

F

16

Hierarchical clustering

Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples.

animal invertebrate vertebrate fish reptile amphibian. mammal worm insect crustacean

17

Hierarchical Agglomerative Clustering

We assume there is a similarity function that determines the similarity of two instances.

Algorithm:

Start with all instances in their own cluster.

Until there is only one cluster:

Among the current clusters, determine the two clusters, c i

Replace c i and and c j c j

, that are most similar.

with a single cluster c i

 c j

Watch animation of HAC

18

What is the most similar cluster?

Single-link



Similarity of the most cosine-similar (single-link)

Complete-link



Similarity of the “furthest” points, the least cosine-similar

Group-average agglomerative clustering



Average cosine between pairs of elements

Centroid clustering



Similarity of clusters’ centroids

19

Single link clustering

1) Use maximum similarity of pairs: sim ( c i

, c j

)

 max x

 c i

, y

 c j sim ( x , y )

2) After merging c i and another cluster, c k

, is: c j

, the similarity of the resulting cluster to sim (( c i

 c j

), c k

)

 max( sim ( c i

, c k

), sim ( c j

, c k

))

20

Complete link clustering

1) Use minimum similarity of pairs: sim ( c i

, c j

)

 min x

 c i

, y

 c j sim ( x , y )

2) After merging c i and another cluster, c k

, is: c j

, the similarity of the resulting cluster to sim (( c i

 c j

), c k

)

 min( sim ( c i

, c k

), sim ( c j

, c k

))

21

Major issue - labeling

After clustering algorithm finds clusters - how can they be useful to the end user?

Need a concise label for each cluster



In search results, say “Animal” or “Car” in the jaguar example.



In topic trees (Yahoo), need navigational cues.

 Often done by hand, a posteriori.

22

How to Label Clusters

Show titles of typical documents



Titles are easy to scan





Authors create them for quick scanning!

But you can only show a few titles which may not fully represent cluster

Show words/phrases prominent in cluster







More likely to fully represent cluster

Use distinguishing words/phrases

But harder to scan

23

Not covered in this lecture

Complexity:



Clustering is computationally expensive. Implementations need careful balancing of needs.

How to decide how many clusters are best?

Evaluating the “goodness” of clustering



There are many techniques, some focus on implementation issues

(complexity/time), some on the quality of

24

Lect 8 Supervised vs. Unsupervised Learning

Lect 8

Clustering

Supervised vs. Unsupervised Learning

What is Clustering?

Example 1: Improving Recall

Example 2: Better Navigation

Clustering Characteristics

Clustering for Web IR

Recall documents as vectors

What makes documents related?

Intuition for relatedness

Clustering Algorithms

Partitioning “flat” algorithms

K-means

K-Means Algorithm

K-means: Different Issues

Hierarchical clustering

Hierarchical Agglomerative Clustering

What is the most similar cluster?

Single link clustering

Complete link clustering

Major issue - labeling

How to Label Clusters

Not covered in this lecture

Related documents

Products

Support

Lect 8 Supervised vs. Unsupervised Learning

Lect 8

Clustering

Supervised vs. Unsupervised Learning

What is Clustering?

Example 1: Improving Recall

Example 2: Better Navigation

Clustering Characteristics

Clustering for Web IR

Recall documents as vectors

What makes documents related?

Intuition for relatedness

Clustering Algorithms

Partitioning “flat” algorithms

K-means

K-Means Algorithm

K-means: Different Issues

Hierarchical clustering

Hierarchical Agglomerative Clustering

What is the most similar cluster?

Single link clustering

Complete link clustering

Major issue - labeling

How to Label Clusters

Not covered in this lecture

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib