Lect 8 Supervised vs. Unsupervised Learning

advertisement

Lect 8

Supervised vs. Unsupervised Learning

Dr. Yousef Qawqzeh

1

Clustering

Supervised vs. Unsupervised Learning

Examples of clustering in Web IR

Characteristics of clustering

Clustering algorithms

Cluster Labeling

2

Supervised vs. Unsupervised Learning

Supervised Learning

Goal: A program that performs a task as good as humans.

TASK – well defined (the target function)

EXPERIENCE – training data provided by a human

PERFORMANCE – error/accuracy on the task

Unsupervised Learning

Goal: To find some kind of structure in the data.

TASK – vaguely defined

No EXPERIENCE

No PERFORMANCE (but, there are some evaluations metrics)

3

What is Clustering?

Clustering is the most common form of Unsupervised

Learning

Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects

It can be used in IR:

To improve recall in search applications

For better navigation of search results

4

Example 1: Improving Recall

Cluster hypothesis - Documents with similar text are related

Thus, when a query matches a document D, also return other documents in the cluster containing D.

5

Example 2: Better Navigation

6

Clustering Characteristics

Flat versus Hierarchical Clustering

Flat means dividing objects in groups (clusters)

Hierarchical means organize clusters in a subsuming hierarchy

Evaluating Clustering

Internal Criteria

 The intra-cluster similarity is high (tightness)

 The inter-cluster similarity is low (separateness)

External Criteria

 Did we discover the hidden classes? (we need gold standard data for this evaluation)

7

Clustering for Web IR

Representation for clustering

Document representation

 Vector space? Normalization?

Need a notion of similarity/distance

How many clusters?

Fixed a priori?

Completely data driven?

 Avoid “trivial” clusters - too large or small

8

Recall documents as vectors

Each doc j is a vector of tf

idf values, one component for each term.

Can normalize to unit length.

d j

 d d

 j j 

 w i , i n

1 j w i , j

where w i , j

 tf i , j

 idf i

So we have a vector space

 terms are axes - aka features

n docs live in this space even with stemming, may have 20,000+ dimensions

9

What makes documents related?

Ideal: semantic similarity.

Practical: statistical similarity

We will use cosine similarity.

Documents as vectors.

We will describe algorithms in terms of cosine similarity.

Cosine similarity of normalized d j

, d k

: sim ( d j

, d k

)

 i

1 w i , j

 w i , k

This is known as the normalized inner product.

10

Intuition for relatedness

D2

D3

D1 x y t 1 t 2

D4

Documents that are “close together” in vector space talk about the same things.

11

Clustering Algorithms

Partitioning “flat” algorithms

Usually start with a random (partial) partitioning

Refine it iteratively

k-means clustering

 Model based clustering (we will not cover it)

Hierarchical algorithms

Bottom-up, agglomerative

Top-down, divisive (we will not cover it)

12

Partitioning “flat” algorithms

Partitioning method: Construct a partition of n documents into a set of k clusters

Given: a set of documents and the number k

Find: a partition of k clusters that optimizes the chosen partitioning criterion

Watch animation of k-means

13

K-means

Assumes documents are real-valued vectors.

Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c:

μ

(c)

|

1  c |  x

 c

 x

Reassignment of instances to clusters is based on distance to the current cluster centroids.

14

K-Means Algorithm

Let d be the distance measure between instances.

Select k random instances { s

1

, s

2

,… s k

} as seeds.

Until clustering converges or other stopping criterion:

For each instance x i

:

Assign x i to the cluster c j such that d ( x i

, s j

) is minimal.

( Update the seeds to the centroid of each cluster )

For each cluster c j s j

=

( c j

)

15

K-means: Different Issues

When to stop?

When a fixed number of iterations is reached

When centroid positions do not change

Seed Choice

Results can vary based on random seed selection.

Try out multiple starting points Example showing sensitivity to seeds

If you start with B and E as centroids you converge to {A,B,C} and {D,E,F}

If you start with D and F you converge to

{A,B,D,E} {C,F}

A

D

B

E

C

F

16

Hierarchical clustering

Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples.

animal invertebrate vertebrate fish reptile amphibian. mammal worm insect crustacean

17

Hierarchical Agglomerative Clustering

We assume there is a similarity function that determines the similarity of two instances.

Algorithm:

Start with all instances in their own cluster.

Until there is only one cluster:

Among the current clusters, determine the two clusters, c i

Replace c i and and c j c j

, that are most similar.

with a single cluster c i

 c j

Watch animation of HAC

18

What is the most similar cluster?

Single-link

Similarity of the most cosine-similar (single-link)

Complete-link

Similarity of the “furthest” points, the least cosine-similar

Group-average agglomerative clustering

Average cosine between pairs of elements

Centroid clustering

Similarity of clusters’ centroids

19

Single link clustering

1) Use maximum similarity of pairs: sim ( c i

, c j

)

 max x

 c i

, y

 c j sim ( x , y )

2) After merging c i and another cluster, c k

, is: c j

, the similarity of the resulting cluster to sim (( c i

 c j

), c k

)

 max( sim ( c i

, c k

), sim ( c j

, c k

))

20

Complete link clustering

1) Use minimum similarity of pairs: sim ( c i

, c j

)

 min x

 c i

, y

 c j sim ( x , y )

2) After merging c i and another cluster, c k

, is: c j

, the similarity of the resulting cluster to sim (( c i

 c j

), c k

)

 min( sim ( c i

, c k

), sim ( c j

, c k

))

21

Major issue - labeling

After clustering algorithm finds clusters - how can they be useful to the end user?

Need a concise label for each cluster

In search results, say “Animal” or “Car” in the jaguar example.

In topic trees (Yahoo), need navigational cues.

 Often done by hand, a posteriori.

22

How to Label Clusters

Show titles of typical documents

Titles are easy to scan

Authors create them for quick scanning!

But you can only show a few titles which may not fully represent cluster

Show words/phrases prominent in cluster

More likely to fully represent cluster

Use distinguishing words/phrases

But harder to scan

23

Not covered in this lecture

Complexity:

Clustering is computationally expensive. Implementations need careful balancing of needs.

How to decide how many clusters are best?

Evaluating the “goodness” of clustering

There are many techniques, some focus on implementation issues

(complexity/time), some on the quality of

24

Download