Cluster

advertisement
Clustering Techniques and IR
CSC 575
Intelligent Information Retrieval
Clustering Techniques and IR
Today
Clustering Problem and Applications
Clustering Methodologies and Techniques
Applications of Clustering in IR
Intelligent Information Retrieval
2
What is Clustering?
Clustering is a process of partitioning a set of data (or objects)
in a set of meaningful sub-classes, called clusters
Helps users understand the natural grouping or structure in a data set
 Cluster:
 a collection of data objects that
are “similar” to one another and
thus can be treated collectively
as one group
 but as a collection, they are
sufficiently different from other
groups
Intelligent Information Retrieval
3
Clustering in IR
 Objective of Clustering
 assign items to automatically created groups based on similarity or
association between items and groups
 also called “automatic classification”
 “The art of finding groups in data.” -- Kaufmann and Rousseu
 Clustering in IR
 automatic thesaurus generation by clustering related terms
 automatic concept indexing (concepts are clusters of terms)
 automatic categorization of documents
 information presentation and browsing
 query generation and search refinement
Intelligent Information Retrieval
4
Applications of Clustering
 Clustering has wide applications in Pattern Recognition
 Spatial Data Analysis:
 create thematic maps in GIS by clustering feature spaces
 detect spatial clusters and explain them in spatial data mining
 Image Processing
 Market Research
 Information Retrieval
 Document or term categorization
 Information visualization and IR interfaces
 Web Mining
 Cluster Web usage data to discover groups of similar access patterns
 Web Personalization
Intelligent Information Retrieval
5
Clustering Methodologies
Two general methodologies
 Partitioning Based Algorithms
 Hierarchical Algorithms
 Partitioning Based
 divide a set of N items into K clusters (top-down)
 Hierarchical
 agglomerative: pairs of items or clusters are successively linked to
produce larger clusters
 divisive: start with the whole set as a cluster and successively
divide sets into smaller partitions
Intelligent Information Retrieval
6
Clustering Algorithms
 Similarity Measures and Features
 most clustering algorithms are based on some measure of similarity
(or distance) between items
in IR these measures could be based on co-occurrence of terms,
citations, or hyperlinks in documents
terms can be clustered based on documents in which they co-occur, or
based on lexical or semantic similarity measures
 clustering requires the selection of features over which similarity
among items is computed
in document clustering, features are generally some or all of the terms
in the collection
often a small number of features must be selecting because many
clustering algorithms break down in a “high-dimensional” space
 similarity measures among the items can be represented as a
symmetric similarity matrix, in which each entry is the similarity
value between two items
Intelligent Information Retrieval
7
Distance or Similarity Measures
 Measuring Distance
 In order to group similar items, we need a way to measure the
distance between objects (e.g., records)
 Note: distance = inverse of similarity
 Often based on the representation of objects as “feature vectors”
An Employee DB
ID
1
2
3
4
5
Gender
F
M
M
F
M
Age
27
51
52
33
45
Salary
19,000
64,000
100,000
55,000
45,000
Term Frequencies for Documents
Doc1
Doc2
Doc3
Doc4
Doc5
T1
0
3
3
0
2
T2
4
1
0
1
2
T3
0
4
0
0
2
T4
0
3
0
3
3
T5
0
1
3
0
1
T6
2
2
0
0
4
Which objects are more similar?
Intelligent Information Retrieval
8
Distance or Similarity Measures
 Pearson Correlation
 Works well in case of user ratings (where there is at least a range of 1-5)
 Not always possible (in some situations we may only have implicit binary
values, e.g., whether a user did or did not select a document)
 Alternatively, a variety of distance or similarity measures can be used
 Common Distance Measures:
 Manhattan distance:
 Euclidean distance:
 Cosine similarity:
dist ( X , Y )  1  sim ( X , Y )
Intelligent Information Retrieval
9
Clustering Similarity Measures
 In vector-space model any of the similarity measures
discussed before can be used in clustering
Simple Matching
Cosine Coefficient
Intelligent Information Retrieval
Dice’s Coefficient
Jaccard’s Coefficient
10
Distance (Similarity) Matrix
 Similarity (Distance) Matrix
 based on the distance or similarity measure we can construct a symmetric
matrix of distance (or similarity values)
 (i, j) entry in the matrix is the distance (similarity) between items i and j
I1
I2
In
I1

d 12
d 1n
I2
d 21

d 2n

In
d n1
dn2

Note that dij = dji (i.e., the matrix is
symmetric. So, we only need the lower
triangle part of the matrix.
The diagonal is all 1’s (similarity) or all
0’s (distance)
d ij  sim ilarity (or distance) of D i to D j
Intelligent Information Retrieval
11
Example: Term Similarities in Documents
 Suppose we want to cluster terms that appear in a collection of
documents with different frequencies
Each term can be viewed
as a vector of term
frequencies (weights)
Doc1
Doc2
Doc3
Doc4
Doc5
T1
0
3
3
0
2
T2
4
1
0
1
2
T3
0
4
0
0
2
T4
0
3
0
3
3
T5
0
1
3
0
1
T6
2
2
0
0
4
T7
1
0
3
2
0
T8
3
1
0
0
2
 We need to compute a term-term similarity matrix
 For simplicity we use the dot product as similarity measure (note that this is the nonnormalized version of cosine similarity)
sim (Ti , T j ) 
N
 ( w ik  w
k 1
 Example:
12
)
jk
N = total number of dimensions (in this case documents)
wik = weight of term i in document k.
Sim(T1,T2) = <0,3,3,0,2> * <4,1,0,1,2>
0x4 + 3x1 + 3x0 + 0x1 + 2x2 = 7
Similarity Matrix - Example
Doc1
Doc2
Doc3
Doc4
Doc5
Term-Term
Similarity Matrix
Intelligent Information Retrieval
T1
0
3
3
0
2
T2
4
1
0
1
2
T3
0
4
0
0
2
T2
T3
T4
T5
T6
T7
T8
T4
0
3
0
3
3
T1
7
16
15
14
14
9
7
T5
0
1
3
0
1
T6
2
2
0
0
4
T7
1
0
3
2
0
T8
3
1
0
0
2
T2
T3
T4
T5
T6
T7
8
12
3
18
6
17
18
6
16
0
8
6
18
6
9
6
9
3
2
16
3
13
Similarity Thresholds
 A similarity threshold is used to mark pairs that are “sufficiently” similar
 The threshold value is application and collection dependent
T2
T3
T4
T5
T6
T7
T2
T3
T4
T5
T6
T7
T8
T1
7
16
15
14
14
9
7
8
12
3
18
6
17
18
6
16
0
8
6
18
6
9
6
9
3
2
16
3
T2
T3
T4
T5
T6
T7
T2
T3
T4
T5
T6
T7
T8
T1
0
1
1
1
1
0
0
0
1
0
1
0
1
1
0
1
0
0
0
1
0
0
0
0
0
0
1
0
Intelligent Information Retrieval
Using a threshold
value of 10 in the
previous example
14
Graph Representation
 The similarity matrix can be visualized as an undirected graph
 each item is represented by a node, and edges represent the fact that two items
are similar (a one in the similarity threshold matrix)
T2
T3
T4
T5
T6
T7
T8
T1 T2 T3 T4 T5 T6 T7
0
1 0
1 1 1
1 0 0 0
1 1 1 1 0
0 0 0 0 0 0
0 1 0 0 0 1 0
If no threshold is used, then
matrix can be represented as
a weighted graph
Intelligent Information Retrieval
T1
T3
T5
T4
T2
T7
T6
T8
15
Graph-Based Clustering Algorithms
 If we are interested only in threshold (and not the degree of similarity or
distance), we can use the graph directly for clustering
 Clique Method (complete link)
 all items within a cluster must be within the similarity threshold of all other
items in that cluster
 clusters may overlap
 generally produces small but very tight clusters
 Single Link Method
 any item in a cluster must be within the similarity threshold of at least one
other item in that cluster
 produces larger but weaker clusters
 Other methods
 star method - start with an item and place all related items in that cluster
 string method - start with an item; place one related item in that cluster; then
place anther item related to the last item entered, and so on
Intelligent Information Retrieval
16
Graph-Based Clustering Algorithms
 Clique Method
 a clique is a completely connected subgraph of a graph
 in the clique method, each maximal clique in the graph becomes a cluster
T1
T3
Maximal cliques (and therefore the
clusters) in the previous example are:
T5
T4
T2
{T1, T3, T4, T6}
{T2, T4, T6}
{T2, T6, T8}
{T1, T5}
{T7}
Note that, for example, {T1, T3, T4}
is also a clique, but is not maximal.
T7
T6
Intelligent Information Retrieval
T8
17
Graph-Based Clustering Algorithms
 Single Link Method
 selected an item not in a cluster and place it in a new cluster
 place all other similar item in that cluster
 repeat step 2 for each item in the cluster until nothing more can be added
 repeat steps 1-3 for each item that remains unclustered
T1
T3
In this case the single link method
produces only two clusters:
T5
T4
T2
{T1, T3, T4, T5, T6, T2, T8}
{T7}
Note that the single link method does
not allow overlapping clusters, thus
partitioning the set of items.
T7
T6
Intelligent Information Retrieval
T8
18
Clustering with Existing Clusters
 The notion of comparing item similarities can be extended to clusters themselves,
by focusing on a representative vector for each cluster
 cluster representatives can be actual items in the cluster or other “virtual”
representatives such as the centroid
 this methodology reduces the number of similarity computations in clustering
 clusters are revised successively until a stopping condition is satisfied, or until no
more changes to clusters can be made
 Partitioning Methods
 reallocation method - start with an initial assignment of items to clusters and then
move items from cluster to cluster to obtain an improved partitioning
 Single pass method - simple and efficient, but produces large clusters, and depends
on order in which items are processed
 Hierarchical Agglomerative Methods
 starts with individual items and combines into clusters
 then successively combine smaller clusters to form larger ones
 grouping of individual items can be based on any of the methods discussed earlier
Intelligent Information Retrieval
19
Partitioning Algorithms:
Basic Concept
 Partitioning method: Construct a partition of a
database D of n objects into a set of k clusters
 Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67)
 Each cluster is represented by the center of the cluster
 k-medoids (Kaufman & Rousseeuw’87):
 Each cluster is represented by one of the objects in the
cluster
Intelligent Information Retrieval
20
K-Means Algorithm
 The basic algorithm (based on reallocation method):
1. Select K initial clusters by (possibly) random assignment of some items to clusters and
compute each of the cluster centroids.
2. Compute the similarity of each item xi to each cluster centroid and (re-)assign each item
to the cluster whose centroid is most similar to xi.
3. Re-compute the cluster centroids based on the new assignments.
4. Repeat steps 2 and 3 until three is no change in clusters from one iteration to the next.
Example: Clustering Documents
Initial (arbitrary)
assignment:
C1 = {D1,D2},
C2 = {D3,D4},
C3 = {D5,D6}
Cluster Centroids
Intelligent Information Retrieval
D1
D2
D3
D4
D5
D6
D7
D8
C1
C2
C3
T1
0
4
0
0
0
2
1
3
4/2
0/2
2/2
T2
3
1
4
3
1
2
0
1
4/2
7/2
3/2
T3
3
0
0
0
3
0
3
0
3/2
0/2
3/2
T4
0
1
0
3
0
0
2
0
1/2
3/2
0/2
T5
2
2
2
3
1
4
0
2
4/2
5/2
5/2
21
Example: K-Means
Now compute the similarity (or distance) of each item with each cluster, resulting a
cluster-document similarity matrix (here we use dot product as the similarity measure).
C1
C2
C3
D1
29/2
31/2
28/2
D2
29/2
20/2
21/2
D3
24/2
38/2
22/2
D4
27/2
45/2
24/2
D5
17/2
12/2
17/2
D6
32/2
34/2
30/2
D7
15/2
6/2
11/2
D8
24/2
17/2
19/2
For each document, reallocate the document to the cluster to which it has the highest
similarity (shown in red in the above table). After the reallocation we have the following
new clusters. Note that the previously unassigned D7 and D8 have been assigned, and that
D1 and D6 have been reallocated from their original assignment.
C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}
This is the end of first iteration (i.e., the first reallocation).
Next, we repeat the process for another reallocation…
Intelligent Information Retrieval
22
Example: K-Means
Now compute new
cluster centroids using
the original documentterm matrix
C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5}
D1
D2
D3
D4
D5
D6
D7
D8
C1
C2
C3
This will lead to a new
cluster-doc similarity matrix
similar to previous slide.
Again, the items are
reallocated to clusters with
highest similarity.
C1
C2
C3
D1
7.67
16.75
14.00
New assignment 
T1
0
4
0
0
0
2
1
3
8/3
2/4
0/1
D2
15.01
11.25
3.00
D3
5.34
17.50
6.00
T2
3
1
4
3
1
2
0
1
2/3
12/4
1/1
D4
9.00
19.50
6.00
D5
5.00
8.00
11.00
T3
3
0
0
0
3
0
3
0
3/3
3/4
3/1
D6
12.00
6.68
9.34
T4
0
1
0
3
0
0
2
0
3/3
3/4
0/1
D7
7.67
4.25
9.00
T5
2
2
2
3
1
4
0
2
4/3
11/4
1/1
D8
11.34
10.00
3.00
C1 = {D2,D6,D8}, C2 = {D1,D3,D4}, C3 = {D5,D7}
Note: This process is now repeated with new clusters. However, the next iteration in this example
Will show no change to the clusters, thus terminating the algorithm.
K-Means Algorithm
 Strength of the k-means:
 Relatively efficient: O(tkn), where n is # of objects, k is # of
clusters, and t is # of iterations. Normally, k, t << n
 Often terminates at a local optimum
 Weakness of the k-means:
 Applicable only when mean is defined; what about
categorical data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
 Variations of K-Means usually differ in:
 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means
Intelligent Information Retrieval
24
Single Pass Method
 The basic algorithm:
1. Assign the first item T1 as representative for C1
2. for item Ti calculate similarity S with centroid for each existing cluster
3. If Smax is greater than threshold value, add item to corresponding cluster
and recalculate centroid; otherwise use item to initiate new cluster
4. If another item remains unclustered, go to step 2
See: Example of Single Pass Clustering Technique
 This algorithm is simple and efficient, but has some
problems
 generally does not produce optimum clusters
 order dependent - using a different order of processing items will result in
a different clustering
Intelligent Information Retrieval
25
Hierarchical Clustering Algorithms
• Two main types of hierarchical clustering
– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one cluster
(or k clusters) left
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or there
are k clusters)
• Traditional hierarchical algorithms use a similarity or
distance matrix
– Merge or split one cluster at a time
Hierarchical Algorithms
 Use distance matrix as clustering criteria
 does not require the no. of clusters as input, but needs a termination
condition
Step 0
Step 1
Step 2
Step 3
Step 4
Agglomerative
a
ab
b
abcde
c
cd
d
cde
e
Divisive
Step 4
Step 3
Intelligent Information Retrieval
Step 2
Step 1
Step 0
27
Hierarchical Agglomerative Clustering
 HAC starts with unclustered data and performs successive pairwise joins
among items (or previous clusters) to form larger ones
 this results in a hierarchy of clusters which can be viewed as a dendrogram
 useful in pruning search in a clustered item set, or in browsing clustering results
A
B
Intelligent Information Retrieval
C
D
E
F
G
H
I
28
Hierarchical Agglomerative Clustering
 Some commonly used HACM methods
 Single Link: at each step join most similar pairs of objects that are not yet
in the same cluster
 Complete Link: use least similar pair between each cluster pair to
determine inter-cluster similarity - all items within one cluster are linked
to each other within a similarity threshold
 Group Average (Mean): use average value of pairwise links within a
cluster to determine inter-cluster similarity (i.e., all objects contribute to
inter-cluster similarity)
 Ward’s method: at each step join cluster pair whose merger minimizes the
increase in total within-group error sum of squares (based on distance
between centroids) - also called the minimum variance method
Intelligent Information Retrieval
29
Hierarchical Agglomerative Clustering
 Basic procedure
 1. Place each of N documents into a class of its own.
 2. Compute all pairwise document-document similarity
coefficients
Total of N(N-1)/2 coefficients
 3. Form a new cluster by combining the most similar pair of
current clusters i and j
(use one of the methods described in the previous slide, e.g., complete
link, Ward’s, etc.);
update similarity matrix by deleting the rows and columns
corresponding to i and j;
calculate the entries in the row corresponding to the new cluster i+j.
 4. Repeat step 3 if the number of clusters left is great than 1.
Intelligent Information Retrieval
30
Clustering Application:
Discovery of Content Profiles
 Content Profiles
 Goal: automatically group together pages which partially deal with
similar concepts
 Method:
identify concepts by clustering features (keywords) based on their
common occurrences among pages (can also be done using
association discovery or correlation analysis)
cluster centroids represent pages in which features in the cluster
appear frequently
 Content profiles are derived from centroids after filtering out lowweight page in each centroid
 The weight of a page in a profile represents the degree to which
features in the corresponding cluster appear in that page.
Intelligent Information Retrieval
31
Keyword-Based Representation
P1
P2
P3
P4
P5
P6
P7
P8
w1
1
1
0
1
1
1
0
0
w2
0
0
1
0
1
1
1
1
w3
1
0
1
0
1
0
0
0
Terms
w1
w2
w3
P1
1
0
1
Keyword weights can be:
- Binary (as in this example)
- Raw (or normalized) term frequency
- TF x IDF
Intelligent Information Retrieval
P2
1
0
0
P3
0
1
1
P4
1
0
0
P5
1
1
1
P6
1
1
0
P7
0
1
0
…
Mining tasks can be
performed on either
of these matrices…
32
Content Profiles – An Example
Filtering
threshold = 0.5
PROFILE 0 (Cluster Size = 3)
-------------------------------------------------------------------------------------------------------------1.00
C.html
(web, data, mining)
1.00
D.html
(web, data, mining)
0.67
B.html
(data, mining)
PROFILE 1 (Cluster Size = 4)
------------------------------------------------------------------------------------------------------------1.00
B.html
(business, intelligence, marketing, ecommerce)
1.00
F.html
(business, intelligence, marketing, ecommerce)
0.75
A.html
(business, intelligence, marketing)
0.50
C.html
(marketing, ecommerce)
0.50
E.html
(intelligence, marketing)
PROFILE 2 (Cluster Size = 3)
------------------------------------------------------------------------------------------------------------1.00
A.html
(search, information, retrieval)
1.00
E.html
(search, information, retrieval)
0.67
C.html
(information, retrieval)
0.67
D.html
(information, retireval)
Intelligent Information Retrieval
33
Example: Assoc. for Consumer Research (ACR)
Intelligent Information Retrieval
34
How Content Profiles Are Generated
1. Extract important features
(e.g., word stems) from each
document:
icmd.html
Feature
Freq
confer
12
market
9
develop
9
intern
5
ghana
3
ismd
3
contact
3
…
…
jcp.html
Feature
Freq
psychologi
11
consum
9
journal
6
manuscript
5
cultur
5
special
4
issu
4
paper
4
…
…
…
…
Intelligent Information Retrieval
2. Build a global dictionary of all features
(words) along with relevant statistics
Total Documents = 41
Feature-id
0
1
2
3
…
123
124
125
…
439
440
441
…
549
550
551
552
553
…
Doc-freq
6
12
13
8
…
26
9
23
…
7
14
11
…
1
3
1
4
3
…
Total-freq
44
59
76
41
…
271
24
165
…
45
78
61
…
6
8
9
23
17
…
Feature
1997
1998
1999
2000
…
confer
consid
consum
…
psychologi
public
publish
…
vision
volunt
vot
vote
web
…
35
How Content Profiles Are Generated
3. Construct a document-word matrix with normalized tf-idf weights
doc-id/feature-id
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
…
0
0.27
0.07
0.00
0.00
0.00
0.00
0.17
0.14
0.00
0.00
0.02
0.00
0.00
0.00
0.00
0.00
…
1
0.43
0.10
0.06
0.00
0.00
0.00
0.10
0.09
0.00
0.07
0.02
0.00
0.00
0.00
0.00
0.00
…
2
0.00
0.00
0.07
0.00
0.00
0.05
0.07
0.08
0.10
0.00
0.00
0.00
0.00
0.00
0.00
0.32
…
3
0.00
0.00
0.03
0.00
0.00
0.06
0.03
0.02
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.38
…
4
0.00
0.00
0.00
0.00
0.00
0.00
0.03
0.02
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
…
5
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
4. Now we can perform clustering on word (or documents) using one of the
techniques described earlier (e.g., k-means clustering on features).
Intelligent Information Retrieval
36
How Content Profiles Are Generated
Examples of feature (word) clusters obtained using k-means:
CLUSTER 0
---------anthropologi
anthropologist
appropri
associ
behavior
...
CLUSTER 4
---------consum
issu
journal
market
psychologi
special
CLUSTER 10
---------ballot
result
vot
vote
...
CLUSTER 11
---------advisori
appoint
committe
council
...
5. Content profiles are now generated from feature clusters based on centroids of
each cluster (similar to usage profiles, but we have words instead of users/sessions).
W eigh t
1.00
0.63
0.35
0.32
P ageview ID
CFP:
CFP:
CFP:
CFP:
O ne W orld One M arket
Int'l C onf. on M arketing & D evelopm ent
Journal of G lobal M arketing
Journal of C onsum er P sych ology
W eigh t
1.00
1.00
0.72
0.61
0.50
0.50
P ageview ID
CFP:
CFP:
CFP:
CFP:
CFP:
CFP:
Journal of P sych. & M arketing
Journal of C onsum er P sych ology I
Journal of G lobal M arketing
Journal of C onsum er P sych ology II
S ociety for C onsum er Psychology
C onf. on G ender, M arket., C onsum er B ehavior
Intelligent Information Retrieval
S ignifican t F eatu res (stem s)
w orld challeng busi co m anag global
challeng co contact develop intern
busi global
busi m anag global
S ignifican t F eatu res (stem s)
psychologi consum special m arket
psychologi journal consum special m arket
journal special m arket
psychologi journal consum special
psychologi consum special
journal consum m arket
37
Scatter/Gather
Cutting, Pedersen, Tukey & Karger 92, 93, Hearst & Pedersen 95
 Cluster-based browsing technique for large text collections
 Cluster sets of documents into general “themes”, like a table of contents
 Display the contents of the clusters by showing topical terms and typical titles
 The user may then select (gather) clusters that seem interesting
 These clusters can then be re-clustered (scattered) to reveal more fine-grained
clusters of documents
 With each successive iteration of scattering and gathering, the clusters become
smaller and more detailed, eventually bottoming out at the level of individual
documents
 Clustering and re-clustering is entirely automated
 Originally used to give collection overview
 Evidence suggests more appropriate for displaying retrieval results in
context
Intelligent Information Retrieval
38
Scatter/Gather Interface
Intelligent Information Retrieval
39
Scatter/Gather Clusters
Intelligent Information Retrieval
40
Clustering and Collaborative Filtering
:: clustering based on ratings: movielens
41
Clustering and Collaborative Filtering
:: tag clustering example
42
Hierarchical Clustering
:: example – clustered search results
Can drill
down within
clusters to
view subtopics or to
view the
relevant
subset of
results
43
Download