Clustering Techniques and IR CSC 575 Intelligent Information Retrieval Clustering Techniques and IR Today Clustering Problem and Applications Clustering Methodologies and Techniques Applications of Clustering in IR Intelligent Information Retrieval 2 What is Clustering? Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters Helps users understand the natural grouping or structure in a data set Cluster: a collection of data objects that are “similar” to one another and thus can be treated collectively as one group but as a collection, they are sufficiently different from other groups Intelligent Information Retrieval 3 Clustering in IR Objective of Clustering assign items to automatically created groups based on similarity or association between items and groups also called “automatic classification” “The art of finding groups in data.” -- Kaufmann and Rousseu Clustering in IR automatic thesaurus generation by clustering related terms automatic concept indexing (concepts are clusters of terms) automatic categorization of documents information presentation and browsing query generation and search refinement Intelligent Information Retrieval 4 Applications of Clustering Clustering has wide applications in Pattern Recognition Spatial Data Analysis: create thematic maps in GIS by clustering feature spaces detect spatial clusters and explain them in spatial data mining Image Processing Market Research Information Retrieval Document or term categorization Information visualization and IR interfaces Web Mining Cluster Web usage data to discover groups of similar access patterns Web Personalization Intelligent Information Retrieval 5 Clustering Methodologies Two general methodologies Partitioning Based Algorithms Hierarchical Algorithms Partitioning Based divide a set of N items into K clusters (top-down) Hierarchical agglomerative: pairs of items or clusters are successively linked to produce larger clusters divisive: start with the whole set as a cluster and successively divide sets into smaller partitions Intelligent Information Retrieval 6 Clustering Algorithms Similarity Measures and Features most clustering algorithms are based on some measure of similarity (or distance) between items in IR these measures could be based on co-occurrence of terms, citations, or hyperlinks in documents terms can be clustered based on documents in which they co-occur, or based on lexical or semantic similarity measures clustering requires the selection of features over which similarity among items is computed in document clustering, features are generally some or all of the terms in the collection often a small number of features must be selecting because many clustering algorithms break down in a “high-dimensional” space similarity measures among the items can be represented as a symmetric similarity matrix, in which each entry is the similarity value between two items Intelligent Information Retrieval 7 Distance or Similarity Measures Measuring Distance In order to group similar items, we need a way to measure the distance between objects (e.g., records) Note: distance = inverse of similarity Often based on the representation of objects as “feature vectors” An Employee DB ID 1 2 3 4 5 Gender F M M F M Age 27 51 52 33 45 Salary 19,000 64,000 100,000 55,000 45,000 Term Frequencies for Documents Doc1 Doc2 Doc3 Doc4 Doc5 T1 0 3 3 0 2 T2 4 1 0 1 2 T3 0 4 0 0 2 T4 0 3 0 3 3 T5 0 1 3 0 1 T6 2 2 0 0 4 Which objects are more similar? Intelligent Information Retrieval 8 Distance or Similarity Measures Pearson Correlation Works well in case of user ratings (where there is at least a range of 1-5) Not always possible (in some situations we may only have implicit binary values, e.g., whether a user did or did not select a document) Alternatively, a variety of distance or similarity measures can be used Common Distance Measures: Manhattan distance: Euclidean distance: Cosine similarity: dist ( X , Y ) 1 sim ( X , Y ) Intelligent Information Retrieval 9 Clustering Similarity Measures In vector-space model any of the similarity measures discussed before can be used in clustering Simple Matching Cosine Coefficient Intelligent Information Retrieval Dice’s Coefficient Jaccard’s Coefficient 10 Distance (Similarity) Matrix Similarity (Distance) Matrix based on the distance or similarity measure we can construct a symmetric matrix of distance (or similarity values) (i, j) entry in the matrix is the distance (similarity) between items i and j I1 I2 In I1 d 12 d 1n I2 d 21 d 2n In d n1 dn2 Note that dij = dji (i.e., the matrix is symmetric. So, we only need the lower triangle part of the matrix. The diagonal is all 1’s (similarity) or all 0’s (distance) d ij sim ilarity (or distance) of D i to D j Intelligent Information Retrieval 11 Example: Term Similarities in Documents Suppose we want to cluster terms that appear in a collection of documents with different frequencies Each term can be viewed as a vector of term frequencies (weights) Doc1 Doc2 Doc3 Doc4 Doc5 T1 0 3 3 0 2 T2 4 1 0 1 2 T3 0 4 0 0 2 T4 0 3 0 3 3 T5 0 1 3 0 1 T6 2 2 0 0 4 T7 1 0 3 2 0 T8 3 1 0 0 2 We need to compute a term-term similarity matrix For simplicity we use the dot product as similarity measure (note that this is the nonnormalized version of cosine similarity) sim (Ti , T j ) N ( w ik w k 1 Example: 12 ) jk N = total number of dimensions (in this case documents) wik = weight of term i in document k. Sim(T1,T2) = <0,3,3,0,2> * <4,1,0,1,2> 0x4 + 3x1 + 3x0 + 0x1 + 2x2 = 7 Similarity Matrix - Example Doc1 Doc2 Doc3 Doc4 Doc5 Term-Term Similarity Matrix Intelligent Information Retrieval T1 0 3 3 0 2 T2 4 1 0 1 2 T3 0 4 0 0 2 T2 T3 T4 T5 T6 T7 T8 T4 0 3 0 3 3 T1 7 16 15 14 14 9 7 T5 0 1 3 0 1 T6 2 2 0 0 4 T7 1 0 3 2 0 T8 3 1 0 0 2 T2 T3 T4 T5 T6 T7 8 12 3 18 6 17 18 6 16 0 8 6 18 6 9 6 9 3 2 16 3 13 Similarity Thresholds A similarity threshold is used to mark pairs that are “sufficiently” similar The threshold value is application and collection dependent T2 T3 T4 T5 T6 T7 T2 T3 T4 T5 T6 T7 T8 T1 7 16 15 14 14 9 7 8 12 3 18 6 17 18 6 16 0 8 6 18 6 9 6 9 3 2 16 3 T2 T3 T4 T5 T6 T7 T2 T3 T4 T5 T6 T7 T8 T1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 Intelligent Information Retrieval Using a threshold value of 10 in the previous example 14 Graph Representation The similarity matrix can be visualized as an undirected graph each item is represented by a node, and edges represent the fact that two items are similar (a one in the similarity threshold matrix) T2 T3 T4 T5 T6 T7 T8 T1 T2 T3 T4 T5 T6 T7 0 1 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 If no threshold is used, then matrix can be represented as a weighted graph Intelligent Information Retrieval T1 T3 T5 T4 T2 T7 T6 T8 15 Graph-Based Clustering Algorithms If we are interested only in threshold (and not the degree of similarity or distance), we can use the graph directly for clustering Clique Method (complete link) all items within a cluster must be within the similarity threshold of all other items in that cluster clusters may overlap generally produces small but very tight clusters Single Link Method any item in a cluster must be within the similarity threshold of at least one other item in that cluster produces larger but weaker clusters Other methods star method - start with an item and place all related items in that cluster string method - start with an item; place one related item in that cluster; then place anther item related to the last item entered, and so on Intelligent Information Retrieval 16 Graph-Based Clustering Algorithms Clique Method a clique is a completely connected subgraph of a graph in the clique method, each maximal clique in the graph becomes a cluster T1 T3 Maximal cliques (and therefore the clusters) in the previous example are: T5 T4 T2 {T1, T3, T4, T6} {T2, T4, T6} {T2, T6, T8} {T1, T5} {T7} Note that, for example, {T1, T3, T4} is also a clique, but is not maximal. T7 T6 Intelligent Information Retrieval T8 17 Graph-Based Clustering Algorithms Single Link Method selected an item not in a cluster and place it in a new cluster place all other similar item in that cluster repeat step 2 for each item in the cluster until nothing more can be added repeat steps 1-3 for each item that remains unclustered T1 T3 In this case the single link method produces only two clusters: T5 T4 T2 {T1, T3, T4, T5, T6, T2, T8} {T7} Note that the single link method does not allow overlapping clusters, thus partitioning the set of items. T7 T6 Intelligent Information Retrieval T8 18 Clustering with Existing Clusters The notion of comparing item similarities can be extended to clusters themselves, by focusing on a representative vector for each cluster cluster representatives can be actual items in the cluster or other “virtual” representatives such as the centroid this methodology reduces the number of similarity computations in clustering clusters are revised successively until a stopping condition is satisfied, or until no more changes to clusters can be made Partitioning Methods reallocation method - start with an initial assignment of items to clusters and then move items from cluster to cluster to obtain an improved partitioning Single pass method - simple and efficient, but produces large clusters, and depends on order in which items are processed Hierarchical Agglomerative Methods starts with individual items and combines into clusters then successively combine smaller clusters to form larger ones grouping of individual items can be based on any of the methods discussed earlier Intelligent Information Retrieval 19 Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen’67) Each cluster is represented by the center of the cluster k-medoids (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster Intelligent Information Retrieval 20 K-Means Algorithm The basic algorithm (based on reallocation method): 1. Select K initial clusters by (possibly) random assignment of some items to clusters and compute each of the cluster centroids. 2. Compute the similarity of each item xi to each cluster centroid and (re-)assign each item to the cluster whose centroid is most similar to xi. 3. Re-compute the cluster centroids based on the new assignments. 4. Repeat steps 2 and 3 until three is no change in clusters from one iteration to the next. Example: Clustering Documents Initial (arbitrary) assignment: C1 = {D1,D2}, C2 = {D3,D4}, C3 = {D5,D6} Cluster Centroids Intelligent Information Retrieval D1 D2 D3 D4 D5 D6 D7 D8 C1 C2 C3 T1 0 4 0 0 0 2 1 3 4/2 0/2 2/2 T2 3 1 4 3 1 2 0 1 4/2 7/2 3/2 T3 3 0 0 0 3 0 3 0 3/2 0/2 3/2 T4 0 1 0 3 0 0 2 0 1/2 3/2 0/2 T5 2 2 2 3 1 4 0 2 4/2 5/2 5/2 21 Example: K-Means Now compute the similarity (or distance) of each item with each cluster, resulting a cluster-document similarity matrix (here we use dot product as the similarity measure). C1 C2 C3 D1 29/2 31/2 28/2 D2 29/2 20/2 21/2 D3 24/2 38/2 22/2 D4 27/2 45/2 24/2 D5 17/2 12/2 17/2 D6 32/2 34/2 30/2 D7 15/2 6/2 11/2 D8 24/2 17/2 19/2 For each document, reallocate the document to the cluster to which it has the highest similarity (shown in red in the above table). After the reallocation we have the following new clusters. Note that the previously unassigned D7 and D8 have been assigned, and that D1 and D6 have been reallocated from their original assignment. C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5} This is the end of first iteration (i.e., the first reallocation). Next, we repeat the process for another reallocation… Intelligent Information Retrieval 22 Example: K-Means Now compute new cluster centroids using the original documentterm matrix C1 = {D2,D7,D8}, C2 = {D1,D3,D4,D6}, C3 = {D5} D1 D2 D3 D4 D5 D6 D7 D8 C1 C2 C3 This will lead to a new cluster-doc similarity matrix similar to previous slide. Again, the items are reallocated to clusters with highest similarity. C1 C2 C3 D1 7.67 16.75 14.00 New assignment T1 0 4 0 0 0 2 1 3 8/3 2/4 0/1 D2 15.01 11.25 3.00 D3 5.34 17.50 6.00 T2 3 1 4 3 1 2 0 1 2/3 12/4 1/1 D4 9.00 19.50 6.00 D5 5.00 8.00 11.00 T3 3 0 0 0 3 0 3 0 3/3 3/4 3/1 D6 12.00 6.68 9.34 T4 0 1 0 3 0 0 2 0 3/3 3/4 0/1 D7 7.67 4.25 9.00 T5 2 2 2 3 1 4 0 2 4/3 11/4 1/1 D8 11.34 10.00 3.00 C1 = {D2,D6,D8}, C2 = {D1,D3,D4}, C3 = {D5,D7} Note: This process is now repeated with new clusters. However, the next iteration in this example Will show no change to the clusters, thus terminating the algorithm. K-Means Algorithm Strength of the k-means: Relatively efficient: O(tkn), where n is # of objects, k is # of clusters, and t is # of iterations. Normally, k, t << n Often terminates at a local optimum Weakness of the k-means: Applicable only when mean is defined; what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Variations of K-Means usually differ in: Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means Intelligent Information Retrieval 24 Single Pass Method The basic algorithm: 1. Assign the first item T1 as representative for C1 2. for item Ti calculate similarity S with centroid for each existing cluster 3. If Smax is greater than threshold value, add item to corresponding cluster and recalculate centroid; otherwise use item to initiate new cluster 4. If another item remains unclustered, go to step 2 See: Example of Single Pass Clustering Technique This algorithm is simple and efficient, but has some problems generally does not produce optimum clusters order dependent - using a different order of processing items will result in a different clustering Intelligent Information Retrieval 25 Hierarchical Clustering Algorithms • Two main types of hierarchical clustering – Agglomerative: • Start with the points as individual clusters • At each step, merge the closest pair of clusters until only one cluster (or k clusters) left – Divisive: • Start with one, all-inclusive cluster • At each step, split a cluster until each cluster contains a point (or there are k clusters) • Traditional hierarchical algorithms use a similarity or distance matrix – Merge or split one cluster at a time Hierarchical Algorithms Use distance matrix as clustering criteria does not require the no. of clusters as input, but needs a termination condition Step 0 Step 1 Step 2 Step 3 Step 4 Agglomerative a ab b abcde c cd d cde e Divisive Step 4 Step 3 Intelligent Information Retrieval Step 2 Step 1 Step 0 27 Hierarchical Agglomerative Clustering HAC starts with unclustered data and performs successive pairwise joins among items (or previous clusters) to form larger ones this results in a hierarchy of clusters which can be viewed as a dendrogram useful in pruning search in a clustered item set, or in browsing clustering results A B Intelligent Information Retrieval C D E F G H I 28 Hierarchical Agglomerative Clustering Some commonly used HACM methods Single Link: at each step join most similar pairs of objects that are not yet in the same cluster Complete Link: use least similar pair between each cluster pair to determine inter-cluster similarity - all items within one cluster are linked to each other within a similarity threshold Group Average (Mean): use average value of pairwise links within a cluster to determine inter-cluster similarity (i.e., all objects contribute to inter-cluster similarity) Ward’s method: at each step join cluster pair whose merger minimizes the increase in total within-group error sum of squares (based on distance between centroids) - also called the minimum variance method Intelligent Information Retrieval 29 Hierarchical Agglomerative Clustering Basic procedure 1. Place each of N documents into a class of its own. 2. Compute all pairwise document-document similarity coefficients Total of N(N-1)/2 coefficients 3. Form a new cluster by combining the most similar pair of current clusters i and j (use one of the methods described in the previous slide, e.g., complete link, Ward’s, etc.); update similarity matrix by deleting the rows and columns corresponding to i and j; calculate the entries in the row corresponding to the new cluster i+j. 4. Repeat step 3 if the number of clusters left is great than 1. Intelligent Information Retrieval 30 Clustering Application: Discovery of Content Profiles Content Profiles Goal: automatically group together pages which partially deal with similar concepts Method: identify concepts by clustering features (keywords) based on their common occurrences among pages (can also be done using association discovery or correlation analysis) cluster centroids represent pages in which features in the cluster appear frequently Content profiles are derived from centroids after filtering out lowweight page in each centroid The weight of a page in a profile represents the degree to which features in the corresponding cluster appear in that page. Intelligent Information Retrieval 31 Keyword-Based Representation P1 P2 P3 P4 P5 P6 P7 P8 w1 1 1 0 1 1 1 0 0 w2 0 0 1 0 1 1 1 1 w3 1 0 1 0 1 0 0 0 Terms w1 w2 w3 P1 1 0 1 Keyword weights can be: - Binary (as in this example) - Raw (or normalized) term frequency - TF x IDF Intelligent Information Retrieval P2 1 0 0 P3 0 1 1 P4 1 0 0 P5 1 1 1 P6 1 1 0 P7 0 1 0 … Mining tasks can be performed on either of these matrices… 32 Content Profiles – An Example Filtering threshold = 0.5 PROFILE 0 (Cluster Size = 3) -------------------------------------------------------------------------------------------------------------1.00 C.html (web, data, mining) 1.00 D.html (web, data, mining) 0.67 B.html (data, mining) PROFILE 1 (Cluster Size = 4) ------------------------------------------------------------------------------------------------------------1.00 B.html (business, intelligence, marketing, ecommerce) 1.00 F.html (business, intelligence, marketing, ecommerce) 0.75 A.html (business, intelligence, marketing) 0.50 C.html (marketing, ecommerce) 0.50 E.html (intelligence, marketing) PROFILE 2 (Cluster Size = 3) ------------------------------------------------------------------------------------------------------------1.00 A.html (search, information, retrieval) 1.00 E.html (search, information, retrieval) 0.67 C.html (information, retrieval) 0.67 D.html (information, retireval) Intelligent Information Retrieval 33 Example: Assoc. for Consumer Research (ACR) Intelligent Information Retrieval 34 How Content Profiles Are Generated 1. Extract important features (e.g., word stems) from each document: icmd.html Feature Freq confer 12 market 9 develop 9 intern 5 ghana 3 ismd 3 contact 3 … … jcp.html Feature Freq psychologi 11 consum 9 journal 6 manuscript 5 cultur 5 special 4 issu 4 paper 4 … … … … Intelligent Information Retrieval 2. Build a global dictionary of all features (words) along with relevant statistics Total Documents = 41 Feature-id 0 1 2 3 … 123 124 125 … 439 440 441 … 549 550 551 552 553 … Doc-freq 6 12 13 8 … 26 9 23 … 7 14 11 … 1 3 1 4 3 … Total-freq 44 59 76 41 … 271 24 165 … 45 78 61 … 6 8 9 23 17 … Feature 1997 1998 1999 2000 … confer consid consum … psychologi public publish … vision volunt vot vote web … 35 How Content Profiles Are Generated 3. Construct a document-word matrix with normalized tf-idf weights doc-id/feature-id 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 … 0 0.27 0.07 0.00 0.00 0.00 0.00 0.17 0.14 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 … 1 0.43 0.10 0.06 0.00 0.00 0.00 0.10 0.09 0.00 0.07 0.02 0.00 0.00 0.00 0.00 0.00 … 2 0.00 0.00 0.07 0.00 0.00 0.05 0.07 0.08 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.32 … 3 0.00 0.00 0.03 0.00 0.00 0.06 0.03 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.38 … 4 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 … 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 … … … … … … … … … … … … … … … … … … … 4. Now we can perform clustering on word (or documents) using one of the techniques described earlier (e.g., k-means clustering on features). Intelligent Information Retrieval 36 How Content Profiles Are Generated Examples of feature (word) clusters obtained using k-means: CLUSTER 0 ---------anthropologi anthropologist appropri associ behavior ... CLUSTER 4 ---------consum issu journal market psychologi special CLUSTER 10 ---------ballot result vot vote ... CLUSTER 11 ---------advisori appoint committe council ... 5. Content profiles are now generated from feature clusters based on centroids of each cluster (similar to usage profiles, but we have words instead of users/sessions). W eigh t 1.00 0.63 0.35 0.32 P ageview ID CFP: CFP: CFP: CFP: O ne W orld One M arket Int'l C onf. on M arketing & D evelopm ent Journal of G lobal M arketing Journal of C onsum er P sych ology W eigh t 1.00 1.00 0.72 0.61 0.50 0.50 P ageview ID CFP: CFP: CFP: CFP: CFP: CFP: Journal of P sych. & M arketing Journal of C onsum er P sych ology I Journal of G lobal M arketing Journal of C onsum er P sych ology II S ociety for C onsum er Psychology C onf. on G ender, M arket., C onsum er B ehavior Intelligent Information Retrieval S ignifican t F eatu res (stem s) w orld challeng busi co m anag global challeng co contact develop intern busi global busi m anag global S ignifican t F eatu res (stem s) psychologi consum special m arket psychologi journal consum special m arket journal special m arket psychologi journal consum special psychologi consum special journal consum m arket 37 Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93, Hearst & Pedersen 95 Cluster-based browsing technique for large text collections Cluster sets of documents into general “themes”, like a table of contents Display the contents of the clusters by showing topical terms and typical titles The user may then select (gather) clusters that seem interesting These clusters can then be re-clustered (scattered) to reveal more fine-grained clusters of documents With each successive iteration of scattering and gathering, the clusters become smaller and more detailed, eventually bottoming out at the level of individual documents Clustering and re-clustering is entirely automated Originally used to give collection overview Evidence suggests more appropriate for displaying retrieval results in context Intelligent Information Retrieval 38 Scatter/Gather Interface Intelligent Information Retrieval 39 Scatter/Gather Clusters Intelligent Information Retrieval 40 Clustering and Collaborative Filtering :: clustering based on ratings: movielens 41 Clustering and Collaborative Filtering :: tag clustering example 42 Hierarchical Clustering :: example – clustered search results Can drill down within clusters to view subtopics or to view the relevant subset of results 43