The Failure of Clustering in Search Interfaces … or When/How/Why Clustering can be Successful in Search Interfaces Marti Hearst UC Berkeley Oct 6, 2004 http://www.sims.berkeley.edu/~hearst 1 Main Points • Grouping search results is desirable • However, getting good groups is difficult • Furthermore, incorporation of groups into interfaces has not been done well • Good news: improvements are happening 2 Talk Outline • Why search interfaces are difficult to define • Definition of categories and clusters • Studies showing failure of clustering in interfaces • A new development in clustering in web search • How to remedy these problems 3 Clustering Interface Problems • Big problem: – Clusters used primarily as part of a visualization • This just doesn’t work – Every usability study says so – Lots of dots scattered about the screen is meaningless to users – There is no inherent spatial relationship among the documents – Need text to understand content • Another big problem: – Clustering images according to an approximation of visual similarity • This just doesn’t work – What limited studies have been done say so – Instead: group according to textual categories 4 Search interfaces are difficult to design • Content and queries are hugely varying – The scope of what people search for is all of human knowledge and experience (!) – Interfaces must accommodate human differences in • • • • Knowledge / life experience Cultural background and expectations Reading / scanning ability and style Methods of looking for things (pilers vs. filers) 5 Abstractions Are Difficult to Represent • Text describes abstract concepts – Difficult to show the contents of text in a visual or compact manner • Exercise: – How would you show the preamble of the US Constitution visually? – How would you show the contents of Joyce’s Ulysses visually? How would you distinguish it from Homer’s The Odyssey or McCourt’s Angela’s Ashes? • The point: it is difficult to show text without using text 6 Lack of Technical Understanding • Most people don’t understand the underlying methods by which search engines work. – Without appropriate explanations, most of 14 people had strong misconceptions about: • ANDing vs ORing of search terms – Some assumed ANDing search engine indexed a smaller collection; most had no explanation at all • For empty results for query “to be or not to be” – 9 of 14 could not explain in a method that remotely resembled stop word removal • For term order variation “boat fire” vs. “fire boat” – Only 5 out of 14 expected different results Muramatsu & Pratt, “Transparent Queries: Investigating Users’ Mental Models of Search Engines, SIGIR 2001. 7 Other Issues • Vocabulary Disconnect – If you ask a set of people to describe a set of things there is little overlap in the results. • If one person assigns a name, the probability of it NOT matching with another person’s is about 80% • It is difficult to represent content compactly • Small details matter • People are reluctant to change search interfaces Furnas, et al: The Vocabulary Problem in Human-System Communication. Commun. ACM 30(11): 964-971 (1987) 8 The Need to Group • Interviews with lay users often reveal a desire for better organization of retrieval results • Useful for suggesting where to look next – People prefer links over generating search terms – But only when the links are for what they want • Two main approaches for text and images: – Group items according to pre-defined categories – Group items into automatically-created clusters Ojakaar and Spool, Users Continue After Category Links, UIETips Newsletter, http://world.std.com/~uieweb/Articles/, 2001 9 Categories • Human-created – But often automatically assigned to items • Arranged in hierarchy, network, or facets – Can assign multiple categories to items – Or place items within categories • Usually restricted to a fixed set – So help reduce the space of concepts • Intended to be readily understandable – To those who know the underlying domain – Provide a novice with a conceptual structure • There are many already made up! • However, until recently, their use in interfaces has been – Under-investigated – Not met their promise 10 Category System Examples 11 Category System Examples 12 Category System Examples eat.epicurious.com 13 Category System Examples eat.epicurious.com 14 Example of Faceted Metadata: Medical Subject Headings (MeSH) Facets 1. Anatomy [A] 2. Organisms [B] 3. Diseases [C] 4. Chemicals and Drugs [D] 5. Analytical, Diagnostic and Therapeutic Techniques and Equipment [E] 6. Psychiatry and Psychology [F] 7. Biological Sciences [G] 8. Physical Sciences [H] 9. Anthropology, Education, Sociology and Social Phenomena [I] 10. Technology and Food and Beverages [J] 11. Humanities [K] 12. Information Science [L] 13. Persons [M] 14. Health Care [N] 15. Geographic Locations [Z] 15 Each Facet Has Hierarchy 1. Anatomy [A] Body Regions [A01] 2. [B] Musculoskeletal System [A02] 3. [C] Digestive System [A03] 4. [D] Respiratory System [A04] 5. [E] Urogenital System [A05] 6. [F] …… 7. [G] 8. Physical Sciences [H] 9. [I] 10. [J] 11. [K] 12. [L] 13. [M] 16 Clustering • “The art of finding groups in data” – Kaufman and Rousseeuw • Groups are formed according to associations and commonalities among the data’s features. – There are dozens of algorithms, more all the time – Most need a way of determing similarity or difference between a pair of items – In text clustering, documents usually represented as a vector of weighted features which are some transformation on the words – Similarity between documents is a weighted measure of feature overlap 17 Clustering • Potential benefits: – Find the main themes in a set of documents • Potentially useful if the user wants a summary of the main themes in the subcollection • Potentially harmful if the user is interested in less dominant themes – More flexible than pre-defined categories • There may be important themes that have not been anticipated – Disambiguate ambiguous terms • ACL – Clustering retrieved documents tends to group those relevant to a complex query together Hearst, Pedersen, Revisiting the Cluster Hypothesis, SIGIR’96 18 Scatter/Gather Clustering • Developed at PARC in the late 80’s/early 90’s • Top-down approach – Start with k seeds (documents) to represent k clusters – Each document assigned to the cluster with the most similar seeds • To choose the seeds: – Cluster in a bottom-up manner – Hierarchical agglomerative clustering • Start with n documents, compare all by pairwise similarity, combine the two most similar documents to make a cluster • Now compare both clusters and individual documents to find the most similar pair to combine • Continue until k clusters remain • Use the centroid of each of these as seeds – Centroid: average of the weighted vectors • Can recluster a cluster to produce a hierarchy of clusters Pedersen, Cutting, Karger, Tukey, Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, SIGIR 1992 19 query Collection Rank Cluster The Scatter/Gather Interface S/G Example: query on “star” Encyclopedia text 8 symbols 68 film, tv (p) 97 astrophysics 67 astronomy(p) 10 flora/fauna 14 sports 47 film, tv 7 music 12 steller phenomena 49 galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated 22 S/G Example: query on “star” Newspaper/Magazine text 22 41 58 98 31 products / business software / computers restaurants / food (reviews) movies / tv (reviews) wall street / finance 35 hollywood 54 astronomers/movies 9 film mini-reviews Topics quite different from encyclopedia text 25 Two Queries: Two Clusterings AUTO, CAR, ELECTRIC 8 control drive accident … AUTO, CAR, SAFETY 6 control inventory integrate … 25 battery california technology … 10 investigation washington … 48 import j. rate honda toyota … 12 study fuel death bag air … 16 export international unit japan 61 sale domestic truck import … 3 service employee automatic … 11 japan export defect unite … The main differences are the clusters that are central to the query 26 Clustering Example: Medical Text • Query: “mastectomy” on a breast cancer collection • 250 documents retrieved • Summary of cluster themes (subjective): – prophylactic mastectomy (preventative) – prostheses and reconstruction – conservative vs radical surgery – side effects of surgery – psychological effects of surgery • The first two clusters found themes for which there was no corresponding MESH category Hearst, The Use of Categories and Clusters for Organizing Retrieval Results, in Natural Language Information Retrieval, Kluwer, 1999 27 A Clustering Failure • Query: “implant” and “prosthesis” • Four clusters returned: – – – – use of implants to administer radiation dosages complications resulting from breast implants other issues surrounding breast implants other kinds of prostheses • Reclustering clusters 2 and 3 does not find cohesive subgroups – An examination of the documents indicates that a valid subdivision was possible • type of surgical procedure • risk factors – This seems to happen when there are too many features in common – Perhaps a better clustering algorithm can help in this case 28 Clustering Algorithm Problems • Doesn’t work well if data is too homogenous or too heterogeneous • Often is difficult to interpret quickly – Automatically generated labels are unintuitive and occur at different levels of description • Often the top-level can be ok, but the subsequent levels are very poor • Need a better way to handle items that fall into more than one cluster 29 Visualizing Clustering Results • Use clustering to map the entire huge multidimensional document space into a huge number of small clusters. • User dimension reduction and then project these onto a 2D/3D graphical representation 30 Clustering Multi-Dimensional Document Space (image from Wise et al 95) 31 Clustering Multi-Dimensional Document Space (image from Wise et al 95) 32 33 (from Chen et al., JASIS 49(7)) Kohonen Feature Maps on Text Is it useful? • 4 Clustering Visualization Usability Studies 34 Clustering for Search Study 1 • This study compared – a system with 2D graphical clusters – a system with 3D graphical clusters – a system that shows textual clusters • Novice users • Only textual clusters were helpful (and they were difficult to use well) Kleiboemer, Lazear, and Pedersen. Tailoring a retrieval system for naive users. SDAIR’96 35 Clustering Study 2: Kohonen Feature Maps • Comparison: Kohonen Map and Yahoo • Task: – “Window shop” for interesting home page – Repeat with other interface • Results: – Starting with map could repeat in Yahoo (8/11) – Starting with Yahoo unable to repeat in map (2/14) Chen, Houston, Sewell, Schatz, Internet Browsing and Searching: User Evaluations of Category Map and Concept Space Techniques. JASIS 49(7): 582-603 (1998) 36 37 (Lin 92, Chen et al. 97) Kohonen Feature Maps Study 2 (cont.) • Participants liked: – – – – – Correspondence of region size to # documents Overview (but also wanted zoom) Ease of jumping from one topic to another Multiple routes to topics Use of category and subcategory labels Chen, Houston, Sewell, Schatz, Internet Browsing and Searching: User Evaluations of Category Map and Concept Space Techniques. JASIS 49(7): 582-603 (1998) 38 Study 2 (cont.) • Participants wanted: – – – – – – – – – • hierarchical organization other ordering of concepts (alphabetical) integration of browsing and search correspondence of color to meaning more meaningful labels labels at same level of abstraction fit more labels in the given space combined keyword and category search multiple category assignment (sports+entertain) (These can all be addressed with faceted hierarchical categories) Chen, Houston, Sewell, Schatz, Internet Browsing and Searching: User Evaluations of Category Map and Concept Space Techniques. JASIS 49(7): 582-603 (1998) 39 Clustering Study 3: NIRVE Each rectangle is a cluster. Larger clusters closer to the “pole”. Similar clusters near one another. Opening a cluster causes a projection that shows the titles. 40 Study 3 This study compared: – 3D graphical clusters – 2D graphical clusters – textual clusters • 15 participants, between-subject design • Tasks – – – – – Locate a particular document Locate and mark a particular document Locate a previously marked document Locate all clusters that discuss some topic List more frequently represented topics Visualization of search results: a comparative evaluation of text, 2D, and 3D interfaces Sebrechts, Cugini, Laskowski, Vasilakis and Miller, SIGIR ‘99. 41 Study 3 • Results (time to locate targets) – – – – Text clusters fastest 2D next 3D last With practice (6 sessions) 2D neared text results; 3D still slower – Computer experts were just as fast with 3D • Certain tasks equally fast with 2D & text – Find particular cluster – Find an already-marked document • But anything involving text (e.g., find title) much faster with text. – Spatial location rotated, so users lost context • Helpful viz features – Color coding (helped text too) – Relative vertical locations Visualization of search results: a comparative evaluation of text, 2D, and 3D interfaces Sebrechts, Cugini, Laskowski, Vasilakis and Miller, SIGIR ‘99. 42 Clustering Study 4 • Compared several factors • Findings: – Topic effects dominate (this is a common finding) – Strong difference in results based on spatial ability – No difference between librarians and other people – No evidence of usefulness for the cluster visualization Aspect windows, 3-D visualizations, and indirect comparisons of information retrieval systems, Swan, &Allan, SIGIR 1998. 43 Summary: Visualizing for Search Using Clusters • Huge 2D maps may be inappropriate focus for information retrieval – cannot see what the documents are about – space is difficult to browse for IR purposes – (tough to visualize abstract concepts) • Perhaps more suited for pattern discovery and gist-like overviews 44 How do people want to search and browse images? • Ethnographic studies of people who use images intensely find: – Find specific objects is easy • Find images of the Empire State Building – Browsing is hard • In a usability study with architects, to our surprise we found their response to an imagebrowsing interface mock-up was they wanted to see more text (categories). Elliott, A. (2001). "Flamenco Image Browser: Using Metadata to Improve Image Search During Architectural 45 Design," in the Proceedings of CHI 2001. Clustering in Image Search • Using Visual “Content” – Extract color, texture, shape • • • • QBIC (Flickner et al. ‘95) Blobworld (Carson et al. ‘99) Body Plans (Forsyth & Fleck ‘00) Piction: images + text (Srihari et al. ’91 ’99) – Two uses: • Show a clustered similarity space • Show those images similar to a selected one 46 K. Rodden, Evaluating Similarity-Based Visualisations as Interfaces for Image Browsing, PhD thesis, 2001 K. Rodden, W. Basalaj, D. Sinclair, and K. Wood, Does Organisation by Similarity Assist Image Browsing?, CHI 2001 47 K. Rodden, Evaluating Similarity-Based Visualisations as Interfaces for Image Browsing, PhD thesis, 2001 K. Rodden, W. Basalaj, D. Sinclair, and K. Wood, Does Organisation by Similarity Assist Image Browsing?, CHI 2001 48 K. Rodden, Evaluating Similarity-Based Visualisations as Interfaces for Image Browsing, PhD thesis, 2001 K. Rodden, W. Basalaj, D. Sinclair, and K. Wood, Does Organisation by Similarity Assist Image Browsing?, CHI 2001 49 Image Clustering Study Results • Searching was faster with the random arrangement • Preference for the clustered arrangement was not overwhelming stronger than random – 2 out of 10 participants prefered random and 3 had no preference – Median satisfaction for clustered was 4.5 and for random was 4.0 K. Rodden, Evaluating Similarity-Based Visualisations as Interfaces for Image Browsing, PhD thesis, 2001 K. Rodden, W. Basalaj, D. Sinclair, and K. Wood, Does Organisation by Similarity Assist Image Browsing?, CHI 2001 50 An Alternative • In the Flamenco project, we have shown that hierarchical faceted metadata, paired with a good interface, is highly effective for browsing image collections – Flamenco.berkeley.edu • (But that’s a different talk) 51 Study 5: Comparing Textual Cluster Interfaces to Category Interfaces • DynaCat system • Decide on important question types in an advance – What are the adverse effects of drug D? – What is the prognosis for treatment T? • Make use of MeSH categories • Retain only those types of categories known to be useful for this type of query. Pratt, W., Hearst, M, and Fagan, L. A Knowledge-Based Approach to Organizing Retrieved Documents. AAAI-99 52 DynaCat Interface Pratt, W., Hearst, M, and Fagan, L. A Knowledge-Based Approach to Organizing Retrieved Documents. AAAI-99 53 DynaCat Study • Design – Three queries – 24 cancer patients – Compared three interfaces • ranked list, clusters, categories • Results – Participants strongly preferred categories – Participants found more answers using categories – Participants took same amount of time with all three interfaces Pratt, W., Hearst, M, and Fagan, L. A Knowledge-Based Approach to Organizing Retrieved Documents. AAAI-99 54 Study 6: Categories vs. Lists • One study found users prefered one level of categories over lists, and were faster at finding answers – Only 13 top-level categories shown – Secondary-level categories not very accurate • However, the queries appeared to be somewhat setup to optimize the usefulness of the clusters – Example: • • • • Query word: “indian” Task: find indian motorcyles Query: “alaska” Task: find yatching adventures in alaska Chen, Dumais, Bringing order to the web: Automatically categorizing search results. CHI 2000 55 What about Textual Displays of Clusters? • Text-based clustering is more promising • Text-based clustering on the Web – In the early days, Excite had a mockup on about 10 documents that pretended to do Scatter/Gather (when it was called Architext) • Quickly removed it and started providing standard search – For a while NorthernLight had a clustering interface • Didn’t really get anywhere – The latest entry is Vivisimo • Has a lot of problems • BUT … there’s a new development from Vivisimo called Clusty • Seems to have much improved clustering and interface 56 An Analysis of Vivisimo • Query: barcelona • Query: dog pregnancy 57 58 59 60 An Analysis of Vivisimo • Query: barcelona – Hotels and Travel Guide are both at top level – Also, Barcelona City – But Travel Guide contains • Hotels • Spain, Spanish – Not really helping to make useful distinctions 61 62 63 An Analysis of Vivisimo • Query: pregnant dog – What does the category pregnant mean here? – Why does it have a subcategory of whelping, when there is also a main category of whelping? – And what the relationship to Pregnancy and Birth – The pages shown don’t seem strongly related to one another • How to followup? – There is a “find in clusters” box, but not very helpful because no hints about which words might work 64 Search within Results 65 Then along came Clusty … • • • • Announced less than a week ago Produced by Vivisimo Much better interface Much better clusters 66 67 68 69 70 71 Clusty Improvements • Labels tend to be more at the same level of description • Subcategories are more cautious, reflecting groups of very similar documents – Do a better job of really showing subcategories • Nice interface touches – Better use of color for distinguishing – Small icons are inviting – Incorporation of encyclopedia results high up • Search results are better – (Not always – pregnant dog not much better) – Using metasearch – May be throwing out some docs to get more distribution in the types of results found – Looks like they are focusing on term proximity to get more meaningful grouping – Don’t allow very many results 72 73 74 75 Clusty Improvements • Doing sense disambiguation for abbreviations like ACL – However, no good followup for how to make use of this – E.g., to search on ACL (meaning comp ling) plus some other concepts – On the other hand, using multiple terms is how most disambiguation is done now • ACL + disambiguation • Jaguar + prey – So not clear if there is a net benefit • Trying to approximate faceted queries – Under Jaguar query, for history, show both history of band with history of car and video game 76 77 Analysis • Is it really helping? Or are the categories now too general and overlapping? • The main effect seems to be that the search results are better due to the metasearch and term proximity 78 79 More Analysis • Reflects the frequency of topics in the data – So no discussion of nukes in the Spain categories – No discussion of hotels in the North Korea categories – Is this good or bad? It depends. 80 81 82 83 84 85 86 More analysis • Adding a related term (Degas, Cezanne) brings up relations between the two that don’t appear with the general term Degas alone – Impressionists – Pissaro, in particular (should be under impressionists) • Also leads to messier results 87 Summary • Grouping search results is desirable – Often requested by lay users – Very positive results for category interface • However, getting good groups is difficult – Two main approaches: • Predefined category sets • Automatically created clusters • Furthermore, incorporation of groups into interfaces has not been done well – Notable Failures in Search Interfaces: • • • • Visualization of clusters Unintuitive clusters and labels Clustering of images according to visual attributes Poor incorporation of categories into search interfaces (not covered) • Good news: improvements are happening – Improved clustering that takes better account of good display principles as seen in Clusty – Flamenco: Flexible search and navigation via faceted category hierarchies (not discussed here) 88 A Promising Direction: Combining Categories and Clusters • Mehran Sahami’s work on combing categories and clusters • Ray Larson’s work on clustering results of categorization • Would be interesting to cluster MeSH category labels – Work using UMLS to select subsets of MeSH has been successful for analysis tasks 89 Conclusions • In order to use clustering in an interface, must pay attention to what makes the groupings intuitive • Much work has been too much of a “science project” • Up to now, clustering hasn’t succeeded on web search results, but Clusty show marked improvements that are promising 90 Thank you! Marti Hearst www.sims.berkeley.edu/~hearst 91 More Recent Attempts • Analyzing retrieval results – KartOO – Grokker http://www.kartoo.com/ http://www.groxis.com/service/grok 92 93 94 95 96 References Chen, Houston, Sewell, and Schatz, JASIS 49(7) Chen and Yu, Empirical studies of information visualization: a meta-analysis, IJHCS 53(5),2000 Dumais, Cutrell, Cadiz, Jancke, Sarin and Robbins, Stuff I've Seen: A system for personal information retrieval and re-use. SIGIR 2003. Hearst, English, Sinha, Swearingen, Yee. Finding the Flow in Web Site Search, CACM 45(9), 2002. Hearst, User Interfaces and Visualization, Chapter 10 of Modern Information Retrieval, Baeza-Yates and Rebeiro-Nato (Eds), Addison-Wesley 1999. Johnson, Manning, Hagen, and Dorsey. Specialize Your Site's Search. Forrester Research, (Dec. 2001), Cambridge, MA 97 References Sebrechts, Cugini, Laskowski, Vasilakis and Miller, Visualization of search results: a comparative evaluation of text, 2D, and 3D interfaces, SIGIR ‘99. Swan and Allan, Aspect windows, 3-D visualizations, and indirect comparisons of information retrieval systems, SIGIR 1998. Yee, Swearingen, Li, Hearst, Faceted Metadata for Image Search and Browsing, Proceedings of CHI 2003 98