SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000 Today Review Basic Human-Computer Interaction Principles Starting Points for Search UI and Viz in IA: Chapter Contents Human-Computer Interaction (HCI) Human – the end-user of a program – the others in the organization Computer – the machine the program runs on Interaction – the user tells the computer what they want – the computer communicates results Slide by James Landay What is HCI? Task Organizational & Social Issues Design Technology Humans Slide by James Landay Shneiderman on HCI Well-designed interactive computer systems promote: – Positive feelings of success, competence, and mastery. – Allow users to concentrate on their work, rather than on the system. Usability Design Goals Ease of learning – faster the second time and so on... Recall – remember how from one session to the next Productivity – perform tasks quickly and efficiently Minimal error rates – if they occur, good feedback so user can recover High user satisfaction – confident of success Slide by James Landay Usability Slogans (from Nielsen’s Usability Engineering) Your best guess is not good enough The user is always right The user is not always right Users are not designers Designers are not users Less is more Details matter Adapted from slide by James Landay Design Guidelines Set of design rules to follow Apply at multiple levels of design Are neither complete nor orthogonal Have psychological underpinnings (ideally) Adapted from slide by James Landay Who builds UIs? A team of specialists (ideally) – – – – – – graphic designers interaction / interface designers technical writers marketers test engineers software engineers Slide by James Landay How to Design and Build UIs Task analysis Rapid prototyping Evaluation Implementation Iterate at every stage! Design Evaluate Prototype Adapted from slide by James Landay Task Analysis Observe existing work practices Create examples and scenarios of actual use Try out new ideas before building software Slide by James Landay Rapid Prototyping Build a mock-up of design Low fidelity techniques – paper sketches – cut, copy, paste – video segments Interactive prototyping tools – Visual Basic, HyperCard, Director, etc. UI builders – NeXT, etc. Slide by James Landay Evaluation Test with real users (participants) Build models Low-cost techniques – expert evaluation – walkthroughs Slide by James Landay Information Seeking Behavior Two parts of a process: » search and retrieval » analysis and synthesis of search results This is a fuzzy area; we will look at several different working theories. Standard Model Assumptions: – Maximizing precision and recall simultaneously – The information need remains static – The value is in the resulting document set Problem with Standard Model: Users learn during the search process: – Scanning titles of retrieved documents – Reading retrieved documents – Viewing lists of related topics/thesaurus terms – Navigating hyperlinks Some users don’t like long disorganized lists of documents “Berry-Picking” as an Information Seeking Strategy (Bates 90) Standard IR model – assumes the information need remains the same throughout the search process Berry-picking model – interesting information is scattered like berries among bushes – the query is continually shifting A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” (after Bates 89) Q2 Q4 Q3 Q1 Q0 Q5 Implications Interfaces should make it easy to store intermediate results Interfaces should make it easy to follow trails with unanticipated results Makes evaluation more difficult. Search Tactics and Strategies Search Tactics – Bates 79 Search Strategies – Bates 89 – O’Day and Jeffries 93 Tactics vs. Strategies Tactic: short term goals and maneuvers – operators, actions Strategy: overall planning – link a sequence of operators together to achieve some end Information Search Tactics (after Bates 79) Source-level tactics – navigate to and within sources Term and Search Formulation tactics – designing search formulation – selection and revision of specific terms within search formulation Monitoring tactics – keep search on track – (should really be called a strategy) Term Tactics Move around a thesaurus – (more on this in 2nd half of class) Source-level Tactics “Bibble”: – look for a pre-defined result set – e.g., a good link page on web Survey: – look ahead, review available options – e.g., don’t simply use the first term or first source that comes to mind Cut: – eliminate large proportion of search domain – e.g., search on rarest term first Source-level Tactics (cont.) Stretch – use source in unintended way – e.g., use patents to find addresses Scaffold – take an indirect route to goal – e.g., when looking for references to obscure poet, look up contemporaries Monitoring Tactics (strategy-level) Check – compare original goal with current state Weigh – make a cost/benefit analysis of current or anticipated actions Pattern – recognize common strategies Correct Errors Record – keep track of (incomplete) paths Additional Considerations (Bates 79) Need a Sort tactic More detail is needed about short-term cost/benefit decision rule strategies When to stop? – How to judge when enough information has been gathered? – How to decide when to give up an unsuccesful search? – When to stop searching in one source and move to another? After the Search How to synthesize information is part of the information use process One “theory” is called sensemaking – Russell at al. paper – Dan Russell is speaking today at 4pm! Room 110. Different topic. Post-Search Analysis Types (O’Day & Jeffries 93) Trends Comparisons Aggregation and Scaling Identifying a Critical Subset Assessing Interpreting The rest: » cross-reference » summarize » find evocative visualizations » miscellaneous SenseMaking (Russell et al. 93) The process of encoding retrieved information to answer task-specific questions Combine – internal cognitive resources – external retrieved resources Create a good representation – an iterative process – contend with a cost/benefit tradoff The SenseMaking Loop,From Russell et al., 93 Observed Activities of Business Analysts Working From Russell et al.,93 The SenseMaking Process,From Russell et al.,InterCHI 93. Sensemaking (Russell et al. 93) An anytime activity – At any point a workable solution is available – Usually more time -> better solution – Usually more properties -> better solution Sensemaking (Russell et al. 93) A good strategy – Maximizes long term rate of gain – Example: » new technology brings more info faster » this causes a uniform increase in useful and useless information » best strategy: throw out bad stuff faster Sensemaking (Russell et al. 93) Most of the effort is in the synthesis of a good representation – covers the data – increase usability – decrease cost-of-use UI and Viz in IA: Chapter Contents Starting Points for Search Types: – Lists – Overviews » Categories » Clusters » Links/Hyperlinks – Examples, Wizards, Guided Tours Starting Points for Search Faced with a prompt or an empty entry form … how to start? – Lists of sources – Overviews » Clusters » Category Hierarchies/Subject Codes » Co-citation links – Examples, Wizards, and Guided Tours – Automatic source selection List of Sources Have to guess based on the name Requires prior exposure/experience Dialog box for chosing sources in old lexis-nexis inte Overviews in the User Interface Supervised (Manual) Category Overviews – Yahoo! – HiBrowse – MeSHBrowse Unsupervised (Automated) Groupings – Clustering – Kohonen Feature Maps Incorporating Categories into the Interface Yahoo is the standard method Problems: – Hard to search, meant to be navigated. – Only one category per document (usually) Evidence Web search engines are heavily using – Link analysis – Page popularity – Interwoven categories These all find dominant home pages More Complex Example: MeSH and MedLine MeSH Category Hierarchy – Medical Subject Headings – – – – ~18,000 labels manually assigned ~8 labels/article on average avg depth: 4.5, max depth 9 Top Level Categories: anatomy animals disease drugs diagnosis psych biology physics related disc technology humanities Category Labels Advantages: – – – – Interpretable Capture summary information Describe multiple facets of content Domain dependent, and so descriptive Disadvantages – Do not scale well (for organizing documents) – Domain dependent, so costly to acquire – May mis-match users’ interests MeshBrowse (Korn & Shneiderman95) Grow the category structure gradually and in response to semantic similarity HiBrowse (Pollitt 97) Show combinations of categories given that some categories already seen Large Category Sets Problems for User Interfaces » Too many categories to browse » Too many docs per category » Docs belong to multiple categories » Need to integrate search » Need to show the documents Text Clustering Finds overall similarities among groups of documents Finds overall similarities among groups of tokens Picks out some themes, ignores others Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93, Hearst & Pedersen 95 How it works – Cluster sets of documents into general “themes”, like a table of contents – Display the contents of the clusters by showing topical terms and typical titles – User chooses subsets of the clusters and reclusters the documents within – Resulting new groups have different “themes” Originally used to give collection overview Evidence suggests more appropriate for displaying retrieval results in context S/G Example: query on “star” Encyclopedia text 8 symbols 68 film, tv (p) 97 astrophysics 67 astronomy(p) 10 flora/fauna 14 sports 47 film, tv 7 music 12 steller phenomena 49 galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated Using Clustering in Document Ranking Cluster entire collection Find cluster centroid that best matches the query This has been explored extensively – it is expensive – it doesn’t work well Two Queries: Two Clusterings AUTO, CAR, ELECTRIC 8 control drive accident … AUTO, CAR, SAFETY 6 control inventory integrate … 25 battery california technology … 10 investigation washington … 48 import j. rate honda toyota … 12 study fuel death bag air … 16 export international unit japan 61 sale domestic truck import … 3 service employee automatic … 11 japan export defect unite … The main differences are the clusters that are central to the query Another use of clustering Use clustering to map the entire huge multidimensional document space into a huge number of small clusters. “Project” these onto a 2D graphical representation – Group by doc: SPIRE/Kohonen maps – Group by words: Galaxy of News/HotSauce/Semio Clustering Multi-Dimensional Document Space (image from Wise et al 95) (from Chen et al., JASIS 49(7) Kohonen Feature Maps on T Study of Kohonen Feature Maps H. Chen, A. Houston, R. Sewell, and B. Schatz, JASIS 49(7) Comparison: Kohonen Map and Yahoo Task: – “Window shop” for interesting home page – Repeat with other interface Results: – Starting with map could repeat in Yahoo (8/11) – Starting with Yahoo unable to repeat in map (2/14) UWMS Data Mining Workshop Study (cont.) Participants liked: – Correspondence of region size to # documents – Overview (but also wanted zoom) – Ease of jumping from one topic to another – Multiple routes to topics – Use of category and subcategory labels UWMS Data Mining Workshop Study (cont.) Participants wanted: – – – – – – – – – hierarchical organization other ordering of concepts (alphabetical) integration of browsing and search corresponce of color to meaning more meaningful labels labels at same level of abstraction fit more labels in the given space combined keyword and category search multiple category assignment (sports+entertain) UWMS Data Mining Workshop Visualization of Clusters – Huge 2D maps may be inappropriate focus for information retrieval » Can’t see what documents are about » Documents forced into one position in semantic space » Space is difficult to use for IR purposes » Hard to view titles – Perhaps more suited for pattern discovery » problem: often only one view on the space Summary: Clustering Advantages: – Get an overview of main themes – Domain independent Disadvantages: – Many of the ways documents could group together are not shown – Not always easy to understand what they mean – Different levels of granularity Next Time Interfaces for Query Specification