January 6, 2003 Information Retrieval Handout #1 (C) 2003, The University of Michigan 1 Course Information • • • • • • Instructor: Dragomir R. Radev (radev@si.umich.edu) Office: 3080, West Hall Connector Phone: (734) 615-5225 Office hours: TBA Course page: http://tangra.si.umich.edu/~radev/650/ Class meets on Mondays, 1-4 PM in 409 West Hall (C) 2003, The University of Michigan 2 Introduction (C) 2003, The University of Michigan 3 Demos • • • • • • Google Vivísimo AskJeeves NSIR Lemur MG (C) 2003, The University of Michigan 4 Syllabus (Part I) CLASSIC IR Week 1 The Concept of Information Need, IR Models, Vector models, Boolean models Week 2 Retrieval Evaluation, Precision and Recall, F-measure, Reference collection, The TREC conferences Week 3 Queries and Documents, Query Languages, Natural language querying, Relevance feedback Week 4 Indexing and Searching, Inverted indexes Week 5 XML retrieval Week 6 Language modeling approaches (C) 2003, The University of Michigan 5 Syllabus (Part II) WEB-BASED IR Week 7 Crawling the Web, hyperlink analysis, measuring the Web Week 8 Similarity and clustering, bottom-up and top-down paradigms Week 9 Social network analysis for IR, Hubs and authorities, PageRank and HITS Week 10 Focused crawling, Resource discovery, discovering communities Week 11 Question answering Week 12 Additional topics, e.g., relevance transfer Week 13 Project presentations (C) 2003, The University of Michigan 6 Readings BOOKS Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley/ACM Press, 1999 http://www.sims.berkeley.edu/~hearst/irbook/ Soumen Chakrabarti, Mining the Web, Morgan Kaufmann, 2002 http://www.cse.iitb.ac.in/~soumen/ PAPERS Bharat and Broder "A technique for measuring the relative size and overlap of public Web search engines" WWW 1998 Barabasi and Albert "Emergence of scaling in random networks" Science (286) 509-512, 1999 Chakrabarti, van den Berg, and Dom "Focused Crawling" WWW 1999 Davison "Topical locality on the Web" SIGIR 2000 Dean and Henzinger "Finding related pages in the World Wide Web" WWW 1999 Jeong and Barabási "Diameter of the world wide web" Nature (401) 130-131, 1999 Hawking, Voorhees, Craswell, and Bailey "Overview of the TREC-8 Web Track" TREC 2000 Haveliwala "Topic-sensitive pagerank" WWW 2002 Lawrence and Giles "Accessibility of information on the Web" Nature (400) 107-109, 1999 Lawrence and Giles "Searching the World-Wide Web" Science (280) 98-100, 1998 Menczer "Links tell us about lexical and semantic Web content" arXiv 2001 Menczer "Growing and Navigating the Small World Web by Local Content”. Proc. Natl. Acad. Sci. USA 99(22) 2002 Page, Brin, Motwani, and Winograd "The PageRank citation ranking: Bringing order to the Web" Stanford TR, 1998 Radev, Fan, Qi, Wu and Grewal "Probabilistic Question Answering on the Web" WWW 2002 Radev et al. “Content Diffusion on the Web Graph“ CASE STUDIES (IR SYSTEMS) Lemur, MG, Google, AskJeeves, NSIR (C) 2003, The University of Michigan 7 Assignments Homeworks: The course will have three homework assignments in the form of problem sets. Each problem set will include essay-type questions, questions designed to show understanding of specific concepts, and hands-on exercises involving existing IR engines. Project: The final course project can be done in three different formats: (1) a programming project implementing a challenging and novel information retrieval application, (2) an extensive survey-style research paper providing an exhaustive look at an area of IR, or (3) a SIGIR-style experimental IR paper. (C) 2003, The University of Michigan 8 Grading • Three HW assignments (30%) • Project (30%) • Final (40%) (C) 2003, The University of Michigan 9 Topics • IR systems • Evaluation methods • Indexing, search, and retrieval (C) 2003, The University of Michigan 10 Need for IR • Advent of WWW - more than 3 Billion documents indexed on Google • How much information? http://www.sims.berkeley.edu/research/projects/how-much-info/ • Search, routing, filtering • User’s information need (C) 2003, The University of Michigan 11 Some definitions of Information Retrieval (IR) Salton (1989): “Information-retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes to records and information requests.” Kowalski (1997): “An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video, and other multi-media objects).” (C) 2003, The University of Michigan 12 Examples of IR systems • Conventional (library catalog). Search by keyword, title, author, etc. • Text-based (Lexis-Nexis, Google, FAST). Search by keywords. Limited search using queries in natural language. • Multimedia (QBIC, WebSeek, SaFe) Search by visual appearance (shapes, colors,… ). • Question answering systems (AskJeeves, NSIR, Answerbus) Search in (restricted) natural language (C) 2003, The University of Michigan 13 (C) 2003, The University of Michigan 14 (C) 2003, The University of Michigan 15 Types of queries (AltaVista) Including or excluding words: To make sure that a specific word is always included in your search topic, place the plus (+) symbol before the key word in the search box. To make sure that a specific word is always excluded from your search topic, place a minus (-) sign before the keyword in the search box. Example: To find recipes for cookies with oatmeal but without raisins, try recipe cookie +oatmeal -raisin. Expand your search using wildcards (*): By typing an * at the end of a keyword, you can search for the word with multiple endings. Example: Try wish*, to find wish, wishes, wishful, wishbone, and wishy-washy. (C) 2003, The University of Michigan 16 Types of queries AND (&) Finds only documents containing all of the specified words or phrases. Mary AND lamb finds documents with both the word Mary and the word lamb. OR (|) Finds documents containing at least one of the specified words or phrases. Mary OR lamb finds documents containing either Mary or lamb. The found documents could contain both, but do not have to. NOT (!) Excludes documents containing the specified word or phrase. Mary AND NOT lamb finds documents with Mary but not containing lamb. NOT cannot stand alone--use it with another operator, like AND. NEAR (~) Finds documents containing both specified words or phrases within 10 words of each other. Mary NEAR lamb would find the nursery rhyme, but likely not religious or Christmas-related documents. (C) 2003, The University of Michigan 17 Mappings and abstractions Reality Data Information need Query (C) 2003, The University of Michigan From Korfhage’s book 18 Documents • • • • • Not just printed paper collections vs. documents data structures: representations document surrogates: keywords, summaries encoding: ASCII, Unicode, etc. (C) 2003, The University of Michigan 19 Typical IR system • • • • (Crawling) Indexing Retrieval User interface (C) 2003, The University of Michigan 20 Sample queries (from Excite) In what year did baseball become an offical sport? play station codes . com birth control and depression government "WorkAbility I"+conference kitchen appliances where can I find a chines rosewood tiger electronics 58 Plymouth Fury How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? emeril Lagasse Hubble M.S Subalaksmi running (C) 2003, The University of Michigan 21 Size matters • Typical document surrogate: 200 to 2000 bytes • Book: up to 3 MB of data • Stemming: computer, computational, computing (C) 2003, The University of Michigan 22 Key Terms Used in IR • QUERY: a representation of what the user is looking for can be a list of words or a phrase. • DOCUMENT: an information entity that the user wants to retrieve • COLLECTION: a set of documents • INDEX: a representation of information that makes querying easier • TERM: word or concept that appears in a document or a query (C) 2003, The University of Michigan 23 Other important terms • • • • • • Classification Cluster Similarity Information Extraction Term Frequency Inverse Document Frequency • Precision • Recall (C) 2003, The University of Michigan • • • • • • • • • Inverted File Query Expansion Relevance Relevance Feedback Stemming Stopword Vector Space Model Weighting TREC/TIPSTER/MUC 24 Query structures • Query viewed as a document? – Length – repetitions – syntactic differences • Types of matches: – exact – range – approximate (C) 2003, The University of Michigan 25 Additional references on IR • Gerard Salton, Automatic Text Processing, AddisonWesley (1989) • Gerald Kowalski, Information Retrieval Systems: Theory and Implementation, Kluwer (1997) • Gerard Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw-Hill (1983) • C. J. an Rijsbergen, Information Retrieval, Buttersworths (1979) • Ian H. Witten, Alistair Moffat, and Timothy C. Bell, Managing Gigabytes, Van Nostrand Reinhold (1994) • ACM SIGIR Proceedings, SIGIR Forum • ACM conferences in Digital Libraries (C) 2003, The University of Michigan 26 Related courses elsewhere • Berkeley (Marti Hearst and Ray Larson) http://www.sims.berkeley.edu/courses/is202/f00/ • Stanford (Chris Manning, Prabhakar Raghavan, and Hinrich Schuetze) http://www.stanford.edu/class/cs276a/ • Cornell (Jon Kleinberg) http://www.cs.cornell.edu/Courses/cs685/2002fa/ • CMU (Yiming Yang and Jamie Callan) http://la.lti.cs.cmu.edu/classes/11-741/ (C) 2003, The University of Michigan 27 Readings for weeks 1 – 3 • MIR (Modern Information Retrieval) – Week 1 • Chapter 1 “Introduction” • Chapter 2 “Modeling” • Chapter 3 “Evaluation” – Week 2 • Chapter 4 “Query languages” • Chapter 5 “Query operations” – Week 3 • Chapter 6 “Text and multimedia languages” • Chapter 7 “Text operations” • Chapter 8 “Indexing and searching” (C) 2003, The University of Michigan 28 IR models (C) 2003, The University of Michigan 29 Major IR models • • • • • • Boolean Vector Probabilistic Language modeling Fuzzy Latent semantic indexing (C) 2003, The University of Michigan 30 Major IR tasks • • • • • Ad-hoc Filtering and routing Question answering Spoken document retrieval Multimedia retrieval (C) 2003, The University of Michigan 31 Venn diagrams x w z y D1 D2 (C) 2003, The University of Michigan 32 Boolean model A (C) 2003, The University of Michigan B 33 Boolean queries restaurants AND (Mideastern OR vegetarian) AND inexpensive • • • • • What types of documents are returned? Stemming thesaurus expansion inclusive vs. exclusive OR confusing uses of AND and OR dinner AND sports AND symphony 4 OF (Pentium, printer, cache, PC, monitor, computer, personal) (C) 2003, The University of Michigan 34 Boolean queries • Weighting (Beethoven AND sonatas) • precedence coffee AND croissant OR muffin raincoat AND umbrella OR sunglasses • Use of negation: potential problems • Conjunctive and Disjunctive normal forms • Full CNF and DNF (C) 2003, The University of Michigan 35 Transformations • De Morgan’s Laws: NOT (A AND B) = (NOT A) OR (NOT B) NOT (A OR B) = (NOT A) AND (NOT B) • CNF or DNF? – Reference librarians prefer CNF - why? (C) 2003, The University of Michigan 36 Boolean model • Partition • Partial relevance? • Operators: AND, NOT, OR, parentheses (C) 2003, The University of Michigan 37 Exercise • • • • D1 = “computer information retrieval” D2 = “computer retrieval” D3 = “information” D4 = “computer information” • Q1 = “information retrieval” • Q2 = “information ¬computer” (C) 2003, The University of Michigan 38 Exercise 0 1 Swift 2 Shakespeare 3 Shakespeare 4 Milton 5 Milton 6 Milton Shakespeare 7 Milton Shakespeare Swift Swift 8 Chaucer 9 Chaucer 10 Chaucer Shakespeare 11 Chaucer Shakespeare 12 Chaucer Milton 13 Chaucer Milton 14 Chaucer Milton Shakespeare 15 Chaucer Milton Shakespeare Swift Swift Swift Swift Swift ((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare)) (C) 2003, The University of Michigan 39 Stop lists • 250-300 most common words in English account for 50% or more of a given text. • Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%. • Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%). •(C) 2003, Token/type ratio: 2256/859 = 2.63 The University of Michigan 40 Vector-based representation Term 1 Doc 1 Doc 2 Term 3 Term 2 (C) 2003, The University of Michigan Doc 3 41 Vector queries • Each document is represented as a vector • non-efficient representations (bit vectors) • dimensional compatibility W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 (C) 2003, The University of Michigan 42 The matching process • Matching is done between a document and a query - topicality • document space • characteristic function F(d) = {0,1} • distance vs. similarity - mapping functions • Euclidean distance, Manhattan distance, Word overlap (C) 2003, The University of Michigan 43 Vector-based matching • The Cosine measure (D,Q) = (di x qi) (di)2 x (qi)2 • Intrinsic vs. extrinsic measures (C) 2003, The University of Michigan 44 Exercise • Compute the cosine measures (D1,D2) and (D1,D3) for the documents: D1 = <1,3>, D2 = <100,300> and D3 = <3,1> • Compute the corresponding Euclidean distances. (C) 2003, The University of Michigan 45 Matrix representations • • • • Term-document matrix (m x n) term-term matrix (m x m x n) document-document matrix (n x n) Example: 3,000,000 documents (n) with 50,000 terms (m) • sparse matrices • Boolean vs. integer matrices (C) 2003, The University of Michigan 46 Zipf’s law Rank x Frequency Constant Rank Term Freq. Z Rank Term Freq. Z 1 the 69,971 0.070 6 in 21,341 0.128 2 of 36,411 0.073 7 that 10,595 0.074 3 and 28,852 0.086 8 is 10,099 0.081 4 to 26.149 0.104 9 was 9,816 0.088 5 a 23,237 0.116 10 he 9,543 0.095 (C) 2003, The University of Michigan 47 Evaluation (C) 2003, The University of Michigan 48 Contingency table retrieved not retrieved relevant w x not relevant y z n2 = w + y (C) 2003, The University of Michigan n1 = w + x N 49 Precision and Recall Recall: w w+x Precision: w w+y (C) 2003, The University of Michigan 50 Exercise Go to Google (www.google.com) and search for documents on Tolkien’s “Lord of the Rings”. Try different ways of phrasing the query: e.g., Tolkien, “JRR Melville”, +”JRR Tolkien” +Lord of the Rings”, etc. For each query, compute the precision (P) based on the first 10 documents returned by AltaVista. Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like. (C) 2003, The University of Michigan 51 n Doc. no Relevant? Recall Precision 1 2 3 4 5 6 588 589 576 590 986 592 7 8 9 10 11 12 13 14 0.2 0.4 0.4 0.6 0.6 0.8 1.00 1.00 0.67 0.75 0.60 0.67 984 988 578 985 103 0.8 0.8 0.8 0.8 0.8 0.57 0.50 0.44 0.40 0.36 591 772 990 0.8 1.0 1.0 0.33 0.38 0.36 (C) 2003, The University of Michigan x x x x x [From Salton’s book] 52 P/R graph 1.2 1 Precision 0.8 0.6 0.4 0.2 0 0 0.2 0.4 (C) 2003, The University of Michigan 0.6 Recall 0.8 1 1.2 53 Issues • • • • Standard levels for P&R (0-100%) Interpolation Average P&R Average P at given “document cutoff values” • F-measure: F = 2/(1/R+1/P) (C) 2003, The University of Michigan 54 Relevance collections • TREC adhoc collections, 2-6 GB • TREC Web collections, 2-100GB (C) 2003, The University of Michigan 55 Sample TREC query <top> <num> Number: 305 <title> Most Dangerous Vehicles <desc> Description: Which are the most crashworthy, and least crashworthy, passenger vehicles? <narr> Narrative: A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles. The document will have to describe/compare vehicles, not drivers. For instance, it should be expected that vehicles preferred by 16-25 year-olds would be involved in more crashes, because that age group is involved in more crashes. I would view number of fatalities per 100 crashes to be more revealing of a vehicle's crashworthiness than the number of crashes per 100,000 miles, for example. </top> (C) 2003, The University of Michigan LA031689-0177 FT922-1008 LA090190-0126 LA101190-0218 LA082690-0158 LA112590-0109 FT944-136 LA020590-0119 FT944-5300 LA052190-0048 LA051689-0139 FT944-9371 LA032390-0172 LA042790-0172 LA021790-0136 LA092289-0167 LA111189-0013 LA120189-0179 LA020490-0021 LA122989-0063 LA091389-0119 LA072189-0048 FT944-15615 LA091589-0101 LA021289-0208 56 <DOCNO> LA031689-0177 </DOCNO> <DOCID> 31701 </DOCID> <DATE><P>March 16, 1989, Thursday, Home Edition </P></DATE> <SECTION><P>Business; Part 4; Page 1; Column 5; Financial Desk </P></SECTION> <LENGTH><P>586 words </P></LENGTH> <HEADLINE><P>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </P></HEADLINE> <BYLINE><P>By LINDA WILLIAMS, Times Staff Writer </P></BYLINE> <TEXT> <P>The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-over accidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. </P> <P>The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of the Suzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents after Consumer Reports magazine charged that the vehicle had basic design flaws. </P> <P>Several Fatalities </P> <P>However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs, particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigation conducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said. </P> <P>According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicle roll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involving the Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After the accident report, NHTSA declined to investigate the Samurai. </P> ... </TEXT> <GRAPHIC><P> Photo, The Ford Bronco II "appears to have a higher number of single-vehicle, first event roll-overs," a federal official said. </P></GRAPHIC> <SUBJECT> <P>TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS; RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </P> </SUBJECT> </DOC> (C) 2003, The University of Michigan 57 TREC results • http://trec.nist.gov/presentations/presentations.html (C) 2003, The University of Michigan 58