INFORMATION RETRIEVAL Yu Hong and Heng Ji jih@rpi.edu October 15, 2014 Outline • Introduction • IR Approaches and Ranking • Query Construction • Document Indexing • IR Evaluation • Web Search • INDRI Information Basic Function of Information • Information = transmission of thought Thoughts Thoughts Telepathy? Words Words Writing Sounds Sounds Speech Encoding Decoding Information Theory • Better called “communication theory” • Developed by Claude Shannon in 1940’s • Concerned with the transmission of electrical signals over wires • How do we send information quickly and reliably? • Underlies modern electronic communication: • Voice and data traffic… • Over copper, fiber optic, wireless, etc. • Famous result: Channel Capacity Theorem • Formal measure of information in terms of entropy • Information = “reduction in surprise” The Noisy Channel Model • Information Transmission = producing the same message at the destination as that was sent at the source • The message must be encoded for transmission across a medium (called channel) • But the channel is noisy and can distort the message Source message Destination Transmitter channel noise Receiver message A Synthesis • Information retrieval as communication over time and space, across a noisy channel Source message Destination Transmitter channel Receiver message noise Sender message Recipient Encoding storage indexing/writing Decoding message acquisition/reading noise What is Information Retrieval? • Most people equate IR with web-search • highly visible, commercially successful endeavors • leverage 3+ decades of academic research • IR: finding any kind of relevant information • web-pages, news events, answers, images, … • “relevance” is a key notion What is Information Retrieval (IR)? • Most people equate IR with web-search • highly visible, commercially successful endeavors • leverage 3+ decades of academic research • IR: finding any kind of relevant information • web-pages, news events, answers, images, … • “relevance” is a key notion Interesting Examples • Google image search http://images.google.com/ • Google video search http://video.google.com/ • People Search • http://www.intelius.com • Social Network Search • http://arnetminer.org/ Interesting Examples • Google image search http://images.google.com/ • Google video search http://video.google.com/ • People Search • http://www.intelius.com • Social Network Search • http://arnetminer.org/ Sender IR System message Recipient Encoding storage indexing/writing Decoding acquisition/reading noise Document corpus Query String IR System Ranked Documents 1. Doc1 2. Doc2 3. Doc3 . . message The IR Black Box Query Results Documents Inside The IR Black Box Query Documents Representation Function Representation Function Query Representation Document Representation Comparison Function Index Results Building the IR Black Box • Fetching model • Comparison model • Representation Model • Indexing Model Building the IR Black Box • Fetching models • Crawling model • Gentle Crawling model • Comparison models • Boolean model • Vector space model • Probabilistic models • Language models • PageRank • Representation Models • How do we capture the meaning of documents? • Is meaning just the sum of all terms? • Indexing Models • How do we actually store all those words? • How do we access indexed terms quickly? Outline • Introduction • IR Approaches and Ranking • Query Construction • Document Indexing • IR Evaluation • Web Search • INDRI Fetching model: Crawling Documents Search Engines Web pages Crawling Fetching Function World Wide Web Query Documents Representation Function Representation Function Query Representation Document Representation Comparison Function Index Results Fetching model: Crawling • Q1: How many web pages should we fetch? • As many as we can. More web pages = Richer knowledge = Intelligent Search engine Document corpus Query String IR System Ranked Documents 1. Doc1 2. Doc2 3. Doc3 . . Fetching model: Crawling • Q1: How many web pages should we fetch? • As many as we can. • Fetching model is enriching the knowledge in the brain of the search engine Fetching Function I know everything now, hahahahaha! IR System Fetching model: Crawling • Q2: How to fetch the web pages? • First, we should know the basic network structure of the web • Basic Structure: Nodes and Links (hyperlinks) World Wide Web Basic Structure Fetching model: Crawling • Q2: How to fetch the web pages? • Crawling program (Crawler) visit each node in the web through hyperlink. Basic Network Structure IR System Fetching model: Crawling • Q2: How to fetch the web pages? • Q2-1: what are the known nodes? • It means that the crawler know the addresses of nodes • The nodes are web pages • So the addresses are the URLs (URL: Uniform Resource Locater) • Such as: www.yahoo.com, www.sohu.com, www.sina.com, etc. • Q2-2: what are the unknown nodes? • It means that the crawler don’t know the addresses of nodes • The seed nodes are the known ones • Before dispatching the crawler, a search engine will introduce some addresses of the web pages to the crawler. The web pages are the earliest known nodes (so called seeds) Fetching model: Crawling • Q2: How to fetch the web pages? • Q2-3: How can the crawler find the unknown nodes? Unknown Nod. Known I can do this. Believe me. Nod. Nod. Doc. Unknown Nod. Unknown Nod. Unknown Nod. Unknown Fetching model: Crawling • Q2: How to fetch the web pages? • Q2-3: How can the crawler find the unknown nodes? Unknown Nod. I can do this. Believe me. Nod. Nod. Doc. Unknown Nod. Unknown Nod. Unknown Nod. Unknown Fetching model: Crawling • Q2: How to fetch the web pages? • Q2-3: How can the crawler find the unknown nodes? Unknown Nod. I can do this. Believe me. Nod. Nod. Doc. Unknown Nod. Unknown Nod. Unknown Nod. Unknown Fetching model: Crawling • Q2: How to fetch the web pages? PARSER • Q2-3: How can the crawler find the unknown nodes? Unknown Known Nod. Known Good news for me. Nod. Nod. Doc. Unknown Known Nod. Unknown Known Nod. Unknown Known Nod. Unknown Known Fetching model: Crawling • Q2: How to fetch the web pages? • Q2-3: How can the crawler find the unknown nodes? • If you introduce a web page to the crawler (let it known the web address), the crawler will use a parser of source code to mine lots of new web pages. Of cause, the crawler have known their addresses. • But if you don’t tell the crawler anything, it will be on strike because it can do nothing. • That is the reason why we need the seed nodes (seed web pages) to awaken the crawler. Give me some seeds. Fetching model: Crawling I need some equipment. • Q2: How to fetch the web pages? • To traverse the whole network of the web, the crawler need some auxiliary equipment. • A register of FIFO (First in, First out) data structure, such as QUEUE. • An Access Control Program (ACP) • Source Code Parser (SCP) • Seed nodes crawler FIFO Register ACP SCP Fetching model: Crawling I am working now. • Q2: How to fetch the web pages? • Robotic crawling procedure (Only five steps) • Initialization: push seed nodes (known web pages) into the empty queue • Step 1: Take out a node from the queue (FIFO) and visit it (ACP) • Step 2: Steal necessary information from the source code of the node (SCP) • Step 3: Send the stolen text information (title, text body, keywords and Language) back to search engine for storage (ACP) • Step 4: Push the newly found nodes into the queue • Step 5: Execute Step 1-5 iteratively Fetching model: Crawling • Q2: How to fetch the web pages? • Trough the steps, the number of the known nodes continuously grows • The underlying reason why the crawler can travers the whole web I control this. Seed Seed Seed New Node Slot New Node Slot New Node Slot New Node Slot New Node Slot New Node Slot New Node Slot New Node Slot New Node Slot New Node Slot New Node Slot New Node Slot New Node Slot New Node Slot New Node Slot New Node Slot New Node Slot • Crawler stops working until the register is empty • Although the register is empty, the information of all nodes in the web has been stolen and stored in the server of the search engine. Fetching model: Crawling • Problems • 1) Actually, the crawler can not traverse the whole web. • Such as encountering the infinite loop when falling into a partial closed- circle network (snare) in the web Node Node Node No. Node Node Node Node Node Fetching model: Crawling • Problems • 2) Crude Crawling. • A portal web site causes a series of homologous nodes in the register. Abided by the FIFO rule, the iterative crawling of the nodes will continuously visit the mutual server of the nodes. It is crude crawling. A class of homologous web pages linking to a portal sit https:// www.yahoo.com https://screen.yahoo.com/live/ https://games.yahoo.com/ https://mobile.yahoo.com/ https://groups.yahoo.com/neo https://answers.yahoo.com/ http://finance.yahoo.com/ https://weather.yahoo.com/ https://autos.yahoo.com/ https://shopping.yahoo.com/ https://www.yahoo.com/health https://www.yahoo.com/food https://www.yahoo.com/style Node Node Node Node Node Slot Node Node Node Slot Node Node Node Slot Node Slot Node Node Node Slot Node Slot Node Slot Node Slot Node Slot Node Slot Node Slot Node Slot Node Slot Slot Slot Network of Web Fetching model: Crawling • Homework • 1) How to overcome the infinite loop cased by the partial closed-circle network in the web? • 2) Please find a way to crawl the web like a gentlemen (not crude). • Please select one of the problems as the topic of your homework. A short paper is necessary. No more than 500 words in the paper. But please include at least your idea and a methodology. The methodology can be described with natural languages, flow diagram, or algorithm. • Send it to me. Email: tianxianer@gmail.com • Thanks. Building the IR Black Box • Fetching models • Crawling model • Gentle Crawling model • Comparison models • Boolean model • Vector space model • Probabilistic models • Language models • PageRank • Representation Models • How do we capture the meaning of documents? • Is meaning just the sum of all terms? • Indexing Models • How do we actually store all those words? • How do we access indexed terms quickly? Query Documents Representation Function Representation Function Query Representation Document Representation Comparison Function Index Results Query Documents Representation Function Representation Function Query Representation Document Representation Ignore Now Comparison Function Results Index A heuristic formula for IR (Boolean model) • Rank docs by similarity to the query • suppose the query is “spiderman film” • Relevance= # query words in the doc • favors documents with both “spiderman” and “film” • mathematically: sim( D, Q) 1qD qQ • Logical variations (set-based) ∏ O ( q, D ) • Boolean AND (require all words): AND ( D, Q ) = • Boolean OR (any of the words): OR ( D, Q ) = 1 - q ∏ O ( q, D ) q Term Frequency (TF) • Observation: • key words tend to be repeated in a document • Modify our similarity measure: • give more weight if word occurs multiple times • Problem: sim( D, Q) tf D (q) qQ • biased towards long documents • spurious occurrences • normalize by length: tf D (q) sim( D, Q) qQ | D | Inverse Document Frequency (IDF) • Observation: • rare words carry more meaning: cryogenic, apollo • frequent words are linguistic glue: of, the, said, went • Modify our similarity measure: • give more weight to rare words … but don’t be too aggressive (why?) |C| tf D (q) sim( D, Q) log qQ | D | df (q) • |C| … total number of documents • df(q) … total number of documents that contain q TF normalization • Observation: • D1={cryogenic,labs}, D2 ={cryogenic,cryogenic} • which document is more relevant? • which one is ranked higher? (df(labs) > df(cryogenic)) • Correction: • first occurrence more important than a repeat (why?) • “squash” the linearity of TF: tf (q) tf (q ) K 1 2 3 tf State-of-the-art Formula Repetitions of query words good Common words less important |C | tf D (q) sim( D, Q) log qQ tf D (q) K | D | df (q) More query words good Penalize very long documents Strengths and Weaknesses • Strengths • Precise, if you know the right strategies • Precise, if you have an idea of what you’re looking for • Implementations are fast and efficient • Weaknesses • Users must learn Boolean logic • Boolean logic insufficient to capture the richness of language • No control over size of result set: either too many hits or none • When do you stop reading? All documents in the result set are considered “equally good” • What about partial matches? Documents that “don’t quite match” the query may be useful also Vector-space approach to IR cat •cat cat •cat cat cat •cat pig •pig cat θ pig •cat cat pig dog dog dog Assumption: Documents that are “close together” in vector space “talk about” the same things Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”) Some formulas for Similarity Dot product Cosine Sim( D, Q) (ai * bi ) (a * b ) i Sim( D, Q) Dice D i i ai * bi 2 i Sim( D, Q) t1 Q 2 i t2 2 (ai * bi ) i ai bi 2 i 2 i (a * b ) Sim( D, Q) a b (a * b ) i i i Jaccard 2 2 i i i i i i i An Example • A document space is defined by three terms: • hardware, software, users • the vocabulary • A set of documents are defined as: • A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1) • A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1) • A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1) • If the Query is “hardware and software” • what documents should be retrieved? An Example (cont.) • In Boolean query matching: • document A4, A7 will be retrieved (“AND”) • retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (“OR”) • In similarity matching (cosine): • q=(1, 1, 0) • S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0 • S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5 • S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5 • Document retrieved set (with ranking)= • {A4, A7, A1, A2, A5, A6, A8, A9} Probabilistic model • Given D, estimate P(R|D) and P(NR|D) • P(R|D)=P(D|R)*P(R)/P(D) P(D|R) D = {t1=x1, t2=x2, …} • P( D | R) P(t i (P(D), P(R) constant) 1 present xi 0 absent xi | R) ( ti xi )D P(ti 1 | R) xi P(ti 0 | R) (1 xi ) pi i (1 pi ) (1 xi ) x ti ti P( D | NR ) P(ti 1 | NR ) xi P(ti 0 | NR ) (1 xi ) qi i (1 qi ) (1 xi ) x ti ti Prob. model (cont’d) For document ranking (1 xi ) i p ( 1 p ) i i x P( D | R) t Odd( D) log log i xi (1 xi ) P( D | NR ) q ( 1 q ) i i ti xi log ti xi log ti pi (1 qi ) 1 pi log qi (1 pi ) ti 1 qi pi (1 qi ) qi (1 pi ) Prob. model (cont’d) • How to estimate pi and qi? • A set of N relevant and irrelevant samples: ri pi Ri ni ri qi N Ri ri Rel. doc. with ti ni-ri ni Irrel.doc. Doc. with ti with ti Ri-ri N-Ri–n+ri N-ni Rel. doc. Irrel.doc. Doc. without ti without ti without ti Ri Rel. doc N-Ri N Irrel.doc. Samples Prob. model (cont’d) pi (1 qi ) Odd( D) xi log qi (1 pi ) ti ri ( N Ri ni ri ) ( Ri ri )(ni ri ) ti • Smoothing (Robertson-Sparck-Jones formula) xi Odd( D) xi ti (ri 0.5)(N Ri ni ri 0.5) wi ( Ri ri 0.5)(ni ri 0.5) ti D • When no sample is available: pi=0.5, qi=(ni+0.5)/(N+0.5)ni/N • May be implemented as VSM An Appraisal of Probabilistic Models Among the oldest formal models in IR Maron & Kuhns, 1960: Since an IR system cannot predict with certainty which document is relevant, we should deal with probabilities Assumptions for getting reasonable approximations of the needed probabilities: Boolean representation of documents/queries/relevance Term independence Out-of-query terms do not affect retrieval Document relevance values are independent An Appraisal of Probabilistic Models The difference between ‘vector space’ and ‘probabilistic’ IR is not that great: In either case you build an information retrieval scheme in the exact same way. Difference: for probabilistic IR, at the end, you score queries not by cosine similarity and tf-idf in a vector space, but by a slightly different formula motivated by probability theory Language-modeling Approach • query is a random sample from a “perfect” document • words are “sampled” independently of each other • rank documents by the probability of generating query D P( query )=P( ) P( )P( ) P( ) = 4/9 * 2/9 * 4/9 * 3/9 Naive Bayes and LM generative models We want to classify document d. We want to classify a query q. Classes: geographical regions like China, UK, Kenya. Each document in the collection is a different class. Assume that d was generated by the generative model. Assume that q was generated by a generative model. Key question: Which of the classes is most likely to have generated the document? Which document (=class) is most likely to have generated the query q? Or: for which class do we have the most evidence? For which document (as the source of the query) do we have the most evidence? 57 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates the query. ❸ What we need to do: ❹ Define the precise generative model we want to use ❺ Estimate parameters (different parameters for each document’s model) ❻ Smooth to avoid zeros ❼ Apply to query and find document most likely to have generated the query ❽ Present most likely document(s) to user ❾ Note that x – y is pretty much what we did in Naive Bayes. What is a language model? We can view a finite state automaton as a deterministic language model. I wish I wish I wish I wish . . . Cannot generate: “wish I wish” or “I wish I”. Our basic model: each document was generated by a different automaton like this except that these automata are probabilistic. 59 A probabilistic language model This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q1. STOP is not a word, but a special symbol indicating that the automaton stops. frog said that toad likes frog STOP P(string) = 0.01 · 0.03 · 0.04 · 0.01 · 0.02 · 0.01 · 0.02 = 0.0000000000048 60 A different language model for each document frog said that toad likes frog STOP P(string|Md1 ) = 0.01 · 0.03 · 0.04 · 0.01 · 0.02 · 0.01 · 0.02 = 0.0000000000048 = 4.8 · 10-12 P(string|Md2 ) = 0.01 · 0.03 · 0.05 · 0.02 · 0.02 · 0.01 · 0.02 = 0.0000000000120 = 12 · 10-12 P(string|Md1 ) < P(string|Md2 ) Thus, document d2 is “more relevant” to the string “frog said that toad likes frog STOP” than d1 is. 61 Using language models in IR Each document is treated as (the basis for) a language model. Given a query q Rank documents based on P(d|q) P(q) is the same for all documents, so ignore P(d) is the prior – often treated as the same for all d But we can give a prior to “high-quality” documents, e.g., those with high PageRank. P(q|d) is the probability of q given d. So to rank documents according to relevance to q, ranking according to P(q|d) and P(d|q) is equivalent. 62 Where we are In the LM approach to IR, we attempt to model the query generation process. Then we rank documents by the probability that a query would be observed as a random sample from the respective document model. That is, we rank according to P(q|d). Next: how do we compute P(q|d)? 63 How to compute P(q|d) We will make the same conditional independence assumption as for Naive Bayes. (|q|: length ofr q; tk : the token occurring at position k in q) This is equivalent to: tft,q: term frequency (# occurrences) of t in q Multinomial model (omitting constant factor) 64 Parameter estimation Missing piece: Where do the parameters P(t|Md). come from? Start with maximum likelihood estimates (as we did for Naive Bayes) (|d|: length of d; tft,d : # occurrences of t in d) As in Naive Bayes, we have a problem with zeros. A single t with P(t|Md) = 0 will make zero. We would give a single term “veto power”. For example, for query [Michael Jackson top hits] a document about “top songs” (but not using the word “hits”) would have P(t|Md) = 0. – That’s bad. 65 We need to smooth the estimates to avoid zeros. Smoothing Key intuition: A nonoccurring term is possible (even though it didn’t occur), . . . . . . but no more likely than would be expected by chance in the collection. Notation: Mc: the collection model; cft: the number of occurrences of t in the collection; : the total number of tokens in the collection. We will use to “smooth” P(t|d) away from zero. 66 Mixture model P(t|d) = λP(t|Md) + (1 - λ)P(t|Mc) Mixes the probability from the document with the general collection frequency of the word. High value of λ: “conjunctive-like” search – tends to retrieve documents containing all query words. Low value of λ: more disjunctive, suitable for long queries Correctly setting λ is very important for good performance. 67 Mixture model: Summary What we model: The user has a document in mind and generates the query from this document. The equation represents the probability that the document that the user had in mind was in fact this one. 68 Example Collection: d1 and d2 d1 : Jackson was one of the most talented entertainers of all time d2: Michael Jackson anointed himself King of Pop Query q: Michael Jackson Use mixture model with λ = 1/2 P(q|d1) = [(0/11 + 1/18)/2] · [(1/11 + 2/18)/2] ≈ 0.003 P(q|d2) = [(1/7 + 1/18)/2] · [(1/7 + 2/18)/2] ≈ 0.013 Ranking: d2 > d1 69 Exercise: Compute ranking Collection: d1 and d2 d1 : Xerox reports a profit but revenue is down d2: Lucene narrows quarter loss but decreases further Query q: revenue down Use mixture model with λ = 1/2 P(q|d1) = [(1/8 + 2/16)/2] · [(1/8 + 1/16)/2] = 1/8 · 3/32 = 3/256 P(q|d2) = [(1/8 + 2/16)/2] · [(0/8 + 1/16)/2] = 1/8 · 1/32 = 1/256 Ranking: d2 > d1 70 LMs vs. vector space model (1) LMs have some things in common with vector space models. Term frequency is directed in the model. But it is not scaled in LMs. Probabilities are inherently “length-normalized”. Cosine normalization does something similar for vector space. Mixing document and collection frequencies has an effect similar to idf. Terms rare in the general collection, but common in some documents will have a greater influence on the ranking. 71 LMs vs. vector space model (2) LMs vs. vector space model: commonalities Term frequency is directly in the model. Probabilities are inherently “length-normalized”. Mixing document and collection frequencies has an effect similar to idf. LMs vs. vector space model: differences LMs: based on probability theory Vector space: based on similarity, a geometric/ linear algebra notion Collection frequency vs. document frequency Details of term frequency, length normalization etc. 72 Language models for IR: Assumptions Simplifying assumption: Queries and documents are objects of same type. Not true! There are other LMs for IR that do not make this assumption. The vector space model makes the same assumption. Simplifying assumption: Terms are conditionally independent. Again, vector space model (and Naive Bayes) makes the same assumption. Cleaner statement of assumptions than vector space Thus, better theoretical foundation than vector space … but “pure” LMs perform much worse than “tuned” LMs. 73 Relevance Using Hyperlinks • Number of documents relevant to a query can be enormous if only term frequencies are taken into account • Using term frequencies makes “spamming” easy • E.g., a travel agency can add many occurrences of the words “travel” to its page to make its rank very high • Most of the time people are looking for pages from popular sites • Idea: use popularity of Web site (e.g., how many people visit it) to rank site pages that match given keywords • Problem: hard to find actual popularity of site • Solution: next slide Relevance Using Hyperlinks (Cont.) • Solution: use number of hyperlinks to a site as a measure of the popularity or prestige of the site • Count only one hyperlink from each site (why? - see previous slide) • Popularity measure is for site, not for individual page • But, most hyperlinks are to root of site • Also, concept of “site” difficult to define since a URL prefix like cs.yale.edu contains many unrelated pages of varying popularity • Refinements • When computing prestige based on links to a site, give more weight to links from sites that themselves have higher prestige • Definition is circular • Set up and solve system of simultaneous linear equations • Above idea is basis of the Google PageRank ranking mechanism PageRank in Google PageRank in Google (Cont’) I1 I2 A B PR( I i ) PR( A) (1 d ) d i C(Ii ) • Assign a numeric value to each page • The more a page is referred to by important pages, the more this page is important • d: damping factor (0.85) • Many other criteria: e.g. proximity of query words • “…information retrieval …” better than “… information … retrieval …” Relevance Using Hyperlinks (Cont.) • Connections to social networking theories that ranked prestige of people • E.g., the president of the U.S.A has a high prestige since many people know him • Someone known by multiple prestigious people has high prestige • Hub and authority based ranking • A hub is a page that stores links to many pages (on a topic) • An authority is a page that contains actual information on a topic • Each page gets a hub prestige based on prestige of authorities that it points to • Each page gets an authority prestige based on prestige of hubs that point to it • Again, prestige definitions are cyclic, and can be got by solving linear equations • Use authority prestige when ranking answers to a query HITS: Hubs and authorities 79 HITS update rules A: link matrix h: vector of hub scores a: vector of authority scores HITS algorithm: Compute h = Aa Compute a =ATh Iterate until convergence Output (i) list of hubs ranked according to hub score and (ii) list of authorities ranked according to authority score 80 Outline • Introduction • IR Approaches and Ranking • Query Construction • Document Indexing • IR Evaluation • Web Search • INDRI Keyword Search • Simplest notion of relevance is that the query string appears verbatim in the document. • Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words). 82 Problems with Keywords • May not retrieve relevant documents that include synonymous terms. • “restaurant” vs. “café” • “PRC” vs. “China” • May retrieve irrelevant documents that include ambiguous terms. • “bat” (baseball vs. mammal) • “Apple” (company vs. fruit) • “bit” (unit of data vs. act of eating) 83 Query Expansion • http://www.lemurproject.org/lemur/IndriQueryLanguage.php • Most errors caused by vocabulary mismatch • query: “cars”, document: “automobiles” • solution: automatically add highly-related words • Thesaurus / WordNet lookup: • add semantically-related words (synonyms) • cannot take context into account: • “rail car” vs. “race car” vs. “car and cdr” • Statistical Expansion: • add statistically-related words (co-occurrence) • very successful Indri Query Examples • <parameters><query>#combine( #weight( 0.063356 #1(explosion) 0.187417 #1(blast) 0.411817 #1(wounded) 0.101370 #1(injured) 0.161191 #1(death) 0.074849 #1(deaths)) #weight( 0.311760 #1(Davao Cityinternational airport) 0.311760 #1(Tuesday) 0.103044 #1(DAVAO) 0.195505 #1(Philippines) 0.019817 #1(DXDC) 0.058113 #1(Davao Medical Center)))</query></parameters> Synonyms and Homonyms • Synonyms • E.g., document: “motorcycle repair”, query: “motorcycle maintenance” • Need to realize that “maintenance” and “repair” are synonyms • System can extend query as “motorcycle and (repair or maintenance)” • Homonyms • E.g., “object” has different meanings as noun/verb • Can disambiguate meanings (to some extent) from the context • Extending queries automatically using synonyms can be problematic • Need to understand intended meaning in order to infer synonyms • Or verify synonyms with user • Synonyms may have other meanings as well Concept-Based Querying • Approach • For each word, determine the concept it represents from context • Use one or more ontologies: • Hierarchical structure showing relationship between concepts • E.g., the ISA relationship that we saw in the E-R model • This approach can be used to standardize terminology in a specific field • Ontologies can link multiple languages • Foundation of the Semantic Web (not covered here) Outline • Introduction • IR Approaches and Ranking • Query Construction • Document Indexing • IR Evaluation • Web Search • INDRI Indexing of Documents • An inverted index maps each keyword Ki to a set of documents Si that contain the keyword • Documents identified by identifiers • Inverted index may record • Keyword locations within document to allow proximity based ranking • Counts of number of occurrences of keyword to compute TF • and operation: Finds documents that contain all of K1, K2, ..., Kn. • Intersection S1 S2 ..... Sn • or operation: documents that contain at least one of K1, K2, …, Kn • union, S1 S2 ..... Sn,. • Each Si is kept sorted to allow efficient intersection/union by merging • “not” can also be efficiently implemented by merging of sorted lists Indexing of Documents • Goal = Find the important meanings and create an internal representation • Factors to consider: • Accuracy to represent meanings (semantics) • Exhaustiveness (cover all the contents) • Facility for computer to manipulate • What is the best representation of contents? • Char. string (char trigrams): not precise enough • Word: good coverage, not precise • Phrase: poor coverage, more precise • Concept: poor coverage, precise Coverage (Recall) String Word Phrase Concept Accuracy (Precision) Indexer steps • Sequence of (Modified token, Document ID) pairs. Doc 1 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Term I did enact julius caesar I was killed i' the capitol brutus killed me so let it be with caesar the noble brutus hath told you Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 caesar 2 was ambitious 2 2 • Multiple term entries in a single document are merged. • Frequency information is added. Term Doc # ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 did 1 enact 1 hath 1 I 1 I 1 i' 1 it 2 julius 1 killed 1 killed 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2 Term Doc # ambitious be brutus brutus capitol caesar caesar did enact hath I i' it julius killed let me noble so the the told you was was with 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 2 1 2 2 2 1 2 2 Term freq 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 An example Stopwords / Stoplist • function words do not bear useful information for IR of, in, about, with, I, although, … • Stoplist: contain stopwords, not to be used as index • Prepositions • Articles • Pronouns • Some adverbs and adjectives • Some frequent words (e.g. document) • The removal of stopwords usually improves IR effectiveness • A few “standard” stoplists are commonly used. Stemming • Reason: • Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them • Stemming: • Removing some endings of word computer compute computes computing computed computation comput Lemmatization • transform to standard form according to syntactic category. E.g. verb + ing verb noun + s noun • Need POS tagging • More accurate than stemming, but needs more resources • crucial to choose stemming/lemmatization rules noise v.s. recognition rate • compromise between precision and recall light/no stemming -recall +precision severe stemming +recall -precision Simple conjunctive query (two terms) Consider the query: BRUTUS AND CALPURNIA To find all matching documents using inverted index: ❶ Locate BRUTUS in the dictionary ❷ Retrieve its postings list from the postings file ❸ Locate CALPURNIA in the dictionary ❹ Retrieve its postings list from the postings file ❺ Intersect the two postings lists ❻ Return intersection to user 97 Intersecting two posting lists This is linear in the length of the postings lists. Note: This only works if postings lists are sorted. 98 Does Google use the Boolean model? On Google, the default interpretation of a query [w1 w2 . . .wn] is w1 AND w2 AND . . .AND wn Cases where you get hits that do not contain one of the wi : anchor text page contains variant of wi (morphology, spelling correction, synonym) long queries (n large) boolean expression generates very few hits Simple Boolean vs. Ranking of result set Simple Boolean retrieval returns matching documents in no particular order. Google (and most well designed Boolean engines) rank the result set – they rank good hits (according to some estimator of relevance) higher than bad hits. 99 Outline • Introduction • IR Approaches and Ranking • Query Construction • Document Indexing • IR Evaluation • Web Search • INDRI IR Evaluation • Efficiency: time, space • Effectiveness: • How is a system capable of retrieving relevant documents? • Is a system better than another one? • Metrics often used (together): • Precision = retrieved relevant docs / retrieved docs • Recall = retrieved relevant docs / relevant docs relevant retrieved retrieved relevant IR Evaluation (Cont’) • Information-retrieval systems save space by using index structures that support only approximate retrieval. May result in: • false negative (false drop) - some relevant documents may not be retrieved. • false positive - some irrelevant documents may be retrieved. • For many applications a good index should not permit any false drops, but may permit a few false positives. • Relevant performance metrics: • precision - what percentage of the retrieved documents are relevant to the query. • recall - what percentage of the documents relevant to the query were retrieved. IR Evaluation (Cont’) • Recall vs. precision tradeoff: • Can increase recall by retrieving many documents (down to a low level of relevance ranking), but many irrelevant documents would be fetched, reducing precision • Measures of retrieval effectiveness: • Recall as a function of number of documents fetched, or • Precision as a function of recall • Equivalently, as a function of number of documents fetched • E.g., “precision of 75% at recall of 50%, and 60% at a recall of 75%” • Problem: which documents are actually relevant, and which are not General form of precision/recall Precision 1.0 Recall 1.0 -Precision change w.r.t. Recall (not a fixed point) -Systems cannot compare at one Precision/Recall point -Average precision (on 11 points of recall: 0.0, 0.1, …, 1.0) An illustration of P/R calculation List Doc1 Doc2 Doc3 Doc4 Doc5 … Rel? Y Precision 1.0 - * (0.2, 1.0) 0.8 - * (0.6, 0.75) * (0.4, 0.67) Y Y Assume: 5 relevant docs. 0.6 - * (0.6, 0.6) * (0.2, 0.5) 0.4 0.2 0.0 | 0.2 | 0.4 | 0.6 | 0.8 | 1.0 Recall MAP (Mean Average Precision) 1 1 j MAP n Qi | Ri | D j Ri rij • rij = rank of the j-th relevant document for Qi • |Ri| = #rel. doc. for Qi • n = # test queries • E.g. Rank: 1 5 10 4 8 1st rel. doc. 2nd rel. doc. 3rd rel. doc. 1 1 1 2 3 1 1 2 MAP [ ( ) ( )] 2 3 1 5 10 2 4 8 Some other measures • Noise = retrieved irrelevant docs / retrieved docs • Silence = non-retrieved relevant docs / relevant docs • Noise = 1 – Precision; Silence = 1 – Recall • Fallout = retrieved irrel. docs / irrel. docs • Single value measures: • F-measure = 2 P * R / (P + R) • Average precision = average at 11 points of recall • Precision at n document (often used for Web IR) • Expected search length (no. irrelevant documents to read before obtaining n relevant doc.) Interactive system’s evaluation • Definition: Evaluation = the process of systematically collecting data that informs us about what it is like for a particular user or group of users to use a product/system for a particular task in a certain type of environment. Problems • Attitudes: • Designers assume that if they and their colleagues can use the system and find it attractive, others will too • Features vs. usability or security • Executives want the product on the market yesterday • Problems “can” be addressed in versions 1.x • Consumers accept low levels of usability • “I’m so silly” Two main types of evaluation • Formative evaluation is done at different stages of development to check that the product meets users’ needs. • Part of the user-centered design approach • Supports design decisions at various stages • May test parts of the system or alternative designs • Summative evaluation assesses the quality of a finished product. • May test the usability or the output quality • May compare competing systems What to evaluate Iterative design & evaluation is a continuous process that examines: • Early ideas for conceptual model • Early prototypes of the new system • Later, more complete prototypes Designers need to check that they understand users’ requirements and that the design assumptions hold. Four evaluation paradigms • ‘quick and dirty’ • usability testing • field studies • predictive evaluation Quick and dirty • ‘quick & dirty’ evaluation describes the common practice in which designers informally get feedback from users or consultants to confirm that their ideas are in-line with users’ needs and are liked. • Quick & dirty evaluations are done any time. • The emphasis is on fast input to the design process rather than carefully documented findings. Usability testing • Usability testing involves recording typical users’ performance on typical tasks in controlled settings. Field observations may also be used. • As the users perform these tasks they are watched & recorded on video & their key presses are logged. • This data is used to calculate performance times, identify errors & help explain why the users did what they did. • User satisfaction questionnaires & interviews are used to elicit users’ opinions. Usability testing • It is very time consuming to conduct and analyze • Explain the system, do some training • Explain the task, do a mock task • Questionnaires before and after the test & after each task • Pilot test is usually needed • Insufficient number of subjects for ‘proper’ statistical analysis • In laboratory conditions, subjects do not behave exactly like in a normal environment Field studies • Field studies are done in natural settings • The aim is to understand what users do naturally and how technology impacts them. • In product design field studies can be used to: - identify opportunities for new technology - determine design requirements - decide how best to introduce new technology - evaluate technology in use Predictive evaluation • Experts apply their knowledge of typical users, often guided by heuristics, to predict usability problems. • Another approach involves theoretically based models. • A key feature of predictive evaluation is that users need not be present • Relatively quick & inexpensive The TREC experiments • Once per year • A set of documents and queries are distributed to the participants (the standard answers are unknown) (April) • Participants work (very hard) to construct, finetune their systems, and submit the answers (1000/query) at the deadline (July) • NIST people manually evaluate the answers and provide correct answers (and classification of IR systems) (July – August) • TREC conference (November) TREC evaluation methodology • Known document collection (>100K) and query set (50) • Submission of 1000 documents for each query by each • • • • participant Merge 100 first documents of each participant -> global pool Human relevance judgment of the global pool The other documents are assumed to be irrelevant Evaluation of each system (with 1000 answers) • Partial relevance judgments • But stable for system ranking Tracks (tasks) • Ad Hoc track: given document collection, different • • • • • • • • topics Routing (filtering): stable interests (user profile), incoming document flow CLIR: Ad Hoc, but with queries in a different language Web: a large set of Web pages Question-Answering: When did Nixon visit China? Interactive: put users into action with system Spoken document retrieval Image and video retrieval Information tracking: new topic / follow up CLEF and NTCIR • CLEF = Cross-Language Experimental Forum • for European languages • organized by Europeans • Each per year (March – Oct.) • NTCIR: • Organized by NII (Japan) • For Asian languages • cycle of 1.5 year Impact of TREC • Provide large collections for further experiments • Compare different systems/techniques on realistic data • Develop new methodology for system evaluation • Similar experiments are organized in other areas (NLP, Machine translation, Summarization, …) Outline • Introduction • IR Approaches and Ranking • Query Construction • Document Indexing • IR Evaluation • Web Search • INDRI IR on the Web • No stable document collection (spider, crawler) • Invalid document, duplication, etc. • Huge number of documents (partial collection) • Multimedia documents • Great variation of document quality • Multilingual problem •… Web Search • Application of IR to HTML documents on the World Wide Web. • Differences: • Must assemble document corpus by spidering the web. • Can exploit the structural layout information in HTML (XML). • Documents change uncontrollably. • Can exploit the link structure of the web. 125 Web Search System Web Spider Document corpus Query String IR System 1. Page1 2. Page2 3. Page3 . . Ranked Documents 126 Challenges • Scale, distribution of documents • Controversy over the unit of indexing • What is a document ? (hypertext) • What does the use expect to be retrieved ? • High heterogeneity • Document structure, size, quality, level of abstraction / specialization • User search or domain expertise, expectations • Retrieval strategies • What do people want ? • Evaluation Web documents / data • No traditional collection • Huge • Time and space to crawl index • IRSs cannot store copies of documents • Dynamic, volatile, anarchic, un-controlled • Homogeneous sub-collections • Structure • In documents (un-/semi-/fully-structured) • Between docs: network of inter-connected nodes • Hyper-links - conceptual vs. physical documents Web documents / data • Mark-up • HTML – look & feel • XML – structure, semantics • Dublin Core Metadata • Can webpage authors be trusted to correctly mark-up / index their pages ? • Multi-lingual documents • Multi-media Theoretical models for indexing / searching • Content-based weighting • As in traditional IRS, but trying to incorporate • hyperlinks • the dynamic nature of the Web (page validity, page caching) • Link-based weighting • Quality of webpages • Hubs & authorities • Bookmarked pages • Iterative estimation of quality Architecture • Centralized • Main server contains the index, built by an indexer, searched by a query engine • Advantage: control, easy update • Disadvantage: system requirements (memory, disk, safety/recovery) • Distributed • Brokers & gatherers • Advantage: flexibility, load balancing, redundancy • Disadvantage: software complexity, update User variability • Power and flexibility for expert users vs. intuitiveness and ease of use for novice users • Multi-modal user interface • Distinguish between experts and beginners, offer distinct interfaces (functionality) • Advantage: can make assumptions on users • Disadvantage: habit formation, cognitive shift • Uni-modal interface • Make essential functionality obvious • Make advanced functionality accessible Search strategies • Web directories • Query-based searching • Link-based browsing (provided by the browser, not the IRS) • “More like this” • Known site (bookmarking) • A combination of the above Support for Relevance Feedback • RF can improve search effectiveness … but is rarely used • Voluntary vs. forced feedback • At document vs. word level • “Magic” vs. control Some techniques to improve IR effectiveness • Interaction with user (relevance feedback) - Keywords only cover part of the contents - User can help by indicating relevant/irrelevant document • The use of relevance feedback • To improve query expression: Qnew = *Qold + *Rel_d - *Nrel_d where Rel_d = centroid of relevant documents NRel_d = centroid of non-relevant documents Modified relevance feedback • Users usually do not cooperate (e.g. AltaVista in early years) • Pseudo-relevance feedback (Blind RF) • Using the top-ranked documents as if they are relevant: • Select m terms from n top-ranked documents • One can usually obtain about 10% improvement Term clustering • Based on `similarity’ between terms • Collocation in documents, paragraphs, sentences • Based on document clustering • Terms specific for bottom-level document clusters are assumed to represent a topic • Use • Thesauri • Query expansion User modelling • Build a model / profile of the user by recording • the `context’ • topics of interest • preferences based on interpreting (his/her actions): • Implicit or explicit relevance feedback • Recommendations from `peers’ • Customization of the environment Personalised systems • Information filtering • Ex: in a TV guide only show programs of interest • Use user model to disambiguate queries • Query expansion • Update the model continuously • Customize the functionality and the look-and-feel of the system • Ex: skins; remember the levels of the user interface Autonomous agents • Purpose: find relevant information on behalf of the user • Input: the user profile • Output: pull vs. push • Positive aspects: • Can work in the background, implicitly • Can update the master with new, relevant info • Negative aspects: control • Integration with collaborative systems Outline • Introduction • IR Approaches and Ranking • Query Construction • Document Indexing • IR Evaluation • Web Search • INDRI Document Representation <html> <head> <title>Department Descriptions</title> </head> <body> The following list describes … <h1>Agriculture</h1> … <h1>Chemistry</h1> … <h1>Computer Science</h1> … <h1>Electrical Engineering</h1> … … <h1>Zoology</h1> </body> </html> <title> context <title>department descriptions</title> <title> extents <body> context <body>the following list describes … <h1>agriculture</h1> … </body> <body> extents <h1> context <h1>agriculture</h1> <h1>chemistry</h1> … <h1>zoology</h1> <h1> extents . . . 1. department descriptions 1. the following list describes <h1>agriculture </h1> … 1. agriculture 2. chemistry … 36. zoology Model • Based on original inference network retrieval framework [Turtle and Croft ’91] • Casts retrieval as inference in simple graphical model • Extensions made to original model • Incorporation of probabilities based on language modeling rather than tf.idf • Multiple language models allowed in the network (one per indexed context) Model Model hyperparameters (observed) Document node (observed) α,βh1 α,βtitle Context language models θtitle r1 α,βbody D … θbody rN Representation nodes (terms, phrases, etc…) r1 … q1 Information need node (belief node) θh1 rN r1 … rN q2 I Belief nodes (#combine, #not, #max) Model α,βbody D α,βh1 α,βtitle θtitle r1 … θbody rN r1 … q1 θh1 rN r1 q2 I … rN P( r | θ ) • Probability of observing a term, phrase, or “concept” given a context language model • ri nodes are binary • Assume r ~ Bernoulli( θ ) • “Model B” – [Metzler, Lavrenko, Croft ’04] • Nearly any model may be used here • tf.idf-based estimates (INQUERY) • Mixture models Model α,βbody D α,βh1 α,βtitle θtitle r1 … θbody rN r1 … q1 θh1 rN r1 q2 I … rN P( θ | α, β, D ) • Prior over context language model determined by α, β • Assume P( θ | α, β ) ~ Beta( α, β ) • Bernoulli’s conjugate prior • αw = μP( w | C ) + 1 • βw = μP( ¬ w | C ) + 1 • μ is a free parameter P(ri | , , D) P(ri | ) P( | , , D) tf w, D P(w | C ) | D | Model α,βbody D α,βh1 α,βtitle θtitle r1 … θbody rN r1 … q1 θh1 rN r1 q2 I … rN P( q | r ) and P( I | r ) • Belief nodes are created dynamically based on query • Belief node CPTs are derived from standard link matrices • Combine evidence from parents in various ways • Allows fast inference by making marginalization computationally tractable • Information need node is simply a belief node that combines all network evidence into a single value • Documents are ranked according to: P( I | α, β, D) Example: #AND P(Q=true|a,b) A B 0 false false 0 0 1 false true true false true true A B Q P#and (Q true) P(Q true | A a, B b) P( A a) P( B b) a ,b P(t | f , f )(1 p A )(1 p B ) P(t | f , t )(1 p A ) pB P(t | t , f ) p A (1 pB ) P(t | t , t ) p A pB 0(1 p A )(1 p B ) 0(1 p A ) pB 0 p A (1 pB ) 1 p A pB p A pB Query Language • Extension of INQUERY query language • Structured query language • Term weighting • Ordered / unordered windows • Synonyms • Additional features • Language modeling motivated constructs • Added flexibility to deal with fields via contexts • Generalization of passage retrieval (extent retrieval) • Robust query language that handles many current language modeling tasks Terms Type Example Matches Stemmed term dog All occurrences of dog (and its stems) Surface term “dogs” Exact occurrences of dogs (without stemming) Term group (synonym group) <”dogs” canine> All occurrences of dogs (without stemming) or canine (and its stems) Extent match Any occurrence of an extent of type person #any:person Date / Numeric Fields Example Example Matches #less #less(URLDEPTH 3) Any URLDEPTH numeric field extent with value less than 3 #greater #greater(READINGLEVEL 3) Any READINGINGLEVEL numeric field extent with value greater than 3 #between #between(SENTIMENT 0 2) Any SENTIMENT numeric field extent with value between 0 and 2 #equals #equals(VERSION 5) Any VERSION numeric field extent with value equal to 5 #date:before #date:before(1 Jan 1900) Any DATE field before 1900 #date:after #date:after(June 1 2004) Any DATE field after June 1, 2004 #date:between #date:between(1 Jun 2000 1 Sep 2001) Any DATE field in summer 2000. Proximity Type Example Matches #odN(e1 … em) or #N(e1 … em) #od5(saddam hussein) or #5(saddam hussein) All occurrences of saddam and hussein appearing ordered within 5 words of each other #uwN(e1 … em) #uw5(information retrieval) All occurrences of information and retrieval that appear in any order within a window of 5 words #uw(e1 … em) #uw(john kerry) All occurrences of john and kerry that appear in any order within any sized window #phrase(e1 … em) #phrase(#1(willy wonka) #uw3(chocolate factory)) System dependent implementation (defaults to #odm) Context Restriction Example Matches yahoo.title All occurrences of yahoo appearing in the title context yahoo.title,paragraph All occurrences of yahoo appearing in both a title and paragraph contexts (may not be possible) <yahoo.title yahoo.paragraph> All occurrences of yahoo appearing in either a title context or a paragraph context #5(apple ipod).title All matching windows contained within a title context Context Evaluation Example Evaluated google.(title) The term google evaluated using the title context as the document google.(title, paragraph) The term google evaluated using the concatenation of the title and paragraph contexts as the document google.figure(paragraph) The term google restricted to figure tags within the paragraph context. Belief Operators INQUERY #sum / #and #wsum* #or #not #max INDRI #combine #weight #or #not #max * #wsum is still available in INDRI, but should be used with discretion Extent / Passage Retrieval Example Evaluated #combine[section](dog canine) Evaluates #combine(dog canine) for each extent associated with the section context #combine[title, section](dog canine) Same as previous, except is evaluated for each extent associated with either the title context or the section context #combine[passage100:50](white house) Evaluates #combine(dog canine) 100 word passages, treating every 50 words as the beginning of a new passage #sum(#sum[section](dog)) Returns a single score that is the #sum of the scores returned from #sum(dog) evaluated for each section extent #max(#sum[section](dog)) Same as previous, except returns the maximum score Extent Retrieval Example <document> <section><head>Introduction</head> Statistical language modeling allows formal methods to be applied to information retrieval. ... </section> <section><head>Multinomial Model</head> Here we provide a quick review of multinomial language models. ... </section> <section><head>Multiple-Bernoulli Model</head> We now examine two formal methods for statistically modeling documents and queries based on the multiple-Bernoulli distribution. ... </section> … </document> Query: #combine[section]( dirichlet smoothing ) 0.15 1. Treat each section extent as a “document” 0.50 2. Score each “document” according to #combine( … ) 0.05 SCORE 0.50 0.35 0.15 … 3. Return a ranked list of extents. DOCID IR-352 IR-352 IR-352 … BEGIN 51 405 0 … END 205 548 50 … Other Operators Type Example Description Filter require #filreq( #less(READINGLEVEL 10) ben franklin) ) Requires that documents have a reading level less than 10. Documents then ranked by query ben franklin Filter reject #filrej( #greater(URLDEPTH 1) microsoft) ) Rejects (does not score) documents with a URL depth greater than 1. Documents then ranked by query microsoft Prior #prior( DATE ) Applies the document prior specified for the DATE field System Overview • Indexing • Inverted lists for terms and fields • Repository consists of inverted lists, parsed documents, and document vectors • Query processing • Local or distributed • Computing local / global statistics • Features