Information Retrieval PengBo Oct 28, 2010 2 3 本次课大纲 Introduction of Information Retrieval 索引技术:Index Techniques 排序:Scoring and Ranking 性能评测:Evaluation 4 Basic Index Techniques Document Collection site:pkunews.pku.edu.cn baidu report 12,800 pages Google report 6820 pages 6 User Information Need 在这个新闻网站内查找 articles talks about Culture of China and Japan, and doesn’t talk about students abroad. QUERY: “中国 日本 文化 —留学生” 中国 日本 文化 -留学生 site:pkunews.pku.edu.cn Baidu report 38 results Google report 361 results 7 How to do it? 字符串匹配,如使用 grep 所有WebPages,找到 包含 “中国” ,“文化”and “日本” 的页面, 再去 除包含 “留学生”的页面? Slow (for large corpora) NOT “留学生” is non-trivial Other operations (e.g., find “中国” NEAR “日本”) not feasible 8 Document Representation Bag of words model Document-term incidence matrix(关联矩阵) D1 D2 D3 D4 D5 D6 … 中国 文化 日本 1 0 1 1 1 0 1 1 0 0 1 0 0 1 1 0 1 1 留学 生 0 1 1 1 0 0 9 教育 北京 … 1 1 0 0 0 0 1 0 0 1 0 1 1 if page contains word, 0 otherwise Incidence Vector Transpose :把Document-term矩阵转置 得到term-document 关联矩阵 每个term对应一个0/1向量, incidence vector D1 D2 D3 D4 D5 D6 … 1 0 1 1 1 0 中国 1 1 0 0 1 0 文化 日本 留学 生 0 0 1 1 1 1 0 1 1 0 1 0 教育 北京 … 1 1 0 0 0 0 1 0 0 1 0 1 10 Retrieval Information Need: 在这个新闻网站内查找: articles talks about Culture of China and Japan, and doesn’t talk about students abroad. To answer query: 读取term向量 “中国” ,“文化”,“日本” , “留学 生”(complemented) bitwise AND 101110 AND 110010 AND 011011 AND 100011 = 000010 11 D5 12 Let’s build a search system! 考虑系统规模: 文档数:N = 1million documents, 每篇文档约有1K terms. 平均6 bytes/term =>6GB of data in the documents. 不相同的term数:M = 500K distinct terms 这个Matrix规模是? 500K x 1M 十分稀疏:不超过one billion 1’s What’s a better representation? 13 1875年,Mary Cowden Clarke为莎士比亚作品编纂了词 汇索引。在书的前言,她骄傲的写到她“纷献了一个通 向智慧宝藏的可靠指南…,希望这十六年来的辛勤劳动没 有辜负这个理想…” 1911,LaneCooper教授出版了一本William Wordsworth 诗集的词汇索引。耗时7个月,共有67人参与工作,使 用的工具八廓卡片、剪刀、胶水和邮票等。 1965,使用计算机整理这样的资料只需要几天时间,而 且会完成得更好…… 14 Inverted index 对每个 term T: 保存包含T的文档(编号)列表 中国 2 4 8 16 文化 1 2 3 5 留学生 13 32 8 64 13 21 128 34 16 Postings Dictionary Sorted by docID (more later on why). 15 Inverted index construction Documents to be indexed. Friends, Romans, countrymen. Tokenizer Token stream. Friends Romans Countrymen Linguistic modules Modified tokens. Inverted index. friend roman countryman Indexer friend 2 4 roman 1 2 countryman 13 16 16 Indexer steps 输出: <Modified token, Document ID> 元组序列. Doc 1 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious 17 Term I did enact julius caesar I was killed i' the capitol brutus killed me so let it be with Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 Sort by terms. Term I did enact julius caesar I was killed i' the capitol brutus killed me so let it be with caesar the noble brutus hath told Core indexing step Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 18 Term Doc # ambitious be brutus brutus capitol caesar caesar caesar did enact hath I I i' it julius killed killed let me noble so the the told 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 2 2 1 2 2 合并一个文档中的多次出现,添加term的 Frequency信息. Term Doc # ambitious be brutus brutus capitol caesar caesar caesar did enact hath I I i' it julius killed killed let me noble so the the 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 2 2 1 2 Term ambitious be brutus brutus capitol caesar caesar did enact hath I i' it julius killed let me noble so the 19 the Doc # Freq 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 2 1 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 结果split 为一个Dictionary 文件和一个Postings 文 件. Why split? Term Doc # ambitious be brutus brutus capitol caesar caesar did enact hath I i' it julius killed let me noble so Freq 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 Freq Doc # Term N docs Tot Freq ambitious 1 1 be 1 1 brutus 2 2 capitol 1 1 caesar 2 3 did 1 1 enact 1 1 hath 1 1 I 1 2 i' 1 1 it 1 1 julius 1 1 killed 1 2 let 1 1 me 1 1 noble 1 1 so 1 1 the 2 20 2 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 Boolean Query processing 查询:中国 AND 文化 查找Dictionary,定位中国; 读取对应的postings. 查找Dictionary,定位文化; 读取对应的postings. “Merge” 合并(AND)两个postings: 2 4 8 16 1 2 3 5 32 8 21 64 13 21 128 中国 34 文化 The merge 2 Lists的合并算法 8 2 4 8 16 1 2 3 5 32 8 64 13 21 128 中国 34 文化 If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID. 22 Boolean queries: Exact match Queries using AND, OR and NOT together with query terms 23 Primary commercial retrieval tool for 3 decades. Professional searchers (e.g., Lawyers) still like Boolean queries: You know exactly what you’re getting. Example: WestLaw Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) Example query: About 7 terabytes of data; 700,000 users Majority of users still use boolean queries What is the statute of limitations in cases involving the federal tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM 特点:Long, precise queries; proximity operators; incrementally developed; not like web search 24 Beyond Boolean term search 短语phrase: 词的临近关系Proximity: Find Gates NEAR Microsoft. 文档中的区域限定: Find “Bill Gates” , not “Bill and Gates” Find documents with (author = Ullman) AND (text contains automata). Solution: 记录term的field property 记录term在docs中的position information. 25 LAST COURSE REVIEW 26 Bag of words model Vector representation doesn’t consider the ordering of words in a document John is quicker than Mary and Mary is quicker than John have the same vectors This is called the bag of words model. In a sense, this is a step back: The positional index was able to distinguish these two documents. We will look at “recovering” positional information later in this course. For now: bag of words model Inverted index 对每个 term T: 保存包含T的文档(编号)列表 中国 2 4 8 16 文化 1 2 3 5 留学生 13 32 8 64 13 21 128 34 16 Postings Dictionary Sorted by docID (more later on why). 28 Simple Inverted Index Inverted Index with counts • supports better ranking algorithms Inverted Index with positions • supports proximity matches Query Processing Document-at-a-time Term-at-a-time Calculates complete scores for documents by processing all term lists, one document at a time Accumulates scores for documents by processing term lists one at a time Both approaches have optimization techniques that significantly reduce time required to generate scores Document-At-A-Time Term-At-A-Time Scoring and Ranking Beyond Boolean Search 对大多数用户来说…. 大多数用户可能会输入 bill rights or bill of rights作为Query LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM 怎样解释和处理这样 full text queries? 没有AND OR NOT等boolean连接符 某些query term不一定要出现在结果文档中 用户会期待结果按某种 order 返回,most likely to be useful 的文档在结果的前面 36 Scoring: density-based 按query,给文档打分scoring,根据score排序 Idea 如果一个文档 talks about a topic more, then it is a better match if如果包含很多次query term的出现,文档是 relevant(相关的) term weighting. 37 Term frequency vectors 考察term t 在文档d, 中出现的次数number of occurrences ,记作tft,d 对一个free-text Query q Score(q,d) = tq tf…t,d D1 D2 D3 D4 D5 D6 中国 11 0 7 13 4 0 文化 2 2 0 0 6 0 日本 0 5 2 0 1 9 留学 生 0 1 2 6 0 0 教育 3 0 0 2 0 0 北京 17 0 0 0 11 8 … 38 Problem of TF scoring 没有区分词序 长文档具有优势 归一化:normalizing for document length wft,d = tft,d / |d| 出现的重要程度其实与出现次数不成正比关系 Positional information index 从0次到1次的出现,和100次出现到101次出现,意 义大不相同 wf t ,d 0 if tft ,d 0, 1 logtft ,d otherwise 平滑 不同的词,其重要程度其实不一样 Consider query 日本 的 汉字 丼 区分Discrimination of terms 39 Discrimination of terms 怎样度量terms的common程度 collection frequency (cf ):文档集合里term出现的总次 数 document frequency (df ):文档集合里出现过term的文 档总数 Word cf df try insurance 10422 10440 8760 3997 40 tf x idf term weights tf x idf 权值计算公式: term frequency (tf ) or wf, some measure of term density in a doc inverse document frequency (idf ) 表达term的重要度(稀有度) 原始值idft = 1/dft N idft log 同样,通常会作平滑 df t 为文档中每个词计算其tf.idf权重: wt ,d tft ,d log(N / dft ) 41 Documents as vectors D1 D2 D3 D4 D5 D6 中国 4.1 0.0 3.7 5.9 3.1 0.0 文化 4.5 4.5 0 0 11.6 0 日本 0 3.5 2.9 0 2.1 3.9 留学生 0 3.1 5.1 12.8 0 0 教育 2.9 0 0 2.2 0 0 北京 7.1 0 0 0 4.4 3.8 … … 每一个文档 j 能够被看作一个向量,每个term 是一个 维度,取值为tf.idf So we have a vector space terms are axes docs live in this space 高维空间:即使作stemming, may have 20,000+ dimensions 42 Intuition t3 d2 d3 d1 θ φ t1 d5 t2 d4 Postulate: 在vector space中“close together” 的 文档会talk about the same things. 用例:Query-by-example,Free Text query as vector 43 Sec. 6.3 Formalizing vector space proximity First cut: distance between two points ( = distance between the end points of the two vectors) Euclidean distance? Euclidean distance is a bad idea . . . . . . because Euclidean distance is large for vectors of different lengths. Sec. 6.3 Why distance is a bad idea The Euclidean distance between q and d2 is large even though the distribution of terms in the query q and the distribution of terms in the document d2 are very similar. Cosine similarity t3 d2 d1 θ t1 向量d1和d2的 “closeness” 可以用它们之间的夹角大 小来度量 具体的,可用cosine of the angle x来计算向量相似度. 向量按长度归一化 Normalization dj 2 d j dk sim( d j , d k ) d j dk M i 1 i1 w M 46 2 w i1 i, j M wi , j wi ,k 2 i, j 2 w i1 i ,k M Example Docs: Austen's Sense and Sensibility, Pride and Prejudice; Bronte's Wuthering Heights affection jealous gossip SaS 115 10 2 SaS affection 0.996 jealous 0.087 gossip 0.017 PaP 58 7 0 WH 20 11 6 PaP 0.993 0.120 0.000 WH 0.847 0.466 0.254 cos(SAS, PAP) = .996 x .993 + .087 x .120 + .017 x 0.0 = 0.999 cos(SAS, WH) = .996 x .847 + .087 x .466 + .017 x .254 = 0.929 47 Notes on Index Structure 怎样保存normalized tf-idf 值? 在每一个postings entry吗? 保存tf/normalization? Space blowup because of floats 通常: tf以整数值保存index compression 文档长度,idf每doc只保存一个 48 Sec. 6.4 tf-idf weighting has many variants Columns headed ‘n’ are acronyms for weight schemes. Why is the base of the log in idf immaterial? Sec. 6.4 Weighting may differ in queries vs documents Many search engines allow for different weightings for queries vs. documents SMART Notation: denotes the combination in use in an engine, with the notation ddd.qqq, using the acronyms from the previous table A very standard weighting scheme is: lnc.ltc Document: logarithmic tf (l as first character), no idf and cosine normalization A bad idea? Query: logarithmic tf (l in leftmost column), idf (t in second column), no normalization … Sec. 6.4 tf-idf example: lnc.ltc Document: car insurance auto insurance Query: best car insurance Term Query tf- tf-wt raw df idf Document wt n’liz e tf-raw tf-wt Pro d n’liz e wt auto 0 0 5000 2.3 0 0 1 1 1 0.52 0 best 1 1 50000 1.3 1.3 0.34 0 0 0 0 0 car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27 insurance 1 1 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53 1000 Exercise: what is N, the number of docs? Doc length = 12 02 12 1.32 1.92 Score = 0+0+0.27+0.53 = 0.8 Thus far We can build a Information Retrieval System Support Boolean query Support Free-text query Support ranking result 52 IR Evaluation Measures for a search engine 创建index的速度 Number of documents/hour Documents size 但更关键的measure是 user happiness 怎样量化的度量它? 搜索的速度 These criteria measurable. 响应时间:Latency as a function of index size 吞吐率:Throughput as a function of index size 查询语言的表达能力 Ability to express complex information needs Speed on complex queries 54 Measuring user happiness Issue: 谁是user? Web engine: user finds what they want and return to the engine eCommerce site: user finds what they want and make a purchase Can measure rate of return users Is it the end-user, or the eCommerce site, whose happiness we measure? Measure time to purchase, or fraction of searchers who become buyers? Enterprise (company/govt/academic): Care about “user productivity” How much time do my users save when looking for information? Many other criteria having to do with breadth of access, secure access, etc. 55 Happiness: elusive to measure Commonest proxy: relevance of search results But how do you measure relevance? Methodology: test collection 1. A benchmark document collection 2. A benchmark suite of queries 3. A binary assessment of either Relevant or Irrelevant for each query-doc pair Some work on more-than-binary, but not the standard 56 Evaluating an IR system Note: the information need is translated into a query, Relevance is assessed relative to the information need not the query E.g., Information need: I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine. Query: wine red white heart attack effective You evaluate whether the doc addresses the information need, not whether it has those words 57 Standard Test Collections TREC - National Institute of Standards and Testing (NIST) has run a large IR test bed for many years Reuters, CWT100G/CWT200G , etc… Human experts mark, for each query and for each doc, Relevant or Irrelevant or at least for subset of docs that some system returned for that query 58 Unranked retrieval evaluation: Precision and Recall Precision: 检索得到的文档中相关的比率 = P(relevant|retrieved) Recall: 相关文档被检索出来的比率 = P(retrieved|relevant) Relevant Not Relevant Retrieved tp fp Not Retrieved fn tn 精度Precision P = tp/(tp + fp) 召回率Recall R = tp/(tp + fn) 59 Accuracy 给定一个Query,搜索引擎对每个文档分类 classifies as “Relevant” or “Irrelevant”. Accuracy of an engine: 分类的正确比率. Accuracy = (tp + tn)/(tp + fp +tn + fn) Is this a very useful evaluation measure in IR? Retrieved Not Retrieved Relevant tp fn 60 Not Relevant fp tn Why not just use accuracy? How to build a 99.9999% accurate search engine on a low budget…. Search for: 0 matching results found. People doing information retrieval want to find something and have a certain tolerance for junk. 61 Precision and recall when ranked 把集合上的定义扩展到ranked list 在ranked list中每个文档处,计算P/R point 这样计算出来的值,那些是有用的? Consider a P/R point for each relevant document Consider value only at fixed rank cutoffs e.g., precision at rank 20 Consider value only at fixed recall points e.g., precision at 20% recall May be more than one precision value at a recall point 62 Precision and Recall example 63 Average precision of a query Often want a single-number effectiveness measure Average precision is widely used in IR Calculate by averaging precision when recall increases 64 Recall/precision graphs Average precision .vs. P/R graph 65 AP hides information Recall/precision graph has odd saw-shape if done directly 但是P/R图很难比较 Precision and Recall, toward averaging 66 Averaging graphs: a false start How can graphs be averaged? 不同的queries有不同的recall values What is precision at 25% recall? 插值interpolate How? 67 Interpolation of graphs 可能的插值方法 No interpolation Not very useful Connect the dots Not a function Connect max Connect min Connect average … 0%recall怎么处理? Assume 0? Assume best? Constant start? 68 How to choose? 一个好的检索系统具有这样的特点:一般来说(On average),随着recall增加, 它的precision会降低 Verified time and time again On average 插值,使得makes function monotonically decreasing 比如: 从左往右,取右边最大precisions值为插值 where S is the set of observed (R,P) points 结果是一个step function 69 Our example, interpolated this way monotonically decreasing Handles 0% recall smoothly 70 Averaging graphs: using interpolation Asked: what is precision at 25% recall? Interpolate values 71 Averaging across queries 多个queries间的平均 微平均 Micro-average – 每个relevant document 是一个 点,用来计算平均 宏平均 Macro-average – 每个query是一个点,用来计 算平均 Average of many queries’ average precision values Called mean average precision (MAP) “Average average precision” sounds weird Most common 72 Interpolated average precision Average precision at standard recall points For a given query, compute P/R point for every relevant doc doc. Interpolate precision at standard recall levels 11-pt is usually 100%, 90, 80, …, 10, 0% (yes, 0% recall) 3-pt is usually 75%, 50%, 25% Average over all queries to get average precision at each recall level Average interpolated recall levels to get single result Called “interpolated average precision” Not used much anymore;MAP “mean average precision” more common Values at specific interpolated points still commonly used 73 Interpolation and averaging 74 A combined measure: F P/R的综合指标F measure (weighted harmonic mean): 2 ( 1) PR F 2 1 1 PR (1 ) P R 通常使用balanced F1 measure( = 1 or = ½) 1 Harmonic mean is a conservative average,Heavily penalizes low values of P or R 75 Averaging F, example Q-bad has 1 relevant document Q-perfect has 10 relevant documents Retrieved at ranks 1-10 (R,P) = (.1,1), (.2,1), …, (1,1) F values of 18%, 33%, …, 100%, so AvgF = 66.2% Macro average Retrieved at rank 1000 (R P) = (1, 0.001) F value of 0.2%, so AvgF = 0.2% (0.2% + 66.2%) / 2 = 33.2% Micro average (0.2% + 18% + … 100%) / 11 = 60.2% 76 本次课小结 Basic Index Techniques Scoring and Ranking Term weighting tf·idf Vector Space Model Cosine Similarity IR evaluation 77 Inverted index Dictionary & Postings Precision, Recall, F Interpolation MAP, interpolated AP Thank You! Q&A 阅读材料 [1] IIR Ch1,Ch6.2,Ch6.3,Ch8.1,8.2,8.3,8.4 [2] M. P. Jay and W. B. Croft, "A language modeling approach to information retrieval," in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. Melbourne, Australia: ACM Press, 1998. 79