Information Retrieval to Knowledge Retrieval, one more step Xiaozhong Liu Assistant Professor School of Library and Information Science Indiana University Bloomington What is Information? What is Retrieval? What is Information Retrieval? I am Retriever How to find this book in Library? Search something based on User Information Need!! How to express your information need? Query User Information Need!! What is Good query? What is Bad query? Good query: query ≈ information need Bad query: query ≠ information need Query Wait!!! User NEVER make mistake!!! It’s OUR job!!! Task 1: Given user information need, how to help (or automatically) help user propose a better query? If there is a query… Perfect query: ππ’πππ¦πππ‘ππππ§π User input query: ππ’πππ¦π’π ππ User Information Need!! What is Good results? What is Bad results? Query Given a query, How to retrieve results? Results Task 2: Given a query (not perfect), how to retrieve Documents from collection? F(query, doc) Very Large, Unstructured Text Data!!! Can you give me an example? If query term exist in doc Yes, this is result F(query, doc) If query term NOT exist in doc No, this is not result Is there any problem in this function? Brainstorm… Query: Obama’s wife Doc 1. My wife supports Obama’s new policy on… Doc 2. Michelle, as the first lady of the United States… Yes, this is a very challenging task! Another problem Collection size: 5 billion Match doc: 5 My algorithm successfully finds all the 5 docs! In… 3 billion results… User Information Need!! How to help user find the results from all the retrieved results? Query Results Task 3: Given retrieved results, how to help you find their results? If retrieval algorithm retrieved 1 billion results from collection, what will you do??? Search with Google, click “next”??? Yes, we can help user find what they need! Query: Indiana University Bloomington Can you read it One by one? You use it?? User User Information Need!! 1 Query 3 2 Results System They are not independent! User User Information Need!! 1 Query 3 2 Results System Text Map Information Retrieval Image …… Music web scholar document blog news Text Map Information Retrieval Image …… Music Index Documents vs. Database Records • Relational database records are typically made up of welldefined fields Select * from students where GPA > 2.5 Text, similar way? Find all the docs including “Xiaozhong” Select * from documents where text like ‘%xiaozhong%’ We need a more effective way to index the text! Collection C: doc1, doc2, doc3 ……… docN Vocabulary V: w1, w2, w3 ……… wn Document doci : di1, di2, di3 ……… dim Query q: q1, q2, q3 ……… qt All dij ο V where qx is the query term Collection C: doc1, doc2, doc3 ……… docN V: w1, w2, w3 ……… wn Doc1 1 0 0 1 Doc2 0 0 0 1 Doc3 1 1 1 1 1 1 ……… DocN 1 0 Query q: 0, 1, 0 ……… Collection C: doc1, doc2, doc3 ……… docN V: w1, w2, w3 ……… Normalization is very important! wn Doc1 3 0 0 9 Doc2 0 0 0 7 Doc3 2 11 21 1 ……… DocN 7 0 1 2 Query q: 0, 3, 0 ……… Collection C: doc1, doc2, doc3 ……… docN Doc1 V: w1, w2, 0.41 0 Doc2 0 Doc3 0.42 0 0.11 w3 0 ……… Normalization is very important! wn 0.62 0 0.12 0.34 0.13 Weight ……… DocN 0.01 0 0.19 0.24 Query q: 0, 0.37, 0 ……… Term weighting TF * IDF Term frequency: freq (w, doc) / | doc| Or… Inverse document frequency 1+ log(N/k) N total num of docs in collection k total num of docs with word w An effective way to weight each word in a document Retrieval Model? Ranking? Index Semantic? Speed? Space? Document representation meets the requirement of retrieval system Stemming Education Educate Educational Educat Educating Very effective to improve system performance. Some risk! E.g. LA Lakers = LA Lake? Educations Inverted index Doc 1: I love my cat. Doc 2: This cat is lovely! Doc 3: Yellow cat and white cat. I love my cat this is lovely yellow and write i love cat thi yellow write We lose something? i-1 love - 1, 2 thi - 2 cat - 1, 2, 3 yellow - 3 write - 3 Inverted index Doc 1: I love my cat. Doc 2: This cat is lovely! Doc 3: Yellow cat and white cat. i-1 love - 1, 2 thi - 2 cat - 1, 2, 3 yellow - 3 write - 3 i – 1:1 love – 1:1, 2:1 thi – 2:1 cat – 1:1, 2:1, 3:2 yellow – 3:1 write – 3:2 We still lose something? Inverted index Doc 1: I love my cat. Doc 2: This cat is lovely! Doc 3: Yellow cat and white cat. i – 1:1 love – 1:1, 2:1 thi – 2:1 cat – 1:1, 2:1, 3:2 yellow – 3:1 write – 3:2 i – 1:1 love – 1:2, 2:4 thi – 2:1 cat – 1:4, 2:2, 3:2, 3:5 yellow – 3:2 write – 3:4 Why do you need position info? Proximity of query terms query: information retrieval Doc 1: information retrieval is important for digital library. Doc 2: I need some information about the dogs, my favorite is golden retriever. Index – bag of words query: information retrieval Doc 1: information retrieval is important for digital library. Doc 2: I need some information about the dogs, my favorite is golden retriever. What’s the limitation of bag-of-words? Can we make it better? n-gram: Doc 1: information retrieval, retrieval is, is important, important for …… bi-gram Better semantic representation! What’s the limitation? Index – bag of “phrase”? Doc 1: …… big apple …… Doc 2: …… apple…… More precision, less ambiguous How to identify phrases from documents? Identify syntactic phrases using POS tagging n-grams from existing resources Noise detection What is the noise of web page? Non-informative content… Web Crawler - freshness Web is changing, but we cannot constantly check all the pages… Need to find the most important page that change freq www.nba.com www.iub.edu www.restaurant????.com Sitemap: a list of urls for each host; modification time and freq Retrieval Model Mathematical modeling is frequently used with the objective to understand, explain, reason and predict behavior or phenomenon in the real world (Hiemstra, 2001). i.e. some model help you to predict tomorrow stock price… Vector Space Model Hypothesis: Retrieval and ranking problem = Similarity Problem! Is that a good hypothesis? Why? Retrieval Function: Similarity (query, Document) Return a score!!! We can Rank the documents!!! Vector Space Model So, query is a short document Collection C: doc1, doc2, doc3 ……… docN V: w1, w2, Doc1 0.41 0 Doc2 0 Doc3 0.42 0 0.11 w3 0 ……… wn 0.62 0 0.12 0.34 0.13 ……… DocN Query q: 0, 0.37, 0 ……… 0.01 0 0.19 0.24 Collection C: doc1, doc2, doc3 ……… docN Similarity V: w1, w2, Doc1 0.41 0 Doc2 0 Doc3 0.42 Doc Vector DocN Query q: 0, 0.37, 0 ……… Query Vector 0 0.11 w3 0 ……… wn 0.62 0 0.12 0.34 0.13 ……… 0.01 0 0.19 0.24 Doc1: ……Cat……dog……cat…… Doc2: ……Cat……dog Doc3: ……snake…… Query: dog cat cat doc 1 2 doc 2 doc 3 1 dog Doc1: ……Cat……dog……cat…… Doc2: ……Cat……dog Doc3: ……snake…… Query: dog cat cat doc 1 2 doc 2 = query θ 1 doc 3 dog F (q, doc) = cosine similarity (q, doc) Why Cosine? Vector Space Model Vocabulary V: w1, w2, w3 ……… wn Dimension = n = vocabulary size Document doci : di1, di2, di3 ……… din Query q: q1, q2, q3 ……… qn All dij ο V Same dimensional space!!! Doc1: ……Cat……dog……cat…… Doc2: ……Cat……dog Doc3: ……snake…… Query: dog cat Try! Term weighting Doc [ 0.42 0.11 0.34 0.13 ] weight, how? TF * IDF Term frequency: freq (w, doc) / | doc| Or… Inverse document frequency 1+ log(N/k) N total num of docs in collection k total num of docs with word w More TF Weighting is very important for retrieval model! We can improve TF by… i.e. freq (term, doc) log [freq (term, doc)] BM25: Vector Space Model But… Bag of word assumption = Word independent! Query = Document, maybe not true! Vector and SEO (Search Engine Optimization)… synonym? Semantic related words? How about these… Pivoted Normalization Method Dirichlet Prior Method TF IDF Normalization +parameter Language model Probability distribution over words P (I love you) = 0.01 P (you love I) = 0.00001 P (love you I) = 0.0000001 If we have this information… we could build a generative model! P(text |ο± ) Language model - unigram Generate text with bag-of-word assumption (word independent): P (w1, w2,…wn) = P(w1) P(w2)…P(wn) topic X = ??? food orange desk USB computer Apple Unix …. …. …. milk sport superbowl Doc: I’m using Mac computer… remote access another computer… share some USB device… P(Doc | topic1) vs. topic 1 food orange desk USB computer Apple Unix …. …. milk yogurt P(Doc | topic2) topic 2 iPad NBA sport superbowl NHL score information unix USB king ghost hamlet play …. …. romeo juliet iPad iplhone 4s tv apple …… play store ο±topicX food orange desk USB computer Apple Unix …. …. …. 10/10000 1000/10000 P(“computer” | topic X) If we have enough data, i.e. docs about topic X How to estimate??? milk sport superbowl 30/10000 query: sport game watch P(query | doc 1) vs. doc 1 food orange desk USB computer Apple Unix …. …. milk yogurt P(query | doc 2) doc 2 iPad NBA sport superbowl NHL score information unix USB a document ο± doc: query likelihood query term likelihood Retrieval Problem ο» Query likelihood ο» Term likelihood P(qi | doc) But document is a small sample of topic… Data like: Smoothing! Smoothing P(qi | doc) What if qi is not observed in doc? P(qi | doc) = 0? We want give this non-zero score!!! i.e. We can make it better! Smoothing First, it addresses the data sparseness problem. As a document is only a very small sample, the probability P (qi | Doc) could be zero for those unseen words (Zhai & Lafferty, 2004). Second, smoothing helps to model the background (nondiscriminative) words in the query. Improve language model estimation by using Smoothing Smoothing Another smoothing method: P (w | doc) P (w | ο± ) P (w | collection) Collection Language Model P (w | ο± ) = (1-λ) βP( query | θdoc)+λβP(doc| θcollection) Smoothing We could use collection language model: Term Freq IDF Doc length norm TFIDF is closely related to Language Model and other retrieval models Language model Solid statistical foundation Flexible parameter setting Different smoothing method Language model in library? If we have a paper… and a query… Similarity (paper, query) Vector Space Model If query word not in the paper… Score = 0 If we use language model… Language model in library? Likelihood of query given a paper can be estimated by: P(query | ο± ) = αP (query | paper) + βP (query | author) +γP (query | journal) +…… Likelihood of query given a paper & author & journal & …… e.g. what’s the difference between web and doc retrieval??? F (doc, query) vs F (web page, query) web page = doc + hyperlink + domain info + anchor text + metadata + … Can you use those to improve system performance??? Knowledge Score each topic, level of interest Topic 1 Topic 2 Historical Interest Hot topic Diminishing topic {if P ( Day today | Z n ) οΎ mean[ P ( Day i | Z n )] ο« STD[ P ( Day i | Z n )]} Score(Topic n ) ο½ {else if P( Day today | Z n ) οΌ mean[ P( Day i | Z n )] ο STD[ P ( Day i | Z n )]} {else P ( Day today | Z n ) Current Interest Regular topic a ο P( Day today | Z n ) / mean[ P( Day i | Z n )] b ο P ( Day today | Z n ) / mean[ P( Day i | Z n )] “Obama”, Nov 5th 2008 After Election 1. Win 2. Create history 3. First black president 6 Nov 5th CIV: 5 4 3 Wiki:Barack_Obama; Wiki:Election; win; success; Wiki:President_of_the_United_States Wiki:African_American; President World; America; victory; record; first; president ; 44th; History; Wiki:Victory_Records Entity:first_black_president; 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Wiki:Sarah_Palin; sarah; palin; hillary Secret; Wiki:Hillary_Rodham_Clinton Clinton; newsweek; club; cloth 16 17 18 19 Entity:first_black_president; Celebrate; 20 21 22 23 24 25 26 27 28 29 30 black; african; Wiki:Colin_Powell; Wiki:Secretary_of_State Wiki:United_States Google web NDCG3 NDCG5 NDCG10 t-test CIV 0.35909366 0.399970894 0.479302401 CILM 0.356652652 0.387120299 0.483420045 Google 0.230423817 0.318737414 0.388792379 ** TFIDF 0.27596245 0.333012091 0.437831859 * BM25 0.284599431 0.336961764 0.436466778 * LM (liner) 0.32558799 0.382113457 0.473992963 LM (dirichlet) 0.34665084 0.358128576 0.45150825 LM (twostage) 0.349735965 0.358725227 0.450046444 BEST1: CIV CIV CILM BEST2: CILM CILM CIV NDCG5 NDCG10 Significant test *** t < 0.05 ** t < 0.10 * t < 0.15 Yahoo_web NDCG3 0.351765133 0.391807685 0.288059321 0.24320988 0.245263974 0.276208943 0.223253393 0.219225991 CIV CILM Yahoo TFIDF BM25 LM (liner) LM (dirichlet) LM (twostage) 0.38207777 0.40623334 0.326373542 0.282799657 0.277579262 0.316889107 0.270017519 0.266537146 0.475506721 0.482464858 0.410969176 0.404092457 0.395953269 0.432428784 0.385936078 0.384349848 BEST1: CILM CILM CILM BEST2: CIV CIV CIV Significant test *** t < 0.05 ** t < 0.10 * t < 0.15 t-test *** *** * *** *** Knowledge Retrieval System How to help user propose knowledge-base queries ? Query How to represent knowledge? Matching Knowledge Representation Knowledge within Scientific Literature Knowledge-based Information Need How to match between the two? Academic Knowledge Query Recommendation & Feedback Query Recommendation Query Feedback 74 Structural Keyword Generation - Features Category Feature Text Keyword Content Content_Of_Keyword CAP Contain_Digit Character_Length_Of_Keyword Token_Length_Of_Keyword Category_Length_Of_Keyword Exist_In_Title Title Context Abstract Context Location_In_Title Title_Text_POS Title_Unigram Title_Bigram Location_In_Abstract Keyword_Position_In_Sentence_O f_Abstract Abstract_Freq Abstract_Text_POS Abstract_Unigram Abstract_Bigram Description or Example content of the keyword, stemmed, case insensitive, stop words removed a vector of all the tokens in the keyword whether the keyword is capitalized whether the keyword contains digits, i.e., TREC2002, value = true number of characters in the target keyword number of tokens in the keyword number of tokens in the keyword; if the length is more than four, we use four to represent its category length whether keyword exists in title (stemmed, case insensitive, stop words removed) the position where the keyword appears in the title unigram and its part of speech in title (in a text window) unigram of keyword in title (in a text window) bigram of keyword in title (in a text window) which sentence the keyword appears in the abstract the keyword’s position in the sentence (beginning, middle or end) how many times a keyword appears in the abstract unigram and its part of speech in abstract (in a text window) unigram of keyword in abstract (in a text window) 76 bigram of keyword in abstract (in a text window) Evaluation – Domain Knowledge Generation F1 Compare Keywordbased features Keyword + Title-based features Keyword + Title + Abstractbased features Concept Research Question Supervised Semisupervised 0.637 0.662 Methodology 0.479 0.516 Dataset 0.824 0.816 Evaluation 0.571 0.571 Research Question 0.633 0.667 Methodology 0.498 0.534 Dataset 0.824 0.816 Evaluation 0.571 0.571 Research Question 0.642 0.663 Methodology 0.420 0.542 Dataset 0.831 0.823 Evaluation 0.621 0.662 F measure comparison for Supervised Learning and SemiSupervised Learning GOOD! but not PERFECT… Knowledge comes from… • System? Machine Learning, but… low modest performance… • User? No way! Very high cost! Author won’t contribute… • System + User? Possible! WikiBackyard Trigger: 1. Wiki page Edit improve; 2. Machine learning model improve; 3. All other wiki pages improve; 4. KR index improve! ScholarWiki YES! It+helps!!! User Machine learning is powerful… • Knowledge retrieval for scholar publications… • Knowledge from paper • Knowledge from user – Knowledge feedback – Knowledge recommendation • Knowledge from User vs. from Machine learning • ScholarWiki (user) + WikiBackyard (machine) Knowledge via Social Network and Text Mining CITATION? CO-OCCUR? CO-AUTHOR? Full text citation analysis With further study of citation analysis, increasing numbers of researchers have come to doubt the reasonableness of assuming that the raw number of citations reflects an article’s influence (MacRoberts & MacRoberts 1996). On the other hand, full-text analysis has to some extent compensated for the weaknesses of citation counts and has offered new opportunities for citation analysis. Content of each node? Motivation of each citation? With further study of citation analysis, increasing numbers of researchers have come to doubt the reasonableness of assuming that the raw number of citations reflects an article’s influence (MacRoberts & MacRoberts 1996). On the other hand, full-text analysis has to some extent compensated for the weaknesses of citation counts and has offered new opportunities for citation analysis. Every word @ Citation Context will VOTE!! Motivation? Topic? Reason??? Left and Right N words?? N = ?????????? With further study of citation analysis, increasing numbers of researchers have come to doubt the reasonableness of assuming that the raw number of citations reflects an article’s influence (MacRoberts & MacRoberts 1996). On the other hand, full-text analysis has to some extent compensated for the weaknesses of citation counts and has offered new opportunities for citation analysis. Word effectiveness is decaying based on the distance!!! Closer words make more significant contribution!! How about language model? Each node and edge represented by a language model? High dimensional space! Word difference? Topic modeling – each node is represented by a topic distribution (Prior Distribution); each edge is represented by a topic distribution (Transitioning Probability Distribution) Supervised topic modeling 1. Each topic has a label (YES! We can interpret each topic) 2. We DO KNOW the total number of topics Each paper is a mix probability distribution of Author Given Keywords Keywords Each paper: pzkeyi(paper) = p(zkeyi | abstract, title) With further study of citation analysis, increasing numbers of researchers have come to doubt the reasonableness of assuming that the raw number of citations reflects an article’s influence (MacRoberts & MacRoberts 1996). On the other hand, full-text analysis has to some extent compensated for the weaknesses of citation counts and has offered new opportunities for citation analysis. Paper importance Domain credit: 100 if we have 3 topics (keywords): key1, key2, key3 Key1-Pub1 credit: 25 * 0.6 25 Key1-Citation1 credit: 25 * 0.6*[0.8/(0.8+0.2)] pub 1 0.8 25 pub 2 P(key1 | citation) = 0.8 P(key2 | citation) = 0.1 P(key3 | citation) = 0.1 P(key1 | text) = 0.6 P(key2 | text) = 0.15 P(key3 | text) = 0.25 0.2 25 Evenly share the credits? Citation is important if 1. citation focusing on important topic 2. other citations focusing on other topics 25 pub 4 pub 3 Paper importance Domain credit: 100 if we have 3 keywords: key1, key2, key3 Key1-Pub1 credit: 25 * 0.6 25 [25,25,25] Key1-Citation1 credit: 25 * 0.6*[0.8/(0.8+0.2)] pub 1 0.8 25 [29,26,28] 0.2 Domain publication ranking Domain keyword topical ranking Topical citation tree pub 2 [27,27,26] 25 pub 3 25 Citation number between paper pair is IMPORTANT! pub 4 [25,25,25] Different citations make different contribution to different topics (keywords) to the citing publication. Citation transitioning topic prior Publication/venue/author topic prior NDCG for Review citation recommendation 0.4 0.35 0.3 NDCG 0.25 0.2 0.15 0.1 0.05 0 nDCG@10 nDCG@30 nDCG@50 nDCG@100 nDCG@300 nDCG@500 nDCG@1000 nDCG@3000 nDCG@5000 nDCG@ALL PageRank 0.0093 0.0076 0.0084 0.0107 0.0198 0.0261 0.0392 0.0719 0.0917 0.1904 TFIDF 0.0674 0.0741 0.0833 0.0975 0.1266 0.1391 0.1541 0.1827 0.1932 0.2141 BM25 0.0689 0.0738 0.0832 0.0957 0.126 0.138 0.1525 0.1808 0.1895 0.213 LM 0.0713 0.0757 0.0861 0.1006 0.1329 0.1446 0.1616 0.1872 0.1987 0.2174 PageRank_LM 0.1023 0.113 0.132 0.1563 0.1943 0.21 0.2258 0.24 0.2458 0.2593 Prior 0.1229 0.1451 0.1621 0.1847 0.2319 0.2514 0.2738 0.3074 0.3152 0.3445 Literature Review Citation recommendation Input: Paper Abstract Output: A list of ranked citations MAP and NDCG evaluation Given a paper abstract: 1. Word level match (language model) 2. Topic level match (KL-Divergence) 3. Topic importance Use Inference Network to integrate each hypothesis Inference Network Content Match Citation Recommendation Publication Topical Prior 1. PageRank 2. Full-text PageRank (greedy match) 3. Full-text PageRank (topic modeling) Topic match Output: Input 1. [3] 2. [2] 3. [6] 4. [8] 5. [10] 6. [1] …… YES YES NO NO YES NO 3 2 0 0 1 0 MAP (Cite or not?) NDCG (Important citation?) NDCG for citation recommendation based on Abstract 0.4 0.35 Based on topic inference, 30 seconds 0.3 Based on greedy match, 1 second 0.25 0.2 0.15 0.1 nDCG@ 10 nDCG@ 30 nDCG@ 50 nDCG@ 100 nDCG@ 300 nDCG@ 500 nDCG@ 1000 nDCG@ 3000 nDCG@ 5000 nDCG@ all abstract_LM 0.1183 0.1424 0.1586 0.1774 0.2022 0.21 0.2202 0.2326 0.2375 0.2647 LM + PageRank 0.174 0.1951 0.2091 0.2297 0.2539 0.2615 0.2711 0.2847 0.289 0.3157 LM + Fulltext 0.1712 0.1929 0.2088 0.229 0.252 0.2607 0.2704 0.2832 0.2886 0.3148 LM + Fulltext + smoothing 0.2015 0.2281 0.2447 0.267 0.2897 0.2989 0.3078 0.3199 0.3236 0.3445 LM + Fulltext + LLDA 0.2032 0.2317 0.246 0.2648 0.2907 0.3007 0.3108 0.3241 0.329 0.347 LM + Fulltext + LLDA + smoothing 0.2137 0.2411 0.2563 0.2756 0.3035 0.3109 0.3208 0.3335 0.3374 0.355 • Information Retrieval • • • • • Index Retrieval Model Ranking User feedback Evaluation • Knowledge Retrieval • • • • Machine Learning User Knowledge Integration Social Network Analysis Thank you!