Overview of Information Retrieval and our Solutions Qiang Yang Department of Computer Science and Engineering The Hong Kong University of Science and Technology Hong Kong 1 Why Need Information Retrieval (IR)? More and more online information in general (Information Overload) Many tasks rely on effective management and exploitation of information Textual information plays an important role in our lives Effective text management directly improves productivity 2 What is IR? Narrow-sense: IR= Search engine technologies (Google/Yahoo!/Live Search) IR= Text matching/classification Broad-sense: IR = Text information management: How to find useful information? (info. retrieval) (e.g., Yahoo!) How to organize information? (text classification) (e.g., How to discover knowledge from text? (text mining) (e.g., automatically assign email to different folders) discover correlation of events) 3 Difficulties Huge Amount of Online Data Different types of data Yahoo! has nearly 20 billion pages in its index (as collected at the beginning of 2005) Web-pages, emails, blogs, chatting-room messages; Ambiguous Queries Short: 2-4 words Ambiguous: apple; bank… 4 Our Solutions Query Classification Query Expansion/Suggestion SIGIR’04; CIKM’04; ICDM’04; ICDE’06; WWW’06; IPM (2007), DMKD (Vol. 12) Document Summarization Submission to SIGIR’07 Web page Classification/Clustering Submissions to: SIGIR’07; AAAI’07; KDD’07 Entity Resolution Champion of KDDCUP’05; TOIS (Vol. 24); SIGIR’06; KDD Exploration (Vol. 7) SIGIR’05; IJCAI’07 Analysis of Blogs, Emails, Chatting-room messages SIGIR’06; ICDM’06 (2); IJCAI’07 5 Outline Query Classification (QC) Introduction Solution 1: Query/category enrichment; Solution 2: Bridging classifiers; Entity Resolution Summary of Other works 6 Query Classification 7 Introduction Web-Query is difficult to manage: Query Classification (QC) can help to understand query better Short; Ambiguous; Evolving Vertical Search Re-rank search results Online Advertisements Difficulties of QC (Different from text classification) How to represent queries Target taxonomy is dynamic, e.g. online ads taxonomy Training data is difficult to collect 8 Problem Definition Inspired by the KDDCUP’05 competition Classify a query into a ranked list of categories Queries are collected from real search engines Target categories are organized in a tree with each node being a category 9 Related Work Document Classification Feature selection [Yang et al. 1997] Feature generation [Cai et al. 2003] Classification algorithms Naïve Bayes [Andrew and Nigam 1998] KNN [Yang 1999] SVM [Joachims 1999] …… An overall survey in [Sebastiani 2002] 10 Related work Query Classification/Clustering Classify the Web queries by geographical locality [Gravano 2003]; Classify queries according to their functional types [Kang 2003]; Beitzel et al. studied the topical classification as we do. However they have manually classified data [Beitzel 2005]; Beeferman and Wen worked on query clustering using clickthrough data respectively [Beeferman 2000; Wen 2001]; 11 Related Work Document/Query Expansion Borrow text from extra data source Using hyperlink [Glover 2002]; Using implicit links from query log [Shen 2006]; Using existing taxonomies [Gabrilovich 2005]; Query expansion [Manning 2007] Global methods: independent of the queries Local methods using relevance feedback or pseudo-relevance feedback 12 Solutions Queries Target Categories Solution Solution 1: 1: Query/Category Query/Category Enrichment Enrichment Queries Target Categories Solution 2: Bridging classifier 13 Solution 1: Query/Category Enrichment Assumptions & Architecture Query Enrichment Classifiers Synonym-based classifiers Statistical classifiers Experiments 14 Assumptions & Architecture The intended meanings of Web queries should be reflected by the Web; A set of objects exist that cover the target categories. Construction of Synonym- based Classifiers Labels of Returned Pages Search Engine Text of Returned Pages Construction of Statistical Classifier Phase I: the training phase Query Classified results Classified results Finial Results Phase II: the testing phase The Architecture of Our Approach 15 Query enrichment Textual information Title Category information Snippet Category Full text 16 Synonym-based classifiers Page 1 C1I C1T Page 2 Query Page 3 C2I Category Mapping C* C2T C3I C3T Page 4 C4I 17 Synonym-based classifiers Map by Word Matching Direct Matching Extended Matching Device E D Wordnet “Hardware" → “Hardware; Device ; Equipment“ High precision, low recall 18 Statistical classifiers: SVM Apply synonym-based classifiers to map Web pages from intermediate taxonomy to target taxonomy Obtain <pages, target category> as the training data Train SVM classifiers for the target categories; 19 Statistical Classifier: SVM Advantages Disadvantages Circles (triangles) denote crawled pages Black ones are mapped to the two categories successfully Fail to map the white ones; For a query, if it happens to be represented by the white ones, it can not be classified correctly by synonym-based method, but SVM can Recall can be higher, but precision may hurt Once the target taxonomy changes, we need to train classifiers again 20 Putting them together: Ensemble of classifiers Why ensemble? Two kinds of classifiers based on different mechanisms They can be complementary to each other Proper combination can improve the performance Combination strategies EV (Use validation data) EN (No validation data) 21 Experiment --Data Sets & Eval. Criteria Queries: from KDDCUP 2005 A: 800,000 queries, 800 labeled; three labelers Evaluation # of queries are correctly tagged as ci i B: # of queries are tagged as c i i C: # of queries whose category is c i Precision A B A C 2 Precision Recall F1 Presion Recall Recall i 1 Overall F1 3 3 (F1 against human labeler i) i 1 22 Experiment: Quality of the Data Sets Consistency between labelers Performance of each labeler against another labelers The distribution of the labels assigned by the three labelers. 23 Experiment Results --Direct vs. Extended Matching Number of pages collected for training using different mapping methods F1 of the synonym based classifier and SVM 24 Experiment Results --The number of assigned labels S1 S1 S2 S2 S3 S3 SVM SVM EN EN EDP EDP 0.70 0.45 0.60 Rec F1 Pre 0.60 0.40 0.50 0.50 0.35 0.40 0.40 0.30 0.30 0.25 0.20 0.20 0.10 1 2 3 4 5 6 Number of guessed labels 25 Experiment Results -- Effect of Base Classifiers 26 Solutions Queries Target Categories Solution 1: Query/Category Enrichment Queries Target Categories Solution Solution 2:2:Bridging Bridging classifier classifier 27 Solution2: Bridging Classifiers Our Algorithm Bridging Classifier Category Selection Experiments Data Set and Evaluation Criteria Results and Analysis 28 Algorithm --Bridging Classifier Problem with Solution 1: target if fixed, and training needs to repeat Goal: Connect the target taxonomy and queries by taking an intermediate taxonomy as a bridge 29 Algorithm --Bridging Classifier (Cont.) How to connect? T The relation between Ci and C Ij The relation between and C Ij q Prior prob. of C Ij The relation between and CiT q 30 Algorithm --Bridging Classifier (Cont.) Understand the Bridging Classifier Given q V q and : and are fixed and which reflects the size of acts as a weighting factor tends to be larger when and tend to belong to the same smaller intermediate categories 31 Algorithm --Category Selection Category Selection for Reducing Complexity Total Probability (TP) Mutual Information 32 Experiment --Data Sets and Eval. Criteria Intermediate taxonomy ODP: 1.5M Web pages, in 172,565 categories Number of Categories on Different Levels Statistics of the Numbers of Documents in the Categories on Different Levels 33 Experiment --Result of Bridging Classifiers All intermediate categories are used Snippet only Best result when n = 60 Improvement by 10.4% and 7.1% in terms of precision and F1 respectively compared to two previous approaches 34 Experiment --Result of Bridging Classifiers Performances of the Bridging Classifier with Different Granularity of Intermediate Taxonomy Best results when using all intermediate categories Reason: A category with larger granularity may be a mixture of several target categories It can not be used to distinguish different target categories 35 Experiment --Effect of category selection When the category number is around 18,000, the bridging classifier is comparable to, if not better than, the previous approaches MI works better than TP It favors the categories which are more powerful to distinguish the target categories 36 Entity Resolution 37 Definition: Reference & Entity Tsz-Chiu Au, Dana S. Nau: The Incompleteness of Planning with Volatile External Information. ECAI 2006 Author Entity Name Reference Venue Reference Journal /Conf. Entity Tsz-Chiu Au, Dana S. Nau: Maintaining Cooperation in Noisy Environments. AAAI 2006 Current Author Search DBLP CiteSeer Google Graphical Model We convert the Entity Resolution into a Graph Partition Problem Each node denotes a reference Each edge denotes the relation of two references How to measure the Reference Relation Tsz-Chiu Au, Dana S. Nau: The Incompleteness of Planning with Volatile External Information. ECAI 2006 Coauthors Research Community Coauthors Authors Plaintext Similarity Research Area Authors Ugur Kuter, Dana S. Nau: Using Domain-Configurable Search Control for Probabilistic Planning. AAAI 2005: Features F1: F2: F3: F4: F5: Title Similarity Coauthor Similarity Venue Similarity Research Community Overlap Research Area Overlap Research Community Overlap A1, A2 stands for two author name references F4.1:Similarity(A1, A2) =Coauthors(Coauthors(A1))∩Coauthors(Coauthors(A2)) F4.2:Similarity(A1, A2) =Venues(Coauthors(A1))∩Venues(Coauthors(A2)) Coauthors(X) returns the coauthor name set of each author in set X Venues(Y) returns the venue name set of each author in set Y Research Area Overlap V1, V2 stands for two venue references F4.1:Similarity(V1, V2) =Authors(Articles(V1))∩Authors(Articles(V2)) F4.2:Similarity(V1, V2) =Articles(Authors(Articles(V1)))∩Articles(Authors(Articles(V2))) Authors(X) returns the author name set of each article in set X Articles(Y) returns the article set holding a reference of each element in set Y System Framework Similarity Probability Experiment Results Our Dataset: 1000 references to 20 author entities from DBLP Getoor’s Datasets CiteSeer: 2,892 author references to 1,165 author entities arXiv: 58,515 references to 9,200 author entities F1 = 97.0% Summary of Other Work 47 Summary of Other Work Summarization using Conditional Random Fields (IJCAI ’07) Thread Detection in Dynamic Text Message Streams (SIGIR ’06) Implicit Links for Web Page Classification (WWW ’06) Text Classification Improved by Multigram Models (CIKM ’06) Latent Friend Mining from Blog Data (ICDM ’06) Web-page Classification through Summarization (SIGIR ’04) 48 Summarization using Conditional Random Fields (IJCAI ’07) Motivation Observation Step 1: 1 2 3 4 5 6 Step 2: 1 2 3 4 5 6 Step 3: 1 2 3 4 5 6 Summarization Sequence labeling Solution: CRF Sentence (Observed) xt-1 xt xt+1 Label (Unobserved) yt-1 yt yt+1 Feature functions: Parameters: , , 49 Thread Detection in Dynamic Text Message Streams (SIGIR ’06) Representation Content-based Structure-based Sentence Type; Personal Pronouns Clustering 50 Implicit Links for Web Page Classification (WWW ’06) Implicit link 1 ( LI1) Assumption: a user tends to click the pages related to the issued query; Definition: there is an LI1 between d1 and d2 if they are clicked by the same person through the same query; Implicit link 2 (LI2) Assumption: users tend to click related pages according to the same query Definition: there is an LI2 between d1 and d2 if they are clicked according to the same query 51 Text Classification Improved by Multigram Models (CIKM ’06) Training Stage: For each category Train an n-multigram model Train an n-gram model on the sequences Test Stage: For a test document For each category, segment the document Calculate its probability under the corresponding n-gram model Assign the test document the category under which it has the largest probability 52 Latent Friend Mining from Blog Data (ICDM ’06) Objective One way to build Web communities Find the people sharing similar interest with the target person “Interest” is reflected by their “writings” “Writings” are from their “blogs” These people may not know each other They are not linked as in previous study 53 Latent Friend Mining from Blog Data (Cont.) Solutions Cosine Similarity-based method Topic Model-based method Calculating the cosine similarity between the contents of the blogs. Find latent topics in the blogs using latent topic models and calculate the similarity at topic level Two-level similarity-based method First stage: use an existing topic hierarchy to get the topic distribution of a blogger’s blogs; Second stage: use a detailed similarity comparison 54 Web-page Classification through Summarization (SIGIR ’04) Description LUHN LSA Page-layout analysis Supervised Combined Summarizer Train set Testing set Train Summaries Classifier Testing Summaries Result 55 Thanks 56