Machine Learning for Information Retrieval Rong Jin Yi Zhang Michigan State University University of California Santa Cruz 1 Outline Introduction to information retrieval, statistical inference and machine learning Supervised learning and its application to IR Semi-supervised learning and its application to IR Emerging research directions 2 Roadmap of Information Retrieval Retrieval Applications Information Access Summarization Visualization Filtering Mining Extraction Search Categorization Mining/Learning Applications Knowledge Acquisition Clustering Data Analysis Data Why Machine Learning is Important ? 3 Text Categorization 4 Text Categorization Open directory project the largest human-edited directory of the Web Manual classification Over 4 million sites and 590 K categories Need to automate the process 5 Document Clustering 6 Question Answering Classify question; identify answers; match questions and answers 7 Image Retrieval Image segmentation by data clustering 8 Image Retrieval by Key Points b1 b2 b7 b3 … b6 b4 b8 … b5 … Key features visual words: data clustering b1 b2 b3 b4 9 Image Retrieval by Text Query Automatically annotate images with textual words Retrieve images with textual queries Key technique: classification Each keyword a different category 10 Information Extraction Web page: free style text Relational DB Title J2EE Developer Length 4 month Salary …. Location Reference Structure prediction by Hidden Markov Model and Markov Random Field 11 Citation/Link Analysis 12 Recommender Systems 13 Recommender Systems User 1 ? 5 3 4 2 User 2 4 1 5 ? 5 User 3 5 ? 4 2 5 User 4 1 5 3 5 ? Sparse data problem: a lot of missing values 14 Recommender System Movie Type I Movie Type II User Class I 1 User Class II p(4)=1/4 p(5)=3/4 p(4)=1/4 p(5)=3/4 p(1)=1/2 p(2)=1/2 Movie Type III 3 p(4)=1/2 p(5)=1/2 Fill out sparse data by data clustering 15 One More Reason for ML $ 1,000,000 award 16 Review of Basic Prob. Concepts Probability Pr(A): “the fraction of possible world in which A is true” Examples Event space of all A = Your paper will be accepted by SIGIR 2008 possible worlds. The area is 1. A = It rains in Singapore A = A document contains the word “IR” A is true 17 Conditional Probability SIGIR2008 = “a document contains the phrase SIGIR 2008” SINGAPORE = “a document contains the word singpaore” P(SINGAPORE) = 0.000001 P(SIGIR2008) = 0.00000001 P(SINGAPORE|SIGIR2008) = 1/2 “Singapore” is rare and “SIGIR 2008” is rarer, but if you have a document with SIGIR 2008, there’s a 50-50 chance you’ll find the word “Singapore” in it 18 Conditional Prob. Definition Pr(A; B ) Pr(AjB ) = Pr(B ) Chain rule Pr (A; B ) = Pr (B ) Pr (AjB ) B is true A is true 19 Conditional Prob. Definition Pr(A; B ) Pr(AjB ) = Pr(B ) Independent variables Pr(AjB ) = Pr(A) Chain rule Pr (A; B ) = Pr (B ) Pr (AjB ) Pr(A; B ) = Pr(B ) Pr(A) A is true B is true 20 Conditional Prob. Definition Pr(A; B ) Pr(AjB ) = Pr(B ) Independence Marginal probability k Pr(AjB ) = Pr(A) X Pr(B ) = Pr(B ; A = aj ) Chain rule Pr (A; B ) = Pr (B ) Pr (AjB ) Pr(A; B ) = Pr(B ) Pr(A) B is true A is true j=1 21 Bayes’ Rule Posterior Prior Likelihood Pr(H jE ) / Pr(H ) £ Pr(E jH ) Information: Pr(E|H) H E Inference: Pr(H|E) Hypothesis Evidence 22 Bayes’ Rule Posterior Prior Likelihood Pr(H jE ) / Pr(H ) £ Pr(E jH ) Information: Pr(W|R) W R Inference: Pr(R|W) Pr(W|R) R R W 0.7 0.4 W 0.3 0.6 R: It rains W: The grass is wet 23 Statistical Inference Posterior Prior Likelihood Pr(H jE ) / Pr(H ) £ Pr(E jH ) Learning stage: a parametric model for Pr(E|H) Inference stage: for a given observation E Compute Pr(H|E) for each hypothesis H Choose the hypothesis with the largest Pr(H|E) 24 Example: Language Model (LM) for IR q: ‘Signapore SIGIR’ Evidence: E Estimating likelihood p(q| ) Pr(E jH ) ? ? Pr(H jE ) ? Estimating some statistics for each document d1 … d1000 Hypothesis: H Pr(H ) 25 Probability Distributions Binomial distributions Beta distribution Multinomial distributions Dirichlet distribution Gaussian distributions Laplacian distribution Language models Smoothing LM Sparse solution L1 regularizer 26 Outline Introduction to information retrieval, statistical inference and machine learning Supervised learning and its application to IR Semi-supervised learning and its application to IR Emerging research directions 27 Supervised Learning: Basic Setting Given training data: {(x1,y1), (x2,y2)…(xN,yN)} Learning: infer a function f(X) from the training data Inference: predict future outcomes y=f(x) given x 2.5 y f (x ) = ax ¡ b 2 1.5 1 Regression: Continuous Y` 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 28 Supervised Learning: Basic Setting Given training data: {(x1,y1), (x2,y2)…(xN,yN)} Learning: infer a function f(X) from the training data Inference: predict future outcomes y=f(x) given x x2 x = (x 1 ; x 2 ) y = +1 y = -1 w> x ¡ b = 0 + f (x) = sign(w > x ¡ b) Classification: Discrete Y x1 29 Examples Text categorization Input x: word histogram Output y: document categories (e.g., 1 for “domestic economics”, 2 for “politics”, 3 “sports”, and 4 for “others”) Question answering: classify question types Input x: a parsing tree of a qestion Output y: question types (e.g., when, where, …) 30 K Nearest-Neighbor (KNN) Classifiers Unknown record – Compute distance to other training documents – Identify the k nearest neighbors – determine the class of the unknown point by the class labels of its closest neighbors 31 Based on Tan,Steinbach, Kumar K Nearest-Neighbor (KNN) Classifiers Compute distance between two points Euclidean distance, cosine distance, Kullback-Leibler distance, Bregman distance, … Learning distance function from data (Distance learning) Determine the class Majority vote, or weighted majority vote Bregman distance: generated by a convex function 32 K Nearest-Neighbor (KNN) Classifiers Decide K (# of nearest neighbors) Bias-variance tradeoff Cross validation (or leave-one-out) (k=1) Training Dataset Validation Dataset (k=4) 33 K Nearest-Neighbor (KNN) Classifiers Curse of dimensionality Many attributes are irrelevant High dimension less informative distance Distribution of square distance, generated by 1000 random data points in 1000 dims 34 KNN for Collaborative Filtering Collaborative filtering Assumption: Will user u like item b? Users have similar tastes are likely to have similar preferences on items Making filtering decisions for one user based on the feedback from other users that are similar to this user 35 KNN for Collaborative Filtering User 1 1 5 3 4 3 User 2 User 3 4 2 1 ? 5 3 2 5 5 4 36 KNN for Collaborative Filtering User 1 1 5 3 4 3 User 2 User 3 4 2 1 5? 5 3 2 5 5 4 Similarity measure of user interests can be learned 37 Paradigm for Supervised Learning Gathering training data Determine the input features (i.e., What’s x ?) Determine the functional form f(x) Linear or nonlinear What is the function form for KNN? Determine the learning algorithm e.g., text categorization, bags of words Feature engineering is very very very important Learn optimal parameters (optimization, cross validation) Probabilistic or non-probabilistic Test on a test set 38 Bayesian Learning Posterior Prior Likelihood Pr(H jE ) / Pr(H ) £ Pr(E jH ) Baye’s Rule Hypot hesis space: H = f Y1 ; Y2 ; : : : ; g Y¤ = arg max Pr(Y jX ) Y2H = arg max Pr(Y ) Pr(X jY ) Y2H MAP Learning: Maximum A Posterior 39 Bayesian Learning Posterior Prior Likelihood Pr(H jE ) / Pr(H ) £ Pr(E jH ) Baye’s Rule Hypot hesis space: H = f Y1 ; Y2 ; : : : ; g Y¤ = arg max Pr(Y jX ) Y2H = arg max Pr(Y ) Pr(X jY ) Y2H MLE Learning: Maximum Likelihood Estimation 40 Bayesian Learning: Conjugate Prior Hypot hesis space: H = f Y1 ; Y2 ; : : : ; g Y¤ = arg max Pr(Y jX ) Y2H = arg max Pr(Y ) Pr(X jY ) Y2H Posterior Pr(Y|X) is in the same form as prior Pr(Y) e.g., Dirchlet dist. is conjugate prior for multinomial dist. (widely used in language model) 41 Example: Text Categorization Y ¤ = arg max Pr(Y ) Pr(X jY ) Y2H Web page for Prof. or student ? Counting ! 1. Counting = MLE 2. Counting + Pseudo = MAP What is Y ? What is feature X? How to estimate Pr(Y=Student) or Pr(Y= Prof.) ? How to estimate Pr(w|Y) ? 42 Naïve Bayes [w1 ; w2 ; : : : ; wV ] X = (x 1 ; x 2 ; : : : ; x V ) Pr(wjY ) ? Pr(X jY ) Pr(X jY ) ¼ [Pr(w1 jY )]x 1 ¢¢¢[Pr(wV jY )]x V Threshold Constant Pr(X jY = Weight for words P) Pr(Y = P) f (X ) = log Pr(X jY = S) Pr(Y = S) Pr(Y = P) Pr(w1 jY = P) Pr(wV jY = P) = log + x 1 log + : : : + x v log Pr(Y = S) Pr(w1 jY = S) Pr(wV jY = S) 43 Naïve Bayes: A Linear Classifier x2 f (x) = sign(w > x ¡ b) + y = +1 y = -1 Logistic Regression x1 Pr(X jY =Directly P) Pr(Y = P) model f(x) or Pr(Y|X) f (X ) = log Pr(X jY = S) Pr(Y = S) Pr(Y = P) Pr(w1 jY = P) Pr(wV jY = P) = log + x 1 log + : : : + x v log Pr(Y = S) Pr(w1 jY = S) Pr(wV jY = S) 44 Logistic Regression (LR) Pr(X jY = P ) Pr(Y = P) log Pr(X jY = S) Pr(Y = S) Pr(Y = P ) Pr(w1 jY = P) Pr(wV jY = P) = log + x 1 log + : : : + x v log Pr(Y = S) Pr(w1 jY = S) Pr(wV jY = S) Pr(X jY = P ) Pr(Y = P) log = b+ t 1 x 1 + : : : + t V x V Pr(X jY = S) Pr(Y = S) t1…tV are unknown weights that are learned from data by maximum likelihood estimation1(MLE) Pr(y = §1jX ) = 1 + exp[¡ y(t 1 x 1 + : : : + t V x V + b)] 45 Logistic Regression (LR) Learning parameters: b, t1…tV Maximum Likelihood Estimation (MLE) XN (~ t ¤ ; b¤ ) = arg max ~ t ;b log Pr(yi jX i ; ~ t ; b) i= 1 46 Logistic Regression (LR) Learning parameters: b, t1…tV Maximum Likelihood Estimation (MLE) XN (~ t ¤ ; b¤ ) = arg max ~ t ;b log Pr(yi jX i ; ~ t ; b) + Pr(t) i= 1 Overfitting Why only word weights? worse performance Maximum Likelihood Estimation Maximum A Posterior 47 Learning Logistic Regression N (~ t ¤ ; b¤ ) = arg min ~ t ;b X ¡ log Pr(yi jX i ; ~ t ; b) i= 1 Loss function Mismatch between y and f(X) Other Loss functions 1 Pr(y = §1jX ) = 1 + exp[¡ yf (X )] f(X) 48 Logistic Regression (LR) Closely related to Maximum Entropy (ME) Logistic Regression Dual Maximum Entropy Advantage of LR Bayesian approach Convenient for incorporating prior knowledge Useful for semi-supervised learning, transfer learning, … 49 Comparison of Classifiers Macro F1 Micro F1 KNN 0.8557 0.5975 Naïve Bayes 0.8009 0.4737 Logistic Regression 0.8748 0.6084 From Li and Yang SIGIR03 50 Comparison of Classifiers Logistic Regression 1. Model Pr(Y|X) 2. Model decision boundary 3. NB is a special case of LR Naïve Bayes 1. Model Pr(X|Y) & Pr(Y) 2. Model input patterns (X) x2 1. Require numerical solution 2. Large number of training examples, slow convergence 1. Simple solution 2. Small number of training examples, fast convergence x1 51 Comparison of Classifiers Discriminative Model Rule of Thumb Generative Model Discriminative model if 1. Enough training examples 1. Model Pr(Y|X) 2. Enough computational power 1. Model Pr(X|Y) & Pr(Y) 2. Model decision boundary 2. Model patterns (X) 3. Classification accuracy is input important 3. Broader model assumption Generative model if x2 1. Lack of training examples 1. Simple solution 1. Require numerical solution 2. Lack of computational power 2. Small number of training 2. Large number of training 3. Training time is more important examples, fast convergence examples, slow convergence 4. A quick test x1 52 Comparison of Classifiers Discriminative Model What Generative about KNN ? Model 1. Model Pr(Y|X) 2. Model decision boundary 3. Broader model assumption 1. Model Pr(X|Y) & Pr(Y) 2. Model input patterns (X) 1. Require numerical solution 2. Large number of training examples, slow convergence 1. Simple solution 2. Small number of training examples, fast convergence 53 Other Discriminative Classifiers Decision tree Aggregation of decision rules via a tree Easy interpretation 54 Other Discriminative Classifiers Decision tree Aggregation of decision rules via a tree Easy interpretation Support vector machine x2 A maximum margin classifier y = +1 best text classifier y = -1 x1 55 Comparison of Classifiers Macro F1 Micro F1 KNN 0.8557 0.5975 Naïve Bayes 0.8009 0.4737 Logistic Regression 0.8748 0.6084 Support vector machine 0.8857 0.5975 From Li and Yang SIGIR03 56 Ensemble Learning Generate multiple classifiers Classification by (weighted) majority votes Bagging & Boosting Train a classifier for a different sampling of training data D x2 Sampling D1 D2 Dk … x1 h1 h2 hk 57 Ensemble Learning Bias-variance tradeoff Reduce variance (bagging) and bias (boosting) 50 decision trees Majority vote Error caused by variance Error caused by bias 58 Multi-Class Classification Binary classifier ……… f 1 (X ) More than 2 classes Multi-labels assigned to X1 each example X2 Approaches One against all ECOC coding Binary fclassifier (X ) K c1 c2 … cK 0 1 … 1 0 0 1 0 1 0 … XN 59 Multi-Class Classification f 1 (X ) c1 c3 f 3 (X ) c2 f (X ) 2 f 1 (X ) fM … … (X More than 2 classes Multi-labels assigned to each example Approaches One against all ECOC coding 0 1 … 0 1 0 … 1 … … … … 1 1 … 0 c1 c2 … cK X1 0 1 X2 1 0 0 1 0 1 ) … # of coding bits 0 … XN 60 Multi-Class Classification More than 2 classes Multi-labels assigned to X1 X2 each example … Approaches One against all ECOC coding Transfer learning XN c1 c2 … cK 0 1 … 1 0 0 1 0 1 f 1 (X ) Binary classifier ……… 0 f K (X ) Binary classifier 61 Beyond Vector Inputs sequences gene sequence classification trees question type classification graphs Character Recognition 62 Beyond Vector Inputs: Kernel Kernel function k(x1, x2) Assess the similarity between two objects x1, x2 Don’t have to represent objects by vectors 63 Beyond Vector Inputs: Kernel Kernel function k(x1, x2) Assess the similarity between two objects x1, x2 Don’t have to represent objects by vectors Vector representation by kernel x ; : : :function ;x 1 Given training examples Represent any example x by vector N [k(x 1 ; x ); k(x 2 ; x ); : : : ; k(x N ; x )] Related to representer theorem 64 Beyond Vector Inputs sequences Strong Kernel trees Tree Kernel graphs Graph Kernel 65 Kernel for Nonlinear Classifiers 66 Words are associated with Kernels Reproducing Kernel Hilbert Space (RKHS) Mercer’s conditions Vector representation Good kernels Representer theorem Kernel learning (e.g., multiple kernel learning) 67 Sequence Prediction [PRP] [VBZ] [DT] [JJ] [NN] [He] [reckons] [the] [current] [account] [deficit] Part-of-speech tagging But, all the taggings are related Pr(N N j\ account ) ! [NN] Pr(N N j\ account ; t ag-for-\ current ) Hidden Markov Model (HMM), Conditional Random Field (CRF), and Maximum Margin Markov Network (M3) 68 Outline Introduction to information retrieval, statistical inference and machine learning Supervised learning and its application to IR Semi-supervised learning and its application to IR Emerging research directions 69 Topics of Semi-supervised Learning Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training Semi-supervised data clustering 70 Spectrum of Learning Problems 71 What is Semi-supervised Learning Learning from a mixture of labeled and unlabeled examples L = f (x 1Labeled ; y1 ); : :Data : ; (x n ; yn )g l Total N = number n l + nofu examples: l U =Unlabeled f x 1 ; : : :Data ; xn g u f (x) : X ! Y 72 Why Semi-supervised Learning? Labeling is expensive and difficult Labeling is unreliable Ex. Segmentation applications Need for multiple experts Unlabeled examples Easy to obtain in large numbers Ex. Web pages, text documents, etc. 73 Semi-supervised Learning Problems Classification Transductive – predict labels of unlabeled data Inductive – learn a classification function Clustering (constrained clustering) Ranking (semi-supervised ranking) Almost every learning problem has a semisupervised counterpart. 74 Topics of Semi-supervised Learning Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training Semi-supervised data clustering 75 Why Unlabeled Could be Helpful Clustering assumption Unlabeled data help decide the decision boundary f (X ) = 0 Manifold assumption Unlabeled data help decide decision function f (X ) 76 Clustering Assumption ? 77 Clustering Assumption ? Suggest a simple alg. for PointsSemi-supervised with same label are Learning connected through high ? density regions, thereby defining a cluster Clusters are separated through low-density regions 78 Manifold Assumption Graph representation Vertex: training example (labeled and unlabeled) Edge: similar examples Labeled examples x1 x2 Regularize the classification function f(x) x 1 and x 2 are connect ed ¡ ! jf (x 1 ) ¡ f (x 2 )j is small 79 Manifold Assumption Graph representation Vertex: training example (labeled and unlabeled) Edge: similar examples Manifold assumption Data lies on a low-dimensional manifold Classification function f(x) should “follow” the data manifold 80 Statistical View Generative model for classification Pr(X ; Y jµ; ´) = Pr(X jY ; µ) Pr(Y j´) θ Y X 81 Statistical View Generative model for classification Pr(X ; Y jµ; ´) = Pr(X jY ; µ) Pr(Y j´) Unlabeled data help estimate Clustering assumption Pr(X jY ; µ) θ Y X 82 Statistical View Discriminative model for classification Pr(X ; Y jµ; ´) = Pr(X j¹ ) Pr(Y jX ; µ) θ μ Y X 83 Statistical View Discriminative model for classification Pr(X ; Y jµ; ´) = Pr(X j¹ ) Pr(Y jX ; µ) Unlabeled data help regularize θ Pr(µjX ) via a prior Manifold assumption θ μ Y X 84 Topics of Semi-supervised Learning Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training Semi-supervised data clustering 85 Topics of Semi-supervised Learning Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training Semi-supervised data clustering 86 Label Propagation: Key Idea A decision boundary based on the labeled examples is unable to take into account the layout of the data points How to incorporate the data distribution into the prediction of class labels? 87 Label Propagation: Key Idea Connect the data points that are close to each other 88 Label Propagation: Key Idea Connect the data points that are close to each other Propagate the class labels over the connected graph 89 Label Propagation: Key Idea Connect the data points that are close to each other Propagate the class labels over the connected graph Different from the K Nearest Neighbor 90 Label Propagation: Representation W 2 f 0; 1gN £ N Adjancy ½ matrix 1 x i and x j connect Wi ;j = 0 ot herwise £N W 2 RN + Similarity matrix Wi ;j : similarity between x i and x j D = diag(d1 ; : : : ; dN ) P Matrix di = Wi ;j j6 =i 91 Label Propagation: Representation W 2 f 0; 1gN £ N Adjancy ½ matrix 1 x i and x j connect Wi ;j = 0 ot herwise £N W 2 RN + Similarity matrix Wi ;j : similarity between x i and x j Degree matrix D = diag(d1 ; : : : ; dN ) di = P j6 =i Wi ;j 92 Label Propagation: Representation W 2 RN £ N + Given Label information y l = (y1 ; y2 ; : : : ; yn ) 2 f ¡ 1; + 1gn l l y u = (y1 ; y2 ; : : : ; yn ) 2 f ¡ 1; + 1gn u u 93 Label Propagation: Representation W 2 RN £ N + Given Label information y l = (y1 ; y2 ; : : : ; yn ) 2 f ¡ 1; + 1gn l l y = (y l ; y u ) 94 Label Propagation yb 2 f ¡ 1; 0; + 1gN ½ assignments Initial class §1 x i is labeled ybi = 0 x i is unlabeled Predicted class assignments f 2 RN First predict the confidence scores y 2 f ¡ 1; + 1gN Then predict ½ the class assignments yi = +1 fi > 0 ¡ 1 fi · 0 95 Label Propagation yb 2 f ¡ 1; 0; + 1gN ½ assignments Initial class §1 x i is labeled ybi = 0 x i is unlabeled Predicted class assignments f = (f 1 ; : : : ; f N ) First predict the confidence scores y 2 f ¡ 1; + 1gN Then predict ½ the class assignments yi = +1 fi > 0 ¡ 1 fi · 0 96 Label Propagation (II) One round of propagation ½ ybi x i is labeled P fi = ® N Wi ;j ybi ot herwise i= 1 Weight for each propagation Weighted KNN f 1 = yb + ®W yb 97 Label Propagation (II) f2 Two rounds of propagation = f 1 + ®W f 1 = yb + ®W yb + ®2 W 2 yb How to generate any number of iterations? Xk fk = yb + ®i W i yb i= 1 98 Label Propagation (II) f2 Two rounds of propagation = f 1 + ®W f 1 = yb + ®W yb + ®2 W 2 yb Results for any number of iterations Xk fk = yb + ®i W i yb i= 1 99 Label Propagation (II) f2 f1 Two rounds of propagation = f 1 + ®W f 1 = yb + ®W yb + ®2 W 2 yb Results for infinite number of iterationsX1 = yb + ®i W i yb i= 1 100 Label Propagation (II) f2 f1 Two rounds of propagation = f 1 + ®W f 1 = yb + ®W yb + ®2 W 2 yb Matrix Inverse Results for infinite number of iterations = (I ¡ ®W ) ¡ 1 yb ¹ = D¡ W Normalized Similarity Matrix: 1=2 W D ¡ 1=2 101 Local and Global Consistency [Zhou et.al., NIPS 03] Local consistency: Like KNN Global consistency: Beyond KNN 102 Summary: Construct a graph using pairwise similarities Propagate class labels along the graph Key parameters : the decay of propagation W: similarity matrix Computational complexity Matrix inverse: O(n3) Chelosky decomposition Clustering f = (I ¡ ®W ) ¡ 1 yb 103 Questions Cluster Assumption ? Manifold Assumption ? Transductive Inductive predict classes for unlabeled data learn classification function 104 Application: Text Classification [Zhou et.al., NIPS 03] SVM 20-newsgroups Pre-processing KNN autos, motorcycles, baseball, and hockey under rec stemming, remove stopwords & rare words, and skip header #Docs: 3970, #word: 8014 Propagation 105 Application: Image Retrieval [Wang et al., ACM MM 2004] Label propagation 5,000 images Relevance feedback for the top 20 ranked images Classification problem SVM Relevant or not? f(x): degree of relevance Learning relevance function f(x) Supervised learning: SVM Label propagation 106 Topics of Semi-supervised Learning Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms Label propagation Graph partition based approaches Transductive Support Vector Machine (TSVM) Co-training Semi-supervised data clustering 107 Graph Partition Classification as graph partitioning Search for a classification boundary Consistent with labeled examples Partition with small graph cut Graph Cut = 2 Graph Cut = 1 108 Graph Partitioning Classification as graph partitioning Search for a classification boundary Consistent with labeled examples Partition with small graph cut Graph Cut = 1 109 Min-cuts for semi-supervised learning [Blum and Chawla, ICML 2001] Additional nodes V+ : source, V-: sink Infinite weights connecting sinks and sources High computational cost V+ Source Graph Cut = 1 V̲ Sink 110 Harmonic Function [Zhu et al., ICML 2003] Weight matrix W wi,j 0: similarity between xi and xi f = (f 1 ; : : : ; f N ) ½ Membership vector fi = + 1 xi 2 A ¡ 1 xi 2 B A +1 +1 B ¡ 1 ¡ 1 ¡ 1 ¡ 1 +1 +1 ¡ 1 ¡ 1 ¡ 1 ¡ 1 ¡ 1 111 Harmonic Function (cont’d) C(f ) Graph cut XN C(f ) XN = (f i ¡ f j ) 2 wi ;j 4 A +1 B ¡1 ¡1 ¡1 ¡1 +1 i= 1 j = 1 = +1 +1 1 1 f > (D ¡ W )f = f > L f 4 4 ¡1 ¡1 ¡1 D = diag(d1 ; : : : ; dN ) P Degree matrix di = Wi ;j j6 =i Diagonal element: 112 ¡1 Harmonic Function (cont’d) Graph cut XN C(f ) C(f ) XN = (f i ¡ f j ) 2 wi ;j 4 A +1 B ¡1 ¡1 ¡1 ¡1 +1 i= 1 j = 1 = +1 +1 1 1 f > (D ¡ W )f = f > L f 4 4 ¡1 ¡1 ¡1 Graph Laplacian L = D –W Pairwise relationships among data poitns Mainfold geometry of data 113 ¡1 Harmonic Function min f 2 f ¡ 1;+ 1g N s. t . 1 C(f ) = f > L f 4 f i = yi ; 1 · i · n l Consistency with graph structures Challenge: Consistent with labeled data Discrete space Combinatorial Opt. A +1 +1 B ¡1 ¡1 ¡1 ¡1 +1 +1 ¡1 ¡1 ¡1 114 ¡1 Harmonic Function min f 2 f ¡ 1;+ 1g N s. t . 1 C(f ) = f > L f 4 f i = yi ; 1 · i · n l Relaxation: {-1, +1} continuous real number min f 2 RN s. t . 1 C(f ) = f > L f 4 f i = yi ; 1 · i · n l A +1 +1 B ¡1 ¡1 ¡1 ¡1 +1 +1 ¡1 Convert continuous f to binary ones ¡1 ¡1 115 ¡1 Harmonic Function s. t . 1 C(f ) = f > L f 4 f i = yi ; 1 · i · n l µ ¶ min f 2 RN L= L l ;l L l ;u L u ;l L u ;u ; f = (f l ; f u ) f u = ¡ L ¡ 1 L u ;l y l u ;u 116 Harmonic Function Local Propagation fu = ¡ L ¡ 1L yl u ;l u ;u 117 Harmonic Function Local Propagation Sound familiar ? fu = ¡ L ¡ 1L yl u ;l u ;u Global propagation 118 Spectral Graph Transducer [Joachim , 2003] min f 2 RN s. t . Xn l 1 C(f ) = f > L f + ® (f i ¡ yi ) 2 4 i= 1 f i = yi ; 1 · i · n l Soften hard constraints 119 Spectral Graph Transducer [Joachim , 2003] min f 2 RN s. t . Xn l 1 C(f ) = f > L f + ® (f i ¡ yi ) 2 4 i= 1 f i = yi ; 1 · i · n l Solved by Constrained Eigenvector Problem Xn min f 2 RN 1 C(f ) = f > L f + ® 4 l (f i ¡ yi ) 2 i= 1 XN s. t . f i= 1 2 i = N 120 Manifold Regularization [Belkin, 2006] min f 2 RN Xn l 1 C(f ) = f > L f + ® (f i ¡ yi ) 2 4 i= 1 XN s. t . f 2 i = N i= 1 Loss function for misclassification Regularize the norm of classifier 121 Manifold Regularization [Belkin, 2006] min f 2 RN Xn l 1 f > L f + ® (f i ¡ yi ) 2 4 i= 1 XN s. t . f 2 i = N i= 1 min f 2 RN Manifold Regularization f > Lf + ® Xn l Loss funct ion: l(f (x i ); yi ) l(f (x i ); yi ) + °jf j 2 HK i= 1 122 Summary Construct a graph using pairwise similarity Key quantity: graph Laplacian Decision boundary is consistent Captures the geometry of the graph Graph structure Labeled examples Parameters , , similarity A +1 +1 +1 +1 B ¡1 ¡1 ¡1 ¡1 ¡1 ¡1 ¡1 ¡1 123 Questions Cluster Assumption ? Manifold Assumption ? Transductive Inductive predict classes for unlabeled data learn classification function 124 Application: Text Classification SVM 20-newsgroups KNN Pre-processing Propagation autos, motorcycles, baseball, and hockey under rec stemming, remove stopwords & rare words, and skip header #Docs: 3970, #word: 8014 Harmonic 125 Application: Text Classification PRBEP: precision recall break even point. 126 Application: Text Classification Improvement in PRBEP by SGT 127 Topics of Semi-supervised Learning Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training Semi-supervised data clustering 128 Transductive SVM Support vector machine Classification margin Maximum classification margin Decision boundary given a small number of labeled examples 129 Transductive SVM Decision boundary given a small number of labeled examples How to change decision boundary given both labeled and unlabeled examples ? 130 Transductive SVM Decision boundary given a small number of labeled examples Move the decision boundary to low local density 131 Transductive SVM ! (X ; y ; f ) f Classification margin f(x): classification function Supervised learning ¤ = arg max ! (X ; y ; f ) f 2HK f (x) ! (X ; y ; f ) Semi-supervised learning Optimize over both f(x) and yu 132 Transductive SVM ! (X ; y ; f ) f Classification margin f(x): classification function Supervised learning ¤ f (x) = arg max ! (X ; y ; f ) f 2HK Semi-supervised learning Optimize over both f(x) and yu 133 Transductive SVM ! (X ; y ; f ) f Classification margin f(x): classification function Supervised learning ¤ f (x) = arg max ! (X ; y ; f ) f 2HK Semi-supervised learning f ¤ Optimize over both f(x) and yu = arg max f 2 H K ;y u 2 f ¡ 1;+ 1gn u ! (X ; y l ; y u ; f ) 134 Transductive SVM Decision boundary given a small number of labeled examples Move the decision boundary to place with low local density Classification results How to formulate this idea? 135 Transductive SVM: Formulation Original SVM A binary variables for label of each example Transductive SVM {w* , b*}= argmin argmin w w {w* , b*}= argmin w w w, b y1 w x1 b 1 y2 w x2 b 1 labeled .... examples yn w xn b 1 Constraints for unlabeled data yn 1 ,..., yn m w, b y1 w x1 b 1 y2 w x2 b 1 labeled .... examples yn w xn b 1 yn 1 w xn 1 b 1 unlabeled .... examples yn m w xn m b 1 136 Computational Issue {w , b }= argmin argmin w w * * yn1 ,..., yn m w, b y1 w x1 b 1 1 y2 w x2 b 1 2 labeled .... examples yn w xn b 1 n n i 1 i n i 1 i yn 1 w xn 1 b 1 1 unlabeled .... examples yn m w xn m b 1 m No longer convex optimization problem. Alternating optimization 137 Summary Based on maximum margin principle Classification margin is decided by Labeled examples Class labels assigned to unlabeled data High computational cost Variants: Low Density Separation (LDS), SemiSupervised Support Vector Machine (S3VM), TSVM 138 Questions Cluster Assumption ? Manifold Assumption ? Transductive Inductive predict classes for unlabeled data learn classification function 139 Text Classification by TSVM 10 categories from the Reuter collection 3299 test documents 1000 informative words selected by MI criterion 140 Topics of Semi-supervised Learning Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training 141 Co-training [Blum & Mitchell, 1998] Classify web pages into category for students and category for professors Two views of web pages Content “I am currently the second year Ph.D. student …” Hyperlinks “My advisor is …” “Students: …” 142 Co-training for Semi-Supervised Learning 143 Co-training for Semi-Supervised Learning It is easier to classify this web page using hyperlinks It is easy to classify the type of this web page based on its content 144 Co-training Two representation for each web page Content representation: (doctoral, student, computer, university…) Hyperlink representation: Inlinks: Prof. Cheng Oulinks: Prof. Cheng 145 Co-training Train a content-based classifier 146 Co-training Train a content-based classifier using labeled examples Label the unlabeled examples that are confidently classified 147 Co-training Train a content-based classifier using labeled examples Label the unlabeled examples that are confidently classified Train a hyperlink-based classifier 148 Co-training Train a content-based classifier using labeled examples Label the unlabeled examples that are confidently classified Train a hyperlink-based classifier Label the unlabeled examples that are confidently classified 149 Co-training Train a content-based classifier using labeled examples Label the unlabeled examples that are confidently classified Train a hyperlink-based classifier Label the unlabeled examples that are confidently classified 150 Co-training Assume two views of objects Key idea Two sufficient representations Augment training examples of one view by exploiting the classifier of the other view Extension to multiple view Problem: how to find equivalent views 151 A Few Words about Active Learning Active learning Select the most informative examples In contrast to passive learning Key question: which examples are informative Uncertainty principle: most informative example is the one that is most uncertain to classify Measure classification uncertainty 152 A Few Words about Active Learning Query by committee (QBC) SVM based approach Construct an ensemble of classifiers Classification uncertainty largest degree of disagreement Classification uncertainty distance to decision boundary Simple but very effective approaches 153 Topics of Semi-supervised Learning Introduction to semi-supervised learning Basics of semi-supervised learning Semi-supervised classification algorithms Label propagation Graph partitioning based approaches Transductive Support Vector Machine (TSVM) Co-training Semi-supervised clustering algorithms 154 Semi-supervised Clustering Clustering data into two clusters 155 Semi-supervised Clustering Must link cannot link Clustering data into two clusters Side information: Must links vs. cannot links 156 Semi-supervised Clustering Also called constrained clustering Two types of approaches Restricted data partitions Distance metric learning approaches 157 Restricted Data Partition Require data partitions to be consistent with the given links Links hard constraints E.g. constrained K-Means (Wagstaff et al., 2001) Links soft constraints E.g., Metric Pairwise Constraints K-means (Basu et al., 2004) 158 Restricted Data Partition Hard constraints Cluster memberships must obey the link constraints must link Yes cannot link 159 Restricted Data Partition Hard constraints Cluster memberships must obey the link constraints must link Yes cannot link 160 Restricted Data Partition Hard constraints Cluster memberships must obey the link constraints must link No cannot link 161 Restricted Data Partition Soft constraints Penalize data clustering if it violates some links must link Penality = 0 cannot link 162 Restricted Data Partition Hard constraints Cluster memberships must obey the link constraints must link Penality = 0 cannot link 163 Restricted Data Partition Hard constraints Cluster memberships must obey the link constraints must link Penality = 1 cannot link 164 Distance Metric Learning Learning a distance metric from pairwise links Enlarge the distance for a cannot-link Shorten the distance for a must-link Applied K-means with pairwise distance measured by the learned distance metric must link Transformed by learned distance metric cannot link 165 Example of Distance Metric Learning 2D data projection using Euclidean distance metric 2D data projection using learned distance metric Solid lines: must links dotted lines: cannot links 166 BoostCluster [Liu, Jin & Jain, 2007] General framework for semi-supervised clustering Improves any given unsupervised clustering algorithm with pairwise constraints Key challenges How to influence an arbitrary clustering algorithm by side information? Encode constraints into data representation How to take into account the performance of underlying clustering algorithm? Iteratively improve the clustering performance 167 167 BoostCluster Data Pairwise Constraints Kernel Matrix New data Representation Clustering Results Clustering Algorithm Clustering Algorithm Final Results Given: (a) pairwise constraints, (b) data examples, and (c) a clustering algorithm 168 168 BoostCluster Data Pairwise Constraints Kernel Matrix New data Representation Clustering Results Clustering Algorithm Clustering Algorithm Final Results Find the best data rep. that encodes the unsatisfied pairwise constraints 169 169 BoostCluster Data Pairwise Constraint s Kernel Matrix New data Representation Clustering Results Clustering Algorithm Clustering Algorithm Final Results Obtain the clustering results given the new data representation 170 170 BoostCluster Data Pairwise Constraints Kernel Matrix New data Representation Clustering Results Clustering Algorithm Clustering Algorithm Final Results Update the kernel with the clustering results 171 171 BoostCluster Data Pairwise Constraints Kernel Matrix New data Representation Clustering Results Clustering Algorithm Clustering Algorithm Final Results Run the procedure iteratively 172 172 BoostCluster Data Pairwise Constraints Kernel Matrix New data Representation Clustering Results Clustering Algorithm Clustering Algorithm Final Results Compute the final clustering result 173 173 Summary Clustering data under given pairwise constraints Two types of approaches Must links vs. cannot links Restricted data partitions (either soft or hard) Distance metric learning Questions: how to acquire links/constraints? Manual assignments Derive from side information: hyper links, citation, user logs, etc. May be noisy and unreliable 174 Application: Document Clustering [Basu et al., 2004] 300 docs from topics (atheism, baseball, space) of 20-newsgroups 3251 unique words after removal of stopwords and rare words and stemming Evaluation metric: Normalized Mutual Informtion (NMI) KMeans-x-x: different variants of constrained clustering algs. 175 Outline Introduction to information retrieval, statistical inference and machine learning Supervised learning and its application to text classification, adaptive filtering, collaborative filtering and ranking Semi-supervised learning and its application to text classification Emerging research directions 176 Efficient Learning In IR, we have massive amount of data But, most learning algs. are relatively slow How to improve scalability ? Difficult to handle millions of documents Sampling, only use part of data Stochastic optimization, update model one example each time (related to online learning) More interesting, more examples may mean more efficient training (Sebro, ICML 2008) 177 Kernel Learning Kernel plays central role in machine learning Kernel functions can be learned from data Kernel alignment, multiple kernel learning, nonparametric learning, … Kernel learning is suitable for IR Similarity measure is key to IR Kernel learning allows us to identify the optimal similarity measure automatically 178 Transfer Learning Different document categories are correlated We should be able to borrow information of one class to the training of another class Key question: what to transfer between classes? Representation, model priors, similarity measure … 179 Active Learning IR Applications Relevance feedback (text retrieval or image retrieval) Text classification Adaptive information filtering Collaborative filtering Query Rewriting 180 Discriminative Language Models Language models have shown to be effective for information retrieval But most language models are generative, thus missing the discriminative power Key difficulty in discriminative language models: no outputs! Side information Mixture of generative and discriminative models 181 References A. McCallum and K. Nigam. A comparison of event models for Naive Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998 Tong Zhang and Frank J. Oles, Text Categorization Based on Regularized Linear Classification Methods, Journal of Information Retrieval, 2001 F. Li and Y. Yang. A loss function analysis for classification methods in text categorization, The Twentieth International Conference on Machine Learning (ICML'03) Chengxiang Zhai and John Lafferty, A study of smoothing methods for language models applied to information retrieval, ACM Trans. Inf. System, 2004 A. Blum and T. Mitchell, Combining Labeled and Unlabeled Data with Co-training, COLT 1998 D. Blei and M. Jordan, Variational methods for the Dirichlet process, ICML 2004 T. Hofmann, Unsupervised Learning by Probabilistic Latent Semantic Analysis, Mach. Learn., 42(1-2), 2001 D. Blei, A. Ng and M. Jordan, Latent Dirichlet allocation, NIPS*2002 R. Jin, C. Ding, and F. Kang, A Probabilistic Approach for Optimizing Spectral Clustering, NIPS*2005 D. Zhou, B. Scholkopf, and T. Hofmann, Semi-supervised learning on directed graphs, NIPS*2005. X. Zhu, Z. Ghahramani, and J. D. Lafferty, Semi-supervised learning using Gaussian fields and harmonic functions. ICML 2003. 182 T. Joachims, Transductive Learning via Spectral Graph Partitioning, ICML 2003 References Andrew McCallum and Kamal Nigam, Employing {EM} in Pool-Based Active Learning for Text Classification, Proceeding of the International Conference on Machine Learning, 1998 David A. Cohn and Zoubin Ghahramani and Michael I. Jordan, Active Learning with Statistical Models, Journal of Artificial Intelligence Research, 1996 S. Tong and E. Chang. Support vector machine active learning for image retrieval. In ACM Multimedia, 2001 Xuehua Shen and ChengXiang Zhai, Active feedback in ad hoc information retrieval, SIGIR '05 J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 1997. X.-J. Wang, W.-Y. Ma, G.-R. Xue, X. Li. Multi-Model Similarity Propagation and its Application for Web Image Retrieval, ACM Multimedia, 2004 M. Belkin and P. Niyogi and V. Sindhwani, Manifold Regularization, Technical Report, Univ. of Chicago, 2006 K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained k-means clustering with background knowledge. In ICML '01, 2001. S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In SIGKDD '04, 2004. 183 References Xiaofei He, Benjamin Rey, Wei Vivian Zhang, Rosie Jones, Query Rewriting using Active Learning for Sponsored Search, SIGIR07 Y. Zhang, W. Xu, and J. Callan. Exploration and exploitation in adaptive filtering based on bayesian active learning. In Proceedings of 20th International Conf. on Machine Learning, 2003. Z. Xu and R. Akella. A bayesian logistic regression model for active relevance feedback (SIGIR08) G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. ICML 2000 M. Saar-Tsechansky and F. Provost. Active sampling for class probability estimation and ranking. Machine learning, 2004 J. Rocchio. Relevance feedback in information retrieval, In The Smart System: experiments in automatic document processing. Prentice Hall, 1971. H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the fifth annual workshop on Computational learning theory, 1992 Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28(2-3):133–168, 1997 D. A. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learn-ing. Machine learning, 1994. Robert M. Bell and Yehuda Koren, Lessons from the Netix Prize Challenge, KDD Exploration 2008 Tie-Yan Liu, Tutorial: Learning to rank Soumen Chakrabarti, Learning to Rank in Vector Spaces and Social Networks, www 2007 184 Thank You God, it is finally over ! 185