Techniques For Exploiting Unlabeled Data Thesis Defense Mugizi Rwebangira September 8,2008 Committee: Avrim Blum, CMU (Co-Chair) John Lafferty, CMU (Co-Chair) William Cohen, CMU Xiaojin (Jerry) Zhu, Wisconsin Motivation Supervised Machine Learning: Labeled Examples {(xi,yi)} induction Model x →y Problems: Document classification, image classification, protein sequence determination. Algorithms: SVM, Neural Nets, Decision Trees, etc. 2 Motivation In recent years, there has been growing interest in techniques for using unlabeled data: More data is being collected than ever before. Labeling examples can be expensive and/or require human intervention. 3 Examples Images: Abundantly available (digital cameras) labeling requires humans (captchas). Web Pages: Can be easily crawled on the web, labeling requires human intervention. Proteins: sequence can be easily determined, structure determination is a hard problem. 4 Motivation Semi-Supervised Machine Learning: Labeled Examples {(xi,yi)} x →y Unlabeled Examples {xi} 5 Motivation + + - 6 However… Techniques not as well developed as supervised techniques: Best practices for using unlabeled data: Techniques for adapting supervised algorithms to semi-supervised algorithms 7 Outline Motivation Randomized Graph Mincut Local Linear Semi-supervised Regression Learning with Similarity Functions Conclusion and Questions 8 Graph Mincut (Blum & Chawla,2001) 9 Construct an (unweighted) Graph 10 Add auxiliary “super-nodes” + - 11 Obtain s-t mincut + - Mincut 12 Classification + - Mincut 13 Problem Plain mincut can give very unbalanced cuts. + - 14 Solution Add random weights to the edges Run plain mincut and obtain a classification. Repeat the above process several times. For each unlabeled example take a majority vote. 15 Before adding random weights + - Mincut 16 After adding random weights + - Mincut 17 PAC-Bayes • PAC-Bayes bounds suggests that when the graph has many small cuts consistent with the labeling, randomization should improve generalization performance. • In this case each distinct cut corresponds to a different hypothesis. • Hence the average of these cuts will be less likely to overfit than any single cut. 18 Markov Random Fields • Ideally we would like to assign a weight to each cut in the graph (a higher weight to small cuts) and then take a weighted vote over all the cuts in the graph. • This corresponds to a Markov Random Field model. • We don’t know how to do this efficiently, but we can view randomized mincuts as an approximation. 19 How to construct the graph? • k-NN – Graph may not have small balanced cuts. – How to learn k? • Connect all points within distance δ – Can have disconnected components. – How to learn δ? • Minimum Spanning Tree – No parameters to learn. – Gives connected, sparse graph. – Seems to work well on most datasets. 20 Experiments • ONE vs. TWO: 1128 examples . • (8 X 8 array of integers, Euclidean distance). • ODD vs. EVEN: 4000 examples . • (16 X 16 array of integers, Euclidean distance). • PC vs. MAC: 1943 examples . • (20 newsgroup dataset, TFIDF distance) . 21 ONE vs. TWO 22 ODD vs. EVEN 23 PC vs. MAC 24 Summary Randomization helps plain mincut achieve a comparable performance to Gaussian Fields. We can apply PAC sample complexity analysis and interpret it in terms of Markov Random Fields. There is an intuitive interpretation for the confidence of a prediction in terms of the “margin” of the vote. “Semi-supervised Learning Using Randomized Mincuts”, A. Blum, J. Lafferty, M.R. Rwebangira, R. Reddy , ICML 2004 25 Outline Motivation Randomized Graph Mincut Local Linear Semi-supervised Regression Learning with Similarity Functions Proposed Work and Time Line 26 (Supervised) Linear Regression y * * * * x 27 Semi-Supervised Regression y * * * * + ++++ + + ++ x 28 Smoothness assumption Things that are close together should have similar values One way of doing this: Minimize ξ(f) = ∑ wij(fi-fj)2 Where wij is the similarity between examples i and j. And fi and fj are the predictions for example i and j. Gaussian Fields (Zhu, Ghahramani & Lafferty) 29 Local Constancy The predictions made by Gaussian Fields are locally constant y *u + u +Δ x More formally: m (u + Δ) ≈ m(u) 30 Local Linearity For many regression tasks we would prefer predictions to be locally linear. y *u + u +Δ x More formally: m (u + Δ) ≈ m(u) + m’(u) Δ 31 Problem Develop a version of Gaussian Fields which is Local Linear Or a semi-supervised version of Linear Regression Local Linear Semi-supervised Regression 32 Local Linear Semi-supervised Regression By analogy with ∑ wij(fi-fj)2 βj βjo βio } (βio – XjiTβj)2 XjiTβj βi xj xi 33 Local Linear Semi-supervised Regression So we find β to minimize the following objective function ξ(β) = ∑ wij (βio – XjiTβj)2 Where wij is the similarity between xi and xj. 34 Synthetic Data: Gong Gong function y = (1/x)sin (15/x) σ2 = 0.1 (noise) 35 Experimental Results: GONG Weighted Kernel Regression, MSE=25.7 36 Experimental Results: GONG Local Linear Regression, MSE=14.4 37 Experimental Results: GONG LLSR, MSE=7.99 38 PROBLEM: RUNNING TIME If we have n examples and dimension d then to compute a closed form solution we have to invert an (n(d+1) * n(d+1)) matrix. This is prohibitively expensive, especially if d is large. For example if n=1500 and d=199 then we have to invert a matrix of size 720 GB in Matlab’s double precision format. 39 SOLUTION: ITERATION It turns out that because of the form of the equation we can start from an arbitrary initial guess and do an iterative computation that provably converges to the desired solution. In the case of n=1500 and d=199, instead of dealing with a matrix of size 720 GB we only have to store 2.4 makes the algorithm much more practical. MB in memory which 40 Experiments on Real Data We do model selection using Leave One Out Cross validation We compare: Weighted Kernel Regression (WKR) – a purely supervised method. Local Linear Regression (LLR) – another purely supervised method. Local Learning Regularization (LL-Reg) – an up to date semi-supervised method Local Linear Semi-Supervised Regularization (LLSR) For each algorithm and dataset we give: 1. The mean and standard deviation of 10 runs. 2. The results of an OPTIMAL choice of parameters. 41 Experimental Results Dataset n d nl LLSR LLSR-OPT WKR WKR-OPT Carbon 58 1 10 27±25 19±11 70±36 37±11 Alligators 25 1 10 288±176 209±162 336±210 324±211 Smoke 25 1 10 82±13 79±13 83±19 80±15 392 7 100 50±2 49±1 57±3 57±3 Autompg Dataset n d nl LLR LLR-OPT LL-Reg LL-Reg-OPT Carbon 58 1 10 57±16 54±10 162±199 74±22 Alligators 25 1 10 207±140 207±140 289±222 248±157 Smoke 25 1 10 82±12 80±13 82±14 70±6 392 7 100 53±3 52±3 53±4 51±2 Autompg 42 Summary LLSR is a natural semi-supervised generalization of Linear Regression While the analysis is not as clear as with semi-supervised classification, semi-supervised regression can perform better than supervised regression if the function has a smooth manifold similar to the GONG function. FUTURE WORK: Carefully analyzing the assumptions under which unlabeled data can be useful in regression. 43 Outline Motivation Randomized Graph Mincut Local Linear Semi-supervised Regression Learning with Similarity Functions Proposed Work and Time Line 44 Kernels K(x,y): Informally considered as a measure of similarity between x and y Kernel trick: K(x,y) = Φ(x)∙Φ(y) (Mercer’s theorem) This allows us to implicitly project non-linearly separable data into a high dimensional space where a linear separator can be found . Kernel must satisfy strict mathematical definitions 1. Continuous 2. Symmetric 3. Positive semi-definite 45 Problems with Kernels There is a conceptual disconnect between the notion of kernels as similarity functions and the notion of finding max-margin separators in possibly infinite dimensional Hilbert spaces. The properties of kernels such as being Positive Semi-Definite are rather restrictive and in particular similarity functions used in certain domains, such as the Smith-Waterman score in molecular biology do do not fit in this framework. WANTED: A method for using similarity functions that is both easy and general. 46 The Balcan-Blum approach An approach fitting these requirements was recently proposed by Balcan and Blum. Gave a general definition of a good similarity function for learning. Showed that kernels are special case of their definition. Gave an algorithm for learning with good similarity functions. 47 The Balcan-Blum approach Suppose S(x,y) \in (-1,+1) is our similarity function. Then 1. Draw d examples {x1, x2, x3, … xd} uniformly at random from the data set. 2. For each example x compute the mapping x → {S(x,x1), S(x,x2), S(x,x3), … S(x,xd)} KEY POINT: This method can make use of UNLABELED DATA. 48 Combining Feature based and Graph Based Methods Feature based methods directly operate on the native features:e.g. Decision Tree, MaxEnt, Winnow, Perceptron Graph based methods operate on the graph of similarities between examples, e.g Kernel methods, Gaussian Fields, Graph mincut and most semi-supervised learning methods. These methods can work well on different datasets, we want to find a way to find a way to COMBINE these approaches into one algorithm. 49 SOLUTION: Similarity functions + Winnow Use the Balcan-Blum approach to generate extra features. Append the extra features to the original features:- x → {x,S(x,x1), S(x,x2), S(x,x3), … S(x,xd)} Run the Winnow algorithm on the combined features (Winnow is known to be resistant to irrelevant features.) 50 Our Contributions Practical techniques for using similarity functions Combining graph based and feature based learning. 51 How to define a good similarity function? By modifying a distance metric:K(x,y) = 1/(D(x,y)+1) Problem: We can end up with all similarities close to ZERO (not good) Solution: Scale the similarities as follows: Sort the similarities for example x from most similar to least. Give the most similar similarity +1 and the least, similarity -1 and interpolate the remaining example in between. VERY IMPORTANT: The ranked similarity may not be symmetric Which is a big difference with kernels. 52 Evaluating a similarity function K is strongly (ε,γ)-good similarity function for a learning problem P if at least a (1- ε) probability mass of examples x satisfy Ex’~P [K(x’,x)|l(x’)=l(x)] ≥ Ex’~P [K(x’,x)|l(x’) ≠l(x)] + γ For a particular similarity function and dataset we can compute the margin γ for each example and then plot the examples by decreasing margin. If the margin is large for most examples, this is an indication that the similarity function may perform well on a particular dataset. 53 Compatibility of the naïve similarity function on Digits1 54 Compatibility of the ranked similarity function on Digits1 55 Experimental Results We’ll look at some experimental results on both real and synthetic datasets. 56 Synthetic Data: Circle 57 Experimental Results: Circle 58 Synthetic Data: Blobs and Lines Can we create a data set that needs BOTH the original and the new features to do well? To answer this we create the data set we will call “Blobs and Lines” We generate the data in the following way: 1. We select k point to be the centers of our “blobs” and assign them labels in {-1,+1}. 2. We flip a coin. 3. If heads, then we set x to be a random boolean vector of dimension d and set the label to be the first coordinate of x. 4. If tails, we pick one of the centers and flip r bits and set x equal to that and set the label to the label of the center. 59 Synthetic Data: Blobs and Lines + + + - - + + + + + - - 60 Experimental Results: Blobs and Lines 61 Experimental Results: Real Data Dataset n d nl Winnow SVM NN SIM Winnow+SVM Congress 435 16 100 93.79 94.93 90.8 90.90 92.24 Webmaster 582 1406 100 81.97 71.78 72.5 69.90 81.20 Credit 653 46 100 78.50 55.52 61.5 59.10 77.36 Wisc 683 89 100 95.03 94.51 95.3 93.65 94.49 Digit1 1500 241 100 73.26 88.79 94.0 94.21 91.31 USPS 1500 241 100 71.85 74.21 92.0 86.72 88.57 62 Experimental Results: Concatenation What if we did something halfway between synthetic and real, by concatenating two different datasets? This can be viewed as simulating a dataset that has two different kinds of data. We concatenated the datasets by padding each of them with a block of ZEROS. Credit (653 X 46) Padding (653 X 241) Padding (653 X 46) Digit1 (653 X 241) Dataset n d nl Winnow SVM NN Credit + Digit1 1306 287 100 72.41 75.46 74.25 51.74 SIM Winnow+SVM 83.95 63 Conclusions Generic similarity functions have a lot of potential to be applied to practical applications. Combining feature based and graph based methods we can often get the “best of both worlds” FUTURE WORK Designing similarity functions suited to particular domains. Theoretically provable guarantees on the quality of a similarity function 64 QUESTIONS? 65 Back Up Slides 66 References “Semi-supervised Learning Using Randomized Mincuts”, A. Blum, J. Lafferty, M.R. Rwebangira, R. Reddy , ICML 2004 67 My Work Techniques for improving graph mincut algorithms for semi-supervised classification Techniques for extending Local Linear Regression to the semi-supervised setting Practical techniques for using unlabeled data and generic similarity functions to “kernelize” the winnow algorithm. 68 Problem There may be several minimum cuts in the graph. + - Indeed, there are potentially exponentially many minimum cuts in the graph. 69 Real Data: CO2 Carbon dioxide concentration in the atmosphere over the last two centuries. Source: World Watch Institute 70 Experimental Results: CO2 Local Linear Regression, MSE = 144 71 Experimental Results: CO2 Weighted Kernel Regression, MSE = 660 72 Experimental Results:CO2 LLSR, MSE = 97.4 73 Winnow A linear separator algorithm, first proposed by Littlestone. We are particularly interested in winnow because 1. It is known to be able to effectively learn in the presence of irrelevant attributes. Since we will be creating many new features, we expect many of them will be irrelevant. 2. It is fast and does not require a lot of memory. Since we hope to use large amounts of unlabeled data, scalability is an important consideration. 74 PROPOSED WORK: Improving Running Time Sparsification: Ignore examples which are far away so as to get a sparser matrix to invert. Iterative Methods for solving Linear systems: For a matrix equation Ax=b, we can obtain successive approximations x1, x2 … xk. Can be significantly faster if matrix A is sparse. 79 PROPOSED WORK: Improving Running Time Power series: Use the identity (I-A)-1 = I + A + A2 + A3 + … y’ =(Q+γΔ)-1Py = Q-1Py + (-γQ-1Δ)Q-1Py + (-γQ-1Δ)2Q-1Py + … A few terms may be sufficient to get a good approximation Compute supervised answer first, then “smooth” the answer to get semiSupervised solution. This can be combined with iterative methods as we can use the supervised solution as the starting point for our iterative algorithm. 80 PROPOSED WORK: Experimental Evaluation Comparison against other proposed semi-supervised regression algorithms. Evaluation on a large variety of data sets, especially high dimensional ones. 81 PROPOSED WORK Overall goal: Investigate the practical applicability of this theory and find out what is needed to make it work on real problems. Two main application areas: 1. Domains which have expert defined similarity functions that are not kernels (protein homology). 2. Domains which have many irrelevant features and in which the data may not be linearly separable in the original features (text classification). 82 PROPOSED WORK: Protein Homology The Smith-Waterman score is the best performing measure of similarity but it does not satisfy the kernel properties. Machine learning applications have either used other similarity functions Or tried to force SW score into a kernel. Can we achieve better performance by using SW score directly? 83 PROPOSED WORK: Text Classification Most popular technique is Bag-of-Words (BOW) where each document is converted into a vector and each position in the vector indicates how many times each word occurred. The vectors tend to be sparse and there will be many irrelevant features, hence this is well suited to the Winnow algorithm. Our approach makes the winnow algorithm more powerful. Within this framework we have strong motivation for investigating “domain specific” similarity function, e.g. “edit distance” between documents instead of cosine similarity. Can we achieve better performance than current techniques using “domain specific” similarity functions? 84 PROPOSED WORK: Domain Specific Similarity Functions As mentioned in the previous two slides, designing specific similarity functions for each domain, is well motivated in this approach. What are the “best practice” principles for designing domain specific similarity functions? In what circumstances are domain specific similarity functions likely to be most useful? We will answer these questions by generalizing from several different datasets and systematically noting what seems to work best. 85 Proposed Work and Time Line Summer 2007 (1) (2) Speeding up LLSR Learning with similarity in protein homology and text classification domain. Fall 2007 (1) (2) Comparison of LLSR with other semi-supervised regression algs. Investigate principles of domain specific similarity functions. Spring 2008 Start Writing Thesis Summer 2008 Finish Writing Thesis 86 Kernels K(x,y) = Φ(x)∙Φ(y) Allows us to implicitly project non-linearly separable data into a high dimensional space where a linear separator can be found . Kernel must satisfy strict mathematical definitions 1. Continuous 2. Symmetric 3. Positive semi-definite 87 Generic similarity Functions What if the best similarity function in a given domain does not satisfy the properties of a kernel? Two options: 1. Use a kernel with inferior performance 2. Try to “coerce” the similarity function into a kernel by building a kernel that has similar behavior. There is another way … 88