Boosting and Semi-supervised Learning Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 String Kernels Compare two sequences for similarity K( ACAAGAT GCCATTG TCCCCCG GCCTCCT GCTGCTG , GCATGAC GCCATTG ACCTGCT GGTCCTA )=0.7 Exact matching kernel Counting all matching substrings Flexible weighting scheme Does not work well for noisy case Successful applications in bio-informatics Linear time algorithm using suffix trees 2 Exact matching string kernels Bag of Characters Count single characters, set 𝑤𝑠 = 0 for 𝑠 > 1 Bag of Words s is bounded by whitespace Limited range correlations Set 𝑤𝑠 = 0 for all 𝑠 > 𝑛 given a fixed 𝑛 K-spectrum kernel Account for matching substrings of length 𝑘, set 𝑤𝑠 = 0 for all 𝑠 ≠𝑘 3 Suffix trees Definitions: compact tree built from all the suffixes of a string. Eg. suffix tree of ababc denoted by S(ababc) Node Label = unique path from the root Suffix links are used to speed up parsing of strings: if we are at node 𝑎𝑥 then suffix links help us to jump to node 𝑥 Represent all the substrings of a given string Can be constructed in linear time and stored in linear space Each leaf corresponds to a unique suffix Leaves on the subtree give number of occurrence 4 Graph Kernels Each data point itself is a graph, and kernel is a similarity measure between graphs 5 Use graph isomorphism test design kernel Similarity score be high for identical graphs, low for very different graphs Efficient computation The resulting similarity measure has to be positive semidefinite Graph isomorphism test tries to test whether two graphs are identical (eg. Weisfeiler-Lehman algorithm) Modify a graph isomorphism test to design kernels 6 Weisfeiler-Lehman algorithm Use multiset labels to capture neighborhood structure in a graph If two graphs are the same, then the multiset labels should be the same as well 7 Weisfeiler-Lehman algorithm II Relabel graph and construct new multiset label Check whether the new labels are the same 8 Weisfeiler-Lehman algorithm II If the new multilabel sets are not the same, stop the algorithm and declare the two graphs are not the same Effectively it is unrolling the graph into trees rooted at each node and use multilabels to identify these trees 9 Design kernel from isomorphism test 10 Graph kernel example Feature vectors are produced in each iteration, and they are concatenated into a long feature vector Feature vectors counts the occurrences of unique labels 11 Graph kernel example II Relabel the graph, and obtain new feature vector When the maximum iteration is reached, just do inner product of the feature feature vector to obtain the kernel value 12 Combining classifiers Ensemble learning: a machine learning paradigm where multiple learners are used to solve the problem Previously: Ensemble: Problem Problem … ... Learner Learner Learner … ... Learner The generalization ability of the ensemble is usually significantly better than that of an individual learner Boosting is one of the most important families of ensemble methods 13 Combining classifiers Average results from several different models Bagging Stacking (meta-learning) Boosting Why? Better classification performance than individual classifiers More resilience to noise Concerns Take more time to obtain the final model Overfitting 14 Bagging Bagging: Bootstrap aggregating Generate B bootstrap samples of the training data: uniformly random sampling with replacement Train a classifier or a regression function using each bootstrap sample For classification: majority vote on the classification results For regression: average on the predicted values Original Training set 1 Training set 2 Training set 3 Training set 4 1 2 7 3 4 2 7 8 6 5 3 8 5 2 1 4 3 6 7 4 5 7 4 5 6 6 6 2 6 4 7 3 7 2 3 8 1 1 2 8 15 Stacking classifiers Level-0 models are based on different learning models and use original data (level-0 data) Level-1 models are based on results of level-0 models (level-1 data are outputs of level-0 models) -- also called “generalizer” If you have lots of models, you can stacking into deeper hierarchies 16 Boosting Boosting: general methods of converting rough rules of thumb into highly accurate prediction rule A family of methods which produce a sequence of classifiers Each classifier is dependent on the previous one and focuses on the previous one’s errors Examples that are incorrectly predicted in the previous classifiers are chosen more often or weighted more heavily when estimating a new classifier. Questions: How to choose “hardest” examples? How to combine these classifiers? 17 AdaBoost 18 Adaboost flow chart Original training set Data set 1 training instances that are wrongly predicted by Learner1 will play more important roles in the training of Learner2 Data set 2 … ... Data set T … ... Learner1 Learner2 … ... LearnerT weighted combination 19 Toy Example Weak classifier (rule of thumb): vertical or horizontal halfplanes Uniform weights on all examples 20 Boosting round 1 Choose a rule of thumb (weak classifier) Some data points obtain higher weights because they are classified incorrectly 21 Boosting round 2 Choose a new rule of thumb Reweight again. For incorrectly classified examples, weight increased 22 Boosting round 3 Repeat the same process Now we have 3 classifiers 23 Boosting aggregate classifier Final classifier is weighted combination of weak classifiers 24 Theoretical properties Y. Freund and R. Schapire [JCSS97] have proved that the training error of AdaBoost is bounded by: where Thus, if each base classifier is slightly better than random so that for some , then the training error drops exponentially fast in T since the above bound is at most 25 How will train/test error behave? Expect: training error to continue to drop (or reach zero) First guess: test error to increase when the continued classifier becomes too complex Overfitting: hard to know when to stop training 26 Actual experimental observation Test error does not increase, even after 1000 rounds Test error continues to drop even after training error is zeros! 27 Theoretical properties (con’t) Y. Freund and R. Schapire [JCSS97] have tried to bound the generalization error as: Pr[𝑓 𝑥 ≠ 𝑦] ≤ 𝑂 𝑇𝑑 𝑚 Where 𝑚 is the number of samples and d is the VCdimension of the weak learner It suggests Adaboost will overfit if T is large. However, empirical studies show that Adaboost often does not overfit. Classification error only measure whether classification right or wrong, also need to consider confidence of classifier Margin-based bound: for any 𝜃 > 0 with high probability Pr[𝑚𝑎𝑟𝑔𝑖𝑛𝑓 𝑥, 𝑦 ≤ 𝜃] ≤ 𝑂 𝑑 𝑚𝜃 2 28 Applications of boosting AdaBoost and its variants have been applied to diverse domains with great success. Here I only show one example P. Viola & M. Jones [CVPR’01] combined AdaBoost with a cascade process for face detection They regarded rectangular features as weak classifiers 29 Applications of boosting By using AdaBoost to weight the weak classifiers, they got two very intuitive features for face detection Finally, a very strong face detector: On a 466MHz SUN machine, a 384288 image requires only 0.067 seconds! (in average, only 8 features needed to be evaluated per image) 30 Boosting for face detection 31 Semi-supervised learning Supervised Learning = learning from labeled data. Dominant paradigm in Machine Learning. E.g, say you want to train an email classifier to distinguish spam from important messages 32 Semi-supervised learning Supervised Learning = learning from labeled data. Dominant paradigm in Machine Learning. E.g, say you want to train an email classifier to distinguish spam from important messages Take sample S of data, labeled according to whether they were/weren’t spam. 33 Basic paradigm has many successes recognize speech steer a car classify documents classify proteins recognizing faces, objects in images ... 34 Labeled data can be rare or expensive Need to pay someone to do it, requires special testing … Unlabeled data is much cheaper Can we make use of cheap unlabeled data? Unlabeled data is missing the most important information But maybe still has useful regularities that we can use. Three supervised method Co-training Semi-Supervised (Transductive) SVM [Joachims98] Graph-based methods 35 Co-training Many problems have two different sources of info you can use to determine label. E.g., classifying webpages: can use words on page or words on links pointing to the page. Prof. Avrim Blum My Advisor x - Link info & Text info Prof. Avrim Blum My Advisor x1- Link info x2- Text info 36 Co-training Idea: Use small labeled sample to learn initial rules. E.g., “my advisor” pointing to a page is a good indicator it is a faculty home page. E.g., “I am teaching” on a page is a good indicator it is a faculty home page. my advisor 37 Co-training Then look for unlabeled examples where one rule is confident and the other is not. Have it label the example for the other. Training 2 classifiers, one on each type of info. Using each to help train the other. hx1,x2i hx1,x2i hx1,x2i hx1,x2i hx1,x2i hx1,x2i 38 Co-training Turns out a number of problems can be set up this way. E.g., [Levin-Viola-Freund03] identifying objects in images. Two different kinds of preprocessing. E.g., [Collins&Singer99] named-entity extraction. – “I arrived in London yesterday” 39 Co-training Setting is each example x = (x1,x2), where x1, x2 are two “views” of the data. Have separate algorithms running on each view. Use each to help train the other. Basic hope is that two views are consistent. Using agreement as proxy for labeled data. hx1,x2i hx1,x2i hx1,x2i hx1,x2i hx1,x2i hx1,x2i 40 Toy example: intervals As a simple example, suppose x1, x2 ∈ 𝑅. Target function is some interval [a,b]. b2 + + a2 a1 + + + b1 41 Webpage classification 12 labeled examples, 1000 unlabeled (sample run) 42 Images classification Visual detectors with different kinds of processing • [Levin-Viola-Freund ‘03]: • Images with 50 labeled cars. 22,000 unlabeled images. • Factor 2-3+ improvement. 43 Semi-Supervised SVM (S3VM) Suppose we believe decision boundary goes through low density regions of the space/large margin. Aim for classfiers with large margin wrt labeled and unlabeled data. (L+U) + _ + _ + _ + _ + _ + _ SVM Labeled data only S3VM 44 Semi-Supervised SVM (S3VM) Unfortunately, optimization problem is now NP-hard. Algorithm instead does local optimization. Start with large margin over labeled data. Induces labels on U. Then try flipping labels in greedy fashion. Or, branch-and-bound, other methods (Chapelle etal06) Quite successful on text data. + + _ _ + + _ _ + + _ _ 45 Graph-based methods Suppose that very similar examples probably have the same label If you have a lot of labeled data, this suggests a NearestNeighbor type of algorithm If you have a lot of unlabeled data, perhaps can use them as “stepping stones” E.g., handwritten digits [Zhu07]: 46 Graph-based methods Idea: construct a graph with edges between very similar examples. Unlabeled data can help “glue” the objects of the same class together. 47 Graph-based methods Idea: construct a graph with edges between very similar examples. Unlabeled data can help “glue” the objects of the same class together. 48 Graph-based methods Idea: construct a graph with edges between very similar examples Unlabeled data can help “glue” the objects of the same class together Suppose just two labels: 0 & 1. Solve for labels f(x) for unlabeled examples x to minimize: Label propagation: average of neighbor labels Minimum cut e=(u,v)|f(u)-f(v)| + Minimum “soft-cut” e=(u,v)(f(u)-f(v))2 Spectral partitioning + 49