Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.9: Semi-Supervised Learning Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Implementation: Real Machine Learning Schemes • Decision trees • From ID3 to C4.5 (pruning, numeric attributes, ...) • Classification rules • From PRISM to RIPPER and PART (pruning, numeric data, …) • Association Rules • Frequent-pattern trees • Extending linear models • Support vector machines and neural networks • Instance-based learning • Pruning examples, generalized exemplars, distance functions Rodney Nielsen, Human Intelligence & Language Technologies Lab Implementation: Real Machine Learning Schemes • Numeric prediction • Regression/model trees, locally weighted regression • Bayesian networks • Learning and prediction, fast data structures for learning • Clustering: hierarchical, incremental, probabilistic • Hierarchical, incremental, probabilistic, Bayesian • Semisupervised learning • Clustering for classification, co-training Rodney Nielsen, Human Intelligence & Language Technologies Lab Semisupervised Learning • Semisupervised learning: attempts to use unlabeled data as well as labeled data • The aim is to improve classification performance • Why try to do this? Unlabeled data is often plentiful and labeling data can be expensive • Web mining: classifying web pages • Text mining: identifying names in text • Video mining: classifying people in the news • Leveraging the large pool of unlabeled examples would be very attractive Rodney Nielsen, Human Intelligence & Language Technologies Lab Clustering for Classification • Idea: use Naïve Bayes on labeled examples and then apply EM • Build Naïve Bayes model on labeled data • Until convergence: • Label unlabeled data based on class probabilities (“Expectation” step) • Train new Naïve Bayes model based on all the data (“Maximization” step) • Essentially the same as EM for clustering with fixed cluster membership probabilities for labeled data and #clusters = #classes Rodney Nielsen, Human Intelligence & Language Technologies Lab Comments • Has been applied successfully to document classification • Certain phrases are indicative of classes • Some of these phrases occur only in the unlabeled data, some in both sets • EM can generalize the model by taking advantage of co-occurrence of these phrases • Refinement 1: reduce weight of unlabeled data • Refinement 2: allow multiple clusters per class Rodney Nielsen, Human Intelligence & Language Technologies Lab Co-training • Method for learning from multiple views (multiple sets of attributes), eg: • First set of attributes describes content of web page • Second set of attributes describes links that link to the web page • Until stopping criteria: • Step 1: build model from each view • Step 2: use models to assign labels to unlabeled data • Step 3: select those unlabeled examples that were most confidently predicted (often, preserving ratio of classes) • Step 4: add those examples to the training set • Assumption: views are independent Rodney Nielsen, Human Intelligence & Language Technologies Lab EM and Co-training • Like EM for semisupervised learning, but view is switched in each iteration of EM • Uses all the unlabeled data (probabilistically labeled) for training • Has also been used successfully with support vector machines • Using logistic models fit to output of SVMs to estimate a class probability distribution • Co-training sometimes also seems to work when views are chosen randomly! • Why? Maybe Co-trained classifier is more robust Rodney Nielsen, Human Intelligence & Language Technologies Lab Self-Training • L L0 <X(0), Y(0)> • Until stopping-criteria • h(x) f(L) • U* select(U, h) • L L0 + <U*, h(U*)> Rodney Nielsen, Human Intelligence & Language Technologies Lab Example Selection • • • • Probability Probability ratio or probability margin Entropy Or several other possibilities (e.g., seach Burr Settles Active Learning Tutorial ) Rodney Nielsen, Human Intelligence & Language Technologies Lab Stopping Criteria • • • • T rounds, Repeat until convergence, Use held out validation data, or k-fold cross validation Rodney Nielsen, Human Intelligence & Language Technologies Lab Seed • Seed Data vs. Seed Classifier • Training on seed data does not necessarily result in a classifier that perfectly labels the seed data • Training on data output by a seed classifier does not necessarily result in the same classifier Rodney Nielsen, Human Intelligence & Language Technologies Lab Indelibility Indelible Original: Y(U) can change • L <X(0), Y(0)> • L L0 <X(0), Y(0)> • Until stopping-criteria • Until stopping-criteria • h(x) f(L) • h(x) f(L) • U* select(U, h) • U* select(U, h) • L L + <U*, h(U*)> • L L0 + <U*, h(U*)> • U U – U* Rodney Nielsen, Human Intelligence & Language Technologies Lab Persistence Indelible Persistent: X(L) can’t change • L <X(0), Y(0)> • L L0 <X(0), Y(0)> • Until stopping-criteria • Until stopping-criteria • h(x) f(L) • h(x) f(L) • U* select(U, h) • U* U*+select(U, h) • L L + <U*, h(U*)> • L L0 + <U*, h(U*)> • U U – U* • U U – U* Rodney Nielsen, Human Intelligence & Language Technologies Lab Throttling Throttled Original: Threshold • L L0 <X(0), Y(0)> • L L0 <X(0), Y(0)> • Until stopping-criteria • Until stopping-criteria • h(x) f(L) • h(x) f(L) • U* select(U, h, k) • U* select(U, h, θ) • L L0+ <U*, h(U*)> • L L0 + <U*, h(U*)> Select k examples from U, Select all examples from U, with the greatest confidence with confidence > θ Rodney Nielsen, Human Intelligence & Language Technologies Lab Balanced Balanced (&Throttled) Throttled • L L0 <X(0), Y(0)> • L L0 <X(0), Y(0)> • Until stopping-criteria • Until stopping-criteria • h(x) f(L) • h(x) f(L) • U* select(U, h, k) • U* select(U, h, k) • L L0+ <U*, h(U*)> • L L0 + <U*, h(U*)> Select k+ positive & k- negative exs; often k+=k- or they are proportional to N+ & N- Select k examples from U, with greatest confidence Rodney Nielsen, Human Intelligence & Language Technologies Lab Preselection Preselect Subset of U Original: Test all of U • L L0 <X(0), Y(0)> • L L0 <X(0), Y(0)> • Until stopping-criteria • Until stopping-criteria • h(x) f(L) • h(x) f(L) • -• U’ select(U, φ) • U* select(U’, h, θ) • U* select(U, h, θ) • L L0+ <U*, h(U*)> • L L0 + <U*, h(U*)> Select exs from U’, a subset of U (typically random) Select exs from all of U Rodney Nielsen, Human Intelligence & Language Technologies Lab Co-training • X = X1 × X2 ; two different views of the data • x = (x1, x2) ; i.e., each instance is comprised of two distinct sets of features and values • Assume each view is sufficient for correct classification Rodney Nielsen, Human Intelligence & Language Technologies Lab Co-Training Algorithm 1 Rodney Nielsen, Intelligence & Language1998 Technologies Lab Table Human 1: Blum and Mitchell, Companionbots Perceptive, emotive, conversational, healthcare, companion robots NSF CISE Smart Health & Wellbeing, PI Collaborators: UC Denver, CU Boulder, BLT, U Denver Consultants: Columbia, Worcester Polytechnic Institute, UCD Depression Center Rodney Nielsen, Human Intelligence & Language Technologies Lab Elderly and Depression Stats for 65+ Depression Double in number by 2030 12 20% Leading cause of disability M/F All ages Worldwide (WHO) 50-58% of hospital patients 36-50% of healthcare expenditures Doubles cost of care for chronic diseases Rodney Nielsen, Human Intelligence & Language Technologies Lab Companionbots Architecture Interpretation Sensory Input Distance Audio Fundamental Recognition Speech Recognition Situation Natural Understanding Language Understanding User Modeling & History Tracking Situation Prediction Goal Manager Action Language Vision Object Recognition Scenario Understanding Beliefs Measurement Radar, IR… Object Tracking Emotion Understanding Emotions Force / Touch Emotion Recognition Environment Understanding Body & Motion Dialogue Prediction Scenario Prediction Emotion Prediction Dialogue Goal Manager Scenario Goal Manager Environment Emotion Goal Goal Manager Manager Environment Prediction Instance selection Mechatronic Outputs Location Time for Co-training in … emotion recognitionMechatronic Control … Natural Behavior Generation … Habits, Hobbies & Routines Health … Behavior Manager … Locomotion Manager Movement Visual Displays Text to Speech Motor Controls Other Mechatronic Controls Natural Language Generation Natural Movement Generation Natural Expression Generation Question Answering Information Retrieval / Extraction Document Summarization Dialogue Manager Gesture Manager Expression Manager Manipulation Manager Posture Manager Sensor 1 Manager Tools … Health Goal Manager Audio Rodney Nielsen, Human Intelligence & Language Technologies Lab … … … … … Multimodal Emotion Recognition • Vision • Speech • Language Why does this always have to happen to me Rodney Nielsen, Human Intelligence & Language Technologies Lab Co-Training Emotion Recognition Given a set L of labeled training examples a set U of unlabeled training examples Create a pool U' of examples by choosing u examples at random from U Loop for k iterations: Use L to train a classifier h1 that considers only Use L to train a classifier h2 that considers only Use L to train a classifier h3 that considers only Why does this always have to happen to me Allow h1 to label p1 positive and n1 negative examples from U’ Allow h2 to label p2 positive and n2 negative examples from U' Allow h3 to label p3 positive and n3 negative examples from U' Add these self-labeled examples to L Randomly choose examples from U to replenish U’ (Blum & Mitchell, 1998) Rodney Nielsen, Human Intelligence & Language Technologies Lab Semisupervised & Active Learning • Most common strategy for instance selection • Based on class probability estimates • Semisupervised learning • Select k instances with highest class probabilities • Active learning • Select k instances with lowest class probabilities Rodney Nielsen, Human Intelligence & Language Technologies Lab Active Learning • Usually an abundance of unlabeled data • How much should you label? • Which instances should you label? • Does it matter? • Can the learner benefit from selective labeling? • Active Learning: incrementally requests labels for key instances Rodney Nielsen, Human Intelligence & Language Technologies Lab Learning Paradigms random ? query ? ? ? ? random Supervised Learning Unsupervised Learning Active Learning Rodney Nielsen, Human Intelligence & Language Technologies Lab Active Learning Applications • Speech Recognition • 10 mins to annotate words in 1 min of speech • 7 hrs to annotate phonemes of 1 minute speech • Named Entity Recognition • Half an hour for a simple newswire article • PhD for a bioinformatics article • Image annotation Rodney Nielsen, Human Intelligence & Language Technologies Lab Face/Pedestrian/Object Detection Rodney Nielsen, Human Intelligence & Language Technologies Lab Heuristic Active Learning Algorithm • Start with unlabeled data • Randomly pick small number of examples to have labeled • Repeat • Train classifier on labeled data • Query the unlabeled ex that: random • Is closest to the boundary • Has the least certainty • Minimizes overall uncertainty ? query ? ? ? ? random Rodney Nielsen, Human Intelligence & Language Technologies Lab Two Gaussians a) Two classes with Gaussian distributions b) Logistic Regression on 30 random labeled exs • 70% accuracy c) Log. Reg. on 30 exs chosen by Active Learning • 90% accuracy Rodney Nielsen, Human Intelligence & Language Technologies Lab Space of Active Learning Query Types: Membership Stream-based Pool-based Active Sampling method Query Synthesis Selective Sampling Learning Uncertainty sampling Query by committee Expected model change Variance reduction Est. error reduction Density weighted methods Rodney Nielsen, Human Intelligence & Language Technologies Lab Space of Active Learning Query Types: Membership Stream-based Pool-based Active Sampling method Query Synthesis Selective Sampling Learning Uncertainty sampling Query by committee Expected model change Variance reduction Est. error reduction Density weighted methods Rodney Nielsen, Human Intelligence & Language Technologies Lab Active Learning Query Types From Burr Settles, 2009, AL Tutorial Rodney Nielsen, Human Intelligence & Language Technologies Lab Membership Query Synthesis • Dynamically construct query instances based on expected informativeness • Applications • Character recognition. • Robot scientist: find optimal growth medium for a yeast • 3x $ decrease vs. cheapest next • 100x $ decrease vs. random selection Rodney Nielsen, Human Intelligence & Language Technologies Lab Stream-based Selective Sampling • Informativeness measure • Region of uncertainty / Version space • Applications • • • • POST Sensor scheduling IR ranking WSD Rodney Nielsen, Human Intelligence & Language Technologies Lab Pool-based Active Learning • Informativeness measure • Applications • • • • • • Cancer diagnosis Text classification IE Image classfctn & retrieval Video classfctn & retrieval Speech recognition Rodney Nielsen, Human Intelligence & Language Technologies Lab Pool-based Active Learning Loop From Burr Settles, 2009, AL Tutorial Rodney Nielsen, Human Intelligence & Language Technologies Lab Space of Active Learning Query Types: Membership Stream-based Pool-based Active Sampling method Query Synthesis Selective Sampling Learning Uncertainty sampling Query by committee Expected model change Variance reduction Est. error reduction Density weighted methods Rodney Nielsen, Human Intelligence & Language Technologies Lab Questions • Questions??? Rodney Nielsen, Human Intelligence & Language Technologies Lab Instance Sampling in Active Learning Query Types: Membership Stream-based Pool-based Active Sampling method Query Synthesis Selective Sampling Learning Uncertainty sampling Query by committee Expected model change Variance reduction Est. error reduction Density weighted methods Rodney Nielsen, Human Intelligence & Language Technologies Lab Uncertainty Sampling • Uncertainty sampling • Select examples based on confidence in prediction • Least confident yˆ = argmax P( y x;q ) x*LC = argmin P( yˆ x;q ) y x • Margin sampling ( x*M = argmin P( yˆ x;q ) - P( y˜ x;q ) x ) y˜ = argmax P( y x;q ) y,y ¹ yˆ • Entropy-based models x*H = argmin å P( y k x;q ) log P ( y k x;q ) x k Rodney Nielsen, Human Intelligence & Language Technologies Lab Query by Committee C = {h1,h2 ,...,hm } • Representing different regions of the version space • Train a committee of hypotheses • Obtain some measure of (dis)agreement on the instances in the dataset (e.g., vote entropy) x * VE = argmin å x k V ( yk ) C log V ( yk ) C • Assume the most informative instance is the one on which the committee has the most disagreement • Goal: minimize the version space • No agreement on size of committee, but even 2-3 provides good results Rodney Nielsen, Human Intelligence & Language Technologies Lab Competing Hypotheses • a From Burr Settles, 2009, AL Tutorial Rodney Nielsen, Human Intelligence & Language Technologies Lab Expected Model Change • Query the instance that would result in the largest expected change in h based on the current model and Expectations • E.g., the instance that would result in the largest 1 gradient descent in the model parameters • Prefer the instance x that leads to the most significant change in the model Rodney Nielsen, Human Intelligence & Language Technologies Lab Expected Model Change • What learning algorithms does this work for • What are the issues 1 • Can be computationally expensive for large datasets and feature spaces • Can be led astray if features aren’t properly scaled • How do you properly scale the features? Rodney Nielsen, Human Intelligence & Language Technologies Lab Estimated Error Reduction • Other models approximate the goal of minimizing future error by minimizing (e.g., uncertainty,…) • Estimated Error Reduction attempts to directly minimize E[error] 1 [ ( x*0 /1 = argmin å P( y k x;q ) E Error q + x,yk x k )] æU ( u) + x,y k ç = argmin å P ( y k x;q )çå1 - P y k x ;q x k è u=1 ( Rodney Nielsen, Human Intelligence & Language Technologies Lab ) ö ÷÷ ø Estimated Error Reduction • Often computationally prohibitive • Binary logistic regression would be O(|U||L|G) • Where G is the number of gradient descent iterations to convergence 1 • Conditional Random Fields would be O(T|Y|T+2|U||L|G) • Where T is the number of instances in the sequence Rodney Nielsen, Human Intelligence & Language Technologies Lab Variance Reduction • Regression problems • E[error2] = noise + bias + variance: [ ] 2ù é E ( yˆ - y ) x = E ê y - E [ y x] ú ë û 2 ( ( ) 1 + E L [ yˆ ] - E [ y x] [ + E L ( yˆ - E L [ yˆ ]) ) 2 noise 2 ] bias variance • Learner can’t change noise or bias so minimize variance • Fisher Information Ratio used for classification Rodney Nielsen, Human Intelligence & Language Technologies Lab Outlier Phenomenon • Uncertainty sampling and Query by Committee might be hindered by querying many outliers Rodney Nielsen, Human Intelligence & Language Technologies Lab Density Weighted Methods • Uncertainty sampling and Query by Committee might be hindered by querying many outliers • Density weighted methods overcome this potential problem by also considering whether the example is representative of the input dist. æ1 x* = argmax f A ( x)× çç x èU U å sim(x,x( ) ) u u=1 öb ÷÷ ø • Tends to work better than any of the base classifiers on their own Rodney Nielsen, Human Intelligence & Language Technologies Lab Diversity • Naïve selection by earlier methods results in selecting examples that are very similar • Must factor this in and look for diversity in the queries Rodney Nielsen, Human Intelligence & Language Technologies Lab Active Learning Empirical Results • Appears to work well, barring publication bias From Burr Settles, 2009, AL Tutorial Rodney Nielsen, Human Intelligence & Language Technologies Lab Labeling Costs • Are all labels created equal? • Generating labels by experiments • Some instances easier to label (eg, shorter sents) • Can pre-label data for a small savings • Experimental problems • Value of information (VOI) • Considers labeling & estmtd misclassification costs • Critical to goal of Active Learning • Divide informativeness by cost? Rodney Nielsen, Human Intelligence & Language Technologies Lab Batch Mode Active Learning Rodney Nielsen, Human Intelligence & Language Technologies Lab Active Learning Evaluation • Learning curves for text classification: baseball vs. hockey. Curves plot classification accuracy as a function of the number of documents queried for two selection strategies: uncertainty sampling (active learning) and random sampling (passive learning). We can see that the active learning approach is superior here because its learning curve dominates that of random sampling. Rodney Nielsen, From Human Intelligence & Language Technologies Lab Burr Settles, 2009, AL Tutorial Active Learning Evaluation • We can conclude that an active learning algorithm is superior to some other approach (e.g., a random baseline like traditional passive supervised learning) if it dominates the other for most or all of the points along their learning curves. Rodney Nielsen,From Human & Language Technologies Lab BurrIntelligence Settles, 2009, AL Tutorial