Machine Learning Overview Tamara Berg Recognizing People, Objects, and Actions Today • Schedule has been adjusted a little bit due to Monday’s cancellation – Today – Overview of machine learning algorithms (other than deep learning) – We will cover a quick intro to deep learning on day 2 of the object recognition topic • The Topic Presentation groups have been posted to the class webpage – Group 1, Feb 15/17, should meet with me early next week to go over presentation outline and proposed paper list (Adam, Zherong, Jae-Sung, Cheng-Yang) For next class • Read assigned object recognition papers (posted later today) • Before class turn in hard copy ½ page summary for each assigned paper outlining: 1) the goal of the paper, 2) the approach, 3) what was novel, 4) what you thought of the paper. (summary template on the class webpage) To Do – prepping for projects – Install your favorite machine learning tool (e.g. CNNs, SVMs, etc) – Download your favorite image dataset (imagenet subset, LFW face dataset, Zappos shoe dataset….) – Run some simple experiment on image classification – split your dataset into training/testing sets, train classifier to recognize images from each category (may or may not require extracting features) Useful code/data/etc: https://github.com/jbhuang0604/awesome-computervision Deep Learning: http://caffe.berkeleyvision.org/, http://torch.ch/docs/cvpr15.html, https://www.tensorflow.org/ Types of ML algorithms • Unsupervised – Algorithms operate on unlabeled examples • Supervised – Algorithms operate on labeled examples • Semi/Partially-supervised – Algorithms combine both labeled and unlabeled examples Slide 5 of 113 Unsupervised Learning, e.g. clustering Slide 6 of 113 Slide 7 of 113 K-means clustering • Want to minimize sum of squared Euclidean distances between points xi and their nearest cluster centers mk D( X , M ) 2 ( x m ) i k cluster k point i in cluster k Algorithm: • Randomly initialize K cluster centers • Iterate until convergence: • Assign each data point to the nearest center • Recompute each cluster center as the mean of all points assigned to it Slide 8 of 113 source: Svetlana Lazebnik Supervised Learning, e.g. nearest neighbor, decision trees, SVMs, boosting Slide 9 of 113 Slide 10 of 113 Slide from Dan Klein Slide 11 of 113 Slide from Dan Klein Slide 12 of 113 Slide from Dan Klein Slide 13 of 113 Slide from Dan Klein Example: Image classification input desired output apple pear tomato cow dog horse Slide 14 of 113 Slide credit: Svetlana Lazebnik http://yann.lecun.com/exdb/mnist/index.html Slide 15 of 113 Slide from Dan Klein Surface wave magnitude Example: Seismic data Earthquakes Nuclear explosions Body wave magnitude Slide 16 of 113 Slide credit: Svetlana Lazebnik Slide 17 of 113 Slide from Dan Klein The basic classification framework y = f(x) output classification function input • Learning: given a training set of labeled examples {(x1,y1), …, (xN,yN)}, estimate the parameters of the prediction function f • Inference: apply f to a never before seen test example x and output the predicted value y = f(x) Slide 18 of 113 Slide credit: Svetlana Lazebnik Some ML classification methods Nearest neighbor Neural networks 106 examples Shakhnarovich, Viola, Darrell 2003 Berg, Berg, Malik 2005 … LeCun, Bottou, Bengio, Haffner 1998 Rowley, Baluja, Kanade 1998 … Support Vector Machines and Kernels Conditional Random Fields Guyon, Vapnik Heisele, Serre, Poggio, 2001 … McCallum, Freitag, Pereira 2000 Kumar, Hebert 2003 … 19 Slide credit: Antonio Torralba Example: Training and testing Training set (labels known) Test set (labels unknown) • Key challenge: generalization to unseen examples Slide 20 of 113 Slide credit: Svetlana Lazebnik Slide 21 of 113 Slide credit: Dan Klein Classification by Nearest Neighbor Word vector document classification – here the vector space is illustrated as having 2 dimensions. How many dimensions would Slide 22 of 113 the data actually live in? Classification by Nearest Neighbor Slide 23 of 113 Classification by Nearest Neighbor Classify the test document as the class of the document “nearest” to the query Slide 24 of 113 document (use vector similarity to find most similar doc) Classification by kNN Classify the test document as the majority class of the k documents Slide 25 of 113 Classification by kNN What are the features? What’s the training data? Testing data? Slide 26 of 113 Parameters? Slide 27 of 113 Slide from Min-Yen Kan Slide 28 of 113 Slide from Min-Yen Kan Slide 29 of 113 Slide from Min-Yen Kan Slide 30 of 113 Slide from Min-Yen Kan Slide 31 of 113 Slide from Min-Yen Kan Classification by kNN What are the features? What’s the training data? Testing data? Slide 32 of 113 Parameters? NN (examples from computer vision) 33 NN for pose estimation Fast Pose Estimation with Parameter Sensitive Hashing Shakhnarovich, Viola, Darrell 34 The algorithm flow Input Query Processed query Representation Database of examples Output Match 35 NN for vision J. Hays and A. Efros, IM2GPS: estimating geographic information from a single image, CVPR 2008 Where? What can you say about where these photos were taken? 37 How? Collect a large collection of geo-tagged photos 6.5 million images with both GPS coordinates and geographic keywords, removing images with keywords like birthday, concert, abstract, … Test set – 400 randomly sampled images from this collection. Manually removed abstract photos and photos with recognizable people – 237 test photos. 38 Nearest Neighbor Matching For each input image compute features (color, texture, shape) Compute distance in feature space to all 6 million images in the database (each feature contributes equally). Label the image with GPS coordinates of: 1 nearest neighbor k=120 nearest neighbors – probability map over entire globe. 39 Results 40 Results 41 Results 42 Decision tree classifier Example problem: decide whether to wait for a table at a restaurant, based on the following attributes: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Alternate: is there an alternative restaurant nearby? Bar: is there a comfortable bar area to wait in? Fri/Sat: is today Friday or Saturday? Hungry: are we hungry? Patrons: number of people in the restaurant (None, Some, Full) Price: price range ($, $$, $$$) Raining: is it raining outside? Reservation: have we made a reservation? Type: kind of restaurant (French, Italian, Thai, Burger) WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60) 43 Decision tree classifier 44 Decision tree classifier 45 Shall I play tennis today? 46 47 Leaf nodes Choose next attribute for splitting How do we choose the best attribute? 48 Criterion for attribute selection • Which is the best attribute? – The one which will result in the smallest tree – Heuristic: choose the attribute that produces the “purest” nodes • Need a good measure of purity! 49 Information Gain Which test is more informative? Wind Humidity >75% <=75% >20 <=20 50 Information Gain Impurity/Entropy (informal) – Measures the level of impurity in a group of examples 51 Impurity Very impure group Less impure Minimum impurity 52 Entropy: a common way to measure impurity • Entropy = p i log 2 pi i pi is the probability of class i Compute it as the proportion of class i in the set. 53 2-Class Cases: • What is the entropy of a group in which all examples belong to the same class? Minimum impurity • entropy = - 1 log21 = 0 • What is the entropy of a group with 50% in either class? • entropy = -0.5 log20.5 – 0.5 log20.5 =1 Maximum impurity 54 Information Gain • We want to determine which attribute in a given set of training feature vectors is most useful for discriminating between the classes to be learned. • Information gain tells us how useful a given attribute of the feature vectors is. • We can use it to decide the ordering of attributes in the nodes of a decision tree. 55 Calculating Information Gain Information Gain = entropy(parent) – [weighted average entropy(children)] child 13 log 13 4 log 4 0.787 impurity 2 2 17 17 17 entropy 17 Entire population (30 instances) 17 instances child 1 12 12 1 impurity entropy 13 log 2 13 13 log 2 13 0.391 parent 14 log 2 14 16 log 2 16 0.996 impurity 30 30 30 entropy 30 13 instances 17 13 0 . 787 0 . 391 0.615 (Weighted) Average Entropy of Children = 30 30 Information Gain= 0.996 - 0.615 = 0.38 56 e.g. based on information gain 57 Linear classifier • Find a linear function to separate the classes f(x) = sgn(w1x1 + w2x2 + … + wDxD) = sgn(w x) Slide 58 of 113 Slide credit: Svetlana Lazebnik Discriminant Function • It can be arbitrary functions of x, such as: Nearest Neighbor Decision Tree Linear Functions g ( x) w T x b Slide 59 of 113 Slide credit: Jinwei Gu Linear Discriminant Function • g(x) is a linear function: x2 wT x + b > 0 g ( x) w T x b A hyper-plane in the feature space w T x + b < 0 x1 x1 denotes +1 denotes -1 Slide 60 of 113 Slide credit: Jinwei Gu Linear Discriminant Function • How would you classify these points using a linear discriminant function in order to minimize the error rate? x2 Infinite number of answers! x1 denotes +1 denotes -1 Slide 61 of 113 Slide credit: Jinwei Gu Linear Discriminant Function • How would you classify these points using a linear discriminant function in order to minimize the error rate? x2 Infinite number of answers! x1 denotes +1 denotes -1 Slide 62 of 113 Slide credit: Jinwei Gu Linear Discriminant Function • How would you classify these points using a linear discriminant function in order to minimize the error rate? x2 Infinite number of answers! x1 denotes +1 denotes -1 Slide 63 of 113 Slide credit: Jinwei Gu Linear Discriminant Function • How would you classify these points using a linear discriminant function in order to minimize the error rate? Infinite number of answers! Which one is the best? x2 x1 denotes +1 denotes -1 Slide 64 of 113 Slide credit: Jinwei Gu Large Margin Linear Classifier • The linear discriminant function (classifier) with the maximum margin is the best Margin is defined as the width that the boundary could be increased by before hitting a data point Why it is the best? x2 “safe zone” Margin strong generalization ability x1 Linear SVM Slide 65 of 113 Slide credit: Jinwei Gu Large Margin Linear Classifier x2 Margin x+ x+ x- Support Vectors x1 Slide 66 of 113 Slide credit: Jinwei Gu Discriminating between classes The linear discriminant function is: g(x) = wT x + b = T a x å i i x+b iÎSV Notice it relies on a dot product between the test point x and the support vectors xi 69 Linear separability 70 Non-linear SVMs: Feature Space General idea: the original input space can be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x) Slide courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt 71 Nonlinear SVMs: The Kernel Trick With this mapping, our discriminant function becomes: g ( x) w T ( x ) b T ( x ) i i ( x) b iSV No need to know this mapping explicitly, because we only use the dot product of feature vectors in both the training and test. A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space: K (xi , x j ) (xi )T (x j ) 72 Nonlinear SVMs: The Kernel Trick Examples of commonly-used kernel functions: K (xi , x j ) xTi x j Linear kernel: Polynomial kernel: Gaussian (Radial-Basis Function (RBF) ) kernel: K (xi , x j ) (1 xTi x j ) p K (xi , x j ) exp( Sigmoid: xi x j 2 2 2 ) K (xi , x j ) tanh(0 xTi x j 1 ) 73 Support Vector Machine: Algorithm 1. Choose a kernel function 2. Choose a value for C and any other parameters (e.g. σ) 3. Solve the quadratic programming problem (many software packages available) 4. Classify held out validation instances using the learned model 5. Select the best learned model based on validation accuracy 6. Classify test instances using the final selected model 74 SVMs in Computer Vision 76 Detection +1 pos ? ? features classify -1 neg ? x • We slide a window over the image • Extract features for each window • Classify each window into pos/neg F(x) y Sliding Window Detection 78 Representation 79 80 Example Results 81 Example Results 82 Summary: Support Vector Machine 1. Large Margin Classifier – Better generalization ability & less over-fitting 2. The Kernel Trick – Map data points to higher dimensional space in order to make them linearly separable. – Since only dot product is needed, we do not need to represent the mapping explicitly. 83 Model Ensembles Random Forests A variant of bagging proposed by Breiman Classifier consists of a collection of decision tree-structure classifiers. Each tree cast a vote for the class of input x. 88 Boosting • A simple algorithm for learning robust classifiers – Freund & Shapire, 1995 – Friedman, Hastie, Tibshhirani, 1998 • Provides efficient algorithm for sparse visual feature selection – Tieu & Viola, 2000 – Viola & Jones, 2003 • Easy to implement, doesn’t require external optimization tools. Used for many real problems in AI. 89 Boosting • Defines a classifier using an additive model: Strong classifier Weak classifier Weight Input feature vector 90 Boosting • Defines a classifier using an additive model: Strong classifier Weak classifier Weight Input feature vector • We need to define a family of weak classifiers Selected from a family of weak classifiers 91 Adaboost Input: training samples Initialize weights on samples For T iterations: Select best weak classifier based on weighted error Update sample weights Output: final strong classifier (combination of selected weak classifier predictions) Boosting • It is a sequential procedure: xt=1 xt=2 xt Each data point has a class label: yt = +1 ( ) -1 ( ) and a weight: wt =1 93 Toy example Weak learners from the family of lines Each data point has a class label: yt = +1 ( ) -1 ( ) and a weight: wt =1 h => p(error) = 0.5 it is at chance 94 Toy example Each data point has a class label: yt = +1 ( ) -1 ( ) and a weight: wt =1 This one seems to be the best This is a ‘weak classifier’: It performs slightly better than chance. 95 Toy example Each data point has a class label: yt = +1 ( ) -1 ( ) We update the weights: wt wt exp{-yt Ht} 96 Toy example Each data point has a class label: yt = +1 ( ) -1 ( ) We update the weights: wt wt exp{-yt Ht} 97 Toy example Each data point has a class label: yt = +1 ( ) -1 ( ) We update the weights: wt wt exp{-yt Ht} 98 Toy example Each data point has a class label: yt = +1 ( ) -1 ( ) We update the weights: wt wt exp{-yt Ht} 99 Toy example f1 f2 f4 f3 The strong (non- linear) classifier is built as the combination of all the weak (linear) classifiers. 100 Adaboost Input: training samples Initialize weights on samples For T iterations: Select best weak classifier based on weighted error Update sample weights Output: final strong classifier (combination of selected weak classifier predictions) Boosting for Face Detection 102 Face detection +1 face ? ? features classify -1 not face ? x F(x) • We slide a window over the image • Extract features for each window • Classify each window into face/non-face y What is a face? • Eyes are dark (eyebrows+shadows) • Cheeks and forehead are bright. • Nose is bright Paul Viola, Michael Jones, Robust Real-time Object Detection, IJCV 04 Basic feature extraction • Information type: – intensity x120 • Sum over: x357 – gray and white rectangles • Output: gray-white • Separate output value for x629 x834 – Each type – Each scale – Each position in the window • FEX(im)=x=[x1,x2,…….,xn] Paul Viola, Michael Jones, Robust Real-time Object Detection, IJCV 04 Decision trees • Stump: x120 – 1 root – 2 leaves x357 • If xi > a then positive else negative x629 x834 • Very simple • “Weak classifier” Paul Viola, Michael Jones, Robust Real-time Object Detection, IJCV 04 Summary: Face detection • Use decision stumps as weak classifiers • Use boosting to build a strong classifier • Use sliding window to detect the face X234>1.3 No x120 +1 Face x629 x357 Yes -1 Non-face x834 Semi-Supervised Learning Slide 108 of 113 Supervised learning has many successes • • • • • • recognize speech, steer a car, classify documents classify proteins recognizing faces, objects in images ... Slide Credit: Avrim Blum Slide 109 of 113 However, for many problems, labeled data can be rare or expensive. Need to pay someone to do it, requires special testing,… Unlabeled data is much cheaper. Slide Credit: Avrim Blum 110 However, for many problems, labeled data can be rare or expensive. Need to pay someone to do it, requires special testing,… Unlabeled data is much cheaper. Speech Customer modeling Images Protein sequences Medical outcomes Web pages Slide Credit: Avrim Blum 111 However, for many problems, labeled data can be rare or expensive. Need to pay someone to do it, requires special testing,… Unlabeled data is much cheaper. [From Jerry Zhu] Slide Credit: Avrim Blum 112 However, for many problems, labeled data can be rare or expensive. Need to pay someone to do it, requires special testing,… Unlabeled data is much cheaper. Can we make use of cheap unlabeled data? Slide Credit: Avrim Blum 113 Semi-Supervised Learning Can we use unlabeled data to augment a small labeled sample to improve learning? But maybe still has useful regularities that we can use. But unlabeled data is missing the most important info!! But… But… But… Slide Credit: Avrim Blum Slide 114 of 113 Method 1: EM 115 How to use unlabeled data • One way is to use the EM algorithm – EM: Expectation Maximization • The EM algorithm is a popular iterative algorithm for maximum likelihood estimation in problems with missing data. • The EM algorithm consists of two steps, – Expectation step, i.e., filling in the missing data – Maximization step – calculate a new maximum a posteriori estimate for the parameters. Slide 116 of 113 Example Algorithm 1. Train a classifier with only the labeled documents. 2. Use it to probabilistically classify the unlabeled documents. 3. Use ALL the documents to train a new classifier. 4. Iterate steps 2 and 3 to convergence. Slide 117 of 113 Method 2: Co-Training 118 Co-training [Blum&Mitchell’98] Many problems have two different sources of info (“features/views”) you can use to determine label. E.g., classifying faculty webpages: can use words on page or words on links pointing to the page. Prof. Avrim Blum My Advisor x - Link info & Text info Slide Credit: Avrim Blum Prof. Avrim Blum My Advisor x1- Link info x2- Text info Slide 119 of 113 Co-training Idea: Use small labeled sample to learn initial rules. – E.g., “my advisor” pointing to a page is a good indicator it is a faculty home page. – E.g., “I am teaching” on a page is a good indicator it is a faculty home page. my advisor Slide Credit: Avrim Blum Slide 120 of 113 Co-training Idea: Use small labeled sample to learn initial rules. – E.g., “my advisor” pointing to a page is a good indicator it is a faculty home page. – E.g., “I am teaching” on a page is a good indicator it is a faculty home page. Then look for unlabeled examples where one view is confident and the other is not. Have it label the example for the other. hx1,x2i hx1,x2i hx1,x2i hx1,x2i hx1,x2i hx1,x2i Training 2 classifiers, one on each type of info. Using each to help train the other. Slide Credit: Avrim Blum Slide 121 of 113 Co-training vs. EM • Co-training splits features, EM does not. • Co-training incrementally uses the unlabeled data. • EM probabilistically labels all the data at each round; EM iteratively uses the unlabeled data. Slide 122 of 113