CZ4041/CE4041: Machine Learning Lecture 1b: Overview of Supervised Learning Sinno Jialin PAN School of Computer Science and Engineering, NTU, Singapore Homepage: http://www3.ntu.edu.sg/home/sinnopan/ Supervised Learning Recall: in supervised learning, the examples presented to computers are pairs of inputs and the corresponding outputs (i.e., labeled training data), the goal is to “learn” a map or a model from inputs to labels inputs Outputs 2 Female Female Male Male Supervised Learning (cont.) Labeled training data In mathematics, Given: a set of 𝒙𝑖 , 𝑦𝑖 for 𝑖 = 1, … , 𝑁, where 𝒙𝑖 = [𝑥𝑖𝑖 , 𝑥𝑖𝑖 , … , 𝑥𝑖𝑖 ], and 𝑦𝑖 is a scalar, Goal: to learn a mapping 𝑓: 𝒙 → 𝑦 by Output requiring 𝑓 𝒙𝑖 = 𝑦𝑖 . Classifier The learned mapping 𝑓 is expected to be able to make precise predictions on any unseen 𝒙∗ as 𝑓(𝒙∗ ) A vector of features 3 Example 1: Sentiment Classification Compact; easy to operate; very good picture quality; looks sharp! It is also quite blurry in very dark settings. I will never_buy HP again. Bag-of-words representation 𝑦1 +1 𝑦2 −1 Dictionary 4 F1 1 F2 1 F3 0 F4 0 F5 1 F6 0 … … 0 0 1 1 0 1 … F1 compact F2 easy F3 quite F4 blurry F5 good F6 never_buy 𝒙1 𝒙2 … … Other Examples 𝒙𝟑 𝒙𝟒 𝒙𝟏 𝒙𝟐 Female Female Male Male −1 −1 +1 +1 Wearable sensors 5 Image processing & Computer vision tech. 𝒙𝟏 = [𝑥11 , 𝑥12 , … , 𝑥1𝑚 ] 𝒙𝟐 = [𝑥11 , 𝑥12 , … , 𝑥1𝑚 ] 𝒙𝟑 = [𝑥11 , 𝑥12 , … , 𝑥1𝑚 ] 𝒙𝟒 = [𝑥11 , 𝑥12 , … , 𝑥1𝑚 ] 𝒙𝒕 = [𝑥𝑡1 , 𝑥𝑡2 , … , 𝑥𝑡𝑚 ] 𝑦𝑡 = 𝑙, where 𝑙 ∈ {1, … , 𝐿} E.g., 1 – working, 2 – sleeping, etc. Other Examples (cont.) 1 Home Marital Taxable Cheat Owner Status Income 125K Yes Single No 2 No Married 3 No Divorced 95K Tid 100K No Yes One-hot encoding 6 HO_Yes HO_No MS_S 𝒙1 1 0 1 1 0 𝒙2 0 𝒙3 0 1 0 MS_M 0 MS_D 0 TaxInc 125K 1 0 100K 0 1 95K Cheat −1 −1 +1 𝑦1 𝑦2 𝑦3 Classification v.s. Regression For classification, 𝑦 is discrete If 𝑦 is binary, then binary classification If 𝑦 is nominal not binary, then multi-class classification For regression, 𝑦 is continuous Can an example be assigned to more than one categories? 7 Typical Learning Procedure Two phrases: Training phrase Given labeled training data 𝒙𝑖 , 𝑦𝑖 , for 𝑖 = 1, … , 𝑁 Apply supervised learning algorithms on 𝒙𝑖 , 𝑦𝑖 to learn a model 𝑓 such that 𝑓 𝒙𝑖 = 𝑦𝑖 Testing or prediction phrase Given unseen test data 𝒙𝑖∗ , for 𝑖 = 1, … , 𝑇 Use the trained model 𝑓 to make predictions 𝑓(𝒙𝑖∗ )’s 8 Training data 𝑦1 +1 𝑦2 −1 𝑦𝑁 −1 F1 1 F2 1 F3 0 F4 0 F5 1 F6 0 … … 0 0 1 1 0 1 … 1 1 … ... 0 1 0 Some classification algorithm Predicted label ? 9 0 F1 1 𝑓: 𝒙 → 𝑦 𝑦 ∗ = 𝑓(𝒙∗ ) F2 1 F3 0 F4 0 𝒙1 𝒙2 𝒙𝑁 Test data F5 1 Inductive Learning F6 0 … … 𝒙∗ Inductive Learning How to learn a predictive model 𝑓(∙) Bayesian Classifiers Decision Trees Artificial Neural Networks Support Vector Machines Regularized Regressions Model 10 Bayesian Classifiers: Naïve Bayes Classifier P(MS=Married) = 0.5 P(MS=Single) = 0.4 Marital Status P(HO=Yes) = 0.4 Home Owner P(TI > 80K) = 0.4 Taxable Income Cheat P(C=Yes | MS=Married, HO=Yes, TI>80K) = 0.8 P(C=Yes | MS=Married, HO=No, TI>80K) = 0.6 P(C=Yes | MS=Married, HO=Yes, TI≤80K) = 0.7 P(C=Yes | MS=Married, HO=No, TI≤80K) = 0.5 … 11 Bayesian Classifiers: Bayesian Belief Networks 12 Decision Tree Refund Yes No NO MarSt Single, Divorced TaxInc 13 Married NO < 80K > 80K NO YES Artificial Neural Networks y1 y2 yk . . . Output Layer . . . Hidden Layer(s) . . . . . . Input Layer X1 14 X2 Xm Support Vector Machines (SVMs) B2 Linear SVMs 15 Nonlinear SVMs Kernel trick! Regularized Regression Model 1 t t 0 1 0 x Linear Regression 1 x Nonlinear Regression Kernel trick 16 Training data 𝑦1 +1 𝑦2 −1 𝑦𝑁 −1 F1 1 F2 1 F3 0 F4 0 F5 1 F6 0 … … 0 0 1 1 0 1 … 1 1 … ... 0 1 0 0 𝒙1 𝒙2 𝒙𝑁 Predicted label ? 17 F1 1 𝑦∗ Test data F2 1 F3 0 Lazy Learning F4 0 F5 1 F6 0 … … 𝒙∗ E.g., 𝐾 Nearest neighbor classifier Evaluation of Supervised Learning Recall: the learned predictive model should be able to make predictions on previously unseen data as accurately as possible Solution: the whole training data is divided into “training” and “test” sets, with “training” set used to learn the predictive model and “test” set used to validate the performance of the learned model Training Data Original Training Data Validation Data 18 All Training data 𝑦1 +1 𝑦2 −1 𝑦𝑁 −1 F1 1 F2 1 F3 0 F4 0 F5 1 F6 0 … … 0 0 1 1 0 1 … 1 1 … ... 0 1 0 0 Used for training 𝑦1 +1 𝑦2 −1 𝑦𝑳 +1 19 𝒙1 𝒙2 𝒙𝑁 Used for evaluation F1 F2 F3 F4 F5 F6 … F1 F2 F3 F4 F5 F6 … 1 1 0 0 1 0 … 𝒙1 𝑦𝐿+1 +1 1 1 0 0 1 0 … 𝒙𝐿+1 0 0 1 1 0 1 … 𝒙2 ... ... 𝑦𝑵 +1 1 1 1 0 1 1 … 𝒙𝑵 1 1 1 0 1 1 … 𝒙𝑳 Evaluate 𝑦�𝑖 = 𝑓(𝒙𝑖 ) Learn 𝑓: 𝒙 → 𝑦 for 𝑖 = 𝐿 + 1, … , 𝑁 𝑦𝐿+1 𝑦�𝐿+1 … v.s … Some evaluation metric 𝑦𝑁 𝑦�𝑁 Common Evaluation Metrics Classification: Accuracy, error rate, F-score etc. Actual 𝒙1 𝒙2 𝒙3 𝒙4 𝒙5 𝒙6 𝒙7 𝒙8 +𝟏 −𝟏 +𝟏 −𝟏 −𝟏 +𝟏 −𝟏 +𝟏 Prediction −𝟏 −𝟏 +𝟏 +𝟏 −𝟏 +𝟏 −𝟏 +𝟏 Accuracy = 5/8 Error rate = 3/8 Regression: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE) 20 Split of Training and Evaluation Sets Random subsampling Cross validation Bootstrap 21 Random Subsampling 1st random sampling 2nd random sampling (Sampled without replacement) (Sampled without replacement) Training Data Original Data Validation Data Training Data Original Data Validation Data acc2 acc1 22 𝑘 1 accavg = � acc𝑖 𝑘 𝑖=1 … … Cross Validation: k-fold cross-validation: partition data into k subsets of the same size Hold aside one group for testing and use the rest to build model Test Iterations 23 Training … Training Test 5-fold Cross-Validation Partition into 5 subsets Hold aside 1 group for testing and use the rest 4 for training Evaluate 24 performance 1 2 3 4 5 1. Test 2. Train 3. Train 4. Train 5. Train 1. Train 2. Test 3. Train 4. Train 5. Train 1. Train 2. Train 3. Test 4. Train 5. Train 1. Train 2. Train 3. Train 4. Test 5. Train 1. Train 2. Train 3. Train 4. Train 5. Test 1. Test 2. Test 3. Test 4. Test 5. Test Leave-One-Out k-fold cross-validation special case: leave-one-out, k=n, n is the number of data instances for training … 25 Bootstrap 1st random sampling Sampled with replacement 2nd random sampling Training Data Original Data 26 Validation Data Training Data Original Data Validation Data … … Next Week … We start by introducing Bayesian Classifiers Thank you! 27