CSC242: Intro to AI Lecture 20 Learning (from Examples) Learning Learning • “... agents that can improve their behavior through diligent study of their own experiences” • Improving one’s performance on future tasks after making observations about the world Why Learn? • Can’t anticipate all possible situations that the agent might find themselves in • Cannot anticipate all changes that might occur over time • Don’t know how to program it other than by learning! What To Learn? • Inferring properties of the world (state) from perceptions • Choosing which action to do in a given state • Results of possible actions • How the world evolves in time • Utilities of states What To Learn? • Numbers (e.g., probabilities) • Functions (e.g., continuous distributions, functional relations between variables) • Rules (e.g., “All (most) X’s are Y’s”, “Under these conditions, do this action”, discrete distributions) Types of Learning Unsupervised (no feedback) Semi-supervised Supervised (labelled examples) Reinforcement (feedback is reward) Learning Functions Function Learning • There is some function y = f(x) Hypothesis • We don’t know f • We want to learn a function h that approximates the true function f • Learning is a search through the space of possible hypotheses for one that will perform well Supervised Learning • Given a training set of N example inputoutput pairs: (x1, y1), (x2, y2), ..., (xN, yN) where each yj = f(xj) • Discover function h that approximates f Training Data x 1 2 4 5 7 y 3 6 12 15 21 h(x)=? 25 20 15 10 5 0 0 2 4 6 8 10 Evaluating Accuracy • Training set: (x , y ), (x , y ), ..., (x , y ) • Test set: Additional (x , y ) pairs distinct 1 1 2 j 2 N N j from training set • Test accuracy of h by comparing h(x ) to j yj for (xj, yj) from training set Training Data x 1 2 4 5 7 y 3 6 12 15 21 h(x)=? Testing Data x 3 4 6 y 9 12 18 f(x)=y h(x)=y? Generalization • A hypothesis (function) generalizes well if it correctly predicts the value of y for novel examples x Supervised Function Learning • Search through the space of possible hypotheses (functions) • For a hypothesis that generalizes well • Using the training data for guidance • Evaluating performance using the testing data Hypothesis Space Hypothesis Space • The class of functions that are acceptable as solutions, e.g. • Linear functions y = mx + b • Polynomials (of some degree) • Java programs? • Turing machines? f(x) f(x) (a) y = -0.4x+3 x f(x) (b) x (c) y = c7 x 7 + c 6 x 6 + . . . + c 1 x + c 0 = 7 � i=0 c i xi x f(x) (b) x f(x) (c) x y = c6 x 6 + c 5 x 5 . . . + c 1 x + c 0 y = mx + b (d) x ax + b + c sin(x) Occam’s Razor William of Occam (or Ockham) 14th c. Hypotheses • There is a tradeoff between complex hypotheses that fit the training data and simpler hypotheses that may generalize better • There is a tradeoff between the expressiveness of a hypothesis space and the complexity of finding a good hypothesis within it Learning Decision Trees Classification • Output y = f(x) is one of a finite set of values (classes, categories, ...) • Boolean classification: yes/no or true/false • Input is vector x of values for attributes • Factored representation Example • Going out to dinner with Stuart Russell • Restaurants often busy in SF; sometimes have to wait for a table • Decision: Do we wait or do something else? Attributes Alternate : is there a suitable alternative nearby Bar: does it have a comfy bar FriSat: is it a Friday or Saturday Hungry: are we hungry Patrons: None, Some, Full Price: $,$$,$$$ Raining: is it raining outside Reservation: do we have a reservation Type: French, Italian, Thai, burger, ... WaitEstimate: 0-10, 10-30, 30-60, >60 Decision Making • If the host/hostess says you’ll have to wait: • Then if there’s no one in the restaurant you don’t want to be there either; • But if there are a few people but it’s not full, then you should wait • Otherwise you need to consider how long he/she told you the wait would be • ... Patrons? None Some No Full Yes WaitEstimate? >60 No 30-60 No Alternate? Yes Bar? No No Yes Yes Yes No Fri/Sat? No No 0-10 Hungry? Yes Reservation? No 10-30 Yes Yes Yes Yes Yes Alternate? No Yes Yes Raining? No No Yes Yes Decision Tree • Each node in the tree represents a test on a single attribute • Children of the node are labelled with the possible values of the attribute • Each path represents a series of tests, and the leaf node gives the value of the function when the input passes those tests Patrons? None Some No Full Yes WaitEstimate? >60 No 30-60 No Alternate? Yes Bar? No No Yes Yes Yes No Fri/Sat? No No 0-10 Hungry? Yes Reservation? No 10-30 Yes Yes Yes Yes Yes Alternate? No Yes Yes Raining? No No Yes Yes Patrons? None Some No Full Yes WaitEstimate? >60 No 30-60 No Alternate? Yes Bar? No No Yes Yes Yes No Fri/Sat? No No 0-10 Hungry? Yes Reservation? No 10-30 Yes Yes Yes Yes Yes Alternate? No Yes Yes Raining? No No Yes Yes Input Attributes Alt Bar Fri Hun Pat Price Rain Res Type Est Will Wait x1 Yes No No Yes Some $$$ No Yes French 0-10 y1=yes x2 Yes No No Yes Full $ No No Thai 30-60 y2=no x3 No Yes No No Some $ No No Burger 0-10 y3=yes x4 Yes No Yes Yes Full $ Yes No Thai 10-30 y4=yes x5 Yes No Yes No Full $$$ No Yes French >60 y5=no x6 No Yes No Yes Some $$ Yes Yes Italian 0-10 y6=yes x7 No Yes No No None $ Yes No Burger 0-10 y7=no x8 No No No Yes Some $$ Yes Yes Thai 0-10 y8=yes x9 No Yes Yes No Full $ Yes No Burger >60 y9=no x10 Yes Yes Yes Yes Full $$$ No Yes Italian 10-30 y10=no x11 No No No No None $ No No Thai 0-10 y11=no x12 Yes Yes Yes Yes Full $ No No Burger 30-60 y12=yes Inducing Decision Trees From Examples • Examples: (x,y) where x is a vector of values for the input attributes and y is a single Boolean value (yes/no, true/false) Input Attributes Alt Bar Fri Hun Pat Price Rain Res Type Est Will Wait x1 Yes No No Yes Some $$$ No Yes French 0-10 y1=yes x2 Yes No No Yes Full $ No No Thai 30-60 y2=no x3 No Yes No No Some $ No No Burger 0-10 y3=yes x4 Yes No Yes Yes Full $ Yes No Thai 10-30 y4=yes x5 Yes No Yes No Full $$$ No Yes French >60 y5=no x6 No Yes No Yes Some $$ Yes Yes Italian 0-10 y6=yes x7 No Yes No No None $ Yes No Burger 0-10 y7=no x8 No No No Yes Some $$ Yes Yes Thai 0-10 y8=yes x9 No Yes Yes No Full $ Yes No Burger >60 y9=no x10 Yes Yes Yes Yes Full $$$ No Yes Italian 10-30 y10=no x11 No No No No None $ No No Thai 0-10 y11=no x12 Yes Yes Yes Yes Full $ No No Burger 30-60 y12=yes Inducing Decision Trees From Examples • Examples: (x,y) • Want a shallow tree (short paths, fewer tests) • Greedy algorithm (AIMA Fig 18.5) • Always test the most important attribute first • Makes the most difference to classification of an example Input Attributes Alt Bar Fri Hun Pat Price Rain Res Type Est Will Wait x1 Yes No No Yes Some $$$ No Yes French 0-10 y1=yes x2 Yes No No Yes Full $ No No Thai 30-60 y2=no x3 No Yes No No Some $ No No Burger 0-10 y3=yes x4 Yes No Yes Yes Full $ Yes No Thai 10-30 y4=yes x5 Yes No Yes No Full $$$ No Yes French >60 y5=no x6 No Yes No Yes Some $$ Yes Yes Italian 0-10 y6=yes x7 No Yes No No None $ Yes No Burger 0-10 y7=no x8 No No No Yes Some $$ Yes Yes Thai 0-10 y8=yes x9 No Yes Yes No Full $ Yes No Burger >60 y9=no x10 Yes Yes Yes Yes Full $$$ No Yes Italian 10-30 y10=no x11 No No No No None $ No No Thai 0-10 y11=no x12 Yes Yes Yes Yes Full $ No No Burger 30-60 y12=yes Input Attributes Alt Bar Fri Hun Pat Price Rain Res Type Est Will Wait x1 Yes No No Yes Some $$$ No Yes French 0-10 y1=yes x2 Yes No No Yes Full $ No No Thai 30-60 y2=no x3 No Yes No No Some $ No No Burger 0-10 y3=yes x4 Yes No Yes Yes Full $ Yes No Thai 10-30 y4=yes x5 Yes No Yes No Full $$$ No Yes French >60 y5=no x6 No Yes No Yes Some $$ Yes Yes Italian 0-10 y6=yes x7 No Yes No No None $ Yes No Burger 0-10 y7=no x8 No No No Yes Some $$ Yes Yes Thai 0-10 y8=yes x9 No Yes Yes No Full $ Yes No Burger >60 y9=no x10 Yes Yes Yes Yes Full $$$ No Yes Italian 10-30 y10=no x11 No No No No None $ No No Thai 0-10 y11=no x12 Yes Yes Yes Yes Full $ No No Burger 30-60 y12=yes 1 3 4 6 8 12 1 3 4 2 5 7 9 10 11 2 5 7 Type? French Italian Pa Thai 1 6 4 8 5 10 2 11 Burger None 1 3 12 7 9 3 7 11 No (a) Som Y Input Attributes Alt Bar Fri Hun Pat Price Rain Res Type Est Will Wait x1 Yes No No Yes Some $$$ No Yes French 0-10 y1=yes x2 Yes No No Yes Full $ No No Thai 30-60 y2=no x3 No Yes No No Some $ No No Burger 0-10 y3=yes x4 Yes No Yes Yes Full $ Yes No Thai 10-30 y4=yes x5 Yes No Yes No Full $$$ No Yes French >60 y5=no x6 No Yes No Yes Some $$ Yes Yes Italian 0-10 y6=yes x7 No Yes No No None $ Yes No Burger 0-10 y7=no x8 No No No Yes Some $$ Yes Yes Thai 0-10 y8=yes x9 No Yes Yes No Full $ Yes No Burger >60 y9=no x10 Yes Yes Yes Yes Full $$$ No Yes Italian 10-30 y10=no x11 No No No No None $ No No Thai 0-10 y11=no x12 Yes Yes Yes Yes Full $ No No Burger 30-60 y12=yes Input Attributes Alt Bar Fri Hun Pat Price Rain Res Type Est Will Wait x1 Yes No No Yes Some $$$ No Yes French 0-10 y1=yes x2 Yes No No Yes Full $ No No Thai 30-60 y2=no x3 No Yes No No Some $ No No Burger 0-10 y3=yes x4 Yes No Yes Yes Full $ Yes No Thai 10-30 y4=yes x5 Yes No Yes No Full $$$ No Yes French >60 y5=no x6 No Yes No Yes Some $$ Yes Yes Italian 0-10 y6=yes x7 No Yes No No None $ Yes No Burger 0-10 y7=no x8 No No No Yes Some $$ Yes Yes Thai 0-10 y8=yes x9 No Yes Yes No Full $ Yes No Burger >60 y9=no x10 Yes Yes Yes Yes Full $$$ No Yes Italian 10-30 y10=no x11 No No No No None $ No No Thai 0-10 y11=no x12 Yes Yes Yes Yes Full $ No No Burger 30-60 y12=yes Input Attributes Alt Bar Fri Hun Pat Price Rain Res Type Est Will Wait x1 Yes No No Yes Some $$$ No Yes French 0-10 y1=yes x2 Yes No No Yes Full $ No No Thai 30-60 y2=no x3 No Yes No No Some $ No No Burger 0-10 y3=yes x4 Yes No Yes Yes Full $ Yes No Thai 10-30 y4=yes x5 Yes No Yes No Full $$$ No Yes French >60 y5=no x6 No Yes No Yes Some $$ Yes Yes Italian 0-10 y6=yes x7 No Yes No No None $ Yes No Burger 0-10 y7=no x8 No No No Yes Some $$ Yes Yes Thai 0-10 y8=yes x9 No Yes Yes No Full $ Yes No Burger >60 y9=no x10 Yes Yes Yes Yes Full $$$ No Yes Italian 10-30 y10=no x11 No No No No None $ No No Thai 0-10 y11=no x12 Yes Yes Yes Yes Full $ No No Burger 30-60 y12=yes 8 12 1 3 4 6 8 12 10 11 2 5 7 9 10 11 Patrons? ai Burger 8 3 12 11 7 9 None Some 1 3 Full 6 8 7 11 No 4 12 2 5 Yes 9 10 Hungry? No Yes 4 12 (b) 5 9 2 10 8 12 1 3 4 6 8 12 10 11 2 5 7 9 10 11 Patrons? ai Burger 8 3 12 11 7 9 None Some 1 3 Full 6 8 7 11 No 4 12 2 5 Yes 9 10 Hungry? No Yes 4 12 (b) 5 9 2 10 8 12 1 3 4 6 8 12 10 11 2 5 7 9 10 11 Patrons? ai Burger 8 3 12 11 7 9 None Some 1 3 Full 6 8 7 11 No 4 12 2 5 Yes 9 10 Hungry? No Yes 4 12 (b) 5 9 2 10 Uncertainty • Examples have same description in terms of input attributes but different classification results • Error or noise in the data • Nondeterministic domain • We can’t observe the attribute that would distinguish the examples Patrons? None No Some Full Hungry? Yes French Yes No Yes No Type? Italian Thai Burger No Fri/Sat? No Yes No Yes Yes Patrons? None Patrons? Some No Full Yes WaitEstimate? >60 No 30-60 No No No Yes Yes Yes 10-30 No Fri/Sat? No No Yes Yes Yes Some Yes Yes French Alternate? Yes No Yes Yes Raining? No No Yes Yes Full Hungry? Yes 0-10 Hungry? Yes Yes Bar? No Alternate? Reservation? No None No Yes No Type? Italian Thai Burger No Fri/Sat? No Yes No Yes Yes Extensions • Missing data • Multi-values attributes • Continuous and integer-valued attributes • Continuous-valued output attributes • Regression • AIMA 18.3.6 Learning Decision Trees • Each node tests a single attribute • Child nodes for each possible value • Each path represents a series of tests, and the leaf node gives the value of the function when the input passes those tests • Test most discriminating attributes first • Shallower tree Evaluating Learning Mechanisms Training Data x 1 2 4 5 7 y 3 6 12 15 21 h(x)=? Testing Data x 3 4 6 y 9 12 18 f(x)=y h(x)=y? Evaluating Learning • Split data into training set and testing set • Learn a hypothesis h using the training set and evaluate it on the testing set • Start with training set of size 1 up to size N-1 Proportion correct on test set 1 0.9 0.8 0.7 0.6 0.5 0.4 0 20 40 60 Training set size Learning Curve 80 100 Error Rate • Error rate: proportion of times h(x) ≠ y for an (x,y) example • Inverse of proportion correct (accuracy) • Need to evaluate error rate on examples not used in training Cross-Validation • Randomly split data into training and testing (in some proportion) • Hold out test data during training • Doesn’t use all data for training k-Fold Cross-Validation • Divide data into k equal subsets • Perform k rounds of learning • Leave out 1 subset (1/k of the data) each round; use for testing that round • Average test scores over k rounds When Learning Goes Wrong Example • Goal: Learn a function that will predict whether a dice roll comes up 6 • Attributes: color of die, weight, time when roll was performed, whether roller had fingers crossed, ... • Data: attributes + value on die Overfitting • A learned model adjusts to random features of the input that have no causal relation to the target function • Usually happens when a model has too many parameters relative to the number of training examples f(x) f(x) (a) x f(x) (b) x f(x) (c) x (d) x Overfitting • Becomes more likely as the hypothesis space and number of input attributes grows • Becomes less likely as the number of training examples increases • Overfitting => Poor generalization Learning (from Examples) Summary Learning • Improving one’s performance on future tasks after making observations about the world Types of Learning Unsupervised (no feedback) Semi-supervised Supervised (labelled examples) Reinforcement (feedback is reward) Supervised Learning • Given a training set of N example inputoutput pairs: (x1, y1), (x2, y2), ..., (xN, yN) where each yj = f(xj) • Discover function h that approximates f Evaluating Accuracy • Training set: (x , y ), (x , y ), ..., (x , y ) • Test set: Additional (x , y ) pairs distinct 1 1 2 j 2 N N j from training set • Test accuracy of h by comparing h(x ) to j yj for (xj, yj) from training set Patrons? None Some No Full Yes WaitEstimate? >60 No 30-60 No Alternate? Yes Bar? No No Yes Yes Yes No Fri/Sat? No No 0-10 Hungry? Yes Reservation? No 10-30 Yes Yes Yes Yes Yes Alternate? No Yes Yes Raining? No No Yes Yes Patrons? None No Some Full Hungry? Yes French Yes No Yes No Type? Italian Thai Burger No Fri/Sat? No Yes No Yes Yes Learning lets the data speak for itself Generalization and Overfitting f(x) f(x) (a) x f(x) (b) x f(x) (c) x (d) x Hypotheses • There is a tradeoff between complex hypotheses that fit the training data and simpler hypotheses that may generalize better • There is a tradeoff between the expressiveness of a hypothesis space and the complexity of finding a good hypothesis within it Project 4: Learning No programming! Do homeworks 18 & 19 (this class & next) Due in BB Mon Apr 16 11:59PM Then work on posters for Tue Apr 24 For Next Time: AIMA 18.6; 18.7-18.9 fyi