CSC242: Intro to AI Lecture 20

advertisement
CSC242: Intro to AI
Lecture 20
Learning
(from Examples)
Learning
Learning
• “... agents that can improve their behavior
through diligent study of their own
experiences”
• Improving one’s performance on future
tasks after making observations about the
world
Why Learn?
• Can’t anticipate all possible situations that
the agent might find themselves in
• Cannot anticipate all changes that might
occur over time
• Don’t know how to program it other than
by learning!
What To Learn?
• Inferring properties of the world (state)
from perceptions
• Choosing which action to do in a given
state
• Results of possible actions
• How the world evolves in time
• Utilities of states
What To Learn?
• Numbers (e.g., probabilities)
• Functions (e.g., continuous distributions,
functional relations between variables)
• Rules (e.g., “All (most) X’s are Y’s”, “Under
these conditions, do this action”, discrete
distributions)
Types of Learning
Unsupervised
(no feedback)
Semi-supervised
Supervised
(labelled examples)
Reinforcement
(feedback is reward)
Learning Functions
Function Learning
• There is some function y = f(x) Hypothesis
• We don’t know f
• We want to learn a function h that
approximates the true function f
• Learning is a search through the space of
possible hypotheses for one that will
perform well
Supervised Learning
• Given a training set of N example inputoutput pairs:
(x1, y1), (x2, y2), ..., (xN, yN)
where each yj = f(xj)
• Discover function h that approximates f
Training Data
x
1
2
4
5
7
y
3
6
12
15
21
h(x)=?
25
20
15
10
5
0
0
2
4
6
8
10
Evaluating Accuracy
• Training set: (x , y ), (x , y ), ..., (x , y )
• Test set: Additional (x , y ) pairs distinct
1
1
2
j
2
N
N
j
from training set
• Test accuracy of h by comparing h(x ) to
j
yj for (xj, yj) from training set
Training Data
x
1
2
4
5
7
y
3
6
12
15
21
h(x)=?
Testing Data
x
3
4
6
y
9
12
18
f(x)=y
h(x)=y?
Generalization
• A hypothesis (function) generalizes well if it
correctly predicts the value of y for novel
examples x
Supervised Function
Learning
• Search through the space of possible
hypotheses (functions)
• For a hypothesis that generalizes well
• Using the training data for guidance
• Evaluating performance using the testing
data
Hypothesis Space
Hypothesis Space
• The class of functions that are acceptable
as solutions, e.g.
• Linear functions y = mx + b
• Polynomials (of some degree)
• Java programs?
• Turing machines?
f(x)
f(x)
(a)
y = -0.4x+3
x
f(x)
(b)
x
(c)
y = c7 x 7 + c 6 x 6 + . . . + c 1 x + c 0
=
7
�
i=0
c i xi
x
f(x)
(b)
x
f(x)
(c)
x
y = c6 x 6 + c 5 x 5 . . . + c 1 x + c 0
y = mx + b
(d)
x
ax + b + c sin(x)
Occam’s Razor
William of Occam (or Ockham)
14th c.
Hypotheses
• There is a tradeoff between complex
hypotheses that fit the training data and
simpler hypotheses that may generalize
better
• There is a tradeoff between the
expressiveness of a hypothesis space and
the complexity of finding a good hypothesis
within it
Learning Decision Trees
Classification
• Output y = f(x) is one of a finite set of
values (classes, categories, ...)
• Boolean classification: yes/no or true/false
• Input is vector x of values for attributes
• Factored representation
Example
• Going out to dinner with Stuart Russell
• Restaurants often busy in SF; sometimes
have to wait for a table
• Decision: Do we wait or do something
else?
Attributes
Alternate : is there a suitable alternative nearby
Bar: does it have a comfy bar
FriSat: is it a Friday or Saturday
Hungry: are we hungry
Patrons: None, Some, Full
Price: $,$$,$$$
Raining: is it raining outside
Reservation: do we have a reservation
Type: French, Italian, Thai, burger, ...
WaitEstimate: 0-10, 10-30, 30-60, >60
Decision Making
• If the host/hostess says you’ll have to wait:
• Then if there’s no one in the restaurant
you don’t want to be there either;
• But if there are a few people but it’s not
full, then you should wait
• Otherwise you need to consider how
long he/she told you the wait would be
• ...
Patrons?
None
Some
No
Full
Yes
WaitEstimate?
>60
No
30-60
No
Alternate?
Yes
Bar?
No
No
Yes
Yes
Yes
No
Fri/Sat?
No
No
0-10
Hungry?
Yes
Reservation?
No
10-30
Yes
Yes
Yes
Yes
Yes
Alternate?
No
Yes
Yes
Raining?
No
No
Yes
Yes
Decision Tree
• Each node in the tree represents a test on
a single attribute
• Children of the node are labelled with the
possible values of the attribute
• Each path represents a series of tests, and
the leaf node gives the value of the function
when the input passes those tests
Patrons?
None
Some
No
Full
Yes
WaitEstimate?
>60
No
30-60
No
Alternate?
Yes
Bar?
No
No
Yes
Yes
Yes
No
Fri/Sat?
No
No
0-10
Hungry?
Yes
Reservation?
No
10-30
Yes
Yes
Yes
Yes
Yes
Alternate?
No
Yes
Yes
Raining?
No
No
Yes
Yes
Patrons?
None
Some
No
Full
Yes
WaitEstimate?
>60
No
30-60
No
Alternate?
Yes
Bar?
No
No
Yes
Yes
Yes
No
Fri/Sat?
No
No
0-10
Hungry?
Yes
Reservation?
No
10-30
Yes
Yes
Yes
Yes
Yes
Alternate?
No
Yes
Yes
Raining?
No
No
Yes
Yes
Input Attributes
Alt
Bar
Fri
Hun
Pat
Price
Rain
Res
Type
Est
Will
Wait
x1
Yes
No
No
Yes
Some
$$$
No
Yes
French
0-10
y1=yes
x2
Yes
No
No
Yes
Full
$
No
No
Thai
30-60
y2=no
x3
No
Yes
No
No
Some
$
No
No
Burger
0-10
y3=yes
x4
Yes
No
Yes
Yes
Full
$
Yes
No
Thai
10-30
y4=yes
x5
Yes
No
Yes
No
Full
$$$
No
Yes
French
>60
y5=no
x6
No
Yes
No
Yes
Some
$$
Yes
Yes
Italian
0-10
y6=yes
x7
No
Yes
No
No
None
$
Yes
No
Burger
0-10
y7=no
x8
No
No
No
Yes
Some
$$
Yes
Yes
Thai
0-10
y8=yes
x9
No
Yes
Yes
No
Full
$
Yes
No
Burger
>60
y9=no
x10
Yes
Yes
Yes
Yes
Full
$$$
No
Yes
Italian
10-30
y10=no
x11
No
No
No
No
None
$
No
No
Thai
0-10
y11=no
x12
Yes
Yes
Yes
Yes
Full
$
No
No
Burger
30-60 y12=yes
Inducing Decision Trees
From Examples
• Examples: (x,y) where x is a vector of
values for the input attributes and y is a
single Boolean value (yes/no, true/false)
Input Attributes
Alt
Bar
Fri
Hun
Pat
Price
Rain
Res
Type
Est
Will
Wait
x1
Yes
No
No
Yes
Some
$$$
No
Yes
French
0-10
y1=yes
x2
Yes
No
No
Yes
Full
$
No
No
Thai
30-60
y2=no
x3
No
Yes
No
No
Some
$
No
No
Burger
0-10
y3=yes
x4
Yes
No
Yes
Yes
Full
$
Yes
No
Thai
10-30
y4=yes
x5
Yes
No
Yes
No
Full
$$$
No
Yes
French
>60
y5=no
x6
No
Yes
No
Yes
Some
$$
Yes
Yes
Italian
0-10
y6=yes
x7
No
Yes
No
No
None
$
Yes
No
Burger
0-10
y7=no
x8
No
No
No
Yes
Some
$$
Yes
Yes
Thai
0-10
y8=yes
x9
No
Yes
Yes
No
Full
$
Yes
No
Burger
>60
y9=no
x10
Yes
Yes
Yes
Yes
Full
$$$
No
Yes
Italian
10-30
y10=no
x11
No
No
No
No
None
$
No
No
Thai
0-10
y11=no
x12
Yes
Yes
Yes
Yes
Full
$
No
No
Burger
30-60 y12=yes
Inducing Decision Trees
From Examples
• Examples: (x,y)
• Want a shallow tree (short paths, fewer tests)
• Greedy algorithm (AIMA Fig 18.5)
• Always test the most important attribute
first
• Makes the most difference to classification of
an example
Input Attributes
Alt
Bar
Fri
Hun
Pat
Price
Rain
Res
Type
Est
Will
Wait
x1
Yes
No
No
Yes
Some
$$$
No
Yes
French
0-10
y1=yes
x2
Yes
No
No
Yes
Full
$
No
No
Thai
30-60
y2=no
x3
No
Yes
No
No
Some
$
No
No
Burger
0-10
y3=yes
x4
Yes
No
Yes
Yes
Full
$
Yes
No
Thai
10-30
y4=yes
x5
Yes
No
Yes
No
Full
$$$
No
Yes
French
>60
y5=no
x6
No
Yes
No
Yes
Some
$$
Yes
Yes
Italian
0-10
y6=yes
x7
No
Yes
No
No
None
$
Yes
No
Burger
0-10
y7=no
x8
No
No
No
Yes
Some
$$
Yes
Yes
Thai
0-10
y8=yes
x9
No
Yes
Yes
No
Full
$
Yes
No
Burger
>60
y9=no
x10
Yes
Yes
Yes
Yes
Full
$$$
No
Yes
Italian
10-30
y10=no
x11
No
No
No
No
None
$
No
No
Thai
0-10
y11=no
x12
Yes
Yes
Yes
Yes
Full
$
No
No
Burger
30-60 y12=yes
Input Attributes
Alt
Bar
Fri
Hun
Pat
Price
Rain
Res
Type
Est
Will
Wait
x1
Yes
No
No
Yes
Some
$$$
No
Yes
French
0-10
y1=yes
x2
Yes
No
No
Yes
Full
$
No
No
Thai
30-60
y2=no
x3
No
Yes
No
No
Some
$
No
No
Burger
0-10
y3=yes
x4
Yes
No
Yes
Yes
Full
$
Yes
No
Thai
10-30
y4=yes
x5
Yes
No
Yes
No
Full
$$$
No
Yes
French
>60
y5=no
x6
No
Yes
No
Yes
Some
$$
Yes
Yes
Italian
0-10
y6=yes
x7
No
Yes
No
No
None
$
Yes
No
Burger
0-10
y7=no
x8
No
No
No
Yes
Some
$$
Yes
Yes
Thai
0-10
y8=yes
x9
No
Yes
Yes
No
Full
$
Yes
No
Burger
>60
y9=no
x10
Yes
Yes
Yes
Yes
Full
$$$
No
Yes
Italian
10-30
y10=no
x11
No
No
No
No
None
$
No
No
Thai
0-10
y11=no
x12
Yes
Yes
Yes
Yes
Full
$
No
No
Burger
30-60 y12=yes
1
3
4
6
8 12
1
3
4
2
5
7
9 10 11
2
5
7
Type?
French
Italian
Pa
Thai
1
6
4
8
5
10
2 11
Burger
None
1
3 12
7
9
3
7 11
No
(a)
Som
Y
Input Attributes
Alt
Bar
Fri
Hun
Pat
Price
Rain
Res
Type
Est
Will
Wait
x1
Yes
No
No
Yes
Some
$$$
No
Yes
French
0-10
y1=yes
x2
Yes
No
No
Yes
Full
$
No
No
Thai
30-60
y2=no
x3
No
Yes
No
No
Some
$
No
No
Burger
0-10
y3=yes
x4
Yes
No
Yes
Yes
Full
$
Yes
No
Thai
10-30
y4=yes
x5
Yes
No
Yes
No
Full
$$$
No
Yes
French
>60
y5=no
x6
No
Yes
No
Yes
Some
$$
Yes
Yes
Italian
0-10
y6=yes
x7
No
Yes
No
No
None
$
Yes
No
Burger
0-10
y7=no
x8
No
No
No
Yes
Some
$$
Yes
Yes
Thai
0-10
y8=yes
x9
No
Yes
Yes
No
Full
$
Yes
No
Burger
>60
y9=no
x10
Yes
Yes
Yes
Yes
Full
$$$
No
Yes
Italian
10-30
y10=no
x11
No
No
No
No
None
$
No
No
Thai
0-10
y11=no
x12
Yes
Yes
Yes
Yes
Full
$
No
No
Burger
30-60 y12=yes
Input Attributes
Alt
Bar
Fri
Hun
Pat
Price
Rain
Res
Type
Est
Will
Wait
x1
Yes
No
No
Yes
Some
$$$
No
Yes
French
0-10
y1=yes
x2
Yes
No
No
Yes
Full
$
No
No
Thai
30-60
y2=no
x3
No
Yes
No
No
Some
$
No
No
Burger
0-10
y3=yes
x4
Yes
No
Yes
Yes
Full
$
Yes
No
Thai
10-30
y4=yes
x5
Yes
No
Yes
No
Full
$$$
No
Yes
French
>60
y5=no
x6
No
Yes
No
Yes
Some
$$
Yes
Yes
Italian
0-10
y6=yes
x7
No
Yes
No
No
None
$
Yes
No
Burger
0-10
y7=no
x8
No
No
No
Yes
Some
$$
Yes
Yes
Thai
0-10
y8=yes
x9
No
Yes
Yes
No
Full
$
Yes
No
Burger
>60
y9=no
x10
Yes
Yes
Yes
Yes
Full
$$$
No
Yes
Italian
10-30
y10=no
x11
No
No
No
No
None
$
No
No
Thai
0-10
y11=no
x12
Yes
Yes
Yes
Yes
Full
$
No
No
Burger
30-60 y12=yes
Input Attributes
Alt
Bar
Fri
Hun
Pat
Price
Rain
Res
Type
Est
Will
Wait
x1
Yes
No
No
Yes
Some
$$$
No
Yes
French
0-10
y1=yes
x2
Yes
No
No
Yes
Full
$
No
No
Thai
30-60
y2=no
x3
No
Yes
No
No
Some
$
No
No
Burger
0-10
y3=yes
x4
Yes
No
Yes
Yes
Full
$
Yes
No
Thai
10-30
y4=yes
x5
Yes
No
Yes
No
Full
$$$
No
Yes
French
>60
y5=no
x6
No
Yes
No
Yes
Some
$$
Yes
Yes
Italian
0-10
y6=yes
x7
No
Yes
No
No
None
$
Yes
No
Burger
0-10
y7=no
x8
No
No
No
Yes
Some
$$
Yes
Yes
Thai
0-10
y8=yes
x9
No
Yes
Yes
No
Full
$
Yes
No
Burger
>60
y9=no
x10
Yes
Yes
Yes
Yes
Full
$$$
No
Yes
Italian
10-30
y10=no
x11
No
No
No
No
None
$
No
No
Thai
0-10
y11=no
x12
Yes
Yes
Yes
Yes
Full
$
No
No
Burger
30-60 y12=yes
8 12
1
3
4
6
8 12
10 11
2
5
7
9 10 11
Patrons?
ai
Burger
8
3 12
11
7
9
None
Some
1
3
Full
6
8
7 11
No
4 12
2
5
Yes
9 10
Hungry?
No
Yes
4 12
(b)
5
9
2 10
8 12
1
3
4
6
8 12
10 11
2
5
7
9 10 11
Patrons?
ai
Burger
8
3 12
11
7
9
None
Some
1
3
Full
6
8
7 11
No
4 12
2
5
Yes
9 10
Hungry?
No
Yes
4 12
(b)
5
9
2 10
8 12
1
3
4
6
8 12
10 11
2
5
7
9 10 11
Patrons?
ai
Burger
8
3 12
11
7
9
None
Some
1
3
Full
6
8
7 11
No
4 12
2
5
Yes
9 10
Hungry?
No
Yes
4 12
(b)
5
9
2 10
Uncertainty
• Examples have same description in terms of
input attributes but different classification
results
• Error or noise in the data
• Nondeterministic domain
• We can’t observe the attribute that
would distinguish the examples
Patrons?
None
No
Some
Full
Hungry?
Yes
French
Yes
No
Yes
No
Type?
Italian
Thai
Burger
No
Fri/Sat?
No
Yes
No
Yes
Yes
Patrons?
None
Patrons?
Some
No
Full
Yes
WaitEstimate?
>60
No
30-60
No
No
No
Yes
Yes
Yes
10-30
No
Fri/Sat?
No
No
Yes
Yes
Yes
Some
Yes
Yes
French
Alternate?
Yes
No
Yes
Yes
Raining?
No
No
Yes
Yes
Full
Hungry?
Yes
0-10
Hungry?
Yes
Yes
Bar?
No
Alternate?
Reservation?
No
None
No
Yes
No
Type?
Italian
Thai
Burger
No
Fri/Sat?
No
Yes
No
Yes
Yes
Extensions
• Missing data
• Multi-values attributes
• Continuous and integer-valued attributes
• Continuous-valued output attributes
• Regression
• AIMA 18.3.6
Learning Decision Trees
• Each node tests a single attribute
• Child nodes for each possible value
• Each path represents a series of tests, and
the leaf node gives the value of the function
when the input passes those tests
• Test most discriminating attributes first
• Shallower tree
Evaluating Learning
Mechanisms
Training Data
x
1
2
4
5
7
y
3
6
12
15
21
h(x)=?
Testing Data
x
3
4
6
y
9
12
18
f(x)=y
h(x)=y?
Evaluating Learning
• Split data into training set and testing set
• Learn a hypothesis h using the training set
and evaluate it on the testing set
• Start with training set of size 1 up to size
N-1
Proportion correct on test set
1
0.9
0.8
0.7
0.6
0.5
0.4
0
20
40
60
Training set size
Learning Curve
80
100
Error Rate
• Error rate: proportion of times h(x) ≠ y
for an (x,y) example
• Inverse of proportion correct (accuracy)
• Need to evaluate error rate on examples
not used in training
Cross-Validation
• Randomly split data into training and
testing (in some proportion)
• Hold out test data during training
• Doesn’t use all data for training
k-Fold Cross-Validation
• Divide data into k equal subsets
• Perform k rounds of learning
• Leave out 1 subset (1/k of the data) each
round; use for testing that round
• Average test scores over k rounds
When Learning Goes
Wrong
Example
• Goal: Learn a function that will predict
whether a dice roll comes up 6
• Attributes: color of die, weight, time when
roll was performed, whether roller had
fingers crossed, ...
• Data: attributes + value on die
Overfitting
• A learned model adjusts to random
features of the input that have no causal
relation to the target function
• Usually happens when a model has too
many parameters relative to the number of
training examples
f(x)
f(x)
(a)
x
f(x)
(b)
x
f(x)
(c)
x
(d)
x
Overfitting
• Becomes more likely as the hypothesis
space and number of input attributes grows
• Becomes less likely as the number of
training examples increases
• Overfitting => Poor generalization
Learning
(from Examples)
Summary
Learning
• Improving one’s performance on future
tasks after making observations about the
world
Types of Learning
Unsupervised
(no feedback)
Semi-supervised
Supervised
(labelled examples)
Reinforcement
(feedback is reward)
Supervised Learning
• Given a training set of N example inputoutput pairs:
(x1, y1), (x2, y2), ..., (xN, yN)
where each yj = f(xj)
• Discover function h that approximates f
Evaluating Accuracy
• Training set: (x , y ), (x , y ), ..., (x , y )
• Test set: Additional (x , y ) pairs distinct
1
1
2
j
2
N
N
j
from training set
• Test accuracy of h by comparing h(x ) to
j
yj for (xj, yj) from training set
Patrons?
None
Some
No
Full
Yes
WaitEstimate?
>60
No
30-60
No
Alternate?
Yes
Bar?
No
No
Yes
Yes
Yes
No
Fri/Sat?
No
No
0-10
Hungry?
Yes
Reservation?
No
10-30
Yes
Yes
Yes
Yes
Yes
Alternate?
No
Yes
Yes
Raining?
No
No
Yes
Yes
Patrons?
None
No
Some
Full
Hungry?
Yes
French
Yes
No
Yes
No
Type?
Italian
Thai
Burger
No
Fri/Sat?
No
Yes
No
Yes
Yes
Learning lets the data speak for itself
Generalization and
Overfitting
f(x)
f(x)
(a)
x
f(x)
(b)
x
f(x)
(c)
x
(d)
x
Hypotheses
• There is a tradeoff between complex
hypotheses that fit the training data and
simpler hypotheses that may generalize
better
• There is a tradeoff between the
expressiveness of a hypothesis space and
the complexity of finding a good hypothesis
within it
Project 4: Learning
No programming!
Do homeworks 18 & 19 (this class & next)
Due in BB Mon Apr 16 11:59PM
Then work on posters for Tue Apr 24
For Next Time:
AIMA 18.6; 18.7-18.9 fyi
Download