• Decision Trees 2 Artificial Intelligence: Representation and Problem

advertisement
Artificial Intelligence:
Representation and Problem
Solving
15-381
April 12, 2007
Decision Trees 2
20 questions
•
Consider this game of 20 questions on the web:
20Q.net Inc.
Artificial Intelligence: Decision Trees 2
2
Michael S. Lewicki ! Carnegie Mellon
Pick your poison
•
•
•
How do you decide if a mushroom is edible?
What’s the best identification strategy?
Let’s try decision trees.
“Death Cap”
Artificial Intelligence: Decision Trees 2
Michael S. Lewicki ! Carnegie Mellon
3
Some mushroom data (from the UCI machine learning repository)
EDIBLE?
CAP-SHAPE
CAP-SURFACE CAP-COLOR
ODOR
STALK-SHAPE
POPULATION
HABITAT
• •!•
1
edible
flat
fibrous
red
none
tapering
several
woods
• •!•
2
poisonous
convex
smooth
red
foul
tapering
several
paths
• •!•
3
edible
flat
fibrous
brown
none
tapering
abundant
grasses
• •!•
4
edible
convex
scaly
gray
none
tapering
several
woods
• •!•
5
poisonous
convex
smooth
red
foul
tapering
several
woods
• •!•
6
edible
convex
fibrous
gray
none
tapering
several
woods
• •!•
7
poisonous
flat
scaly
brown
fishy
tapering
several
leaves
• •!•
8
poisonous
flat
scaly
brown
spicy
tapering
several
leaves
• •!•
9
poisonous
convex
fibrous
yellow
foul
enlarging
several
paths
• •!•
10 poisonous
convex
fibrous
yellow
foul
enlarging
several
woods
• •!•
11 poisonous
flat
smooth
brown
spicy
tapering
several
woods
• •!•
12 edible
convex
smooth
yellow
anise
tapering
several
woods
• •!•
13 poisonous
knobbed
scaly
red
foul
tapering
several
leaves
• •!•
14 poisonous
flat
smooth
brown
foul
tapering
several
leaves
• •!•
15 poisonous
flat
fibrous
gray
foul
enlarging
several
woods
• •!•
16 edible
sunken
fibrous
brown
none
enlarging
solitary
urban
• •!•
17 poisonous
flat
smooth
brown
foul
tapering
several
woods
• •!•
18 poisonous
convex
smooth
white
foul
tapering
scattered
urban
• •!•
19 poisonous
flat
scaly
yellow
foul
enlarging
solitary
paths
• •!•
20 edible
convex
fibrous
gray
none
tapering
several
woods
• •!•
• •!•
• •!•
• •!•
• •!•
• •!•
• •!•
• •!•
• •!•
• •!•
Artificial Intelligence: Decision Trees 2
4
Michael S. Lewicki ! Carnegie Mellon
An easy problem: two attributes provide most of the information
Poisonous: 44
Edible: 46
ODOR is
almond, anise,
or none
+++"(",+-.+/0++1++23
no
yes
Poisonous: 1
Edible: 46
Poisonous: 43
Edible:
0
!"#$"%"&$
SPORE-PRINT-COLOR
is
+++$!",'!!,#%4!5"*",+-.+/0++1++6++7++8++2++9++:3
green
no
yes
Poisonous: 0
Edible: 46
'(#)*'
100% classification accuracy
on a 100 examples.
Poisonous: 1
!"#$"%"&$
Edible:
0
Artificial Intelligence: Decision Trees 2
Michael S. Lewicki ! Carnegie Mellon
5
Same problem with no odor or spore-print-color
+++,#%%!-'%'.+/0+12++3++45
+++,#%%!(&6-#),+7+8
+++-6&!-'%'.+/0+18++=++2++>++?5
!"#$%!
&'#(')'*(
+++(96%:!(*.;6-!!6$'<!!.#),+7+2
+++,#%%!-'%'.+7+8@
&'#(')'*(
+++,#%%!(#A!+7+=
!"#$%!
100% classification accuracy
on a 100 examples.
+++(96%:!(*.;6-!!6$'<!!.#),+7+8
&'#(')'*(
Pretty good, right?
!"#$%!
&'#(')'*(
What if we go off hunting
with this decision tree?
!"#$%!
Performance on another
set of 100 mushrooms:
Why?
80%
Artificial Intelligence: Decision Trees 2
6
Michael S. Lewicki ! Carnegie Mellon
Not enough examples?
Training error
100
98
Testing error
%correct on another
set of t he same size
96
94
92
90
88
86
84
82
80
0
200
400
Artificial Intelligence: Decision Trees 2
600
Side example with
Why is the testing errorboth discrete and
always lower than the continuous
attributes:
training error?
Predicting MPG
(‘GOOD’ or ‘BAD’)
from attributes:
Cylinders
Horsepower
800
1000
1200
1400
1600
1800
2000
Acceleration
# training examples
Maker (discrete)
Displacement
7
Michael S. Lewicki ! Carnegie Mellon
The Overfitting Problem: Example
The Overfitting Problem: Example
Class B
Class A
•
•
•
• Suppose that, in an ideal world, class B is
everything
0.5 and class
A is X2>= 0.5 and class
Suppose that,
in an idealsuch
world,that
classX2B>=
is everything
such that
everything
with
X
<
0.5
A is everything with X2< 0.5 2
• Note that attribute X1 is irrelevant
Note that• attribute
irrelevant a decision tree would be
1 is generating
Seems X
like
trivial
Generating a decision tree would be trivial, right?
The following examples are from Prof Hebert.
Artificial Intelligence: Decision Trees 2
8
Michael S. Lewicki ! Carnegie Mellon
2
The
Overfitting
Problem: Example
The
Overfitting
Problem:
Example
The Overfitting Problem: Example
•
But in the real world, our observations have variability.
• However,
we collect training examples from the
They can also be corrupted by noise.
•perfect
world through some imperfect observation
Thus, the observed pattern is more complex than it appears.
•device
• However, we collect training examples from the
world
through
imperfect
noise.
• As a perfect
result, the
training
datasome
is corrupted
byobservation
device
• As a result, the training data is corrupted by noise.
Artificial Intelligence: Decision Trees 2
9
Michael S. Lewicki ! Carnegie Mellon
The Overfitting Problem: Example
The Overfitting
Problem:
Example
The Overfitting
Problem:
Example
The Overfitting Problem: Example
•
noise makes the decision tree
more complex
than itnoise,
should bethe
Because
of the
•
decision tree is
• However, we collect
training examples fromresulting
the
algorithm
tries to classify than
all
complicated
it should be
•farThemore
perfect world through
some
imperfect
observation
of the training set perfectly
• This is because the learning algorithm tries to
device
This
is a fundamental
in setthe
all of the
training
perfectly
! This
is a tree is
•classify
• Because
ofproblem
the
noise,
resulting
decision
learning
and
is
called
overfitting
noise.
• As a result, the training
data
is
corrupted
by
fundamental
in learning:
far moreproblem
complicated
than it overfitting
should be
• This is because the learning algorithm tries to
classify all of the training set perfectly ! This is a
fundamental problem in learning: overfitting
Artificial Intelligence: Decision Trees 2
10
Michael S. Lewicki ! Carnegie Mellon
3
The Overfitting
Problem:
Example
The Overfitting
Problem:
Example
The Overfitting Problem: Example
The Overfitting Problem: Example
•
noise makes the decision tree
more complex than it should be
• However, we collect training examples from the
tries to classify all
• The algorithm
perfect world through
some
imperfect
observation
of the training set perfectly
The tree classifies
device
•
However,
we
collect
training
examples
from
the
this point
as tree
‘A’,
butis
This
is
a
fundamental
problem
in
• • Because of the noise, the resulting decision
it
won’t
generalize
learning
and
is
called
overfitting
noise.
• As a result, the training
data
corrupted
bythan
perfect
through
some
imperfect
observation
far
moreisworld
complicated
it should
be
to new examples.
device
• This
is because the learning algorithm tries to
all ofthe
thetraining
trainingdata
set perfectly
! This
is a
• classify
As a result,
is corrupted
by noise.
fundamental problem in learning: overfitting
Artificial Intelligence: Decision Trees 2
11
Michael S. Lewicki ! Carnegie Mellon
The Overfitting Problem: Example
The Overfitting
Problem:
Example
The Overfitting
Problem:
Example
The Overfitting Problem: Example
•
noise makes the decision tree
morethe
complex
than it should
be
thecollect
noise,
resulting
decision
• Because
However,ofwe
training
examples
fromtree
the is
far more complicated
it should
be all
algorithm
tries to classify
• Thethan
perfect world through
some
imperfect
observation
the training algorithm
set perfectly tries to
This is because theoflearning
device all of the training set perfectly ! This is a
classify
is a fundamental
in
• This
• Because
ofproblem
the noise,
the resulting
decision tree is
The problem started
overfitting
problem
in
learning:
learning
and
is
called
overfitting
noise.
• fundamental
As a result, the
training
data
is
corrupted
by
far more complicated than it should
here.be
X1 is irrelevant to
the
underlying
• This is because the learning algorithm triesstructure.
to
classify all of the training set perfectly ! This is a
fundamental problem in learning: overfitting
Artificial Intelligence: Decision Trees 2
12
Michael S. Lewicki ! Carnegie Mellon
3
The Overfitting
Problem:
Example
The Overfitting
Problem:
Example
The Overfitting Problem: Example
Is there a way to identify that
splitting this node is not helpful?
•
noise makes the decision
tree splitting would result in
Idea: When
more complex than aittree
should
thatbeis too “complex”?
• However, we collect training examples from the
tries to classify all
• The algorithm
perfect world through
some
imperfect
observation
of the training set perfectly
device
is a fundamental
in
• This
• Because
ofproblem
the noise,
the resulting
decision tree is
The problem started
learning
and
is
called
overfitting
noise.
• As a result, the training
data
is
corrupted
by
far more complicated than it should
here.be
X1 is irrelevant to
the underlying
• This is because the learning algorithm
triesstructure.
to
classify all of the training set perfectly ! This is a
fundamental problem in learning: overfitting
Artificial Intelligence: Decision Trees 2
13
Michael S. Lewicki ! Carnegie Mellon
The Overfitting Problem: Example
Addressing overfitting
•
•
•
•
Grow tree based on training data.
This yields an unpruned tree.
Then prune nodes from the tree that are unhelpful.
How do we know when this is the case?
-
Use additional data not used in training, ie test data
Use a statistical significance test to see if extra nodes are different from noise
Penalize the complexity of the tree
Because of the noise, the resulting decision tree is
far more complicated than it should be
This is because the learning algorithm tries to
classify all of the training set perfectly ! This is a
fundamental problem in learning: overfitting
Artificial Intelligence: Decision Trees 2
14
Michael S. Lewicki ! Carnegie Mellon
3
Training Data
Training Data
Artificial Intelligence: Decision Trees 2
Unpruned decision tree
Unpruned
decision
from training
data tree
from training data
15
Training data
Training
data
with
the partitions
induced
withthe
thedecision
partitions
induced
by
tree
by the decision
tree
(Notice
the tiny regions
at
(Notice
the tiny regions
at
the
top necessary
to
the top necessary
to ‘A’
correctly
classify the
correctly classify the ‘A’
outliers!)
outliers!)
Artificial Intelligence: Decision Trees 2
Michael S. Lewicki ! Carnegie Mellon
Unpruned decision tree
Unpruned
decision
from training
data tree
from training data
16
Michael S. Lewicki ! Carnegie Mellon
6
Training data
Training data
Test data
Test data
Artificial Intelligence: Decision Trees 2
17
Unpruned decision tree
Unpruned
decision
from training
data tree
from
training data
Performance
(%
Performance
(%
correctly classified)
correctly
Training:classified)
100%
Training:
100%
Test: 77.5%
Test: 77.5%
Michael S. Lewicki ! Carnegie Mellon
Training data
Training data
Pruned decision tree
Pruned
decision
from training
datatree
from
training data
Performance
(%
Performance
(%
correctly classified)
correctly
Training:classified)
95%
Training:
Test: 80%95%
Test: 80%
Test data
Test data
Artificial Intelligence: Decision Trees 2
18
Michael S. Lewicki ! Carnegie Mellon
7
Training data
Training data
Test data
Test data
19
Michael S. Lewicki ! Carnegie Mellon
% of
% data
of data
correctly
correctly
classified
classified
Tree
Tree
withwith
best
best
performance
performance
on on
testtest
setset
Artificial Intelligence: Decision Trees 2
Pruned decision tree from
Pruned
tree from
training decision
data
training
data (% correctly
Performance
Performance
(% correctly
classified)
classified)
Training: 80%
Training:
80%
Test: 97.5%
Test: 97.5%
Artificial Intelligence: Decision Trees 2
Performance on
training seton
Performance
training set
Performance on
test set on
Performance
test set
Size of decision tree
Size of decision tree
20
Michael S. Lewicki ! Carnegie Mellon
8
General principle
•
•
•
Region of
overfitting the
training data
% Correct classification
%correct classification
Classification
Using Test Data
performance on
training data
Classification rate
on training data
Classification
performance on
test data
Complexity
of model
(eg size
In this region,
the tree
overfits
the of tree)
Classification
training data (including the noise!) and
onthetest
data
start doing
poorly
onmodel
the test
datato better classify
As its complexity
increases,
the
is able
training
data
rate
Performance on the test data initially increases, but then falls
as of
thetree
model
Size
overfits, or becomes specialized for classifying the noise training
• General
principle:
theofcomplexity
The complexity
in decision
trees is the As
number
free parameters,of
ie the
the number
classifier
increases
(depth
of
the
decision
tree),
of nodes
the performance on the training data increases
and the performance on the test data decreases
when the classifier overfits the training data.
Artificial Intelligence: Decision Trees 2
Michael S. Lewicki ! Carnegie Mellon
21
Decision Tree Pruning
Strategies for avoiding overfitting: Pruning
•
•
•
Ovoiding
is equivalent
achievingtree
good generalization
• overfitting
Construct
thetoentire
as before
All strategies
need some at
way the
to control
the complexity
of the model
• Starting
leaves,
recursively
Pruning:
-
eliminate splits:
constructs a standard decision tree, but keep a test data set on which the
– Evaluate performance of the tree on test data
model is not trained
(also called validation data, or hold out data
set)
splits are eliminated (or pruned) by evaluating performance on the test data
– Prune the tree if the classification
a leaf is pruned
if classification on
the test databy
increases
by removing
split
performance
increases
removing
thethe
split
prunes leaves recursively
(1)
Artificial Intelligence: Decision Trees 2
Prune node if
classification
performance
on test set is
greater for (2)
than for (1)
22
(2)
Michael S. Lewicki ! Carnegie Mellon
Strategies for avoiding overfitting: Statistical significance tests
•
•
•
•
•
A Criterion to Detect Us
For each split, ask of there is a
significant increase in the info. gain
•
The problem is
IG increases, b
change in entr
If we’re splitting noise, then data are
significant
random
• Reasoning:
What proportion of data go to left
node?
• The proportion
node is pL = (N
NAL + NBL
5
pL =
=
• Suppose now
NAL = 1
NA + NB
9
randomly distri
=4
•NBLGrow
tree based on make
training
dat
sense
to
If data were random, how many
would we expect to go to the left?
(unpruned tree) • The expected
left node would
!
NAL
= NA × pL = 10/9
•• The
A inby •removing
# class Anumber
in root
node isof
NA =class
2
•
Prune
the
tree
use
The expected
!
B in root
node isis
NBN
=7 = 2
• # class
NBL
= NB × pL = 35/9
the
root
node
left node would
is NAL = on:
1A
• # class A in left node
based
NA = 2
NB = 7
Possible Overfitting So
# class Bnumber
in left node is N
4
•• The
of=class
B in
–
Additional
test
data
used for
the root node is NB = 7
• (not
Question:
BL
Is there a statistically significant
different from what we observe and
what we expect?
Artificial Intelligence: Decision Trees 2
– Statistical
If not,
don’t split!
significance
• Aretests
NA and NB
N’A and N’B. If
is not statistica
should not spli
children are no
what we would
distribution at t
• The number of class A in
the left node is NAL = 1
• The number of class B in
the left node is NBL = 4
23
Michael S. Lewicki ! Carnegie Mellon
Detecting statistically significant splits
•
A measure of statistical significance
K=
!
!
!
!
(NBL
− NBL )2
(NAR
− NBR )2
(NBR
− NBR )2
(NAL
− NAL )2
+
+
+
!
!
!
!
NAL
NBL
NBR
NBR
A Criterion to Detect Usel
•
K measures how much the split
deviates from what we would expect
from random data
•
•
K small
•
•
the information gain from the split is not
significant
•
•
Here,
K=
•
• The number of class A in
(35/9 − 4)2
(10/9 − 1)2
+
+ · · · = 0.0321
the root node is NA = 2
10/9
35/9
• The number of class B in
the root node is NB = 7
Artificial Intelligence: Decision Trees 2
24
• The number
ofS. Lewicki
class
A Mellon
in
Michael
! Carnegie
the left node is NAL = 1
• The number of class B in
the left node is NBL = 4
•
•
•
The problem is that
IG increases, but w
change in entropy i
significant
Reasoning:
The proportion of th
node is pL = (NAL +
Suppose now that t
randomly distribute
make sense to split
The expected numb
left node would be
The expected numb
left node would be
Question:
Are NA and NB suffi
N’A and N’B. If not, i
is not statistically si
should not split the
children are not sig
what we would get
distribution at the ro
“!2 criterion”: general case
•
Small “Chi-square” values imply
low statistical significance
•
Nodes that have K smaller than
threshold are pruned
•
The threshold regulates the
complexity of the model
-
N data points
PL
PR
NL data points
Low thresholds allow larger
trees and more overfitting
High thresholds keep trees
small but may sacrifice
performance
!
K=
all classes i
all children j
NR data points
(Nij − Nij! )2
Nij!
Nij
=
Number of points from class i in child j
Nij!
=
Number of points from class i in child j
assuming random selection
Nij!
=
N i × pj
Artificial Intelligence: Decision Trees 2
25
Michael S. Lewicki ! Carnegie Mellon
Illustration on our toy problem
K = 10.58
K = 0.0321
K = 0.83
Artificial Intelligence: Decision Trees 2
The gains
obtained by
these splits are
not significant
26
Michael S. Lewicki ! Carnegie Mellon
Illustration on our toy problem
• By thresholding K we end up with the decision
that
we wouldthresholding,
expect (i.e.,
that
does not
With
appropriate
we one
get the
decision
•tree
overfit the data)
tree we expect, ie only one split.
• Note: The approach is presented with
Note: this approach
caninbethis
applied
to bothbut it works
attributes
example
•continuous
2
continuous
and
discrete
attributes
just as well withχdiscrete
attributes
Pruning
• The test on K is a version of a standard
statistical test, the χ2 (‘chi-square’) test.
• The value of t is retrieved from statistical
Artificial Intelligence: Decision Trees 2
Michael S. Lewicki ! Carnegie Mellon
27
tables. For example, K > t means that, with
confidence 95%, the information gain due
to the split is significant.
• If K < t, with high confidence, the
information gain will be 0 over very large
training samples
A real example:– Fisher’s
data
Reduces Iris
overfitting
•
•
– Eliminates irrelevant attributes
three classes of irises
four attributes
Example
Class
Setosa
Setosa
Setosa
Versicolor
Versicolor
Versicolor
Virginica
Virginica
Artificial Intelligence: Decision Trees 2
Sepal
Length
(SL)
5.1
4.9
5.4
5.2
5
6
6.4
7.2
Sepal
Width
(SW)
3.5
3
3.9
2.7
2
2.2
2.8
3
Petal
Length
(PL)
1.4
1.4
1.7
3.9
3.5
4
5.6
5.8
Petal
Width
(PW)
0.2
0.2
0.4
1.4
1
1
2.1
1.6
28
50 examples
from each class
Michael S. Lewicki ! Carnegie Mellon
13
Full (unpruned) decision tree
Full Decision Tree
Full Decision Tree
Artificial Intelligence: Decision Trees 2
Michael S. Lewicki ! Carnegie Mellon
29
2.5
!"#$%&-./#)&*!0,
2
1.5
The scatter plot of the data with decision boundaries
2.5
1
0.5
2
!"#$%&-./#)&*!0,
0
1
2
3
4
5
6
7
!"#$%&%"'(#)&*!+,
1.5
1
15
0.5
0
1
2
3
4
5
6
7
!"#$%&%"'(#)&*!+,
Artificial Intelligence: Decision Trees 2
30
Michael S. Lewicki ! Carnegie Mellon
Tree statistics
Artificial Intelligence: Decision Trees 2
31
Michael S. Lewicki ! Carnegie Mellon
Pruning One Level
Pruning one level
Pruning One Level
16
Artificial Intelligence: Decision Trees 2
32
Michael S. Lewicki ! Carnegie Mellon
16
Pruning two levels
Pruning Two Levels
Pruning Two Levels
Artificial Intelligence: Decision Trees 2
Michael S. Lewicki ! Carnegie Mellon
33
2.5
2
!"#$%&-./#)&*!0,
The tree with pruned decision boundaries
1.5
!"#$%&-./#)&*!0,
2.5
1
2
0.5
1.5
0
1
2
3
4
5
6
7
!"#$%&%"'(#)&*!+,
1
0.5
17
0
1
2
3
4
5
6
7
!"#$%&%"'(#)&*!+,
Artificial Intelligence: Decision Trees 2
34
Michael S. Lewicki ! Carnegie Mellon
Recap: What you should understand
•
•
Learning is fitting models (estimating their parameters) from data
•
There complexity of a model (related but not identical to the number of
parameters) determines how well it can fit the data
•
If there are insufficient data relative to the complexity, the model will exhibit poor
generalization, ie it will overfit the data
•
•
To avoid this learning algorithms divide examples into training and testing data
The goal of learning is to achieve good predictions/classifications for novel data, ie
good generalization
Decision Trees:
-
a simple hierarchical approach to classification
goal is to achieve best classification with minimal # of decisions
works with binary, categorical, or continuous data
information gain is a useful splitting strategy (there are many others)
the decision tree is built recursively
it can be pruned to reduce the problem of overfitting
Artificial Intelligence: Decision Trees 2
35
Michael S. Lewicki ! Carnegie Mellon
Download