Decision tree learning

advertisement
CS B551: DECISION TREES
AGENDA

Decision trees
Complexity
 Learning curves
 Combatting overfitting


Boosting
RECAP

Still in supervised setting with logical attributes

Find a representation of CONCEPT in the form:
CONCEPT(x)  S(A,B, …)
where S(A,B,…) is a sentence built with the
observable attributes, e.g.:
CONCEPT(x)  A(x)  (B(x) v C(x))
PREDICATE AS A DECISION TREE
The predicate CONCEPT(x)  A(x)  (B(x) v C(x)) can
be represented by the following decision tree:
Example:
A?
A mushroom is poisonous iff
True
it is yellow and small, or yellow,
big and spotted
B?
• x is a mushroom
False
True
• CONCEPT = POISONOUS
• A = YELLOW
True
• B = BIG
C?
• C = SPOTTED
True
False
True
False
False
False
PREDICATE AS A DECISION TREE
The predicate CONCEPT(x)  A(x)  (B(x) v C(x)) can
be represented by the following decision tree:
Example:
A?
A mushroom is poisonous iff
True
it is yellow and small, or yellow,
big and spotted
B?
• x is a mushroom
False
True
• CONCEPT = POISONOUS
• A = YELLOW
True
• B = BIG
C?
• C = SPOTTED
True
False
• D = FUNNEL-CAP
• E = BULKY
True
False
False
False
TRAINING SET
Ex. #
A
B
C
D
E
CONCEPT
1
False
False
True
False
True
False
2
False
True
False
False
False
False
3
False
True
True
True
True
False
4
False
False
True
False
False
False
5
False
False
False
True
True
False
6
True
False
True
False
False
True
7
True
False
False
True
False
True
8
True
False
True
False
True
True
9
True
True
True
False
True
True
10
True
True
True
True
True
True
11
True
True
False
False
False
False
12
True
True
False
False
True
False
13
True
False
True
True
True
True
POSSIBLE DECISION TREE
D
T
F
E
Ex. #
A
B
C
D
E
CONCEPT
1
False
False
True
False
True
False
2
False
True
False
False
False
False
3
False
True
True
True
True
False
4
False
False
True
False
False
False
5
False
False
False
True
True
False
6
True
False
True
False
False
True
7
True
False
False
True
False
True
8
True
False
True
False
True
True
9
True
True
True
False
True
True
10
True
True
True
True
True
True
11
True
True
False
False
False
False
12
True
True
False
False
True
False
13
True
False
True
True
True
True
T
A
T
C
F
B
F
T
E
A
F
A
T
T
F
POSSIBLE DECISION TREE
CONCEPT 
(D(EvA))v(D(C(Bv(B((EA)v(EA))))))
D
T
F
E
CONCEPT  A  (B v C)
True
B?
True
True
False
T
F
B
F
T
E
False
False
C?
True
T
A
A?
C
True
A
False
False
F
A
T
T
F
POSSIBLE DECISION TREE
CONCEPT 
(D(EvA))v(D(C(Bv(B((EA)v(EA))))))
D
T
F
E
CONCEPT  A  (B v C)
A
A?
True
C
T
B
F
False
T
Fdecision
T tree
KIS bias  Build smallest
E
B?
True
False
False
True
C?
A
A
Computationally
intractable problem
False
True
True
False
 greedy algorithm
F
T
T
F
A
TOP-DOWN
INDUCTION OF A DT
True
C
False
False
True
B
True
DTL(D, Predicates)
1.
2.
3.
4.
5.
False
True
False
False
True
If all examples in D are positive then return True
If all examples in D are negative then return False
If Predicates is empty then return majority rule
A  error-minimizing predicate in Predicates
Return the tree whose:
- root is A,
- left branch is DTL(D+A,Predicates-A),
- right branch is DTL(D-A,Predicates-A)
LEARNABLE CONCEPTS
 Some
simple concepts cannot be
represented compactly in DTs


Parity(x) = X1 xor X2 xor … xor Xn
Majority(x) = 1 if most of Xi’s are 1, 0 otherwise
 Exponential
size in # of attributes
 Need exponential # of examples to learn
exactly
 The ease of learning is dependent on
shrewdly (or luckily) chosen attributes
that correlate with CONCEPT
PERFORMANCE ISSUES
Assessing performance:
Training set and test set
 Learning curve

% correct on test set

100
size of training set
Typical learning curve
PERFORMANCE ISSUES
Assessing performance:
Training set and test set
 Learning curve

Some concepts are
unrealizable within a
machine’s capacity
% correct on test set

100
size of training set
Typical learning curve
PERFORMANCE ISSUES

Assessing performance:
Training set and test set
 Learning curve

% correct on test set

Overfitting
100
size of training set
Typical learning curve
Risk of using irrelevant
observable predicates to
generate an hypothesis
that agrees with all
examples
in the training set
PERFORMANCE ISSUES

Assessing performance:
Training set and test set
 Learning curve


Overfitting

Tree pruning
Risk of using irrelevant
observable predicates to
generate an hypothesis
that agrees with all
examples
in the training set
Terminate recursion when
# errors / information gain
is small
PERFORMANCE ISSUES

Assessing performance:
Training set and test set
 Learning curve

Risk of using irrelevant
 Overfitting
observable predicates to
 Tree pruning
generate an hypothesis
that agrees with all
The resulting decision
tree +
examples
majority rule
not
in may
the training
set
classify correctly all
Terminate recursion
when in the training set
examples
# errors / information gain
is small
PERFORMANCE ISSUES

Assessing performance:
Training set and test set
 Learning curve


Overfitting

Tree pruning
Incorrect examples
 Missing data
 Multi-valued and continuous attributes

USING INFORMATION THEORY
Rather than minimizing the probability of error,
minimize the expected number of questions
needed to decide if an object x satisfies
CONCEPT
 Use the information-theoretic quantity known as
information gain
 Split on variable with highest information gain

ENTROPY / INFORMATION GAIN

Entropy: encodes the quantity of uncertainty in a
random variable


Properties




I(X,Y) = Ey[H(X) – H(X|Y)] =
y P(y) x [P(x|y) log P(x|y) – P(x)log P(x)]
Properties



H(X) = 0 if X is known, i.e. P(x)=1 for some value x
H(X) > 0 if X is not known with certainty
H(X) is maximal if P(X) is uniform distribution
Information gain: measures the reduction in
uncertainty in X given knowledge of Y


H(X) = -xVal(X) P(x) log P(x)
Always nonnegative
= 0 if X and Y are independent
If Y is a choice, maximizing IG = > minimizing
Ey[H(X|Y)]
MAXIMIZING IG / MINIMIZING CONDITIONAL
ENTROPY IN DECISION TREES
Ey[H(X|Y)] = y P(y) x P(x|y) log P(x|y)
Let n be # of examples
 Let n+,n- be # of examples on T/F branches of Y
 Let p+,p- be accuracy on true/false branches of Y

P(Y) = (p+n++p-n-)/n
 P(correct|Y) = p+, P(correct|-Y) = p

Ey[H(X|Y)] 
n+ [p+log p+ + (1-p+)log (1-p+)] + n- [p-log p- + (1-p-) log (1-p-)]
CONTINUOUS ATTRIBUTES

Continuous attributes can be converted into
logical ones via thresholds


X => X<a
When considering splitting on X, pick the
threshold a to minimize # of errors / entropy
7
7 6
5
6
5
4 5 4 3 4 5
4
5
6
7
MULTI-VALUED ATTRIBUTES
Simple change: consider splits on all values A can
take on
 Caveat: the more values A can take on, the more
important it may appear to be, even if it is
irrelevant

More values => dataset split into smaller example
sets when picking attributes
 Smaller example sets => more likely to fit well to
spurious noise

STATISTICAL METHODS FOR ADDRESSING
OVERFITTING / NOISE

There may be few training examples that match
the path leading to a deep node in the decision
tree


More susceptible to choosing irrelevant/incorrect
attributes when sample is small
Idea:
Make a statistical estimate of predictive power
(which increases with larger samples)
 Prune branches with low predictive power
 Chi-squared pruning

TOP-DOWN DT PRUNING
Consider an inner node X that by itself (majority
rule) predicts p examples correctly and n
examples incorrectly
 At k leaf nodes, number of correct/incorrect
examples are p1/n1,…,pk/nk


Chi-squared statistical significance test:
Null hypothesis: example labels randomly chosen
with distribution p/(p+n) (X is irrelevant)
 Alternate hypothesis: examples not randomly chosen
(X is relevant)


Prune X if testing X is not statistically significant
CHI-SQUARED TEST

Let Z = i (pi – pi’)2/pi’ + (ni – ni’)2/ni’

Where pi’ = pi(pi+ni)/(p+n), ni’ = ni(pi+ni)/(p+n) are the
expected number of true/false examples at leaf node i
if the null hypothesis holds
Z is a statistic that is approximately drawn from
the chi-squared distribution with k degrees of
freedom
 Look up p-Value of Z from a table, prune if pValue > a for some a (usually ~.05)

ENSEMBLE LEARNING
(BOOSTING)
IDEA
It may be difficult to search for a single
hypothesis that explains the data
 Construct multiple hypotheses (ensemble), and
combine their predictions
 “Can a set of weak learners construct a single
strong learner?” – Michael Kearns, 1988

MOTIVATION
5 classifiers with 60% accuracy
 On a new example, run them all, and pick the
prediction using majority voting


If errors are independent, new classifier has 94%
accuracy!

(In reality errors will not be independent, but we
hope they will be mostly uncorrelated)
BOOSTING

Main idea:
If learner 1 fails to learn an example correctly, this
example is more important for learner 2
 If learner 1 and 2 fail to learn an example correctly,
this example is more important for learner 3
 …


Weighted training set

Weights encode importance
BOOSTING

Weighted training set
Ex. #
Weight
A
B
C
D
E
CONCEPT
1
w1
False
False
True
False
True
False
2
w2
False
True
False
False
False
False
3
w3
False
True
True
True
True
False
4
w4
False
False
True
False
False
False
5
w5
False
False
False
True
True
False
6
w6
True
False
True
False
False
True
7
w7
True
False
False
True
False
True
8
w8
True
False
True
False
True
True
9
w9
True
True
True
False
True
True
10
w10
True
True
True
True
True
True
11
w11
True
True
False
False
False
False
12
w12
True
True
False
False
True
False
13
w13
True
False
True
True
True
True
BOOSTING
Start with uniform weights wi=1/N
 Use learner 1 to generate hypothesis h1
 Adjust weights to give higher importance to
misclassified examples
 Use learner 2 to generate hypothesis h2
…
 Weight hypotheses according to performance, and
return weighted majority

MUSHROOM EXAMPLE

“Decision stumps” - single attribute DT
Ex. #
Weight
A
B
C
D
E
CONCEPT
1
1/13
False
False
True
False
True
False
2
1/13
False
True
False
False
False
False
3
1/13
False
True
True
True
True
False
4
1/13
False
False
True
False
False
False
5
1/13
False
False
False
True
True
False
6
1/13
True
False
True
False
False
True
7
1/13
True
False
False
True
False
True
8
1/13
True
False
True
False
True
True
9
1/13
True
True
True
False
True
True
10
1/13
True
True
True
True
True
True
11
1/13
True
True
False
False
False
False
12
1/13
True
True
False
False
True
False
13
1/13
True
False
True
True
True
True
MUSHROOM EXAMPLE

Pick C first, learns CONCEPT = C
Ex. #
Weight
A
B
C
D
E
CONCEPT
1
1/13
False
False
True
False
True
False
2
1/13
False
True
False
False
False
False
3
1/13
False
True
True
True
True
False
4
1/13
False
False
True
False
False
False
5
1/13
False
False
False
True
True
False
6
1/13
True
False
True
False
False
True
7
1/13
True
False
False
True
False
True
8
1/13
True
False
True
False
True
True
9
1/13
True
True
True
False
True
True
10
1/13
True
True
True
True
True
True
11
1/13
True
True
False
False
False
False
12
1/13
True
True
False
False
True
False
13
1/13
True
False
True
True
True
True
MUSHROOM EXAMPLE

Pick C first, learns CONCEPT = C
Ex. #
Weight
A
B
C
D
E
CONCEPT
1
1/13
False
False
True
False
True
False
2
1/13
False
True
False
False
False
False
3
1/13
False
True
True
True
True
False
4
1/13
False
False
True
False
False
False
5
1/13
False
False
False
True
True
False
6
1/13
True
False
True
False
False
True
7
1/13
True
False
False
True
False
True
8
1/13
True
False
True
False
True
True
9
1/13
True
True
True
False
True
True
10
1/13
True
True
True
True
True
True
11
1/13
True
True
False
False
False
False
12
1/13
True
True
False
False
True
False
13
1/13
True
False
True
True
True
True
MUSHROOM EXAMPLE

Update weights (precise formula given in R&N)
Ex. #
Weight
A
B
C
D
E
CONCEPT
1
.125
False
False
True
False
True
False
2
.056
False
True
False
False
False
False
3
.125
False
True
True
True
True
False
4
.125
False
False
True
False
False
False
5
.056
False
False
False
True
True
False
6
.056
True
False
True
False
False
True
7
.125
True
False
False
True
False
True
8
.056
True
False
True
False
True
True
9
.056
True
True
True
False
True
True
10
.056
True
True
True
True
True
True
11
.056
True
True
False
False
False
False
12
.056
True
True
False
False
True
False
13
.056
True
False
True
True
True
True
MUSHROOM EXAMPLE

Next try A, learn CONCEPT=A
Ex. #
Weight
A
B
C
D
E
CONCEPT
1
.125
False
False
True
False
True
False
2
.056
False
True
False
False
False
False
3
.125
False
True
True
True
True
False
4
.125
False
False
True
False
False
False
5
.056
False
False
False
True
True
False
6
.056
True
False
True
False
False
True
7
.125
True
False
False
True
False
True
8
.056
True
False
True
False
True
True
9
.056
True
True
True
False
True
True
10
.056
True
True
True
True
True
True
11
.056
True
True
False
False
False
False
12
.056
True
True
False
False
True
False
13
.056
True
False
True
True
True
True
MUSHROOM EXAMPLE

Next try A, learn CONCEPT=A
Ex. #
Weight
A
B
C
D
E
CONCEPT
1
.125
False
False
True
False
True
False
2
.056
False
True
False
False
False
False
3
.125
False
True
True
True
True
False
4
.125
False
False
True
False
False
False
5
.056
False
False
False
True
True
False
6
.056
True
False
True
False
False
True
7
.125
True
False
False
True
False
True
8
.056
True
False
True
False
True
True
9
.056
True
True
True
False
True
True
10
.056
True
True
True
True
True
True
11
.056
True
True
False
False
False
False
12
.056
True
True
False
False
True
False
13
.056
True
False
True
True
True
True
MUSHROOM EXAMPLE

Update weights
Ex. #
Weight
A
B
C
D
E
CONCEPT
1
0.07
False
False
True
False
True
False
2
0.03
False
True
False
False
False
False
3
0.07
False
True
True
True
True
False
4
0.07
False
False
True
False
False
False
5
0.03
False
False
False
True
True
False
6
0.03
True
False
True
False
False
True
7
0.07
True
False
False
True
False
True
8
0.03
True
False
True
False
True
True
9
0.03
True
True
True
False
True
True
10
0.03
True
True
True
True
True
True
11
0.25
True
True
False
False
False
False
12
0.25
True
True
False
False
True
False
13
0.03
True
False
True
True
True
True
MUSHROOM EXAMPLE

Next try E, learn CONCEPT=E
Ex. #
Weight
A
B
C
D
E
CONCEPT
1
0.07
False
False
True
False
True
False
2
0.03
False
True
False
False
False
False
3
0.07
False
True
True
True
True
False
4
0.07
False
False
True
False
False
False
5
0.03
False
False
False
True
True
False
6
0.03
True
False
True
False
False
True
7
0.07
True
False
False
True
False
True
8
0.03
True
False
True
False
True
True
9
0.03
True
True
True
False
True
True
10
0.03
True
True
True
True
True
True
11
0.25
True
True
False
False
False
False
12
0.25
True
True
False
False
True
False
13
0.03
True
False
True
True
True
True
MUSHROOM EXAMPLE

Next try E, learn CONCEPT=E
Ex. #
Weight
A
B
C
D
E
CONCEPT
1
0.07
False
False
True
False
True
False
2
0.03
False
True
False
False
False
False
3
0.07
False
True
True
True
True
False
4
0.07
False
False
True
False
False
False
5
0.03
False
False
False
True
True
False
6
0.03
True
False
True
False
False
True
7
0.07
True
False
False
True
False
True
8
0.03
True
False
True
False
True
True
9
0.03
True
True
True
False
True
True
10
0.03
True
True
True
True
True
True
11
0.25
True
True
False
False
False
False
12
0.25
True
True
False
False
True
False
13
0.03
True
False
True
True
True
True
MUSHROOM EXAMPLE

Update Weights…
Ex. #
Weight
A
B
C
D
E
CONCEPT
1
0.07
False
False
True
False
True
False
2
0.03
False
True
False
False
False
False
3
0.07
False
True
True
True
True
False
4
0.07
False
False
True
False
False
False
5
0.03
False
False
False
True
True
False
6
0.03
True
False
True
False
False
True
7
0.07
True
False
False
True
False
True
8
0.03
True
False
True
False
True
True
9
0.03
True
True
True
False
True
True
10
0.03
True
True
True
True
True
True
11
0.25
True
True
False
False
False
False
12
0.25
True
True
False
False
True
False
13
0.03
True
False
True
True
True
True
MUSHROOM EXAMPLE

Final classifier, order C,A,E,D,B
Weights on hypotheses determined by overall error
 Weighted majority weights
A=2.1, B=0.9, C=0.8, D=1.4, E=0.09


100% accuracy on training set
BOOSTING STRATEGIES
Prior weighting strategy was the popular
AdaBoost algorithm
see R&N pp. 667
 Many other strategies
 Typically as the number of hypotheses increases,
accuracy increases as well


Does this conflict with Occam’s razor?
ANNOUNCEMENTS

Next class:
Neural networks & function learning
 R&N 18.6-7

Download