Uploaded by Krutin Rathod

cs559a f23 wk10 dt stacking

advertisement
GRUS5
CS 559: Machine Learning Fundamentals & Applications
Week 10: Combined Models I (Decision Tree and Stacking Method)
GRUS5
9.1. Decision Trees (DTs)
Lecture 9 - Combined Models
2
9.1.1. Introduction
GRUS5
• The decision tree (DT) algorithm uses iterative, top-down construction decision-making on
hypothesis and visualizes a hierarchy of decisions forking the dataset into subspaces.
• DT algorithm uses any Boolean function to search the entire training dataset and split it after
decision makings.
• A total of πŸπ’Ž potential splits will be evaluated given m features.
• Hypothesis space is searched greedily (make a choice that looks the best).
• The challenge is whether an algorithm can learn the right function or a good approximation.
Lecture 9 - Combined Models
3
9.1.2. Terminology
GRUS5
§ Node (Vertex): Holds fields or key values.
§ Branch (Edge): Connects nodes
§ Root: the node at the top of the tree – no incoming
edge, zero or more outgoing edges.
§ Internal node: one incoming edge, two or more
outgoing edges.
§ Leaf: the node at the end of the tree – no outgoing
edges and assigns the class label. If nodes are not
pure, it takes the majority vote.
§ Parent and child nodes: a node at one lower depth
is the parent node, and one at a higher depth is the
child node.
§ Depth: This refers to the top-down height of the
tree from the root.
§ Height: Refers to the bottom-up height of the tree
from the leaf.
Lecture 9 - Combined Models
4
9.1.2. Ruled-based Learning
GRUS5
§ We can think of DT as nested “if-else” rules that are simple conjunction of conditions. For
example,
𝑅𝑒𝑙𝑒 1 = 𝑖𝑓 π‘₯ = 1 ∩ 𝑖𝑓 𝑦 = 2 ∩ β‹―
Eq. 9 - 1
§ Multiple rules can then be joined into a set of rules and applied to predict the target value of
training data. For example,
πΆπ‘™π‘Žπ‘ π‘  1 = 𝑖𝑓 𝑅𝑒𝑙𝑒 1 = π‘‡π‘Ÿπ‘’π‘’ ∪ 𝑅𝑒𝑙𝑒 2 = π‘‡π‘Ÿπ‘’π‘’ ∪ β‹―
Eq. 9 - 2
§ Each leaf node in a DT represents a set of rules. For example,
π‘Šπ‘œπ‘Ÿπ‘˜ π‘‘π‘œ π‘‘π‘œ? = π‘π‘œ ∩ π‘‚π‘’π‘‘π‘™π‘œπ‘œπ‘˜? = π‘…π‘Žπ‘–π‘›π‘¦ ∩ (πΉπ‘Ÿπ‘–π‘’π‘›π‘‘π‘  𝑏𝑒𝑠𝑦? = π‘π‘œ)
Eq. 9 - 3
Lecture 9 - Combined Models
5
9.1.2. Ruled-based Learning
GRUS5
Lecture 9 - Combined Models
6
9.1.3. Things to concern?
GRUS5
Even rules can be constructed from DT easily, sometimes…
§ It is only sometimes possible to build a DT from a set of rules which may take time
to determine how.
§ A rule-set evaluation is much more expensive than a tree evaluation.
§ Multiple rulesets are possible.
§ Because of the flexible rules, DT is more prone to overfitting, especially if there
are more hypothesis spaces than DTs.
Lecture 9 - Combined Models
7
9.1.4. Different DT algorithms - ID3, C4.5, and CART
GRUS5
§ There are multiple DT algorithms.
§ Most DT algorithms differ in the following ways:
o Splitting criteria – Information gain (Shannon Entropy, Gini impurity,
misclassification error), use of statistical tests, objective function, etc.
o Binary split vs. multi-way splits
o Discrete vs. continuous variables
o Pre- vs. post-pruning
o Bottom-up vs. Top-down pruning
Lecture 9 - Combined Models
8
9.1.4. Different DT algorithms - ID3, C4.5, and CART
GRUS5
Iterative Dichotomizer 3
§ Described in Quinlan, J. R. (1986). Introduction of decision trees.
§ One of the earliest DT algorithms
§ Can only handle discrete features
§ Multi-category splits
§ No pruning, prone to overfitting
§ Short and wide trees compared to CART
§ Maximizes information gain and minimizes entropy
Lecture 9 - Combined Models
9
9.1.4. Different DT algorithms - ID3, C4.5, and CART
GRUS5
C4.5
§ Described by Quinlan, J. R. (1993). C4.5: Programming for machine
learning. Morgan Kauffmann, 38, 48
§ Handles both continuous and discrete features. Continuous feature splitting
is very expensive.
§ The splitting criterion is computed via the gain ratio.
§ Handles missing attributes by ignoring them in the computation.
§ Performs post-pruning (bottom-up pruning)
Lecture 9 - Combined Models
10
9.1.4. Different DT algorithms - ID3, C4.5, and CART
GRUS5
CART – Classification and Regression Trees
§ Described in Breiman, L. (1984). Classification and regression trees.
Belmont, Calif: Wadsworth International Group.
§ Handles continuous and discrete features.
§ Strictly binary splits (results trees are taller compared to ID3 and C4.5)
o Generate better trees than C4.5 but tend to be larger and harder to
interpret.
o For k attributes, there are 2!"# − 1 ways to create binary partitioning.
§ Variance reduction in regression trees.
§ Uses Gini impurity in classification trees.
§ Performs cost-complexity pruning.
Lecture 9 - Combined Models
11
9.1.5. Information Gain (IG)
GRUS5
• The IG is the standard criterion for choosing the splitting in DT.
o The higher the IG, the better the split.
• IG relies on the concept of mutual information – the reduction of the entropy
of one variable by knowing the other.
• We want to maximize mutual information when defining splitting criteria.
𝐷%
πΊπ‘Žπ‘–π‘› 𝐷, π‘₯$ = 𝐻 𝐷 −
𝐻 𝐷%
𝐷
%∈'()*+, -!
Eq. 9 - 4
where 𝐷 is the training set at the parent node, and 𝐷% is a dataset at a child
node upon splitting. 𝐻 ⋅ is an entropy.
Lecture 9 - Combined Models
12
9.1.6. Information Theory and Entropy
GRUS5
• In ID3, Shannon Entropy is used to measure improvement in a DT; i.e., we
use it as an optimization metric (or impurity measure).
• It was originally proposed in the context of encoding digital information in
the form of Bits (0s or 1s).
• Consider it as a measure of the amount of information of discrete random
variables (two outcomes, Bernoulli distribution)
Lecture 9 - Combined Models
13
9.1.6. Information Theory and Entropy
GRUS5
• Shannon information:
o Shannon information defines information as the number of bits to encode
#
a number 1/𝑝, where 𝑝 is the probability that an event is a rule (#". is the
uncertainty).
#
#
.
.
o The number of bits for encoding is log /
.
o − log / 𝑝 → [∞, 0]; if we are 100% certain about an event, we gain 0
information.
o E.g., assume two soccer teams both teams have a win probability of 50%.
§ If the information team 1 wins is transmitted 1 bit: log /
log / (2) = 1
Lecture 9 - Combined Models
#
0.2
=
14
9.1.6. Information Theory and Entropy
GRUS5
• Shannon entropy is then the “average information”
o Entropy: 𝐻 𝑝 = ∑3 𝑝3 log /
#
."
= − ∑3 𝑝3 log / 𝑝3
Eq. 9 - 5
o E.g., assume team 1 and 2 have win possibilities 75% and 25%,
respectively. The average information content is
𝐻 𝑝 = −0.75 log / 0.75 − 0.25 log / 0.25 = 0.81
Lecture 9 - Combined Models
15
9.1.7. Growing DT via Entropy or Gini Impurity than
Misclassification Error
GRUS5
Consider the general measurement of information gain,
𝐷%
𝐺 𝐷, π‘₯$ = 𝐼 𝐷 − 𝐼 𝐷%
𝐷
%
Eq. 9 - 6
where 𝐼 is a function of the impurity measurement (entropy), 𝐷 is the
training set at the parent node, and 𝐷% is a dataset at a child node upon
splitting.
Lecture 9 - Combined Models
16
9.1.7. Growing DT via Entropy or Gini Impurity than
Misclassification Error
GRUS5
Another choice: Let the misclassification error be
5
1
𝐸 𝐷 = - 𝐿 𝑦D
𝑁
4
,𝑦
4
4
,
Eq. 9 - 7
with the 0-1 loss
0
𝑖𝑓 𝑦D = 𝑦
𝐿 𝑦,
D 𝑦 =F
1 π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’.
Eq. 9 - 8
Lecture 9 - Combined Models
17
9.1.8. Gini Impurity
GRUS5
• Gini impurity is the measurement used in CART as opposed to entropy:
𝐺𝑖𝑛𝑖 𝑑 = 1 − - 𝑝 𝑐 = 𝑖
3
/
Eq. 9 - 9
• In practice, the use of Gini Impurity or entropy really does not matter. Both
will have the same concave which is essential.
• Gini is computationally more efficient to compute than entropy which could
make code negligibly more efficient.
Lecture 9 - Combined Models
18
9.1.8. Gini Impurity
GRUS5
Lecture 9 - Combined Models
19
9.1.8. Gini Impurity
GRUS5
Gain Ratio
• The gain ratio was introduced by Quinlan penalizes splitting categorical
attributes with many values via
πΊπ‘Žπ‘–π‘› 𝐷, π‘₯$
πΊπ‘Žπ‘–π‘›π‘…π‘Žπ‘‘π‘–π‘œ 𝐷, π‘₯$ =
.
π‘†π‘π‘™π‘–π‘‘πΌπ‘›π‘“π‘œ 𝐷, π‘₯$
Eq. 9 - 10
• SplitInfo measures the entropy of the attribute itself:
𝐷%
𝐷%
π‘†π‘π‘™π‘–π‘‘πΌπ‘›π‘“π‘œ 𝐷, π‘₯$ = − log /
.
𝐷
𝐷
%∈-!
Eq. 9 - 11
Lecture 9 - Combined Models
20
Gini Impurity
GRUS5
Overfitting
• If DTs are not pruned, they have a high risk to overfit the training data to a
high degree.
• Overfitting occurs if models pick up noise or errors in the training dataset
and it can be seen as a performance gap between training and test data.
• DT pruning is a general approach for minimizing overfitting.
• Pre-Pruning:
o Set a depth cut-off (maximum tree depth) at the beginning.
o Cost-complexity pruning: 𝐼 + 𝛼 𝑁 , where I is an impurity measure, 𝛼 is
a tuning parameter, and |𝑁| is the total number of nodes.
o Stop growing if a split is not statistically significant.
o Set a minimum number of data points for each node.
Lecture 9 - Combined Models
21
9.1.8. Gini Impurity
GRUS5
Overfitting
• Post-Pruning:
o Grow a full tree then remove nodes.
o Reduces error by removing nodes from validation.
o Convert trees to rules first and then prune the rules
§ There is one rule per leaf node.
§ If rules are not sorted, rule sets are costly to evaluate but more
expressive.
§ In contrast to pruned rule sets, rules from DTs are mutually exclusive.
Lecture 9 - Combined Models
22
9.1.9. DT for Regression
GRUS5
• Grow the tree through variance reduction at each node.
• Use a metric for the continuous target variables comparison to the
predictions such as the mean squared error at a given node t:
1
𝑀𝑆𝐸 =
𝑁6
5
-
𝑦6
4
−β„Ž 𝒙
47#,4∈9#
/
4
6
.
Eq. 9 - 12
• It is often referred to as “within-node variance” and the splitting criterion is
called “variance reduction”.
Lecture 9 - Combined Models
23
9.1.9. Conclusion
Pros:
• Easy to interpret and communicate
• Independent of feature scaling
Cons:
• Easy to overfit
• Elaborate pruning required
• Output range is bounded in regression trees.
Lecture 9 - Combined Models
GRUS5
24
9.1.10. DT Classifier Example
GRUS5
Lecture 9 - Combined Models
25
9.1.10. DT Classifier Example
• 𝑝 𝑦=1 =
!"
"#
$
"#
= 0.6
GRUS5
• 𝑝 𝑦 = 0 = = 0.4
• 𝐻 π‘Œ = −0.6 log " 0.6 − 0.4 log " 0.4
• 𝐻 π‘Œ π‘₯! = 1 = −𝑃 π‘₯! = 1 1𝑃 π‘₯! = 1 π‘Œ = 1 log " 𝑃 π‘₯! = 1 π‘Œ = 1
𝑃 π‘₯! = 1|π‘Œ = 0 log " 𝑃 π‘₯! = 1|π‘Œ = 0 2
• 𝐻 π‘Œ π‘₯! = 𝐻 π‘Œ π‘₯! = 1 + 𝐻 π‘Œ π‘₯! = 0
• 𝐼𝐺 π‘₯! = 𝐻 π‘Œ − 𝐻 π‘Œ π‘₯!
Lecture 9 - Combined Models
26
+
9.1.10. DT Classifier Example
Y=1
GRUS5
𝐻 π‘Œ π‘₯! = 1 = −𝑃 π‘₯! = 1 𝑦 = 1
Y=0
N(x3=1)
3
N(x3=1)
5
p(x3=1|y=1)
0.25
p(x3=1|y=0)
0.625
H(x3)
0.811278124
p(x1=1|y=1)
0.5
p(x1=1|y=0)
0.5
p(y=1|x1=1,x3=1) 0.333333333 p(x1=1|y=0,x3=1)
0.2
p(y=1|x1=0,x3=1) 0.666666667 p(x1=0|y=0,x3=1)
0.8
H(Y|X1=1,x3=1)
0.459147917
H(Y|X1=0,x3=1) 0.36096405
H(Y|X1,X3=1)
0.820111964
p(x2=1|y=1)
0.666666667
p(x1=1|y=0)
0.375
p(X2=1|y=1,x3=1)
1
p(X2=1|y=0,x3=1)
0.2
p(X2=0|y=1,x3=1)
0
p(X2=0|y=0,x3=1)
0.8
H(Y|X2=1)
0
H(Y|X2=0)
0.27072304
H(Y|X2,X3=1)
0.270723036
IG(X1)
-0.00883384
IG(X2)
0.540555089
) 𝑝 𝑦 = 1 π‘₯! = 𝑖, π‘₯& = 1 log ' 𝑝 𝑦 = 1 π‘₯! = 𝑖, π‘₯& = 1
"∈ $,!
𝐻 π‘Œ π‘₯! , π‘₯% = 1 = 8 𝐻 π‘Œ π‘₯! = 𝑖, π‘₯% = 1
&∈ #,!
𝐼𝐺 π‘₯! = 𝐻 π‘₯% − 𝐻 π‘Œ π‘₯! , π‘₯% = 1
• Therefore, the node should split for X2!
π‘₯% = 0
Lecture 9 - Combined Models
π‘₯% = 1
27
9.1.10. DT Classifier Example
Y=1
GRUS5
Y=0
Y=1
Y=0
N(x3=1)
3
N(x3=1)
5
N(x3=0)
9
N(x3=0)
3
p(x3=1|y=1)
0.25
p(x3=1|y=0)
0.625
p(x3=0|y=1)
0.75
p(x3=0|y=0)
0.375
H(x3)
0.811278124
H(x3)
0.954434
p(x1=1|y=1)
0.5
p(x1=0|y=1)
0.5
p(x1=0|y=0)
0.5
p(x1=1|y=0)
0.5
p(y=1|x1=1,x3=1) 0.333333333 p(x1=1|y=0,x3=1)
0.2
p(x1=1|y=1,x3=0) 0.55555556 p(x1=1|y=0,x3=0)
1
p(y=1|x1=0,x3=1) 0.666666667 p(x1=0|y=0,x3=1)
0.8
p(x1=0|y=1,x3=0) 0.44444444 p(x1=0|y=0,x3=0)
0
H(Y|X1=1)
0.459147917
H(Y|X1=0)
0.36096405 H(Y|X1=1,X3=0) 0.49553803 H(Y|X1=0,X3=0)
H(Y|X1,X3=1)
0.820111964
p(x2=1|y=1)
0.666666667
p(x1=1|y=0)
0.375
p(X2=1|y=1,x3=1)
1
p(X2=1|y=0,x3=1)
0.2
p(X2=1|y=1,x3=0) 0.55555556 p(X2=1|y=0,x3=0) 0.66666667
p(X2=0|y=1,x3=1)
0
p(X2=0|y=0,x3=1)
0.8
p(X2=0|y=1,x3=0) 0.44444444 p(X2=0|y=0,x3=0) 0.33333333
H(Y|X2=1)
0
H(Y|X2=0)
H(Y|X2,X3=1)
0.270723036
H(Y|X2,X3=0)
0.90429358
IG(X1)
-0.00883384
IG(X1)
0.4588960
IG(X2)
0.540555089
IG(X2)
0.05014042
H(Y|X1,X3=0)
0.49553803
p(x2=0|y=1)
0.33333333
p(x1=0|y=0)
0.27072304 H(Y|X2=1,X3=0) 0.33035869 H(Y|X2=0,X3=0)
Lecture 9 - Combined Models
0
0.625
0.5739349
28
9.1.11. DT Regression Example
Day
Outlook
7Overcast
3Overcast
13Overcast
12Overcast
6Rain
5Rain
14Rain
4Rain
10Rain
9Sunny
2Sunny
1Sunny
8Sunny
11Sunny
Temp.
Cool
Hot
Hot
Mild
Cool
Cool
Mild
Mild
Mild
Cool
Hot
Hot
Mild
Mild
Humidity
Normal
High
Normal
High
Normal
Normal
High
High
Normal
Normal
High
High
High
Normal
Wind
Strong
Weak
Weak
Strong
Strong
Weak
Strong
Weak
Weak
Weak
Strong
Weak
Weak
Strong
Lecture 9 - Combined Models
Golf Players
43
46
44
52
23
52
30
45
46
38
30
25
35
48
GRUS5
29
9.1.11. DT Regression Example
average =
Outlook
Overcast
Rain
Sunny
weighted
stdev
stdev
reduction
39.79STDEV =
9.32
STDEV
Instances average
mse
46.25
18.66
3.49
4
39.2
118.16
10.87
5
35.2
60.56
7.78
5
65.79
GRUS5
Stdev of
Golf Players Instances
8.95
4
10.51
4
7.65
6
Temperature
Hot
Cool
Mild
weighted stdev
stdev reduction
Wind
Strong
Weak
Stdev of Golf
Player
Instances
10.59
6
7.87
8
weighted stdev
stdev reduction
7.66
1.66
Humidity
High
Normal
weighted stdev
stdev reduction
Lecture 9 - Combined Models
8.84
0.48
9.04
0.29
Stdev of Golf
Player
Instances
9.36
7
8.73
7
9.05
0.28
30
GRUS5
Day
average =
Outlook
3 Overcast
7 Overcast
12 Overcast
13 Overcast
Temp.
Hot
Cool
Mild
Hot
Humidity
High
Normal
High
Normal
STEVD
stdev =
Instances
46.25
Temp.
Hot
Cool
Mild
weighted
stdev
stdev
reduction
1.00
0.00
0.00
0.50
Wind
Weak
Strong
Strong
Weak
3.49106001
2
1
1
Golf Players
46
43
52
44
Humidity
High
Normal
weighted
stdev
stdev
reduction
STEVD
Instances
3.00
0.50
1.75
1.74
2
2
Wind
Weak
Strong
weighted
stdev
stdev
reduction
STEVD
Instances
1
4.5
2
2
2.75
0.74
2.99
Lecture 9 - Combined Models
31
9.1.11. DT Regression Example
Day
Outlook
4 Rain
5 Rain
6 Rain
10 Rain
14 Rain
average =
Temp.
Mild
Cool
Cool
Mild
Mild
Humidity
High
Normal
Normal
Normal
High
39.2
Temp.
Hot
Cool
Mild
weighted
stdev
stdev
reduction
Wind
Weak
Weak
Strong
Weak
Strong
stdev =
Instances
STEVD
0.00
14.50
7.32
10.19
0.68
Golf Players
45
52
23
46
30
10.8701426
0
2
3
GRUS5
Humidity
High
Normal
weighted
stdev
stdev
reduction
STEVD
Instances
7.50
12.50
10.50
Lecture 9 - Combined Models
0.37
2
3
Wind
Weak
Strong
weighted
stdev
stdev
reduction
STEVD
Instances
3.09120617
3
3.5
2
3.25
7.62
32
9.1.11. DT Regression Example
Day
Outlook
1 Sunny
2 Sunny
8 Sunny
9 Sunny
11 Sunny
average =
Temp.
Hot
Hot
Mild
Cool
Mild
Humidity
High
High
High
Normal
Normal
35.2
Temp.
Hot
Cool
Mild
weighted
stdev
stdev
reduction
Wind
Weak
Strong
Weak
Weak
Strong
stdev =
STEVD
7.78203058
Instances
2.50
0.00
6.50
3.60
2
1
2
GRUS5
Golf Players
25
30
35
38
48
Humidity
High
Normal
weighted
stdev
stdev
reduction
STEVD
Instances
4.08
5.00
4.45
3.33
3
2
Wind
Weak
Strong
weighted
stdev
stdev
reduction
STEVD
Instances
5.56
9
2
3
7.62
0.16
4.18
Lecture 9 - Combined Models
33
9.1.12. DT Classifier Implementation
GRUS5
Building Decision Trees (see the notebook file)
1. Assign all training instances to the root of the tree. Set the current node to the root
node.
2. For each attribute
a. Partition all data instances at the node by the value of the attribute.
b. Compute the information gain ratio from the partitioning.
3. Identify the feature that results in the greatest information gain ratio. Set this
feature to be the splitting criterion at the current node.
• If the best information gain ratio is 0, tag the current node as a leaf and return.
4. Partition all instances according to the attribute value of the best feature.
5. Denote each partition as a child node of the current node.
6. For each child node:
a. If the child node is ”pure” (has instances from only one class), tag it as a leaf
and return.
b. If not, set the child node as the current node and recurse to step 2.
Lecture 9 - Combined Models
34
9.1.12. DT Classifier Implementation
GRUS5
Evaluating an instance using a decision tree
Pruning by Information Gain
1. Catalog all twigs in the tree
2. Count the total number of leaves in the tree
3. While the number of leaves in the tree exceeds the desired number:
a. Find the twig with the least Information Gain.
b. Remove all child nodes of the twig.
c. Relabel twig as a leaf.
d. Update the leaf count.
Lecture 9 - Combined Models
35
GRUS5
9.3 Ensemble Methods II – Stacking Method
Lecture 9 - Combined Models
36
9.3.1 Stacking - Overview
GRUS5
• Stacking is a special case of the ensemble method where we combine an ensemble of
models through a so-called meta-classifier.
• We have a “base learner” that learns from the initial training set, and resulting
models then make predictions that serve as input features to a “meta-learner”.
Lecture 9 - Combined Models
37
9.3.1. Naïve Stacking
GRUS5
• It has a high tendency to suffer from
extensive overfitting.
o The meta-learning strictly relies on
base learning.
• We can use k-fold or leave-one-out crossvalidation to avoid overfitting.
Lecture 9 - Combined Models
38
9.3.1. Naïve Stacking
GRUS5
Lecture 9 - Combined Models
39
9.3.2. Stacking with Cross-Validation
Lecture 9 - Combined Models
GRUS5
40
9.3.2. Stacking with Cross-Validation
Lecture 9 - Combined Models
GRUS5
41
9.3.3. Scikit-Learn Stacking Method
Lecture 9 - Combined Models
GRUS5
45
9.4. Summary
• Decision Trees: need to reduce variance. How?
• Bagging: Bootstrap (random subsampling with replacement)
• Random Forest
• Bagging method with full decision tree method
• Easy, feature selection, less data pre-processing
• But… How to reduce bias?
• Boosting
• Gradient Boost
• Good for classification & regression
• Simple when we use the square loss function
• Constant small step-size
• Works with any convex differentiable loss function
• AdaBoost
• Only for classification
• Invented first but turned out to be one of gradient boost (exponential loss function)
• Need to compute weight and step size for every iteration
• XG Boost
• Extreme Gradient Boost
• Stacking Method
• It is hard to interpret the result since it ensembles the predictions from the base models.
• The computational cost can be high (mainly from training the base models).
Lecture 9 - Combined Models
GRUS5
46
Download