GRUS5 CS 559: Machine Learning Fundamentals & Applications Week 10: Combined Models I (Decision Tree and Stacking Method) GRUS5 9.1. Decision Trees (DTs) Lecture 9 - Combined Models 2 9.1.1. Introduction GRUS5 • The decision tree (DT) algorithm uses iterative, top-down construction decision-making on hypothesis and visualizes a hierarchy of decisions forking the dataset into subspaces. • DT algorithm uses any Boolean function to search the entire training dataset and split it after decision makings. • A total of ππ potential splits will be evaluated given m features. • Hypothesis space is searched greedily (make a choice that looks the best). • The challenge is whether an algorithm can learn the right function or a good approximation. Lecture 9 - Combined Models 3 9.1.2. Terminology GRUS5 § Node (Vertex): Holds fields or key values. § Branch (Edge): Connects nodes § Root: the node at the top of the tree – no incoming edge, zero or more outgoing edges. § Internal node: one incoming edge, two or more outgoing edges. § Leaf: the node at the end of the tree – no outgoing edges and assigns the class label. If nodes are not pure, it takes the majority vote. § Parent and child nodes: a node at one lower depth is the parent node, and one at a higher depth is the child node. § Depth: This refers to the top-down height of the tree from the root. § Height: Refers to the bottom-up height of the tree from the leaf. Lecture 9 - Combined Models 4 9.1.2. Ruled-based Learning GRUS5 § We can think of DT as nested “if-else” rules that are simple conjunction of conditions. For example, π π’ππ 1 = ππ π₯ = 1 ∩ ππ π¦ = 2 ∩ β― Eq. 9 - 1 § Multiple rules can then be joined into a set of rules and applied to predict the target value of training data. For example, πΆπππ π 1 = ππ π π’ππ 1 = πππ’π ∪ π π’ππ 2 = πππ’π ∪ β― Eq. 9 - 2 § Each leaf node in a DT represents a set of rules. For example, ππππ π‘π ππ? = ππ ∩ ππ’π‘ππππ? = π ππππ¦ ∩ (πΉππππππ ππ’π π¦? = ππ) Eq. 9 - 3 Lecture 9 - Combined Models 5 9.1.2. Ruled-based Learning GRUS5 Lecture 9 - Combined Models 6 9.1.3. Things to concern? GRUS5 Even rules can be constructed from DT easily, sometimes… § It is only sometimes possible to build a DT from a set of rules which may take time to determine how. § A rule-set evaluation is much more expensive than a tree evaluation. § Multiple rulesets are possible. § Because of the flexible rules, DT is more prone to overfitting, especially if there are more hypothesis spaces than DTs. Lecture 9 - Combined Models 7 9.1.4. Different DT algorithms - ID3, C4.5, and CART GRUS5 § There are multiple DT algorithms. § Most DT algorithms differ in the following ways: o Splitting criteria – Information gain (Shannon Entropy, Gini impurity, misclassification error), use of statistical tests, objective function, etc. o Binary split vs. multi-way splits o Discrete vs. continuous variables o Pre- vs. post-pruning o Bottom-up vs. Top-down pruning Lecture 9 - Combined Models 8 9.1.4. Different DT algorithms - ID3, C4.5, and CART GRUS5 Iterative Dichotomizer 3 § Described in Quinlan, J. R. (1986). Introduction of decision trees. § One of the earliest DT algorithms § Can only handle discrete features § Multi-category splits § No pruning, prone to overfitting § Short and wide trees compared to CART § Maximizes information gain and minimizes entropy Lecture 9 - Combined Models 9 9.1.4. Different DT algorithms - ID3, C4.5, and CART GRUS5 C4.5 § Described by Quinlan, J. R. (1993). C4.5: Programming for machine learning. Morgan Kauffmann, 38, 48 § Handles both continuous and discrete features. Continuous feature splitting is very expensive. § The splitting criterion is computed via the gain ratio. § Handles missing attributes by ignoring them in the computation. § Performs post-pruning (bottom-up pruning) Lecture 9 - Combined Models 10 9.1.4. Different DT algorithms - ID3, C4.5, and CART GRUS5 CART – Classification and Regression Trees § Described in Breiman, L. (1984). Classification and regression trees. Belmont, Calif: Wadsworth International Group. § Handles continuous and discrete features. § Strictly binary splits (results trees are taller compared to ID3 and C4.5) o Generate better trees than C4.5 but tend to be larger and harder to interpret. o For k attributes, there are 2!"# − 1 ways to create binary partitioning. § Variance reduction in regression trees. § Uses Gini impurity in classification trees. § Performs cost-complexity pruning. Lecture 9 - Combined Models 11 9.1.5. Information Gain (IG) GRUS5 • The IG is the standard criterion for choosing the splitting in DT. o The higher the IG, the better the split. • IG relies on the concept of mutual information – the reduction of the entropy of one variable by knowing the other. • We want to maximize mutual information when defining splitting criteria. π·% πΊπππ π·, π₯$ = π» π· − π» π·% π· %∈'()*+, -! Eq. 9 - 4 where π· is the training set at the parent node, and π·% is a dataset at a child node upon splitting. π» ⋅ is an entropy. Lecture 9 - Combined Models 12 9.1.6. Information Theory and Entropy GRUS5 • In ID3, Shannon Entropy is used to measure improvement in a DT; i.e., we use it as an optimization metric (or impurity measure). • It was originally proposed in the context of encoding digital information in the form of Bits (0s or 1s). • Consider it as a measure of the amount of information of discrete random variables (two outcomes, Bernoulli distribution) Lecture 9 - Combined Models 13 9.1.6. Information Theory and Entropy GRUS5 • Shannon information: o Shannon information defines information as the number of bits to encode # a number 1/π, where π is the probability that an event is a rule (#". is the uncertainty). # # . . o The number of bits for encoding is log / . o − log / π → [∞, 0]; if we are 100% certain about an event, we gain 0 information. o E.g., assume two soccer teams both teams have a win probability of 50%. § If the information team 1 wins is transmitted 1 bit: log / log / (2) = 1 Lecture 9 - Combined Models # 0.2 = 14 9.1.6. Information Theory and Entropy GRUS5 • Shannon entropy is then the “average information” o Entropy: π» π = ∑3 π3 log / # ." = − ∑3 π3 log / π3 Eq. 9 - 5 o E.g., assume team 1 and 2 have win possibilities 75% and 25%, respectively. The average information content is π» π = −0.75 log / 0.75 − 0.25 log / 0.25 = 0.81 Lecture 9 - Combined Models 15 9.1.7. Growing DT via Entropy or Gini Impurity than Misclassification Error GRUS5 Consider the general measurement of information gain, π·% πΊ π·, π₯$ = πΌ π· − πΌ π·% π· % Eq. 9 - 6 where πΌ is a function of the impurity measurement (entropy), π· is the training set at the parent node, and π·% is a dataset at a child node upon splitting. Lecture 9 - Combined Models 16 9.1.7. Growing DT via Entropy or Gini Impurity than Misclassification Error GRUS5 Another choice: Let the misclassification error be 5 1 πΈ π· = - πΏ π¦D π 4 ,π¦ 4 4 , Eq. 9 - 7 with the 0-1 loss 0 ππ π¦D = π¦ πΏ π¦, D π¦ =F 1 ππ‘βπππ€ππ π. Eq. 9 - 8 Lecture 9 - Combined Models 17 9.1.8. Gini Impurity GRUS5 • Gini impurity is the measurement used in CART as opposed to entropy: πΊπππ π‘ = 1 − - π π = π 3 / Eq. 9 - 9 • In practice, the use of Gini Impurity or entropy really does not matter. Both will have the same concave which is essential. • Gini is computationally more efficient to compute than entropy which could make code negligibly more efficient. Lecture 9 - Combined Models 18 9.1.8. Gini Impurity GRUS5 Lecture 9 - Combined Models 19 9.1.8. Gini Impurity GRUS5 Gain Ratio • The gain ratio was introduced by Quinlan penalizes splitting categorical attributes with many values via πΊπππ π·, π₯$ πΊππππ ππ‘ππ π·, π₯$ = . πππππ‘πΌπππ π·, π₯$ Eq. 9 - 10 • SplitInfo measures the entropy of the attribute itself: π·% π·% πππππ‘πΌπππ π·, π₯$ = − log / . π· π· %∈-! Eq. 9 - 11 Lecture 9 - Combined Models 20 Gini Impurity GRUS5 Overfitting • If DTs are not pruned, they have a high risk to overfit the training data to a high degree. • Overfitting occurs if models pick up noise or errors in the training dataset and it can be seen as a performance gap between training and test data. • DT pruning is a general approach for minimizing overfitting. • Pre-Pruning: o Set a depth cut-off (maximum tree depth) at the beginning. o Cost-complexity pruning: πΌ + πΌ π , where I is an impurity measure, πΌ is a tuning parameter, and |π| is the total number of nodes. o Stop growing if a split is not statistically significant. o Set a minimum number of data points for each node. Lecture 9 - Combined Models 21 9.1.8. Gini Impurity GRUS5 Overfitting • Post-Pruning: o Grow a full tree then remove nodes. o Reduces error by removing nodes from validation. o Convert trees to rules first and then prune the rules § There is one rule per leaf node. § If rules are not sorted, rule sets are costly to evaluate but more expressive. § In contrast to pruned rule sets, rules from DTs are mutually exclusive. Lecture 9 - Combined Models 22 9.1.9. DT for Regression GRUS5 • Grow the tree through variance reduction at each node. • Use a metric for the continuous target variables comparison to the predictions such as the mean squared error at a given node t: 1 πππΈ = π6 5 - π¦6 4 −β π 47#,4∈9# / 4 6 . Eq. 9 - 12 • It is often referred to as “within-node variance” and the splitting criterion is called “variance reduction”. Lecture 9 - Combined Models 23 9.1.9. Conclusion Pros: • Easy to interpret and communicate • Independent of feature scaling Cons: • Easy to overfit • Elaborate pruning required • Output range is bounded in regression trees. Lecture 9 - Combined Models GRUS5 24 9.1.10. DT Classifier Example GRUS5 Lecture 9 - Combined Models 25 9.1.10. DT Classifier Example • π π¦=1 = !" "# $ "# = 0.6 GRUS5 • π π¦ = 0 = = 0.4 • π» π = −0.6 log " 0.6 − 0.4 log " 0.4 • π» π π₯! = 1 = −π π₯! = 1 1π π₯! = 1 π = 1 log " π π₯! = 1 π = 1 π π₯! = 1|π = 0 log " π π₯! = 1|π = 0 2 • π» π π₯! = π» π π₯! = 1 + π» π π₯! = 0 • πΌπΊ π₯! = π» π − π» π π₯! Lecture 9 - Combined Models 26 + 9.1.10. DT Classifier Example Y=1 GRUS5 π» π π₯! = 1 = −π π₯! = 1 π¦ = 1 Y=0 N(x3=1) 3 N(x3=1) 5 p(x3=1|y=1) 0.25 p(x3=1|y=0) 0.625 H(x3) 0.811278124 p(x1=1|y=1) 0.5 p(x1=1|y=0) 0.5 p(y=1|x1=1,x3=1) 0.333333333 p(x1=1|y=0,x3=1) 0.2 p(y=1|x1=0,x3=1) 0.666666667 p(x1=0|y=0,x3=1) 0.8 H(Y|X1=1,x3=1) 0.459147917 H(Y|X1=0,x3=1) 0.36096405 H(Y|X1,X3=1) 0.820111964 p(x2=1|y=1) 0.666666667 p(x1=1|y=0) 0.375 p(X2=1|y=1,x3=1) 1 p(X2=1|y=0,x3=1) 0.2 p(X2=0|y=1,x3=1) 0 p(X2=0|y=0,x3=1) 0.8 H(Y|X2=1) 0 H(Y|X2=0) 0.27072304 H(Y|X2,X3=1) 0.270723036 IG(X1) -0.00883384 IG(X2) 0.540555089 ) π π¦ = 1 π₯! = π, π₯& = 1 log ' π π¦ = 1 π₯! = π, π₯& = 1 "∈ $,! π» π π₯! , π₯% = 1 = 8 π» π π₯! = π, π₯% = 1 &∈ #,! πΌπΊ π₯! = π» π₯% − π» π π₯! , π₯% = 1 • Therefore, the node should split for X2! π₯% = 0 Lecture 9 - Combined Models π₯% = 1 27 9.1.10. DT Classifier Example Y=1 GRUS5 Y=0 Y=1 Y=0 N(x3=1) 3 N(x3=1) 5 N(x3=0) 9 N(x3=0) 3 p(x3=1|y=1) 0.25 p(x3=1|y=0) 0.625 p(x3=0|y=1) 0.75 p(x3=0|y=0) 0.375 H(x3) 0.811278124 H(x3) 0.954434 p(x1=1|y=1) 0.5 p(x1=0|y=1) 0.5 p(x1=0|y=0) 0.5 p(x1=1|y=0) 0.5 p(y=1|x1=1,x3=1) 0.333333333 p(x1=1|y=0,x3=1) 0.2 p(x1=1|y=1,x3=0) 0.55555556 p(x1=1|y=0,x3=0) 1 p(y=1|x1=0,x3=1) 0.666666667 p(x1=0|y=0,x3=1) 0.8 p(x1=0|y=1,x3=0) 0.44444444 p(x1=0|y=0,x3=0) 0 H(Y|X1=1) 0.459147917 H(Y|X1=0) 0.36096405 H(Y|X1=1,X3=0) 0.49553803 H(Y|X1=0,X3=0) H(Y|X1,X3=1) 0.820111964 p(x2=1|y=1) 0.666666667 p(x1=1|y=0) 0.375 p(X2=1|y=1,x3=1) 1 p(X2=1|y=0,x3=1) 0.2 p(X2=1|y=1,x3=0) 0.55555556 p(X2=1|y=0,x3=0) 0.66666667 p(X2=0|y=1,x3=1) 0 p(X2=0|y=0,x3=1) 0.8 p(X2=0|y=1,x3=0) 0.44444444 p(X2=0|y=0,x3=0) 0.33333333 H(Y|X2=1) 0 H(Y|X2=0) H(Y|X2,X3=1) 0.270723036 H(Y|X2,X3=0) 0.90429358 IG(X1) -0.00883384 IG(X1) 0.4588960 IG(X2) 0.540555089 IG(X2) 0.05014042 H(Y|X1,X3=0) 0.49553803 p(x2=0|y=1) 0.33333333 p(x1=0|y=0) 0.27072304 H(Y|X2=1,X3=0) 0.33035869 H(Y|X2=0,X3=0) Lecture 9 - Combined Models 0 0.625 0.5739349 28 9.1.11. DT Regression Example Day Outlook 7Overcast 3Overcast 13Overcast 12Overcast 6Rain 5Rain 14Rain 4Rain 10Rain 9Sunny 2Sunny 1Sunny 8Sunny 11Sunny Temp. Cool Hot Hot Mild Cool Cool Mild Mild Mild Cool Hot Hot Mild Mild Humidity Normal High Normal High Normal Normal High High Normal Normal High High High Normal Wind Strong Weak Weak Strong Strong Weak Strong Weak Weak Weak Strong Weak Weak Strong Lecture 9 - Combined Models Golf Players 43 46 44 52 23 52 30 45 46 38 30 25 35 48 GRUS5 29 9.1.11. DT Regression Example average = Outlook Overcast Rain Sunny weighted stdev stdev reduction 39.79STDEV = 9.32 STDEV Instances average mse 46.25 18.66 3.49 4 39.2 118.16 10.87 5 35.2 60.56 7.78 5 65.79 GRUS5 Stdev of Golf Players Instances 8.95 4 10.51 4 7.65 6 Temperature Hot Cool Mild weighted stdev stdev reduction Wind Strong Weak Stdev of Golf Player Instances 10.59 6 7.87 8 weighted stdev stdev reduction 7.66 1.66 Humidity High Normal weighted stdev stdev reduction Lecture 9 - Combined Models 8.84 0.48 9.04 0.29 Stdev of Golf Player Instances 9.36 7 8.73 7 9.05 0.28 30 GRUS5 Day average = Outlook 3 Overcast 7 Overcast 12 Overcast 13 Overcast Temp. Hot Cool Mild Hot Humidity High Normal High Normal STEVD stdev = Instances 46.25 Temp. Hot Cool Mild weighted stdev stdev reduction 1.00 0.00 0.00 0.50 Wind Weak Strong Strong Weak 3.49106001 2 1 1 Golf Players 46 43 52 44 Humidity High Normal weighted stdev stdev reduction STEVD Instances 3.00 0.50 1.75 1.74 2 2 Wind Weak Strong weighted stdev stdev reduction STEVD Instances 1 4.5 2 2 2.75 0.74 2.99 Lecture 9 - Combined Models 31 9.1.11. DT Regression Example Day Outlook 4 Rain 5 Rain 6 Rain 10 Rain 14 Rain average = Temp. Mild Cool Cool Mild Mild Humidity High Normal Normal Normal High 39.2 Temp. Hot Cool Mild weighted stdev stdev reduction Wind Weak Weak Strong Weak Strong stdev = Instances STEVD 0.00 14.50 7.32 10.19 0.68 Golf Players 45 52 23 46 30 10.8701426 0 2 3 GRUS5 Humidity High Normal weighted stdev stdev reduction STEVD Instances 7.50 12.50 10.50 Lecture 9 - Combined Models 0.37 2 3 Wind Weak Strong weighted stdev stdev reduction STEVD Instances 3.09120617 3 3.5 2 3.25 7.62 32 9.1.11. DT Regression Example Day Outlook 1 Sunny 2 Sunny 8 Sunny 9 Sunny 11 Sunny average = Temp. Hot Hot Mild Cool Mild Humidity High High High Normal Normal 35.2 Temp. Hot Cool Mild weighted stdev stdev reduction Wind Weak Strong Weak Weak Strong stdev = STEVD 7.78203058 Instances 2.50 0.00 6.50 3.60 2 1 2 GRUS5 Golf Players 25 30 35 38 48 Humidity High Normal weighted stdev stdev reduction STEVD Instances 4.08 5.00 4.45 3.33 3 2 Wind Weak Strong weighted stdev stdev reduction STEVD Instances 5.56 9 2 3 7.62 0.16 4.18 Lecture 9 - Combined Models 33 9.1.12. DT Classifier Implementation GRUS5 Building Decision Trees (see the notebook file) 1. Assign all training instances to the root of the tree. Set the current node to the root node. 2. For each attribute a. Partition all data instances at the node by the value of the attribute. b. Compute the information gain ratio from the partitioning. 3. Identify the feature that results in the greatest information gain ratio. Set this feature to be the splitting criterion at the current node. • If the best information gain ratio is 0, tag the current node as a leaf and return. 4. Partition all instances according to the attribute value of the best feature. 5. Denote each partition as a child node of the current node. 6. For each child node: a. If the child node is ”pure” (has instances from only one class), tag it as a leaf and return. b. If not, set the child node as the current node and recurse to step 2. Lecture 9 - Combined Models 34 9.1.12. DT Classifier Implementation GRUS5 Evaluating an instance using a decision tree Pruning by Information Gain 1. Catalog all twigs in the tree 2. Count the total number of leaves in the tree 3. While the number of leaves in the tree exceeds the desired number: a. Find the twig with the least Information Gain. b. Remove all child nodes of the twig. c. Relabel twig as a leaf. d. Update the leaf count. Lecture 9 - Combined Models 35 GRUS5 9.3 Ensemble Methods II – Stacking Method Lecture 9 - Combined Models 36 9.3.1 Stacking - Overview GRUS5 • Stacking is a special case of the ensemble method where we combine an ensemble of models through a so-called meta-classifier. • We have a “base learner” that learns from the initial training set, and resulting models then make predictions that serve as input features to a “meta-learner”. Lecture 9 - Combined Models 37 9.3.1. Naïve Stacking GRUS5 • It has a high tendency to suffer from extensive overfitting. o The meta-learning strictly relies on base learning. • We can use k-fold or leave-one-out crossvalidation to avoid overfitting. Lecture 9 - Combined Models 38 9.3.1. Naïve Stacking GRUS5 Lecture 9 - Combined Models 39 9.3.2. Stacking with Cross-Validation Lecture 9 - Combined Models GRUS5 40 9.3.2. Stacking with Cross-Validation Lecture 9 - Combined Models GRUS5 41 9.3.3. Scikit-Learn Stacking Method Lecture 9 - Combined Models GRUS5 45 9.4. Summary • Decision Trees: need to reduce variance. How? • Bagging: Bootstrap (random subsampling with replacement) • Random Forest • Bagging method with full decision tree method • Easy, feature selection, less data pre-processing • But… How to reduce bias? • Boosting • Gradient Boost • Good for classification & regression • Simple when we use the square loss function • Constant small step-size • Works with any convex differentiable loss function • AdaBoost • Only for classification • Invented first but turned out to be one of gradient boost (exponential loss function) • Need to compute weight and step size for every iteration • XG Boost • Extreme Gradient Boost • Stacking Method • It is hard to interpret the result since it ensembles the predictions from the base models. • The computational cost can be high (mainly from training the base models). Lecture 9 - Combined Models GRUS5 46