Boosting Trevor Hastie, Stanford University Trees, Bagging, Random Forests and Boosting • Classification Trees • Bagging: Averaging Trees • Random Forests: Cleverer Averaging of Trees • Boosting: Cleverest Averaging of Trees Methods for improving the performance of weak learners such as Trees. Classification trees are adaptive and robust, but do not generalize well. The techniques discussed here enhance their performance considerably. 1 Boosting Trevor Hastie, Stanford University Two-class Classification • Observations are classified into two or more classes, coded by a response variable Y taking values 1, 2, . . . , K. • We have a feature vector X = (X1 , X2 , . . . , Xp ), and we hope to build a classification rule C(X) to assign a class label to an individual with feature X. • We have a sample of pairs (yi , xi ), i = 1, . . . , N . Note that each of the xi are vectors xi = (xi1 , xi2 , . . . , xip ). • Example: Y indicates whether an email is spam or not. X represents the relative frequency of a subset of specially chosen words in the email message. • The technology described here estimates C(X) directly, or via the probability function P (C = k|X). 2 Boosting Trevor Hastie, Stanford University Classification Trees • Represented by a series of binary splits. • Each internal node represents a value query on one of the variables — e.g. “Is X3 > 0.4”. If the answer is “Yes”, go right, else go left. • The terminal nodes are the decision nodes. Typically each terminal node is dominated by one of the classes. • The tree is grown using training data, by recursive splitting. • The tree is often pruned to an optimal size, evaluated by cross-validation. • New observations are classified by passing their X down to a terminal node of the tree, and then using majority vote. 3 Boosting Trevor Hastie, Stanford University Classification Tree 0 10/30 x.2<0.39 x.2>0.39 0 1 3/21 2/9 x.3<-1.575 x.3>-1.575 1 0 2/5 0/16 4 Boosting Trevor Hastie, Stanford University Properties of Trees ✔ Can handle huge datasets ✔ Can handle mixed predictors—quantitative and qualitative ✔ Easily ignore redundant variables ✔ Handle missing data elegantly ✔ Small trees are easy to interpret ✖ large trees are hard to interpret ✖ Often prediction performance is poor 5 Boosting Trevor Hastie, Stanford University Example: Predicting e-mail spam • data from 4601 email messages • goal: predict whether an email message is spam (junk email) or good. • input features: relative frequencies in a message of 57 of the most commonly occurring words and punctuation marks in all the training the email messages. • for this problem not all errors are equal; we want to avoid filtering out good email, while letting spam get through is not desirable but less serious in its consequences. • we coded spam as 1 and email as 0. • A system like this would be trained for each user separately (e.g. their word lists would be different) 6 Boosting Trevor Hastie, Stanford University Predictors • 48 quantitative predictors—the percentage of words in the email that match a given word. Examples include business, address, internet, free, and george. The idea was that these could be customized for individual users. • 6 quantitative predictors—the percentage of characters in the email that match a given character. The characters are ch;, ch(, ch[, ch!, ch$, and ch#. • The average length of uninterrupted sequences of capital letters: CAPAVE. • The length of the longest uninterrupted sequence of capital letters: CAPMAX. • The sum of the length of uninterrupted sequences of capital letters: CAPTOT. 7 Boosting Trevor Hastie, Stanford University Details • A test set of size 1536 was randomly chosen, leaving 3065 observations in the training set. • A full tree was grown on the training set, with splitting continuing until a minimum bucket size of 5 was reached. • This bushy tree was pruned back using cost-complexity pruning, and the tree size was chosen by 10-fold cross-validation. • We then compute the test error and ROC curve on the test data. 8 Boosting Trevor Hastie, Stanford University Some important features 39% of the training data were spam. Average percentage of words or characters in an email message equal to the indicated word or character. We have chosen the words and characters showing the largest difference between spam and email. george you your hp free hpl spam 0.00 2.26 1.38 0.02 0.52 0.01 email 1.27 1.27 0.44 0.90 0.07 0.43 ! our re edu remove spam 0.51 0.51 0.13 0.01 0.28 email 0.11 0.18 0.42 0.29 0.01 9 Boosting Trevor Hastie, Stanford University 10 email 600/1536 ch$<0.0555 ch$>0.0555 email spam 280/1177 48/359 remove<0.06 remove>0.06 email 180/1065 hp<0.405 hp>0.405 spam spam email 9/112 26/337 0/22 ch!<0.191 george<0.15CAPAVE<2.907 ch!>0.191 george>0.15CAPAVE>2.907 email email spam email spam spam 80/861 100/204 6/109 0/3 19/110 7/227 george<0.005 CAPAVE<2.7505 1999<0.58 george>0.005 CAPAVE>2.7505 1999>0.58 email email email spam spam email 80/652 36/123 18/109 0/209 16/81 hp<0.03 hp>0.03 free<0.065 free>0.065 email email email spam 77/423 3/229 16/94 9/29 CAPMAX<10.5 business<0.145 CAPMAX>10.5 business>0.145 email email email spam 20/238 57/185 14/89 receive<0.125edu<0.045 receive>0.125edu>0.045 email spam email email 19/236 1/2 48/113 9/72 our<1.2 our>1.2 email spam 37/101 1/12 3/5 0/1 Boosting Trevor Hastie, Stanford University ROC curve for pruned tree on SPAM data 0.8 1.0 SPAM Data TREE − Error: 8.7% 0.0 0.2 0.4 Sensitivity 0.6 o 0.0 0.2 0.4 0.6 Specificity 0.8 1.0 Overall error rate on test data: 8.7%. ROC curve obtained by varying the threshold c0 of the classifier: C(X) = +1 if P̂ (+1|X) > c0 . Sensitivity: proportion of true spam identified Specificity: proportion of true email identified. We may want specificity to be high, and suffer some spam: Specificity : 95% =⇒ Sensitivity : 79% 11 Boosting Trevor Hastie, Stanford University 1.0 ROC curve for TREE vs SVM on SPAM data 0.8 TREE vs SVM o o SVM − Error: 6.7% TREE − Error: 8.7% 0.0 0.2 0.4 Sensitivity 0.6 Comparing ROC curves on the test data is a good way to compare classifiers. SVM dominates TREE here. 0.0 0.2 0.4 0.6 Specificity 0.8 1.0 12 Boosting Trevor Hastie, Stanford University Toy Classification Problem Bayes Error Rate: 0.25 6 1 2 0 -2 -4 -6 1 1 1 1 1 1 -6 -4 • Here X = (X1 , X2 ) 1 11 1 111 1 1 1 1 0 1101110 1 0 1 1 1 1 111 1 11 1 1 1 1 1 1 1 1 0 1 1 11 1 1 11 1 1 1 0 0 1 1 0 1 1 0 1 0 1 1 11010 1 00 01 1 11 1 1 1 01 0 0101 100 1 1 11 1 0 00 000000 11 00110 0001 00 11 1 1 110 1 01 1 0 000100000 11 100 0010 01 1 1 0 1100000 1 1 1 101 0000011 0 1 0 0 0 1 0 0 0 01 0001010 010 1 1 1 0000000000010 000000 000101 11 1 10000 01 101 00 0101 011 11 00111101 1 11 11 10 1 00 000000 0001 0 000000 0111001 0 00 0 1 000 0 1 1 0 0 0 1011 0 1 1 0 0 0 0 0 0 01 0 0 0 1 1 000 0 0 0 0 1 0 0 1 1 0 1 110000000 0 0 0 0 0 1 0 0 0 11 11 0 1 1 0 0 1 0 0 000 000 00 01 001 100100 00 01 1 000 01000001 0 000010 01001000 0001 111 01 11 01 1 10010 101100 1 0101100 00100100010 1010000 0 1 10 0000000 0 0 0 0 1 1 1 0 0110 0 0 111 111 1 1 00000000000010 11 1 10 10110110110110001 1 1 0 0 1 0 0 0 0 1 1 0 0 1 1 0 0001 1 1 10 00 1 11 111 111 11 011111 10 1 00011 1 110111 1 01 00 0 1 1 1 1 1 1 0 1 0 1 1 11 11 000 0 0 11 10 1 1 11 1 11 11100 01 1 1 101 1 1 11 1 111 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 11 1 11 11 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 X2 1 1 4 1 • Data X and Y , with Y taking values +1 or −1. 1 1 1 -2 0 X1 2 4 1 1 1 6 • The black boundary is the Bayes Decision Boundary - the best one can do. • Goal: Given N training pairs (Xi , Yi ) produce a classifier Ĉ(X) ∈ {−1, 1} • Also estimate the probability of the class labels P (Y = +1|X). 13 Boosting Trevor Hastie, Stanford University Toy Example - No Noise Bayes Error Rate: 0 3 1 1 11 1 1 1 1 1 1 11 11 1 1 1 1 1 1 11 1 11 1 1 11 1 1 1 1 111 1 111 1 11 1 1 1 1 1111 1 1 1 1 1 1 1 1 111111 1 11 11 111 11 11 1 1 1 11 1 111 1 0 00 1 1 1 1 1 1 1 1 1 1 11 1 11 1 1 11 11111 11 00000 0 0 11 1 00 1 01 1 1 0 1 11 1 1 1 1 1 1111 00000 00 000000000 0 0 1 0 1 0 1 0 000 1 1 1 1 1 1 1 11 00 0 00 00000 00000 1 1 1 0 1 1 0 0 0 0 1 0 0 00 00000000 1 111 1 111000000000000 0 00 0 000 000 1111 11 11 1 1 1 0 0 00000000000 0000 0 0 0 11 0 1 0 0 1 1 1 000 000 1 1 1 1 0 000000000 0000 1 11 00 0 00000 0000 00 1 000 000000 0 0000000 1 11 0000 1 11 11 1 1 0 00000 00 0000 1 00 00 01 11 11 00 0000000000 0 0000 1 0 0 0 0 0 0 0 0 00 00000000 0111 1 1 11 1 100 0 00 0 00000000 11 1 00000 0 00 00 0 01 1111 1 1 000 1 11 11 11 1 0000000000000000 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0000 0 00 11 1 111 0 0 1 00 0 0000000000 000000 1 111 1 111 111 1 1 1 000 000000 1 0 1 111 0 0 1 0 0 1 0 1 1 0 1 1 0 0000 1101001 0 1 0 1 0000 1 11111111111 1 11 1 1 0 100 1 11 001 1 1 11 00 1 1 1 1 11 1 1 1 1 1 1 1 11 1111 11 1 1 1 1 11 11 1 1 111 11 11 1 1 11 1 1 1 11 1 1 1 1 1 1 1 1 111 1 1 1 1 11 1 1 1 1 1 11 1 1 1 11 11 1 1 1 1 1 1 11 0 -1 -2 -3 X2 1 2 1 -3 -2 1 -1 0 X1 1 2 3 • Deterministic problem; noise comes from sampling distribution of X. • Use a training sample of size 200. • Here Bayes Error is 0%. 14 Boosting Trevor Hastie, Stanford University Classification Tree 1 94/200 x.2<-1.06711 x.2>-1.06711 0 1 72/166 x.2<1.14988 x.2>1.14988 0/34 0 1 40/134 x.1<1.13632 x.1>1.13632 0 1 23/117 0/17 0/32 x.1<-0.900735 x.1>-0.900735 1 0 5/26 x.1<-1.1668 x.1>-1.1668 1 1 0/12 2/91 x.2<-0.823968 x.2>-0.823968 5/14 x.1<-1.07831 x.1>-1.07831 1 1 1/5 4/9 0 0 2/8 0/83 15 Boosting Trevor Hastie, Stanford University Decision Boundary: Tree 2 3 Error Rate: 0.073 1 1 0 1 1 1 -1 1 -2 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 00 01 1 11 1 1 0 11 1 1 1 1 0 0 0 00 0 1 0 000 0 0 000 0 0 0 1 0 0 0 00 0 000 1 1 1 11 1 11 0 0 00 0 0 0 0 0000 1 0 0 00 0 00 0 00 0 00 0 00 0000 0000 0 0 0 1 0 000 000 0 1 0 0 1 1 0 0 0 1 1 0000 000 000 0 0 11 1 1 11 1 1 1 1 0 01 11 10 11 1 1 1 1 1 1 11 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 -3 X2 1 1 11 -3 -2 -1 0 X1 1 2 3 When the nested spheres are in 10-dimensions, Classification Trees produces a rather noisy and inaccurate rule Ĉ(X), with error rates around 30%. 16 Boosting Trevor Hastie, Stanford University Model Averaging Classification trees can be simple, but often produce noisy (bushy) or weak (stunted) classifiers. • Bagging (Breiman, 1996): Fit many large trees to bootstrap-resampled versions of the training data, and classify by majority vote. • Boosting (Freund & Shapire, 1996): Fit many large or small trees to reweighted versions of the training data. Classify by weighted majority vote. • Random Forests (Breiman 1999): Fancier version of bagging. In general Boosting Random Forests Bagging Single Tree. 17 Boosting Trevor Hastie, Stanford University Bagging Bagging or bootstrap aggregation averages a given procedure over many samples, to reduce its variance — a poor man’s Bayes. See pp 246. Suppose C(S, x) is a classifier, such as a tree, based on our training data S, producing a predicted class label at input point x. To bag C, we draw bootstrap samples S ∗1 , . . . S ∗B each of size N with replacement from the training data. Then Ĉbag (x) = Majority Vote {C(S ∗b , x)}B b=1 . Bagging can dramatically reduce the variance of unstable procedures (like trees), leading to improved prediction. However any simple structure in C (e.g a tree) is lost. 18 Boosting Trevor Hastie, Stanford University Original Tree Bootstrap Tree 1 0 0 10/30 7/30 x.2<0.39 x.2<0.36 x.2>0.39 x.2>0.36 0 1 0 1 3/21 2/9 1/23 1/7 x.3<-1.575 x.1<-0.965 x.3>-1.575 x.1>-0.965 1 0 0 0 2/5 0/16 1/5 0/18 Bootstrap Tree 2 Bootstrap Tree 3 0 0 11/30 4/30 x.2<0.39 x.4<0.395 x.2>0.39 x.4>0.395 0 1 0 0 3/22 0/8 2/25 2/5 x.3<-1.575 x.3<-1.575 x.3>-1.575 x.3>-1.575 1 0 0 0 2/5 0/17 2/5 0/20 Bootstrap Tree 4 Bootstrap Tree 5 0 0 13/30 12/30 x.2<0.255 x.2<0.38 x.2>0.255 x.2>0.38 0 1 0 1 2/16 3/14 4/20 2/10 x.3<-1.385 x.3<-1.61 x.3>-1.385 x.3>-1.61 0 0 1 0 2/5 0/11 2/6 0/14 19 Boosting Trevor Hastie, Stanford University Decision Boundary: Bagging 2 3 Error Rate: 0.032 1 1 0 1 1 1 -1 1 -2 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 00 01 1 11 1 1 0 11 1 1 1 1 0 0 0 00 0 1 0 000 0 0 000 0 0 0 1 0 0 0 00 0 000 1 1 1 11 1 11 0 0 00 0 0 0 0 0000 1 0 0 00 0 00 0 00 0 00 0 00 0000 0000 0 0 0 1 0 000 000 0 1 0 0 1 1 0 0 0 1 1 0000 000 000 0 0 11 1 1 11 1 1 1 1 0 01 11 10 11 1 1 1 1 1 1 11 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 -3 X2 1 1 11 -3 -2 -1 0 X1 1 2 3 Bagging averages many trees, and produces smoother decision boundaries. 20 Boosting Trevor Hastie, Stanford University Random forests • refinement of bagged trees; quite popular • at each tree split, a random sample of m features is drawn, and only those m features are considered for splitting. Typically √ m = p or log2 p, where p is the number of features • For each tree grown on a bootstrap sample, the error rate for observations left out of the bootstrap sample is monitored. This is called the “out-of-bag” error rate. • random forests tries to improve on bagging by “de-correlating” the trees. Each tree has the same expectation. 21 Boosting Trevor Hastie, Stanford University 22 1.0 ROC curve for TREE, SVM and Random Forest on SPAM data 0.8 TREE, SVM and RF Random Forest − Error: 5.0% SVM − Error: 6.7% TREE − Error: 8.7% Random Forest dominates both other methods on the SPAM data — 5.0% error. Used 500 trees with default settings for random Forest package in R. 0.0 0.2 0.4 Sensitivity 0.6 o o o 0.0 0.2 0.4 0.6 Specificity 0.8 1.0 Boosting Trevor Hastie, Stanford University Weighted Sample CM (x) Boosting • Average many trees, each grown to re-weighted versions of the training data. Weighted Sample C3 (x) Weighted Sample C2 (x) Training Sample C1 (x) • Final Classifier is weighted average of classifiers: M C(x) = sign m=1 αm Cm (x) 23 Boosting Trevor Hastie, Stanford University 24 100 Node Trees Boosting vs Bagging 0.4 Bagging AdaBoost • Bayes error rate is 0%. 0.2 • Trees are grown best first without pruning. 0.1 • Leftmost term is a single tree. 0.0 Test Error 0.3 • 2000 points from Nested Spheres in R10 0 100 200 Number of Terms 300 400 Boosting Trevor Hastie, Stanford University AdaBoost (Freund & Schapire, 1996) 1. Initialize the observation weights wi = 1/N, i = 1, 2, . . . , N . 2. For m = 1 to M repeat steps (a)–(d): (a) Fit a classifier Cm (x) to the training data using weights wi . (b) Compute weighted error of newest tree N i=1 wi I(yi = Cm (xi )) . errm = N i=1 wi (c) Compute αm = log[(1 − errm )/errm ]. (d) Update weights for i = 1, . . . , N : wi ← wi · exp[αm · I(yi = Cm (xi ))] and renormalize to wi to sum to 1. M 3. Output C(x) = sign m=1 αm Cm (x) . 25 Trevor Hastie, Stanford University 26 0.5 Boosting Single Stump 0.3 A stump is a two-node tree, after a single split. Boosting stumps works remarkably well on the nested-spheres problem. 0.1 0.2 400 Node Tree 0.0 Test Error 0.4 Boosting Stumps 0 100 200 Boosting Iterations 300 400 Boosting Trevor Hastie, Stanford University 0.4 • Nested spheres in 10Dimensions. 0.3 0.5 Training Error • Bayes error is 0%. 0.2 • Boosting drives the training error to zero. 0.1 • Further iterations continue to improve test error in many examples. 0.0 Train and Test Error 27 0 100 200 300 Number of Terms 400 500 600 Trevor Hastie, Stanford University 0.5 Boosting 0.4 Noisy Problems 0.3 • Nested Gaussians in 10-Dimensions. • Bayes error is 25%. Bayes Error 0.2 • Boosting with stumps 0.1 • Here the test error does increase, but quite slowly. 0.0 Train and Test Error 28 0 100 200 300 Number of Terms 400 500 600 Boosting Trevor Hastie, Stanford University Stagewise Additive Modeling Boosting builds an additive model f (x) = M βm b(x; γm ). m=1 Here b(x, γm ) is a tree, and γm parametrizes the splits. We do things like that in statistics all the time! • GAMs: f (x) = j fj (xj ) M • Basis expansions: f (x) = m=1 θm hm (x) Traditionally the parameters fm , θm are fit jointly (i.e. least squares, maximum likelihood). With boosting, the parameters (βm , γm ) are fit in a stagewise fashion. This slows the process down, and overfits less quickly. 29 Boosting Trevor Hastie, Stanford University Additive Trees • Simple example: stagewise least-squares? • Fix the past M − 1 functions, and update the M th using a tree: min fM ∈T ree(x) E(Y − M −1 fm (x) − fM (x))2 m=1 • If we define the current residuals to be R=Y − M −1 fm (x) m=1 then at each stage we fit a tree to the residuals min fM ∈T ree(x) E(R − fM (x))2 30 Boosting Trevor Hastie, Stanford University 31 Stagewise Least Squares Suppose we have available a basis family b(x; γ) parametrized by γ. • After m − 1 steps, suppose we have the model m−1 fm−1 (x) = j=1 βj b(x; γj ). • At the mth step we solve min β,γ N 2 (yi − fm−1 (xi ) − βb(xi ; γ)) i=1 • Denoting the residuals at the mth stage by rim = yi − fm−1 (xi ), the previous step amounts to min(rim − βb(xi ; γ))2 , β,γ • Thus the term βm b(x; γm ) that best fits the current residuals is added to the expansion at each step. Boosting Trevor Hastie, Stanford University Adaboost: Stagewise Modeling • AdaBoost builds an additive logistic regression model M Pr(Y = 1|x) αm Gm (x) f (x) = log = Pr(Y = −1|x) m=1 by stagewise fitting using the loss function L(y, f (x)) = exp(−y f (x)). • Given the current fM −1 (x), our solution for (βm , Gm ) is arg min β,G N exp[−yi (fm−1 (xi ) + β G(x))] i=1 where Gm (x) ∈ {−1, 1} is a tree classifier and βm is a coefficient. 32 Boosting Trevor Hastie, Stanford University (m) • With wi = exp(−yi fm−1 (xi )), this can be re-expressed as arg min β,G N (m) wi exp(−β yi G(xi )) i=1 • We can show that this leads to the Adaboost algorithm; See pp 305. 33 Boosting Trevor Hastie, Stanford University Why Exponential Loss? 3.0 • e−yF (x) is a monotone, smooth upper bound on misclassification loss at x. • Leads to simple reweighting scheme. 1.0 1.5 • Has logit transform as population minimizer 0.5 f ∗ (x) = 0.0 Loss 2.0 2.5 Misclassification Exponential Binomial Deviance Squared Error Support Vector -2 -1 0 y·f 1 2 1 Pr(Y = 1|x) log 2 Pr(Y = −1|x) • Other more robust loss functions, like binomial deviance. 34 Boosting Trevor Hastie, Stanford University General Stagewise Algorithm We can do the same for more general loss functions, not only least squares. 1. Initialize f0 (x) = 0. 2. For m = 1 to M : (a) Compute N (βm , γm ) = arg minβ,γ i=1 L(yi , fm−1 (xi ) + βb(xi ; γ)). (b) Set fm (x) = fm−1 (x) + βm b(x; γm ). Sometimes we replace step (b) in item 2 by (b∗ ) Set fm (x) = fm−1 (x) + νβm b(x; γm ) Here ν is a shrinkage factor, and often ν < 0.1. Shrinkage slows the stagewise model-building even more, and typically leads to better performance. 35 Boosting Trevor Hastie, Stanford University 36 Gradient Boosting • General boosting algorithm that works with a variety of different loss functions. Models include regression, resistant regression, K-class classification and risk modeling. • Gradient Boosting builds additive tree models, for example, for representing the logits in logistic regression. • Tree size is a parameter that determines the order of interaction (next slide). • Gradient Boosting inherits all the good features of trees (variable selection, missing data, mixed predictors), and improves on the weak features, such as prediction performance. • Gradient Boosting is described in detail in , section 10.10. Boosting Trevor Hastie, Stanford University Tree Size 0.4 Stumps 10 Node 100 Node Adaboost 0.2 0.3 The tree size J determines the interaction order of the model: η(X) = ηj (Xj ) 0.1 j + ηjk (Xj , Xk ) jk 0.0 Test Error 37 + 0 100 200 Number of Terms 300 400 jkl +··· ηjkl (Xj , Xk , Xl ) Boosting Trevor Hastie, Stanford University 38 Stumps win! Since the true decision boundary is the surface of a sphere, the function that describes it has the form f (X) = X12 + X22 + . . . + Xp2 − c = 0. Boosted stumps via Gradient Boosting returns reasonable approximations to these quadratic functions. Coordinate Functions for Additive Logistic Trees f1 (x1 ) f2 (x2 ) f3 (x3 ) f4 (x4 ) f5 (x5 ) f6 (x6 ) f7 (x7 ) f8 (x8 ) f9 (x9 ) f10 (x10 ) Boosting Trevor Hastie, Stanford University Spam Example Results With 3000 training and 1500 test observations, Gradient Boosting fits an additive logistic model f (x) = log Pr(spam|x) Pr(email|x) using trees with J = 6 terminal-node trees. Gradient Boosting achieves a test error of 4%, compared to 5.3% for an additive GAM, 5.0% for Random Forests, and 8.7% for CART. 39 Boosting Trevor Hastie, Stanford University Spam: Variable Importance 3d addresses labs telnet 857 415 direct cs table 85 # parts credit [ lab conference report original data project font make address order all hpl technology people pm mail over 650 ; meeting email 000 internet receive ( re business 1999 will money our you edu CAPTOT george CAPMAX your CAPAVE free remove hp $ ! 0 20 40 60 Relative importance 80 100 40 -1.0 -1.0 -0.2 0.0 0.2 0.0 0.2 0.2 0.4 0.6 0.4 0.6 edu 0.8 0.8 1.0 -0.2 0.0 0.2 0.0 -0.6 Partial Dependence -0.6 Partial Dependence -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Partial Dependence -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Partial Dependence Boosting Trevor Hastie, Stanford University 1.0 0.0 0.0 0.2 ! 0.5 1.0 41 Spam: Partial Dependence 0.4 1.5 hp 0.6 remove 2.0 2.5 3.0 Boosting Trevor Hastie, Stanford University 42 Comparison of Learning Methods Some characteristics of different learning methods. Key: ●= good, ●=fair, and ●=poor. Characteristic Neural Nets SVM CART GAM KNN, Kernel Gradient Boost Natural handling of data of “mixed” type ● ● ● ● ● ● Handling of missing values ● ● ● ● ● ● Robustness to outliers in input space ● ● ● ● ● ● Insensitive to monotone transformations of inputs ● ● ● ● ● ● Computational scalability (large N ) ● ● ● ● ● ● Ability to deal with irrelevant inputs ● ● ● ● ● ● Ability to extract linear combinations of features ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Interpretability Predictive power Boosting Trevor Hastie, Stanford University Software • R: free GPL statistical computing environment available from CRAN, implements the S language. Includes: – randomForest: implementation of Leo Breimans algorithms. – rpart: Terry Therneau’s implementation of classification and regression trees. – gbm: Greg Ridgeway’s implementation of Friedman’s gradient boosting algorithm. • Salford Systems: Commercial implementation of trees, random forests and gradient boosting. • Splus (Insightful): Commerical version of S. • Weka: GPL software from University of Waikato, New Zealand. Includes Trees, Random Forests and many other procedures. 43