Classification & Regression Trees (CART) I. Introduction and Motivation Tree-based modelling is an exploratory technique for uncovering structure in data. Specifically, the technique is useful for classification and regression problems where one has a set of predictor or classification variables ๐ฟ and a single response ๐ฆ. When ๐ฆ is a factor, classification rules are determined from the data, e.g. If (๐1 < 5.6) ๐๐๐ (๐2 ∈ (๐๐๐๐๐๐, ๐น๐๐๐๐๐ ๐๐๐๐๐๐)) ๐กโ๐๐ ๐ฆ ๐๐ ๐๐๐ ๐ก ๐๐๐๐๐๐ฆ "๐ท๐๐๐" When the response ๐ฆ is numeric, then regression tree rules for prediction are of the form: ๐ผ๐ (๐1 > 100) ๐๐๐ (๐2 = ๐๐๐ข๐), ๐๐ก๐ … ๐กโ๐๐ ๐กโ๐ ๐๐๐๐๐๐๐ก๐๐ ๐ฃ๐๐๐ข๐ ๐๐ ๐ฆ = 5.89 Statistical inference for tree-based models is in its infancy and there is a definite lack of formal procedures for inference. However, the method is rapidly gaining widespread popularity as a means of devising prediction rules for rapid and repeated evaluation, as a screening method for variables, as diagnostic technique to assess the adequacy of other types of models, and simply for summarizing large multivariate data sets. Some reasons for its recent popularity are that: 1. In certain applications, especially where the set of predictors contains a mix of numeric variables and factors, tree-based models are sometimes easier to interpret and discuss than linear models. 2. Tree-based models are invariant to monotone transformations of predictor variables so that the precise form in which these appear in the model is irrelevant. As we have seen in several earlier examples this is a particularly appealing property. 3. Tree-based models are more adept at capturing nonadditive behavior; the standard linear model does not allow interactions between variables unless they are pre-specified and of a particular multiplicative form. MARS can capture this automatically by specifying a degree = 2 fit. 1 Regression Trees The form of the fitted surface or smooth obtained from a regression tree is ๐ ๐(๐) = ∑ ๐๐ ๐ผ(๐ ∈ ๐ ๐ ) ๐=1 where the ๐๐ are constants and the ๐ ๐ are regions defined a series of binary splits. If all the predictors are numeric these regions form a set of disjoint hyper-rectangles with sides parallel to the axes such that ๐ โ ๐ ๐ = ๐ ๐ ๐=1 Regardless of how the neighborhoods are defined if we use the least squares criterion for each region ๐ ∑(๐ฆ๐ − ๐(๐๐ )) 2 ๐=1 the best estimator of the response, ๐ฬ๐ , is just the average of the ๐ฆ๐ in the region ๐ ๐ , i.e. ๐ฬ๐ = ๐๐ฃ๐(๐ฆ๐ |๐๐ ∈ ๐ ๐ ). Thus to obtain a regression tree we need to somehow obtain the neighborhoods ๐ ๐ . This is accomplished by an algorithm called recursive partitioning, see Breiman et al. (1984). We present the basic idea below though an example for the case where the number of neighborhoods ๐ = 2 and the number of predictor variables ๐ = 2. The task of determining neighborhoods ๐ ๐ is solved by determining a split coordinate ๐, i.e. which variable to split on, and split point ๐ . A split coordinate and split point define the rectangles ๐ 1 ๐๐๐ ๐ 2 as ๐ 1 (๐, ๐ ) = {๐|๐ฅ๐ ≤ ๐ } ๐๐๐ ๐ 2 (๐, ๐ ) = {๐|๐ฅ๐ > ๐ } The residual sum of squares (RSS) for a split determined by (๐, ๐ ) is ๐ ๐๐(๐, ๐ ) = min [min (๐,๐ ) ๐1 ∑ (๐ฆ๐ − ๐1 )2 + min ๐ฅ๐ ∈๐ 1 (๐,๐ ) ๐2 ∑ (๐ฆ๐ − ๐2 )2 ] ๐ฅ๐ ∈๐ 2 (๐,๐ ) The goal at any given stage is to find the pair (๐, ๐ ) such that ๐ ๐๐(๐, ๐ ) is minimal or the overall RSS is maximally reduced. This may seem overwhelming, however this only requires examining at most (๐ − 1) splits for each variable because the points in a neighborhood only change when the split point ๐ crosses an observed value. If we wish to split into three neighborhoods, i.e. split ๐ 1 (๐, ๐ )or ๐ 2 (๐, ๐ ) after the first split, we have 2 (๐ − 1)๐ possibilities for the first split and (๐ − 2)๐ possibilities for the second split, given the first split. In total we have (๐ − 1)(๐ − 2)๐2 operations to find the best splits for ๐ = 3 neighborhoods. In general for ๐ neighborhoods we have, (๐ − 1)(๐ − 2) … (๐ − ๐ + 1)๐๐−1 possibilities if all predictors are numeric! This gets too big for an exhaustive search, therefore we use the technique for ๐ = 2 recursively. This is the basic idea of recursive partitioning. One starts with the first split and obtains ๐ 1 & ๐ 2 as explained above. This split stays fixed and the same splitting procedure is applied recursively to the two regions ๐ 1 & ๐ 2 . This procedure is then repeated until we reach some stopping criterion such as the nodes become homogenous or contain very few observations. The rpart function uses two such stopping criteria. A node will not be split if it contains fewer minsplit observations (default =20). Additionally we can specify the minimum number of observations in terminal node by specifying a value for minbucket (default = ⌊๐๐๐๐ ๐๐๐๐ก/3⌋). The figures below from pg. 306 of Elements of Statistical Learning show a hypothetical tree fit based on two numeric predictors ๐1 & ๐2. 3 Let's examine these ideas using the ozone pollution data for the Los Angeles Basin discussed earlier in the course. For simplicity we consider the case where ๐ = 2. Here we will develop a regression tree using rpart for predicting upper ozone concentration using the temperature at Sandburg Air Force base and Daggett pressure. > > > > > > > library(rpart) attach(Ozdata) oz.rpart <- rpart(upoz ~ inbh + safb) summary(oz.rpart) plot(oz.rpart) text(oz.rpart) post(oz.rpart,"Regression Tree for Upper Ozone Concentration") Plot the fitted surface > > > > > + x1 = seq(min(inbh),max(inbh),length=100) x2 = seq(min(safb),max(safb),length=100) x = expand.grid(inbh=x1,safb=x2) ypred = predict(oz.rpart,newdata=x) persp(x1,x2,z=matrix(ypred,100,100),theta=45,xlab="INBH", ylab="SAFB",zlab="UPOZ") > plot(oz.rpart,uniform=T,branch=1,compress=T,margin=0.05,cex=.5) > text(oz.rpart,all=T,use.n=T,fancy=T,cex=.7) > title(main="Regression Tree for Upper Ozone Concentration") 4 Example: Infant Mortality Rates for 77 Largest U.S. Cities in 2000 In this example we will examine how to build regression trees using functions in the packages rpart and tree. We will also examine use of the maptree package to plot the results. > infmort.rpart = rpart(infmort~.,data=City,control=rpart.control(minsplit=10)) > summary(infmort.rpart) Call: rpart(formula = infmort ~ ., data = City, minsplit = 10) n= 77 CP nsplit rel error xerror xstd 1 0.53569704 0 1.00000000 1.0108944 0.18722505 2 0.10310955 1 0.46430296 0.5912858 0.08956209 3 0.08865804 2 0.36119341 0.6386809 0.09834998 4 0.03838630 3 0.27253537 0.5959633 0.09376897 5 0.03645758 4 0.23414907 0.6205958 0.11162033 6 0.02532618 5 0.19769149 0.6432091 0.11543351 7 0.02242248 6 0.17236531 0.6792245 0.11551694 8 0.01968056 7 0.14994283 0.7060502 0.11773100 9 0.01322338 8 0.13026228 0.6949660 0.11671223 10 0.01040108 9 0.11703890 0.6661967 0.11526389 11 0.01019740 10 0.10663782 0.6749224 0.11583334 12 0.01000000 11 0.09644043 0.6749224 0.11583334 Node number 1: 77 observations, mean=12.03896, MSE=12.31978 left son=2 (52 obs) right son=3 Primary splits: pct.black < 29.55 to the growth < -5.55 to the pct1par < 31.25 to the precip < 36.45 to the laborchg < 2.85 to the complexity param=0.535697 (25 obs) left, right, left, left, right, improve=0.5356970, improve=0.4818361, improve=0.4493385, improve=0.3765841, improve=0.3481261, (0 (0 (0 (0 (0 missing) missing) missing) missing) missing) 5 Surrogate splits: growth < -2.6 pct1par < 31.25 laborchg < 2.85 poverty < 21.5 income < 24711 to to to to to the the the the the right, left, right, left, right, Node number 2: 52 observations, mean=10.25769, MSE=4.433595 left son=4 (34 obs) right son=5 Primary splits: precip < 36.2 to the pct.black < 12.6 to the pct.hisp < 4.15 to the pct1hous < 29.05 to the hisp.pop < 14321 to the Surrogate splits: pct.black < 22.3 to the pct.hisp < 3.8 to the hisp.pop < 7790.5 to the growth < 6.95 to the taxes < 427 to the Node number 3: 25 observations, mean=15.744, MSE=8.396064 left son=6 (20 obs) right son=7 Primary splits: pct1par < 45.05 to the growth < -5.55 to the pct.black < 62.6 to the pop2 < 637364.5 to the black.pop < 321232.5 to the Surrogate splits: pct.black < 56.85 to the growth < -15.6 to the welfare < 22.05 to the unemprt < 11.3 to the black.pop < 367170.5 to the agree=0.896, agree=0.896, agree=0.857, agree=0.844, agree=0.818, adj=0.68, adj=0.68, adj=0.56, adj=0.52, adj=0.44, (0 (0 (0 (0 (0 split) split) split) split) split) complexity param=0.08865804 (18 obs) left, left, right, left, right, improve=0.3647980, improve=0.3395304, improve=0.3325635, improve=0.3058060, improve=0.2745090, left, right, right, right, left, agree=0.865, agree=0.865, agree=0.808, agree=0.769, agree=0.769, (0 (0 (0 (0 (0 missing) missing) missing) missing) missing) adj=0.611, adj=0.611, adj=0.444, adj=0.333, adj=0.333, (0 (0 (0 (0 (0 split) split) split) split) split) complexity param=0.1031095 (5 obs) left, right, left, left, left, improve=0.4659903, improve=0.4215004, improve=0.4061168, improve=0.3398599, improve=0.3398599, left, right, left, left, left, agree=0.92, agree=0.88, agree=0.88, agree=0.88, agree=0.84, (0 (0 (0 (0 (0 adj=0.6, adj=0.4, adj=0.4, adj=0.4, adj=0.2, missing) missing) missing) missing) missing) (0 (0 (0 (0 (0 Node number 4: 34 observations, complexity param=0.0383863 mean=9.332353, MSE=2.830424 left son=8 (5 obs) right son=9 (29 obs) Primary splits: medrent < 614 to the right, improve=0.3783900, (0 black.pop < 9584.5 to the left, improve=0.3326486, (0 pct.black < 2.8 to the left, improve=0.3326486, (0 income < 29782 to the right, improve=0.2974224, (0 medv < 160100 to the right, improve=0.2594415, (0 Surrogate splits: area < 48.35 to the left, agree=0.941, adj=0.6, income < 35105 to the right, agree=0.941, adj=0.6, medv < 251800 to the right, agree=0.941, adj=0.6, popdens < 9701.5 to the right, agree=0.912, adj=0.4, black.pop < 9584.5 to the left, agree=0.912, adj=0.4, split) split) split) split) split) missing) missing) missing) missing) missing) (0 (0 (0 (0 (0 split) split) split) split) split) 6 Node number 5: 18 observations, complexity param=0.02242248 mean=12.00556, MSE=2.789414 left son=10 (7 obs) right son=11 (11 obs) Primary splits: july < 78.95 to the right, improve=0.4236351, (0 missing) pctmanu < 11.65 to the right, improve=0.3333122, (0 missing) pct.AIP < 1.5 to the left, improve=0.3293758, (0 missing) pctdeg < 18.25 to the left, improve=0.3078859, (0 missing) precip < 49.7 to the right, improve=0.3078859, (0 missing) Surrogate splits: oldhous < 15.05 to the left, agree=0.833, adj=0.571, (0 split) precip < 46.15 to the right, agree=0.833, adj=0.571, (0 split) area < 417.5 to the right, agree=0.778, adj=0.429, (0 split) pctdeg < 19.4 to the left, agree=0.778, adj=0.429, (0 split) unemprt < 5.7 to the right, agree=0.778, adj=0.429, (0 split) Node number 6: 20 observations, complexity param=0.03645758 mean=14.755, MSE=4.250475 left son=12 (10 obs) right son=13 (10 obs) Primary splits: growth < -5.55 to the right, improve=0.4068310, (0 missing) medv < 50050 to the right, improve=0.4023256, (0 missing) pct.AIP < 0.85 to the right, improve=0.4019953, (0 missing) pctrent < 54.3 to the right, improve=0.3764815, (0 missing) pctold < 13.95 to the left, improve=0.3670365, (0 missing) Surrogate splits: pctold < 13.95 to the left, agree=0.85, adj=0.7, (0 split) laborchg < -1.8 to the right, agree=0.85, adj=0.7, (0 split) black.pop < 165806 to the left, agree=0.80, adj=0.6, (0 split) pct.black < 45.25 to the left, agree=0.80, adj=0.6, (0 split) pct.AIP < 1.05 to the right, agree=0.80, adj=0.6, (0 split) Node number 7: 5 observations mean=19.7, MSE=5.416 Node number 8: 5 observations mean=6.84, MSE=1.5784 Node number 9: 29 observations, complexity param=0.01968056 mean=9.762069, MSE=1.79063 left son=18 (3 obs) right son=19 (26 obs) Primary splits: laborchg < 55.9 to the right, improve=0.3595234, (0 missing) growth < 61.7 to the right, improve=0.3439875, (0 missing) taxes < 281.5 to the left, improve=0.3185654, (0 missing) july < 82.15 to the right, improve=0.2644400, (0 missing) pct.hisp < 5.8 to the right, improve=0.2537809, (0 missing) Surrogate splits: growth < 61.7 to the right, agree=0.966, adj=0.667, (0 split) welfare < 3.85 to the left, agree=0.966, adj=0.667, (0 split) pct1par < 17.95 to the left, agree=0.966, adj=0.667, (0 split) ptrans < 1 to the left, agree=0.966, adj=0.667, (0 split) pop2 < 167632.5 to the left, agree=0.931, adj=0.333, (0 split) Node number 10: 7 observations mean=10.64286, MSE=2.21102 7 Node number 11: 11 observations, complexity param=0.0101974 mean=12.87273, MSE=1.223802 left son=22 (6 obs) right son=23 (5 obs) Primary splits: pctmanu < 11.95 to the right, improve=0.7185868, (0 july < 72 to the left, improve=0.5226933, (0 black.pop < 54663.5 to the left, improve=0.5125632, (0 pop2 < 396053.5 to the left, improve=0.3858185, (0 pctenr < 88.65 to the right, improve=0.3858185, (0 Surrogate splits: popdens < 6395.5 to the left, agree=0.818, adj=0.6, black.pop < 54663.5 to the left, agree=0.818, adj=0.6, taxes < 591 to the left, agree=0.818, adj=0.6, welfare < 8.15 to the left, agree=0.818, adj=0.6, poverty < 15.85 to the left, agree=0.818, adj=0.6, missing) missing) missing) missing) missing) (0 (0 (0 (0 (0 split) split) split) split) split) Node number 12: 10 observations, complexity param=0.01322338 mean=13.44, MSE=1.8084 left son=24 (5 obs) right son=25 (5 obs) Primary splits: pct1hous < 30.1 to the right, improve=0.6936518, (0 missing) precip < 41.9 to the left, improve=0.6936518, (0 missing) laborchg < 1.05 to the left, improve=0.6902813, (0 missing) welfare < 13.2 to the right, improve=0.6619479, (0 missing) ptrans < 5.25 to the right, improve=0.6087241, (0 missing) Surrogate splits: precip < 41.9 to the left, agree=1.0, adj=1.0, (0 split) pop2 < 327405.5 to the right, agree=0.9, adj=0.8, (0 split) black.pop < 127297.5 to the right, agree=0.9, adj=0.8, (0 split) pctold < 11.75 to the right, agree=0.9, adj=0.8, (0 split) welfare < 13.2 to the right, agree=0.9, adj=0.8, (0 split) Node number 13: 10 observations, complexity param=0.02532618 mean=16.07, MSE=3.2341 left son=26 (5 obs) right son=27 (5 obs) Primary splits: pctrent < 52.9 to the right, improve=0.7428651, (0 missing) pct1hous < 32 to the right, improve=0.5460870, (0 missing) pct1par < 39.55 to the right, improve=0.4378652, (0 missing) pct.hisp < 0.8 to the right, improve=0.4277646, (0 missing) pct.AIP < 0.85 to the right, improve=0.4277646, (0 missing) Surrogate splits: area < 62 to the left, agree=0.8, adj=0.6, (0 split) pct.hisp < 0.8 to the right, agree=0.8, adj=0.6, (0 split) pct.AIP < 0.85 to the right, agree=0.8, adj=0.6, (0 split) pctdeg < 18.5 to the right, agree=0.8, adj=0.6, (0 split) taxes < 560 to the right, agree=0.8, adj=0.6, (0 split) Node number 18: 3 observations mean=7.4, MSE=1.886667 Node number 19: 26 observations, complexity param=0.01040108 mean=10.03462, MSE=1.061494 left son=38 (14 obs) right son=39 (12 obs) Primary splits: pct.hisp < 20.35 to the right, improve=0.3575042, (0 missing) hisp.pop < 55739.5 to the right, improve=0.3013295, (0 missing) pctold < 11.55 to the left, improve=0.3007143, (0 missing) pctrent < 41 to the right, improve=0.2742615, (0 missing) taxes < 375.5 to the left, improve=0.2577731, (0 missing) 8 Surrogate splits: hisp.pop < 39157.5 pctold < 11.15 pct1par < 23 pctrent < 41.85 precip < 15.1 to to to to to the the the the the right, left, right, right, left, agree=0.885, agree=0.769, agree=0.769, agree=0.769, agree=0.769, adj=0.75, adj=0.50, adj=0.50, adj=0.50, adj=0.50, (0 (0 (0 (0 (0 split) split) split) split) split) Node number 22: 6 observations mean=12.01667, MSE=0.4013889 Node number 23: 5 observations mean=13.9, MSE=0.276 Node number 24: 5 observations mean=12.32, MSE=0.6376 Node number 25: 5 observations mean=14.56, MSE=0.4704 Node number 26: 5 observations mean=14.52, MSE=0.9496 Node number 27: 5 observations mean=17.62, MSE=0.7136 Node number 38: 14 observations mean=9.464286, MSE=0.7994388 Node number 39: 12 observations mean=10.7, MSE=0.545 > plot(infmort.rpart) > text(infmort.rpart) > path.rpart(infmort.rpart) ๏ clicking on 3 leftmost terminal nodes, right-click to stop 9 node number: 8 root pct.black< 29.55 precip< 36.2 medrent>=614 node number: 18 root pct.black< 29.55 precip< 36.2 medrent< 614 laborchg>=55.9 node number: 38 root pct.black< 29.55 precip< 36.2 medrent< 614 laborchg< 55.9 pct.hisp>=20.35 The package maptree has a function draw.tree() that plots trees slightly differently. > draw.tree(infmort.rpart) Examining cross-validation results > printcp(infmort.rpart) Regression tree: rpart(formula = infmort ~ ., data = City, minsplit = 10) Variables actually used in tree construction: [1] growth july laborchg medrent pct.black pct.hisp [7] pct1hous pct1par pctmanu pctrent precip 10 Root node error: 948.62/77 = 12.32 n= 77 1 2 3 4 5 6 7 8 9 10 11 12 CP nsplit rel error 0.535697 0 1.00000 0.103110 1 0.46430 0.088658 2 0.36119 0.038386 3 0.27254 0.036458 4 0.23415 0.025326 5 0.19769 0.022422 6 0.17237 0.019681 7 0.14994 0.013223 8 0.13026 0.010401 9 0.11704 0.010197 10 0.10664 0.010000 11 0.09644 xerror 1.01089 0.59129 0.63868 0.59596 0.62060 0.64321 0.67922 0.70605 0.69497 0.66620 0.67492 0.67492 xstd 0.187225 0.089562 0.098350 0.093769 0.111620 0.115434 0.115517 0.117731 0.116712 0.115264 0.115833 0.115833 > plotcp(infmort.rpart) The 1-SE rule for choosing a tree-size 1. Find the smallest xerror and add the corresponding xsd to it. 2. Choose the first tree size that has a xerror smaller than the result from step 1. A very small tree (2 splits or 3 terminal nodes) is suggested by cross-validation, but the larger trees cross-validate reasonably well so we might choose a larger tree just because it is more interesting from a practical standpoint. > plot(City$infmort,predict(infmort.rpart)) > row.names(City) [1] "New.York.NY" [6] "San.Diego.CA" [11] "San.Jose.CA" [16] "Columbus.OH" [21] "El.Paso.TX" [26] "New.Orleans.LA" [31] "Long.Beach.CA" [36] "Albuquerque.NM" [41] "Tulsa.OK" [46] "Cincinnati.OH" [51] "Wichita.KS" [56] "Tampa.FL" [61] "Newark.NJ" [66] "Aurora.CO" "Lexington.Fayette.KY" [71] "Jersey.City.NJ" [76] "Richmond.VA" "Los.Angeles.CA" "Dallas.TX" "Indianapolis.IN" "Milwaukee.WI" "Seattle.WA" "Denver.CO" "Kansas.City.MO" "Atlanta.GA" "Oakland.CA" "Minneapolis.MN" "Mesa.AZ" "Arlington.TX" "Corpus.Christi.TX" "Riverside.CA" "Chicago.IL" "Phoenix.AZ" "San.Francisco.CA" "Memphis.TN" "Cleveland.OH" "Fort.Worth.TX" "Virginia.Beach.VA" "St.Louis.MO" "Honolulu.CDP.HI" "Omaha.NE" "Colorado.Springs.CO" "Anaheim.CA" "Birmingham.AL" "St.Petersburg.FL" "Houston.TX" "Detroit.MI" "Baltimore.MD" "Washington.DC" "Nashville.Davidson.TN" "Oklahoma.City.OK" "Charlotte.NC" "Sacramento.CA" "Miami.FL" "Toledo.OH" "Las.Vegas.NV" "Louisville.KY" "Norfolk.VA" "Rochester.NY" "Philadelphia.PA" "San.Antonio.TX" "Jacksonville.FL" "Boston.MA" "Austin.TX" "Portland.OR" "Tucson.AZ" "Fresno.CA" "Pittsburgh.PA" "Buffalo.NY" "Santa.Ana.CA" "St.Paul.MN" "Anchorage.AK" "Baton.Rouge.LA" "Mobile.AL" "Akron.OH" "Raleigh.NC" "Stockton.CA" 11 > identify(City$infmort,predict(infmort.rpart),labels=row.names(City)) [1] 14 19 37 44 61 63 ๏ identify some interesting points > abline(0,1) ๏ adds line ๐ฬ = ๐ to the plot > post(infmort.rpart) ๏ creates a postscript version of tree. You will need to download a postscript viewer add-on for Adobe Reader to open them. Google “Postscript Viewer” and grab the one off of cnet - (http://download.cnet.com/Postscript-Viewer/3000-2094_4-10845650.html) Postscript version of the infant mortality regression tree. Using the draw.tree function from the maptree package we can produce the following display of the full infant mortality regression tree. 12 > draw.tree(infmort.rpart) Another function in the maptree library is the group.tree command that will label the observations in according to the terminal nodes they are in. This can be particularly interesting when the observations have meaningful labels or are spatially distributed. > infmort.groups = group.tree(infmort.rpart) > infmort.groups Here is a little function to display groups of observations in a data set given the group identifier. > groups = function(g,dframe) { ng <- length(unique(g)) for(i in 1:ng) { cat(paste("GROUP ", i)) cat("\n") cat("=========================================================\n") cat(row.names(dframe)[g == i]) cat("\n\n") } cat(" \n\n") } > groups(infmort.groups,City) 13 GROUP 1 ==================================================================== San.Jose.CA San.Francisco.CA Honolulu.CDP.HI Santa.Ana.CA Anaheim.CA GROUP 2 ================================== Mesa.AZ Las.Vegas.NV Arlington.TX GROUP 3 =============================================================================== Los.Angeles.CA San.Diego.CA Dallas.TX San.Antonio.TX El.Paso.TX Austin.TX Denver.CO Long.Beach.CA Tucson.AZ Albuquerque.NM Fresno.CA Corpus.Christi.TX Riverside.CA Stockton.CA GROUP 4 =============================================================================== Phoenix.AZ Fort.Worth.TX Oklahoma.City.OK Sacramento.CA Minneapolis.MN Omaha.NE Toledo.OH Wichita.KS Colorado.Springs.CO St.Paul.MN Anchorage.AK Aurora.CO GROUP 5 =============================================================================== Houston.TX Jacksonville.FL Nashville.Davidson.TN Tulsa.OK Miami.FL Tampa.FL St.Petersburg.FL GROUP 6 =============================================================================== Indianapolis.IN Seattle.WA Portland.OR Lexington.Fayette.KY Akron.OH Raleigh.NC GROUP 7 ================================================================= New.York.NY Columbus.OH Boston.MA Virginia.Beach.VA Pittsburgh.PA GROUP 8 ================================================================= Milwaukee.WI Kansas.City.MO Oakland.CA Cincinnati.OH Rochester.NY GROUP 9 =============================================================== Charlotte.NC Norfolk.VA Jersey.City.NJ Baton.Rouge.LA Mobile.AL GROUP 10 ============================================================ Chicago.IL New.Orleans.LA St.Louis.MO Buffalo.NY Richmond.VA GROUP 11 =================================================================== Philadelphia.PA Memphis.TN Cleveland.OH Louisville.KY Birmingham.AL GROUP 12 ========================================================== Detroit.MI Baltimore.MD Washington.DC Atlanta.GA Newark.NJ The groups of cities certainly make sense intuitively. What’s next? (1) More Examples (2) Recent advances in tree-based regression models, namely Bagging and Random Forests. 14 Example 2: Predicting/Modeling CPU Performance > head(cpus) name syct mmin mmax cach chmin chmax perf estperf 1 ADVISOR 32/60 125 256 6000 256 16 128 198 199 2 AMDAHL 470V/7 29 8000 32000 32 8 32 269 253 3 AMDAHL 470/7A 29 8000 32000 32 8 32 220 253 4 AMDAHL 470V/7B 29 8000 32000 32 8 32 172 253 5 AMDAHL 470V/7C 29 8000 16000 32 8 16 132 132 6 AMDAHL 470V/8 26 8000 32000 64 8 32 318 290 > Performance = cpus$perf > Statplot(Performance) > Statplot(log(Performance)) > cpus.tree = rpart(log(Performance)~.,data=cpus[,2:7],cp=.001) By default rpart() uses a complexity penalty of cp = .01 which will prune off more terminal nodes than we might want to consider initially. I will generally use a smaller value of cp (e.g. .001) to lead to a tree that is larger but will likely over fit the data. Also 15 if you really want a large tree you can use the arguments below when calling rpart: control=rpart.control(minsplit=##,minbucket=##). > printcp(cpus.tree) Regression tree: rpart(formula = log(Performance) ~ ., data = cpus[, 2:7], cp = 0.001) Variables actually used in tree construction: [1] cach chmax chmin mmax syct Root node error: 228.59/209 = 1.0938 n= 209 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 CP nsplit rel error 0.5492697 0 1.00000 0.0893390 1 0.45073 0.0876332 2 0.36139 0.0328159 3 0.27376 0.0269220 4 0.24094 0.0185561 5 0.21402 0.0167992 6 0.19546 0.0157908 7 0.17866 0.0094604 9 0.14708 0.0054766 10 0.13762 0.0052307 11 0.13215 0.0043985 12 0.12692 0.0022883 13 0.12252 0.0022704 14 0.12023 0.0014131 15 0.11796 0.0010000 16 0.11655 xerror 1.02344 0.48514 0.43673 0.33004 0.34662 0.32769 0.31008 0.29809 0.27080 0.24297 0.24232 0.23530 0.23783 0.23683 0.23703 0.23455 xstd 0.098997 0.049317 0.043209 0.033541 0.034437 0.034732 0.031878 0.030863 0.028558 0.026055 ๏ within 1 SE (xstd) of min 0.026039 0.025449 0.025427 0.025407 0.025453 0.025286 > plotcp(cpus.tree) size of tree 2 3 4 5 6 7 8 10 Inf 0.22 0.088 0.054 0.03 0.022 0.018 0.016 0.012 11 12 13 14 15 16 17 0.8 0.6 0.4 0.2 X-val Relative Error 1.0 1.2 1 0.0072 0.0054 0.0048 0.0032 0.0023 0.0018 0.0012 cp 16 > plot(cpus.tree,uniform=T) > text(cpus.tree,cex=.7) Prune the tree back to a 10 split, 11 terminal node tree using cp = .0055. > cpus.tree = rpart(log(Performance)~.,data=cpus[,2:7],cp=.0055) > post(cpus.tree) 17 > plot(log(Performance),predict(cpus.tree),ylab="Fitted Values",main="Fitted Values vs. log(Performance)") > abline(0,1) > plot(predict(cpus.tree),resid(cpus.tree),xlab="Residuals", ylab="Fitted Values",main="Residuals vs. Fitted (cpus.tree)") > abline(h=0) 18 Bagging for Regression Trees Suppose we are interested in predicting a numeric response variable ๐ฆ|๐ = response value for a given set of ๐′ ๐ = Yx and ๐(๐) = ๐๐๐๐๐๐๐ก๐๐๐ ๐๐๐๐ ๐ ๐ ๐๐๐๐๐๐๐๐ ๐๐๐๐๐ ๐๐๐ ๐๐ ๐๐ ๐ For example, ๐(๐), might come from an OLS model selected using Mallow’s ๐ถ๐ with all potential predictors or from a CART model using x with a complexity parameter cp = .005. Letting ๐๐ denote ๐ธ(๐(๐)), where the expectation is with respect to the distribution underlying the training sample (since, viewed as a random variable, ๐(๐) is a function of training sample, which can be viewed as a high-dimensional random variable) and not ๐ (which is considered fixed), we have that: 2 2 ๐๐๐ธ(๐๐๐๐๐๐๐ก๐๐๐) = ๐ธ [(๐๐ฅ − ๐(๐)) ] = ๐ธ [(๐๐ฅ − ๐๐ + ๐๐ − ๐(๐)) ] 2 2 = ๐ธ [(๐๐ฅ − ๐๐ ) ] + ๐ธ [(๐(๐) − ๐๐ ) ] 2 2 = ๐ธ [(๐๐ฅ − ๐๐ ) ] + ๐๐๐(๐(๐)) ≥ ๐ธ [(๐๐ฅ − ๐๐ ) ] Thus in theory, if our prediction could be based on ๐๐ instead of ๐(๐) then we would have a smaller mean squared error for prediction (and a smaller RMSEP as well). How can we approximate ๐๐ = ๐ธ(๐(๐))? We could take a large number of samples of size n from the population and fit the specified model to each. Then average across these samples to get an average model ๐๐ , more specifically we could get the average prediction from the different models for a given set of predictor values x. Of course, this is silly as we only take one sample of size n in general when conducting any study. However, we can approximate this via the bootstrap. The bootstrap remember involves taking B random samples of size n drawn with replacement from our original sample. For each bootstrap sample, b, we will obtain an estimated model ๐๐∗ (๐) and average ∗ those to obtain an estimate of ๐๐ (๐), i.e. ๐ต 1 ∗ (๐) = ∑ ๐๐∗ (๐) ๐๐ ๐ต ๐=1 This estimator for ๐๐ฅ = ๐ฆ|๐ should in theory be better than the one obtained from the training data. This process of averaging the predicted value from a given x is called bagging. Bagging works best when the fitted models vary substantially from one bootstrap sample to the next. Modeling schemes are complicated and involve the effective estimation of a large number parameters will benefit from bagging most. Projection Pursuit, CART and MARS are examples of algorithms where this is likely to be the case. 19 Example 3: Bagging with the Upper Ozone Concentration > Statplot(Ozdata$upoz) > tupoz = Ozdata$upoz^.333 > Statplot(tupoz) > names(Ozdata) [1] "day" "v500" "wind" "hum" "safb" "inbh" "dagg" "inbt" "vis" "upoz" > Ozdata2 = data.frame(tupoz,Ozdata[,-10]) > oz.rpart = rpart(tupoz~.,data=Ozdata2,cp=.001) > plotcp(oz.rpart) The complexity parameter cp = .007 looks like a good choice. 20 > oz.rpart2 = rpart(tupoz~.,data=Ozdata2,cp=.007) > post(oz.rpart2) > badRMSE = sqrt(sum(resid(oz.rpart2)^2)/330) > badRMSE [1] 0.2333304 > results = rpart.cv(oz.rpart2,data=Ozdata2,cp=.007) > mean(results) [1] 0.3279498 MCCV code for RPART > rpart.cv = function(fit,data,p=.667,B=100,cp=.01) { n <- dim(data)[1] cv <- rep(0,B) y = fit$y for (i in 1:B) { ss <- floor(n*p) sam <- sample(1:n,ss,replace=F) fit2 <- rpart(formula(fit),data=data[-sam,],cp=cp) ynew <- predict(fit2,newdata=data[sam,]) cv[i] <- sqrt(sum((y[sam]-ynew)^2)/ss) } cv } 21 We now we consider using bagging to improve the prediction using CART. > oz.bag = bagging(tupoz~.,data=Ozdata2,cp=.007,nbagg=10,coob=T) > print(oz.bag) Bagging regression trees with 10 bootstrap replications Call: bagging.data.frame(formula = tupoz ~ ., data = Ozdata2, cp = 0.007, nbagg = 10, coob = T) Out-of-bag estimate of root mean squared error: 0.2745 > oz.bag = bagging(tupoz~.,data=Ozdata2,cp=.007,nbagg=25,coob=T) > print(oz.bag) Bagging regression trees with 25 bootstrap replications Call: bagging.data.frame(formula = tupoz ~ ., data = Ozdata2, cp = 0.007, nbagg = 25, coob = T) Out-of-bag estimate of root mean squared error: 0.2685 > oz.bag = bagging(tupoz~.,data=Ozdata2,cp=.007,nbagg=100,coob=T) > print(oz.bag) Bagging regression trees with 100 bootstrap replications Call: bagging.data.frame(formula = tupoz ~ ., data = Ozdata2, cp = 0.007, nbagg = 100, coob = T) Out-of-bag estimate of root mean squared error: 0.2614 MCCV code for Bagging > bag.cv = function(fit,data,p=.667,B=100,cp=.01) { n <- dim(data)[1] y = fit$y cv <- rep(0,B) for (i in 1:B) { ss <- floor(n*p) sam <- sample(1:n,ss,replace=F) fit2 <- bagging(formula(fit), data=data[-sam,],control=rpart.control(cp=cp)) ynew <- predict(fit2,newdata=data[sam,]) cv[i] <- sqrt(sum((y[sam]-ynew)^2)/ss) } cv } > results = bag.cv(oz.rpart2,data=Ozdata2,cp=.007) > mean(results) [1] 0.2869069 The RMSEP estimate is lower for the bagged estimate. 22 Random Forests Random forests are another bootstrap-based method for building trees. In addition to the use of bootstrap samples to build multiple trees that will then be averaged to obtain predictions, random forests also includes randomness in the tree building process for a given bootstrap sample. The algorithm for random forests from Elements of Statistical Learning is presented below. The advantages of random forests are: ๏ท It handles a very large number of input variables (e.g. QSAR data) ๏ท It estimates the importance of input variables in the model. ๏ท It returns proximities between cases, which can be useful for clustering, detecting outliers, and visualizing the data. (Unsupervised learning) ๏ท Learning is faster than using the full set of potential predictors. 23 Variable Importance To measure variable the importance do the following. For each bootstrap sample we first compute the Out-of-Bag (OOB) error rate, ๐๐ธ๐ (๐๐๐ต). Next we randomly permute the OOB values on the ๐ ๐กโ variable ๐๐ while leaving the data on all other variables unchanged. If ๐๐ is important, permuting its values will reduce our ability to predict the response successfully for of the OOB observations. Then we make the predictions using the permuted ๐๐ values and all predictors unchanged to obtain ๐๐ธ๐ (๐๐๐ต๐ ), which should be larger than the error rate of the unaltered data. The raw score for ๐๐ can be computed by the difference between these two OOB error rates, ๐๐๐ค๐ (๐) = ๐๐ธ๐ (๐๐๐ต๐ ) − ๐๐ธ๐ (๐๐๐ต), ๐ = 1, … , ๐ต. Finally, average the raw scores over all the B trees in the forest, ๐ต 1 ๐๐๐(๐) = ∑ ๐๐๐ค๐ (๐) ๐ต ๐=1 To obtain an overall measure of the importance of ๐๐ . This measure is called the raw permutation accuracy importance score for the ๐ ๐กโ variable. Assuming the B raw scores are independent from tree to tree, we can compute a straightforward estimate of the standard error by computing the standard deviation of the ๐๐๐ค๐ (๐) values. Dividing the average raw importance scores from each bootstrap by the standard error gives what is called the mean decrease in accuracy for the ๐ ๐กโ variable. Example 1: L.A. Ozone Levels > oz.rf = randomForest(tupoz~.,data=Ozdata2,importance=T) > oz.rf Call: randomForest(formula = tupoz ~ ., data = Ozdata2, importance = T) Type of random forest: regression Number of trees: 500 No. of variables tried at each split: 3 Mean of squared residuals: 0.05886756 % Var explained: 78.11 24 > plot(tupoz,predict(oz.rf),xlab="Y",ylab="Fitted Values (Y-hat)") Short function to display mean decrease in accuracy from a random forest fit. rfimp = function(rffit) { barplot(sort(rffit$importance[,1]),horiz=T, xlab="Mean Decreasing in Accuracy",main="Variable Importance") } The temperature variables, inversion base temperature and Sandburg AFB temperature, are clearly the most important predictors in the random forest. 25 The random forest command with options is show below. It should be noted that x and y can be replaced by the usual formula building nomenclature, namely… randomForest(y ~ . , data=Mydata, etc…) From the R html help file: Some important options have been highlighted in the generic function call above and they are summarized below: ๏ท ๏ท ๏ท ๏ท ๏ท ๏ท ntree = number of bootstrap trees in the forest mtry = number of variables to randomly pick at each stage (m in notes). maxnodes = maximum number of terminal nodes to use in the bootstrap trees importance = if T means variable importances will be computed. proximity = should proximity measures among rows be computed? do.trace = report OOB MSE and OOB R-square for each of the ntrees. Estimate RMSEP for random forest models can be done using the errorest function in the ipred package (i.e. the bagging one) as show below. It is rather slow so doing more than 10 is not advisable. Each replicate does a 10-fold cross-validation B = 25 times per RMSEP estimate by default. It can be used with a variety of modeling methods, see the errorest help file for examples. > error.RF = numeric(10) > for (i in 1:10) error.RF[i] = errorest(tupoz~.,data=Ozdata2,model=randomForest)$error ๏ this is one line. > summary(error.RF) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.2395 0.2424 0.2444 0.2448 0.2466 0.2506 26 Example 2: Predicting San Francisco List Prices of Single Family Homes Using www.redfin.com it is easy to create interesting data sets regarding the list price of homes and different characteristics of the home. The map below shows all the single-family homes listed in San Francisco just south of downtown and the Golden Gate Bridge. Once you drill down to this level of detail you can download the prices and home characteristics in an Excel file. From there it is easy to load the data in JMP, edit it, and then read it into R. > names(SFhomes) [1] "ListPrice" "BEDS" "BATHS" "SQFT" "YrBuilt" "ParkSpots" "Garage" "LATITUDE" "LONGITUDE" > str(SFhomes) 'data.frame': 263 obs. of 9 variables: $ ListPrice: int 749000 499900 579000 1295000 688000 224500 378000 140000 530000 399000 ... $ BEDS : int 2 3 4 4 3 2 5 1 2 3 ... $ BATHS : num 1 1 1 3.25 2 1 2 1 1 2 ... $ SQFT : int 1150 1341 1429 2628 1889 995 1400 772 1240 1702 ... $ YrBuilt : int 1931 1927 1937 1937 1939 1944 1923 1915 1925 1908 ... $ ParkSpots: int 1 1 1 3 1 1 1 1 1 1 ... $ Garage : Factor w/ 2 levels "Garage","No": 1 1 1 1 1 1 1 2 1 2 ... $ LATITUDE : num 37.8 37.7 37.7 37.8 37.8 ... $ LONGITUDE: num -122 -122 -122 -122 -122 ... - attr(*, "na.action")=Class 'omit' Named int [1:66] 7 11 23 30 32 34 36 43 55 62 ... .. ..- attr(*, "names")= chr [1:66] "7" "11" "23" "30" ... Note: I omitted missing values by using the command na.omit. > SFhomes = na.omit(SFhomes) 27 > head(SFhomes) ListPrice BEDS BATHS SQFT YrBuilt ParkSpots Garage LATITUDE LONGITUDE 1 749000 2 1.00 1150 1931 1 Garage 37.76329 -122.4029 2 499900 3 1.00 1341 1927 1 Garage 37.72211 -122.4037 3 579000 4 1.00 1429 1937 1 Garage 37.72326 -122.4606 4 1295000 4 3.25 2628 1937 3 Garage 37.77747 -122.4502 5 688000 3 2.00 1889 1939 1 Garage 37.75287 -122.4857 6 224500 2 1.00 995 1944 1 Garage 37.72778 -122.3838 We will now develop a random forest model for the list price of the home as a function of the number of bedrooms, number of bathrooms, square footage, year built, number of parking spots, garage (Garage or No), latitude, and longitude. > sf.rf = randomForest(ListPrice~.,data=SFhomes,importance=T) > rfimp(sf.rf,horiz=F) > plot(SFhomes$ListPrice,predict(sf.rf),xlab="Y", ylab="Y-hat",main="Predict vs. Actual List Price") > abline(0,1) 28 > plot(predict(sf.rf),SFhomes$ListPrice - predict(sf.rf),xlab="Fitted Values",ylab="Residuals") > abline(h=0) Not good, severe heteroscedasticity and outliers. > Statplot(SFhomes$ListPrice) > logList = log(SFhomes$ListPrice) > Statplot(logList) > SFhomes.log = data.frame(logList,SFhomes[,-1]) > names(SFhomes.log) [1] "logList" "BEDS" "BATHS" "SQFT" "YrBuilt" "ParkSpots" "Garage" "LATITUDE" "LONGITUDE" We now resume the model building process using the log of the list price as the response. To develop models we consider different choices for (m) the number predictors chosen in each random subset. 29 > attributes(sf.rf) $names [1] "call" [7] "importance" [13] "forest" "type" "importanceSD" "coefs" "predicted" "mse" "localImportance" "proximity" "y" "test" "rsq" "ntree" "inbag" "oob.times" "mtry" "terms" $class [1] "randomForest.formula" "randomForest" > sf.rf$mtry [1] 2 > myforest = function(formula,data) {randomForest(formula,data,mtry=2)} > error.RF = numeric(10) > for (i in 1:10) error.RF[i] = errorest(logList~.,data=SFhomes.log,model=myforest)$error > mean(error.RF) [1] 0.2600688 > summary(error.RF) Min. 1st Qu. Median 0.2544 0.2577 0.2603 Mean 3rd Qu. 0.2601 0.2606 Max. 0.2664 Now consider using different values for (m). m=3 > myforest = function(formula,data) {randomForest(formula,data,mtry=3)} > for (i in 1:10) error.RF[i] = + errorest(logList~.,data=SFhomes.log,model=myforest)$error > summary(error.RF) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.2441 0.2465 0.2479 0.2486 0.2490 0.2590 m=4 > myforest = function(formula,data) {randomForest(formula,data,mtry=4)} > for (i in 1:10) error.RF[i] = + errorest(logList~.,data=SFhomes.log,model=myforest)$error > summary(error.RF) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.2359 0.2419 0.2425 0.2422 0.2445 0.2477 m=5* > myforest = function(formula,data) {randomForest(formula,data,mtry=5)} > for (i in 1:10) error.RF[i] = + errorest(logList~.,data=SFhomes.log,model=myforest)$error > summary(error.RF) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.2374 0.2391 0.2403 0.2418 0.2433 0.2500 m=6 > myforest = function(formula,data) {randomForest(formula,data,mtry=6)} > for (i in 1:10) error.RF[i] = + errorest(logList~.,data=SFhomes.log,model=myforest)$error > summary(error.RF) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.2362 0.2403 0.2430 0.2439 0.2462 0.2548 30 > sf.final = randomForest(logList~.,data=SFhomes.log,importance=T,mtry=5) > rfimp(sf.final,horiz=F) > plot(logList,predict(sf.final),xlab="log(List Price)", ylab="Predicted log(List Price)") 31 > plot(predict(sf.final),logList - predict(sf.final), xlab="Fitted Values",ylab="Residuals") > abline(h=0) As with earlier methods, it is not hard to write our own MCCV function. > rf.cv = function(fit,data,p=.667,B=100,mtry=fit$mtry,ntree=fit$ntree) { n <- dim(data)[1] cv <- rep(0,B) y = fit$y for (i in 1:B) { ss <- floor(n*p) sam <- sample(1:n,ss,replace=F) fit2 <- randomForest(formula(fit),data=data[-sam,],mtry=mtry,ntree=ntree) ynew <- predict(fit2,newdata=data[sam,]) cv[i] <- sqrt(sum((y[sam]-ynew)^2)/ss) } cv } This function also allows you to experiment with different values for (mtry) & (ntree). > results = rf.cv(sf.final,data=SFhomes.log,mtry=2) > mean(results) [1] 0.3059704 > results = rf.cv(sf.final,data=SFhomes.log,mtry=3) > mean(results) [1] 0.2960605 > results = rf.cv(sf.final,data=SFhomes.log,mtry=4) > mean(results) [1] 0.2945276 > results = rf.cv(sf.final,data=SFhomes.log,mtry=5) ๏ m = 5 is optimal. > mean(results) [1] 0.2889211 > results = rf.cv(sf.final,data=SFhomes.log,mtry=6) > mean(results) [1] 0.289072 32 Partial Plots to Visualize Predictor Effects > names(SFhomes.log) [1] "logList" > > > > > > > "BEDS" "BATHS" "SQFT" "YrBuilt" "ParkSpots" "Garage" "LATITUDE" "LONGITUDE" partialPlot(sf.final,SFhomes.log,SQFT) partialPlot(sf.final,SFhomes.log,LONGITUDE) partialPlot(sf.final,SFhomes.log,LATITUDE) partialPlot(sf.final,SFhomes.log,BEDS) partialPlot(sf.final,SFhomes.log,BATHS) partialPlot(sf.final,SFhomes.log,YrBuilt) partialPlot(sf.final,SFhomes.log,ParkSpots) > par(mfrow=c(2,3)) Sets up plot consisting of 2 rows and 3 columns of plots. > par(op) ๏returns to default 33 Boosting Regression Trees Boosting, like bagging, is way to combine or “average” the results of multiple trees in order to improve their predictive ability. Boosting however does not simply average trees constructed from bootstrap samples of the original data, rather it creates a sequence of trees where the next tree in sequence essentially uses the residuals from the previous trees as the response. This type of approach is referred to as gradient boosting. Using the squared error as the measure of fit, the Gradient Tree Boosting Algorithm is given below. Gradient Tree Boosting Algorithm (Squared Error) 1. Initialize ๐๐ (๐ฅ) = ๐ฆฬ . 2. For ๐ = 1, … , ๐: a) For ๐ = 1, … , ๐ compute rim = ๐ฆ๐ − ๐๐−1 (๐๐ ) which are simply the residuals from the previous tree. b) Fit a regression tree using ๐๐๐ as the response, giving terminal node regions ๐ ๐๐ , ๐ = 1, … , ๐ฝ๐ . c) For ๐ = 1,2, … , ๐ฝ๐ compute the mean of the residuals in each of the terminal nodes, call these ๐พ๐๐ . d) Update the model as follows: ๐ฝ๐ ๐๐ (๐ฅ) = ๐๐−1 (๐ฅ) + ๐ ∑ ๐พ๐๐ ๐ผ(๐ฅ ∈ ๐ ๐๐ ) ๐=1 The parameter ๐ is a shrinkage parameter which can be tweaked along with ๐ฝ๐ and ๐ to improve cross-validated predictive performance. 3. Output ๐ฬ(๐ฅ) = ๐๐ (๐ฅ). 4. Stochastic Gradient Boosting uses the same algorithm as above, but takes a random subsample of the training data (without replacement), and grows the next tree using only those observations. A typical fraction for the subsamples would be ½ but smaller values could be used when n is large. The authors of Elements of Statistical Learning recommend using 4 ≤ ๐ฝ๐ ≤ 8 for the number of terminal nodes in the regression trees grown at each of the M steps. For classification trees smaller values of ๐ฝ๐ are used, with 2 being optimal in many cases. A two terminal node tree is called a stump. Small values of ๐ (๐ < 0.1) have been found to produce superior results for regression problems, however this generally will require a large value for M. For example, for shrinkage values between .001 and .01 it is recommended that the number of iterations be between 3,000 and 10,000. Thus in terms of model development one needs to consider various combinations of Jm, ๏ฎ๏ฌ๏ and M. Let’s now consider some examples. 34 Example 1: Boston Housing Data > + + + + + + + + + + bos.gbm = gbm(logCMEDV~.,data=Boston3, distribution="gaussian", n.trees=3000, shrinkage=.005, interaction.depth=3, ๏ small Jm bag.fraction=0.5, train.fraction=0.5, n.minobsinnode=10, cv.folds=5, keep.data=T, verbose=T) CV: 1 Iter 1 2 3 4 5 6 7 8 9 10 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 CV: 2 Iter 1 2 3 4 5 6 7 8 9 10 100 200 300 400 500 600 700 TrainDeviance 0.1677 0.1664 0.1652 0.1640 0.1628 0.1615 0.1605 0.1594 0.1583 0.1572 0.0859 0.0510 0.0353 0.0277 0.0237 0.0211 0.0193 0.0178 0.0167 0.0157 0.0148 0.0141 0.0134 0.0129 0.0123 0.0118 0.0114 0.0110 0.0105 0.0102 0.0098 0.0095 0.0091 0.0088 0.0085 0.0082 0.0080 0.0077 0.0075 0.0072 ValidDeviance 0.1608 0.1597 0.1586 0.1573 0.1564 0.1554 0.1544 0.1537 0.1528 0.1519 0.0917 0.0610 0.0486 0.0427 0.0399 0.0384 0.0378 0.0370 0.0366 0.0363 0.0363 0.0361 0.0360 0.0360 0.0361 0.0364 0.0363 0.0364 0.0364 0.0363 0.0362 0.0364 0.0365 0.0364 0.0364 0.0365 0.0364 0.0365 0.0367 0.0366 StepSize 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 Improve 0.0011 0.0011 0.0012 0.0009 0.0008 0.0011 0.0012 0.0012 0.0012 0.0011 0.0004 0.0002 0.0001 0.0000 -0.0000 -0.0000 0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 TrainDeviance 0.1838 0.1824 0.1811 0.1798 0.1785 0.1771 0.1758 0.1745 0.1732 0.1718 0.0937 0.0549 0.0374 0.0288 0.0240 0.0212 0.0192 ValidDeviance 0.0973 0.0964 0.0955 0.0949 0.0942 0.0933 0.0925 0.0917 0.0910 0.0903 0.0446 0.0273 0.0223 0.0215 0.0220 0.0227 0.0236 StepSize 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 Improve 0.0013 0.0013 0.0014 0.0012 0.0012 0.0009 0.0011 0.0011 0.0010 0.0011 0.0006 0.0002 0.0000 0.0000 0.0000 0.0000 0.0000 35 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 CV: 3 Iter 1 2 3 4 5 6 7 8 9 10 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 CV: 4 Iter 1 2 3 4 5 6 7 8 9 10 100 200 0.0176 0.0164 0.0154 0.0146 0.0139 0.0133 0.0127 0.0121 0.0116 0.0112 0.0108 0.0104 0.0100 0.0096 0.0093 0.0090 0.0087 0.0085 0.0082 0.0079 0.0077 0.0074 0.0072 0.0244 0.0249 0.0254 0.0257 0.0260 0.0263 0.0263 0.0263 0.0264 0.0265 0.0266 0.0268 0.0271 0.0272 0.0273 0.0273 0.0271 0.0274 0.0274 0.0275 0.0274 0.0276 0.0276 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 TrainDeviance 0.1486 0.1476 0.1466 0.1458 0.1448 0.1437 0.1429 0.1419 0.1408 0.1399 0.0816 0.0522 0.0384 0.0312 0.0268 0.0241 0.0221 0.0205 0.0193 0.0182 0.0173 0.0164 0.0156 0.0149 0.0143 0.0137 0.0132 0.0127 0.0122 0.0117 0.0113 0.0109 0.0106 0.0103 0.0099 0.0096 0.0094 0.0091 0.0088 0.0086 ValidDeviance 0.2372 0.2354 0.2339 0.2325 0.2309 0.2294 0.2280 0.2266 0.2249 0.2234 0.1255 0.0736 0.0475 0.0339 0.0256 0.0214 0.0189 0.0174 0.0164 0.0156 0.0153 0.0150 0.0149 0.0149 0.0149 0.0148 0.0148 0.0150 0.0151 0.0153 0.0152 0.0153 0.0153 0.0154 0.0153 0.0154 0.0154 0.0155 0.0155 0.0156 StepSize 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 Improve 0.0010 0.0009 0.0009 0.0008 0.0010 0.0008 0.0009 0.0008 0.0010 0.0010 0.0004 0.0002 0.0001 0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 TrainDeviance 0.1738 0.1725 0.1713 0.1701 0.1687 0.1674 0.1661 0.1648 0.1636 0.1625 0.0892 0.0528 ValidDeviance 0.1361 0.1351 0.1342 0.1334 0.1324 0.1316 0.1307 0.1298 0.1291 0.1284 0.0762 0.0491 StepSize 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 Improve 0.0011 0.0012 0.0011 0.0013 0.0011 0.0011 0.0011 0.0012 0.0011 0.0011 0.0004 0.0003 36 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 CV: 5 Iter 1 2 3 4 5 6 7 8 9 10 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 Iter 1 2 3 4 5 6 7 0.0367 0.0288 0.0243 0.0216 0.0199 0.0185 0.0174 0.0165 0.0157 0.0150 0.0144 0.0138 0.0133 0.0127 0.0123 0.0118 0.0114 0.0110 0.0107 0.0103 0.0100 0.0097 0.0094 0.0091 0.0089 0.0086 0.0084 0.0081 0.0374 0.0317 0.0283 0.0265 0.0253 0.0247 0.0240 0.0233 0.0228 0.0227 0.0224 0.0223 0.0220 0.0217 0.0217 0.0216 0.0215 0.0214 0.0210 0.0211 0.0210 0.0209 0.0208 0.0206 0.0205 0.0204 0.0204 0.0205 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0001 0.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 TrainDeviance 0.1578 0.1567 0.1556 0.1544 0.1533 0.1521 0.1511 0.1501 0.1491 0.1481 0.0819 0.0480 0.0324 0.0245 0.0205 0.0180 0.0162 0.0150 0.0140 0.0133 0.0126 0.0120 0.0115 0.0110 0.0106 0.0102 0.0098 0.0094 0.0090 0.0087 0.0084 0.0081 0.0079 0.0076 0.0074 0.0072 0.0070 0.0067 0.0065 0.0063 ValidDeviance 0.2019 0.2009 0.1997 0.1984 0.1974 0.1963 0.1952 0.1943 0.1933 0.1922 0.1228 0.0870 0.0699 0.0623 0.0586 0.0563 0.0552 0.0545 0.0541 0.0540 0.0540 0.0540 0.0538 0.0539 0.0541 0.0541 0.0542 0.0540 0.0539 0.0542 0.0541 0.0541 0.0543 0.0543 0.0544 0.0546 0.0548 0.0547 0.0545 0.0545 StepSize 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 Improve 0.0009 0.0010 0.0010 0.0010 0.0010 0.0009 0.0011 0.0010 0.0009 0.0010 0.0005 0.0002 0.0000 0.0000 -0.0000 0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 TrainDeviance 0.1663 0.1652 0.1640 0.1629 0.1618 0.1607 0.1596 ValidDeviance 0.1644 0.1633 0.1621 0.1612 0.1601 0.1589 0.1578 StepSize 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 Improve 0.0012 0.0011 0.0012 0.0009 0.0011 0.0011 0.0010 37 8 9 10 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 0.1584 0.1573 0.1563 0.0873 0.0525 0.0364 0.0286 0.0243 0.0217 0.0199 0.0186 0.0174 0.0164 0.0156 0.0148 0.0142 0.0136 0.0131 0.0126 0.0122 0.0118 0.0114 0.0110 0.0107 0.0104 0.0101 0.0098 0.0095 0.0092 0.0089 0.0087 0.0084 0.0082 0.1566 0.1555 0.1544 0.0869 0.0547 0.0415 0.0362 0.0336 0.0324 0.0313 0.0306 0.0298 0.0292 0.0288 0.0281 0.0279 0.0276 0.0273 0.0271 0.0268 0.0266 0.0265 0.0265 0.0264 0.0263 0.0262 0.0260 0.0259 0.0258 0.0258 0.0258 0.0257 0.0257 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0050 0.0011 0.0010 0.0009 0.0004 0.0002 0.0001 0.0000 0.0000 -0.0000 -0.0000 0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 Verbose obviously provides a lot of detail about the fit of the model. We can plot the CV performance results using the gbm.perf(). > gbm.perf(bos.gbm,method="OOB") [1] 619 > title(main="Out of Bag and Training Prediction Errors") Cross-validation error Training error 38 > gbm.perf(bos.gbm,method="test") [1] 2861 Test error Training error > gbm.perf(bos.gbm,method="cv") [1] 1444 5-fold CV error Test error Training error Changing the shrinkage to ๏ฎ๏ ๏ฝ๏ ๏ฎ๏ฐ๏ฐ๏ฑ requires are larger number of iterations. It seems odd to me that the optimal number of iterations based on 5-fold CV and a test set approach are so close the maximum iterations allowed (10,000). > gbm.perf(bos.gbm,method="OOB") [1] 2970 > gbm.perf(bos.gbm,method="cv") [1] 9967 > gbm.perf(bos.gbm,method="test") [1] 9996 39 > best.iter = gbm.perf(bos.gbm,method="cv") > summary(bos.gbm,n.trees=best.iter) var rel.inf 1 LSTAT 51.10286714 2 RM 18.19975510 3 CRIM 8.47714802 4 DIS 7.94065688 5 B 3.27853742 6 NOX 3.13832287 7 PTRATIO 2.32268005 8 TAX 2.02538364 9 AGE 1.80179087 10 RAD 0.69996505 11 CHAS 0.57827463 12 INDUS 0.42289890 13 ZN 0.01171941 > yhat = predict(bos.gbm,n.trees=best.iter) > plot(logCMEDV,yhat) > cor(logCMEDV,yhat)^2 40 [1] 0.8884372 41 Classification trees using the rpart package In classification trees the goals is to a tree-based model that will classify observations into one of g predetermined classes. The end result of a tree model can be viewed as a series of conditional probabilities (posterior probabilities) of class membership given a set of covariate values. For each terminal node we essentially have a probability distribution for class membership, where the probabilities are of the form: P (class i | x ๏ N A ) such that ~ g ๏ฅ P(class i | x ๏ N i ๏ฝ1 ~ A ) = 1. Here, N A is a neighborhood defined by a set covariates/predictors, x . The ~ neighborhoods are found by a series of binary splits chosen to minimize the overall “loss” of the resulting tree. For classification problems measuring overall “loss” can be a bit complicated. One obvious method is to construct classification trees so that the overall misclassification rate is minimized. In fact, this is precisely what the RPART algorithm does by default. However in classification problems it is often times the case we wish to incorporate prior knowledge about likely class membership. This knowledge is represented by prior probabilities of an observation being from class i, which will denote by ๏ฐ i . Naturally the priors must be chosen in such a way that they sum to 1. Other information we might want to incorporate into a modeling process is the cost or loss incurred by classifying an object from class i as being from class j and vice versa. With this information provided we would expect the resulting tree to avoid making the most costly misclassifications on our training data set. Some notation that is used by Therneau & Atkinson (1999) for determining the loss for a given node A: niA ๏ฝ number of observations in node A from class i ni ๏ฝ number of observations in training set from class i n ๏ฝ total number of observations in the training set n A ๏ฝ number of observations in node A ni ) n L(i, j ) = loss incurred by classifying a class i object as being from class j (i.e. c(j|i) from our discussion of discriminant analysis). ๏ด ( A) ๏ฝ predicted class for node A ๏ฐ i ๏ฝ prior probability of being from class i (by default ๏ฐ i ๏ฝ In general, the loss is specified as a matrix 42 L(1,2) L(1,3) ... L(1, C ) ๏น ๏ฉ 0 ๏ช L(2,1) 0 L(2,3) ... L(2, C ) ๏บ๏บ ๏ช ๏บ Loss Matrix ๏ฝ ๏ช L(3,1) L(3,2) ... ... ... ๏ช ๏บ ... ... 0 L(C ๏ญ 1, C )๏บ ๏ช ... ๏ช๏ซ L(C ,1) L(C ,2) ๏บ๏ป ... L(C , C ๏ญ 1) 0 By default this is a symmetric matrix with L(i, j ) ๏ฝ L( j , i ) ๏ฝ 1 for all i ๏น j . Using the notation and concepts presented above the risk or loss at a node A is given by C ๏ฆn ๏ถ ๏ฆ n ๏ถ R( A) ๏ฝ ๏ฅ ๏ฐ i L(i,๏ด ( A)) ๏ ๏ง๏ง iA ๏ท๏ท ๏ ๏ง๏ง ๏ท๏ท ๏ n A i ๏ฝ1 ๏จ ni ๏ธ ๏จ n A ๏ธ Example: Kyphosis Data > library(rpart) > data(kyphosis) > attach(kyphosis) > names(kyphosis) [1] "Kyphosis" "Age" "Number" "Start" > k.default <- rpart(Kyphosis~.,data=kyphosis) > k.default n= 81 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 81 17 absent (0.7901235 0.2098765) 2) Start>=8.5 62 6 absent (0.9032258 0.0967742) 4) Start>=14.5 29 0 absent (1.0000000 0.0000000) * 5) Start< 14.5 33 6 absent (0.8181818 0.1818182) 10) Age< 55 12 0 absent (1.0000000 0.0000000) * 11) Age>=55 21 6 absent (0.7142857 0.2857143) 22) Age>=111 14 2 absent (0.8571429 0.1428571) * 23) Age< 111 7 3 present (0.4285714 0.5714286) * 3) Start< 8.5 19 8 present (0.4210526 0.5789474) * Sample Loss Calculations: Root Node: Terminal Node 3: R(root) = .2098765*1*(17/17)*(81/81)*81 = 17 R(A) = .7901235*1*(8/64)*(81/19)*19 = 8 The actual tree is shown on the following page. 43 Suppose now we have prior beliefs that 65% of patients will not have Kyphosis (absent) and 35% of patients will have Kyphosis (present). > k.priors <-rpart(Kyphosis~.,data=kyphosis,parms=list(prior=c(.65,.35))) > k.priors n= 81 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 81 28.350000 absent (0.65000000 0.35000000) 2) Start>=12.5 46 3.335294 absent (0.91563089 0.08436911) * 3) Start< 12.5 35 16.453130 present (0.39676840 0.60323160) 6) Age< 34.5 10 1.667647 absent (0.81616742 0.18383258) * 7) Age>=34.5 25 9.049219 present (0.27932897 0.72067103) * Sample Loss Calculations: R(root) = .35*1*(17/17)*(81/81)*81 = 28.35 R(node 7) = .65*1*(11/64)*(81/25)*25 = 9.049219 The resulting tree is shown on the following page. 44 Finally suppose we wish to incorporate the following loss information. Classification L(i,j) present absent Actual present 0 4 absent 1 0 Note: In R the category orderings for the loss matrix are Z ๏ A in both dimensions. This says that it is 4 times more serious to misclassify a child that actually has kyphosis (present) as not having it (absent). Again we will use the priors from the previous model (65% - absent, 35% - present). > lmat <- matrix(c(0,4,1,0),nrow=2,ncol=2,byrow=T) > k.priorloss <- rpart(Kyphosis~.,data=kyphosis,parms=list(prior=c(.65,.35),loss=lmat)) > k.priorloss n= 81 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 81 52.650000 present (0.6500000 0.3500000) 2) Start>=14.5 29 0.000000 absent (1.0000000 0.0000000) * 3) Start< 14.5 52 28.792970 present (0.5038760 0.4961240) 6) Age< 39 15 6.670588 absent (0.8735178 0.1264822) * 7) Age>=39 37 17.275780 present (0.3930053 0.6069947) * Sample Loss Calculations: R(root) = .65*1*(64/64)*(81/81)*81 = 52.650000 R(node 7) = .65*1*(21/64)*(81/37)*37 = 17.275780 R(node 6) = .35*4*(1/17)*(81/15)*15 = 6.67058 The resulting tree is shown below on the following page. 45 Example 2 – Owl Diet > owl.tree <- rpart(species~.,data=OwlDiet) > owl.tree n= 179 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 179 140 tiomanicus (0.14 0.21 0.13 0.089 0.056 0.22 0.16) 2) teeth_row>=56.09075 127 88 tiomanicus (0.2 0.29 0 0.13 0.079 0.31 0) 4) teeth_row>=72.85422 26 2 annandalfi (0.92 0.077 0 0 0 0 0) * 5) teeth_row< 72.85422 101 62 tiomanicus (0.0099 0.35 0 0.16 0.099 0.39 0) 10) teeth_row>=64.09744 49 18 argentiventer (0.02 0.63 0 0.27 0.02 0.061 0) 20) palatine_foramen>=61.23703 31 4 argentiventer (0.032 0.87 0 0.065 0 0.032 0)* 21) palatine_foramen< 61.23703 18 7 rajah (0 0.22 0 0.61 0.056 0.11 0) * 11) teeth_row< 64.09744 52 16 tiomanicus (0 0.077 0 0.058 0.17 0.69 0) 22) skull_length>=424.5134 7 1 surifer (0 0.14 0 0 0.86 0 0) * 23) skull_length< 424.5134 45 9 tiomanicus (0 0.067 0 0.067 0.067 0.8 0) * 3) teeth_row< 56.09075 52 24 whiteheadi (0 0 0.46 0 0 0 0.54) 6) palatine_foramen>=47.15662 25 1 exulans (0 0 0.96 0 0 0 0.04) * 7) palatine_foramen< 47.15662 27 0 whiteheadi (0 0 0 0 0 0 1) * The function misclass.rpart constructs a confusion matrix for a classification tree. > misclass.rpart(owl.tree) 46 > misclass.rpart(owl.tree) Table of Misclassification (row = predicted, col = actual) 1 2 3 4 5 6 7 annandalfi 24 2 0 0 0 0 0 argentiventer 1 27 0 2 0 1 0 exulans 0 0 24 0 0 0 1 rajah 0 4 0 11 1 2 0 surifer 0 1 0 0 6 0 0 tiomanicus 0 3 0 3 3 36 0 whiteheadi 0 0 0 0 0 0 27 Misclassification Rate = 0.134 Using equal prior probabilities for each of the groups. > owl.tree2 <- rpart(species~.,data=OwlDiet,parms=list(prior=c(1,1,1,1,1,1,1)/7)) > misclass.rpart(owl.tree2) Table of Misclassification (row = predicted, col = actual) 1 2 3 4 5 6 7 annandalfi 24 2 0 0 0 0 0 argentiventer 1 27 0 2 0 1 0 exulans 0 0 24 0 0 0 1 rajah 0 4 0 11 1 2 0 surifer 0 1 0 1 9 5 0 tiomanicus 0 3 0 2 0 31 0 whiteheadi 0 0 0 0 0 0 27 Misclassification Rate = 0.145 > plot(owl.tree) > text(owl.tree,cex=.6) 47 We can see that teeth row and palantine foramen figure prominently in the classification rule. Given the analyses we have done previously this is not surprising. Cross-validation is done in the usual fashion: we leave out a certain percentage of the observations, develop a model from the remaining data, predict back the class of the observations we set aside, calculate the misclassification rate, and then repeat this process are number of times. The function tree.cv3 leaves out (1/k)x100% of the data at a time to perform the cross-validation. > tree.cv3(owl.tree,k=10,data=OwlDiet,y=species,reps=100) [1] [9] [17] [25] [33] [41] [49] [57] [65] [73] [81] [89] [97] 0.17647059 0.11764706 0.11764706 0.29411765 0.23529412 0.29411765 0.29411765 0.05882353 0.23529412 0.11764706 0.23529412 0.11764706 0.11764706 0.29411765 0.11764706 0.29411765 0.17647059 0.17647059 0.05882353 0.35294118 0.17647059 0.35294118 0.00000000 0.11764706 0.17647059 0.23529412 0.29411765 0.11764706 0.11764706 0.11764706 0.17647059 0.05882353 0.17647059 0.23529412 0.11764706 0.00000000 0.29411765 0.17647059 0.11764706 0.05882353 0.11764706 0.11764706 0.29411765 0.35294118 0.17647059 0.17647059 0.29411765 0.29411765 0.17647059 0.35294118 0.00000000 0.17647059 0.17647059 0.23529412 0.11764706 0.35294118 0.05882353 0.23529412 0.23529412 0.11764706 0.05882353 0.17647059 0.35294118 0.29411765 0.00000000 0.11764706 0.29411765 0.41176471 0.17647059 0.11764706 0.35294118 0.17647059 0.23529412 0.23529412 0.23529412 0.05882353 0.11764706 0.17647059 0.29411765 0.17647059 0.23529412 0.17647059 0.17647059 0.23529412 0.17647059 0.05882353 0.11764706 0.41176471 0.17647059 0.05882353 0.23529412 0.11764706 0.23529412 0.17647059 0.11764706 0.17647059 0.17647059 0.23529412 0.11764706 0.35294118 > results <- tree.cv3(owl.tree,k=10,data=OwlDiet,y=species,reps=100) > mean(results) [1] 0.2147059 48