Classification and Regression Trees (CART) • CART is one of the simplest and now most often used techniques for classification. • The CART algorithm generates a classification tree by sequentially doing binary splits on the data. • The simplest case is when splits are made on individual variables. 845 Basic Details of CART Pk • The basic form of the model is given by s(x) = l=1 cl 1[x∈Nl ] which implies that the response is piece-wise constant over S disjoint hyper-rectangles Ni with l = 1k Nl = <p. • For the classification case, ci represents the i’th class indicator while for the regression case, it is a numerical value. • In order to minimize RSS, we set ĉi = mean(yj | xj ∈ Ni) for regression. For classification, we minimize misclassification rate by ĉl = argmaxl (#yj ∈ classk | xj ∈ Ni). • The key question is how to determine k and N1, N2, . . . , Nk . 846 Special Case: k = 2 classes • Set N0 = <p. Split two axis-paralle hyper-rectangles N1 and N2 characterized by the split coordinate i and split point s. • We thus get: N1(i, s) = {x | xi < s} and N2(i, s) = {x | xi ≥ s} and the corresponding subsets given by Sk (i, s) = {j | xj ∈ Nk (i, s)}. cl ’s are estimated as above. Pk • For the regression case then, RSS(i, s) = l=1(nl −1)V ar(yj | j ∈ Sl (i, s)). Our goal is to find (i, s) minimizing this. Simi- larly, for the classification case. 847 • Note that the RSS changes only when S1 and S2 changes. Note also that for each i, we have to try at most n − 1 split points, and fewer if there are ties. This means that means and variances can be easily updated, and solves the problem for k = 2. Extension: k = 3 classes • Note that there are (n − 1)p possibilities for the first split (as before) and then (n − 2)p possibilities for the second split, given the first split. This means that there are a total of (n−1)(n−2)p2 possible split points. For general k, therefore, we have (n − 1)(n − 2) · · · (n − k + 1)pk−1 possible splits. This defies the possibilities of computers, so the idea is to apply the splitting procedure recursively. • The idea behind this is a greedysearch which basically means that we do not have any hope of finding an optimal partitioning but hopefully one that is not too much worse. So, once we have the first split, we fix it and apply the splitting 848 procedure to the two rectangles. As a result, we will not get the optimal partitioning of the space into three rectangles for k = 3 but we will get three rectangles, two of which are contained in the first split. We thus get a tree, with each node being associated with a rectangular region N of the predictor space, a subset S of the training sample consisting of those observations with predictor vectors in N and the corresponding vector of (regression) constants or (class) indicators. Example 11.11 (CART on GMAT Data) GPA < 2.725 | GPA < 3.17 2 GMAT < 415 GMAT < 474.5 GMAT < 472 GPA < 2.875 3 1 1 3 3 1 849 • To get the prediction or fitted value for a given x, we simply run the observation down a tree. Terminating the Tree • How do we terminate our recursive partitioning algorithm? Obviously, we can go all the way down to a node, but that would not be good. (Note that the tree becomes more and more unstable as we go down to the terminal nodes.) • One option is to start with the complete tree and use bottomup recombination. To do that, for each node, denote the left and right daughters at a node by l(t) and r(t). Then define the cost of a node t by c(t) = RSS(t) if the node t is terminal, and c(t) = c(l(t) + r(t)) + λ if the node t is non-terminal. We consider splitting t worthwhile if the cost incurred in making t terminal is larger than the cost of making 850 t non-terminal. The parameter λ is called the complexity parameter and specifies the price we pay for splitting a node and thereby creating a more complex model. • Note that among the properties of the complexity parameter, if λ ≥ λ0, then a tree built using λ contains the tree built using λ0. Further for λ → ∞, the tree consists of only one node. Finally, if the tree has k terminal nodes, there are at most k different subtrees that can be obtained by choosing different values of λ. Cross-validation to determine tree size 90.0 48.0 8.6 8.0 5.9 5.5 −Inf 140 GPA < 3.17 120 2 100 deviance 160 180 200 GPA < 2.725 | GMAT < 415 1 3 1 2 3 4 5 6 3 7 size 851 • Our prediction using the best cross-validated tree is that the new applicant belongs to the first category.