Classification and Regression Trees (CART)

advertisement
Classification and Regression Trees (CART)
• CART is one of the simplest and now most often used techniques for classification.
• The CART algorithm generates a classification tree by sequentially doing binary splits on the data.
• The simplest case is when splits are made on individual variables.
845
Basic Details of CART
Pk
• The basic form of the model is given by s(x) = l=1 cl 1[x∈Nl ]
which implies that the response is piece-wise constant over
S
disjoint hyper-rectangles Ni with l = 1k Nl = <p.
• For the classification case, ci represents the i’th class indicator while for the regression case, it is a numerical value.
• In order to minimize RSS, we set ĉi = mean(yj | xj ∈ Ni) for
regression. For classification, we minimize misclassification
rate by ĉl = argmaxl (#yj ∈ classk | xj ∈ Ni).
• The key question is how to determine k and N1, N2, . . . , Nk .
846
Special Case: k = 2 classes
• Set N0 = <p. Split two axis-paralle hyper-rectangles N1 and
N2 characterized by the split coordinate i and split point s.
• We thus get: N1(i, s) = {x | xi < s} and N2(i, s) = {x | xi ≥ s}
and the corresponding subsets given by Sk (i, s) = {j | xj ∈
Nk (i, s)}. cl ’s are estimated as above.
Pk
• For the regression case then, RSS(i, s) = l=1(nl −1)V ar(yj |
j ∈ Sl (i, s)). Our goal is to find (i, s) minimizing this. Simi-
larly, for the classification case.
847
• Note that the RSS changes only when S1 and S2 changes.
Note also that for each i, we have to try at most n − 1 split
points, and fewer if there are ties. This means that means
and variances can be easily updated, and solves the problem
for k = 2.
Extension: k = 3 classes
• Note that there are (n − 1)p possibilities for the first split (as
before) and then (n − 2)p possibilities for the second split,
given the first split. This means that there are a total of
(n−1)(n−2)p2 possible split points. For general k, therefore,
we have (n − 1)(n − 2) · · · (n − k + 1)pk−1 possible splits. This
defies the possibilities of computers, so the idea is to apply
the splitting procedure recursively.
• The idea behind this is a greedysearch which basically means
that we do not have any hope of finding an optimal partitioning but hopefully one that is not too much worse. So,
once we have the first split, we fix it and apply the splitting
848
procedure to the two rectangles. As a result, we will not
get the optimal partitioning of the space into three rectangles for k = 3 but we will get three rectangles, two of which
are contained in the first split. We thus get a tree, with
each node being associated with a rectangular region N of
the predictor space, a subset S of the training sample consisting of those observations with predictor vectors in N and
the corresponding vector of (regression) constants or (class)
indicators.
Example 11.11 (CART on GMAT Data)
GPA < 2.725
|
GPA < 3.17
2
GMAT < 415
GMAT < 474.5
GMAT < 472
GPA < 2.875
3
1
1
3
3
1
849
• To get the prediction or fitted value for a given x, we simply
run the observation down a tree.
Terminating the Tree
• How do we terminate our recursive partitioning algorithm?
Obviously, we can go all the way down to a node, but that
would not be good. (Note that the tree becomes more and
more unstable as we go down to the terminal nodes.)
• One option is to start with the complete tree and use bottomup recombination. To do that, for each node, denote the
left and right daughters at a node by l(t) and r(t). Then
define the cost of a node t by c(t) = RSS(t) if the node
t is terminal, and c(t) = c(l(t) + r(t)) + λ if the node t is
non-terminal. We consider splitting t worthwhile if the cost
incurred in making t terminal is larger than the cost of making
850
t non-terminal. The parameter λ is called the complexity
parameter and specifies the price we pay for splitting a node
and thereby creating a more complex model.
• Note that among the properties of the complexity parameter,
if λ ≥ λ0, then a tree built using λ contains the tree built using
λ0. Further for λ → ∞, the tree consists of only one node.
Finally, if the tree has k terminal nodes, there are at most k
different subtrees that can be obtained by choosing different
values of λ.
Cross-validation to determine tree size
90.0
48.0
8.6
8.0
5.9
5.5
−Inf
140
GPA < 3.17
120
2
100
deviance
160
180
200
GPA < 2.725
|
GMAT < 415
1
3
1
2
3
4
5
6
3
7
size
851
• Our prediction using the best cross-validated tree is that the
new applicant belongs to the first category.
Download