Combining tree results using several trees: Tree 1 (weight .2) Tree 2

advertisement
Combining tree results using several trees:
Tree 1 (weight .2)
Tree 2 (weight .3)
X1
X3
Pr{Y1 =1}
Yi
1
0
1
1
X1i
18
4
3
11
Tree 3 (weight .5)
X4
X2
X1
.6
Weights
X2i X3i X4i X5i
3 5 7 4
16 12 3 8
8 9 12 2
10 2 7 12
X2
.3
.2 .3
p1 p2
.6 .3
.4 .4
.2 .1
.9 .8
.5
p3
.8
.5
.3
.7
X1
X5
X1
.8
Y-p
combined Decsn. Error
.61
1
1-.61
.45
0
0-.45
.22
0
1-.22
.77
1
1-.77
.
Misclass
0
0
1
0
The error sum of squares for this tree is .392+…+.232 and the misclassification rate is ¼. The blue dot in
each tree indicates the leaf in which observation 1 lands, based on its X values (features). In tree 1, it
lands in a leaf with 60% 1s so the probability of 1 is 0.6. The combined result for observation Y1 is
.1(.6)+.3(.3)+.5(.8)=.61. The weights for each tree are related to the tree’s accuracy with more accurate
trees getting more weight. Trees with larger error sums of squares or misclassification rates would get
less weight. Bagging and random forests use equal weights but boosting uses weights based on
accuracy.
A program to compute the above table:
data a;
input p1 p2 p3 Y;
combinedp = .2*p1+.3*p2+.5*p3;
D = (combinedp>1/2);
misclass=0; if D ne Y then misclass=1;
error = (Y-combinedp);
Sq = error*error;
example+1;
datalines;
.6 .3 .8 1
.4 .4 .5 0
.2 .1 .3 1
.9 .8 .7 1
;
proc print; sum misclass Sq;
var example p1 p2 p3 combinedp Y D misclass error Sq;
Title "Tree weights are .2, .3, and .5";
run;
Methods for obtaining multiple trees. BOOSTING BAGGING and RANDOM FORESTS.
Bagging (Bootstrap AGGregation) Sample the data with replacement so some observations will enter
more than once, some not at all most likely - almost certainly in very large training data sets. We
sample all variables (columns) but in sample 1, observations 2,5, and 7 are “out of bag” and act
something like a validation set though the assessments seem to be pessimistic when computed in the
out of bag data.
Sample 1
X X X X X
X X X X X
X X X X X
X X X X X
Sample 2
X X X X X
X X X X X
X X X X X
X X X X X
X X X X X
Etc.
Do this sampling multiple times resulting in multiple trees.
Random Forests – This is perhaps the most popular of these methods. Recall that observations are the
rows of our data set, one row for each item (e.g. each credit card applicant) and variables are the
columns so age, gender, salary, etc. are variables. The idea is to take a random sample of both the
variables and of the observations and grow a tree on this subset. Notice that this is a sort of bagging but
using a different subset of the variables for each “bag.” Do this sampling multiple times resulting in
several trees.
X
X X
X
X
X X
X X
X
X X
Boosting (basic idea) For decisions, run a sequence of trees each time giving increased weight to the
observations that were previously misclassified in hopes of getting them right this time. For estimation
predictions, you take the deviations from the previous model’s predicted probabilities (1-p or 0-p) for
each point. Perhaps there is a pattern to these residuals so you then grow a tree on these residuals to
improve on the original prediction. In either case the sequence of trees is combined in a weighted
combination to classify a point or estimate its probability of yielding an event. Larger weights are given
to more accurate trees in this weighted combination of trees. These methods try to sequentially
improve on the previous decisions or predictions. There are some variations on this. One variant,
gradient boosting, is available from the Model subtab. A more detailed reference for this is chapter 10
in The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman.
Notice that for all of these methods, you end up with multiple trees. How do you then predict
observations? The idea of combining multiple models is called ensemble modelling. The models can
even be of different types. This will be discussed in a more general setting later. For trees, if you are
interested in decisions one method is to run an observation through all of the trees, as in the book
illustration with the yellow and blue dots, and then decide based on the majority decision. This is called
the “voting” method of combining models. For estimates, you can just average the probabilities from
the models. You can also do a weighted combination with weights computed using the accuracy of each
tree so trees with larger error mean square or misclassification rate are given less weight which is
typically applied for boosting. See the optional weights section below for details. Because there are
many trees, this method will not produce some of the output that we are used to, such as a tree
diagram (there isn’t just one tree). In random forests, you will also see statistics labelled as “out of bag.”
Recall that each sample most likely does not include all the observations. The observations from the
training data that are not included in the sample (the bag) are called out of bag observations and thus
they form a sort of validation data set, a different one for each sample. For boosting a weighted
average of the tree results, based on the tree accuracy, is used as noted previously.
Enterprise Miner provides several HP programs where HP stands for high performance. These have
been sped up with programming and by taking advantage of distributed processing when available. If
you have several processors, each sample can be assigned to a different processor so that the time for
processing all of the samples is the same as for growing a single tree. The procedure HPFOREST is one
such program for random forests. Here is a quick random forest demo for our veterans data.
(1) Find the HPDM subtab and select the HP forest icon.
(2) Inspect the properties panel and note that some of the tree defaults are a bit different than those for
the ordinary tree node.
(3) Connect the HP forest node to the data partition node in your veterans diagram.
(4) Run the node and inspect the results. Did it take a long time to run?
(5) Later we will see how to export the score code to score future data sets. The gain is in accuracy of
the results but we also loose the simple interpretability of a single tree.
Optional: Similarly try a gradient boosting approach (model subtab) to the veterans data. In this demo
you will see at least one variable (demcluster) that is character with 54 levels. When you split, one part
will involve level A (let’s say) of this class variable. For that part, each of the 53 other levels can be
included or not. That means there are 253 ways of constructing the part of the split containing level A
which is also the total number of possible splits except for the fact that it includes a “split” where all the
levels are included with A so the total number of feasible splits is 253 – 1. This variable would have lots
of chances to give a high Chi-square (& logworth). You will see that demcluster has importance 1 in the
training data and 0 in the validation data for this reason. Use the metadata node to change the role of
demcluster to “rejected.”
Optional: Weights for individual trees in a boosting sequence using misclassification:
wi = weight for observation i in tree m. Start with weight 1/N where there are N observations.
Relabel the two classes using Y=1 for an event, Y=-1 for a non-event. Same for classification (classify as 1
or -1). This way Y times classification is 1 if correct, -1 if incorrect.
1. Sum the weights wi over all misclassified observations. Divide this by the sum of all N weights. This
we will call error_m (for tree m out of a list of M trees). It is between 0 and 1 and thus can be thought
of as a weighted probability of misclassification in tree m. You would prune to minimize error_m as you
build tree m.
2. Compute the logit  of the weighted probability of correct classification.
This is =ln((1-err_m)/err_m) > ln(1)=0. It is nonnegative since you’d never misclassify more than half of
the observations.
3. For the next tree, leave wi the same for correctly classified observations and multiply the old wi by e
for observations incorrectly classified. This focuses the next tree on the observations its predecessor
misclassified. Repeat for m=1,2,…,M to get M trees.
4. At the end, consider a point and its set of features. Run that point through all M trees to get your
decision. Multiply each tree’s decision (1 or -1) by the tree’s  weight and add. If the weighted sum is
>0, call it an event (1) and if not call it a nonevent (-1).
Boosting for estimates follows a similar process.
Download