Classification And Regression Trees Stat 430

advertisement
Classification And
Regression Trees
Stat 430
Outline
• More Theoretical Aspects of
• tree algorithms
• random forests
• Intro to Bootstrapping
Construction of Tree
• Starting with the root, find best split at each
node using an exhaustive search,
i.e. for each variable Xi generate all possible
splits and compute homogeneity, select best
split for best variable
Gini Index
• probabilistic view:
for each node i we have class probabilities
pik (with sample size ni)
• Definition
G(i) =
K
�
k=1
p̂ik (1 − p̂ik )
with p̂ik
ni
�
1
I(Yi = k)
=
ni i=1
Deviance/Entropy/
Information Criterion
For node i:
• Entropy
•
E(i) = −
K
�
p̂ik log p̂ik
k=1
Deviance D(i) = −2 ·
K
�
k=1
nik log p̂ik
Find best split
•
k
�
For a split at X
= )x=
0, we get Y12 and Y2 of
g(Y
1−
pi
lengths n1 and n2, respectively:
i=1
1
g(Y |X = xo ) =
(n1 g(Y1 ) + n2 g(Y2 ))
n1 + n2
f (happy) and f (sex)
f (happy | sex)
Delayed
TRUE
Distance< 2459
|
Delayed
FALSE
FALSE
TRUE
1000
2000
Distance
3000
4000
1.0
Distance>=728
0.03486
0.8
Distance>=4228
0.05107
0.6
count
Delayed
FALSE
0
0.5
TRUE
0.4
Delayed?
0.2
0.0
0
1000
2000
Distance
3000
4000
0.4
ginis
0.3
0.2
0.1
0.0
1000
2000
Gini Index of Homogeneity
3000
4000
split data at
maximum, then repeat
with each subset
Some Stopping Rules
•
•
•
Nodes are homogeneous “enough”
•
•
Minimize Error (e.g. cross-validation)
Nodes are small (ni < 20 rpart, ni < 10 tree)
“Elbow” criterion: gain in homogeneity levels
out
Minimize cost-complexity measure:
Ra = R + a*size
(R homogeneity measure evaluated at leaves,
a>0 real value, penalty term for complexity)
Diagnostics
• Prediction error (in training/testing scheme)
• misclassification matrix (for categorical
response)
• loss matrix:
adjust corresponding to risk
- are all types of misclassifications similar?
- in binary situation: is false positive as bad
as false negative?
Random Forests
•
•
Breiman (2001), Breiman & Cutler (2004)
•
Each case classified once for each tree in the
ensemble
•
Overall values determined by ‘voting’ for
category or (weighted) averaging in case of
continuous response
•
Random Forests apply a Bootstrap Aggregating
Technique
Tree Ensemble built by randomly sampling cases
and variables
Bootstrapping
• Efron 1982
• ‘Pull ourselves out of the swamp
by our shoe laces(=bootstraps)’
• We have one dataset which gives
us one specific statistic of interest.
We do not know the distribution
of the statistic.
• The idea is that we use the data to
‘create’ a distribution.
Bootstrapping
• Resampling Technique:
from a dataset D of size n
sample n times with replacement
to get D1, and again to get D2,
D3, ..., DM for some fairly large M.
• Compute statistic of interest for
each Di
This yields a distribution against
which we can compare the original
value.
Example: Law Schools
• average GPA and LSAT
340
for admission from 15
law schools
• cor (LSAT, GPA) = 0.78
What would be a
confidence interval for
this?
320
GPA
•
What is correlation
between GPA and LSAT?
330
310
300
290
280
560
580
600
LSAT
620
640
660
Percentile Bootstrap CI
(1) Sample with replacement from the data
(2) Compute correlation
(3) Repeat M=1000 times
(4) Get an a 100% confidence interval by
excluding the top a/2 and bottom a/2
percent values
Percentile Bootstrap CI
(1) Sample with replacement from the data
(2) Compute correlation
(3) Repeat M=1000 times
> summary(cors)
Min. 1st Qu.
-0.02534 0.68990
Median
0.79350
Mean
0.76910
3rd Qu.
0.88130
Max.
0.99390
(4) Get an a 100% confidence interval by
excluding the top a/2 and bottom a/2
percent values
> quantile(cors, probs=c(0.025, 0.975))
2.5%
97.5%
0.4478862 0.9605759
80
60
count
Bootstrap
Results
> quantile(cors, probs=c(0.025, 0.975))
2.5%
97.5%
0.4478862 0.9605759
40
20
M=1000
0
0.0
0.2
0.4
cors
0.6
0.8
1.0
> quantile(cors, probs=c(0.025, 0.975))
2.5%
97.5%
0.4654083 0.9629974
M=5000
count
300
200
100
0
0.2
0.4
cors
0.6
0.8
1.0
Influence of size of M
0.80
cor
0.78
0.76
0.74
0.72
10000
20000
M
30000
40000
50000
• for M ≥ 1000 the estimates of the
correlation coefficient look reasonable.
Compare to all 82
Schools
• Unique situation: here, we have data on the
whole population (of all 82 law schools)
340
320
100 * GPA
actual population
value of correlation
is 0.76
300
280
260
500
550
600
LSAT
650
700
Limitations of
Bootstrap
• Bootstrap approaches are not good in
boundary situations, e.g. finding min or
max:
• Assume u ~ U[0,ß]
estimate of ß: max(u)
> summary(bhat)
Min. 1st Qu.
1.867
1.978
Median
1.995
Mean 3rd Qu.
1.986
1.995
> quantile(bhat, probs=c(0.025, 0.975))
2.5%
97.5%
1.956715 1.994739
Max.
1.995
Bootstrap Estimates
• Percentile CI:
works well if bootstrap distribution is
symmetric and centered on the observed
statistic.
if not, underestimates variability (same
happens if sample is very small < 50)
• other approaches:
Basic Bootstrap, Studentized Bootstrap, BiasCorrected Bootstrap, Accelerated Bootstrap
Bootstrapping in R
• packages boot, bootstrap
Download