Classification And Regression Trees Stat 430 Outline • More Theoretical Aspects of • tree algorithms • random forests • Intro to Bootstrapping Construction of Tree • Starting with the root, find best split at each node using an exhaustive search, i.e. for each variable Xi generate all possible splits and compute homogeneity, select best split for best variable Gini Index • probabilistic view: for each node i we have class probabilities pik (with sample size ni) • Definition G(i) = K � k=1 p̂ik (1 − p̂ik ) with p̂ik ni � 1 I(Yi = k) = ni i=1 Deviance/Entropy/ Information Criterion For node i: • Entropy • E(i) = − K � p̂ik log p̂ik k=1 Deviance D(i) = −2 · K � k=1 nik log p̂ik Find best split • k � For a split at X = )x= 0, we get Y12 and Y2 of g(Y 1− pi lengths n1 and n2, respectively: i=1 1 g(Y |X = xo ) = (n1 g(Y1 ) + n2 g(Y2 )) n1 + n2 f (happy) and f (sex) f (happy | sex) Delayed TRUE Distance< 2459 | Delayed FALSE FALSE TRUE 1000 2000 Distance 3000 4000 1.0 Distance>=728 0.03486 0.8 Distance>=4228 0.05107 0.6 count Delayed FALSE 0 0.5 TRUE 0.4 Delayed? 0.2 0.0 0 1000 2000 Distance 3000 4000 0.4 ginis 0.3 0.2 0.1 0.0 1000 2000 Gini Index of Homogeneity 3000 4000 split data at maximum, then repeat with each subset Some Stopping Rules • • • Nodes are homogeneous “enough” • • Minimize Error (e.g. cross-validation) Nodes are small (ni < 20 rpart, ni < 10 tree) “Elbow” criterion: gain in homogeneity levels out Minimize cost-complexity measure: Ra = R + a*size (R homogeneity measure evaluated at leaves, a>0 real value, penalty term for complexity) Diagnostics • Prediction error (in training/testing scheme) • misclassification matrix (for categorical response) • loss matrix: adjust corresponding to risk - are all types of misclassifications similar? - in binary situation: is false positive as bad as false negative? Random Forests • • Breiman (2001), Breiman & Cutler (2004) • Each case classified once for each tree in the ensemble • Overall values determined by ‘voting’ for category or (weighted) averaging in case of continuous response • Random Forests apply a Bootstrap Aggregating Technique Tree Ensemble built by randomly sampling cases and variables Bootstrapping • Efron 1982 • ‘Pull ourselves out of the swamp by our shoe laces(=bootstraps)’ • We have one dataset which gives us one specific statistic of interest. We do not know the distribution of the statistic. • The idea is that we use the data to ‘create’ a distribution. Bootstrapping • Resampling Technique: from a dataset D of size n sample n times with replacement to get D1, and again to get D2, D3, ..., DM for some fairly large M. • Compute statistic of interest for each Di This yields a distribution against which we can compare the original value. Example: Law Schools • average GPA and LSAT 340 for admission from 15 law schools • cor (LSAT, GPA) = 0.78 What would be a confidence interval for this? 320 GPA • What is correlation between GPA and LSAT? 330 310 300 290 280 560 580 600 LSAT 620 640 660 Percentile Bootstrap CI (1) Sample with replacement from the data (2) Compute correlation (3) Repeat M=1000 times (4) Get an a 100% confidence interval by excluding the top a/2 and bottom a/2 percent values Percentile Bootstrap CI (1) Sample with replacement from the data (2) Compute correlation (3) Repeat M=1000 times > summary(cors) Min. 1st Qu. -0.02534 0.68990 Median 0.79350 Mean 0.76910 3rd Qu. 0.88130 Max. 0.99390 (4) Get an a 100% confidence interval by excluding the top a/2 and bottom a/2 percent values > quantile(cors, probs=c(0.025, 0.975)) 2.5% 97.5% 0.4478862 0.9605759 80 60 count Bootstrap Results > quantile(cors, probs=c(0.025, 0.975)) 2.5% 97.5% 0.4478862 0.9605759 40 20 M=1000 0 0.0 0.2 0.4 cors 0.6 0.8 1.0 > quantile(cors, probs=c(0.025, 0.975)) 2.5% 97.5% 0.4654083 0.9629974 M=5000 count 300 200 100 0 0.2 0.4 cors 0.6 0.8 1.0 Influence of size of M 0.80 cor 0.78 0.76 0.74 0.72 10000 20000 M 30000 40000 50000 • for M ≥ 1000 the estimates of the correlation coefficient look reasonable. Compare to all 82 Schools • Unique situation: here, we have data on the whole population (of all 82 law schools) 340 320 100 * GPA actual population value of correlation is 0.76 300 280 260 500 550 600 LSAT 650 700 Limitations of Bootstrap • Bootstrap approaches are not good in boundary situations, e.g. finding min or max: • Assume u ~ U[0,ß] estimate of ß: max(u) > summary(bhat) Min. 1st Qu. 1.867 1.978 Median 1.995 Mean 3rd Qu. 1.986 1.995 > quantile(bhat, probs=c(0.025, 0.975)) 2.5% 97.5% 1.956715 1.994739 Max. 1.995 Bootstrap Estimates • Percentile CI: works well if bootstrap distribution is symmetric and centered on the observed statistic. if not, underestimates variability (same happens if sample is very small < 50) • other approaches: Basic Bootstrap, Studentized Bootstrap, BiasCorrected Bootstrap, Accelerated Bootstrap Bootstrapping in R • packages boot, bootstrap