Peter Fox
Data Analytics – ITWS-4963/ITWS-6965
Week 12a, April 15, 2014
1
• Cross-validation is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set.
• The goal of cross validation is to define a dataset to "test" the model in the training phase (i.e., the validation dataset), in order to limit problems like overfitting
2
• Original sample is randomly partitioned into k equal size subsamples.
• Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data.
• Repeat cross-validation process k times
(folds), with each of the k subsamples used exactly once as the validation data.
– The k results from the folds can then be averaged (usually) to produce a single estimation.
3
• As the name suggests, leave-one-out crossvalidation (LOOCV) involves using a single observation from the original sample as the validation data, and the remaining observations as the training data.
• i.e. K=n-fold cross-validation
• Leave out > 1 = bootstraping and jackknifing
4
• Generate replicates of a statistic applied to data (parametric and nonparametric).
– nonparametric bootstrap, possible methods: ordinary bootstrap, the balanced bootstrap, antithetic resampling, and permutation.
• For nonparametric multi-sample problems stratified resampling is used:
– this is specified by including a vector of strata in the call to boot.
– importance resampling weights may be specified.
5
• Systematically recompute the statistic estimate, leaving out one or more observations at a time from the sample set
• From this new set of replicates of the statistic, an estimate for the bias and an estimate for the variance of the statistic can be calculated.
• Often use log(variance) [instead of variance] especially for non-normal distributions
6
• Random split of the dataset into training and validation data.
– For each such split, the model is fit to the training data, and predictive accuracy is assessed using the validation data.
• Results are then averaged over the splits.
• Note: for this method can the results will vary if the analysis is repeated with different random splits.
7
• The advantage of K-fold over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once.
– 10-fold cross-validation is commonly used
• The advantage of rep-random over k-fold cross validation is that the proportion of the training/validation split is not dependent on the number of iterations (folds). 8
• The disadvantage of rep-random is that some observations may never be selected in the validation subsample, whereas others may be selected more than once.
– i.e., validation subsets may overlap.
9
> head(coleman) salaryP fatherWc sstatus teacherSc motherLev Y
1 3.83 28.87 7.20 26.6 6.19 37.01
2 2.89 20.10 -11.71 24.4 5.17 26.51
3 2.86 69.05 12.32 25.7 7.04 36.51
4 2.92 65.40 14.28 25.7 7.10 40.70
5 3.06 29.59 6.31 25.4 6.15 37.10
6 2.07 44.82 6.16 21.6 6.41 33.90
10
> call <- call("lmrob", formula = Y ~ .)
> # set up folds for cross-validation
> folds <- cvFolds(nrow(coleman), K = 5, R = 10)
> # perform cross-validation
> cvTool(call, data = coleman, y = coleman$Y, cost = rtmspe,
+ folds = folds, costArgs = list(trim = 0.1))
CV
[1,] 0.9880672
[2,] 0.9525881
[3,] 0.8989264
[4,] 1.0177694
[5,] 0.9860661
[6,] 1.8369717
[7,] 0.9550428
[8,] 1.0698466
[9,] 1.3568537
Warning messages:
1: In lmrob.S(x, y, control = control) :
S refinements did not converge (to refine.tol=1e-07) in 200
(= k.max) steps
2: In lmrob.S(x, y, control = control) :
S refinements did not converge (to refine.tol=1e-07) in 200
(= k.max) steps
3: In lmrob.S(x, y, control = control) : find_scale() did not converge in 'maxit.scale' (= 200) iterations
4: In lmrob.S(x, y, control = control) : find_scale() did not converge in 'maxit.scale' (= 200) iterations
11
[10,] 0.8313474
> cvFits
5-fold CV results:
Fit CV
1 LS 1.674485
2 MM 1.147130
3 LTS 1.291797
Best model:
CV
"MM"
12
fitLts50 <- ltsReg(Y ~ ., data = coleman, alpha = 0.5) cvFitLts50 <- cvLts(fitLts50, cost = rtmspe, folds = folds, fit = "both", trim = 0.1)
# 75% subsets fitLts75 <- ltsReg(Y ~ ., data = coleman, alpha = 0.75) cvFitLts75 <- cvLts(fitLts75, cost = rtmspe, folds = folds, fit = "both", trim = 0.1)
# combine and plot results cvFitsLts <- cvSelect("0.5" = cvFitLts50, "0.75" = cvFitLts75)
13
> cvFitsLts
5-fold CV results:
Fit reweighted raw
1 0.5 1.291797 1.640922
2 0.75 1.065495 1.232691
Best model: reweighted raw
"0.75" "0.75"
14
tuning <- list(tuning.psi=c(3.14, 3.44, 3.88,
4.68))
# perform cross-validation cvFitsLmrob <- cvTuning(fitLmrob$call, data = coleman, y = coleman$Y, tuning = tuning, cost
= rtmspe, folds = folds, costArgs = list(trim =
0.1))
15
5-fold CV results: tuning.psi
CV
1 3.14 1.179620
2 3.44 1.156674
3 3.88 1.169436
4 4.68 1.133975
Optimal tuning parameter: tuning.psi
CV 4.68
16
mammals.glm <- glm(log(brain) ~ log(body), data = mammals)
(cv.err <- cv.glm(mammals, mammals.glm)$delta)
[1] 0.4918650 0.4916571
> (cv.err.6 <- cv.glm(mammals, mammals.glm, K = 6)$delta)
[1] 0.4967271 0.4938003
# As this is a linear model we could calculate the leave-one-out
# cross-validation estimate without any extra model-fitting.
muhat <- fitted(mammals.glm) mammals.diag <- glm.diag(mammals.glm)
(cv.err <- mean((mammals.glm$y - muhat)^2/(1 mammals.diag$h)^2))
[1] 0.491865
17
# leave-one-out and 11-fold cross-validation prediction error for
# the nodal data set. Since the response is a binary variable
# an appropriate cost function is
> cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
> nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal)
> (cv.err <- cv.glm(nodal, nodal.glm, cost, K = nrow(nodal))$delta)
[1] 0.1886792 0.1886792
> (cv.11.err <- cv.glm(nodal, nodal.glm, cost, K = 11)$delta)
[1] 0.2264151 0.2228551
18
• Estimating the precision of sample statistics
(medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping)
• Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests, randomization tests, or re-randomization tests)
• Validating models by using random subsets
(bootstrapping, cross validation) 19
• Improve the stability and accuracy of machine learning algorithms used in statistical classification and regression .
• Also reduces variance and helps to avoid overfitting.
• Usually applied to decision tree methods, but can be used with any type of method.
– Bagging is a special case of the model averaging approach.
20
10 of 100 bootstrap samples average
21
• Shows improvements for unstable procedures (Breiman, 1996): e.g. neural nets, classification and regression trees, and subset selection in linear regression
• … can mildly degrade the performance of stable methods such as K-nearest neighbors
22
library(mlbench) data(BreastCancer) l <- length(BreastCancer[,1]) sub <- sample(1:l,2*l/3)
BC.bagging <- bagging(Class ~., data=BreastCancer[,-1], mfinal=20, control=rpart.control(maxdepth=3))
BC.bagging.pred <-predict.bagging( BC.bagging, newdata=BreastCancer[-sub,-1])
BC.bagging.pred$confusion
Observed Class
Predicted Class benign malignant benign 142 2
BC.bagging.pred$error
[1] 0.04291845
malignant 8 81
23
> data(BreastCancer)
> l <- length(BreastCancer[,1])
> sub <- sample(1:l,2*l/3)
> BC.bagging <- bagging(Class ~.,data=BreastCancer[,-1],mfinal=20,
+ control=rpart.control(maxdepth=3))
> BC.bagging.pred <- predict.bagging(BC.bagging,newdata=BreastCancer[sub,-1])
> BC.bagging.pred$confusion
Observed Class
Predicted Class benign malignant benign 147 1 malignant 7 78
> BC.bagging.pred$error
[1] 0.03433476
24
> data(Vehicle)
> l <- length(Vehicle[,1])
> sub <- sample(1:l,2*l/3)
> Vehicle.bagging <- bagging(Class ~.,data=Vehicle[sub, ],mfinal=40,
+ control=rpart.control(maxdepth=5))
> Vehicle.bagging.pred <- predict.bagging(Vehicle.bagging, newdata=Vehicle[-sub, ])
> Vehicle.bagging.pred$confusion
Observed Class
Predicted Class bus opel saab van bus 63 10 8 0 opel saab
1 42 27 0
0 18 30 0 van 5 7 9 62
> Vehicle.bagging.pred$error
[1] 0.3014184
25
• A weak learner: a classifier which is only slightly correlated with the true classification
(it can label examples better than random guessing)
• A strong learner: a classifier that is arbitrarily well-correlated with the true classification.
• Can a set of weak learners create a single strong learner?
26
• … reducing bias in supervised learning
• most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier.
– typically weighted in some way that is usually related to the weak learners' accuracy.
• After a weak learner is added, the data is reweighted: examples that are misclassified gain weight and examples that are classified correctly lose weight
• Thus, future weak learners focus more on the examples that previous weak learners misclassified.
27
require(ggplot2) # or load package first data(diamonds) head(diamonds) # look at the data!
# ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar() ggplot(diamonds, aes(clarity)) + geom_bar() + facet_wrap(~ cut) ggplot(diamonds) + geom_histogram(aes(x=price)) + geom_vline(xintercept=12000) ggplot(diamonds, aes(clarity)) + geom_freqpoly(aes(group = cut, colour = cut))
28
ggplot(diamonds, aes(clarity)) + geom_freqpoly(aes(group = cut, colour = cut))
29
30
> mglmboost<-glmboost(as.factor(Expensive) ~ ., data=diamonds,family=Binomial(link="logit"))
> summary(mglmboost)
Generalized Linear Models Fitted via Gradient Boosting
Call: glmboost.formula(formula = as.factor(Expensive) ~ ., data = diamonds, family = Binomial(link
= "logit"))
Negative Binomial Likelihood
Loss function: { f <- pmin(abs(f), 36) * sign(f) p <- exp(f)/(exp(f) + exp(-f)) y <- (y + 1)/2
-y * log(p) - (1 - y) * log(1 - p)
}
31
> summary(mglmboost) #continued
Number of boosting iterations: mstop = 100
Step size: 0.1
Offset: -1.339537
Coefficients:
NOTE: Coefficients from a Binomial model are half the size of coefficients from a model fitted via glm(... , family = 'binomial').
See Warning section in ?coef.mboost
(Intercept) carat clarity.L
-1.5156330 1.5388715 0.1823241 attr(,"offset")
[1] -1.339537
Selection frequencies: carat (Intercept) clarity.L
0.50 0.42 0.08
32
cars.gb <- blackboost(dist ~ speed, data = cars, control = boost_control(mstop = 50))
### plot fit plot(dist ~ speed, data = cars) lines(cars$speed, predict(cars.gb), col = "red")
33
Gradient boosting for optimizing arbitrary loss functions where regression trees are utilized as base-learners.
> cars.gb
Model-based Boosting
Call: blackboost(formula = dist ~ speed, data = cars, control = boost_control(mstop = 50))
Squared Error (Regression)
Loss function: (y - f)^2
Number of boosting iterations: mstop = 50
Step size: 0.1
Offset: 42.98
Number of baselearners: 1
34
• Later today
35
• Class: ITWS-4963/ITWS 6965
• Hours: 12:00pm-1:50pm Tuesday/ Friday
• Location: SAGE 3101
• Instructor: Peter Fox
• Instructor contact: pfox@cs.rpi.edu
, 518.276.4862 (do not leave a msg)
• Contact hours: Monday** 3:00-4:00pm (or by email appt)
• Contact location: Winslow 2120 (sometimes Lally 207A announced by email)
• TA: Lakshmi Chenicheri chenil@rpi.edu
• Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014
– Schedule, lectures, syllabus, reading, assignments, etc.
36