4.4 (a) 0.1 (b) 0.01 (c) 0.1^100 (d) Since the curse of dimensionality

advertisement
4.4
(a) 0.1
(b) 0.01
(c) 0.1^100
(d) Since the curse of dimensionality, everything is far from everything in high
dimension. As a result there are very few training data near any given test data.
(e)
When p=1, len=0.1
When p=2, len=0.1^0.5=0.316
When p=100, len=0.1^0.01=0.977
4.8
For k=1, the training error for KNN is zero, since for any training data, its nearest
neighbor is itself. We know that the average error for KNN is 18%, hence the test
error is 36%, which is higher than the logistic regression test error 30%.
Consequently, we choose logistic regression.
4.10
(a) Compute correlation matrix. Year and Volume appear to be strongly correlated.
(b) Intercept and lag2 are statistically significant.
(c) According to the confusion matrix, when market goes down, logistic regression
tends to make the wrong prediction most of the time.
(d) 62.5%
(e) 62.5%
(f) 58.65%
(g) 50%
(h) Logistic regression and LDA provide the best results on this data.
(i) Answer may vary.
5.3
(a) As it is explained at page 181 of the textbook, k-fold cross validation involves
randomly dividing the set of observations into k groups, or folds, of approximately
equal size. The first fold is treated as a validation set, and the method is fit on the
remaining k − 1 folds. The mean squared error, MSE1, is then computed on the
observations in the held-out fold. This procedure is repeated k times; each time, a
different group of observations is treated as a validation set. This process results in
k estimates of the test error, MSE1,MSE2,...,MSEk . The k-fold CV estimate is
computed by averaging these values.
(b) i) Advantage of k-fold cross validation relative to the validation set: As it is
explained at page 178, the validation estimate of the test error rate can be highly
variable, depending on precisely which observations are included in the training set
and which observations are included in the validation set. Moreover, validation set
error rate may tend to overestimate the test error rate for the model fit on the entire
data set.
Disadvantage of k-fold cross validation relative to the validation set: As it is
explained at page 177, validation set approach is conceptually simple and easy to
implement.
ii) As it is said at page 181, LOOCV is a special case of k-fold CV with k = n.
Advantage of k-fold cross validation relative to LOOCV: LOOCV requires fitting the
statistical learning method n times. This has the potential to be computationally
expensive. Moreover, as explained at page 183, k-fold CV often gives more accurate
estimates of the test error rate than does LOOCV.
Disadvantage of k-fold cross validation relative to LOOCV: If the main purpose bias
reduction, LOOCV should be preffered to k-fold CV since it tends to has less bias.
5.4
We can use bootstrap approach. The bootstrap approach works by repeatedly
sampling observations from the original data set B times, for some large value of B,
each time fitting a new model and subsequently obtaining the RMSE of the estimates
for all B models.
5.5
(a)-(c) Answer may vary.
(d) Doesn’t lead to a reduction in the test error rate.
Download