Document 10784929

advertisement
Stat 502X Homework
Assignment #1 (Due 1/31/14)
Regarding the curse of dimensionality/necessary sparsity of data sets in  p for moderate to
large p :
1. For each of p  10, 20,50,100,500,1000 make n  10,000 draws of distances between pairs
of independent points uniform in the cube  0,1 . Use these to make 95% confidence limits for
p
the ratio
mean distance between two random points in the cube
maximum distance between two points in the cube
2. For each of p  10, 20,50 make n  10,000 random draws of N  100 independent points
uniform in the cube  0,1 . Find for each sample of 100 points, the distance from the first point
p
drawn to the 5th closest point of the other 99. Use these to make 95% confidence limits for the
ratio
mean diameter of a 5-nearest neighbor neighborhood if N  100
maximum distance between two points in the cube
3. What fraction of random draws uniform from the unit cube  0,1 lie in the "middle part" of
p
the cube  ,1    , for a small positive number  ?
p
Regarding decomposition of Err :
4. Suppose that (unknown to a statistician) a mechanism generates iid data pairs  x, y 
according to the following model:
x ~ U   ,  

y | x ~ N sin  x  ,.25  x  1
2

(The conditional variance is .25  x  1 .)
2
a) What is an absolute minimum value of Err possible regardless what training set size, N , is
available and what fitting method is employed.
b) What linear function of x (which g  x   a  bx ) has the smallest "average squared bias" as
a predictor for y ? What cubic function of x (which g  x   a  bx  cx 2  dx 3 ) has the smallest
1 average squared bias as a predictor for y ? Is the set of cubic functions big enough to eliminate
model bias in this problem?
Regarding cross-validation as a method of choosing a predictor:
5. Vardeman will send out an N  100 data set generated by the model of problem 4. Use tenfold cross validation (use the 1st ten points as the first test set, the 2nd 10 points as the second,
etc.) based on the data set to choose among the following methods of prediction for this scenario:
 polynomial regressions of orders 0,1,2,3,4, and 5
 regressions using sets of predictors 1,sinx, cosx and 1,sinx, cosx, sin2x, cos2x

a regression with the set of predictors 1, x, x 2 , x 3 , x 4 , x 5 ,sinx, cosx,sin2x, cos2x
(Use ordinary least squares fitting.) Which predictor looks best on an empirical basis? Knowing
how the data were generated (an unrealistic luxury) which methods here are without model bias?
Regarding some linear algebra and principal components analysis:
6. Consider the small ( 7  3 ) fake X matrix below.
10 10 .1 


 11 11 .1 
9 9
0 


X   11 9 2.1 
 9 11 2.1 


12 8 4.0 
 8 12 4.0 


(Note, by the way, that x3  x2  x1 .)
a) Find the QR and singular value decompositions of X . Use the latter and give best rank 1 and
rank 2 approximations to X .
b) Subtract column means from the columns of X to make a centered data matrix. Find the
singular value decomposition of this matrix. Is it approximately the same as that in part a)?
Give the 3 vectors of the principal component scores. What are the principal components for
case 1?
Henceforth consider only the centered data matrix of b).
c) What are the singular values? How do you interpret their relative sizes in this context? What
are the first two principal component directions? What are the loadings of the first two principal
component directions on x3 ? What is the third principal component direction? Make scatter
2 plots of 7 points  x1 , x2  and then 7 points with first co-ordinate the 1st principal component
score and the second the 2nd principal component score. How do these compare? Do you expect
them to be similar in light of the sizes of the singular values?
c') Find the matrices Xvj vj for j  1, 2,3 and the best rank 1 and rank 2 approximations to X .
How are the latter related to the former?
d) Compute the ( N divisor) 3  3 sample covariance matrix for the 7 cases. Then find its
singular value decomposition and its eigenvalue decomposition. Are the eigenvectors of the
sample covariance matrix related to the principal component directions of the (centered) data
matrix? If so, how? Are the eigenvalues/singular values of the sample covariance matrix related
to the singular values of the (centered) data matrix. If so, how?
e) The functions

K1  x, z   exp  x  z
K 2  x, z   1  x, z

2

and
d
are legitimate kernel functions for choice of   0 and positive integer d . Find the first two
kernel principal component vectors for X for each of cases
1. K1 with two different values of  (of your choosing), and
2. K 2 for d  1, 2 .
If there is anything to interpret (and there may not be) give interpretations of the pairs of vectors
for each of the 4 cases. (Be sure to use the vectors for "centered versions" of latent feature
vectors.)
Assignment #2, (Due 2/21/14)
7. Return to the context of problem 5 and the last/largest set of predictors. Center the y vector
to produce (say) Y * , remove the column of 1's from the X matrix (giving a 100  9 matrix) and
standardize the columns of the resulting matrix, to produce (say) X* .
a) If one somehow produces a coefficient vector β* for the centered and standardized version of
the problem, so that

y*  1* x1*   2* x2*     9* x9*
what is the corresponding predictor for y in terms of 1, x, x 2 , x 3 , x 4 , x 5 ,sinx, cosx,sin2x, cos2x ?
b) Do the transformations and fit the equation in a) by OLS. How do the fitted coefficients and
error sum of squares obtained here compare to what you got in problem 5?
c) Augment Y* to Y** by adding 9 values 0 at the end of the vector (to produce a 109 1 vector)
and for value   4 augment X* to X** (a 109  9 matrix) by adding 9 rows at the bottom of the
3 matrix in the form of
 I . What quantity does OLS based on these augmented data seek to
99
optimize? What is the relationship of this to a ridge regression objective?
d) Use trial and error and matrix calculations based on the explicit form of βˆ ridge given in the
slides for Module 5 to identify a value of  for which the error sum of squares for ridge
regression is about 1.5 time that of OLS in this problem. Then make a series of at least 5 values
from 0 to  to use as candidates for  . Choose one of these as an "optimal" ridge parameter
 opt here based on 10-fold cross-validation (as was done in problem 5). Compute the
corresponding predictions yˆiridge and plot both them and the OLS predictions as functions of x
(connect successive  x, yˆ  points with line segments). How do the "optimal" ridge predictions
based on the 9 predictors compare to the OLS predictions based on the same 9 predictors?
8. In light of the idea in part c) of problem 7, if you had software capable of doing lasso fitting
of a linear predictor for a penalty coefficient  , how can you use that routine to do elastic net
fitting of a linear predictor for penalty coefficients 1 and 2 in
N
p
p
i 1
j 1
j 1
2
  yi  yˆi   1  ˆ j  2  ˆ j2 ?
9. (There is a solution to much of this problem on the Stat 602X page for the 2011 course HW.
That solution uses the N  1 divisor for sample variances. For the following, please make your
own version that uses the N divisor.)
Beginning in Section 5.6, Izenman's book uses an example where PET yarn density is to be
predicted from its NIR spectrum. This is a problem where N  21 data vectors x j of length
p  268 are used to predict the corresponding outputs yi . Izenman points out that the yarn data
are to be found in the pls package in R. (The package actually has N  28 cases. Use all of
them in the following.) Get those data and make sure that all inputs are standardized and the
output is centered.
a) Using the pls package, find the 1, 2,3, and 4 component PCR and PLS β̂ vectors.
b) Find the singular values for the matrix X and use them to plot the function df    for ridge
regression. Identify values of  corresponding to effective degrees of freedom 1, 2,3, and 4 .
Find corresponding ridge β̂ vector.
c) Plot on the same set of axes ˆ j versus j for the PCR, PLS and ridge vectors for number of
components/degrees of freedom 1. (Plot them as "functions," connecting consecutive plotted
j , ˆ j points with line segments.) Then do the same for 2,3, and 4 numbers of


components/degrees of freedom.
4 d) It is (barely) possible to find that the best (in terms of R 2 ) subsets of M  1, 2,3, and 4
predictors are respectively,  x40  ,  x212 , x246  ,  x25 , x160 , x215  and  x160 , x169 , x231 , x243  . Find their
corresponding coefficient vectors. Use the lars package in R and find the lasso coefficient
vectors β̂ with exactly M  1, 2,3, and 4 non-zero entries with the largest possible
268
 ˆ
j 1
lasso
j
(for the counts of non-zero entries).
e) If necessary, re-order/sort the cases by their values of yi (from smallest to largest) to get a
new indexing. Then plot on the same set of axes yi versus i and yˆi versus i for ridge, PCR,
PLS, best subset, and lasso regressions for number of components/degrees of freedom/number of
nonzero coefficients equal to 1. (Plot them as "functions," connecting consecutive plotted  i, yi 
or  i, yˆi  points with line segments.) Then do the same for 2,3, and 4 numbers of
components/degrees of freedom/counts of non-zero coefficients.
f) Section 6.6 of JWHT uses the glmnet package for R to do ridge regression and the lasso.
Use that package in the following. Find the value of  for which your lasso coefficient vector in
d) for M  2 optimizes the quantity
N
268
i 1
j 1
2
  yi  yˆi     ˆ j
(by matching the error sums of squares). Then, by using the trick of problem 8, employ the
package to find coefficient vectors β optimizing
N

268
268

i 1

j 1
j 1

2
  yi  yˆi     1     ˆ j    ˆ j2 
for   0,.1,.2,,1.0 . What effective degrees of freedom are associated with the   1 version
of this? How many of the coefficients  j are non-zero for each of the values of  ? Compare
error sums of squares for the raw elastic net predictors to the linear predictors using coefficients
modified elastic net coefficients
1    ˆ elastic net
 ,
10. Here is a small fake data set with p  4 and N  8 .
y
x
x
1
3
5
13
9
3
11
1
5
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
x3
1
1
1
1
1
1
1
1
x4
1
1
1
1
1
1
1
1
5 Notice that the y is centered and the x's are orthogonal (and can easily be made orthonormal by
dividing by 8 ). Use the explicit formulas for fitted coefficients in the orthonormal features
context to make plots (on a single set of axes for each fitting method, 5 plots in total) of
1. ˆ , ˆ , ˆ , and ˆ versus M for best subset (of size M ) regression,
1
2
3
4
2. ˆ1 , ˆ2 , ˆ3 , and ˆ4 versus  for ridge regression,
3. ˆ1 , ˆ2 , ˆ3 , and ˆ4 versus  for lasso,
4. ˆ1 , ˆ2 , ˆ3 , and ˆ4 versus  for   .5 in the elastic net penalty of part d) of problem 9,
and
5. ˆ1 , ˆ2 , ˆ3 , and ˆ4 versus  for the non-negative garrote.
Make 5 corresponding plots of the error sum of squares versus the corresponding parameter.
11. For the data set of problem 5 make up a matrix of inputs based on x consisting of the values
of Haar basis functions up through order m  4 . (You will need to take the functions defined on
 0,1 and rescale their arguments to   ,   . For a function g : 0,1   this is the function
 x

g * :   ,     defined by g *  x   g 
 .5  .) This will produce a 100 16 matrix Xh .
 2

a) Find β̂ OLS and plot the corresponding ŷ as a function of x with the data also plotted in
scatterplot form.
b) Center y and standardize the columns of Xh . Find the lasso coefficient vectors β̂ with
exactly M  2, 4, and 8 non-zero entries with the largest possible
16
 ˆ
j 1
lasso
j
(for the counts of
non-zero entries). Plot the corresponding yˆ's as functions of x on the same set of axes, with the
data also plotted in scatterplot form.
12. For the data set of problem 5 make up a 100  7 matrix Xh of inputs based on x consisting
of the values of functions on slide 7 of module 9 for the 7 knot values
1  3.0, 2  2.03  1.0, 4  0.0, 5  1.0, 6  2.0, 7  3.0
Find β̂ OLS and plot the corresponding natural cubic regression spline, with the data also plotted in
scatterplot form.
Assignment #3, (Not to be collected)
13. Again using the data set of problem 5,
a) Fit with approximately 5 and then 9 effective degrees of freedom
i) a cubic smoothing spline (using smooth.spline()) , and
6 ii) a locally weighted linear regression smoother based on a tricube kernel (using
loess(…,span= ,degree=1)).
Plot for approximately 5 effective degrees of freedom all of yi and the 2 sets of smoothed values
against xi . Connect the consecutive  xi , yˆ i  for each fit with line segments so that they plot as
"functions." Then redo the plotting for 9 effective degrees of freedom.
b) Produce a single hidden layer neural net fit with an error sum of squares about like those for
the 9 degrees of freedom fits using nnet(). You may need to vary the number of hidden nodes
for a single-hidden-layer architecture and vary the weight for a penalty made from a sum of
squares of coefficients in order to achieve this. For the function that you ultimately fit, extract
the coefficients and plot the fitted mean function. How does it compare to the plots made in a)?
c) Each run of an nnet()begins from a different random start and can produce a different
fitted function. Make 5 runs using the architecture and penalty parameter (the "decay"
parameter) you settle on for part b) and save the 100 predicted values for the 10 runs into 10
vectors. Make a scatterplot matrix of pairs of these sets of predicted values. How big are the
correlations between the different runs?
d) Use the avNNet() function from the caret package to average 20 neural nets with your
parameters from part b) .
e) Fit radial basis function networks based on the standard normal pdf  ,
81
 x  xi 
f   x    0   i K   x, xi  for K   x, xi    

i 1
  
to these data for two different values of  . Define normalized versions of the radial basis
functions as
K  x, x i 
N i  x   81 
 K   x, x m 
m 1
and redo the problem using the normalized versions of the basis functions.
14. Use all of MARS, thin plate splines, local kernel-weighted linear regression, and neural nets
to fit predictors to both the noiseless and the noisy Mexican hat data in R. For those methods for
which it's easy to make contour or surface plots, do so. Which methods seem most effective on
this particular data set/function?
Assignment #4 (Due 4/14/14)
15. Problem 7, page 333 of James, Witten, Hastie, and Tibshirani.
16. On their page 218 in Problem 8.1a, Kuhn and Johnson refer to a "Friedman1" simulated data
set. Make that ( N  200 ) data set exactly as indicated on their page 218 and use it below. For
computing here refer to James, Witten, Hastie and Tibshirani Section 8.3. For purposes of
assessing test error, simulate a size 5000 test set exactly as indicated on page 169 of K&J.
7 a) Fit a single regression tree to the data set. Prune the tree to get the best sub-trees of all sizes
from 1 final node to the maximum number of nodes produced by the tree routine. Compute
and plot cost-complexities C T  as a function of  as indicated in Module 17.
b) Evaluate a "test error" (based on the size 5000 test set) for each sub-tree identified in a).
What size sub-tree looks best based on these values?
c) Do 5-fold cross-validation on single regression trees to pick an appropriate tree complexity
(to pick one of the sub-trees from a)). How does that choice compare to what you got in b)
based on test error for the large test set?
d) Use the randomForest package to make a bagged version of a regression tree (based on,
say, B  500 ). What is its OOB error? How does that compare to a test error based on the size
5000 test set?
e) Make a random forest predictor using randomForest (use again B  500 ). What is its
OOB error? How does that compare to a test error based on the size 5000 test set?
f) Use the gbm package to fit several boosted regression trees to the training set (use at least 2
different values of tree depth with at least 2 different values of learning rate). What are values of
training error and then test error based on the size 5000 test set for these?
g) How do predictors in c), d), e), and f) compare in terms of test error? Evaluate ŷ for each of
the first 5 cases in your test set for all predictors and list all of the inputs and each of the
predictions in a small table.
h) Find the (test set) biases, variances, and correlation between your predictor from d) and one
of your predictors from f). Identify a linear combination of the two predictors that has minimum
(test set) prediction error.
17. Consider 4 different continuous distributions on the 2-dimensional unit square  0,1 with
2
densities on that space
g1   x1 , x2    1, g 2   x1 , x2    x1  x2 , g 3   x1 , x2    2 x1 , and g 4   x1 , x2    x1  x2  1
a) For a 0-1 loss K  4 classification problem, find explicitly and make a plot showing the 4
regions in the unit square where an optimal classifier f has f  x   k (for k  1, 2,3, 4 ) first if
π   1 ,  2 ,  3 ,  4  is .25,.25,.25,.25  and then if it is .2,.2,.3,.3 .
b) Find the marginal densities for all of the g k . Define 4 new densities g k* on the unit square by
the products of the 2 marginals for the corresponding g k . Consider a 0-1 loss K  4
classification problem ?approximating? the one in a) by using the g k* in place of the g k for the
π  .25,.25,.25,.25  case. Make a 101101 grid of points of the form  i / 100, j / 100  for
integers 0  i, j  100 and for each such point determine the value of the optimal classifier.
8 Using these values, make a plot (using a different plotting color and/or symbol for each value of
ŷ ) showing the regions in  0,1 where the optimal classifier classifies to each class.
2
c) Find the g k conditional densities for x2 | x1 . Note that based on these and the marginals in
part b) you can simulate pairs from any of the 4 joint distributions by first using the inverse
probability transform of a uniform variable to simulate from the x1 marginal and then using the
inverse probability transform to simulate from the conditional of x2 | x1 . x2 | x1 . (It's also easy to
use a rejection algorithm based on  x1 , x2  pairs uniform on  0,1 .)
2
d) Generate 2 data sets consisting of multiple independent pairs  x, y  where y is uniform on
1, 2, 3, 4 and conditioned on
y  k , the variable x has density gk . Make first a small training
set with N  400 pairs (to be used below). Then make a larger test set of 10,000 pairs. Use the
test set to evaluate the error rates of the optimal rule from a) and then the "naïve" rule from b).
e) See Section 4.6.3 of JWHT for use of a nearest neighbor classifier. Based on the N  400
training set from d), for several different numbers of neighbors (say 1,3,5,10) make a plot like
that required in b) showing the regions where the nearest neighbor classifiers classify to each of
the 4 classes. Then evaluate the test error rate for the nearest neighbor rules based on the small
training set.
f) This surely won't work very well, but give the following a try for fun. Naively/unthinkingly
apply linear discriminant analysis to the training set as in Section 4.6.3 of JWHT. Identify the
regions in  0,1 corresponding to the values of yˆ  1, 2,3, 4 . Evaluate the test error rate for
2
LDA based on this training set.
g) Following the outline of Section 8.3.1 of JWHT, fit a classification tree to the data set using
5-fold cross-validation to choose tree size. Make a plot like that required in b) showing the
regions where the tree classifies to each of the 4 classes. Evaluate the test error rate for this tree.
h) Based on the training set, one can make estimates of the 2-d densities g k as
gˆ k  x  
1
# i with yi  k 

i with yi  k
h  x | xi ,  2 
for h   | u,  2  the bivariate normal density with mean vector u and covariance matrix  2 I . (Try
perhaps   .1 .) Using these estimates and the relative frequencies of the possible values of y
in the training set
ˆ k 
# i with yi  k 
N
an approximation of the optimal classifier is
fˆ  x   arg max ˆ gˆ  x   arg max
k
k
k
k

i with yi  k
h  x | xi ,  2 
9 Make a plot like that required in b) showing the regions where this classifies to each of the 4
classes. Then evaluate the test error rate for the nearest neighbor rules based on the small
training set.
18. Carefully review (and make yourself your own copy of the R code and output for) Problems
21 and 22 of the Spring 2013 Stat 602X homework. There is nothing to be turned in here, but
you need to do this work.
Assignment #5 (Due 4/28/14)
19. Return to the context of Problem 17. (Try all of the following. Some of these things are
direct applications of code in JWHT or KJ and so I'm sure they can be done fairly painlessly.
Other things may not be so easy or could even be essentially impossible without a lot of new
coding. If that turns out to be the case, do only what you can in a finite amount of time.)
a) Use logistic regression (e.g. as implemented in glm() or glmnet()) on the training data
you generated in 17d) to find 6 classifiers with linear boundaries for choice between all pairs of
classes. Then consider an OVO classifier that classifies x to the class with the largest sum (of 3)
estimated probabilities coming from these logistic regressions. Make a plot like the one required
2
in 17b) showing the regions in  0,1 where this classifier has fˆ  x   1, 2,3, and 4 . Use the
large test set to evaluate error rate of this classifier.
b) It seems from the glmnet() documentation that using family="multinomial" one
can fit multivariate versions of logistic regression models. Try this using the training set.
Consider the classifier that classifies x to the class with the largest estimated probability. Make
a plot like the one required in 17b) showing the regions in  0,1 where this classifier has
2
fˆ  x   1, 2,3, and 4 . Use the large test set to evaluate error rate of this classifier.
c) Pages 360-361 of K&J indicate that upon converting output y taking values in 1, 2, 3, 4 to 4
binary indicator variables, one can use nnet with the 4 binary outputs (and the option linout
= FALSE) to fit a single hidden layer neural network to the training data with predicted output
values between 0 and 1 for each output variable. Try several different numbers of hidden nodes
and "decay" values to get fitted neural nets. From each of these, define a classifier that classifies
x to the class with the largest predicted response. Use the large test set to evaluate error rates of
these classifiers and pick the one with the smallest error rate. Make a plot like the one required
in 17b) showing the regions in  0,1 where your best neural net classifier has
2
fˆ  x   1, 2,3, and 4 .
d) Use svm() in package e1071 to fit SVM's to the y  1 and y  2 training data for the


"linear" kernel,
"polynomial" kernel (with default order 3),
10 
"radial basis" kernel (with default gamma, half that gamma value, and twice that
gamma value)
Compute as in Section 9.6 of JWHT and use the plot() function to investigate the nature of
the 5 classifiers. If it's possible, put the training data pairs on the plot using different symbols or
colors for classes 1 and 2, and also identify the support vectors.
e) Compute as in JWHT Section 9.6.4 to find SVMs (using the kernels indicated in d)) for the
K  4 class problem. Again, use the plot() function to investigate the nature of the 5
classifiers. Use the large test set to evaluate the error rates for these 5 classifiers.
f) Use either the ada package or the adabag package and fit an AdaBoost classifier to the
y  1 and y  2 training data. Make a plot like the one required in 17b) showing the regions in
 0,1
2
where this classifier has fˆ  x   1 and 2 . Use the large test set to evaluate the error rate of
this classifier. How does this error rate compare to the best possible one for comparing classes 1
and 2 with equal weights on the two? (You should be able to get the latter analytically.)
g) It appears from the paper "ada: An R Package for Stochastic Boosting" by Mark Culp,
Kjell Johnson, and George Michailidis that appeared in the Journal of Statistical Software in
2006 that ada implements a OVA version of a K -class Adaboost classifier. If so, use this and
find the corresponding classifier. Make a plot like the one required in 17b) showing the regions
2
in  0,1 where this classifier has fˆ  x   1, 2,3, and 4 . Use the large test set to evaluate error
rate of this classifier.
20. Do Exercises 2,3,4, and 9 from Ch 10 of JWHT.
Assignment #6 (Not to be Collected)
21. Try out Model-based clustering on the “USArrests” data you used in Problem 9 of JWHT in
HW5.
22. Do Problems 7 and 9 from the Stat 602X Spring 13 Final Exam.
11 
Download