Stat 502X Homework Assignment #1 (Due 1/31/14) Regarding the curse of dimensionality/necessary sparsity of data sets in p for moderate to large p : 1. For each of p 10, 20,50,100,500,1000 make n 10,000 draws of distances between pairs of independent points uniform in the cube 0,1 . Use these to make 95% confidence limits for p the ratio mean distance between two random points in the cube maximum distance between two points in the cube 2. For each of p 10, 20,50 make n 10,000 random draws of N 100 independent points uniform in the cube 0,1 . Find for each sample of 100 points, the distance from the first point p drawn to the 5th closest point of the other 99. Use these to make 95% confidence limits for the ratio mean diameter of a 5-nearest neighbor neighborhood if N 100 maximum distance between two points in the cube 3. What fraction of random draws uniform from the unit cube 0,1 lie in the "middle part" of p the cube ,1 , for a small positive number ? p Regarding decomposition of Err : 4. Suppose that (unknown to a statistician) a mechanism generates iid data pairs x, y according to the following model: x ~ U , y | x ~ N sin x ,.25 x 1 2 (The conditional variance is .25 x 1 .) 2 a) What is an absolute minimum value of Err possible regardless what training set size, N , is available and what fitting method is employed. b) What linear function of x (which g x a bx ) has the smallest "average squared bias" as a predictor for y ? What cubic function of x (which g x a bx cx 2 dx 3 ) has the smallest 1 average squared bias as a predictor for y ? Is the set of cubic functions big enough to eliminate model bias in this problem? Regarding cross-validation as a method of choosing a predictor: 5. Vardeman will send out an N 100 data set generated by the model of problem 4. Use tenfold cross validation (use the 1st ten points as the first test set, the 2nd 10 points as the second, etc.) based on the data set to choose among the following methods of prediction for this scenario: polynomial regressions of orders 0,1,2,3,4, and 5 regressions using sets of predictors 1,sinx, cosx and 1,sinx, cosx, sin2x, cos2x a regression with the set of predictors 1, x, x 2 , x 3 , x 4 , x 5 ,sinx, cosx,sin2x, cos2x (Use ordinary least squares fitting.) Which predictor looks best on an empirical basis? Knowing how the data were generated (an unrealistic luxury) which methods here are without model bias? Regarding some linear algebra and principal components analysis: 6. Consider the small ( 7 3 ) fake X matrix below. 10 10 .1 11 11 .1 9 9 0 X 11 9 2.1 9 11 2.1 12 8 4.0 8 12 4.0 (Note, by the way, that x3 x2 x1 .) a) Find the QR and singular value decompositions of X . Use the latter and give best rank 1 and rank 2 approximations to X . b) Subtract column means from the columns of X to make a centered data matrix. Find the singular value decomposition of this matrix. Is it approximately the same as that in part a)? Give the 3 vectors of the principal component scores. What are the principal components for case 1? Henceforth consider only the centered data matrix of b). c) What are the singular values? How do you interpret their relative sizes in this context? What are the first two principal component directions? What are the loadings of the first two principal component directions on x3 ? What is the third principal component direction? Make scatter 2 plots of 7 points x1 , x2 and then 7 points with first co-ordinate the 1st principal component score and the second the 2nd principal component score. How do these compare? Do you expect them to be similar in light of the sizes of the singular values? c') Find the matrices Xvj vj for j 1, 2,3 and the best rank 1 and rank 2 approximations to X . How are the latter related to the former? d) Compute the ( N divisor) 3 3 sample covariance matrix for the 7 cases. Then find its singular value decomposition and its eigenvalue decomposition. Are the eigenvectors of the sample covariance matrix related to the principal component directions of the (centered) data matrix? If so, how? Are the eigenvalues/singular values of the sample covariance matrix related to the singular values of the (centered) data matrix. If so, how? e) The functions K1 x, z exp x z K 2 x, z 1 x, z 2 and d are legitimate kernel functions for choice of 0 and positive integer d . Find the first two kernel principal component vectors for X for each of cases 1. K1 with two different values of (of your choosing), and 2. K 2 for d 1, 2 . If there is anything to interpret (and there may not be) give interpretations of the pairs of vectors for each of the 4 cases. (Be sure to use the vectors for "centered versions" of latent feature vectors.) Assignment #2, (Due 2/21/14) 7. Return to the context of problem 5 and the last/largest set of predictors. Center the y vector to produce (say) Y * , remove the column of 1's from the X matrix (giving a 100 9 matrix) and standardize the columns of the resulting matrix, to produce (say) X* . a) If one somehow produces a coefficient vector β* for the centered and standardized version of the problem, so that y* 1* x1* 2* x2* 9* x9* what is the corresponding predictor for y in terms of 1, x, x 2 , x 3 , x 4 , x 5 ,sinx, cosx,sin2x, cos2x ? b) Do the transformations and fit the equation in a) by OLS. How do the fitted coefficients and error sum of squares obtained here compare to what you got in problem 5? c) Augment Y* to Y** by adding 9 values 0 at the end of the vector (to produce a 109 1 vector) and for value 4 augment X* to X** (a 109 9 matrix) by adding 9 rows at the bottom of the 3 matrix in the form of I . What quantity does OLS based on these augmented data seek to 99 optimize? What is the relationship of this to a ridge regression objective? d) Use trial and error and matrix calculations based on the explicit form of βˆ ridge given in the slides for Module 5 to identify a value of for which the error sum of squares for ridge regression is about 1.5 time that of OLS in this problem. Then make a series of at least 5 values from 0 to to use as candidates for . Choose one of these as an "optimal" ridge parameter opt here based on 10-fold cross-validation (as was done in problem 5). Compute the corresponding predictions yˆiridge and plot both them and the OLS predictions as functions of x (connect successive x, yˆ points with line segments). How do the "optimal" ridge predictions based on the 9 predictors compare to the OLS predictions based on the same 9 predictors? 8. In light of the idea in part c) of problem 7, if you had software capable of doing lasso fitting of a linear predictor for a penalty coefficient , how can you use that routine to do elastic net fitting of a linear predictor for penalty coefficients 1 and 2 in N p p i 1 j 1 j 1 2 yi yˆi 1 ˆ j 2 ˆ j2 ? 9. (There is a solution to much of this problem on the Stat 602X page for the 2011 course HW. That solution uses the N 1 divisor for sample variances. For the following, please make your own version that uses the N divisor.) Beginning in Section 5.6, Izenman's book uses an example where PET yarn density is to be predicted from its NIR spectrum. This is a problem where N 21 data vectors x j of length p 268 are used to predict the corresponding outputs yi . Izenman points out that the yarn data are to be found in the pls package in R. (The package actually has N 28 cases. Use all of them in the following.) Get those data and make sure that all inputs are standardized and the output is centered. a) Using the pls package, find the 1, 2,3, and 4 component PCR and PLS β̂ vectors. b) Find the singular values for the matrix X and use them to plot the function df for ridge regression. Identify values of corresponding to effective degrees of freedom 1, 2,3, and 4 . Find corresponding ridge β̂ vector. c) Plot on the same set of axes ˆ j versus j for the PCR, PLS and ridge vectors for number of components/degrees of freedom 1. (Plot them as "functions," connecting consecutive plotted j , ˆ j points with line segments.) Then do the same for 2,3, and 4 numbers of components/degrees of freedom. 4 d) It is (barely) possible to find that the best (in terms of R 2 ) subsets of M 1, 2,3, and 4 predictors are respectively, x40 , x212 , x246 , x25 , x160 , x215 and x160 , x169 , x231 , x243 . Find their corresponding coefficient vectors. Use the lars package in R and find the lasso coefficient vectors β̂ with exactly M 1, 2,3, and 4 non-zero entries with the largest possible 268 ˆ j 1 lasso j (for the counts of non-zero entries). e) If necessary, re-order/sort the cases by their values of yi (from smallest to largest) to get a new indexing. Then plot on the same set of axes yi versus i and yˆi versus i for ridge, PCR, PLS, best subset, and lasso regressions for number of components/degrees of freedom/number of nonzero coefficients equal to 1. (Plot them as "functions," connecting consecutive plotted i, yi or i, yˆi points with line segments.) Then do the same for 2,3, and 4 numbers of components/degrees of freedom/counts of non-zero coefficients. f) Section 6.6 of JWHT uses the glmnet package for R to do ridge regression and the lasso. Use that package in the following. Find the value of for which your lasso coefficient vector in d) for M 2 optimizes the quantity N 268 i 1 j 1 2 yi yˆi ˆ j (by matching the error sums of squares). Then, by using the trick of problem 8, employ the package to find coefficient vectors β optimizing N 268 268 i 1 j 1 j 1 2 yi yˆi 1 ˆ j ˆ j2 for 0,.1,.2,,1.0 . What effective degrees of freedom are associated with the 1 version of this? How many of the coefficients j are non-zero for each of the values of ? Compare error sums of squares for the raw elastic net predictors to the linear predictors using coefficients modified elastic net coefficients 1 ˆ elastic net , 10. Here is a small fake data set with p 4 and N 8 . y x x 1 3 5 13 9 3 11 1 5 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 x3 1 1 1 1 1 1 1 1 x4 1 1 1 1 1 1 1 1 5 Notice that the y is centered and the x's are orthogonal (and can easily be made orthonormal by dividing by 8 ). Use the explicit formulas for fitted coefficients in the orthonormal features context to make plots (on a single set of axes for each fitting method, 5 plots in total) of 1. ˆ , ˆ , ˆ , and ˆ versus M for best subset (of size M ) regression, 1 2 3 4 2. ˆ1 , ˆ2 , ˆ3 , and ˆ4 versus for ridge regression, 3. ˆ1 , ˆ2 , ˆ3 , and ˆ4 versus for lasso, 4. ˆ1 , ˆ2 , ˆ3 , and ˆ4 versus for .5 in the elastic net penalty of part d) of problem 9, and 5. ˆ1 , ˆ2 , ˆ3 , and ˆ4 versus for the non-negative garrote. Make 5 corresponding plots of the error sum of squares versus the corresponding parameter. 11. For the data set of problem 5 make up a matrix of inputs based on x consisting of the values of Haar basis functions up through order m 4 . (You will need to take the functions defined on 0,1 and rescale their arguments to , . For a function g : 0,1 this is the function x g * : , defined by g * x g .5 .) This will produce a 100 16 matrix Xh . 2 a) Find β̂ OLS and plot the corresponding ŷ as a function of x with the data also plotted in scatterplot form. b) Center y and standardize the columns of Xh . Find the lasso coefficient vectors β̂ with exactly M 2, 4, and 8 non-zero entries with the largest possible 16 ˆ j 1 lasso j (for the counts of non-zero entries). Plot the corresponding yˆ's as functions of x on the same set of axes, with the data also plotted in scatterplot form. 12. For the data set of problem 5 make up a 100 7 matrix Xh of inputs based on x consisting of the values of functions on slide 7 of module 9 for the 7 knot values 1 3.0, 2 2.03 1.0, 4 0.0, 5 1.0, 6 2.0, 7 3.0 Find β̂ OLS and plot the corresponding natural cubic regression spline, with the data also plotted in scatterplot form. Assignment #3, (Not to be collected) 13. Again using the data set of problem 5, a) Fit with approximately 5 and then 9 effective degrees of freedom i) a cubic smoothing spline (using smooth.spline()) , and 6 ii) a locally weighted linear regression smoother based on a tricube kernel (using loess(…,span= ,degree=1)). Plot for approximately 5 effective degrees of freedom all of yi and the 2 sets of smoothed values against xi . Connect the consecutive xi , yˆ i for each fit with line segments so that they plot as "functions." Then redo the plotting for 9 effective degrees of freedom. b) Produce a single hidden layer neural net fit with an error sum of squares about like those for the 9 degrees of freedom fits using nnet(). You may need to vary the number of hidden nodes for a single-hidden-layer architecture and vary the weight for a penalty made from a sum of squares of coefficients in order to achieve this. For the function that you ultimately fit, extract the coefficients and plot the fitted mean function. How does it compare to the plots made in a)? c) Each run of an nnet()begins from a different random start and can produce a different fitted function. Make 5 runs using the architecture and penalty parameter (the "decay" parameter) you settle on for part b) and save the 100 predicted values for the 10 runs into 10 vectors. Make a scatterplot matrix of pairs of these sets of predicted values. How big are the correlations between the different runs? d) Use the avNNet() function from the caret package to average 20 neural nets with your parameters from part b) . e) Fit radial basis function networks based on the standard normal pdf , 81 x xi f x 0 i K x, xi for K x, xi i 1 to these data for two different values of . Define normalized versions of the radial basis functions as K x, x i N i x 81 K x, x m m 1 and redo the problem using the normalized versions of the basis functions. 14. Use all of MARS, thin plate splines, local kernel-weighted linear regression, and neural nets to fit predictors to both the noiseless and the noisy Mexican hat data in R. For those methods for which it's easy to make contour or surface plots, do so. Which methods seem most effective on this particular data set/function? Assignment #4 (Due 4/14/14) 15. Problem 7, page 333 of James, Witten, Hastie, and Tibshirani. 16. On their page 218 in Problem 8.1a, Kuhn and Johnson refer to a "Friedman1" simulated data set. Make that ( N 200 ) data set exactly as indicated on their page 218 and use it below. For computing here refer to James, Witten, Hastie and Tibshirani Section 8.3. For purposes of assessing test error, simulate a size 5000 test set exactly as indicated on page 169 of K&J. 7 a) Fit a single regression tree to the data set. Prune the tree to get the best sub-trees of all sizes from 1 final node to the maximum number of nodes produced by the tree routine. Compute and plot cost-complexities C T as a function of as indicated in Module 17. b) Evaluate a "test error" (based on the size 5000 test set) for each sub-tree identified in a). What size sub-tree looks best based on these values? c) Do 5-fold cross-validation on single regression trees to pick an appropriate tree complexity (to pick one of the sub-trees from a)). How does that choice compare to what you got in b) based on test error for the large test set? d) Use the randomForest package to make a bagged version of a regression tree (based on, say, B 500 ). What is its OOB error? How does that compare to a test error based on the size 5000 test set? e) Make a random forest predictor using randomForest (use again B 500 ). What is its OOB error? How does that compare to a test error based on the size 5000 test set? f) Use the gbm package to fit several boosted regression trees to the training set (use at least 2 different values of tree depth with at least 2 different values of learning rate). What are values of training error and then test error based on the size 5000 test set for these? g) How do predictors in c), d), e), and f) compare in terms of test error? Evaluate ŷ for each of the first 5 cases in your test set for all predictors and list all of the inputs and each of the predictions in a small table. h) Find the (test set) biases, variances, and correlation between your predictor from d) and one of your predictors from f). Identify a linear combination of the two predictors that has minimum (test set) prediction error. 17. Consider 4 different continuous distributions on the 2-dimensional unit square 0,1 with 2 densities on that space g1 x1 , x2 1, g 2 x1 , x2 x1 x2 , g 3 x1 , x2 2 x1 , and g 4 x1 , x2 x1 x2 1 a) For a 0-1 loss K 4 classification problem, find explicitly and make a plot showing the 4 regions in the unit square where an optimal classifier f has f x k (for k 1, 2,3, 4 ) first if π 1 , 2 , 3 , 4 is .25,.25,.25,.25 and then if it is .2,.2,.3,.3 . b) Find the marginal densities for all of the g k . Define 4 new densities g k* on the unit square by the products of the 2 marginals for the corresponding g k . Consider a 0-1 loss K 4 classification problem ?approximating? the one in a) by using the g k* in place of the g k for the π .25,.25,.25,.25 case. Make a 101101 grid of points of the form i / 100, j / 100 for integers 0 i, j 100 and for each such point determine the value of the optimal classifier. 8 Using these values, make a plot (using a different plotting color and/or symbol for each value of ŷ ) showing the regions in 0,1 where the optimal classifier classifies to each class. 2 c) Find the g k conditional densities for x2 | x1 . Note that based on these and the marginals in part b) you can simulate pairs from any of the 4 joint distributions by first using the inverse probability transform of a uniform variable to simulate from the x1 marginal and then using the inverse probability transform to simulate from the conditional of x2 | x1 . x2 | x1 . (It's also easy to use a rejection algorithm based on x1 , x2 pairs uniform on 0,1 .) 2 d) Generate 2 data sets consisting of multiple independent pairs x, y where y is uniform on 1, 2, 3, 4 and conditioned on y k , the variable x has density gk . Make first a small training set with N 400 pairs (to be used below). Then make a larger test set of 10,000 pairs. Use the test set to evaluate the error rates of the optimal rule from a) and then the "naïve" rule from b). e) See Section 4.6.3 of JWHT for use of a nearest neighbor classifier. Based on the N 400 training set from d), for several different numbers of neighbors (say 1,3,5,10) make a plot like that required in b) showing the regions where the nearest neighbor classifiers classify to each of the 4 classes. Then evaluate the test error rate for the nearest neighbor rules based on the small training set. f) This surely won't work very well, but give the following a try for fun. Naively/unthinkingly apply linear discriminant analysis to the training set as in Section 4.6.3 of JWHT. Identify the regions in 0,1 corresponding to the values of yˆ 1, 2,3, 4 . Evaluate the test error rate for 2 LDA based on this training set. g) Following the outline of Section 8.3.1 of JWHT, fit a classification tree to the data set using 5-fold cross-validation to choose tree size. Make a plot like that required in b) showing the regions where the tree classifies to each of the 4 classes. Evaluate the test error rate for this tree. h) Based on the training set, one can make estimates of the 2-d densities g k as gˆ k x 1 # i with yi k i with yi k h x | xi , 2 for h | u, 2 the bivariate normal density with mean vector u and covariance matrix 2 I . (Try perhaps .1 .) Using these estimates and the relative frequencies of the possible values of y in the training set ˆ k # i with yi k N an approximation of the optimal classifier is fˆ x arg max ˆ gˆ x arg max k k k k i with yi k h x | xi , 2 9 Make a plot like that required in b) showing the regions where this classifies to each of the 4 classes. Then evaluate the test error rate for the nearest neighbor rules based on the small training set. 18. Carefully review (and make yourself your own copy of the R code and output for) Problems 21 and 22 of the Spring 2013 Stat 602X homework. There is nothing to be turned in here, but you need to do this work. Assignment #5 (Due 4/28/14) 19. Return to the context of Problem 17. (Try all of the following. Some of these things are direct applications of code in JWHT or KJ and so I'm sure they can be done fairly painlessly. Other things may not be so easy or could even be essentially impossible without a lot of new coding. If that turns out to be the case, do only what you can in a finite amount of time.) a) Use logistic regression (e.g. as implemented in glm() or glmnet()) on the training data you generated in 17d) to find 6 classifiers with linear boundaries for choice between all pairs of classes. Then consider an OVO classifier that classifies x to the class with the largest sum (of 3) estimated probabilities coming from these logistic regressions. Make a plot like the one required 2 in 17b) showing the regions in 0,1 where this classifier has fˆ x 1, 2,3, and 4 . Use the large test set to evaluate error rate of this classifier. b) It seems from the glmnet() documentation that using family="multinomial" one can fit multivariate versions of logistic regression models. Try this using the training set. Consider the classifier that classifies x to the class with the largest estimated probability. Make a plot like the one required in 17b) showing the regions in 0,1 where this classifier has 2 fˆ x 1, 2,3, and 4 . Use the large test set to evaluate error rate of this classifier. c) Pages 360-361 of K&J indicate that upon converting output y taking values in 1, 2, 3, 4 to 4 binary indicator variables, one can use nnet with the 4 binary outputs (and the option linout = FALSE) to fit a single hidden layer neural network to the training data with predicted output values between 0 and 1 for each output variable. Try several different numbers of hidden nodes and "decay" values to get fitted neural nets. From each of these, define a classifier that classifies x to the class with the largest predicted response. Use the large test set to evaluate error rates of these classifiers and pick the one with the smallest error rate. Make a plot like the one required in 17b) showing the regions in 0,1 where your best neural net classifier has 2 fˆ x 1, 2,3, and 4 . d) Use svm() in package e1071 to fit SVM's to the y 1 and y 2 training data for the "linear" kernel, "polynomial" kernel (with default order 3), 10 "radial basis" kernel (with default gamma, half that gamma value, and twice that gamma value) Compute as in Section 9.6 of JWHT and use the plot() function to investigate the nature of the 5 classifiers. If it's possible, put the training data pairs on the plot using different symbols or colors for classes 1 and 2, and also identify the support vectors. e) Compute as in JWHT Section 9.6.4 to find SVMs (using the kernels indicated in d)) for the K 4 class problem. Again, use the plot() function to investigate the nature of the 5 classifiers. Use the large test set to evaluate the error rates for these 5 classifiers. f) Use either the ada package or the adabag package and fit an AdaBoost classifier to the y 1 and y 2 training data. Make a plot like the one required in 17b) showing the regions in 0,1 2 where this classifier has fˆ x 1 and 2 . Use the large test set to evaluate the error rate of this classifier. How does this error rate compare to the best possible one for comparing classes 1 and 2 with equal weights on the two? (You should be able to get the latter analytically.) g) It appears from the paper "ada: An R Package for Stochastic Boosting" by Mark Culp, Kjell Johnson, and George Michailidis that appeared in the Journal of Statistical Software in 2006 that ada implements a OVA version of a K -class Adaboost classifier. If so, use this and find the corresponding classifier. Make a plot like the one required in 17b) showing the regions 2 in 0,1 where this classifier has fˆ x 1, 2,3, and 4 . Use the large test set to evaluate error rate of this classifier. 20. Do Exercises 2,3,4, and 9 from Ch 10 of JWHT. Assignment #6 (Not to be Collected) 21. Try out Model-based clustering on the “USArrests” data you used in Problem 9 of JWHT in HW5. 22. Do Problems 7 and 9 from the Stat 602X Spring 13 Final Exam. 11