Stat 602 HW Sets (2015) HW#1 (Due 1/27/15) There are a variety of ways that one can quantitatively demonstrate the qualitative realities that p is "huge" and for p at all large "filling up" even a small part of it with data points is effectively impossible and our intuition about distributions in p is very poor. The first 3 problems below are based on nice ideas in this direction taken from Giraud's book. 1. For p 2,10,100, and 1000 draw samples of size n 100 from the uniform distributions on 0,1 p . Then for every xi , x j pair with i j in one of these samples, compute the Euclidean 100 distance between the two points, xi x j . Make a histogram (one p at a time) of these 2 2 distances. What do these suggest about how well "local" prediction methods (that rely only on data points xi , yi with xi "near" x to make predictions about y at x ) can be expected to work? 2. Consider finding a lower bound on the number of points xi (for i 1, 2, , n ) required to "fill up" 0,1 in the sense that no point of 0,1 is Euclidean distance more than away from p p some xi . The p-dimensional volume of a ball of radius r in p is Vp r p /2 p / 2 1 and Giraud notes that it can be shown that as p Vp r 2 er 2 p p /2 p rp 1 1/2 Then, if n points can be found with -balls covering the unit cube in p , the total volume of those balls must be at least 1. That is nV p 1 What then are approximate lower bounds on the number of points required to fill up 0,1 to p within for p 20,50, and 200 and 1,.1, and .01 ? (Giraud notes that the p 200 and 1 lower bound is larger than the estimated number of particles in the universe.) 1 3. Giraud points out that for large p , most of MVN p 0, Ι probability is "in the tails." For f p x the MVN p 0, Ι pdf. For 0 1 let B p x | f p x f p 0 x | x 2 ln 1 2 be the "central"/"large density" part of the multivariate standard normal distribution. a) Using the Markov inequality, show that the probability assigned by the multivariate standard normal distribution to the region B p is no more than 1/ 2 p /2 . b) What then is a lower bound on the radius (call it r p ) of a ball at the origin required so that the multivariate standard normal distribution places probability .5 in that ball? What is an upper bound on the ratio f p x / f p 0 outside the ball with radius that lower bound? Plot these bounds as functions of p for p 1,500 . 4. Consider Section 1.4 of the typed outline (concerning the variance-bias trade-off in prediction). Suppose that in a very simple problem with p 1 , the distribution P for the random pair x, y is specified by x U 0,1 and y | x N 3 x 1.5 , 3 x 1.5 .2 2 2 ( 3x 1.5 .2 is the conditional variance of the output). Further, consider two possible sets of 2 functions S g for use in creating predictors of y , namely 1. S1 g | g x a bx for real numbers a, b , and 10 j j 1 2. S2 g | g x a j I x for real numbers a j 10 10 j 1 Training data are N pairs xi , yi iid P . Suppose that the fitting of elements of these sets is done by 1. OLS (simple linear regression) in the case of S1 , and 2. according to j 1 j if no xi , y 10 10 aˆ j 1 yi otherwise j 1 j i with , # xi j 1 j xi 10 , 10 10 10 in the case of S2 to produce predictors fˆ and fˆ . 1 2 2 a) Find (analytically) the functions g * for the two cases. Use them to find the two expected squared model biases E x E y | x g * x . How do these compare for the two cases? 2 b) For the second case, find an analytical form for E T fˆ2 and then for the average squared 2 estimation bias E x E T fˆ2 x g 2* x . (Hints: What is the conditional distribution of the yi j 1 j j 1 j , ? What is the conditional mean of y given that x , ?) given that no xi 10 10 10 10 c) For the first case, simulate at least 1000 training data sets of size N 100 and do OLS on each one to get corresponding fˆ 's. Average those to get an approximation for E T fˆ1 . (If you can do this analytically, so much the better!) Use this approximation and analytical calculation to 2 find the average squared estimation bias E x E T fˆ x g * x for this case. 1 1 d) How do your answers for b) and c) compare for a training set of size N 100 ? e) Use whatever combination of analytical calculation, numerical analysis, and simulation you need to use (at every turn preferring analytics to numerics to simulation) to find the expected prediction variances E x Var T fˆ x for the two cases for training set size N 100 . f) In sum, which of the two predictors here has the best value of Err for N 100 ? 5. Two files sent out by Vardeman with respectively 100 and then 1000 pairs xi , yi were generated according to P in Problem 4. Use 10-fold cross-validation to see which of the two predictors in Problem 4 appears most likely to be effective. (The data sets are not sorted, so you may treat successively numbered groups of 1/10 th of the training cases as your K 10 randomly created pieces of the training set.) 6. Consider the 5 4 data matrix 2 4 X 3 5 1 4 3 4 2 3 7 5 6 4 4 2 5 1 2 4 a) Use R and find the QR and singular value decompositions of X . What are the two corresponding bases for C X ? 3 b) Use the singular value decomposition of X to find the eigen (spectral) decompositions of XX and XX (what are eigenvalues and eigenvectors?). c) Find the best rank 1 and rank 2 approximations to X . . Now center the columns of X to make the centered data matrix X . What are the principal component directions d) Find the singular value decomposition of X and principal components for the data matrix? What are the "loadings" of the first principal component? . e) Find the best rank 1 and rank 2 approximations to X f) Find the eigen decomposition of the sample covariance matrix 1 X X . Find best 1 and 2 5 component approximations to this covariance matrix. . Repeat parts d), e), and f) using this Now standardize the columns of X to make the matrix X matrix X . 7. Consider the linear space of functions on , of the form f t a bt c sin t d cos t Equip this space with the inner-product f , g f t g t dt and norm f f, f 1/2 (to create a small Hilbert space). Use the Gram-Schmidt process to orthogonalize the set of functions 1, t ,sin t , cos t and produce an orthonormal basis for the space. Some Other Problems (NOT to be Turned in … Most Have Solutions on the 2011 602X Page) The next two calculations others intended to provide intuition in the direction that all data in p are necessarily sparse. 8. Let Fp t and f p t be respectively the p2 cdf and pdf. Consider the MVN p 0, I distribution and Z1 , Z 2 , , Z N iid with this distribution. With M min Zi | i 1, 2, , N Write out a one-dimensional integral involving Fp t and f p t giving EM . Evaluate this mean for N 100 and p 1,5,10, and 20 either numerically or using simulation. 4 9. For each of p 1,5,10, and 20 generate at least 1000 realizations of pairs of points x and z as iid uniform over the p- dimensional unit ball (the set of x with x p 1 ). Compute (for each p ) the sample average distance between x and z . (For Z MVN p 0, I independent of U U 0,1 , x U 1/ p / Z p Z is uniformly distributed on the unit ball in p .) 10. Consider the linear space of functions on 0,1 of the form f t a bt ct 2 dt 3 1 Equip this space with the inner-product f , g f t g t dt and norm f f , f 1/2 0 (to create a small Hilbert space). Use the Gram-Schmidt process to orthogonalize the set of functions 1, t , t 2 , t 3 and produce an orthonormal basis for the space. 11. Consider the linear space of functions on 0,1 of the form 2 f t , s a bt cs dt 2 es 2 fts Equip this space with the inner-product f , g f t , s g t , s dtds and norm f f , f 1/2 0,12 (to create a small Hilbert space). Use the Gram-Schmidt process to orthogonalize the set of functions 1, t , s, t 2 , s 2 , ts and produce an orthonormal basis for the space. 12. For a c 0 , consider the function K : 2 2 defined by K x, z exp c x z 2 a) Use the facts about "kernel functions" in Module 48 to argue that K is a kernel function. 2 (Note that x z x, x z, z 2 x, z .) b) Argue that there is a φ : 2 so that with (infinite-dimensional) feature vector φ x the kernel function is a "regular inner-product" K x, z φ x , φ z l x l z l 1 (You will want to consider the Taylor series expansion of the exponential function about 0 and co-ordinate functions of φ that are multiples of all possible products of the form x1p x2q for nonnegative integers p and q . It is not necessary to find explicit forms for the multipliers, though that can probably be done. You do need to argue carefully though, that such a representation is possible.) 5 13. Show the equivalence of the two forms of the optimization used to produce the fitted ridge regression parameter. (That is, show that there is a t such that βˆ ridge βˆ tridge and a t such that βˆ tridge βˆ ridge t .) 14. Consider "data augmentation" methods of penalized least squares fitting. a) Augment a centered X matrix with p new rows I and Y by adding p new entries 0 . p p Argue that OLS fitting with the augmented data set returns βˆ ridge as a fitted coefficient vector. net b) Show how the "elastic net" fitted coefficient vector βˆ elastic could be found using lasso 1 ,2 software and an appropriate augmented data set. HW#2 (Due 2/10/15) 15. a) Argue carefully that the function K : p p defined by K s, t c s , t for c 0 , d a non-negative integer, and d the usual p inner product is a "kernel" function. b) For the "kernel principle components" R example used in class, replace the kernel used there with 3 different versions of this kernel, ones with c 0, c 1, and c 4 for the choice d 2 . Find the 3 sets of (centered) kernel principle components for these 3 choices using the methodology provided in Module 4. If you find anything interesting in these analyses, comment on it. c) Consider the c 1 version in part b). Write out K s, t in terms of the components of s and t and observe that K s, t can be viewed and an inner product in a 6-dimensional feature space. What are those 6 functions producing features of an observed x x1 , x2 ? Compute and center columns of the 10 6 transformed data matrix. Compute ordinary principle components of this new matrix. How do they compare with what you found in b)? d) Consider the c 4 version in part b). For this case, what 6 functions produce observed features of an observed x x1 , x2 ? (Note that these did not produce the same answers as did the choice c 1 in part b) despite their similarities to the features noted in c).) 16. This question concerns analysis of a set of sale price data obtained from the Ames City Assessor’s Office. There is an Ames Home Price Data file on the course web page 6 containing information on sales May 2002 through June 2003 of 1 1 2 and 2 story homes built 1945 and before, with (above grade) size of 2500 sq ft or less and lot size 20,000 sq ft or less, located in Low- and Medium-Density Residential zoning areas. n 88 different homes fitting this description were sold in Ames during this period. (2 were actually sold twice, but only the second sales prices of these were included in our data set.) (The rows of the file have been shuffled randomly, so that you may use 8 successive sets of 11 rows as folds for cross-validation purposes if you end up programming you own cross-validations.) For each home, the value of the response variable Price recorded sale price of the home and the values of 14 potential explanatory variables were obtained. These variables are Size Land Bedrooms Central Air Fireplace Full Bath Half Bath Basement Finished Bsmnt Bsmnt Bath Garage Multiple Car Style (2 Story ) the floor area of the home above grade in sq ft, the area of the lot the home occupies in sq ft, a count of the number of bedrooms in the home a dummy variable that is 1 if the home has central air conditioning and is 0 if it does not, a count of the number of fireplaces in the home, a count of the number of full bathrooms above grade, a count of the number of half bathrooms above grade, the floor area of the home's basement (including both finished and unfinished parts) in sq ft, the area of any finished part of the home's basement in sq ft, a dummy variable that is 1 if there is a bathroom of any sort (full or half) in the home's basement and is 0 otherwise, a dummy variable that is 1 if the home has a garage of any sort and is 0 otherwise, a dummy variable that is 1 if the home has a garage that holds more than one vehicle and is 0 otherwise, a dummy variable that is 1 if the home is a 2 story (or a 2 1 2 story) home and is 0 otherwise, and Zone (Town Center ) a dummy variable that is 1 if the home is in an area zoned as "Urban Core Medium Density" and 0 otherwise. a) In preparation for analysis, standardize all explanatory variables that are not dummy variables (those we'll leave in raw form), and center the price variable making a data frame with 15 columns. Say clearly how one goes from a particular new set of home characteristics to a corresponding set of predictors. Then say clearly how a prediction for the centered price to a prediction for the actual dollar price. 7 b) Find linear predictors for centered price of all the following forms: OLS Lasso (choose by 8-fold cross-validation) Ridge (choose by 8-fold cross-validation) Elastic Net with .5 (choose by 8-foldcross-validation) PCR (choose the number of components by 8-fold cross-validation) PLS (choose the number of components by 8-fold cross-validation) For each predictor that you have a way to do so, evaluate the effective degrees of freedom. c) Plot on the same set of axes as a function of index (1 through 15) the values of co-ordinates ˆ j of βˆ for each predictor in b). (Connect successive coordinates of a given β̂ with line segments so that you can track the different methods across the plot. Use different plotting symbols and colors for the 6 different methods.) If you see anything interesting in the plot comment on it. d) Plot on the same set of axes as a function of index (1 through 88) the values of co-ordinates ˆ for each predictor in b). (Connect successive coordinates of a given Ŷ with line yˆi of Y segments so that you can track the different methods across the plot. Use different plotting symbols and colors for the 6 different methods.) If you see anything interesting in the plot comment on it. Some Other Problems (NOT to be Turned in … Some Have Solutions on the 2011 602X Page) 17. For a c 0 , consider the function K : 2 2 defined by K x, z exp c x z 2 a) Use the facts about "kernel functions" in the last section of the course outline to argue that K 2 is a kernel function. (Note that x z x, x z, z 2 x, z .) b) Argue that there is a φ : 2 so that with (infinite-dimensional) feature vector φ x the kernel function is a "regular inner-product" K x, z φ x , φ z l x l z l 1 (You will want to consider the Taylor series expansion of the exponential function about 0 and co-ordinate functions of φ that are multiples of all possible products of the form x1p x2q for nonnegative integers p and q . It is not necessary to find explicit forms for the multipliers, though that can probably be done. You do need to argue carefully though, that such a representation is possible.) 8 18. (Exactly as in 2011 HW.) Beginning in Section 5.6, Izenman's book uses an example where PET yarn density is to be predicted from its NIR spectrum. This is a problem where N 21 data vectors x j of length p 268 are used to predict the corresponding outputs yi . Izenman points out that the yarn data are to be found in the pls package in R. (The package actually has N 28 cases. Use all of them in the following.) Get those data and make sure that all inputs are standardized and the output is centered. a) Using the pls package, find the 1, 2,3, and 4 component PCR and PLS β̂ vectors. b) Find the singular values for the matrix X and use them to plot the function df for ridge regression. Identify values of corresponding to effective degrees of freedom 1, 2,3, and 4 . Find corresponding ridge β̂ vector. c) Plot on the same set of axes ˆ j versus j for the PCR, PLS and ridge vectors for number of components/degrees of freedom 1. (Plot them as "functions," connecting consecutive plotted j , ˆ j points with line segments.) Then do the same for 2,3, and 4 numbers of components/degrees of freedom. d) It is (barely) possible to find that the best (in terms of R 2 ) subsets of M 1, 2,3, and 4 predictors are respectively, x40 , x212 , x246 , x25 , x160 , x215 and x160 , x169 , x231 , x243 . Find their corresponding coefficient vectors. Use the lars package in R and find the lasso coefficient vectors β̂ with exactly M 1, 2,3, and 4 non-zero entries with the largest possible 268 ˆ j 1 lasso j (for the counts of non-zero entries). e) If necessary, re-order/sort the cases by their values of yi (from smallest to largest) to get a new indexing. Then plot on the same set of axes yi versus i and yˆi versus i for ridge, PCR, PLS, best subset, and lasso regressions for number of components/degrees of freedom/number of nonzero coefficients equal to 1. (Plot them as "functions," connecting consecutive plotted i, yi or i, yˆi points with line segments.) Then do the same for 2,3, and 4 numbers of components/degrees of freedom/counts of non-zero coefficients. 19. a) Find a set of basis functions for continuous functions on 0,1 0,1 that are linear (i.e. are of the form y a bx1 cx2 ) on each of the squares S1 0,.5 0,.5 , S 2 0,.5 .5,1 , S3 .5,1 0,.5 , and S 4 .5,1 .5,1 Vardeman will send you a data set generated from a model with 9 E y | x1 , x2 2 x1 x2 Find the best fitting linear combination of these basis functions according to least squares. b) Describe a set of basis functions for all continuous functions on 0,1 0,1 that for 0 0 1 2 K 1 K 1 and 0 0 1 M 1 M 1 are linear on each rectangle S km k 1 , k m 1 , m . How many such basis functions are needed to represent these functions? HW#3 (Due 2/27/15) 20. For the N 100 data set of Problem 5 make up a matrix of inputs based on x consisting of the values of Haar basis functions up through order m 3 . This will produce a 100 16 matrix Xh . a) Find β̂ OLS and plot the corresponding ŷ as a function of x with the data also plotted in scatterplot form. b) Center y and standardize the columns of Xh . Find the lasso coefficient vectors β̂ with exactly M 2, 4, and 8 non-zero entries with the largest possible 16 ˆ j 1 lasso j (for the counts of non-zero entries). Plot the corresponding yˆ's as functions of x on the same set of axes, with the data also plotted in scatterplot form. 21. For the data set of Problem 5 make up a 100 7 matrix Xh of inputs based on x consisting of the values of functions on slide 7 of Module 9 for the 7 knot values 1 0, 2 .1, 3 .3, 4 .5, 5 .7, 6 .9, 7 1.0 Find β̂ OLS and plot the corresponding natural cubic regression spline, with the data also plotted in scatterplot form. 22. Consider the function optimization problem solved by the cubic smoothing splines over the interval 0,1 . Suppose that N 11 training data pairs xi , yi have x1 0, x2 .1, x3 .2, , x11 1.0 and that the basis functions on slide 7 of Module 9 are employed. 10 a) Find the matrices H, Ω, and K associated with the smoothing spline development of Module 11. (You should be able to compute Ω in closed form. Do so and give the numerical versions of all 3 of these matrices.) b) Do an eigen analysis of the matrix K and then plot as functions of row index i the coordinates of the eigenvectors of K . Put these plots on the same set of axes, or stack plots of them one above the other in descending order of the size of eigenvalue. (Connect consecutive plotted points for a given eigenvector with line segments and use different plotting symbols, colors, and/or line weights and types so that you can make qualitative comparisons of the nature of these.) If possible, compare the "shapes" of the plots for the two largest and two smallest corresponding eigenvalues. Qualitatively speaking, what kinds of components of a vector of observed values, Y get "most suppressed" in the smoothing? c) Plot effective degrees of freedom for the spline smoother as a function of the parameter . Do simple numerical searches to identify values of corresponding to degrees of freedom 2.5,3,4, and 5. 23. In a p 1 smoothing context like that in Problem 22, where N 11 training data pairs xi , yi have x1 0, x2 .1, x3 .2, , x11 1.0 , consider locally weighted regression as on slides 5-7 of Module 14 based on a Gaussian kernel. a) Compute and plot effective degrees of freedom as a function of the bandwidth, . (It may be most effective to make the plot with on a log scale or some such.) Do simple numerical searches to identify values of corresponding to effective degrees of freedom 2.5,3,4, and 5. b) Compare the smoothing matrix " S " for Problem 22 producing 4 effective degrees of freedom to the matrix " L " in the present context also producing 4 effective degrees of freedom. What is the 1111 matrix difference? Plot, as a function of column index, the values in the 1st, 3rd, and 5th rows of the two matrices, connecting with line segments successive values from a given row. (Connect consecutive plotted points for a given row of a given matrix and use different plotting symbols, colors, and/or line weights and types so that you can make qualitative comparisons of the nature of these.) 24. Again use the data set of Problem 5. a) Fit with approximately 5 and then 9 effective degrees of freedom i) a cubic smoothing spline (using smooth.spline()) , and ii) a locally weighted linear regression smoother based on a tricube kernel (using loess(…,span= ,degree=1)). 11 Plot for approximately 5 effective degrees of freedom all of yi and the 2 sets of smoothed values against xi . Connect the consecutive xi , yˆ i for each fit with line segments so that they plot as "functions." Then redo the plotting for 9 effective degrees of freedom. b) Produce a single hidden layer neural net fit with an error sum of squares about like those for the 9 degrees of freedom fits using nnet(). You may need to vary the number of hidden nodes for a single-hidden-layer architecture and vary the weight for a penalty made from a sum of squares of coefficients in order to achieve this. For the function that you ultimately fit, extract the coefficients and plot the fitted mean function. How does it compare to the plots made in a)? c) Each run of nnet()begins from a different random start and can produce a different fitted function. Make 5 runs using the architecture and penalty parameter (the "decay" parameter) you settle on for part b) and save the 100 predicted values for the 10 runs into 10 vectors. Make a scatterplot matrix of pairs of these sets of predicted values. How big are the correlations between the different runs? d) Use the avNNet() function from the caret package to average 20 neural nets with your parameters from part b) . e) Fit radial basis function networks based on the standard normal pdf , 51 xz m 1 f x 0 i K x, for K x, z 50 m 1 to these data for two different fixed values of . Define normalized versions of the radial basis functions as i 1 K x, 50 N i x 51 m 1 K x, 50 m 1 and redo the problem using the normalized versions of the basis functions. 25. On their page 218 in Problem 8.1a, Kuhn and Johnson refer to a "Friedman1" simulated data set. Make that ( N 200 ) data set exactly as indicated on their page 218 and use it below. For computing here refer to James, Witten, Hastie and Tibshirani Section 8.3. For purposes of assessing test error, simulate a size 5000 test set exactly as indicated on page 169 of K&J. 12 a) Fit a single regression tree to the data set. Prune the tree to get the best sub-trees of all sizes from 1 final node to the maximum number of nodes produced by the tree routine. Compute and plot cost-complexities C T as a function of as indicated in Module 17. b) Evaluate a "test error" (based on the size 5000 test set) for each sub-tree identified in a). What size sub-tree looks best based on these values? c) Do 5-fold cross-validation on single regression trees to pick an appropriate tree complexity (to pick one of the sub-trees from a)). How does that choice compare to what you got in b) based on test error for the large test set? d) Use the randomForest package to make a bagged version of a regression tree (based on, say, B 500 ). What is its OOB error? How does that compare to a test error based on the size 5000 test set? e) Make a random forest predictor using randomForest (use again B 500 ). What is its OOB error? How does that compare to a test error based on the size 5000 test set? f) Use the gbm package to fit several boosted regression trees to the training set (use at least 2 different values of tree depth with at least 2 different values of learning rate). What are values of training error and then test error based on the size 5000 test set for these? g) How do predictors in c), d), e), and f) compare in terms of test error? Evaluate ŷ for each of the first 5 cases in your test set for all predictors and list all of the inputs and each of the predictions in a small table. h) Call your predictor from d) fˆ1 and pick one of your predictors from f) to call fˆ2 . Use your test set and approximate E y fˆ1 x , E y fˆ2 x , Var y fˆ1 x , Var y fˆ2 x , and Corr y fˆ1 x , y fˆ2 x (these expectations are across the joint distribution of x, y for the fixed training set (and randomization for the random forest). Identify an approximately optimizing E y fˆ1 x 1 fˆ2 x 2 Is the optimizer 0 or 1? 13 26. See pages 328-331 of JWHT and consider again the Ames Home Price Data of Problem 16. a) Fit bagged regression trees of two different depths to the data. b) Fit a "default" random forest to the data. c) Fit two different versions of boosted regression trees to the data. d) Portray the predictions from part c) as in part d) of Problem 16. e) On the basis of error sums of squares and sample correlations between predictions produced here and in Problem 16, if you were looking for two sets of predictions to "stack" which seem to be the most promising? Some Other Problems (NOT to be Turned in … Some Have Solutions on the 2011 602X Page) 27. (Exactly as in 2011 HW.) Find a set of basis functions for the natural (linear outside the interval 1 , K ) quadratic regression splines with knots at 1 2 K . 28. (Exactly as in 2011 HW.) (B-Splines) For a 1 2 K b consider the B-spline bases of order m , Bi ,m x defined recursively as follows. For j 1 define j a , and for j K let j b . Define Bi ,1 x I i x i 1 (in case i i 1 take Bi ,1 x 0 ) and then Bi ,m x x i x Bi , m 1 x i m B x i m 1 i i m i 1 i 1, m 1 (where we understand that if Bi ,l x 0 its term drops out of the expression above). For a 0.1 and b 1.1 and i i 1 /10 for i 1, 2, ,11 , plot the non-zero Bi ,3 x . Consider all linear combinations of these functions. Argue that any such linear combination is piecewise quadratic with first derivatives at every i . If it is possible to do so, identify one or more linear constraints on the coefficients (call them ci ) that will make qc x ci B3,i x linear to the left of 1 (but otherwise minimally constrain the form of qc x ). i 29. (Exactly as in 2011 HW. This is an argument for the reduction of the cubic spline function optimization problem to the matrix calculus problem.) 14 a) Suppose that a x1 x2 xN b and s x is a natural cubic spline with knots at the xi interpolating the points xi , yi (i.e. s xi yi ). Let z x be any twice continuously differentiable function on a, b also interpolating the points xi , yi . Show that s x b 2 a 2 b a (Hint: Consider d x z x s x , write 2 dx z x dx 2 2 d x dx z x dx s x dx 2 s x d x dx b b a b a b a a and use integration by parts and the fact that s x is piecewise constant.) b) Use a) and prove that the minimizer of N y h x i 1 i i 2 h x dx over the set of twice b 2 a continuously differentiable functions on a, b is a natural cubic spline with knots at the xi . 30. Use all of MARS, thin plate splines, local kernel-weighted linear regression, and neural nets to fit predictors to both the noiseless and the noisy Hat data in R. For those methods for which it's easy to make contour or surface plots, do so. Which methods seem most effective on this particular data set/function? 31. A p 2 data set consists of N 81 training vectors x1i , x2i , yi for pairs x1i , x2i in the set 2.0, 1.5, ,1.5, 2.0 for iid N 0, .1 2 where the yi were generated as yi exp x12i x22i 1 2 x12i x22i i 2 variables . Vardeman will send you such a data file. Use it in the i following. a) Why should you expect MARS to be ineffective in producing a predictor in this context? (You may want to experiment with the earth package in R trying out MARS.) b) Try 2-D locally weighted regression smoothing on this data set using the loess function in R. "Surface plot" this for 2 different choices of smoothing parameters along with both the raw data and the mean function. (If nothing else, JMP will do this under its "Graph" menu.) c) Try fitting a thin plate spline to these data. There is an old tutorial at http://www.stat.wisc.edu/~xie/thin_plate_spline_tutorial.html for using the Tps() function in the fields package that might make this simple. d) Use the neural network and random forest routines in JMP to fit predictors to these data with error sums of square like those you get in b). Surface plot your fits. 15 e) Center the response variable and then fit radial basis function networks based on the standard normal pdf , x xi f x i K x, xi for K x, xi i 1 81 to these data for two different values of . Define normalized versions of the radial basis functions as K x, xi N i x 81 K x, xm m 1 and redo the problem using the normalized versions of the basis functions. 32. (Exactly as in 2011 HW.) Consider radial basis functions built from kernels. In particular, consider the choice D t t , the standard normal pdf. a) For p 1 , plot on the same set of axes the 11 functions x j j 1 for j K x, j D j 1, 2, ,11 10 first for .1 and then (in a separate plot) for .01 . Then make plots on the a single set of axes the 11 normalized functions K x, j N j x 11 K x, l l 1 first for .1 , then in a separate plot for .01 . b) For p 2 , consider the 121 basis functions x ξ ij i 1 j 1 for ξ ij K x, ξ ij D , i 1, 2, ,11 and j 1, 2, ,11 10 10 Make contour plots for K.1 x, ξ 6,6 and K.01 x, ξ 6,6 . Then define N ij x K x, ξ ij 11 11 K x, ξ m 1 l 1 lm Make contour plots for N.1,6,6 x and N.01,6,6 x . HW #4 Covered on Exam 1 (Due 3/10/15) 16 33. Consider 4 different continuous distributions on the 2-dimensional unit square 0,1 with 2 densities on that space g1 x1 , x2 1, g 2 x1 , x2 x1 x2 , g 3 x1 , x2 2 x1 , and g 4 x1 , x2 x1 x2 1 a) For a 0-1 loss K 4 classification problem, find explicitly and make a plot showing the 4 regions in the unit square where an optimal classifier f has f x k (for k 1, 2,3, 4 ) first if π 1 , 2 , 3 , 4 is .25,.25,.25,.25 and then if it is .2,.2,.3,.3 . b) Find the marginal densities for all of the g k . Define 4 new densities g k* on the unit square by the products of the 2 marginals for the corresponding g k . Consider a 0-1 loss K 4 classification problem ?approximating? the one in a) by using the g k* in place of the g k for the π .25,.25,.25,.25 case. Make a 101101 grid of points of the form i / 100, j / 100 for integers 0 i, j 100 and for each such point determine the value of the optimal classifier. Using these values, make a plot (using a different plotting color and/or symbol for each value of ŷ ) showing the regions in 0,1 where the optimal classifier classifies to each class. 2 c) Find the g k conditional densities for x2 | x1 . Note that based on these and the marginals in part b) you can simulate pairs from any of the 4 joint distributions by first using the inverse probability transform of a uniform variable to simulate from the x1 marginal and then using the inverse probability transform to simulate from the conditional of x2 | x1 . x2 | x1 . (It's also easy to use a rejection algorithm based on x1 , x2 pairs uniform on 0,1 .) 2 d) Generate 2 data sets consisting of multiple independent pairs x, y where y is uniform on 1, 2, 3, 4 and conditioned on y k , the variable x has density gk . Make first a small training set with N 400 pairs (to be used below). Then make a larger test set of 10,000 pairs. Use the test set to evaluate the error rates of the optimal rule from a) and then the "naïve" rule from b). e) See Section 4.6.3 of JWHT for use of a nearest neighbor classifier. Based on the N 400 training set from d), for several different numbers of neighbors (say 1,3,5,10) make a plot like that required in b) showing the regions where the nearest neighbor classifiers classify to each of the 4 classes. Then evaluate the test error rate for the nearest neighbor rules based on the small training set. 17 f) This surely won't work very well, but give the following a try for fun. Naively/unthinkingly apply linear discriminant analysis to the training set as in Section 4.6.3 of JWHT. Identify the regions in 0,1 corresponding to the values of yˆ 1, 2,3, 4 . Evaluate the test error rate for 2 LDA based on this training set. g) Following the outline of Section 8.3.1 of JWHT, fit a classification tree to the data set using 5-fold cross-validation to choose tree size. Make a plot like that required in b) showing the regions where the tree classifies to each of the 4 classes. Evaluate the test error rate for this tree. h) Fit a default random forest to the training set. Make a plot like that required in b) showing the regions where the tree classifies to each of the 4 classes. Evaluate the test error rate for this tree. i) Based on the training set, one can make estimates of the 2-d densities g k as gˆ k x 1 # i with yi k i with yi k h x | xi , 2 for h | u, 2 the bivariate normal density with mean vector u and covariance matrix 2 I . (Try perhaps .1 .) Using these estimates and the relative frequencies of the possible values of y in the training set ˆ k # i with yi k N an approximation of the optimal classifier is fˆ x arg max ˆ gˆ x arg max k k k k i with yi k h x | xi , 2 Make a plot like that required in b) showing the regions where this classifies to each of the 4 classes. Then evaluate the test error rate for the approximately optimal rule based on the small training set. 34. (Not to be turned in. Exactly as in 2011 HW.) Figure 4.4 of HTF gives a 2-dimensional plot of the "vowel training data" (available on the book’s website at http://wwwstat.stanford.edu/~tibs/ElemStatLearn/index.html or from http://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Vowel+Recognition++Deterding+Data%29 ). The ordered pairs of first 2 canonical variates are plotted to give a "best" reduced rank LDA picture of the data like that below (lacking the decision boundaries). a) Use Section 14.1 of the course outline to reproduce Figure 4.4 of HTF (color-coded by group, with group means clearly indicated). Keep in mind that you may need to multiply 18 one or both of your coordinates by −1 to get the exact picture. Use the R function lda (MASS package) to obtain the group means and coefficients of linear discriminants for the vowel training data. Save the lda object by a command such as LDA=lda(insert formula, data=vowel). b) Reproduce a version of Figure 1a) above. You will need to plot the first two canonical coordinates as in a). Decision boundaries for this figure are determined by classifying to the nearest group mean. Do the classification for a fine grid of points covering the entire area of the plot. If you wish, you may plot the points of the grid, color coded according to their classification, instead of drawing in the black lines. c) Make a version of Figure 1b) above with decision boundaries now determined by using logistic regression as applied to the first two canonical variates. You will need to create a data frame with columns y, canonical variate 1, and canonical variate 2. Use the vglm function (VGAM package) with family=multinomial() to do the logistic regression. Save the lda object by a command such as LR=vglm(insert formula, family=multinomial(), data=data set). A set of observations can now be classified to groups by using the command predict(LR, newdata, type=‘‘response’’), where newdata contains the observations to be classified. The outcome of the predict function will be a matrix of probabilities. Each row contains the probabilities that a corresponding observation belongs to each of the groups (and thus sums to 1). We classify to the group with maximum probability. As in b), do the classification for a fine 19 grid of points covering the entire area of the plot. You may again plot the points of the grid, color-coded according to their classification, instead of drawing in the black lines. 35. (Not to be turned in. Exactly as in 2011 HW.) Continue use of the vowel training data from Problem 34. a) So that you can see and plot results, first use the 2 canonical variates employed in Problem 34 and use rpart in R to find a classification tree with empirical error rate comparable to the reduced rank LDA classifier pictured in Figure 1a. Make a plot showing the partition of the region into pieces associated with the 11 different classes. (The intention here is that you show rectangular regions indicating which classes are assigned to each rectangle, in a plot that might be compared to the earlier plots.) b) Beginning with the original data set (rather than with the first 2 canonical variates) and use rpart in R to find a classification tree with empirical error rate comparable to the classifier in a). How much (if any) simpler/smaller is the tree here than in a)? HW #5 (Due //15) 36. Return to the context of Problem 33. (Some of these things are direct applications of code in JWHT or KJ and so I'm sure they can be done fairly painlessly. Other things may not be so easy or could even be essentially impossible without a lot of new coding. If that turns out to be the case, do only what you can in a finite amount of time.) a) Use logistic regression (e.g. as implemented in glm() or glmnet()) on the training data you generated in 33d) to find 6 classifiers with linear boundaries for choice between all pairs of classes. Then consider an OVO classifier that classifies x to the class with the largest sum (of 3) estimated probabilities coming from these logistic regressions. Make a plot like the one required 2 in 33b) showing the regions in 0,1 where this classifier has fˆ x 1, 2,3, and 4 . Use the large test set to evaluate error rate of this classifier. b) It seems from the glmnet() documentation that using family="multinomial" one can fit multivariate versions of logistic regression models. Try this using the training set. Consider the classifier that classifies x to the class with the largest estimated probability. Make a plot like the 2 one required in 33b) showing the regions in 0,1 where this classifier has fˆ x 1, 2,3, and 4 . Use the large test set to evaluate error rate of this classifier. 20 c) Pages 360-361 of K&J indicate that upon converting output y taking values in 1, 2, 3, 4 to 4 binary indicator variables, one can use nnet with the 4 binary outputs (and the option linout = FALSE) to fit a single hidden layer neural network to the training data with predicted output values between 0 and 1 for each output variable. Try several different numbers of hidden nodes and "decay" values to get fitted neural nets. From each of these, define a classifier that classifies x to the class with the largest predicted response. Use the large test set to evaluate error rates of these classifiers and pick the one with the smallest error rate. Make a plot like the one required in 33b) showing the regions in 0,1 where your best neural net classifier has 2 fˆ x 1, 2,3, and 4 . d) Use svm() in package e1071 to fit SVM's to the y 1 and y 2 training data for the "linear" kernel, "polynomial" kernel (with default order 3), "radial basis" kernel (with default gamma, half that gamma value, and twice that gamma value) Compute as in Section 9.6 of JWHT and use the plot() function to investigate the nature of the 4 classifiers. If it's possible, put the training data pairs on the plot using different symbols or colors for classes 1 and 2, and also identify the support vectors. e) Compute as in JWHT Section 9.6.4 to find SVMs (using the kernels indicated in d)) for the K 4 class problem. Again, use the plot() function to investigate the nature of the 4 classifiers. Use the large test set to evaluate the error rates for these 4 classifiers. f) Use either the ada package or the adabag package and fit an AdaBoost classifier to the y 1 and y 2 training data. Make a plot like the one required in 33b) showing the regions in 0,1 2 where this classifier has fˆ x 1 and 2 . Use the large test set to evaluate the error rate of this classifier. How does this error rate compare to the best one possible for comparing classes 1 and 2 with equal weights on the two? (You should be able to get the latter analytically.) g) It appears from the paper "ada: An R Package for Stochastic Boosting" by Mark Culp, Kjell Johnson, and George Michailidis that appeared in the Journal of Statistical Software in 2006 that ada implements a OVA version of a K -class Adaboost classifier. If so, use this and find the corresponding classifier. Make a plot like the one required in 33b) showing the regions 2 in 0,1 where this classifier has fˆ x 1, 2,3, and 4 . Use the large test set to evaluate error rate of this classifier. 21 37. Consider the famous Swiss Bank Note data set that can be found (among other places) on Izenman's book web page http://astro.ocis.temple.edu/~alan/MMST/datasets.html a) Compare an appropriate SVM based on the original input variables to a classifier based on logistic regression with the same variables. (BTW, there is a nice paper "Support Vector Machines in R" by Karatzoglou, Meyer, and Hornik that appeared in Journal of Statistical Software that you can get to through ISU. And there is an R SVM tutorial at http://planatscher.net/svmtut/svmtut.html .) b) Use both an adaBoost algorithm and a random forest to identify appropriate classifiers based on the Bank Note data set. 38. This problem concerns the "Seeds" data set at the UCI Machine Learning Repository, http://archive.ics.uci.edu/ml/datasets/seeds# . A text version of the 3-class data set is posted on the course web page. Standardize all p 7 input variables before beginning the analyses indicated below. a) Consider first the problem of classification where only varieties 1 and 3 are considered (temporarily code variety 1 as 1 and variety 3 as 1 ) and only predictors x1 and x6 . Using several different values of and constants c , find g HK and 0 minimizing 1 y N i i 1 0 g xi for the (Gaussian) kernel K x, z exp c x z 2 g 2 HK . Make contour plots for those functions 0 g x , and, in particular, show the contour that separates 3,3 3,3 into the regions where a corresponding SVM classifies to classes 1 and 1 . (You may use the kind of plotting of color-coded grid points suggested in Problem 34 to make the "contour plots" is you wish.) b) Use both an adaBoost algorithm and a random forest to identify appropriate classifiers based on variety 1 versus variety 3 classification problem of part a). For a fine grid of points in 3,3 3,3 , indicate on a 2-D plot which points get classified to classes 1 and 1 so that you can make visual comparisons to the SVM classifiers referred to in a). c) Find both an appropriate random forest classifier and an adaBoost.mh classifier for the 3-class problem, (still with p 2 ) and once more show how the classifiers break the 2-D space up into regions (this time associated with one of the 3 wheat varieties). d) How much better can you do at the classification task using a random forest classifier based on all p 7 input variables than you are able to do in part c)? 22 e) Use the k-means method in the R stats package (or some other equivalent method) to find k (7-dimensional) prototypes for representing each of the 3 wheat varieties for each of k 5, 7,10 . Then compare training error rates for classifiers that classify to the variety with the nearest prototype for these values of k . 39. Work again with the "Seeds" data of Problem 38. Begin by again standardizing all p 7 measured variables. a) JMP will do clustering for you. (Look under the Analyze-> Multivariate menu.) In particular, it will do both hierarchical and k -means clustering, and even self-organizing mapping as an option for the latter. Consider i) several different k -means clusterings (say with k 9,12,15, 21 ), ii) several different hierarchical clusterings based on 7-d Euclidean distance (say, again with k 9,12,15, 21 final clusters), and iii) SOM's for several different grids (say 3 3,3 5, 4 4, and 5 5 ). Make some comparisons of how these methods break up the 210 data cases into groups. You can save the clusters into the JMP worksheet and use the GraphBuilder to quickly make plots. If you "jitter" the cluster numbers and use "variety" for both size and color of plotted points, you can quickly get a sense as to how the groups of data points match up method-tomethod and number-of-clusters-to-number-of-clusters (and how the clusters are or are not related to seed variety). Also make some comparisons of the sums squared Euclidean distances to cluster centers. b) Use appropriate R packages/functions and do multi-dimensional scaling on the 210 data cases, mapping from 7 to 2 using Euclidean distances. Plot the 210 vectors z i 2 using different plotting symbols for the 3 different varieties. 40. There is a "Spectral" data set with 200 cases of x, y on the course web page. Plot these data and see that there are somehow 4 different kinds of "structure" in the data set. Apply the "spectral clustering idea (use w d exp d 2 / c ) and see if you can "find" the 4 "clusters" in the data set by appropriate choice of c . 41. (There is a variant of this problem included in the 2011 HW.) Let H be the set of absolutely continuous functions on 0,1 with square integrable first derivatives (that exist except possibly at a set of measure 0 ). Equip H with an inner product f ,g 1 H f 0 g 0 f x g x dx 0 a) Show that 23 R x, z 1 min x, z is a reproducing kernel for this Hilbert space. Now consider the small fake data set in the table below, y x 1.1 0 1.5 .1 2.4 .2 2.2 .3 1.7 .4 1.3 .5 .3 .6 .1 .7 .1 .8 .5 .9 .1.0 1.0 b) For two different values of 0 , find and h H optimizing N y h x i 1 i 2 i h x dx 1 2 0 c) ) For two different values of 0 , find and h H optimizing N i 1 yi h t dt xi 0 h x dx 2 1 2 0 42. Center the outputs for the n 100 data set of Problem 5. Then derive sets of predictions yˆi based on x 0 Gaussian process priors for f x . Plot several of those as functions on the same set of axes (along with centered original data pairs) as follows: a) Make one plot for cases with what appear to you to be a sensible choice of 2 , for exp c 2 , 2 2 , 4 2 , and c 1, 4 . b) Make one plot for cases with exp c and the choices of parameters you made in a). c) Make one plot for cases with 2 one fourth of your choice in a), but where otherwise the parameters of a) are used. 43. (Izenman Problem 14.4.) (Not to be turned in. Exactly as in 2011 HW.) Consider the following 10 two-dimensional points, the first five points, (1, 4),(3.5,6.5),(4.5,7.5),(6,6),(1.5,1.5) belong to Class 1 and the second five points (8, 6.5), (3, 4.5), (4.5, 4), (8,1.5), (2.5, 0) , belong to Class 2. Plot these points on a scatterplot using different symbols or colors to distinguish the two classes. Carry through by hand the AdaBoost algorithm on these points, showing the weights at each step of the process. (Presumably Izenman's intention is that you use simple 2-node trees as your base classifiers.) Determine the final classifier and calculate its empirical misclassification rate. 24