Stat 602 Exam 1 Spring 2015 I have neither given nor received unauthorized assistance on this exam. ________________________________________________________ Name Signed Date _________________________________________________________ Name Printed This is a long Exam. You probably won't be able to finish all of it. Don't let yourself get hung up on some part and miss ones that will go (much) faster than others. Point values are indicated. 1 1. Consider a SEL prediction problem where p 1 , and the class of functions used for prediction is the set of constant functions S h | h x c x and some c . Suppose that in fact x U 0,1 , E y | x ax b, and Var y | x dx 2 for some d 0 6 pts a) Under this model, what is the best element of S , say g * , for predicting y ? Use this to find the average squared model bias in this problem. 6 pts b) Suppose that based on an iid sample of N points xi , yi , fitting is done by least squares (and thus the predictor fˆ x y is employed). What is the average squared fitting bias in this case? 2 6 pts c) What is the average prediction error, Err , when the predictor in b) is employed? 2. Consider two probability densities on the unit disk in 2 (i.e. on g1 x1 , x2 1 and g 2 x1 , x2 x , x | x 1 2 2 1 x22 1 ), 3 1 x12 x22 2 and a 2-class 0-1 loss classification problem with prior probabilities 1 2 .5 . 6 pts a) Give a formula for a best-possible single feature T x1 , x2 . 3 10 pts b) Give an explicit form for the theoretically optimal classifier in this problem. 4 pts c) Suppose that one uses features x1 , x2 , x12 , x22 , and x1 x2 to do 2-class classification based on a moderate number of iid training cases from this model. Would you expect better classification performance for 1) a classifier based on logistic regression using these features or 2) a random forest using these features? Explain. Likely Better Classifier: Your Reasoning: 4 3. Below are tables specifying two discrete joint distributions for x, y that we'll call Model #1 and Model #2. Suppose that N 2 training cases (drawn iid from one of the models) are x1 , y1 2, 2 and x2 , y2 3,3 . y\x 1 3 2 1 0 0 0 .125 .125 Model #1 2 .125 .125 .125 .125 3 y\x 1 Model #2 2 .125 .125 0 0 3 2 1 0 0 .1 .1 .1 0 .2 .2 0 3 .1 .1 .1 0 Suppose further that prior probabilities for the two models are 1 .3 and 2 .7 . 6 pts a) Find the posterior probabilities of Models #1 and #2. 6 pts b) Find the "Bayes model averaging" SEL predictor of y based on x for these training data. (Give values fˆ 1 , fˆ 2 , and fˆ 3 . You don't need to complete the arithmetic here.) 5 4. Consider the p 2 prediction problem based on N 9 training points as below. 8 1 1 3 1 0 3 1 1 5 0 1 1 1 Y 0 0 1 and X x1 , x 2 6 6 5 0 1 1 1 1 3 1 0 5 1 1 6 pts a) Find the SEL Lasso coefficient vector β̂ optimizing SSE 8 ˆ1Lasso ˆ2Lasso and give the corresponding Ŷ Lasso . 6 8 pts b) "Boost" your Lasso SEL predictor from a) using ridge regression with 1 and a learning rate of .1 . Give the resulting vector of predictions Ŷ boost1 . 4 pts c) Is the predictor in b) a linear predictor? If not, argue that it is not. If it is, what is β̂ such that ˆ boost1 Xβˆ ? Y 7 8 pts d) Now "boost" your SEL Lasso predictor from a) using a best "stump" regression tree predictor (one that makes only a single split) and a learning rate of .1 . Give the resulting vector of predictions Ŷ boost2 . 8 5. Here is (a rounded version of) a smoother matrix S , for a N-W smoother with Gaussian kernel for data with x 0, 0.1, 0.2, , 0.8, 0.9,1.0 . [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] 6 pts [,1] 0.47 0.26 0.10 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 [,2] 0.35 0.35 0.23 0.09 0.02 0.00 0.00 0.00 0.00 0.00 0.00 [,3] 0.14 0.26 0.31 0.23 0.09 0.02 0.00 0.00 0.00 0.00 0.00 [,4] 0.03 0.11 0.23 0.31 0.23 0.09 0.02 0.00 0.00 0.00 0.00 [,5] 0.00 0.02 0.10 0.23 0.31 0.23 0.09 0.02 0.00 0.00 0.00 [,7] 0.00 0.00 0.00 0.02 0.09 0.23 0.31 0.23 0.10 0.02 0.00 [,8] 0.00 0.00 0.00 0.00 0.02 0.09 0.23 0.31 0.23 0.11 0.03 [,9] [,10] [,11] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.09 0.02 0.00 0.23 0.09 0.02 0.31 0.23 0.10 0.26 0.35 0.26 0.14 0.35 0.47 a) Approximately what bandwidth ( ) and effective degrees of freedom are associated with this? ____________ 6 pts [,6] 0.00 0.00 0.02 0.09 0.23 0.31 0.23 0.09 0.02 0.00 0.00 effective df ____________ b) A rounded version of the matrix product S S is below. Thinking of this product as itself a smoother matrix, what might you think of as "an equivalent kernel"? (Give values of weights w i j for i, j indices from 1 to 11 so that yˆ j i 1 w i j yi .) 11 [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [,1] 0.33 0.24 0.14 0.06 0.02 0.01 0.00 0.00 0.00 0.00 0.00 [,2] 0.32 0.28 0.21 0.13 0.06 0.02 0.01 0.00 0.00 0.00 0.00 [,3] 0.21 0.24 0.24 0.19 0.12 0.06 0.02 0.01 0.00 0.00 0.00 [,4] 0.10 0.14 0.20 0.22 0.19 0.12 0.06 0.02 0.01 0.00 0.00 [,5] 0.03 0.07 0.12 0.19 0.22 0.19 0.12 0.06 0.02 0.01 0.00 [,6] 0.01 0.02 0.06 0.12 0.19 0.22 0.19 0.12 0.06 0.02 0.01 [,7] 0.00 0.01 0.02 0.06 0.12 0.19 0.22 0.19 0.12 0.07 0.03 [,8] 0.00 0.00 0.01 0.02 0.06 0.12 0.19 0.22 0.20 0.14 0.10 [,9] [,10] [,11] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.02 0.01 0.00 0.06 0.02 0.01 0.12 0.06 0.02 0.19 0.13 0.06 0.24 0.21 0.14 0.24 0.28 0.24 0.21 0.32 0.33 9 Here is a bit of R code and more output for this problem. > round(eigen(S)$values,3) [1] 1.000 0.921 0.730 0.509 0.317 0.176 0.087 0.038 0.015 0.005 0.001 > round(eigen(S)$vectors[,1],3) [1] -0.302 -0.302 -0.302 -0.302 -0.302 -0.302 -0.302 -0.302 -0.302 -0.302 -0.302 (While S is not symmetric, it is non-singular and has 11 real eigenvalues 1 d1 d 2 d11 0 with corresponding linearly independent unit eigenvectors u1 , u 2 , , u11 such that S u j d j u j . So with U u1 , u 2 , , u11 and D diag d1 , d 2 , , d11 we have S U UD and S UDU 1 . The output above provides the eigenvalues and u1 .) 6 pts c) The nth power of S , call it S n , has a limit. What is it? Argue that your answer is correct. If you can't see a tight argument, make an intuitive one based on the nature of smoothing. What are the corresponding limits of S n Y and of the effective degrees of freedom of S n ? 10 6 pts 6. Consider the small space of functions on 1,1 that are linear combinations of the 4 functions 2 1, x1 , x2 , and x1 x2 , with inner product defined by f , g f x , x g x , x dx dx 1 2 1 2 1 2 . Find the 1,1 2 element of this space closest to h x1 , x2 x12 x22 (in the L2 1,1 h h, h 1/ 2 2 function space norm ). (Note that the functions 1, x1 , x2 , and x1 x2 are orthogonal in this space.) 11