Stat 602X Exam 1 Spring 2013 I have neither given nor received unauthorized assistance on this exam. ________________________________________________________ Name Signed Date _________________________________________________________ Name Printed This is a very long exam consisting of 12 parts. I'll score it at 10 points per problem/part and add your best 8 scores to get an exam score (out of 80 points possible). Some parts will go (much) faster than others, and you'll be wise to do them first. 1 1. Consider the p 1 prediction problem with N 8 and training data as below y x 8 4 4 0 2 3 6 5 .125 .250 .375 .500 .625 .750 .875 1.000 where use of the order M 2 Haar basis functions on the unit interval produces 1 1 1 1 X 1 1 1 1 1 2 0 1 2 0 1 2 0 1 2 0 1 0 2 1 0 2 1 0 2 1 0 2 0 2 0 0 0 0 2 0 0 0 2 0 0 0 0 2 0 0 0 2 0 0 0 0 2 0 0 0 2 2 0 0 Use the notation that the jth column of X is x j . a) Find the fitted OLS coefficient vector β̂ OLS for a model including only x1 , x 2 , x3 , x 4 as predictors. 2 b) Center Y to create Y* and let x*j 1 2 2 x j for each j . Find βˆ lasso 7 optimizing 2 8 * 8 * yi b j xij 5 b j i 1 j 2 i 2 8 over choices of b 7 . * 0 c) The LAR algorithm applied to Y* and the set of predictors x*j for j 2,3, ,8 begins at Y * OLS . Identify the first two points in 8 at which and takes a piecewise linear path through 8 to Y the direction of the path changes, call them W1 and W2 . (Here you may well wish to use both the connection between the LAR path and the lasso path and explicit formulas for the lasso coefficients.) 3 ˆ penalty 8 optimizing d) Find Y Y v Y v v,x*2 2 2 v,x*3 2 v,x*4 2 8 4 v,x*j 2 j 5 over choices of v 8 . 4 e) Find an 8 8 smoother matrix S corresponding to the penalty in d) (a matrix so that for any Y 8 a Ŷ penalty optimizing the form in part d) is SY ) and plot values in the 4th row of this matrix against x .500 below. 5 f) If one accepts the statistical conventional wisdom that (generalized) "spline" smoothing is nearly equivalent to kernel smoothing, in light of your plot in e) identify a kernel that might provide smoothed values similar to those for the penalty used in d). (Name a kernel and choose a bandwidth.) 6 2. Consider the p 1 prediction problem with N 6 and training data as below. y x 1.6 .4 3.5 1.5 5 6 1 2 3 4 5 6 Forward selection of binary trees for SEL prediction produces the sequence of trees represented below. If one determines to prune back from the final tree in optimal fashion, there is a nested sequence of subtrees that are the only possible optimizers of C T T SSE T for positive . Identify that nested sequence of sub-trees of Tree 5 below. Tree Number Subsets of values of x SSE 0 123456 24.22 1 1 2 3 4 || 5 6 5.47 2 1 2 || 3 4 || 5 6 3.22 3 1 2 || 3 || 4 || 5 6 1.22 4 1 || 2 || 3 || 4|| 5 6 .50 5 1 || 2 || 3 || 4 || 5 || 6 0 7 3. Consider a p 1 prediction problem for x 0,1 and random forest predictor fˆB* based on a training set of size N 101 with xi i 1 /100 for i 1, 2, ,101 and nmax 5 (so no split is made in creating a single tree predictor fˆ *b that would produce a leaf representing fewer than 5 training points). Consider the bias of prediction at x 1.00 , namely E fˆB* 1.00 1.00 under a model where Eyi xi . Say/prove what you can about this bias. (Is it 0? Is it positive? Is it negative? How big is it?) 8 4. Consider a small fake data set consisting of N 6 data vectors in 2 and use of a kernel function (mapping 2 2 ) K x, z exp 3 x z 1.01 .99 1 1 1 .99 1.01 1 1 .998 .01 .003 .01 0 0 X and K 0 0 0 0 .003 .01 .003 0 0 .01 2.00 2.00 2 2 .000 .998 1 .003 .003 .003 .000 .003 .003 1 .999 .998 .000 2 . The data and Gram matrices are .003 .003 .999 1 .999 .000 .003 .003 .998 .999 1 .000 .000 1 .000 1 .000 0 .000 0 .000 0 1 0 1 1 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 0 0 1 1 1 0 As it turns out (using the approximate form for K ) 1 1 1 K JK KJ JKJ 6 6 36 has (approximately) a SVD with two non-zero singular values (namely 2.43 and 1.43) and corresponding vectors of principal components u1 .39, .39, .39,.51,.51,.15 and u 2 .12, .12, .12, .27, .27,.9 Say what both principal components analysis on the raw data and kernel principal components indicate about these data. Raw PCA: Kernel PCA: 9 5. Consider here prediction of a 0-1 (binary) response using a model that says that for two (standardized) predictors z1 and z2 P yi 1| z1i , z2i exp 1 z1i 2 z2i 1 exp 1 z1i 2 z2i (Training data are N vectors z1i , z2i , yi .) For this problem, one might define a (log-likelihoodbased) training error as N N i 1 i 1 TE a, b1 , b2 ln 1 exp a b1 z1i b2 z2i yi a b1 z1i b2 z2i How would you regularize fitting of this model in "ridge-regression" style (penalizing only b1 and b2 and not a )? Derive 3 equations that you would need to solve simultaneously to carry out regularized fitting. 10 6. Consider a simple Bayes model averaging prediction problem. Training data are xi , yi where xi 0,1 and we assume that these are independent with yi xi i for i N 0,1 . Two models are contemplated. Model 1 says that 0 1 and a priori N 0, 10 . Model 2 says that 0 and 1 are a priori independent with both 0 N 0, 10 2 2 and 1 N 0, 10 . Assume that a priori the two models are equally likely. Training pairs xi , yi 2 are 0,5 , 0, 7 , 0, 6 , 1,12 . Find an appropriate predicted value of y if x 1 . HELPFUL FACT (you need NOT prove): If conditioned on , observations z1 , , zn are iid n N ,1 and is itself N 0, 2 , then conditioned on z1 , , zn , is N n 1 2 1 1 z, n 2 . 11 7. Consider approximations to simple functions using single layer feed-forward neural network forms. First say how you might produce an approximation of a function on 1 that is an indicator function of any interval, I a, b (finite or infinite), say I a x b . Then argue that it's possible to M approximate any function of the form g x cl I al x bl on 1 using a neural network l 1 form. 12