Stat 602 Exam 2 Spring 2015 I have neither given nor received unauthorized assistance on this exam. ________________________________________________________ Name Signed Date _________________________________________________________ Name Printed This is a long Exam. You probably won't be able to finish all of it. It has 19 parts. I'll score every part out of 5 points except for Problem 4 on page 5 that I'll score out of 10 points. I'll drop either 5 5-point parts, or Problem 4 and 3 5-point questions to get an exam score. That is, this is a 75 point exam. Don't let yourself get hung up on some part and miss ones that will go faster than others. 1 1 for k 1, 2,3 , and densities g k x for the 3 conditional distributions of x | y k , k 1, 2,3 . For some pair of features T1 x and T2 x show that: 1. A 3-class classification model has k P y k 5 pts a) optimal classification for each of the 3 pairs of classes is linear classification based on the features t1 and t2 . (Show the classification boundaries on the axes below. Indicate the scales for the features.) T1 x _____________________ 5 pts T2 x _____________________ b) the optimal 3-class classifier can be realized as a "OVO" (one-versus-one) combination of the three 2-class classifiers. (Show the classification boundaries in terms of these features and indicate which regions correspond to which classification decisions.) 2 2. In class it was said that the AdaBoostM.1 classification algorithm is derivable by applying the general gradient boosting algorithm of Friedman to exponential loss and basic function updates that are simple "binary stumps." This problem concerns applying the algorithm with hinge loss, L y, yˆ 1 yyˆ (for the 1 versus 1 coding for y and ŷ ), and linear functions of predictor x p , say 0 xβ , as basic function updates. 5 pts a) What starting function f 0 x would be used? 5 pts b) With the m 1 iterate f m 1 x in hand, each yim is in 1,1 . Using appropriate indicator functions, give an explicit formula for yim in terms of yi and yˆim 1 f m 1 xi . 5 pts c) Describe in words how you would use standard statistical software to produce 0 m and β m so that all 0 m xi β m approximate the values yim . 3 N 5 pts d) Why does optimization of 1 y f x i 1 i m 1 i 0m xi β m over choices of involve comparison of this quantity for at most N values of ? Give a formula for values of that you might have to check. 5 pts e) After M iterations you probably won't have an f M x taking values 1 or 1 at every xi . How would you use f M x to do classification? 5 pts 3. Murphy mentions the possibility of "kernelizing" nearest-neighbor classification. ("Kernelization" amounts to mapping x p to K x, in a RKHS with kernel K , , H K , and using inner products and corresponding distances in that space.) Using the Gaussian kernel K x, z exp x z 2 , what is the H K distance between K x, and K z, ? Describe the set of training cases xi p with K xi , in the H K k - nearest neighborhood of K x, . 4 10 pts 4. Consider the Gaussian kernel K x, z exp x z 2 for x and z in 2, 4 and a corresponding RKHS, H K . Based on the very small x, y data set below, we wish to fit a function of the form f x 0 1 x h x for h H K to the data set under the penalty 5 y f x i 1 i i 2 2 h 2 HK You may use the fact that the least squares line through these data is yˆ 3.7 .5 x . In as much numerical detail as you can sensibly provide in the available time, show how to find the optimizing f x . y 4 4 3 3 2 x 1 0 1 2 3 5 5. Here is some simple R code and output for a small N 5 and p 4 data set. > X [,1] [,2] [,3] [,4] [1,] 0.4 2 -0.5 0 [2,] -0.1 0 -0.3 1 [3,] 0.4 0 -0.1 0 [4,] 0.4 0 0.0 -1 [5,] 0.1 2 0.7 0 > > svd(X) $d [1] 2.8551157 1.4762267 0.9397253 0.3549439 $u [,1] [,2] [,3] [,4] [1,] 0.70256076 0.06562895 0.6458656 -0.2618499 [2,] -0.01458943 0.69768837 0.1798028 0.2661822 [3,] 0.01628552 -0.05282808 0.2689008 0.8815301 [4,] 0.02268773 -0.71093125 0.2403923 0.1625961 [5,] 0.71092586 -0.02664090 -0.6484076 0.2388488 $v [,1] [,2] [,3] [,4] [1,] 0.12929953 -0.23823242 0.403567340 0.8738766 [2,] 0.99014314 0.05282123 -0.005410155 -0.1296041 [3,] 0.05222766 -0.17306746 -0.912659300 0.3665691 [4,] -0.01305627 0.95420275 -0.064475843 0.2918382 5 pts a) What is the best rank 2 approximation to the 5 4 data matrix (in terms of "Frobenius norm" of the difference between X and the approximation)? (You may identify/name vectors and matrices above and write formulas involving them rather than copying numerical values from above.) 5 pts b) Interpret the fact that by far the largest (in absolute value) number in the first column of the " v " matrix is .99014314. 6 5 pts 6. Below is a small fake p 2 data set and a scatterplot for it. Consider (graphical) spectral clustering for the data set, using the symmetric set of index pairs N 2 (based on 3-nearest-neighbor neighborhoods--a neighborhood including the point itself) and weight function w d exp d 2 . Set up an appropriate adjacency matrix and give the 8 node degrees. Index ( i ) 1 2 3 4 5 6 7 8 x1i 1 2 2 3.5 3.5 3 3 4 x2i 1 0 1 1 2 3 5 5 7 7. Below is a small p 2 classification training set (for 2 classes) displayed in graphical and tabular forms (circles are class 1 and squares are class 1). A bootstrap sample is made from this data set and is indicated in the table and by counts next to plotted points for those points represented in the sample other than once. This sample is used to create a tree in a random forest with 4 end nodes (accomplished by 3 binary splits). A random choice is made for which of the 2 variables to split on at each opportunity and turns out to produce the sequence " x1 then x1 then x2 ." y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 x1 1 1 3 7 7 8 15 7 7 11 12 13 13 15 17 17 Frequency in x2 Bootstrap Sample 0 2 0 12 2 5 2 0 2 13 0 6 0 4 1 15 0 17 1 15 0 16 2 7 2 20 1 11 1 16 2 19 5 pts a) Identify the resulting tree by rectangles on the plot and provide the value of ŷ for each rectangle. 5 pts b) What is the numerical value of the out-of-bag (0-1 loss) error for this particular tree? 8 8. Consider a p 3 predictor 2-class neural net classifier, with a single hidden layer having only 2 nodes. 5 pts a) Provide the network diagram for this situation and a corresponding likelihood term that might be associated with a training vector x1i , x2i , x3i , yi where y has the 1 versus 1 coding. 5 pts b) Suppose that the inputs have been standardized, and completely specify a lasso-motivated jointly continuous prior distribution for the model parameters that might be expected to promote posterior sparsity/near-sparsity for the model parameters. 9 9. On the next pages there is some R code and output for the problem of prediction of (log base 10) "salary" for NFL Football running backs in 1990 based on p 6 inputs. ("Draft" is the round of the NFL draft in which the player was chosen. "played" and "started" refer to participation numbers for 1989 regular season games.) According to the Glmnet vignette, after standardizing the input variable, the precise objective function used in the program is 1 2 2 1 N yi 0 xi β β 2 β 1 2 N i 1 2 but reported coefficients are in their raw/unstandardized form. 5 pts a) For 1 the value of ultimately used in making predictions was 1se .0622 rather than min .0269 . Both of these values lead to fits with 3 non-zero fitted entries in β . So in what sense does the first lead to a less-complex predictor? 5 pts b) If you had to rank the predictors in some rational order of "importance," how would you do so based on the information provided? Give an order of importance from most to least. Remember that the printed coefficients seem to not be for standardized predictors, so just looking at magnitudes may not be adequate. 5 pts c) Compare CV-chosen lasso and ridge predictors in terms of their coefficients and their predictions on this training set. How different are these? b0 bYRS_EXP bPLAYED bSTARTED bCITYPOP b1/Draft bPercentStarted Lasso Ridge Predictions and overall difference: 10 R Output for Problem 9 > FOOTBALL Log10Salary YRS_EXP PLAYED STARTED 1 5.372912 2 16 16 2 5.397940 5 16 5 3 5.267172 4 16 16 4 5.217484 2 6 0 5 5.397940 3 16 4 6 5.477121 7 16 13 7 5.477121 7 11 8 8 6.000000 10 14 12 9 5.352183 5 11 7 10 5.676694 6 16 15 11 5.628389 7 16 1 12 5.491362 6 16 0 13 5.458638 4 13 10 14 5.845098 5 16 15 15 6.105510 6 16 16 16 5.267172 5 15 1 17 5.845098 2 2 1 18 5.511883 6 16 1 19 5.190332 2 7 0 20 5.698970 2 8 6 21 5.309630 2 13 1 22 6.135673 4 14 14 23 5.204120 2 14 0 24 6.021189 10 16 14 25 4.991226 2 11 0 26 5.568202 2 10 1 27 5.653213 6 16 8 28 5.290035 1 1 0 29 6.176091 8 16 16 30 5.623249 13 14 0 > > y<-as.matrix(FOOTBALL[,1]) > x<-as.matrix(FOOTBALL[,2:7]) CITYPOP 2737000 2737000 4620000 4620000 13770000 13770000 2388000 2388000 1307000 1307000 18120000 18120000 18120000 5963000 5963000 2030000 2030000 2030000 6042000 1995000 1995000 1995000 1176000 1728000 1728000 3641000 1237000 1575000 3001000 4110000 1/Draft PercentStarted 1.00000000 1.00000000 0.16666667 0.31250000 0.10000000 1.00000000 0.07692308 0.00000000 1.00000000 0.25000000 0.09090909 0.81250000 0.33333333 0.72727273 0.12500000 0.85714286 0.07692308 0.63636364 0.14285714 0.93750000 0.33333333 0.06250000 0.12500000 0.00000000 0.25000000 0.76923077 1.00000000 0.93750000 0.50000000 1.00000000 0.08333333 0.06666667 1.00000000 0.50000000 0.25000000 0.06250000 0.33333333 0.00000000 0.33333333 0.75000000 0.50000000 0.07692308 1.00000000 1.00000000 0.33333333 0.00000000 1.00000000 0.87500000 0.14285714 0.00000000 0.33333333 0.10000000 0.50000000 0.50000000 0.50000000 0.00000000 1.00000000 1.00000000 0.12500000 0.00000000 > FOOTBALL.out1<-cv.glmnet(x,y,alpha=1) > plot(FOOTBALL.out1) > FOOTBALL.out1$lambda.1se [1] 0.06220122 > FOOTBALL.out1$lambda.min [1] 0.02692543 > > FOOTBALL.out2<-glmnet(x,y,alpha=1) > plot(FOOTBALL.out2) 11 > print(FOOTBALL.out2) Call: glmnet(x = x, y = y, alpha = 1) [1,] [2,] [3,] [4,] [5,] [6,] [7,] Df 0 1 1 2 3 3 3 [11,] [12,] [13,] [14,] [15,] 3 3 3 3 3 [21,] [22,] [23,] [24,] [25,] 3 3 3 4 4 [41,] [42,] [43,] [44,] [45,] [46,] [47,] [48,] [49,] [50,] 4 4 5 5 5 5 5 6 6 6 [74,] [75,] 6 0.71600 0.0002134 6 0.71600 0.0001944 %Dev 0.00000 0.06564 0.12010 0.17000 0.24940 0.32280 0.38370 Lambda 0.1900000 0.1731000 0.1577000 0.1437000 0.1309000 0.1193000 0.1087000 0.54020 0.56420 0.58420 0.60080 0.61450 0.0749200 0.0682700 0.0622000 0.0566800 0.0516400 0.65980 0.66350 0.66660 0.67050 0.67630 0.0295500 0.0269300 0.0245300 0.0223500 0.0203700 0.70350 0.70380 0.70470 0.70660 0.70810 0.70950 0.71060 0.71150 0.71220 0.71290 0.0045970 0.0041890 0.0038170 0.0034780 0.0031690 0.0028870 0.0026310 0.0023970 0.0021840 0.0019900 > coef(FOOTBALL.out2,s=FOOTBALL.out1$lambda.1se) 7 x 1 sparse Matrix of class "dgCMatrix" 1 (Intercept) 5.22814277 YRS_EXP 0.02922959 PLAYED . STARTED . CITYPOP . 1/Draft 0.21817477 PercentStarted 0.19369103 12 > coef(FOOTBALL.out2,s=FOOTBALL.out1$lambda.1se/10) 7 x 1 sparse Matrix of class "dgCMatrix" 1 (Intercept) 5.123419442 YRS_EXP 0.055751346 PLAYED -0.009761796 STARTED . CITYPOP . 1/Draft 0.374395453 PercentStarted 0.268474122 > coef(FOOTBALL.out2,s=FOOTBALL.out1$lambda.1se/20) 7 x 1 sparse Matrix of class "dgCMatrix" 1 (Intercept) 5.109819480 YRS_EXP 0.057854878 PLAYED -0.010033403 STARTED -0.005071788 CITYPOP . 1/Draft 0.383974812 PercentStarted 0.346002433 > coef(FOOTBALL.out2,s=FOOTBALL.out1$lambda.1se/25) 7 x 1 sparse Matrix of class "dgCMatrix" 1 (Intercept) 5.096819e+00 YRS_EXP 5.813910e-02 PLAYED -9.248198e-03 STARTED -8.867456e-03 CITYPOP 2.375342e-11 1/Draft 3.864012e-01 PercentStarted 4.002561e-01 > > fitted<-predict(FOOTBALL.out2,newx=x,s=FOOTBALL.out1$lambda.1se) > plot(y,fitted,asp=1) > #Now an alpha=0 Version > FOOTBALL.out3<-cv.glmnet(x,y,alpha=0) > plot(FOOTBALL.out3) 13 > FOOTBALL.out3$lambda.1se [1] 0.3397603 > FOOTBALL.out3$lambda.min [1] 0.02511074 > > FOOTBALL.out4<-glmnet(x,y,alpha=0) > plot(FOOTBALL.out4) > print(FOOTBALL.out4) Call: [1,] [2,] [3,] [4,] [5,] [11,] [53,] [59,] [68,] [69,] [70,] glmnet(x = x, y = y, alpha = 0) Df 6 6 6 6 6 %Dev 2.656e-36 4.613e-03 5.059e-03 5.549e-03 6.086e-03 Lambda 190.00000 173.10000 157.70000 143.70000 130.90000 6 1.057e-02 74.92000 6 3.063e-01 6 4.043e-01 6 5.342e-01 6 5.464e-01 6 5.580e-01 1.50500 0.86140 0.37290 0.33980 0.30960 14 [81,] [95,] [96,] [97,] [98,] [99,] [100,] 6 6.512e-01 6 6 6 6 6 6 6.968e-01 6.983e-01 6.997e-01 7.009e-01 7.021e-01 7.031e-01 0.11130 0.03025 0.02756 0.02511 0.02288 0.02085 0.01900 > coef(FOOTBALL.out4,s=FOOTBALL.out3$lambda.1se) 7 x 1 sparse Matrix of class "dgCMatrix" 1 (Intercept) 5.262813e+00 YRS_EXP 2.294492e-02 PLAYED 4.224220e-04 STARTED 6.212700e-03 CITYPOP -9.462227e-10 1/Draft 1.800913e-01 PercentStarted 1.300834e-01 > coef(FOOTBALL.out4,s=FOOTBALL.out3$lambda.1se/5) 7 x 1 sparse Matrix of class "dgCMatrix" 1 (Intercept) 5.169024e+00 YRS_EXP 4.428290e-02 PLAYED -6.669709e-03 STARTED 4.625864e-03 CITYPOP -3.169742e-10 1/Draft 3.122715e-01 PercentStarted 1.994410e-01 > coef(FOOTBALL.out4,s=FOOTBALL.out3$lambda.1se/10) 7 x 1 sparse Matrix of class "dgCMatrix" 1 (Intercept) 5.148346e+00 YRS_EXP 5.103829e-02 PLAYED -9.173812e-03 STARTED 2.046326e-03 CITYPOP 1.265347e-10 1/Draft 3.485644e-01 PercentStarted 2.411498e-01 > coef(FOOTBALL.out4,s=FOOTBALL.out3$lambda.1se/15) 7 x 1 sparse Matrix of class "dgCMatrix" 1 (Intercept) 5.137331e+00 YRS_EXP 5.381037e-02 PLAYED -9.986680e-03 STARTED -7.499388e-05 CITYPOP 3.251923e-10 1/Draft 3.630724e-01 PercentStarted 2.729196e-01 > > fitted<-predict(FOOTBALL.out4,newx=x,s=FOOTBALL.out3$lambda.1se) > plot(y,fitted,asp=1) 15 fitted1<-predict(FOOTBALL.out2,newx=x,s=FOOTBALL.out1$lambda.1se) fitted0<-predict(FOOTBALL.out4,newx=x,s=FOOTBALL.out3$lambda.1se) plot(fitted1,fitted0,asp=1) 16