Stat 602 Exam 2 Spring 2015

advertisement
Stat 602 Exam 2
Spring 2015
I have neither given nor received unauthorized assistance on this exam.
________________________________________________________
Name Signed
Date
_________________________________________________________
Name Printed
This is a long Exam. You probably won't be able to finish all of it. It has 19 parts. I'll score every part
out of 5 points except for Problem 4 on page 5 that I'll score out of 10 points. I'll drop either 5 5-point
parts, or Problem 4 and 3 5-point questions to get an exam score. That is, this is a 75 point exam.
Don't let yourself get hung up on some part and miss ones that will go faster than others.
1
1
for k  1, 2,3 , and densities g k  x  for the
3
conditional distributions of x | y  k , k  1, 2,3 . For some pair of features T1  x  and T2  x  show that:
1. A 3-class classification model has  k  P  y  k  
5 pts
a) optimal classification for each of the 3 pairs of classes is linear classification based on the features
t1 and t2 . (Show the classification boundaries on the axes below. Indicate the scales for the features.)
T1  x   _____________________
5 pts
T2  x   _____________________
b) the optimal 3-class classifier can be realized as a "OVO" (one-versus-one) combination of the three
2-class classifiers. (Show the classification boundaries in terms of these features and indicate which
regions correspond to which classification decisions.)
2
2. In class it was said that the AdaBoostM.1 classification algorithm is derivable by applying the
general gradient boosting algorithm of Friedman to exponential loss and basic function updates that are
simple "binary stumps." This problem concerns applying the algorithm with hinge loss,
L  y, yˆ   1  yyˆ  (for the 1 versus 1 coding for y and ŷ  ), and linear functions of predictor
x  p , say  0  xβ , as basic function updates.
5 pts
a) What starting function f 0  x  would be used?
5 pts
b) With the  m  1 iterate f m 1  x  in hand, each yim is in 1,1 . Using appropriate indicator
functions, give an explicit formula for yim in terms of yi and yˆim 1  f m 1  xi  .
5 pts
c) Describe in words how you would use standard statistical software to produce  0 m and β m so that
all  0 m  xi β m approximate the values yim .
3
N
5 pts
d) Why does optimization of
 1  y  f  x     
i 1
i
m 1
i
0m
 xi β m   

over choices of  involve
comparison of this quantity for at most N values of  ? Give a formula for values of  that you
might have to check.
5 pts
e) After M iterations you probably won't have an f M  x  taking values 1 or 1 at every xi . How
would you use f M  x  to do classification?
5 pts
3. Murphy mentions the possibility of "kernelizing" nearest-neighbor classification. ("Kernelization"
amounts to mapping x   p to K  x,  in a RKHS with kernel K  ,  , H K , and using inner products

and corresponding distances in that space.) Using the Gaussian kernel K  x, z   exp  x  z
2
,
what is the H K distance between K  x,  and K  z,  ? Describe the set of training cases xi  p with
K  xi ,  in the H K k - nearest neighborhood of K  x,  .
4


10 pts 4. Consider the Gaussian kernel K  x, z   exp   x  z 2 for x and z in  2, 4 and a corresponding
RKHS, H K . Based on the very small  x, y  data set below, we wish to fit a function of the form
f  x    0  1 x  h  x  for h  H K to the data set under the penalty
5
  y  f  x 
i 1
i
i
2
2 h
2
HK
You may use the fact that the least squares line through these data is yˆ  3.7  .5 x . In as much
numerical detail as you can sensibly provide in the available time, show how to find the optimizing
f  x .
y
4
4
3
3
2
x
1
0
1
2
3
5
5. Here is some simple R code and output for a small N  5 and p  4 data set.
> X
[,1] [,2] [,3] [,4]
[1,] 0.4
2 -0.5
0
[2,] -0.1
0 -0.3
1
[3,] 0.4
0 -0.1
0
[4,] 0.4
0 0.0
-1
[5,] 0.1
2 0.7
0
>
> svd(X)
$d
[1] 2.8551157 1.4762267 0.9397253 0.3549439
$u
[,1]
[,2]
[,3]
[,4]
[1,] 0.70256076 0.06562895 0.6458656 -0.2618499
[2,] -0.01458943 0.69768837 0.1798028 0.2661822
[3,] 0.01628552 -0.05282808 0.2689008 0.8815301
[4,] 0.02268773 -0.71093125 0.2403923 0.1625961
[5,] 0.71092586 -0.02664090 -0.6484076 0.2388488
$v
[,1]
[,2]
[,3]
[,4]
[1,] 0.12929953 -0.23823242 0.403567340 0.8738766
[2,] 0.99014314 0.05282123 -0.005410155 -0.1296041
[3,] 0.05222766 -0.17306746 -0.912659300 0.3665691
[4,] -0.01305627 0.95420275 -0.064475843 0.2918382
5 pts
a) What is the best rank  2 approximation to the 5  4 data matrix (in terms of "Frobenius norm" of
the difference between X and the approximation)? (You may identify/name vectors and matrices
above and write formulas involving them rather than copying numerical values from above.)
5 pts
b) Interpret the fact that by far the largest (in absolute value) number in the first column of the " v "
matrix is .99014314.
6
5 pts
6. Below is a small fake p  2 data set and a scatterplot for it. Consider (graphical) spectral
clustering for the data set, using the symmetric set of index pairs N 2 (based on 3-nearest-neighbor
neighborhoods--a neighborhood including the point itself) and weight function w  d   exp   d 2  . Set
up an appropriate adjacency matrix and give the 8 node degrees.
Index ( i )
1
2
3
4
5
6
7
8
x1i
1
2
2
3.5
3.5
3
3
4
x2i
1
0
1
1
2
3
5
5
7
7. Below is a small p  2 classification training set (for 2 classes) displayed in graphical and tabular
forms (circles are class 1 and squares are class 1). A bootstrap sample is made from this data set and
is indicated in the table and by counts next to plotted points for those points represented in the sample
other than once. This sample is used to create a tree in a random forest with 4 end nodes
(accomplished by 3 binary splits). A random choice is made for which of the 2 variables to split on at
each opportunity and turns out to produce the sequence " x1 then x1 then x2 ."
y
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x1
1
1
3
7
7
8
15
7
7
11
12
13
13
15
17
17
Frequency in
x2 Bootstrap Sample
0
2
0
12
2
5
2
0
2
13
0
6
0
4
1
15
0
17
1
15
0
16
2
7
2
20
1
11
1
16
2
19
5 pts
a) Identify the resulting tree by rectangles on the plot and provide the value of ŷ for each rectangle.
5 pts
b) What is the numerical value of the out-of-bag (0-1 loss) error for this particular tree?
8
8. Consider a p  3 predictor 2-class neural net classifier, with a single hidden layer having only 2
nodes.
5 pts
a) Provide the network diagram for this situation and a corresponding likelihood term that might be
associated with a training vector  x1i , x2i , x3i , yi  where y has the 1 versus 1 coding.
5 pts
b) Suppose that the inputs have been standardized, and completely specify a lasso-motivated jointly
continuous prior distribution for the model parameters that might be expected to promote posterior
sparsity/near-sparsity for the model parameters.
9
9. On the next pages there is some R code and output for the problem of prediction of (log base 10)
"salary" for NFL Football running backs in 1990 based on p  6 inputs. ("Draft" is the round of
the NFL draft in which the player was chosen. "played" and "started" refer to participation
numbers for 1989 regular season games.) According to the Glmnet vignette, after standardizing the
input variable, the precise objective function used in the program is
 1    2

2
1 N
yi    0  xi β     
β 2  β 1


2 N i 1
 2

but reported coefficients are in their raw/unstandardized form.
5 pts
a) For   1 the value of  ultimately used in making predictions was 1se  .0622 rather than
min  .0269 . Both of these values lead to fits with 3 non-zero fitted entries in β . So in what sense
does the first lead to a less-complex predictor?
5 pts
b) If you had to rank the predictors in some rational order of "importance," how would you do so
based on the information provided? Give an order of importance from most to least. Remember that
the printed coefficients seem to not be for standardized predictors, so just looking at magnitudes may
not be adequate.
5 pts
c) Compare CV-chosen lasso and ridge predictors in terms of their coefficients and their predictions
on this training set. How different are these?
b0
bYRS_EXP
bPLAYED
bSTARTED
bCITYPOP
b1/Draft
bPercentStarted
Lasso
Ridge
Predictions and overall difference:
10
R Output for Problem 9
> FOOTBALL
Log10Salary YRS_EXP PLAYED STARTED
1
5.372912
2
16
16
2
5.397940
5
16
5
3
5.267172
4
16
16
4
5.217484
2
6
0
5
5.397940
3
16
4
6
5.477121
7
16
13
7
5.477121
7
11
8
8
6.000000
10
14
12
9
5.352183
5
11
7
10
5.676694
6
16
15
11
5.628389
7
16
1
12
5.491362
6
16
0
13
5.458638
4
13
10
14
5.845098
5
16
15
15
6.105510
6
16
16
16
5.267172
5
15
1
17
5.845098
2
2
1
18
5.511883
6
16
1
19
5.190332
2
7
0
20
5.698970
2
8
6
21
5.309630
2
13
1
22
6.135673
4
14
14
23
5.204120
2
14
0
24
6.021189
10
16
14
25
4.991226
2
11
0
26
5.568202
2
10
1
27
5.653213
6
16
8
28
5.290035
1
1
0
29
6.176091
8
16
16
30
5.623249
13
14
0
>
> y<-as.matrix(FOOTBALL[,1])
> x<-as.matrix(FOOTBALL[,2:7])
CITYPOP
2737000
2737000
4620000
4620000
13770000
13770000
2388000
2388000
1307000
1307000
18120000
18120000
18120000
5963000
5963000
2030000
2030000
2030000
6042000
1995000
1995000
1995000
1176000
1728000
1728000
3641000
1237000
1575000
3001000
4110000
1/Draft PercentStarted
1.00000000
1.00000000
0.16666667
0.31250000
0.10000000
1.00000000
0.07692308
0.00000000
1.00000000
0.25000000
0.09090909
0.81250000
0.33333333
0.72727273
0.12500000
0.85714286
0.07692308
0.63636364
0.14285714
0.93750000
0.33333333
0.06250000
0.12500000
0.00000000
0.25000000
0.76923077
1.00000000
0.93750000
0.50000000
1.00000000
0.08333333
0.06666667
1.00000000
0.50000000
0.25000000
0.06250000
0.33333333
0.00000000
0.33333333
0.75000000
0.50000000
0.07692308
1.00000000
1.00000000
0.33333333
0.00000000
1.00000000
0.87500000
0.14285714
0.00000000
0.33333333
0.10000000
0.50000000
0.50000000
0.50000000
0.00000000
1.00000000
1.00000000
0.12500000
0.00000000
> FOOTBALL.out1<-cv.glmnet(x,y,alpha=1)
> plot(FOOTBALL.out1)
> FOOTBALL.out1$lambda.1se
[1] 0.06220122
> FOOTBALL.out1$lambda.min
[1] 0.02692543
>
> FOOTBALL.out2<-glmnet(x,y,alpha=1)
> plot(FOOTBALL.out2)
11
> print(FOOTBALL.out2)
Call: glmnet(x = x, y = y, alpha = 1)
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
Df
0
1
1
2
3
3
3
[11,]
[12,]
[13,]
[14,]
[15,]
3
3
3
3
3
[21,]
[22,]
[23,]
[24,]
[25,]
3
3
3
4
4
[41,]
[42,]
[43,]
[44,]
[45,]
[46,]
[47,]
[48,]
[49,]
[50,]
4
4
5
5
5
5
5
6
6
6
[74,]
[75,]
6 0.71600 0.0002134
6 0.71600 0.0001944



%Dev
0.00000
0.06564
0.12010
0.17000
0.24940
0.32280
0.38370
Lambda
0.1900000
0.1731000
0.1577000
0.1437000
0.1309000
0.1193000
0.1087000
0.54020
0.56420
0.58420
0.60080
0.61450
0.0749200
0.0682700
0.0622000
0.0566800
0.0516400
0.65980
0.66350
0.66660
0.67050
0.67630
0.0295500
0.0269300
0.0245300
0.0223500
0.0203700
0.70350
0.70380
0.70470
0.70660
0.70810
0.70950
0.71060
0.71150
0.71220
0.71290
0.0045970
0.0041890
0.0038170
0.0034780
0.0031690
0.0028870
0.0026310
0.0023970
0.0021840
0.0019900

> coef(FOOTBALL.out2,s=FOOTBALL.out1$lambda.1se)
7 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept)
5.22814277
YRS_EXP
0.02922959
PLAYED
.
STARTED
.
CITYPOP
.
1/Draft
0.21817477
PercentStarted 0.19369103
12
> coef(FOOTBALL.out2,s=FOOTBALL.out1$lambda.1se/10)
7 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept)
5.123419442
YRS_EXP
0.055751346
PLAYED
-0.009761796
STARTED
.
CITYPOP
.
1/Draft
0.374395453
PercentStarted 0.268474122
> coef(FOOTBALL.out2,s=FOOTBALL.out1$lambda.1se/20)
7 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept)
5.109819480
YRS_EXP
0.057854878
PLAYED
-0.010033403
STARTED
-0.005071788
CITYPOP
.
1/Draft
0.383974812
PercentStarted 0.346002433
> coef(FOOTBALL.out2,s=FOOTBALL.out1$lambda.1se/25)
7 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept)
5.096819e+00
YRS_EXP
5.813910e-02
PLAYED
-9.248198e-03
STARTED
-8.867456e-03
CITYPOP
2.375342e-11
1/Draft
3.864012e-01
PercentStarted 4.002561e-01
>
> fitted<-predict(FOOTBALL.out2,newx=x,s=FOOTBALL.out1$lambda.1se)
> plot(y,fitted,asp=1)
> #Now an alpha=0 Version
> FOOTBALL.out3<-cv.glmnet(x,y,alpha=0)
> plot(FOOTBALL.out3)
13
> FOOTBALL.out3$lambda.1se
[1] 0.3397603
> FOOTBALL.out3$lambda.min
[1] 0.02511074
>
> FOOTBALL.out4<-glmnet(x,y,alpha=0)
> plot(FOOTBALL.out4)
> print(FOOTBALL.out4)
Call:
[1,]
[2,]
[3,]
[4,]
[5,]
[11,]
[53,]
[59,]
[68,]
[69,]
[70,]
glmnet(x = x, y = y, alpha = 0)
Df
6
6
6
6
6
%Dev
2.656e-36
4.613e-03
5.059e-03
5.549e-03
6.086e-03
Lambda
190.00000
173.10000
157.70000
143.70000
130.90000
6 1.057e-02
74.92000


6 3.063e-01

6 4.043e-01

6 5.342e-01
6 5.464e-01
6 5.580e-01
1.50500
0.86140
0.37290
0.33980
0.30960

14
[81,]
[95,]
[96,]
[97,]
[98,]
[99,]
[100,]
6 6.512e-01

6
6
6
6
6
6
6.968e-01
6.983e-01
6.997e-01
7.009e-01
7.021e-01
7.031e-01
0.11130
0.03025
0.02756
0.02511
0.02288
0.02085
0.01900
> coef(FOOTBALL.out4,s=FOOTBALL.out3$lambda.1se)
7 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept)
5.262813e+00
YRS_EXP
2.294492e-02
PLAYED
4.224220e-04
STARTED
6.212700e-03
CITYPOP
-9.462227e-10
1/Draft
1.800913e-01
PercentStarted 1.300834e-01
> coef(FOOTBALL.out4,s=FOOTBALL.out3$lambda.1se/5)
7 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept)
5.169024e+00
YRS_EXP
4.428290e-02
PLAYED
-6.669709e-03
STARTED
4.625864e-03
CITYPOP
-3.169742e-10
1/Draft
3.122715e-01
PercentStarted 1.994410e-01
> coef(FOOTBALL.out4,s=FOOTBALL.out3$lambda.1se/10)
7 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept)
5.148346e+00
YRS_EXP
5.103829e-02
PLAYED
-9.173812e-03
STARTED
2.046326e-03
CITYPOP
1.265347e-10
1/Draft
3.485644e-01
PercentStarted 2.411498e-01
> coef(FOOTBALL.out4,s=FOOTBALL.out3$lambda.1se/15)
7 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept)
5.137331e+00
YRS_EXP
5.381037e-02
PLAYED
-9.986680e-03
STARTED
-7.499388e-05
CITYPOP
3.251923e-10
1/Draft
3.630724e-01
PercentStarted 2.729196e-01
>
> fitted<-predict(FOOTBALL.out4,newx=x,s=FOOTBALL.out3$lambda.1se)
> plot(y,fitted,asp=1)
15
fitted1<-predict(FOOTBALL.out2,newx=x,s=FOOTBALL.out1$lambda.1se)
fitted0<-predict(FOOTBALL.out4,newx=x,s=FOOTBALL.out3$lambda.1se)
plot(fitted1,fitted0,asp=1)
16
Download