Stat 602 Exam 1 Spring 2015

advertisement
Stat 602 Exam 1
Spring 2015
I have neither given nor received unauthorized assistance on this exam.
________________________________________________________
Name Signed
Date
_________________________________________________________
Name Printed
This is a long Exam. You probably won't be able to finish all of it. Don't let yourself get hung up
on some part and miss ones that will go (much) faster than others. Point values are indicated.
1
1. Consider a SEL prediction problem where p = 1 , and the class of functions used for prediction is
the set of constant functions S = {h | h ( x ) = c ∀x and some c ∈ ℜ} . Suppose that in fact
x  U ( 0,1) , E [ y | x ] = ax + b, and Var [ y | x ] = dx 2 for some d > 0
6 pts
a) Under this model, what is the best element of S , say g * , for predicting y ? Use this to find the
average squared model bias in this problem.
6 pts
b) Suppose that based on an iid sample of N points ( xi , yi ) , fitting is done by least squares (and
thus the predictor fˆ ( x ) = y is employed). What is the average squared fitting bias in this case?
2
6 pts
c) What is the average prediction error, Err , when the predictor in b) is employed?
2. Consider two probability densities on the unit disk in ℜ2 (i.e. on
g1 ( x1 , x2 ) =
1
π
and g 2 ( x1 , x2 ) =
{( x , x ) | x
1
2
2
1
+ x22 ≤ 1} ),
3
1 − ( x12 + x22 )
2π
and a 2-class 0-1 loss classification problem with prior probabilities π 1 = π 2 = .5 .
6 pts
a) Give a formula for a best-possible single feature T ( x1 , x2 ) .
3
10 pts b) Give an explicit form for the theoretically optimal classifier in this problem.
4 pts
c) Suppose that one uses features x1 , x2 , x12 , x22 , and x1 x2 to do 2-class classification based on a
moderate number of iid training cases from this model. Would you expect better classification
performance for 1) a classifier based on logistic regression using these features or 2) a random
forest using these features? Explain.
Likely Better Classifier:
Your Reasoning:
4
3. Below are tables specifying two discrete joint distributions for ( x, y ) that we'll call Model #1
and Model #2. Suppose that N = 2 training cases (drawn iid from one of the models) are
( x1 , y1 ) = ( 2, 2 ) and ( x2 , y2 ) = ( 3,3) .
y\x
1
3
2
1
0
0
0
.125
.125
Model #1
2
.125
.125
.125
.125
3
y\x
1
Model #2
2
.125
.125
0
0
3
2
1
0
0
.1
.1
.1
0
.2
.2
0
3
.1
.1
.1
0
Suppose further that prior probabilities for the two models are π 1 = .3 and π 2 = .7 .
6 pts
a) Find the posterior probabilities of Models #1 and #2.
6 pts
b) Find the "Bayes model averaging" SEL predictor of y based on x for these training data. (Give
values fˆ (1) , fˆ ( 2 ) , and fˆ ( 3) . You don't need to complete the arithmetic here.)
5
4. Consider the p = 2 prediction problem based on N = 9 training points as below.
8
1 1
 


 3
1 0
 −3 
 1 −1
 


5
0 1
1  
1 
Y=
0 0
−1 and X = ( x1 , x 2 ) =

6 
6
 −5 
 0 −1
1
 −1 1 
 


 −3 
 −1 0 
 −5 
 −1 −1
 


6 pts
(
a) Find the SEL Lasso coefficient vector β̂ optimizing SSE + 8 βˆ1Lasso + βˆ2Lasso
) and give the
corresponding Ŷ Lasso .
6
8 pts
b) "Boost" your Lasso SEL predictor from a) using ridge regression with λ = 1 and a learning rate
of ν = .1 . Give the resulting vector of predictions Ŷ boost1 .
4 pts
c) Is the predictor in b) a linear predictor? If not, argue that it is not. If it is, what is β̂ such that
ˆ boost1 = Xβˆ ?
Y
7
8 pts
d) Now "boost" your SEL Lasso predictor from a) using a best "stump" regression tree predictor
(one that makes only a single split) and a learning rate of ν = .1 . Give the resulting vector of
predictions Ŷ boost2 .
8
5. Here is (a rounded version of) a smoother matrix S λ , for a N-W smoother with Gaussian kernel
for data with x′ = ( 0, 0.1, 0.2, , 0.8, 0.9,1.0 ) .
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
[11,]
6 pts
[,1]
0.47
0.26
0.10
0.02
0.00
0.00
0.00
0.00
0.00
0.00
0.00
[,2]
0.35
0.35
0.23
0.09
0.02
0.00
0.00
0.00
0.00
0.00
0.00
[,3]
0.14
0.26
0.31
0.23
0.09
0.02
0.00
0.00
0.00
0.00
0.00
[,4]
0.03
0.11
0.23
0.31
0.23
0.09
0.02
0.00
0.00
0.00
0.00
[,5]
0.00
0.02
0.10
0.23
0.31
0.23
0.09
0.02
0.00
0.00
0.00
[,7]
0.00
0.00
0.00
0.02
0.09
0.23
0.31
0.23
0.10
0.02
0.00
[,8]
0.00
0.00
0.00
0.00
0.02
0.09
0.23
0.31
0.23
0.11
0.03
[,9] [,10] [,11]
0.00 0.00 0.00
0.00 0.00 0.00
0.00 0.00 0.00
0.00 0.00 0.00
0.00 0.00 0.00
0.02 0.00 0.00
0.09 0.02 0.00
0.23 0.09 0.02
0.31 0.23 0.10
0.26 0.35 0.26
0.14 0.35 0.47
a) Approximately what bandwidth ( λ ) and effective degrees of freedom are associated with this?
λ ≈ ____________
6 pts
[,6]
0.00
0.00
0.02
0.09
0.23
0.31
0.23
0.09
0.02
0.00
0.00
effective df = ____________
b) A rounded version of the matrix product S λ ⋅ S λ is below. Thinking of this product as itself a
smoother matrix, what might you think of as "an equivalent kernel"? (Give values of weights
w ( i − j ) for i, j indices from 1 to 11 so that yˆ j ≈  i =1 w ( i − j ) yi .)
11
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
[7,]
[8,]
[9,]
[10,]
[11,]
[,1]
0.33
0.24
0.14
0.06
0.02
0.01
0.00
0.00
0.00
0.00
0.00
[,2]
0.32
0.28
0.21
0.13
0.06
0.02
0.01
0.00
0.00
0.00
0.00
[,3]
0.21
0.24
0.24
0.19
0.12
0.06
0.02
0.01
0.00
0.00
0.00
[,4]
0.10
0.14
0.20
0.22
0.19
0.12
0.06
0.02
0.01
0.00
0.00
[,5]
0.03
0.07
0.12
0.19
0.22
0.19
0.12
0.06
0.02
0.01
0.00
[,6]
0.01
0.02
0.06
0.12
0.19
0.22
0.19
0.12
0.06
0.02
0.01
[,7]
0.00
0.01
0.02
0.06
0.12
0.19
0.22
0.19
0.12
0.07
0.03
[,8]
0.00
0.00
0.01
0.02
0.06
0.12
0.19
0.22
0.20
0.14
0.10
[,9] [,10] [,11]
0.00 0.00 0.00
0.00 0.00 0.00
0.00 0.00 0.00
0.01 0.00 0.00
0.02 0.01 0.00
0.06 0.02 0.01
0.12 0.06 0.02
0.19 0.13 0.06
0.24 0.21 0.14
0.24 0.28 0.24
0.21 0.32 0.33
9
Here is of R code and more output for this problem.
> round(eigen(S)$values,3)
[1] 1.000 0.921 0.730 0.509 0.317 0.176 0.087 0.038 0.015 0.005 0.001
> round(eigen(S)$vectors[,1],3)
[1] -0.302 -0.302 -0.302 -0.302 -0.302 -0.302 -0.302 -0.302 -0.302 -0.302 -0.302
(While S λ is not symmetric, it is non-singular and has 11 real eigenvalues 1 = d1 > d 2 >  > d11 > 0
with corresponding linearly independent unit eigenvectors u1 , u 2 , , u11 such that S λ u j = d j u j . So
with U = ( u1 , u 2 , , u11 ) and D = diag ( d1 , d 2 , , d11 ) we have S λ U = UD and S λ = UDU −1 . The
output above provides the eigenvalues and u1 .)
6 pts
c) The nth power of S λ , call it S λn , has a limit. What is it? Argue that your answer is correct. (A
tight argument can be based on the information above. If you can't see that one, make an intuitive
one based on the nature of smoothing.) What are the corresponding limits of S λn Y and of the
effective degrees of freedom of S λn ?
10
6 pts
6. Consider the small space of functions on [ −1,1] that are linear combinations of the 4 functions
2
1, x1 , x2 , and x1 x2 , with inner product defined by f , g =
 f ( x , x ) g ( x , x ) dx dx
1
2
1
2
1
2
. Find the
[ −1,1] 2
(
element of this space closest to h ( x1 , x2 ) = x12 + x22 (in the L2 [ −1,1]
h = h, h
1/ 2
2
) function space norm
). (Note that the functions 1, x1 , x2 , and x1 x2 are orthogonal in this space.)
11
Download