240F Solutions to Final Exam Spring 2016 1.(a) Let 7/ denote 7/(x) E

advertisement
240F Solutions to Final Exam Spring 2016
1.(a) Let fb denote fb(x)
E[(y
fb)2 ] = E[(fb
= E[(fb
y)2 ]
f
")2 ] as y = f + "
= E[ffb f g2 ] + E["2 ] as " ? X and E["] = 0
= E[f(fb E[fb]) + (E[fb] f )g2 ] + E["2 ]
= E[(fb E[fb])2 ] + (E[fb] f )2 + E["2 ] as cross term = 0
= V ar[fb] + fBias(fb)g2 + V ar("):
(b) 10-fold cross-validation. Randomly divide data into 10 folds of approximately equal size. In
turn drop one fold, estimated onP
the other nine folds, and then predict and compute MSE in the
K
1
2
p
dropped fold. Then CV(10) = 10
k=1 M SE(10) . The model is OLS regression of y on x, x ,..., x .
(p)
(p)
For p = 1; 2, ... compute CV(10) and choose degree p for which CV(10) is lowest.
P
Pp
(c) Lasso estimator b of
minimizes Pni=1 (yi x0i )2 +
j=1 j j j = RSS + jj jj1 , where
p
0 is a tuning parameter and jj jj1 = j=1 j j j is L1 norm.
P
P
Equivalently the lasso estimator minimizes ni=1 (yi x0i )2 subject to pj=1 j j j s.
(d) Splines are one example. For example, cubic spline estimate by OLS f (x) = 0 + 1 x + 2 x2 +
3
c1 )3+ +
+ (3+K) (x cK )3+ , where (x c)+ = (x c) if (x c) > 0 and equals 0
3 x + 4 (x
otherwise.
Rb
P
Or smoothing spline minimize ni=1 (yi g(xi ))2 + a g 00 (t)dt where a all xi b:
Local polynomial and neural networks also possible.
P
P
(e) First cut. Choose y for which i2R1 (yi yR1 )2 + i2R2 (yi yR2 )2 is smallest, where R1 =
fy
y g and R2 = fyi : yi > y g. Second cut now do splits on R1 and R2 to minimize
i
Pi3 : yP
yRj )2 .
j=1
i2Rj (yi
(f ) Logistic regression speci es a model for Pr[yjX = x]. i.e. conditional on x;
Linear discriminant analysis speci es a model for f (Xjy) and for Pr[y]. i.e. Joint X and y. it then
backs out Pr[yjX = x]. using Bayes rule.
(g) Principal components. Several ways to explain. Simplest is that the rst principal component
has the largest sample variance among all normalized linear combinations of the columns of X.
The second principal component has the largest variance subject to being orthogonal to the rst,
and so on.
K-Means splits into K distinct clusters where within cluster variation (measured using Euclidean
distance, for example) is minimized.
Q
Q
2.(a) L(yj ) = ni=1 e yi =yi ! = e Q ny =( ni=1 yi !) and ( ) = 1.
Posterior density ( jy) = e ny =( ni=1 yi !) 1 _ e n ny .
(b) ( jy) _ e n (ny+1) 1 which is gamma density with a = (ny + 1) and b = n so posterior mean
equals a=b = (ny + 1)=n = y + (1=n).
(c) Use Gibbs sampling. At sth round draw
(s)
1
from ( 1 j
(s 1)
)
2
and then
(s)
2
from ( 2 j
(s 1)
).
1
(d) For latent variable models where if the latent variable was fully observed analysis would
be straightforward. examples are logit and probit and, for missing data problems, multivariate
normal.
1
(e) Use Metropolis-Hastings when it is di cult to make draws directly from the posterior density.
Instead make draws from a candidate density and keep the draw given the MH rule.
(f ) In Stata generate expb=exp(b) and then sum expb.
In Stata centile expb, centile(2.5,9.5). Since beta> 0 always this will just yield (e:1706708 ,
e1:019916 ).
(g) The autocorrelations are very slow to die out; the rst 100 draws after burnin have only one
new value accepted; the trace has some big movements; the density is bimodal (though the rst
and second halves are similar).
I would be concerned that the chain has not converged.
1 P
PG PNg
G PNg
0
3.(a) b =
i=1 xg xg
i=1 xg yig
g=1
g=1
1 P
PG
PNg
PNg
G
0
=
i=1 1
i=1 yig
g=1 xg xg
g=1 xg
1 P
PNg
PG
G
0
yig
=
Ng
Ng yg where yg = N1g i=1
g=1 xg xg
g=1 xg
1 P
PG
G
0
=
g=1 xg yg which is OLS regression of yg on xg .
g=1 xg xg
PNg
PNg
(b) Var(ug ) = Var N1g i=1
uig = N12 Var
i=1 uig
g
PNg PNg
PNg
= N12
i=1
i=1 V ar(uig ) +
j=1;j6=i Cov(uig ; ujg ) =
1
Ng2
g
(Ng
2
u
+ Ng (Ng
PG
PG
PG
0
(c) V ar
g=1 V ar(ug )xg xg
g=1 V ar(xg ug ) =
g=1 xg ug =
P
PG
2
2
0
u
= G
1) ) xg x0g = Nu (1 + (N 1) )
g=1 N (1 + (N
g=1 xg xg .
1
1
PG
PG
PG
2
0
0
x
x
= Nu (1 + (N
x
u
x
x
V
ar
So Var( b ) =
g
g
g
g
g
g
g=1
g=1
g=1
1
PG
2
2
0
.
x
x
(d) The default uses V ar(ug ) = Nu so Var( b ) = Nu
g
g
g=1
The true variance is (1 + (N 1) ) times this.
P
P
p
0
0
0
eg u
e 0g Xg G1 G
(e) No. The key is that G1 G
g=1 Xg u
g=1 Xg ug ug Xg ! 0:
(f ) Var(b) =
P P
g
i
@m(yig ;xig ; )0
@
1
PG PNg PNg
g=1
i=1
j=1
1)
2
u)
1) )
=
2
u
Ng
(1 + (Ng
PG
E[m(yig ; xig ; )m(yjg ; xjg ; )0
1
0
g=1 xg xg
P P
g
i
.
@m(yig ;xig ; )
@ 0
(g) Various methods. Cluster pairs bootstrap resamples the clusters with replacement B times
P
b(b)
e)(b(b)
e)0 where
and computes B estimates b. Then estimated variance is B1 B
b=1 (
P
e = 1 B b. Take square root of diagonal entries to get standard errors. This does no better
b=1
B
with small clusters than the usual cluster-robust estimate.
(h) Denote the two ways of clustering G and H. OLS with cluster one-way on G gives variance
matrix VbG . Similar on H gives VbH . Similar on G \ H gives VbG\H . Then Vb2way = VbG + VbH VbG\H :
Exam / 40
Exam / 40
75th percentile 31 (78%)
Median
27:5 (69%)
25th percentile 24:5 (61%)
A
AB+
B
B-
30
25
20
17
14
and
and
and
and
and
1) ).
above
above
above
above
above
2
Download