240F Solutions to Final Exam Spring 2016 1.(a) Let 7/ denote 7/(x) E

240F Solutions to Final Exam Spring 2016 1.(a) Let fb denote fb(x) E[(y fb)2 ] = E[(fb = E[(fb y)2 ] f ")2 ] as y = f + " = E[ffb f g2 ] + E["2 ] as " ? X and E["] = 0 = E[f(fb E[fb]) + (E[fb] f )g2 ] + E["2 ] = E[(fb E[fb])2 ] + (E[fb] f )2 + E["2 ] as cross term = 0 = V ar[fb] + fBias(fb)g2 + V ar("): (b) 10-fold cross-validation. Randomly divide data into 10 folds of approximately equal size. In turn drop one fold, estimated onP the other nine folds, and then predict and compute MSE in the K 1 2 p dropped fold. Then CV(10) = 10 k=1 M SE(10) . The model is OLS regression of y on x, x ,..., x . (p) (p) For p = 1; 2, ... compute CV(10) and choose degree p for which CV(10) is lowest. P Pp (c) Lasso estimator b of minimizes Pni=1 (yi x0i )2 + j=1 j j j = RSS + jj jj1 , where p 0 is a tuning parameter and jj jj1 = j=1 j j j is L1 norm. P P Equivalently the lasso estimator minimizes ni=1 (yi x0i )2 subject to pj=1 j j j s. (d) Splines are one example. For example, cubic spline estimate by OLS f (x) = 0 + 1 x + 2 x2 + 3 c1 )3+ + + (3+K) (x cK )3+ , where (x c)+ = (x c) if (x c) > 0 and equals 0 3 x + 4 (x otherwise. Rb P Or smoothing spline minimize ni=1 (yi g(xi ))2 + a g 00 (t)dt where a all xi b: Local polynomial and neural networks also possible. P P (e) First cut. Choose y for which i2R1 (yi yR1 )2 + i2R2 (yi yR2 )2 is smallest, where R1 = fy y g and R2 = fyi : yi > y g. Second cut now do splits on R1 and R2 to minimize i Pi3 : yP yRj )2 . j=1 i2Rj (yi (f ) Logistic regression speci es a model for Pr[yjX = x]. i.e. conditional on x; Linear discriminant analysis speci es a model for f (Xjy) and for Pr[y]. i.e. Joint X and y. it then backs out Pr[yjX = x]. using Bayes rule. (g) Principal components. Several ways to explain. Simplest is that the rst principal component has the largest sample variance among all normalized linear combinations of the columns of X. The second principal component has the largest variance subject to being orthogonal to the rst, and so on. K-Means splits into K distinct clusters where within cluster variation (measured using Euclidean distance, for example) is minimized. Q Q 2.(a) L(yj ) = ni=1 e yi =yi ! = e Q ny =( ni=1 yi !) and ( ) = 1. Posterior density ( jy) = e ny =( ni=1 yi !) 1 _ e n ny . (b) ( jy) _ e n (ny+1) 1 which is gamma density with a = (ny + 1) and b = n so posterior mean equals a=b = (ny + 1)=n = y + (1=n). (c) Use Gibbs sampling. At sth round draw (s) 1 from ( 1 j (s 1) ) 2 and then (s) 2 from ( 2 j (s 1) ). 1 (d) For latent variable models where if the latent variable was fully observed analysis would be straightforward. examples are logit and probit and, for missing data problems, multivariate normal. 1 (e) Use Metropolis-Hastings when it is di cult to make draws directly from the posterior density. Instead make draws from a candidate density and keep the draw given the MH rule. (f ) In Stata generate expb=exp(b) and then sum expb. In Stata centile expb, centile(2.5,9.5). Since beta> 0 always this will just yield (e:1706708 , e1:019916 ). (g) The autocorrelations are very slow to die out; the rst 100 draws after burnin have only one new value accepted; the trace has some big movements; the density is bimodal (though the rst and second halves are similar). I would be concerned that the chain has not converged. 1 P PG PNg G PNg 0 3.(a) b = i=1 xg xg i=1 xg yig g=1 g=1 1 P PG PNg PNg G 0 = i=1 1 i=1 yig g=1 xg xg g=1 xg 1 P PNg PG G 0 yig = Ng Ng yg where yg = N1g i=1 g=1 xg xg g=1 xg 1 P PG G 0 = g=1 xg yg which is OLS regression of yg on xg . g=1 xg xg PNg PNg (b) Var(ug ) = Var N1g i=1 uig = N12 Var i=1 uig g PNg PNg PNg = N12 i=1 i=1 V ar(uig ) + j=1;j6=i Cov(uig ; ujg ) = 1 Ng2 g (Ng 2 u + Ng (Ng PG PG PG 0 (c) V ar g=1 V ar(ug )xg xg g=1 V ar(xg ug ) = g=1 xg ug = P PG 2 2 0 u = G 1) ) xg x0g = Nu (1 + (N 1) ) g=1 N (1 + (N g=1 xg xg . 1 1 PG PG PG 2 0 0 x x = Nu (1 + (N x u x x V ar So Var( b ) = g g g g g g g=1 g=1 g=1 1 PG 2 2 0 . x x (d) The default uses V ar(ug ) = Nu so Var( b ) = Nu g g g=1 The true variance is (1 + (N 1) ) times this. P P p 0 0 0 eg u e 0g Xg G1 G (e) No. The key is that G1 G g=1 Xg u g=1 Xg ug ug Xg ! 0: (f ) Var(b) = P P g i @m(yig ;xig ; )0 @ 1 PG PNg PNg g=1 i=1 j=1 1) 2 u) 1) ) = 2 u Ng (1 + (Ng PG E[m(yig ; xig ; )m(yjg ; xjg ; )0 1 0 g=1 xg xg P P g i . @m(yig ;xig ; ) @ 0 (g) Various methods. Cluster pairs bootstrap resamples the clusters with replacement B times P b(b) e)(b(b) e)0 where and computes B estimates b. Then estimated variance is B1 B b=1 ( P e = 1 B b. Take square root of diagonal entries to get standard errors. This does no better b=1 B with small clusters than the usual cluster-robust estimate. (h) Denote the two ways of clustering G and H. OLS with cluster one-way on G gives variance matrix VbG . Similar on H gives VbH . Similar on G \ H gives VbG\H . Then Vb2way = VbG + VbH VbG\H : Exam / 40 Exam / 40 75th percentile 31 (78%) Median 27:5 (69%) 25th percentile 24:5 (61%) A AB+ B B- 30 25 20 17 14 and and and and and 1) ). above above above above above 2

240F Solutions to Final Exam Spring 2016 1.(a) Let 7/ denote 7/(x) E

Related documents

Products

Support

240F Solutions to Final Exam Spring 2016 1.(a) Let 7/ denote 7/(x) E

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib