Lecture Note 1: Some Probability & Statistics 1 Basics Joint Distribution For discrete random variables, fXY (x; y) = Pr (X = x and Y = y) Marginal Distribution Sums/integrates out all but one variable, X X fX (x) = fXY (x; y) and fY (y) = fXY (x; y) y x XX y fXY (x; y) = 1 x Conditional Distribution fXjY (xjy) = fXY (x; y) fY (y) for fY (y) 6= 0 Note that for a …xed y, the conditional probability must sum to 1: X fXjY (xjy) = 1 x 2 Measures of Central Tendency Mean E [X] = X xfX (x) x Conditional Mean Mean of X given a value of Y , X E [XjY = y] = xfXjY (xjy) x 1 3 Properties of Conditional Expectations Law of Iterated Expectations X E [X] = E [E [ Xj Y ]] = E [Xj Y = y] fY (y) y Noting that E [ Xj Y = y] is just a function of Y , say g (Y ), we can write the above as: X E [X] = E [g (Y )] = g (y) fY (y) y In general, E [ Xj Y ] = E [E [Xj Y; Z]j Y ] Example 1 E Y 2 Y = y = y2 E XY 2 Y = y = y 2 E [Xj Y = y] E [g (X; Y )j Y = y] = E [g (X; y)j Y = y] = X g (x; y) fXjY (xjy) x 4 Best Constant Predictor Our objective is to …nd a)2 argmin E (Y a Let E [Y ] and note that E (Y a)2 = E ((Y = E (Y )2 + 2 ( = E (Y )2 + 2 ( a) E [(Y )] + ( a)2 = E (Y )2 + 2 ( a) (E [Y ] )+( a)2 = E (Y )2 + 2 ( a) 0 + ( = Var (Y ) + ( which is minimized when a))2 )+( = a) (Y a)2 )+( a)2 a)2 ; = E [Y ], so the mean is the best constant predictor. 2 4.1 Special Case Identi…cation Suppose that Y 2 N ( 0; 2 ) with 1 ' (y; ) = p exp 2 known. Is the ! (y )2 2 2 0 identi…ed? Let denote the PDF. Now, let )2 (Y L ( ) = log ' (Y ; ) = C 2 2 where C denotes a generic constant. If we know the exact distribution F of Y , we can calculate the expected value of L ( ): " # EF (Y )2 (Y )2 EF [L ( )] = EF C = C 2 2 2 2 But EF (Y )2 = EF ((Y 0) 2 = EF (Y 0) ( +( It follows that 2 0 )) 2 0) 2 = 0. Therefore, 0 +( 2 is identi…ed. Estimation Now suppose that Y1 ; : : : ; Yn by iid +( 2 0) 2 EF [L ( )] = C is maximized at 2 0) 2 0) = EF (Y 2 N ( 0; =C 2 2( 0 ) (Y 2 2 known. We can estimate n 1X E Y = E [Yi ] = n i=1 n it is an unbiased estimator of 0. 0 Con…dence Interval We need two properties of the normal distribution: 2 N( ; Lemma 2 If Y N (0; 1), then Pr [jY j Corollary 1 If Y N( ; ), then Y Lemma 1 If Y 2 ), then Pr N (0; 1). 1:96] = 95%. Y 1:96 = 95%. 3 2 0) 2 1X Y = Yi n i=1 Because +( 2 0) ( ) with 0) 0 Now, note that Y is a linear combination of Y1 ; : : : ; Yn iid 2 N ( 0; ) so we have 2 Y N 0; n It follows that Pr Y p0 / n 1:96 = 95% But Pr Y p 0 Y p / n 0 1:96 = Pr / n 1:96 p = Pr 0 2 0 n 1:96 p Y 0 1:96 p Y 0 n 1:96 p = Pr Y = Pr 1:96 = Pr n n Y + 1:96 p n 1:96 p ; Y + 1:96 p n n Y We conclude that Pr and Y 5 5.1 0 1:96 pn ; Y + 1:96 pn 2 Y 1:96 p ; Y + 1:96 p n n = 95% is the 95% con…dence interval for 0. Digression Conditional Expectation Our objective is to …nd some function (X) (X) such that argmin E (Y g (X))2 g( ) Let (X) E [Y j X] and note that (Y g (X))2 = ((Y = (Y g (X)))2 (X)) + ( (X) (X))2 + 2 ( (X) g (X)) (Y (X)) + ( (X) and therefore, E (Y g (X))2 X = E (Y (X))2 X + E [2 ( (X) + E ( (X) 4 g (X)) (Y 2 g (X)) X (X))j X] g (X))2 But E [2 ( (X) g (X)) (Y (X))j X] = 2 ( (X) g (X)) E [Y (X)j X] g (X)) (E [Y j X] = 2 ( (X) g (X)) (E [Y j X] = 2 ( (X) = 2 ( (X) E [ (X)j X]) (X)) g (X)) 0 = 0; and g (X))2 X = ( (X) E ( (X) g (X))2 It follows that E (Y g (X))2 X = E (Y (X))2 X + ( (X) g (X))2 and therefore E (Y g (X))2 = E (Y (X))2 + E ( (X) which is minimized when ( (X) g (X))2 = 0 or g (X) = (X) Therefore, (X) = E [Y j X] 5.2 Best Linear Predictor Predict Y using linear functions of X. Find ( ; ) and argmin E (Y such that (a + bX))2 a;b The minimization can be solved in two steps. Step 1 We …x b, and …nd (b) that solves (b) argmin E (Y a 5 (a + bX))2 g (X))2 ; Step 2 We …nd that solves ( (b) + bX))2 argmin E (Y b and recognize that = ( ). It is not di¢ cult to see that (b) = E [Y bX] = E [Y ] bE [X] = b Y X: Therefore, Step 2 optimization can be rewritten as min E (Y (( b b Y X) + bX))2 = min E b where Ye Now, let Y h i e Ye E X h i e2 E X b0 and note that E We therefore have E Ye e bX 2 =E =E =E Ye Ye Ye h Ye e b0 X X + 2 (b0 e b) X 2 + (b0 E [X] 2 b) E 2 b) h h Ye i 2 e E X ; e b0 X which is minimized when b = b0 . Therefore, we have = b0 = Cov (X; Y ) ; Var (X) and = ( )= Cov (X; Y ) Var (X) Y 6 2 i e =0 X e b0 X 2 e bX Cov (X; Y ) Var (X) = e + (b0 b0 X e b0 X e X E [Y ] ; Ye X: i e + (b0 X h i e2 b)2 E X Lecture Note 2: Computational Issues of OLS 6 Some Notation Data For each individual i, we observe (yi ; xi1 ; : : : ; xik ). We observe n individuals. Objective We want to “predict”y by xs: yi = b1 xi1 + + bk xik + ri : i = 1; : : : ; n Here, ri denotes the residual. Vector Writing 0 1 b1 B C b = @ ... A ; k 1 bk 0 1 xi1 B C xi = @ ... A ; k 1 xik we may compactly write yi = x0i b + ri : i = 1; : : : ; n Matrix Writing 0 1 x01 B C X = @ ... A ; n k x0n 0 1 y1 B C y = @ ... A ; n 1 yn we may more compactly write 0 1 r1 B C r = @ ... A : n 1 rn y = X b + r : n 1 7 n kk 1 n 1 n 1 OLS It seems natural to seek b that solves min b n X (yi (b1 xi1 + + bk xik ))2 = min b i=1 n X 7 2 x0i b) i=1 = min (y b (yi Xb)0 (y Xb) Basic Matrix Calculus De…nition 1 For a real valued function f : t = (t1 ; : : : ; tn ) ! f (t), we de…ne 2 @f 3 @t1 @f 7 6 = 4 ... 5 ; @t @f @f = @t0 @f @f ;:::; @t1 @tn : @tn Lemma 3 Let f (t) = a0 t. Then, @f (t)/ @t = a. Lemma 4 Let f (t) = t0 At, where A is symmetric. Then, @f (t)/ @t = 2At. Back to OLS Let S (b) Xb)0 (y (y Xb) = y 0 y b0 X 0 y y 0 Xb + b0 X 0 Xb 0 = y0y 2 (X 0 y) b + b0 X 0 Xb A necessary condition for minimum is @S (b) =0 @b Now using the two lemmas above,weobtain @S (b) = @b 2X 0 y + 2X 0 Xb from which we obtain Xb)0 (y argmin (y Xb) = (X 0 X) 1 X 0y b Theorem 1 Let e y X (X 0 X) 1 X 0 y. Then,X 0 e = 0. Proof. X 0e = X 0y Remark 1 Let b (X 0 X) 1 X 0 X (X 0 X) X 0 y. We then have e 1 X 0y = 0 y X b. Corollary 2 If the …rst column of X consists of ones, we have Theorem 2 (X 0 X) 1 X 0 y = argminb (y Xb)0 (y 8 Xb). P i ei = 0. Proof. Write y Xb+ Xb Xb = y =y X (X 0 X) =e X b Therefore, we have Xb) = e = e0 e 0 b X b 0 b = e0 e + X b e0 e This gives a proof as to why b minimizes (y 8 e b X 0e b X 0y b X b b =0 e0 X b Xb)0 (y 1 b and note that (y Xb b X b e0 X b 0 b X b Xb)0 (y 0 b + b b X 0X b b Xb). Digression: Restricted Least Squares Theorem 3 The solution to the problem min (y b Xb)0 (y is given by b where Here, R is m b (X 0 X) 1 Xb) s:t: h R0 R (X 0 X) b = (X 0 X) 1 1 Rb = q R0 i 1 Rb X 0 y: q ; k matrix. Both R and q are known. We assume that R has the full row rank. Proof. It is not di¢ cult to see that Rb = R b = Rb = q: R (X 0 X) Rb q 1 h R0 R (X 0 X) Suppose that some b satis…es Rb = q. Let d Rd = R (b b b ) = Rb 9 1 R0 i 1 Rb q b . We should then have Rb = q q = 0: We now show that e0 Xd = 0; where e y Xb : Note that X 0 e = X 0 (y Xb ) = X 0 e X 0X b =0 b X b b = X 0e h 1 b = X 0X (X 0 X) R0 R (X 0 X) h i 1 1 1 = X 0 X (X 0 X) R0 R (X 0 X) R0 Rb q h i 1 1 = R0 R (X 0 X) R0 Rb q : = X 0X b b X 0X b 1 It therefore follows that h i 1 h 1 d0 X 0 e = d0 R0 R (X 0 X) R0 R b q = (Rd)0 R (X 0 X) h i 1 1 = 00 R (X 0 X) R0 Rb q = 0 R0 1 i R0 1 i and 0 e0 Xd = (d0 X 0 e ) = 0: Therefore, we have (y Xb)0 (y Xb) = (e X (b = (e 0 Xd) (e = e0 e d0 X 0 e b ))0 (e X (b b )) Xd) e0 Xd + d0 X 0 Xd = e0 e + d0 X 0 Xd e0 e and the minimum is achieved if and only if d = 0 or b = b . Theorem 4 Let e y Then, e0 e = e0 e + R b q 0 h Xb : R (X 0 X) 10 1 R0 i 1 Rb q : Rb 1 Rb q q Proof. It is immediate from h i 1 R R (X X) R Rb h i 1 1 1 Rb = y X b + X (X 0 X) R0 R (X 0 X) R0 h i 1 1 1 = e + X (X 0 X) R0 R (X 0 X) R0 Rb q e =y and X b 0 (X X) 1 0 0 1 0 q q X 0 e = 0: 9 Projection Algebra Let X (X 0 X) P 1 X 0; M In P Theorem 5 P and M are symmetric and idempotent, i.e., P 0 = P; M 0 = M; P 2 = P; M 2 = M: Proof. Symmetry is immediate. P 2 = X (X 0 X) M 2 = (In = In 1 X 0 X (X 0 X) P ) (In P 1 X = X (X 0 X) P ) = In P + P = In P 1 X 0 = P; P + PP P = M: Theorem 6 P X = X; M X = 0: Proof. P X = X (X 0 X) M X = (In P)X = X 1 X 0 X = X; PX = X X = 0: Theorem 7 X b = P y; e = M y: Note that X b is the “predicted value” of y given X and the estimator b. 11 Proof. X b = X (X 0 X) X 0 y = P y: e = y X b = y P y = [In P ] y = M y: 1 Theorem 8 e0 e = y 0 M y: Proof. e0 e = (M y)0 M y = y 0 M 0 M y = y 0 M M y = y 0 M y: Theorem 9 (Analysis of Variance) 0 y 0 y = X b X b + e0 e: Proof. 0 y 0 y = y 0 (P + M ) y = y 0 P y + y 0 M y = y 0 P 0 P y + e0 e = (P y)0 P y + e0 e = X b X b + e0 e: 10 Problem Set 1. It can be shown that trace (AB) = trace (BA). Using such property of the trace operator, prove that trace (P ) = k if X has k columns. (Hint: trace (P ) = trace X (X 0 X) 1 X 0 = trace X (X 0 X) = trace (X 0 X) 1 1 X0 X 0 X = trace (X 0 X) 1 X 0 X = trace (I) : What is the dimension of the identity matrix?) 2. This question is taken from Goldberger. Let 2 3 1 2 6 1 4 7 6 7 6 7 X = 6 1 3 7; 6 7 4 1 5 5 1 2 Using Matlab, calculate the following: X 0 X; (X 0 X) P = X (X 0 X) 1 1 ; (X 0 X) X 0; P y; 12 2 6 6 6 y=6 6 4 1 14 17 8 16 3 3 7 7 7 7: 7 5 X 0; (X 0 X) M =I P; 1 X 0 y; e = M y: 11 Application of Projection Algebra: Partitioned Regression X = X1 n k1 n k Note that k1 + k2 = k. Partition b accordingly: Note that b1 is (k1 1), and b2 is (k2 b= b1 b2 X2 n k2 ! 1). Thus we have y = X1 b1 + X2 b1 + e We characterize of b1 as resulting from two step regression: Theorem 10 We have where 11.1 ye Discussion b1 = X e10 X e1 e1 X M2 y; 1 e10 ye; X M2 X1 : Theorem 11 Write m = k1 , and X1 = [x1 ; : : : ; xm ] ; X2 = [xm+1 ; : : : ; xk ] : Then, ye = M2 y is the residual when y is regressed on X2 , and where e1 = M2 X1 = M2 [x1 ; : : : ; xm ] = [M2 x1 ; : : : ; M2 xm ] ; X M2 x1 is the residual when x1 is regressed on X2 . We can thus summarize the characterization of b1 in the following algorithm: Regress y on xm+1 ; : : : ; xk . Get the residual, and call it ye. (e y captures that portion of y not correlated with X2 , i.e., ye is y partialled out with respect to X2 ) 13 e1 . (e Regress x1 on xm+1 ; : : : ; xk . Get the residual, and call it x x1 captures that portion of x1 not correlated with X2 ) e2 ; : : : ; x em . Repeat second step for x2 ; : : : ; xm . Obtain x e1 ; : : : ; x em . The coe¢ cient estimate is numerically equal to b1 . Regress ye on x You may wonder why we need to do the regression in two steps when it can be done in one single step. The reason is mainly computational. Notice that, in order to compute b, a k k matrix X 0 X has to be inverted. If the number of regressors k is big, then your computer cannot implement this matrix inversion. This point will be useful in the application to the panel data. 11.2 Proof Lemma 5 Let M2 I 1 X2 (X20 X2 ) X20 : Then, M2 e = e: Proof. Observe that X20 e = 0 * X 0e = X10 X20 e= X10 e X20 e = 0: Therefore, M2 e = In e X2 (X20 X2 ) 1 X20 e = e: Lemma 6 Proof. Premultiply by M2 , and obtain M2 y = M2 X1 b1 + e: y = X1 b1 + X2 b1 + e M2 y = M2 X1 b1 + M2 X2 b1 + M2 e = M2 X1 b1 + e; where the second equality follows from the projection algebra and the previous lemma. Lemma 7 X10 M2 y = X10 M2 X1 b1 : 14 Proof. Premultiply M2 y = M2 X1 b1 + e by X10 , and obtain X10 M2 y = X10 M2 X1 b1 + X10 e = X 0 M2 X1 b1 ; 1 where the second equality follows from the normal equation. 12 R2 Let X1 = [x1 ; : : : ; xm ] ; X2 = `: Then, Remark 2 Note that ye = y e 1 = [x1 X y `; X2 (X20 X2 ) 1 x1 `; : : : ; xm 1 0 X20 = ` (`0 `) ` = ` (n) xm `] : 1 0 ` = 1 0 `` n and therefore X2 (X20 X2 ) 1 X20 y = 1 0 `` y = ` n 1 0 `y n = `y Because e 1 b1 + e; ye = X we obtain e 1 b1 + e ye0 ye = X 0 e 0 e = 0; and X 1 0 b0 e 0 e 1 b1 + e = b0 X e0 e b X 1 1 X1 1 + 2 1 X1 e + e e 0 e0 X e b = b10 X 1 1 1 + e e: Here, e0 e denotes the portation of ye0 ye “unexplained” by the variation of X1 (from its own sample average). The smaller it is relative to ye0 ye, the more of ye0 ye is explained by X1 . From this intuitive idea, we develop e0 e R2 = 1 ye0 ye as a measure of goodness of …t. Theorem 12 Suppose that X = [X1 ; X2 ] : Then, we have min (y b1 X1 b1 )0 (y X1 b1 ) 15 min (y b Xb)0 (y Xb) : Proof. We can write min (y b1 X1 b1 )0 (y Xb)0 (y X1 b1 ) = min (y b Xb) s:t: b2 = 0: The latter is bounded below by min (y b Xb)0 (y Xb) Corollary 3 1 13 minb1 (y Problem Set X1 b1 )0 (y ye0 ye X1 b1 ) 1 minb (y Xb)0 (y ye0 ye Xb) : 1. You are given 0 B B B B B y=B B B B @ 1 3 5 3 7 9 4 1 C C C C C C; C C C A 2 6 6 6 6 6 X1 = 6 6 6 6 4 X = [X1 ; X2 ] ; 1 1 1 1 1 1 1 3 5 3 2 8 4 7 3 7 7 7 7 7 7; 7 7 7 5 2 6 6 6 6 6 X2 = 6 6 6 6 4 2 4 6 1 7 3 1 3 7 7 7 7 7 7: 7 7 7 5 (a) Using Matlab, compute b = (X 0 X) 1 X 0 y; 0 e=y X b; e0 e; R2 = 1 e0 e= X (yi y)2 : i (b) Write b = b1 ; b2 ; b3 . Try computing b1 ; b2 by using the partitioned regression technique: First, using Matlab, compute e 1 = M2 X1 ; X ye = M2 y where M2 = I X2 (X02 X2 ) 1 X02 . (In Matlab, you would have to specify the dimension of the identity matrix. What is the dimension?) Second, using Matlab again, compute 1 e0 X e1 e 0 ye X X 1 1 Is the result of your two step calculation equal to the …rst two component of b that you computed before? 16 (c) Using Matlab, compute b Xb)0 (y argmin (y s.t. Rb = q; Xb) b where R (1; 1; 1) and q = [1]. (d) Using Matlab, compute Xb )0 (y e0 e = (y Xb ) : (e) Using Matlab, compute Rb Is it equal to e0 e e0 e? q 0 h 0 R (X X) 1 R 0 i 1 Rb q : 2. You are given 0 B B B B B y=B B B B @ 1 3 5 3 7 9 4 1 C C C C C C; C C C A 2 X = [X1 ; X2 ] ; 6 6 6 6 6 X1 = 6 6 6 6 4 1 1 1 1 1 1 1 3 5 3 2 8 4 7 3 7 7 7 7 7 7; 7 7 7 5 2 6 6 6 6 6 X2 = 6 6 6 6 4 2 4 6 1 7 3 1 3 7 7 7 7 7 7: 7 7 7 5 Using Matlab, compute b = (X 0 X) Now, compute M2 = I X2 (X02 X2 ) 1 X02 ; 1 X 0 y; e 1 = M2 X1 ; X e=y ye = M2 y; X b: e0 X e b1 = X 1 1 1 e 0 ye; X 1 e 1b : e = ye X 1 Verify that b1 is numerically identical to the …rst two components of b. Also verify that e = e. 17 Lecture Note 3: Stochastic Properties of OLS 14 Note on Variances For scalar random variable Z: Var (Z) E (Z E [Z])2 = E Z 2 (E [Z])2 For random vector Z = (Z1 ; Z2 ; : : : ; Zk )0 ; k 1 we have a variance - covariance matrix: Var (Z) E (Z E [Z]) (Z E [Z])0 2 E [(Z1 E [Z1 ]) (Z1 E [Z1 ])] E [(Z1 6 E [(Z2 E [Z2 ]) (Z1 E [Z1 ])] E [(Z2 6 =6 .. 4 . 2 E [(Zk E [Zk ]) (Z1 E [(Z1 E [(Z2 E [Z1 ]) (Zk E [Z2 ]) (Zk .. . E [Z2 ])] E [(Zk E [Zk ]) (Zk E [(Z1 E [Z1 ]) (Z2 E [Z2 ])] E (Z2 E [Z2 ])2 .. . E [(Z1 E [(Z2 E [Z1 ]) (Zk E [Z2 ]) (Zk .. . E [Z1 ])] E [(Zk 2 E (Z1 E [Z1 ]) 6 E [(Z2 E [Z2 ]) (Z1 E [Z1 ])] 6 =6 .. 4 . E [(Zk E [Zk ]) (Z1 E [Z1 ])] 2 Var (Z1 ) Cov (Z1 ; Z2 ) 6 Cov (Z2 ; Z1 ) Var (Z2 ) 6 =6 .. .. ... 4 . . It is useful to note that 15 15.1 E [Z2 ])] E [Z2 ])] E [Zk ]) (Z2 E [(Zk Cov (Zk ; Z1 ) Cov (Zk ; Z2 ) E (Z E [Z1 ]) (Z2 E [Z2 ]) (Z2 .. . E [Z]) (Z E [Zk ]) (Z2 3 Cov (Z1 ; Zk ) Cov (Z2 ; Zk ) 7 7 7 .. 5 . .. .. . . E [Z2 ])] Var (Zk ) E [Z])0 = E [ZZ 0 ] E [Z] E [Z]0 Classical Linear Regression Model I Model y =X +" X is a nonstochastic matrix 18 E (Zk 3 E [Zk ])] E [Zk ])] 7 7 7 5 E [Zk ])] 3 E [Zk ])] E [Zk ])] 7 7 7 5 E [Zk ])2 X has a full column rank (Columns of X are linearly independent.) E ["] = 0 E [""0 ] = 15.2 2 2 In for some unknown positive number Discussion We will later discuss the case where X is stochastic such that (i) y = X + "; (ii) X has a full column rank; (iii) E ["j X] = 0; (iv) E [""0 j X] = 2 In . The third assumptions amounts to the identi…ability condition on . Suppose X does not have a full column rank. Then, by de…nition, we can …nd some 6= 0 such that X = 0. Now, even if " = 0, we would have y =X =X +X =X( + ) so that we would not be able to di¤erentiate from + from the data E ["] = 0 is a harmless assumption if we believe that E ["i ] are the same regardless of i. Write this common value as 0 . Then, we may rewrite the model as yi = 0 xi0 + xi1 + 1 + k xik + ui for u i = "i 0 and xi0 = 1: E [""0 ] = 2 In consists of two parts. First, it says that all the diagonal elements, i.e., variances of the error terms are equal. This is called the homoscedasticity. Second, it says all the o¤-diagonal elements are zeros. (This will not be satis…ed in the time series environment.) 15.3 Properties of OLS Lemma 8 b = + (X 0 X) 1 X 0" Proof. b = (X 0 X) = (X 0 X) 1 X 0 y = (X 0 X) 1 1 X 0 X + (X 0 X) 19 X 0 (X + ") 1 X 0" = + (X 0 X) 1 X 0 ": h i Theorem 13 Under the Classical Linear Regression Model I, we have E b = 2 (X 0 X) 1 Proof. h i E b = and + (X 0 X) Var b = E 1 h X 0 E ["] = b 1 0 = E (X X) 16 1 + (X 0 X) b 0 X 00 = 0 0 1 0 X "" X (X X) i = (X 0 X) 1 X 0 E [""0 ] X (X 0 X) 1 = (X 0 X) 1 X0 1 2 and Var b = I X (X 0 X) = 2 (X 0 X) 1 X 0 X (X 0 X) = 2 (X 0 X) 1 : 1 Estimation of Variance Lemma 9 e = M" Proof. e = M y = M (X + ") = M X + M " = M " Lemma 10 trace (M ) = n k; trace (P ) = k: Proof. trace (P ) = trace X (X 0 X) 1 X 0 = trace (X 0 X) = k; trace (M ) = trace (In =n P ) = trace (In ) k: 20 trace (P ) 1 X 0 X = trace (Ik ) Theorem 14 Let s2 = e0 e n k : Then, E s2 = 2 Proof. Observe that e0 e = "0 M 0 M " = "0 M " = trace ("0 M ") = trace (M ""0 ) ; and hence, E [e0 e] = E [trace (M ""0 )] = trace (M E [""0 ]) = trace M = (n 2 k) Corollary 4 In other words, s2 (X 0 X) 17 1 2 I = 2 trace (M ) : h i 1 E s2 (X 0 X) = Var b is an unbiased estimator of Var b . Basic Asymptotic Theory: Convergence in Probability Remark 3 A sequence of nonstochastic real numbers an converges to a, if for any > 0, there exists N = N ( ) such that jan aj < for all n N . De…nition 2 A sequence of random variables fzn g converges in probabilityto c, a deterministic number, if lim Pr [jzn cj ]=0 n!1 for any > 0. We sometimes write it as plimn!1 zn = c. For a sequence of random matrices An , we have the convergence in probability to a deterministic matrix A if every element of An convergences in probability to the corresponding element of A. Remark 4 If an is a sequence of nonstochastic real numbers converging to a, and if g ( ) is continuous at a, we have limn!1 g (an ) = g (a). Theorem 15 (Slutzky) Suppose that plimn!1 zn = c. Also suppose that g ( ) is continuous at c. We then have plimn!1 g (zn ) = g (c). 21 Corollary 5 If plim z1n = c1 ; plim z2n = c2 ; n!1 n!1 then plim (z1n + z2n ) = c1 + c2 ; plim z1n z2n = c1 c2 ; n!1 n!1 and if c2 6= 0, z1n c1 = : c2 n!1 z2n plim Theorem 16 (LLN) Given a sequence fzi g of i.i.d. random variables such that E [jzi j] < 1, P we have plimn!1 z n = E [zi ], where z n = n1 ni=1 zi . Corollary 6 Given a sequence fzi g of i.i.d. random variables such that E [jg (zi )j] < 1, we have n 1X plim g (zi ) = E [g (zi )] : n!1 n i=1 18 18.1 Classical Linear Regression Model III Model yi = x0i + "i (x0i ; "i ) i = 1; 2; : : : is i.i.d. xi ?"i E ["i ] = 0, Var ("i ) = 2 E [xi x0i ] is positive de…nite. Furthermore, all of its elements are …nite. 18.2 Some Auxiliary Lemmas Lemma 11 1X xi x0i n i=1 n b= Proof. b = (X 0 X) 1 X 0 y = (X 0 X) = + 1 + ! 1 ! n 1X x i "i : n i=1 1 X 0 (X + ") = + (X 0 X) X 0 " ! 1 ! ! n n n X X X 1 xi x0i x i "i = + xi x0i n i=1 i=1 i=1 22 1 ! n 1X x i "i : n i=1 Lemma 12 s2 = 1 n k Proof. 1 2 s = n k = n X 2 n X e2i = i=1 k n X n "2i k n X Lemma 14 plimn!1 1 n Proof. n X 2 "i x0i i=1 i=1 n Pn i=1 Pn i=1 k ! x0i b yi 1 i=1 1 n n X "i x0i i=1 b + b 1 2 = ! n k b 1 0 n n X k i=1 1 0 n 1 n Pn i=1 k xi x0i = E [xi x0i ] x i "i = 0 n "2i = 2 1X 2 plim "i = E "2i = n!1 n i=1 n 2 : Large Sample Property of OLS De…nition 3 An estimator b is consistent for Theorem 17 b is consistent for . Proof. 2 if plimn!1 b = . ! n X 1 plim b = plim 4 + xi x0i n n!1 n!1 i=1 ! 1 n X 1 xi x0i = + plim n!1 n i=1 = + E [xi x0i ] 1 0 = : 23 1 xi x0i i=1 x0i b x0i + "i + b n X 1X plim xi "i = E [xi "i ] = E [xi ] E ["i ] = 0 n!1 n i=1 Lemma 15 plimn!1 18.3 k 1 Lemma 13 plimn!1 Proof. n i=1 1 n "2i 1 !3 n X 1 x i "i 5 n i=1 ! n X 1 plim x i "i n!1 n i=1 ! b : 2 n X i=1 xi x0i ! b Theorem 18 plimn!1 s2 = 2 Proof. plim n!1 n 1X 2 s = plim "i n n!1 n i=1 k n 2 1X 0 "i x i n i=1 n plim 2 n!1 n!1 = 2 1X 0 "i x i n i=1 n 2 plim n!1 0 + plim b n!1 = 19 b 1X xi x0i n i=1 0 + plim b ! n ! plim n!1 2 ! b plim b n!1 ! n 1X xi x0i plim b n i=1 n!1 Large Sample Property of OLS with Model I? Lemma 16 If a sequence of random variables fzn g is such that E (zn plimn!1 zn = c. c)2 ! 0, then Proof. Let > 0 be given. Note that 2 E (zn c) = Z c)2 dFn (z) (z where Fn denotes the CDF of zn . But Z Z Z 2 2 (z c) dFn (z) = (z c) dFn (z) + (z c)2 dFn (z) jz cj jz cj< Z (z c)2 dFn (z) jz cj Z Z 2 2 dFn (z) = dFn (z) = 2 Pr [jz cj jz cj jz cj It follows that Pr [jz cj ] c)2 E (zn 2 24 !0 ] h We i will now use the fact that, under the Classical Linear Regression Model I, we have b E = and Var b = 2 (X 0 X) 1 . We …rst note that E b 2 0 b =E = E trace b = trace E 2 b b 0 1 b b 0 = trace Var b = 2 0 trace @ i=1 n 1X xi x0i n i=1 1 n ! 11 0 n 1X @ trace xi x0i n i=1 ! 11 A for all n, we can conclude that from which we can conclude that 1 n 0 n X i=1 xi x0i ! 11 n X 1 A= xi x0i trace @ n n i=1 Therefore, IF we assume that E 0 b 1 (X 0 X) = 2 trace (X 0 X) 0 0 ! 11 n X A = 2 trace @ n = 2 trace @ xi x0i = trace b = E trace 2 A ! 11 A B<1 2 b !0 plim b = n!1 P The only question is how we can make sure that trace n1 ni=1 xi x0i textbooks adopt di¤erent assumptions to ensure this property. 25 1 B < 1. Di¤erent Lecture Note 4: Statistical Inference with Normality Assumption 20 Review of Multivariate Normal Distribution De…nition 4 A square matrix A is called positive de…nite if t0 At > 0 8t 6= 0: Theorem 19 A symmetric positive de…nite matrix is nonsingular. De…nition 5 (Multivariate Normal Distribution) An n-dimensional random (column) vector Z has a multivariate normal distribution if its joint pdf equals (2 )n=2 for some 1 s det and positive de…nite Z Theorem 20 Let L be an m 0 z n 1 1 (z ) n 1 n n . We then write Z N( ; ) Z 1 2 exp ) N ( ; ). It can be shown that E [Z] = ; Var (Z) = : n deterministic matrix. Then, N( ; ) ) LZ Z= Z1 Z2 N (L ; L L0 ) : Theorem 21 Assume that has a multivariate normal distribution. Assume that E [(Z1 and Z2 are independent. 1 )(Z2 0 2) ] = 0. Then, Z1 Theorem 22 Suppose that Z1 ; : : : ; Zn are i.i.d. N (0; 1). Then, Z = (Z1 ; : : : ; Zn )0 = N (0; In ). Theorem 23 Z N (0; In ) ) Z 0Z 2 (n) : n is called the degrees of freedom. Theorem 24 Suppose that an n 2 Z 0 1Z (n) n matrix is positive de…nite, and Z N (0; ). Then, Theorem 25 Suppose that an n n matrix A is symmetric and idempotent, and Z 2 Then, Z 0 AZ (trace (A)) 26 N (0; In ). 21 Classical Linear Regression Model II In addition to the assumptions of the classical linear regression model I, we now assume that " has a multivariate normal distribution 21.1 Sampling Property of OLS Estimator Theorem 26 Proof. Recall that b N h i E b = ; 2 ; 1 (X 0 X) Var b = 2 : (X 0 X) 1 even without the normality assumption. It thus su¢ ces to establish normality. Note that b = (X 0 X) 1 X 0 "; a linear combination of the multivariate normal vector ". Thus, b normal distribution. De…nition 6 For simplicity of notation, write V 22 2 (X 0 X) 1 b V ; s2 (X 0 X) , and hence b has a 1 Con…dence Interval: Known Variance For a given k 1 vector r, we are interested in the inference on r0 . Speci…cally, we want to construct a 95% con…dence interval. Theorem 27 (95% Con…dence Interval) The random interval p p r0 b 1:96 r0 Vr; r0 b + 1:96 r0 Vr contains r0 with 95% probability: h p Pr r0 b 1:96 r0 Vr i p r0 b + 1:96 r0 Vr = :95 r0 Proof. Because r0 b we have and Pr " r0 b p N (r0 ; r0 Vr) ; r0 r0 Vr r0 b p N (0; 1) # r0 r0 Vr 1:96 = :95 27 Corollary 7 Write = ( 1; : : : ; b = b1 ; : : : ; bk 0 k) ; Note that bj = a0 b; j 0 : = a0j ; j where aj is the k-dimensional vector whose jth component equals 1 and the remaining elements all equal 0. The preceding theorem implies that bj p 1:96 Vjj ; p bj + 1:96 Vjj ; where Vjj is the (j; j)-element of V, is a valid 95% con…dence interval for j . Noting that Vjj is the variance of bj , we come back to the undergraduate con…dence interval: p estimator 1:96 variance! 23 Hypothesis Test: Known Variance Theorem 28 (Single Hypothesis: 5% Signi…cance Level) Given H0 : r 0 = q vs: HA : r0 6= q; the test which rejects the null i¤ r0 b q p r0 Vr has a size equal to 5%: Pr " 1:96 # r0 b q p r0 Vr 1:96 = :05 Corollary 8 Suppose we want to test H0 : j = 0 vs: HA : j 6= 0 The preceding theorem suggests that we reject the null i¤ b pj Vjj 1:96: This con…rms our undergraduate training based procedure rejecting the null i¤ q bj variance of bj 28 1:96: Theorem 29 (Multiple Hypotheses: 5% Signi…cance Level) Given H0 : R m k =q HA : R 6= q; vs: the test which rejects the null i¤ has a size equal to 5%. Rb 0 q Rb 1 (RVR0 ) 2 :05 q (m) Proof. It follows easily from and Rb 24 Rb q 0 N (0; RVR0 ) R 1 (RVR0 ) Rb 2 q (m) Problem Set 1. (In this question, you are expected to verify the theorems discussed in the class using MATLAB. Turn in your MATLAB program along with the result.) Consider the linear model given by yi = 1 + 2 xi2 + i ; i = 1; : : : ; n; where N (0; i 2 ) i:i:d: Let b1 and b2 denote the OLS estimators of 1 and 2 . Suppose that and n = 9. Note that the above model can be compactly written as 2 6 6 X=6 4 1 1 .. . 1 2 .. . 1 9 3 7 7 7 5 and 1 = 2 (a) Suppose that xi2 = i. Show that b2 29 N 1; = 2 = 2 =1 (1) y=X + for 1 1 60 : : (b) Show that P b2 1 1:96 p 60 2 b2 + 1:96 p1 = :95 60 Let u(1) ; : : : ; u(1000) denote 1,000 independent N (0; 2 In ) random vectors. Notice that y in (1) has the same distribution as y (j) given by y (j) X + u(j) : (j) (j) Let b2 denote the OLS estimator of 2 of the above model. b2 are i.i.d. random variables which has the same distribution as b2 . Let ( (j) b(j) + 1:96 p1 1 if b2 1:96 p160 2 2 60 D(j) = 0 otherwise Show that we have E[D(j) ] = :95 Argue that 1 X D(j) 1000 j=1 1000 :95 Verify that this indeed is the case by generating 1,000 independent uj from the computer. (This type of experiment is called the Monte Carlo.) 25 Con…dence Interval: Unknown Variance Lemma 17 e0 e 2 2 Proof. Let u Then, e0 e 2 = u0 M u " 2 (n k) : N (0; In ) : (trace (M )) = Lemma 18 b and e0 e are independently distributed. 2 (n k) : Proof. It su¢ ces to prove that b and e are independently distributed. But ! b (X 0 X) 1 X 0 " (X 0 X) 1 X 0 = = " M" M e 30 has a multivariate normal distribution because it is a linear combination of ". It thus su¢ ces to prove that b and e are uncorrelated. Their covariance equals h i h i 1 1 0 0 0 0 0 b E e = E (X X) X "" M = (X 0 X) X 0 E [""0 ] M 1 = (X 0 X) Theorem 30 r0 b r0 p b r0 Vr Proof. Notice that r 0 We also have b 2 0 N 0; 0 r (X X) 1 t (n e0 e 2 2 Now observe that b (X 0 X) 1 (X 0 M ) = 0: r0 b q r0 (X 0 X) N (0; 1) : 1 r k) and e0 e are independently distributed so that It thus follows that . q 0 b r0 (X 0 X) r q e0 e (n k) 2 Because p (n 2 I M= k) ) r 2 X0 e0 e/ (n r0 b q r0 (X 0 X) 1 r =p r0 b e0 e/ (n q k) r0 (X 0 X) we obtain the desired conclusion. 1 r ? e0 e 2 : r0 b r0 b q = q 1 0 0 k) r (X X) r s r0 (X 0 X) 1 r =q r0 b r0 s2 (X 0 X) 1 r t (n 1 r r0 b = p r0 b r0 Vr Theorem 31 (95% Con…dence Interval) The random interval p p 0b 0b 0 b b ; r t:025 (n k) r Vr; r + t:025 (n k) r0 Vr where t:025 (n k) denotes the upper 2.5 percentile of the t (n with 95% probability: p h 0b b Pr r t:025 (n k) r0 Vr r0 r0 b + t:025 (n 31 k) : k) distribution, contains r0 k) p b r0 Vr i = :95 Proof. It follows easily from r0 b p Theorem 32 1 Rb m Proof. Observe that Rb 0 R 0 R n h R r0 b r0 Vr 1 b 0 RVR 2 1 (X 0 X) being a function of b, is independent of k) s2 (n t (n = 2 i k) : Rb R0 R o e0 e 1 F (m; n Rb 2 2 R (n 2 k) (m) ; k) : It thus follows that Rb 0 R R (X 0 X) 1 1 R0 Rb 2 m, R 1 k) s2 (n n k 1 (n 2 F (m; n k) Because Rb = Rb 0 R R (X 0 X) 1 1 R0 Rb 2 R 0 R (X 0 X) 1 1 R0 s2 0 1 1 b 0 Rb R RVR Rb m we obtain the desired conclusion. m, R Rb R R ; = n m = k 1 Rb m k) s2 2 R 0 Rs2 (X 0 X) Corollary 9 Pr 1 Rb m R 0 b 0 RVR 1 Rb Remark 5 Suppose you want to test H0 : R = q vs: R F:05 (m; n k) = :05 HA : R 6= q We would then reject the null under 5% signi…cance level i¤ 1 Rb m q 0 b 0 RVR 1 Rb 32 q F:05 (m; n k) 1 R0 1 Rb R Remark 6 Observe that the statistics in Theorems 29 and 32 are identical except that replaced by s2 in Theorem 32. Theorem 33 1 Rb m 0 q 1 b 0 RVR Rb q = 2 is e0 e)/ m (e0 e ; e0 e/ (n k) where e is the residual from the restricted least squares Xb)0 (y min (y b subject to Xb) Rb = q Proof. Note that 1 Rb m q 0 b RVR Recall that 1 0 Rb q = e e = e e + Rb 0 Therefore, 1 Rb m q 0 0 b 0 RVR Rb 1 0 q Rb h 0 q 1 1 R0 s2 0 R (X X) q = R (X 0 X) 1 (e0 e R 0 i 1 Rb e0 e)/ m s2 Rb q m q : (e0 e e0 e)/ m e0 e/ (n k) = Corollary 10 Suppose that the …rst column of X consists of 1s. You would want to test H0 : 2 = = = 0: k (What are R and q?) We then have 1 Rb m q 0 b 0 RVR 1 Rb q = R2 / (k 1) : (1 R2 )/ (n k) Proof. We need to obtain e0 e …rst. Note that the constrained least squares problem can be written as X X min (yi b1 xi1 0 xi2 0 xik )2 = min (yi b1 )2 : b1 b1 i i We know that the solution is given by c1 = y. In other words, we have X b = (y; 0; : : : ; 0) ; e0 e = (yi y)2 = ye0 ye: i Our test statistic thus equals (e y 0 ye e0 e)/ (k 1) : e0 e/ (n k) 33 Note now that e0 e ye0 ye R2 = 1 The test statistic is thus equal to R2 / (k 1) : (1 R2 )/ (n k) Remark 7 The statistic R2 / (k 1) (1 R2 )/ (n k) is the “F -statistic” reported by many popular softwares. 26 Con…dence Interval for Mean: To be read, not to be taught in class Consider the estimation of from n i.i.d. N ( ; 2 ) random variables U1 ; : : : ; Un . It has been argued before that this is a special example of the classical linear regression model: y=X + where 0 1 U1 B C y = @ ... A ; Un The m.l.e. b equals 0 1 1 B C X = @ ... A ; 1 (`0 `) 0 B =@ = ; 1 0 1 `y =n X U1 Un Ui = U i the sample average! We also know that the distribution of b equals N ; Thus, the 95% con…dence interval for or 2 (`0 `) 1 =N ; 2 n is b 1:96 p ; b + 1:96 p n n U 1:96 p ; U + 1:96 p n n 34 1 .. . 1 C A What if we are not fortunate enough to know 2 ? Here, we can make use of the fact that U P 2 and i Ui U are independent of each other. Because of this independence, we know that U r p P i n 2 (Ui = U) 2 (n 1) p U n p It thus follows that we can use U as the 95% con…dence interval for 27 t:975 (n t(n s2 1): s 1) p n in this case. Problem Set In this problem set, you are expected to read and replicate results of Mankiw, Romer, and Weil (1992, QJE ). Use MATLAB. 1. Select observations such that the “nonoil” variable is equal to 1 and discard the rest of the observation. (How many observations do you have now?) For each country i, create yi = ln (GDP per working-age person in 1985) xi1 = 1 xi2 = ln (I / GDP) xi3 = ln (growth rate of the working age population between 1960 and 1985 + g + ) assuming that g + = :05. 2. Assume that yi = xi1 where "i i:i:d: N (0; 2 1 + xi2 2 + xi3 3 + "i ). Compute the OLS estimator b = b1 ; b2 ; b3 0 for = ( 1; 2; 0 3) . Compute the sum of squared residuals e0 e. Compute an unbiased estimator s2 of 2 . Compute an estimator of the variance-covariance matrix of b. Compute the standard deviations of b1 , b2 , and b3 . Present 95% con…dence intervals for 1 , 2 , and 3 . 3. Because R2 monotonically increases as more regressors are added to the model, some other measure has been developed. The adjusted R2 is computed as 2 R =1 sample size 1 1 sample size number of regressors 2 Compute R2 and R . 35 R2 4. You want to estimate the OLS estimator under the restriction that possibility is to rewrite the model as yi = xi1 1 + xi2 xi3 2 2 + "i = xi1 1 + (xi2 xi3 ) 2 3 2. = One + "i and consider the OLS of yi on xi1 and xi2 xi3 . Compute the OLS estimator b1 ; b2 of ( 1 ; 2 ) this way. (Note that this trick is an attempt to compute the restricted least squares of ( 1 ; 2 ; 3 ) = ( 1 ; 2 ; 2 ) as b1 ; b2 ; b2 .) Also compute the sum of squared residuals e0 e for this restricted model. 5. The restricted least squares problem in the previous question can be written as min (y c Xc)0 (y subject to Xc) Rc = q What is R? What is q? In your class, it was argued that the solution to such problem is given by h i 1 b (X 0 X) 1 R0 R (X 0 X) 1 R0 Rb q ; where b is the OLS estimator for the unrestricted model. Is it equal to b1 ; b2 ; which you obtained in the previous question? We also learned in class that i 1 0h 1 e0 e e0 e = R b q R (X 0 X) R0 Rb q : Subtract e0 e from e0 e . Is the di¤erence equal to R b 6. You can test the restriction that 3 = 2 q 0 R (X 0 X) or r0 = q for r0 = (0; 1; 1) 1 R0 1 Rb b2 0 , q ? q=0 by computing the t-statistic q r0 b q s2 r0 (X 0 X) (2) 1 r Compute the t-statistic in (2) by MATLAB, and implement the t-test under 5% signi…cance level. 7. You can test the restriction that 3 = 2 or R = q for R = (0; 1; 1) q=0 by computing the F -statistic Rb q 0 R (X 0 X) 1 R0 1 Rb q m (3) s2 What is m in this case? Compute the F -statistic in (3) by MATLAB, and implement the F -test under 5% signi…cance level. 36 8. Compute the square of the t-statistic in (6) and compare it with the F -statistic in (7). Are they equal to each other? In fact, if R consists of a single row so that r0 = R, then the square of the t-statistic as in (2) is numerically equivalent to the F -statistic as in (3). Provide a theoretical proof to such equality. 9. It was shown that the F -statistic computed as in (3) is numerically equal to e0 e/ (sample size e0 e)/ m (e0 e number of regressors in the unrestricted model) Compute the F -statistic this way, and see if it is indeed equal to the value you computed in (h). 10. You can test the restriction in a slightly di¤erent way. Rewrite the model as yi = xi1 1 + (xi2 xi3 ) 2 + xi3 ( 3 + 2) + "i This could be understood as a regression of yi on (1; xi2 xi3 ; xi3 ). If the restriction is correct, the coe¢ cient of xi3 in this model should be equal to zero. Therefore, if the estimated coe¢ cient of xi3 is signi…cantly di¤erent from zero, we can understand it as an evidence against the restriction, and reject it. Compute the regression coe¢ cient. Compute the t-statistic. Would you accept the restriction or reject it (under 5% signi…cance level)? 11. Based on their knowledge of capital’s share in income, Mankiw, Romer and Weil (1992) entertained the multiple hypothesis that ( 2 ; 3 ) = (0:5; 0:5). This could be written as 0 1 0 0 0 1 2 4 1 2 3 3 5= 0:5 0:5 Compute the corresponding F -statistic. Would you accept the null under 5% signi…cance level? 37 Lecture Note 5: Large Sample Theory I 28 Basic Asymptotic Theory: Convergence in Distribution De…nition 7 A sequence of distributions Fn (t) converges in distribution to a distribution F (t) if lim Fn (t) = F (t) n!1 at all points of continuity of F (t). With some abuse of terminology, if a sequence of random vectors zn , whose cdf are Fn (t), converges in distribution to a random variable z with cdf F (t) if Fn (t) converges in distribution to a distribution F (t). We will sometimes write d zn ! z and call F (t) the limiting distribution of zn . Theorem 34 A sequence of random variables Yn converges to a constant c if and only if it converges in distribution to a limiting distribution degenerate at c. Proof. Suppose that plimn!1 Yn = c. Let Fn ( ) denote the c.d.f. of Yn . It su¢ ces to show that lim Fn (y) = 0 if y < c; = 1 if y > c n Assume that y < c. Let = c Fn (y) = Pr [Yn y > 0 We then have y] = Pr [Yn Now assume that y > c. Let = y 1 Fn (y) = Pr [Yn > y] = Pr [Yn c y c] = Pr [Yn c ] Pr [jYn cj ]!0 c. c>y c] = Pr [Yn c> ] Pr [jYn cj > ] ! 0 so that lim F (y) = 1 n Now suppose that lim Fn (y) = 0 if y < c; n We have Pr [jYn cj > ] proving the theorem. Pr [Yn c h ] + Pr Yn 38 c> = 1 if y > c 2 i = Fn (c )+1 Fn c + 2 !0 d Theorem 35 (Continuous Mapping Theorem) If zn ! z and if g ( ) is a continuous funcd tion, then g (zn ) ! g (z) d Theorem 36 (Transformation Theorem) If an ! a and plim bn = b with b constant, then d d d an + bn ! a + b, an bn ! ab. If b 6= 0, thenan / bn ! a/ b. Theorem 37 (Central Limit Theorem) Given an i.i.d. sequence zi of random vectors with E [zi ] = …nite and Var (zi ) = …nite positive de…nite, we have p d n (z ) ! N (0; ) : Remark 8 Consider n i.i.d. random variables with unknown mean 2 . By the law of large numbers, we have X n 1 Xi2 ! E Xi2 ; X ! E [Xi ] and unknown variance i It follows that ^2 = n 1 X Xi2 i With the Slutsky Theorem, we obtain p X n 2 2 X ! ! N (0; 1) ^ In particular, we obtain Pr 1:96 < n1=2 X ^ < 1:96 ! :95 so that we can use ^ ^ 1:96 p ; X + 1:96 p n n as the asymptotic 95% con…dence interval for even when Xi does not necessarily have a normal distribution. X Theorem 38 If d and z2;n ! z; plim z1;n = c; n!1 then d z1;n + z2;n ! c + z; d z1;n z2;n ! c z: Theorem 39 (Delta Method) Suppose that p d n (zn c) ! N (0; 2 ) Also suppose that g ( ) is continuously di¤erentiable at c. We then have p d n (g (zn ) g (c)) ! N (0; g 0 (c)2 2 ) 39 Sketch of Proof. We have g (c) = g 0 (e c) (zn g (zn ) c) for some e c between zn and c. Because zn ! c in probability, and because g 0 ( ) is continuous, we have g 0 (e c) ! g 0 (c). Writing p p n (g (zn ) g (c)) = g 0 (e c) n (zn c) we obtain the desired conclusion. Theorem 40 (Multivariate Delta Method) Suppose that p d c) ! N (0; ) n (zn Also suppose that g ( ) is continuously di¤erentiable at c. We then have p d g (c)) ! N (0; G G0 ) n (g (zn ) where @g (c) @z 0 G 29 Classical Linear Regression Model III 29.1 Model yi = x0i + "i (x0i ; "i ) i = 1; 2; : : : is i.i.d. xi ?"i 2 E ["i ] = 0, Var ("i ) = E [xi x0i ] is positive de…nite. Furthermore, all of its elements are …nite. 29.2 Some Auxiliary Lemmas Lemma 19 1X xi x0i n i=1 n b= Lemma 20 s2 = 1 n k n X i=1 "2i 2 1 n k + n X i=1 "i x0i ! ! b 1 ! n 1X x i "i : n i=1 + b 40 1 0 n k n X i=1 xi x0i ! b : Lemma 21 plimn!1 1 n Lemma 22 plimn!1 1 n Lemma 23 plimn!1 1 n Lemma 24 Pn i=1 Pn i=1 Pn i=1 xi x0i = E [xi x0i ] x i "i = 0 "2i = p 2 1X d n xi "i ! N 0; n i=1 n 2 E [xi x0i ] : Proof. Let zi = xi "i . We have E [zi ] = E [xi "i ] = E [xi ] E ["i ] = 0; Var (zi ) = E "2i xi x0i = E "2i E [xi x0i ] = 2 E [xi x0i ] : Lemma 25 limn!1 t:975 (n) = 1:96 30 Large Sample Property of OLS if plimn!1 b = . De…nition 8 An estimator b is consistent for Theorem 41 b is consistent for . Theorem 42 plimn!1 s2 = Theorem 43 p Proof. We have p But we have 2 d n b ! N 0; 1X xi x0i n i=1 n n b = 1X xi x0i n i=1 n plim n!1 and ! 2 1 (E [xi x0i ]) ! 1 ! n 1 X p x i "i : n i=1 1 1 X d p xi "i ! N 0; n i=1 = [Exi x0i ] 1 n 41 2 : Exi x0i : Theorem 44 Suppose that g : Rk ! R is continuously di¤erentiable at @g (c) : @c0 (c) Then, p n g b such that d 2 g ( ) ! N 0; 1 ( ) (E [xi x0i ]) ( )0 : Proof. Delta Method. p b is a valid approximate 95% con…dence interval for r0 . Theorem 45 r0 b 1:96 r0 Vr Proof. We have p Now, notice that n r0 b r0 d ! N 0; v " # u n u X 1 plim tr0 xi x0i n i=1 n!1 and 2 r0 [Exi x0i ] 1 r : q 1 1 r r0 [Exi x0i ] 1 r0 [Exi x0i ] r= p plim s2 = : n!1 Thus, we have It thus follows that p n r 0b v " # u n u X 1 plim s tr0 xi x0i n n!1 i=1 r 0 v " # u n . u X 1 str0 xi x0i n i=1 1 r= 1 r=q and lim Pr r n!1 0b q 1:96 s2 r0 (X 0 X) 1 r r q 0 r 0b r0 b r: r0 s2 r0 (X 0 X) d 1 r ! N (0; 1) q + 1:96 s2 r0 (X 0 X) 1 r = :95 Theorem 46 Suppose that g : Rk ! R is continuously di¤erentiable at . Then, r 0 b V b b g b 1:96 is a valid approximate 95% con…dence interval for g ( ). 42 Proof. We have p n g b d 2 g ( ) ! N 0; and ( ) (E [xi x0i ]) 1 ( )0 ; p plim s2 = : n!1 Because is continuous and plim b = , we should have b = plim n!1 from which we obtain v " n # u u X 1 b xi x0i plim t n n!1 i=1 1 b = It thus follows that p n g b and " lim Pr g b n!1 v u . u g( ) st 31 b r t:025 (n 1X xi x0i n i=1 n b (X 0 X) 1:96 s2 Theorem 47 r0 b " k) Problem Set # ( ) q ( ) (E [xi x0i ]) 1 b =r b 1 g( ) s2 1 g b ( )0 ; g( ) b (X 0 X) r g b + 1:96 s2 1 d b 0 ! N (0; 1) b (X 0 X) p b is a valid approximate 95% con…dence interval for r0 . r0 Vr Mankiw, Romer, and Weil (1992, QJE ) considered the regression yi = xi1 with the restriction 3 = 2, 1 + xi2 2 + xi3 3 + "i where yi = ln (GDP per working-age person in 1985) xi1 = 1 xi2 = ln (I / GDP) xi3 = ln (growth rate of the working age population between 1960 and 1985 + g + ) In other words, they regressed yi on xi1 and xi2 xi3 . They noted that the coe¢ cient of xi2 in this restricted regression is an estimator of 1 , where is capital’s share in income. 43 xi3 1 b # = :9 1. In Table 1, Mankiw, Romer, and Weil (1992) report an estimator b of implied by the OLS coe¢ cient of xi2 xi3 . Con…rm their …ndings with the data set provided for the three samples. 2. Mankiw, Romer, and Weil (1992) also report the standard deviation of b. Using deltamethod, con…rm their results. 44 Lecture Note 6: Large Sample Theory II 32 32.1 Linear Regression with Heteroscedasticity: Classical Linear Regression Model IV Model yi = x0i + "i (x0i ; "i ) i = 1; 2; : : : is i.i.d. E ["i j xi ] = 0 2 Var ( "i j xi ) (xi ) not known E [xi x0i ] is positive de…nite. We are going to consider the large sample property of ! ! 1 n n X X 1 1 0 b= + xi xi x i "i n i=1 n i=1 and the related inference. 32.2 Some Useful Results Lemma 26 1X x i "i = 0 plim n!1 n i=1 n Proof. 1X xi "i = E [xi "i ] = E [E [xi "i j xi ]] = E [xi E ["i j xi ]] = E [xi 0] = 0: n!1 n i=1 n plim Lemma 27 1X 2 0 plim "i xi xi = E xi x0i "2i : n n!1 i=1 n 45 Lemma 28 p 1X d n xi "i ! N 0; E "2i xi x0i n i=1 n : Proof. Let zi x i "i : We have Var (zi ) = E "2i xi x0i : E [zi ] = 0; Apply CLT. Lemma 29 Under reasonable conditions, we have 1X 2 0 ei xi xi = E "2i xi x0i n!1 n i=1 n plim Proof. For simplicity, I will assume that xi is a scalar. It su¢ ces to prove that 1X 2 2 xi ei plim n!1 n i=1 n x2i "2i = 0: We have 2"i xi b = "2i so that e2i "2i = 2"i xi b 1X 2 2 x e n i=1 i i n 2 2 "i x i b 2 + jxi j2 b 1X 2 2 x e n i=1 i i 2 + x2i b + x2i b 2 j"i j jxi j b Thus, 2 xi b e2i = "i : 2 + x2i b n "2i "2i 2X j"i j jxi j3 n i=1 n ! ! 0: 1X jxi j4 n i=1 n b + if E j"i j jxi j3 < 1; 46 E jxi j4 < 1: ! b 2 32.3 Large Sample Property Theorem 48 plim b = : n!1 Proof. 1X xi x0i n i=1 n plim n!1 ! 1 1X x i "i n i=1 n Theorem 49 ! 1X xi x0i n i=1 n = plim n!1 p where n b [Exi x0i ] But we have n b plim n!1 and n plim n!1 1 1 E "2i xi x0i [Exi x0i ] n 1X xi x0i n i=1 ! ! 1 : ! n X 1 p x i "i : n i=1 1 = [Exi x0i ] 1 X d p xi "i ! N 0; E "2i xi x0i n i=1 1 n Theorem 50 Let Then, 1X x i "i n i=1 d n 1X xi x0i n i=1 = 1 ! N (0; ) ; Proof. We have p ! " n X bn = 1 xi x0i n i=1 # 1 " n 1X 2 0 e xi xi n i=1 i #" plim b n = : : n 1X xi x0i n i=1 # 1 : n!1 Theorem 51 Let We then have c W " n # X 1b xi x0i n = n i=1 r0 b r0 p r0 c Wr 1 " n X i=1 e2i xi x0i #" ! N (0; 1) 47 n X i=1 xi x0i # 1 ! = E [xi x0i ] 1 0 = 0: Proof. Corollary 11 r0 b r0 b r0 p r0 c Wr r0 b = q p r0 1 0b r nr n = n r0 b r0 q r0 b n r ! N (0; 1) p 1:96 r0 c Wr is a valid approximate 95% con…dence interval for r0 . Remark 9 For many practical purposes, we may understand c W as equal to Var b 33 Problem Set Mankiw, Romer, and Weil (1992, QJE ) considered the regression yi = xi1 with the restriction 3 = 2, 1 + xi2 2 + xi3 3 + "i where yi = ln (GDP per working-age person in 1985) xi1 = 1 xi2 = ln (I / GDP) xi3 = ln (growth rate of the working age population between 1960 and 1985 + g + ) In other words, they regressed yi on xi1 and xi2 xi3 . They noted that the coe¢ cient of xi2 in this restricted regression is an estimator of 1 , where is capital’s share in income. xi3 1. Using White’s formula, construct a 95% con…dence intervals for the coe¢ cients of xi1 and xi2 xi3 . 2. In Table 1, Mankiw, Romer, and Weil (1992) report an estimator b of implied by the OLS coe¢ cient of xi2 xi3 . Con…rm their …ndings with the data set provided for the three samples. 3. Mankiw, Romer, and Weil (1992) also report the standard deviation of b. Combining White’s formula with delta-method, construct a 95% con…dence interval for . 48 Lecture Note 7: IV 34 Omitted Variable Bias Suppose that yi = xi + wi + "i : We do not observe wi . Our object of interest is . 34.1 Bias of OLS What will happen if we regress yi on xi alone? P P P b = Pi xi yi = + Pi xi wi + Pi xi "i = + 2 2 2 i xi i xi i xi 1 n Letting P xi wi Pi 2 + 1 i xi n P x i "i Pi 2 ! i xi 1 n 1 n + E [xi wi ] E [x2i ] E [xi wi ] ; E [x2i ] we have plim b = Because + : P xw Pi i 2 i ! ; i xi we can interpret as the probability limit of the OLS when wi is regressed on xi : Unless wi and xi are uncorrelated, will be nonzero, and OLS will be biased. 34.2 IV Estimation Suppose that we also observe zi such that E [zi wi ] = 0 E [zi "i ] = 0 E [zi xi ] 6= 0 Note that we can regard ui wi + "i the “error term”in the regression of yi on xi . Also note that E [zi ui ] = E [zi wi ] + E [zi "i ] = 0 49 Consider bIV P zy Pi i i ; i zi xi which is obtained by replacing xi by zi in the numerator and the denominator of the OLS formula. Observe that P P z (x + u ) zi ui E [zi ui ] i i i i bIV = P = + Pi ! + = : E [zi xi ] i zi xi i zi xi We thus have 35 plim bIV = : Problem Set Suppose that yi = x0i + wi0 + "i : We do not observe wi . Our object of interest is . We assume that E [xi "i ] = 0; E [wi "i ] = 0 What is the probability limit of the OLS estimator of if we regress yi on xi alone? Hint: ! ! 1 n n X 1 1X xi x0i xi yi bOLS = n i=1 n i=1 ! 1 ! ! 1 ! n n n n X X X X 1 1 1 1 = + xi x0i xi wi0 + xi x0i x i "i n i=1 n i=1 n i=1 n i=1 When is the OLS estimator consistent for ? 36 Errors in Variables For simplicity, we assume that every random variable is a zero mean scalar random variable. Suppose that + "i yi = xi We do not observe xi . Instead, we observe a proxy xi = xi + ui 50 36.1 Bias of OLS What would happen if we regress yi on xi ? Condition 1 E ["i ui ] = 0; E ["i xi ] = 0; E [ui xi ] = 0 Observe that yi = xi + ("i ui ) : We thus have P b = Pi xi yi = 2 i xi But, + P i xi ("i P ui ) 2 i xi ! + E [xi ("i ui )] : 2 E [xi ] E x2i = E xi 2 + E u2i and E [xi ("i We thus have bias toward zero. 36.2 plim b = ui )] = E [(xi + ui ) ("i E [u2i ] = E [xi 2 ] + E [u2i ] ui )] = E u2i : E [xi 2 ] ; E [xi 2 ] + E [u2i ] IV estimation Suppose that we also observe zi = xi + vi ; where Condition 2 vi is independent of xi ; "i ; ui Consider bIV P zy Pi i i ; i zi xi which is obtained by replacing xi by zi in the numerator and the denominator of the OLS formula. Observe that P P z [x + (" u )] zi ("i ui ) E [zi ("i ui )] i i i i bIV = i P = + i P ! + : E [zi xi ] i zi xi i zi xi But, E [zi ("i ui )] = E [(xi + vi ) ("i We thus have plim bIV = : 51 ui )] = 0: 37 37.1 Simultaneous Equation: Identi…cation A Supply-Demand Model qd = 1 p+ qs = 1 p d y+ 2 + (Demand) s x+ 3 (Supply) q = qd = qs 1 1 p q 1 1 1 = 1 p q 1 1 0 2y 2 1y 1 + 3 + 3 d 1 3x 1 + s d + s 1 y x 3x + d y x 0 2 1 1 = 0 1 1 1 0 2 + (Equilibrium) 1 1 s =0 1 1 1 d 1 s 1 : Important observation: Both p; q in equilibrium will be correlated with d ; s . What happens in the “demand”regression? The probability limit will equal E [p2 ] E [py] E [py] E [y 2 ] 1 E [pq] E [yq] = E p E = E [p2 ] E [py] E [py] E [y 2 ] + 2 But d 1 h d 2 i E 1 1 1 E p E y d d d s 6= 0 in general! 37.2 General Notation Individual observation consists of (wi0 ; zi0 ) Here, wi denote the vector of endogenous variables, and zi denote the vector of exogenous variables. We assume that there is a linear relationship: w0 = z 0 B + 0 ; where denotes the vector of “errors”. Our assumption is that z is uncorrelated with : E [z 0 ] = 0: 52 37.3 Identi…cation Suppose that we know the exact population joint distribution of (w0 ; z 0 ). Can we compute and B from this distribution? In many cases, we are not interested in the estimation of the whole system. Rather we are interested in the estimation of just one equation. Assume without loss of generality that it is the …rst equation. Write w1 = Here, I assume that 0 w + 0 1 w1 w = @ w A; w0 0 1 z + : z z0 z= : Our restriction that E [z 0 ] = 0 implies that E zw1 = E [zw 0 ] + E [zz 0 ] = [E [zw 0 ] ; E [zz 0 ]] A necessary condition for the identi…cation of and the dimension of x is bigger than that of ( 0 ; 0 )0 : dim (z) from this system of linear equations if dim ( ) + dim ( ) = dim (w ) + dim (z ) But dim (z) = dim (z ) + dim z 0 : We thus have dim z 0 37.4 dim (w ) : General Identi…cation We have y = x0 + u; where the only restriction given to us is that E [zu] = 0: Identi…cation: E [zy] = E [zx0 ] A necessary condition for identi…cation is dim (z) dim ( ) Remark 10 You would like to make sure that the rank of the matrix E [zx0 ] is equal to dim ( ) as well, but it is not very important at the …rst year level. It will become important later. 53 37.5 Estimation with ‘Exact’Identi…cation When dim (z) = dim ( ), we say that the model is exactly identi…ed. We can then see that the matrix E [zx0 ] is square, and that = (E [zx0 ]) 1 E [zy] By exploiting the law of large numbers, we can construct a consistent esitmator of : ! 1 ! ! 1 ! n n n n X X X X 1 1 zi x0i zi yi = zi x0i zi yi = bIV ! n i=1 n i=1 i=1 i=1 The IV estimator is usually written in matrix notations: 38 bIV = (Z 0 X) 1 Z 0y Asymptotic Distribution of IV Estimator Theorem 52 Suppose that zi is independent of "i . Then where 2 " = Var ("i ). p n bIV ! N 0; 2 " (E [zi x0i ]) 1 E [zi zi0 ] (E [xi z Proof. Problem Set. 39 Problem Set All questions here are taken from Greene. ys denote endogenous variables, and xs denote exogenous variables. 1. Consider the following two-equation model: y1 = 1 y2 + 11 x1 + 21 x2 + 31 x3 + "1 y2 = 2 y1 + 12 x1 + 22 x2 + 32 x3 + "2 (a) Verify that, as stated, neither equation is identi…ed. (b) Establish whether or not the following restrictions are su¢ cient to identify (or partially identify) the model: i. 21 = 32 =0 ii. 12 = 22 =0 iii. 1 =0 iv. 2 = v. 21 + 1 and 22 32 =0 =1 54 2. Examine the identi…ability of the following supply and demand model: (Demand) ln Q = 0 + 1 ln P + 2 ln (income) + "1 ln Q = 0 + 1 ln P + 2 ln (input cost) + "2 (Supply) 3. Consider a linear model yi = x0i + "i m 1 with the restriction that E zi "i m 1 Derive the asymptotic distribution of IV ! 1 n X 1 bIV = zi x0i n i=1 =0 1X zi yi n i=1 n under the assumption that E (zi "i ) (zi "i )0 = (b) Show that n bIV = n 1X 0 zi xi n i=1 1X 0 zi xi n i=1 n converges to (E [zi x0i ]) = (Z 0 X) 1 ! ! 1 1 X p zi "i n i 1 1 in probability. (c) Show that 1 X p zi "i n i converges in distribution to N 0; (d) Conclude that p n bIV N 0; 2 "E [zi zi0 ] converges in distribution to 2 " (E [zi x0i ]) 55 1 Z 0y [zi zi0 ]. 2 "E (a) Show that p ! E [zi zi0 ] (E [xi zi0 ]) 1 ! Lecture Note 8: MLE 40 MLE We have a collection of i.i.d. random vectors Zi i = 1; : : : ; n such that Zi f (z; ) Here, f (z; ) denotes the (common) pdf of Zi . Our objective is to estimate . De…nition 9 (MLE) Assume that Z1 ; : : : ; Zn are i.i.d. with p.d.f. f (zi ; maximizes the likelihood: b = argmax c 40.1 n Y f (Zi ; c) = argmax c i=1 n X 0 ). The MLE b log f (Zi ; c) : i=1 Consistency Theorem 53 (Consistency) Assume that Z1 ; : : : ; Zn are i.i.d. with p.d.f. f (zi ; plimn!1 b = under some suitable regularity conditions. 0 ). Then, Below we provide an elementary proof of consistency. Write h (Zi ; c) = log f (Zi ; c) for simplicity of notation. We assume the following: Condition 3 There is a unique 0 2 such that max E [h (Zi ; c)] = E [h (Zi ; 0 )] c2 Remark 11 Because log is a concave function, we can use Jensen’s Inequality and conclude that f (Zi ; c) f (Zi ; c) E [log f (Zi ; c)] E [log f (Zi ; 0 )] = E log log E f (Zi ; 0 ) f (Zi ; 0 ) But f (Zi ; c) E = f (Zi ; 0 ) and hence Z f (z; c) f (z; f (z; 0 ) E [log f (Zi ; c)] 0 ) dz E [log f (Zi ; = Z 0 )] f (z; c) dz = 1 log (1) = 0 for all c. In other words, E [log f (Zi ; c)] E [log f (Zi ; for all c. 56 0 )] Condition 4 De…ne B ( ) f 2 :j 0j g. For each max E [h (Zi ; c)] < E [h (Zi ; c2B( ) Condition 5 maxc2 1 n Pn i=1 > 0, 0 )] : E [h (Zi ; c)] ! 0 almost surely. h (Zi ; c) Sketch of Proof. Fix > 0. Let = E [h (Zi ; and note that and 0 )] ; = max E [h (Zi ; c)] : c2B( ) < . But 1X max h (Zi ; c) ! max E [h (Zi ; c)] c2B( ) n c2B( ) i=1 n 1X h (Zi ; c) ! max E [h (Zi ; c)] max c2 c2 n i=1 n almost surely. It follows that b 62 B( ) for n su¢ ciently large. Thus, lim b n!1 0 almost surely < Since the above statement holds for every > 0, we have b ! 40.2 0 almost surely. Fisher Information with One Observation Remark 12 Without loss of generality, we omit the i subscript in this section. Assumption Z f (z; ), 2 De…nition 10 (Score) s (z; ) . @ log f (z; )/ @ Lemma 30 E [s (Z; )] = 0 Proof. Because 1= we have 0= Z @f (z; ) dz = @ Z Z f (z; ) dz @f (z; ) @ f (z; ) f (z; ) dz = Z s (z; ) f (z; ) dz De…nition 11 (Fisher Information) Z I ( ) = s (z; ) s (z; )0 f (z; )dz = E s (Z; ) s (Z; )0 : 57 Theorem 54 I( )= Z @ 2 log f (z; ) f (z; )dz = @ @ 0 Proof. Because 0= we have Z Z @s (z; ) f (z; ) dz + @ 0 @ 2 log f (Z; ) : @ @ 0 E s (z; ) f (z; ) dz Z @f (z; ) dz @ 0 Z Z @f (z; ) @ @ 2 log f (z; ) 0 = f (z; ) dz + s (z; ) @ f (z; ) dz 0 @ @ f (z; ) Z 2 Z @ log f (z; ) = f (z; ) dz + s (z; ) s (z; )0 f (z; ) dz @ @ 0 0= s (z; ) Example 2 Suppose that X N ( ; 2 ). Assume that 2 is known. The Fisher information I ( ) can be calculated in the following way. Notice that # " 1 (x )2 f (x; ) = p exp 2 2 2 so that )2 (x log f (x; ) = C 2 2 where C denotes the part of the log f wich does not depend on . Because s(x; ) = we have I ( ) = E s (X; )2 = Remark 13 In the multivariate case where s (x; ) = @ log f (x; ) @ E 2 1 4 )2 = E (X 1 2 = ( 1 ; : : : ; K ), we let 0 @ log f (x; ) 1 and I ( ) = E s (X; ) s (X; )0 = x B @ @ 1 @ log f (x; ) @ K @ 2 log f (x; ) = @ @ 0 58 .. . 2 6 E4 C A: @ 2 log f (X; ) @ 1@ 1 @ 2 log f (X; ) @ 1@ K @ 2 log f (X; ) @ K@ 1 @ 2 log f (X; ) @ K@ K 3 7 5 Example 3 Suppose that X is from N ( 1 ; log f (x; 1; 2) = Then, 2 ). 1 log (2 2 2 so that s (x; 1; 2) = @ log f =@ @ log f =@ 2 1) (x 2) x 1 = 1 2 2 from which we obtain E [ss0 ] = 2 0 1 2 2 1 2 + (x 2 1) 2 2 2 ! 0 1 2 22 In this calculation, I used the fact that E [Z 2m ] = (2m)!= (2m m!) and E [Z 2m 1 ] = 0 if Z N (0; 1). 40.3 Random Sample Assumption Z1 ; : : : ; Zn are i.i.d. random vectors f (z; ). Proposition 1 Let In ( ) denote the Fisher Information in Z1 ; : : : ; Zn . Then, In ( ) = n I ( ) ; where I ( ) is the Fisher Information in Zi . 40.4 Limiting Distribution of MLE Theorem 55 (Asymptotic Normality of MLE) p n b d ! N (0; I 1 ( )) Sketch of Proof. We will assume that the MLE is consistent. By the FOC, we have 0= n @ log f Z ; b X i @ i=1 0 1 n n @ 2 log f Z ; e X X i @ log f (Zi ; ) @ A b = + 0 @ @ @ i=1 i=1 where the second equality is justi…ed by the mean value theorem. Here, the e is on the line segment adjoining b and . It follows that p n b = 0 1 n @ 2 log f Z ; e X i @1 A n i=1 @ @ 0 59 1 n 1 X @ log f (Zi ; ) p @ n i=1 ! It can be shown that, under some regularity conditions, 2 e n 1 X @ log f Zi ; n i=1 @ @ 0 1 X @ 2 log f (Zi ; ) !0 n i=1 @ @ 0 n in probability. Because 1 X @ 2 log f (Zi ; ) @ 2 log f (Zi ; ) ! E n i=1 @ @ 0 @ @ 0 n in probability, we conclude that 2 e n 1 X @ log f Zi ; @ 2 log f (Zi ; ) plim = E = 0 0 @ @ @ @ n!1 n i=1 We also note that I( ) 1 X @ log f (Zi ; ) d p ! N (0; I ( )) @ n i=1 (4) n (5) by the central limit theorem. Combining (4) and (5) with Slutzky Theorem, we obtain the desired conclusion. Remark 14 How do we prove 2 e n 1 X @ log f Zi ; n i=1 @ @ 0 Here’s one way. Assume that 1 X @ 2 log f (Zi ; ) p ! 0? n i=1 @ @ 0 n is a scalar, so what we need to prove is 2 e n 1 X @ log f Zi ; n i=1 @ 2 1 X @ 2 log f (Zi ; ) p !0 n i=1 @ 2 n Note that we have by the mean value theorem e n 1 X @ log f Zi ; n i=1 @ 2 2 0 1 ee 3 @ log f Z ; n n i X C 1 X @ 2 log f (Zi ; ) B B1 C e = @n A n @ 2 @ 3 i=1 i=1 e for some e in between e and . Now assume that sup @ 3 log f (Zi ; ) @ 3 60 M (Zi ) and that E [M (Zi )] < 1. Then we have 2 e n 1 X @ log f Zi ; n i=1 @ 2 n 1 X @ 2 log f (Zi ; ) n i=1 @ 2 ! n X 1 M (Zi ) e n i=1 ! n 1X M (Zi ) b n i=1 where the second inequality used the fact that e is on the line segment adjoining b and and hence the distance between e and should be smaller than that b and . By the law of large numbers, we have n 1X p M (Zi ) ! E [M (Zi )] n i=1 By consistency, we also have b p ! 0. The conclusion then follows by Slutzky. Theorem 56 (Estimation of Asymptotic Variance 1) 0 1 n @ log f Z ; b @ log f Z ; b X i i 1 A plim Vb1 plim @ n i=1 @ @ 0 n!1 n!1 Theorem 57 (Estimation of Asymptotic Variance 2) 1 0 n @ 2 log f Z ; b X i 1 A plim Vb2 plim @ 0 n i=1 @ @ n!1 n!1 1 =I 1 ( ): 1 =I 1 ( ): Theorem 58 (Approximate 95% Con…dence Interval) For simplicity, assume that dim ( ) = 1. We have q q 3 2 Vb1 Vb b + 1:96 p 1 5 = :95 lim Pr 4 b 1:96 p n!1 n n q q 3 2 b V2 Vb b + 1:96 p 2 5 = :95 lim Pr 4 b 1:96 p n!1 n n Remark 15 Approximate 95% con…dence interval may therefore be constructed as q q b V Vb2 b 1:96 p 1 ; b or 1:96 p n n . . Many softwares usually report Vb1 n or Vb2 n, and call it the (estimated) variance. Therefore, p you do not need to make any adjustment for n with such output. 61 41 Latent Utility For simplicity of notation, assume that x0i is nonstochastic. We have U1i = x0i 1 + u1i : Choice 1 U0i = x0i 0 + u0i : Choice 0 Choice 1 is made if and only if Ui U1i U0i = x0i ( 0) 1 + (u1i u0i ) 0: or x0i "i 0: Otherwise, choice 0 is made. Example 4 Suppose (u1 ; u0 ) has a bivariate normal distribution. Then, "i has a normal distribution. Example 5 Suppose u1 and u0 are i.i.d. with the common c.d.f. F (z) = exp [ exp ( x)]. Then, et Pr [" t] = t : e +1 (Proof omitted.) 42 Binary Response Model Two Choices: yi = 1 Choice 1 is made yi = 0 Choice 0 is made Assume that yi = 1 , Ui = x0i "i 0 Let G (t) Pr ["i t] : Then, Pr [yi = 1] = Pr ["i Example 6 G (t) = Example 7 G (t) = (t): Probit Model et et +1 = (t): Logit Model 62 x0i ] = G (x0i ) : Note that individual likelihood equals y 1 yi G (x0i ) i [1 G (x0i )] It follows that the joint log likelihood equals X yi log G (x0i ) + (1 yi ) log [1 G (x0i )] i MLE from FOC: X i G x0i b i g x0i b h 0b 0b 1 G xi G xi yi xi = 0: Proposition 2 The log likelihood of the Probit or Logit model is globally concave. Proof. Exercise. Proposition 3 p n b d ! N 0; I 1 ( ) : Proposition 4 The Fisher Information I ( ) from the individual observation equals " # g (x0i )2 E xi x0i : 0 0 G (xi ) [1 G (xi )] Proof. Obvious from @ log f (zi ; ) yi G (x0i ) = g (x0i ) xi @ G (x0i ) [1 G (x0i )] and I( )=E @ log f @ log f : @ @ 0 Proposition 5 2 g x0i b 1X h i xi x0i = I ( ) ; plim n i G x0 b 1 G x0 b i i h i2 yi G x0i b 2 1X 0 plim xi x0i = I ( ) : i2 g xi b 2h n i G x0i b 1 G x0i b 63 43 Tobit Model (Censoring) Suppose yi = x0i + "i "i j x i N 0; 2 We observe (yi ; Di ; xi ),where Di = 1 (yi > 0) y i = Di y i Individual Likelihood: x0i 1 Di 1 yi x0i Di = 1 x0i 1 Di 1 yi x0i Di MLE is not simple because the likelihood is not concave in parameters: We have to deal with local vs. global maxima problem! 43.1 Bias of OLS Consider regressing yi on xi for those observations with Di = 1. For simplicity, assume that is a scalar. In this case, we may write P Di x i y i : b= P Di x2i Lemma 31 plim b = n!1 E [Pr [Di = 1j xi ] E [yi j Di = 1; xi ] xi ] : E [Pr [Di = 1j xi ] x2i ] Proof. Denominator: 1X Di x2i = E Di x2i = E E [Di j xi ] x2i = E Pr [Di = 1j xi ] x2i plim n!1 n Numerator: 1X Di xi yi = E [Di yi xi ] = E [E [Di yi j xi ] xi ] ; n!1 n plim and E [Di yi j xi ] = E [1 yi j Di = 1; xi ] Pr [Di = 1j xi ] + E [0 yi j Di = 0; xi ] Pr [Di = 0j xi ] = E [1 yi j Di = 1; xi ] Pr [Di = 1j xi ] 64 Corollary 12 b is consistent only if E [yi j Di = 1; xi ] = x0i : Lemma 32 Suppose u N (0; 1). Then, E [uj u > t] = (t) (t) 1 Proof. Rs Rs (x) dx (x) dx t Pr [u sj u > t] = R 1 = t : 1 (t) (x) dx t Thus, conditional p.d.f. of u at s given u > t equals Rs (x) dx d d (s) t Pr [u sj u > t] = = : ds ds 1 (t) 1 (t) It follows that E [ uj u > t] = Z t 1 s 1 (s) ds = (t) 1 Z 1 (t) 1 s (s) ds: t But from d (s) = s (s) ; ds we have Z 1 Z 1 d (s) ds = s (s) ds = ds t t from which the conclusion follows. (s)j1 t = Lemma 33 x0i E [yi j xi ; Di = 1] = x0i + x0i : Proof. x0i ] E ["i j xi ; Di = 1] = E ["i j xi ; "i = E "i xi ; "i x0i = x0i 1 x0i = x0i Corollary 13 b is inconsistent. 65 : x0i (t) ; 43.2 Heckman’s Two Step Estimator For notational simplicity, I will drop xi in the conditioning event. We know that E [ yi j Di = 1] = x0i + where x0i ; (s) : (s) (s) Thus, if is known, we can estimate consistently by regressing yi on xi and x0i . Observe that can be consistently estimated by the Probit MLE of D on x: we have Di = 1 i¤ x0i and "i + "i >0 N (0; 1) : This suggests two step estimation: Obtain MLE of d from the Probit model x0i d Regress yi on xi and 44 Sample Selection Model Suppose yi = x0i + ui We observe (yi ; Di ; xi ),where Di = 1 (zi0 + vi > 0) ; yi = Di yi Our goal is to estimate . We assume ui vi xi N 0 0 2 u ; u v 2 v u v : By the same reasoning as in the censoring case, yi regressed on xi for the subsample where Di = 1 will result in an inconsistent estimator. To …x this problem, we can either rely on MLE, or we can use the two step estimation. Lemma 34 E [yi j Di = 1] = x0i + 66 u zi0 : v Proof. Note that E [yi j Di = 1] = E [x0i + ui j zi0 + vi > 0] zi0 ] = x0i + E [ui j vi > Now recall that u wi = ui vi v is independent of vi . We thus have u zi0 ] = E E [ ui j v i > = vi E u vi > v = vi vi E u v = zi0 vi + wi vi > v zi0 u The lemma suggests that if we know v zi0 zi0 > v + E [ wi j vi > zi0 ] + E [wi ] v +0 v , then can be estimated consistently by regressing yi on xi and zi0 v . But v can be consistently estimated by the Probit MLE of Di on zi ! This suggests two step estimation. Obtain MLE of [ v from the Probit model Regress yi on xi and 45 zi0 [ v Problem Set 1. Suppose that (u; v) are bivariate normal with mean equal to zero. Let " Cov (u; v) u: Var (u) v Show that " and u are independent of each other. 2. Recall that, if N (0; 1), we have E[ j t] = 1 (t) : (t) Suppose that v u N 0 0 67 ; 2 v v u v u 2 u : Show that t E [ vj u u t] = v : t 1 u Hint: Observe that E [vj u v t] = E " + u u u t =E u t = v u u u E v u u u u u t : u 3. Suppose that yi = x0i + vi ; but we observe yi if and only if zi0 + ui 0: In other words, we observe (yi ; Di ; xi ; zi ) for each individual, where Di = 1 (zi0 + ui and yi = yi Di ; Assume that vi ui N 0 0 2 v ; 0) : v u 2 u v u : Show that E [yi j Di = 1; xi ; zi ] = (a) Suppose that you know x0i + v zi0 u zi0 u : . Show that the OLS of yi on xi and u the subsample where Di = 1 yields a consistent estimator of . (b) Suggest how you would construct a consistent estimator of u (zi0 u ) applied to (zi0 u ) . (c) Suggest a two step method to construct a consistent estimator of . 68 Lecture Note 9: Efficiency 46 46.1 Classical Linear Regression Model I Model y =X +" X is a nonstochastic matrix X has a full column rank (Columns of X are linearly independent.) E ["] = 0 E [""0 ] = 46.2 2 In for some unknown positive number 2 Gauss-Markov Theorem Theorem 59 (Gauss-Markov) Given the Classical Linear Regression Model I, OLS estimator is the minimum variance linear unbiased estimator. (OLS is BLUE) Proof. First note that b is a linear combination of y using (X 0 X) linear estimator Cy = CX + C": If c is to be unbiased, we should have CX = ; or CX = I: Also note that Var (Cy) = 2 CC 0 : Because the di¤erence CC 0 (X 0 X) 1 = CC 0 h =C I = CM C 0 CX (X 0 X) X 0C 0 i 1 0 0 X (X X) X C 0 = CM M 0 C 0 = CM (CM )0 ; is nonnegative de…nite, the result follows. 69 1 1 X 0 . Consider any other 46.3 Digression: GLS Theorem 60 Assume that is a positive de…nite matrix. Then, Xb)0 argmin (y 1 (y Xb) = X 0 1 1 X X0 1 y: b Proof. Because is a positive de…nite matrix, there exists T such that T T 0 = In : Observe that = (T ) 1 (T 0 ) 1 = (T 0 T ) 1 1 ) = T 0 T: We can thus rewrite the objective function as (y Xb)0 T 0 T (y T Xb)0 (T y Xb) = (T y X b)0 (y T Xb) = (y X b) ; which is minimized by (X 0 X ) 1 X 0 y = (X 0 T 0 T X) 1 X 0T 0T y = X 0 1 X 1 X0 1 y: We keep every assumption of the classical linear regression model I except we now assume that Var (") = , some known positive de…nite matrix. Consider the estimator bGLS de…ned by bGLS Xb)0 argmin (y 1 (y Xb) b Theorem 61 Under the new assumption, bGLS is BLUE. Proof. There exists T such that T T 0 = In : Now, consider the transformed model y =X +" ; where y = T y; X = T X; " = T" This transformed model satis…es every assumption of Classical Linear Regression Model I: h i 0 E " " = E [T ""0 T 0 ] = T T 0 = In : The BLUE for the transformed model equals (X 0 X ) 1 X 0 y = (X 0 T 0 T X) 1 X 0T 0T y = X 0 70 1 X 1 X0 1 y: Remark 16 OLS b= remains to be unbiased. 1 X 0" is unknown and has to be estimated. Let b denote some ‘good’ Remark 17 In general, estimator. Then, X0 b is called the feasible GLS (FGLS). 47 + (X 0 X) 1 1 X X0 b 1 y Approximate E¢ ciency of MLE: Cramer-Rao Inequality Theorem 62 (Cramer-Rao Inequality) Suppose that Z ased for , i.e., E [Y ] = . Then, E (Y )0 ) (Y I( ) f (Z; ) and Y = u (Z) is unbi1 : Proof. Assume for simplicity that is a scalar. We then have Z = u (z) f (z; ) dz and 1= We can rewrite them as Z @f (z; ) u (z) dz = @ E [Y ] = ; Z u (z) s (z; ) f (z; ) dz E [Y s (Z; )] = 1 Now, recall that E [s (Z; )] = 0, so that E [(Y ) s (Z; )] = E [Y s (Z; )] E [s (Z; )] = 1: De…ne y Y ; s (Z; ) : X Letting E X2 1 E [X y] = I ( ) and " y X ; we may write y=X 71 + ": 1 ; Observations to be made: (1) The coe¢ cient does not depend on Y at all; it only depends on the parameter estimated and the Fisher information I ( ); (2) X and hence U are zero mean random variables and they are not correlated: E [X "] = E [X y] E X2 =0 We thus have E y2 = 2 2 E X2 + E U 2 to be E X2 = 1 I( ) This inequality is known as the Cramer-Rao inequality, and the right hand side of this inequality is sometimes called the Cramer-Rao lower bound. De…nition 12 Let Y be an unbiased estimator of a parameter . Call Y an e¢ cient estimator if and only if the variance of Y equals the Cramer-Rao lower bound. De…nition 13 The ratio of the actual variance of some unbiased estimator and the CramerRao lower bound is called the e¢ ciency of the estimator. Remark 18 Cramer-Rao bound is not sharp. Suppose that Xi i = 1; : : : ; n are from N ( 1 ; Then, ! n 0 2 In ( ) = 0 2n2 2 ). 2 which implies that the Cramer-Rao lower bound for 2 is equal to 2 22 ? n 2 22 . n Does there exist an unbiased estimator of 2 with variance equal to Note that the usual estimator S 2 = P P P 2 n 1 X is unbiased and is a function of su¢ cient statistic ( ni=1 Xi ; ni=1 Xi2 ). i=1 Xi n 1 It can be shown (using Lehman-Sche¤é Theorem, which you can learn from any textbook on mathematical statistics) that S 2 is the unique unbiased minimum variance estimator. Because 2 2 2 2 2 (n 1)S 2 2 = (n 21)2 2 (n 1) = n 21 , which (n 1), we have Var (S 2 ) = (n 21)2 Var (n 1)S 2 2 is strictly larger than the Cramer-Rao bound 2 22 . n Theorem 63 (Asymptotic E¢ ciency of MLE) 48 p n b d ! N (0; I 1 ( )) E¢ cient Estimation with Overidenti…cation: 2SLS We are dealing with yi = x0i + "i ; where the only restriction given to us is that E [zi "i ] = 0: 72 We assume that 0 1 z1i B C zi = @ ... A ; zri such that r = dim (zi ) dim (xi ) = dim ( ) = q: We can estimate in more than one way. For example, if dim ( ) = 1, we can construct r di¤erent IV estimators: P P zri yi z y 1i i i b(1) = P ; : : : ; b(r) = P i : i z1i xi i zri xi How do we combine them e¢ ciently? The answer is given by 2 ! ! 1 !3 1 ! ! X X X X X b2SLS = 4 xi zi0 zi zi0 zi x0i 5 xi zi0 zi zi0 i i h = X 0 Z (Z 0 Z) 1 Z 0X i i 1 X 0 Z (Z 0 Z) i 1 1 X i i zi yi ! Z 0y This estimator is called the two stage least squares estimator because it is numerically equivalent to b = Z (Z 0 Z) 1. Regress xi on zi . Get a …tted value matrix X 2. Regress yi on x bi , and obtain X 0 Z (Z 0 Z) It can be shown that for some , and that bn = where ei yi b V p 1 n b2SLS Z 0X 1 1 Z 0X X 0 Z (Z 0 Z) 1 Z 0 y. ! N (0; ) can be consistently estimated by 2 ! ! P 2 X X e 1 1 i i 4 xi zi0 zi zi0 n n i n i xi b2SLS is the residual. Let 2 ! P 2 X 1b 1 i ei 4 1 xi zi0 n = n n n n i 1X 0 zi zi n i Then a valid 95% asymptotic con…dence interval is given by p b2SLS 1:96 V b 73 !3 1 1X 0 5 zi xi n i ! 1 1 !3 X 1 zi x0i 5 n i 1 Remark 19 In the derivation of the asymptotic variance above, I assumed that zi and "i are independent of each other. Remark 20 Our discussion generalizes to the situation where dim ( ) > 1. We still have h i 1 1 0 1 0 0 X Z (Z Z) Z X X 0 Z (Z 0 Z) Z 0 y as our optimal estimator. 49 Why is 2SLS E¢ cient? Consider the following seemingly unrelated problem. Write U = (U1 ; : : : ; Ur )0 . Suppose that P P E [U ] = 0 and Var (U ) = . Consider w0 U = rj=1 wj Uj with rj=1 wj = 1. You would like to minimize the variance. This problem can be written as min w0 w s:t: `0 w = 1 w The Lagrangian of this problem is 1 0 w w + (1 2 `0 w) FOC is given by 0= w ` `0 w 0=1 which can be alternatively written as 1 w= ` and 1 = `0 from which we obtain w= Now, note that 0 1 2 b(1) B .. C 6 @ . A=4 b(r) 2 6 =4 1 n P i z1i 1 `0 1 n i z1i 0 ` 1 1` xi ` 3 0 ... 1 n 0 P 1 xi .. . 1 n 74 P i zri P xi 0 i zri xi 7 5 3 7 5 1 1 2 6 4 1 n P i z1i yi 3 7 .. 5 . P 1 i zri yi n 1X zi yi n i ! so that 0 p B n@ b(1) .. . b(r) 1 2 C 6 A=4 E [z1 x] 3 0 .. 1 7 5 . 6 4 0 E [zr x] X 1 zi "i + op (1) = Q 1p n i ! N 0; Q 2 " 1 1 Q 2 p1 n p1 n = N 0; P i z1i P "i .. . i zri 2 1 "Q Q "i 3 7 5 + op (1) 1 where 2 6 Q=4 E [z1 x] 3 0 .. . 0 E [zr x] = E [zz 0 ] 7 5 According to the preceding analysis, the optimal combination is given by 1 `0 ( 2 1 "Q 1) Q 1 ` 2 1 "Q Q 1 1 `= 1 `0 Q 1 Q` Q 1 Q` Now, note that 2 6 Q` = 4 E [z1 x] 0 so that the optimal combination is 2 3 3 E [z1 x] 6 7 7 .. ... 5` = 4 5 = E [zx] . E [zr x] E [zr x] 0 Q (E [zz 0 ]) 1 E [zx] E [xz 0 ] (E [zz 0 ]) 1 E [zx] We now note that, if plim wn = w, 0 b(1) p B .. wn0 n @ . b(r) then 1 0 p B C 0 A = w n@ b(1) b(r) .. . 1 C A + op (1) so we can use the approximate optimal combination, using 2 1P 3 0 i z1i xi n P P 16 7 .. 1 1 0 0 x z 4 5 . i i i i zi zi n n P 1 0 i zri xi n wn0 = P P P 1 1 1 1 0 0 i xi zi i zi zi i zi xi n n n 75 So the approximately optimal combination is 2 1P i z1i xi n P P 16 ... 1 1 0 0 4 i xi zi i zi zi n n 1 n P i xi zi0 2 6 4 1 n 1 n 0 P P 1 0 i zi zi i z1i xi 1 n .. i zri i zi xi 1 n 0 P 1 n P . 3 0 P i zri xi 3 0 xi 1X zi yi n i P which is 2SLS! 50 1 7 5 1 n 1 n = 7 5 i P xi zi0 i xi zi0 P 1 n 1 n ! 0 i zi zi P 0 i zi zi P 1 1 n 1 1 n i zi P yi i zi xi Problem Set 1. (From Goldberger) You are given a sample produced produced by a simultaneous equations model: y1 = 1 y2 + 2 x1 + "1 y2 = 3 y1 + 4 x2 + "2 You naively regressed the endogenous variables on exogenous variables, and obtained following OLS estimates: y1 = 6x1 + 2x2 y2 = 3x1 + x2 Is ( 1 ; 2 ) identi…ed? Is ( identi…ed. 3; 4) identi…ed? Compute consistent estimates of s that are 2. Consider a linear model yi = x0i + "i m 1 with the restriction that E zi "i k 1 =0 where k > m. Derive the asymptotic distribution of 2SLS 2 ! ! 1 !3 1 ! ! X X X X X b2SLS = 4 xi zi0 zi zi0 zi x0i 5 xi zi0 zi zi0 i h = X 0 Z (Z 0 Z) i 1 Z 0X i i 1 i X 0 Z (Z 0 Z) 76 1 Z 0y i 1 X i zi yi ! under the assumption that E (zi "i ) (zi "i )0 = 2 "E [zi zi0 ]. (a) Show that p 2 n b2SLS =4 1X xi zi0 n i ! !3 1 1X 0 5 zi xi n i ! 1 ! 1X 0 1 X p zi zi zi "i n i n i ! 1 1X 0 zi zi n i ! 1X xi zi0 n i (b) Show that 2 X 4 1 xi zi0 n i ! 1X 0 zi zi n i ! !3 X 1 zi x0i 5 n i 1 converges to in probability. h E [xi zi0 ] (E [zi zi0 ]) (c) Show that 1 i (E [zi x0i ]) N 0; p n b2SLS N 0; 1X xi zi0 n i 2 "E [zi zi0 ] converges in distribution to 2 " h E [xi zi0 ] (E 1 [zi zi0 ]) (E i [zi x0i ]) 3. The “wage.xls”…le contains three variables (from three years): w0 earnings (in dollars), 1990 ed0 education (in years), 1990 a0 age (in years), 1990 w1 earnings (in dollars), 1991 ed1 education (in years), 1991 a1 age (in years), 1991 w2 earnings (in dollars), 1992 ed2 education (in years), 1992 a2 age (in years), 1992 77 ! E [xi zi0 ] (E [zi zi0 ]) 1 X p zi "i n i converges in distribution to (d) Conclude that 1 1 1 1X 0 zi zi n i 1 ! 1 (a) For the 1992 portion of the data, regress ln (wage)on edu, exp, (exp)2 , and a constant. (b) Ability is an omitted variable which may create an endogeneity problem with the education variable in our usual wage equation. It may be reasonable to assume that lagged education (1990 and 1991) are valid instruments for education in the 1992 regression. Re-estimate the wage regression using 2SLS. Never mind that the education variable hardly changes over time. 51 Method of Moments: Simple Example Suppose in the population we have E Yi i.e., the average 1 n =0 1 1 1 1 is the mean of Yi . Given that the expectation E can be approximated by a sample Pn by b that solves i=1 , it seems reasonable to estimate 1X Yi n i=1 n or 52 b =0 X b= 1 Yi n i=1 n Method of Moments: Generalization Suppose that we are given a model that satis…es E Yi Xi q q q 1 q 1 =0 Here, 0 is the parameter of interest. In order to estimate 0 , it makes sense to recall that sample average provides an analog of population expectation. Therefore, we expect 1X (Yi n i=1 n Xi ) 0 by b that solves It therefore makes sense to estimate 1X Yi n i=1 n or 1X Xi n i=1 n b= ! 1 1X Yi n i=1 n Xi b = 0 ! = n X Xi i=1 Such estimator is called the method of moments estimator. 78 ! 1 n X i=1 Yi ! Theorem 64 Suppose that (Yi ; Xi ) i = 1; 2; : : : is i.i.d. Also suppose that E [Xi ] is nonsingular. Then, b is consistent for . Proof. Let Ui Yi Xi Note that E [Ui ] = 0 by assumption. We then have 1X Xi n i=1 n b= ! 0 ! 1 + (E [Xi ]) 1X Xi n i=1 n 1 1X + Ui 0 n i=1 n (E [Ui ]) = ! 1X Xi n i=1 n = + 0: ! 1 1X Ui n i=1 n ! Theorem 65 Suppose that (Yi ; Xi ) i = 1; 2; : : : is i.i.d. Also suppose that E [Xi ] is nonsingular. Finally, suppose that E [Ui Ui0 ] exists and is …nite. Then, p n b ! N 0; (E [Xi ]) p 1X Xi n i=1 1 E [Ui Ui0 ] (E [Xi0 ]) 1 Proof. Follows easily from 52.1 n n b = ! 1 1 X p Ui n i=1 n ! Estimation of Asymptotic Variance De…nition 14 Given a matrix A, we de…ne kAk = p trace (A0 A). q De…nition 15 If a = (a1 ; : : : ; aq ) is a q-dimensional column vector, kak = a21 + 0 Lemma 35 kA + Bk + a2q . kAk + kBk Lemma 36 kA0 k = kAk Lemma 37 kABk kAk kBk bi = Yi Theorem 66 Suppose that b is consistent for . Let U E [Ui Ui0 ] in probability. 79 Xi b. Then, 1 n Pn i=1 bi U b0 ! U i Proof. Note that Xi b = Xi + Ui bi = Yi U and 0 bi U b 0 = Ui U 0 + Ui Xi b U i i Now note that Xi b = Ui + Xi b + Xi b Ui0 + Xi b 1X 1 X b b0 Ui Ui = Ui Ui0 n i=1 n i=1 n 0 n 1X Ui Xi b n i=1 n + 0 1X Xi b + n i=1 n Because 1X Ui Xi b n i=1 n and 0 Ui Xi b Xi b we have 1X Ui Xi b n i=1 n 1X Ui Xi b n i=1 n 0 0 kUi k kXi k b = kUi k Xi b 1X kUi k kXi k b n i=1 1X kUi k kXi k n i=1 n 0 Ui0 0 Xi b 0 Xi b kUi k 1X Xi b n i=1 n + n = ! b !0 in probability. The remaining two terms also converge to zero by similar reasoning. Theorem 67 probability. 52.2 1 n Pn i=1 Xi 1 1 n Pn i=1 bi U bi0 U 1 n Pn i=1 Xi0 1 ! (E [Xi ]) E¢ ciency by GMM Suppose that we are given a model that satis…es E Yi r 1 with r > q. Because there is no Xi r q q 1 =0 satisfying 1X (Yi n i=1 n Y X = 80 Xi ) = 0 1 E [Ui Ui0 ] (E [Xi0 ]) 1 in we will minimize Qn ( ) = Y X 0 Wn Y X and obtain 1 b = X 0 Wn X 0 X Wn Y 1 0 = + X Wn X 0 X Wn U or p Assume that n b = X Wn X 1 X p Ui n i=1 n 1 0 0 X Wn ! E [Xi ] = G0 plim Wn = W0 E [Ui Ui0 ] = S0 It is straightforward to show that p d n b ! N 0; (G00 W0 G0 ) 1 G00 W0 S0 W0 G0 (G00 W0 G0 ) 1 Given that Wn , hence W0 , can be chosen by the econometrician, we can ask what the optimal choice would be. It can be shown that, if we choose W0 = S0 1 , then the asymptotic variance is minimized. If W0 = S0 1 , then the asymptotic variance formula simpli…es to 1 G00 S0 1 G0 Remark 21 If you are curious, here’s the intuition. Consider the linear regression model in a matrix form y = G0 + u where G0 is nonstochastic and E [u] = 0 and E [uu0 ] = S0 . We can minimize G0 b)0 W0 (y (y G0 b) and obtain an estimator Because bb = (G0 W0 G0 ) 0 1 G00 W0 y bb = (G0 W0 G0 ) 1 G0 W0 (G0 + u) = + (G0 W0 G0 ) 0 0 0 we can see that bb is unbiased and has variance equal to (G00 W0 G0 ) 1 G00 W0 S0 W0 G0 (G00 W0 G0 ) 1 G00 W0 u 1 By the Gauss-Markov, this estimator cannot be as good as the GLS estimator that minimizes (y G0 b)0 S0 1 (y 81 G0 b) 53 Method of Moments Interpretation of OLS Suppose that yi = x0i + "i where E E [xi "i ] = 0 Then, we can derive x0i )] = E [xi yi 0 = E [xi "i ] = E [xi (yi xi x0i ] = E [Yi Xi ] with Yi = x i y i Xi = xi x0i Remark 22 It can be shown that ! 1 ! ! n n n 1X 1 X b b0 1X 0 Xi Ui Ui X n i=1 n i=1 n i=1 i 54 1 1X xi x0i n i=1 n = ! 1 1X xi x0i e2i n i=1 n ! 1X xi x0i n i=1 n Method of Moments Interpretation of IV The preceding discussion can be generalized to the following situation. Suppose that yi = xi + "i IF we can …nd zi such that E [zi "i ] = 0; the model can then be written as E [zi yi zi xi ] = 0 Write Yi zi yi ; Xi zi xi The method of moments estimator is then given by ! 1 ! P n n X X zi yi = bIV ; Xi Yi = P i z x i i i i=1 i=1 which is consistent for by the argument in the preceding subsection. Now assume that xi and zi are both q-dimensional, so that the model is given by yi = x0i + "i 82 ! 1 We then have Yi zi yi ; Xi zi x0i Therefore, we obtain n X Xi i=1 54.1 ! 1 n X i=1 Yi ! = n X zi x0i i=1 ! 1 n X zi yi i=1 ! = bIV Asymptotic Distribution As for its asymptotic distribution, we note that E [Ui Ui0 ] = E (zi "i ) (zi "i )0 = 2 "E [zi zi0 ] and E [Xi ] = E [zi x0i ] (if we assume that z and " are independent of each other.) We then have p 54.2 n bIV ! N 0; 2 " (E [zi x0i ]) 1 0 E [zi zi0 ] E (zi x0i ) 1 Overidenti…cation: E¢ ciency Consideration As before, we assume that the model is given by yi = x0i + "i Our restriction is E [zi "i ] = 0 The only di¤erence now is that we assume r = dim (zi ) > dim (xi ) = q This means that we cannot …nd any b such that 1X zi yi n i=1 n x0i b = 0 Given that equality is impossible, we do the next best thing: We minimize !0 ! n n 1X 1X 0 0 Qn ( ) = zi (yi xi ) Wn zi (yi xi ) n i=1 n i=1 for some weighting matrix Wn , which is potentially stochastic. 83 Example 8 Take 1X 0 zi zi n i Wn = ! 1 Then it can be shown that the solution is equal to 2 ! ! 1 X X 1 1 4 xi zi0 zi zi0 n i n i 2 ! ! X X =4 xi zi0 zi zi0 i i !3 X 1 zi x0i 5 n i !3 1 X zi x0i 5 1 1X xi zi0 n i 1 X i xi zi0 i h = X 0 Z (Z 0 Z) 1 ! ! 1X 0 zi zi n i X zi zi0 i Z 0X i 1 ! ! 1 1 1X zi yi n i X zi yi i X 0 Z (Z 0 Z) 1 ! ! Z 0 y = b2SLS According to the previous discussion, if we are to develop an optimal GMM, we need to use the weight matrix that converges to the inverse of S0 = E (zi "i ) (zi "i )0 = E "2i zi zi0 Suppose that "i happens to be independent of zi . If this were the case, we have S0 = E "2i zi zi0 = E "2i E [zi zi0 ] = 2 "E [zi zi0 ] which can be estimated consistently by Therefore, we would want to minimize 1X zi (yi n i=1 n !0 x0i ) 1X 0 b"2 zi zi n i ! = b" 1 1X 0 zi zi n i 1X zi (yi n i=1 !0 1X zi (yi n i=1 n 2 b"2 n x0i ) ! x0i ) 1X 0 zi zi n i ! 1 1X zi (yi n i=1 n x0i ! ) Note that the b" 2 does not a¤ect the minimization. Therefore, the GMM estimator e¤ectively minimizes !0 ! 1 ! n n X X 1X 1 1 zi (yi x0i ) zi zi0 zi (yi x0i ) n i=1 n i n i=1 This is 2SLS! 84 55 GMM - Nonlinear Case Suppose that we are given a model E [h (wi ; )] = 0 where h is a r 1 vector, and is a q 1 vector with r > q. Method of moments estimation is P impossible because there is in general no that solves n1 ni=1 h (Wi ; ) = 0. We minimize !0 ! ! ! X X X X 1 1 1 1 Qn ( ) = h (wi ; ) Wn h (wi ; ) = h (wi ; )0 Wn h (wi ; ) n i n i n i n i instead. We will derive the asymptotic distribution under the assumption that b is consistent. 55.1 Asymptotic Distribution The FOC is given by 1 0 0 b C B 1 X @h wi ; 0=@ A Wn n i @ 1X h wi ; b n i ! 10 b @h w ; X i 1 A Wn =@ n i @ 0 0 1X h wi ; b n i ! Using the mean value theorem, write the last term as 1 0 e @h w ; i 1X 1X 1X A b h wi ; b = h (wi ; ) + @ n i n i n i @ 0 We then have 0 1 X @h @ 0= n i 0 1 X @h +@ n i from which we obtain p n b = 0 wi ; b @ 0 wi ; b @ 0 10 A Wn 10 ! 1X h (wi ; ) n i 1 e @h w ; X i A Wn @ 1 A b n i @ 0 0 20 10 0 13 b e X @h wi ; 6@ 1 X @h wi ; A @1 A7 W 4 5 n 0 0 n i @ n i @ 10 b @h w ; X i @1 A Wn n i @ 0 85 ! 1 X p h (wi ; ) n i 1 (6) Assuming that b 1 X @h wi ; n i @ 0 e 1 X @h wi ; n i @ 0 1 X @h (wi ; ) p !0 n i @ 0 1 X @h (wi ; ) p !0 n i @ 0 we can infer that 0 10 1 0 b e @h w ; @h w ; X X i i p @1 A Wn @ 1 A! G00 W0 G0 n i @ 0 n i @ 0 where G0 = E @h (wi ; ) @ 0 W0 = plim Wn It follows that 20 10 0 13 0 10 b e b @h w ; @h w ; @h w ; X X X i i i 1 p 6@ 1 A Wn @ 1 A7 A Wn ! (G00 W0 G0 ) 4 5@ 0 0 0 n i @ n i @ n i @ 1 G00 W0 (7) Because E [h (wi ; )] = 0, we have by the CLT that 1 X d p h (wi ; ) ! N (0; S0 ) n i (8) where S0 = E h (wi ; ) h (wi ; )0 It follows that 0 10 b @h w ; X i @1 A Wn n i @ 0 ! 1 X d p h (wi ; ) ! N (0; S0 ) n i Combining (6) –(8), we conclude that p n b d ! N 0; (G00 W0 G0 ) 86 1 G00 W0 S0 W0 G0 (G00 W0 G0 ) 1 55.2 Optimal Weight Matrix Given that Wn , hence W0 , can be chosen by the econometrician, we can ask what the optimal choice would be. It can be shown that, if we choose W0 = S0 1 , then the asymptotic variance is minimized. If W0 = S0 1 , then the asymptotic variance formula simpli…es to 1 G00 S0 1 G0 55.3 Two Step Estimation How do we actually implement the above idea? The trick is to recognize that the asymptotic variance only depends on the probability limit of Wn and that Wn is allowed to be stochastic. Let’s assume that there is a consistent estimator . We can then see that S0 can be estimated by noting 1X 1X S0 = E h (wi ; ) h (wi ; )0 = plim h (wi ; ) h (wi ; )0 = plim h (wi ; ) h (wi ; )0 n i n i Remark 23 The last equality requires some justi…cation, i.e., we need to show that 1X 1X p h (wi ; ) h (wi ; )0 h (wi ; ) h (wi ; )0 ! 0 n i n i We have seen in the discussion of MLE how this can be done. Therefore, if we choose our weight matrix to be Wn = 1X h (wi ; n i ) h (wi ; 0 ) ! 1 then we are all set. Question is where we …nd such a . We usually …nd it by the preliminary GMM, that minimizes ! ! ! ! X X X 1X 1 1 1 h (wi ; )0 In h (wi ; ) = h (wi ; )0 h (wi ; ) n i n i n i n i although we can choose any other weight matrix. So, here’s the summary: 1. Minimize 1X h (wi ; )0 n i ! An ! 1X h (wi ; ) n i for arbitrary positive de…nite An . Call the minimizer Wn = 1X h (wi ; n i 87 ) h (wi ; , and let ! 1 )0 2. Minimize 55.4 1X h (wi ; )0 n i ! ! 1X h (wi ; ) n i Wn Estimation of Asymptotic Variance Suppose that we estimated the optimal GMM estimator b. How do we estimate the asymptotic variance? Noting that the asymptotic variance is G00 S0 1 G0 1 b0 Sb 1 G b G 1 we can estimate it by where b 1 X @h wi ; b G= n i @ 0 1X Sb = h wi ; b h wi ; b n i 88 0 Lecture Note 10: Hypothesis Test 56 Elementary Decision Theory We are given a statistical model with an observation vector X whose distribution depends on a parameter , which would be understood to be a speci…cation of the true state of nature. The ranges over a known parameter space . The decision maker has available a set A of actions, which is called the action space. For example, A = f0; 1g where 0 is “accept H0 ”and 1 is “reject H0 ”. We are given a loss function l ( ; a). In the case of testing, we may take l ( ; a) = 0 if the decision is correct and l ( ; a) = 1 if it is incorect. We de…ne a decision rule to be a mapping from the sample space to A. If X is observed, then we take the action (X). Our loss is then the random variable l ( ; (X)). We de…ne the risk to be Z R ( ; ) = E [l ( ; (X))] = l ( ; (x)) f (xj ) dx where f (xj ) denotes the pdf of X. A procedure improves a procedure 0 i¤ R ( ; ) R ( ; 0 ) for all (with strict inequality for some ). If is admissible, then no other 0 improves . We may want to choose such that the worst possible risk sup R ( ; ) is minimized. If is such that sup R ( ; ) = inf sup R ( ; ) then the is called the minimax procedure. The Bayes risk is Z r( ) = R( ; ) ( )d R starts from weights ( ) such that ( ) d = 1. Viewing ( ), we can write r ( ) = E [R ( ; )] as a random variable with pdf A Bayesian would try to minimize the Bayes risk. Because Z Z Z r( ) = R( ; ) ( )d = l ( ; (x)) f (xj ) ( ) dxd Z Z f (xj ) ( ) = l ( ; (x)) d f (x) dx f (x) where f (x) = Z f (xj ) ( ) d 89 we can write r ( ) = E [r ( j X)] where r ( j x) = Z f (xj ) ( ) d = l ( ; (x)) f (x) = E [ l ( ; (x))j X = x] Z l ( ; (x)) f ( j x) d Therefore, if we choose (x) such that r ( j x) is minimized, then the Bayes risk is minimized. (If is discrete, we should replace the integral by summation.) In a very simple testing context, the Bayes procedure with 0-1 loss would look like the following. Let a0 denote the action of accepting H0 , and let a1 denote the action of accepting H1 . Then the Bayes rule says that we should choose a0 if r (a0 j x) < r (a1 j x), and a1 if r (a1 j x) < r (a0 j x). Let denote f 0 ; 1 g, where 0 is “H0 is correct”and 1 is “H1 is correct”. Because X X f (xj j ) ( j ) r ( j x) = l ( j ; (x)) f ( j j x) = l ( j ; (x)) f (x) j=0;1 j=0;1 = l ( 0 ; (x)) f (xj 0) ( 0 ) + l ( 1 ; (x)) f (xj f (x) 1) ( 1) we have r (a0 j x) < r (a1 j x) if and only if l ( 0 ; a0 ) f (xj 0) ( 0 ) + l ( 1 ; a0 ) f (xj f (x) 0) ( 0 )+l ( 1 ; a0 ) f (xj 1) ( 1) < l ( 0 ; a1 ) f (xj 0) ( 0 ) + l ( 1 ; a1 ) f (xj f (x) 1) ( 1) or l ( 0 ; a0 ) f (xj 1) ( 1 ) < l ( 0 ; a1 ) f (xj 0) ( 0 )+l ( 1 ; a1 ) f (xj 1) ( 1) or f (xj or 57 1) ( 1 ) < f (xj f (xj f (xj 0) ( 0) 1) ( 1) <1 0) ( 0) Tests of Statistical Hypothesis Example 9 Assume that X1 ; : : : ; Xn N ( ; 10). We are going to consider the null hypothesis H0 : 2 M0 against the alternative H1 : 2 M1 , where M0 and M1 are some subsets of the one dimensional Euclidean space. We believe that 2 M0 or 2 M1 . We can either accept the null or reject the null. When we reject the null even when the null is true, we are making a Type I error. When we accept the null even when the alternative is true, we are making a Type II error. Our objective is to make a decision in such a way as to minimize the probability of either error. 90 Our decision will hinge on the realization of X1 ; : : : ; Xn . Suppose that we are going to reject the null if (X1 ; : : : ; Xn ) 2 C, which implicitly will lead us to accept the alternative. The set C is sometimes called the critical region. De…nition 16 A Statistical Hypothesis is an assertation about the distribtion of one or more random variables. If the statistical hypothesis completely speci…es the distribution, it is called a simple statistical hypothesis; if it does not, it is called a composite statistical hypothesis. Example 10 Assume that X1 ; : : : ; Xn N ( ; 10). A hypothesis H0 : hypothesis, whereas H0 : 75 is a composite hypothesis. = 75 is a simple De…nition 17 A test of a statistical hypothesis is a rule which, when the experimental sample values have been obtained, leads to a decision to accept or to reject the hypothesis under consideration. De…nition 18 Let C be that subset of the sample space which, in accordance witha prescribed test, leads to the rejection of the hypothesis under consideration. Then C is called the critical region of the test. De…nition 19 The power function of a test of a statistical hypothsis H0 against H1 is the function, de…ned for all distributions under consideration, which yields the probability that the sample point falls in the critical region C of the test, that is, a function that yields the probability of rejecting the hypothesis under consideration. The value of the power function at a parameter point is called the power of the test at that point. De…nition 20 Let H0 denote a hypothesis that is to be tested against an alternative H1 in accordance with a prescribed test. The signi…cance level of the test is the maximum value of the power function of the test when H0 is true. Example 11 Suppose that n = 10. Suppose that H0 : 75 and H1 : > 75. Notice that we are dealing with composite hypotheses. Suppose that we adopted a test procedure in which we reject the null only when X > 76:645 In this test, the critical region C equals (X1 ; : : : ; X10 ) : X > 76:645 The power function of the test is p( ) = P =1 X > 76:645 = Pr [N ( ; 1) > 76:645] = Pr [Z > 76:645 (76:645 ] ); where Z N (0; 1) and is the c.d.f. of N (0; 1). Notice that this power function is increasing in . In the set f : 75g, the power function is maximized when = 75, at which it equals 1 (1:645) = :05 Thus, the signi…cance level of the test equals 5%. 91 58 Additional Comments about Statistical Tests Suppose that H0 : = 75. If the alternative takes the form H1 : > 75 (or H1 : < 75 ), we call such an alternative a one-sided hypothesis. If instead, if takes the form H1 : 6= 75, we call it a two-sided hypothesis. Example 12 Now suppose that the pair of hypotheses is H0 : = 0 vs. H1 : that 2 and n are known. A commonly used test rejects the null if and only if X p 0 = n > 0. Assume c for some c, where c is chosen such that the signi…cance level of the test equals some prechosen value. Assume that it has been decided to set the signi…cance level at 5%, a common choice. Noting that X p 0 N (0; 1) ; = n it su¢ ces to …nd c such that Pr [Z c] = 5% where Z is a standard normal random varaible. From the normal distribution table, it can easily be seen that we want to set c = 1:645: Example 13 What would happen if we now have H1 : 6= 0 . A common test rejects the null if and only if X p 0 c; = n where c is again chosen so as to have the signi…cance level of the test equal to some prechosen value. Assuming that we want to have the signi…cance level of the test equal to 5%, it can easily be seen that we want to set c = 1:96: Example 14 What would happen if we now have the same two sided alternative, but do not know 2 ? A common procedure is to reject the null i¤ X p 0 s= n c; where s is the sample standard deviation: s2 = Because the ratio has a t (n t (n 1) distribution. 1 n 1 X Xi X 2 : i 1) distribution, we want to set c equal to 97.5th percentile of the 92 59 Certain Best Tests Remark 24 The term “test” and “critical region” can be used interchangeably: a test speci…es a critical region; but it can also be said that a choice of a critical region de…nes a test. Let f (x; ) denote the p.d.f. of a random vector X. Consider the two hypothesis H0 : vs. H1 : = 1 . We have = f 0 ; 1 g. = 0 De…nition 21 Let C denote the subset of the sample space. Then C is called the best critical region of size for testing H0 against H1 if, for every subset A of the sample space such that Pr [X 2 A; H0 ] = ; (a) Pr [X 2 C; H0 ] = ; (b) Pr [X 2 C; H1 ] Pr [X 2 A; H1 ]. In e¤ect, the best critical region maximizes the power of the test while keeping the signi…cance level of the test equal to . Theorem 68 (Neyman-Pearson Theorem) Let X denote a random vector with p.d.f. f (x; ) = L ( ; x). Assume that H0 : = 0 and H1 : = 1 . Let C L ( 1 ; x) L ( 0 ; x) x: k ; where k is chosen in such a way that P [X 2 C; H0 ] = : Then, C is the best critical region of size for testing H0 against the alternative H1 . Proof. Assume that A is another critical region of size . We want to show that Z Z f (x; 1 ) dx: f (x; 1 ) dx A C By using the familiar indicator function notation, we can rewrite the inequality as Z (IC (x) IA (x)) f (x; 1 ) dx 0: It su¢ ces to show that (IC (x) IA (x)) f (x; 1) k (IC (x) Notice that, if it (9) holds, it follows that Z (IC (x) IA (x)) f (x; 1 ) dx But since Z (IC (x) IA (x)) f (x; 0 ) dx = Z k Z f (x; C IA (x)) f (x; (IC (x) 0 ) dx IA (x)) f (x; Z A 93 f (x; 0 ) dx (9) 0) : = 0 ) dx: = 0; we have the desired conclusion. Notice that IC obviously have (IC (x) When IC IA = 1, we have f (x; (IC (x) When IC IA (x)) f (x; IA = 1) k (IC (x) kf (x; IA (x)) f (x; 1, we have f (x; (IC (x) 1) 1) IA equals 1, 0, or -1. When IC 0) 1) 0) : IA (x)) f (x; 0) : so that k (IC (x) < kf (x; IA (x)) f (x; IA = 0, we so that 0) IA (x)) f (x; 1) k (IC (x) IA (x)) f (x; 1) : IA (x)) f (x; 1) k (IC (x) IA (x)) f (x; 0) : We thus conclude that (IC (x) Example 15 Consider X1 ; : : : ; Xn i.i.d. N ( ; H0 : = 0 vs. H1 : = 1 with 0 < 1 . Now, p 1= 2 L ( 1 ; x1 ; : : : ; x n ) = p L ( 0 ; x1 ; : : : ; x n ) 1= 2 " = exp 2 2 X The best critical region C takes the form " ! X exp xi ( 1 i 2 ). We assume that P n exp[ i P n exp[ ! xi ( 0) = i 2 (xi 2 1) =2 2 ] (xi 2 0) =2 2 ] 0) = 1 n 2 2 0 i 2 n 2 1 =2 2 0 2 # 2 1 is known. We have =2 2 # : k for some k. In other words, C takes the form x c for some c. We can …nd the value of c easily from the standard normal distribution table. 60 Uniformly Most Powerful Test In this section, we consider the problem of testing a simple null against a composite alternative. Note that a composite hypothesis may be viewed as a collection of simple hypotheses. 94 De…nition 22 The critical region C is a uniformly most powerful critical region of size for testing a simple H0 against a composite H1 if the set C is a best critical region of size for testing H0 against each simple hypothesis in H1 . A Test de…ned by this critical region is called a uniformly most powerful test with signi…cance level . Example 16 Assume that X1 ; : : : ; Xn are i.i.d. N (0; ) random variables. We want to test H0 : = 0 against H1 : > 0 . We …rst consider a simple alternative H1 : = 00 where 00 > 0 . The best critical region takes the form k p 1= 2 p 1= 2 00 0 P n exp [ i x2i =2 00 ] P n = 2 0 i xi =2 ] exp [ n=2 0 exp 00 " 00 2 0 00 0 X i x2i # P In other words, the best critical region takes the form i x2i c for some c, which is determined by the size of the test. Notice that the same argument holds for any 00 > 0 . It thus follows P that i x2i c is the uniformly most powerful test of H0 against H1 ! Example 17 Let X1 ; : : : ; Xn i.i.d. N ( ; 1). There exists no uniformly most powerful test of H0 : = 0 against H1 : 6= 0 . Consider 00 6= 0 . The best critical region for testing = 0 against = 00 takes the form p P n 00 2 1= 2 exp ) =2 i (xi p k P n 0 )2 =2 (x exp 1= 2 i i or " exp ( Thus, when 00 00 0 ) X 00 2 xi n ( ) 0 2 ( ) # =2 i k > 0 , the best critical region takes the form X xi c; i and when 00 < 0 , it takes the form X xi c: i It thus follows that there exists no uniformly most powerful test. 61 Likelihood Ratio Test We can intuitively modify and extend the notion of using the ratio of the likelihood to provide a method of constructing a test of composite null against composite alternative, or of constructing a test of a simple null against some composite alternative where no uniformly most powerful test exists. 95 Idea: Suppose that X f (x; ) = L ( ; x). The random vector X is not necessarily one dimensional, and the parameter is not necessarily one dimensional, either. Suppose we are given two sets of parameter ! and where ! . We are given 2 !; H0 : 2 H1 : ! The likelihood ratio test is based on the ratio sup sup L ( ; x) 2! L ( ; x) 2 If this ratio is bigger than k, say, the null is rejected. Otherwise, we do not reject the null. The number k is chosen in such a way that the size of the test equals some prechosen value, say . Example 18 Suppose that X1 ; : : : ; Xn are i.i.d. N ( ; 2 ). We do not know is positive. We have H0 : = 0; H1 : 6= 0: 2 except that it Formally, we can write != 2 ; : 2 = 0; >0 and = ; 2 1< : 2 < 1; >0 : Now, we have to calculate sup L ( ; ; 2 ; x1 ; : : : ; xn = sup 2 )2! ( ; 2 )2! 2 2 and sup L ( ; ; 2 ; x1 ; : : : ; x n = 2 )2 sup ( ; 2 )2 exp " P n=2 1 2 exp " P n=2 1 2 i (xi 2 i (xi 2 In these calculations, we can consider maximizing log L = instead. For (10), we set n log (2 ) 2 n log 2 2 2 1 X 2 = 0 and di¤erentiate with respect to n 1 X 2 + x =0 2 2 2 2 i i 96 (xi )2 i 2 obtaining )2 # ; (10) )2 # : (11) 2 2 We thus have sup L 2 ; 1 ; x1 ; : : : ; xn = L 0; n 2 )2! ( ; X x2i ; x1 ; : : : ; xn i = 1 P 2 = i e P 2 i P 2 x P i 2i 2 i xi =n n=2 exp x2i =n n=2 1 : x2i =n For (11), we set the partial derivatives with respect to 1 X (xi ) = 0; 2 2 and ! equal to zero: i and n 1 X + (xi 2 2 2 2 i We then obtain sup L ( ; ; 2 ; x1 ; : : : ; xn = L x; n )2 = 0: 1 2 )2 X (xi x)2 ; x1 ; : : : ; xn i e 1 P 2 x)2 =n i (xi = Thus, the likelihood ratio test would be based on !n=2 P 2 =n x = P i i 2 : x) =n i (xi !n=2 ! : It is important to note that the numerator and the denominator are constrained mle’s of 2 . This important observation will be used often in the classical linear regression hypothesis testing setup. The likelihood ratio test would reject the null i¤ is bigger than certain threshhold. It is equivalent to the rejection when P 2 x =n P i i 2 x) =n i (xi is big. Because X X x2i =n = (xi x)2 =n + x2 ; i i the test is equivalent to the rejection when qP i is big, which has a jt (n x (xi x)2 = (n 1)j distribution! 97 1) 62 Asymptotic Tests Suppose that we have Xi i:i:d: f (x; ), where dim ( ) = k. Let b = argmax c n X log f (Xi ; c) i=1 denote the MLE. We would like to test H0 : = against H1 : 6= . The …rst test is Wald Test. Tentatively assume that k = 1. Recall that p It follows that we should have d n b ! N 0; I ( ) n b I( ) 2 d ! 1 2 (1) In most cases, I ( ) is a continuous function in , so I b 1 will be consistent for I ( ) 1 , and 2 n b Under the null, we should have 1 1 I b ! d 2 (1) d 2 (1) 2 n b ! 1 I b When k is an arbitrary number, we have the generalization n b 0 d b I b 2 ! (k) The Likelihood Ratio Test is based on the result that 2 n X i=1 log f Xi ; b n X ! ! d 2 (k) ! ! d 2 (k) ! ! d 2 (k) log f (Xi ; ) i=1 Under the null, we should have 2 n X i=1 log f Xi ; b n X log f (Xi ; i=1 The Score Test is based on the result that !0 n 1 X s (Xi ; ) I ( ) 1 n i=1 98 n X i=1 ) s (Xi ; ) so that n X 1 n !0 s (Xi ; ) i=1 n X 1 I( ) ! s (Xi ; ) i=1 d ! 2 (k) This test is sometimes called the Lagrange Multiplier Test because, if we want to maximize Pn = , and if we use the Lagrangian i=1 log f (Xi ; ) subject to the constraint n X 0 log f (Xi ; ) ( ) i=1 the …rst order conditions are n X s (Xi ; ) = i=1 so = = n X s (Xi ; ) i=1 63 63.1 Some Details Wald Test of Linear Hypothesis p Let b be such that n b We are interested in testing where R is m k with m n Rb r 0 h d ! N (0; V ) for some V , which is consistently estimated by Vb . H0 : R r=0 HA : R r 6= 0 k. The test statistic is then given by RVb R0 i 1 Rb r = Rb Its asymptotic distribution under the null is 63.2 2 r 0 (m). Wald Test of Noninear Hypothesis We are interested in testing H0 : h ( ) = 0 HA : h ( ) 6= 0 99 R 1b V n 1 R0 Rb r where dim (h) = m k. The test statistic is then given by i 1 0h 0 bVb R b0 b 1 Vb nh b R h b =h b R n b is a consistent estimator of where R 1 b0 R h b @h ( ) : @ 0 Its asymptotic distribution under the null is 2 (m). Common sense suggests that we can take R= b= R @h b @ 0 p The idea can be understood by Delta method: Because n b have p d n h b h ( ) ! N (0; RV R0 ) and therefore, we should have p under H0 . 63.3 d ! N (0; V ), we should d nh b ! N (0; RV R0 ) LR Test We now assume that b is MLE, i.e., V = I ( ) 1 . We are interested in testing H0 : h ( ) = 0 vs HA : h ( ) 6= 0, where dim (h) = m k. We assume that H0 can be equivalently written H0 : = g ( ) for some , where dim ( ) = k m = dim ( ) dim (h). P The LR test requires calculation of restricted MLE e, i.e., the maximizer of ni=1 log f (Xi ; c) subject to the restriction h (c) = 0. Given the alternative characterization, it su¢ ces to maxiP mize ni=1 log f (Xi ; g (a)) without any restriction. Let e = arg max e = g (e) We now calculate the LR test statistic by 2 n X i=1 log f Xi ; b We note that n X log f (Xi ; n X i=1 0) log f Xi ; e = i=1 n X i=1 + 1 2 n X ! log f (Xi ; g (a)) i=1 =2 n X i=1 log f Xi ; b n X log f (Xi ; g (e)) i=1 1 0 n @ log f X ; b X i A log f Xi ; b + @ 0 @ i=1 ! n X @ 2 log f (Xi ; ) 0 b b 0 0 0 @ @ i=1 100 ! 0 b for some in between zero. We therefore have 2 n X i=1 0 n X log f Xi ; b and b. Because of the obvious FOC, the second term on the right is log f (Xi ; ! 0) i=1 = b 0 = p = p = p 2 n X log f (Xi ; g (e)) i=1 log f (Xi ; g ( J( 0) ! = @ log f (Xi ; g ( @ =E p 0 n (e 0 )) @ b 0) 0 (J ( log f (Xi ; g ( @ 0 0 )) p n (e 0 )) But because @ log f (Xi ; g ( @ 0 0 )) = @ log f (Xi ; g ( @ 0 0 )) @g ( @ 0) 0 @ log f (Xi ; g ( @ 0 we have J ( 0 ) = G0 I ( 0 ) G. Now recall that p Similarly, we have p n (e n b 0) 0 = I ( 0) 1 1 X p s (Xi ; n 0) + op (1) 1 X 0 @ log f (Xi ; g ( 0 )) p G + op (1) @ n 1 X 1 = [G0 I ( 0 ) G] G0 p s (Xi ; 0 ) + op (1) n =J( 0) 1 101 0 ! n p 1 X @ 2 log f (Xi ; ) n b 0 n i=1 @ @ 0 p n b (I ( 0 ) + op (1)) 0 0 p n b + op (1) I ( 0) 0 0 n b ! 0 0 n b 0 )) i=1 where n b ) 0 @ @ i=1 Likewise, we have n X n X @ 2 log f (Xi ; 0 0 )) G 0) +op (1) 0 Therefore, we conclude that 2 n X i=1 = = = where log f Xi ; b 1 X p s (Xi ; 0 ) n 1 X p s (Xi ; 0 ) n 1 X p s (Xi ; 0 ) n 1 X p s (Xi ; 0 ) n n X i=1 0 log f Xi ; e ! 1 X p s (Xi ; 0 ) n 0 1 X 1 G [G0 I ( 0 ) G] G0 p s (Xi ; n I ( 0) 0 0 1 I ( 0) h 1 1 G [G0 G] 1 X p s (Xi ; n and we write 1 G [G0 I ( 0 ) G] 1 0) G0 i G0 0) + op (1) 1 X p s (Xi ; n 1 X p s (Xi ; n 0) 0) + op (1) + op (1) d ! N (0; I ( 0 )) = I ( 0 ) to avoid confusion with an identity matrix. It follows that ! n n X X d ! 2 (m) 2 log f Xi ; b log f Xi ; e i=1 i=1 An intuition can be given in the following way. Now consider u0 1 G [G0 G] 1 G0 u where u N (0; ). Let T be a square matrix such that T T 0 = , and write " Note that we can without loss of generality write u = T ". We then have u0 1 where X = T 0 G is a k G [G0 G] (k 1 G 0 u = "0 T 0 1 G [G0 G] = "0 T 0 1 = "0 Ik X [X 0 X] T 1 G [G0 G] 102 1 G0 u G0 T " T 0 G [G0 T T 0 G] 1 X0 " m) matrix. It follows that u0 1 2 (m) 1 G0 T " N (0; Ik ). 63.4 LM Test The score test is based on 1 X 1 X p s Xi ; e = p s (Xi ; n i=1 n i=1 n 1 X @s (Xi ; n i=1 @ 0 n 1 X s (Xi ; =p n i=1 n 0) + n 1 X =p s (Xi ; n i=1 ! p 0) p I ( 0 ) n (g (e) g( 0) p I ( 0 ) G n (e 0) 0) I ( 0 ) G [G0 I ( 0 ) G] n 1 X s (Xi ; =p n i=1 ) n e 0 )) + op (1) + op (1) 1 X p s (Xi ; n n h G [G0 G] = I 1 G0 i 1 X p s (Xi ; n i=1 n 0 1 G0 ! 0) 0) + op (1) + op (1) It follows that 1 X p s Xi ; e n i=1 n 1 X p s (Xi ; n i=1 n = 1 X p s (Xi ; n i=1 n = !0 1 X p s Xi ; e n i=1 n 1 !0 0) !0 0) h G [G0 G] I h 1 1 G0 G [G0 G] 1 ! i G0 i 1 h 1 G [G0 G] I 1 X p s (Xi ; n i=1 n G0 ! 0) i 1 X p s (Xi ; n i=1 n ! 0) + op (1) It follows that the score test is asymptotically equivalent to the LR test. 64 Problem Set 1. Let X1 ; : : : ; X10 be a random sample from N (0; 2 ). Find a best critical region of size = 5% for testing H0 : 2 = 1 against H1 : 2 = 2. Is this a best critical region for testing against H1 : 2 = 4? Against H1 : 2 = 12 > 1? 2. Let X1 ; : : : ; Xn be a random sample from a distribution with pdf f (x) = x 1 , 0 < x < 1, zero elsewhere. Show that the best critical region for testing H0 : = 1 against H1 : = 2 takes the form ( ) n Y (x1 ; : : : ; xn ) : xi c i=1 3. Let X1 ; : : : ; Xn be a random sample from N ( ; 100). Find a best critical region of size = 5% for testing H0 : = 75 against H1 : = 78. 103 + op (1) 65 Test of Overidentifying Restrictions Suppose that we are given a model E [h (wi ; )] = 0 where h is a r 1 vector, and is a q 1 vector with r > q. We ask if there is such a begin with. To be more precise, our null hypothesis is H0 : There exists some such that E [h (wi ; )] = 0 In order to test this hypothesis, a reasonable test statistic can be based on !0 ! X X 1 1 h wi ; b Wn h wi ; b Qn b = n i n i p where Wn ! S0 1 and b solves min Qn ( ) = !0 1X h (wi ; ) Wn n i ! 1X h (wi ; ) n i for the same Wn . It can be shown that d nQn b ! 2 r q The proof is rather lengthy, and consists of several steps: Step 1 Because S0 is a positive de…nite matrix, there exists T0 such that T0 S0 T00 = Ir Observe that S0 = (T0 ) 1 (T00 ) 1 = (T00 T0 ) 1 ) S0 1 = T00 T0 Now, write nQn b = = = !0 ! 1 X 1 X p h (wi ; ) Wn p h (wi ; ) n i n i !0 ! 1 X 1 X 0 0 1 1 p h (wi ; ) T0 (T0 ) Wn T0 T0 p h (wi ; ) n i n i !0 ! X 1 X 1 1 T0 p h (wi ; ) (T00 ) Wn T0 1 T0 p h (wi ; ) n i n i 104 to Note that 1 (T00 ) Wn T0 1 p ! (T00 ) 1 S0 1 T0 1 = Ir Therefore, we can see that the nQn b has the same limit distribution as !0 1 X T0 p h (wi ; ) n i ! 1 X T0 p h (wi ; ) n i Step 2 In general, we learned that 20 10 13 0 b e @h w ; @h w ; X i i p 6 1X A Wn @ 1 A7 n b = 4@ 5 0 0 n i @ n i @ 10 b @h w ; X i A Wn @1 n i @ 0 0 1 ! 1 X p h (wi ; ) n i 20 10 0 13 0 10 b e b @h w ; @h w ; @h w ; X X X i i i 1 p 6@ 1 A Wn @ 1 A7 A Wn ! (G00 W0 G0 ) 5@ 4 0 0 0 n i @ n i @ n i @ and p p n n b = n ! 1 X p h (wi ; ) n i 20 10 0 13 0 10 b e b @h w ; @h w ; @h w ; X X X i i i 1 6 1 A Wn @ 1 A7 A Wn = 4@ 5@ 0 0 n i @ n i @ n i @ 0 ! G00 S0 1 G0 G00 W0 1 X d p h (wi ; ) ! N (0; S0 ) n i Using that Wn ! S0 1 , we can conclude that where 1 1 G00 S0 1 in probability. Step 3 We now consider the distribution of 1 0 e @h w ; X X X i 1 1 1 A b p h wi ; b = p h (wi ; ) + @ p @ 0 n i n i n i 0 1 e @h w ; X X i p 1 1 A n b =p h (wi ; ) + @ 0 n i @ n i 105 (12) As before, we will assume that e 1 X @h wi ; n i @ 0 1 X @h (wi ; ) p !0 n i @ 0 which will imply that e 1 X @h wi ; p ! G0 0 n i @ Combined with (12), we can conclude that 1 X p h wi ; b = [I n i ! 1 X p h (wi ; ) n i n] such that 0 0 1 20 10 13 0 10 e b e b @h w ; @h w ; @h w ; @h w ; X X X X i i i i 1 1 @1 A6 A Wn @ 1 A7 A Wn 4@ 5@ n = 0 0 0 n i @ n i @ n i @ n i @ 0 ! G0 G00 S0 1 G0 1 G00 S0 1 in probability. Write 0 1 = G0 G00 S0 1 G0 G00 S0 1 By Slutsky, we conclude that 1 X d h wi ; b ! N (0; T0 (I T0 p n i Because T0 (I 0 ) S0 (I 0 0 0 ) T0 = T0 S0 T00 = Ir T0 0 0 S0 T0 T0 G0 G00 S0 1 G0 0 ) S0 T0 S0 1 + T0 G0 G00 S0 1 G0 = Ir 1 106 0 S0 0 0 0 T0 G00 T00 1 G00 T00 1 T0 G0 G00 S0 1 G0 G00 T00 1 G00 T00 G00 T00 T0 G0 G00 S0 1 G0 we can further conclude that 1 X d h wi ; b ! N 0; Ir T0 p n i + T0 G00 S0 1 S0 S0 1 G0 G00 S0 1 G0 T0 G0 G00 S0 1 G0 + T0 G0 G00 S0 1 G0 = Ir 1 0 0 0 T0 G00 S0 1 S0 T00 1 T0 S0 S0 1 G0 G00 S0 1 G0 0 0 0 ) T0 ) (I 1 G00 T00 T0 G0 G00 S0 1 G0 1 G00 T00 (13) Step 4 This step will be taken care of by analogy. We now imagine a GLS model y = G0 + " where " N (0; S0 ). Because T0 S0 T00 = Ir the transformed model y =G , +" T0 y = T0 G0 + T0 " is such that " N (0; Ir ). The residual vector e = Ir G (G )0 G 1 (G )0 then has the distribution N 0; Ir G 1 (G )0 G (G )0 = N 0; Ir T0 G0 (T0 G0 )0 T0 G0 1 (T0 G0 )0 Because Ir G (G )0 G 1 (G )0 = Ir 1 T0 G0 (T0 G0 )0 T0 G0 (T0 G0 )0 = Ir T0 G0 (G00 T00 T0 G0 ) 1 G00 T00 = Ir T0 G0 G00 S0 1 G0 1 G00 T00 we conclude that e N 0; Ir T0 G0 G00 S0 1 G0 1 G00 T00 Now compare (14) with (13). We can see that the limit distribution of T0 p1n (14) P i h wi ; b is identical to the distribution of e . Also recall that nQn b has the same limit distribution as !0 1 X T0 p h (wi ; ) n i ! 1 X T0 p h (wi ; ) n i which suggests that the limit distribution of nQn b should be identical to the distribution of (e )0 e . But we already know that (e )0 e 66 2 r q Hausman Test of OLS vs IV Suppose yi = xi + "i ; (i = 1; : : : ; n): 107 Also suppose that we observe zi such that E [zi "i ] = 0. Either H0 : E [xi "i ] = 0 or HA : E [xi "i ] 6= 0. Under H0 , both OLS bOLS and IV bIV are consistent: H0 : plim bOLS = plim bIV Furthermore, bOLS is e¢ cient. Under HA , only …xed e¤ects estimator are consistent: HA : plim bOLS 6= plim bIV Therefore, a reasonable test of H0 can be based on the di¤erence bOLS bIV . If the di¤erence is large, there is an evidence supporting HA . If small, there is an evidence supporting H0 . For p this purpose, we need to establish the asymptotic distribution of n bOLS bIV under the null. Under H0 , we would have p p In general, because n bOLS p n bIV n bOLS Vara bOLS bIV ! N 0; Vara bOLS bIV ! N 0; Vara bIV ! N 0; Vara bOLS = Vara bOLS + Vara bIV bIV 2 Cov a bOLS ; bIV ; we need to …gure out Cov a bOLS ; bIV , to implement such a test. Hausman’s intuition is that, if bOLS is e¢ cient under H0 , then we must have Cov a bOLS ; bIV = Vara bOLS : Suppose otherwise. Consider an arbitrary linear combination bOLS + (1 ) bIV ; whose asymptotic distribution is given by p n bOLS + (1 ) bIV ! N 0; 2 Vara bOLS + (1 )2 Vara bIV + 2 (1 Consider minimizing the asymptotic variance. First order condition is given by or Vara bOLS 1 = (1 ) Vara bIV + (1 Vara bOLS Cov a bOLS ; bIV Vara bOLS + Vara bIV 108 2 ) Cov a bOLS ; bIV 2 Cov a bOLS ; bIV =0 : ) Cov a bOLS ; bIV In order for bOLS to be e¢ cient, we had better have 1 =0 or Vara bOLS Cov a bOLS ; bIV : Therefore, the asymptotic variance calculation simpli…es substantially: Because bOLS is e¢ cient under the null, we should have Vara bOLS bIV = Vara bOLS +Vara bIV Therefore, a test can be based on r p n bOLS d a bIV Var bIV 2 Cov a bOLS ; bIV d a bOLS Var 109 N (0; 1) : = Vara bIV Vara bOLS :