Econ 388 R. Butler 2014 revisions Lecture 6 Multivariate 4 A. Other Assumptions about the regression error: part duex, after the beginning Assumption IV. The error has constant variance, and there is no serial correlation (the error terms are uncorrelated across observations, E( ui , u j )=0 when i j) This assumption restricts the error “covariance matrix” to have a particularly simple form. Let’s review some basics of means and variances with matrices before describing that simple form. ************************************************************************* Math Moment: Some general notation and results for means and variances of matrices: Let y denote an n x 1 vector of random variables, i.e., y = (y1, y2, . . ., yn)'. 1. The expected value of y is defined by E(y1) E(y2) E(y) = E(yn) 2. The variance of the vector y is defined by NOTE: Var(y) = E[(y - μ)(y - μ)'] where μ = E(y). Var(y) = E[(y - μ)(y - μ)'] y1 - 1 = E (y1 - 1 , ..., yn - n) y - n n E(y1 - 1 )2 E(y1 - 1)(y2 - 2) E(y1 - 1)(yn - n) E(y - )(y - ) E(y2 - 2 )2 E(y2 - 2)(yn - n) 2 1 2 1 = E(y n - n )2 E(yn - n)(y1 - 1) E(y n - n)(y2 - 2) 1 Var(y1) Cov(y1 , y2) Cov(y1 , yn) Cov(y2 , y1) Var(y2) Cov(y2 , yn) = Var(yn) Cov(yn , y1) Cov(yn , y2) A Useful Theorem (Note where the Projections P and M apply recalling PM=0) If y N(μy,Σy), then z = Ay N(μz = Aμy; Σz = AΣyA') where A is a matrix of constants (where means “distributed as”). NOTE: Proof of (a) E(z) = E(Ay) = AE(y) = Aμy VAR(z) = E(z - E(z))(z - E(z))' = E[(Ay - Aμy)(Ay - Aμy)'] = E[A(y - μy)(y - μy)'A'] = AE(y - μy)(y - μy)'A' = AΣyA' A few cases where this useful theorem applies: 1) 𝛽̂ = (𝑋 ′ 𝑋)−1 𝑋′𝑌 so in this problem 𝐴 = (𝑋 ′ 𝑋)−1 𝑋′ 2) 𝑦̂𝑇 = 𝑥𝑇 𝛽̂ (find predicted value of a future y given estimated 𝛽̂ from past observations) so in this problem 𝐴 = 𝑥𝑇 . With this terminology, we can describe assumption IV with the following mathematical expression: Var ( |X) = 2 I where I is the n x n identity matrix, so that 2I 2 0 = ... 0 0 0 2 0 ... 0 ... 0 0 ... ... 0 0 2 ... 0 0 0 ... 0 2 From the math moment above, the diagonal terms represent the variances for the error terms. The first row, first column term in the matrix is the variance for u1 ; the second row, second column in the matrix is the variance for the u2 , etc. All the diagonal terms are the variances; all are assumed to have the same constant (but unknown) value 2 . There variances are the same; so the error variance function is said to be HOMOSKEDASTIC. When this assumption fails to hold, and the diagonal elements are not all equal to the same constant, then the errors are said to HETEROSKEDASTIC. 2 The off diagonal elements represent the covariances between the errors. Assuming that all of them are zero means that there is no serial correlation in the model (a topic to which we later return in Chapters 10 through 12 of Wooldridge). With Assumption IV (and the math moment gems), we can establish a few more important results. The first is the variance covariance matrix for the least squares estimators. Since ˆ ( X ' X ) 1 X ' Y = ( X ' X ) 1 X ' ( X u) = ( X ' X ) 1 X ' u Hence, ˆ ( X ' X ) 1 X ' u , and Var ( ˆ |X) = E( ( ˆ )( ˆ )' ) E (( X ' X ) 1 X ' uu ' X ( X ' X ) 1 ) = (( X ' X ) 1 X ' E (uu ' ) X ( X ' X ) 1 ) = (( X ' X ) 1 X ' 2 IX ( X ' X ) 1 ) = 2 (( X ' X ) 1 X ' X ( X ' X ) 1 ) = 2 ( X ' X ) 1 The Frisch theorem says a coefficient in a multiple regression is just the simple regression of unexplained Y (the residual after Y is regressed on the other regressors, X-i, given as 𝑀𝑥−𝑖 𝑌 ) on unexplained Xi (the residual after Xi is regressed on the other 𝑋 )′ (𝑀 (𝑀 𝑋 ′𝑀 𝑌) 𝑌 𝑥 𝑥 𝑥 𝑖 𝑖 regressors, X-i, given as 𝑀𝑥−𝑖 𝑋𝑖 ). So 𝛽̂𝑖 = (𝑀 −𝑖𝑋 )′ (𝑀 −𝑖𝑋 ) = 𝑋 ′ 𝑀 −𝑖𝑋 . From which we 𝑥−𝑖 𝑖 𝑥−𝑖 𝑖 𝑖 𝑥−𝑖 𝑖 can derive: Variance for i , the ith term in the coefficient vector, equals the variance of the residual, 2 , divided by the sum of squared residuals from the regression of xi on all the rest of the regressors (including the constant), denoted by X i : var( i ) 2 xi' M X i xi In the case of a simple regression with just a constant and an intercept, the X i is just the vector of ones making up the constant variable so X i , hence var( 1 ) 2 2 . Another way to write the general expression for the xi' M 1 xi ( xi x ) 2 ith coefficient is to apply result d from the Frisch theorem in the appendix to lecture 4 to get xi' M X i xi = xi' M 1 xi - xi' PM1 X i xi , and substitute this into var( i ) 2 xi' M X i xi , which is what Wooldridge does as an alternative expression for the variance of the ith element in the OLS coefficient vector: Var( ˆi ) = 2 xi ' M X i x i 2 xi ' M 1 xi xi ' PM1 X i xi = 2 xi ' M 1 xi (1 xi ' PM1 X i xi = 2 SSTi (1 Ri 2 ) . ) xi ' M 1 xi This shows that the variance for any of the estimated elements in the coefficient vector depends on three things. SSTj is the sum of deviations squared of xj from its mean; and 3 R2j is the R-squared from regressing xj on all the other explanatory variables (including the intercept). Why does this expression make sense? [[[SST is the amount of variation in X, indicating the potential precision of the estimate—show with simple regression model using the slope intercept diagram. 1-(R-square) is the fraction of independent variation in xi, independent of the other r.h.s. regressors]]]] The variance is estimated by summing the squares of all the regression residuals, and dividing by n-k (where we count the intercept with the other regressors). In matrix notation this is: s 2 ˆ 2 ˆ 2 i nk uˆ ' uˆ ' ' Y ' MY Y '( I X ( X ' X ) 1 X ')Y = = nk nk nk nk Note that the s2 and coefficient vector are independent because they are constructed from sets of vectors that are orthogonal to each other (s2 is constructed from M, and ˆ from P). Since MY (1 (`)1 `)( u) ( `)1 ` ( I (`)1 `)u X X Mu =Mu we can substitute this into the s2 formula, and prove that E(s2)= 2 . We just replicate the proof on Wooldridge, appendix E, by taking the expected value of the numerator of the term on the far right hard side above, noting that Y’MY=u’Mu. This proof uses traces (the sum of the diagonal elements, properties of which you will want to review as needed), and the fact that u’Mu is a scalar (i.e., a real number—i.e., a 1x1 matrix): E(u’Mu|X)=E(tr(u’Mu)|X)=E(tr(Muu’)|X)=trE(Muu’|X)=tr(M 2 I|X)= 2 tr(M) But from the properties of traces (i.e., trace AB = trace BA when conformable, and trace (A + B) = trace A + trace B) we have tr(M) = tr(In)-tr(X(X’X)-1X’) = tr(In)-tr((X’X)-1X’X) = tr(In)-tr(Ik) = n – (k), hence E(u’Mu|X)= 2 (n-k) which is what we needed to complete the proof. Gauss-Markov Theorem (BLUE) (see the end of chapter 3). Don't use normality--only need the second moments defined. The theorem says: in the class of linear, unbiased estimates for the elements of β, the least square estimator has minimum variance. ~ = AY is a linear estimator. Let A = (X'X)-1X' + C' where C is arbitrary Unbiasedness means ~ E( ) = E(AY) = β = E[(X'X)-1X' + C')(Xβ + )] = β + C'Xβ so C'X = 0. 4 ~ -1 Note: - β = ((X'X) X' + C') . Hence, the minimum variance ("best") estimator is the one that ~ ~ min V = min E [( - β)( - β)'] = min σ2((X'X)-1 + C'C) Note that C`C is a positive definite matrix. A matrix Q is positive definite if x` Q x > 0 where x is any non zero vector (if Q is nxn then x is nx1). C`C is positive definite because x` C` C x =(Cx)’Cx= ||Cx||2 > 0 when x 0. But we want to minimize the ~ variance, V, and as C is arbitrary, so we set C' = 0. Hence = (X'X)-1X'Y, the OLS estimator. 5