Lecture 6 - BYU Department of Economics

advertisement
Econ 388 R. Butler 2014 revisions Lecture 6 Multivariate 4
A. Other Assumptions about the regression error: part duex, after the beginning
Assumption IV. The error has constant variance, and there is no serial correlation
(the error terms are uncorrelated across observations, E( ui , u j )=0 when i  j)
This assumption restricts the error “covariance matrix” to have a particularly simple form.
Let’s review some basics of means and variances with matrices before describing that
simple form.
*************************************************************************
Math Moment: Some general notation and results for means and variances of matrices:
Let y denote an n x 1 vector of random variables, i.e.,
y = (y1, y2, . . ., yn)'.
1.
The expected value of y is defined by
 E(y1)


 E(y2)
E(y) = 





 E(yn)
2.
The variance of the vector y is defined by
NOTE:
Var(y) = E[(y - μ)(y - μ)']
where μ = E(y).
Var(y) = E[(y - μ)(y - μ)']
 y1 - 1


= E
 (y1 - 1 , ..., yn -  n)


y -  
n
 n

E(y1 - 1 )2 E(y1 - 1)(y2 -  2)  E(y1 - 1)(yn -  n)


 E(y -  )(y -  )

E(y2 -  2 )2  E(y2 -  2)(yn -  n)
2
1
2
1

=








E(y n -  n )2
 E(yn -  n)(y1 - 1) E(y n -  n)(y2 -  2) 
1
Var(y1) Cov(y1 , y2)  Cov(y1 , yn)



 Cov(y2 , y1)
Var(y2)  Cov(y2 , yn)
= 







Var(yn)
 Cov(yn , y1) Cov(yn , y2) 
A Useful Theorem (Note where the Projections P and M apply recalling PM=0)
If y  N(μy,Σy), then z = Ay  N(μz = Aμy; Σz = AΣyA') where A is a matrix of constants
(where  means “distributed as”).
NOTE: Proof of (a)
E(z) = E(Ay) = AE(y) = Aμy
VAR(z) = E(z - E(z))(z - E(z))'
= E[(Ay - Aμy)(Ay - Aμy)']
= E[A(y - μy)(y - μy)'A']
= AE(y - μy)(y - μy)'A'
= AΣyA'
A few cases where this useful theorem applies:
1) 𝛽̂ = (𝑋 ′ 𝑋)−1 𝑋′𝑌 so in this problem 𝐴 = (𝑋 ′ 𝑋)−1 𝑋′
2) 𝑦̂𝑇 = 𝑥𝑇 𝛽̂ (find predicted value of a future y given estimated 𝛽̂ from past
observations) so in this problem 𝐴 = 𝑥𝑇 .
With this terminology, we can describe assumption IV with the following mathematical
expression:
Var (  |X) =  2 I
where I is the n x n identity matrix, so that
 2I
 2

0
=  ...

0
0

0
2
0
...
0
... 0
0 ...
... 0
0 2
... 0
0

0
... 

0
 2 
From the math moment above, the diagonal terms represent the variances for the error
terms. The first row, first column term in the matrix is the variance for u1 ; the second row,
second column in the matrix is the variance for the u2 , etc. All the diagonal terms are the
variances; all are assumed to have the same constant (but unknown) value  2 . There
variances are the same; so the error variance function is said to be HOMOSKEDASTIC.
When this assumption fails to hold, and the diagonal elements are not all equal to the same
constant, then the errors are said to HETEROSKEDASTIC.
2
The off diagonal elements represent the covariances between the errors. Assuming that all
of them are zero means that there is no serial correlation in the model (a topic to which we
later return in Chapters 10 through 12 of Wooldridge).
With Assumption IV (and the math moment gems), we can establish a few more
important results. The first is the variance covariance matrix for the least squares
estimators. Since
ˆ  ( X ' X ) 1 X ' Y = ( X ' X ) 1 X ' ( X  u) =   ( X ' X ) 1 X ' u
Hence, ˆ    ( X ' X ) 1 X ' u , and
Var ( ˆ |X) = E( ( ˆ   )( ˆ   )' )  E (( X ' X ) 1 X ' uu ' X ( X ' X ) 1 ) =
(( X ' X ) 1 X ' E (uu ' ) X ( X ' X ) 1 ) = (( X ' X ) 1 X ' 2 IX ( X ' X ) 1 ) =
 2 (( X ' X ) 1 X ' X ( X ' X ) 1 ) =  2 ( X ' X ) 1
The Frisch theorem says a coefficient in a multiple regression is just the simple
regression of unexplained Y (the residual after Y is regressed on the other regressors, X-i,
given as 𝑀𝑥−𝑖 𝑌 ) on unexplained Xi (the residual after Xi is regressed on the other
𝑋 )′ (𝑀
(𝑀
𝑋 ′𝑀
𝑌)
𝑌
𝑥
𝑥
𝑥
𝑖
𝑖
regressors, X-i, given as 𝑀𝑥−𝑖 𝑋𝑖 ). So 𝛽̂𝑖 = (𝑀 −𝑖𝑋 )′ (𝑀 −𝑖𝑋 ) = 𝑋 ′ 𝑀 −𝑖𝑋 . From which we
𝑥−𝑖 𝑖
𝑥−𝑖 𝑖
𝑖
𝑥−𝑖 𝑖
can derive:
Variance for  i , the ith term in the coefficient vector, equals the variance of the
residual,  2 , divided by the sum of squared residuals from the regression of xi on all the
rest of the regressors (including the constant), denoted by X i :
var(  i ) 
2
xi' M X i xi
In the case of a simple regression with just a constant and an intercept, the X i is just the
vector of ones making up the constant variable so X i , hence
var( 1 ) 
2
2

. Another way to write the general expression for the
xi' M 1 xi  ( xi  x ) 2
ith coefficient is to apply result d from the Frisch theorem in the appendix to lecture 4 to
get xi' M X i xi = xi' M 1 xi - xi' PM1 X i xi , and substitute this into var( i ) 
2
xi' M X  i xi
, which is
what Wooldridge does as an alternative expression for the variance of the ith element in
the OLS coefficient vector:
Var( ˆi ) =
2
xi ' M X  i x i

2
xi ' M 1 xi  xi ' PM1 X i xi
=
2
xi ' M 1 xi (1 
xi ' PM1 X i xi
=
2
SSTi (1  Ri 2 )
.
)
xi ' M 1 xi
This shows that the variance for any of the estimated elements in the coefficient vector
depends on three things. SSTj is the sum of deviations squared of xj from its mean; and
3
R2j is the R-squared from regressing xj on all the other explanatory variables (including
the intercept). Why does this expression make sense? [[[SST is the amount of variation
in X, indicating the potential precision of the estimate—show with simple regression
model using the slope intercept diagram. 1-(R-square) is the fraction of independent
variation in xi, independent of the other r.h.s. regressors]]]]
The variance is estimated by summing the squares of all the regression residuals, and
dividing by n-k (where we count the intercept with the other regressors). In matrix
notation this is:
s 2  ˆ 2 
 ˆ
2
i
nk

uˆ ' uˆ
 '  '  Y ' MY Y '( I  X ( X ' X ) 1 X ')Y
=
=

nk
nk
nk
nk
Note that the s2 and coefficient vector are independent because they are constructed from
sets of vectors that are orthogonal to each other (s2 is constructed from M, and ˆ from
P). Since
MY  (1  (`)1 `)(   u)    ( `)1 `  ( I  (`)1 `)u
 X  X  Mu =Mu
we can substitute this into the s2 formula, and prove that E(s2)=  2 . We just replicate the
proof on Wooldridge, appendix E, by taking the expected value of the numerator of the
term on the far right hard side above, noting that Y’MY=u’Mu. This proof uses traces
(the sum of the diagonal elements, properties of which you will want to review as
needed), and the fact that u’Mu is a scalar (i.e., a real number—i.e., a 1x1 matrix):
E(u’Mu|X)=E(tr(u’Mu)|X)=E(tr(Muu’)|X)=trE(Muu’|X)=tr(M  2 I|X)=  2 tr(M)
But from the properties of traces (i.e., trace AB = trace BA when conformable, and
trace (A + B) = trace A + trace B) we have tr(M) = tr(In)-tr(X(X’X)-1X’)
= tr(In)-tr((X’X)-1X’X) = tr(In)-tr(Ik) = n – (k), hence
E(u’Mu|X)=  2 (n-k) which is what we needed to complete the proof.
Gauss-Markov Theorem (BLUE) (see the end of chapter 3).
Don't use normality--only need the second moments defined. The theorem says: in
the class of linear, unbiased estimates for the elements of β, the least square estimator has
minimum variance.
~
 = AY is a linear estimator.
Let A = (X'X)-1X' + C' where C is arbitrary
Unbiasedness means
~
E(  ) = E(AY) = β = E[(X'X)-1X' + C')(Xβ +  )]
= β + C'Xβ so C'X = 0.
4
~
-1
Note:  - β = ((X'X) X' + C')  .
Hence, the minimum variance ("best") estimator is the one that
~
~
min V = min E [(  - β)(  - β)'] = min σ2((X'X)-1 + C'C)
Note that C`C is a positive definite matrix. A matrix Q is positive definite if x` Q x > 0
where x is any non zero vector (if Q is nxn then x is nx1). C`C is positive definite
because x` C` C x =(Cx)’Cx= ||Cx||2 > 0 when x  0. But we want to minimize the
~
variance, V, and as C is arbitrary, so we set C' = 0. Hence  = (X'X)-1X'Y, the OLS
estimator.
5
Download