Lecture 9 - BYU Department of Economics

advertisement
Econ 388 R. Butler 2014 revisions Lecture 9 Law of Large Numbers and Consistency
QUOTE: Infinite in All Directions by Dyson
(these next two lectures use the outline of the Wooldridge text, with some examples etc.
from the Kennedy, Johnston, Goldberger and White’s econometric texts).
I. Introduction
The OLS coefficient vector is unbiased (under the first three assumptions), efficient (the
BLUE estimator if the variance assumption holds: has the smallest variance among all the
linear estimators that are unbiased), and normal (if the errors are normal). But even if the
errors are not normal (so that the dependent variables, Yi , is not normal either),
whenever the sample is large we can still make asymptotic tests of hypotheses just as
though the 𝛽̂𝑖 were normal. This magic comes because:
A. the least squares coefficient vector is actually a weighted average or means of the Yis
(White, Asymptotic Theory for Econometricians, chapter 1), and weighted averages from
large samples are normally distributed.
1.) The usual way to write the OLS ˆ is in matrix form:
ˆ  ( ` ) 1 `
kxn nxk
kxn nx1
Another way is to equivalently write it as a function of the vectors of independent
variables. Denote observation of i as xi , a 1 k vector, for each observation (the ith row
of the Bix X matrix.) and write out ˆ as a linear function as follows:
ˆ  (
1
N
N

'
 i ) 1 (
i
i 1 k 1 1k
1
N
N
 ' y )
i
i
i 1 k 1 11
This shows how you can update the estimate of ˆ when adding one more observation
(N+1 observation), simply add “ `N 1  N 1 ” to the first term and “ `N 1 N 1 ” to the
second term. It shows that ˆ is a weighted average of the  , ̂  C  , where
i

i
i
`
1
`i  i ) 1 i

N i
N
Finally, substitute from i   i    i (the population model for the ith observation) into
this second form to get
1 N
1
ˆ    (  `i  i ) 1 (   i  i ) .
N i 1
N
To show that ˆ is consistent (the distribution, not just the mean of the
Ci  (
distribution, converges to  ), we assume that (
1
N
positive definite matrix as N gets large, and that (
(the mean of  i ).
First some results and definitions:

1
N
N
i 1
 'i X i ) term converges to a finite

N
i 1
 i  i ) literally collapses to zero
1
B. there are special results for means that come from large samples. In particular, there
are three important results for a random sample from any population (Yis normal or not)
with E(Y)=  and Var(Y) =  2 :
1. Law of Large numbers: the plim (or probability limit) of Y is  .
2. Central Limit Theorem:
Y-
 n[( Y   ) /  ] has a standard normal
/ n
distribution in large samples, and
3. the asymptotic distribution of Y is N(  ,  2 /n).
Now we apply these notions to our model
1
1
ˆ    (  `i  i )(  `i  i )
N
N
1. Instead of random variable Yi, consider random vector `i  i . The essential OLS
assumption is that ( `i  i )  0 (that  is uncorrelated with each of the right hand side
regressors). A stronger assumption that implies this is (  i |  i )  0 . Then
1
plim (  `i  i ) = 0 by the Law of Large Numbers.
N
2. The (Lindeberg-Levy) Central Limit Theorem let wi be sequence of independent
identically distributed k x 1 random vectors with finite second moments, ( wi )  0 , and
k 1
1
 wi  N (0, ) . Where N (0, ) means normally
N
distributed with mean vector 0 (k x 1) and covariance matrix (k  k matrix) . We apply
1
1
this to N ( ˆ   )  (   'i i ) 1 (
 `i i ) to get
N
N
d
1
`



N (O,  2
)

i i
N
N
So that
d
` 1 2 ` ` 1
N ( ˆ   )  N (O, (
) (
)(
) )
N
N
N
` 1
1
) corresponds to (  `i  i ) 1
Where (
N
N
Or
d
` 1
N ( ˆ   )  N (O,  2 (
) ).
N
So we can treat
ˆ  N (  ,  2 ( `) 1 ) .
V ( wi )  ( wi w`i )   . Then
d
In this section, we define plims and show that under the first four model assumptions the
OLS estimator of the vector coefficient is consistent (so we will need to define
consistency as well). Basically, we focus on the law of large numbers results here; next
lecture, we will apply the central limit theorem to get numerous results.
2
II. Consistency and the wonderful world of plims (probability limits)
Suppose that we calculated the mean of the distribution from a sample of size Y, noting
E( Yn ) =  , and that V( Yn ) =  2 /n, for any sample size n. As n gets very large, the
distribution of Yn collapses around  --that is, the sampling distribution is centered at 
and its variance goes to zero as the sample gets large. [[show picture of this]] To talk
about this sort of result formally, we discuss the probability intervals of the estimator (Ybar) around the true parameter (  ), by looking at the likelihood that that Yn is within an
arbitrarily small distant around  :
Prob(  -  < Yn <  +  ) = Prob (| Yn -  |<  )
The random variable Yn is said to converge in probability to the constant  , if as the
sample size n increases, the likelihood that Yn lies within an  distance of  , goes to
one:
lim n  Prob (| Yn -  |<  ) = 1
In other words, the likelihood that Yn lies within a small interval (  -interval) of  can be
made to be as close to one (certainty) as we would like (that is, the  -interval can be
made as small as we like) by letting the sample size get sufficiently large. The shorthand
way for writing this result is
plim Yn =  .
If the estimator converges to the true parameter in this way, than it is said to be a
consistent estimator. The law of large numbers says that sample mean is a consistent
estimator of the true population mean parameter, so that the equation above holds
whenever the Yis are independent, identically distributed random variables from a
distribution with mean  .
Plim are wonderful because they can pass right through basic mathematical
manipulations (adding, substracting, multiplication and division) and also though
continuous transformations. Plims can this do, taking expected values cannot. We
illustrate this with a graph from Kennedy’s A Guide to Econometrics:
3
g( ̂ ), pdf( ̂ )
g( ̂ )
E(g( ̂ )
̂ , pdf(g( ̂ ))
g(E( ̂ )
The convex function, g(.), distorts the roughly (given my drawing skills) symmetric
distribution of ̂ on the bottom (horizontal axis) into the skewed to the right distribution
on the vertical axis, namely the distribution of g( ̂ ). Given the distortion, we see that
“expectation” does not “pass through” nonlinear functions in the sense that E(g( ̂ ) 
g(E( ̂ )) (in fact, for such convex transformations it is always the case that E(g( ̂ )  g(E(
̂ )), a result known as Jensen’s inequality). In other words, the expected value operator
cannot “pass through” nonlinear transformations like g(.).
However, the plim operation can “pass through.” How can we use the diagram above to
show that? In particular, how can we show that This is known as Slutsky’s Theorem.
plim g( ̂ ) = g(plim ̂ ) ?
Some other examples of this wonderfulness includes:
plim(ln(y)) = ln(plim(y))
plim(y**a)=plim(y)**a where “a” is nonrandom
plim(z/y)=plim(z)/plim(y)
for the last result, it is true whether or not z and y are independently distributed. Plim
results also extend to matrices and vectors, by taking the probability limit of each element
4
in the matrix or the vector. This offers great simplifications. For matrices, then, we get
such cool results as
plim(Z-1) = (plim Z)-1
plim (XZ) = plim(X) plim(Z)
With plims, we can alternatively show the consistency of the OLS regression coefficient
vector as follows:
ˆ  ( X ' X ) 1 X ' y
   ( X ' X ) 1 X ' 
1
1
   [ ( X ' X )] 1 [ X '  ]
n
n
1
We assume that the first matrix in the far right hand term, [( X ' X ) 1 ] , converges to a
n
finite, positive definite matrix (it will also be symmetric, though that will not matter for
our proof of consistency). If we are conditioning on the X (so that the X matrix is
“constant in repeated samples”), then this is no problem and matrix consists of squares
and cross products of the explanatory variables. If the explanatory variables are taken as
stochastic, then it can be shown that it converges in probability to the population
moments under general conditions (White):
1
plim [ ( X ' X )] 1 =  1
n
where  1 is the symmetric, finite, positive definite matrix of population moments.
1
What we have to show is that the plim of the second term, [ ( X '  )] , is the 0 vector.
n
Let’s look at some of the elements in this vector:
1


plim ( n   i )



1

plim (  X 1i  i ) 
1

plim [ ( X '  )] = 
n
n







1
plim (  X ki  i )
n


The top most element inside the right hand side vector is  . By the law of large
numbers, plim  = 0. Another view of this same result is to note that E(  )=0, and the
Var(  ) =  2 /n, so that as n goes to infinity, the Var goes to zero and the distribution of
5
 collapses around 0. For the other elements in this vector (say the jth element), we
note that
E(
1
n
X
ji
i ) = 0
which holds for the case where we condition on X (so that we treat it as fixed), and for
the case where X is stochastic as long as the covariance of X and  is zero (see bottom
of p. 164 in Wooldridge). Next we show that the variance goes to zero for these terms:
n
Var(
1
n
X
ji
i ) 

X
2
n
(
j 1
n
2
ji
)
1
By the argument above for the [ ( X ' X ) 1 ] matrix, the probability limit of the
n
term is a constant, while the probability limit (plim) of
1
n
X
ji

X
2
ji
n
2
n
is zero. Hence, the plim for
1
n
 i (and all the other terms in the [ ( X '  )] vector) is zero. This means that
1
plim [ ( X '  )] = 0
n
so that
plim ̂
1
1
   plim [ ( X ' X )] 1 plim [ X '  ]
n
n
1
=   0  
This shows that the OLS estimator is consistent. If the error terms,  , are correlated
1
with Xs so that plim [ ( X '  )]  0, then the OLS estimator will not be consistent (the
n
extent of the inconsistency will depend on the strength of the correlation between X and
 ). This is the problem with simultaneous equations bias, which we will treat towards
the end of the semester.
Note that we use the plim apparatus to show that omitted variable bias doesn’t go away in
large samples (suppose we omit variable z with coefficient  (see lecture 5):
~
plim   plim ( X ' X ) 1 X ' Y  plim ( X ' X ) 1 X ' ( X  z  u )
1
1
1
1
   plim[ ( X ' X )] 1 plim[ X ' z ]  plim[ ( X ' X ) 1 ] plim[ X ' u ]
n
n
n
n
6
    1
  1 0 =    1
1
X ' z ] . The bias from omitted variable bias,  1 , does not go
n
1
away in large samples unless either  = 0 or plim [ X ' z ] = 0 (so that X and z are
n
asymptotically “uncorrelated”].
where  = plim [
7
Download