Econ 388 R. Butler 2014 revisions Lecture 9 Law of Large Numbers and Consistency QUOTE: Infinite in All Directions by Dyson (these next two lectures use the outline of the Wooldridge text, with some examples etc. from the Kennedy, Johnston, Goldberger and White’s econometric texts). I. Introduction The OLS coefficient vector is unbiased (under the first three assumptions), efficient (the BLUE estimator if the variance assumption holds: has the smallest variance among all the linear estimators that are unbiased), and normal (if the errors are normal). But even if the errors are not normal (so that the dependent variables, Yi , is not normal either), whenever the sample is large we can still make asymptotic tests of hypotheses just as though the 𝛽̂𝑖 were normal. This magic comes because: A. the least squares coefficient vector is actually a weighted average or means of the Yis (White, Asymptotic Theory for Econometricians, chapter 1), and weighted averages from large samples are normally distributed. 1.) The usual way to write the OLS ˆ is in matrix form: ˆ ( ` ) 1 ` kxn nxk kxn nx1 Another way is to equivalently write it as a function of the vectors of independent variables. Denote observation of i as xi , a 1 k vector, for each observation (the ith row of the Bix X matrix.) and write out ˆ as a linear function as follows: ˆ ( 1 N N ' i ) 1 ( i i 1 k 1 1k 1 N N ' y ) i i i 1 k 1 11 This shows how you can update the estimate of ˆ when adding one more observation (N+1 observation), simply add “ `N 1 N 1 ” to the first term and “ `N 1 N 1 ” to the second term. It shows that ˆ is a weighted average of the , ̂ C , where i i i ` 1 `i i ) 1 i N i N Finally, substitute from i i i (the population model for the ith observation) into this second form to get 1 N 1 ˆ ( `i i ) 1 ( i i ) . N i 1 N To show that ˆ is consistent (the distribution, not just the mean of the Ci ( distribution, converges to ), we assume that ( 1 N positive definite matrix as N gets large, and that ( (the mean of i ). First some results and definitions: 1 N N i 1 'i X i ) term converges to a finite N i 1 i i ) literally collapses to zero 1 B. there are special results for means that come from large samples. In particular, there are three important results for a random sample from any population (Yis normal or not) with E(Y)= and Var(Y) = 2 : 1. Law of Large numbers: the plim (or probability limit) of Y is . 2. Central Limit Theorem: Y- n[( Y ) / ] has a standard normal / n distribution in large samples, and 3. the asymptotic distribution of Y is N( , 2 /n). Now we apply these notions to our model 1 1 ˆ ( `i i )( `i i ) N N 1. Instead of random variable Yi, consider random vector `i i . The essential OLS assumption is that ( `i i ) 0 (that is uncorrelated with each of the right hand side regressors). A stronger assumption that implies this is ( i | i ) 0 . Then 1 plim ( `i i ) = 0 by the Law of Large Numbers. N 2. The (Lindeberg-Levy) Central Limit Theorem let wi be sequence of independent identically distributed k x 1 random vectors with finite second moments, ( wi ) 0 , and k 1 1 wi N (0, ) . Where N (0, ) means normally N distributed with mean vector 0 (k x 1) and covariance matrix (k k matrix) . We apply 1 1 this to N ( ˆ ) ( 'i i ) 1 ( `i i ) to get N N d 1 ` N (O, 2 ) i i N N So that d ` 1 2 ` ` 1 N ( ˆ ) N (O, ( ) ( )( ) ) N N N ` 1 1 ) corresponds to ( `i i ) 1 Where ( N N Or d ` 1 N ( ˆ ) N (O, 2 ( ) ). N So we can treat ˆ N ( , 2 ( `) 1 ) . V ( wi ) ( wi w`i ) . Then d In this section, we define plims and show that under the first four model assumptions the OLS estimator of the vector coefficient is consistent (so we will need to define consistency as well). Basically, we focus on the law of large numbers results here; next lecture, we will apply the central limit theorem to get numerous results. 2 II. Consistency and the wonderful world of plims (probability limits) Suppose that we calculated the mean of the distribution from a sample of size Y, noting E( Yn ) = , and that V( Yn ) = 2 /n, for any sample size n. As n gets very large, the distribution of Yn collapses around --that is, the sampling distribution is centered at and its variance goes to zero as the sample gets large. [[show picture of this]] To talk about this sort of result formally, we discuss the probability intervals of the estimator (Ybar) around the true parameter ( ), by looking at the likelihood that that Yn is within an arbitrarily small distant around : Prob( - < Yn < + ) = Prob (| Yn - |< ) The random variable Yn is said to converge in probability to the constant , if as the sample size n increases, the likelihood that Yn lies within an distance of , goes to one: lim n Prob (| Yn - |< ) = 1 In other words, the likelihood that Yn lies within a small interval ( -interval) of can be made to be as close to one (certainty) as we would like (that is, the -interval can be made as small as we like) by letting the sample size get sufficiently large. The shorthand way for writing this result is plim Yn = . If the estimator converges to the true parameter in this way, than it is said to be a consistent estimator. The law of large numbers says that sample mean is a consistent estimator of the true population mean parameter, so that the equation above holds whenever the Yis are independent, identically distributed random variables from a distribution with mean . Plim are wonderful because they can pass right through basic mathematical manipulations (adding, substracting, multiplication and division) and also though continuous transformations. Plims can this do, taking expected values cannot. We illustrate this with a graph from Kennedy’s A Guide to Econometrics: 3 g( ̂ ), pdf( ̂ ) g( ̂ ) E(g( ̂ ) ̂ , pdf(g( ̂ )) g(E( ̂ ) The convex function, g(.), distorts the roughly (given my drawing skills) symmetric distribution of ̂ on the bottom (horizontal axis) into the skewed to the right distribution on the vertical axis, namely the distribution of g( ̂ ). Given the distortion, we see that “expectation” does not “pass through” nonlinear functions in the sense that E(g( ̂ ) g(E( ̂ )) (in fact, for such convex transformations it is always the case that E(g( ̂ ) g(E( ̂ )), a result known as Jensen’s inequality). In other words, the expected value operator cannot “pass through” nonlinear transformations like g(.). However, the plim operation can “pass through.” How can we use the diagram above to show that? In particular, how can we show that This is known as Slutsky’s Theorem. plim g( ̂ ) = g(plim ̂ ) ? Some other examples of this wonderfulness includes: plim(ln(y)) = ln(plim(y)) plim(y**a)=plim(y)**a where “a” is nonrandom plim(z/y)=plim(z)/plim(y) for the last result, it is true whether or not z and y are independently distributed. Plim results also extend to matrices and vectors, by taking the probability limit of each element 4 in the matrix or the vector. This offers great simplifications. For matrices, then, we get such cool results as plim(Z-1) = (plim Z)-1 plim (XZ) = plim(X) plim(Z) With plims, we can alternatively show the consistency of the OLS regression coefficient vector as follows: ˆ ( X ' X ) 1 X ' y ( X ' X ) 1 X ' 1 1 [ ( X ' X )] 1 [ X ' ] n n 1 We assume that the first matrix in the far right hand term, [( X ' X ) 1 ] , converges to a n finite, positive definite matrix (it will also be symmetric, though that will not matter for our proof of consistency). If we are conditioning on the X (so that the X matrix is “constant in repeated samples”), then this is no problem and matrix consists of squares and cross products of the explanatory variables. If the explanatory variables are taken as stochastic, then it can be shown that it converges in probability to the population moments under general conditions (White): 1 plim [ ( X ' X )] 1 = 1 n where 1 is the symmetric, finite, positive definite matrix of population moments. 1 What we have to show is that the plim of the second term, [ ( X ' )] , is the 0 vector. n Let’s look at some of the elements in this vector: 1 plim ( n i ) 1 plim ( X 1i i ) 1 plim [ ( X ' )] = n n 1 plim ( X ki i ) n The top most element inside the right hand side vector is . By the law of large numbers, plim = 0. Another view of this same result is to note that E( )=0, and the Var( ) = 2 /n, so that as n goes to infinity, the Var goes to zero and the distribution of 5 collapses around 0. For the other elements in this vector (say the jth element), we note that E( 1 n X ji i ) = 0 which holds for the case where we condition on X (so that we treat it as fixed), and for the case where X is stochastic as long as the covariance of X and is zero (see bottom of p. 164 in Wooldridge). Next we show that the variance goes to zero for these terms: n Var( 1 n X ji i ) X 2 n ( j 1 n 2 ji ) 1 By the argument above for the [ ( X ' X ) 1 ] matrix, the probability limit of the n term is a constant, while the probability limit (plim) of 1 n X ji X 2 ji n 2 n is zero. Hence, the plim for 1 n i (and all the other terms in the [ ( X ' )] vector) is zero. This means that 1 plim [ ( X ' )] = 0 n so that plim ̂ 1 1 plim [ ( X ' X )] 1 plim [ X ' ] n n 1 = 0 This shows that the OLS estimator is consistent. If the error terms, , are correlated 1 with Xs so that plim [ ( X ' )] 0, then the OLS estimator will not be consistent (the n extent of the inconsistency will depend on the strength of the correlation between X and ). This is the problem with simultaneous equations bias, which we will treat towards the end of the semester. Note that we use the plim apparatus to show that omitted variable bias doesn’t go away in large samples (suppose we omit variable z with coefficient (see lecture 5): ~ plim plim ( X ' X ) 1 X ' Y plim ( X ' X ) 1 X ' ( X z u ) 1 1 1 1 plim[ ( X ' X )] 1 plim[ X ' z ] plim[ ( X ' X ) 1 ] plim[ X ' u ] n n n n 6 1 1 0 = 1 1 X ' z ] . The bias from omitted variable bias, 1 , does not go n 1 away in large samples unless either = 0 or plim [ X ' z ] = 0 (so that X and z are n asymptotically “uncorrelated”]. where = plim [ 7