ECONG020: Econometrics Overview of Prerequisites Martin Weidner UCL 1 / 20 Matrix Algebra You are expected to be familiar with the following concepts: I row-vector v , column-vector w , matrix A I transpose v 0 , w 0 , A0 I basic operations (+, −, ×) I inverse matrix: A−1 I rank of a matrix: rank(A) I positive (semi-) definite matrix: A > 0, A ≥ 0. I trace and determinant of a matrix: tr(A), det(A) You can study this from any linear algebra textbook. There will also be a math-stats pre-course before the fall term starts, which also includes some matrix algebra, probability theory and basic asymptotic theory (see requirements below), but it might be useful to start studying this over the summer already. 2 / 20 Probability Theory I random variable, random vector, random matrix I probability density function (pdf), cumulative distribution function (cdf) I expected value I normal distribution N (µ, σ 2 ), properties I multivariate normal distribution N (µ, Σ) I conditional distribution, conditional expectation I independent and identically distributed = iid I convergence in probability →p I weak law of large numbers I convergence in distribution ⇒ I central limit theorem I continuous mapping theorem, Slutsky’s theorem, delta method E, variance Var Reference: e.g. “Statistical Inference” by Casella and Berger 3 / 20 Basic Concepts in Estimation and Statistical Inference I Estimator I Unbiased I Consistent I Asymptotic Normality I Confidence Intervals I Hypothesis Testing I etc. Literature: I e.g. Casella and Berger, Statistical Inference I many econometrics textbooks (e.g. Greene, Econometric Analysis) have a mathematical appendix, covering matrix algebra, statistical inference, etc. 4 / 20 Example: Inference on the Mean of a Sample I Example to illustrate terms and concepts in probability theory and statistical inference. I Observe data (sample): y1 , y2 , . . . , yn I We model yi (i = 1, . . . , n) as random variables, and we think of the observations as one concrete realization of these random variables. I We assume that yi and yj are independent and identically distributed if i 6= j. The distribution of yi is called population. I Question: from the sample what can we learn about the population mean Eyi ? 5 / 20 Example: Inference on the Mean of a Sample (cont.) I e.g., we may be interested in the average height of students at UCL. The distribution of student heights at UCL is the population. We could sample all students at UCL to get E(yi ) exactly. However, given limited time and resources, we decide to only take a random sample of n = 200 students and measure their height. From this sample, what do we learn about E(yi )? 6 / 20 Unbiasedness I I I I Notation: yi = β + ui , where β = E(yi ) is the parameter of interest and ui are iid random shocks with E(ui ) = 0. P Estimator (sample mean): β̂ = n1 ni=1 yi (this is a random variable) P We have: β̂ = β + n1 ni=1 ui Since we “assume” E(ui ) = 0 (you can either view this as an assumption, or as a result of the definition β = E(yi )), we find that the estimator is unbiased, i.e. Eβ̂ = β. 7 / 20 Consistency I I Since we assume that yi are iid and (implicitly) that E(yi ) exists, we can conclude that β̂ is consistent, i.e. β̂ →p β as n → ∞. What does this mean? If samples get larger and larger (n → ∞) we have a different β̂ = β̂n for each sample size, i.e. a series of random variables indexed by n. Converge in probability of β̂n to β is defined by ∀ > 0 : lim P(|β̂n − β| < ) = 1 n→∞ i.e. as n becomes large “the probability that βn is arbitrary close to β converges to one“. (in this definition β can be a random variable, in our case it is just a number) I Why is this true? Weak law of large numbers (WLLN) 8 / 20 Weak Law of Large Numbers Theorem (WLLN) Let X1 , X2 , X3 , . . . be a sequence of iid random P variables, and assume that E|Xi | < ∞. Then we have n1 ni=1 Xi →p EXi . Comment: Often it is assumed that EXi2 < ∞ (because then the E proof is very easy, just apply Chebyshev inequality P(|Z | ≥ ) ≤ Z 2 /2 to P Z = n1 ni=1 Xi ), but the weaker condition |Xi | < ∞ is also sufficient. We write exists. E E|Xi | < ∞, which just means that E|Xi | 9 / 20 (Asymptotic) Normality of β̂ Finite Sample Normality: I Additional assumption: ui ∼ N (0, σ 2 ) I Then: β̂ ∼ N (β, σ 2 /n) (show!) Asymptotic Normality: E(ui2) = σ2. I Additional assumption: I Then: Var(β̂n ) = n1 σ 2 I “Natural” rescaling of β̂: I Central Limit Theorem (CLT) implies that as n → ∞ we have √ I (show!) √ n(β̂ − β) n(β̂ − β) ⇒ N (0, σ 2 ). Where “⇒” refers to confergence in distribution. 10 / 20 Convergence in Distribution I Denote by FX the cumulative distribution function (cdf) of the random X , i.e. FX (x) ≡ P(X ≤ x). I Def: A sequence of random variables X1 , X2 , . . . converges in distribution to the random variable X (we write Xn ⇒ X ) if lim FXn (x) = FX (x) n→∞ for all points x where FX (x) is continuous. I We write Xn ⇒ X , or more often Xn ⇒ DX , where DX is the distribution of X , e.g. DX = N (µ, σ 2 ). I Theorem: if Xn →p X , then also Xn ⇒ X . 11 / 20 Central Limit Theorem Lindeberg-Lévy CLT Let X1 , X2 , X3 , . . . be a sequence of iid random variables, with 1 Pn 2 EXi = µ and VarXi = σ < ∞. Let X n = n i=1 Xi . Then: √ n(X n − µ) ⇒ N (0, σ 2 ) I Later in the lecture we may need other versions of the CLT that hold under weaker assumptions. 12 / 20 Asymptotic Variance of an Estimator √ n(β̂ − β) ⇒ N (0, σ 2 ) as n → ∞. I We saw that I Thus, for the asymptotic variance of β̂ we have √ AVar n β̂ = σ 2 . I I √ More generally, whenever n(β̂ − β) ⇒ X for some random variable/vector X we write √ n β̂ = Var(X ). AVar ( There is a subtlety here: SOMETIMES we have AVar √ √ n β̂ = lim Var n β̂ = lim nVar(β̂), but this need not be true. n→∞ n→∞ ) 13 / 20 Confidence Intervals I I We have a consistent estimator β̂, but for finite sample size n we want to provide a measure how close β̂ is to the true mean β = Eyi . √ Since n(β̂ − β) ⇒ N (0, σ 2 ), we know that √ lim P( n|β̂ − β| ≤ 1.96σ) = 0.95 n→∞ I Thus, for sufficiently large n we have 1.96σ 1.96σ P β ∈ β̂ − √ , β̂ + √ ≈ 0.95 n n I Need an estimator σ̂ that satisfies σ̂ →p σ as n → ∞. Then: 1.96σ̂ 1.96σ̂ lim P β ∈ β̂ − √ , β̂ + √ = 0.95 n→∞ n n | {z } 95 % confidence interval for β 14 / 20 Confidence Intervals (cont.) I Estimator for σ: n σ̂ 2 = 1 X (yi − β̂)2 n−1 i=1 I Show: Eσ̂2 = σ2 (unbiased). (useful Pn trick: 2 Pn Pn 2 2 2 1 1 1 i=1 (yi − β) = n i=1 (yi − β̂ + β̂ − β) = n i=1 (yi − β̂) + (β̂ − β) , n which you could also write in a more familiar form: P u 2 − ( u)2 = (u − u)2 , with = n1 i , and “u = ui ”) Ê Ê Ê Ê Ê I Using the factor 1/(n − 1) instead of 1/n is often called the “degree of freedom correction”. I Show: σ̂ →p σ as n → ∞ (consistent). (Useful to know the “continuous mapping theorem” to show this — see below) 15 / 20 Hypothesis Testing I Sometimes one is not interested in estimating β = Eyi , but in testing whether β has a particular value. For example: I I I Null hypothesis H0 : β = r Alternative hypothesis Ha : β 6= r The so called t-test statistics for testing H0 reads √ t= n(β̂ − r ) σ̂ I Under H0 as n → ∞ we have t ⇒ N (0, 1). I Therefore we reject H0 at 95% confidence level if |t| > 1.96. 16 / 20 Continuous Mapping Theorem Continuous Mapping Theorem Let g : Rk → R` be a continuous function, and let X1 , X2 , . . . be a sequence of random k-vectors. If Xn →p X as n → ∞, then g (Xn ) →p g (X ). If Xn ⇒ X as n → ∞, then g (Xn ) ⇒ g (X ). A direct application is that if Un →p U and Vn →p V , then Un + Vn →p U + V , an analogously for other operations (−, ×, /). 17 / 20 Slutsky’s Theorem Another application of the continuous mapping theorem: Slutsky’s Theorem Let U1 , U2 , . . . and V1 , V2 , . . . be sequences of random variables, vectors or matrices (of appropriate dimension). As n → ∞ let Un →p U, where U is a non-random constant scalar, vector or matrix, and Vn ⇒ V , for some random variable, vector, matrix V . Then I Un + Vn ⇒ U + V I Un Vn ⇒ UV , I if also P(det(Un ) = 0) = 0, then Un−1 Vn ⇒ U −1 V . √ Using this we can e.g. show that nσ̂ −1 (β̂ − β) ⇒ N (0, 1) (we used that implicitly before when deriving confidence interval). 18 / 20 Delta Method Delta Method Let X1 , X2 , X3 ,. . . be a sequence of random k-vectors (e.g. β̂) and √ X be a constant k-vector such that n(Xn − X ) ⇒ N (0, Σ) for some k × k matrix Σ. Let g : Rk → R` be continuously differentiable at X . Then √ n(g (Xn ) − g (X )) ⇒ N (0, G ΣG 0 ), where G = I ∂g (x) ∂x 0 x=X , i.e. G is the ` × k Jacobian matrix. √ For example: we know n(β̂ − β) ⇒ N (0, σ 2 ). Using the √ delta method we can show that n(β̂ 2 − β 2 ) ⇒ N (0, 4β 2 σ 2 ) 19 / 20 Econometrics I The course assumes knowledge of matrix algebra, probability theory and basic asymptotic theory, as summarized above. I The first half of the course covers linear models: OLS estimation, Instrumental Variables, Hypothesis Testing. I The second half of the course covers Maximum Likelihood Estimation, General Methods of Moments, and some basic Time Series Methods. I There is no required textbook for this course, but we often follow the presentation in Wooldridge, “Econometric Analysis of Cross Section and Panel Data”. Most graduate econometrics textbooks cover very similar material, and you can choose your favourite one to study from. 20 / 20