Statistics and Inference Paul Schrimpf January 14, 2016 Contents 1 Properties of estimators 1.1 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Variance of estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 3 2 Finite sample inference 4 3 Asymptotics 3.1 Convergence in probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Convergence in distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 5 7 4 Hypothesis testing 4.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Asymptotic interpretation of p-values . . . . . . . . . . . . . . . . . . . . . . . . . . 10 11 15 References • Wooldridge (2013) appendix C • Abbring (2001) section 2.4 • Diez, Barr, and Cetinkaya-Rundel (2012) chapters 4, 5, & 6 • Grinstead and Snell (2003) chapters 8 & 9 1 Statistics • Interested in parameters of population, e.g. – Moments, functions of moments – Conditional expectation functions – Distribution functions • Learn about parameters by observing sample from the population • Use sample to form estimates of parameters 1 1.1 Properties of estimators Unbiasedness Unbiasedness • Setup: – Sample {y1 , ..., yn } – Parameter θ – Estimator W , some function of sample • The bias of W is Bias(W ) = E[W ] − θ • W is unbiased if E[W ] = θ • Examples: 1 ∑n i=1 yi is unbiased n ∑ s2y = n1 ni=1 (yi − ȳ)2 is – Sample mean: ȳ = – Sample variance: biased An example of a parameter ∑ is the population mean, E[Y ]. An estimator for the population mean is the sample mean, ȳ = n1 ni=1 yi . Another (less useful) estimator of the population mean is just the first observation, y1 . Samples are random, and estimators are functions of a sample, so estimators are random variables. Therefore, it makes sense to think about the distribution of an estimator, the mean of an estimator, etc. For an estimator to be useful, it should be related to a parameter in some nice way. One thing we might want to know about an estimator is whether on average, it will equal the parameter we want. If it does, the estimator is unbiased. 2 1.2 Variance of estimators Bias is not the only property we should want our estimators to have. Both the sample mean and just the first observation are unbiased estimators of the population mean. However, we should expect (and it is true) that the sample mean is more likely to be close to the population mean than just the first observation. The variance and mean square error of an estimator are two ways of quantifying how variable is an estimator. Variance of estimators • The variance of W is Var(W ) √ • The standard error of W is Var(W ) • If W1 and W2 is are two unbiased estimators, then W1 is relatively efficient if Var(W1 ) ≤ Var(W2 ) • The mean square error of W is MSE(W ) = E[(W − θ)2 ] – MSE(W ) = Bias(W )2 + Var(W ) • Prefer estimators with that are relatively efficient and/or have low MSE Variance of sample mean • Sample mean: ȳ = 1 n ∑n i=1 yi • Variance of sample mean: ( ) n 1∑ Var(ȳ) =Var yi n i=1 ) ( n n ∑ 1 1∑ =Cov yi , yi n n = 1 n2 i=1 n n ∑∑ i=1 Cov(yi , yj ) i=1 j=1 assume independent observations n 1 ∑ = 2 Var(yi ) n i=1 1 = Var(Y ) n • Standard error of sample mean: SE(ȳ) = √ Var(Y )/n 3 2 Finite sample inference Knowing estimators’ variances and MSEs is useful for choosing which estimator to use. Once we have chosen an estimator and calculated it for a given sample, we would like to know something about how “close” our estimate is to the parameter we’re interested. Statistical inference is the tool for doing this. Estimators are functions of a sample of random variables, so estimators themselves are random variables. If we draw different samples, or repeat our experiment, our estimator will take on different values. Therefore, we can think about describing the distribution of an estimator. Inference • Inference: (roughly) how close are our sample estimates to the population parameters • Example: N independent observations distributed Bernoulli(p), Xi = 1 if success, else 0 ∑ ∑ ∏ – Joint distribution f(x1 , ..., xn ) = ni=1 pxi (1 − p)1−xi = p xi (1 − p)N− xi – Could use to compute sample ( ∑ how surprising ) x̄ that we observe is if the true p = p0 , e.g. compute P n1 ni=1 Xi − p0 ≥ |x̄ − p0 | • Example: xi normally distributed, can do inference on sample mean using t-tests • Problem: usually do not know distribution of sample and/or distribution of estimator intractable In a few cases, such as when our sample consists of Bernoulli or normal random variables, we can calculate the exact distribution of some estimators. Unfortunately, most of the time, calculating the exact distribution of an estimator is either very complicated or requires too many strong assumptions. 3 Asymptotics When we cannot calculate the exact finite sample distribution of an estimator, we must try to approximate the distribution somehow. Fortunately, in large samples, most estimators have a distribution that can be calculated. We can use this large sample distribution to approximate the finite sample distribution of an estimator. Asymptotic inference • Idea: use limit of distribution of estimator as N→∞ to approximate finite sample distribution of estimator • Notation: – Sequence of samples of increasing size n, Sn = {y1 , ..., yn } – Estimator for each sample Wn • References: 4 – Wooldridge (2013) appendix C – Menzel (2009) especially VI-IX 3.1 Convergence in probability Loosely speaking you can think of probability limits as the large sample version of expectations. Convergence in probability • Wn converges in probability to θ if for every ε > 0, lim P (|Wn − θ| > ε) = 0 n→∞ p denote by plim Wn = θ or Wn → θ • Show using law of large numbers: if y1 , ..., yn are i.i.d. with mean µ, or if y1 , ..., yn have finite ∑ p Var(yi ) expectations and ∞ is finite, then ȳ → µ i=1 i2 • Properties: – plim g(Wn ) = g(plim Wn ) if g is continuous (continuous mapping theorem (CMT)) p p – If Wn → ω and Zn → ζ, then (Slutsky’s lemma) p * Wn + Zn → ω + ζ p * Wn Zn → ωζ * Wn Zn p ω ζ → p • Wn is a consistent estimate of θ if Wn → θ Probability limits are much easier to work with than expectations, mostly because of the continuous mapping theorem and the second and third parts of Slutsky’s lemma. These properties of probability limits do not apply to expectations. Demonstration of LLN 1 2 3 4 5 6 7 8 n <− seq ( 1 , 2 0 0 ) ## sample s i z e s sims <− 1 0 ## number o f s i m u l a t i o n s f o r each sample s i z e rand <− f u n c t i o n (N) { r e t u r n ( r u n i f (N ) ) } ## s i m u l a t i o n sampleMean <− m a t r i x ( 0 , nrow= l e n g t h ( n ) , n c o l = sims ) for ( i in 1 : length (n ) ) { N = n[ i ]; x <− m a t r i x ( rand (N* sims ) , nrow=N, n c o l = sims ) 5 sampleMean [ i , ] = colMeans ( x ) 9 10 11 12 13 14 15 16 17 18 19 20 21 } ## p l o t sample means l i b r a r y ( ggplot2 ) l i b r a r y ( reshape ) d f <− data . frame ( sampleMean ) d f $n <− n d f <− melt ( df , i d = ”n” ) l l n P l o t <− g g p l o t ( data = df , aes ( x =n , y = value , c o l o u r = v a r i a b l e ) ) + geom_ l i n e ( ) + s c a l e _ c o l o u r _ brewer ( p a l e t t e = ” P a i r e d ” ) + s c a l e _ y _ continuous ( name= ” sample mean” ) + g u i d e s ( c o l o u r = FALSE ) llnPlot Demonstration of LLN 1.00 sample mean 0.75 0.50 0.25 0 50 100 150 n Convergence in probability: examples p • ȳ → E[y] by LLN 6 200 • Sample variance: n σ̂ 2 = 1∑ (yi − ȳ)2 n i=1 n = 1∑ 2 yi − n i=1 | {z } ȳ2 |{z} p → E[y]2 by LLN and CMT p → E[y2 ] by LLN p → E[y2 ] − E[y]2 = Var(y) • Mean divided by variance: ∑n 1 ∑n yi yi n ∑n i=1 = 1 ∑n i=1 2 2 (y − ȳ) i=1 i i=1 (yi − ȳ) n ȳ σ̂ 2 p E[y] → by above and Slutsky’s lemma Vary = Convergence in probability: examples • Sample correlation: ρ̂ 1 ∑n (xi − x̄)(yi − ȳ) n = √ ∑ i=1 ∑n n 1 21 i=1 (xi − x̄) n i=1 (yi − n ȳ)2 ∑ p p − x̄)2 → Var(X ) and n1 ni=1 (yi − ȳ)2 → Var(Y ) as above ∑ p – − x̄)2 n1 ni=1 (yi − ȳ)2 → Var(X )Var(Y ) by Slutsky’s lemma √ ∑ √ ∑n n 1 21 2 p – Var(X )Var(Y ) by CMT i=1 (xi − x̄) n i=1 (yi − ȳ) → n ∑ p – n1 ni=1 (xi − x̄)(yi − ȳ) → Cov(X , Y ) by similar reasoning as for sample variance – 1 n 1 n ∑n i=1 (xi ∑n i=1 (xi p – ρ̂ → Corr(X , Y ) by Slutsky’s lemma 3.2 Convergence in distribution Convergence in probability tells us that an estimator will eventually be very close to some number (hopefully the parameter we’re trying to estimate), but it does not tell us about the distribution of an estimator. From the law of large numbers, we know that ȳn − µ converges to the constant 0. The central limit √ theorem tells us that if we scale up the difference by n, then ȳn − µ converges to normal distribution with variance Var(y) = σ 2 . 7 Convergence in distribution • Let Fn be the CDF of Wn and W be a random variable with CDF F d • Wn converges in distribution to W , written Wn → W , if limn→∞ Fn (x) = F (x) for all x where F is continuous √ • Central limit theorem: Let {y1 , ..., yn } be i.i.d. with mean µ and variance σ 2 then Zn = n ȳnσ−µ converges in distribution to a standard normal random variable – As with the LLN, the i.i.d. condition can be relaxed if additional moment conditions are added – Demonstration • Properties: d d – If Wn → W , then g(Wn ) → g(W ) for continuous g (continuous mapping theorem (CMT)) d p d d – Slutsky’s theorem: If Wn → W and Zn → c, then (i) Wn + Zn → W + c, (ii) Wn Zn → cW , d and (iii) Wn /Zn → W /c The central limit theorem can be paraphrased as “scaled and centered sample means converge in dis√ tribution to normal random variables.” “Scaling” refers to the fact that we multiple by n. “Centering” refers to the subtraction of the population mean. If an estimator converges in distribution to a normal random variable, then we often say that the estimator is asymptotically normal. Demonstration of CLT 1 2 3 4 5 6 7 8 9 10 N <− c ( 1 , 2 , 5 , 1 0 , 2 0 , 5 0 , 1 0 0 ) s i m u l a t i o n s <− 5000 means <− m a t r i x ( 0 , nrow= s i m u l a t i o n s , n c o l = l e n g t h (N ) ) f o r ( i i n 1 : l e n g t h (N ) ) { n <− N[ i ] dat <− m a t r i x ( r b e t a ( n * s i m u l a t i o n s , shape1 = 0 . 5 , shape2 = 0 . 5 ) , nrow= s i m u l a t i o n s , n c o l =n ) means [ , i ] <− ( a p p l y ( dat , 1 , mean ) − 0 . 5 ) * s q r t ( n ) } 11 12 13 14 15 16 17 18 19 20 21 # Plotting d f <− data . frame ( means ) d f $n <− N d f <− melt ( d f ) c l t P l o t <− g g p l o t ( data = df , aes ( x = value , f i l l = v a r i a b l e ) ) + geom_ histogram ( alpha = 0 . 2 , p o s i t i o n = ” i d e n t i t y ” ) + s c a l e _ x _ continuous ( name= e x p r e s s i o n ( s q r t ( n ) ( bar ( x)−mu ) ) ) + s c a l e _ f i l l _ brewer ( t y p e = ” d i v ” , p a l e t t e = ” RdYlGn ” , name= ”N” , l a b e l =N) cltPlot 8 Demonstration of CLT Histogram of sample mean for Beta(0.5,0.5) distribution 600 N 1 400 2 count 5 10 20 50 200 100 0 −1 0 1 ^ − µ) n (µ CLT examples • Mean: √ d n ȳ−µ σ → N(0, 1) by CLT • What if σ estimated instead of known? √ ȳ − µ n σ̂ p – σ̂ → σ from before √ d – n(ȳ − µ) → N(0, σ 2 ) by CLT – Slutsky’s theorem with Wn = √ n(ȳ − µ), Zn = σ̂ gives √ ȳ − µ d → N(0, 1) n σ̂ CLT examples • Variance: 9 – n 1∑ σ̂ = (yi − ȳ)2 n 2 i=1 n 1∑ (yi − µ + µ − ȳ)2 = n i=1 ] [ n 1∑ 2 (yi − µ) − (ȳ − µ)2 = n i=1 – So ] [ n ∑ √ √ √ 1 n(σ̂ 2 − σ 2 ) = n (yi − µ)2 − σ 2 − n(ȳ − µ)(ȳ − µ) n i=1 √ p d d n(ȳ − µ) → N(0, σ 2 ), (ȳ − µ) → 0, so n(ȳ − µ)(ȳ − µ) → 0 ] d √ [ ∑ – CLT implies n n1 ni=1 (yi − µ)2 − σ 2 → N(0, V ) – 4 √ Hypothesis testing The main way we will use the central limit theorem is to use the asymptotic distribution of an estimator to approximate the estimator’s finite sample distribution. We do this so that we can make (approximate) statements about how uncertain we are about a given estimate. Hypothesis testing is one way we make statements about how random is an estimate. The logic hypothesis testing is slightly awkward. To understand it, keep in mind that we are trying to learn about a parameter. Parameters are fixed characteristics of a population. They are not random. Therefore, it makes no sense to talk about the probability that a parameter is a certain value or in a certain range. What is random, is the sample that we observe and the estimate that we calculate from the sample. Therefore, we can calculate the probability that our estimate is a certain value in some range. The probability that an estimate takes on a value depends on the true population parameter. Therefore, we generally quantify the randomness of an estimate by making statements like, assuming the true parameter is θ0 , then the probability of getting an estimate as far or farther from θ0 than the one we observe is p. Hypothesis testing • Null hypothesis, H0 • p-value: if null hypothesis what is the probability of a sample as or more extreme than what is observed √ • From 325: if yi ∼ N(µ, σ 2 ), then n ȳ−µ ∼ t(n − 1) σ̂ 2 10 4.1 Examples Example: testing mean • From 325: – If yi ∼ N(µ, σ 2 ), then √ n ȳ−µ σ̂ ∼ t distribution with n − 1 degrees of freedom – p-value for H0 : µ = µ0 vs Ha : µ ̸= µ0 is ( ∑ ) 1 n y − µ ȳ − µ i 0 0 p =P n i=1 ≥ sy σ̂ ( ) ȳ − µ0 ;n − 1 =2 Ft − σ̂ | {z } CDF of t distribution • What if distribution of yi unknown? – Use CLT to approximate distribution of t-statistic Example: testing one mean • From earlier slide, • So, √ ȳ − µ d n → N(0, 1) σ̂ ) ( ∑ ) ( 1 n y − µ ȳ − µ ȳ − µ0 0 n i=1 i 0 →2 Φ − P ≥ sy σ̂ σ̂ | {z } standard normal CDF Data on deworming treatment and school participation from Miguel and Kremer (2003) Treatment Control 873 352 Sample size1 Infection 0.32 0.54 School attendance 0.808 0.684 11 Example: testing one mean • Test H0 : E[infected|control] = 0.5 againt Ha : E[infected|control] ̸= 0.5 – Let Ii = 1 if infected, else 0 – What is sample variance of Ii ? ) n 1∑ 2 Ii − Ī 2 n i=1 ( n ) ∑ 1 = Ii − Ī 2 (Ii2 = Ii ) n n σ̂I2 1∑ = (Ii − Ī)2 = n i=1 ( i=1 2 =Ī − Ī = Ī(1 − Ī) – Test statistic z= √ √ Ī − 0.5 0.54 − 0.5 n√ = 352 √ = 1.506 0.54(1 − 0.54) Ī(1 − Ī) – P-value = 2(1 − Φ(z)) = 0.132 Example: testing two means • Often want to test difference between two means – H0 : treatment has no effect ∼ H0 treatment mean = control mean • Let µT = treatment mean, µC = control mean • H0 : µT = µC • CLT implies √ d nT (ĪT − µT ) → N(0, σT2 ) and √ d nC (ĪC − µC ) → N(0, σC2 ) i.e. ĪT is approximately N(µT , σT2 /nT ) • Sum of two normals is normal • If treatment and control groups independent, then ĪT − ĪC is approximately N(0, σT2 /nT + σC2 /nC ) formally we could show √ ĪT − ĪC σ̂T2 /nT + σ̂C2 /nC 12 d → N(0, 1) Example: testing two means • Test H0 : E[infected|control] = E[infected|treatment] against Ha : E[infected|control] > E[infected|treatment] – Test statistic: z =√ ĪT − ĪC σ̂T2 /nT + σ̂C2 /nC =√ 0.32 − 0.54 = −7.12 0.32(1 − 0.32)/873 + 0.54 ∗ (1 − 0.54)/352 – P-value = Φ(z) = 5 × 10−13 • Test H0 : E[school|control] = E[school|treatment] against Ha : E[school|control] < E[school|treatment] – Test statistic: S̄T − S̄C z =√ σ̂T2 /nT + σ̂C2 /nC 0.808 − 0.684 =√ = 4.41 0.808(1 − 0.808)/873 + 0.684 ∗ (1 − 0.684)/352 – P-value = 1 − Φ(z) = 5 × 10−6 The RAND health insurance experiment randomly assigned 3985 Americans to one of 14 different insurance plans from 1974-1982. All insurance plans included catastrophic coverage — families total healthcare spending beyond a proportion of their income was completely covered. The least generous plan in the experiment only had catastrophic coverage. The most generous plan offered comprehensive care for free. There were also plans that had a deductible of $150 per person or $450 per family, after which everything was covered. Finally, there were plans with coinsurance that required families to pay between 25% and 50% of the cost of health care out of pocket, up to the catastrophic limit. You can read more about the RAND health insurance experiment in chapter 1 of Angrist and Pischke (2014) or in Aron-Dine, Einav, and Finkelstein (2013). Example: RAND health insurance experiment 13 Example: RAND health insurance experiment 14 p-value pitfalls • Hypothesis tests and p-values are a tool for quantifying uncertainty, but not the only tool • Significance thresholds often over-emphasized • Difference in significance is not necessarily a significant difference – A statistically significantly different from 0 and B not statistically significantly different from 0 does not imply that A − B is statistically significant • When testing many hypotheses, likely to find significant results by chance alone – Researcher degrees of freedom and garden of forking paths can introduce many tests inadvertently These ideas are frequently discussed by the statistician Andrew Gelman on his blog. http://andrewgelman.com/2005/06/14/the_difference/ http://andrewgelman.com/2011/09/09/the-difference-between-significant-and-not-significant/ http://andrewgelman.com/2016/01/04/plausibility-vs-probability-prior-distributions-garden-fork http://andrewgelman.com/2013/11/06/marginally-significant/ 4.2 Asymptotic interpretation of p-values When we calculate p-values using the asymptotic distribution of an estimator, then in a given sample, we know that our p-value is not exactly correct. However, we also know that for a sufficiently large sample, the error in the p-value will be arbitrarily small. Exactly how large of a sample we need for the error to be negligible depends on the true distribution of our sample and the number of parameters we estimated to calculate the p-value. Asymptotic interpretation of p-values • CLT: under H0 , zn = √ ȳ − µ d → N(0, 1) n σ̂ or equivalently, lim P(zn ≤ x) = Φ(x) n→∞ • P-values are only correct asymptotically (for large samples) • Small finite sample p-values will be somewhat off • With exact p-values, under H0 , p-values from distributed U[0, 1] 15 • Asymptotic p-values, under H0 , p-values only U[0, 1] in large samples Convergence of P(zn ≤ x) 1.00 0.75 N 2 CDF 5 10 0.50 15 20 infty 0.25 0.00 −2 0 2 z Convergence of p-values 1.00 0.75 N actual size 2 5 10 0.50 15 20 infty 0.25 0.00 0.00 0.25 0.50 0.75 1.00 nominal size Code for graphs 16 References References Abbring, Jaap. 2001. “An Introduction to Econometrics: Lecture notes.” URL http://jabbring. home.xs4all.nl/courses/b44old/lect210.pdf. Angrist, Joshua D and Jörn-Steffen Pischke. 2014. Mastering ’Metrics: The Path from Cause to Effect. Princeton University Press. Aron-Dine, Aviva, Liran Einav, and Amy Finkelstein. 2013. “The RAND Health Insurance Experiment, Three Decades Later.” Journal of Economic Perspectives 27 (1):197–222. URL http://www.aeaweb. org/articles.php?doi=10.1257/jep.27.1.197. Diez, David M, Christopher D Barr, and Mine Cetinkaya-Rundel. 2012. OpenIntro Statistics. OpenIntro. URL http://www.openintro.org/stat/textbook.php. Grinstead, Charles M and J. Laurie Snell. 2003. Introduction to Probability. American Mathematical Society. URL http://www.dartmouth.edu/~chance/teaching_aids/books_articles/ probability_book/book.html. Menzel, Konrad. 2009. “14.30 Introduction to Statistical Methods in Economics.” Massachusetts Institute of Technology: MIT OpenCourseWare. URL http://ocw.mit.edu/courses/ economics/14-30-introduction-to-statistical-methods-in-economics-spring-2009/ lecture-notes/. Miguel, E. and M. Kremer. 2003. “Worms: identifying impacts on education and health in the presence of treatment externalities.” Econometrica 72 (1):159–217. URL http://onlinelibrary. wiley.com/doi/10.1111/j.1468-0262.2004.00481.x/abstract. Wooldridge, J.M. 2013. Introductory econometrics: A modern approach. South-Western. 17