Interval Estimation Edwin Leuven Interval estimation While an estimator may be unbiased or consistent, a given estimate will never equal the true value I this point estimate does not give a sense of “closeness”, while I the variance estimate does not give a sense of location We could try to combine both in statements like: “We are confident that θ lies somewhere between . . . and . . . ” where we would like to 1. give a specific interval, and 2. be precise about how confident we are This is the aim of a confidence interval (CI), which is a particular type of probability interval 2/36 Interval estimation Formally we define a confidence interval as follows Pr(L̂ < θ < Û) = 1 − α where we construct estimates of a lower bound L̂ and an upper bound Û such that the interval [L̂, Û] covers the parameter of interest with probability 1 − α We call 1 − α the confidence level CI’s are random intervals because they differ across random samples The confidence level is thus a probability relative to the sampling distribution! 3/36 Probability intervals We will consider probability intervals for continuous r.v. X Pr(a < X < b) with density f (x ) and support on the real line this equals Z b Pr(a < X < b) = f (x )dx a Z b = f (x )dx − −∞ Z a f (x )dx −∞ = F (b) − F (a) Note that since X is continuous Pr(X = x ) = 0 and Pr(a < X < b) = Pr(a ≤ X < b) = Pr(a < X ≤ b) = Pr(a ≤ X ≤ b) 4/36 Density Probability intervals (X ∼ χ2 (3)) Area = Pr(a < X < b) = F(b) − F(a) a b X 5/36 Probability intervals We often need to compute either p = F (a) ≡ Pr(X ≤ a) or a = F −1 (p) In R we can do this using the pxxx and qxxx functions, f.e. for the normal distribution: pnorm(1.96) ## [1] 0.9750021 qnorm(0.975) ## [1] 1.959964 6/36 What is the probability that our estimator is close to θ? Density We can write this as: Pr(|θ̂ − θ| < ε) = Pr(θ − ε < θ̂ < θ + ε) Area = ? θ−ε θ θ+ε θ^ 7/36 Interval estimation The probability that our estimator is no further than ε from θ equals Pr(θ − ε < θ̂ < θ + ε) which is the probability that the r.v. θ̂ is in a fixed interval with unknown boundaries Note though that we can rewrite this as follows Pr(θ − ε < θ̂ < θ + ε) = Pr(θ̂ − ε < θ < θ̂ + ε) which is the probability that the random interval (θ̂ − ε, θ̂ + ε) covers the fixed number θ How do we construct such intervals and compute their corresponding confidence levels? 8/36 CI for the mean – Normal data, variance known Let X ∼ N (µ, σ 2 ) then X̄ ∼ N (µ, σ 2 /n) Now consider taking a random sample of size n, then ε X̄ − µ ε √ < √ Pr(µ − ε < X̄ < µ + ε) = Pr − √ < σ/ n σ/ n σ/ n ε ε √ −Φ − √ =Φ σ/ n σ/ n ε √ =2Φ −1=1−α σ/ n ! For a given confidence level 1 − α we get the following ε √ ε = z1−α/2 · σ/ n where z1−α/2 = Φ−1 (1 − α/2) 9/36 CI for the mean – Normal data, variance known Φ(x) 1−α 2 α 2 zα 2 µ z1−α 2 x 10/36 CI for the mean – Normal data, variance known Since Pr(µ − ε < X̄ < µ + ε) = Pr(X̄ − ε < µ < X̄ + ε), the following is a (1 − α)100% CI: √ √ (X̄ − z1−α/2 · σ/ n, X̄ + z1−α/2 · σ/ n) For a given sample, µ will be either inside or outside this interval But before drawing the sample, there is a (1 − α)100% chance that an interval constructed this way will cover the true parameter µ 11/36 CI for the mean – Normal data, variance known For example if we set the confidence level at 1 − α = 0.90, then z1−0.10/2 = Φ−1 (0.95) = −Φ−1 (0.05) ≈ 1.645 qnorm(.95) ## [1] 1.6448536 With n = 10 random draws from X ∼ N (µ, 1) mean(rnorm(10)) ## [1] -0.38315741 we get the following 90% confidence interval: √ √ (−0.38 − 1.645 · 1/ 10, −0.38 + 1.645 · 1/ 10) ≈ (−0.90, 0.14) 12/36 CI for the mean – Normal data, variance known We know that we need to cover the true parameter 90% of the time: n = 10; nrep = 1e5; z = qnorm(0.95) cover = rep(F, nrep) for(i in 1:nrep) { x = rnorm(n, 0, 1) m = mean(x); se = 1 / sqrt(n) ci0 = m - z * se; ci1 = m + z * se; cover[i] = ci0 < 0 & 0 < ci1 } mean(cover) ## [1] 0.90217 13/36 90% CI for the mean, n = 10 50 Sample nr. 40 30 20 10 0 −2 −1 0 1 2 CI 14/36 90% CI for the mean, n = 40 50 Sample nr. 40 30 20 10 0 −2 −1 0 1 2 CI 15/36 90% CI for the mean, n = 160 50 Sample nr. 40 30 20 10 0 −2 −1 0 1 2 CI 16/36 90% CI for the mean, n = 10 50 Sample nr. 40 30 20 10 0 −2 −1 0 1 2 CI 17/36 95% CI for the mean, n = 10 50 Sample nr. 40 30 20 10 0 −2 −1 0 1 2 CI 18/36 99% CI for the mean, n = 10 50 Sample nr. 40 30 20 10 0 −2 −1 0 1 2 CI 19/36 Computing sample size Suppose you plan to collect data, and you want to know the sample size you need to achieve a certain level of confidence in our interval estimate Since √ ε = z1−α/2 · σ/ n Solving for n, we obtain n= z1−α/2 · σ ε 2 Note that we need a larger sample (n increases) if I we require greater precision (ε decreases) I we want to be more confident (α decreases) I there is more dispersion in the population (σ increases) 20/36 CI for the Variance – Normal data The sample variance S2 = n 1 X (Xi − X̄ )2 n − 1 i=1 is our estimator for the population variance When the Xi follow a normal distribution then (n − 1)S 2 /σ 2 follows a so-called Chi-squared distribution with n − 1 degrees of freedom: Chi-squared distribution If Zi ∼ N (0, 1) and V = Pk 2 i=1 Zi then V ∼ χ2 (k) where k are the degrees of freedom. E [V ] = k and Var(V ) = 2k. 21/36 CI for the Variance – χ2 distribution Density χ2(1) χ2(2) χ2(3) χ2(9) 0 2 4 6 8 X 22/36 CI for the Variance – Normal data Because the Chi-square distribution is asymmetric we need to make sure that we set the boundaries of the CI such that we have α/2 probability mass on each side: n−1 n−1 Pr(c.025 < (n − 1)S 2 /σ 2 < c.975 ) = 0.95 where we can compute cpn−1 in R using qchisq(p,k) We can rewrite the above as n−1 n−1 Pr((n − 1)S 2 /c.975 < σ 2 < (n − 1)S 2 /c.025 ) = 0.95 23/36 CI for the Variance – Normal data 1.0 Density 0.8 0.6 0.4 0.2 0.0 0.0 0.5 1.0 1.5 S 2.0 2.5 2 24/36 CI for the Variance – Normal data If α = .05 we know that we need to cover the true parameter 95% of the time: n = 4; nrep = 1e5 cover = rep(F, nrep) for(i in 1:nrep) { v = var(rnorm(n, 1, 10)) ci0 = (n - 1) * v / qchisq(.975, n - 1) ci1 = (n - 1) * v / qchisq(.025, n - 1) cover[i] = ci0 < 100 & 100 < ci1 } mean(cover) ## [1] 0.94955 25/36 CI for the mean – Normal data, variance unknown We considered X ∼ N (µ, σ 2 ) and assumed we knew σ 2 In practice we probably don’t know σ 2 , but we have an estimator S2 = n 1 X (Xi − X̄ )2 n − 1 i=1 A simple solution is to replace σ with S: √ √ x̄ − zα/2 S/ n, x̄ + z1−α/2 S/ n But how does this work? 26/36 CI for the mean – Normal data, variance unknown If α = .1 we know that we need to cover the true parameter 90% of the time: n = 10; nrep = 1e5; z = qnorm(.95) cover = rep(F, nrep) for(i in 1:nrep) { x = rnorm(n, 1, 10) m = mean(x); se = sd(x) / sqrt(n) ci0 = m - z * se; ci1 = m + z * se; cover[i] = ci0 < 1 & 1 < ci1 } mean(cover) ## [1] 0.86477 27/36 Student’s t-distribution It turns out that √ Z = (X̄ − µ)/(S/ n) does not follow a Normal but a so-called t-distribution with n − 1 degrees of freedom t distribution If Z ∼ N (0, 1) and V ∼ χ2 (k), Z and V are independent, and T =p Z V /k then T ∼ t(k) where k are the degrees of freedom. E [T ] = k and Var(T ) = 2k. 28/36 Student’s t-distribution Density N(0, 1) = t(∞) t(4) t(1) x 29/36 Student’s t-distribution 5 t1−α 2 (k) 4 3 z0.995 z0.975 z0.95 2 5 10 20 50 k (degrees of freedom) 30/36 CI for the mean – Normal data, variance unknown We know that we need to cover the true parameter 95% of the time: n = 10; nrep = 1e5; z = qt(.975, n - 1) cover = rep(FALSE, nrep) for(i in 1:nrep) { x = rnorm(n, 1, 10) m = mean(x); se = sd(x) / sqrt(n) ci0 = m - z * se; ci1 = m + z * se; cover[i] = ci0 < 1 & 1 < ci1 } mean(cover) ## [1] 0.95102 31/36 Computing sample size – unknown variance We saw that n= 2 z1−α/2 · σ2 ε2 This means that without knowing the population variance σ 2 we cannot set the sample size When X is Binomial we know that Var(X ) = p(1 − p), while this depends on p which is unknown, we know that 0 ≤ p(1 − p) ≤ 0.25 This means that n= 2 z1−α/2 · σ2 ε2 ≤ 2 z1−α/2 · 0.25 ε2 32/36 CI for the mean – Non Normal data Up until now we assumed that our data came from a Normal distribution It turns out that as long as our sample is large enough the Normal distribution is a good approximation thanks to the Central Limit Theorem Central Limit Theorem Let X1 , . . . , Xn be i.i.d. random variables with E [Xi ] = µ and Var(Xi ) = σ 2 < ∞ then X̄ − µ √ → N (0, 1) σ/ n Consider the sampling distribution of X̄ when X ∼ χ2 (3) 33/36 Density CI for the mean – Non Normal data E[X] x 34/36 CI’s for statistics other than the mean Beyond the scope of this course, but some pointers: I when sampling distributions are known, use these, otherwise CI’s when n is large I use the bootstrap CI when n is small I rely on non-parametric or permutation tests 35/36 Summary Confidence Intervals (CI’s) are random intervals that cover the true parameter with a given probability We call this probability the confidence level, and 0.95 is commonly used Pay attention to the interpretation!! I Before drawing the sample, there is a (1 − α)100% chance that an interval constructed this way will cover the true parameter We saw how to construct CI’s for the mean For a given confidence level we can set CI widths by choosing the appropriate sample sizes 36/36