Slide set 28 Stat 330 (Spring 2015) Last update: April 8, 2015 Stat 330 (Spring 2015): slide set 28 Examples for CI of proportion Example 1: Suppose we want to estimate the fraction of records in the 2000 IRS data base that have a taxable income over 35K. Question: We want to get a 98% confidence interval and wish to estimate the quantity to within 0.01. How many samples we need? ♠ The size of CI is actually 0.02 to satisfy the desire condition (W.L.O.G., we choose a conservative confidence interval for easy computation): ♠ Using the second definition: recall that P (|θ̂ − θ| < e) ≥ 1 − α is the second definition, we have √ zα/2 2.33 2e ≤ 0.02 ⇐⇒ √ ≤ 0.01 ⇐⇒ n ≥ = 116.5 2 · 0.01 2 n so that n ≥ 13573. z 2 ♠ n ≥ 0.25 α/2 , ∆ is the half of the desired size of confidence interval. ∆ 1 Stat 330 (Spring 2015): slide set 28 Example 2: Suppose that we are interested in the large time probability p that a server is available. Doing 100 simulations has shown, that in 60 of them a server was available at time t = 1000 hrs. What is a 95% confidence interval for this probability? ♠ If 60 out of 100 simulations showed a free server, we can use p̂ = 60/100 = 0.6 as an estimate for p, or, use the conservative one p̂ = 0.5. ♠ For a 95% confidence interval, zα/2 = z(1−0.95)/2 = Φ−1(0.975) = 1.96. The conservative confidence interval is: 1 1 p̂ ± zα/2 √ = 0.6 ± 1.96 √ = 0.6 ± 0.098. 2 n 2 · 100 For the confidence interval using substitution we get: r p̂ ± zα/2 p̂(1 − p̂) = 0.6 ± 1.96 n r 0.6 · 0.4 = 0.6 ± 0.096. 100 2 Stat 330 (Spring 2015): slide set 28 Two Populations CI for mean difference µ1 − µ2, or, the difference of two proportions, p1 − p2 ♠ X̄1 − X̄2 and p̂1 − p̂2 are the unbiased estimators for µ1 − µ2 and p1 − p2, respectively. ♠ Confidence intervals are summarized below: large n confidence interval for µ1 − µ2 (based on independent X̄1 and X̄2) x̄1 − x̄2 ± zα/2 q σ12 n1 + σ22 n2 (when σ12, σ22 unknown, substitute s21, s22 respectively ) large n confidence interval for p1 − p2 (based on independent p̂1 and p̂2) p̂1 − p̂2 ± zα/2 2 q 1 n1 + n12 (conservative) or p̂1 − p̂2 ± zα/2 (substitution) q p̂1 (1−p̂1 ) n1 p̂2 ) + p̂2(1− n2 3 Stat 330 (Spring 2015): slide set 28 Two Populations: Simple derivation Derivation: The arguments are very similar in both cases - we will only discuss the confidence interval for the difference between the two means. 1. X̄1 − X̄2 is approximately normal, since X̄1 and X̄2 are approximately normal, with (X̄1, X̄2 are independent) 2. X̄i ∼ N (µi, σi2/ni) for i = 1, 2 E[X̄1 − X̄2] = E[X̄1] − E[X̄2] = µ1 − µ2 σ12 σ22 + V ar[X̄1 − X̄2] = V ar[X̄1] + (−1) V ar[X̄2] = n 1 n2 2 3. Then we can use the similar arguments as before and get a C.I. for µ1 − µ2 as shown above. 4 Stat 330 (Spring 2015): slide set 28 Two Populations: Example Example 1: Assume, we have two parts of the IRS database: East Coast and West Coast. We want to compare the mean taxable income between reported from the two regions in 2000. East Coast West Coast # of sampled records: n1 = 1000 n2 = 2000 mean taxable income: x̄1 = $37200 x̄2 = $42000 standard deviation: s1 = $10100 s2 = $15600 ♠ We can, for example, compute a 2 sided 95% confidence interval for µ1 − µ2 = difference in mean taxable income as reported from 2000 tax return between East and West Coast as following: ♠ r 101002 156002 37000 − 42000 ± zα/2 + = −5000 ± 927 1000 2000 ♠ Note: this shows pretty conclusively that the mean West Coast taxable 5 Stat 330 (Spring 2015): slide set 28 income is higher than the mean East Coast taxable income (in the report from 2000). The interval contains only negative numbers ♠ However, if it contained the 0, the message wouldn’t be so clear. Example 2: Two different digital communication systems send 100 large messages via each system and determine how many are corrupted in transmission, p̂1 = 0.05 and pˆ2 = 0.10. What’s the difference in the corruption rates? Find a 98% confidence interval. ♥ Use: r 0.05 − 0.1 ± 2.33 · 0.05 · 0.95 0.10 · 0.90 + = −0.05 ± 0.086 100 100 ♥ This calculation tells us, that based on these sample sizes, we don’t even have a solid idea about the sign of p1 − p2, i.e. we can’t tell which of the two pi is larger. 6 Stat 330 (Spring 2015): slide set 28 Small samples when the standard deviation σ is unknown Single Population: • If the standard deviation σ is unknown, but sample X1, ..., Xn can be assumed to come from a Normal distribution, then instead of using zα/2, we may use tn−1, α/2 which is the corresponding percentile of the t distribution with n − 1 degrees of freedom • The resulting 100 × (1 − α)% confidence interval for µ is s x̄ ± tn−1, α/2 · √ . n This is helpful when sample size n is small, since the CLT does not apply. • See pages below for a note on the t-distribution. 7 Stat 330 (Spring 2015): slide set 28 Small samples; σ is unknown: continued... Two Populations: • Population variances σ12 and σ22 are unknown. • assume equal variances σ12 = σ22 = σ 2 The 100 × (1 − α)% confidence interval for θ = µ1 − µ2 is r x̄1 − x̄2 ± tn+m−2,α/2 · sp 1 1 + , n m where s2p is the pooled variance calculated as 2 2 (n − 1)s + (m − 1)s 1 2 s2p = n+m−2 8 Stat 330 (Spring 2015): slide set 28 • assume unequal variances σ12 6= σ22 The 100 × (1 − α)% confidence interval for θ = σ1 − σ2 is r x̄1 − x̄2 ± tν,α/2 where ν= s21 n s41 n2 (n−1) s21 s22 + , n m 2 + s22 m + s42 m2 (m−1) . 9 Stat 330 (Spring 2015): slide set 28 t distribution • A random variable T that is of form Z T =p V /ν is said to have a t distribution with ν degress of freedom for random variables Z and V such that: 1. Z ∼ N (0, 1) and V ∼ χ2ν (a chi-square distribution with ν degrees of freedom). 2. Z and V are independent. 3. Chi-square distribution is a special case of the Gamma distribution. 10 Stat 330 (Spring 2015): slide set 28 • The diagram shows the probability density function of this distribution for several different degrees of freedom: • Observe that as the degrees of freedom increases the shape of the t-ditribution tends towards that of the Normal distribution. • Read percentiles of the t distribution from Table A5 of Baron’s textbook. 11