Stat 330 (Spring 2015): slide set 28 p̂ ± zα/2 p̂(1 − p̂) = 0.6 ± 1.96 n 0.6 · 0.4 = 0.6 ± 0.096. 100 For the confidence interval using substitution we get: 1 1 = 0.6 ± 0.098. p̂ ± zα/2 √ = 0.6 ± 1.96 √ 2 n 2 · 100 2 ♠ For a 95% confidence interval, zα/2 = z(1−0.95)/2 = Φ−1(0.975) = 1.96. The conservative confidence interval is: ♠ If 60 out of 100 simulations showed a free server, we can use p̂ = 60/100 = 0.6 as an estimate for p, or, use the conservative one p̂ = 0.5. Example 2: Suppose that we are interested in the large time probability p that a server is available. Doing 100 simulations has shown, that in 60 of them a server was available at time t = 1000 hrs. What is a 95% confidence interval for this probability? Last update: April 8, 2015 Stat 330 (Spring 2015) Slide set 28 Stat 330 (Spring 2015): slide set 28 √ zα/2 2.33 √ ≤ 0.01 ⇐⇒ n ≥ = 116.5 2 · 0.01 2 n Two Populations σ12 n1 σ2 + n22 (when σ12, σ22 unknown, substitute s21, s22 respectively ) x̄1 − x̄2 ± zα/2 large n confidence interval for μ1 − μ2 (based on independent X̄1 and X̄2) zα/2 2 1 n1 p̂1 (1−p̂1 ) n1 3 p̂2 ) + p̂2(1− n2 + n12 (conservative) p̂1 − p̂2 ± zα/2 (substitution) or p̂1 − p̂2 ± large n confidence interval for p1 − p2 (based on independent p̂1 and p̂2) ♠ Confidence intervals are summarized below: ♠ X̄1 − X̄2 and p̂1 − p̂2 are the unbiased estimators for μ1 − μ2 and p1 − p2, respectively. CI for mean difference μ1 − μ2, or, the difference of two proportions, p1 − p2 Stat 330 (Spring 2015): slide set 28 1 so that n ≥ 13573. z 2 , Δ is the half of the desired size of confidence interval. ♠ n ≥ 0.25 α/2 Δ 2e ≤ 0.02 ⇐⇒ ♠ Using the second definition: recall that P (|θ̂ − θ| < e) ≥ 1 − α is the second definition, we have ♠ The size of CI is actually 0.02 to satisfy the desire condition (W.L.O.G., we choose a conservative confidence interval for easy computation): Question: We want to get a 98% confidence interval and wish to estimate the quantity to within 0.01. How many samples we need? Example 1: Suppose we want to estimate the fraction of records in the 2000 IRS data base that have a taxable income over 35K. Examples for CI of proportion Stat 330 (Spring 2015): slide set 28 ♠ 37000 − 42000 ± zα/2 0.05 · 0.95 0.10 · 0.90 + = −0.05 ± 0.086 100 100 6 ♥ This calculation tells us, that based on these sample sizes, we don’t even have a solid idea about the sign of p1 − p2, i.e. we can’t tell which of the two pi is larger. 0.05 − 0.1 ± 2.33 · ♥ Use: Example 2: Two different digital communication systems send 100 large messages via each system and determine how many are corrupted in transmission, p̂1 = 0.05 and pˆ2 = 0.10. What’s the difference in the corruption rates? Find a 98% confidence interval. ♠ However, if it contained the 0, the message wouldn’t be so clear. income is higher than the mean East Coast taxable income (in the report from 2000). The interval contains only negative numbers Stat 330 (Spring 2015): slide set 28 • See pages below for a note on the t-distribution. This is helpful when sample size n is small, since the CLT does not apply. s x̄ ± tn−1, α/2 · √ . n • The resulting 100 × (1 − α)% confidence interval for μ is 7 • If the standard deviation σ is unknown, but sample X1, ..., Xn can be assumed to come from a Normal distribution, then instead of using zα/2, we may use tn−1, α/2 which is the corresponding percentile of the t distribution with n − 1 degrees of freedom Single Population: Small samples when the standard deviation σ is unknown Stat 330 (Spring 2015): slide set 28 5 n2 ♠ We can, for example, compute a 2 sided 95% confidence interval for μ1 − μ2 = difference in mean taxable income as reported from 2000 tax return between East and West Coast as following: 4 + σ22 Example 1: Assume, we have two parts of the IRS database: East Coast and West Coast. We want to compare the mean taxable income between reported from the two regions in 2000. East Coast West Coast # of sampled records: n1 = 1000 n2 = 2000 mean taxable income: x̄1 = $37200 x̄2 = $42000 standard deviation: s1 = $10100 s2 = $15600 101002 156002 + = −5000 ± 927 1000 2000 ♠ Note: this shows pretty conclusively that the mean West Coast taxable n1 σ12 Stat 330 (Spring 2015): slide set 28 Two Populations: Example 3. Then we can use the similar arguments as before and get a C.I. for μ1 − μ2 as shown above. V ar[X̄1 − X̄2] = V ar[X̄1] + (−1)2V ar[X̄2] = E[X̄1 − X̄2] = E[X̄1] − E[X̄2] = μ1 − μ2 2. X̄i ∼ N (μi, σi2/ni) for i = 1, 2 1. X̄1 − X̄2 is approximately normal, since X̄1 and X̄2 are approximately normal, with (X̄1, X̄2 are independent) Derivation: The arguments are very similar in both cases - we will only discuss the confidence interval for the difference between the two means. Two Populations: Simple derivation 3. Chi-square distribution is a special case of the Gamma distribution. 10 1. Z ∼ N (0, 1) and V ∼ χ2ν (a chi-square distribution with ν degrees of freedom). 2. Z and V are independent. ν= s21 n s41 n2 (n−1) s2 s4 2 2 + m2(m−1) + m2 . s21 s22 + , n m 11 • Read percentiles of the t distribution from Table A5 of Baron’s textbook. • Observe that as the degrees of freedom increases the shape of the t-ditribution tends towards that of the Normal distribution. • The diagram shows the probability density function of this distribution for several different degrees of freedom: Stat 330 (Spring 2015): slide set 28 where x̄1 − x̄2 ± tν,α/2 Stat 330 (Spring 2015): slide set 28 is said to have a t distribution with ν degress of freedom for random variables Z and V such that: Z T = V /ν • A random variable T that is of form Stat 330 (Spring 2015): slide set 28 The 100 × (1 − α)% confidence interval for θ = σ1 − σ2 is assume unequal variances σ12 = σ22 9 • 8 1 1 + , n m (n − 1)s21 + (m − 1)s22 n+m−2 t distribution s2p = where s2p is the pooled variance calculated as x̄1 − x̄2 ± tn+m−2,α/2 · sp The 100 × (1 − α)% confidence interval for θ = μ1 − μ2 is assume equal variances σ12 = σ22 = σ 2 • Population variances σ12 and σ22 are unknown. Two Populations: • Stat 330 (Spring 2015): slide set 28 Small samples; σ is unknown: continued...