Stat 330 (Spring 2015): slide set 15 2 E[Xi] = (2 + 5)/2 = 3.5 min, V ar(Xi) = (5 − 2)2/12 = 9/12 = .75 min.2 Then X1, X2, . . . X30 ∼ iid U (2, 5) and it follows that Let Xi = time I wait for the bus on day i. (a) How much time do I expect to spent waiting for the bus in one month (30 days)? Example 1: The time I spend waiting for the bus in a day has a Uniform distribution between 2 minutes and 5 minutes. Some Example Applications of the Central Limit Theorem Recall: Φ is the cdf of the standard normal distribution i.e., Z ∼ N (0, 1) Use of CLT: Calculate probabilities associated with averages or sums of iid random variable. these approximate distributional statements. For e.g., b−μ a−μ √ √ P (a < X n < b) ≈ P (a < X < b)= Φ σ/ − Φ n σ/ n Central Limit Theorem (CLT) (cont’d) Last update: February 3, 2015 Stat 330 (Spring 2015) Slide set 15 Stat 330 (Spring 2015): slide set 15 X n∼N ˙ (μ, σ 2/n) Sn∼N ˙ (nμ, nσ 2) 1 Stat 330 (Spring 2015): slide set 15 (X1 + X2 + · · · + Xn) n Sn = X 1 + X 2 + · · · + X n Xn = i = 1, . . . , n CLT Examples (cont’d) Sample Sum: Sample Average: V ar[Xi] = σ 2, i.e. T ∼N ˙ (105, 22.5) ≈ 1 − P (Z ≤ (120 − 105) √ ) by CLT 22.5 = 1 − Φ(3.16) = 1 − .9992112 = .00079 P (T > 120) = 1 − P (T ≤ 120) 3 We need the probability that T is greater than 120 minutes, i.e., P (T > 120) From the CLT, we have T ∼N ˙ (30 × 3.5, 30 × 0.75) (b) Approximately, find the probability that I spend more than 2 hours waiting for a bus in a month. Thus the expected waiting time for a month E[T ] = 30 × 3.5 = 105 min. T is the sum of 30 iid random variables: T = X1 + X2 + . . . + X30 30 E[T ] = i=1 Xi = 30μ where μ = E[Xi] = 3.5. Let T ≡ random variable representing the total waiting time for a month. For large n Define E[Xi] = μ Suppose X1, X2, . . . , Xn are iid random variables with Main Idea: Sums and averages of random variables from arbitrary distributions have approximate normal distributions for sufficiently large sample sizes. Central Limit Theorem (CLT) Stat 330 (Spring 2015): slide set 15 P (|X n − d| > .5) = P (X n − d > .5) + P (X n − d < −.5) .5 −.5 Xn − d Xn − d +P = P > < 4/n 4/n 4/n 4/n −.5 .5 +P Z < ≈ P Z> 4/n 4/n Thus 6 We use the CLT to approximate each of the probabilities on the right. From the CLT we have that 7 • Thus the astronomer must take at least 62 measurements to have the accuracy specified above. √ • We need to find an integer n so that 2(1 − Φ( n/4)) is just less than or equal to .05. √ • We will set 2(1 − Φ( n∗/4)) = .05, solve for n∗ and take the required number of measurements to be the n∗. √ √ • Observe that 2(1 − Φ( n/4)) = .05 implies that Φ( n/4)) = .975. √ • Using the Normal cdf tables, this gives n/4 = 1.96; thus n∗ = 61.47. √ √ = 1 − Φ( n/4) + Φ(− n/4) √ = 2(1 − Φ( n/4)) CLT Big Example (cont’d) Stat 330 (Spring 2015): slide set 15 P (|X n − d| > .5) ≤ .05 Stat 330 (Spring 2015): slide set 15 CLT Big Example (cont’d) X n∼N ˙ (d, 4/n) (X1 +X2 +···+Xn ) n We want to find the number of measurements n so that The estimate of d is X n = X1, X2, . . . Xn ∼ iid with E[Xi] = d and V ar[Xi] = 4 Let Xi be the ith measurement. The astronomer assumes that 5 P (|X n − d| > .5) = P (X n − d > .5) + P (X n − d < −.5) We know that Stat 330 (Spring 2015): slide set 15 An astronomer wants to measure the distance, d, from the observatory to a star. Due to the variation of atmospheric conditions and imperfections in the measurement method, a single measurement will not produce the exact distance d. The astronomer takes n measurements of the distance and uses the sample average to estimate the true distance. From past records of these measurements the astronomer knows the variance of a single measurement is 4 parsec2. How many measurement should the astronomer make so that the chance that his estimate differs by d by more than .5 parsecs is at most .05? CLT Big Example 4 Since this probability is very high, the available disk space is very likely to be sufficient. P (sufficient space) = P (Sn ≤ 330)) Sn − nμ 330 − (300)(1) √ √ ) ≤ = P σ n 0.5 300 ≈ Φ(3.46) = .9997 We have n = 300, μ = 1 and σ = 0.5. The number of images n is large, so the CLT applies. Then Example 4.13 (Allocation of Disk Space) A disk has free space of 330 megabytes. Is it likely to be sufficient for 300 independent images, if each image has expected size of 1 megabyte with a standard deviation of 0.5 megabytes? CLT Example from Baron Stat 330 (Spring 2015): slide set 15 8 When either of np or n(1 − p) are < 20, a continuity correction is needed (see Baron p.94). Use this approximation only when np and n(1 − p) are both > 5; the approximation is pretty good when np and n(1 − p) are both > 20. Applying the CLT result for Sn, we have that Y ∼N ˙ (nμ, nσ 2) where μ = p 2 and σ = p(1 − p). That is, Y ∼N ˙ (np, np(1 − p)). Write Y as the sum of n iid Bernoulli variables each with μ = E(Xi) = p and σ 2 = V ar(Xi) = p(1 − p): Y = X1 + X2 + . . . + Xn Let Y be a variable with a Bn,p distribution. We know, that Y is the number of successes in n independent Bernoulli experiments with P (success) = p. For large n, the binomial distribution Bn,p is approximately normal Nnp,np(1−p). Why? Normal approximation to the Binomial