Slide set 15 Stat 330 (Spring 2015) Last update: February 3, 2015 Stat 330 (Spring 2015): slide set 15 Central Limit Theorem (CLT) Main Idea: Sums and averages of random variables from arbitrary distributions have approximate normal distributions for sufficiently large sample sizes. Suppose X1, X2, . . . , Xn are iid random variables with E[Xi] = µ Define Sample Average: Sample Sum: V ar[Xi] = σ 2, i = 1, . . . , n (X1 + X2 + · · · + Xn) Xn = n Sn = X1 + X2 + · · · + Xn For large n X n∼N ˙ (µ, σ 2/n) Sn∼N ˙ (nµ, nσ 2) 1 Stat 330 (Spring 2015): slide set 15 Central Limit Theorem (CLT) (cont’d) Use of CLT: Calculate probabilities associated with averages or sums of iid random variable. these approximate distributional statements. For e.g., a−µ b−µ P (a < X n < b) ≈ P (a < X < b)= Φ σ/√n − Φ σ/√n Recall: Φ is the cdf of the standard normal distribution i.e., Z ∼ N (0, 1) Some Example Applications of the Central Limit Theorem Example 1: The time I spend waiting for the bus in a day has a Uniform distribution between 2 minutes and 5 minutes. (a) How much time do I expect to spent waiting for the bus in one month (30 days)? Let Xi = time I wait for the bus on day i. Then X1, X2, . . . X30 ∼ iid U (2, 5) and it follows that E[Xi] = (2 + 5)/2 = 3.5 min, V ar(Xi) = (5 − 2)2/12 = 9/12 = .75 min.2 2 Stat 330 (Spring 2015): slide set 15 CLT Examples (cont’d) Let T ≡ random variable representing the total waiting time for a month. T is the sum of 30 iid random variables: T = X1 + X2 + . . . + X30 P30 E[T ] = i=1 Xi = 30µ where µ = E[Xi] = 3.5. Thus the expected waiting time for a month E[T ] = 30 × 3.5 = 105 min. (b) Approximately, find the probability that I spend more than 2 hours waiting for a bus in a month. From the CLT, we have T ∼N ˙ (30 × 3.5, 30 × 0.75) i.e. T ∼N ˙ (105, 22.5) We need the probability that T is greater than 120 minutes, i.e., P (T > 120) P (T > 120) = 1 − P (T ≤ 120) (120 − 105) √ ) by CLT 22.5 = 1 − Φ(3.16) = 1 − .9992112 = .00079 ≈ 1 − P (Z ≤ 3 Stat 330 (Spring 2015): slide set 15 CLT Example from Baron Example 4.13 (Allocation of Disk Space) A disk has free space of 330 megabytes. Is it likely to be sufficient for 300 independent images, if each image has expected size of 1 megabyte with a standard deviation of 0.5 megabytes? We have n = 300, µ = 1 and σ = 0.5. The number of images n is large, so the CLT applies. Then P (sufficient space) = P (Sn ≤ 330)) Sn − nµ 330 − (300)(1) √ √ = P ≤ ) σ n 0.5 300 ≈ Φ(3.46) = .9997 Since this probability is very high, the available disk space is very likely to be sufficient. 4 Stat 330 (Spring 2015): slide set 15 CLT Big Example An astronomer wants to measure the distance, d, from the observatory to a star. Due to the variation of atmospheric conditions and imperfections in the measurement method, a single measurement will not produce the exact distance d. The astronomer takes n measurements of the distance and uses the sample average to estimate the true distance. From past records of these measurements the astronomer knows the variance of a single measurement is 4 parsec2. How many measurement should the astronomer make so that the chance that his estimate differs by d by more than .5 parsecs is at most .05? Let Xi be the ith measurement. The astronomer assumes that X1, X2, . . . Xn ∼ iid with E[Xi] = d and V ar[Xi] = 4 The estimate of d is X n = (X1 +X2 +···+Xn ) n We want to find the number of measurements n so that P (|X n − d| > .5) ≤ .05 5 Stat 330 (Spring 2015): slide set 15 CLT Big Example (cont’d) We know that P (|X n − d| > .5) = P (X n − d > .5) + P (X n − d < −.5) We use the CLT to approximate each of the probabilities on the right. From the CLT we have that X n∼N ˙ (d, 4/n) Thus P (|X n − d| > .5) = P (X n − d > .5) + P (X n − d < −.5) ! ! .5 −.5 Xn − d Xn − d = P p >p +P p <p 4/n 4/n 4/n 4/n ! ! .5 −.5 ≈ P Z>p +P Z < p 4/n 4/n 6 Stat 330 (Spring 2015): slide set 15 CLT Big Example (cont’d) √ = 1 − Φ( n/4) + Φ(− n/4) √ = 2(1 − Φ( n/4)) √ √ • We need to find an integer n so that 2(1 − Φ( n/4)) is just less than or equal to .05. √ • We will set 2(1 − Φ( n∗/4)) = .05, solve for n∗ and take the required number of measurements to be the dn∗e. √ √ • Observe that 2(1 − Φ( n/4)) = .05 implies that Φ( n/4)) = .975. √ • Using the Normal cdf tables, this gives n/4 = 1.96; thus n∗ = 61.47. • Thus the astronomer must take at least 62 measurements to have the accuracy specified above. 7 Stat 330 (Spring 2015): slide set 15 Normal approximation to the Binomial For large n, the binomial distribution Bn,p is approximately normal Nnp,np(1−p). Why? Let Y be a variable with a Bn,p distribution. We know, that Y is the number of successes in n independent Bernoulli experiments with P (success) = p. Write Y as the sum of n iid Bernoulli variables each with µ = E(Xi) = p and σ 2 = V ar(Xi) = p(1 − p): Y = X1 + X2 + . . . + Xn Applying the CLT result for Sn, we have that Y ∼N ˙ (nµ, nσ 2) where µ = p and σ 2 = p(1 − p). That is, Y ∼N ˙ (np, np(1 − p)). Use this approximation only when np and n(1 − p) are both > 5; the approximation is pretty good when np and n(1 − p) are both > 20. When either of np or n(1 − p) are < 20, a continuity correction is needed (see Baron p.94). 8