Central Limit Theorem (CLT) • The main idea is that sums and averages of random variables from many arbitrary distributions have approximate normal distributions for sufficiently large sample sizes. • Suppose X1 , X2 , . . . , Xn are iid random variables with E[Xi ] = µ • Define Sample Average: Sample Sum: V ar[Xi ] = σ 2 , i = 1, . . . , n (X1 + X2 + · · · + Xn ) n Sn = X1 + X2 + · · · + Xn Xn = • For large n X n ∼N ˙ (µ, σ 2 /n) Sn ∼N ˙ (nµ, nσ 2 ) • We may use these approximate distributional statements to calculate probabilities associated with averages or sums of iid random variable. For e.g., P (a < X n < b) ≈ P (a < Z < b) b−µ a−µ √ √ ) = Φ − (1 − Φ σ/ n σ/ n Recall that Φ is the c.d.f. of the standard normal distribution i.e., Z ∼ N (0, 1) Some Example Applications of the Central Limit Theorem Example 1: The time I spend waiting for the bus in a day has a Uniform distribution between 2 minutes and 5 minutes. (a) How much time do I expect to spent waiting for the bus in one month (30 days)? (b) Approximately, find the probability that I spend more than 2 hours waiting for a bus in a month. Solution: Let Xi represent the time I wait for the bus in day i X1 , X2 , . . . X30 ∼ iid U (2, 5) Recall that E[Xi ] = (2 + 5)/2 = 3.5 min and V ar(Xi ) = (5 − 2)2 /12 = 9/12 = .75 min.2 Let T be the random variable representing the total waiting time for a month. T is the sum of 30 iid random variables: T = X1 + X2 + . . . + X30 1 (a) We need E[T ]. The CLT tells us that E[T ] = nµ where µ = E[Xi ]. Thus E[T ] = 30 × 3.5 = 105 min. We may also compute this from scratch as follows: E[T ] = E[X1 + X2 + . . . + X30 ] = E[X1 ] + E[X2 ] + . . . + E[X30 ] = µ + µ + . . . + µ = 30 × 3.5 = 105 min. (b) From the CLT, we have T ∼N ˙ (30 × 3.5, 30 × 0.75) i.e. T ∼N ˙ (105, 22.5) We need the probability that T is greater than 120 minutes, i.e., P (T > 120) P (T > 120) = 1 − P (T ≤ 120) (120 − 105) √ ≈ 1 − P (Z ≤ ) by CLT 22.5 = 1 − Φ(3.16) = 1 − .9992112 = .00079 Example 2: An astronmer wants to measure the distance, d, from the observatory to a star. Due to the variation of atmospheric conditions and imperfections in the measurement method, a single measurement will not produce the exact distance d. The astronomer takes n measurements of the distance and uses the sample average to estimate the true distance. From past records of these measurements the astronomer knows the variance of a single measurement is 4parsec2 . How many measurement should the astronomer make so that the chance that his estimate differs by d by more than .5 parsecs is at most .05? Solution: Let Xi be the ith measurement. The astronomer assumes that X1 , X2 , . . . Xn ∼ iid with E[Xi ] = d and V ar[Xi ] = 4 The estimate of d is (X1 + X2 + · · · + Xn ) n We want to find the number of measurements n so that Xn = P (|X n − d| > .5) ≤ .05 We know that P (|X n − d| > .5) = P (X n − d > .5) + P (X n − d < −.5) We use the CLT to approximate each of the probabilities on the right. From the CLT we have that X n ∼N ˙ (d, 4/n) 2 Thus P (|X n − d| > .5) = P (X n − d > .5) + P (X n − d < −.5) ! ! Xn − d .5 Xn − d −.5 >p <p = P p +P p 4/n 4/n 4/n 4/n ! ! −.5 .5 +P Z < p ≈ P Z>p 4/n 4/n √ √ = 1 − Φ( n/4) + Φ(− n/4) √ = 2(1 − Φ( n/4)) √ We need to find √ an integer n so that 2(1 − Φ( n/4)) is just less than or equal to .05. We n∗ and take the required number of measurements will set 2(1 − Φ( n∗ /4)) = .05, solve for √ √ ∗ to be the dn e. Observe that 2(1√− Φ( n/4)) = .05 implies that Φ( n/4)) = .975. Using the Normal cdf tables, this gives n/4 = 1.96; thus n∗ = 61.47. Thus the astronomer must take at least 62 measurements to have the accuracy specified above. 3