Slide set 15 Stat 330 (Spring 2015) Last update: February 3, 2015

advertisement
Slide set 15
Stat 330 (Spring 2015)
Last update: February 3, 2015
Stat 330 (Spring 2015): slide set 15
Central Limit Theorem (CLT)
Main Idea: Sums and averages of random variables from arbitrary
distributions have approximate normal distributions for sufficiently large
sample sizes.
Suppose X1, X2, . . . , Xn are iid random variables with
E[Xi] = µ
Define
Sample Average:
Sample Sum:
V ar[Xi] = σ 2,
i = 1, . . . , n
(X1 + X2 + · · · + Xn)
Xn =
n
Sn = X1 + X2 + · · · + Xn
For large n
X n∼N
˙ (µ, σ 2/n)
Sn∼N
˙ (nµ, nσ 2)
1
Stat 330 (Spring 2015): slide set 15
Central Limit Theorem (CLT) (cont’d)
Use of CLT: Calculate probabilities associated with averages or sums of iid
random variable. these approximate distributional statements. For e.g.,
a−µ
b−µ
P (a < X n < b) ≈ P (a < X < b)= Φ σ/√n − Φ σ/√n
Recall: Φ is the cdf of the standard normal distribution i.e., Z ∼ N (0, 1)
Some Example Applications of the Central Limit Theorem
Example 1: The time I spend waiting for the bus in a day has a Uniform
distribution between 2 minutes and 5 minutes.
(a) How much time do I expect to spent waiting for the bus in one month
(30 days)?
Let Xi = time I wait for the bus on day i.
Then X1, X2, . . . X30 ∼ iid U (2, 5) and it follows that
E[Xi] = (2 + 5)/2 = 3.5 min, V ar(Xi) = (5 − 2)2/12 = 9/12 = .75 min.2
2
Stat 330 (Spring 2015): slide set 15
CLT Examples (cont’d)
Let T ≡ random variable representing the total waiting time for a month.
T is the sum of 30 iid random variables: T = X1 + X2 + . . . + X30
P30
E[T ] = i=1 Xi = 30µ where µ = E[Xi] = 3.5.
Thus the expected waiting time for a month E[T ] = 30 × 3.5 = 105 min.
(b) Approximately, find the probability that I spend more than 2 hours
waiting for a bus in a month.
From the CLT, we have T ∼N
˙ (30 × 3.5, 30 × 0.75)
i.e. T ∼N
˙ (105, 22.5)
We need the probability that T is greater than 120 minutes, i.e., P (T > 120)
P (T > 120) = 1 − P (T ≤ 120)
(120 − 105)
√
)
by CLT
22.5
= 1 − Φ(3.16) = 1 − .9992112 = .00079
≈ 1 − P (Z ≤
3
Stat 330 (Spring 2015): slide set 15
CLT Example from Baron
Example 4.13 (Allocation of Disk Space) A disk has free space of 330
megabytes. Is it likely to be sufficient for 300 independent images, if each
image has expected size of 1 megabyte with a standard deviation of 0.5
megabytes?
We have n = 300, µ = 1 and σ = 0.5. The number of images n is large, so
the CLT applies. Then
P (sufficient space) = P (Sn ≤ 330))
Sn − nµ 330 − (300)(1)
√
√
= P
≤
)
σ n
0.5 300
≈ Φ(3.46) = .9997
Since this probability is very high, the available disk space is very likely to
be sufficient.
4
Stat 330 (Spring 2015): slide set 15
CLT Big Example
An astronomer wants to measure the distance, d, from the observatory to
a star. Due to the variation of atmospheric conditions and imperfections in
the measurement method, a single measurement will not produce the exact
distance d. The astronomer takes n measurements of the distance and uses
the sample average to estimate the true distance. From past records of these
measurements the astronomer knows the variance of a single measurement
is 4 parsec2. How many measurement should the astronomer make so that
the chance that his estimate differs by d by more than .5 parsecs is at most
.05?
Let Xi be the ith measurement. The astronomer assumes that
X1, X2, . . . Xn ∼ iid with E[Xi] = d and V ar[Xi] = 4
The estimate of d is X n =
(X1 +X2 +···+Xn )
n
We want to find the number of measurements n so that
P (|X n − d| > .5) ≤ .05
5
Stat 330 (Spring 2015): slide set 15
CLT Big Example (cont’d)
We know that
P (|X n − d| > .5) = P (X n − d > .5) + P (X n − d < −.5)
We use the CLT to approximate each of the probabilities on the right. From
the CLT we have that
X n∼N
˙ (d, 4/n)
Thus
P (|X n − d| > .5) = P (X n − d > .5) + P (X n − d < −.5)
!
!
.5
−.5
Xn − d
Xn − d
= P p
>p
+P p
<p
4/n
4/n
4/n
4/n
!
!
.5
−.5
≈ P Z>p
+P Z < p
4/n
4/n
6
Stat 330 (Spring 2015): slide set 15
CLT Big Example (cont’d)
√
= 1 − Φ( n/4) + Φ(− n/4)
√
= 2(1 − Φ( n/4))
√
√
• We need to find an integer n so that 2(1 − Φ( n/4)) is just less than
or equal to .05.
√
• We will set 2(1 − Φ( n∗/4)) = .05, solve for n∗ and take the required
number of measurements to be the dn∗e.
√
√
• Observe that 2(1 − Φ( n/4)) = .05 implies that Φ( n/4)) = .975.
√
• Using the Normal cdf tables, this gives n/4 = 1.96; thus n∗ = 61.47.
• Thus the astronomer must take at least 62 measurements to have the
accuracy specified above.
7
Stat 330 (Spring 2015): slide set 15
Normal approximation to the Binomial
For large n, the binomial distribution Bn,p is approximately normal
Nnp,np(1−p). Why?
Let Y be a variable with a Bn,p distribution. We know, that Y is the number
of successes in n independent Bernoulli experiments with P (success) = p.
Write Y as the sum of n iid Bernoulli variables each with µ = E(Xi) = p
and σ 2 = V ar(Xi) = p(1 − p): Y = X1 + X2 + . . . + Xn
Applying the CLT result for Sn, we have that Y ∼N
˙ (nµ, nσ 2) where µ = p
and σ 2 = p(1 − p). That is, Y ∼N
˙ (np, np(1 − p)).
Use this approximation only when np and n(1 − p) are both > 5; the
approximation is pretty good when np and n(1 − p) are both > 20.
When either of np or n(1 − p) are < 20, a continuity correction is needed
(see Baron p.94).
8
Download