The central limit theorem

advertisement
Topic 4-3 The central limit theorem
One of the most important theorems in statistics—gives the PDF of the sum of a large number of RVs.
IF
X1, X2, ......, Xn are independent, identically distributed RVs (i.e. they all have the same PMF if discrete, or the
same PDF if continuous), each with mean  and variance 2, and let Sn = X1 + X2 + ...... + Xn
THEN
lim
n 
P(c1 
In words, when n gets large, the quantity Zn =
Sn / n  
/ n
 c2 ) 
c2
1
2
e
u 2 / 2
du .
c1
Sn / n  
behaves like a standard normal R.V., i.e.
/ n

Sn
looks normal with mean  and standard deviation
;
n
n
n
or, equivalently (multiplying Zn by ),
n
2. Sn looks normal with mean n and standard deviation n .
1.
Proof: Beyond the scope of our course. See Spiegel (p.131, example 4.25: Proof of CLT using moment
generating functions) if you are interested.
What is important is that, as n gets very large, although X1, X2, .... Xn may not be individually normal (or even
S

continuous!), Sn always is. Also, since
becomes a tiny number, n  X has a Gaussian distribution that is
n
n
sharply peaked at  , as indicated in the following picture:
40
30
20
10
2.6
2.8
3
3.2
3.4
x
Solved problems
Problem 4-3-1
The waiting time (T) in minutes to clear the toll station at Lion Rock Tunnel has an exponential probability
density function, f(t) = 0.2 e–t/5 (t  0 only). Find the probability that a line of 50 cars waiting to pay toll at this
tunnel can be completely served in less than 3.5 hours.
1
Hint: the exponential distribution f(x) = e  x /  has E(X) =  and Var(X) = 2

Solution:
For any one car, its average waiting time is T = 5 minutes, with standard deviation T = 5 minutes. Let W be
the total waiting time of 50 cars. W is the sum of 50 i.i.d. random variables, hence, according to CLT, W is
approximately normal with mean
W = 50T = 505 min. = 250 min.,
and standard deviation
W = 50 T = 50 5 min.
Hence the probability
P(W < 3.5 hrs) = P(W < 210 mins)
W  W 210  250
= P(

)
W
5 50
= P(Z < -1.13137085)  0.129
Problem 4-3-2
(a) A random variable with mean  and variance 2 (but not necessarily normal) is sampled a large number (n)
of times, and the observed values are X1, X2, …, Xn (a “sample”). What is the “sampling distribution” (i.e.
n
the PDF) of the sample mean (i.e. X 
X
i 1
n
n
i
)? And how is
X
i
distributed?
i 1
(b) Hence prove the “normal approximation to binomial”: if X has a binomial distribution with a large n (yet p
is not so small, hence a Poisson approximation is not suitable), X can be treated as normal with mean np and
standard deviation np (1  p ) .
Solution:
(a) For a sample of size n, each Xi is identically distributed with the same mean  and variance 2, so
requirements in the CLT are met. Hence we conclude that the sampling distribution of X is
(approximately) normal with mean  and standard deviation

n
standard deviation
n
.
X
i
is also normal, with mean n and
i 1
n .
(b) Recall that a binomial X counts the total number of success among n independent trials. We can think of
each trial (Xi) as a sampling of the same random variable which has two possibilities, namely 1 (success)
with probability p, and 0 (failure) with probability 1 – p, and hence the mean  = 1p + 0(1 – p) = p, and
standard deviation
(1  p) 2  p  (0  p) 2  (1  p)  [ p(1  p)](1  p  p) = p(1  p) . Note that X is
simply X1 + X2 + ...... + Xn. Thus, by CLT, X must be approximately normal with mean np and standard
deviation n  p(1  p) = np (1  p ) .
Problem 4-3-3
Consider a long cantilever structure shown below, which is loaded by a force F at a distance X from the fixed
F
A
X
end A. Suppose F and X are independent lognormal random variables with mean values of 0.2 kips, 10 ft. and
c.o.v. 20% and 30%, respectively.
(a) Determine the probability that the induced bending moment at A will exceed 3 kip-ft. (ans. 0.093)
(b) Suppose there are 50 forces, each acting at a random location along the beam. Each force has a lognormal
distribution with a mean of 0.2 kip and c.o.v. of 20%, and the location of each force is also lognormal with a
mean of 10 ft. from A and a c.o.v. of 30%. Assume statistical independence between individual values of
the 50 forces and also between individual locations of the 50 forces. Determine the probability that the total
induced bending moment at A will exceed 120 kip-ft. (ans. 5.510-5)
Solution:
(a) First, let’s calculate the parameters of F and X:
F = [ln(1 + F2)]0.5 = [ln (1 + 0.22)]0.5 = [ln(1.04)]0.5
 F = ln F – F2/2 = ln(0.2 / 1.040.5), similarly
X = [ln(1 + X2)]0.5 = [ln (1 + 0.32)]0.5 = [ln(1.09)]0.5
x = ln x – x2/2 = ln(10 / 1.090.5)
Now, let M denote the bending moment at A. Since M = FX,
ln M = ln F + ln X (i.e. normal + normal), hence
M is lognormal (because ln M is normal), with parameters
2
M = F + X = ln(
), and
1.04  1.09
M = (F2 + X2)0.5 = [ln(1.041.09)]0.5, hence
P(M > 3) = P(
ln M   M
M

ln 3  ln( 2 / 1.04  1.09)
ln(1.04  1.09)
= 1 – (1.322063427)  0.093
)
(b) Let Mi be the moment at A due to the i-th force, where i = 1,2,…,50. Since each Mi is identically distributed
as M in part (a), the lognormal parameters of Mi are  = 0.630447976 and  = 0.354116378. From these, we
can calculate the mean and standard deviation of Mi as
i = exp( + 2/2) = 2,
i = i [ exp(2 ) – 1] 0.5 = 0.731026675
The total bending moment at A is T = M1 + M2 + … + M50, the sum of a large number of identically
distributed, independent RVs, hence we may apply the central limit theorem:
T ~ N(502, 50 0.731026675)
T  T
120  50  2
P(T > 120) = P(
)

T
50  0.73102667 5
= 1 – ( 3.869116163)
= 1 – 0.999945364
 0.000055
Problem 4-3-4
Salaries of Assistant Engineers (AEs) at a large engineering firm are uniformly distributed between 10000 and
20000 dollars per month.
(a) What is the probability that a randomly chosen AE at this firm has a monthly salary exceeding $16000? (ans.
0.4)
(b) For 50 AEs chosen at random from this firm, find the probability that their average salary exceeds $16000.
(ans. 0.007)
(c) Why do answers from (a) and (b) differ significantly? Elaborate.
Solution:
(a) Let X be the monthly salary (in dollars) of a randomly chosen assistant engineer. X has a uniform distribution
between 10000 and 20000, covering an area of 20000  16000 1 = 0.4 from x = 16000 to x = 20000.
20000  10000
(b) Using properties of an uniform random variable,
X = (10000+20000)/2 = 15000,
(20000  10000 ) 5000
X =
.

12
3
Let Y be the mean monthly salary of 50 assistant engineers. By the central limit theorem, Y is approximately
normally distributed with

5000
Y = X = 15000 and Y = X 
,
50
150
hence the desired probability is
16000  15000
P(Y > 16000) = 1 – P(Y  16000) = 1 – P(Z 
)
5000 / 150
= 1 – P(Z  2.449489743) = 1 - 0.992847057  0.007
(c) An individual’s probability of exceeding $16000 is much higher than that for the mean of a group of people,
as X has a much larger standard deviation than Y, though they have the same mean. Uncertainty is reduced
as sample size increases, and “collective behavior” (i.e. average value) becomes very centralized (i.e. almost
constantly at $15000) around the mean, while individual behavior can differ from the mean significantly.
Exercises
Exercise 4-3-1
A roulette wheel has slots numbered 1 to 36; half of them are colored red and the other half are black.
Additionally, there are two slots numbered 0 and 00 which are green. Players can bet $1 that the ball will land in
a red (or black) slot, and win $1 if they made the right guess.
(a) If you place a one-dollar bet on this game once everyday, what is the probability of being ahead or just
breaking even after one year? (ans. 0.157)
(b) Repeat the calculation in (a) for 1000 plays. (ans. 0.048)
(c) What is the probability of not losing any money when the number of plays goes to infinity? What does this
say about gambling at a casino? (ans. 0, don’t gamble)
Exercise 4-3-2
Suppose the traffic on a long-span bridge consists of two kinds of vehicles: cars and trucks. The weight of each
car has a mean of 5 kips and standard deviation of 2 kips; whereas the weight of a truck has a mean of 20 kips
and standard deviation of 5 kips. Assume that the weight of each vehicle is lognormally distributed and the
weights between vehicles are statistically independent. There are currently 100 cars and 30 trucks on the bridge.
(a)
Determine the mean and c.o.v. of the total vehicle weight on the bridge. (ans. 1100 kips,0.031)
(b)
Estimate the probability that the total vehicle weight exceeds 1200 kips. State any assumption that you
use. (ans. 0.0016)
(c)
Suppose the total dead lead of the bridge is normally distributed with a mean of 1200 kips and a cov of
10%.
(i)
What is the probability that the total vehicle load will exceed the total dead lead? (ans. 0.211)
(ii)
What is the probability that the sum of total dead and vehicle load will exceed 2500 kips? (ans.
0.054)
Exercise 4-3-3
Five-star brand cement is shipped in batches, each batch containing forty bags. Previous records show that the
weight of a randomly chosen bag of this type of cement has a mean of 2.5 kg and a standard deviation of 0.1 kg,
but its exact probability density function is unknown.
(a) What is the mean weight of one batch of Five-star brand cement? (ans. 100kg)
(b) Suppose the shipping company charges a penalty if a batch exceeds its mean weight by more than 1 kg. What
is the probability that a batch of Five-star brand cement will receive a penalty? (ans. 0.569)
(c) Suppose the standard deviation of each bag is changed to 1 kg, but all other parameters remain the same.
What is the probability of a penalty now? Hence comment on whether standard deviation is something good
or bad. (ans. 0.437)
Exercise 4-3-4
(a) State the Central Limit Theorem in its precise mathematical form, i.e. in terms of a limit and an integral
which permits immediate table look-up. Define all symbols used.
(b) In fifteen words or less, briefly summarize the essence of the Central Limit Theorem in plain English. Do
not use any mathematical symbol.
Exercise 4-3-5
Consider a one-dimensional atomic lattice, with distance a (“lattice constant”) between neighboring sites. An
atom, initially at the origin, starts to travel randomly between neighboring sites in the following manner:
(i) each step is either to the immediate right or left, with respective probabilities p and 1 – p  q;
(ii) the steps are independent of each other.
For X, the atom's net displacement after a large number of (N) steps, calculate
(a) the average displacement X (ans. N(p – q)a)
(b) the dispersion E[( X   X ) 2 ] (ans. 4Npqa2)
(c) what is the atom’s most likely location after N steps if p = (i) 0.5, (ii) 0.2, (iii) 0.99? Express your answer in
terms of the maximum possible net displacement L  Na. (ans. 0, –0.6L, 0.98L)
Download