# Random Variables Chapter 2 ```Chapter 2
Random Variables
If the value of a numerical variable depends on the outcome of an experiment, we call the variable a random
variable.
Definition 2.0.1 (Random Variable)
A function X : Ω →
7 R is called a random variable.
X assigns
to each elementary event
a real value.
Standard notation: capital letters from the end of the alphabet.
Example 2.0.3 Very simple Dartboard
In the case of three darts on a board as in the previous example, we are usually not interested in the order,
in which the darts have been thrown. We only want to count the number of times, the red area has been
hit. This count is a random variable!
More formally: we define X to be the function, that assigns to a sequence of three throws the number of
times, that the red area is hit.
X(s) = k,
if s consists of k hits to the red area and 3 − k hits to the gray area.
X(s) is then an integer between 0 and 3 for every possible sequence.
What is then the probability, that a player hits the red area exactly two times?
We are looking now for all those elementary events s of our sample space, for which X(s) = 2.
Going back to the tree, we find three possibilities for s : rrg, rgr and grr. This is the subset of Ω, for which
X(s) = 2. Very formally, this set can be written as:
{s|X(s) = 2}
We want to know the total probability:
P ({s|X(s) = 2}) = P (rrg ∪ rgr ∪ grr) = P (rrg) + P (rgr) + P (grr) =
To avoid cumbersome notation, we write
X=x
for the event
{ω|ω ∈ Ω and X(ω) = x}.
23
8
8
8
+ 3 + 3 = 0.03.
93
9
9
24
CHAPTER 2. RANDOM VARIABLES
Example 2.0.4 Communication Channel
Suppose, 8 bits are sent through a communication channel. Each bit has a certain probability to be received
incorrectly.
So this is a Bernoulli experiment, and we can use Ω8 as our sample space.
We are interested in the number of bits that are received incorrectly.
Use random variable X to “count” the number of wrong bits. X assigns a value between 0 and 8 to each
sequence in Ω8 .
Now it’s very easy to write events like:
X=0
P (X = 0)
b) at least one wrong bit received
X≥1
P (X ≥ 1)
c) exactly three bits are wrong
X=3
P (X = 3)
d) at least 3, but not more than 6 bits wrong 3 ≤ X ≤ 6 P (3 ≤ X ≤ 6)
Definition 2.0.2 (Image of a random variable)
The image of a random variable X is defined as
all possible
values X can
reach
Im(x) := X(Ω).
Depending on whether or not the image of a random variable is countable, we distinguish between discrete
and continuous random variables.
Example 2.0.5
1. Put a disk drive into service, measure Y = “time till the first major failure”.
Sample space Ω = (0, ∞).
Y has uncountable image → Y is a continuous random variable.
2. Communication channel: X = “# of incorrectly received bits”
Im(X) = {0, 1, 2, 3, 4, 5, 6, 7, 8} is a finite set → X is a discrete random variable.
2.1
Discrete Random Variables
Assume X is a discrete random variable. The image of X is therefore countable and can be written as
{x1 , x2 , x3 , . . .}
Very often we are interested in probabilities of the form P (X = x). We can think of this expression as a
function, that yields different probabilities depending on the value of x.
Definition 2.1.1 (Probability Mass Function, PMF)
The function pX (x) := P (X = x) is called the probability mass function of X.
A probability mass function has two main properties:
all values
must be between 0 and
1
the sum of
all values
is 1
Theorem 2.1.2 (Properties of a pmf )
pX is the pmf of X, if and only if
(i) 0 ≤ pX (x) ≤ 1 for all x ∈ {x1 , x2 , x3 , . . .}
P
(ii)
i pX (xi ) = 1
Note: this gives us an easy method to check, whether a function is a probability mass function!
2.1. DISCRETE RANDOM VARIABLES
25
Example 2.1.1
Which of the following functions is a valid probability mass function?
1.
x
pX (x)
-3
0.1
-1
0.45
0
0.15
5
0.25
7
0.05
2.
y
pY (y)
-1
0.1
0
0.45
1.5
0.25
3
-0.05
4.5
0.25
3.
z
pZ (z)
0
0.22
5
0.17
7
0.18
1
0.18
3
0.24
We need to check the two properties of a pmf for pX , pY and pZ .
1st property: probabilities between 0 and 1 ?
This eliminates pY from the list of potential probability mass functions, since pY (3) is negative.
The other two functions fulfill the property.
2nd
P property: sum of all probabilities is 1?
Pi p(xi ) = 1, so pX is a valid probability mass function.
i p(zi ) = 0.99 6= 1, so pZ is not a valid probability mass function.
Example 2.1.2 Probability Mass Functions
1. Very Simple Dartboard
X, the number of times, a player hits the red area with three darts is a value between 0 and 3.
What is the probability mass function for X?
The probability mass function pX can be given as a list of all possible values:
pX (0)
=
P (X = 0) = P (ggg) =
83
≈ 0.70
93
82
≈ 0.26
93
8
pX (2) = P (X = 2) = P (rrg) + P (rgr) + P (grr) = 3 &middot; 3 ≈ 0.03
9
1
pX (3) = P (X = 3) = P (rrr) = 3 ≈ 0.01
9
pX (1)
=
P (X = 1) = P (rgg) + P (grg) + P (ggr) = 3 &middot;
2. Roll of a fair die
Let Y be the number of spots on the upturned face of a die:
Obviously, Y is a random variable with image {1, 2, 3, 4, 5, 6}.
Assuming, that the die is a fair die means, that the probability for each side is equal. The probability
mass function for Y therefore is pY (i) = 61 for all i in {1, 2, 3, 4, 5, 6}.
3. The diagram shows all six faces of a particular die. If Z denotes the number of
spots on the upturned face after toss this die, what is the probability mass function
for Z?
Assuming, that each face of the die appears with the same probability, we have 1
possibility to get a 1 or a 4, and two possibilities for a 2 or 3 to appear, which gives
a probability mass function of:
x
p(x)
1
1/6
2
1/3
3
1/3
4
1/6
26
2.1.1
CHAPTER 2. RANDOM VARIABLES
Expectation and Variance
Example 2.1.3 Game
Suppose we play a “game”, where you toss a die. Let X be the number of spots, then
if X is
1,3 or 5 I pay you \$ X
2 or 4 you pay me \$ 2 &middot; X
6 no money changes hands.
What money do I expect to win?
For that, we look at another function, h(x), that counts the money I win with respect to the number of spots:

 −x for x = 1, 3, 5
2x for x = 2, 4
h(x) =

0 for x = 6.
Now we make a list:
In 1/6 of all tosses X will be 1, and I will gain -1 dollars
In 1/6 of all tosses X will be 2, and I will gain 4 dollars
In 1/6 of all tosses X will be 3, and I will gain -3 dollars
In 1/6 of all tosses X will be 4, and I will gain 8 dollars
In 1/6 of all tosses X will be 5, and I will gain -5 dollars
In 1/6 of all tosses X will be 6, and I will gain 0 dollars
In total I expect to get 61 &middot; (−1) + 16 &middot; 4 + 61 &middot; (−3) + 61 &middot; 8 + 16 &middot; (−5) + 61 &middot; 0 = 63 = 0.5 dollars per play.
Assume, that instead of a fair die, we use the die from example 3. How does that change my expected gain?
h(x) is not affected by the different die, but my expected gain changes: in total I expect to gain:
1
1
1
1
9
1
6 &middot; (−1) + 3 &middot; 4 + 3 &middot; (−3) + 6 &middot; 8 + 0 &middot; (−5) + 6 &middot; 0 = 6 = 1.5 dollars per play.
Definition 2.1.3 (Expectation)
The expected value of a function h(X) is defined as
E[h(X)] :=
X
h(xi ) &middot; pX (xi ).
i
The most important version of this is h(x) = x:
E[X] =
X
xi &middot; pX (xi ) =: &micro;
i
Example 2.1.4 Toss of a Die
Toss a fair die, and denote by X the number of spots on the upturned face.
What is the expected value for X?
Looking at the above definition for E[X], we see that we need to know the probability mass function for a
computation.
The probability mass function of X is pX (i) = 61 for all i ∈ {1, 2, 3, 4, 5, 6}.
Therefore
6
X
1
1
1
1
1
1
E[X] =
ipX (i) = 1 &middot; + 2 &middot; + 3 &middot; + 4 &middot; + 5 &middot; + 6 &middot; = 3.5.
6
6
6
6
6
6
i=1
A second common measure for describing a random variable is a measure, how far its values are spread out.
We measure, how far we expect values to be away from the expected value:
2.1. DISCRETE RANDOM VARIABLES
27
Definition 2.1.4 (Variance of a random variable)
The variance of a random variable X is defined as:
V ar[X] := E[(X − E[X])2 ] =
X
(xi − E[X])2 &middot; pX (xi )
i
The variance
is measured in squared units of X.
p
σ := V ar[X] is called the standard deviation of X, its units are the original units of X.
Example 2.1.5 Toss of a Die, continued
Toss a fair die, and denote with X the number of spots on the upturned face.
What is the variance for X?
Looking at the above definition for V ar[X], we see that we need to know the probability mass function and
E[X] for a computation.
The probability mass function of X is pX (i) = 61 for all i ∈ {1, 2, 3, 4, 5, 6}; E[X] = 3.5
Therefore
6
X
1
1
1
1
1
1
2
V ar[X] =
(Xi − 3.5)2 pX (i) = 6.25 &middot; + 2.25 &middot; + 0.25 &middot; + 0.25 &middot; + 2.25 &middot; + 6.25 &middot; = 2.917 (spots ).
6
6
6
6
6
6
i=1
The standard deviation for X is:
σ=
2.1.2
p
V ar(X) = 1.71 (spots).
Some Properties of Expectation and Variance
The following theorems make computations with expected value and variance of random variables easier:
Theorem 2.1.5
For two random variables X and Y and two real numbers a, b holds:
E[aX + bY ] = aE[X] + bE[Y ].
Theorem 2.1.6
For a random variable X and a real number a holds:
(i) E[X 2 ] = V ar[X] + (E[X])2
(ii) V ar[aX] = a2 V ar[X]
Theorem 2.1.7 (Chebyshev’s Inequality)
For any positive real number k, and random variable X with variance σ 2 :
P (|X − E[X]| ≤ kσ) ≥ 1 −
2.1.3
1
k2
Probability Distribution Function
Very often we are interested in the probability of a whole range of values, like P (X ≤ 5) or P (4 ≤ X ≤ 16).
For that we define another function:
Definition 2.1.8 (probability distribution function)
Assume X is a discrete random variable:
The function FX (t) := P (X ≤ t) is called the probability distribution function of X.
28
CHAPTER 2. RANDOM VARIABLES
Relationship between pX and FX
Since X is a discrete random variable, the image of X can be written as {x1 , x2 , x3 , . . .}, we are therefore
interested in all xi with xi ≤ t:
X
FX (t) = P (X ≤ t) = P ({xi |xi ≤ t}) =
pX (xi ).
i,with xi ≤t
Note: in contrast to the probability mass function, FX is defined on R (not only on the image of X).
Example 2.1.6 Roll a fair die
X = # of spots on upturned face
Ω = {1, 2, 3, 4, 5, 6}
pX (1) = pX (2) = . . . = pX (6) = 16
F (X)(t) =
P
i&lt;t
pX (i) =
Properties of FX
variable X.
Pbtc
i=1
pX (i) =
btc
6 ,
where btc is the truncated value of t.
The following properties hold for the probability distribution function FX of a random
• 0 ≤ FX (t) ≤ 1 for all t ∈ R
• FX is monotone increasing, (i.e. if x1 ≤ x2 then FX (x1 ) ≤ FX (x2 ).)
• limt→−∞ FX (t) = 0 and limt→∞ FX (t) = 1.
• FX (t) has a positive jump equal to pX (xi ) at {x1 , x2 , x3 , . . .}; FX is constant in the interval [xi , xi+1 ).
Whenever no confusion arises, we will omit the subscript X.
2.2
Special Discrete Probability Mass Functions
In many theoretical and practical problems, several probability mass functions occur often enough to be
worth exploring here.
2.2.1
Bernoulli pmf
Situation: Bernoulli experiment (only two outcomes: success/ no success)
with P ( success ) = p
We define a random variable X as:
X( success ) = 1
X( no success ) = 0
The probability mass function pX of X is then:
pX (0) = 1 − p
pX (1) = p
This probability mass function is called the Bernoulli mass function.
The distribution function FX is then:

t&lt;0
 0
1−p 0≤t&lt;1
FX (t) =

1
1≤t
This distribution function is called the Bernoulli distribution function.
That’s a very simple probability function, and we’ve already seen sequences of Bernoulli experiments. . .
2.2. SPECIAL DISCRETE PROBABILITY MASS FUNCTIONS
2.2.2
29
Binomial pmf
Situation: n sequential Bernoulli experiments, with success rate p for a single trial. Single trials are independent from each other.
We are only interested in the number of successes he had in total after n trials, therefore we define a random
variable X as:
X = “ number of successes in n trials”
This leads to an image of X as
im(X) = {0, 1, 2, . . . , n}
We can think of the sample space Ω as the set of sequences of length n that only consist of the letters S and
F for “success” and ”failure”:
Ω = {F...F F, F...F S, ...., S...SS}
This way, we get 2n different outcomes in the sample space.
Now, we want to derive a probability mass function for X, i.e. we want to get to a general expression for
pX (k) for all possible k = 0, . . . , n.
pX (k) = P (X = k), i.e. we want to find the probability, that in a sequence of n trials there are exactly k
successes.
Think: if s is a sequence with k successes and n − k failures, we already know the probability:
P (s) = pk (1 − p)n−k .
Now we need to know, how many possibilities there are, to have k successes in n trials: think of the n trials
as numbers from 1 to n. To have k successes, we need to choose a set of k of these numbers out of the n
possible numbers. Do you see it? - That’s the Binomial coefficient, again.
pX (k) is therefore:
n k
p (1 − p)n−k .
pX (k) =
k
This probability mass function is called the Binomial mass function.
The distribution function FX is:
FX (t) =
btc X
n
i=0
i
pi (1 − p)n−i =: Bn,p (t)
This function is called the Binomial distribution Bn,p , where n is the number of trials, and p is the probability
for a success.
It is a bit cumbersome to compute values for the distribution function. Therefore, those values are tabled
with respect to n and p.
Example 2.2.1
Compute the probabilities for the following events:
A box contains 15 components that each have a failure rate of 2%. What is the probability that
1. exactly two out of the fifteen components are defective?
2. at most two components are broken?
3. more than three components are broken?
4. more than 1 but less than 4 are broken?
Let X be the number of broken components. Then X has a B15,0.02 distribution.
30
CHAPTER 2. RANDOM VARIABLES
1. P (exactly two out of the fifteen components are defective) = pX (2) =
15
2
0.022 0.9813 = 0.0323.
2. P (at most two components are broken) = P (X ≤ 2) = B15,0.02 (2) = 0.9638.
3. P ( more than three components are broken ) = P (X &gt; 3) = 1 − P (X ≤ 3) = 1 − 0.9945 = 0.0055.
4. P ( more than 1 but less than 4 are broken ) = P (1 &lt; X &lt; 4) = P (X ≤ 3) − P (X ≤ 1) = 0.9945 −
0.8290 = 0.1655.
If we want to say that a random variable has a binomial distribution, we write:
X ∼ Bn,p
What are the expected value and variance of X ∼ Bn,p ?
E[X]
=
=
=
n
X
i &middot; pX (i) =
i=0
n
X
n i
i&middot;
p (1 − p)n−i =
i
i=0
n
X
i=1
= np &middot;
i
n!
pi (1 − p)n−i
i!(n − i)!
n−1
X
j=0
|
V ar[X]
2.2.3
j:=i−1
=
(n − 1)!
pj (1 − p)n−1−j = np
j!((n − 1) − j)!
{z
}
=1
= . . . = np(1 − p).
Geometric pmf
Assume, we have a single Bernoulli experiment with probability for success p.
Now, we repeat this experiment until we have a first success.
Denote by X the number of repetitions of the experiment until we have the first success.
Note: X = k means, that we have k − 1 failures and the first success in the kth repetition of the experiment.
The sample space Ω is therefore infinite and starts at 1 (we need at least one experiment):
Ω = {1, 2, 3, 4, . . .}
Probability mass function:
pX (k) = P (X = k) = (1 − p)k−1 &middot; p
| {z } |{z}
k−1 failures
success!
This probability mass function is called the Geometric mass function.
Expected value and variance of X are:
E[X]
=
∞
X
i=1
V ar[X]
=
i(1 − p)i p = . . . =
1
,
p
∞
X
1
1−p
(i − )2 (1 − p)i p = . . . =
.
p
p2
i=1
2.2. SPECIAL DISCRETE PROBABILITY MASS FUNCTIONS
31
Example 2.2.2 Repeat-until loop
Examine the following programming statement:
Repeat S until B
assume P (B = true) = 0.1 and let X be the number of times S is executed.
Then, X has a geometric distribution,
P (X = k) = pX (k) = 0.9k−1 &middot; 0.1
How often is S executed on average? - What is E[X]? Using the above formula, we get E[X] =
1
p
= 10.
We still need to compute the distribution function FX . Remember, FX (t) is the probability for X ≤ t.
Instead of tackling this problem directly, we use a trick and look at the complementary event X &gt; t. If X is
greater than t, this means that the first btc trials yields failures. This is easy to compute! It’s just (1 − p)btc .
Therefore the probability distribution function is:
FX (t) = 1 − (1 − p)btc =: Geop (t)
This function is called the Geometric distribution (function) Geop .
Example 2.2.3 Time Outs at the Alpha Farm
Watch the input queue at the alpha farm for a job that times out.
The probability that a job times out is 0.05.
Let Y be the number of the first job to time out, then Y ∼ Geo0.05 .
What’s then the probability that
• the third job times out?
P (Y = 3) = 0.952 0.05 = 0.045
• Y is less than 3?
P (Y &lt; 3) = P (Y ≤ 2) = 1 − 0.952 = 0.0975
• the first job to time out is between the third and the seventh?
P (3 ≤ Y ≤ 7) = P (Y ≤ 7) − P (Y ≤ 2) = 1 − 0.957 − (1 − 0.952 ) = 0.204
What are the expected value for Y , what is V ar[Y ]?
Plugging in p = 0.05 in the above formulas gives us:
2.2.4
E[Y ]
=
V ar[Y ]
=
1
= 20
p
1−p
= 380
p2
we expect the 20th job to be the first time out
Poisson pmf
The Poisson density follows from a certain set of assumptions about the occurrence of “rare” events in time
or space.
The kind of variables modelled using a Poisson density are e.g.
X = # of alpha particles emitted from a polonium bar in an 8 minute period.
Y = # of flaws on a standard size piece of manufactured product (100m coaxial cable)
Z = # of hits on a web page in a 24h period.
32
CHAPTER 2. RANDOM VARIABLES
The Poisson probability mass function is defined as:
p(x) =
e−λ λx
x!
for x = 0, 1, 2, 3, . . .
λ is called the rate parameter.
P oλ (t) := FX (t) is the Poisson distribution (function).
We need to check that p(x) as defined above is actually a probability mass function, i.e. we need to check
whether the two basic properties (see theorem 2.1.2) are true:
• Obviously, all values of p(x) are positive for x ≥ 0.
• Do all probabilities sum to 1?
∞
X
p(x) =
k=0
∞
X
∞
e−λ
X λk
λk
= e−λ
k!
k!
(∗)
k=0
k=0
Now, we need to remember from calculus that the exponential function has the series representation
ex =
∞
X
xn
.
n!
n=0
In our case this simplifies (∗) to:
e−λ
∞
X
λk
k=0
k!
= e−λ &middot; eλ = 1.
p(x) is therefore a valid probability mass function.
Expected Value and Variance of X ∼ P oλ are:
E[X]
V ar[X]
=
∞
X
e−λ λx
x
= ... = λ
x!
x=0
= ... = λ
Computing E[X] and V ar[X] involves some math, but as it is not too hard, we can do the computation for
E[X]:
E[X]
=
∞
∞
X
X
λx
e−λ λx
= e−λ
x
=
x
x!
x!
x=0
x=0
= e−λ
= e
−λ
for x = 0 the expression is 0
∞
∞
X
X
λx
λx
x
= e−λ
=
x!
(x − 1)!
x=1
x=1
λ
= e−λ λ
∞
X
x=1
∞
X
x=0
x
λx−1
=
(x − 1)!
x
λx
= e−λ λeλ = λ
(x)!
start at x = 0 again and change summation index
How do we choose λ in an example? - look at the expected value!
2.2. SPECIAL DISCRETE PROBABILITY MASS FUNCTIONS
33
Example 2.2.4
A manufacturer of chips produces 1% defectives.
What is the probability that in a box of 100 chips no defective is found?
Let X be the number of defective chips found in the box.
So far, we would have modelled X as a Binomial variable with distribution B100,0.01 .
100
Then P (X = 0) = 100
0.010 = 0.366.
0 0.99
On the other hand, a defective chip can be considered to be a rare event, since p is small (p = 0.01). What
else can we do?
We expect 100 &middot; 0.01 = 1 chip out of the box to be defective. If we model X as Poisson variable, we know,
that the expected value of X is λ. In this example, therefore, λ = 1.
−1 0
Then P (X = 0) = e 0!1 = 0.3679.
No big differences between the two approaches!
For larger k, however, the binomial coefficient nk becomes hard to compute, and it is easier to use the
Poisson distribution instead of the Binomial distribution.
Poisson approximation of Binomial pmf For large n, the Binomial distribution is approximated by
the Poisson distribution, where λ is given as np:
n k
(np)k
p (1 − p)n−k ≈ e−np
k!
k
Rule of thumb: use Poisson approximation if n ≥ 20 and (at the same time) p ≤ 0.05.
Why does the approximation work? - We will have a closer look at why the Poisson distribution
approximates the Binomial distribution. This also explains why the Poisson is defined as it is.
Example 2.2.5 Typos
Imagine you are supposed to proofread a paper. Let us assume that there are on average 2 typos on a page
and a page has 1000 words. This gives a probability of 0.002 for each word to contain a typo.
The number of typos on a page X is then a Binomial random variable, i.e. X ∼ B1000,0.002 .
Let’s have a closer look at a couple of probabilities:
• the probability for no typo on a page is P (X = 0). We know, that
P (X = 0) = (1 − 0.002)1000 = 0.9981000 .
We can also write this probability as
P (X = 0) =
2
1−
1000
1000
(= 0.13506).
From calculus we know, that
x n
= ex .
n→∞
n
Therefore the probability for no typo on the page is approximately
lim
1−
P (X = 0) ≈ e−2
(= 0.13534).
• the probability for exactly one typo on a page is
1000
P (X = 1) =
0.002 &middot; 0.998999
1
(= 0.27067).
We can write this as
2
P (X = 1) = 1000 &middot;
1000
1−
2
1000
999
≈ 2 &middot; e−2
(= 0.27067)
34
CHAPTER 2. RANDOM VARIABLES
• the probability for exactly two typos on a page is
1000
P (X = 2) =
0.0022 &middot; 0.998998
2
(= 0.27094),
which we again re-write to
1000 &middot; 999
22
P (X = 2) =
&middot;
2
1000 &middot; 1000
2
1−
1000
998
≈ 2 &middot; e−2
(= 0.27067)
• and a last one: the probability for exactly three typos on a page is
1000
P (X = 3) =
0.0023 &middot; 0.998997
(= 0.18063),
3
which is
P (X = 3) =
2.2.5
1000 &middot; 999 &middot; 998
23
&middot;
3&middot;2
1000 &middot; 1000 &middot; 1000
1−
2
1000
997
≈
23 −2
&middot;e
3!
(= 0.18045)
Compound Discrete Probability Mass Functions
Real problems very seldom concern a single random variable. As soon as more than 1 variable is involved it
is not sufficient to think of modeling them only individually - their joint behavior is important.
Again, the How do we specify probabilities for more than one random variable at a time?
individual
probabili- Consider the 2 variable case: X, Y are two discrete variables. The joint probability mass function is defined
ties must be
between 0 as
and 1 and
their sum
PX,Y (x, y) := P (X = x ∩ Y = y)
must be 1.
Example 2.2.6
A box contains 5 unmarked PowerPC G4 processors of different speeds:
2 400 mHz
1 450 mHz
2 500 mHz
Select two processors out of the box (without replacement) and let
X = speed of the first selected processor
Y = speed of the second selected processor
For a sample space we can draw a table of all the possible combinations of processors. We will distinguish
between processors of the same speed by using the subscripts 1 or 2 .
Ω
4001 4002
4001
x
4002
x
450
x
x
5001
x
x
5002
x
x
1st processor
450
x
x
x
x
5001
x
x
x
x
5002
x
x
x
x
-
2nd processor
In total we have 5 &middot; 4 = 20 possible combinations.
Since we draw at random, we assume that each of the above combinations is equally likely. This yields the
following probability mass function:
2.2. SPECIAL DISCRETE PROBABILITY MASS FUNCTIONS
400
450
500 (mHz)
1st proc.
400
0.1
0.1
0.2
35
2nd processor
450 500 (mHz)
0.1
0.2
0.0
0.1
0.1
0.1
What is the probability for X = Y ?
this might be important if we wanted to match the chips to assemble a dual processor machine:
P (X = Y )
= pX,Y (400, 400) + pX,Y (450, 450) + pX,Y (500, 500) =
=
0.1 + 0 + 0.1 = 0.2.
Another example: What is the probability that the first processor has higher speed than the second?
P (X &gt; Y )
= pX,Y (400, 450) + pX,Y (400, 500) + pX,Y (450, 500) =
=
0.1 + 0.2 + 0.1 = 0.4.
We can go from joint probability mass functions to individual pmfs:
X
pX (x) =
pX,Y (x, y)
y
X
“marginal” pmfs
pY (y) =
pX,Y (x, y)
x
Example 2.2.7 Continued
For the previous example the marginal probability mass functions are
x
pX (x)
400
0.4
450
0.2
500 (mHz)
0.4
y
pY (y)
400
0.4
450
0.2
500 (mHz)
0.4
Just as we had the notion of expected value for functions with a single random variable, there’s an expected
value for functions in several random variables:
X
E[h(X, Y )] :=
h(x, y)pX,Y (x, y)
x,y
Example 2.2.8 Continued
Let X, Y be as before.
What is E[|X − Y |] (the average speed difference)?
here, we have the situation E[|X − Y |] = E[h(X, Y )], with h(X, Y ) = |X − Y |.
Using the above definition of expected value gives us:
X
E[|X − Y |] =
|x − y|pX,Y (x, y) =
x,y
= |400 − 400| &middot; 0.1 + |400 − 450| &middot; 0.1 + |400 − 500| &middot; 0.2 +
|450 − 400| &middot; 0.1 + |450 − 450| &middot; 0.0 + |450 − 500| &middot; 0.1 +
|500 − 400| &middot; 0.2 + |500 − 450| &middot; 0.1 + |500 − 500| &middot; 0.1 =
=
0 + 5 + 20 + 5 + 0 + 5 + 20 + 5 + 0 = 60.
36
CHAPTER 2. RANDOM VARIABLES
The most important cases for h(X, Y ) in this context are linear combinations of X and Y .
For two variables we can measure how “similar” their values are:
Definition 2.2.1 (Covariance)
The covariance between two random variables X and Y is defined as:
Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]
Note, that this definition looks very much like the definition for the variance of a single random variable. In
fact, if we set Y := X in the above definition, the Cov(X, X) = V ar(X).
Definition 2.2.2 (Correlation)
The (linear) correlation between two variables X and Y is
% := p
Cov(X, Y )
V ar(X) &middot; V ar(Y )
• % is between -1 and 1
• if % = 1 or -1, Y is a linear function of X
%=1
% = −1
→ Y = aX + b with a &gt; 0,
→ Y = aX + b with a &lt; 0,
% is a measure of linear association between X and Y . % near &plusmn;1 indicates a strong linear relationship, %
near 0 indicates lack of linear association.
Example 2.2.9 Continued
What is % in our box with five chips?
Check:
E[X] = E[Y ] = 450
Use marginal pmfs to compute!
V ar[X] = V ar[Y ] = 2000
The covariance between X and Y is:
X
Cov(X, Y ) =
(x − E[X])(y − E[Y ])pX,Y (x, y) =
x,y
=
(400 − 450)(400 − 450) &middot; 0.1 + (450 − 450)(400 − 450) &middot; 0.1 + (500 − 450)(400 − 450) &middot; 0.2 +
(400 − 450)(450 − 450) &middot; 0.1 + (450 − 450)(450 − 450) &middot; 0.0 + (500 − 450)(450 − 450) &middot; 0.1 +
(400 − 450)(500 − 450) &middot; 0.2 + (450 − 450)(500 − 450) &middot; 0.1 + (500 − 450)(500 − 450) &middot; 0.1 =
=
250 + 0 − 500 + 0 + 0 + 0 − 500 + 250 + 0 = −500.
% therefore is
%= p
Cov(X, Y )
V ar(X)V ar(Y )
=
−500
= −0.25,
2000
% indicates a weak negative (linear) association.
Definition 2.2.3 (Independence)
Two random variables X and Y are independent, if their joint probability pX,Y is equal to the product of
the marginal densities pX &middot; pY .
2.3. CONTINUOUS RANDOM VARIABLES
37
Note: so far, we’ve had a definition for the independence of two events A and B: A and B are independent,
if P (A ∩ B) = P (A) &middot; P (B).
Random variables are independent, if all events of the form X = x and Y = y are independent.
Example 2.2.10 Continued
Let X and Y be defined as previously.
Are X and Y independent?
Check: pX,Y (x, y) = pX (x) &middot; pY (y) for all possible combinations of x and y.
Trick: whenever there is a zero in the joint probability mass function, the variables cannot be independent:
pX,Y (450, 450) = 0 6= 0.2 &middot; 0.2 = pX (450) &middot; pY (450).
Therefore, X and Y are not independent!
More properties of Variance and Expected Values
Theorem 2.2.4
If two random variables X and Y are independent,
E[X &middot; Y ]
V ar[X + Y ]
=
= E[X] &middot; E[Y ]
= V ar[X] + V ar[Y ]
Theorem 2.2.5
For two random variables X and Y and three real numbers a, b, c holds:
V ar[aX + bY + c] = a2 V ar[X] + b2 V ar[Y ] + 2ab &middot; Cov(X, Y )
Note: by comparing the two results, we see that for two independent random variables X and Y , the
covariance Cov(X, Y ) = 0.
Example 2.2.11 Continued
E[X − Y ]
V ar[X − Y ]
2.3
= E[X] − E[Y ] = 450 − 450 = 0
= V ar[X] + (−1)2 V ar[Y ] − 2 Cov(X, Y ) = 2000 + 2000 + 1000 = 5000
Continuous Random Variables
All previous considerations for discrete variables have direct counterparts for continuous variables.
So far, a lot of sums have been involved, e.g. to compute the distribution functions or expected values.
Summing over (uncountable) infinite many values corresponds to an integral.
The main trick in working with continuous random variables is to substitute all sums by integrals in the
definitions.
As in the case of a discrete random variable, we define a distribution function as the probability that a
random variable has outcome t or a smaller value:
Definition 2.3.1 (probability distribution function)
Assume X is a continuous random variable:
The function FX (t) := P (X ≤ t) is called the probability distribution function of X.
The only difference to the discrete case is that the distribution function of a continuous variable is not a
stairstep function:
38
CHAPTER 2. RANDOM VARIABLES
Properties of FX
variable X.
The following properties hold for the probability distribution function FX for random
• 0 ≤ FX (t) ≤ 1 for all t ∈ R
• FX is monotone increasing, (i.e. if x1 ≤ x2 then FX (x1 ) ≤ FX (x2 ).)
• limt→−∞ FX (t) = 0 and limt→∞ FX (t) = 1.
f (x) is no
probability!
f (x) may
be &gt; 1.
Now, however, the situation is slightly different from the discrete case:
Definition 2.3.2 (density function)
For a continuous variable X with distribution function FX the density function of X is defined as:
0
fX (x) := FX (x).
Theorem 2.3.3 (Properties of f (x))
A function fX is a density function of X, if
(i) fX (x) ≥ 0 for all x,
R∞
(ii) −∞ f (x)dx = 1.
Relationship between fX and FX Since the density function fX is defined as the derivative of the
distribution function, we can re-gain the distribution function from the density by integrating: Then
Rt
• FX (t) = P (X ≤ t) = −∞ f (x)dx
• P (a ≤ X ≤ b) =
Rb
a
f (x)dx
Therefore,
Z
P (X = a) = P (a ≤ X ≤ a) =
a
f (x)dx = 0.
a
Example 2.3.1
Let Y be the time until the first major failure of a new disk drive.
A possible density function for Y is
−y
e
y&gt;0
f (y) =
0
otherwise
First, we need to check, that f (y) is actually a density function. Obviously, f (y) is a non-negative function
on whole of R.
The second condition, f must fulfill to be a density of Y is
Z ∞
Z ∞
f (y)dy =
e−y dy = −e−y |∞
0 = 0 − (−1) = 1
−∞
0
What is the probability that the first major disk drive failure occurs within the first year?
Z
P (Y ≤ 1) =
1
e−y dy = −e−y |10 = 1 − e−1 ≈ 0.63.
0
What is the distribution function of Y ?
Z t
Z t
f (y)dy =
e−y dy = 1 − e−t for all t ≥ 0.
FY (t) =
∞
0
2.4. SOME SPECIAL CONTINUOUS DENSITY FUNCTIONS
f(y)
39
density function of Y
y
F(y)
distribution function of Y
y
Figure 2.1: Density and Distribution function of random variable Y .
Summary:
discrete random variable
image Im(X) finite or countable infinite
continuous random variable
image Im(X) uncountable
probability distribution function:
P
FX (t) = P (X ≤ t) = k≤btc pX (k)
FX (t) = P (X ≤ t) =
probability mass function:
pX (x) = P (X = x)
probability density function:
0
fX (x) = FX (x)
expected value:
P
E[h(X)] = x h(x) &middot; pX (x)
E[h(X)] =
variance:
V ar[X] =P
E[(X − E[X])2 ] =
= x (x − E[X])2 pX (x)
V ar[X] =RE[(X − E[X])2 ] =
∞
= −∞ (x − E[X])2 fX (x)dx
2.4
2.4.1
R
x
Rt
∞
f (x)dx
h(x) &middot; fX (x)dx
Some special continuous density functions
Uniform Density
One of the most basic cases of a continuous density is the uniform density. On the finite interval (a, b) each
value has the same density (cf. diagram 2.2):
1
if a &lt; x &lt; b
b−a
f (x) =
0
otherwise
The distribution function FX is
Ua,b (x) := FX (x) =



0
x
b−a
1
if x ≤ a
if a &lt; x &lt; b
if x ≥ b.
We now know how to compute expected value and variance of a continuous random variable.
40
CHAPTER 2. RANDOM VARIABLES
f(x)
1/
uniform density on (a,b)
(b-a)
a
b
x
Figure 2.2: Density function of a uniform variable X on (a, b).
Assume, X has a uniform distribution on (a, b). Then
Z
b
1
1 1 2b
dx =
x | =
b−a
b−a2 a
a
b2 − a2
1
=
= (a + b).
2(b − a)
2
Z b
a+b 2 1
(b − a)2
V ar[X] =
(x −
)
dx = . . . =
.
2
b−a
12
a
E[X]
x
=
Example 2.4.1
The(pseudo) random number generator on my calculator is supposed to create realizations of U (0, 1) random
variables.
Define U as the next random number the calculator produces.
What is the probability, that the next number is higher than 0.85?
1
For that, we want to compute P (U ≥ 0.85). We know the density function of U : fU (u) = 1−0
= 1. Therefore
Z
1
P (U ≥ 0.85) =
1du = 1 − 0.85 = 0.15.
0.85
2.4.2
Exponential distribution
This density is commonly used to model waiting times between occurrences of “rare” events, lifetimes of
electrical or mechanical devices.
Definition 2.4.1 (Exponential density)
A random variable X has exponential density (cf. figure 2.3), if
λe−λx if x ≥ 0
fX (x) =
0
otherwise
λ is called the rate parameter.
Mean, variance and distribution function are easy to compute. They are:
E[X]
=
V ar[X]
=
Expλ (t)
=
1
λ
1
λ2
FX (t) =
0
1 − e−λx
if x &lt; 0
if x ≥ 0
The following example will accompany us throughout the remainder of this class:
we expect
X to be in
the middle
between a
and b - makes
sense, doesn’t
it?
2.4. SOME SPECIAL CONTINUOUS DENSITY FUNCTIONS
41
f2
f1
f0.5
x
Figure 2.3: Density functions of exponential variables for different rate parameters 0.5, 1, and 2.
Example 2.4.2 Hits on a webpage
On average there are 2 hits per minute on a specific web page.
I start to observe this web page at a certain time point 0, and decide to model the waiting time till the first
hit Y (in min) using an exponential distribution.
What is a sensible value for λ, the rate parameter?
Think: on average there are 2 hits per minute - which makes an average waiting time of 0.5 minutes between
hits.
We will use this value as the expected value for Y : E[Y ] = 0.5.
On the other hand, we know, that the expected value for Y is 1/λ. → we are back at 2 = λ as a sensible
choice for the parameter!
λ describes the rate, at which this web page is hit!
What is the probability that we have to wait at most 40 seconds to observe the first hit?
ok, we know the rate at which hits come to the web page in minutes - so, it’s advisable to express the 40s in
minutes also: The above probability then becomes:
What is the probability that we have to wait at most 2/3 min to observe the first hit?
This, we can compute:
P (Y ≤ 2/3) = Expλ (2/3) = 1 − −e−2/3&middot;2 ≈ 0.736
How long do we have to wait at most, to observe a first hit with a probability of 0.9?
This is a very different approach to what we have looked at so far!
Here, we want to find a t, for which P (Y ≤ t) = 0.9:
P (Y ≤ t) = 0.9
⇐⇒ 1 − e−2t = 0.9
⇐⇒ e−2t = 0.1
⇐⇒ t = −0.5 ln 0.1 ≈ 1.15 (min) - that’s approx. 69 s.
Memoryless property
Example 2.4.3 Hits on a web page
In the previous example I stated that we start to observe the web page a time point 0. Does the choice of
this time point affect our analysis in any way?
Let’s assume, that during the first minute after we started to observe the page, there is no hit.
What is the probability, that we have to wait for another 40 seconds for the first hit? - this implies an answer
to the question, what would have happened, if we had started our observation of the web page a minute
later - would we still get the same results?
42
CHAPTER 2. RANDOM VARIABLES
The probability we want to compute is a conditional probability. If we think back - the conditional probability
of A given B was defined as
P (A ∩ B)
P (A|B) :=
P (B)
Now, we have to identify, what the events A and B are in our case. The information we have is, that during
the first minute, we did not observe a hit =: B, i.e. B = (Y &gt; 1). The probability we want to know, is that
we have to wait another 40 s for the first hit: A = wait for 1 min and 40 s for the first hit (= Y ≤ 5/3).
P ( first hit within 5/3 min
P (A ∩ B)
P (Y ≤ 5/3 ∩ Y &gt; 1)
=
=
P (B)
P (Y &gt; 1)
|
no hit during 1st min) = P (A|B) =
=
P (1 &lt; Y ≤ 5/3)
e−2 − e−10/3
= 0.736.
=
1 − P (Y &lt; 1)
e−2
That’s exactly the same probability as we had before!!!
The result of this example is no coincidence. We can generalize:
P (Y ≤ t + s|Y ≥ s) = 1 − e−λt = P (Y ≤ t)
This means: a random variable with an exponential distribution “forgets” about its past. This is called the
memoryless property of the exponential distribution.
An electrical or mechanical device whose lifetime we model as an exponential variable therefore “stays as
good as new” until it suddenly breaks, i.e. we assume that there’s no aging process.
2.4.3
Erlang density
Example 2.4.4 Hits on a web page
Remember: we modeled waiting times until the first hit as Exp2 .
How long do we have to wait for the second hit?
In order to get the waiting time for the second hit, we can add the waiting times until the first hit and the
time between the first and the second hit.
For both of these we know the distribution: Y1 , the waiting time until the first hit is an exponential variable
with λ = 2.
After we have observed the first hit, we start the experiment again and wait for the next hit. Since the
exponential distribution is memoryless, this is as good as waiting for the first hit. We therefore can model
Y2 , the time between first and second hit, by another exponential distribution with the same rate λ = 2.
What we are interested in is Y := Y1 + Y2 .
Unfortunately, we don’t know the distribution of Y , yet.
Definition 2.4.2 (Erlang density)
If Y1 , . . . , Yk are k independent exponential random variables with parameter λ, their sum X has an Erlang
distribution:
k
X
X :=
Yi is Erlang(k,λ)
i=1
The Erlang density fk,λ is
(
f (x) =
λe
−λx
0
k−1
&middot; (λx)
(k−1)!
k is called the stage parameter, λ is the rate parameter.
x&lt;0
for x ≥ 0
2.4. SOME SPECIAL CONTINUOUS DENSITY FUNCTIONS
43
Expected value and variance of an Erlang distributed variable X can be computed using the properties of
expected value and variance for sums of independent random variables:
E[X]
V ar[X]
k
k
X
X
1
= E[
Yi ] =
E[Yi ] = k &middot;
λ
i=1
i=1
= V ar[
k
X
Yi ] =
i=1
k
X
V ar[Yi ] = k &middot;
i=1
1
λ2
In order to compute the distribution function, we need another result about the relationship between P oλ
and Expλ .
Theorem 2.4.3
If X1 , X2 , X3 , . . . are independent exponential random variables with parameter λ and (cf. fig. 2.4)
W := largest index j such that
j
X
Xi ≤ T
i=1
for some fixed T &gt; 0.
Then W ∼ P oλT .
*
0
X1
*
X2
*
X3
*
*
&lt;- occurrence
times
T
Figure 2.4: W = 3 in this example.
With this theorem, we can derive an expression for the Erlang distribution function. Let X be an Erlangk,λ
variable:
Erlangk,λ (x)
= P (X ≤ x)
=
1st trick
=
1 − P (X &gt; x) = 1 −
P(
X
Yi &gt; x)
above theorem
=
i
|
{z
}
less than k hits observed
=
1 − P o( a Poisson r.v. with rate xλ ≤ k − 1) =
=
1 − P oλx (k − 1).
Example 2.4.5 Hits on a web page
What is the density of the waiting time until the next hit?
We said that Y as previously defined, is the sum of two exponential variables, each with rate λ = 2.
X has therefore an Erlang distribution with stage parameter 2, and the density is given as
fX (x) = fk,λ (x) = 4xe−2x for x ≥ 0
If we wait for the third hit, what is the probability that we have to wait more than 1 min?
Z := waiting time until the third hit has an Erlang(3,2) distribution.
P (Z &gt; 1) = 1 − Erlang3,2 (1) = 1 − (1 − P o2&middot;1 (3 − 1)) = P o2 (2) = 0.677
44
CHAPTER 2. RANDOM VARIABLES
Note:
The exponential distribution is a special case of an Erlang distribution:
Expλ = Erlang(k=1,λ)
Erlang distributions are used to model waiting times of components that are exposed to peak stresses. It is
assumed that they can withstand k − 1 peaks and fail with the kth peak.
We will come across the Erlang distribution again, when modelling the waiting times in queueing systems,
where customers arrive with a Poisson rate and need exponential time to be served.
2.4.4
Gaussian or Normal density
The normal density is the archetypical “bell-shaped” density. The density has two parameters: &micro; and σ 2
and is defined as
(x−&micro;)2
1
f&micro;,σ2 (x) = √
e− 2σ2
2πσ 2
The expected value and variance of a normal distributed r.v. X are:
Z ∞
E[X] =
xf&micro;.σ2 (x)dx = . . . = &micro;
−∞
Z ∞
V ar[X] =
(x − &micro;)2 f&micro;.σ2 (x)dx = . . . = σ 2 .
−∞
Note: the parameters &micro; and σ 2 are actually mean and variance of X - and that’s what they are called.
f0,0.5
f0,1
f0,2
x
f-1,1
f0,1
f2,1
x
Figure 2.5: Normal densities for several parameters. &micro; determines the location of the peak on the x−axis,
σ 2 determines the “width” of the bell.
2.4. SOME SPECIAL CONTINUOUS DENSITY FUNCTIONS
45
The distribution function of X is
Z
t
f&micro;,σ2 (x)dx
N&micro;,σ2 (t) := F&micro;,σ2 (t) =
−∞
Unfortunately, there does not exist a closed form for this integral - f&micro;,σ2 does not have a simple antiderivative. However, to get probabilities means we need to evaluate this integral. This leaves us with several
choices:
1. personal numerical integration
2. use of statistical software
later
3. standard tables of normal probabilities
We will use the third option, mainly.
First of all: only a special case of the normal distributions is tabled: only positive values of N (0, 1) are tabled
- N (0, 1) is the normal distribution, that has mean 0 and a variance of 1. This is the so-called standard
normal distribution, also written as Φ.
A table for this distribution is enough, though. We will use several tricks to get any normal distribution into
the shape of a standard normal distribution:
Basic facts about the normal distribution that allow the use of tables
(i) for X ∼ N (&micro;, σ 2 ) holds:
Z :=
X −&micro;
∼ N (0, 1)
σ
This process is called standardizing X.
(this is at least plausible, since
E[Z]
=
V ar[Z]
=
1
(E[X] − &micro;) = 0
σ
1
V ar[X] = 1
σ2
(ii) Φ(−z) = 1 − Φ(z) since f0,1 is symmetric in 0 (see fig. 2.6 for an explanation).
f0,1
P(Z ≤ -z)
P(Z ‡ +z)
-z
+z
x
Figure 2.6: standard normal density. Remember, the area below the graph up to a specified vertical line
represents the probability that the random variable Z is less than this value. It’s easy to see, that the areas
in the tails are equal: P (Z ≤ −z) = P (Z ≥ +z). And we already know, that P (Z ≥ +z) = 1 − P (Z ≤ z),
which proves the above statement.
Example 2.4.6
Suppose Z is a standard normal random variable.
this is, what
we are going to do!
46
CHAPTER 2. RANDOM VARIABLES
1. P (Z &lt; 1) = ?
P (Z &lt; 1) = Φ(1)
straight look-up
=
0.8413.
2. P (0 &lt; Z &lt; 1) = ?
P (0 &lt; Z &lt; 1) = P (Z &lt; 1) − P (Z &lt; 0) = Φ(1) − Φ(0)
look-up
=
0.8413 − 0.5 = 0.3413.
3. P (Z &lt; −2.31) = ?
P (Z &lt; −2.31) = 1 − Φ(2.31)
look-up
=
1 − 0.9896 = 0.0204.
4. P (|Z| &gt; 2) = ?
P (|Z| &gt; 2) = P (Z &lt; −2) + P (Z &gt; 2) = 2(1 − Φ(2))
(1)
look-up
=
(2)
f0,1
(3)
f0,1
Example 2.4.7
Suppose, X ∼ N (1, 2)
P (1 &lt; X &lt; 2) =?
A standardization of X gives Z :=
P (1 &lt; X &lt; 2)
2(1 − 0.9772) = 0.0456.
f0,1
(4)
f0,1
X−1
√ .
2
1−1
X −1
2−1
P( √ &lt; √
&lt; √ )=
2
2
2
√
= P (0 &lt; Z &lt; 0.5 2) = Φ(0.71) − Φ(0) = 0.7611 − 0.5 = 0.2611.
=
Note that the standard normal table only shows probabilities for z &lt; 3.99. This is all we need, though, since
P (Z ≥ 4) ≤ 0.0001.
Example 2.4.8
Suppose the battery life of a laptop is normally distributed with σ = 20 min.
Engineering design requires, that only 1% of batteries fail to last 300 min.
What mean battery life is required to ensure this condition?
Let X denote the battery life in minutes, then X has a normal distribution with unknown mean &micro; and
standard deviation σ = 20 min.
What is &micro;?
The condition, that only 1% of batteries is allowed to fail the 300 min limit translates to:
P (X &lt; 300) ≤ 0.01
We must make sure to choose &micro; such, that this condition holds.
2.5. CENTRAL LIMIT THEOREM (CLT)
47
In order to compute the probability, we must standardize X:
Z :=
Then
P (X ≤ 300) = P (
X −&micro;
20
X −&micro;
300 − &micro;
300 − &micro;
300 − &micro;
≤
) = P (Z ≤
) = Φ(
)
20
20
20
20
The condition requires:
P (X ≤ 300) ≤ 0.01
300 − &micro;
⇐⇒ Φ(
) ≤ 0.01 = 1 − 0.99 = 1 − Φ(2.33) = Φ(−2.33)
20
300 − &micro;
⇐⇒
≤ −2.33
20
⇐⇒ &micro; ≥ 346.6.
Normal distributions have a “reproductive property”, i.e. if X and Y are normal variables, then W :=
aX + bY is also a normal variable, with:
E[W ]
V ar[W ]
= aE[X] + bE[Y ]
= a2 V ar[X] + b2 V ar[Y ] + 2abCov(X, Y )
The normal distribution is extremely common/ useful, for one reason: the normal distribution approximates
a lot of other distributions. This is the result of one of the most fundamental theorems in Math:
2.5
Central Limit Theorem (CLT)
Theorem 2.5.1 (Central Limit Theorem)
If X1 , X2 , . . . , Xn are n independent, identically distributed random variables with E[Xi ] = &micro; and V ar[Xi ] =
σ 2 , then:
Pn
the sample mean X̄ := n1 i=1 Xi is approximately normal distributed
with E[X̄] = &micro; and V ar[X̄] = σ 2 /n.
2
i.e. X̄ ∼ N (&micro;, σn ) or
P
i
Xi ∼ N (n&micro;, nσ 2 )
Corollary 2.5.2
(a) for large n the binomial distribution Bn,p is approximately normal Nnp,np(1−p) .
(b) for large λ the Poisson distribution P oλ is approximately normal Nλ,λ .
(c) for large k the Erlang distribution Erlangk,λ is approximately normal N k ,
k
λ λ2
Why?
(a) Let X be a variable with a Bn,p distribution.
We know, that X is the result from repeating the same Bernoulli experiment n times and looking at
the overall number of successes. We can therefor, write X as the sum of n B1,p variables Xi :
X := X1 + X2 + . . . + Xn
X is then the sum of n independent, identically distributed random variables. Then, the Central
Limit Theorem states, that X has an approximate normal distribution with E[X] = nE[Xi ] = np and
V ar[X] = nV ar[Xi ] = np(1 − p).
48
CHAPTER 2. RANDOM VARIABLES
(b) it is enough to show the statement for the case that λ is a large integer:
Let Y be a Poisson variable with rate λ. Then we can think of Y as the number of occurrences in
an experiment that runs for time λ - that is the same as to observe λ experiments that each run
independently for time 1 and add their results:
Y = Y1 + Y2 + . . . + Yλ , with Yi ∼ P o1 .
Again, Y is the sum of n independent, identically distributed random variables. Then, the Central
Limit Theorem states, that X has an approximate normal distribution with E[Y ] = λ &middot; 1 and V ar[Y ] =
λV ar[Yi ] = λ.
(c) this statement is the easiest to prove, since an Erlangk,λ distributed variable Z is by definition the sum
of k independently distributed exponential variables Z1 , . . . , Zk .
For Z the CLT holds, and we get, that Z is approximately normal distributed with E[Z] = kE[Zi ] =
and V ar[Z] = kV ar[Zi ] = λk2 .
k
λ
2
Why do we need the central limit theorem at all? - first of all, the CLT gives us the distribution of the
sample mean in a very general setting: the only thing we need to know, is that all the observed values come
from the same distribution, and the variance for this distribution is not infinite.
A second reason is, that most tables only contain the probabilities up to a certain limit - the Poisson table
e.g. only has values for λ ≤ 10, the Binomial distribution is tabled only for n ≤ 20. After that, we can use
the Normal approximation to get probabilities.
Example 2.5.1 Hits on a webpage
Hits occur with a rate of 2 per min.
What is the probability to wait for more than 20 min for the 50th hit?
Let Y be the waiting time until the 50th hit.
We know: Y has an Erlang50,2 distribution. therefore:
P (Y &gt; 20)
=
1 − Erlang50,2 (20) = 1 − (1 − P o2&middot;20 (50 − 1)) =
=
P o40 (49) ≈ N40,40 (49) =
49 − 40
table
√
Φ
= Φ(1.42) = 0.9222.
40
=
CLT !
Example 2.5.2 Mean of Uniform Variables
Let U1 , U2 , U3 , U4 , and U5 be standard uniform variables, i.e. Ui ∼ U(0,1) .
Without the CLT we would have no idea, what distribution the sample mean Ū =
approx
1
With it, we know: Ū ∼ N (0.5, 60
).
Issue:
1
5
Accuracy of approximation
• increases with n
• increases with the amount of symmetry in the distribution of Xi
Rule of thumb for the Binomial distribution:
Use the normal approximation for Bn,p , if np &gt; 5 (if p ≤ 0.5) or nq &gt; 5 (if p ≥ 0.5)!
P5
i=1