Lecture Notes Stat 330 May 2, 2003 Spring 2003 Heike Hofmann

advertisement
Lecture Notes Stat 330
Spring 2003
Heike Hofmann
May 2, 2003
Contents
1 Introduction
1.1 Basic Probability . . . . . . . . . . . . . . . . . .
1.1.1 Examples for sample spaces . . . . . . . .
1.1.2 Examples for events . . . . . . . . . . . .
1.2 Basic Notation of Sets . . . . . . . . . . . . . . .
1.3 Kolmogorov’s Axioms . . . . . . . . . . . . . . .
1.4 Counting Methods . . . . . . . . . . . . . . . . .
1.4.1 “Multiplication Principle” . . . . . . . . .
1.4.2 Ordered Samples with Replacement . . .
1.4.3 Ordered Samples without Replacement . .
1.4.4 Unordered Samples without Replacement
1.5 Conditional Probabilities . . . . . . . . . . . . . .
1.6 Independence of Events . . . . . . . . . . . . . .
1.7 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . .
1.8 Bernoulli Experiments . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
2
2
4
7
7
7
9
10
11
12
14
19
2 Random Variables
2.1 Discrete Random Variables . . . . . . . . . . . . . . . .
2.1.1 Expectation and Variance . . . . . . . . . . . . .
2.1.2 Some Properties of Expectation and Variance . .
2.1.3 Probability Distribution Function . . . . . . . . .
2.2 Special Discrete Probability Mass Functions . . . . . . .
2.2.1 Bernoulli pmf . . . . . . . . . . . . . . . . . . . .
2.2.2 Binomial pmf . . . . . . . . . . . . . . . . . . . .
2.2.3 Geometric pmf . . . . . . . . . . . . . . . . . . .
2.2.4 Poisson pmf . . . . . . . . . . . . . . . . . . . . .
2.2.5 Compound Discrete Probability Mass Functions
2.3 Continuous Random Variables . . . . . . . . . . . . . . .
2.4 Some special continuous density functions . . . . . . . .
2.4.1 Uniform Density . . . . . . . . . . . . . . . . . .
2.4.2 Exponential distribution . . . . . . . . . . . . . .
2.4.3 Erlang density . . . . . . . . . . . . . . . . . . .
2.4.4 Gaussian or Normal density . . . . . . . . . . . .
2.5 Central Limit Theorem (CLT) . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
22
24
25
26
26
27
27
28
30
30
34
36
36
37
39
40
44
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Elementary Simulation
47
3.1 Random Number Generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1.1 A general method for discrete data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.2 A general Method for Continuous Densities . . . . . . . . . . . . . . . . . . . . . . . . 50
i
0
CONTENTS
3.2
3.1.2.1
3.1.2.2
3.1.2.3
Basic Problem
Simulating Binomial & Geometric distributions .
Simulating a Poisson distribution . . . . . . . .
Simulating a Normal distribution . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
51
51
51
52
4 Stochastic Processes
57
4.1 Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Birth & Death Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5 Queuing systems
5.1 Little’s Law . . . . . .
5.2 The M/M/1 Queue .
5.3 The M/M/1/K queue
5.4 The M/M/c queue . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
67
68
69
71
73
6 Statistical Inference
6.1 Parameter Estimation . . . . . . . . . . . . . . . . . . . . .
6.1.1 Maximum Likelihood Estimation . . . . . . . . . . .
6.2 Confidence intervals . . . . . . . . . . . . . . . . . . . . . .
6.2.1 Large sample C.I. for µ . . . . . . . . . . . . . . . .
6.2.2 Large sample confidence intervals for a proportion p
6.2.2.1 Conservative Method: . . . . . . . . . . . .
6.2.2.2 Substitution Method: . . . . . . . . . . . .
6.2.3 Related C.I. Methods . . . . . . . . . . . . . . . . .
6.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . .
6.4 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.1 Simple Linear Regression (SLR) . . . . . . . . . . .
6.4.1.1 The sample correlation r . . . . . . . . . .
6.4.1.2 Coefficient of determination R2 . . . . . . .
6.4.2 Simple linear Regression Model . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
75
75
77
81
82
85
85
86
87
89
93
93
96
97
98
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 1
Introduction
Motivation
In every field of human life there are processes that cannot be described exactly (by an algorithm). For
example, how fast does a web page respond? when does the bus come? how many cars are on the parking
lot at 8.55 am?
By observation of these processes or by experiments we can detect patterns of behavior, such as: “ usually,
the first week of semester the campus network is slow”, “by 8.50 am the parking lot at the Design building
usually is full”.
Our goal is to analyze these patterns further:
1.1
Basic Probability
1.1.1
Real World
Observation, experiment with unknown
outcome
list of all possible outcomes
Mathematical World
Random experiment
individual outcome
elementary event A, A ∈ Ω (read: A is an
element of Ω)
a collection of individual outcomes
event A, A ⊂ Ω (read: A is a subset of Ω)
assignment of the likelihood or chance of
a possible outcome
probability of an event A, P (A).
sample space Ω (read: Omega)
Examples for sample spaces
1. I attempt to sign on to AOL from my home - to do so successfully the local phone number must be
working and AOL’s network must be working.
Ω = { (phone up, network up), (phone up, network down), (phone down, network up), (phone down,
network down) }
2. Online I attempt to access a web page and record the time required to receive and display it (in
seconds).
Ω = (0, ∞) seconds
3. on a network there are two possible routes a message can take to a destination - in order for a message
to get to the recipient, one of the routes and the recipient’s computer must be up.
1
2
CHAPTER 1. INTRODUCTION
Ω1 in tabular form:
route 1
up
up
up
up
down
down
down
down
route 2
up
up
down
down
up
up
down
down
recipient’s computer
up
down
up
down
up
down
up
down
or, alternatively: Ω2 = { successful transmission, no transmission }
Summary 1.1.1
• Sample spaces can be finite, countable infinite or uncountable infinite.
• There is no such thing as THE sample space for a problem. The complexity of Ω can vary, many are
possible for a given problem.
1.1.2
Examples for events
With the same examples as before, we can define events in the following way:
1. A = fail to log on
B = AOL network down
then A is a subset of Ω and can be written as a set of elementary events:
A = { (phone up, network down), (phone down, network up), (phone down, network down)}
Similarly:
B = {(phone up, network down), (phone down, network down)}
2. C = at least 10 s are required, C = [10, ∞).
3. D = message gets through
D with first sample space: D = {(U, U, U ), (U, D, U ), (D, U, U )}
Once we begin to talk about events in terms of sets, we need to know the standard notation and basic rules
for computation:
1.2
Basic Notation of Sets
For the definitions throughout this section assume that A and B are two events.
Definition 1.2.1 (Union)
A ∪ B is the event consisting of all outcomes in A or in B or in
both
read: A or B
1.2. BASIC NOTATION OF SETS
3
Example 1.2.1
2. time required to retrieve and display a particular web page. Let A, B, C and D be events: A =
[100, 200), B = [150, ∞), C = [200, ∞) and D = [50, 75].
Then A ∪ B = [100, ∞) and A ∪ C = [100, ∞) and A ∪ D = [50, 75] ∪ [100, 200]
Definition 1.2.2 (Intersection)
A ∩ B is the event consisting of all outcomes simultaneously in
A and in B.
read: A and B
Example 1.2.2
2. Let A, B, C and D be defined as above.
Then
A∩B
A∩D
= [100, 200) ∩ [150, ∞) = [150, 200)
= [100, 200) ∩ [50, 75] = ∅
3. Let A be the event “fail to log on” and B = “network down”.
Then
A ∩ B = {(phone up, network down), (phone down, network down)} = B
B is a subset of A.
Definition 1.2.3 (Empty Set)
∅ is the the set with no outcomes
Definition 1.2.4 (Complement)
Ā is the event consisting of all outcomes not in A.
read: not A
Example 1.2.3
4
CHAPTER 1. INTRODUCTION
3. message example
Let D be the event that a message gets through.
D̄ = { ( D,D,U), (D,U,D), (U,D,D), (D,D,D) }.
Definition 1.2.5 (disjoint sets)
Two events A and B are called mutually exclusive or disjoint, if
their intersection is empty:
A∩B =∅
[
1.3
Kolmogorov’s Axioms
Example:
3. From my experience with the network provider, I can decide that the chance that my next message
gets through is 90 %.
Write: P (D) = 0.9
To be able to work with probabilities properly - to compute with them - one must lay down a set of postulates:
Kolmogorov’s Axioms A system of probabilities ( a probability model) is an assignment of numbers
P (A) to events A ⊂ Ω in such a manner that
the probability of any
event A is
a real number between
0 and 1
the sum of
probabilities of all
events in the
sample space
is 1
(i) 0 ≤ P (A) ≤ 1 for all A
(ii) P (Ω) = 1.
(iii) if A1 , A2 , . . . are (possibly, infinite many) disjoint events (i.e. Ai ∩ Aj = ∅ for all i, j) then
the probability of a
disjoint union
of events is
equal to the
sum of the
individual
probabilities
P (A1 ∪ A2 ∪ . . .) = P (A1 ) + P (A2 ) + . . . =
X
P (Ai ).
i
These are the basic rules of operation of a probability model:
• every valid model must obey these,
• any system that does, is a valid model
Whether or not a particular model is realistic or appropriate for a specific application is another question.
Example 1.3.1
Draw a single card from a standard deck of playing cards
Ω = { red, black }
Two different, equally valid probability models are:
1.3. KOLMOGOROV’S AXIOMS
5
Model 1
P (Ω) = 1
P ( red ) = 0.5
P ( black ) = 0.5
Model 2
P (Ω) = 1
P ( red ) = 0.3
P ( black ) = 0.7
Mathematically, both schemes are equally valid.
Beginning from the axioms of probability one can prove a number of useful theorems about how a probability
model must operate.
We start with the probability of Ω and derive others from that.
Theorem 1.3.1
Let A be an event in Ω, then
P (Ā) = 1 − P (A) for all A ⊂ Ω.
For the proof we need to consider three main facts and piece them together appropriately:
1. We know that P (Ω) = 1 because of axiom (ii)
2. Ω can be written as Ω = A ∪ Ā because of the definition of an event’s complement.
3. A and Ā are disjoint and therefore the probability of their union equals the sum of the individual
probabilities (axiom iii).
All together:
(1)
(2)
(3)
1 = P (Ω) = P (A ∪ Ā) = P (A) + P (Ā).
This yields the statement.
2
Example 1.3.2
3. If I believe that the probability that a message gets through is 0.9, I also must believe that it fails with
probability 0.1
Corollary 1.3.2
The probability of the empty set P (∅) is zero.
For a proof of the above statement we exploit that the empty set is the complement of Ω. Then we can
apply Theorem 1.3.1.
Thm 1.3.1
P (∅) = P (Ω̄)
=
1 − P (Ω) = 1 − 1 = 0.
2
Theorem 1.3.3 (Addition Rule of Probability)
Let A and B be two events of Ω, then:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
6
CHAPTER 1. INTRODUCTION
To see why this makes sense, think of probability as the area in the Venn diagram: By simply adding P (A)
and P (B), P (A ∩ B) gets counted twice and must be subtracted off to get P (A ∪ B).
Example 1.3.3
1. AOL dial-up:
If I judge:
P ( phone up ) = 0.9
P ( network up ) = 0.6
P ( phone up, network up ) = 0.55
then
P ( phone up or network up) = 0.9 + 0.6 − 0.55 = 0.95
diagram:
network
up
down
phone
up down
.55
.05
.35
.05
.90
.10
.60
.40
1
Example 1.3.4
A box contains 4 chips, 1 of them is defective.
A person draws one chip at random.
What is a suitable probability that the person draws the defective chip?
Common sense tells us, that since one out of the four chips is defective, the person has a chance of 25% to
draw the defective chip.
Just for training, we will write this down in terms of probability theory:
One possible sample space Ω is: Ω = {g1 , g2 , g3 , d} (i.e. we distinguish the good chips, which may be a bit
artificial. It will become obvious, why that is a good idea anyway, later on.)
The event to draw the defective chip is then A = {d}.
We can write the probability to draw the defective chip by comparing the sizes of A and Ω:
P (A) =
|A|
|{d}|
=
= 0.25.
|Ω|
|{g1 , g2 , g3 , d}|
Be careful, though! The above method to compute probabilities is only valid in a special case:
Theorem 1.3.4
If all elementary events in a sample space are equally likely (i.e. P ({ωi }) = const for all ω ∈ Ω), the
probability of an event A is given by:
|A|
P (A) =
,
|Ω|
where |A| gives the number of elements in A.
1.4. COUNTING METHODS
7
Example 1.3.5 continued
The person now draws two chips. What is the probability that the defective chip is among them?
We need to set up a new sample space containing all possibilities for drawing two chips:
Ω
E
= {{g1 , g2 }, {g1 , g3 }, {g1 , d},
{g2 , g3 }, {g2 , d},
{g3 , d}}
= “ defective chip is among the two chips drawn” =
= {{g1 , d}, {g2 , d}, {g3 , d}}.
Then
P (E) =
3
|E|
= = 0.5.
|Ω|
6
Finding P (E) involves counting the number of outcomes in E. Counting by hand is sometimes not feasible
if Ω is large.
Therefore, we need some standard counting methods.
1.4
1.4.1
Counting Methods
“Multiplication Principle”
If a complex action can be broken down in a series of k components and these components can be performed
in respectively n1 , n2 , . . . , nk ways, then the complex action can be performed in n1 · n2 · . . . · nk different
ways.
Example 1.4.1
Tossing a coin, then tossing a die: results in 2 · 6 = 12 possible outcomes of the experiment.
die
coin
H
T
1.4.2
1
2
3
4
5
6
1
2
3
4
5
6
Ordered Samples with Replacement
just to make sure we know what we are talking about, here are the definitions that will explain this section’s
title:
Sounds simple, but has
enormous impact!
Not to be
underestimated!
8
CHAPTER 1. INTRODUCTION
Definition 1.4.1 (ordered sample)
If r objects are selected from a set of n objects, and if the order of selection is noted, the selected set of r
objects is called an ordered sample.
Definition 1.4.2 (Sampling w/wo replacement)
1.4.4 Sampling with replacement occurs when an object is selected and then replaced before the next object
is selected.
Sampling without replacement occurs when an object is not replaced after it has been selected.
Situation:
Imagine a box with n balls in it numbered from 1 to n.
We are interested in the number of ways to sequentially select k balls from the box
when the same ball can be drawn repeatedly (with replacement).
This is our first application of the multiplication principle: Instead of looking at the complex action, we
break it down into the k single draws. For each draw, we have n different possibilities to draw a ball.
The complex action can therefore be done in |n · n {z
· . . . · n} = nk different ways.
k times
The sample space Ω can be written as:
Ω
= {(x1 , x2 , . . . , xk )|xi ∈ {1, . . . , n}}
= {x1 x2 . . . xk |xi ∈ {1, . . . , n}}
We already know that |Ω| = nk .
Example 1.4.2
(a) How many valid five digit octal numbers (with leading zeros) do exist?
In a valid octal number each digit needs to be between 0 and 7. We therefore have 8 choices for each
digit, yielding 85 different five digit octal numbers.
(b) What is the probability that a randomly chosen five digit number is a valid octal number?
One possible sample space for this experiment would be
Ω = {x1 x2 . . . x5 |xi ∈ {0, . . . , 9}},
yielding |Ω| = 105 .
Since all numbers in Ω are equally likely, we can apply Thm 1.3.4 and get for the sought probability:
P ( “randomly chosen five digit number is a valid octal number” ) =
85
≈ 0.328.
105
Example 1.4.3 Pick 3
Pick 3 is a game played daily at the State Lottery. The rules are as follows:
Choose three digits between 0 and 9 and order them.
To win, the numbers must be drawn in the exact order you’ve chosen.
Clearly, the number of different ways to choose numbers in this way is 10 · 10 · 10 = 1000.
odds (= probability) to win: 1/1000.
1.4. COUNTING METHODS
1.4.3
9
Ordered Samples without Replacement
Situation:
Same box as before.
We are interested in the number of ways to sequentially draw k balls from the box
when each ball can be drawn only once (without replacement).
Again, we break up the complex action into k single draws and apply the multiplication principle:
Draw
# of Choices
1st
n
2nd
(n − 1)
3rd
(n − 2)
...
...
total choices:
n · (n − 1) · (n − 2) · . . . · (n − k + 1) =
The fraction
n!
(n−k)!
kth
(n − k + 1)
n!
(n − k)!
is important enough to get a name of its own:
Definition 1.4.3 (Permutation number)
P (n, k) := n!/(n − k)! is the number of permutations of n distinct objects taken k at a time.
Example 1.4.4
(a) I only remember that a friend’s (4 digit) telephone number consists of the numbers 3,4, 8 and 9.
How many different numbers does that describe?
That’s the situation, where we take 4 objects out of a set of 4 objects and order them - that is P (4, 4)!.
P (4, 4) =
4!
4!
24
=
=
= 24.
(4 − 4)!
0!
1
(b) In a survey, you are asked to choose from seven items on a pizza your favorite three and rank them.
How many different results will the survey have at most? - P (7, 3).
P (7, 3) =
7!
= 7 · 6 · 5 = 210.
(7 − 3)!
Variation: How many different sets of “top 3” items are there? (i.e. now we do not regard the order
of the favorite three items.)
Think: The value P (7, 3) is the result of a two-step action. First, we choose 3 items out of 7. Secondly,
we order them.
Therefore (multiplication principle!):
P (7, 3)
| {z }
=
X
|{z}
·
P (3, 3)
| {z }
# of ways to choose
# of ways to choose
# of ways to choose
3 from 7 and order them
3 out of 7 items
3 out of 3 and order them
So:
X=
P (7, 3)
7!
7·6·5
=
=
= 35.
P (3, 3)
4!3!
3·2·1
This example leads us directly to the next section:
10
CHAPTER 1. INTRODUCTION
1.4.4
Unordered Samples without Replacement
Same box as before.
We are interested in the number of ways to choose k balls (at once) out of a box
with n balls.
As we’ve seen in the last example, this can be done in
n!
P (n, k)
=
P (k, k)
(n − k)!k!
different ways.
Again, this number is interesting enough to get a name:
Definition 1.4.4 (Binomial Coefficient)
For two integer numbers n, k with k ≤ n the Binomial coefficient is defined as
n
n!
:=
k
(n − k)!k!
Read: “out of n choose k” or “k out of n”.
Example 1.4.5 Powerball (without the Powerball)
Pick five (different) numbers out of 49 - the lottery will also draw five numbers.
You’ve won, if at least three of the numbers are right.
(a) What is the probability to have five matching numbers?
Ω, the sample space, is the set of all possible five-number-sets:
Ω = {{x1 , x2 , x3 , x4 , x5 }|xi ∈ {1, . . . , 49}}
|Ω| =
49
5
=
49!
= 1906884.
5!44!
The odds to win a matching five are 1: 1 906 884 - they are about the same as to die from being struck
by lightning.
(b) What is the probability to have exactly three matching numbers?
Answering this question is a bit tricky. But: since the order of the five numbers you’ve chosen doesn’t
matter, we can assume that we picked the three right numbers at first and then picked two wrong
numbers.
Do you see it? That’s again a complex action that we can split up into two simpler actions.
We need to figure out first, how many
ways there are to choose 3 numbers out of the right 5 numbers.
Obviously, this can be done in 53 = 10 ways.
Secondly,
the number of ways to choose the remaining 2 numbers out of the wrong 49-5 = 44 numbers
is 44
2 = 231.
In total, we have 10 · 231 = 2310 possible ways to choose three right numbers, which gives a probability
of 11/90804 ≈ 0.0001.
Note: the probability to have exactly three right numbers was given as
5 49−5
P ( “3 matching numbers” ) =
3
5−3
49
5
We will come across these probabilities quite a few times from now on.
1.5. CONDITIONAL PROBABILITIES
11
(b) What is the probability to win? (i.e to have at least three matching numbers)
In order to have a win, we need to have exactly 3, 4 or 5 matching numbers. We already know
the probabilities for exactly 3 or 5 matching numbers. What remains, is the probability for exactly 4
matching numbers.
If we use the above formula and substitute the 3 by a 4, we get
5 49−5
P ( “4 matching numbers” ) =
4
5−4
49
5
=
5 · 49
≈ 0.000128
49
5
In total the probability to win is:
P ( “win” )
= P ( “3 matching numbers” ) + P ( “4 matches” ) + P ( “5 matches” ) =
1 + 5 · 49 + 231
= 477 : 1906884 ≈ 0.00025.
=
1906884
Please note: In the previous examples we’ve used parentheses ( ), see definition , to indicate that the order
of the elements inside matters. These constructs are called tuples.
If the order of the elements does not matter, we use { } - the usual symbol for sets.
1.5
Conditional Probabilities
Example 1.5.1
A box contains 4 computer chips, two of them are defective.
Obviously, the probability to draw a defective chip in one random draw is 2/4 = 0.5.
We analyze this chip now and find out that it is a good one.
If we draw now, what is the probability to draw a defective chip?
Now, the probability to draw a defective chip has changed to 2/3.
Conclusion: The probability of an event A may change if we know (before we start the experiment for A)
the outcome of another event B.
We need to add another term to our mathematical description of probabilities:
Real World
assessment of “chance” given additional,
partial information
Mathematical World
conditional probability of one event A
given another event B.
write: P (A|B)
Definition 1.5.1 (conditional probability)
The conditional probability of event A given event B is defined as:
P (A|B) :=
P (A ∩ B)
P (B)
if P (B) 6= 0.
Example 1.5.2
A lot of unmarked Pentium III chips in a box is as
Good
Defective
400 mHz
480
20
500
500 mHz
490
10
500
970
30
1000
12
CHAPTER 1. INTRODUCTION
Drawing a chip at random has the following probabilities:
P (D) = 0.03
P (400mHz) = 0.5
P (G) = 0.97
P (500mHz) = 0.5)
check: these two must sum to 1.
check: these two must sum to 1, too.
P (D and 400mHz) = 20/1000 = 0.02
P (D and 500mHz) = 10/1000 = 0.01
Suppose now, that I have the partial information that the chip selected is a 400 mHz chip.
What is now the probability that it is defective?
Using the above formula, we get
P ( chip is D| chip is 400mHz) =
P ( chip is D and chip is 400mHz)
0.02
=
= 0.04.
P ( chip is 400mHz)
0.5
i.e. knowing the speed of the chip influences our probability assignment to whether the chip is defective or
not.
Note: Rewriting the above definition of conditional probability gives:
P (A ∩ B) = P (B) · P (A|B),
(1.1)
i.e. knowing two out of the three probabilities gives us the third for free.
We have seen that the occurrence of an event B may change the probability for an event A. If an event B
does not have any influence on the probability of A we say, that the events A and B are independent:
1.6
Independence of Events
Definition 1.6.1 (Independence of Events)
Two events A and B are called independent, if
P (A ∩ B) = P (A) · P (B)
(Alternate definition: P (A|B) = P (A))
Independence is the mathematical counterpart of the everyday notion of “unrelatedness” of two events.
Example 1.6.1 Safety System at a nuclear reactor
Suppose there are two physically separate safety systems A and B in a nuclear reactor. An “incident” can
occur only when both of them fail in the event of a problem.
Suppose the probabilities for the systems to fail in a problem are:
P (A fails) = 10−4
P (B fails) = 10−8
The probability for an incident is then
P ( incident ) = P (A and B fail at the same time =
= P (A fails and B fails)
Using that A and B are independent from each other, we can compute the intersection of the events that
both systems fail as the product of the probabilities for individual failures:
P (A fails and B fails)
A,B independent
=
P (A fails ) · P (B fails)
Therefore the probability for an incident is:
P ( incident ) = P (A fails ) · P (B fails) = 10−4 · 10−8 = 10−12 .
1.6. INDEPENDENCE OF EVENTS
13
Comments The safety system at a nuclear reactor is an example for a “parallel system”
A parallel system consists of k components c1 , . . . , ck , that are arranged as drawn in the diagram 1.1.
C1
C2
1
2
Ck
Figure 1.1: Parallel system with k components.
The system works as long as there is at least one unbroken path between 1 and 2 (= at least one of the
components still works).
Under the assumption that all components work independently from each other, it is fairly easy to compute
the probability that a parallel system will fail:
P ( system fails ) =
P ( all components fail ) =
components are independent
= P (c1 fails ∩ c2 fails ∩ . . . ck fails)
= P (c1 fails) · P (c2 fails) · . . . · P (ck fails)
=
A similar kind of calculation can be done for a “series system”. A series system, again, consists of k
supposedly independent components c1 , . . . , ck arranged as shown in diagram 1.2.
1
C1
C2
2
Ck
Figure 1.2: Series system with k components.
This time, the system only works, if all of its components are working.
Therefore, we can compute the probability that a series system works as:
P ( system works )
= P ( all components work ) =
components are independent
= P (c1 works ∩ c2 works ∩ . . . ck works)
= P (c1 works) · P (c2 works) · . . . · P (ck works)
=
Please note that based on the above probabilities it is easy to compute the probability that a parallel system
is working and a series system fails, respectively, as:
P ( parallel system works )
P ( series system fails )
T hm1.3.1
=
T hm1.3.1
=
1 − P ( parallel system fails)
1 − P ( parallel system works)
The probability that a system works is sometimes called the system’s reliability. Note that a parallel system
is very reliable, a series system usually is very unreliable.
Warning: independence and disjointness are two very different concepts!
14
CHAPTER 1. INTRODUCTION
Disjointness:
Independence:
If A and B are disjoint, their intersection is empty,
has therefore probability 0:
If A and B are independent events, the probability of their intersection can be computed as the
product of their individual probabilities:
P (A ∩ B) = P (∅) = 0.
P (A ∩ B) = P (A) · P (B)
If neither of A or B are empty, the probability for
the intersection will not be 0 either!
The concept of independence between events can be extended to more than two events:
Definition 1.6.2 (Mutual Independence)
A list of events A1 , . . . , An is called mutually independent, if for any subset {i1 , . . . , ik } ⊂ {1, . . . , n} of indices
we have:
P (Ai1 ∩ Ai2 ∩ . . . ∩ Aik ) = P (Ai1 ) · P (Ai2 ) · . . . · P (Aik ).
Note: for more than 3 events pairwise independence does not imply mutual independence.
1.7
Bayes’ Rule
Example 1.7.1 Treasure Hunt
Suppose that there are three closed boxes. The first box contains two gold coins, the second box contains
one gold coin and one silver coin, and the third box contains two silver coins. Suppose that you select one
of the boxes randomly and then select one of the coins from this box.
What is the probability that the coin you selected is golden?
For a problem like this, that consists of a step-wise procedure, it is often useful to draw a tree (a flow chart)
of the choices we can make in each step.
The diagram below shows the tree for the 2 steps of choosing a box first and choosing one of two coins in
that box.
1/3
1/3
1
gold
1/2
gold
1/2
silver
1
silver
B1
B2
1/3
B3
The lines are marked by the probabilities, with which each step is done:
Choosing one box (at random) means, that all boxes are equally likely to be chosen: P (Bi ) = 13 for i = 1, 2, 3.
In the first box are two gold coins: A gold coin in this box is therefore chosen with probability 1.
The second box has one golden and one silver coin. A gold coin is therefore chosen with probability 0.5.
1.7. BAYES’ RULE
15
How do we piece these information together?
We have two possible paths in the tree, to get a golden coin as a result. Each path corresponds to one event.
E1
E2
=
=
choose Box 1 and pick one of the two golden coins
choose Box 2 and pick the golden coin
We need the probabilities for these two events.
Think: use equation (1.1) to get P (Ei )!
P (E1 )
= P ( choose Box 1 and pick one of the two golden coins) =
= P ( choose Box 1 ) · P ( pick one of the two golden coins |B1 ) =
1
=
· 1.
3
P (E2 )
= P ( choose Box 2 and pick one of the two golden coins) =
= P ( choose Box 2 ) · P ( pick one of the two golden coins |B2 ) =
1
1 1
=
· = .
3 2
6
and
The probability to choose a golden coin is the sum of P (E1 ) and P (E2 ) (since those are the only ways to get
a golden coin, as we’ve seen in the tree diagram).
P ( golden coin ) =
1 1
+ = 0.5.
3 6
There are several things to learn from this example:
1. Instead of trying to tackle the whole problem, we’ve divided it into several smaller pieces, that are
more manageable (Divide and conquer Principle).
2. We identified the smaller parts by looking at the description of the problem with the help of a tree.
And: if you compare the probabilities on the lines of the tree with the probabilities we used to compute
the smaller pieces E1 and E2 , you’ll see that those correspond closely to the branches of the tree.
The probability of E1 is computed as the product of all probabilities on the edges from the root to the
leaf for E1 .
Definition 1.7.1 (cover)
A set of k events B1 , . . . , Bk is called a cover of the sample space Ω, if
(i) the events are pairwise disjoint, i.e.
Bi ∩ Bj = ∅
for all i, j
(ii) the union of the events contains Ω:
k
[
Bi = Ω
i=1
What is a cover, then? – You can think of a cover as several non-overlapping pieces, which in total contain
every possible case of the sample space, like pieces of a jig-saw puzzle e.g.
Compare with diagram 1.3.
16
CHAPTER 1. INTRODUCTION
Figure 1.3: B1 , B2 . . . , Bk are a cover of Ω.
The boxes from the last example, B1 , B2 , and B3 , are a cover of the sample space.
Theorem 1.7.2 (Total Probability)
If the set B1 , . . . , Bk is a cover of the sample space Ω, we can compute the probability for an event A by
(cf. fig.1.4):
k
X
P (A) =
P (Bi ) · P (A|Bi ).
i=1
Note: Instead of writing P (Bi )·P (A|Bi ) we could have written P (A∩Bi ) - this is the definition of conditional
probability cf. def. 1.5.1.
Figure 1.4: The probability of event A is put together as sum of the probabilities of the smaller pieces (theorem
of total probability).
The challenge in using this Theorem is to identify what set of events to use as cover, i.e. to identify in which
parts to dissect the problem.
Very often, the cover B1 , B2 , . . . , Bk has only two elements, and looks like E, Ē.
Tree Diagram:
B1
P(A| B1)
A
B1
B2
P(A| B2)
A
B2
Bk
P(A| Bk)
A
Bk
The probability of each node in the tree can be calculated by multiplying all probabilities from the root to the event (1st rule of tree
diagrams).
Summing up all the probabilities in the leaves gives P (A) (2nd
rule).
this is a formal way for
“Divide and
Conquer”
1.7. BAYES’ RULE
17
Homework: Powerball - with the Powerball
Redo the above analysis under the assumption that besides the five numbers chosen from 1 to 49 you choose
an additional number, again, between 1 and 49 as the Powerball. The Powerball may be a number you’ve
already chosen or a new one.
You’ve won, if at least the Powerball is the right number or, if the Powerball is wrong, at least three out of
the other five numbers must match.
• Show that the events “Powerball is right”, “Powerball is wrong”is a cover of the sample space (for that,
you need to define a sample space).
• Draw a tree diagram for all possible ways to win, given that the Powerball is right or wrong.
• What is the probability to win?
Extra Problem (tricky): Seven Lamps
1
2
A system of seven lamps is given as drawn in the diagram.
7
3
6
4
Each lamp fails (independently) with probability p = 0.1.
The system works as long as not two lamps next to each other fail.
What is the probability that the system works?
5
Example 1.7.2 Forensic Analysis
On a crime site the police found traces of DNA (evidence DNA), which could be identified to belong to the
perpetrator. Now, the search is done by looking for a DNA match.
The probability for ”a man from the street” to have the same DNA as the DNA from the crime site (random
match) is approx. 1: 1 Mio.
For the analysis, whether someone is a DNA match or not, a test is used. The test is not totally reliable,
but if a person is a true DNA match, the test will be positive with a probability of 1. If the person is not a
DNA match, the test will still be positive with a probability of 1:100000.
Assuming, that the police found a man with a positive test result. What is the probability that he actually
is a DNA match?
First, we have to formulate the above text into probability statements.
The probability for a random match is
P ( match ) = 1 : 1 Mio = 10−6 .
Now, the probabilities for a positive test result:
P ( test pos | match ) = 1
P ( test pos | no match ) = 1 : 100000 = 10−5
The probability asked for in the question is, again, a conditional probability. We know already, that the man
has a positive test result. We look for the probability, that he is a match. This translates to P ( match | test pos. ).
First, we use the definition of conditional probability to re-write this probability:
P ( match | test pos. ) =
P ( match ∩ test pos. )
P ( test pos. )
This doesn’t seem to help a lot, since we still don’t know a single one of those probabilities. But we do the
same trick once again for the numerator:
P ( match ∩ test pos. ) = P ( test pos. | match ) · P ( match )
18
CHAPTER 1. INTRODUCTION
Now, we know both of these probabilities and get
P ( match ∩ test pos. ) = 1 · 10−6 .
The denominator is a bit more tricky. But remember the theorem of total probabilities - we just need a proper
cover to compute this probability.
The way this particular problem is posed, we find a suitable cover in the events match and no match. Using
the theorem of total probability gives us:
P ( test pos. ) = P ( match ) · P ( test pos. | match ) + P ( no match ) · P ( test pos. | no match )
We have got the numbers for all of these probabilities! Plugging them in gives:
P ( test pos. ) = 10−6 · 1 + (1 − 10−6 ) · 10−5 = 1.1 · 10− 5.
In total this gives a probability for the man with the positive test result to be a true match of slightly less
than 10%!
P ( match | test pos. ) = 10−6 · (1.1 · 10− 5.) = 1/11.
Is that result plausible? - If you look at the probability for a false positive test result and compare it with the
overall probability for a true DNA match, you can see, that the test is ten times more likely to give a positive
result than there are true matches.This means that, if 10 Mio people are tested, we would expect 10 people
to have a true DNA match. On the other hand, the test will yield additional 100 false positive results, which
gives us a total of 110 people with positive test results.
This, by the way, is not a property limited to DNA tests - it’s a property of every test, where the overall
percentage of positives is fairly small, like e.g. tuberculosis tests, HIV tests or - in Europe - tests for mad
cow disease.
Theorem 1.7.3 (Bayes’ Rule)
If B1 , B2 , . . . , Bk is a cover of the sample space Ω,
P (Bj |A) =
P (Bj ∩ A)
P (A|Bj ) · P (Bj )
= Pk
P (A)
i=1 P (A|Bi ) · P (Bi )
for all j and ∅ =
6 A ⊂ Ω.
Example 1.7.3
A given lot of chips contains 2% defective chips. Each chip is tested before delivery.
However, the tester is not wholly reliable:
P ( “tester says chip is good” | “chip is good” ) = 0.95
P ( “tester says chip is defective” | “chip is defective” ) = 0.94
If the test device says the chip is defective, what is the probability that the chip actually is defective?
P ( chip is defective
{z
}
|
|
:=Cd
tester says it’s defective ) = P (Cd |Td )
|
{z
}
Bayes’ Rule, use Cd ,C̄d as cover
:=Td
=
=
P (Td |Cd )P (Cd )
=
P (Td |Cd )P (Cd ) + P (Td |C¯d )P (C¯d )
0.94 · 0.02
= 0.28.
0.94 · 0.02 + (1 − P (T¯d |C¯d ) ·0.98
|
{z
}
0.05
=
1.8. BERNOULLI EXPERIMENTS
1.8
19
Bernoulli Experiments
A random experiment with only two outcomes is called a Bernoulli experiment.
Outcomes are e.g. 0,1
“success”, “failure”
“hit”, “miss”
“good”, “defective”
The probabilities for “success” and “failure” are called p and q, respectively.
(Then: p + q = 1)
A compound experiment, consisting of a sequence of n independent repetitions is called a sequence of
Bernoulli experiments.
Example 1.8.1
Transmit binary digits through a communication channel with success = “digit received correctly”.
Toss a coin repeatedly, success = “head”.
Sample spaces
Let Ωi be the sample space of an experiment involving i Bernoulli experiments:
Ω1
Ω2
= {0, 1}
= {(0, 0), (0, 1), (1, 0), (1, 1)} = {
00, 01, 10, 11
|
{z
}
}
all two-digit binary numbers
..
.
Ωn
= {n − digit binary numbers} = {n − tuples of 0s and 1s}
Probability assignment
For Ω1 probabilities are already assigned:
P (0) = q
P (1) = p
For Ω2 :
P (00) = q 2
P (01) = qp
P (10) = pq
Generally, for Ωn :
P (s) = pk q n−k
if s has exactly k 1s and n − k 0s.
P (11) = p2
Example 1.8.2 Very simple dartboard
We will assume, that only those darts count, that actually hit the dartboard.
If a player throws a dart and hits the board at random, the probability to hit
the red zone will be directly proportional to the red area. Since out of the nine
squares in total 8 are gray and only one is red, the probabilities are:
P ( gray ) = 89 .
P ( red ) = 91
A player now throws three darts, one after the other.
What are the possible sequences of red and gray hits, and what are their probabilities?
We have, again, a step-wise setup of the problem, we can therefore draw a tree:
20
CHAPTER 1. INTRODUCTION
r
r
g
r
r
rrg
rgr
g
g
r
r
g
g
sequence
rrr
r
g
g
rgg
grr
grg
ggr
ggg
probability
1
93
8
93
8
93
82
93
8
93
82
93
82
93
83
93
Most of the time, however, we are not interested in the exact sequence in which the darts are thrown - but
in the overall result, how many times a player hits the red area.
This leads us to the notion of a random variable.
Chapter 2
Random Variables
If the value of a numerical variable depends on the outcome of an experiment, we call the variable a random
variable.
Definition 2.0.1 (Random Variable)
A function X : Ω →
7 R is called a random variable.
X assigns
to each elementary event
a real value.
Standard notation: capital letters from the end of the alphabet.
Example 2.0.3 Very simple Dartboard
In the case of three darts on a board as in the previous example, we are usually not interested in the order,
in which the darts have been thrown. We only want to count the number of times, the red area has been
hit. This count is a random variable!
More formally: we define X to be the function, that assigns to a sequence of three throws the number of
times, that the red area is hit.
X(s) = k,
if s consists of k hits to the red area and 3 − k hits to the gray area.
X(s) is then an integer between 0 and 3 for every possible sequence.
What is then the probability, that a player hits the red area exactly two times?
We are looking now for all those elementary events s of our sample space, for which X(s) = 2.
Going back to the tree, we find three possibilities for s : rrg, rgr and grr. This is the subset of Ω, for which
X(s) = 2. Very formally, this set can be written as:
{s|X(s) = 2}
We want to know the total probability:
P ({s|X(s) = 2}) = P (rrg ∪ rgr ∪ grr) = P (rrg) + P (rgr) + P (grr) =
To avoid cumbersome notation, we write
X=x
for the event
{ω|ω ∈ Ω and X(ω) = x}.
21
8
8
8
+ 3 + 3 = 0.03.
93
9
9
22
CHAPTER 2. RANDOM VARIABLES
Example 2.0.4 Communication Channel
Suppose, 8 bits are sent through a communication channel. Each bit has a certain probability to be received
incorrectly.
So this is a Bernoulli experiment, and we can use Ω8 as our sample space.
We are interested in the number of bits that are received incorrectly.
Use random variable X to “count” the number of wrong bits. X assigns a value between 0 and 8 to each
sequence in Ω8 .
Now it’s very easy to write events like:
a) no wrong bit received
X=0
P (X = 0)
b) at least one wrong bit received
X≥1
P (X ≥ 1)
c) exactly three bits are wrong
X=3
P (X = 3)
d) at least 3, but not more than 6 bits wrong 3 ≤ X ≤ 6 P (3 ≤ X ≤ 6)
Definition 2.0.2 (Image of a random variable)
The image of a random variable X is defined as
all possible
values X can
reach
Im(x) := X(Ω).
Depending on whether or not the image of a random variable is countable, we distinguish between discrete
and continuous random variables.
Example 2.0.5
1. Put a disk drive into service, measure Y = “time till the first major failure”.
Sample space Ω = (0, ∞).
Y has uncountable image → Y is a continuous random variable.
2. Communication channel: X = “# of incorrectly received bits”
Im(X) = {0, 1, 2, 3, 4, 5, 6, 7, 8} is a finite set → X is a discrete random variable.
2.1
Discrete Random Variables
Assume X is a discrete random variable. The image of X is therefore countable and can be written as
{x1 , x2 , x3 , . . .}
Very often we are interested in probabilities of the form P (X = x). We can think of this expression as a
function, that yields different probabilities depending on the value of x.
Definition 2.1.1 (Probability Mass Function, PMF)
The function pX (x) := P (X = x) is called the probability mass function of X.
A probability mass function has two main properties:
all values
must be between 0 and
1
the sum of
all values
is 1
Theorem 2.1.2 (Properties of a pmf )
pX is the pmf of X, if and only if
(i) 0 ≤ pX (x) ≤ 1 for all x ∈ {x1 , x2 , x3 , . . .}
P
(ii)
i pX (xi ) = 1
2.1. DISCRETE RANDOM VARIABLES
23
Note: this gives us an easy method to check, whether a function is a probability mass function!
Example 2.1.1
Which of the following functions is a valid probability mass function?
1.
x
pX (x)
-3
0.1
-1
0.45
0
0.15
5
0.25
7
0.05
2.
y
pY (y)
-1
0.1
0
0.45
1.5
0.25
3
-0.05
4.5
0.25
3.
z
pZ (z)
0
0.22
5
0.17
7
0.18
1
0.18
3
0.24
We need to check the two properties of a pmf for pX , pY and pZ .
1st property: probabilities between 0 and 1 ?
This eliminates pY from the list of potential probability mass functions, since pY (3) is negative.
The other two functions fulfill the property.
2nd property: sum of all probabilities is 1?
P
p(xi ) = 1, so pX is a valid probability mass function.
Pi
i p(zi ) = 0.99 6= 1, so pZ is not a valid probability mass function.
Example 2.1.2 Probability Mass Functions
1. Very Simple Dartboard
X, the number of times, a player hits the red area with three darts is a value between 0 and 3.
What is the probability mass function for X?
The probability mass function pX can be given as a list of all possible values:
pX (0)
= P (X = 0) = P (ggg) =
83
≈ 0.70
93
82
≈ 0.26
93
8
pX (2) = P (X = 2) = P (rrg) + P (rgr) + P (grr) = 3 · 3 ≈ 0.03
9
1
pX (3) = P (X = 3) = P (rrr) = 3 ≈ 0.01
9
pX (1)
= P (X = 1) = P (rgg) + P (grg) + P (ggr) = 3 ·
2. Roll of a fair die
Let Y be the number of spots on the upturned face of a die:
Obviously, Y is a random variable with image {1, 2, 3, 4, 5, 6}.
Assuming, that the die is a fair die means, that the probability for each side is equal. The probability
mass function for Y therefore is pY (i) = 16 for all i in {1, 2, 3, 4, 5, 6}.
24
CHAPTER 2. RANDOM VARIABLES
3. The diagram shows all six faces of a particular die. If Z denotes the number of
spots on the upturned face after toss this die, what is the probability mass function
for Z?
Assuming, that each face of the die appears with the same probability, we have 1
possibility to get a 1 or a 4, and two possibilities for a 2 or 3 to appear, which gives
a probability mass function of:
x
p(x)
2.1.1
1
2
3
4
1/6 1/3 1/3 1/6
Expectation and Variance
Example 2.1.3 Game
Suppose we play a “game”, where you toss a die. Let X be the number of spots, then
if X is
1,3 or 5 I pay you $ X
2 or 4 you pay me $ 2 · X
6 no money changes hands.
What money do I expect to win?
For that, we look at another function, h(x), that counts the money I win with respect to the number of spots:

 −x for x = 1, 3, 5
2x for x = 2, 4
h(x) =

0 for x = 6.
Now we make a list:
In 1/6 of all tosses X will be 1, and I will gain -1 dollars
In 1/6 of all tosses X will be 2, and I will gain 4 dollars
In 1/6 of all tosses X will be 3, and I will gain -3 dollars
In 1/6 of all tosses X will be 4, and I will gain 8 dollars
In 1/6 of all tosses X will be 5, and I will gain -5 dollars
In 1/6 of all tosses X will be 6, and I will gain 0 dollars
In total I expect to get 61 · (−1) + 16 · 4 + 16 · (−3) + 16 · 8 + 61 · (−5) + 16 · 0 = 36 = 0.5 dollars per play.
Assume, that instead of a fair die, we use the die from example 3. How does that change my expected gain?
h(x) is not affected by the different die, but my expected gain changes: in total I expect to gain:
1
1
1
1
1
9
6 · (−1) + 3 · 4 + 3 · (−3) + 6 · 8 + 0 · (−5) + 6 · 0 = 6 = 1.5 dollars per play.
Definition 2.1.3 (Expectation)
The expected value of a function h(X) is defined as
E[h(X)] :=
X
h(xi ) · pX (xi ).
i
The most important version of this is h(x) = x:
E[X] =
X
i
xi · pX (xi ) =: µ
2.1. DISCRETE RANDOM VARIABLES
25
Example 2.1.4 Toss of a Die
Toss a fair die, and denote by X the number of spots on the upturned face.
What is the expected value for X?
Looking at the above definition for E[X], we see that we need to know the probability mass function for a
computation.
The probability mass function of X is pX (i) = 16 for all i ∈ {1, 2, 3, 4, 5, 6}.
Therefore
6
X
1
1
1
1
1
1
ipX (i) = 1 · + 2 · + 3 · + 4 · + 5 · + 6 · = 3.5.
E[X] =
6
6
6
6
6
6
i=1
A second common measure for describing a random variable is a measure, how far its values are spread out.
We measure, how far we expect values to be away from the expected value:
Definition 2.1.4 (Variance of a random variable)
The variance of a random variable X is defined as:
V ar[X] := E[(X − E[X])2 ] =
X
(xi − E[X])2 · pX (xi )
i
The variance
is measured in squared units of X.
p
σ := V ar[X] is called the standard deviation of X, its units are the original units of X.
Example 2.1.5 Toss of a Die, continued
Toss a fair die, and denote with X the number of spots on the upturned face.
What is the variance for X?
Looking at the above definition for V ar[X], we see that we need to know the probability mass function and
E[X] for a computation.
The probability mass function of X is pX (i) = 61 for all i ∈ {1, 2, 3, 4, 5, 6}; E[X] = 3.5
Therefore
V ar[X] =
6
X
(Xi − 3.5)2 pX (i) = 6.25 ·
i=1
1
1
1
1
1
1
2
+ 2.25 · + 0.25 · + 0.25 · + 2.25 · + 6.25 · = 2.917 (spots ).
6
6
6
6
6
6
The standard deviation for X is:
σ=
2.1.2
p
V ar(X) = 1.71 (spots).
Some Properties of Expectation and Variance
The following theorems make computations with expected value and variance of random variables easier:
Theorem 2.1.5
For two random variables X and Y and two real numbers a, b holds:
E[aX + bY ] = aE[X] + bE[Y ].
Theorem 2.1.6
For a random variable X and a real number a holds:
(i) E[X 2 ] = V ar[X] + (E[X])2
(ii) V ar[aX] = a2 V ar[X]
26
CHAPTER 2. RANDOM VARIABLES
Theorem 2.1.7 (Chebyshev’s Inequality)
For any positive real number k, and random variable X with variance σ 2 :
P (|X − E[X]| ≤ kσ) ≥ 1 −
2.1.3
1
k2
Probability Distribution Function
Very often we are interested in the probability of a whole range of values, like P (X ≤ 5) or P (4 ≤ X ≤ 16).
For that we define another function:
Definition 2.1.8 (probability distribution function)
Assume X is a discrete random variable:
The function FX (t) := P (X ≤ t) is called the probability distribution function of X.
Relationship between pX and FX
Since X is a discrete random variable, the image of X can be written as {x1 , x2 , x3 , . . .}, we are therefore
interested in all xi with xi ≤ t:
X
FX (t) = P (X ≤ t) = P ({xi |xi ≤ t}) =
pX (xi ).
i,with xi ≤t
Note: in contrast to the probability mass function, FX is defined on R (not only on the image of X).
Example 2.1.6 Roll a fair die
X = # of spots on upturned face
Ω = {1, 2, 3, 4, 5, 6}
pX (1) = pX (2) = . . . = pX (6) = 16
F (X)(t) =
P
i<t
pX (i) =
Properties of FX
variable X.
Pbtc
i=1
pX (i) =
btc
6 ,
where btc is the truncated value of t.
The following properties hold for the probability distribution function FX of a random
• 0 ≤ FX (t) ≤ 1 for all t ∈ R
• FX is monotone increasing, (i.e. if x1 ≤ x2 then FX (x1 ) ≤ FX (x2 ).)
• limt→−∞ FX (t) = 0 and limt→∞ FX (t) = 1.
• FX (t) has a positive jump equal to pX (xi ) at {x1 , x2 , x3 , . . .}; FX is constant in the interval [xi , xi+1 ).
Whenever no confusion arises, we will omit the subscript X.
2.2
Special Discrete Probability Mass Functions
In many theoretical and practical problems, several probability mass functions occur often enough to be
worth exploring here.
2.2. SPECIAL DISCRETE PROBABILITY MASS FUNCTIONS
2.2.1
27
Bernoulli pmf
Situation: Bernoulli experiment (only two outcomes: success/ no success)
with P ( success ) = p
We define a random variable X as:
X( success ) = 1
X( no success ) = 0
The probability mass function pX of X is then:
pX (0) = 1 − p
pX (1) = p
This probability mass function is called the Bernoulli mass function.
The distribution function FX is then:

t<0
 0
1−p 0≤t<1
FX (t) =

1
1≤t
This distribution function is called the Bernoulli distribution function.
That’s a very simple probability function, and we’ve already seen sequences of Bernoulli experiments. . .
2.2.2
Binomial pmf
Situation: n sequential Bernoulli experiments, with success rate p for a single trial. Single trials are independent from each other.
We are only interested in the number of successes he had in total after n trials, therefore we define a random
variable X as:
X = “ number of successes in n trials”
This leads to a sample space of
Ω = {0, 1, 2, . . . , n}
Now, we want to derive a probability mass function for X, i.e. we want to get to a general expression for
pX (k) for all possible k = 0, . . . , n.
pX (k) = P (X = k), i.e. we want to find the probability, that in a sequence of n trials there are exactly k
successes.
Think: if s is a sequence with k successes and n − k failures, we already know the probability:
P (s) = pk (1 − p)n−k .
Now we need to know, how many possibilities there are, to have k successes in n trials: think of the n trials
as numbers from 1 to n. To have k successes, we need to choose a set of k of these numbers out of the n
possible numbers. Do you see it? - That’s the Binomial coefficient, again.
pX (k) is therefore:
n k
pX (k) =
p (1 − p)n−k .
k
This probability mass function is called the Binomial mass function.
The distribution function FX is:
btc X
n i
FX (t) =
p (1 − p)n−i =: Bn,p (t)
i
i=0
This function is called the Binomial distribution Bn,p , where n is the number of trials, and p is the probability
for a success.
It is a bit cumbersome to compute values for the distribution function. Therefore, those values are tabled
with respect to n and p.
28
CHAPTER 2. RANDOM VARIABLES
Example 2.2.1
Compute the probabilities for the following events:
A box contains 15 components that each have a failure rate of 2%. What is the probability that
1. exactly two out of the fifteen components are defective?
2. at most two components are broken?
3. more than three components are broken?
4. more than 1 but less than 4 are broken?
Let X be the number of broken components. Then X has a B15,0.02 distribution.
2
13
1. P (exactly two out of the fifteen components are defective) = pX (2) = 15
= 0.0323.
2 0.02 0.98
2. P (at most two components are broken) = P (X ≤ 2) = B15,0.02 (2) = 0.9638.
3. P ( more than three components are broken ) = P (X > 3) = 1 − P (X ≤ 3) = 1 − 0.9945 = 0.0055.
4. P ( more than 1 but less than 4 are broken ) = P (1 < X < 4) = P (X ≤ 3) − P (X ≤ 1) = 0.9945 −
0.8290 = 0.1655.
If we want to say that a random variable has a binomial distribution, we write:
X ∼ Bn,p
What are the expected value and variance of X ∼ Bn,p ?
E[X]
=
=
n
X
i=0
n
X
i · pX (i) =
i·
i=0
=
n
X
i=1
= np ·
i
n i
p (1 − p)n−i =
i
n!
pi (1 − p)n−i
i!(n − i)!
n−1
X
j=0
|
V ar[X]
2.2.3
j:=i−1
=
(n − 1)!
pj (1 − p)n−1−j = np
j!((n − 1) − j)!
{z
}
=1
= . . . = np(1 − p).
Geometric pmf
Assume, we have a single Bernoulli experiment with probability for success p.
Now, we repeat this experiment until we have a first success.
Denote by X the number of repetitions of the experiment until we have the first success.
Note: X = k means, that we have k − 1 failures and the first success in the kth repetition of the experiment.
The sample space Ω is therefore infinite and starts at 1 (we need at least one experiment):
Ω = {1, 2, 3, 4, . . .}
2.2. SPECIAL DISCRETE PROBABILITY MASS FUNCTIONS
29
Probability mass function:
pX (k) = P (X = k) = (1 − p)k−1 · p
| {z } |{z}
k−1 failures
success!
This probability mass function is called the Geometric mass function.
Expected value and variance of X are:
E[X]
=
∞
X
i(1 − p)i p = . . . =
i=1
V ar[X]
=
1
,
p
∞
X
1−p
1
.
(i − )2 (1 − p)i p = . . . =
p
p2
i=1
Example 2.2.2 Repeat-until loop
Examine the following programming statement:
Repeat S until B
assume P (B = true) = 0.1 and let X be the number of times S is executed.
Then, X has a geometric distribution,
P (X = k) = pX (k) = 0.9k−1 · 0.1
How often is S executed on average? - What is E[X]? Using the above formula, we get E[X] =
1
p
= 10.
We still need to compute the distribution function FX . Remember, FX (t) is the probability for X ≤ t.
Instead of tackling this problem directly, we use a trick and look at the complementary event X > t. If X is
greater than t, this means that the first btc trials yields failures. This is easy to compute! It’s just (1 − p)btc .
Therefore the probability distribution function is:
FX (t) = 1 − (1 − p)btc =: Geop (t)
This function is called the Geometric distribution (function) Geop .
Example 2.2.3 Time Outs at the Alpha Farm
Watch the input queue at the alpha farm for a job that times out.
The probability that a job times out is 0.05.
Let Y be the number of the first job to time out, then Y ∼ Geo0.05 .
What’s then the probability that
• the third job times out?
P (Y = 3) = 0.952 0.05 = 0.045
• Y is less than 3?
P (Y < 3) = P (Y ≤ 2) = 1 − 0.952 = 0.0975
• the first job to time out is between the third and the seventh?
P (3 ≤ Y ≤ 7) = P (Y ≤ 7) − P (Y ≤ 2) = 1 − 0.957 − (1 − 0.952 ) = 0.204
What are the expected value for Y , what is V ar[Y ]?
Plugging in p = 0.05 in the above formulas gives us:
E[Y ]
=
V ar[Y ]
=
1
= 20
p
1−p
= 380
p2
we expect the 20th job to be the first time out
very spread out!
30
2.2.4
CHAPTER 2. RANDOM VARIABLES
Poisson pmf
The Poisson density follows from a certain set of assumptions about the occurrence of “rare” events in time
or space.
The kind of variables modelled using a Poisson density are e.g.
X = # of alpha particles emitted from a polonium bar in an 8 minute period.
Y = # of flaws on a standard size piece of manufactured product (100m coaxial cable)
Z = # of hits on a web page in a 24h period.
The Poisson probability mass function is defined as:
p(x) =
e−λ λx
x!
for x = 0, 1, 2, 3, . . .
λ is called the rate parameter.
P oλ (t) := FX (t) is the Poisson distribution (function).
Expected Value and Variance of X ∼ P oλ are:
E[X] =
∞
X
e−λ λx
= ... = λ
x
x!
x=0
V ar[X] = . . . = λ
How do we choose λ in an example? - look at the expected value!
Example 2.2.4
A manufacturer of chips produces 1% defectives.
What is the probability that in a box of 100 chips no defective is found?
Let X be the number of defective chips found in the box.
So far, we would have modelled X as a Binomial variable with distribution B100,0.01 .
100
Then P (X = 0) = 100
0.010 = 0.366.
0 0.99
On the other hand, a defective chip can be considered to be a rare event, since p is small (p = 0.01). What
else can we do?
We expect 100 · 0.01 = 1 chip out of the box to be defective. If we model X as Poisson variable, we know,
that the expected value of X is λ. In this example, therefore, λ = 1.
−1 0
Then P (X = 0) = e 0!1 = 0.3679.
No big differences between the two approaches!
For larger k, however, the binomial coefficient nk becomes hard to compute, and it is easier to use the
Poisson distribution instead of the Binomial distribution.
Poisson approximation of Binomial pmf For large n, the Binomial distribution is approximated by
the Poisson distribution, where λ is given as np:
n k
(np)k
p (1 − p)n−k ≈ e−np
k!
k
Rule of thumb: use Poisson approximation if n ≥ 20 and (at the same time) p ≤ 0.05.
2.2.5
Compound Discrete Probability Mass Functions
Real problems very seldom concern a single random variable. As soon as more than 1 variable is involved it
is not sufficient to think of modeling them only individually - their joint behavior is important.
Again, the how do we specify probabilities for more than one random variable at a time?
individual
probabili- Consider the 2 variable case: X, Y are two discrete variables. The joint probability mass function is defined
ties must be
between 0 as
and 1 and
their sum
PX,Y (x, y) := P (X = x ∩ Y = y)
must be 1.
2.2. SPECIAL DISCRETE PROBABILITY MASS FUNCTIONS
31
Example 2.2.5
A box contains 5 unmarked PowerPC G4 processors of different speeds:
2 400 mHz
1 450 mHz
2 500 mHz
Select two processors out of the box (without replacement) and let
X = speed of the first selected processor
Y = speed of the second selected processor
For a sample space we can draw a table of all the possible combinations of processors. We will distinguish
between processors of the same speed by using the subscripts 1 or 2 .
Ω
4001 4002
4001
x
x
4002
450
x
x
5001
x
x
5002
x
x
1st processor
450
x
x
x
x
5001
x
x
x
x
5002
x
x
x
x
-
2nd processor
In total we have 5 · 4 = 20 possible combinations.
Since we draw at random, we assume that each of the above combinations is equally likely. This yields the
following probability mass function:
400
450
500 (mHz)
1st proc.
400
0.1
0.1
0.2
2nd processor
450 500 (mHz)
0.1
0.2
0.0
0.1
0.1
0.1
What is the probability for X = Y ?
this might be important if we wanted to match the chips to assemble a dual processor machine:
P (X = Y )
= pX,Y (400, 400) + pX,Y (450, 450) + pX,Y (500, 500) =
= 0.1 + 0 + 0.1 = 0.2.
Another example: What is the probability that the first processor has higher speed than the second?
P (X > Y )
= pX,Y (400, 450) + pX,Y (400, 500) + pX,Y (450, 500) =
= 0.1 + 0.2 + 0.1 = 0.4.
We can go from joint probability mass functions to individual pmfs:
X
pX (x) =
pX,Y (x, y)
y
X
“marginal” pmfs
pY (y) =
pX,Y (x, y)
x
32
CHAPTER 2. RANDOM VARIABLES
Example 2.2.6 Continued
For the previous example the marginal probability mass functions are
x
pX (x)
400
0.4
450
0.2
500 (mHz)
0.4
y
pY (y)
400
0.4
450
0.2
500 (mHz)
0.4
Just as we had the notion of expected value for functions with a single random variable, there’s an expected
value for functions in several random variables:
X
E[h(X, Y )] :=
h(x, y)pX,Y (x, y)
x,y
Example 2.2.7 Continued
Let X, Y be as before.
What is E[|X − Y |] (the average speed difference)?
here, we have the situation E[|X − Y |] = E[h(X, Y )], with h(X, Y ) = |X − Y |.
Using the above definition of expected value gives us:
X
E[|X − Y |] =
|x − y|pX,Y (x, y) =
x,y
= |400 − 400| · 0.1 + |400 − 450| · 0.1 + |400 − 500| · 0.2 +
|450 − 400| · 0.1 + |450 − 450| · 0.0 + |450 − 500| · 0.1 +
|500 − 400| · 0.2 + |500 − 450| · 0.1 + |500 − 500| · 0.1 =
= 0 + 5 + 20 + 5 + 0 + 5 + 20 + 5 + 0 = 60.
The most important cases for h(X, Y ) in this context are linear combinations of X and Y .
For two variables we can measure how “similar” their values are:
Definition 2.2.1 (Covariance)
The covariance between two random variables X and Y is defined as:
Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]
Note, that this definition looks very much like the definition for the variance of a single random variable. In
fact, if we set Y := X in the above definition, the Cov(X, X) = V ar(X).
Definition 2.2.2 (Correlation)
The (linear) correlation between two variables X and Y is
Cov(X, Y )
% := p
V ar(X) · V ar(Y )
read: “rho”
Facts about %:
• % is between -1 and 1
• if % = 1 or -1, Y is a linear function of X
%=1
% = −1
→ Y = aX + b with a > 0,
→ Y = aX + b with a < 0,
2.2. SPECIAL DISCRETE PROBABILITY MASS FUNCTIONS
33
% is a measure of linear association between X and Y . % near ±1 indicates a strong linear relationship, %
near 0 indicates lack of linear association.
Example 2.2.8 Continued
What is % in our box with five chips?
Check:
E[X] = E[Y ] = 450
Use marginal pmfs to compute!
V ar[X] = V ar[Y ] = 2000
The covariance between X and Y is:
X
Cov(X, Y ) =
(x − E[X])(y − E[Y ])pX,Y (x, y) =
x,y
(400 − 450)(400 − 450) · 0.1 + (450 − 450)(400 − 450) · 0.1 + (500 − 450)(400 − 450) · 0.2 +
(400 − 450)(450 − 450) · 0.1 + (450 − 450)(450 − 450) · 0.0 + (500 − 450)(450 − 450) · 0.1 +
(400 − 450)(500 − 450) · 0.2 + (450 − 450)(500 − 450) · 0.1 + (500 − 450)(500 − 450) · 0.1 =
= 250 + 0 − 500 + 0 + 0 + 0 − 500 + 250 + 0 = −500.
=
% therefore is
Cov(X, Y )
%= p
V ar(X)V ar(Y )
=
−500
= −0.25,
2000
% indicates a weak negative (linear) association.
Definition 2.2.3 (Independence)
Two random variables X and Y are independent, if their joint probability pX,Y is equal to the product of
the marginal densities pX · pY .
Note: so far, we’ve had a definition for the independence of two events A and B: A and B are independent,
if P (A ∩ B) = P (A) · P (B).
Random variables are independent, if all events of the form X = x and Y = y are independent.
Example 2.2.9 Continued
Let X and Y be defined as previously.
Are X and Y independent?
Check: pX,Y (x, y) = pX (x) · pY (y) for all possible combinations of x and y.
Trick: whenever there is a zero in the joint probability mass function, the variables cannot be independent:
pX,Y (450, 450) = 0 6= 0.2 · 0.2 = pX (450) · pY (450).
Therefore, X and Y are not independent!
More properties of Variance and Expected Values
Theorem 2.2.4
If two random variables X and Y are independent,
E[X · Y ] = = E[X] · E[Y ]
V ar[X + Y ] = V ar[X] + V ar[Y ]
Theorem 2.2.5
For two random variables X and Y and three real numbers a, b, c holds:
V ar[aX + bY + c] = a2 V ar[X] + b2 V ar[Y ] + 2ab · Cov(X, Y )
34
CHAPTER 2. RANDOM VARIABLES
Note: by comparing the two results, we see that for two independent random variables X and Y , the
covariance Cov(X, Y ) = 0.
Example 2.2.10 Continued
E[X − Y ]
V ar[X − Y ]
2.3
= E[X] − E[Y ] = 450 − 450 = 0
= V ar[X] + (−1)2 V ar[Y ] − 2 Cov(X, Y ) = 2000 + 2000 + 1000 = 5000
Continuous Random Variables
All previous considerations for discrete variables have direct counterparts for continuous variables.
So far, a lot of sums have been involved, e.g. to compute the distribution functions or expected values.
Summing over (uncountable) infinite many values corresponds to an integral.
The main trick in working with continuous random variables is to substitute all sums by integrals in the
definitions.
As in the case of a discrete random variable, we define a distribution function as the probability that a
random variable has outcome t or a smaller value:
Definition 2.3.1 (probability distribution function)
Assume X is a continuous random variable:
The function FX (t) := P (X ≤ t) is called the probability distribution function of X.
The only difference to the discrete case is that the distribution function of a continuous variable is not a
stairstep function:
Properties of FX
variable X.
The following properties hold for the probability distribution function FX for random
• 0 ≤ FX (t) ≤ 1 for all t ∈ R
• FX is monotone increasing, (i.e. if x1 ≤ x2 then FX (x1 ) ≤ FX (x2 ).)
• limt→−∞ FX (t) = 0 and limt→∞ FX (t) = 1.
f (x) is no
probability!
f (x) may
be > 1.
Now, however, the situation is slightly different from the discrete case:
Definition 2.3.2 (density function)
For a continuous variable X with distribution function FX the density function of X is defined as:
0
fX (x) := FX (x).
Theorem 2.3.3 (Properties of f (x))
A function fX is a density function of X, if
(i) fX (x) ≥ 0 for all x,
R∞
(ii) −∞ f (x)dx = 1.
2.3. CONTINUOUS RANDOM VARIABLES
35
Relationship between fX and FX Since the density function fX is defined as the derivative of the
distribution function, we can re-gain the distribution function from the density by integrating: Then
Rt
• FX (t) = P (X ≤ t) = −∞ f (x)dx
• P (a ≤ X ≤ b) =
Rb
a
f (x)dx
Therefore,
Z
P (X = a) = P (a ≤ X ≤ a) =
a
f (x)dx = 0.
a
Example 2.3.1
Let Y be the time until the first major failure of a new disk drive.
A possible density function for Y is
−y
e
y>0
f (y) =
0
otherwise
First, we need to check, that f (y) is actually a density function. Obviously, f (y) is a non-negative function
on whole of R.
The second condition, f must fulfill to be a density of Y is
Z ∞
Z ∞
f (y)dy =
e−y dy = −e−y |∞
0 = 0 − (−1) = 1
−∞
0
What is the probability that the first major disk drive failure occurs within the first year?
Z
P (Y ≤ 1) =
1
e−y dy = −e−y |10 = 1 − e−1 ≈ 0.63.
0
What is the distribution function of Y ?
Z t
Z t
FY (t) =
f (y)dy =
e−y dy = 1 − e−t for all t ≥ 0.
∞
f(y)
0
density function of Y
y
F(y)
distribution function of Y
y
Figure 2.1: Density and Distribution function of random variable Y .
36
CHAPTER 2. RANDOM VARIABLES
Summary:
discrete random variable
image Im(X) finite or countable infinite
continuous random variable
image Im(X) uncountable
probability distribution function:
P
FX (t) = P (X ≤ t) = k≤btc pX (k)
FX (t) = P (X ≤ t) =
probability mass function:
pX (x) = P (X = x)
probability density function:
0
fX (x) = FX (x)
expected value:
P
E[h(X)] = x h(x) · pX (x)
E[h(X)] =
variance:
P
V ar[X] = E[(X −E[X])2 ] = x (x−E[X])2 pX (x)
2.4
2.4.1
R
x
Rt
∞
f (x)dx
h(x) · fX (x)
V ar[X] = E[(X − E[X])2 ]
E[X])2 fX (x)dx
=
R∞
−∞
(x −
Some special continuous density functions
Uniform Density
One of the most basic cases of a continuous density is the uniform density. On the finite interval (a, b) each
value has the same density (cf. diagram 2.2):
1
if a < x < b
b−a
f (x) =
0
otherwise
The distribution function FX is
f(x)
1/
uniform density on (a,b)
(b-a)
a
b
x
Figure 2.2: Density function of a uniform variable X on (a, b).
Ua,b (x) := FX (x) =



0
x
b−a
1
if x ≤ a
if a < x < b
if x ≥ b.
We now know how to compute expected value and variance of a continuous random variable.
Assume, X has a uniform distribution on (a, b). Then
we expect
X to be in
the middle
between a
and b - makes
sense, doesn’t
it?
Z
b
1
1 1 2b
dx =
x | =
b
−
a
b
−
a2 a
a
b2 − a2
1
=
= (a + b).
2(b − a)
2
Z b
a+b 2 1
(b − a)2
V ar[X] =
(x −
)
dx = . . . =
.
2
b−a
12
a
E[X]
=
x
2.4. SOME SPECIAL CONTINUOUS DENSITY FUNCTIONS
37
Example 2.4.1
The(pseudo) random number generator on my calculator is supposed to create realizations of U (0, 1) random
variables.
Define U as the next random number the calculator produces.
What is the probability, that the next number is higher than 0.85?
1
For that, we want to compute P (U ≥ 0.85). We know the density function of U : fU (u) = 1−0
= 1. Therefore
Z
1
P (U ≥ 0.85) =
1du = 1 − 0.85 = 0.15.
0.85
2.4.2
Exponential distribution
This density is commonly used to model waiting times between occurrences of “rare” events, lifetimes of
electrical or mechanical devices.
Definition 2.4.1 (Exponential density)
A random variable X has exponential density (cf. figure 2.3), if
λe−λx if x ≥ 0
fX (x) =
0
otherwise
λ is called the rate parameter.
f2
f1
f0.5
x
Figure 2.3: Density functions of exponential variables for different rate parameters 0.5, 1, and 2.
Mean, variance and distribution function are easy to compute. They are:
1
λ
1
λ2
E[X]
=
V ar[X]
=
Expλ (t)
= FX (t) =
0
1 − e−λx
if x < 0
if x ≥ 0
The following example will accompany us throughout the remainder of this class:
Example 2.4.2 Hits on a webpage
On average there are 2 hits per minute on a specific web page.
I start to observe this web page at a certain time point 0, and decide to model the waiting time till the first
hit Y (in min) using an exponential distribution.
38
CHAPTER 2. RANDOM VARIABLES
What is a sensible value for λ, the rate parameter?
Think: on average there are 2 hits per minute - which makes an average waiting time of 0.5 minutes between
hits.
We will use this value as the expected value for Y : E[Y ] = 0.5.
On the other hand, we know, that the expected value for Y is 1/λ. → we are back at 2 = λ as a sensible
choice for the parameter!
λ describes the rate, at which this web page is hit!
What is the probability that we have to wait at most 40 seconds to observe the first hit?
ok, we know the rate at which hits come to the web page in minutes - so, it’s advisable to express the 40s in
minutes also: The above probability then becomes:
What is the probability that we have to wait at most 2/3 min to observe the first hit?
This, we can compute:
P (Y ≤ 2/3) = Expλ (2/3) = 1 − −e−2/3·2 ≈ 0.736
How long do we have to wait at most, to observe a first hit with a probability of 0.9?
This is a very different approach to what we have looked at so far!
Here, we want to find a t, for which P (Y ≤ t) = 0.9:
P (Y ≤ t) = 0.9
⇐⇒ 1 − e−2t = 0.9
⇐⇒ e−2t = 0.1
⇐⇒ t = −0.5 ln 0.1 ≈ 1.15 (min) - that’s approx. 69 s.
Memoryless property
Example 2.4.3 Hits on a web page
In the previous example I stated that we start to observe the web page a time point 0. Does the choice of
this time point affect our analysis in any way?
Let’s assume, that during the first minute after we started to observe the page, there is no hit.
What is the probability, that we have to wait for another 40 seconds for the first hit? - this implies an answer
to the question, what would have happened, if we had started our observation of the web page a minute
later - would we still get the same results?
The probability we want to compute is a conditional probability. If we think back - the conditional probability
of A given B was defined as
P (A ∩ B)
P (A|B) :=
P (B)
Now, we have to identify, what the events A and B are in our case. The information we have is, that during
the first minute, we did not observe a hit =: B, i.e. B = (Y > 1). The probability we want to know, is that
we have to wait another 40 s for the first hit: A = wait for 1 min and 40 s for the first hit (= Y ≤ 5/3).
P ( first hit within 5/3 min
P (A ∩ B)
P (Y ≤ 5/3 ∩ Y > 1)
=
=
P (B)
P (Y > 1)
|
no hit during 1st min) = P (A|B) =
=
P (1 < Y ≤ 5/3)
e−2 − e−10/3
=
= 0.736.
1 − P (Y < 1)
e−2
That’s exactly the same probability as we had before!!!
2.4. SOME SPECIAL CONTINUOUS DENSITY FUNCTIONS
39
The result of this example is no coincidence. We can generalize:
P (Y ≤ t + s|Y ≥ s) = 1 − e−λt = P (Y ≤ t)
This means: a random variable with an exponential distribution “forgets” about its past. This is called the
memoryless property of the exponential distribution.
An electrical or mechanical device whose lifetime we model as an exponential variable therefore “stays as
good as new” until it suddenly breaks, i.e. we assume that there’s no aging process.
2.4.3
Erlang density
Example 2.4.4 Hits on a web page
Remember: we modeled waiting times until the first hit as Exp2 .
How long do we have to wait for the second hit?
In order to get the waiting time for the second hit, we can add the waiting times until the first hit and the
time between the first and the second hit.
For both of these we know the distribution: Y1 , the waiting time until the first hit is an exponential variable
with λ = 2.
After we have observed the first hit, we start the experiment again and wait for the next hit. Since the
exponential distribution is memoryless, this is as good as waiting for the first hit. We therefore can model
Y2 , the time between first and second hit, by another exponential distribution with the same rate λ = 2.
What we are interested in is Y := Y1 + Y2 .
Unfortunately, we don’t know the distribution of Y , yet.
Definition 2.4.2 (Erlang density)
If Y1 , . . . , Yk are k independent exponential random variables with parameter λ, their sum X has an Erlang
distribution:
k
X
X :=
Yi is Erlang(k,λ)
i=1
The Erlang density fk,λ is
(
f (x) =
−λx
λe
0
k−1
· (λx)
(k−1)!
x<0
for x ≥ 0
k is called the stage parameter, λ is the rate parameter.
Expected value and variance of an Erlang distributed variable X can be computed using the properties of
expected value and variance for sums of independent random variables:
E[X]
V ar[X]
k
k
X
X
1
= E[
Yi ] =
E[Yi ] = k ·
λ
i=1
i=1
k
k
X
X
1
= V ar[
Yi ] =
V ar[Yi ] = k · 2
λ
i=1
i=1
In order to compute the distribution function, we need another result about the relationship between P oλ
and Expλ .
Theorem 2.4.3
If X1 , X2 , X3 , . . . are independent exponential random variables with parameter λ and (cf. fig. 2.4)
W := largest index j such that
j
X
i=1
Xi ≤ T
40
CHAPTER 2. RANDOM VARIABLES
for some fixed T > 0.
Then W ∼ P oλT .
*
0
X1
*
X2
*
X3
*
*
<- occurrence
times
T
Figure 2.4: W = 3 in this example.
With this theorem, we can derive an expression for the Erlang distribution function. Let X be an Erlangk,λ
variable:
Erlangk,λ (x)
= P (X ≤ x)
=
1st trick
=
1 − P (X > x) = 1 −
X
P(
Yi > x)
above theorem
=
i
|
{z
}
less than k hits observed
= 1 − P o( a Poisson r.v. with rate xλ ≤ k − 1) =
= 1 − P oλx (k − 1).
Example 2.4.5 Hits on a web page
What is the density of the waiting time until the next hit?
We said that Y as previously defined, is the sum of two exponential variables, each with rate λ = 2.
X has therefore an Erlang distribution with stage parameter 2, and the density is given as
fX (x) = fk,λ (x) = 4xe−2x for x ≥ 0
If we wait for the third hit, what is the probability that we have to wait more than 1 min?
Z := waiting time until the third hit has an Erlang(3,2) distribution.
P (Z > 1) = 1 − Erlang3,2 (1) = 1 − (1 − P o2·1 (3 − 1)) = P o2 (2) = 0.677
Note:
The exponential distribution is a special case of an Erlang distribution:
Expλ = Erlang(k=1,λ)
Erlang distributions are used to model waiting times of components that are exposed to peak stresses. It is
assumed that they can withstand k − 1 peaks and fail with the kth peak.
We will come across the Erlang distribution again, when modelling the waiting times in queueing systems,
where customers arrive with a Poisson rate and need exponential time to be served.
2.4.4
Gaussian or Normal density
The normal density is the archetypical “bell-shaped” density. The density has two parameters: µ and σ 2
and is defined as
(x−µ)2
1
fµ,σ2 (x) = √
e− 2σ2
2πσ 2
2.4. SOME SPECIAL CONTINUOUS DENSITY FUNCTIONS
41
The expected value and variance of a normal distributed r.v. X are:
Z ∞
E[X] =
xfµ.σ2 (x)dx = . . . = µ
−∞
Z ∞
V ar[X] =
(x − µ)2 fµ.σ2 (x)dx = . . . = σ 2 .
−∞
Note: the parameters µ and σ 2 are actually mean and variance of X - and that’s what they are called.
f0,0.5
f0,1
f0,2
x
f-1,1
f0,1
f2,1
x
Figure 2.5: Normal densities for several parameters. µ determines the location of the peak on the x−axis,
σ 2 determines the “width” of the bell.
The distribution function of X is
Z
t
Nµ,σ2 (t) := Fµ,σ2 (t) =
fµ,σ2 (x)dx
−∞
Unfortunately, there does not exist a closed form for this integral - fµ,σ2 does not have a simple antiderivative. However, to get probabilities means we need to evaluate this integral. This leaves us with several
choices:
1. personal numerical integration
2. use of statistical software
3. standard tables of normal probabilities
We will use the third option, mainly.
uuuh, bad,
bad, idea
later
this is, what
we are going to do!
42
CHAPTER 2. RANDOM VARIABLES
First of all: only a special case of the normal distributions is tabled: only positive values of N (0, 1) are tabled
- N (0, 1) is the normal distribution, that has mean 0 and a variance of 1. This is the so-called standard
normal distribution, also written as Φ.
A table for this distribution is enough, though. We will use several tricks to get any normal distribution into
the shape of a standard normal distribution:
Basic facts about the normal distribution that allow the use of tables
(i) for X ∼ N (µ, σ 2 ) holds:
Z :=
X −µ
∼ N (0, 1)
σ
This process is called standardizing X.
(this is at least plausible, since
E[Z]
=
V ar[Z] =
1
(E[X] − µ) = 0
σ
1
V ar[X] = 1
σ2
(ii) Φ(−z) = 1 − Φ(z) since f0,1 is symmetric in 0 (see fig. 2.6 for an explanation).
f0,1
P(Z ≤ -z)
P(Z ‡ +z)
-z
x
+z
Figure 2.6: standard normal density. Remember, the area below the graph up to a specified vertical line
represents the probability that the random variable Z is less than this value. It’s easy to see, that the areas
in the tails are equal: P (Z ≤ −z) = P (Z ≥ +z). And we already know, that P (Z ≥ +z) = 1 − P (Z ≤ z),
which proves the above statement.
Example 2.4.6
Suppose Z is a standard normal random variable.
1. P (Z < 1) = ?
P (Z < 1) = Φ(1)
straight look-up
=
0.8413.
2. P (0 < Z < 1) = ?
P (0 < Z < 1) = P (Z < 1) − P (Z < 0) = Φ(1) − Φ(0)
look-up
=
0.8413 − 0.5 = 0.3413.
3. P (Z < −2.31) = ?
P (Z < −2.31) = 1 − Φ(2.31)
look-up
=
1 − 0.9896 = 0.0204.
4. P (|Z| > 2) = ?
P (|Z| > 2) = P (Z < −2) + P (Z > 2) = 2(1 − Φ(2))
look-up
=
2(1 − 0.9772) = 0.0456.
2.4. SOME SPECIAL CONTINUOUS DENSITY FUNCTIONS
(1)
43
(2)
f0,1
(3)
f0,1
(4)
f0,1
Example 2.4.7
Suppose, X ∼ N (1, 2)
P (1 < X < 2) =?
A standardization of X gives Z :=
P (1 < X < 2)
f0,1
X−1
√ .
2
X −1
2−1
1−1
= P( √ < √
< √ )=
2
2
2
√
= P (0 < Z < 0.5 2) = Φ(0.71) − Φ(0) = 0.7611 − 0.5 = 0.2611.
Note that the standard normal table only shows probabilities for z < 3.99. This is all we need, though, since
P (Z ≥ 4) ≤ 0.0001.
Example 2.4.8
Suppose the battery life of a laptop is normally distributed with σ = 20 min.
Engineering design requires, that only 1% of batteries fail to last 300 min.
What mean battery life is required to ensure this condition?
Let X denote the battery life in minutes, then X has a normal distribution with unknown mean µ and
standard deviation σ = 20 min.
What is µ?
The condition, that only 1% of batteries is allowed to fail the 300 min limit translates to:
P (X < 300) ≤ 0.01
We must make sure to choose µ such, that this condition holds.
In order to compute the probability, we must standardize X:
Z :=
Then
P (X ≤ 300) = P (
X −µ
20
X −µ
300 − µ
300 − µ
300 − µ
≤
) = P (Z ≤
) = Φ(
)
20
20
20
20
The condition requires:
P (X ≤ 300) ≤ 0.01
300 − µ
⇐⇒ Φ(
) ≤ 0.01 = 1 − 0.99 = 1 − Φ(2.33) = Φ(−2.33)
20
300 − µ
⇐⇒
≤ −2.33
20
⇐⇒ µ ≥ 346.6.
44
CHAPTER 2. RANDOM VARIABLES
Normal distributions have a “reproductive property”, i.e. if X and Y are normal variables, then W :=
aX + bY is also a normal variable, with:
E[W ] = aE[X] + bE[Y ]
V ar[W ] = a2 V ar[X] + b2 V ar[Y ] + 2abCov(X, Y )
The normal distribution is extremely common/ useful, for one reason: the normal distribution approximates
a lot of other distributions. This is the result of one of the most fundamental theorems in Math:
2.5
Central Limit Theorem (CLT)
Theorem 2.5.1 (Central Limit Theorem)
If X1 , X2 , . . . , Xn are n independent, identically distributed random variables with E[Xi ] = µ and V ar[Xi ] =
σ 2 , then:
Pn
the sample mean X̄ := n1 i=1 Xi is approximately normal distributed
with E[X̄] = µ and V ar[X̄] = σ 2 /n.
2
i.e. X̄ ∼ N (µ, σn ) or
P
i
Xi ∼ N (nµ, nσ 2 )
Corollary 2.5.2
(a) for large n the binomial distribution Bn,p is approximately normal Nnp,np(1−p) .
(b) for large λ the Poisson distribution P oλ is approximately normal Nλ,λ .
(c) for large k the Erlang distribution Erlangk,λ is approximately normal N k ,
k
λ λ2
Why?
(a) Let X be a variable with a Bn,p distribution.
We know, that X is the result from repeating the same Bernoulli experiment n times and looking at
the overall number of successes. We can therefor, write X as the sum of n B1,p variables Xi :
X := X1 + X2 + . . . + Xn
X is then the sum of n independent, identically distributed random variables. Then, the Central
Limit Theorem states, that X has an approximate normal distribution with E[X] = nE[Xi ] = np and
V ar[X] = nV ar[Xi ] = np(1 − p).
(b) it is enough to show the statement for the case that λ is a large integer:
Let Y be a Poisson variable with rate λ. Then we can think of Y as the number of occurrences in
an experiment that runs for time λ - that is the same as to observe λ experiments that each run
independently for time 1 and add their results:
Y = Y1 + Y2 + . . . + Yλ , with Yi ∼ P o1 .
Again, Y is the sum of n independent, identically distributed random variables. Then, the Central
Limit Theorem states, that X has an approximate normal distribution with E[Y ] = λ · 1 and V ar[Y ] =
λV ar[Yi ] = λ.
2.5. CENTRAL LIMIT THEOREM (CLT)
45
(c) this statement is the easiest to prove, since an Erlangk,λ distributed variable Z is by definition the sum
of k independently distributed exponential variables Z1 , . . . , Zk .
For Z the CLT holds, and we get, that Z is approximately normal distributed with E[Z] = kE[Zi ] =
and V ar[Z] = kV ar[Zi ] = λk2 .
k
λ
2
Why do we need the central limit theorem at all? - first of all, the CLT gives us the distribution of the
sample mean in a very general setting: the only thing we need to know, is that all the observed values come
from the same distribution, and the variance for this distribution is not infinite.
A second reason is, that most tables only contain the probabilities up to a certain limit - the Poisson table
e.g. only has values for λ ≤ 10, the Binomial distribution is tabled only for n ≤ 20. After that, we can use
the Normal approximation to get probabilities.
Example 2.5.1 Hits on a webpage
Hits occur with a rate of 2 per min.
What is the probability to wait for more than 20 min for the 50th hit?
Let Y be the waiting time until the 50th hit.
We know: Y has an Erlang50,2 distribution. therefore:
P (Y > 20)
= 1 − Erlang50,2 (20) = 1 − (1 − P o2·20 (50 − 1)) =
CLT !
= P o40 (49) ≈ N40,40 (49) =
49 − 40
table
√
= Φ(1.42) = 0.9222.
= Φ
40
Example 2.5.2 Mean of Uniform Variables
Let U1 , U2 , U3 , U4 , and U5 be standard uniform variables, i.e. Ui ∼ U(0,1) .
Without the CLT we would have no idea, what distribution the sample mean Ū =
approx
1
With it, we know: Ū ∼ N (0.5, 60
).
Issue:
1
5
Accuracy of approximation
• increases with n
• increases with the amount of symmetry in the distribution of Xi
Rule of thumb for the Binomial distribution:
Use the normal approximation for Bn,p , if np > 5 (if p ≤ 0.5) or nq > 5 (if p ≥ 0.5)!
P5
i=1
Ui had!
46
CHAPTER 2. RANDOM VARIABLES
Chapter 3
Elementary Simulation
We want to be able to perform an experiment with a given set of probabilities.
Starting point:
3.1
Random Number Generators
Random number generators (rng) produce a stream of numbers that look like realizations of independent
standard uniform variables
U1 , U2 , U3 , . . .
Usually, these numbers are not completely random , but pseudo random. This way, we ensure repeatability
of an experiment.
(Note: even the trick to link the system’s rand() function to the internal clock, gives you only pseudo
random numbers, since the same time will give you exactly the same stream of random numbers.)
There are hundreds of methods that have been proposed for doing this - some (most?) are pretty bad.
A good method - and, in fact, current standard in most operating systems, is:
Linear Congruential Method
Definition 3.1.1 (Linear Congruential Sequence)
For integers a, c, and m a sequence of “random numbers” xn is defined by:
xi ≡ (axi−1 + c)
mod m for i = 1, 2, . . .
Note: this sequence still depends on the choice of x0 , the so-called seed of the sequence. Choosing different
seeds yields different sequences.
That way, we get a sequence with elements in [0, m − 1].
We define ui := xmi .
The choice of the parameters a, c and m is crucial!
obviously, we want to get as many different numbers as possible - therefore m needs to be as large as possible
and preferably prime (that way we get rid of small cycles).
Example 3.1.1 rng examples
Status-quo in industry is the so called Minimal standard generator. It fulfills the common requirements of a
rng and at the same time is very fast. Its parameters are:
c = 2
a = 16807
m = 231 − 1
47
48
CHAPTER 3. ELEMENTARY SIMULATION
An example for a terrible random number generator is the RANDU, with
c = 0
a = 65539
m = 231
It was widely used, before people discovered how bad it actually is: Knowing two successive random numbers
gives you the possibility to predict the next number pretty well. . . . that’s not, how rng s are supposed to
work.
For more information about random number generators and different techniques, how to produce and check
them, look at
http://crypto.mat.sbg.ac.at/results/karl/server/
State of the art at the moment is the Marsaglia-Multicarry-RNG:
#define znew ((z=36969*(z&65535)+(z>>16))<<16)
#define wnew ((w=18000*(w&65535)+(w>>16))&65535)
#define IUNI (znew+wnew)
#define UNI
(znew+wnew)*2.328306e-10
static unsigned long z=362436069, w=521288629;
void setseed(unsigned long i1,unsigned long i2){z=i1; w=i2;}
/*
Whenever you need random integers or random reals in your
C program, just insert those six lines at (near?) the beginning
of the program. In every expression where you want a random real
in [0,1) use UNI, or use IUNI for a random 32-bit integer.
No need to mess with ranf() or ranf(lastI), etc, with their
requisite overheads. Choices for replacing the two multipliers
36969 and 18000 are given below. Thus you can tailor your own
in-line multiply-with-carry random number generator.
This section is expressed as a C comment, in case you want to
keep it filed with your essential six lines:
*/
/* Use of IUNI in an expression will produce a 32-bit unsigned
random integer, while UNI will produce a random real in [0,1).
The static variables z and w can be reassigned to i1 and i2
by setseed(i1,i2);
You may replace the two constants 36969 and 18000 by any
pair of distinct constants from this list:
18000 18030 18273 18513 18879 19074 19098 19164 19215 19584
19599 19950 20088 20508 20544 20664 20814 20970 21153 21243
21423 21723 21954 22125 22188 22293 22860 22938 22965 22974
23109 23124 23163 23208 23508 23520 23553 23658 23865 24114
24219 24660 24699 24864 24948 25023 25308 25443 26004 26088
26154 26550 26679 26838 27183 27258 27753 27795 27810 27834
27960 28320 28380 28689 28710 28794 28854 28959 28980 29013
29379 29889 30135 30345 30459 30714 30903 30963 31059 31083
(or any other 16-bit constants k for which both k*2^16-1
3.1. RANDOM NUMBER GENERATORS
49
and k*2^15-1 are prime)*/
Armed with a Uniform rng, all kinds of other distributions can be generated:
3.1.1
A general method for discrete data
Consider a discrete pmf with:
x
p(x)
x1 < x2 < . . .
p(x1 ) p(x2 ) . . .
< xn
p(xn )
The distribution function F then is:
F (t) =
X
p(xi )
i,xi ≤t
Suppose, we have a sequence of independent, standard uniform random variables
U1 , U2 , . . . and they have realizations u1 , u2 , . . . (realizations are real values in (0,1)).
then we define the ith element in our new sequence to be xj , if
j−1
X
p(xk ) ≤ ui ≤
k=1
|
j
X
p(xk )
k=1
{z
F (xj−1 )
}
|
{z
F (xj )
}
Then X has probability mass function p.
This is less complicated than it looks. Have a look at figure 3.1. Getting the right x-value for a specific u
is done by drawing a horizontal line from the y-axis to the graph of F and following the graph down to the
x-axis. - This is, how we get the inverse of a function, graphically.
1
u1 x
u2 x
0
x1
x2
x3
x4
...
xn
x
Figure 3.1: Getting the value corresponding to ui is done by drawing a straight line to the right, until we hit
the graph of F , and following the graph down to xj .
Example 3.1.2 Simulate the roll of a fair die
Let X be the number of spots on the upturned face. The probability mass function of X is p(i) =
i = 1, . . . , 6, the distribution function is FX (t) = btc
6 for all t ∈ (0, 6).
We therefore get X from a standard uniform variable U by

1 if 0 ≤ U ≤ 16






2 if 16 < U ≤ 26




 3 if 2 < U ≤ 3
6
6
X=
3

4 if 6 < U ≤ 46





4
5

 5 if 6 < U ≤ 6



6 if 56 < U ≤ 66
1
6
for all
50
CHAPTER 3. ELEMENTARY SIMULATION
A faster definition than the one above is X = d6 · U e.
3.1.2
A general Method for Continuous Densities
Consider a continuous density f with distribution function F . We know (cf. fig. 3.2) that F : (x0 , ∞) 7→ (0, 1)
(x0 could be −∞) has an inverse function:
F −1 : (0, 1) 7→ (x0 , ∞)
FX
x
xo
Figure 3.2: Starting at some value x0 any continuous distribution function has an inverse. In this example,
x0 = 1.
General Method:
For a given standard uniform variable U ∼ U(0,1) we define
−1
X := FX
(U )
Then X has distribution FX .
Why? For a proof of the above statement, we must compute the distribution function X has. Remember,
the distribution function of X, FX at value x is the probability that X is x or less:
P (X ≤ x)
trick
=
apply FX to both sides of the inequality
= P (FX (X) ≤ FX (x))
= P (U ≤ FX (x)) =
= F (x).
dfn of U
=
U is a standard uniform variable,P (U ≤ t) = t
Therefore, X has exactly the distribution, we wanted it to have.
Example 3.1.3 Simulate from Expλ
Suppose, we want to simulate a random variable X that has an exponential distribution with rate λ.
How do we do this based on a standard uniform variable U ∼ U(0,1) ?
The distribution function for Expλ is
0
for x ≤ 0
Expλ (x) =
1 − e−λx for x ≥ 0
So, Expλ : (0, ∞) 7→ (0, 1) has an inverse:
2
3.1. RANDOM NUMBER GENERATORS
51
Let u be a positive real number:
u = 1 − e−λx
⇐⇒ 1 − u = e−λx
⇐⇒ ln(1 − u) = −λx
1
−1
⇐⇒ x = − ln(1 − u) =: FX
(u)
λ
then X := − λ1 ln(1 − U ) has an exponential distribution with rate λ.
In fact, since 1 − U is uniform, if U is uniform, we could also have used X := − λ1 ln U
For specific densities there are a lot of different special tricks for simulating observations:
For all of the next sections, let’s assume that we have a sequence of independent standard uniform variables
U1 , U2 , U3 , . . .
3.1.2.1
Simulating Binomial & Geometric distributions
Let p be the rate of success for a single Bernoulli trial.
define:
0 if ui ≥ p
Xi =
1 if ui < p
Then
X
:=
n
X
Xi ∼ Bn,p
i=1
and W
3.1.2.2
:=
# of Xi until the first is 1
W ∼ Geometricp
Simulating a Poisson distribution
With given U , we know, X = − λ1 ln U has an exponential distribution with rate λ.
Define
j
j+1
X
X
Y := largest index j such that
Xi ≤ 1 and
Xi > 1
i=1
i=1
then Y ∼ P oλ
3.1.2.3
Simulating a Normal distribution
To simulate a normal distribution, we need two sequences of standard uniform variables. Let U1 and U2 be
two independent standard uniform variables.
Define
−1/2
cos(2πU2 )
−1/2
sin(2πU2 )
Z1 := [−2 ln U1 ]
Z2 := [−2 ln U1 ]
Then both Z1 and Z2 have a standard normal distribution and are independent,
Z1 , Z2 ∼ N (0, 1)
and
X := µ + σZi ∼ N (µ, σ 2 )
52
CHAPTER 3. ELEMENTARY SIMULATION
3.2
Basic Problem
Simulation allows us to get approximate results to all kinds of probability problems, that we couldn’t solve
analytically.
The basic problem is:
k random variables
X, Y, . . . , Z
|
{z
}
k independent random variables. We know how to simulate each
of these variables
g(x, y, . . . , z)
some quite complicated function g of k variables
V = g(X, Y, . . . , Z)
a random variable of interest
We might be interested in some aspects of the density of V - e.g.
P (13 < V < 17) = ?
E[V ] = ?
V ar[V ] = ?
unless g is simple, k is small, and we are very lucky, we may not be able to solve these problems analytically.
Using simulation, we can do the following:
steps of Simulation:
1. Simulate some large number (say n) of values for each of the k variables X, Y, . . . , Z.
we then have a set of n k-tuples of the form
(Xi , Yi , . . . , Zi )
for i = 1, . . . , n
2. plug each (Xi , Yi , . . . , Zi ) into function g and compute Vi :
Vi = g(Xi , Yi , . . . , Zi )
for i = 1, . . . , n
3. then approximate
(a) P (a ≤ V ≤ b) by
#Vi : Vi ∈ [a, b]
n
(b) E[h(V )] by
n
n
1X
h(Vi )
n i=1
i.e. E[V ] =
1X
Vi = V̄
n i=1
(c) V ar[V ] by
n
1X
(Vi − V̄ )2
n i=1
Example 3.2.1 Simple electric circuit
Consider an electric circuit with three resistors as shown in the diagram:
Simple Physics predicts that R, the overall resistance is:
R = R1 + (
1
1 −1
R2 · R3
+
) = R1 +
R1
R3
R2 + R3
Assume, the resistors are independent and have a normal distribution with mean 100 Ω and a standard
deviation of 2 Ω.
3.2. BASIC PROBLEM
53
What should we expect for R, the overall resistance?
The following lines are R output from a simulation of 1000 values of R:
# Example: Simple Electric Circuit
#
# Goal: Simulate 1000 random numbers for each of the resistances R1, R2 and R3.
# Compute R, the overall resistance, from those values and get approximations for
# expectated value, variance and probabilities:
#
# rnorm (n, mean=0, sd = 1) generates n normal random numbers with the specified
# mean and standard deviation
#
R1 <- rnorm (1000, mean=100, sd = 2)
R2 <- rnorm (1000, mean=100, sd = 2)
R3 <- rnorm (1000, mean=100, sd = 2)
#
# compute R:
R <- R1 + R2*R3/(R2 + R3)
#
# now get the estimates:
mean(R)
> [1] 149.9741
sd(R)
> [1] 2.134474
#
# ... the probability that R is less than 146 is given by the number of values
# that are less than 146 divided by 1000:
sum(R<146)/1000
> [1] 0.04
Example 3.2.2 at MacMall
Assume, you have a summer job at MacMall, your responsibility are the Blueberry IMacs in stock.
At the start of the day, you have 20 Blueberry IMacs in stock. We know:
X = # of IMacs ordered per day is Poisson with mean 30
Y = # of IMacs received from Apple is Poisson with mean 15 a day
Question: What is the probability that at the end of the day you have inventory left in the stock.
Let I be the number of Blueberry IMacs in stock at the end of the day.
I = 20 − X + Y
Asked for is the probability that I ≥ 1).
Again, we use R for simulating I:
# Example: MacMall
#
# Goal: generate 1000 Poisson values with lambda = 30
#
54
CHAPTER 3. ELEMENTARY SIMULATION
# Remember: 1 Poisson value needs several exponential values
# step 1: produce exponential values
u1 <- runif(33000)
e1 <- -1/30*log(u1)
sum(e1)
[1] 1099.096
#
# sum of the exponential values is > 1000, therefore we have enough values
# to produce 1000 Poisson values
#
# step 2:
# add the exponential values (cumsum is cumulative sum)
E1 <- cumsum(e1)
E1[25:35]
[1] 0.7834028 0.7926534 0.7929962 0.7959631 0.8060001 0.8572329 0.8670336
[8] 0.8947401 1.0182220 1.0831698 1.1001983
E1 <- floor(E1)
E1[25:35]
[1] 0 0 0 0 0 0 0 0 1 1 1
#
# Each time we step over the next integer, we get another Poisson value
# by counting how many exponential values we needed to get there.
#
# step 3:
# The ’table’ command counts, how many values of each integer we have
X <- table(E1)
X[1:10]
0 1 2 3 4 5 6 7 8 9
32 26 31 32 17 27 31 33 32 31
#
# we have 1099 values, we only need 1000
X <- X[1:1000]
#
# check, whether X is a Poisson variable (then, e.g. mean and variance
# must be equal to lambda, which is 30 in our example)
#
mean(X)
[1] 30.013
var(X)
[1] 29.84067
#
# generate another 1000 Poisson values, this time lambda is 15
Y <- rpois(1000,15)
# looks a lot easier!
#
# now compute the variable of interest: I is the number of Blueberry IMacs
# we have in store at the end of the day
I <- 20 - X + Y
#
3.2. BASIC PROBLEM
55
# and, finally,
# the result we were looking for;
# the (empirical) probability, that at the end of the day there are still
# computers in the store:
sum(I > 0)/1000
[1] 0.753
Using simulation gives us the answer, that with an estimated probability of 0.753 there will be Blueberry
IMacs in stock at the end of the day.
Why does simulation work?
On what properties do we rely when simulating?
P (V ∈ [a, b]) approximated by
p̂ = #Vi :Vni ∈[a,b]
Pn
1
E[h(V )]
approximated by h̄ := n i=1 h(Vi )
Pn
1
2
V ar[V ]
approximated by
i=1 (Vi − V̄ )
n
Suppose
V1 = g(X1 , Y1 , . . . , Z1 ), V2 = g(X2 , Y2 , . . . , Z2 ) . . . Vn = g(Xn , Yn , . . . , Zn ) are i.i.d
then #{Vi : Vi ∈ [a, b]} ∼ Bn,p with p = P (V ∈ [a, b]), n = # trials.
So, we can compute expected value and variance of p̂:
n
E[p̂]
=
1
1X
E[Vi ] = n · p = p
n i=1
n
V ar[p̂]
=
n
1
p(1 − p)
1
1 X
V ar[Vi ] = 2 n · p(1 − p) =
≤
→ 0 for n → ∞
n2 i=1
n
n
4n
i.e. we have the picture that for large values of n, p̂ has a density centered at the “true” value for P (V ∈ [a, b])
with small spread.
i.e. for large n p̂ is close to p with high probability.
Similarly, for Vi i.i.d, h(Vi ) are also i.i.d.
Then
n
1X
E[h̄] =
E[h(Vi )] = E[h(V )]
n i=1
and
V ar[h̄] =
n
1 X
V ar[h(Vi )] = V ar[h(V )]/n → 0 for n → ∞.
n2 i=1
Once again we have that picture for h̄, that the density for h̄ is centered at E[h(V )] for large n and has small
spread.
56
CHAPTER 3. ELEMENTARY SIMULATION
Chapter 4
Stochastic Processes
Definition 4.0.1 (Stochastic Process)
A stochastic process is a set of random variables indexed by time: X(t)
Modeling requires somehow (mathematically consistent) specifying the joint distribution
(X(t1 ), X(t2 ), X(t3 ), . . . , X(tk ))
for any choice of t1 < t2 < t3 < . . . < tk .
Values of X(t) are called states, the set of all possible values for X(t) is called the state space.
We have been looking at a Poisson process for some time - our example “hits on a web page” is a typical
example for a Poisson process - so, here’s a formal definition:
4.1
Poisson Process
Definition 4.1.1 (Poisson Process)
A stochastic process X(t) is called homogenous Poisson process with rate λ, if
1. for t > 0 X(t) takes values in {0, 1, 2, 3, . . .}.
distribution
depends only
on length
of interval
2. for any 0 ≤ t1 < t2 :
X(t2 ) − X(t1 ) ∼ P oλ(t2 −t1 )
non-overlapping
intervals are
independent
3. for any 0 ≤ t1 < t2 ≤ t3 < t4
Xt2 − Xt1 is independent from Xt4 − Xt3
Jargon: X(t) is a “counting process” with independent Poisson increments.
Example 4.1.1 Hits on a web page
Number of hits on a webpage
Counter X(t)
A counter of the number of hits on our
webpage is an example for a Poisson Process with rate λ = 2.
In the example X(t) = 3 for t between 5
and 8 minutes.
Time t (in min)
57
58
CHAPTER 4. STOCHASTIC PROCESSES
Note:
• X(t) can be thought of as the number of occurrences until time t.
• Similarly, X(t2 ) − X(t1 ) is the number of occurrences in the interval (t1 , t2 ].
• With the same argument, X(0) = 0 - ALWAYS!
• The distribution of X(t) is Poisson with rate λt, since:
X(t) = X(t) − X(0) ∼ P oλ(t−0)
For a given Poisson process X(t) we define occurrences
O0
Oj
=
=
=
0
time of the jth occurrence =
the first t for which X(t) ≥ t
and the inter-arrival time between successive hits:
Ij
= Oj − Oj−1 for j = 1, 2, . . .
The time until the kth hit Ok is therefore given as the sum of inter-arrival times Ok = I1 + . . . + Ik .
Theorem 4.1.2
X(t) is a Poisson process with rate λ
⇐⇒ The inter-arrival times I1 , I2 , . . . are i.i.d. Expλ .
Further: the time until the kth hit Ok is an Erlangk,λ distributed variable, ⇐⇒ X(t) is a Poisson process
with rate λ.
This theorem is very important! - it links the Poisson, Exponential, and Erlang
distributions tightly together.
Consider the following very important example:
Example 4.1.2 Hits on a webpage
Hits on a popular Web page occur according to a Poisson Process with a rate of 10 hits/min. One begins
observation at exactly noon.
1. Evaluate the probability of 2 or less hits in the first minute. Let X be the number of hits in the first
minute, then X is a Poisson variable with λ = 10:
P (X ≤ 2) = P o10 (2) = e−10 + 10 · e−10 + 102 /2e−10 = 0.0028. or table-lookup p.788
2. Evaluate the probability that the time till the first hit exceeds 10 seconds. Let Y be the time until the
first hit - then Y has an Exponential distribution with parameter λ = 10 per minute or λ = 1/6 per
second.
P (Y ≥ 10) = 1 − P (Y ≤ 10) = 1 − (1 − e−10·1/6 ) = e−5/3 = 0.1889.
3. Evaluate the mean and the variance of the time till the 4th hit. Let Z be the time till the 4th hit. Then
Z has an Erlang distribution with stage parameter k = 4 and λ = 10 per minute.
E[Z] =
V ar[Z] =
k
4
=
= 0.4 minutes
λ
10
k
4
=
= 0.04 minutes2 .
λ2
100
4.2. BIRTH & DEATH PROCESSES
59
4. Evaluate the probability that the time till the 4th hit exceeds 24 seconds.
P (Z > 24)
=
1 − P (Z ≤ 24) = 1 − textErlang 4,1/6 (24) =
=
1 − (1 − P o1/6·24 (4 − 1)) = P o4 (3)
table,p.786
=
0.433
5. The number of hits in the first hour is Poisson with mean 600. You would like to know the probability
of more than 650 hits. Exact calculation isn’t really feasible. So approximate this probability and
justify your approximation. A Poisson distribution with large rate λ can be approximated by a normal
distribution (corollary from the Central Limit Theorem) with mean µ = λ and variance σ 2 = λ.
Then X
approx
∼
N (600, 600) → Z :=
approx
X−600
√
∼
600
N (0, 1).
Then:
P (X > 650)
= 1 − P (X ≤ 650) = 1 − P
≈ 1 − Φ(2.05)
table, p.789
=
Z≤
650 − 600
√
600
≈
1 − 0.9798 = 0.0202.
Another interesting property of the Poisson process model that’s consistent with thinking of it as “random
occurrences” in time t, is
Theorem 4.1.3
Let X(t) be a Poisson process. Given that X(T ) = k, the conditional distribution of the time of the k
occurrences O1 , . . . , Ok is the same as the distribution of k ordered independent standard uniform variables
U(1) , U(2) , . . . , U(k) .
This tells us a way to simulate a Poisson process with rate λ on the interval (0, T ):
- first, draw a Poisson value w from P oλT . - This tells us, how many uniform values Ui we need to
simulate.
- secondly, generate w many standard uniform values u1 , . . . , uw
- define oi = T · u(i) , where u(i) is the ith smallest value among u1 , . . . , uw .
The above theorem tells us, that, if we pick k values at random from an interval (0, t), we can assume, that if
we order them, the distance between two successive values has an exponential distribution with rate λ = k/t.
So far, we are looking only at arrivals of events. Besides that, we could, for example, look at the number of
surfers that are on our web site at the same time.
There, we have departures as well and, related to that, the time each surfer stays - which we will call service
time (from the perspective of the web server).
This leads us to another model:
4.2
Birth & Death Processes
Birth & Death Processes (B+D) are a generalization of Poisson processes, that allow the modelling of queues,
i.e. we assume, that arrivals stay some time in the system and leave again after that.
A B+D process X(t) is a stochastic process that monitors the number of people in a system. If X(t) = k,
we assume that at time t there are k people in the system.
Again, X(t) is called the state at time t. X(t) is in {0, 1, 2, 3 . . .}, for all t.
We can visualize (see fig. 4.1) the set-up for a B+D process in a state diagram as movements between
consecutive states. Conditional on X(t) = k we either move to state k + 1 or to k − 1, depending on whether
a birth or a death occurs first.
60
CHAPTER 4. STOCHASTIC PROCESSES
0
1
2
3
...
Figure 4.1: State diagram of a Birth & Death process.
Example 4.2.1 Stat Printer
The ”heavy-duty” printer in the Stats department gets 3 jobs per hour. On average, it takes 15 min to
complete printing.
The printer queue is monitored for a day (8h total time):
Jobs arrive at the following points in time (in h):
job i
arrival time
1
0.10
2
0.40
3
0.78
4
1.06
5
1.36
6
1.84
7
1.87
8
2.04
9
3.10
10
4.42
job i
arrival time
11
4.46
12
4.66
13
4.68
14
4.89
15
5.01
16
5.56
17
5.56
18
5.85
19
6.32
20
6.99
The printer finishes jobs at:
job i
finishing time
1
0.22
2
0.63
3
1.61
4
1.71
5
1.76
6
1.90
7
2.32
8
2.68
9
3.42
10
4.67
job i
finishing time
11
5.31
12
5.54
13
5.59
14
5.62
15
5.84
16
6.04
17
6.83
18
7.10
19
7.23
20
7.39
Let X(t) be the number of jobs in the printer and its queue at time t. X(t) is a Birth & Death process.
(a) Draw the graph of X(t) for the values monitored.
Number of jobs in the system at time t
5
4
3
X(t)
2
1
0
0
2
4
6
Time (in h)
(b) What is the (empirical) probability that there are 5 jobs in the printer and its queue at some time t?
The empirical probability for 5 jobs in the printer is the time, X(t) is in state 5 divided by the total
time:
\= 5) = (5.31 − 5.01) + (5.59 − 5.56) = 0.33 = 0.04125.
P (X(t)
8
8
4.2. BIRTH & DEATH PROCESSES
61
The model for a birth or a death is given conditional on X(t) = k as:
B
D
if
= time till a potential birth ∼ Expλk
= time till a potential deathj ∼ Expµk
B<D
B>D
the move is to state k + 1 at time t + B
the move is to state k − 1 at time t + D
remember:
P (B = D) =
0!
B and D are independent for each state k.
This implies, that, given the process is in state k, the probability to move to state
k+1
k−1
λk
µk + λk
µk
is
.
µk + λk
is
Then Y = min(B, D) is the remaining time in state k until the move.
What can we say about the distribution of Y := min(B, D)?
P (Y ≤ y)
=
=
=
=
=
P (min(B, D) ≤ y) = P (B ≤ y ∪ D ≤ y) =
way, way back, we looked at this kind of probability. . .
P (B ≤ y) + P (D ≤ y) − P (B ≤ y ∩ D ≤ y) =
B, Dare independent.
P (B ≤ y) + P (D ≤ y) − P (B ≤ y) · P (D ≤ y) =
B ∼ Expλk , D ∼ Expµk
−λk y
−µk y
−λk y
−µk y
1−e
+1−e
− (1 − e
)(1 − e
)=
1 − e(λk +µk )y = Expλk +µk (y),
i.e. Y itself is again an exponential variable, its rate is the the sum of the rates of B and D.
Knowing the distribution of Y , the staying time in state k, gives us, e.g. the possibility to compute the
mean staying time in state k. The mean staying time in state k is the expected value of an exponential
distribution with rate λk + µk . The mean staying time therefore is 1/(λk + µk ). We will mark this result by
(*) and use it below.
Note: A Poisson process with rate λ is a special case of a Birth & Death process, where the birth rates and
death rates are constant, λk = λ and µk = 0 for all k.
The analysis of this model for small t is mathematically difficult because of “start-up” effects - but in some
cases, we can compute the “large t” behaviour.
A lot depends on the ratio of births and deaths:
this is result (*)
62
CHAPTER 4. STOCHASTIC PROCESSES
Number of jobs in the system at time t
15
X(t)
5
0
0
500
1000
1500
2000
Time (in sec)
Number of jobs in the system at time t
60
X(t)
20
0
0
500
1000
1500
2000
Time (in sec)
Number of jobs in the system at time t
400
200
X(t)
0
0
500
1000
1500
2000
Time (in sec)
In the picture, three different simulations of Birth & Death processes are shown. Only in the first case, the
process is stable (birth rate < death rate). The other two processes are unstable (birth rate = death rate
(2nd process) and birth rate > death rate (3rd process)).
Only if the B+D process is stable, it will find an equilibrium after some time - this is called the steady state
of the B+D process.
Mathematically, the notion of a steady state state translates to
lim P (X(t) = k) = pk for all k,
t→∞
P
where the pk are numbers between 0 and 1, with k pk = 1. The pk probabilities are called the steady state
probabilities of the B+D process, they form a density function for X.
We can figure out what the pk must be as follows:
# of visits to
state k by time t
mean stay
in state k
total time t
# of visits to
state k by time t
→
pk
→
pk (λk + µk )
use (*)
total time t
Let’s have a look at the following diagram:
k
k-1
k-1
k
k
A fraction of
λk
λk +µk
k+1
k+1
visits to state k result in moves to state k + 1. The probability to be in state k is pk , so
λk
pk · (λk + µk ) = λk pk
λk + µk
4.2. BIRTH & DEATH PROCESSES
63
is the long run rate of transitions k → k + 1 and, similarly, µk pk is the long run rate of transitions k → k − 1.
From the very simple principle, that overall everything that flows into state k has to flow out , again, we get
the so-called balance equations for the steady state probabilities:
Balance equations
The Flow-In = Flow-Out Principle provides us with the means to derive equations between the steady state
probabilities.
1. For state 0
µ1 p1 = λ0 p0
0
0
i.e. p1 =
1
λ0
µ1 p0 .
1
2. For state 1
µ1 p1 + λ1 p1 = λ0 p0 + µ2 p2
1
0
0
i.e. p2 =
λ1
µ2 p1
=
λ0 λ1
µ1 µ2 p0 .
1
1
2
2
3. For state 2
µ2 p2 + λ2 p2 = λ0 p0 + µ3 p3
1
0
0
i.e. p3 =
λ2
µ3 p2
=
λ0 λ1 λ2
µ1 µ2 µ3 p0 .
1
1
2
2
4. . . . for state k we get:
pk =
λ0 λ1 λ2 · . . . · λk−1
p0 .
µ1 µ2 µ3 · . . . · µk
ok, so now we know all the steady state probabilities depending on p0 . But what use has that, if we don’t
know p0 ?
Here, we need another trick: we know, that the steady state probabilities are the density function for the
state X. Their sum must therefore be 1!
Then
1
= p0 + p1 + p 2 + . . .
λ0
λ 0 λ1
= p0 1 +
+
+ ...
µ1
µ1 µ2
|
{z
}
:=S
If this sum S converges, we get p0 = S −1 . If it doesn’t converge, we know that we don’t have any steady
state probabilities, i.e. the B+D process never reaches an equilibrium. The analysis of S is crucial!
If S exists, p0 does, and with p0 all pk , which implies, that the Birth & Death process is stable.
64
CHAPTER 4. STOCHASTIC PROCESSES
If S does not exist, then the B & D process is unstable, i.e. it does not have an equilibrium and no steady
state probabilities.
Special case:
Birth & Death process with constant birth and death rates
If all birth rates λk = λ a constant birth rate and µk = µ for all k, the ratio between birth and death rates
is constant, too:
λ
a :=
µ
a is called the traffic intensity.
In order to decide, whether a specific B&D process is stable or not, we have to look at S. For constant traffic
intensities, S can be written as:
∞
S =1+
X
λ0 λ1
λ0
+
+ . . . = 1 + a + a2 + a3 + ... =
ak
µ1
µ1 µ2
k=0
This sum is called a geometric series.
If 0 < a < 1 the series converges:
S=
1
1−a
for 0 < a < 1.
Then:
p0
pk
= S −1 = 1 − a
= ak · (1 − a) = P (X(t) = k), i.e.
X(t) therefore has a Geometric distribution for large t:
X(t) ∼ Geo1−a
for large t and 0 < a < 1.
Example 4.2.2 Printer queue (continued)
A certain printer in the Stat Lab gets jobs with a rate of 3 per hour. On average, the printer needs 15 min
to finish a job.
Let X(t) be the number of jobs in the printer and its queue at time t.
X(t) is a Birth & Death Process with constant arrival rate λ = 3 and constant death rate µ = 4.
(a) Draw a state diagram for X(t) - the (technically possible) number of jobs in the printer (and its queue).
3
0
3
1
4
3
2
4
3
3
4
4
(b) What is the (true) probability that at some time t the printer is idle?
P (X(t) = 0) = p0 = 1 −
3
= 0.25.
4
(c) What is the probability that there arrive more than 7 jobs during one hour?
Let Y be the number of arrivals. Y is a Poisson Process with arrival rate λ = 3.
Y (t) ∼ P oλ·t .
P (Y (1) > 7) = 1 − P (Y (1) ≤ 7) = 1 − P o3·1 (7) = 1 − 0.949 = 0.051.
4.2. BIRTH & DEATH PROCESSES
65
(d) What is the probability that the printer is idle for more than 1 hour at a time? (Hint: this is the
probability that X(t) = 0 and - at the same time - no job arrives for more than one hour.)
Let Z be the time until the next arrival, then Z ∼ Exp3 .
P (X(t) = 0 ∩ Z > 1)
X(t),Zindependent
=
P (X(t) = 0) · P (Z > 1) = p0 · (1 − Exp3 (1)) = 0.25 · e−3 = 0.0124
(d) What is the probability that there are 3 jobs in the printer queue at time t (including the job printed
at the moment)?
P (X(t) = 3) = p3 = .753 · .25 = 0.10
(e) What is the difference between the true and the empirical probability of exactly 5 jobs in th printer
system?
p5
pb5
= 0.755 · 0.25 = 005933
= 0.04125
The probabilities are close - which means that we can assume that this particular printer queue actually
behaves like a Birth & Death process.
Two Examples of Birth & Death Processes
Communication System
A communication system has two processors for decoding messages and a buffer that will hold at most two
further messages. If the buffer is full, any incoming message is lost.
Each processor needs on average 2 min to decode a message. Messages come in with a rate of 1 per min.
Assume exponential distributions both for interarrival times between messages and the time needed to decode
a message.
Use a Birth & Death process to model the number of messages in the system.
(a) Carefully draw a transition state diagram.
1
0
1
1
0.5
1
2
1
1
3
1
4
1
(b) Find the steady state probability that there are no messages in the system.
Since p0 is S −1 , we need to compute S first:
S
=
=
Therefore p0 = 91 .
λ0
λ0 λ1
λ0 λ1 λ2
λ0 λ1 λ 2 λ 3
+
+
+
=
µ1
µ1 µ2
µ1 µ2 µ3
µ1 µ2 µ3 µ4
1 + 2 + 2 + 2 + 2 = 9.
1+
66
CHAPTER 4. STOCHASTIC PROCESSES
ICB - International Campus Bank Ames
The ICB Ames employs three tellers. Customers arrive according to a Poisson process with a mean rate
of 1 per minute. If a customer finds all tellers busy, he or she joins a queue that is serviced by all tellers.
Transaction times are independent and have exponential distributions with mean 2 minutes.
(a) Sketch an appropriate state diagram for this queueing system.
1
0
1
1
0.5
1
2
1
1
3
1.5
1
4
1.5
1.5
(b) As it turns out, the large t probability that there are no customers in the system is p0 = 1/9.
What is the probability that a customer entering the bank must enter the queue and wait for service?
A person entering the bank must queue for service, if at least three people are in the bank (not including
the one who enters at the moment). We are therefore looking for the large t probability, that X(t) is
at least 3:
P (X(t) ≥ 3) =
=
1 − P (X(t) < 3) = 1 − P (X(t) ≤ 2) =
1
1
1
4
1 − (p0 + p1 + p2 ) = 1 − ( + 2 · + 2 · ) = .
9
9
9
9
Chapter 5
Queuing systems
Queueing system
server 1
enter the system
some population of
individuals
server 2
according to some
random mechanism
exit the
system
server c
Depending upon the specifics of the application there are many varieties of queuing systems corresponding
combinations like
• size & nature of calling population
is it finite or a (potentially) infinite set? is it homogenous, i.e. only one type of individuals, or several
types?
• random mechanism by which the population enters
• nature of the queue
finite/ infinite
• nature of the queuing discipline
FIF0 or priority (i.e. different types of individuals get different treatment)
• number and behavior of servers
distribution of service times?
Variety of matters one might want to investigate:
• mean number of individuals in the system
• mean queue length
• fraction of customers turned away (for a finite queue length)
• mean waiting time
• etc.
67
68
CHAPTER 5. QUEUING SYSTEMS
Notation: FY /FS /c/K
FY distribution of inter arrival times Y
FS distribution of service times S
Usually, we will assume a FIFO queue.
c
number of servers
K maximum number of individuals in the system
The distributions FY and FS are chosen from a small set of distributions, denoted by:
M exponential (Memoryless) distribution
Ek Erlang k stage
D deterministic distribution
G a general, not further specified distribution
Usually, we will be interested in a couple of properties for each queuing system. The main properties are:
L
length of system on average = average number of individuals in the system
W average waiting time (time in queue and service time
Ws average service time
Wq average waiting time in queue
Lq average length of queue
The main idea of a queuing system is to model the number of individuals in the system (queue and server)
as a Birth & Death Process. This gives us a way to analyze the queuing systems using the methods from
the previous chapter.
X(t) = number of individuals in the system at time t is the Birth & Death Process we’ ll be interested in.
5.1
Little’s Law
The next theorem is based on a simple principle. However, don’t underestimate the theorem’s importance!
- It links waiting times to the number of people in the system and will be very useful in the future:
Theorem 5.1.1 (Little’s Law)
For a queuing system in steady state
L = λ̄ · W
where L is the average number of individuals in the system, W is the average time spent in the system, and
λ̄ is the average rate at which individuals enter the system.
This theorem can also be applied to the queue itself:
Lq = λ¯q · Wq
and the service center
Ls = λ¯s · Ws .
Relationship between properties
For the properties L, W, Ws , Wq , Lq there are usually two different ways to get a result: the easy and the
difficult one (involving infinite summations and similar nice stuff). To make sure, we choose the easy way of
computation, here’s an overview of the relationship between all these properties:
5.2. THE M/M/1 QUEUE
69
L = E[X(t)]
W = L/λ
WS = E[S]
5.2
Lq = Wq λ
Wq = W - WS
The M/M/1 Queue
Situation: exponential inter arrival times with rate λ, exponential service times with rate µ.
Let N (t) denote the number of individuals in the system at time t, N (t) can then be modeled using a Birth
& Death process:
λk
µk
= birth rate
= death rate
= arrival rate
= service rate
= λ for all k
= µ for all k
We’ve already seen that the ratio λ/µ is very important for the analysis of the B&D process. This ratio is
called the traffic intensity a. For a M/M/1 queuing system, the traffic intensity is constant for all k.
The previous problem of finding the steady state probabilities of the B&D process is equivalent to finding
the steady state probabilities for the number of individuals in the queuing system,
pk = lim P (N (t) = k).
t→∞
The B&D balance equations then say that
1 = p0 (1 + a + a2 + . . .)
The question whether
we have a steady state or not is the reduced to the question whether or not a < 1.
P∞
1
, then
For a < 1 S = k=0 ak = 1−a
p0
p1
p2
p3
...
pk
= 1−a
= a(1 − a)
= a2 (1 − a)
= a3 (1 − a)
= ak (1 − a)
N (t) has a geometric distribution for large t!
The mean number of individuals in the queuing system L is limt→∞ E[N (t)]:
L = lim E[N (t)] =
t→∞
a
.
1−a
The closer the service rate is to the arrival rate, the larger is the expected number of people in the system.
The mean time spent in the system W is then, using Little’s Law:
W = L/λ =
1
1
·
µ 1−a
mean number in system
mean time
in system
70
CHAPTER 5. QUEUING SYSTEMS
The overall time spent in the system is a sum of the time spent in the queue Wq and the average time spent
in service Ws . Since we know that service times are exponentially distributed with rate µ, Ws = µ1 . For the
time spent waiting in the queue we therefore get
1
1
1 a
Wq = W − Ws =
−1 =
.
µ 1−a
µ1−a
The average length of the queue is, using Little’s Law again, given as
average length
of the queue
Lq = Wq λ =
server utilization rate
a2
1−a
Further we see, that the long run probability that the server is busy is given as:
p := P (server busy) = 1 − P (system empty) = 1 − p0 = a.
distribution
of time in
queue
Denote by q(t) the time that an individual entering the system at time t has to spend waiting in the queue.
Clearly, the distribution of the waiting times depends on the number of individuals already in the queue at
time t.
Assume, that the individual entering the system doesn’t have to wait at all in the queue - that happens
exactly when the system at time t is empty. For large t we therefore get:
lim P (q(t) = 0) = p0 = 1 − a.
t→∞
Think: If there are k individuals in the queue, the waiting time q(t) is Erlangk,µ (we’re waiting for k
departures, departures occur with a rate of µ). This is a conditional distribution for q(t), since it is based
on the assumption about the number of people in the queue:
q(t)|X(t) = k ∼ Erlangk,µ
for large t
We can put those pieces together in order to get the large t distribution for q(t) using the theorem of total
probability. For large t and x ≥ 0:
Fq(t) (x)
= P (q(t) ≤ x) =
total probability!
∞
X
P (q(t) ≤ x ∩ X(t) = k) =
=
=
k=0
∞
X
P (q(t) ≤ x|X(t) = k)pk =
k=0
= p0 +
∞
X
(1 − P oµx (k − 1))pk =
k=1
= p0 +
∞
X
(1 −
=
1 − ae
j
e−µx
j=0
k=1
−x/W
k−1
X
(xµ)
)pk = . . .
j!
,
where W is the average time spent in the system, W =
1
µ
·
1
1−a
=
1
µ−λ .
Example 5.2.1 Printer Queue (continued)
A certain printer in the Stat Lab gets jobs with a rate of 3 per hour. On average, the printer needs 15 min
to finish a job.
average time
in the queue
5.3. THE M/M/1/K QUEUE
71
Let X(t) be the number of jobs in the printer and its queue at time t.
We know already: X(t) is a Birth & Death Process with constant arrival rate λ = 3 and constant death rate
µ = 4.
The properties of interest for this printer system then are:
L = E[X(t)] =
Ws
W
Wq
Lq
a
0.75
=
=3
1−a
0.25
1
= 0.25 hours = 15 min
µ
L
3
=
= = 1 hour
λ
3
= W − Ws = 0.75 hours = 45 minutes
= Wq λq = 0.75 · 3 = 2.25
=
On average, a job has to spend 45 min in the queue. What is the probability that a job has to spend less
than 20 min in the queue?
We denoted the waiting time in the queue by q(t). q(t) has distribution function 1 − aey(µ−λ) . The probability
asked for is
P (q(t) < 2/6) = 1 − 0.75 · e−2/6·(4−3) = 0.4626.
5.3
The M/M/1/K queue
An M/M/1 queue with limited size K is a lot more realistic than the one with infinite queue. Unfortunately,
it’s computationally slightly harder to deal with.
X(t) is modelled as a Birth & Death Process with states {0, 1, ..., K}. Its state diagram looks like:
λ
0
λ
1
λ
λ
2
µ
µ
K
µ
µ
Since X(t) has only a finite number of states, it’s a stable process independently from the values of λ and µ.
The steady state probabilities pk are:
pk
= = ak p0
p0
= S−1 =
1−a
1 − aK+1
where a = µλ , the traffic intensity and S = 1 + a + a2 + ... + aK =
The mean number of individuals in the queue L then is:
1−aK+1
1−a .
L = E[X(t)] = 0 · p0 + 1 · p1 + 2 · p2 + ... + K · pK =
=
K
X
k=0
kpk =
K
X
k=0
kak · p0 = ... =
a
(K + 1)aK+1
−
1−a
1 − aK+1
72
CHAPTER 5. QUEUING SYSTEMS
Another interesting property of a queuing system with limited size is the number of individuals that get
turned away. From a marketing perspective they are the ”expensive” ones - they are most likely annoyed
and less inclined to return. It’s therefore a good strategy to try and minimize this number.
Since an incoming individual is turned away, when the system is full, the probability for being turned away
is pK . The rate of individuals being turned away therefore is pK · λ.
For the expected total waiting time W , we used Little’s theorem:
W =
L
λ̄
where λ̄ is the average arrival rate into the system.
At this point we have to be careful when dealing with limited systems: λ̄ is NOT equal to the arrival rate
λ. We have to adjust λ by the rate of individuals who are turned away. The adjusted rate λa of individuals
entering the system is:
λa = λ − pK λ = (1 − pK )λ.
The expected total waiting time is then W = L/λa and the expected length of the queue Lq = Wq · λa .
Example 5.3.1 Convenience Store
In a small convenience store there’s room for only 4 customers. The owner himself deals with all the customers
- he likes chatting a bit. On average it takes a customer 4 minutes to pay for his/her purchase. Customers
arrive at an average of 1 per 5 minutes. If a customer finds the shop full, he/she will go away immediately.
1. What fraction of time will the owner be in the shop on his own?
The number of customers in the shop can be modelled as a Birth& death Process with arrival rate
λ = 0.2 per minute and µ = 0.25 per minute and upper size K = 4.
The probability (or fraction of time) that the owner will be alone is p0 =
1−a
1−aK+1
=
0.2
1−0.85
= 0.2975.
2. What is the mean number of customers in the store?
L=
(K + 1)aK+1
a
−
= 1.56.
1−a
1 − aK+1
3. What fraction of customers is turned away per hour?
p4 λ = 0.84 · 0.2975 · 0.2 per minute = 0.0243 per minute = 1.46 per hour
4. What is the average time a customer has to spend for check-out?
W =
L
= 1.56/(0.2 − 0.0243) = 8.88 minutes .
λa
For limited queueing systems the adjusted arrival rate λa must be considered for applying Little’s Law.
5.4. THE M/M/C QUEUE
5.4
73
The M/M/c queue
Again, X(t) the number of individuals in the queueing system can be modeled as a birth & death process.
The transition state diagram for the X(t) is:
λ
λ
0
λ
λ
2
1
µ
λ
2µ
K
c-1
3µ
(c-1)µ
λ
1c
cµ
cµ
Clearly, the critical thing here in terms of whether or not a steady state exists is whether or not λ/(cµ) < 1.
Let a = λ/µ and % = a/c = λ/(cµ).
The balance equations for steady state are:
p1
p2
p3
...
pc
= ap0
a2
p0
= 2·1
3
= a3! p0
=
pc+1
...
pn
= %·
ac
c! p0
= %n−c ·
ac
c! p0
for n ≥ c.
ac
c! p0
In order to get an expression for p0 , we use the condition, that the overall sum of probabilities must be 1.
This gives:


!

 c−1 k
∞
c−1 k
∞
X
X
X a
a
ac X k−c
ac 1 
.
1=
p k = p0
+
%
= p0 
+

k!
c!
c! 1 − % 

k=0 k!
k=0
k=0
k=c
|
{z
}
=:S
This system has a steady state, if % < 1, in that case
p0 = S −1 .
The other probabilities pn are given as:
pn =
an
n! p0
an
c!cn−c p0
for 0 ≤ n ≤ c − 1
for n ≥ c
A key descriptor for the system is the probability, with which an entering customer must queue for service this is equal to the probability that all servers are busy. The formula for this probability is known as Erlang’s
C formula or Erlang’s delay formula and written as C(c, a).
Obviously, in this queueing system an entering individual must queue for service exactly then, when c or
more individuals are already in the system.
lim P (X(t) ≥ c) = C(c, a) =
t→∞
∞
X
pk = 1 −
k=c
c−1
= p0
X ak
1
−
p0
k!
k=0
= p0
ac
.
c!(1 − %)
c−1
X
k=0
!
=
pk =
prob. that
entering customer must
queue
74
CHAPTER 5. QUEUING SYSTEMS
The steady state mean number of individuals in the queue Lq is
Lq
∞
X
=
(k − c)pk =
k=c
(k − c)
k=c
∞
X
c
= p0
∞
X
a
c!
k%k
= p0
k=1
average number in queue
ak
p0 =
c!ck−c
%
ac
c! (1 − %)2
| {z }
P
k 0
%( ∞
k=1 % )
%
C(c, a).
1−%
=
mean waiting time in
queue
overall time
in system
By Little’s Law
Wq = Lq /λ =
1
%
1
C(c, a) =
C(c, a).
·
λ 1−%
cµ(1 − %)
Then
W = Wq + Ws = Wq +
number in
system
1
,
µ
and the overall number of individual in the system is on average
L=W ·λ=a+
%
C(c, a).
1−%
Example 5.4.1 Bank
A bank has three tellers. Customers arrive at a rate of 1 per minute. Each teller needs on average 2 min to
deal with a customer.
What are the specifications of this queue?
For this queue,λ = 1, µ = 0.5, c = 3, a = µλ = 2, and % = ac = 2/3.
The probability that no customer is in he bank then is
p0 =
c−1 k
X
a
k=0
ac 1
+
k!
c! 1 − %
!−1
1+2+
4 23
1
+
·
2
3! 1 − %
ac
%
= 8/9.
c! (1 − %)2
Lq /λ = 8/9 minutes .
1
= 2 minutes.
µ
Ws + Wq = 26/9 minutes
W λ = 26/9
Lq
= p0
Wq
=
Ws
=
W =
L =
=
−1
=
1
.
9
Chapter 6
Statistical Inference
From now on, we will use probability theory only to find answers to the questions arising from specific
problems we are working on.
In this chapter we want to draw inferences about some characteristic of an underlying population - e.g. the
average height of a person. Instead of measuring this characteristic of each individual, we will draw a sample,
i.e. choose a “suitable” subset of the population and measure the characteristic only for those individuals.
Using some probabilistic arguments we can then extend the information we got from that sample and make
an estimate of the characteristic for the whole population. Probability theory will give us the means to find
those estimates and measure, how “probable” our estimates are.
Of course, choosing the sample, is crucial. We will demand two properties from a sample:
• the sample should be representative - taking only basketball players into the sample would change our
estimate about a person’s height drastically.
• if there’s a large number in the sample we should come close to the “true” value of the characteristic
The three main area of statistics are
• estimation of parameters:
point or interval estimates: “my best guess for value x is . . . ”, “my guess is that value x is in interval
(a, b)”
• evaluation of plausibility of values: hypothesis testing
• prediction of future (individual) values
6.1
Parameter Estimation
Statistics are all around us - scores in sports, prices at the grocers, weather reports ( and how often they
turn out to be close to the actual weather), taxes, evaluations . . .
The most basic form of statistics are descriptive statistics.
But - what exactly is a statistic? - Here is the formal definition:
Definition 6.1.1 (Statistics)
Any function W (x1 , . . . , xk ) of observed values x1 , . . . , xk is called a statistic.
Some statistics you already know are:
75
76
CHAPTER 6. STATISTICAL INFERENCE
P
X̄ = n1 i Xi
X(1) - Parentheses indicate that the values are sorted
X(n)
X(n) − X(1)
value(s) that appear(s) most often
“middle value” - that value, for which one half of the data is larger,
the other half is smaller. If n is odd the median is X(n/2) , if n is
even, the median is the average of the two middle values: 0.5 ·
X((n−1)/2) + 0.5 · X((n+1)/2)
For this section it is important to distinguish between xi and Xi properly. If not stated otherwise, any capital
letter denotes some random variable, a small letter describes a realization of this random variable, i.e.
what we have observed. xi therefore is a real number, Xi is a function, that assigns a real number to an
event from the sample space.
Mean (Average)
Minimum
Maximum
Range
Mode
Median
Definition 6.1.2 (estimator)
Let X1 , . . . , Xk be k i.i.d random variables with distribution Fθ with (unknown) parameter θ.
A statistic Θ̂ = Θ̂(X1 , . . . , Xk ) used to estimate the value of θ is called an estimator of θ.
θ̂ = Θ̂(x1 , . . . , xk ) is called an estimate of θ.
Desirable properties of estimates:
x true value
• value for from one sample
Unbiasedness:
•
E[Θ̂] = θ
•
•
and not
•
•
•
•
•
•
•
••
•
• ••
•
•
•
x
Efficiency:
estimator 1
•
•• •
• ••
• x• •
• •
•
is better than
•
•
••
•
• x• •
•
• •
•
•
estimator 2
Consistency:
•
• Consistency, if we have a larger sample size n, we want
the estimate θ̂ to be closer to the true parameter θ.
•
•
•
• Efficiency, for two estimators, Θ̂1 and Θ̂2 of the same
parameter θ, Θ̂1 is said to be more efficient than Θ̂2 ,
if
V ar[Θ̂1 ] < V ar[Θ̂2 ]
•
•
x
• •
•
•
•
•
•
•
•
•
• x• •
•
lim P (|Θ̂ − θ| > ) = 0
n→∞
same estimator
for n = 100
•
•• •
• x••
• ••
• •
•
• Unbiasedness, i.e the expected value of an estimator is
the true parameter:
for n = 10000
Example 6.1.1
Let X1 , . . . , Xn be n i.i.d. random variables with E[Xi ] = µ.
Pn
Then X̄ = n1 i=1 Xi is an unbiased estimator of µ, because
n
E[X̄] =
1X
1
E[Xi ] = n · µ = µ.
n i=1
n
ok, so, once we have an estimator, we can decide, whether it has the properties. But how do we find
estimators?
6.1. PARAMETER ESTIMATION
6.1.1
77
Maximum Likelihood Estimation
Situation: We have n data values x1 , . . . , xn . The assumption is, that these data values are realizations of n
i.i.d. random variables X1 , . . . , Xn with distribution Fθ . Unfortunately the value for θ is unknown.
X
observed values
x1, x2, x3, ...
f
with
=0
f
with
= -1.8
f
with
=1
By changing the value for θ we can “move the density function fθ around” - in the diagram, the third density
function fits the data best.
Principle: since we do not know the true value θ of the distribution, we take that value θ̂ that most likely
produced the observed values, i.e.
maximize something like
P (X1 = x1 ∩ X2 = x2 ∩ . . . ∩ Xn = xn )
Xi are independent!
=
= P (X1 = x1 ) · P (X2 = x2 ) · . . . · P (Xn = xn ) =
=
n
Y
P (Xi = xi )
(*)
i=1
This is not quite the right way to write the probability, if X1 , . . . , Xn are continuous variables. (Remember:
P (X = x) = 0 for a continuous variable X; this is still valid)
We use the above “probability” just as a plausibility argument. To come around the problem that P (X =
x) = 0 for a continuous variable, we will write (*) as:
n
Y
pθ (xi )
and
i=1
|
n
Y
fθ (xi )
i=1
{z
}
for discreteXi
|
{z
}
for continuousXi
where pθ ) is the probability mass function of discrete Xi (all Xi have the same, since they are i.d) and fθ is
the density function of continuous Xi .
Both these functions depend on θ. In fact, we can write the above expressions as a function in θ. This
function, which we will denote by L(θ), is called the Likelihood function of X1 , . . . , Xn .
The goal is now, to find a value θ̂ that maximizes the Likelihood function. (this is what “moves” the density
to the right spot, so it fits the observed values well)
How do we get a maximum of L(θ)? - by the usual way, we maximize a function! - Differentiate it and
set it to zero! (After that, we ought to check with the second derivative, whether we’ve actually found a
maximum, but we won’t do that unless we’ve found more than one possible value for θ̂.)
Most of the time, it is difficult to find a derivative of L(θ) - instead we use another trick, and find a maximum
for log L(θ), the Log-Likelihood function.
Note: though its name is “log”, we use the natural logarithm ln.
The plan to find an ML-estimator is:
1. Find Likelihood function L(θ).
2. Get natural log of Likelihood function log L(θ).
3. Differentiate log-Likelihood function with respect to θ.
78
CHAPTER 6. STATISTICAL INFERENCE
4. Set derivative to zero.
5. Solve for θ.
Example 6.1.2 Roll a Die
A die is rolled until its face shows a 6.
repeating this experiment 100 times gave the following results:
#Rolls of a Die until first 6
20
15
# runs
10
5
0
1
k
# trials
1
18
2
20
3
8
4
9
2
5
9
3
4
5
6
6
5
7
8
7
8
9
11
8
3
9
5
14
11
3
16
14
3
20
15
3
27
16
1
17
1
20
1
29
21
1
27
1
29
1
We know, that k the number of rolls until a 6 shows up has a geometric distribution Geop . For a fair die, p
is 1/6.
The Geometric distribution has probability mass function p(k) = (1 − p)k−1 · p.
What is the ML-estimate p̂ for p?
1. Likelihood function L(p):
Since we have observed 100 outcomes k1 , ..., k100 , the likelihood function L(p) =
L(p) =
100
Y
i=1
(1 − p)ki −1 p = p100 ·
100
Y
P100
(1 − p)ki −1 = p100 · (1 − p)
i=1 (ki −1)
2. log of Likelihood function log L(p):
P100
log p100 · (1 − p) i=1 ki −100 =
P100
= log p100 + log (1 − p) i=1 ki −100 =
!
100
X
= 100 log p +
ki − 100 log(1 − p).
=
i=1
i=1
p(ki ),
P100
= p100 · (1 − p)
i=1
log L(p)
Q100
i=1
ki −100
.
6.1. PARAMETER ESTIMATION
79
3. Differentiate log-Likelihood with respect to p:
d
log L(p)
dp
1
100 +
p
=
100
X
100
X
100(1 − p) −
! !
ki − 100 p
=
i=1
1
p(1 − p)
=
−1
=
1−p
ki − 100
i=1
1
p(1 − p)
=
!
100 − p
100
X
!
ki
.
i=1
4. Set derivative to zero.
For the estimate p̂ the derivative must be zero:
⇐⇒
1
p̂(1 − p̂)
d
log L(p̂) = 0
dp
!
100
X
100 − p̂
ki
= 0
i=1
5. Solve for p̂.
1
p̂(1 − p̂)
100 − p̂
100
X
!
ki
= 0
i=1
100 − p̂
100
X
ki
= 0
i=1
100
p̂ = P100
i=1
In total, we have an estimate p̂ =
100
568
ki
=
1
100
1
P100
i=1
ki
.
= 0.1710.
Example 6.1.3 Red Cars in the Parking Lot
The values 3,2,3,3,4,1,4,2,4,3 have been observed while counting the numbers of red cars pulling into the
parking lot # 22 between 8:30 - 8:40 am Mo to Fr during two weeks.
The assumption is, that these values are realizations of ten independent Poisson variables with (the same)
rate λ.
What is the Maximum Likelihood estimate of λ?
x
The probability mass function of a Poisson distribution is pλ (x) = e−λ · λx! .
We have ten values xi , this gives a Likelihood function:
L(λ) =
10
Y
i=1
e−λ ·
10
Y
P10
λXi
1
= e−10λ · λ i=1 Xi ·
Xi !
X
i!
i=1
The log-Likelihood then is
log L(λ) = −10λ + ln(λ) ·
10
X
i=1
Xi −
X
ln(Xi ).
80
CHAPTER 6. STATISTICAL INFERENCE
Differentiating the log-Likelihood with respect to λ gives:
10
d
1 X
log L(λ) = −10 + ·
Xi
dλ
λ i=1
Setting it to zero:
10
1 X
·
Xi = 10
λ̂ i=1
10
⇐⇒ λ̂ =
1 X
Xi
10 i=1
29
= 2.9
10
This gives us an estimate for λ - and since λ is also the expected value of the Poisson distribution, we can
say, that on average the number of red cars pulling into the parking lot each morning between 8:30 and 8:40
pm is 2.9.
⇐⇒ λ̂ =
ML-estimators for µ and σ 2 of a Normal distribution
Let X1 , . . . , Xn be n independent, identically distributed normal variables with E[Xi ] = µ and V ar[Xi ] = σ 2 .
µ and σ 2 are unknown.
The normal density function fµ,σ2 is
fµ,σ2 (x) = √
1
2πσ 2
e−
(x−µ)2
2σ 2
Since we have n independent variables, the Likelihood function is a product of n densities:
L(µ, σ 2 ) =
n
Y
i=1
√
1
2πσ 2
e−
(xi −µ)2
2σ 2
= (2πσ 2 )n/2 · e−
Pn
i=1
(xi −µ)2
2σ 2
Log-Likelihood:
log L(µ, σ 2 ) = −
n
n
1 X
ln(2πσ 2 ) − 2
(xi − µ)2
2
2σ i=1
Since we have now two parameters, µ and σ 2 , we need to get 2 partial derivatives of the log-Likelihood:
d
log L(µ, σ 2 )
dµ
d
log L(µ, σ 2 )
dσ 2
=
0−2·
n
n
−1 X
1 X
2
·
(x
−
µ)
·
(−1)
=
(xi − µ)2
i
2σ 2 i=1
σ 2 i=1
n
n 1
1 X
= −
+
(xi − µ)2
2 σ2
2(σ 2 )2 i=1
We know, must find values for µ and σ 2 , that yield zeros for both derivatives at the same time.
d
Setting dµ
log L(µ, σ 2 ) = 0 gives
n
1X
µ̂ =
xi ,
n i=1
plugging this value into the derivative for σ 2 and setting
n
d
dσ 2
log L(µ̂, σ 2 ) = 0 gives
1X
σˆ2 =
(xi − µ̂)2
n i=1
6.2. CONFIDENCE INTERVALS
6.2
81
Confidence intervals
The previous section has provided a way to compute point estimates for parameters. Based on that, our
next question is - how good is this point estimate? or How close is the estimate to the true value of the
parameter?
Instead of just looking at the point estimate, we will now try to compute an interval around the estimated
parameter value, in which the true parameter is “likely” to fall. An interval like that is called confidence
interval.
Definition 6.2.1 (Confidence Interval)
Let θ̂ be an estimate of θ.
If P (|θ̂ − θ| < e) > α, we say, that the interval (θ̂ − e, θ̂ + e) is an α · 100% Confidence interval of θ (cf. fig.
6.1).
Usually, α is a value near 1, such as 0.9, 0.95, 0.99, 0.999, etc.
Note:
• for any given set of values x1 , . . . , xn the value or θ̂ is fixed, as well as the interval (θ̂ − e, θ̂ + e).
• The true value θ is either within the confidence interval or not.
P(
prob £ 1 -
x-e
x
-e
x+e
-e< x <
e
+ e) >
prob £ 1 -
confidence interval for
Figure 6.1: The probability that x̄ falls into an e interval around µ is α. Vice versa, we know, that for all of
those x̄ µ is within an e interval around x̄. That’s the idea of a confidence interval.
!!DON’T DO!!
A lot of people are tempted to reformulate the above probability to:
P (θ̂ − e < θ < θ̂ + e) > α
Though it looks ok, it’s not. Repeat: IT IS NOT OK.
θ is a fixed value - therefore, it does not have a probability to fall into some interval.
The only probability that we have, here, is
P (θ − e < θ̂ < θ + e) > α,
we can therefore say, that θ̂ has a probability of at least α to fall into an e- interval around θ. Unfortunately,
that doesn’t help at all, since we do not know θ!
How do we compute confidence intervals, then? - that’s different for each estimator.
First, we look at estimates of a mean of a distribution:
82
6.2.1
CHAPTER 6. STATISTICAL INFERENCE
Large sample C.I. for µ
Situation: we have a large set of observed values (n > 30, usually).
The assumption is, that these values are realizations of n i.i.d random variables X1 , . . . , Xn with E[X̄] = µ
and V ar[X̄] = σ 2 .
We already know from the previous section, that X̄ is an unbiased ML-estimator for µ.
But we know more! - The CLT tells us, that in exactly the situation we are X̄ is an approximately normal
2
distributed random variable with E[X̄] = µ and V ar[X̄] = σn .
We therefore can find the boundary e by using the standard normal distribution. Remember: if X̄ ∼
X̄−µ
√ ∼ N (0, 1) = Φ:
N (µ, σ 2 /n) then Z := σ/
n
P (|X̄ − µ| ≤ e) ≥ α
use standardization
|X̄ − µ|
e
√ ≤
√
⇐⇒ P
≥α
σ/ n
σ/ n
e
√
⇐⇒ P |Z| <
≥α
σ/ n
e
e
√
≥α
⇐⇒ P − √ < Z <
σ/ n
σ/ n
e
e
√
−Φ − √
≥α
⇐⇒ Φ
σ/ n
σ/ n
e
e
√
√
⇐⇒ Φ
− 1−Φ
≥α
σ/ n
σ/ n
e
√
⇐⇒ 2Φ
−1≥α
σ/ n
e
α
√
⇐⇒ Φ
≥1+
2
σ/ n
1
+
α
e
−1
√ ≥Φ
⇐⇒
2
σ/ n
1+α
σ
√
⇐⇒ e ≥ Φ−1
2
n
|
{z
}
:=z
This computation gives a α· 100% confidence value around µ as:
σ
σ
X̄ − z · √ , X̄ + z · √
n
n
Now we can do an example:
Example 6.2.1
Suppose, we want to find a 95% confidence interval for the mean salary of an ISU employee.
A random sample of 100 ISU employees gives us a sample mean salary of $21543 = x̄.
Suppose, the standard deviation of salaries is known to be $3000.
By using the above expression, we get a 95% confidence interval as:
1 + 0.95
3000
−1
21543 ± Φ
·√
= 21543 ± Φ−1 (0.975) · 300
2
100
How do we read Φ−1 (0.975) from the standard normal table? - We look for which z the probability N(0,1) (z) ≥
0.975!
6.2. CONFIDENCE INTERVALS
83
This gives us z = 1.96, the 95% confidence interval is then:
21543 ± 588,
i.e. if we repeat this study 100 times (with 100 different employees each time), we can say: in 95 out of 100
studies, the true parameter µ falls into a $588 range around x̄.
Critical values for z, depending on α are:
α
0.90
0.95
0.98
0.99
z = Φ−1 ( 1+α
2 )
1.65
1.96
2.33
2.58
Problem: Usually, we do not know
q σP
n
1
2
Slight generalization: use s = n−1
i=1 (Xi − X̄) instead of σ!
An α· 100% confidence interval for µ is given as
s
s
X̄ − z · √ , X̄ + z · √
n
n
where z = Φ−1 ( 1+α
2 ).
Example 6.2.2
Suppose, we want to analyze some complicated queueing system, for which we have no formulas and theory.
We are interested in the mean queue length of the system after reaching steady state.
The only thing possible for us is to run simulations of this system and look at the queue length at some large
time t, e.g. t = 1000 hrs.
After 50 simulations, we have got data:
X1 = number in queue at time 1000 hrs in 1st simulation
X2 = number in queue at time 1000 hrs in 2nd simulation
...
X50 = number in queue at time 1000 hrs in 50th simulation
q
Pn
1
2
Our observations yield an average queue length of x̄ = 21.5 and s = n−1
i=1 (xi − x̄) = 15.
A 90% confidence interval is given as
s
s
x̄ − z · √ , x̄ + z · √
n
n
=
=
15
15
21.5 − 1.65 · √ , 21.5 + 1.65 · √
50
50
(17.9998, 25.0002)
=
Example 6.2.3
The graphs show a set of 80 experiments. The values from each experiment are shown in one of the green
framed boxes. Each experiment consists of simulating 20 values from a standard normal distributions (these
are drawn as the small blue lines). For each of the experiments, the average from the 20 value is computed
(that’s x̄) as well as a confidence interval for µ- for parts a) and b) it’s the 95% confidence interval, for part
c) it is the 90% confidence interval, for part d) it is the 99% confidence interval. The upper and the lower
confidence bound together with the sample mean are drawn in red next to the sampled observations.
84
CHAPTER 6. STATISTICAL INFERENCE
a) 95 % confidence intervals
b) 95 % confidence intervals
c) 90 % confidence intervals
d) 99 % confidence intervals
There are several things to see from this diagram. First of all, we know in this example the “true” value of
the parameter µ - since the observations are sampled from a standard normal distribution, µ = 0. The true
parameter is represented by the straight horizontal line through 0.
6.2. CONFIDENCE INTERVALS
85
We see, that each sample yields a different confidence interval, all of the are centered around the sample
mean. The different sizes of the intervals tells us another thing: in computing these confidence intervals, we
had to use the estimate s instead of the true standard deviation σ = 1. Each sample gave a slightly different
standard deviation. Overall, though, the intervals are not very different in lengths between parts a) and b).
The intervals in c) tend to be slightly smaller, though - these are 90% confidence intervals, whereas the
intervals in part d) are on average larger than the first ones, they are 99% confidence intervals.
Almost all the confidence intervals contain 0 - but not all. And that is, what we expect. For a 90% confidence
interval we expect, that in 10 out of 100 times, the confidence interval does not contain the true parameter.
When we check that - we see, that in part c) 4 out of the 20 confidence intervals don’t contain the true
parameter for µ - that’s 20%, on average we would expect 10% of the conficence intervals not to contain µ.
Official use of Confidence Intervals:
In an average of 90 out of 100 times the 90% confidence interval of θ does contain
the true value of θ.
6.2.2
Large sample confidence intervals for a proportion p
Let p be a proportion of a large population or a probability.
In order to get an estimate for this proportion, we can take a sample of n individuals from the population
and check each one of them, whether or not they fulfill the criterion to be in that proportion of interest.
Mathematically, this corresponds to a Bernoulli-n-sequence, where we are only interested in the number of
“successes”, X, which in our case corresponds to the number of individuals that qualify for the interesting
subgroup.
X then has a Binomial distribution, with parameters n and p. We know, that X̄ is an estimate for E[X].
Now think: for a Binomial variable X, the expected value E[X] = n · p. Therefore we get an estimate p̂ for
p as p̂ = n1 X̄.
Furthermore, we even have a distribution for p̂ for large n: Since X̂ is, using the CLT, a normal variable
with E[X̄] = np and V ar[X̄] = np(1 − p), we get that for large n p̂ is a approximately normally distributed
with E[p̂] = p and V ar[p̂] = p(1−p)
.
n
BTW: this tells us, that p̂ is an unbiased estimator of p.
Prepared with the distribution of p̂ we can set up an α · 100% confidence interval as:
(p̂ − e, p̂ + e)
where e is some positive real number with:
P (|p̂ − p| ≤ e) ≥ α
We can derive the expression for e in the same way as in the previous section and we come up with:
e=z·
p(1 − p)
n
where z = Φ−1 ( 1+α
2 ).
We also run into the problem that e in this form is not ready for use, since we do not know the value for p.
In this situation, we have different options. We can either find a value that maximizes the value p(1 − p) or
we can substitute an appropriate value for p.
6.2.2.1
Conservative Method:
replace p(1 − p) by something that’s guaranteed to be at least as large.
The function p(1 − p) has a maximum for p = 0.5. p(1 − p) is then
0.25.
p
86
CHAPTER 6. STATISTICAL INFERENCE
The conservative α · 100% confidence interval for p is
1
p̂ ± z · √
2 n
where z = Φ−1 ( 1+α
2 ).
6.2.2.2
Substitution Method:
Substitute p̂ for p, then:
The α · 100% confidence interval for p by substitution is
r
p̂(1 − p̂)
p̂ ± z ·
n
where z = Φ−1 ( 1+α
2 ).
Where is the difference between the two methods?
• for large n there is almost no difference at all
• if p̂ is close to 0.5, there is also almost no difference
Besides that, conservative confidence intervals (as the name says) are larger than confidence intervals found
by substitution. However, they are at the same time easier to compute.
Example 6.2.4 Complicated queueing system, continued
Suppose, that now we are interested in the large t probability p that a server is available.
Doing 100 simulations has shown, that in 65 of them a server was available at time t = 1000 hrs.
What is a 95% confidence interval for this probability?
60
If 60 out of 100 simulations showed a free server, we can use p̂ = 100
= 0.6 as an estimate for p.
−1
For a 95% confidence interval, z = Φ (0.975) = 1.96.
The conservative confidence interval is:
1
1
= 0.6 ± 0.098.
p̂ ± z √ = 0.6 ± 1.96 √
2 n
2 · 100
For the confidence interval using substitution we get:
r
r
0.6 · 0.4
p̂(1 − p̂
= 0.6 ± 0.096.
p̂ ± z
= 0.6 ± 1.96
100
n
Example 6.2.5 Batting Average
In the 2002 season the baseball player Sammy Sosa had a batting average of 0.288. (The batting average is
the ratio of the number of hits and the times at bat.) Sammy Sosa was at bats 555 times in the 2002 season.
Could the ”true” batting average still be 0.300?
Compute a 95% Confidence Interval for the true batting average.
Conservative Method gives:
1
0.288 ± 1.96 · √
2 555
0.288 ± 0.042
6.2. CONFIDENCE INTERVALS
87
Substitution Method gives:
r
0.288 ± 1.96 ·
0.288(1 − 0.288)
555
0.288 ± 0.038
The substitution method gives a slightly smaller confidence interval, but both intervals contain 0.3. There is
not enough evidence to allow the conclusion that the true average is not 0.3.
Confidence intervals give a way to measure the precision we get from simulations intended to evaluate
probabilities. But besides that it also gives as a way to plan how large a sample size has to be to get a
desired precision.
Example 6.2.6
Suppose, we want to estimate the fraction of records in the 2000 IRS data base that have a taxable income
over $35 K.
We want to get a 98% confidence interval and wish to estimate the quantity to within 0.01.
this means that our boundaries e need to be smaller than 0.01 (we’ll choose a conservative confidence interval
for ease of computation):
e ≤ 0.01
1
z is 2.33
⇐⇒ z · √ ≤ 0.01
2 n
1
⇐⇒ 2.33 · √ ≤ 0.01
2 n
√
2.33
= 116.5
⇐⇒ n ≥
2 · 0.01
⇒ n ≥ 13573
6.2.3
Related C.I. Methods
Related to the previous confidence intervals, are confidence intervals for the difference between two means,
µ1 − µ2 , or the difference between two proportions , p1 − p2 .
Confidence intervals for these differences are given as:
large n confidence interval for µ1 − µ2
(based on independent X̄1 and X̄2 )
X̄1 − X̄2 ± z
q
s21
n1
+
s22
n2
large n confidence interval for p1 − p2
(based on independent p̂1 and p̂2 )
p̂1 − p̂2 ± z 12
q
or p̂1 − p̂2 ± z
stitution)
1
qn1
+
1
n2
p̂1 (1−p̂1 )
n1
(conservative)
+
p̂2 (1−p̂2 )
n2
(sub-
Why? The argumentation in both cases is very similar - we will only discuss the confidence interval for
the difference between means.
X̄1 − X̄2 is approximately normal, since X̄1 and X̄2 are approximately normal, with (X̄1 , X̄2 are independent)
E[X̄1 − X̄2 ]
V ar[X̄1 − X̄2 ]
= E[X̄1 ] − E[X̄2 ] = µ1 − µ2
= V ar[X̄1 ] + (−1)2 V ar[X̄2 ] =
σ2
σ2
+
n1
n2
88
CHAPTER 6. STATISTICAL INFERENCE
Then we can use the same arguments as before and get a C.I. for µ1 − µ2 as shown above.
2
Example 6.2.7
Assume, we have two parts of the IRS database: East Coast and West Coast.
We want to compare the mean taxable income between reported from the two regions in 2000.
East Coast
West Coast
# of sampled records:
n1 = 1000
n2 = 2000
mean taxable income: x̄1 = $37200 x̄2 = $42000
standard deviation: s1 = $10100 s2 = $15600
We can, for example, compute a 2 sided 95% confidence interval for µ1 − µ2 = difference in mean taxable
income as reported from 2000 tax return between East and West Coast:
r
101002
156002
37000 − 42000 ±
+
= −5000 ± 927
1000
2000
Note: this shows pretty conclusively that the mean West Coast taxable income is higher than the mean East
Coast taxable income (in the report from 2000). The interval contains only negative numbers - if it contained
the 0, the message wouldn’t be so clear.
One-sided intervals
idea: use only one of the end points x̄ ± z √sn
This yields confidence intervals for µ of the form
(##, ∞)
| {z }
(−∞, #)
| {z }
upper bound
lower bound
However, now we need to adjust z to the new situation. Instead of worrying about two tails of the normal
distribution, we use for a one sided confidence interval only one tail.
P( x <
x
+ e) <
e
x+e
prob ≤ 1 -
confidence interval for
Figure 6.2: One sided (upper bounded) confidence interval for µ (in red).
Example 6.2.8 complicated queueing system, continued
What is a 95% upper confidence bound of µ, the parameter for the length of the queue?
−1
x̄ + z √sn is the upper confidence bound. Instead of z = Φ−1 ( α+1
(α) (see fig. 6.2).
2 ) we use z = Φ
This gives: 21.5 + 1.65 √1550 = 25.0
as the upper confidence bound. Therefore the one sided upper bounded confidence interval is (−∞, 25.0).
6.3. HYPOTHESIS TESTING
89
Critical values z = Φ−1 (α) for the one sided confidence interval are
α
z = Φ−1 (α)
0.90
1.29
0.95
1.65
0.98
2.06
2.33
0.99
Example 6.2.9
Two different digital communication systems send 100 large messages via each system and determine how
many are corrupted in transmission.
p̂1 = 0.05 and pˆ2 = 0.10.
What’s the difference in the corruption rates? Find a 98% confidence interval:
Use:
r
0.05 − 0.1 ± 2.33 ·
0.05 · 0.95 0.10 · 0.90
+
= −0.05 ± 0.086
100
100
This calculation tells us, that based on these sample sizes, we don’t even have a solid idea about the sign of
p1 − p2 , i.e. we can’t tell which of the pi s is larger.
So far, we have only considered large sample confidence intervals. The problem with smaller sample sizes is,
that the normal approximation in the CLT doesn’t work, if the standard deviation σ 2 is unknown.
What you need to know is, that there exist different methods to compute C.I. for smaller sample sizes.
6.3
Hypothesis Testing
Example 6.3.1 Tea Tasting Lady
It is claimed that a certain lady is able to tell, by tasting a cup of tea with milk, whether the milk was put
in first or the tea was put in first.
To put the claim to the test, the lady is given 10 cups of tea to taste and is asked to state in each case
whether the milk went in first or the tea went in first.
To guard against deliberate or accidental communication of information, before pouring each cup of tea a
coin is tossed to decide whether the milk goes in first or the tea goes in first. The person who brings the cup
of tea to the lady does not know the outcome of the coin toss.
Either the lady has some skill (she can tell to some extent the difference) or she has not, in which case she
is simply guessing.
Suppose, the lady tested 10 cups of tea in this manner and got 9 of them right.
This looks rather suspicious, the lady seems to have some skill. But how can we check it?
We start with the sceptical assumption that the lady does not have any skill. If the lady has no skill at all,
the probability she gives a correct answer for any single cup of tea is 1/2.
The number of cups she gets right has therefore a Binomial distribution with parameter n = 10 and p = 0.5.
The diagram shows the probability mass function of this distribution:
90
CHAPTER 6. STATISTICAL INFERENCE
p(x)
observed x
x
Events that are as unlikely or less likely are, that the lady got all 10 cups right or - very different, but
nevertheless very rare - that she only got 1 cup or none right (note, this would be evidence of some “antiskill”, but it would certainly be evidence against her guessing).
The total probability for these events is (remember, the binomial probability mass function is p(x) = nx px (1−
p)n−x )
p(0) + p(1) + p(9) + p(10) = 0.510 + 10 · 0.510 + 10 · 0.510 + 0.510 = 0.021
i.e. what we have just observed is a fairly rare event under the assumption, that the lady is only guessing.
This suggests, that the lady may have some skill in detecting which was poured first into the cup.
Jargon: 0.021 is called the p-value for testing the hypothesis p = 0.5.
The fact that the p-value is small is evidence against the hypothesis.
Hypothesis testing is a formal procedure to check whether or not some - previously made - assumption can
be rejected based on the data.
We are going to abstract the main elements of the previous example and cook up a standard series of steps
for hypothesis testing:
Example 6.3.2
University CC administrators have historical records that indicate that between August and Oct 2002 the
mean time between hits on the ISU homepage was 2 per min.
They suspect that in fact the mean time between hits has decreased (i.e. traffic is up) - sampling 50
inter-arrival times from records for November 2002 gives: X̄ = 1.7 min and s = 1.9 min.
Is this strong evidence for an increase in traffic?
6.3. HYPOTHESIS TESTING
1
2
3
4
91
Formal Procedure
Application to Example
State a “null hypothesis” of the form H0 : function of parameter(s) = #
meant to embody a status quo/ pre data view
State an “alternative hypothesis”
 of the form
 >
6= #
Ha : function of parameter(s)

<
meant to identify departure from H0
State test criteria - consists of a test statistic,
a “reference distribution” giving the behavior
of the test statistic if H0 is true and the kinds
of values of the test statistic that count as evidence against H0 .
show computations
H0 : µ = 2.0 min between hits
5
Report and interpret a p-value = “observed
level of significance, with which H0 can be rejected”. This is the probability of an observed
value of the test statistic at least as extreme as
the one at hand. The smaller this value is, the
less likely it is that H0 is true.
Note aside: a 90% confidence interval for µ is
Ha : µ < 2 (traffic is down)
√
test statistic will be Z = X̄−2.0
s/ n
The reference density will be standard normal,
large negative values for Z count as evidence
against H0 in favor of Ha
sample gives z =
1.7−2.0
√
1.9/ 50
= −1.12
The p-value is P (Z ≤ −1.12) = Φ(−1.12) =
0.1314
This value is not terribly small - the evidence of
a decrease in mean time between hits is somewhat weak.
s
x̄ ± 1.65 √ = 1.7 ± 0.44
n
This interval contains the hypothesized value of µ = 2.0
There are four basic hypothesis tests of this form, testing a mean, a proportion or differences between two
means or two proportions. Depending on the hypothesis, the test statistic will be different. Here’s an
overview of the tests, we are going to use:
Hypothesis
Statistic
Reference Distribution
√
H0 : µ = #
Z = X̄−#
Z
is standard normal
s/ n
H0 : p = #
Z=
H0 : µ1 − µ2 = #
Z=
H 0 : p1 − p2 = #
where p̂ =
Z=√
q p̂−#
Z is standard normal
#(1−#)
n
X̄
1 −X̄2 −#
r
s2
1
n1
s2
p̂(1−p̂)
1
n1
Z is standard normal
+ n2
2
p̂1 −p̂q
2 −#
+ n1
Z is standard normal
2
n1 p̂1 +n2 p̂2
n1 +n2 .
Example 6.3.3 tax fraud
Historically, IRS taxpayer compliance audits have revealed that about 5% of individuals do things on their
tax returns that invite criminal prosecution.
A sample of n = 1000 tax returns produces p̂ = 0.061 as an estimate of the fraction of fraudulent returns. does this provide a clear signal of change in the tax payer behavior?
1. state null hypothesis: H0 : p = 0.05
2. alternative hypothesis: Ha : p 6= 0.05
92
CHAPTER 6. STATISTICAL INFERENCE
3. test statistic:
p̂ − 0.05
Z=p
0.05 · 0.95/n
Z has under the null hypothesis a standard normal distribution, any large values of Z - positive and
negative values - will count as evidence against H0 .
p
4. computation: z = (0.061 − 0.05)/ 0.05 · 0.95/1000 = 1.59
5. p-value: P (|Z| ≥ 1.59) = P (Z ≤ −1.59) + P (Z ≥ 1.59) = 0.11 This is not a very small value, we
therefore have only very weak evidence against H0 .
Example 6.3.4 life time of disk drives
n1 = 30 and n2 = 40 disk drives of 2 different designs were tested under conditions of “accelerated” stress
and times to failure recorded:
Standard Design
n1 = 30
x̄1 = 1205 hr
s1 = 1000 hr
New Design
n2 = 40
x̄2 = 1400 hr
s2 = 900 hr
Does this provide conclusive evidence that the new design has a larger mean time to failure under “accelerated” stress conditions?
1. state null hypothesis: H0 : µ1 = µ2 (µ1 − µ2 = 0)
2. alternative hypothesis: Ha : µ1 < µ2 (µ1 − µ2 < 0)
3. test statistic is:
x̄1 − x̄2 − 0
Z= q 2
s22
s1
n1 + n2
Z has under the null hypothesis a standard normal distribution, we will consider large negative values
of Z as evidence against H0 .
p
4. computation: z = (1205 − 1400 − 0)/ 10002 /30 + 9002 /40 = −0.84
5. p-value: P (Z < −0.84) = 0.2005
This is not a very small value, we therefore have only very weak evidence against H0 .
Example 6.3.5 queueing systems
2 very complicated queuing systems: We’d like to know, whether there is a difference in the large t probabilities of there being an available server.
We do simulations for each system, and look whether at time t = 2000 there is a server available:
System 1
System 2
n1 = 1000 runs n2 = 500 runs (each with different random seed)
server at time t = 2000 available?
551
p̂1 = 1000
p̂2 = 303
500
How strong is the evidence of a difference between the t = 2000 availability of a server for the two systems?
6.4. REGRESSION
93
1. state null hypothesis: H0 : p1 = p2 (p1 − p2 = 0)
2. alternative hypothesis: Ha : p1 6= p2 (p1 − p2 6= 0)
3. Preliminary: note that, if there was no difference between the two systems, a plausible estimate of the
availability of a server would be
p̂ =
np̂1 + np̂2
551 + 303
=
= 0.569
n1 + n2
1000 + 500
a test statistic is:
p̂1 − p̂2 − 0
q
p̂(1 − p̂) · n11 +
Z=p
1
n2
Z has under the null hypothesis a standard normal distribution, we will consider large values of Z as
evidence against H0 .
p
p
4. computation: z = (0.551 − 0.606)/( 0.569 · (1 − 0.569) 1/1000 + 1/500) = −2.03
5. p-value: P (|Z| > 2.03) = 0.04 This is fairly strong evidence of a real difference in t=2000 availabilities
of a server between the two systems.
6.4
Regression
A statistical investigation only rarely focusses on the distribution of a single variable. We are often interested
in comparisons among several variables, in changes in a variable over time, or in relationships among several
variables.
The idea of regression is that we have a vector X1 , . . . , Xk and try to approximate the behavior of Y by
finding a function g(X1 , . . . , Xk ) such that Y ≈ g(X1 , . . . , Xk ).
Simplest possible version is:
6.4.1
Simple Linear Regression (SLR)
Situation: k = 1 and Y is approximately linearly related to X, i.e. g(x) = b0 + b1 x.
Notes:
• Scatterplot of Y vs X should show the linear relationship.
• linear relationship may be true only after a transformation of X and/or Y , i.e. one needs to find the
“right” scale for the variables:
e.g. if y ≈ cxb , this is nonlinear in x, but it implies that
ln y ≈ b |{z}
ln x + ln c,
|{z}
=:y 0
x0
so on a log scale for both x and y-axis one gets a linear relationship.
94
CHAPTER 6. STATISTICAL INFERENCE
Example 6.4.1 Mileage vs Weight
Measurements on 38 1978-79 model automobiles. Gas mileage in miles per gallon as measured by Consumers’
Union on a test track. Weight as reported by automobile manufacturer.
A scatterplot of mpg versus weight shows an indirect proportional relationship:
35
30
M 25
P
G
20
2.25
Transform weight by
1
x
3.00
Weight
3.75
to weight−1 . A scatterplot of mpg versus weight−1 reveals a linear relationship:
35
30
M 25
P
G
20
0.300
0.375
1/Wgt
0.450
Example 6.4.2 Olympics - long jump
Results for the long jump for all olympic games between 1900 and 1996 are:
year long jump (in m) year long jump (in m)
1900
7.19
1960
8.12
1904
7.34
1964
8.07
1908
7.48
1968
8.90
1912
7.60
1972
8.24
1920
7.15
1976
8.34
1924
7.45
1980
8.54
1928
7.74
1984
8.54
1932
7.64
1988
8.72
1936
8.06
1992
8.67
1948
7.82
1996
8.50
1952
7.57
1956
7.83
A scatterplot of long jump versus year shows:
6.4. REGRESSION
95
l
o 8.5
n
g
j 8.0
u
m 7.5
p
0
20
40
year
60
80
The plot shows that it is perhaps reasonable to say that
y ≈ β0 + β1 x
The first issue to be dealt with in this context is: if we accept that y ≈ β0 + β1 x, how do we derive empirical
values of β0 , β1 from n data points (x, y)? The standard answer is the “least squares” principle:
y
y=b0 + b1 x
0.75
0.50
0.25
-0.00
0.2
0.4
0.6
0.8
x
In comparing lines that might be drawn through the plot we look at:
Q(b0 , b1 ) =
n
X
2
(yi − (b0 + b1 xi ))
i=1
i.e. we look at the sum of squared vertical distances from points to the line and attempt to minimize this
sum of squares:
n
X
d
Q(b0 , b1 ) = −2
(yi − (b0 + b1 xi ))
db0
i=1
n
X
d
Q(b0 , b1 ) = −2
xi (yi − (b0 + b1 xi ))
db1
i=1
Setting the derivatives to zero gives:
nb0 − b1
b0
n
X
i=1
xi − b1
n
X
i=1
n
X
i=1
xi
x2i
=
=
n
X
i=1
n
X
i=1
yi
xi yi
96
CHAPTER 6. STATISTICAL INFERENCE
Least squares solutions for b0 and b1 are:
Pn
Pn
Pn
Pn
1
(x − x̄)(yi − ȳ)
i=1 xi yi − n
i=1 xi ·
i=1 yi
i=1
Pn i
b1 =
=
Pn
Pn
2
2
1
2
(x
−
x̄)
xi − (
xi )
i=1 i
i=1
n
b0
n
slope
i=1
n
1X
1X
y i − b1
xi
= ȳ − x̄b1 =
n i=1
n i=1
y − intercept at x = 0
These solutions produce the “best fitting line”.
Example 6.4.3 Olympics - long jump, continued
X := year, Y := long jump
n
X
n
X
xi = 1100,
i=1
n
X
x2i = 74608
i=1
n
X
yi = 175.518,
i=1
yi2 = 1406.109,
i=1
n
X
xi yi = 9079.584
i=1
The parameters for the best fitting line are:
b1
=
b0
=
9079.584 −
1100·175.518
22
2
74608 − 1100
22 = 0.0155(in m)
175.518 1100
−
· 0.0155 = 7.2037
22
22
The regression equation is
high jump = 7.204 + 0.016year (in m).
It is useful for addition, to be able to judge how well the line describes the data - i.e. how “linear looking”
a plot really is.
There are a couple of means doing this:
6.4.1.1
The sample correlation r
This is what we would get for a theoretical correlation % if we had random variables X and Y and their
distribution.
Pn
Pn
Pn
Pn
1
i=1 xi yi − n
i=1 xi ·
i=1 yi
i=1 (xi − x̄)(yi − ȳ)
r := pPn
= r
Pn
2
2
Pn
Pn
Pn
Pn
2
2
1
1
2
2
i=1 (xi − x̄) ·
i=1 (yi − ȳ)
i=1 xi − n (
i=1 xi )
i=1 yi − n (
i=1 yi )
The numerator is the numerator of b1 , one part under the root of the denominator is the denominator of b1 .
Because of its connection to %, the sample correlation r fulfills (it’s not obvious to see, and we want prove
it):
• −1 ≤ r ≤ 1
• r = ±1 exactly, when all (x, y) data pairs fall on a single straight line.
• r has the same sign as b1 .
Example 6.4.4 Olympics - long jump, continued
9079.584 −
r= q
(74608 −
Second measure for goodness of fit:
1100·175.518
22
11002
n )(1406.109
−
= 0.8997
175.5182
)
22
6.4. REGRESSION
6.4.1.2
97
Coefficient of determination R2
This is based on a comparison of “variation accounted for” by the line versus “raw variation” of y.
The idea is that
n
X
2
(yi − ȳ) =
i=1
n
X
i=1
yi2
1
−
n
n
X
!2
yi
= SST T otal S um of S quares
i=1
is a measure for the variability of y. (It’s (n − 1) · s2y )
y
0.75
0.50
y
0.25
-0.00
0.2
0.4
0.6
0.8
x
After fitting the line ŷ = b0 + b1 x, one doesn’t predict y as ȳ anymore and suffer the errors of prediction
above, but rather only the errors
ŷi − yi =: ei .
So, after fitting the line
n
X
e2i =
i=1
n
X
(yi − ŷ)2 = SSES um of S quares of E rrors
i=1
is a measure for the remaining/residual/ error variation.
y
y=b0 + b1 x
0.75
0.50
0.25
-0.00
0.2
0.4
0.6
0.8
x
The fact is that SST ≥ SSE.
So: SSR := SST − SSE ≥ 0.
SSR is taken as a measure of “variation accounted for” in the fitting of the line.
The coefficient of determination R2 is defined as:
R2 =
SSR
SST
Obviously: 0 ≤ R2 ≤ 1, the closer R2 is to 1, the better is the linear fit.
Example 6.4.5 Olympics - long jump, continued
Pn
Pn
2
2
SST = i=1 yi2 − n1 ( i=1 yi ) = 1406.109 − 175.518
= 5.81.
22
SSE and SSR?
98
CHAPTER 6. STATISTICAL INFERENCE
y − ŷ (y − ŷ)2
-0.019
0.000
0.075
0.006
0.152
0.023
0.211
0.045
-0.363
0.132
-0.130
0.017
0.104
0.011
-0.060
0.004
0.299
0.089
-0.124
0.015
-0.440
0.194
-0.241
0.058
-0.011
0.000
-0.124
0.015
0.646
0.417
-0.077
0.006
-0.037
0.001
0.098
0.010
0.036
0.001
0.153
0.024
0.041
0.002
-0.191
0.036
SSE =
1.107
So SSR = SST − SSE = 5.810 − 1.107 = 4.703 and R2 = SSR
SST = 0.8095.
y
7.185
7.341
7.480
7.601
7.150
7.445
7.741
7.639
8.060
7.823
7.569
7.830
8.122
8.071
8.903
8.242
8.344
8.541
8.541
8.720
8.670
8.500
x
0
4
8
12
20
24
28
32
36
48
52
56
60
64
68
72
76
80
84
88
92
96
ŷ
7.204
7.266
7.328
7.390
7.513
7.575
7.637
7.699
7.761
7.947
8.009
8.071
8.133
8.195
8.257
8.319
8.381
8.443
8.505
8.567
8.629
8.691
Connection between R2 and r
R2 is SSR/SST - that’s the squared sample correlation of y and ŷ.
If - and only if! - we use a linear function in x to predict y, i.e. ŷ = b0 + b1 x, the correlation between ŷ and
x is 1.
Then R2 (and only then!) is equal to the squared sample correlation between y and x = r2 :
R2 = r2 if and only if ŷ = b0 + b1 x
Example 6.4.6 Olympics - long jump, continued
R2 = 0.8095 = (0.8997)2 = r2 .
It is possible to go beyond simply fitting a line and summarizing the goodness of fit in terms of r and R2 to
doing inference, i.e. making confidence intervals, predictions, . . . based on the line fitting. But for that, we
need a probability model.
6.4.2
Simple linear Regression Model
In words: for input x the output y is normally distributed with mean β0 + β1 x = µy|x and standard deviation
σ.
In symbols: yi = β0 + β1 xi + i with i i.i.d. normal N (0, σ 2 ).
β0 , β1 , and σ 2 are the parameters of the model and have to be estimated from the data (the data pairs
(xi , yi ).
Pictorially:
6.4. REGRESSION
99
y
density of
y given x
x
How do we get estimates for β0 , β1 , and σ 2 ?
Point estimates:
β̂0 = b0 , βˆ1 = b1 from Least Squares fit (which gives β̂0 and βˆ1 the name Least Squares Estimates.
and σ 2 ?
σ 2 measures the variation around the “true” line β0 + β1 x - we don’t know that line, but only b0 + b1 x.
Should we base the estimation of σ 2 on this line?
The “right” estimator for σ 2 turns out to be:
n
σ̂ 2 =
1 X
SSE
(yi − ŷi )2 =
.
n − 2 i=1
n−2
Example 6.4.7 Olympics - long jump, continued
β̂0
= b0 = 7.2073 (in m)
β̂1
= b1 = 0.0155 (in m)
SSE
1.107
=
=
= 0.055.
n−2
20
σ̂ 2
Overall, we assume a linear regression model of the form:
y = 7.2037 + 0.0155x + e, with e ∼ N (0, 0.055).
Constructing confidence intervals and tests will require some more probability facts about these estimators.
We will just jump to the results:
b0 and b1 are - in the way we defined them previously - unbiased estimators for β0 and β1 , respectively.
The variance of the errors, σ 2 , can be estimated by:
s2 :=
SSE
n−2
The distributions of the parameters are:
1. b1 ∼ normal, with E[b1 ] = β1 and V ar[b1 ] =
P
σ2
2
(x
i −x̄)
i
2. ŷ = β0 + β1 x∗ ∼ normal with
E[ŷ] = β0 + β1 x∗
1
(x∗ − x̄)2
V ar[ŷ] = σ 2
+P
2
n
i (xi − x̄)
Note, we can take x∗ = 0 above and have a statement about b0 .
Download