An introduction to probability Shiu-Tang Li Index Page

advertisement
An introduction to probability
Shiu-Tang Li
Index
Page
1 Basic probability theory
2
1.1 Definitions and examples
2
1.2 Conditional probability
5
1.3 Random variable
6
1.4 (Cumulative) Distribution functions
7
1.5 Discrete random variables and continuous random variables
10
1.6 CDF/PDF/PMF of functions of random variables
12
1.7 Independence
14
1.8 Expected value and variance of random variables
16
1.9 Examples of discrete random variables and continuous random variables 20
1.10 Joint distributions
24
1.11 Moment generating functions
35
1.12 LLN and CLT
37
1
An introduction to probability
Shiu-Tang Li
1 Basic probability theory
1.1 Definitions and examples
Definition 1.1.1 A space (Ω, F, P ), where Ω is the sample space, F ⊂ 2Ω is the collection of sets that could be assigned probability (F is also called the collection of events), and
P is the probability measure, is called a probability space if it satisfies
(1) For any A ∈ F , 0 ≤ P (A) ≤ 1.
(2) P (Ω) = 1.
S
P∞
(3) If A1 , A2 , · · · , An , · · · ∈ F , and Ai ∩Aj = ∅ for all i 6= j, then P ( ∞
j=1 Aj ) =
j=1 P (Aj ).
Remark 1.1.2 We also require F ⊂ 2Ω to be a σ-algebra, which means F satisfies the
following properties:
(1) ∅, Ω ∈ F .
(2) If A ∈ F , then Ac ∈ F .
S
(3) If A1 , A2 , · · · , An , · · · ∈ F , then ∞
j=1 Aj ∈ F .
By DeMorgan’s law, when we have both (1) and (2), then (3) is equivalent to (4) below:
T
(4) If A1 , A2 , · · · , An , · · · ∈ F , then ∞
j=1 Aj ∈ F .
When Ω is a countable set, that is, Ω = {a1 , · · · , aN } or Ω = {a1 , · · · , an , · · · }, and every
one point set is in F (that is, for all j we have {aj } ∈ F ), then F = 2Ω in these cases.
Remark 1.1.3 Definition 1.1.1 implies some easy properties: (a) If A ∈ F , then
P (Ac ) = 1 − P (A). (b) If A1 , A2 , · · · , AN ∈ F , and Ai ∩ Aj = ∅ for all i 6= j, then
PN
S
P( N
j=1 P (Aj ). (c) If A, B ∈ F , A ⊂ B, then P (A) ≤ P (B).
j=1 Aj ) =
Example 1.1.4 Let Ω = {1, 2, 3, 4, 5, 6}, the set of all possible outcomes when rolling a
die. We usually take F = 2Ω , and if it is a fair die we let P ({j}) = 1/6 for 1 ≤ j ≤ 6.
Example 1.1.5 Let Ω = {(1, 1), (1, 2), · · · , (1, 6), (2, 1), · · · , (6, 6)}, the set of all possible
outcomes when rolling two dice. Note that #Ω = 36. Here we take F = 2Ω , and if it is a
fair die we let P ({(i, j)}) = 1/36 for 1 ≤ i, j ≤ 6.
Exercise 1.1.6 Let A ∈ F be the event that the first rolling is even and the second one
is not 5. Calculate #A and P (A).
Example 1.1.7 Let Ω = {(♠10, ♣A, ♣7, ♦4, ♥J), (♦3, ♠2, ♥2, ♦2, ♦4), · · · , (♠3, ♥4, ♥2,
♠7, ♣A)} be the set of all possible outcomes when we draw 5 cards from a poker deck, where
the order is considered. It’s not hard to see #Ω = 52 × 51 × 50 × 49 × 48. Now we let F = 2Ω ,
and the probability for each particular hand is 1/#Ω.
2
An introduction to probability
Shiu-Tang Li
Exercise 1.1.8 Let A be the event that your hand is a full house. Calculate #A and
P (A) using the probability space given in Example 1.1.7.
Example 1.1.9 We now consider a similar probability space to the one given in Example
1.1.7. Let Ω = {{♠10, ♣A, ♣7, ♦4, ♥J}, {♦3, ♠2, ♥2, ♦2, ♦4}, · · · , {♠3, ♥4, ♥2, ♠7, ♣A}}
be the set of all possible outcomes when we draw 5 cards from a poker deck, where the order
52×51×50×49×48
is NOT considered. We discover that #Ω = 52
=
. We let F = 2Ω , and the
5
5!
probability for each particular hand is 1/#Ω.
Exercise 1.1.10 Let A be the event that your hand is a full house. Calculate #A and
P (A) using the probability space given in Example 1.1.9.
Example 1.1.11 Tom is waiting for a bus. He knows that the bus would come every
15 minutes, and there is no fixed bus schedule. Let Ω = [0, 15], which is the collection of all
“times” he could wait. Intuitively, for each exact moment t ∈ [0, 15], P ({t}) = 0, and we
BELIEVE that for any A1 = [a, b], A2 = (a, b], A3 = [a, b), A4 = (a, b) ⊂ [0, 15], P (Aj ) = b−a
.
15
To define a ‘good’ set of events F we need knowledge from measure theory. Warning: in this
case F is not 2Ω .
Example 1.1.12 (Coin tossing) We continue flipping a unfair coin with probability
p to have a head and q to have a tail, and of course p + q = 1. The sample space Ω is
defined to be the set of all possible sequences {(a1 , a2 , · · · ) : aj ∈ {T, H} ∀j ∈ N}. Besides,
we introduce an notation: (T, H, ·, ·, T, · · · ) := {(a1 , a2 , · · · ) : a1 = T, a2 = H, a5 = T, aj ∈
{T, H} f or j 6= 1, 2, 5}.
Now we define an increasing sequence of σ-algebras {Fi } on Ω as follows. Let F1 =
{∅, Ω, (T, · · · ), (H, · · · )}, F2 := {∅, Ω, (T, · · · ), (H, · · · ), (·, T, · · · ), (·, H, · · · ), (H, T, · · · ),
(T, H, · · · ), (T, T, · · · ), (H, H, · · · ), (T, T, · · · ) ∪ (H, H, · · · ), (T, H, · · · ) ∪ (H, T, · · · ),
(T, · · · )∪(·, T, · · · ), (T, · · · )∪(·, H, · · · ), (H, · · · )∪(·, T, · · · ), (H, · · · )∪(·, H, · · · )}, and so on.
Fj means all the events that can be assigned to some probability when we toss a coin j times.
For example, we may take F2 as the σ-algebra, and we can define probability on events
in F2 , like P ({(T, H, · · · )}) = pq and P ({(H, H, · · · )}) = p2 .
Actually, the σ-algebra that we use on sequences is
algebra that includes every Fj . We’ll see this later.
W∞
j=1
Fj , which is the smallest σ-
Example 1.1.13 (Standard Birthday Problem) [Saeed Ghahramani] What is the
probability that at least two students of a class of size n have the same birthday? Compute
the numerical values of such probabilities for n = 23, 30, 50, and 60. Assume that the birth
3
An introduction to probability
Shiu-Tang Li
rates are constant throughout the year and that each year has 365 days.
Sol. We may let the sample space Ω := {(k1 , · · · , kn ) : kj ∈ {1, · · · , 365}}, F = 2Ω . And
for each (k1 , · · · , kn ) ∈ Ω, P ((k1 , · · · , kn )) = 3651 n . The probability that no two students
. When n = 23, 30, 50, and 60, the
have the same birthday is P (n) = 365×364×···×(365−(n−1))
365n
corresponding values 1 − P (n) are 0.507, 0.706, 0.970, and 0.995 respectively.
Exercise 1.1.14 Let Ω = N, the collection of all events F = {A : A is finite or Ac is
1
finite}, and P ({n}) = 2n+1
for every n ∈ N. For any finite set An = {a1 , · · · , an } ∈ F , where
P
ai ∈ N ∀1 ≤ i ≤ n, define P (An ) = ni=1 P ({ai }). For any set B ∈ F such that B c is finite,
define P (B) = 1 − P (B c ).
1. Show that for any A, B ∈ F , Ac , A ∩ B ∈ F .
2. Show that this model (Ω, F, P ) satisfies (1) and (2) of Definition 1.1.1, and P (A∪B) =
P (A) + P (B) for A, B ∈ F , A ∩ B = ∅, but it fails to possess (3) of Definition 1.1.1, namely
the countable additivity property.
Sol of the last assertion. If we have the countable additivity property, then 1 = P (Ω) =
P
S
1
P ( n∈N {n}) = ∞
n=1 P ({n}) = 2 , which is absurd.
4
An introduction to probability
Shiu-Tang Li
1.2 Conditional probability
Definition 1.2.1 Let (Ω, F, P ) be the probability space, and A, B ∈ F be two events such
that P (B) > 0. Then the conditional probability of A conditioned on B, denoted by P (A|B),
is defined to be P (A ∩ B)/P (B). If P (B) = 0, then we define P (A|B) := 0.
Theorem 1.2.2 (Law of total probability) Let {Bn }n be a sequence of mutually disS
P∞
joint events s.t. ∞
n=1 Bn = Ω. Then we have P (A) =
n=1 P (A ∩ Bn ).
Remark 1.2.3 If we let B1 = B, B2 = B c , Bn = ∅ for n ≥ 3, then we have
P (A) = P (A ∩ B) + P (A ∩ B c ).
Theorem 1.2.4 (Baye’s theorem) Let {Bn }n be a sequence of mutually disjoint events
S
P (A∩Bn )
n )P (Bn )
P∞
s.t. ∞
= P∞P (A|B
.
n=1 Bn = Ω. Then P (Bn |A) =
P (A∩Bj )
P (A|Bj )P (Bj )
j=1
j=1
The following example reveals how the information given beforehand can “change” the
probability of a certain event.
Example 1.2.5 Draw two cards from a poker deck of 52 cards, without replacement.
Determine the conditional probability that both cards are aces, given that (a) one of the
cards is the ace of spades. (b) the second card is an ace. (c) at least one of the cards is an ace.
Sol. (a)1/17. (b)1/17. (c)1/33.
Example 1.2.6 (An application to Baye’s theorem) In a village, 10% of the population
has some disease. A test is administered that if a person is sick, the test will be positive
95% of the time and if the person is not sick, then the test would still has a 20% chance
to be positive. If the test result of John is positive, what’s the probability that he is infected?
Sol. If John does not take the test, the probability that he is infected is 10%, since we
do not have any information about him. Once we’re given some information, we could make
a more precise judgement.
Let A be the event that John is infected, and B be the event that his test result is posi(B|A)P (A)
tive. We’re asked to calculate P (A|B). By Baye’s theorem, it equals P (B|A)PP(A)+P
(B|Ac )P (Ac )
95%·10%
= 95%·10%+20%·90% ≈ 34.5%.
Exercise 1.2.7 [Grinstead & Snell] Prove that if P (A|C) ≥ P (B|C) and P (A|C c ) ≥
P (B|C c ), then P (A) ≥ P (B).
5
An introduction to probability
Shiu-Tang Li
1.3 Random variable
Definition 1.3.1 Let (Ω, F, P ) be the probability space. A random variable, abbreviated as r.v., is a mapping X : Ω → R so that {X ∈ O} := {ω ∈ Ω : X(ω) ∈ O} ∈ F for any
open set O ⊂ R.
Remark 1.3.2 Random variables help us analyze many properties of the original abstract
space (Ω, F, P ). The condition {X ∈ O} ∈ F is the “measurability condition” - the behavior
of a random variable cannot be too wild.
Remark 1.3.3 By the definition of a r.v. X, {X ∈ D} ∈ F for every closed set D of R.
Example 1.3.4 Roll a die twice. Let Ω = {(1, 1), · · · , (1, 6), (2, 1), · · · , (6, 6)}, F = 2Ω ,
P ({(i, j)}) = 1/36 for all i, j. Let X be the total number of 1’s: It means X((1, 2)) = 1,
X((2, 6)) = 0, X((1, 1)) = 2, and so forth. It’s also easy to see {(i, j) : X(i, j) ∈ O} ∈ F for
all open set O in R.
6
An introduction to probability
Shiu-Tang Li
1.4 (Cumulative) Distribution functions
Let (Ω, F, P ) be the probability space, and X be some random variable on this space.
Since {X ≤ x} ∈ F for all x ∈ R, FX (x) = P (X ≤ x) is a well-defined function on R.
Definition 1.4.1 The function FX (x) defined above is called the (cumulative) distribution function of X. It is abbreviated as CDF.
Before we proceed to the properties of CDFs, we introduce the idea of convergence of
sets and derive some more properties of probability measures.
Definition 1.4.2 A sequence of sets An is said to converge to a set A, denoted by
An → A, or A = limn→∞ An , if the following holds:
(1) For any x ∈ A, there exists N = N (x) ∈ N so that x ∈ An for all n ≥ N .
(2) For any x ∈
/ A, there exists M = M (x) ∈ N so that x ∈
/ An for all n ≥ M .
By the way, it’s not hard to see the limit limn→∞ An is unique if it exists. (Which means
the above definition is well-defined.)
Theorem 1.4.3 Let An be an increasing sequence of sets. Then limn→∞ An =
T
Similarly, if An is a decreasing sequence of events, then limn→∞ An = ∞
n=1 An .
S∞
n=1
An .
S
Definition 1.4.4 Let An be a sequence of sets. We know that { ∞
n=N An }N is a deT∞
creasing sequence of events and { n=N An }N is an increasing sequence of events. By the
previous theorem, both the limits of these two sequences exist, and we define lim supn An :=
T∞
S
limN →∞ ∞
n=N An .
n=N An and lim inf n An := limN →∞
Theorem 1.4.5 Let An be a sequence of sets. limn→∞ An exists if and only if lim supn An =
lim inf n An . If any one of these equivalent conditions holds, we have furthermore lim supn An =
lim inf n An = limn→∞ An .
Proof. (⇒) Assume that x ∈ lim supn An and x ∈
/ lim inf n An . Since x ∈ lim supn An , x ∈ An
T
for infinitely many n’s. Since x ∈
/ lim inf n An , x ∈
/ ∞
n=N An for any N ∈ N, which means
S∞
c
c
x ∈ n=N An for any N ∈ N. Therefore, x ∈ An for infinitely many n’s. This proves An 9 A.
(⇐) Assume that lim supn An = lim inf n An . We claim that A = limn An exists and
T
A = lim supn An = lim inf n An . If x ∈ A, then x ∈ lim inf n An , which implies x ∈ ∞
n=N An
for some N ∈ N. So (1) of Definition 1.4.2 is satisfied. If x ∈
/ A, then x ∈
/ lim supn An , which
S∞
T∞
implies x ∈
/ n=N An for all N large, which is equivalent to x ∈ n=N Acn for all N large,
and this is exactly (2) of Definition 1.4.2.
7
An introduction to probability
Shiu-Tang Li
Theorem 1.4.6 (Continuity in probability) Let An ∈ F for all n ∈ N , and An → A.
Then P (An ) → P (A).
S
Proof. We first prove the case An ↑ A. By Theorem 1.4.3, A = ∞
n=1 An . We define B1 = A1 ,
and Bn = An \ An−1 for n ≥ 2. We find that {Bn } is a sequence of pairwise disjoint events
S
S
S
S∞
so that nj=1 Aj = nj=1 Bj . Let n → ∞, we have further ∞
A
=
j
j=1
j=1 Bj .
By countable additivity of P , we have
lim P (Aj ) = lim P (
n→∞
n→∞
= lim
n→∞
= P(
n
[
Aj ) = lim P (
n→∞
j=1
n
X
P (Bj ) =
∞
X
j=1
∞
[
j=1
n
[
Bj )
j=1
P (Bj )
j=1
Bj ) = P (
∞
[
Aj ) = P (A).
j=1
Next we prove the case An ↓ A. Since Acn ↑ Ac , we have P (Acn ) = 1 − P (An ) → P (Ac ) =
1 − P (A), and therefore P (An ) → P (A).
T
A ) → P (lim inf n→∞ An ) = P (A) as N → ∞ (We’ve
For the general case, since P ( ∞
S∞ n=N n
used Theorem 1.4.5 here), P ( n=N An ) → P (lim supn→∞ An ) = P (A) as N → ∞, and
S∞
T
P( ∞
n=N An ) ≤ P (AN ) ≤ P ( n=N An ), by the squeezing theorem we have P (AN ) → P (A)
as N → ∞.
Now we are able to study some properties of FX .
Theorem 1.4.7 Let FX (x) be the CDF of the random variable X. We have
(1) FX (x) is increasing.
(2) FX (x) is right-continuous for every x ∈ R.
(3) FX (x−), the left limit of FX (x), exists for every x ∈ R.
(4) FX (+∞) := limx→∞ FX (x) = 1.
(5) FX (−∞) := limx→−∞ FX (x) = 0.
(6) P ({X = x}) = FX (x) − FX (x−) for every x ∈ R.
Proof. (1) If x > y, then {X ≤ x} ⊃ {X ≤ y}, and FX (x) = P (X ≤ x) ≥ P (X ≤ y) =
FX (y).
(2) Let xn ↓ x, since {X ≤ xn } → {X ≤ x} (why?), by Theorem 1.4.6 we have
FX (xn ) → FX (x).
8
An introduction to probability
Shiu-Tang Li
(3) First we note that {X < x} ∈ F . For any xn ↑ x, {X ≤ xn } → {X < x}, so
FX (xn ) = P (X ≤ xn ) → P (X < x). FX (x−) is therefore given by P (X < x).
(4) If xn ↑ +∞, then {X ≤ xn } → {X ∈ R} = Ω, and by Theorem 1.4.6 we have
FX (xn ) → P (Ω) = 1.
(5) Exercise.
(6) Since P ({X = x}) = P (X ≤ x) − P (X < x), the result follows from (3).
Now an interesting question arises: if now we are given a right continuous, increasing
function F : R → R s.t. F (+∞) = 1 and F (−∞) = 0, can we always find some probability
space (Ω, F, P ) and an r.v. X so that FX (x) ≡ F (x)?
Luckily, the answer is affirmative. The interested readers may check [Varadhan].
9
An introduction to probability
Shiu-Tang Li
1.5 Discrete random variables and continuous random
variables
1.5.1 Definitions
Definition 1.5.1 A r.v. X which takes its values on a finite set or a countably infinite
set is said to be a discrete random variable. That is, there exists x1 , x2 , · · · ∈ R s.t.
P∞
i=1 P (X = xi ) = 1. (By the definition of r.v.s, {X = xi } ∈ F for all i ∈ N. (why?))
Definition 1.5.2 Let X be a discrete r.v. Its (probability) mass function pX (x),
or PMF, is defined as pX (x) := P (X = x).
Remark 1.5.3 Therefore, when X is discrete, we have FX (x) =
sum has only countably many nonzero terms.
P
y≤x
pX (y), where the
Definition 1.5.4 A r.v. X is said to be a continuous random variable with (probRx
ability) density function (PDF) fX (x) if FX (x) = P (X ≤ x) = −∞ fX (y) dy for all
x ∈ R, where fX (x) ≥ 0 for all x ∈ R and fX is Riemann integrable on R.
Remark 1.5.5 The PDF fX (x) is not unique. For example, we may change the values
fX (x) for finitely many x’s without changing the value of FX (x) for all x ∈ R. Actually,
the PDFs of X form an equivalence class, with equivalence relation defined by f ∼ g if
Rx
Rx
f
(y)
dy
=
g(y) dy for all x ∈ R. Therefore, when we’re talking about the PDF
−∞
−∞
fX (x) of X, we’re referring to an arbitrary candidate in the equivalence class.
R∞
Remarks 1.5.6 (1) −∞ fX (y) dy = 1. (Verify it.) (2) When X is continuous, FX (x) =
Rx
f (y) dy is a continuous function, and hence P (X = x) = FX (x) − FX (x−) = 0 for
−∞ X
every x ∈ R. (3) In a more general setting (requiring the knowledge of measure theory), the
Riemann integrability of fX may be loosen to the Lebesgue integrability.
Theorem 1.5.7 Let X be a continuous r.v. If FX0 (x0 ) exists for some x0 ∈ R and fX is
continuous at x0 , then FX0 (x0 ) = fX (x0 ).
Proof. Use the fundamental theorem of calculus.
The following theorem provides us with a way to “verify” if a r.v. X is a continuous r.v.
by simply looking at its CDF. It also helps us to find a PDF candidate of X.
10
An introduction to probability
Shiu-Tang Li
Theorem 1.5.8 Let X be r.v. with CDF FX . Assume that FX is continuous on R and
FX0 exists for all but finitely many values of R. We have X is a continuous r.v., with its
PDF defined by fX (x) = FX0 (x) for which FX0 (x) exists and fX (x) = 0 for which FX0 (x) does
not exist.
Proof. Assume that FX0 (x) does not exist on {x1 , · · · , xn }, with x1 < · · · < xn . For y 0 < y ≤
Ry
x1 , by the fundamental theorem of calculus we have FX (y) − FX (y 0 ) = y0 FX0 (z) dz, letting
Ry
y 0 → −∞ we have FX (y) = −∞ FX0 (z) dz. For x1 < y ≤ x2 , FX (y) = (FX (y) − FX (x1 )) +
Ry
R x1 0
FX (z) dz, again by the fundamental theorem of calculus, where
FX (x1 ) = x1 FX0 (z) dz + −∞
the continuity of FX plays a role. The rest of the proof is left to the readers.
We have to “warn” the reader that not every r.v. X is either continuous or discrete. The
reader is invited to think upon the following two examples.
Example 1.5.9 FX (x) = 0 for all x ≤ 0, FX (x) = x/2 for 0 < x < 1, and FX (x) = 1 for
all x ≥ 1.
Even if FX (x) is continuous for every x ∈ R, it may happen that X is not a continuous
r.v., as can be seen in the following example.
Example 1.5.10 Let FX be the Cantor-Lebesgue function. It can be proved that no fX
Rx
exists so that FX (x) = −∞ fX (y) dy for every x ∈ R. Again, the proof requires measure
theory.
In 1.9 we’ll see many examples of discrete and continuous random variables, and discover
their properties.
11
An introduction to probability
Shiu-Tang Li
1.6 CDF/PDF/PMF of functions of random variables
Let X be a r.v. defined on (Ω, F, P ), and g : R → R be a continuous function. It’s not
hard to see that g(X) is a r.v. by definition, for continuous functions pull open sets back to
open sets.
In this section we’ll see through some examples about how we compute the CDF/PDF/PMF
of Y = g(X).
Theorem 1.6.1 Let X be a discrete r.v. s.t.
some function. Then g(X) is a discrete r.v.
P∞
i=1
P (X = xi ) = 1, and g : R → R be
S
Proof. For each z ∈ R s.t. z = g(xi ) for some i ∈ N, we have {g(X) = z} ⊃ xi ∈{y:g(y)=z} {X =
P
P
S
xi }. Summing up all these z’s gives us 1 = P (Ω) ≥ z P (g(X) = z) ≥ z P ( xi ∈{y:g(y)=z} {X =
P P
P
P
xi }) = z xi ∈{y:g(y)=z} P (X = xi ) = ∞
i=1 P (X = xi ) = 1, so
z P (g(X) = z) = 1. We’ve
used the fact that we may sum up an absolutely convergent series in any order.
Remark 1.6.2 When X is continuous, It is possible that g(X) is continuous, discrete,
or neither of them. We’ll find it in the following examples.
Exercise 1.6.3 Let P (X = 1) = P (X = −1) = 1/6, P (X = 2) = P (X = −2) =
1/4, P (X = 0) = 1/6. Find the PMF of Y = X 2 .
Example 1.6.4 Let X be a continuous r.v. with density fX (x) =
CDF of Y = X 2 .
2
√1 e−x /2 .
2π
Find the
√
√
Sol.
First,
P
(Y
≤
x)
=
0
for
x
<
0.
For
x
≥
0,
P
(Y
≤
x)
=
P
(−
x
≤
X
≤
x) =
R √x −y2 /2
R −√x −y2 /2
1
√1
√
e
dy − 2π −∞ e
dy.
2π −∞
We find that FY (x) = P (Y ≤ x) is differentiable for all x 6= 0. When x < 0, FY0 (x) = 0.
√1 · √1 e−x/2 = √1 · √1 e−x/2 . We may
When x > 0, FY0 (x) = 21 · √1x · √12π e−x/2 − −1
·
2
x
x
2π
2π
therefore define the PDF fY (x) of Y by fY (0) := 0 and fY (x) = FY0 (x) for x 6= 0.
Example 1.6.5 Let X be a continuous r.v. with density fX (x) =
CDF of Y = g(X), where
x if x > 0
g(x) =
0 otherwise
.
2
√1 e−x /2 .
2π
Find the
Sol. First, FY (x) = P (Y ≤ x) = 0 for any x < 0. We also observe that P (Y ≤ 0) =
R0
2
P (Y = 0) = P (X ≤ 0) = −∞ √12π e−y /2 dy = 1/2. For x > 0, we have FY (x) = P (Y ≤ x) =
12
An introduction to probability
P (Y ≤ 0) + P (0 < Y ≤ x) = P (Y ≤ 0) + P (0 < X ≤ x) = 1/2 +
Shiu-Tang Li
Rx
0
2
√1 e−y /2
2π
dy.
We find that FY (x) is differentiable everywhere except for x = 0, and FY (x) has a jump
of size 1/2 at x = 0. Since P (Y = y) > 0 only when y = 0, and P (Y = 0) = 1/2 < 1, Y
cannot be a discrete r.v. Besides, if Y is continuous, then P (Y = y) = 0 for y ∈ R, so Y
cannot be a continuous r.v.
1
Example 1.6.6 Let X be a continuous r.v. with density fX (x) = b−a
· 1(a,b) (x). Find
X
the CDF of Y = e . (Rmk. 1A (x) := 1 if x ∈ A, and 1A (x) := 0 if x ∈
/ A. It is called the
indicator function of set A.)
Sol. For x ≤ 0, FY (x) = P (eX ≤ x) = 0. For x > 0, FY (x) = P (eX ≤ x) = P (X ≤
R ln(x) 1
ln(x)) = −∞ b−a
1(a,b) (y) dy.
1
1
For x > 0, x 6= ea , eb , we have FY0 (x) = b−a
· x1 · 1(a,b) (ln(x)) = b−a
· x1 · 1(ea ,eb ) (x) (To apply
chain rule to (f ◦ g)(x) at x0 , f must be differentiable at g(x0 )). For x < 0, FY0 (x) = 0.
A “version” of the PDF of Y is given by fY (x) = FY0 (x) for x 6= 0, ea , eb , and fY (x) = 0
otherwise.
13
An introduction to probability
Shiu-Tang Li
1.7 Independence
1.7.1 Independence of events
Definition 1.7.1 Let our probability space be (Ω, F, P ). A collection of events {Eα }α∈I ⊂
F is said to be independent if for any finite subcollection Eαn1 , · · · , Eαnk , αni ∈ I, we have
T
P ( 1≤i≤k Eαni ) = Π1≤i≤k P (Eαni ).
Remark 1.7.2 If we take the index set I = {1, 2}, the above definition reads E1 and
E2 are independent if P (E1 )P (E2 ) = P (E1 ∩ E2 ); if I = {1, 2, 3}, then the definition reads
E1 , E2 , and E3 are independent if P (E1 )P (E2 )P (E3 ) = P (E1 ∩ E2 ∩ E3 ), P (E1 )P (E2 ) =
P (E1 ∩ E2 ), P (E1 )P (E3 ) = P (E1 ∩ E3 ), and P (E2 )P (E3 ) = P (E2 ∩ E3 ).
Remark 1.7.3 Assume that P (E2 ) > 0, then E1 and E2 are independent if and only if
P (E1 ) = P (E1 |E2 ).
Definition 1.7.4 Let our probability space be (Ω, F, P ). A collection of events {Eα }α∈I ⊂
F is said to be pairwise independent if for any β, γ ∈ I, we have P (Eβ ∩ Eγ ) =
P (Eβ ) · P (Eγ ).
Remark 1.7.5 Obviously, independence implies pairwise independence.
Example 1.7.6 (pairwise independence does not imply independence) Let Ω =
{ω1 , ω2 , ω3 , ω4 }, F = 2Ω , P ({ωi }) = 1/4 for all i. We find that E1 = {ω1 , ω2 }, E2 = {ω2 , ω3 },
and E3 = {ω1 , ω3 } are pairwise independent, but not independent.
Example 1.7.7 Let us look at another example. Let Ω = {ω1 , ω2 , ω3 , ω4 , ω5 , ω6 , ω7 , ω8 },
F = 2Ω , P ({ωi }) = 1/8 for all i, and E1 = {ω1 , ω2 , ω3 , ω4 }, E2 = {ω2 , ω3 , ω4 , ω5 }, and
E3 = {ω3 , ω4 , ω5 , ω6 }. We have P (E1 )P (E2 )P (E3 ) = P (E1 ∩ E2 ∩ E3 ) = 1/8, but they’re not
independent.
1.7.2 Independence of r.v.s
Definition 1.7.8 Let the probability space be (Ω, F, P ). A collection of random variables
{Xα }α∈I is said to be independent if for any finite subcollection Xαn1 , · · · , Xαnk , αni ∈ I,
T
we have P ( 1≤i≤k Xα−1
(Oi )) = Π1≤i≤k P (Xα−1
(Oi )) for arbitrary open set Oi ⊂ R.
ni
ni
Definition 1.7.9 Let the probability space be (Ω, F, P ). A collection of random variables
{Xα }α∈I is said to be pairwise independent if for any β, γ ∈ I, we have P (Xβ−1 (O1 ) ∩
Xγ−1 (O2 )) = P (Xβ−1 (O1 )) · P (Xγ−1 (O2 )) for any open sets O1 , O2 ⊂ R.
14
An introduction to probability
Shiu-Tang Li
Exercise 1.7.10 Show that we may replace all the open sets in Definition 1.7.8 and
Definition 1.7.9 with closed sets.
Exercise 1.7.11 Let X1 , · · · , Xn be independent r.v.s and f1 , · · · , fn : R → R be continuous functions. Show that f1 (X1 ), · · · , fn (Xn ) are independent.
Definition 1.7.12 A collection of random variables {Xα }α∈I is said to be independent
and identically distributed, abbreviated as i.i.d, if they are independent and they have
the same CDF.
Exercise 1.7.13 Flip a fair dice twice. Let X1 be the number of the first roll and X2 be
the number in the second roll. If X1 and X2 are independent, show that P ({X1 = i}∩{X2 =
j}) = 1/36 for all 1 ≤ i, j ≤ 6.
Exercise 1.7.14 Show that A and B are independent if one of them is either ∅ or Ω.
Exercise 1.7.15 Let X(ω) ≡ c, c ∈ R. By the previous exercise, show that X and Y
are independent for arbitrary r.v. Y which is defined on the same probability space as is X.
Exercise 1.7.16 Let Ω = {ω1 , · · · , ωk }, F = 2Ω , and P ({ωj }) > 0 for all 1 ≤ j ≤ k.
Show that if X is independent of itself, then X(ω) ≡ c for some c ∈ R.
Sol. Assume X is not a constant r.v. We take ωi 6= ωj , ωi , ωj ∈ Ω such that X(ω1 ) =
a, X(ω2 ) = b and a 6= b. Since X is independent to itself, we have 0 = P (∅) = P (X −1 (a) ∩
X −1 (b)) = P (X −1 (a)) · P (X −1 (b)) ≥ P (ωj )P (ωk ) > 0, a contradiction.
15
An introduction to probability
Shiu-Tang Li
1.8 Expected value and variance of random variables
Now that we’ve introduced two types of r.v.s - continuous and discrete, we may define
the expected value and variance for these two types of r.v.s.
1.8.1 Discrete case
P∞
Definition 1.8.1 Let X be a discrete r.v. s.t.
P (X = xi ) = 1. The expected
P∞ i=1
value of X, denoted by E[X], is defined to be
i=1 xi × P (X = xi ), provided that
P∞
i=1 |xi | × P (X = xi ) < ∞.
Expected value, as its name suggests, indicates the “average” amount of all possible outcomes, weighted by probabilities.
Example 1.8.2 Let (Ω, F, P ) be given as in Example 1.1.4, and let X be the number
rolled. Compute E[X].
Example 1.8.3 (Double bet strategy) John flips a fair coin, and he could bet any
amount of money in a single flip: if he flips a tail, he wins the amount of money he bets, or
otherwise he loses the bet. He adopts the following strategy: he begins by betting 1 dollar,
and if he loses, he bets 2 dollars, and if he loses again, he bets 4 dollars, and so on. Assume
that he plays the game for at most n times.
We construct the probability space as Ω = {T, HT, HHT, · · · , |H ·{z
· · H} T, H
· · H}},
| ·{z
n-1 times n times
Ω
F = 2 , P (T ) = 1/2, P (HT ) = 1/4, and so on. Let X be the total amount of money John
wins. To compute E[X], by definition, E[X] = P (T ) · 1 + P (T H) · 1 + P (H · · · HT ) · 1 +
n
n−1
n
n−1
P (H · · · H) · (−1 − 2 − · · · − 2n ) = 2 +2 2n +···+1 − 2 +2 2n +···+1 = 0.
If we assume that John could play infinitely many times and he is financially supported to
do so, then we may construct the probability space as Ω = {T, HT, HHT, · · · , H · · · HT, · · · }.
In this case, he would win 1 dollar whatever, and thus E[X] = 1, in this “impractical” model.
P
Definition 1.8.4 Let X be a discrete r.v. s.t. ∞
i=1 P (X = xi ) = 1. The variance of X,
P∞
denoted by V ar(X), is defined to be i=1 (xi − E[X])2 × P (X = xi ), provided that E[X]
exists. V ar(X) could be +∞.
Variance shows the tendency of a r.v. concentrating to its mean (expected value).
Example 1.8.5 Let Ω = {ω1 , ω2 , ω3 , ω4 , ω5 }, F = 2Ω , P (wj ) = 1/5 for 1 ≤ j ≤ 5. Let
X, Y be two r.v.s such that X(ωj ) = 4 + j and Y (ωj ) = 10 + 3j 2 . Which one is larger?
16
An introduction to probability
Shiu-Tang Li
V ar(X) or V ar(Y )?
1.8.2 Continuous case
Definition 1.8.6 Let X be a continuous r.v. with density fX . The expected value of
R∞
R∞
X, denoted by E[X], is defined to be −∞ x·fX (x) dx, provided that −∞ |x|·fX (x) dx < ∞.
Definition 1.8.7 Let X be a continuous r.v. with density fX . The variance of X,
R∞
denoted by V ar(X), is defined to be −∞ (x − E[X])2 · fX (x) dx, provided that E[X] exists.
V ar(X) could be +∞.
1.8.3 Properties of expected values
P∞
Theorem 1.8.8 Let X be a discrete r.v. s.t.
= x ) = 1, and g : R → R be
i=1 P (X
P∞
P∞ i
some function. Then E[g(X)] = i=1 g(xi ) · P (X = xi ), if i=1 |g(xi )| · P (X = xi ) < ∞.
P
Proof. First we note by Theorem 1.6.1 that g(X) is discrete. It also says z∈A P (g(X) =
z) = 1, where A := {y : y = g(xi ) for some i ∈ N}. So by definition we have E[g(X)] =
P
we take a closer look at the proof of Theorem 1.6.1, we’ll
z∈A z · P (g(X) = z). If P
find that P (g(X) = z) =
= z) =
xi ∈{y:g(y)=z} P (X = xi ). This means z · P (g(X)
P∞
P
i=1 g(xi ) ·
xi ∈{y:g(y)=z} g(xi )P (X = xi ). Summing over all such z’s gives us E[g(X)] =
P (X = xi ).
From 1.6 we know that for some continuous function g : R → R, and some continuous
r.v. X, g(X) could be neither a continuous r.v. nor a discrete r.v. This means E[g(X)]
cannot be defined in such case. To remedy this we need to impose some restrictions on g, as
is seen in the following theorem.
Definition 1.8.9 g : R → R is called a piecewise strictly monotone function if there
exists a1 , · · · , ak ∈ R so that g is strictly monotone on (−∞, a1 ], [ak , ∞), and [ai , ai+1 ] for
all 1 ≤ i ≤ k − 1.
Theorem 1.8.10 Let X be a continuous r.v. with density fX (x), and g : R → R
be a continuously differentiable and piecewise strictly monotone function. Then g(X) is a
R∞
R∞
continuous r.v., and E[g(X)] = −∞ g(x) · fX (x) dx, if −∞ |g(x)| · fX (x) dx < ∞ (To make
sure the integral is well-defined).
Proof. For simplicity we consider the case that g is strictly increasing on (−∞, a] and [b, ∞)
and strictly decreasing on [a, b]. Let g(a) = c > d = g(b). Besides, there exists a0 > a, b0 < b
s.t. g(a0 ) = g(a) and g(b0 ) = g(b).
d
1
P (g(X) ≤ x) = g0 (g−1
·
For x < d, P (g(X) ≤ x) = P (X ≤ g −1 (x)). Therefore, dx
(x))
−1
fX (g (x)). For d < x < c, there exists a1 < a2 < a3 s.t. g(a1 ) = g(a2 ) = g(a3 ) = x.
17
An introduction to probability
Shiu-Tang Li
We have P (g(X) ≤ x) = P (a2 ≤ X ≤ a3 ) + P (X ≤ a1 ). Taking derivatives we have
d
d
P (g(X) ≤ x) = g0 (a1 3 ) · fX (a3 ) − g0 (a1 2 ) · fX (a2 ) + g0 (a1 1 ) · fX (a1 ). For x > c, dx
P (g(X) ≤
dx
1
x) = g0 (g−1
· fX (g −1 (x)).
(x))
Rd
Rc
x
−1
It follows that E[g(X)] = −∞ g0 (g−1
·f
(g
(x))
dx+
X
(x))
d
R∞
x
x
−1
·
f
(a
)
dx
+
·
f
(g
(x))
dx.
X
1
X
0
0
−1
g (a1 )
c g (g (x))
x
x
·fX (a3 )− g0 (a
·fX (a2 )+
g 0 (a3 )
2)
First, g is monotone on [b0 , ∞), so the change of variable formula is applicable and we
Rd
R b0 g(y)
R b0
x
−1
0
have −∞ g0 (g−1
·
f
(g
(x))
dx
=
·
f
(y)g
(y)
dy
=
g(y) · fX (y) dy. Similarly,
X
X
0
(x))
−∞
R ∞ −∞ g (y)
R∞
x
−1
· fX (g (x)) dx = a0 g(y) · fX (y) dy. We now observe that when x goes from d
c g 0 (g −1 (x))
to c, a3 goes from b to a0 , a2 goes from b to a, and a1 goes from b0 to a. By the change of
Rc x
R a0
Rc
variable formula again, we have d g0 (a
·fX (a3 ) dx = b g(y)·fX (y) dy, d g0−x
·fX (a2 ) dx =
(a2 )
3)
Rc x
Rb
Ra
Ra
− b g(y) · fX (y) dy = a g(y) · fX (y) dy, and d g0 (a1 ) · fX (a1 ) dx = b0 g(y) · fX (y) dy.
Adding up everything, we have E[g(X)] =
R∞
−∞
g(y) · fX (y) dy.
The reason that Theorem 1.8.10 is powerful is because it provides us a quick way to compute E[g(X)], without finding the PDF for g(X). We invite the reader to compute E[g(X)]
in Example 1.6.5 and Example 1.6.6, using and without using Theorem 1.8.10, respectively.
We remind the reader that in a more general setting, E[X] is defined even X is neither
continuous nor discrete. We can get rid of a lot of restrictions on g in Theorem 1.8.10. in
the more general framework.
Corollary 1.8.11 Let X be a continuous r.v. with density fX (x), then E[p(X)] =
R∞
p(x)
·
f
(x)
dx,
where
p(x)
is
a
polynomial
with
degree
≥
1,
provided
that
|p(x)| ·
X
−∞
−∞
fX (x) dx < ∞.
R∞
Corollary 1.8.12 Let X be a continuous r.v. with density fX (x), then E[aX + b] =
R∞
(ax
+
b)
·
f
(x)
dx,
where
a,
b
∈
R,
a
=
6
0,
provided
that
|x| · fX (x) dx < ∞.
X
−∞
−∞
R∞
Corollary 1.8.13 Let X be a continuous r.v. with density fX (x), then E[g(X)] =
R∞
x
g(X)·f
(x)
dx,
where
g(x)
=
e
or
g(x)
=
|x|
,
provided
that
|g(x)|·fX (x) dx < ∞.
X
−∞
−∞
R∞
Corollary 1.8.14 Let X be a continuous r.v. with density fX (x), and V ar(X) < ∞.
Then we have V ar(X) = E[(X − E[X])2 ] = E[X 2 ] − (E[X])2 .
Proof. V ar(X) = E[(X − E[X])2 ] by Corollary 1.8.11. To see V ar(X) = E[X 2 ] − (E[X])2 ,
R∞
note that it is true when E[X] = 0. When E[X] 6= 0, V ar(X) = −∞ (x−E[X])2 ·fX (x) dx =
R∞ 2
R∞
R
2 ∞
x
·
f
(x)
dx
−
2E[X]
x
·
f
(x)
dx
+
(E[X])
f (x) dx = E[X 2 ] − (E[X])2 .
X
X
−∞
−∞
−∞ X
18
An introduction to probability
Shiu-Tang Li
R∞
R∞
We’ve used the fact that ∞ > −∞ 2(x − E[X])2 + 2(E[X])2 · fX (x) dx ≥ −∞ x2 ·
R∞
fX (x) dx, so that −∞ x2 · fX (x) dx = E[X 2 ] by Corollary 1.8.11.
Exercise 1.8.15 Try to state parallel arguments of Corollary 1.8.11 - 1.8.14, when X
is a discrete r.v.
19
An introduction to probability
Shiu-Tang Li
1.9 Examples of discrete random variables and continuous random variables
1.9.1 Discrete r.v.s
Example 1.9.1 (Bernoulli r.v.) P (X = 1) = p, P (X = 0) = 1 − p, 0 < p < 1.
It means to flip a coin with probability p to get a head. X is the number of heads. We
say X satisfies the Bernoulli distribution, denoted by X ∼ Bernoulli(p).
E[X] = p · 1 + (1 − p) · 0 = p.
V ar(X) = E[X 2 ] − (E[X])2 = (p · 12 + (1 − p) · 02 ) − p2 = p − p2 .
Example 1.9.2 (Binomial r.v.) P (X = k) =
n
k
k
p (1 − p)n−k , 0 < p < 1, 0 ≤ k ≤ n.
It means to flip n independent coins, each with probability p to get a head. X is
the total number of heads. We say X satisfies the Binomial(n, p) distribution, denoted
by X ∼ Binomial(n, p).
P
E[X] = 0≤k≤n k ·
(1 − p))n−1 = np.
n!
pk (1
k!(n−k)!
− p)n−k = np
P
(n−1)!
k−1
(1
1≤k≤n (k−1)!(n−k)! p
− p)n−k = np(p +
n!
pk (1 − p)n−k − (np)2
k!(n − k)!
0≤k≤n
X
X
n!
n!
k·
k(k − 1) ·
pk (1 − p)n−k +
pk (1 − p)n−k − (np)2
=
k!(n
−
k)!
k!(n
−
k)!
0≤k≤n
0≤k≤n
V ar(X) = E[X 2 ] − (E[X])2 =
= n(n − 1)p2
X
k2 ·
(n − 2)!
pk−2 (1 − p)n−k + np − (np)2
(k
−
2)!(n
−
k)!
2≤k≤n
X
= n(n − 1)p2 (p + (1 − p))n−2 + np − (np)2 = np − np2 .
Example 1.9.3 (Geometric r.v.) P (X = k) = p(1 − p)k−1 , 0 < p < 1, k ∈ N.
It means to flip a coin with probability p to get a head, and X is the first time to get a
head. We write X ∼ Geo(p).
P
P
P
k−1
E[X] = ∞
kp(1 − p)k−1 = ∞
p(1 − p)k−1 + (1 − p) · ∞
+ (1 − p)2 ·
k=1
k=1
k=1 p(1 − p)
P
P∞
1
k−1
k−1
+ · · · = 1−(1−p)
· ∞
= p1 .
k=1 p(1 − p)
k=1 p(1 − p)
20
An introduction to probability
V ar(X) =
=
=
=
∞
X
k=1
∞
X
2
k−1
k p(1 − p)
Shiu-Tang Li
∞
∞
X
1 2 X 2
1
k−1
−( ) =
(k + k)p(1 − p)
−
kp(1 − p)k−1 − ( )2
p
p
k=1
k=1
(k 2 + k)p(1 − p)k−1 −
k=1
∞
X
1
1
− ( )2
p
p
1
1
p(1 − p)k−1 2 + 4(1 − p) + 6(1 − p)2 + · · · − − ( )2
p
p
k=1
1
1
− .
2
p
p
Theorem 1.9.4 (Memoryless property of geometric r.v.s) Let X ∼ Geo(p). Then
for any n, m ∈ N, P (X > m + n|X > m) = P (X > n). Conversely, if for any n, m ∈ N,
P
P (X > m + n|X > m) = P (X > n), and n∈N P (X = n) = 1, then X ∼ Geo(p) for some
0 < p < 1, or P (X = 1) = 1.
Proof. The proof of the first assertion is omitted. For the second assertion, we let p =
P (X = 1). By the assumption we have P (X > 2) = P (X > 1)2 = (1 − p)2 , which shows
P (X = 2) = 1 − p − (1 − p)2 = p(1 − p). By induction, P (X = k) = P (X > k − 1) − P (X >
k) = P (X > k − 1) − P (X > k − 1)P (X > 1) = p · (1 − p − p(1 − p) − · · · − p(1 − p)k−2 ) =
p(1 − p)k−1 . If 0 < p < 1, then X ∼ Geo(p). If p = 1, then P (X = 1) = 1. It cannot happen
that p = 0 from the above deductions, for this would make P (X = n) = 0 for all n ∈ N.
Example 1.9.5 (Poisson r.v.) P (X = k) =
X ∼ P oisson(λ).
E[X] =
P∞
V ar(X) =
k=0
k·
P∞
k=0
e−λ λk
k!
k2 ·
=λ
e−λ λk
k!
P∞
k=1
e−λ λk−1
(k−1)!
− λ2 =
P∞
k=0
e−λ λk
,
k!
λ > 0, k ∈ N. We write
= λ.
k(k − 1) ·
e−λ λk
k!
+ λ − λ2 = λ.
Theorem 1.9.6 below shows the Poisson r.v. is some kind of “limit” of binomial r.v.s.
Theorem 1.9.6 Let X ∼ P oisson(λ), and Xn ∼ Binomial(n, nλ ) for all n ∈ N. Then
for any k ∈ N ∪ {0}, P (X = k) = limn→∞ P (Xn = k).
Proof.
n!
λk
λ n−k
(1
−
)
k!(n − k)! nk
n
λ −k
λk e−λ
λk e−λ n(n − 1) · · · (n − k + 1) (1 − nλ )n
=
·
·
·
(1
−
)
→
.
k!
nk
e−λ
n
k!
P (Xn = k) =
21
An introduction to probability
Shiu-Tang Li
1.9.2 Continuous r.v.s
Example 1.9.7 (Uniform r.v.) fX (x) =
E[X] =
1
b−a
V ar(X) =
Rb
a
1
b−a
x dx =
Rb
a
b2 −a2
2
·
1
b−a
=
x2 dx − ( a+b
)2 =
2
1
1 (x),
b−a [a,b]
b > a. Write X ∼ U nif (a, b).
a+b
.
2
a2 +ab+b2
3
−
a2 +2ab+b2
4
=
(b−a)2
.
12
Example 1.9.8 (Exponential r.v.) fX (x) = λe−λx 1[0,∞) (x), λ > 0. Write X ∼ exp(λ).
E[X] =
R∞
0
V ar(X) =
∞ R ∞
xλe−λx dx = −xe−λx 0 + 0 e−λx dx = λ1 .
R∞
0
x2 λe−λx dx −
1
λ2
∞ R ∞
= −x2 e−λx 0 + 0 2xe−λx dx −
1
λ2
=
2
λ2
−
1
λ2
=
1
.
λ2
Theorem 1.9.9 (Memoryless property of exponential r.v.s) Let X ∼ exp(λ).
Then for any s, t ≥ 0, P (X > s + t|X > s) = P (X > t). Conversely, if for any s, t ≥ 0,
P (X > s + t|X > s) = P (X > t), and P (X > 0) = 1, then X ∼ exp(λ) for some λ > 0.
[Norris]
Proof. (⇒) For s, t ≥ 0, P (X > s + t|X > s) =
P (X>s+t)
P (X>s)
=
e−λ(s+t)
e−λs
= e−λt = P (X > t).
(⇐) Since P (X > 0) = 1, and P (X > n1 ) → P (X > 0) (Theorem 1.4.6), we may find
some n ∈ N s.t. P (X > n1 ) > 0. Also, by the assumption we have P (X > 1) = P (X >
1 n
) > 0, and thus we may find some λ > 0 so that P (X > 1) = e−λ .
n
For any q ∈ Q+ , we may write q = m/n, m, n ∈ N. We find that P (X > m/n)n = P (X >
m) = P (X > 1)m = e−λm , which implies P (X > q) = P (X > m/n) = e−λ·m/n = e−λq .
Now for any t > 0, we may find pn ↑ t and qn ↓ t so that pn , qn ∈ Q+ for all n ∈ N.
We have e−λpn = P (X > pn ) ≥ P (X > t) ≥ P (X > qn ) = e−λqn . Let n → ∞ we have
P (X > t) = e−λt = 1 − FX (t) for t > 0. Therefore, fX (t) = FX0 (t) = λe−λt for t > 0.
(x−µ)2
Example 1.9.10 (Normal r.v.) fX (x) = σ√12π e− 2σ2 , σ > 0, µ ∈ R.
X ∼ N (µ, σ 2 ). X is called a standard normal r.v. if X ∼ N (0, 1).
Exercise 1.9.11 Using the fact
R∞
2
e−x dx =
−∞
√
π to prove
R∞
√1 e−
−∞ σ 2π
(x−µ)2
2σ 2
Write
dx = 1.
Z ∞
(x−µ)2
y2
1
1
−
2
2σ
E[X] =
x· √ e
dx =
(y + µ) · √ e− 2σ2 dy
σ 2π
σ 2π
−∞
Z−∞
Z ∞
∞
2
2
y
y
1
1
√ e− 2σ2 dy = µ.
=
y · √ e− 2σ2 dy + µ
σ 2π
−∞
−∞ σ 2π
Z
∞
22
An introduction to probability
Remarks. (1)
2
− y2
2σ
y·e
R∞
0
y·
Shiu-Tang Li
y2
√1 e− 2σ2
σ 2π
dy < ∞ (use limit comparison test of integrals). (2)
is an odd function.
∞
Z ∞
(x−µ)2
y2
1
1
−
2
2σ
V ar(X) =
(x − µ) · √ e
y 2 · √ e− 2σ2 dy
dx =
σ 2π
σ 2π
−∞
Z ∞ −∞
2 2
y
y
y
1
∞
√ e− 2σ2 dy = σ 2 .
= √ · (−σ 2 )e− 2σ2 −∞ + σ 2
σ 2π
−∞ σ 2π
Z
2
Exercise 1.9.12 Let X ∼ N (0, 1). Using integration by parts formula to show E[X n ] = 0
if n = 1, 3, 5, · · · and E[X n ] = 1 · 3 · · · (2k − 1) if n = 2k, k ∈ N.
Example 1.9.13 (Cauchy r.v.) fX (x) =
Since
R∞
−∞
1
π
·
1
.
1+x2
1
|x|· π1 · 1+x
2 = ∞ (prove it), E[X] does not exist. Hence V ar(X) does not exist.
Example 1.9.14 (Gamma r.v.) fX (x) =
R ∞ x−1 −t
t e dt. Write X ∼ Gamma(α, β).
0
β α α−1 −βx
x e 1(0,∞) ,
Γ(α)
α, β > 0, Γ(x) :=
Remarks 1.9.15 (1) Γ(n) = (n − 1)! for all n ∈ N. E[X] = α/β, V ar(X) = α/β 2 ,
calculations omitted. (2) When α = 1, β = λ, Gamma(α, β) ∼ exp(λ).
23
An introduction to probability
Shiu-Tang Li
1.10 Joint distributions
1.10.1 Definitions
Definition 1.10.1 Let X1 , · · · , Xn be r.v.s on (Ω, F, P ). The joint distribution function FX1 ,··· ,Xn (x1 , · · · , xn ) is defined to be P (X1 ≤ x1 , · · · , Xn ≤ xn ).
Definition 1.10.2 Let X1 , · · · , Xn be discrete r.v.s on (Ω, F, P ). The joint probability mass function, or joint P M F , is defined to be pX1 ,··· ,Xn (x1 , · · · , xn ) = P (X1 =
x1 , · · · , Xn = xn ).
Definition 1.10.3 Let X1 , · · · , Xn be continuous r.v.s on (Ω, F, P ). The joint probability density function, or joint P DF , is defined to be a Riemann integrable function
R xn
R x1
fX1 ,··· ,Xn : Rn → R s.t. the iterated integral −∞
· · · −∞
fX1 ,··· ,Xn (y1 , · · · , yn ) dy1 · · · dyn is
well-defined and equals P (X1 ≤ x1 , · · · , Xn ≤ xn ), and fX1 ,··· ,Xn ≥ 0.
We recall a theorem from advanced calculus, which provides us with a different way to
look at the iterated integral:
Theorem 1.10.4 [Folland] Let R = [a, b] × [c, d], and let f be an Riemann integrable
function on R. Suppose that, for each y ∈ [c, d], the function fy defined by fy = f (x, y)
Rb
is integrable on [a, b], and the function g(x) = a f (x, y) dx is integrable on [c, d]. Then
RR
Rd Rb
f dA = c [ a f (x, y) dx] dy.
R
Remark 1.10.5 As what we’ve mentioned in Remark 1.5.5, the joint PDF fX1 ,··· ,Xn is
not unique - they are a collection of “candidate functions” in some equivalence class.
1.10.2 Basic properties
Theorem 1.10.6 P (y1 < X1 ≤ x1 , y2 < X2 ≤ x2 ) = FX1 ,X2 (x1 , x2 ) − FX1 ,X2 (x1 , y2 ) −
FX1 ,X2 (y1 , x2 ) + FX1 ,X2 (y1 , y2 ).
Proof. Exercise. What about the case of n r.v.s?
Theorem 1.10.7 P (X1 ≤ x1 , X2 = x2 ) = FX1 ,X2 (x1 , x2 ) − limz↑x2 FX1 ,X2 (x1 , z).
Proof. Exercise. How to write P (y1 < X1 ≤ x1 , X2 = x2 , X3 > x3 ) in terms of FX1 ,X2 ,X3 ?
Theorem 1.10.8 Assume that the joint PDF of X1 and X2 exists. Then P (y1 < X1 ≤
Rx Rx
x1 , y2 < X2 ≤ x2 ) = y22 y11 fX1 ,X2 (z1 , z2 ) dz1 dz2 .
Proof. The proof follows from Theorem 1.10.6. How do you extend it to multi-dimensional
case? Is it OK if we replace < with ≤ in the identity?
24
An introduction to probability
Shiu-Tang Li
Theorem 1.10.9 Assume that the joint PDF of X1 , X2 , · · · , Xn exists. Then the PDF
R∞
R∞
of Xk is given by g(yk ) = −∞ · · · −∞ fX1 ,··· ,Xn (y1 , · · · , yn ) dy1 · · · db
yk · · · dyn , 1 ≤ k ≤ n.
Ry
Proof. Show that −∞ g(yk ) dyk = P (Xk ≤ y) for all y ∈ R.
Theorem 1.10.10 Assume that the joint PDF of X1 , X2 , · · · , Xn exists. Then
FX1 ,··· ,Xn (x1 , · · · , xn ) is a continuous function on Rn .
Proof. By Theorem 1.4.6, P (x − δ < Xk ≤ x + δ) → P (Xk = x) = 0 as δ ↓ 0. It
T
follows that FX1 ,··· ,Xn (x1 + δ, · · · , xn + δ) − FX1 ,··· ,Xn (x1 − δ, · · · , xn − δ) = P ( 1≤k≤n {Xk ≤
T
P
xk +δ})−P ( 1≤k≤n {Xk ≤ xk −δ}) ≤ 1≤k≤n P (xk −δ < Xk ≤ xk +δ) → 0 as δ ↓ 0. We may
select δ enough so that FX1 ,··· ,Xn (x1 + δ, · · · , xn + δ) − FX1 ,··· ,Xn (x1 − δ, · · · , xn − δ) < for any
given > 0. Therefore, for any k(z1 , · · · , zn ) − (x1 , · · · , xn )k < δ, FX1 ,··· ,Xn (x1 , · · · , xn ) − ≤
FX1 ,··· ,Xn (z1 , · · · , zn ) ≤ FX1 ,··· ,Xn (x1 , · · · , xn ) + . This completes the proof.
Remark 1.10.11 When X1 , · · · , Xn are discrete r.v.s of the same probability space
(Ω, F, P ), The joint PMF always exists (exercise). But it is not the case for continuous r.v.s,
as shown in the following example.
Example 1.10.12 Let X ∼ N (0, 1) and Y (ω) = X(ω) for every ω ∈ Ω. Show that the
joint PDF of X and Y does not exist.
Sol. Assume the joint PDF f (x, y) of X, Y exists. Let An,j,k := { 2jn < X ≤ j+1
, k <
2n 2n
P
Y ≤ k+1
}, j, k ∈ Z, n ∈ N. For each n ∈ N, we have j,k ( 21n )2 · sup{f (x, y) : (x, y) ∈
2n
R∞ R∞
] × ( 2kn , k+1
]} ↓ −∞ −∞ f (x, y) dx dy = 1 as n → ∞, since f (x, y) is Riemann inte( 2jn , j+1
2n
2n
grable on R2 .
Therefore, we may select N ∈ N s.t.
k k+1
( 2N , 2N ]} ≤ 32 .
P
1 2
j,k ( 2N )
· sup{f (x, y) : (x, y) ∈ ( 2jN , j+1
]×
2N
P
Besides, 1 = P (Ω) = P (∪j,k AN +1,j,k ) = P (∪j AN +1,j,j ) = j P (AN +1,j,j )
P R (j+1)/2N +1 R (j+1)/2N +1
P
j
j+1
1
2
f
(x,
y)
dx
dy
≤
= j j/2N +1
N
+1
j ( 2N +1 ) · sup{f (x, y) : (x, y) ∈ ( 2N +1 , 2N +1 ] ×
j/2
P 1 2
j j+1
j j+1
1
3
( 2Nj+1 , 2j+1
N +1 ]} ≤ 4 ·2·
j ( 2N ) ·sup{f (x, y) : (x, y) ∈ ( 2N , 2N ]×( 2N , 2N ]} ≤ 4 , a contradiction.
1.10.3 Properties about expected values
The following theorem generalizes Theorem 1.8.8, and the proof is omitted here.
P
P∞
Theorem 1.10.13 Let X1 , · · · , Xn be discrete r.v.s s.t. ∞
P (X1 = xj1 , · · · , Xn =
jn =1
j1 =1
P∞ P
n
xjn ) = 1, and g : R → R be some function. Then E[g(X1 , · · · , Xn )] = jn =1 ∞
j1 =1 g(xj1 , · · · , xjn )
P∞ P∞
·P (X1 = xj1 , · · · , Xn = xjn ), if jn =1 j1 =1 |g(xj1 , · · · , xjn )|P (X1 = xj1 , · · · , Xn = xjn ) <
25
An introduction to probability
Shiu-Tang Li
∞.
The following theorem generalizes Theorem 1.8.10. It is extremely complicated, or almost
impossible, to prove it with advanced calculus. When we’ve learned measure theory, it then
becomes tractable. The proof is again omitted here.
Theorem 1.10.14 Let X1 , · · · , Xn be continuous r.v.s with joint density fX1 ,··· ,Xn (x1 , · · · , xn ),
and g : Rn → R be a continuous function. Then g(X1 , · · · , Xn ) is a r.v., and if g(X1 , · · · , Xn )
is either continuous or discrete, then
Z ∞
Z ∞
g(x1 , · · · , xn ) · fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn ,
···
E[g(X1 , · · · , Xn )] =
−∞
provided that
R∞
−∞
···
R∞
−∞
−∞
|g(x1 , · · · , xn )| · fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn < ∞.
Example 1.10.15 Let X, Y be continuous r.v.s with joint PDF fX,Y (x, y). Then by
R∞ R∞
Theorem 1.10.14, for all g : R → R continuous, E[g(X)] = −∞ −∞ g(x) · fX,Y (x, y) dx dy =
R∞
R∞
R∞
g(x) · ( −∞ fX,Y (x, y) dy) dx = −∞ g(x) · fX (x) dx. This shows Theorem 1.10.14 is com−∞
patible with what we’ve learned before (see Theorem 1.8.10).
Theorem 1.10.16 Let X1 , · · · , Xn be continuous r.v.s with joint density fX1 ,··· ,Xn (x1 , · · · , xn ).
Then we have E[a1 X1 + · · · + an Xn ] = a1 E[X1 ] + · · · + an E[Xn ], where ak ∈ R for 1 ≤ k ≤ n,
provided that E[|Xk |] < ∞ for all 1 ≤ k ≤ n.
Proof. By Theorem 1.10.14,
Z ∞
Z ∞
E[a1 X1 + · · · + an Xn ] =
···
(a1 x1 + · · · + an xn ) · fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn
−∞
−∞
Z ∞
Z ∞
X
=
ak
···
xk · fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn
1≤k≤n
−∞
−∞
= a1 E[X1 ] + · · · + an E[Xn ].
Remark 1.10.17 The above theorem also holds for X1 , · · · , Xn discrete (prove it). Actually, with a more general definition of expected values of r.v.s, Theorem 1.10.16 holds for
arbitrary r.v.s that live on the same probability space.
Example 1.10.18 (Coupon collector’s problem) Suppose there n different basketball
player cards, and you want to collect all of them. The rule is as follows: each time you cannot
buy the player card you want, but instead buy a random one, since the cards are all sealed.
26
An introduction to probability
Shiu-Tang Li
What is the expected number of cards that you have to buy to get a full collection of n cards?
Sol. Let Xk be the number of cards you have to buy to raise the number of kinds of
player cards from k − 1 to k. Obviously X1 = 1. It’s not hard to see X2 ∼ Geo( n−1
), and
n
n−2
X3 ∼ Geo( n ), and so on. Therefore, the expected number of cards we have to buy is
n
n
E[X1 + · · · + Xn ] = E[X1 ] + · · · + E[Xn ] (by the previous remark) = 1 + n−1
+ n−2
+ · · · + n1 ,
which equals n ln(n) + γ · n + 12 + o(1) as n becomes large.
Definition 1.10.19 Let X, Y be either both continuous r.v.s with joint density fX,Y (x, y)
or both discrete r.v.s. We define the covariance of X and Y to be E[(X−E[X])·(Y −E[Y ])],
written by Cov(X, Y ), provided that E[|X|] < ∞, E[|Y |] < ∞, and E[|X − E[X]| · |Y −
E[Y ]|] < ∞.
Remark 1.10.20 In general, we may define Cov(X, Y ) for any r.v.s X and Y . We’ll
talk about this in the measure-theoretic probability course.
Theorem 1.10.21 Cov(X, Y ) = E[XY ] − E[X] · E[Y ].
Proof. We present the proof when X, Y are continuous with joint density f . By Theorem
1.10.14,
Cov(X, Y ) = E[(X − E[X]) · (Y − E[Y ])]
Z ∞Z ∞
=
(x − E[X])(y − E[Y ])f (x, y) dx dy
−∞ −∞
Z ∞Z ∞
(xy − yE[X] − xE[Y ] + E[X]E[Y ])f (x, y) dx dy
=
−∞
−∞
= E[XY ] − E[X] · E[Y ].
1.10.4 Independence and joint distributions
Theorem 1.10.22 Let X1 , · · · , Xn be continuous r.v.s with joint density fX1 ,··· ,Xn (x1 , · · · , xn ),
and each Xk has PDF fXk (xk ) for 1 ≤ k ≤ n. If fX1 ,··· ,Xn (x1 , · · · , xn ) = Π1≤k≤n fXk (xk ), then
X1 , · · · , Xn are independent; conversely, if X1 , · · · , Xn are independent continuous r.v.s,
then one “version” of their joint PDF is given by Π1≤k≤n fXk (xk ).
27
An introduction to probability
Shiu-Tang Li
Proof. (⇒) We first notice that (By a slight revision of Theorem 1.10.8)
Z
bn
Z
b1
fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn
···
P (a1 < X1 < b1 , · · · , an < Xn < bn ) =
a1
Z b1
an
Z bn
fX1 (x1 ) · · · fXn (xn ) dx1 · · · dxn
···
=
a1
an
=Π1≤k≤n P (ak < Xk < bk ),
which proves the case P (X1 ∈ O1 , · · · , Xn ∈ On ) = Π1≤k≤n P (Xk ∈ Ok ) for Ok = (ak , bk ).
We then consider O1 to be a general open set in R, and Ok = (ak , bk ) for k ≥ 2. Since
S
we may write O1 as a countable disjoint union of open sets, say O1 = ∞
j=1 (a1j , b1j ) (see
[Munkres]; the union may include infinite open intervals), thus
P (X1 ∈ O1 , · · · , Xn ∈ On ) =
∞
X
P (X1 ∈ (a1j , b1j ), · · · , Xn ∈ On )
j=1
=
∞
X
P (X1 ∈ (a1j , b1j )) × · · · × P (Xn ∈ On )
j=1
=P (X2 ∈ O2 ) · · · P (Xn ∈ On ) ×
∞
X
P (X1 ∈ (a1j , b1j ))
j=1
=P (X1 ∈ O1 ) · · · P (Xn ∈ On ).
Next we consider O1 , O2 to be general open sets in R, and Ok = (ak , bk ) for k ≥ 3, and
then O1 , O2 , O3 to be general open sets in R, and so on, following the same deductions.
(⇐) Since
P (X1 ≤ a1 , · · · , Xn ≤ an ) =P (X1 < a1 , · · · , Xn < an ) (why?)
=P (X1 < a1 ) · · · P (Xn < an )
Z a1
Z an
fXn (xn ) dxn
=
fX1 (x1 ) dx1 × · · · ×
−∞
−∞
Z an
Z a1
=
···
fX1 (x1 ) · · · fXn (xn ) dx1 · · · dxn ,
−∞
−∞
it follows that Π1≤k≤n fXk (xk ) is a version of their joint PDF.
Corollary 1.10.23 If X1 , · · · , Xn are independent continuous r.v.s, then their joint
PDF exists.
28
An introduction to probability
Shiu-Tang Li
Theorem 1.10.24 (Independence implies zero correlation) Let X, Y be independent r.v.s. If they are both continuous or both discrete, then Cov(X, Y ) = 0.
Proof. (Continuous case) By theorem 1.10.22, the joint PDF of X,Y is given by fX (x)fY (y).
By Theorem 1.10.14 and Theorem 1.10.21,
Cov(X) =E[XY ] − E[X]E[Y ]
Z ∞Z ∞
Z
=
xyfX (x)fY (y) dx dy −
−∞
−∞
∞
Z
∞
xfX (x) dx ·
−∞
yfY (y) dy
−∞
=0.
(Discrete case) Let X, Y be discrete r.v.s s.t.
have
P∞ P∞
i=1
j=1
P (X = xi , Y = yj ) = 1. We
Cov(X) =E[XY ] − E[X]E[Y ]
∞ X
∞
∞
∞
X
X
X
=
xi yj P (X = xi , Y = yj ) −
xi P (X = xi ) ·
yj P (Y = yi )
=
i=1 j=1
∞ X
∞
X
i=1 j=1
i=1
xi yj P (X = xi )P (Y = yj ) −
∞
X
i=1
j=1
xi P (X = xi ) ·
∞
X
yj P (Y = yi ) (by independence)
j=1
=0.
Remark 1.10.25 The above theorem can be generalized to arbitrary independent r.v.s
X, Y .
Example 1.10.26 (Zero correlation does not imply independence) Let Ω =
{ω1 , ω2 }, P (ω1 ) = 2/3, P (ω2 ) = 1/3, X(ω1 ) = 1, X(ω2 ) = −2, Y (ω1 ) = 2, Y (ω2 ) = 1. It’s
not hard to see Cov(X, Y ) = E[XY ] − E[X]E[Y ] = 0 − 0 · 5/3 = 0, but P (X = 1, Y = 2) =
P (ω2 ) 6= P (ω2 )2 = P (X = 1) · P (Y = 2).
Theorem 1.10.27 Let X1 , · · · , Xn be independent continuous r.v.s or independent disP
crete r.v.s. Then V ar(X1 + · · · + Xn ) = ni=1 V ar(Xi ).
29
An introduction to probability
Shiu-Tang Li
Proof. We’ll prove the continuous case only. By Theorem 1.10.14,
V ar(X1 + · · · + Xn ) = E[(X1 + · · · + Xn − E[X1 + · · · + Xn ])2 ]
Z ∞
Z ∞
(x1 + · · · + xn − E[X1 + · · · + Xn ])2 fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn
···
=
−∞
∞
−∞
∞
Z
Z
···
=
−∞
=
n
X
X
n
X
2
(xj − E[Xj ]) + 2
(xi − E[Xi ])(xj − E[Xj ]) fX1 (x1 ) · · · fXn (xn ) dx1 · · · dxn
−∞
j=1
1≤i<j≤n
V ar(Xi ).
i=1
1.10.5 Density transformation formula
Let X1 , · · · , Xn be continuous r.v.s with joint density fX1 ,··· ,Xn (x1 , · · · , xn ). We’re interested in computing the joint PDF of g1 (X1 , · · · , Xn ), · · · , gn (X1 , · · · , Xn ) given that g1 , · · · , gn
are “nice” enough. For example, if X, Y are continuous r.v.s with joint density f , how do
we compute the joint PDF of U = X + Y and V = XY ? We would learn the technique in
this section.
Let us first recall the inverse mapping theorem (or inverse function theorem) from [Folland], and the change-of-variable formula from [Folland] (with some revisions).
Theorem 1.10.28 (Inverse function theorem) Let U and V be open sets in Rn ,
a ∈ U , and b = f (a). Suppose that f : U → V is a mapping of class C 1 and the Frechet
∂f
|x=a is nonzero). Then there exist neighderivative f 0 (a) is invertible (that is, the Jacobian ∂x
borhoods M ⊂ U and N ⊂ V of a and b, respectively, so that f is a one-to-one map from M
to N , and the inverse map f −1 from N to M is also of class C 1 . Moreover, if y = f (x) ∈ N ,
−1
then (f −1 )0 (y) = f 0 (x) .
Theorem 1.10.29 (Change-of-variable formula) Given open sets U and V in Rn , let
G : U → V be a one-to-one transformation of class C 1 whose derivative G 0 (u) is invertible
for all u ∈ U . If f is integrable on G (U ), then f ◦ G is integrable on U , and
Z
Z
Z
Z
∂G n
n
···
f (x) d x = · · · f (G (u)) · |
|d u.
∂u
G (U )
U
And now it’s time for the main course of this section.
Theorem 1.10.30 (Density transformation formula) Let X1 , · · · , Xn be continuous r.v.s with joint density fX1 ,··· ,Xn (x1 , · · · , xn ), and let g : Rn → Rn be a C 1 -mapping
30
An introduction to probability
Shiu-Tang Li
s.t. g 0 (x) is invertible for all x ∈ Rn , and g is 1-1 with inverse h. Then the joint
PDF of (U1 , · · · , Un ) := g (X1 , · · · , Xn ) exists, and it is given by fU1 ,··· ,Un (u1 , · · · , un ) =
∂h
fX1 ,··· ,Xn (h(u1 , · · · , un ))1g (Rn ) (u1 , · · · , un ) · | ∂u
|.
Proof. Fix an arbitrary open set O in Rn . By Theorem 1.10.28, g (Rn ) is an open set
(why?), and h is of class C 1 on g (Rn ) so that h 0 (u) is invertible for all u ∈ g (Rn ). Besides,
g −1 (O) = h (O ∩ g (Rn )) is an open set, since g is a continuous function. We have
P ((U1 , · · · , Un ) ∈ O) =P (g (X1 , · · · , Xn ) ∈ O)
=P ((X1 , · · · , Xn ) ∈ g −1 (O))
Z
Z
= ···
fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn (why?)
g −1 (O)
Z
Z
= ···
fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn
h (O∩g (Rn ))
Z
Z
∂h
fX1 ,··· ,Xn (h(u1 , · · · , un )) · | | du1 · · · dun (Theorem 1.10.29)
= ···
∂u
O∩g (Rn )
Z
Z
∂h
= · · · fX1 ,··· ,Xn (h(u1 , · · · , un ))1g (Rn ) (u1 , · · · , un ) · | | du1 · · · dun .
∂u
O
If we take O = {(u1 , · · · , un ) : ui < ai }, then it follows that fX1 ,··· ,Xn (h(u1 , · · · , un ))1g (Rn ) (u1 , · · · , un )·
∂h
| ∂u
| is a version of the joint PDF of (U1 , · · · , Un ).
Theorem 1.10.31 (Density transformation formula, a slight generalization) Let
X1 , · · · , Xn be continuous r.v.s with joint density fX1 ,··· ,Xn (x1 , · · · , xn ), and let g : E → Rn
be a C 1 -mapping s.t. g 0 (x) is invertible for all x ∈ E, where E is an open set in Rn , and
g is 1-1 with inverse h. Besides, we assume P ((X1 , · · · , Xn ) ∈
/ E) = 0. Then the joint
PDF of (U1 , · · · , Un ) := g (X1 , · · · , Xn ) exists, and it is given by fU1 ,··· ,Un (u1 , · · · , un ) =
∂h
fX1 ,··· ,Xn (h(u1 , · · · , un ))1g (E) (u1 , · · · , un ) · | ∂u
|.
Proof. Fix an arbitrary open set O in Rn . By Theorem 1.10.28, g (E) is an open set, and h
is of class C 1 on g (E) so that h 0 (u) is invertible for all u ∈ g (E). Besides, g −1 (O) ∩ E =
h (O ∩ g (Rn )) ∩ h (g (E)) = h (O ∩ g (E)) is an open set, since g is a continuous function.
31
An introduction to probability
Shiu-Tang Li
We have
P ((U1 , · · · , Un ) ∈ O) =P (g (X1 , · · · , Xn ) ∈ O)
=P ((X1 , · · · , Xn ) ∈ g −1 (O))
Z
Z
= ···
fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn
g −1 (O)
Z
Z
= ···
fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn
g −1 (O)∩E
Z
Z
= ···
fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn
h (O∩g (E))
Z
Z
∂h
= ···
fX1 ,··· ,Xn (h(u1 , · · · , un )) · | | du1 · · · dun (Theorem 1.10.29)
∂u
O∩g (E)
Z
Z
∂h
= · · · fX1 ,··· ,Xn (h(u1 , · · · , un ))1g (E) (u1 , · · · , un ) · | | du1 · · · dun .
∂u
O
If we take O = {(u1 , · · · , un ) : ui < ai }, then it follows that fX1 ,··· ,Xn (h(u1 , · · · , un ))1g (E) (u1 , · · · , un )·
| is a version of the joint PDF of (U1 , · · · , Un ).
| ∂h
∂u
Example 1.10.32 Let X and Y be two independent standard normal random variables.
What is the joint density of U = X + Y and V = X − Y ?
Sol. Note that g : (x, y) 7→ (x + y, x − y) is a C 1 mapping: R2 → R2 , and g 0 (x, y) =
1 1
is invertible for all (x, y) ∈ R. To compute inverse h of g, we let u = x + y, v =
1 −1
and y = u−v
, and thus we have h : (u, v) 7→ ( u+v
, u−v
), where
x − y, so that x = u+v
2
2
2
2
1/2 1/2
2
h 0 (u, v) =
. Also we notice that fX,Y (x, y) = fX (x)fY (y) = √12π e−x /2 ·
1/2 −1/2
2 /2
1
−y
√ e
, and g (R2 ) = R2 . By Theorem 1.10.30,
2π
R∞
∂h
fU,V (u, v) =fX,Y (h(u, v))1g (R2 ) (u, v) · | |
∂u
u+v
u − v 1/2 1/2
)fY (
) · det
=fX (
1/2 −1/2 2
2
1
1
1 −u2 /4−v2 /4
2
2
=
· e−(u+v) /8 · e−(u−v) /8 · =
e
.
2π
2
4π
R ∞ 1 −u2 /4−v2 /4
Here is an interesting fact. By Theorem 1.10.9, fU (u) = −∞ 4π
e
dv =
2
1
√
e−v /4
−∞ 2 π
2
2
1
√
e−u /4 ·
2 π
2
dv = 2√1 π e−u /4 , and similarly fV (v) = 2√1 π e−v /4 . It follows that fU,V (u, v) =
fU (u)fV (v), which shows U = X + Y and V = X − Y are independent by Theorem 1.10.22.
32
An introduction to probability
Shiu-Tang Li
Example 1.10.33 Let X and Y be continuous independent random variables with joint
PDF fX,Y . Compute the PDF of X + Y .
Sol. Construct a C 1 -mapping g : R2 → R2 , (x, y) 7→ (x + y, y), and note that g (R2 ) = R2
1 1
0
and g (x, y) =
is invertible for all (x, y) ∈ R. To compute inverse h of g, we let
0 1
u = x + y, v = y, so that x = u − v and y = v, and thus we have h : (u, v) 7→ (u − v, v),
1 −1
0
where h (u, v) =
. By Theorem 1.10.30,
0 1
∂h
fU,V (u, v) =fX,Y (h(u, v))1g (R2 ) (u, v) · | |
∂u 1 −1 =fX,Y (u − v, v) · det
0 1
=fX,Y (u − v, v),
so the PDF is given by fU (u) =
R∞
−∞
fU,V (u, v) dv =
R∞
−∞
fX,Y (u − v, v) dv.
Remark 1.10.34 (Convolution Theorem) In the previous example, if X,Y are inR∞
dependent, then the PDF of X +Y is given by fU (u) = −∞ fX (u−v)·fY (v) dv = (fX ∗fY )(u).
Example 1.10.35 Let X and Y be independent, standard normal random variables.
Find the distribution of X/Y .
Sol. Construct a C 1 -mapping g : E → R2 , (x, y) 7→ (x/y, y), where E := {(x, y) ∈
R2 : y 6= 0}, and P ((X, Y ) ∈
/ E) = P (Y 6= 0) = 0. Besides, g (E) = E and g 0 (x, y) =
1/y −x/y 2
is invertible for all (x, y) ∈ E. To compute inverse h of g, we let u =
0
1
x/y, v = y, so that x = uv and y = v, and thus we have h : (u, v) 7→ (uv, v), where
v u
0
h (u, v) =
, which is invertible for all (u, v) ∈ g (E) = E. By Theorem 1.10.31,
0 1
fU,V (u, v) =fX,Y (h(u, v))1g (E) (u, v) · |
∂h
|
∂u
∂h
=fX,Y (h(u, v))| | (why?)
∂u v u =fX,Y (uv, v) · det
0 1 =fX (uv)fY (v)|v|
1
2 2
2
= e−u v /2 · e−v /2 |v|.
2π
33
An introduction to probability
As a result, fU (u) =
R
f (u, v) dv = 2
R U,V
Shiu-Tang Li
R∞
0
1 −u2 v 2 /2 −v 2 /2
v
e
e
2π
1
= π1 · 1+u
2 (X/Y is Cauchy!), and FX/Y (x) = P (X/Y ≤ x) =
1
(tan−1 (x)
π
+ π2 ).
Rx
v=∞
v=0
x
1
−1
du = π tan (u)
=
2 )v 2 /2
−1 −(1+u
dv = π1 · 1+u
2e
1
· 1
−∞ π 1+u2
−∞
34
An introduction to probability
Shiu-Tang Li
1.11 Moment generating functions
Definition 1.11.1 Let X be a discrete r.v. or a continuous r.v. The moment generating
function (abbreviated as MGF) of X, is defined to be MX (t) := E[etX ] for t ∈ R.
P
Remarks 1.11.2 (1) MX (0) = 1. (2) Let t 6= 0. When X is discrete, and ∞
i=1 P (X =
P∞ t·xi
xi ) = 1, then MX (t) = i=1 e P (X = xi ) by Theorem 1.8.8. When X is continuous with
R∞
PDF fX (x), then MX (t) = −∞ etx fX (x) dx by Theorem 1.8.10.
We’ll soon see the power of MGF in the next few theorems.
Theorem 1.11.3 (MGF determines CDF) If X and Y are both continuous r.v.s or
both discrete r.v.s with MGF MX and MY respectively, and there exists an open interval I
containing 0 so that MX (t) = MY (t) for all t ∈ I, then the CDF of X and Y are the same,
namely FX (x) ≡ FY (x).
Proof. The proof is omitted. The interested reader could see [Varadhan] for a proof, which
requires the knowledge of characteristic functions.
Theorem 1.11.4 Let X1 , · · · , Xn be independent continuous r.v.s or independent discrete r.v.s. If we let Y = X1 + · · · + Xn , then MY (t) = Πni=1 MXi (t) for all t ∈ R.
Proof. We prove the continuous case only. By Theorem 1.10.14 and Theorem 1.10.21, if t is
chosen s.t. E[etX1 ] < ∞, then
Z ∞
Z ∞
tY
MY (t) =E[e ] =
···
et(x1 +···+xn ) · fX1 ,··· ,Xn (x1 , · · · , xn ) dx1 · · · dxn
−∞
Z ∞
Z −∞
∞
=
···
et(x1 +···+xn ) · fX1 (x1 ) · · · fXn (xn ) dx1 · · · dxn
−∞
Z ∞ −∞
=Πni=1
etxi fXi (xi ) dxi = Πni=1 MXi (t).
−∞
Otherwise, if t is chosen s.t. E[etX1 ] = ∞, then MY (t) = Πni=1 MXi (t) = ∞.
Table 1.11.5 Here we list the MGF of some common distributions. The readers are
invited to compute any of them if interested.
35
An introduction to probability
Distribution
Bernoulli(p)
Binomial(n, p)
Geo(p)
P oisson(λ)
U nif (a, b)
exp(λ)
N (µ, σ 2 )
Cauchy
Gamma(α, β)
Shiu-Tang Li
MGF: M (t)
1 − p + pet
(1 − p + pet )n
t
pe
, for t < − ln(1 − p), +∞ otherwise
1−(1−p)et
t
eλ(e −1)
etb −eta
t(b−a)
1
,
1−t/λ
for t < λ, +∞ otherwise
1 2 2
etµ+ 2 σ t
does not exist (+∞ for all t 6= 0)
(1 − t/β)−α , for t < β
Example 1.11.6 (Important!) Let X1 , · · · , Xn be independent r.v.s, and Y = X1 +
· · · + Xn . Based on Theorem 1.11.3, Theorem 1.11.4, and Table 1.11.5, we have the following
observations.
(1) If Xi ∼ Bernoulli(p) for all 1 ≤ i ≤ n, then Y ∼ Binomial(n, p).
(2) If Xi ∼ Bernoulli(Ni , p) for all 1 ≤ i ≤ n, then Y ∼ Binomial(N1 + · · · + Nn , p).
(3) If Xi ∼ P oisson(λi ) for all 1 ≤ i ≤ n, then Y ∼ P oisson(λ1 + · · · + λn ).
(4) If Xi ∼ exp(λ) for all 1 ≤ i ≤ n, then Y ∼ Gamma(n, λ).
(5) If Xi ∼ N (µi , σi2 ) for all 1 ≤ i ≤ n, then N (µ1 + · · · + µn , σ12 + · · · + σn2 ).
(6) If Xi ∼ Gamma(αi , β) for all 1 ≤ i ≤ n, then Y ∼ Gamma(α1 + · · · + αn , β).
Next we’ll see how the “moment generating” function MX (t) generates moments.
Theorem 1.11.7 Let X be either a continuous or a discrete r.v., so that there exists
n
an open interval I containing
0 on which MX (t) < ∞. Then the n-th moment E[X ] of X
n
exists and we have dtd n M (t) = E[X n ].
t=0
Proof. (In class). Some details are omitted - we need the dominated convergence theorem
from measure theory.
36
An introduction to probability
Shiu-Tang Li
1.12 LLN and CLT
In this section we would state without proof two very important theorems, the law of
large numbers and the central limit theorem.
Theorem 1.12.1 (Strong law of large numbers, SLLN) Let X1 , · · · , Xn , · · · be a
n (ω)
sequence i.i.d r.v.s so that E[|X1 |] < ∞. Then P ({ω : limn→∞ X1 (ω)+···+X
= E[X1 ]}) = 1.
n
Theorem 1.12.2 (Weak law of large numbers, WLLN) Let X1 , · · · , Xn , · · · be a sen (ω)
quence i.i.d r.v.s so that E[|X1 |] < ∞. Then for any > 0, limn→∞ P ({ω : | X1 (ω)+···+X
−
n
E[X1 ]| > }) = 0.
Remark 1.12.3 As the name suggests, SLLN implies WLLN. But we would not give a
proof here.
Theorem 1.12.4 (Central limit theorem, CLT) Let X1 , · · · , Xn , · · · be a sequence
√ n −nE[X1 ] ≤
i.i.d r.v.s so that E[X12 ] < ∞. Then for any x ∈ R, we have limn→∞ P ( X1 +···+X
n·V ar(X1 )
R x 1 −y2 /2
√
x) = −∞ 2π e
dy.
37
Download