Lecture 2

advertisement
B.Sc./Cert./M.Sc. Qualif. - Statistical Theory
2
2.1
Random Variables
Introduction
There are many situations in which we are interested in some numerical value associated with
the outcome of an experiment, rather than the outcome of the experiment itself. For example,
a dealer may be more interested in the value of a particular share, at a certain time, than in
the outcome(s)/set of circumstances giving rise to that value.
Example 2.1.1
A fair coin is tossed twice. In this case, the set of possible outcomes, or the sample space, can
be conveniently described as
Ω = {T T, HT, T H, HH}
with an associated probability measure
1
P(T T ) = P(HT ) = P(T H) = P(HH) = .
4
For an outcome, ω ∈ Ω, define X(ω) to be the number of heads. So clearly,
X(T T ) = 0, X(HT ) = X(T H) = 1, X(HH) = 2.
X in this example is called a random variable, and is defined formally as follows.
Definition 2.1.2 (Random variable)
A random variable (r.v.) on a probability space (Ω, F, P) is a mapping X : Ω −→ R with
the property that {ω ∈ Ω : X(ω) ≤ x} ∈ F for each x ∈ R.
The range of X is given by the expression
RX = {x ∈ R : X(ω) = x for some ω ∈ Ω}.
We will eventually need to make statements about the ’chance’ or probability that a r.v. takes
values in particular sets or intervals. The following construction takes us some way towards
addressing this issue.
Definition 2.1.3 (Distribution function)
The (cumulative) distribution function of a r.v. X is the function F : R → [0, 1] given by
FX (x) = P(X ≤ x).
Remarks 2.1.4
(i) P(X ≤ x) is shorthand for P({ω : X(ω) ≤ x}); for this to make any sense, {ω : X(ω) ≤ x}
must be an event, i.e. a member of F, since F is the domain of P.
(ii) FX (x) = P(X ∈ (−∞, x]) and so it ’sweeps up’ the probability from −∞ to x.
1
Example 2.1.5 (Ex. 2.1.1 cont...)
1
4
1 1
1
P(X = 1) = P({ω : X(ω) = 1}) = P({HT, T H}) = P({HT }) + P({T H}) = + =
4 4
2
1
P(X = 2) = P({ω : X(ω) = 2}) = P({HH}) =
4
P(X = 0) = P({ω : X(ω) = 0}) = P({T T }) =

0


 1
x<0
0≤x<1
4
.
FX (x) =
3
1≤x<2


 4
1
x≥2
We shall consider only certain types of r.v. in the ensuing discussion.
2
2.2
Discrete Random Variables
A random variable is said to be discrete if it takes values in some countable set {x1 , x2 , x3 , . . .} ⊆
R.
An example of a discrete r.v. was encountered in the previous example. Jump discontinuities in the distribution function appear at x1 , x2 , . . ..
It is also possible to characterize the distribution of a discrete r.v. X through its mass function.
Definition 2.2.1 (Probability mass function)
Suppose X is a discrete r.v. on (Ω, F, P). The probability mass function of X is the
function pX : R → [0, 1] given by
pX (x) = P(X = x) for x ∈ R.
Remarks 2.2.2
(i) All of the probability mass sits on RX , and so
pX (x) = 0 for x 6∈ RX .
(iii) We should always specify the values taken by the mass function pX (x) for all x; however,
if we only specify the function on RX , it is tacitly assumed that it takes the value 0 everywhere
else.
(iv) It should be clear that
FX (x) =
X
pX (y).
y≤x
An important characterization of p.m.f.’s can be found from the following result.
Lemma 2.2.3 (Characterization of p.m.f.’s)
A function pX (x) is the p.m.f. of a discrete r.v. if, and only if,
(i) pP
X (x) ≥ 0 for x ∈ RX ,
(ii) RX pX (x) = 1.
Given a function, g(x) say, x ∈ R, we can determine whether it is a p.m.f by checking conditions
(i) and (ii) above. It is also often the case that we only know the functional form of the p.m.f.
i.e. pX (x) ∝ h(x) or pX (x) = c × h(x) for some constant c; by utilizing part (ii) of the lemma
we can determine the normalizing constant, since
X
1
.
c
h(x) = 1 ⇒ c = P
h(x)
R
X
R
X
Note: In the rest of this course, we will focus attention on those distributions for which
pX (x) > 0 for x ∈ RX .
Next, we shall motivate the use of a number of discrete univariate distributions through some
examples.
3
Example 2.2.4 (Bernoulli distribution)
Coin tossed once. Ω = {H, T }. F = {∅, H, T, Ω}.
P(∅) = 0, P(H) = p, P(T ) = 1 − p, P(Ω) = 1
for p ∈ (0, 1).
Let
X(H) = 1, X(T ) = 0.
It therefore follows that
pX (0) = P(X = 0) = 1 − p; pX (1) = P(X = 1) = p.
We say that X has a Bernoulli distribution with parameter p.
X ∼ Bernoulli(p).
Note that for this distribution, RX = {0, 1}.
Example 2.2.5 (Binomial distribution)
Coin is tossed n times. Coin turns up heads with probability p ∈ (0, 1), and tails with probability 1 − p at each toss independently.
Ω = {H, T } × {H, T } × . . . × {H, T } = {H, T }n
{z
}
|
n times
Let X be the total number of heads. Clearly X takes values in {0, 1, 2, . . . , n}, and so, by
definition, is discrete.
Note that the outcomes resulting in precisely k heads and n − k tails each occur with probability pk (1 − p)n−k ; the number
of such outcomes is equal to the number of ways of choosing k
¡n¢
from n objects, which is k .
So for k ∈ {0, 1, 2, . . . , n},
=
X
ω:X(ω)=k
pX (k) = P(X = k) = P({ω : X(ω) = k})
µ ¶
X
n k
k
n−k
p (1 − p)
=
p (1 − p)n−k .
P({ω}) =
k
ω:X(ω)=k
X is said to have the Binomial distribution with parameters n and p.
X ∼ Bin(n, p).
Example 2.2.6 (Poisson distribution)
A r.v. X has a Poisson distribution with parameter λ > 0 if
½ −λ λk
e k! k = 0, 1, 2, . . .
pX (k) =
0
o.w.
Another way to specify the distribution is to say that
pX (k) = e−λ
λk
k!
k ∈ RX = {0, 1, 2, . . .}
X ∼ P o(λ).
4
Example 2.2.7 (Geometric distribution)
Consider a sequence of independent trials of an experiment, where each trial has probability
p ∈ (0, 1) of resulting in success and probability 1 − p of resulting in failure.
Let X represent the number of trials until the 1st success. Let Ai be the event that the
i-th trial is a success; then Aci is the event that the i-th trial is not a success (i.e. a failure).
Then
P(X = n) = P(Ac1 ∩ Ac2 ∩ Ac3 ∩ . . . ∩ Acn−1 ∩ An )
= P(Ac1 )P(Ac2 )P(Ac3 ) . . . P(Acn−1 )P(An ) by independence
= (1 − p)n−1 p n ∈ RX = {1, 2, . . .}.
We say that X has a Geometric distribution with parameter p.
X ∼ Geometric(p).
2.3
Continuous Random Variables
Definition 2.3.1 (Continuous r.v. and p.d.f.)
(i) The r.v. X is said to be continuous if its distribution function can be written as
Z x
FX (x) =
fX (u)du x ∈ R
−∞
for some integrable function fX : R 7→ [0, ∞).
(ii) fX (·) is called the probability density function(p.d.f) of X.
Remarks 2.3.2
(i) X is called continuous because FX (·) is a continuous function.
(ii) Unlike the discrete case, where pX (x) = P(X = x), fX (x) 6= P(X = x); in fact, in
this case, P(X = x) = 0 for all x ∈ R.
(iii)
Z
b
P(a ≤ X ≤ b) =
fX (u)du
a
which is the area bounded by the graph of fX (·) and the lines x = a, x = b, and the x-axis.
5
(iv) The distribution of a continuous r.v. can be characterized via its p.d.f. rather than
its c.d.f.; indeed
d
fX (x) =
FX (x).
dx
Lemma 2.3.3
A function fX (x) is a p.d.f. for some continuous r.v. X if, and only if,
(i) fRX (x) ≥ 0 for x ∈ RX ;
∞
(ii) −∞ fX (x)dx = 1.
As in the discrete case, conditions (i) and (ii) can be used to determine whether a given function
is a p.d.f. or not, and to calculate normalizing constants if only the functional form of the p.d.f.
is known. We present some examples of the more well known p.d.f.’s.
Example 2.3.4 (Uniform distribution)
X is said to be uniformly distributed on the interval [a, b], where a, b ∈ R s.t. a < b, if

x<a
 0
1
a
≤
x≤b .
fX (x) =
 b−a
0
x>b
X ∼ U nif orm(a, b).
Could write the p.d.f. as
fX (x) =
1
for x ∈ RX = [a, b].
b−a
Example 2.3.5 (Exponential distribution)
X is said to be from an exponential distribution with parameter λ(> 0) if
½
0
x<0
fX (x) =
λe−λx x ≥ 0
X ∼ Exp(λ).
Let us check the conditions (i) and (ii) of Lemma 2.3.3 for this distribution.
Setting RX = {x : x ≥ 0}, we see that fX (x) > 0 for x ∈ RX and fX (x) = 0 for x 6∈ RX .
Also,
Z
Z ∞
£
¤∞
−λx
λe
=
λe−λx dx = −e−λx 0
RX
0
= 0 − (−1) = 1.
Example 2.3.6 (Normal distribution)
X is said to have a normal distribution with parameters µ, σ 2 ∈ R where σ 2 > 0 if
fX (x) = √
1
2πσ 2
e−
(x−µ)2
2σ 2
6
− ∞ < x < ∞.
X ∼ N (µ, σ 2 ).
If X ∼ N (0, 1), then it is said to have the standard Normal distribution. The Normal
distribution is also sometimes known as the Gaussian distribution.
Remarks 2.3.7 (More on distribution functions)
Here is a summary of some of the properties relating to the distribution function (continuous
and discrete cases):
(i) 0 ≤ FX (x) ≤ 1, −∞ < x < ∞.
limx→−∞ FX (x) = 0; limx→∞ FX (x) = 1.
(ii) FX (x) is a non-decreasing function of x, i.e. x1 < x2 ⇒ FX (x1 ) ≤ FX (x2 ).
(iii) FX is continuous from the right:
FX (x+) := lim FX (x + h) = FX (x).
h&0
for h > 0
(iv)
FX (x−) := lim FX (x − h) = FX (x) − P(X = x).
h&0
for h > 0.
(Recall that P(X = x) = 0 for cts. case)
0
(v) fX (x) = FX (x) if X is continuous;
pX (x) = FX (x) − FX (x−) if X is discrete, which is equal to the jump size at x.
2.4
Moments
Here, we introduce some of the tools that will allow us to measure the location and spread of
probability distributions.
Definition 2.4.1 (Mean/Expectation/Expected Value)
The mean/expectation/expected value of a r.v. X on (Ω, F, P) is
½ P
R x∈RX xpX (x) X discrete .
E[X] =
xfX (x)dx
X cts
RX
7
E[X] is often denoted by the symbol µ.
We are also interested in being able to calculate the mean of a function of a r.v. e.g. Y = g(X),
where the distribution of X is known. We could try to discover the distribution of Y first,
and then proceed to compute the mean, however this can often be very difficult and/or complicated; the next important result provides a mechanism for bypassing such a task, based on
knowledge of X and the function g.
Lemma 2.4.2 (Law of the Unconscious Statistician)
For a r.v. X and function g : R → R
(i) if X is discrete, then
X
E[g(X)] =
g(x)pX (x)
x∈RX
provided that the R.H.S. exists;
(ii) if X is continuous1 , and g is a continuous function, then
Z
E[g(X)] =
g(x)fX (x)dx.
RX
The procedure suggested by Lemma 2.4.2 will be particularly useful in our study of moments:
Definition 2.4.3 (Moment)
If k ∈ Z+ , µ = E[X], then the k-th moment of X about µ is
µk = E[(X − µ)k ]
provided that the R.H.S. exists.
Remarks 2.4.4
(i) By Lemma 2.4.2 (ii),
Z
(x − µ)k fX (x)dx
µk =
RX
for X continuous.
The discrete case is similar, with integration replaced by summation.
(ii) E[X k ] is called the k-th moment of X about the origin.
The next quantity is used to measure the spread of a distribution.
Definition 2.4.5 (Variance)
For µ = E[X], the variance of a r.v. X is given by
var(X) = E[(X − µ)2 ]
provided that the R.H.S. exists.
1
Here, we mean that X is a continuous r.v., unlike g which is a continuous function of x.
8
Clearly, var(X) is just the 2-nd moment of X about µ, and is often denoted by σ 2 . The
standard deviation of X is defined to be the square root of the variance, denoted by σ, i.e.
p
σ = var(X).
We list some important properties of the expectation and variance operators.
Proposition 2.4.6 (Properties of Expectation and Variance)
Suppose X is a r.v. on (Ω, F, P) and let µ = E[X]. Then
(i) if c is a constant, E[c] = c;
(ii) E[aX + b] = aE[X] + b = aµ + b for constants a, b;
(iii) var(X) = E[X 2 ] − E[X]2 ;
(iv) var(aX + b) = a2 var(X) for constants a, b.
Proof
(i) With g(x) = c for all x ∈ R, we have that
Z
Z
E[c] =
cfX (x)dx = c
RX
fX (x)dx = c.1
RX
(the final equality follows from part (ii) of Lemma 2.3.3) which is equal to c.
(ii)
Z
Z
E[aX + b] =
Z
(ax + b)fX (x)dx = a
RX
xfX (x)dx + b
RX
fX (x)dx
RX
= aE[X] + b.1 = aµ + b.
(iii)
var(X) = E[(X − µ)2 ] = E[X 2 − 2µX + µ2 ]
Z
Z
Z
2
2
2
2
(x − 2µx + µ )fX (x)dx =
x fX (x)dx − 2µ
xfX (x)dx + µ
Z
=
RX
RX
RX
= E[X 2 ] − 2µE[X] + µ2 .1
= E[X 2 ] − E[X]2 .
(iv) Since E[aX + b] = aµ + b, then
var(aX + b) = E[((aX + b) − (aµ + b))2 ] = E[a2 (X − µ)2 ]
= a2 E[(X − µ)2 ] = a2 var(X).
9
fX (x)dx
RX
Download