Uploaded by Cameron Boulanger

Statistical Modeling and Methods 1 midterm study guide

advertisement
Stigler 1-4
Cameron Boulanger
October 2023
1
Introduction
” Chapter 1 [[Conditional Probability]] [[Counting]] [[Random Variables]] [[Binomial Experiments]] [[Probability Distributions]] Probabilities of Events
**properties of Probability** scaling: P (S) = 1 and 0 ≤ P (E) for all E in
S additivity: if E and F are mutually exclusive, P (E ∪ F ) = P (E) + P (F )
complementarity: P (E) + P (E c ) = 1 for all E in S general additivity: For any
E and F in S
P (E ∪ F ) = P (E) + P (F ) − P (E ∩ F )
finite additivity; For any finite collection of mutually exclusive events E1 , E2 , . . . , En
P(
n
[
Ei ) =
i=1
n
X
P (Ei )
i=1
Countable additivity: for any countably infinite collection of mutually exclusive
events E1 , E2 , . . . in S,
∞
∞
[
X
P ( Ei ) =
P (Ei )
i=1
i=1
**Sample Space**: the set of all possible outcomes denoted S **An Event:** A
set of possible outcomes (say E), which is a subset of S **Mutually Exclusive:**
If E and F have no common elements they are said to be mutually exlusive
(E ∩ F = ∅)
Conditional Probability **Conditional probability:** The probability that
E occurs given F has occured is defined to be
P (E|F ) =
P (E ∩ F )
P (F )
if P (F ) > 0 can be thought of as relative probability P (E|F ) is the probability
of E relavitve to the reduced sample space consisting of only those outcomes in
the event F .
**General Multiplication** For any events E and F in S,
P (E ∩ F ) = P (F )P (E|F )
1
**Independent events** We say events E and F in S are independent if P (E) =
P (E|F ) If E and F are independent that P (E|F ) = P (E|F c ) If E and F are
independent then P (E ∩ F ) = P (E) · P (F )
Counting **Permutations** The number of ways of choosing r objects from
n distinguishable objects where the order of choice makes a difference. the
number of perumations of n choose r, given by
Pr,n =
n!
(n − r)!
and we also have
Pr,n =
1 · 2 · 3 · ··· · n
= (n − r + 1) · · · · · (n − 1)n
1 · 2 · . . . (n − r)
**Why divide by (n − r)!** After choosing r items, there are n − r items left,
and we are not arranging them. So, we divide by the number of arrangements
of these remaining items.
E.g., for arranging 2 letters out of A, B, C, the numerator 3! represents all
arrangements of A, B, C, but we only want arrangements of 2 letters, so we
divide by the arrangements of the remaining 1 letter, (32)!.
**Combinations** The number of ways of choosing r objects from n distinguishable objects where the order of choice makes no difference
n!
n
=
Cr,n =
r
r!(n − r)!
**Why divide by r!** In permutations, different orders of the same set of
items are counted separately. But in combinations, we don’t want to count
them separately. So, we divide by the number of arrangements (permutations)
of those r items, which is r!. E.g., for selecting 2 letters out of A, B, C, we
calculate all arrangements (3!), divide by the arrangements of the chosen 2 (2!),
and divide by the arrangements of the remaining 1 (1!).
We can also derive
Pr,n =
and
1 · 2 · 3 · ··· · n
= (n − r + 1) · · · · · (n − 1)n
1 · 2 · . . . (n − r)
n
Pr,n
(n − r + 1) · · · · · (n − 1)n
=
=
r
r!
1 · 2 · 3 · · · · · (r − 1)r
Stirling’s Formula **Stirling’s Formula**
1
1
loge (n!) ≈ loge (2π) + n +
loge (n) − n
2
2
thus
n! ≈
√
n+ 21 −n
2πn
2
e
where ≈ means the ratio of two sides tends to 1 as n increases.
This formula can be used to derive appromixations for both Pr,n and
n
r
.
r −(n+1/2)
(n − r)r e−r
Pr,n ≈ 1 −
n
n
1 r −(n−r+1/2) r −(r+1/2)
≈√
1−
n
n
r
2πn
Random Variables **Random Variables** a function that assigns a numerical value to each outcome in S; a real-valued function defined on S
**Discrete Random Variables** random variables whose values can be listed
sequentially (0, 1, 2, 3) a
**Probability distributions of discrete random variables** A list of possible
values of a discrete random variable together with the probabilities of these
values
the probability that random variable X is equal to the possible value x is
denoted by pX (x) or when there is no likely confusion p(x)
3
the probability distrubution can be described in many ways pX (x) = x3 12
for x = 0, 1, 2, 3
**Cumulative distrubution function** (CDF)
X
FX (x) = P (X ≤ x) =
pX (a)
a≤x
A helpful interpretation is that the probability distribution can be thought of
as a distributions of unit mass spread across the real line, thus FX (x) gives the
cumulative mass starting from left, up to and inclusive of the point x.
you can derive the probability of x from FX (x) as such
pX (x) = FX (x) − FX (x − 1)
tags:
Binomial Experiments **The class of Binomial experiments** binomial experiments are characterized by: - The experiment consists of a series of n independent trials - The possible outcomes of a single trial are classified as one of
two types: A: success, Ac : failure - The probability of success on a single trial
P (A) = θ, is the same for all n trials, this probability θ is called the **Parameter
of the Experiment**
For binomial experiments we are usually only interested in a numerical summary of the outcome the random variable: X = number of successess = number
of A’s
**The Binomial (n, θ) Distributions** b(x; n, θ) = nx θx (1 − θ)n−x for x =
0, 1, 2, . . . , n and = 0 otherwise
**Parametric Family of probability distrubutions** The family of probability
distrubutions where parameters determine the distrubution such that fo every
possible value of the parameters we have a different distrubution. A binomial
distrubution is an example with n and θ as parameters
3
**Bernoulli Trials** Instead of conduting a fixed number of n trials, the
trials are conducted until a fixed number r sucesses have been observed. due to
it being a reversal of the original procedure, it is called a **negative binomial
experiment**.
In this type of experiment the random variable of interest is Z= number
of ”failures” before the rth ”success” generally we have pZ (z) = (1 − θ)z θ for
z = 1, 2, 3, . . .
**Negative Binomial Ditrubution** The probability distrubution of the number of failures Z before the rth success in a series of Bernoulli trials with probability of success θ is
r+z−1 r
nb(z : r, θ) =
θ (1 − θ)z
r−1
for z = 1, 2, . . . and = 0 otherwise
**Relations between the two binomial distributions** Let B(x; n, θ) = P (X ≤
x) and N B(z; r, θ) = P (Z ≤ z) be their respective CDF ’s If we are computing
X and Z from the same number of trials, we have X ≥ r if and only if Z ≤ n−r,
Since P (X ≥ r) = 1 − P (X ≤ r − 1), this means
N B(n − r, r, θ) = 1 − B(r − 1, n, θ)
The binomial distributions has other symmetries, particularly
B(x : n, θ) = 1 − B(n − x − 1 : n, 1 − θ)
Continuous Distributions **Continuous Random Variable** A random variable
whose possible values form an interval
**Probability Density Functions** non-negative function which give the
probabilities of an interval through the area under the function over the interval, we define fX (x) to be the probability density function of the continuous
random variable X if for any numbers c and d, with c < d
Z d
P (c < X ≤ d) =
fX (x) dx
c
R∞
the following statements are necessarily true: fX (x) ≥ 0 for all x and −∞ fX (x) dx =
1
We can think of fX (x)dx (height(fX (x) times base dx) as the probability X
falls in an infintisemal interval at x:
P (x < X ≤ x + dx) = fX (x)dx
P (X = c) = 0 for any c, as a consequence of using probability densities to
describe distributions
**Culmative Distrubution Function of a continous random variable** for a
continuous random variable X
Z x
FX (x) = P (X ≤ x) =
fX (u) du
−∞
4
and we see that FX (x) is the area under fX (x) to the left of x, In this case
although FX (x) is non-decreasing as in the discrete case it is no longer a jump
function. It also results in another way to describe continuous distributions.
Z x
d
fX (u) du = fX (x)
dx −∞
so
d
FX (x) = fX (x)
dx
**Exponential Distribution** Consider an experiment where X is the time until
failure, X is a continuous random variable with possible values {x : 0 ≤ x < ∞}.
To specify a class of probability distributions for X, we would expect to have the
probability beyond survival time t, P (X > t). decreasing as t → ∞. One class
of decreasing functions which also have P (X > 0) = 1 are the exponentially
decreasing functions P (X > t) = C t where 0 < C < 1, equivalently writing
e−θt where θ is a fixed parameter we have
P (X > t) = e−θt , (t ≥ 0)
and
FX (t) = P (X ≤ t) = 1 − e−θt
the corresponding probability density function follows from differentiation: fX (t) =
θe−Θt for t ≥ 0 and fX (t) = 0 for t < 0
Transformations of Random Variables
**Strictly monotone transformations** if Y = h(X) is a strictly monotone
transformation of X, we can solve for X in terms of Y . that is find the **inverse
transformation** X = g(Y ). If we have Y = h(x) = 2X + 3 then X = g(Y ) =
Y −3
X > 0 then X = g(Y ) = eY . If Y = h(x) = X 2
2 , if Y = h(x) = loge (X) for
√
for X > 0 then X = g(Y ) = Y
**Discrete case** if pX (x) is the probability density function of X then the
probability density function of Y is
pY (y) = P (Y = y) = P (h(x) = y) = P (X = g(y)) = pX (g(y))
Example:
X has a previously seen binomial distributions, so we have pX (x) =
3
3
(.5)
for
x = 0, 1, 2, 3 and Y = X 2 , what is the distributions of Y , we have
x
√
√
√
g(y) = + y and thus pY (y) = pX ( y) = √3y (.5)3 for y = 0, 1, 2, 3 . . . , which
gives: pY (y) = 18 for y = 0, 38 for y = 1,
**Continuous Case** we have
1
8
for y = 9, = 0 otherwise
fY (y) = fX (g(y)) · |
dg(y)
|
dy
when |g ′ (y)| is small x = g(y) is changing slowly as y changes, and we scale
down, when |g ′ (y)| is large we scale up, as x changes rapidly with y. To show
5
the correctness of this factor in the equation, compute P (Y ≤ a) in two different
ways First,
Z
a
P (Y ≤ a) =
fY (y) dy
−∞
by definition of fY (y). Second, supposing for a moment that h(x) is monotone
increasing we have
Z g(a)
P (Y ≤ a) = P (h(X) ≤ a) = P (X ≤ g(a)) =
fX (x) dx
−∞
Now change variables as such, x = g(y) and dx = g ′ (y)dy and we have
Z a
P (Y ≤ a) =
fX (g(y))g ′ (y) dy
−∞
Differentiating both solutions with respect to a, we have fY (y) = fX (g(y))g ′ (y)
**Example** Let X be the time to failure of the first of two lightbulbs, and
Y the probability that the second lightbulb lasts longer than the first. we have
Y = h(X) = e−θx the random time X has density
fX (x) = θe−θx , (x ≥ 0)
Now loge (Y ) = −θX and the inverse transformation is X = g(y) = − logeθ(Y ) we
find g ′ (y) = − θ1 · y1 and
1
|g ′ (y)| =
, (y > 0)
θy
Then
fY (y) = fX (g(y))|g ′ (y)|
and, noting that fX (g(y)) = 0 for y ≤ 0 and y > 1 we have
fY (y) = θe−θ(− log(y)/θ) ·
1
θy
or since θe−θ(− log(y)/θ) = θy
fY (y) = 1, (0 < Y ≤ 1)
We recognize this as the **Uniform (0,1) distributuion**
**Probability Integral transformation** the transformation h(x) = FX (x)
−1
to find the distribution of Y = h(x) we need to differentiate g(y) = FX
(y),
**Inverse Cumulative distrubution function** the function that for each y, 0 <
y < 1 gives the value of x for which FX (x) = y
Aside: For continuous random variables X, with desnisties, FX (x) is continous and thus an x will exist for all 0 < y < 1, for more general random
−1
−1
variables FX
(y) can be defined as FX
(y) = inf{x : FX (x) ≥ y}. the deriva−1
tive g(y) = FX (y) can be found by implicit differentiation
y = FX (x)
6
⇒1=
d
dx
FX (x) = fX (x) ·
dy
dy
by chain rule, and so
dx
1
=
dy
fX (x)
−1
or, with x = g(y) = FX
(y)
g ′ (y) =
d −1
1
FX (y) =
−1
dy
fX (FX
(y))
But then
−1
(y)) ·
fY (y) = fX (g(y))|g ′ (y)| = fX (FX
1
= 1, (0 < y < 1)
−1
fX (FX
(y))
= 0 otherwise
Chapter 2 [[Expectation]] [[Probability Distributions]] [[Linear Transformations]] [[Transformations]] [[Normal Distributions]] **General** If we state that
random variable X has a binomial (5, .3) distribution we
are saying that the
probability distribution is given by P (X = x) = x5 (.3)x (.7)5−x for x =
0, 1, 2, 3, 4, 5 Analagolousy if X has an exponential (2, 3) distribution we mean
fX (x) = 2.3e2.3x for x ≥ 0
Expectation The expectation of a random variable, is a weighted average of
its possible values, (weighted by its probability distribution) We derive expectation of X, denoted E(X) **Discrete Case**
X
E(X) =
xpx (x)
x
**continuous case**
Z
∞
E(X) =
xfX (x) dx
−∞
We will refer to E(X) as both the *mean* and *expected value* of X., The
notation suggests that E(X) is a function of X however E(X) is a number and
is described as a function of the probability distribution of X
The expectation of X summarizes the distribution of X, by describing its
center.
**Alternate descriptor of center** Another descriptor of center is mode,
which is the most probable value in the discrete case, and the value with the
highest density in the continuous case.
d
fX (x)
dx
The Beta Distribution
The Beta (α, β) distrubition is a member of a parametric family of distrbutions that will be useful for inference
7
**The Gamma Function** A generalization of n factorial. A definite integral
that was studied by Euler,
Z ∞
Γ(a) =
xa−1 e−x dx
0
If a is an integer, this integer can be evaluated by repeated integration-by-parts
to give
Γ(n) = (n − 1)!
The function has the properties that
Γ(a) = (a − 1)Γ(a − 1), ∀a > 1
√
Γ(.5) = π
Stirlings formula also applies here
Γ(a + 1) ≈
√
2π
aa+1 −a
e
2
The family of Beta Distributions **Probability density function** The probability density function is
(
Γ(α+β)
· xα−1 (1 − x)β−1 0 ≤ x ≤ 1
fX (x) = Γ(α)Γ(β)
0
Otherwise
Where α, β > 0
When we have α = β = 1 we have fX (x) = 1 for 0 ≤ x ≤ 1 which is the
Uniform (0,1) distribution
If α = β the density is symmetric about 1/2, the larger α, β are the more
concentrated the distribution is around its center.
If α, β ∈ Z then we can say that
Γ(α + β)
(α + β − 1)!
(α + β − 2)!(α + β − 1)
α+β−2
=
=
=
(α+β−1)
Γ(α)Γ(β)
(α − 1)!(β − 1)!
(α − 1)!(β − 1)!
α−1
Thus we can rewrite
(
fX (x) =
Γ(α+β)
Γ(α)Γ(β)
· xα−1 (1 − x)β−1
0
0≤x≤1
Otherwise
as
fX (x) = (α + β − 1)
α + β − 2 α−1
x
(1 − x)β−1
α−1
The name for the beta distribution comes from the fact in classical analysis:
Z 1
B(α, β) =
xα−1 (1 − x)β−1 dx
0
8
is called the **Beta Function** and it relates to the gamma function as such
Γ(α)Γ(β)
Γ(α + β)
B(α, β) =
which is the reciprocal of the coefficient of fX (x) Thus it becomes evident that
1
Z
fX (x) dx = 1
0
**Expectation for the Beta Distribution** the expression for the expectation
given α, β > 1 is
α
E(X) =
α+β
R∞
to show this, utilize −∞ fX (x) dx = 1 By definition
Z
1
x·
E(X) =
0
Γ(α + β) α−1
Γ(α + β)
x
(1 − x)β−1 dx =
Γ(α)Γ(β)
Γ(α)Γ(β)
Z
1
xα (1 − x)β−1 dx
0
Now we manipulate the integrand by rescaling it so it becomes a probability
density, and multiplying the rest of the expression by the reciprocal to preserve
value
Z 1
Γ(α + β + 1) α
Γ(α + β) Γ(α + 1)Γ(β)
·
·
x (1 − x)β−1 dx
E(X) =
Γ(α)Γ(β) Γ(α + β + 1) 0 Γ(α + 1)Γ(β)
the left factor is the probability distrubition for a beta function thus
E(X) =
Γ(α + β) Γ(α + 1)Γ(β)
Γ(α + β) Γ(α + 1)Γ(β)
·
·1=
·
Γ(α)Γ(β) Γ(α + β + 1)
Γ(α)Γ(β) Γ(α + β + 1)
=
Γ(α + β)
Γ(α + 1)
α
·
=
Γ(α + β + 1)
Γ(α)
α+β
Expectations of Transformations If we have the random variable Y = h(X)
where X is also a random variable. to find E(Y ) we do the following
fY (y) = fX (g(y))|g ′ (y)|
where g = h−1 and then calculate
Z
∞
E(Y ) =
yfY y dy
−∞
or combining the steps
Z
∞
E(Y ) =
yfX (g(y))|g ′ (y)| dy
−∞
9
However this is unecessary there is a simpler method for both the discrete and
continuous case **Discrete Case**
X
E(h(X)) =
h(x)PX (x)
x
**Continuous Case**
Z
∞
h(x)fX (x) dx
E(h(x)) =
−∞
To see why this works consider that we are making the following change of
variable x = g(y), y = h(x), dx = g ′ (y)dy thus we have
Z ∞
Z ∞
E(Y ) =
yfX (g(y))|g ′ (y)| dy =
h(x)fX (x) dx
−∞
−∞
In general E(h(X)) ̸= h(E(X)) although this is true for linear transformations.
**Example** Suppose X has the standard Normal distribution with density
2
1
ϕ(x) = √ e−x /2 , −∞ < x < ∞
2π
What is E(Y ), Y = X 2 , the distribution of Y is the Chi-square distribution
with density
1
fY (y) = √
e−y/2 , y > 0,
2πy
Thus
Z
E(Y ) =
0
∞
1
1
e−y/2 dy = √
y· √
2πy
2π
Z
∞
√
ye−y/2 dy
0
we can evaluate this integral by making the change of variable z = y2 , dz = 12 dy
Z ∞
√ Z ∞ 1/2 −z
√
√ √
√
√
√ −y/2
ye
dy = 2 2
z e dz = 2 2Γ(1.5) = 2 2·(.5)Γ(.5) = 2· π = 2π
∞
Thus
0
√
1
E(Y ) = √ · 2π = 1
2π
Alternatively we can obtain this by calculating
Z ∞
2
1
2
E(X ) =
x2 √ e−x /2 dx
2π
−∞
which would have been easier had we not been given fY (y)
Linear Transformations The simplest and most used of the transformations.
An example is Y = h(X) = aX + b **Theorem** For any constants a and b
E(aX + b) = aE(X) + b
10
Proof: we have
Z
Z ∞
(ax+b)fX (x) dx =
E(aX+b) =
−∞
∞
Z
∞
−∞
Z
−∞
= aE(X) + b · 1 = aE(X) + b
if b = 0, E(aX) = aE(X), and if a = 0, E(b) = b
We will adopt the notation E(X) = µX = µ
Variance Expectation is a measure of the center of a probability distribution, variance is another important measures as it measures the spread of the
distribution. The way it measures this spread is by asking how far on average can we expect X to be from the center of its distribution. So that is
X − E(X) = X − µX and since there is no regard to the sign of the equation we
have E|X − µx | Although this measure may seem to be the most natural way
of measuring dispersion, we use a different measure due to both mathematical
convience and the fact that the alternate form arises naturally from theoretical
considerations.
The measure is called **variance**, and it is as such
2
V ar(X) = E|(X − µX )2 | = σX
Variance is difficult to interpret because it is defined in terms of square units so
we define the **standard deviation**
p
p
σX = V ar(X) = E(X − µx )2
This
p is different from our intuitive measure defined earlier as E|X − µx | =
E( (X − µx )2 )
The following device simplifies the calculation of variance
V ar(X) = E|(X − µ)2 | = E(X 2 + (−2µX) + µ2 ) = E(X) + E(−2µX) + E(µ2 )
Now we have
E(−2µX) = −2µE(X) = −2µ · µ = −2µ2 , and , E(µ2 ) = µ2
thus
V ar(X) = E(X 2 ) − µ2 = E(X 2 ) − (E(X))2
Linear Change of Scale The most common transformation, a linear change of
scale Y = aX + b. **Variance of Y** Theorem: For any constants a and b,
V ar(aX + b) = a2 V ar(X)
Proof: By definition V ar(aX + b) is the expectation of
|(aX + b) − E(aX + b)|2 = [aX + b − aµx + b]2 = (aX − aµX )2 = (a2 (X − µX )2 )
V ar(aX + b) = E[a2 (X − µx )2 ] = a2 E(X − µX )2 = a2 V ar(X)
11
∞
fX (x) dx
xfX (x) dx+b
(axfX (x)+bfX (x)) dx = a
−∞
and we can immediately deduce σaX+b = |a|σX Something to note is that
neither the variance nor the standard deviation is affected by b, this means that
the spread of the distribution is unaffected by a shift of the origin by b units.
**Standard Form** By a linear change of scale we can arrange to have
random variable expressed with expectation zero and variance one. This is
accomplished by transforming X by subtracting its expectation and dividing its
standard deviation:
X − µX
W =
σX
Note that W = aX + b for the special choices a = σ1X and b = − µσxx then The
Normal (µ, σ 2 ) distributions The standard Normal distribution with continuous
random variable X has density
2
1
ϕ(x) = √ e−x /2 , −∞ < x < ∞
2π
And we define the Normal (µ, σ 2 ) distribution of Y = σX +µ, which has density
fY (y) = √
−(y−µ)2
1
e 2σ2
2πσ
2
= 1 so now for the
The standard has the mean and variance, µX = 0 and σX
distribution of Y we have
E(Y ) = σE(X) + µ = µ, V ar(Y ) = σ 2 V ar(X) = σ 2
or µY = µ, σY2 = σ 2 , σY = σ
We consider the inverse of the transformation that defines Y
X=
Y −µ
σ
And we see that X can be said to be Y expressed in standard form, That
is where the name standard normal comes from. and Usually we say Normal
(µ, σ 2 ) as N (µ, σ 2 )
Chapter 3 [[Covariance]] [[Multivariate Distributions]] [[Bivariate Distributions]] Discrete Bivariate Distributions If X and Y are two random variables
on the same sample space S. that means they were defined in reference to the
same experiment. We define the **Bivariate Probability function**::
p(x, y) = P (X = x, Y = y)
p(x, y) may be though of as describing the distribution of unit mass in the (x, y)
plane, with p(x, y) representing the mass assigned to (x, y), p(x, y) is the height
of the spike at (x, y). like the univariate case the total for all possible points
must be one
XX
p(x, y) = 1
x
y
12
**Example** Consider experiment of tossing a dair coin three times, and then
independently tossing a second coin tree times X= num of heads for first coin
Y = num of tails for second coin Z= num of tails for first coin.
The coins are independent, so any pair (x, y) of X and Y we have, if {X = x}
stands for event X = x
p(x, y) = P (X = x, Y = y) = P ({X = x}∩{Y = y}) = P ({X = x})·P ({Y = y}) = PX (x)·PY (y)
On the other hand, X and Z refer to the same coin so we have
p(x, y) = P (X = x, Z = z) = P ({X = x} ∩ {Z = z}) = P ({X = x}) = pX (x)
if z = 3 − x,
= 0, Otherwise
This is because we must necessarily have x + z = 3 which means {X = x} and
{Z = x − 3} describe the same event. If z ̸= 3 − x then {X = x} and {Z = z}
are mutually exclusive and the probability both occur is zero..
If we have a bivariate pdf such as p(x, y) then we can write univariate distributions
∞
X
X
pX (x) =
p(x, y), pY (y) =
p(x, y)
x
Y
the intuition is we can compose {X = x} into a collection of smaller sets
{X = x} = {X = x, Y = 0} ∪ {X = xY = 1} ∪ . . .
the values on the righthand side run through all possible values of Y , and yet
all the events are mutually
exclusive, so the right hand side is the sum of all the
P
probabilities of Y or ∀y p(x, y) and the LHS is pX (x)
univariate distrubutions in a multivariate context are called **marginal
probability functions**, you can always find marginal distrubutions from the
bivariate distribution but in general you can’t go the other way. The distributions tell us about the probability of all possible values of one variable but has
zero regard to the other variables. p(x, y) = pX (x) · pY (y) and p(x, z) = pX (x)
What was needed is info on how knowing one variables outcome affects
another **Conditional Probability function**
p(y|x) = P (Y = y|X = x)
the probability that Y = y given X = x, this is the same notion of conditional
probability we expressed earlier.
p(y|x) = P (Y = y|X = x) =
P (X = x, Y = y)
p(x, y)
=
P (X = x)
pX (x)
as long as pX (x) > 0,
if p(y|x) = pY (y), ∀x then we say Y and X are independent random variables
and thusly p(x, y) = PX (x) · pY (y)
13
X and Y are independent only if all X = x and Y = y are independent, if it
fails for a single (x0 , y0 ) X and Y are dependent. X and Y were independent,
but X and Z were dependent take x = 2
p(z|2) =
p(2, z)
= 1, z = 1, = 0 Otherwise
pX (2)
so p(z|x) ̸= pZ (z)
Continuous Bivariate Distributions **Bivariate probability density function**
f (x, y)
the rectangular region with a < X < b and c < Y < d has probability
Z dZ b
f (x, y) dx dy
P (a < X < b, c < Y < d) =
c
a
It is always rue that f (x, y) ≥ 0, ∀x, y and
Z ∞Z ∞
f (x, y) dx dy = 1
−∞
−∞
any function satisfying these properties describes a continuous bivariate probability distribution
unit mass resting quarely on the plane, f (x, y) describes the upper surface
of the mass
if we are given f (x, y) we can find
Z ∞
Z ∞
fX (x) =
f (x, y) dy, ∀x, fY (y) =
f (x, y) dx, ∀y
−∞
−∞
Mathematically we can justify this as follows, if we hae a < b, a < X ≤ b, −∞ <
Y <∞
Z ∞Z
Z aZ ∞
Z a Z ∞
f (x, y) dx dy =
f (x, y) dy dx =
f (x, y) dy dx
−∞
a
b
−∞
b
Therefore
Z
b
Z
−∞
∞
P (a < X ≤ b) =
f (x, y) dy dx∀a < b
a
−∞
R∞
and thus −∞ f (x, y) dy fufills the definition, it is a function that of x that gives
the probabilities as integrals of areas.
the integral
Z x+dx Z ∞
Z ∞
f (x, y) dy du ≈
f (x, y) dy · dx
x
−∞
−∞
gives the total mass, between x and x + dx
**Example** Consider the bivariate density function
1
− x + x, for , 0 < x < 1, 0 < y < 2,
f (x, y) = y
2
14
and = 0 otherwise.
One way of visualizing
such a function is to look at cross-sections: the function f x, 12 = 21 12 − x + x = x2 + 41 is the cross section of the surface, cutting
it with a plane at y = 12 to show that f (x, y) is a bivariate density we do the
following
Z 2Z 1 Z ∞Z ∞
Z 2 Z 1 1
1
y
f (x, y) dx dy =
y
0x + x dx dy =
− x + x dx dy8
2
2
0
0
−∞ −∞
0
0
Conditional Probability densities In the discrete case we defined Y given X as
p(y|x) =
p(x, y)
, pX (X) > 0
pX (x)
So for the continuous case we define the conditional probability density of Y
given X as
f (x, y)
, fX (x) > 0
f (y|x) =
fX (x)
The discrete case is a re-expresssion of
p(E|F ) =
P (E ∩ F )
P (F )
Since f (y|x) is a density it gives the conditional probabilites as areas of a region,
we havee
Z
b
P (a < Y ≤ b|X = x) =
f (y|x) dy, ∀a < b
a
What does P (a < Y ≤ b|X = x) mean since P (X = x) = 0. We proceed
Heuristically, by P (a < Y ≤ b|X = x) we mean something like P (a < Y ≤
b|x ≤ X ≤ x + h) for very small h. if fX (x) > 0 the latter probability si well
defined since P (x ≤ X ≤ x + h) > 0 even though it is small thus
P (a < Y ≤ b|X = x) ≈ P (a < Y ≤ b|x ≤ X ≤ x+h) =
P (x ≤ X ≤ x + h, a < Y ≤ b)
=
P (x ≤ X ≤ x + h)
but if fX (u) doesnt’t vary greatly for u ∈ [x, x + h] then
Z x+h
fX (u) du ≈ fX (x) · h
x
if for fixeed y the function f (u, y) doesn’t change value greatly for u ∈ [x, x + h]
we have
Z
x+h
f (u, y) du ≈ f (x, y) · h
x
Substituting these into the above expression gives
Rb
Rb
Z b
f (x, y) dy
f (x, y) · hdy
f (x, y)
a
a
=
=
dx
P (a < Y ≤ b|X = x) ≈
fX (x) · h
fX (x)
fX (x)
a
15
R b hR x+h
a
x
i
f (u, y) du d
R x+h
x
fX (x) dx
f (y|x) is a cross section of the surface f (x, y) at X = x, rescaled so it has total
area 1. Indeed for a fixed x, the denom of the conditional probability function
is just the right scaling factror so the area is 1
Z ∞
Z ∞
Z ∞
f (x, y)
1
1
f (y|x) dy =
dy =
·
f (x, y) dy =
·fX (x) = 1, ∀x s.t.fX (x) > 0
fX (x) −∞
fX (x)
−∞
−∞ fX (x)
if the following are true
f (y|x) = fY (y), f (x|y) = fX (x), f (x, y) = fX (x) · fY (y)
then X and Y are independent random variables any of these conditions is
equivalent to
P (a < X < b, c < Y < d) = P (a < X < b) · P (c < Y < d)
if any of these conditions fail to hold X and Y are dependent.
Using one marginal density, and one set of conditional densities, usually that
agree fX , f (y|x) or fY , f (x|y) we determine the bivariate density by
f (x, y) = fX (x)f (y|x)
Expectations of Transformations of Bivariate Random Variables An example
of a transfomation h1 (X, Y ) = X + Y , h2 (X, Y ) = X − Y , and h3 (X, Y ) = X · Y
are all transforamtions, we can find the expectation by setting Z = h(X, Y ) as
a random variable and if we find the distribution of Z the expectation follows:
Z ∞
X
E(Z) =
zpZ (z), E(Z) =
zfZ (z) dz
−∞
Z
in the discrete and continuous cases respectively However we can find these
expectations without finding the distibution of z, by a generalized change of
varibales argument
Z ∞Z ∞
XX
Eh(X, Y ) =
h(x, y)p(x, y), Eh(X, Y ) =
h(x, y)f (x, y) dx dy
x
−∞
y
−∞
discrete and continuous respectively.
Mixed Cases We have encountered both X, Y discrete or continous, now we
consider situations where X discrete and Y continuous
**Example** A ball is rolled on a table ranged 0 → 1 and the position is Y ,
and we roll a second ball n times, and X is the number of times the ball lands
to the left of Y Then (X, Y ) is a bivariate random variable, X discrete, Y cnts
From our description we have Y is uniform
fY (y) = 1, 0 < y < 1
16
If we consider each role independent then conditional upon Y = y, X is a success
count for n bernoulli trials, and the chance of success on a single trial is y, that
means p(x|y) is given by Binomial (n, y)
n x
p(x|y) =
y (1 − y)n−x
x
For mixed cases as for other cases, we construct the distribution by multiplication
n x
f (x, y) = fY (y) · p(x|y) =
y (1 − y)n−x
x
this distribution can be picterd as a series of parallel sheets on the x − y plane,
concentrated on the lines x = 0, x = 1, . . . , x = n for dealing with mixed cases,
we use our results from the earlier, the Marginal distribution of X is
Z ∞
Z 1 n x
pX (x) =
f (x, y) dy =
y (1 − y)n−x dy
x
−∞
0
Z 1
1
n
n
n
1
=
=
y x (1−y)n−x dy =
·B(x+1, n−x+1) =
· n
n
+
1
x 0
x
x
(n
+
1)
x
We could also calculate
f (x, y)
=
f (y|x) =
pX (x)
n
x
y x (1 − y)n−x
1
n+1
= (n + 1)
n x
y (1 − y)n−x
x
Higher Dimensions Bivariate was focused on due to their mathematic simplicty,
but the same ideas carry over for multiple dimensions for X1 , X2 , X3 discrete
we have
p(x1 , x2 , x3 ) = P (X1 = x1 , X2 = x2 , X3 = x3 )
and
X
p(x1 , x2 , x3 ) = 1
∀x1 ,x2 ,x3
For the continuous case we describe the distribution by a density f (x1 , x2 , x3 )
where
Z ∞Z ∞Z ∞
f (x1 , x2 , x3 ) dx1 dx2 dx3
−∞
−∞
−∞
A collection of random variables is independent if their distribution factors into
a product of univariate distributions
p(x1 , x2 , . . . , xn ) = pX1 (x1 ) · pX2 (x2 ) · · · · · pXn (xn )
or
f (x1 , x2 , . . . , xn ) = fX1 (x1 ) · fX2 (x2 ) · · · · · fXn (xn )
The terms on the right are the marginal distributions of the xi ’s and they can
be found by
XX
pX1 (x1 ) =
p(x1 , x2 , x3 )
x2
x3
17
and
Z
∞
Z
∞
f (x1 , x2 , x3 ) dx2 dx3
fX1 (x1 ) =
−∞
−∞
There may also be multivariate marginal distributions for X1 , X2 , X3 , X4 cnts
we have
Z ∞Z ∞
f (x1 , x2 , x3 , x4 ) dx3 dx4
f (x1 , x2 ) =
−∞
−∞
Conditional distributions are found analagolousy to bivariate case
f (x3 , x4 |x1 , x2 ) =
f (x1 , x2 , x3 , x4 )
f (x1 , x2 )
and in the cnts case
Z
∞
Z
∞
···
E(h(X1 , X2 , . . . , Xn )) =
−∞
h(x1 , . . . , xn )f (x1 , . . . , xn ) dx1 , . . . , dxn
−∞
Measuring Multivariate Distributions A natural starting point for summary
measures is linear transformations, in the bivariate case
h(X, Y ) = aX + bY
a,b are constants. **Theorem** For any constants a, b and any multivariate
random variable (X, Y ),
E(aX + bY ) = aE(X) + bE(Y )
We generalize these results using indution we have
!
n
n
X
X
E
ai Xi =
ai E(Xi )
i=1
i=1
It is hopeless to try to capture information about how X and Y vary together
by looking at the expectations of linear trasforamtions. To learn this we must
go beyond expectations of linear transformations
Covariance and Correlation The simplest nonlinear transformation of X, Y
is XY and we could consider E(XY ) as a summary. Since it is affected by
where the distributions are centered and how the variables vary together.
We strart with the product of X −E(X) and Y −E(Y ) that is the expectation
of
h(X, Y ) = (X − µX )(Y − µY )
which is called the covariance of X and Y denoted
Cov(X, Y ) = E[(X − µX )(Y − µY )] ⇒ h(X, Y ) = XY − µX Y − XµY + µX µy
using properties from above we arrive at
COV (X, Y ) = E(XY )−µX E(Y )−µY E(X)+µX µy = E(XY )−µx µy −µy µx +µx µy
18
Thus
COV (X, Y ) = E(XY ) − µX µy
It follows immediately that COV (X, Y ) = COV (Y, X)
if X and Y are the same random variable we have
COV (X, X) = E(X · X) − µX µX = V ar(X)
If we have the other extreme where Y = −X then
COV (X, −X) = E(X(−X)) − µX µ−X = −E(X 2 ) + µ2x = −V ar(X)
and lastly if X and Y are independent
COV (X, Y ) = 0
Yet COV (X, Y ) = 0 isn’t sufficient to guarantee independence. theoretically
they can be an exact balance between h(X, Y ) in the positve quadrants and the
negative quadrants.
In general we refer to X, Y with COV (X, Y ) = 0 as **Uncorrelated**
**Correction Term** The most important use of covariance and the best
means of interpreting it quantitatively is by viewing it as a correction term that
arises in calculating the variance of sums
Lets take
E(X + Y ) = E(X) + E(Y )
Now V ar(X + Y ) is the expectation of
(X+Y −µX+Y )2 = [X+Y −(µx +µy )]2 = [(X−µx )+(Y −µY )]2 = (X−µX )2 +(Y −µy )2 +2(X−µx )(Y −µy )
thus we have
V ar(X + Y ) = V ar(X) + V ar(Y ) + 2COV (X, Y )
In the special case where X and Y are independent(or uncorrelated)
V ar(X + Y ) = V ar(X) + V ar(Y )
Large covariance may be due to a high degree of association or dependence
between X and Y , but it may also be due to the choice of scales of measurement.
To eliminate this scale effect we define the correlation of X and Y to be the
covariance of the standardized X and Y
Y − µY
X − µx Y − µY
COV (X, Y )
X − µx
= COV
,
=
pXY = E
σX
σY
σX
σY
σX σY
If X and Y are independent
pXY = 0
Yet uncorrelation doesn’t imply independence
19
Chapter 4 Baye’s Theorem The most elementary situation is we have an
effect F , a list of independent, comprehensive causes E1 , E2 , . . . , En . We have
the **priori Probabilites** P (E1 ), . . . , P (En ) and we know the probability of F
given each cause
P (F |E1 ), . . . , P (F |En )
We want to ind the **Posteori Probabilites**
P (E1 |F ), . . . , P (En |F )
**Example**
E1 = Patient has Cancer, E2 = No Cancer
Let F be a positive test result we know that
P (F |E1 ) = .9, P (F |E2 ) = .1
We say the patient is randomly selected from a population with a cancer prevalance
of θo So
P (E1 ) = θo , P (E2 ) = 1 − θo
There is a disagreement on θo what is P (E1 |F )
**Bayes Theorem Elementary Version** if E1 , E2 , . . . , En partition a sample
space S, then for each Ei and F with P (F ) > 0
P (Ei )P (F |Ei )
P (Ei |F ) = Pn
j=1 P (Ej )P (F |Ej )
**proof** generally we have that for any E and F
P (E ∩ F ) = P (F )P (E|F )
applying twice we have
P (Ei ∩ F ) = P (F )P (Ei |F ), P (Ei ∩ F ) = P (Ei )P (F |Ei )
and since P (F ) = 0 equating this gives
P (Ei |F ) =
P (Ei )P (F |Ei )
P (F )
this is essentially the theorem now we WTS
P (F ) =
n
X
P (Ej )P (F |Ej )
j=1
Which will be done in two ways first
n
n n
X
X
P (Ei )P (F |Ei )
1 X
1=
P (Ei |F ) =
P (Ei )P (F |Ei )
=
P (F )
P (F ) i=1
i=1
i=1
20
or
P (F ) =
n
X
P (Ei )P (F |Ei )
i=1
Altenatively, we have
F = ∪ni=1 (Ei ∩ F )
and finite additivity gives
P (F ) =
n
X
P (Ei ∩ F )
i=1
But for each i P (Ei ∩ F ) = P (Ei )P (F |Ei ), thus
P (F ) =
n
X
P (Ei )P (F |Ei )
i=1
**Example return** we have
P (F |E1 ) = .9, P (F |E2 ) = .2, P (E1 ) = θo
So from above we have
P (E1 |F ) =
θo .9
θ0 (.9) + (1 − θo ).2
Bayes theorem more generally suppose we have fY (y) and f (x|y) if we want to
find f (y|x) we proceed as follows, if fX (x) > 0
f (y|x) =
f (x, y)
fX (x)
second we know that f (x, y) = f (x|y)fY (y) together we arrive at
f (y|x) =
f (x|y)fY (y)
fX (x)
all that remains is to find fX (x) whcih we can do in either of two ways first we
must have
Z ∞
f (x|u)fY (u)du
=1
fX (x)
−∞
or
Z ∞
fX (x) =
f (x|u)fY (u) du
−∞
These equations together provide the general frorm of Bayes theorem
f (x|y)fY (y)
f (x|u)fY (u) du
−∞
f (y|x) = R ∞
21
This is a distribuition for Y given X = x, that is for a fixed x this gives the
conditional density of Y . sometimes we write the above as f (y|x)αf (x|y)fY (y)
meaning f (y|x) is proportional to the product of f (x|y)fY (y), though the constant of proportionality can depend upon x Inference about binomial distribution For the case X disrcrete and Y contininous Bayes becomes
p(x|y)fY (y)
p(x|u)fY (u) du
∞
f (y|x) = R ∞
or
f (y|x)αP (x|y)fY (y)
We consider an example to show Bayesian Inference **Example** We select
n = 100 ames from a list of a million registered democrats, that all n are
interviewd and express an opinion, the poll will result in count X, for the
incumbent, and count n − X against formallu
θ = fraction for the incumbent
If the survey is truly random we would accept
that given θ, X has a binomial
(100, θ) distribution p(x|θ) = b(x; n, θ) = nx θx (1 − θ)n−x
But we aren’t given θ and we want to determine θ, We suppose our uncertainty about θ is well represented as uniform distribution
f (θ) = 1, 0 < θ < 1
Now this becomes an application of Bayes theorem, We observe n = 100, X =
40, and we have f (θ|x)αf (θ)p(x|θ), so the density is
100 40
100 40
f (θ|40)αp(40|θ) · f (θ) =
θ (1 − θ)60 ∗ 1 =
θ (1 − θ)60
40
40
Now this represents the density of θ given X = 40 up to a constant of proportionality. We could evaluate the constant if we wish but we recognize that
the portion of the density that depends upon θ is exactly equal to the variable
part of a Beta (41, 61) density, since the variable parts agree and both integrate
to 1, the constants must agreee and we conclude that f (θ|40) is a Beta(41, 61)
density To summarize we thought (a) θ is uniform (0,1) before the survey (b)
Given θ, X is binomial (100, x)
our conclusion is that (c) θ is Beta (41, 61) a posteriori, given X = 40
A richer class of Models for Binomial Inference The previous section was
restrictive to the hypothesis that the fraction θ is a priori distributed as Uniform
(0, 1) Bayes Therorem can be applied to any f (θ) and this flexibilty can be
exploited to represent any degree of uncertainty based upon prior experience
can be summarized as a density.
A particular class is the Beta (α, β) Distribtution:
f (θ) =
Γ(α + β) α−1
θ
(1 − θ)β−1
Γ(α)Γ(β)
22
This class includes the Uniform (0, 1) as a special case α = β = 1 If we accept
a parrticular f (θ), the analysis is straightforward, subsequent to observing the
number of successes X = x in n independent trials, the condiational density θ
given X = x,
n x
Γ(α + β) α−1
θ
(1 − θ)β−1
f (θ|x) ∝ p(x|θ)f (θ) =
θ (1 − θ)n−x ·
Γ(α)Γ(β)
x
= Cθx+α−1 (1 − θ)n−x+β−1
where C depends on n, x, α and β but not θ. we recognize the portion of f (θ|x)
that depends on θ, is a Beta(x + α, n − x + β) density. Thus f (θ|x) must be a
Beta(x + α, n − x + β) density.
f (θ|x) =
Γ(n + α + β)
θx+α−1 (1 − θ)n−x+β−1
Γ(x + α)Γ(n − x + β)
The specification of a priori density f (θ) from within the class of Beta distributions can be simplified using three facts about Beta(α, β) distirbutions. First,
µθ =
and
σθ2 =
α
α+β
µθ (1 − µθ )
αβ
=
(α + β)2 (α + β + 1)
α+β+1
And if α, and β not very small fθ (y) is approximately like the density of a
N (µθ , σθ2 ) distribution
2
P (|Y − µθ | < σθ ) ≈
3
**Returning to Example** Before conducting the survey we specify f (θ) by
asking what is µθ , If α and β not very small, f (θ) apprroximately symmetric
about µθ , and µθ approximately median of f (θ). We ask for waht value is it even
odds that θ is above or below µθ . Suppose µθ = .5 answers this approximately.
prior to the survey we consider θ equally likely to be above or below .5. Next we
ask for what interval of values of θ around µθ = .5 is the probbility 23 . For what
what value σθ is it twice as likely that .5 − σθ < θ < .5 + σθ as not? Suppose
we agree that a priori, P (.4 < θ < .6) ≈ 23 or σθ ≈ .1 gives us
σθ2 =
µθ (1 − µθ)
(α + β + 1)
if µθ = .5 and σθ2 = (.1)2 , this gives us
α+β+1=
µθ (1 − µθ )
.52
=
= 25
σθ2
.12
23
or α + β = 24, together with µθ = .5 we have α = β = 12 So we take our priori
distribution for θ with Beta(12,12) as
f (θ) =
23! 11
θ (1 − θ)1 1
11!11!
After the survey the distribution that describes the uncertainty about θ given
n = 100 and X = 40 is then, from the Beta(52,72) density
f (θ|40) =
123! 51
θ (1 − θ)71
51!71!
The effect of the sample information X upon the distribution of θ can be summarixed in chages of expectation and variance, before X if θ is as follows:
E(θ) =
µθ (1 − µθ )
α
= µθ , V ar(θ) =
α+β
α+β+1
After we observe X = x
α+x
E(θ|X = x) =
=
α+β+n
V ar(θ|X = x) =
α+β
α+β+n
µθ +
n
α+β+n
x
n
E(θ|X = x)(1 − E(θ|X = x))
α+β+n+1
The expectation formula is particulary informative, it gives E(θ|X = x) as a
weighted average of the prior expectation adn the fraction of the sample that
are successes. That is E(θ|X = x) is a compromise between µθ our expectation
n
with no data, and the sample fraction nx , the larger α+β+n
is the more weight
X
we put on n
The folowing two situations give the same f (θ|x) (i) A uniform (0,1) prior,(Beta
with α = β = 1) and a sample of n = 10, with X = 5 successes (ii) A Beta(5,,5)
prior and n = 2 with X = 1 successes
24
Download