Uploaded by Kevin Lim

STA498 book - Bayesian analysis

advertisement
Bayesian Hotelling’s 𝑇 2
Lim, Kyuson
November 4, 2021
STA498
2
Lim, Kyuson
Contents
1
Acknowledgement
7
2
Multivariate Normal and Hypothesis Testing
9
2.1
Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1.1
Multivariate Normal distribution . . . . . . . . . . . . . . . . .
9
2.1.2
Distribution of (x − πœ‡) 0𝚺−1 (x − πœ‡) . . . . . . . . . . . . . . . .
9
2.1.3
MLE of πœ‡ and 𝚺 . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.1.4
The sampling distribution of S and xΜ„ . . . . . . . . . . . . . . .
11
2.1.5
Hypothesis testing when 𝜎, Σ is known . . . . . . . . . . . . .
11
2.1.6
Hypothesis testing when 𝜎, Σ is unknown . . . . . . . . . . . .
12
Confidence region . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.2.1
Univariate 𝑑-interval . . . . . . . . . . . . . . . . . . . . . . .
16
2.2.2
Bonferroni’s Simultaneous Confidence interval . . . . . . . . .
17
2.2.3
Simultaneous 𝑇 2 -intervals . . . . . . . . . . . . . . . . . . . .
17
2.2.4
Comparison between Simultaneous 𝑇 2 -intervals and Bonferroni’s Confidence intervals . . . . . . . . . . . . . . . . . . . .
18
Multivariate Quality-Control (QC) . . . . . . . . . . . . . . . .
19
Comparing mean vectors of two population . . . . . . . . . . . . . . .
21
2.3.1
Pooled sample covariance when 𝑛1 , 𝑛2 is small and Σ = Σ1 = Σ2
21
2.3.2
Hypothesis test with small samples when Σ1 = Σ2 . . . . . . . .
22
2.3.3
Confidence intervals with small samples when Σ1 = Σ2 . . . . .
22
2.3.4
Behrens-Fisher problem . . . . . . . . . . . . . . . . . . . . .
23
2.3.5
Heterogeneous covariance matrices with large sample size . . .
23
2.3.6
Box’s M test (Bartlett’s test) . . . . . . . . . . . . . . . . . . .
23
2.2
2.2.5
2.3
3
STA498
2.4
3
Lim, Kyuson
MANOVA (Multivariate Analysis Of Variance) . . . . . . . . . . . . . 24
2.4.1
Sum of Squares (TSS = SSπ‘‘π‘Ÿ +SSπ‘Ÿπ‘’π‘  ) . . . . . . . . . . . . . . .
24
2.4.2
Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . .
25
2.4.3
Distribution of Wilk’s Lambda . . . . . . . . . . . . . . . . . .
26
2.4.4
Large Sample property for modification of 𝚲∗ . . . . . . . . . .
26
2.4.5
Simultaneous Confidence Intervals for treatment effect . . . . .
26
Bayesian Alternative approach
3.0.1
3.1
3.2
3.3
3.4
4
27
Overview: Univariate Binomial distribution with known and
unknown parameter . . . . . . . . . . . . . . . . . . . . . . . .
29
Conditional distribution of the subset . . . . . . . . . . . . . . . . . . .
31
3.1.1
Law of total expectation . . . . . . . . . . . . . . . . . . . . .
32
3.1.2
Conditional expectation (MMSE) . . . . . . . . . . . . . . . .
33
3.1.3
Laplace’s law of succession . . . . . . . . . . . . . . . . . . .
34
3.1.4
Bayesian Hypothesis testing . . . . . . . . . . . . . . . . . . .
35
3.1.5
Bayesian Interval Estimation . . . . . . . . . . . . . . . . . . .
37
Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.2.1
Conjugate Prior . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.2.2
Univariate Normal distribution Conjugate Prior with known
variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
3.2.3
Non-informative Prior . . . . . . . . . . . . . . . . . . . . . .
42
3.2.4
Univariate Normal distribution Conjugate Prior with unknown
variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
Posterior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.3.1
Maximum A Posteriori (MAP) . . . . . . . . . . . . . . . . . .
45
3.3.2
Multivariate Normal distribution with known Σ . . . . . . . . .
46
3.3.3
Multivariate Normal distribution with unknown Σ . . . . . . . .
48
3.3.4
Lindley’s Paradox . . . . . . . . . . . . . . . . . . . . . . . . .
51
3.3.5
Bernstein-von Mises theorem . . . . . . . . . . . . . . . . . .
52
Goodness-of-fit test . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
3.4.1
53
Bayes factor . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS
Lim, Kyuson
3.4.2
Bayes factor: hypothesis testing . . . . . . . . . . . . . . . . .
STA498
54
3.4.3
One sample test for equal means . . . . . . . . . . . . . . . . .
55
4
Appendix
57
4.1
Extension of Bayesian distribution . . . . . . . . . . . . . . . . . . . .
57
4.1.1
57
EM (expectation-maximizing) algorithm for MLE example . . .
CONTENTS
5
STA498
6
Lim, Kyuson
CONTENTS
Chapter 1
Acknowledgement
First, the chapter of multivariate Normal and Hypothesis Testing explains about the
construction and concepts of multivariate normal and relevant statistical inference as to
apply. The notation and interpretation is all of multivariate random variables to consider
for.
Second, the basic information starts with frequentist approach in understanding the
Bayesian statistics. However, the chapter is mainly about Bayesian approach apart from
frequentist approach for interpretation where majority of concept and derivation lies
on Bayesian approach to consider for. The chapter discuss for the hypothesis testing
for Bayesian approach and derivation for posterior distribution of the univariate normal
distribution as well as Bayes posterior estimator.
The normal distribution is the main topic of Bayesian inference for the posterior distribution where multivariate statistics concept is introduced and used to build up the
knowledge. The goal is to expand for bivariate and multivariate normal distribution
including Wishart distribution. Also, the idea of relative belief ratio and normal distribution for understanding posterior distribution is discussed.
7
STA498
8
Lim, Kyuson
CHAPTER 1. ACKNOWLEDGEMENT
Chapter 2
Multivariate Normal and Hypothesis
Testing
2.1
2.1.1
Basic definitions
Multivariate Normal distribution
If x ∼ 𝑁 𝑝 (πœ‡, 𝚺), then the PDF of x 1 is
𝑓 (x) =
1
0 −1
exp − (x − πœ‡) 𝚺 (x − πœ‡) ,
2
1
(2πœ‹) 𝑝/2 |𝚺| 1/2
where (x− πœ‡) 0𝚺−1 (x− πœ‡) is the squared Mahalanobis distance 2 between x and population
mean vector πœ‡ as a quadratic term.
Notice that the PDF does not exists if 𝚺 is not positive definite 3, which implies |𝚺| = 0.
For Gaussian function exp − 21 (x − πœ‡), the normalizing constant (2πœ‹)1 𝑝/2 is multiplied so
the area under the curve is 1.
2.1.2
Distribution of (x − πœ‡) 0𝚺−1 (x − πœ‡)
For x ∼ 𝑁 𝑝 (πœ‡, 𝚺),
0 −1
(x − πœ‡) 𝚺 (x − πœ‡) = {𝚺
−1/2
0
(x − πœ‡)} {𝚺
−1/2
𝑝 ∑︁
1 0
(x − πœ‡)}4 ⇔
√ e𝑖 (x − πœ‡) = z0z,
πœ†π‘–
𝑖=1
1The constant probability density contour of the function is defined to be C = {x : 𝑓 (x) = 𝑐 0 ⇔ x :
(x − πœ‡) 0𝚺−1 (x − πœ‡) = 𝑐2 } for connections of points. √︁
2For arbitrary distance of 𝑃 and 𝑄, 𝑑 (𝑃, 𝑄) = (x − y) 0S−1 (x − y), where x = [π‘₯ 1 , ..., π‘₯ 𝑝 ] 0, y =
[𝑦 1 , ..., 𝑦 𝑝 ] 0 and 𝑆 is the sample covariance matrix of all measurements on p variables.
3By the spectral decomposition, 𝚺 = Q𝚲Q is positive definite if and only if πœ†π‘– ≥ 0 for eigenvalues.
Í𝑝 1
0
4Notice that 𝚺−1 = 𝑖=1
πœ†π‘– ei ei .
9
STA498
where
Lim, Kyuson
z ∼ 𝑁 𝑝 (0, I) =
𝑝
∑︁
𝑧𝑖2 , where 𝑧𝑖 ∼ 𝑁 (0, 1),
𝑖=1
as πœ’(2𝑝) is defined as the distribution of
2.1.3
2
𝑖=1 𝑧𝑖
Í𝑝
such that (x − πœ‡) 0𝚺−1 (x − πœ‡) ∼ πœ’(2𝑝) .
MLE of πœ‡ and 𝚺
πœ‡ˆ = xΜ„ and 𝚺ˆ =
1
𝑛
Í𝑛
𝑖=1 (x𝑖
− xΜ„)(x𝑖 − xΜ„) 0 = S𝑛 =
𝑛−1
𝑛 S,
where
𝑛
1 ∑︁
S=
(x𝑖 − xΜ„)(x𝑖 − xΜ„) 0
𝑛 − 1 𝑖=1
Now, 𝐸 (S) = 𝚺 and 𝐸 ( xΜ„) = πœ‡5 are unbiased estimators.
𝑛
𝑛
𝑛
∑︁
1 ∑︁ 0
𝑛−1
1 ∑︁ 0
0
0
S=
x𝑖 x𝑖 − xΜ„xΜ„0,
x𝑖 x𝑖 − 2
x𝑖 xΜ„ + 𝑛xΜ„xΜ„ =
𝑛
𝑛 𝑖=1
𝑛 𝑖=1
𝑖=1
where 𝐸 (x𝑖 x0𝑖 ) = 𝚺 + πœ‡πœ‡0 and 𝐸 ( xΜ„xΜ„0) = cov( xΜ„) + 𝐸 ( xΜ„)𝐸 ( xΜ„0) = 𝑛1 𝚺 + πœ‡πœ‡0. Hence, by
taking the expected value
𝑛
𝑛−1
1 ∑︁ 0
1
𝑛−1
0
0
0
𝐸 (S) = 𝐸
x𝑖 x𝑖 − 𝐸 ( xΜ„xΜ„ ) = 𝚺 + πœ‡πœ‡ − 𝚺 + πœ‡πœ‡ =
𝚺
𝑛
𝑛
𝑛
𝑛
𝑖=1
𝑃
𝑃
𝑃
→ 𝚺, S −
→ 𝚺. Asymptotically, S could be replaced by S𝑛
According to LLN, xΜ„ −
→ πœ‡, S𝑛 −
1 Í𝑛
or 𝚺. By definition of S = {π‘ π‘–π‘˜ = 𝑛−1 𝑗=1 (x 𝑗𝑖 − x̄𝑖 )(x 𝑗 π‘˜ − xΜ„ π‘˜ )},
𝑛
π‘ π‘–π‘˜ =
𝑛
1 ∑︁
1 ∑︁
(x 𝑗𝑖 − x̄𝑖 )(x 𝑗 π‘˜ − xΜ„ π‘˜ ) =
(x 𝑗𝑖 − πœ‡π‘– + πœ‡π‘– − x̄𝑖 )(x 𝑗 π‘˜ − πœ‡ π‘˜ + πœ‡ π‘˜ − xΜ„ π‘˜ )
𝑛 − 1 𝑗=1
𝑛 − 1 𝑗=1
𝑛
1 ∑︁
𝑛
(x 𝑗𝑖 − πœ‡π‘– )(x 𝑗 π‘˜ − πœ‡ π‘˜ ) +
( x̄𝑖 − πœ‡π‘– )( xΜ„ π‘˜ − πœ‡ π‘˜ ),
=
𝑛 − 1 𝑗=1
𝑛−1
where the second term converges to 0. By applying LLN,
𝑛
𝑛−1
∑︁
𝑛
1
1
𝑃
(x 𝑗𝑖 −πœ‡π‘– )(x 𝑗 π‘˜ −πœ‡ π‘˜ ) = 1− 𝐸 {(x 𝑗𝑖 −πœ‡π‘– )(x 𝑗 π‘˜ −πœ‡ π‘˜ )} −
→ πœŽπ‘–π‘˜ , as 𝑛 → ∞.
𝑛 𝑗=1
𝑛
Equivalently, S𝑛 is a consistent estimator for 𝚺 which is analogous to univariate case 6.
By CLT where x𝑖 ∼ 𝑁 𝑝 (πœ‡0 , 𝚺) and xΜ„ ∼ 𝑁 𝑝 (πœ‡0 , 𝑛1 𝚺)
√
𝑑
𝑛( xΜ„ − πœ‡0 ) →
− 𝑁 𝑝 (πœ‡0 , 𝚺)
Í𝑛
Í𝑛
Í𝑛
5𝐸 ( xΜ„) = 𝐸 𝑛1 𝑖=1
x𝑖 = 𝑖=1
𝐸 𝑛1 x𝑖 = 𝑛1 𝑖=1
πœ‡=πœ‡
6As 𝑛 → ∞, 𝑆 2𝑛 converges to 𝜎 2 which is the population variance.
10
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
Lim, Kyuson
and
STA498
(x − πœ‡0 ) 0𝚺−1 (x − πœ‡0 ) ∼ πœ’2𝑝
such that
𝑑
𝑛( xΜ„ − πœ‡0 ) 0𝚺−1 ( xΜ„ − πœ‡0 ) →
− πœ’2𝑝 ,
for large sample size and 𝑛 relatively larger than 𝑝.
2.1.4
The sampling distribution of S and xΜ„
1
xΜ„ ∼ 𝑁 𝑝 (πœ‡, 𝚺),
𝑛
1
Var( xΜ„) = 𝚺,
𝑛
where S and xΜ„ are independent.
As x𝑖 ∼ 𝑁 𝑝 (πœ‡, 𝚺) and xΜ„ is a linear combination of x𝑖 , xΜ„ follows a normal distribution.
(𝑛 − 1)S =
𝑛
∑︁
0
(x𝑖 − xΜ„)(x𝑖 − xΜ„) =
𝑖=1
𝑛
∑︁
𝑛
∑︁
(x𝑖 − πœ‡ + πœ‡ − xΜ„)(x𝑖 − πœ‡ + πœ‡ − xΜ„) 0 =
𝑖=1
(x𝑖 − xΜ„) (x𝑖 − xΜ„) 0 +𝑛(πœ‡π‘– − xΜ„)(πœ‡ − xΜ„) 0 −2𝑛(πœ‡ − xΜ„)(πœ‡ − xΜ„) 0 =
𝑖=1
𝑛
∑︁
(x𝑖 − πœ‡) 0 −𝑛(πœ‡ − xΜ„)(πœ‡ − xΜ„) 0,
𝑖=1
and
𝑛
∑︁
(x𝑖 − πœ‡) 0 ∼ π‘Šπ‘› (𝚺),
𝑛(πœ‡ − xΜ„)(πœ‡ − xΜ„) 0 ∼ π‘Š1 (𝚺)
𝑖=1
such that
(𝑛 − 1)S ∼ π‘Šπ‘›−1 (𝚺) =
𝑛−1
∑︁
z𝑖 z0𝑖 ,
z𝑖 ∼ 𝑁 𝑝 (0, 𝚺)
𝑖=1
The Wishart distribution with 𝑛 − 1 degree of freedom has a property 𝐸 {(𝑛 − 1)S} =
(𝑛 − 1)𝚺.
2.1.5
Hypothesis testing when 𝜎, Σ is known
The statistical inference is based upon the hypothesis test and to construct confidence
regions for the parameters of interest.
The goal of this chapter is to include two general ideas, including construction of a
likelihood ratio test (LRT) based on the multivariate normal distribution, and the unionintersection approach.
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
11
STA498
Univariate test statistics (𝜎 known)
Lim, Kyuson
If π‘₯ ∼ 𝑁1 (0, 𝜎 2 ), the hypothesis testing is 𝐻0 : πœ‡ = πœ‡0 vs π»π‘Ž : πœ‡ ≠ πœ‡0 . For random
samples of π‘₯1 , ..., π‘₯ 𝑛 from the Normal population, the test statistics is
𝑧=
π‘₯¯ − πœ‡0
√ ∼ 𝑁1 (0, 1)
𝜎/ 𝑛
or
𝑧2 =
( π‘₯¯ − πœ‡0 ) 2
∼ πœ’12
𝜎 2 /𝑛
under 𝐻0 .
Multivariate generalization (Σ known)
If x ∼ 𝑁 𝑝 (πœ‡, 𝚺) where |𝚺| > 0, then the hypothesis test is 𝐻0 : πœ‡ = πœ‡0 vs π»π‘Ž : πœ‡ ≠ πœ‡0 .
If x1 , ..., x𝑛 is a random sample from a normal population, then the test statistics
𝑧 2 = 𝑛( xΜ„ − πœ‡0 ) 0𝚺−1 ( xΜ„ − πœ‡0 ) ∼ πœ’2𝑝
under 𝐻0 .
2.1.6
Hypothesis testing when 𝜎, Σ is unknown
Univariate test statistics (𝜎 unknown)
As an estimated mean vector and hypothesized mean vector πœ‡0 for the distance measure
is defined, the hypothesis testing is 𝐻0 : πœ‡ = πœ‡0 vs π»π‘Ž : πœ‡ ≠ πœ‡0 . The test statistics is
𝑑=
π‘₯¯ − πœ‡0
√ ∼ 𝑑 𝑛−1
𝑠/ 𝑛
under 𝐻0 , where 𝑠2 =
or 𝑑 2 =
Í𝑛
𝑖=1
( π‘₯¯ − πœ‡0 ) 2
2
= 𝑛( π‘₯¯ − πœ‡0 )(𝑠2 ) −1 ( π‘₯¯ − πœ‡0 ) ∼ 𝐹1,𝑛−1
𝑠2 /𝑛
(π‘₯ 𝑖 −π‘₯)
¯ 2
𝑛−1 .
Note that 𝑑 2 is the square distance between sample mean π‘₯¯ and the test value πœ‡0 .
The distribution of 𝑑 2 under 𝐻0
Under the 𝐻0 ,
−1 π‘₯¯ − πœ‡0 𝑠2
π‘₯¯ − πœ‡0
𝑑 = 𝑛( π‘₯¯ − πœ‡0 )(𝑠 )
𝑛( π‘₯¯ − πœ‡0 ) =
√
√
𝜎/ 𝑛 𝜎 2
𝜎/ 𝑛
Í𝑛
2 −1 ¯
π‘₯¯ − πœ‡0
π‘₯¯ − πœ‡0
𝑖=1 {(π‘₯𝑖 − π‘₯)/𝜎}
=
√
√
𝑛−1
𝜎/ 𝑛
𝜎/ 𝑛
2 −1
πœ’
πœ’2 /1
∼ (𝑁 (0, 1)) 𝑛−1
(𝑁 (0, 1)) ⇔ 2 1
⇔ 𝐹1,𝑛−1
𝑛−1
πœ’π‘›−1 /(𝑛 − 1)
2
12
√
2 −1 √
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
Lim, Kyuson
Multivariate generalization (𝜎 unknown)
STA498
For 𝑝-dimensional vector, 𝐻0 : πœ‡ = πœ‡0 vs π»π‘Ž : πœ‡ ≠ πœ‡0 . A natural generalization of
univariate 𝑑 2 is a multivariate analog of test statistics for Hotelling’s 𝑇 2 distribution.
−1
s
𝑇 = ( xΜ„ − πœ‡0 )
( xΜ„ − πœ‡0 )
𝑛
0
2
√
= 𝑛( xΜ„−πœ‡0 )
0
Í𝑛
𝑖=1 (x𝑖
− xΜ„)(x𝑖 − xΜ„) 0
𝑛−1
−1
√
π‘Š 𝑝,𝑛−1 (𝚺)
𝑛( xΜ„−πœ‡0 ) ∼ (𝑁 𝑝 (0, 𝚺)) ,
𝑛−1
0
−1
(𝑁 𝑝 (0, 𝚺)),
which is in the form of (multivariate normal)0 (Wishart distribution / 𝑑𝑓 ) −1 (multivariate
normal)7.
⇔ (𝑛 − 1)(𝑁 𝑝 (0, I)) 0 {π‘Šπ‘›−1 (I)}−1 (𝑁 𝑝 (0, I)), where I = I 𝑝×𝑝
In case the 𝑇 2 is too large, this means xΜ„ too far from the πœ‡0 such that 𝐻0 is rejected.
Hotelling’s 𝑇 2 distribution
In the case vector d follows the multivariate normal distribution 𝑁 𝑝 (0, I) which is
√
𝑛( xΜ„ − πœ‡0 ) (by CLT), and another random vector M (which is S) follows the Wishart
distribution, then π‘š(d0Md) (which is 𝑇 2 ) has a Hotelling’s 𝑇 2 ( 𝑝, π‘š) distribution with
dimensionality parameter 𝑝 and π‘š degrees of freedom, based on the observation π‘š and 𝑝.
If a random vector π‘₯ follows the Hotelling’s 𝑇 2 distribution which is π‘₯ ∼ 𝑇 2 ( 𝑝, π‘š),
then
π‘š− 𝑝+1
π‘₯ ∼ 𝐹𝑝,π‘š−𝑝+1
π‘π‘š
For hypothesis testing, reject 𝐻0 : πœ‡ = πœ‡0 if
𝑇2 >
𝑝(𝑛 − 1)
𝐹𝑝,𝑛−𝑝 (𝛼)
𝑛−𝑝
or
𝐹=
𝑛−𝑝 2
𝑇 > 𝐹𝑝,𝑛−𝑝 (𝛼),
𝑝(𝑛 − 1)
when observed π‘š = 𝑛 − 1 for the sample size and 𝑝 = 𝑝 to be the dimension of 𝚺.
Computational example
The student Kyuson from sample of 15 course marks he has taken at UTM was analyzed
based on the classification on the π‘₯ 1 = MAT, π‘₯ 2 = STA and π‘₯ 3 = other courses (for
simplicity sample numbers for courses are the same).
Question: Is πœ‡0 = (99 99 95) 0 plausible for the population mean vector at 𝛼 = 0.1?
2
7Notice this is analogous to 𝑑 𝑛−1
= (Normal random variable)0 (chi-square random variable/ 𝑑𝑓 ) −1
(Normal random variable)
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
13
STA498
Lim, Kyuson
Equivalently, the problem actually is to test 𝐻0 : πœ‡ = (99 99 95) 0 vs. π»π‘Ž : πœ‡ ≠
(99 99 95) 0.
At level 𝛼 = 0.1, reject 𝐻0 if 𝑇 2 >
𝑝(𝑛−1)
𝑛−𝑝 𝐹𝑝,𝑛−𝑝 (𝛼)
=
3(40−1)
40−3 𝐹3,37 (0.1)
The sample mean xΜ„ = (100 100 99) 0 with S computed.
40( xΜ„ − πœ‡) 0S−1 ( xΜ„ − πœ‡) = 8.739.
= 7.544.
The computed 𝑇 2 =
Since 8.739 > 7.544 which is the critical value, 𝐻0 is rejected where his true average differ at least for one area, πœ‡π‘– ≠ πœ‡π‘–0 and conclude he is not being honest.
Invariant under transformation, Hotelling’s 𝑇 2
Moreover, Hotelling’s 𝑇 2 is invariant under transformation of the form y = Cx+b, where
C 𝑝×𝑝 for the hypothesis testing of 𝐻0 : 𝐸 (y) = Cπœ‡0 + b 8.
Since yΜ„ = CxΜ„ + b and S𝑦 =
1
𝑛−1
Í𝑛
𝑖=1 (y𝑖
− yΜ„)(y𝑖 − yΜ„) 0 = CSπ‘₯ C0,
0
−1
𝑇 2 = 𝑛{yΜ„ − (Cπœ‡0 + 𝑏)}0S−1
𝑦 { yΜ„ − (Cπœ‡ 0 + 𝑏)} = 𝑛{C( xΜ„ − πœ‡0 )} (CSπ‘₯ C) {C( xΜ„ − πœ‡0 )}
= 𝑛( xΜ„ − πœ‡0 ) 0C0 (CSπ‘₯ C) −1 C( xΜ„ − πœ‡0 ) = 𝑛( xΜ„ − πœ‡0 ) 0 (Sπ‘₯ ) −1 ( xΜ„ − πœ‡0 )
Normality, Hotelling’s 𝑇 2
𝑇 2 = 𝑛( xΜ„ − πœ‡0 ) 0S−1 ( xΜ„ − πœ‡0 ) is approximately chi-square distribution with 𝑝 𝑑𝑓 . whenπœ‡0
is correct.
Note that the 𝐹-distribution of 𝑇 2 rely on the normality assumption. Then, the critical value
𝑝(𝑛 − 1)
𝐹𝑝,𝑛−𝑝 (𝛼) > πœ’2𝑝 (𝛼),
𝑛−𝑝
but the value is nearly equivalent for larger values of 𝑛 − 𝑝 of 𝐹𝑝,𝑛−𝑝 (𝛼) as 𝑝 > 𝑛 − 𝑝. In
other words, if 𝑛 >> 𝑝 then the difference is larger but if 𝑛 > 𝑝 then the gap is smaller 9.
Likelihood Ratio Test (LRT)
The Hotelling’s 𝑇 2 test is equivalent to the LRT 10. For hypothesis testing of 𝐻0 : πœ‡ = πœ‡0
vs 𝐻𝛼 : πœ‡ ≠ πœ‡0 , the likelihood ratio (𝚲) is
ˆ 𝑛/2
max𝚺 𝐿 (πœ‡0 , 𝚺)
| 𝚺|
𝚲=
=
max πœ‡,𝚺 𝐿 (πœ‡, 𝚺)
| 𝚺ˆ 0 |
8Instead of 𝐻0 : 𝐸 (x) = πœ‡0
9For example 𝑛 = 3000 and 𝑝 = 10,
𝑝 (𝑛−1)
𝑛− 𝑝 𝐹 𝑝,𝑛− 𝑝 (𝛼)
= 16.057 is close to πœ’2𝑝 (𝛼) = 15.987 but if
(𝑛−1)
2
𝑛 = 30 and 𝑝 = 5, 𝑝𝑛−
𝑝 𝐹 𝑝,𝑛− 𝑝 (𝛼) = 12.135 is greater than πœ’ 𝑝 (𝛼) = 9.236.
10Note that this is extended to Neyman-Pearson Lemma for uniformly most powerful test.
14
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
Lim, Kyuson
STA498
, where the maximum of multivariate normal likelihood of πœ‡ and 𝚺 is
1
1
−𝑛𝑝
−𝑛𝑝
, max 𝐿 (πœ‡, 𝚺) =
max 𝐿 (πœ‡0 , 𝚺) =
𝑒π‘₯ 𝑝
𝑒π‘₯ 𝑝
πœ‡,𝚺
𝚺
2
2
(2πœ‹) 𝑛𝑝/2 | ΣΜ‚0 | 𝑛/2
(2πœ‹) 𝑛𝑝/2 | ΣΜ‚| 𝑛/2
However, while ΣΜ‚0 is restricted under the 𝐻0 , ΣΜ‚ is unrestricted 11. The LRT reject 𝐻0 if
𝚲 < 𝑐 for the cutoff value 𝑐.
ˆ is approxiUnder the continuous mapping theorem, −2 ln(𝚲) = 𝑛{ln( 𝚺ˆ 0 ) − ln( 𝚺)}
2
mately following the πœ’π‘‘π‘“ , where 𝑑𝑓 = {𝑝 + 𝑝( 𝑝 + 1)/2} − {𝑝( 𝑝 + 1)/2} = 𝑝 12 (number
of parameters without the restriction of 𝐻0 - number of parameters under 𝐻0 ).
Wilk’s Lambda
Equivalently, based on the likelihood ration statistics of 𝚲 it is derived for 𝚲2/𝑛
Í𝑛
−1
ˆ
| 𝑖=1 (x𝑖 − xΜ„)(x𝑖 − xΜ„) 0 |
| 𝚺|
𝑇2
2/𝑛
𝚲 =
= Í𝑛
⇔ 1+
< π‘π‘Ž
| 𝑖=1 (x𝑖 − πœ‡0 )(x𝑖 − πœ‡0 ) 0 |
𝑛−1
| 𝚺ˆ 0 |
Notice that for large 𝑇 2 the likelihood ratio is small and will reject 𝐻0 . Also, the
Hotelling’s 𝑇 2 , Wilk’s Lambda and LRT are all equivalent.
Inverse of Wishart distribution
The Wishart distribution which is (𝑛 − 1)S ∼ π‘Šπ‘›−1 (𝚺) is an multivariate analogue of
the Gamma distribution (as the chi-square distribution of z2 is gamma random variable
as well).
Í𝑛−1 0 −1
With a reparametrization where x1 , ..., x𝑛 ∼ 𝑁 (0, S−1
𝑖=1 x𝑖 x𝑖
0 ), a cov-matrix 𝚺 =
is sampled from the inverse-Wishart distribution, which is 𝑛 − 1 df and parameter S−1
0 .
Hence,
𝐸 (𝚺−1 ) = (𝑛 − 1)S−1
0 ,
𝐸 (𝚺) =
1
1
−1
(S−1
=
S0 ,
0 )
(𝑛 − 1) − 𝑝 − 1
𝑛− 𝑝−2
by the property of Wishart distribution.
For large 𝑛 − 1, S0 = (𝑛 − 𝑝 − 2)𝚺0 is near true covariance matrix of 𝚺.
Union-Intersection derivation of 𝑇 2
If the null hypothesis is not rejected for given a 𝑝 of 𝑦 = a0x ∼ 𝑁1 (a0 πœ‡, a0𝚺a) that
maximize test statistics of 𝑑 a2 , then any of univariate null hypothesis 𝐻0,a : a0 πœ‡ = a0 πœ‡0 ⇔
Í𝑛
Í𝑛
11ΣΜ‚0 = 𝑛1 𝑖=1
(x𝑖 − πœ‡0 ) (x𝑖 − πœ‡0 ) 0 to be restricted and ΣΜ‚ = 𝑛1 𝑖=1
(x𝑖 − xΜ„) (x𝑖 − xΜ„) 0 to be unrestricted
12 𝑝 correspond to πœ‡, 𝑝( 𝑝 + 1)/2 correspond to var-cov matrix, 𝚺
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
15
STA498
πœ‡ = πœ‡0 is not rejected.
Lim, Kyuson
First, the 𝐻0 : πœ‡ = πœ‡0 is equivalent of the form 𝐻0,a : a0 πœ‡ = a0 πœ‡0 . The test statistics
0
𝑦¯ − a0 πœ‡0 2
a π‘₯¯ − a0 πœ‡0 2 {a0 ( xΜ„ − πœ‡0 )}2
𝑑a =
= √οΈƒ
=
𝑠 𝑦¯
a0 (S/𝑛)a
1 0
a
Sa
𝑛
Hence, if maxa 𝑑 a2 < 𝑐, then 𝑑a2 < 𝑐 for any a.
Second, the maximum squared t-test is
max 𝑑a2
a
−1
S
= ( xΜ„ − πœ‡0 )
( xΜ„ − πœ‡0 ) = 𝑛( xΜ„ − πœ‡0 ) 0 (S) −1 ( xΜ„ − πœ‡0 ) = 𝑇 2 ,
𝑛
0
which is the Hotelling’s 𝑇 2 distribution 13.
2.2
Confidence region
2.2.1
Univariate 𝑑-interval
Without relationship between multivariate components, the univariate 𝑑-interval method
is constructed as
𝑦¯ − πœ‡ 𝑦
< 𝑑 𝑛−1 (𝛼/2) = 1 − 𝛼,
𝑃 − 𝑑 𝑛−1 (𝛼/2) ≤ √οΈƒ
2
𝑠 𝑦 /𝑛
𝑦¯ −πœ‡ 𝑦
as √
𝑠2𝑦 /𝑛
∼ 𝑑 𝑛−1 .
√οΈƒ
In other words, the confidence interval of 100(1 − 𝛼)% for πœ‡ 𝑦 is 𝑦¯ ± 𝑑 𝑛−1 (𝛼/2) 𝑠2𝑦 /𝑛,
where 𝑑 𝑛−1 (𝛼/2) is the upper percentile.
Problem and Bonferroni’s inequality
Notice that the 𝑝 100(1 − 𝛼)% does not cover joint CI. For 𝑅𝑖 , each 𝑅𝑖 covers the
corresponding πœ‡π‘– and 𝑃(𝑅𝑖 ) = 1 − 𝛼𝑖 . Then, for each 𝑅𝑖 independent
𝑃{πœ‡π‘– ∈ 𝑅𝑖 } =
𝑝
𝑃(∩𝑖=1 𝑅𝑖 )
=1−
𝑝
𝑃(∪𝑖=1 𝑅𝑖𝑐 )
≥ 1−
𝑝
∑︁
𝛼𝑖 ,
𝑖=1
which is the Bonferroni’s inequality. In the case 𝛼𝑖 = 𝛼 for all 𝑖, then 1 −
1 − 𝑝 𝛼 < 1 − 𝛼 such that if 𝑝 > 1 the inequality is not guaranteed.
Í𝑝
𝑖=1 𝛼𝑖
=
13Using the Maximization Lemma, the Cauchy-Schwartz inequality is based to derive the UnionIntersection derivation of 𝑇 2 .
16
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
Lim, Kyuson
STA498
Equivalently, if 𝑅𝑖𝑐 14 is the the event of making a Type 1 error on 𝑖th test, then
𝑃(at least one Type 1 error) =
𝑝
𝑃(∪𝑖=1 𝑅𝑖𝑐 )
≤
𝑝
∑︁
𝑃(𝑅𝑖𝑐 ).
𝑖=1
For each of 𝑝 tests, use a significance level of 𝛼/𝑝, then the CI coverage rate or Type 1
error rate is at most 100(1 − 𝛼)% or 𝛼.
So the probability that at least one test results in a Type I error is at most 𝛼 or the chance
that at least one CI does not capture the true mean difference is at most 100(1 − 𝛼)%.
2.2.2
Bonferroni’s Simultaneous Confidence interval
To construct the simultaneous confidence interval for {πœ‡1 , ..., πœ‡ 𝑝 } by the confidence level
𝛼/𝑝 for each of 𝑝 separate univariate CI’s that is
√οΈ‚
𝑠𝑖𝑖
π‘₯¯π‘– ± 𝑑 𝑛−1 (𝛼/(2𝑝))
, 𝑖 = 1, .., 𝑝.
𝑛
√︁
Since 𝑃( π‘₯¯π‘– ±(πœ‡π‘– ∈ 𝛼/(2𝑝)) 𝑠𝑖𝑖 /𝑛) = 1−𝛼/𝑝, the joint coverage probability ≥ 1−𝑝( 𝛼𝑝 ) =
1 − 𝛼, which now guarantee to be no smaller than 1 − 𝛼 15.
2.2.3
Simultaneous 𝑇 2 -intervals
To construct simultaneous confidence intervals for any linear combinations of a0 πœ‡ that
Í𝑝
is the expected value of 𝑖=1 π‘Žπ‘– x𝑖 = a0x where x ∼ 𝑁 𝑝 (πœ‡, 𝚺) with variance a0𝚺a, the CI
is derived from univariate-intersection derivation of 𝑇 2
0
(a0xΜ„ − a0 πœ‡0 ) 2
a π‘₯¯ − a0 πœ‡0 2 {a0 ( xΜ„ − πœ‡0 )}2
= √οΈƒ
=
𝑑a =
var(a0x)/𝑛
a0 (S/𝑛)a
1 0
a
Sa
𝑛
√οΈ„
⇔ a0xΜ„ ±
𝑝(𝑛 − 1)
𝐹𝑝,𝑛−𝑝 (𝛼)
𝑛−𝑝
where
max 𝑑a2 = 𝑇 2 ∼
a
√οΈ‚
a0Sa
,
𝑛
𝑝(𝑛 − 1)
𝐹𝑝,𝑛−𝑝
𝑛−𝑝
For each πœ‡π‘– ,
√οΈ„
π‘₯¯π‘– ±
𝑝(𝑛 − 1)
𝐹𝑝,𝑛−𝑝 (𝛼)
𝑛−𝑝
√οΈ‚
𝑠𝑖𝑖
𝑛
14Notice that this the confidence interval that is √︁
not covered for πœ‡π‘– .
15Note that 𝑑 𝑛−1 (𝛼/2𝑝) could be replaces with (𝑛 − 1) 𝑝𝐹 𝑝,𝑛− 𝑝 (𝛼)/(𝑛 − 𝑝) for equivalency, by the
property of Hotelling 𝑇 2 .
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
17
STA498
Therefore,
Lim, Kyuson
𝑃
𝑑a2
𝑝(𝑛 − 1)
𝑝(𝑛 − 1)
2
𝐹𝑝,𝑛−𝑝 (𝛼) = 𝑃 max 𝑑a ≤
𝐹𝑝,𝑛−𝑝 (𝛼)
≤
a
𝑛−𝑝
𝑛−𝑝
𝑝(𝑛 − 1)
2
=𝑃 𝑇 ≤
𝐹𝑝,𝑛−𝑝 (𝛼) = 1 − 𝛼
𝑛−𝑝
The drawback of the simultaneous 𝑇 2 -intervals is less powerful due to wider range of
interval, which lead to be less powerful and conservative.
2.2.4
Comparison between Simultaneous 𝑇 2 -intervals and Bonferroni’s Confidence intervals
Criteria
shape
joint coverage rate
𝑑-intervals
Bonferroni’s 𝑑-intervals
narrower, powerful
< 100(1 − 𝛼)%
depends on number of intervals
Simultaneous 𝑇 2 -intervals
winder, conservative
≥ 100(1 − 𝛼)%
does not depend
For each πœ‡π‘– , the simultaneous confidence intervals is computed as
√οΈ„
√οΈ‚
𝑝(𝑛 − 1)
𝑠𝑖𝑖
π‘₯¯π‘– ±
𝐹𝑝,𝑛−𝑝 (𝛼)
,
𝑛−𝑝
𝑛
but the Bonferroni’s confidence intervals for πœ‡π‘– is computed as
√οΈ‚
𝛼
𝑠𝑖𝑖
π‘₯¯π‘– ± 𝑑 𝑛−1
, where 𝑖 = 1, .., 𝑝.
2𝑝
𝑛
Confidence Region
Denoted as 𝑅(x), which is the multivariate extensions of univariate confidence interval
(CI) where x𝑖 ∼ 𝑁 𝑝 (πœ‡, 𝚺) for 𝑖 = 1, ..., 𝑛. Then, for mean vector πœ‡
𝑝(𝑛 − 1)
0 −1
𝑃 𝑛( xΜ„ − πœ‡) S ( xΜ„ − πœ‡) ≤
𝐹𝑝,𝑛−𝑝 (𝛼) = 1 − 𝛼
𝑛−𝑝
Cantered at xΜ„ and computing S for the set yields
𝑝(𝑛 − 1)
0 −1
𝐹𝑝,𝑛−𝑝 (𝛼)
𝑅(x) = πœ‡ : 𝑛( xΜ„ − πœ‡) S ( xΜ„ − πœ‡) ≤
𝑛−𝑝
However, the on half-length along the normalized eigenvector e𝑖 from S16 gives
√οΈ‚ √οΈ„
√οΈ‚
πœ†π‘– 𝑝(𝑛 − 1)
πœ†π‘– √︁ 2
𝐹𝑝,𝑛−𝑝 (𝛼) =
𝑇 (𝛼),
𝑛
𝑛−𝑝
𝑛
for each eigenvalues of πœ†π‘– from S, 𝑖 = 1, ..., 𝑝.
16Correlation R for eigenvalues are computed to be equivalent to the covariance matrix of S for
standardized eigenvalues.
18
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
Lim, Kyuson
2.2.5
Multivariate Quality-Control (QC)
STA498
Univariate paired t-test
For π‘₯ 1𝑖 and π‘₯ 2𝑖 which is the response to treatments, let 𝑑𝑖 = π‘₯𝑖1 − π‘₯𝑖2 with 𝑑𝑖 ∼ 𝑁 (πœ‡ 𝑑 , πœŽπ‘‘2 ),
for hypothesis testing of 𝐻0 : πœ‡ 𝑑 = 0 vs. π»π‘Ž : πœ‡ 𝑑 ≠ 0. Then, the test statistics is
𝑑=
𝑑¯
√ ∼ 𝑑 𝑛−1 ,
𝑠𝑑 / 𝑛
under 𝐻0 . If |𝑑| > 𝑑 𝑛−1 (𝛼/2), then reject 𝐻0 . The confidence interval for πœ‡ 𝑑 is
𝑠𝑑
𝑑 ± 𝑑 𝑛−1 (𝛼/2) √
𝑛
Multivariate extension in comparison of confidence intervals and confidence region
Suppose for 𝑝 units that there are 2 treatments to be x1𝑖 = (π‘₯ 1𝑖1 , ..., π‘₯ 1𝑖 𝑝 ) 0
and x2𝑖 = (π‘₯ 2𝑖1 , ..., π‘₯ 2𝑖 𝑝 ) 0 with d𝑖 = x1𝑖 − x2𝑖 ⇔ 𝑑𝑖 𝑗 = π‘₯ 1𝑖 𝑗 − π‘₯ 2𝑖 𝑗 , for all 𝑗 = 1, ..., 𝑝.
For d𝑖 ∼ 𝑁 𝑝 (πœ‡ 𝑑 , πšΊπ‘‘ , 𝐻0 : πœ‡ 𝑑 = 0 vs. 𝐻0 : πœ‡ 𝑑 ≠ 0. Then, the test statistics is
𝑇 2 = 𝑛dΜ„0S−1
𝑑 dΜ„
√
0 Í𝑛
𝑖=1 d𝑖
𝑖=1 (d𝑖
Í𝑛
= 𝑛
𝑛
Í𝑛
− dΜ„)(d𝑖 − dΜ„) 0 √
𝑝(𝑛 − 1)
𝑖=1 d𝑖
𝑛
𝐹𝑝,𝑛−𝑝 (𝛼),
∼
𝑛−1
𝑛
𝑛−𝑝
under 𝐻0 where the 100(1 − 𝛼)% confidence region of πœ‡ 𝑑 is
𝑝(𝑛 − 1)
0 −1
𝐹𝑝,𝑛−𝑝 (𝛼) ,
𝑅(πœ‡ 𝑑 ) = πœ‡ 𝑑 : 𝑛( dΜ„ − πœ‡ 𝑑 ) S𝑑 ( dΜ„ − πœ‡ 𝑑 ) ≤
𝑛−𝑝
which is analogous to the confidence region for xΜ„ and S
𝑝(𝑛 − 1)
0 −1
πœ‡ : 𝑛( xΜ„ − πœ‡) S ( xΜ„ − πœ‡) ≤
𝐹𝑝,𝑛−𝑝 (𝛼)
𝑛−𝑝
When 𝑇 2 >
𝑝(𝑛−1)
𝑛−𝑝 𝐹𝑝,𝑛−𝑝 (𝛼)
for the critical value, then reject 𝐻0 .
However, 100(1 − 𝛼)% Simultaneous 𝑇 2 confidence intervals for individual mean differences {πœ‡ 𝑑 𝑗 } 17 is given by
√οΈ„
𝑑¯π‘— ±
√οΈ„
𝑝(𝑛 − 1)
𝐹𝑝,𝑛−𝑝 (𝛼)
𝑛−𝑝
𝑠2𝑑 𝑗
, 𝑗 = 1..., 𝑝.
𝑛
(𝑛−1)
2
17Note that when 𝑛 − 𝑝 is large replaced by 𝑝𝑛−
𝑝 𝐹 𝑝,𝑛− 𝑝 (𝛼) with πœ’ 𝑝 (𝛼) by the property of Hotelling’s
2
𝑇 and the normality assumption is not necessary.
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
19
STA498
Lim, Kyuson
Note that 𝑑¯π‘— is the diagonal element of dΜ„ and 𝑠2𝑑 𝑗 is the diagonal element of S𝑑 . This is
analogous to πœ‡π‘– of
√οΈ„
√οΈ‚
𝑝(𝑛 − 1)
𝑠𝑖𝑖
π‘₯¯π‘– ±
𝐹𝑝,𝑛−𝑝 (𝛼)
.
𝑛−𝑝
𝑛
Also, 100(1 − 𝛼)% Bonferroni’s confidence intervals for {πœ‡π‘– } is given by
√οΈ„
𝑠2𝑑 𝑗
𝑑¯π‘— ± 𝑑 𝑛−1 (𝛼/2𝑝)
, 𝑗 = 1..., 𝑝.
𝑛
This is analogous to
√οΈ‚
π‘₯¯π‘– ± 𝑑 𝑛−1 (𝛼/(2𝑝))
𝑠𝑖𝑖
, 𝑖 = 1, .., 𝑝.
𝑛
shown before.
Simple Block design
For π‘ž treatments
over successive period
of time, observation data is denoted as x𝑖 =
π‘₯𝑖1 , ..., π‘₯π‘–π‘ž and πœ‡ = πœ‡1 , ..., πœ‡ π‘ž . The goal is to compare the components of πœ‡.
The contrast matrix C is found for two ways. First, the contrast matrix is set for control
treatments compared with other treatment, which is
πœ‡ − πœ‡2
1 −1 0 · · · 0
© 1
ª
πœ‡
­ πœ‡1 − πœ‡3 ® ©­1 0 −1 · · · 0 ª® © .1 ª
­ . ®=­
­
® ­ .. ®® =C1 πœ‡.
­ .. ® ­
.
.
.
®
­
®
1 0 0 · · · −1¬ « πœ‡ π‘ž ¬
« πœ‡1 − πœ‡ π‘ž ¬ «
The other way is a successive treatments for contrast matrix
πœ‡ − πœ‡2
1 −1 0 · · · 0
© 1
ª
πœ‡
­ πœ‡2 − πœ‡3 ® ©­0 1 −1 · · · 0 ª® © .1 ª
­
­
®
=­
® ­ .. ®® =C2 πœ‡.
..
­
®
.
.
.
­
®
.
­
®
0 0 · · · 1 −1¬ « πœ‡ π‘ž ¬
« πœ‡ π‘ž−1 − πœ‡ π‘ž ¬ «
Note that the contrast matrix C1 and C2 is (π‘ž − 1) × π‘ž matrix.
However, in order to test that there is no difference in treatments 𝐻0 : Cπœ‡ = 0 vs.
π»π‘Ž : Cπœ‡ ≠ 0.
The test statistics is the Hotelling’s 𝑇 2 as x𝑖 ∼ 𝑁 π‘ž (πœ‡, 𝚺),
𝑇 2 = 𝑛(CxΜ„) 0 (CSC0) −1 (CxΜ„) ∼
(π‘ž − 1)(𝑛 − 1)
πΉπ‘ž−1,𝑛−π‘ž+1 (𝛼),
𝑛−π‘ž+1
and the 100(1 − 𝛼)% confidence region for C is
(π‘ž − 1)(𝑛 − 1)
0
0 −1
Cπœ‡ : 𝑛(CxΜ„ − Cπœ‡) (CSC ) (CxΜ„ − Cπœ‡) ≤
πΉπ‘ž−1,𝑛−π‘ž+1 (𝛼) ,
𝑛−π‘ž+1
20
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
Lim, Kyuson
STA498
compared to the 100(1 − 𝛼)% simultaneous 𝑇 2 confidence intervals for 1-dimensional
{c𝑖 πœ‡} is
√οΈ„
√οΈ„
c0𝑗 Sc 𝑗
(π‘ž − 1)(𝑛 − 1)
0
πΉπ‘ž−1,𝑛−π‘ž+1 (𝛼)
c 𝑗 xΜ„ ±
𝑛−π‘ž+1
𝑛
Computational Example
A sample of 20 courses were administrated with 4 assessments ways:
Treatment 1: final exam and no term tests
Treatment 2: no exam and no term tests
Treatment 3: final exam and term test
Treatment 4: no exam and term test
The outcome variable is % for students marks.
(πœ‡3 + πœ‡4 ) − (πœ‡1 + πœ‡2 ): effect of having term test
(πœ‡1 + πœ‡3 ) − (πœ‡2 + πœ‡4 ): effect of having final exam
(πœ‡1 + πœ‡4 ) − (πœ‡2 + πœ‡3 ): interaction between term test and final exam
−1 −1 1 1
©
ª
C = ­ 1 −1 1 −1®
« 1 −1 −1 1 ¬
Then, 𝐻0 : Cπœ‡ = 0 vs. π»π‘Ž : Cπœ‡ ≠ 0 at 𝛼 = 0.05. From the data, 𝑇 2 =
𝑛(CxΜ„) 0 (CSC0) −1 CxΜ„ = 20.5.
(𝑛−1)
3×19
At 𝛼 = 0.05, the critical value is (π‘ž−1)
𝑛−π‘ž+1 πΉπ‘ž−1,𝑛−π‘ž+1 (𝛼) = 17 𝐹3,17 (0.05) = 10.73.
Since 𝑇 2 > 10.93, reject 𝐻0 at the level of 𝛼 = 0.05 and conclude that there is a significant difference in contrast for the effect of midterm and final exam for courses to be
offered.
Within 95% simultaneous confidence intervals, if the confidence interval does not contain 0 then there is an effect by the presence of either term test or final exam. Note
that the interaction effect of two factors is not significant if the confidence interval does
contain 0.
2.3
Comparing mean vectors of two population
When x1𝑖 ∼ 𝑁 (πœ‡1 , Σ1 ) and x2 𝑗 ∼ 𝑁 (πœ‡2 , Σ2 ) for 𝑖 = 1, .., 𝑛1 and 𝑗 = 1, ..., 𝑛2 for 𝑝-variate
population and independent, then the goal is to make inference on πœ‡1 − πœ‡2 .
2.3.1
Pooled sample covariance when 𝑛1 , 𝑛2 is small and Σ = Σ1 = Σ2
As x1 ∼ 𝑁 (πœ‡, Σ1 ), x1 ∼ 𝑁 (πœ‡, Σ1 ), let xΜ„ π‘˜ =
xΜ„) 0.
1
π‘›π‘˜
Í𝑛 π‘˜
𝑖=1 x π‘˜π‘– and S π‘˜ =
1
𝑛 π‘˜ −1
Í𝑛 π‘˜
𝑖=1 (x π‘˜π‘– − xΜ„)(x π‘˜π‘– −
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
21
STA498
Lim, Kyuson
The pooled sample covariance is the weighted mean of two samples.
Í𝑛 1
0 Í𝑛2 (x − xΜ„ )(x − xΜ„ ) 0
𝑛1 − 1
𝑛2 − 1
2𝑖
2
2𝑖
2
𝑖=1 (x1𝑖 − xΜ„1 )(x1𝑖 − xΜ„1 )
S π‘π‘œπ‘œπ‘™π‘’π‘‘ =
+ 𝑖=1
⇔
S1 +
S2
𝑛1 − 1 + 𝑛2 − 1
𝑛1 − 1 + 𝑛2 − 1
𝑛1 + 𝑛2 − 2
𝑛1 + 𝑛2 − 2
Hypothesis test with small samples when Σ1 = Σ2
2.3.2
For 𝐻0 : πœ‡1 − πœ‡2 = 𝛿0 18 vs. π»π‘Ž : πœ‡1 − πœ‡2 ≠ 𝛿0 , test statistics of Hotelling’s under 𝐻0
2
𝑇 = (x1 − x2 − 𝛿0 )
1
1
=
+
𝑛1 𝑛2
0
−1/2
(x1 − x2 − πœ‡1 +
−1
1
1
+
S π‘π‘œπ‘œπ‘™π‘’π‘‘
(x1 − x2 − 𝛿0 )
𝑛1 𝑛2
πœ‡2 ) 0S−1
π‘π‘œπ‘œπ‘™π‘’π‘‘
1
1
+
𝑛1 𝑛2
−1/2
(x1 − x2 − πœ‡1 + πœ‡2 ),
which follows (multivariate normal)(Wishart / 𝑑𝑓 )(multivariate normal) such that
π‘Šπ‘›1 +𝑛2 −2
⇔ 𝑁 𝑝 (0, 𝚺)
𝑛1 + 𝑛2 − 2
0
−1
𝑁 𝑝 (0, 𝚺) =
𝑝(𝑛1 + 𝑛2 − 2)
𝐹𝑝,𝑛1 +𝑛2 −𝑝−1 .
𝑛1 + 𝑛2 − 𝑝 − 1
The hypothesis testing reject 𝐻0 if
𝑇2 >
2.3.3
𝑝(𝑛1 + 𝑛2 − 2)
2
𝐹𝑝,𝑛1 +𝑛2 −𝑝−1 (𝛼) = π‘‡π‘π‘Ÿπ‘–π‘‘π‘–π‘π‘Žπ‘™
.
𝑛1 + 𝑛2 − 𝑝 − 1
Confidence intervals with small samples when Σ1 = Σ2
Confidence region
Analogously, the half-length with axes along e1 , ..., e 𝑝 and ellipsoid centered at x1 − x2
is
√οΈ„ √︁
1
1
𝑝(𝑛1 + 𝑛2 − 2)
+
𝐹𝑝,𝑛1 +𝑛2 −𝑝−1 (𝛼) , 𝑗 = 1, ..., 𝑝.
πœ†π‘—
𝑛1 𝑛2 𝑛1 + 𝑛2 − 𝑝 − 1
Simultaneous 𝑇 2 Confidence intervals a0 (πœ‡1 − πœ‡2 )
√οΈ„ √οΈƒ
1
1
0
2
0
a ( xΜ„1 − xΜ„2 ) ± π‘‡π‘π‘Ÿπ‘–π‘‘π‘–π‘π‘Žπ‘™ a
+
S π‘π‘œπ‘œπ‘™π‘’π‘‘ a .
𝑛1 𝑛2
Notice that this is analogous to each confidence intervals
√οΈ„ √οΈ„
𝑝(𝑛1 + 𝑛2 − 2)
1
1
( π‘₯¯1 𝑗 − π‘₯¯2 𝑗 ) ±
𝐹𝑝,𝑛1 +𝑛2 −𝑝−1 (𝛼)
+
𝑠 𝑗 𝑗,π‘π‘œπ‘œπ‘™π‘’π‘‘ .
𝑛1 + 𝑛2 − 𝑝 − 1
𝑛1 𝑛2
18ie. 𝛿0 = 0
22
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
Lim, Kyuson
Bonferroni’s Confidence Intervals
STA498
( π‘₯¯1 𝑗 − π‘₯¯2 𝑗 ) ± 𝑑 𝑛1 +𝑛2 −2
2.3.4
𝛼
2𝑝
√οΈ„ 1
1
+
𝑠 𝑗 𝑗,π‘π‘œπ‘œπ‘™π‘’π‘‘ .
𝑛1 𝑛2
Behrens-Fisher problem
In the case of heterogeneous covariance 𝚺1 ≠ 𝚺2 with small (moderate) sample sizes
𝑛1 , 𝑛2 greater than 𝑝, the estimator of sample mean difference yields sample covariance
S1 S2
(𝑛 − 1)(S1 + S2 ) 2
S𝑒 =
+
⇔
𝑛1 𝑛2
2(𝑛 − 1)
𝑛
such that the test statistics under 𝐻0 : πœ‡1 − πœ‡2 = 0 vs. π»π‘Ž : πœ‡1 − πœ‡2 ≠ 0 is
𝑇 2 = ( xΜ„1 − xΜ„2 ) 0S−1
𝑒 ( xΜ„1 − xΜ„2 ) ∼
𝑣𝑝
𝐹𝑝,𝑣−𝑝+1 ,
𝑣− 𝑝+1
for 𝑝 number of variables where
𝑣=Í
2
𝑖=1
𝑝 + 𝑝2
1
1
−1 2 + tr
𝑛𝑖 tr 𝑛𝑖 S𝑖 S𝑒
1
−1 2
𝑛𝑖 S𝑖 S𝑒
where min(𝑛1 , 𝑛2 ) ≤ 𝑣 ≤ 𝑛1 + 𝑛2 19. Hence, reject 𝐻0 if 𝑇 2 >
2.3.5
,
𝑣𝑝
𝑣−𝑝+1 𝐹𝑝,𝑣−𝑝+1 (𝛼).
Heterogeneous covariance matrices with large sample size
The test statistics under 𝐻0 with same S𝑒 is
2
𝑇 2 = ( xΜ„1 − xΜ„2 ) 0S−1
𝑒 ( xΜ„1 − xΜ„2 ) ∼ πœ’ 𝑝
with the assumption that 𝑛1 − 𝑝 and 𝑛2 − 𝑝 is large enough. Hence, reject 𝐻0 if
𝑇 2 > πœ’2𝑝 (𝛼).
2.3.6
Box’s M test (Bartlett’s test)
The goal is to hypothesis test for the equality of covariance matrices, 𝐻0 : 𝚺1 = · · · =
πšΊπ‘” = 𝚺 vs. π»π‘Ž : at least one πšΊπ‘– ≠ 𝚺 𝑗 , for some 𝑖 ≠ 𝑗 with chi-square approximation.
Under multivariate normal distribution, the LRT20
Í𝑔
(𝑛𝑙 −1)/2
(𝑛𝑙 − 1)S𝑙
|S𝑙 |
, where S π‘π‘œπ‘œπ‘™π‘’π‘‘ = Í𝑙=1
Λ = Π𝑙
𝑔
|S π‘π‘œπ‘œπ‘™π‘’π‘‘ |
𝑙=1 (𝑛𝑙 − 1)
19The approximation reduces to Welch 𝑑-test in univariate (𝑝 = 1), 𝑑 =
20Formerly, under 𝐻0 : πœ‡ = πœ‡0
max𝚺 𝐿 ( πœ‡0 ,𝚺)
max πœ‡,𝚺 𝐿 ( πœ‡,𝚺)
=
| 𝚺ˆ | 𝑛/2
.
| 𝚺ˆ 0 |
π‘₯¯ 1 − π‘₯¯ 2
𝑠2
1
𝑁1
𝑠2
+ 𝑁2
2
and 𝑣 =
𝑠2
1
𝑁1
𝑠2
2
+ 𝑁2
𝑠4
1
𝑁 2 ( 𝑁1 −1)
1
2
𝑠4
2
2 ( 𝑁2 −1)
+𝑁
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
23
STA498
Lim, Kyuson
for 𝑛𝑙 that is the sample size for the 𝑙th group of S𝑙 sample covariance.
∑︁
𝑔
𝑔
∑︁
⇔ −2 π‘™π‘œπ‘”Λ = 𝑀 =
(𝑛𝑙 − 1) π‘™π‘œπ‘”|S π‘π‘œπ‘œπ‘™π‘’π‘‘ | −
{(𝑛𝑙 − 1) π‘™π‘œπ‘”|S𝑙 |},
𝑙=1
where
𝑙=1
𝑔
1
2𝑝 2 + 3𝑝 − 1 ∑︁ 1
,
− Í𝑔
𝑒=
6( 𝑝 + 1)(𝑔 − 1) 𝑙=1 𝑛𝑙 − 1
𝑙=1 (𝑛𝑙 − 1)
as 𝑝 is the number of variables and 𝑔 is the number of groups. The test statistics is
1
𝑝( 𝑝 + 1)(𝑔 − 1)
2
under 𝐻0 21. While Box’s M test is sensitive to non-normality, MANOVA test of means
or treatments are robust to non-normality 22.
𝐢 = (1 − 𝑒)𝑀 ∼ πœ’π‘£2 , 𝑣 =
2.4
MANOVA (Multivariate Analysis Of Variance)
The one-way MANOVA model for comparing 𝑔 population mean vector is illustrated as
X𝑙 𝑗 = πœ‡ + πœπ‘™ + e𝑙 𝑗 , e𝑙 𝑗 ∼ 𝑁 𝑝 (0, 𝚺),
which is random vector = overall mean+ 𝑙th population treatment effect +random error,
where there are 𝑔 populations and 𝑛𝑙 observations ({x𝑙1 , ..., x𝑙𝑛𝑙 }) for population 𝑙 with
the population mean πœ‡π‘™ , 𝑙 = 1, .., 𝑔, which follows Wishart distribution.
Í𝑔
Constraint on 𝑙=1 𝑛𝑙 πœπ‘™ = 0 define the unique model parameters.
For vector of observations, decomposes into
x𝑙 𝑗 = xΜ„ + ( x̄𝑙 − xΜ„) + (x𝑙 𝑗 − x̄𝑙 ),
which is also observation = overall sample mean πœ‡+
ˆ estimated treatment effect 𝜏ˆπ‘™ +
residual error, ê𝑙 𝑗 .
Note that the normality assumption for samples can be relaxed when the sample size
{𝑛𝑙 } is large.
2.4.1
Sum of Squares (TSS = SSπ‘‘π‘Ÿ +SSπ‘Ÿπ‘’π‘  )
Total (corrected) sum of squares (and cross products), TSS = treatment (between groups)
sum of squares and cross products, B + residuals (within group) sum of squares and
cross products, W.
𝑔 ∑︁
𝑛𝑙
∑︁
𝑙=1 𝑗=1
0
(x𝑙 𝑗 − xΜ„)(x𝑙 𝑗 − xΜ„) =
𝑔
∑︁
𝑙=1
0
𝑛𝑙 ( x̄𝑙 − xΜ„)( x̄𝑙 − xΜ„) +
𝑔 ∑︁
𝑛𝑙
∑︁
(x𝑙 𝑗 − x̄𝑙 )(x𝑙 𝑗 − x̄𝑙 ) 0,
𝑙=1 𝑗=1
21Reject 𝐻0 if 𝐢 > πœ’π‘£2 (𝛼)
22Although M-test reject 𝐻0 , MANOVA test could be inconsistent with.
24
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
Lim, Kyuson
as simplified from
STA498
(x𝑙 𝑗 − xΜ„)(x𝑙 𝑗 − xΜ„) 0 = [(x𝑙 𝑗 − xΜ„ + x̄𝑙 − xΜ„) [(x𝑙 𝑗 − xΜ„ + x̄𝑙 − xΜ„)] 0
= (x𝑙 𝑗 − x̄𝑙 )(x𝑙 𝑗 − x̄𝑙 ) 0 + (x𝑙 𝑗 − x̄𝑙 )( x̄𝑙 − xΜ„) 0 + ( x̄𝑙 − xΜ„)(x𝑙 𝑗 − x̄𝑙 ) + ( x̄𝑙 − xΜ„)( x̄𝑙 − xΜ„) 0,
Í𝑙
(x𝑙 𝑗 − x̄𝑙 ) = 023 such that
and 𝑛𝑗=1
⇔ (x𝑙 𝑗 − xΜ„)(x𝑙 𝑗 − xΜ„) 0 = (x𝑙 𝑗 − x̄𝑙 )(x𝑙 𝑗 − x̄𝑙 ) 0 + ( x̄𝑙 − xΜ„)( x̄𝑙 − xΜ„) 0 .
Notice that (x𝑙 𝑗 − xΜ„)(x𝑙 𝑗 − xΜ„) 0 = (x𝑙 𝑗 − xΜ„) 2 applies for other terms.
First, for S𝑙 of 𝑙th sample covariance matrix 24
W=
𝑔 ∑︁
𝑛𝑙
∑︁
0
(x𝑙 𝑗 − x̄𝑙 )(x𝑙 𝑗 − x̄𝑙 ) = (𝑛1 −1)S1 +· · ·+(𝑛𝑔 −1)S𝑔 ⇔
(𝑛𝑙 −1)S = (𝑁 −𝑔)S,
𝑖=1
𝑙=1 𝑗=1
where 𝑁 =
𝑔
∑︁
Í𝑔
𝑙=1 𝑛𝑙
with 𝑁 − 𝑔 𝑑𝑓 , Wishart distribution. Hence,
W
𝐸
= 𝚺,
𝑁 −𝑔
Second,
B=
𝑔
∑︁
𝑛𝑙 ( x̄𝑙 − xΜ„)( x̄𝑙 − xΜ„) 0,
𝑙=1
with 𝑔 − 1 𝑑𝑓 , Wishart distribution.
Thus, TSS has total 𝑁 − 1 = (𝑁 − 𝑔) + (𝑔 − 1) 𝑑𝑓 , Wishart distribution.
2.4.2
Hypothesis Testing
The goal is to test the presence of treatment effects. 𝐻0 : πœ‡1 = πœ‡1 = · · · = πœ‡π‘” is
equivalent to π»π‘Ž : 𝜏1 = 𝜏1 = · · · = πœπ‘” 25, which is that treatment effects are all same. The
test statistics26 uses Wilk’s Lambda27 test as B, W follows Wishart distribution,
|W|
1
1
𝑠
∗
⇔
= Π𝑖=1
,
𝚲 =
|B + W|
det(I + W−1 B)
1 + πœ†ˆ 𝑖
where πœ†ˆ 1 , .., πœ†ˆ 𝑠 are eigenvalues of W−1 S, as 𝑠 = min( 𝑝, 𝑔 − 1) is the rank of B.
23Note that (x𝑙 𝑗 − x̄𝑙 ) ( x̄𝑙 − xΜ„) 0 + ( x̄𝑙 − xΜ„) (x𝑙 𝑗 − x̄𝑙 ) = 0
24Note that the generalized (𝑛1 + 𝑛2 − 2)S π‘π‘œπ‘œπ‘™π‘’π‘‘ is recommended in two-sample case.
25As πœπ‘™ = x̄𝑙 − xΜ„, testing for 𝐻0 : 𝜏1 = · · · = πœπ‘” ⇔ xΜ„1 − x̄𝑔 = 0
ˆ
26Analogous LRT to | 𝚺𝚺ˆ |.
0
27There are Roy’s test maxπœ† (BW−1 ), Lawley-Hotelling’s test tr(BW−1 ), and Pillar’s test tr{B(B+W) −1 }
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
25
STA498
2.4.3
Number of variable
𝑝=1
Number of group
𝑔≥2
𝑝=2
𝑔≥2
𝑝≥1
𝑔=2
𝑝≥1
𝑔=3
2.4.4
Lim, Kyuson
Distribution of Wilk’s Lambda
Test statistics
𝑁−𝑔 1−𝚲∗
𝑔−1 1 − 𝚲∗√
∗
𝑁−𝑔−1 1 − 1−√ 𝚲∗
𝑔−1
𝚲 𝑁−𝑝−1 1−𝚲∗
1
−
∗
𝑝
𝚲
√ ∗
𝑁−𝑔−2 1 − 1−√ 𝚲∗
𝑝
𝚲
Distribution under 𝐻0
𝐹𝑔−1,𝑁−𝑔
𝐹2(𝑔−1),2(𝑁−𝑔)
𝐹𝑝,𝑁−𝑝−1
𝐹2𝑝,2(𝑁−𝑝−2)
Large Sample property for modification of 𝚲∗
If 𝐻0 is true for 𝑁 to be large,
𝑝+𝑔
− 𝑁 −1−
ln(𝚲∗ ) ∼ πœ’2𝑝(𝑔−1)
2
However, reject 𝐻0 if
𝑝+𝑔
− 𝑁 −1−
ln(𝚲∗ ) > πœ’2𝑝(𝑔−1) (𝛼)
2
2.4.5
Simultaneous Confidence Intervals for treatment effect
Let πœπ‘™ = x̄𝑙 − xΜ„ to be the 𝑙th treatment effect. Then, the treatment difference between π‘˜th
0
and 𝑙th treatment is 𝜏ˆπ‘˜ − 𝜏ˆπ‘™ = xΜ„ π‘˜ − xΜ„ − x̄𝑙 + xΜ„ = xΜ„ π‘˜ − x̄𝑙 ⇔ π‘₯¯ π‘˜1 − π‘₯¯π‘™1 · · · π‘₯¯ π‘˜ 𝑝 − π‘₯¯π‘™ 𝑝 ,
W𝑖𝑖
and Var(𝜏ˆπ‘˜π‘– − 𝜏ˆπ‘™π‘– ) = Var(π‘₯¯ π‘˜π‘– − π‘₯¯π‘™π‘– ) = 𝑛1π‘˜ + 𝑛1𝑙 πœŽπ‘–π‘– , where πœŽπ‘–π‘– = 𝑠𝑖𝑖, π‘π‘œπ‘œπ‘™π‘’π‘‘ = 𝑁−𝑔
.
The 95% simultaneous Bonferroni’s confidence intervals for { 𝜏ˆπ‘˜π‘– − 𝜏ˆπ‘™π‘– } 28 is
√οΈ„ 1
1
( π‘₯¯ π‘˜π‘– − π‘₯¯π‘™π‘– ) ± 𝑑 𝑁−𝑔 (𝛼/2π‘š)
+
𝑠𝑖𝑖,π‘π‘œπ‘œπ‘™π‘’π‘‘ , where π‘š = 𝑝𝑔(𝑔 − 1)/2,
𝑛 π‘˜ 𝑛𝑙
where 𝑝 is the number of variables and 𝑔 is the number of populations.
Hence, reject 𝐻0 : πœπ‘˜π‘– − πœπ‘™π‘– = 0 if
| π‘₯¯ π‘˜π‘– − π‘₯¯π‘™π‘– |
𝛼
𝑑 = √οΈƒ
> 𝑑 𝑁−𝑔
.
2π‘š
1
1
+
𝑠
π‘›π‘˜
𝑛𝑙 𝑖𝑖, π‘π‘œπ‘œπ‘™π‘’π‘‘
28Note that this is analogous to ( π‘₯¯1 𝑗 − π‘₯¯2 𝑗 ) ± 𝑑 𝑛1 +𝑛2 −2
26
𝛼
2𝑝
√οΈƒ 1
𝑛1
+
1
𝑛2
𝑠 𝑗 𝑗,
π‘π‘œπ‘œπ‘™π‘’π‘‘
defined.
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
Chapter 3
Bayesian Alternative approach
Let the discrete random variables πœƒ which is to be estimated and observed random variable of 𝑋 = π‘₯. From the prior pr(πœƒ) information about possible values for the parameter,
the approach uses observed data p(π‘₯|πœƒ) to update the information on posterior probabilities p(πœƒ|π‘₯)1 as a regenerating process by the confidence intervals, p(πœƒ ∈ 𝐢𝛼 |π‘₯) = 1−𝛼.
If known with πœƒ ∗ related to probability data points given, then the estimated π‘₯ˆπ‘– ∼
p(π‘₯|πœƒ ∗ ) for the true value of π‘₯ to generate the update information in the compatible
space, where there is no overfitting to be concerned with.
From the likelihood p(π‘₯|πœƒ), the actual distribution by the Bayes theorem yield unnormalized posterior density which is the right side
p(πœƒ|π‘₯) =
p(π‘₯|πœƒ)p(πœƒ)
⇔ ∝ p(π‘₯|πœƒ)p(πœƒ),
p(π‘₯)
where p(π‘₯) is unknown with fixed 𝑦 and does not depend on πœƒ. Note that p(π‘₯) is also
referred to as evidence.
⇔ Posterior ∝ Prior × Likelihood.
A parameter of a prior distribution is referred to a hyperparameter.
For predictive inference on unknown variable before data π‘₯ is considered, the distribution of unknown but observed π‘₯ is
∫
∫
𝑝(π‘₯) =
𝑝(π‘₯, πœƒ)π‘‘πœƒ =
𝑝(πœƒ) 𝑝(π‘₯|πœƒ)π‘‘πœƒ
Θ
Θ
as a marginal distribution of π‘₯, which a prior predictive distribution2.
For observed data x and unknown πœƒ = (πœ‡, 𝜎 2 ), the unknown observable π‘₯˜ to be predicted
is referred to be posterior predictive distribution
∫
∫
∫
𝑝( π‘₯|x)
˜ =
𝑝( π‘₯,
˜ πœƒ|x)π‘‘πœƒ =
𝑝( π‘₯|πœƒ,
˜ x) 𝑝(πœƒ|x)π‘‘πœƒ =
𝑝( π‘₯|πœƒ)
˜ 𝑝(πœƒ|x)π‘‘πœƒ,
Θ
Θ
Θ
1This is written by the Bayes theorem that p(π‘₯, πœƒ)/p(π‘₯) ⇔ (p(π‘₯ | πœƒ) p(πœƒ))/p(π‘₯)
2predictive refers to the distribution for a quantity that is observable.
27
STA498
Lim, Kyuson
as posterior which is conditional on observed x and predictive for observable π‘₯.
˜
The ratio of posterior density 𝑝(πœƒ|π‘₯) evaluated at points πœƒ 1 and πœƒ 2 under the given
model is referred to posterior odds for πœƒ 1 compared to πœƒ 2 .
𝑝(πœƒ 1 |π‘₯) 𝑝(πœƒ 1 ) 𝑝(π‘₯|πœƒ 1 )/𝑝(π‘₯) 𝑝(πœƒ 1 ) 𝑝(π‘₯|πœƒ 1 )
=
=
,
𝑝(πœƒ 2 |π‘₯) 𝑝(πœƒ 2 ) 𝑝(π‘₯|πœƒ 2 )/𝑝(π‘₯) 𝑝(πœƒ 2 ) 𝑝(π‘₯|πœƒ 2 )
which the posterior odds,
𝑝(π‘₯|πœƒ 1 )
𝑝(π‘₯|πœƒ 2 )
𝑝(πœƒ 1 |π‘₯)
𝑝(πœƒ 2 |π‘₯)
equal to the prior odds
𝑝(πœƒ 1 )
𝑝(πœƒ 2 )
times likelihood ratio,
under the Bayes’ rule for discrete parameters.
Random variables and Bayesian statistical inference
For the unknown random variables of Θ to be estimated and 𝑋 = π‘₯ which is observed,
the Bayes’ rule yields
𝑝(Θ = πœƒ|𝑋 = π‘₯) =
𝑝 𝑋 |Θ (π‘₯|πœƒ) 𝑝 Θ (πœƒ)
𝑝(𝑋 = π‘₯|Θ = πœƒ) 𝑝(Θ = πœƒ)
⇔ 𝑝 Θ|𝑋 (πœƒ|π‘₯) =
.
𝑝(𝑋 = π‘₯)
𝑝 𝑋 (π‘₯)
Either Θ or 𝑋2 are continuous random variable, replace the PMF or PDF in the formula.
Equivalently, the posterior PDF is represented by prior times likelihood with 𝑓 𝑋 (π‘₯) using
the law of total probability as
⇔ 𝑓Θ|𝑋 (πœƒ|π‘₯) =
𝑓 𝑋 |Θ (π‘₯|πœƒ) π‘“Θ (πœƒ)
.
𝑓 𝑋 (π‘₯)
In the problem of Bayesian statistics, the choice prior π‘“Θ (πœƒ) is generally unclear and
subjective to be different. With unknown variable Θ, the goal is to draw inferences by
observing related random variable 𝑋, about Θ. Note that the posterior distribution of Θ,
𝑓Θ|𝑋 (πœƒ|π‘₯)/𝑝 Θ|𝑋 (πœƒ|π‘₯), contains all information is derived by point or interval estimates
of Θ.
Comparison between Frequentist and Bayesian methods
For frequentist inference, probabilities are frequencies as the goal is to create procedure
with long run frequency guarantees. For Bayesian inference, probabilities are subjective
degrees of belief as to state and analyze. Hence, frequentists view parameter as fixed
constant while Bayesian considers as random variable. For example, the confidence
interval is considered.
√
√
For confidence interval defined as 𝐢 𝐼 = [ 𝑋¯ 𝑛 − 1.96/ 𝑛, 𝑋¯ 𝑛 + 1.96/ 𝑛], the probability statement is 𝑝 πœƒ (πœƒ ∈ 𝐢 𝐼) = 0.95 for all πœƒ ∈ R, which is random due to function of
the data. With parameter πœƒ fixed, the CI trap the true value with probability 0.95.
For infinitely many experiments of 𝑛 data points and chosen πœƒ 𝑖 , the computed intervals
28
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Lim, Kyuson
STA498
𝐢 𝐼𝑖 is found to trap the parameter πœƒ 𝑖 , 95% of the time, that is almost surely convergent
for any sequences πœƒ 𝑖 ,
𝑛
1 ∑︁
𝐼 (πœƒ 𝑖 ∈ 𝐢 𝐼𝑖 ) ≥ 0.95
lim inf
𝑛→∞
𝑛 𝑖=1
On the other hands, for beliefs the unknown parameter πœƒ is given as a prior distribution
πœ‹(πœƒ) to represent the subjective beliefs about πœƒ. Using Bayes’ theorem, the posterior
distribution for πœƒ given the observed data 𝑋1 , ..., 𝑋𝑛 is computed with likelihood function
𝑛 𝑝(𝑋 |πœƒ),
𝐿(πœƒ) = Π𝑖=1
𝑖
∫
πœ‹(πœƒ|𝑋1 , ..., 𝑋𝑛 ) ∝ 𝐿 (πœƒ)πœ‹(πœƒ) ⇔
πœ‹(πœƒ|𝑋1 , ..., 𝑋𝑛 )π‘‘πœƒ = 0.95 ⇔ 𝑝(πœƒ ∈ 𝐢 𝐼 |𝑋1 , ..., 𝑋𝑛 ) = 0.95
𝐢𝐼
Hence, the degree-of-belief probability statement about πœƒ given the observed data is not
the same, where the intervals would not trap the true value 95% of the time.
In summary, the frequentist CI satisfies inf πœƒ 𝑝 πœƒ (πœƒ ∈ 𝐢 𝐼) = 1 − 𝛼 for the coverage
of the interval CI, and the probability refers to random interval CI. A Bayesian confidence interval CI satisfies 𝑝(πœƒ ∈ 𝐢 𝐼 |𝑋1 , ..., 𝑋𝑛 ) = 1 − 𝛼, where the probability refers
to πœƒ. While the subjective Bayesian interpret probability strictly as personal degrees of
belief, the objective Bayesian try to find the prior distributions for the resulting posterior
to be objective 3.
However, frequentist Bayesian only use Bayesian methods when resulting posterior
has good frequency behaviours. On the other hands, the likelihoodist use the likelihood
function to measure the strength of data as an evidence.
3.0.1
Overview: Univariate Binomial distribution with known and
unknown parameter
Let the probability of a success in a trial is πœƒ. Also, let 𝑋 = {π‘₯1 , .., π‘₯ 𝑛 } be the observation
Í𝑛
set where π‘₯ 1 ∼ π΅π‘’π‘Ÿ (πœƒ). Then, the probability of 𝑠 = 𝑖=1
π‘₯𝑖 success
times in 𝑛 trials
𝑛 𝑠
(π‘₯ 1 , ..., π‘₯ 𝑛 ) happens is p(𝑋 = π‘₯ 1 , ..., π‘₯ 𝑛 |πœƒ, 𝑛) = Bin(𝑠|πœƒ, 𝑛)= 𝑠 πœƒ (1 − πœƒ) 𝑛−𝑠 as the
posterior distribution.
Example. Objective Bayesian approach
As 𝑝(πœƒ|𝑋 = π‘₯ 1 , ..., π‘₯ 𝑛 ) ∝ 𝑝(𝑋 = π‘₯ 1 , ..., π‘₯ 𝑛 |πœƒ) 𝑝(πœƒ) for the prior πœƒ ∼ π‘ˆ [0, 1] to be
unknown so to set the following the uniform distribution for 𝑝(πœƒ) = 1, then
𝑝(πœƒ|𝑋) ∝ πœƒ 𝑠 (1 − πœƒ) 𝑛−𝑠 = πœƒ 𝑠+1−1 (1 − πœƒ) 𝑛−𝑠+1−1
⇔
πœƒ 𝑠 (1 − πœƒ) 𝑛−𝑠
Γ(𝑛 + 2)
πœƒ 𝑠 (1 − πœƒ) 𝑛−𝑠 =
,
Γ(𝑠 + 1)Γ(𝑛 − 𝑠 + 1)
Beta(𝑠 + 1, 𝑛 − 𝑠 + 1)
πœƒ|𝑋, 𝑛 ∼ Beta(𝑠 + 1, 𝑛 − 𝑠 + 1)
3Empirical Bayesian estimate the prior distribution from the data.
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
29
STA498
Lim, Kyuson
where posterior follows the Beta distribution. Since the density function integrates to 1,
the normalizing constant (𝑧) is
∫ 1
Γ(𝑠 + 1)Γ(𝑛 − 𝑠 + 1)
𝑧=
πœƒ 𝑠 (1 − πœƒ) 𝑛−𝑠 =
.
Γ(𝑛 + 2)
0
The prior predictive distribution for fixed 𝑠 success in
∫ 1
∫ 1 1
𝑛 𝑠
𝑛 Γ(𝑠 + 1)(𝑛 − 𝑠 + 1)
𝑛−𝑠
𝑝(𝑋) =
𝑝(𝑋 |𝑠, πœƒ) 𝑝(πœƒ, 𝑠)π‘‘πœƒ =
πœƒ (1−πœƒ) π‘‘πœƒ =
=
.
𝑠
𝑠
Γ(𝑛 + 2)
𝑛+1
0
0
Hence, the prior predictive density 𝑝(𝑋) =
1
𝑛+1
∫
one observation with an outcome 𝑝( π‘₯˜ = 1) =
an example.
which is to be uniform, where π‘₯˜ is the
∫1
𝑝( π‘₯˜ = 1|πœƒ) 𝑝(πœƒ)π‘‘πœƒ = 0 πœƒπ‘‘πœƒ = 1/2 as
1
0
Also, by the mean of Beta distribution the Bayes posterior estimator is 𝐸 (πœƒ|𝑋) =
For instructive purpose of convexity,
1
𝑠
+ (1 − πœ† 𝑛 )
,
𝐸 (πœƒ|𝑋) = πœ† 𝑛
𝑛
2
𝑠+1
𝑛+2 .
from the prior mean for 1/2 and the maximum likelihood estimate4 𝑠/𝑛. Moreover, the
𝑛
which is close to 1.
optimized convex set for πœ† 𝑛 is 𝑛+2
Example. Subjective Bayesian approach
On the other hand, the subjective Bayesian find the uninformed prior to be strongly
peaked around 1/2, as a subjective beliefs about the data. For the known of πœƒ, the
posterior with Bayes rule yield
∫
𝑝(𝑋 |πœƒ, 𝑠) 𝑓 (πœƒ|𝑠)
𝑓 (πœƒ|𝑋, 𝑠) =
, where 𝑝(𝑋 |𝑠) =
𝑝(𝑋 |πœƒ, 𝑠) 𝑓 (πœƒ|𝑠)π‘‘πœƒ.
𝑝(𝑋 |𝑠)
Θ
By setting the prior distribution πœƒ ∼ π΅π‘’π‘‘π‘Ž(𝛼, 𝛽) for 𝑓 (πœƒ|𝑠) = 𝑓 (πœƒ),
𝑛 𝑠
Γ(𝛼 + 𝛽) 𝛼−1
𝛽−1
𝑛−𝑠
𝑝(πœƒ|𝑋, 𝑠) ∝ 𝑝(𝑋 |πœƒ) 𝑓 (πœƒ) ⇔
πœƒ (1 − πœƒ)
πœƒ (1 − πœƒ)
𝑠
Γ(𝛼)Γ(𝛽)
=
Γ(𝛼 + 𝛽 + 𝑛)
πœƒ 𝛼+𝑠−1 (1 − πœƒ) 𝛽+𝑛−𝑠−1 ⇔∝ πœƒ 𝛼+𝑠−1 (1 − πœƒ) 𝛽+𝑛−𝑠−1
Γ(𝛼 + 𝑠)Γ(𝛽 − 𝑠 + 𝑛)
without the normalizing constant 5. The posterior distribution πœƒ|𝑋 ∼ π΅π‘’π‘‘π‘Ž(𝛼+𝑠, 𝛽+𝑛−𝑠)
𝐸 (πœƒ|𝑋) =
𝛼+𝑠
,
𝛼+𝛽+𝑛
π‘£π‘Žπ‘Ÿ (πœƒ|𝑋) =
(𝛼 + 𝑠)(𝛽 − 𝑠 + 𝑛)
,
(𝛼 + 𝛽 + 𝑛) 2 (𝛼 + 𝛽 + 𝑛 + 1)
and PDF π‘“πœƒ|𝑋 (πœƒ|𝑋 = π‘₯ 1 , ..., π‘₯ 𝑛 ).
𝑠
4The estimate is success over success plus failure as 𝐿 (Θ, 𝑋0 = Π𝑖=1
πœƒ 𝑖𝑛 in multinomial distribution.
5Note that Γ(𝑛) = (𝑛 − 1)!
30
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Lim, Kyuson
STA498
Moreover, the Bayesian point estimate is summarized at the center of the posterior
distribution
∫ 1
𝛼
+
𝑠
𝑛
𝑠
𝛼
+
𝛽
𝛼
πœƒ¯ =
πœƒ 𝑓 (πœƒ|𝑋)π‘‘πœƒ =
=
+
𝛼+𝛽+𝑛
𝛼+𝛽+𝑛 𝑛
𝛼+𝛽+𝑛 𝛼+𝛽
0
𝛼+𝛽
𝑛
πœƒ 𝑀 𝐿𝐸 +
𝐸 (πœƒ),
⇔
𝛼+𝛽+𝑛
𝛼+𝛽+𝑛
for the prior mean.
After the data 𝑋 have been observed, an unknown observable, π‘₯˜ is predicted, which
is referred to as a posterior predictive distribution. Now, the posterior predictive distribution for just one observation π‘₯˜ = 1 of new value conditional on several observations
𝑋 yield
∫ 1
∫ 1
𝑝( π‘₯˜ = 1|𝑋, 𝑠) =
𝑝( π‘₯˜ = 1|πœƒ, 𝑋, 𝑠) 𝑝(πœƒ|𝑋, 𝑠)π‘‘πœƒ =
π΅π‘’π‘Ÿ ( π‘₯˜ = 1|πœƒ) π΅π‘’π‘‘π‘Ž(πœƒ|𝑠+𝛼, 𝛽)π‘‘πœƒ
0
0
∫
⇔
1
∫
πœƒ π΅π‘’π‘‘π‘Ž(πœƒ|𝑠 + 𝛼, 𝛽)π‘‘πœƒ =
0
1
πœƒ 𝑝(πœƒ|𝑋)π‘‘πœƒ = 𝐸 (πœƒ|𝑋)
0
where 𝑝( π‘₯˜ = 1) = πœƒ6 such that the mean of the posterior distribution is derived to be
𝛼+𝑠
𝐸 (πœƒ|𝑋) = 𝛼+𝛽+𝑛
.
For the purpose of Bayesian inference, the predictive distribution for the new observations are derived in the example.
Equivalently, the generalized form of prediction is
Í𝑛
𝛼+ 𝑖=1
π‘₯𝑖
𝑝( π‘₯˜ = 1|𝑋) = 𝐸 (πœƒ|𝑋) = 𝛼+𝛽+𝑛 . On the other hand, 𝑝( π‘₯˜ = 0|𝑋) = 1 − 𝐸 (πœƒ|𝑋) =
𝛽+
Í𝑛
𝑖=1 (1−π‘₯ 𝑖 )
.
𝛼+𝛽+𝑛
3.1
Conditional distribution of the subset
Given canonical form of x (2) ∼ 𝑁 𝑝−π‘ž (πœ‡ (2) , 𝚺22 ), the conditional distribution of x (1) ∼
(2) − πœ‡ (2) ) and 𝚺
𝑁 π‘ž (πœ‡ (1) , 𝚺11 ) 7 is 𝑁 π‘ž (πœ‡1.2 , 𝚺11.2 ), where πœ‡1.2 = πœ‡ (1) + 𝚺12 𝚺−1
11.2 =
22 (x
−1
𝚺11 − 𝚺12 𝚺22 𝚺21 , and x is 𝑝 × 1 matrix 8.
(2)
(2)
−1
Thus, the conditional density x (1) |x (2) ∼ 𝑁 (πœ‡ (1) +𝚺12 𝚺−1
22 (x − πœ‡ ), 𝚺11 −𝚺12 𝚺22 𝚺21 ).
Independence and covariance
0
For partition of subset x = x = x (1) x (2) , x (1) ⊥ x (1) ⇔ 𝚺12 = 0.9 Generally, if both
x (1) and x (2) follow normal distribution and are independent, then the joint distribution
is normally distributed.
6For Bernoulli trial, 𝑝( π‘₯˜ = 0) = 1-πœƒ
7Note that this definition for partition is also valid in EM algorithm to be estimated
0
8Note that x = x (1) x (2) .
9This could be proven as 𝑓 (x) = 𝑓 (x (1) ) 𝑓 (x (2) ) = 0, where off-diagonal elements other than πšΊπ‘–π‘– is 0.
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
31
STA498
Linear transformation
Lim, Kyuson
For the linear transformation with respect to A which could be defined as y = Ax, y
follows the distribution of 𝑁 𝑝 (πœ‡∗ , 𝚺∗ ), where πœ‡∗ = Aπœ‡ and 𝚺∗ = A𝚺A0 10.
Based on the conditional distribution formula, y (1) |y (2) ∼ 𝑁 (πœ‡∗∗ , 𝚺∗∗ ), where πœ‡∗∗ =
(2) − πœ‡ ∗(2) ) and 𝚺∗∗ = 𝚺∗ − 𝚺∗ 𝚺∗−1 𝚺∗ , where y (2) is given matrix.
πœ‡∗(1) + 𝚺∗12 𝚺∗−1
22 (y
11
12 22
21
3.1.1
Law of total expectation
Often referred to as tower property, the Adam’s law for random variable 𝑋 and π‘Œ is
𝐸 (𝑋) = 𝐸 (𝐸 (𝑋 |π‘Œ ))
∑︁
∑︁ ∑︁
𝐸 (𝐸 (𝑋 |π‘Œ )) = 𝐸
π‘₯ 𝑝(𝑋 = π‘₯|π‘Œ ) =
π‘₯ 𝑝(𝑋 = π‘₯|π‘Œ = 𝑦) 𝑝(π‘Œ = 𝑦)
π‘₯
=
∑︁ ∑︁
𝑦
=π‘₯
𝑦
π‘₯ 𝑝(𝑋 = π‘₯, π‘Œ = 𝑦) =
π‘₯
π‘₯
∑︁ ∑︁
π‘₯
π‘₯
∑︁ ∑︁
𝑝(𝑋 = π‘₯, π‘Œ = 𝑦) =
𝑦
π‘₯ 𝑝(𝑋 = π‘₯, π‘Œ = 𝑦)
𝑦
∑︁
π‘₯ 𝑝(𝑋 = π‘₯) = 𝐸 (𝑋),
π‘₯
and the Eve’s law is
π‘£π‘Žπ‘Ÿ (𝑋) = 𝐸 (π‘£π‘Žπ‘Ÿ (𝑋 |π‘Œ )) + π‘£π‘Žπ‘Ÿ (𝐸 (𝑋 |π‘Œ )),
as
𝐸 (𝑋 2 ) = 𝐸 𝐸 ((𝑋 |π‘Œ ) 2 ) − (𝐸 (𝑋 |π‘Œ )) 2 + (𝐸 (𝑋 |π‘Œ )) 2 ⇔ 𝐸 π‘£π‘Žπ‘Ÿ (𝑋 |π‘Œ ) + (𝐸 (𝑋 |π‘Œ )) 2
2
⇔ 𝐸 (𝑋 2 ) − 𝐸 (𝑋) 2 = 𝐸 π‘£π‘Žπ‘Ÿ (𝑋 |π‘Œ ) + (𝐸 (𝑋 |π‘Œ )) 2 − 𝐸 (𝐸 (𝑋 |π‘Œ ))
= 𝐸 π‘£π‘Žπ‘Ÿ (𝑋 |π‘Œ ) + 𝐸 (𝐸 (𝑋 |π‘Œ ) 2 ) − 𝐸 (𝐸 (𝑋 |π‘Œ )) 2 ,
where π‘£π‘Žπ‘Ÿ (𝑋 |π‘Œ ) = 𝐸 (𝐸 (𝑋 |π‘Œ ) 2 ) − 𝐸 (𝐸 (𝑋 |π‘Œ )) 2
⇔ 𝐸 π‘£π‘Žπ‘Ÿ (𝑋 |π‘Œ ) + π‘£π‘Žπ‘Ÿ 𝐸 (𝑋 |π‘Œ )
which is also referred to as the law of total expectation. As {𝐴𝑖 } is the partition of the
probability space and assumes finite or countably infinite set of finite values 𝐸 (𝑋) < ∞,
the law of total probability in countable and finite cases guarantees,
∑︁
𝐸 (𝑋) =
𝐸 (𝑋 | 𝐴𝑖 ) 𝑝( 𝐴𝑖 )
𝑖
For Eve’s law, notice that the posterior variance is on average smaller than the prior
variance. This indicates the greater the latter variation in Eve’s law, the more the
potential for reducing our uncertainty with regard to 𝑋.
10Note that cov(Ax (1) ) = A cov(x (1) )A0 and also cov(Ax (1) , Bx (2) ) = A cov(x (1) , x (2) )B0 in partitioning the vector.
32
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Lim, Kyuson
3.1.2
STA498
Conditional expectation (MMSE)
For posterior distribution for unknown random variable π‘Œ , such as π‘“π‘Œ |𝑋 (𝑦|π‘₯), the point
estimate of the posterior mean is defined as
𝑦ˆ 𝑀 = 𝐸 (π‘Œ |𝑋 = π‘₯),
which is the minimum estimate of the π‘Œ in terms of the MSE, referred to as a minimum
mean squared error (MMSE) 11 or Bayes’ estimate of π‘Œ .
The posterior density is derived with computing
𝑓 𝑋 |π‘Œ (π‘₯|𝑦) π‘“π‘Œ (𝑦)
π‘“π‘Œ |𝑋 (𝑦|π‘₯) =
, where 𝑓 𝑋 (π‘₯) =
𝑓 𝑋 (π‘₯)
∫
+∞
𝑓 𝑋 |π‘Œ (π‘₯|𝑦) π‘“π‘Œ (𝑦)𝑑𝑦.
−∞
Then, the MMSE estimate of π‘Œ given 𝑋 = π‘₯ is then given by
∫ +∞
𝑦ˆ 𝑀 =
𝑦 π‘“π‘Œ |𝑋 (𝑦|π‘₯)𝑑𝑦 ⇒ 𝐸 (π‘Œˆπ‘€ ) = 𝐸 (𝐸 (π‘Œ |𝑋)) = 𝐸 (π‘Œ ),
−∞
by applying for the Adam’s law. Hence, 𝐸 (π‘Œ ) − 𝐸 (π‘Œˆπ‘€ ) = 0 which is an unbiased
estimator.
Properties of estimation error
For the unobserved random variable to be estimated is π‘Œ which is estimated by 𝑋 = π‘₯,
let estimate π‘Œˆ = 𝑔(𝑋) to be the function of 𝑋, and the error of estimate is defined as
π‘Œ˜ = π‘Œ − π‘Œˆ ⇔ π‘Œ − 𝑔(𝑋) for MSE of 𝐸 [(π‘Œ − 𝑔(𝑋)) 2 ]. The goal is to derive the variance
of π‘Œ and expectation of π‘Œ . Now, let π‘Š = 𝐸 (π‘Œ˜ |𝑋) and π‘Œˆπ‘€ = 𝐸 (π‘Œ |𝑋), where π‘Œ˜ = π‘Œ − π‘Œˆπ‘€ .
Then,
π‘Š = 𝐸 (π‘Œ˜ |𝑋) = 𝐸 (π‘Œ − π‘Œˆπ‘€ |𝑋) = 𝐸 (π‘Œ |𝑋) − 𝐸 (π‘Œˆπ‘€ |𝑋) = π‘Œˆπ‘€ − 𝐸 (π‘Œˆπ‘€ |𝑋) = π‘Œˆπ‘€ − π‘Œˆπ‘€ = 0.
For any function of 𝑔(𝑋)
𝐸 (π‘Œ˜ 𝑔(𝑋)|𝑋) = 𝑔(𝑋)𝐸 (π‘Œ˜ |𝑋) = 𝑔(𝑋)π‘Š = 0.
Similarly, by the Adam’s law for iterated expectations
𝐸 (π‘Œ˜ 𝑔(𝑋)) = 𝐸 [𝐸 (π‘Œ˜ 𝑔(𝑋)|𝑋)] = 0,
by the previous result applied. However, the estimation error of π‘Œ˜ = π‘Œ − π‘Œˆπ‘€ and
π‘Œˆπ‘€ = 𝐸 (π‘Œ |𝑋) is uncorrelated to derive the variance of π‘Œ . By applying covariance
formula
π‘π‘œπ‘£(π‘Œ˜ , π‘Œˆπ‘€ ) = 𝐸 (π‘Œ˜ π‘Œˆπ‘€ ) − 𝐸 (π‘Œ˜ )𝐸 (π‘Œˆπ‘€ ) = 𝐸 (π‘Œ˜ π‘Œˆπ‘€ ),
11Least mean squares (LMS)
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
33
STA498
where 𝐸 (π‘Œ˜ ) = 𝐸 (𝐸 (π‘Œ˜ |𝑋)) = 0 from the previous result such that
Lim, Kyuson
⇔ 𝐸 (π‘Œ˜ 𝑔(𝑋)) = 𝐸 [𝐸 (π‘Œ˜ 𝑔(𝑋))|𝑋] = 0,
by applying for the Adam’s law for iterated expectation. For π‘Œ˜ = π‘Œ − π‘Œˆπ‘€ , due to
π‘π‘œπ‘£(π‘Œ˜ , π‘Œˆπ‘€ ) = 0,
π‘£π‘Žπ‘Ÿ (π‘Œ ) = π‘£π‘Žπ‘Ÿ (π‘Œˆπ‘€ ) + π‘£π‘Žπ‘Ÿ (π‘Œ˜ )
2
⇔ 𝐸 (π‘Œ 2 ) − 𝐸 (π‘Œ ) 2 = 𝐸 (π‘Œˆπ‘€
) − 𝐸 (π‘Œˆπ‘€ ) 2 + 𝐸 (π‘Œ˜ 2 ) − 𝐸 (π‘Œ˜ ) 2 ,
where 𝐸 (π‘Œˆπ‘€ ) = 𝐸 (𝐸 (π‘Œ |𝑋)) = 𝐸 (π‘Œ ), 𝐸 (π‘Œ˜ ) = 𝐸 (π‘Œ − π‘Œˆπ‘€ ) = 0 such that
2
2
𝐸 (π‘Œ 2 ) = 𝐸 (π‘Œˆπ‘€
) + 𝐸 (π‘Œ˜ 2 ) + (𝐸 (π‘Œ ) 2 − 𝐸 (π‘Œˆπ‘€ ) 2 ) − 𝐸 (π‘Œ˜ ) 2 ⇔ 𝐸 (π‘Œˆπ‘€
) + 𝐸 (π‘Œ˜ 2 ).
Also, the MSE of π‘Œ |𝑋, where π‘Œ is the unknown variable, is derived as
𝑀𝑆𝐸 (π‘Œ |𝑋) = 𝐸 [π‘£π‘Žπ‘Ÿ (π‘Œ |𝑋)]
π‘£π‘Žπ‘Ÿ (π‘Œ |𝑋) = 𝐸 [(π‘Œ − 𝐸 (π‘Œ |𝑋)) 2 |𝑋] ⇔ 𝐸 (π‘Œ 2 |𝑋) − 𝐸 (π‘Œ |𝑋) 2 by definition
⇔ 𝐸 [π‘£π‘Žπ‘Ÿ (π‘Œ |𝑋)] = 𝐸 [𝐸 [(π‘Œ − 𝐸 (π‘Œ |𝑋)) 2 |𝑋]] = 𝐸 [(π‘Œ − 𝐸 (π‘Œ |𝑋)) 2 ] = 𝐸 [(π‘Œ − π‘Œˆπ‘€ ) 2 ],
which is the MSE of the estimator. Moreover, the above derivations and equation involves
for convolution of normals and bivariate normal for estimators.
MSE for convolution of two normally distributed random variables
For 𝑋 ∼ 𝑁 (πœ‡ 𝑋 , πœŽπ‘‹2 ) independent of π‘Š ∼ 𝑁 (πœ‡π‘Š , πœŽπ‘Š2 ), let π‘Œ = 𝑋 + π‘Š. The goal is to
2 ) − 𝐸 (𝑋
˜ 2 ).
derive 𝑋ˆ 𝑀 = 𝐸 (𝑋 |π‘Œ ), 𝐸 [(𝑋 − 𝑋ˆ 𝑀 ) 2 ] which will verify for 𝐸 (𝑋 2 ) = 𝐸 ( 𝑋ˆ 𝑀
Now, π‘π‘œπ‘£(𝑋, π‘Œ ) = π‘π‘œπ‘£(𝑋, 𝑋 + π‘Š) = π‘£π‘Žπ‘Ÿ (𝑋) + π‘π‘œπ‘£(𝑋, π‘Š) = πœŽπ‘‹2 by independence and
𝜌 𝑋,π‘Œ = π‘π‘œπ‘£(𝑋, π‘Œ )(πœŽπ‘‹ πœŽπ‘Œ ) −1 = πœŽπ‘‹ (πœŽπ‘‹ + πœŽπ‘Š ) −1 . Then, MMSE of 𝑋 |π‘Œ is 𝐸 (𝑋 |π‘Œ ) =
𝑋ˆ 𝑀 = πœ‡ 𝑋 + πœŒπœŽπ‘‹ πœŽπ‘Œ−1 (π‘Œ − πœ‡π‘Œ ) = πœ‡ 𝑋 + (πœŽπ‘‹2 (π‘Œ − πœ‡π‘Œ ))πœŽπ‘Œ−2 , which is analogous to
(2) − πœ‡ (2) ).
𝐸 (x (1) |x (2) ) = πœ‡ (1) + 𝚺12 𝚺−1
22 (x
Also, the MSE of 𝑋ˆ 𝑀 is 𝐸 ( 𝑋ˆ 2 ) = 𝐸 [(𝑋 − 𝑋ˆ 𝑀 ) 2 ] with substituting the derived equation.
2 ) + 𝐸 (𝑋
˜ 2 ) could be verified for substitution
Since 𝐸 (𝑋 2 ) = πœŽπ‘‹2 + 𝐸 (𝑋) 2 , 𝐸 (𝑋 2 ) = 𝐸 ( 𝑋ˆ 𝑀
from the above equation.
3.1.3
Laplace’s law of succession
For the rule of succession 12, when repeating Bernoulli trials for 𝑛 times independently
with 𝑠 successes, if 𝑋1 , ..., 𝑋𝑛+1 conditionally independent random variables, then
𝑝(𝑋𝑛+1 = 1|𝑋1 + · · · + 𝑋𝑛 = 𝑠) =
𝑠+1
.
𝑛+2
12When there are few observations, or for events that have not been observed to occur at all in (finite)
sample data, the probability examine the next repetition to succeed
34
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Lim, Kyuson
STA498
Within the prior success or failure, let πœƒ ∈ π‘ˆ [0, 1] to describe the uncertainty as a prior
probability measure. Also, 𝑋𝑖 describe 𝑖th trial for 0 and 1 and π‘₯𝑖 is the data actually
observed. Now, the likelihood function for πœƒ is
𝑛
𝐿 (πœƒ) = 𝑝(𝑋1 = π‘₯ 1 , ..., 𝑋𝑛 = π‘₯ 𝑛 |πœƒ) = Π𝑖=1
πœƒ π‘₯𝑖 (1 − πœƒ) 1−π‘₯𝑖 = πœƒ 𝑠 (1 − 𝑝) 𝑛−𝑠 ,
Í𝑛
where 𝑠 = 𝑖=1
π‘₯𝑖 is the number of successes for 𝑛 trials.
The goal is to derive for posterior distribution
𝑓 (πœƒ|𝑋1 = π‘₯1 , ..., 𝑋𝑛 = π‘₯ 𝑛 ) = ∫ 1
𝐿 (πœƒ) 𝑓 (πœƒ)
0
˜ 𝑓 ( πœƒ)𝑑
˜ πœƒ˜
𝐿 ( πœƒ)
= ∫1
0
πœƒ 𝑠 (1 − πœƒ) 𝑛−𝑠
,
˜ 𝑛−𝑠 𝑑 πœƒ˜
πœƒ˜π‘  (1 − πœƒ)
where the Beta distribution PDF yield
∫ 1
Γ(𝑠 + 1)Γ(𝑛 − 𝑠 + 1) ˜π‘ 
Γ(𝑛 + 2)
Γ(𝑛 + 2)
˜ 𝑛−𝑠 𝑑 πœƒ˜ =
πœƒ (1 − πœƒ)
Γ(𝑠 + 1)Γ(𝑛 − 𝑠 + 1) 0
Γ(𝑛 + 2)
Γ(𝑠 + 1)Γ(𝑛 − 𝑠 + 1)
so that 𝐡(𝛼, 𝛽) =
Γ(𝑠+1)Γ(𝑛−𝑠+1)
,𝛼
Γ(𝑛+2)
= 𝑠 + 1, 𝛽 = 𝑛 − 𝑠 + 1
⇔ 𝑓 (πœƒ|𝑋1 = π‘₯ 1 , ..., 𝑋𝑛 = π‘₯ 𝑛 ) =
(𝑛 + 1)! 𝑠
πœƒ (1 − πœƒ) 𝑛−𝑠 ,
𝑠!(𝑛 − 𝑠)!
where this is the Beta distribution with expected value
∫ 1
𝑠+1
,
𝐸 (πœƒ|𝑋1 = π‘₯ 1 , ..., 𝑋𝑛 = π‘₯ 𝑛 ) =
πœƒ 𝑓 (πœƒ|𝑋1 = π‘₯ 1 , ..., 𝑋𝑛 = π‘₯ 𝑛 )π‘‘πœƒ =
𝑛+2
0
as πœƒ is a random variable the law of total probability provide the expected probability of
success is πœƒ.
For cases when πœƒ = 0 or πœƒ = 𝑛, the hypergeometric distribution Hyp(πœƒ|𝑁, 𝑛, Θ) used,
where Θ is the total number of successes in the total population size 𝑁. When 𝑁, Θ → ∞,
1
the ratio πœƒ = Θ
𝑁 is fixed. Now, the prior probability of πœƒ (1−πœƒ) is roughly equivalent to
1
Θ(𝑁−Θ) with 1 ≤ Θ ≤ 𝑁 − 1. Then, the posterior for Θ,
Θ π‘ −Θ
1
𝑝(Θ|𝑁, 𝑛, πœƒ) ∝
.
Θ(𝑁 − Θ) πœƒ 𝑛 − πœƒ
For conjugate prior of multinomial distribution, the Dirichlet distribution is the posterior
distribution. 13
3.1.4
Bayesian Hypothesis testing
For two hypothesis 𝐻0 and π»π‘Ž , let 𝑝(𝐻0 ) = 𝑝 0 and 𝑝(π»π‘Ž ) = 𝑝 1 and 𝑝 0 + 𝑝 1 = 1. Also,
for random variable 𝑋, the distribution of 𝑋 under hypothesis is defined as 𝑓 𝑋 (π‘₯|𝐻0 )
and 𝑓 𝑋 (π‘₯|π»π‘Ž ). By Bayes’ rule, the posterior probability of 𝐻0 and π»π‘Ž is obtained:
𝑝(𝐻0 |𝑋 = π‘₯) =
𝑓π‘₯ (π‘₯|𝐻0 ) 𝑝(𝐻0 )
,
𝑓π‘₯ (π‘₯)
𝑝(π»π‘Ž |𝑋 = π‘₯) =
𝑓π‘₯ (π‘₯|π»π‘Ž ) 𝑝(π»π‘Ž )
.
𝑓π‘₯ (π‘₯)
13Th joint posterior distribution of πœƒ 1 , ..., πœƒ π‘š for 𝑓 (πœƒ 1 , ..., πœƒ π‘š |𝑛1 , ..., π‘›π‘š , 𝐼) =
Íπ‘š
𝑖=1 πœƒ 𝑖
Íπ‘š
Γ( 𝑖=1
(𝑛𝑖 +1))
π‘š Γ(𝑛 +1) πœƒ 𝑛1 ··· πœƒ π‘›π‘š
Π𝑖=1
𝑖
π‘š
1
,
= 1.
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
35
STA498
Lim, Kyuson
The posterior comparison between 𝑝(𝐻0 |𝑋 = π‘₯) and 𝑝(π»π‘Ž |𝑋 = π‘₯) could be used to
decide between 𝐻0 and π»π‘Ž for higher probability to take into account for.
Maximum A Posteriori (MAP) test
The idea to take for the higher posterior probability in hypothesis test is referred to as
MAP test. The 𝐻0 is chosen if and only if
𝑝(𝐻0 |𝑋 = π‘₯) ≥ 𝑝(𝐻1 |𝑋 = π‘₯) ⇔ 𝑓π‘₯ (π‘₯|𝐻0 ) 𝑝(𝐻0 ) ≥ 𝑓π‘₯ (π‘₯|π»π‘Ž ) 𝑝(π»π‘Ž ).
Note that the MAP test is also generalized for the case where there are more than 2
hypotheses for taking the hypothesis with highest posterior probability, 𝑝(𝐻𝑖 |𝑋 = π‘₯) ⇔
𝑓 𝑋 (π‘₯|𝐻𝑖 ) 𝑝(𝐻𝑖 ).
Then, the average error probability for hypothesis testing is written as
𝑝 𝑒 = 𝑝(choose 𝐻1 |𝐻0 ) 𝑝(𝐻0 ) + 𝑝(choose 𝐻0 |𝐻1 ) 𝑝(𝐻1 ),
where the MAP test achieve minimum possible average error probability.
Either for continuous 𝑓 𝑋 |π‘Œ (π‘₯|𝑦) or discrete 𝑝 𝑋 |π‘Œ (π‘₯|𝑦), the maximum a posteriori (MAP)
estimate, π‘₯ˆ 𝑀 𝐴𝑃 could be obtained for the point or interval estimates of 𝑋, by maximizing
π‘“π‘Œ |𝑋 (𝑦|π‘₯) 𝑓 𝑋 (π‘₯), as π‘₯ does not depend on π‘“π‘Œ (𝑦).
Minimum Cost hypothesis test
In two hypothesis testing, there are two types of error which is to accept 𝐻0 while π»π‘Ž
is true or π»π‘Ž while 𝐻0 is true. Let the cost to each error type be defined as 𝐢10 and 𝐢01
accordingly. Then, the average cost is
𝐢 = 𝐢10 𝑝(choose π»π‘Ž |𝐻0 ) 𝑝(𝐻0 ) + 𝐢01 𝑝(choose 𝐻0 |π»π‘Ž ) 𝑝(π»π‘Ž )
⇔ 𝑝(choose π»π‘Ž |𝐻0 ) [ 𝑝(𝐻0 )𝐢10 ] + 𝑝(choose 𝐻0 |π»π‘Ž ) [ 𝑝(π»π‘Ž )𝐢01 ].
Since 𝑝(𝐻𝑖 |𝑋 = π‘₯) =
𝑓 𝑋 (π‘₯|𝐻𝑖 ) 𝑝(𝐻𝑖 )
,
𝑓 𝑋 (π‘₯)
the 𝐻0 is chosen if and only if
𝑓 𝑋 (π‘₯|𝐻0 ) 𝑝(𝐻0 )𝐢10 ≥ 𝑓 𝑋 (π‘₯|π»π‘Ž ) 𝑝(π»π‘Ž )𝐢01 =
𝑓 𝑋 (π‘₯|𝐻0 )
𝑝(π»π‘Ž )𝐢01
≥
𝑓 𝑋 (π‘₯|π»π‘Ž )
𝑝(𝐻0 )𝐢10
⇔ 𝑝(𝐻0 |π‘₯)𝐢10 ≥ 𝑝(π»π‘Ž |π‘₯)𝐢01 ,
for decision rule. Hence, the posterior risk for accepting π»π‘Ž is derived to be 𝑝(𝐻0 |π‘₯)𝐢10
14. This would derived to take the minimum cost test as to accept the hypothesis test
with lowest posterior risk.
14Equivalently, the posterior risk for accepting 𝐻0 is derived to be 𝑝(𝐻 π‘Ž |π‘₯)𝐢01 .
36
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Lim, Kyuson
Decision rule for cost in hypothesis testing
STA498
In two hypothesis cases for 𝐻0 and 𝐻1 , let 𝐢𝑖 𝑗 to be defined for the cost of accepting
𝐻𝑖 given 𝐻 𝑗 to be true 15. Since associated cost for the correct decision is less than
the incorrect decision, that is 𝐢𝑖𝑖 < 𝐢 𝑗𝑖 for 𝑖, 𝑗 = 1, 2, the average cost is derived as
Í
𝐢 = 𝑖, 𝑗 ∈{0,1} 𝐢𝑖 𝑗 𝑝(choose 𝐻𝑖 |𝐻 𝑗 ) 𝑝(𝐻 𝑗 ) 16, as the goal is to find the decision rule such
that the average cost is minimized.
First, the complement for choosing the correct hypothesis is the complement of choosing
the wrong hypothesis such that
𝑝(choose 𝐻0 |𝐻0 ) = 1 − 𝑝(choose 𝐻1 |𝐻0 ),
𝑝(choose 𝐻1 |𝐻1 ) = 1 − 𝑝(choose 𝐻0 |𝐻1 )
Hence,
𝐢 = 𝐢00 [1 − 𝑝(choose 𝐻1 |𝐻0 )] 𝑝(𝐻0 ) + 𝐢01 𝑝(choose 𝐻0 |𝐻1 ) 𝑝(𝐻1 )
+𝐢10 𝑝(choose 𝐻1 |𝐻0 ) 𝑝(𝐻0 ) + 𝐢11 [1 − 𝑝(choose 𝐻0 |𝐻1 )] 𝑝(𝐻1 )
⇔ (𝐢10 −𝐢00 ) 𝑝(choose 𝐻1 |𝐻0 ) 𝑝(𝐻0 )+(𝐢01 −𝐢11 ) 𝑝(choose 𝐻0 |𝐻1 ) 𝑝(𝐻1 )+𝐢00 𝑝(𝐻0 )+𝐢11 𝑝(𝐻1 ),
where 𝐢00 𝑝(𝐻0 ) + 𝐢11 𝑝(𝐻1 ) is constant. To minimize, the decision rule is simplified as
𝐷 = 𝑝(choose𝐻1 |𝐻0 ) 𝑝(𝐻0 )(𝐢10 − 𝐢00 ) + 𝑝(choose𝐻1 = 0|𝐻1 ) 𝑝(𝐻1 )(𝐢01 − 𝐢11 )
Applying the hypothesis testing from previous inequality, the 𝐻0 is chosen if and only if
𝑓 𝑋 (π‘₯|𝐻0 ) 𝑝(𝐻0 )(𝐢10 − 𝐢00 ) ≥ 𝑓 𝑋 (π‘₯|𝐻1 ) 𝑝(𝐻1 )(𝐢01 − 𝐢11 )
⇔ 𝑝(𝐻0 |𝑋)(𝐢10 − 𝐢00 ) ≥ 𝑝(𝐻1 |𝑋)(𝐢01 − 𝐢11 )
3.1.5
Bayesian Interval Estimation
Instead of posterior density 𝑓π‘₯1 |π‘₯2 (π‘₯ 1 |π‘₯ 2 ) for unobserved random variable π‘₯ 1 given observed π‘₯ 2 , the (1 − 𝛼)100% credible interval of π‘₯1 being in [π‘Ž, 𝑏] is derived as
𝑝(π‘Ž ≤ π‘₯ 1 ≤ 𝑏|𝑋2 = π‘₯ 2 ) = 1 − 𝛼.
Bivariate normal example
For 𝑋1 ∼ 𝑁 (0, 1) and 𝑋2 ∼ 𝑁 (1, 4) with 𝜌(𝑋1 , 𝑋2 ) = 41 , the goal is to derive a
95% credible interval for 𝑋1 , given 𝑋2 = 2 is observed. Analogous to 𝐸 (x (1) |x (2) ) =
(2) − πœ‡ (2) ),
πœ‡ (1) + 𝚺12 𝚺−1
22 (x
𝐸 (𝑋1 |𝑋2 = π‘₯ 2 ) = πœ‡ 𝑋1 + πœŒπœŽπ‘‹1
π‘₯ 1 − πœ‡π‘₯2
,
πœŽπ‘‹2
15Then, there would be 2 more cases, including 𝐢00 : The cost of choosing 𝐻0 given 𝐻0 is true and
𝐢11 : The cost of choosing 𝐻1 given 𝐻1 is true.
16which is 𝐢00 𝑝(choose 𝐻0 |𝐻0 ) 𝑝(𝐻0 ) + 𝐢01 𝑝(choose 𝐻0 |𝐻1 ) 𝑝(𝐻1 ) + 𝐢10 𝑝(choose 𝐻1 |𝐻0 ) 𝑝(𝐻0 ) +
𝐢11 𝑝(choose 𝐻1 |𝐻1 ) 𝑝(𝐻1 )
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
37
STA498
Lim, Kyuson
where 𝜌 𝑋1 ,𝑋2 πœŽπ‘‹1 πœŽπ‘‹2 = Σ12 ⇔ π‘π‘œπ‘£(𝑋1 , 𝑋2 ) (and πœŽπ‘‹1 𝑋1 = Σ11 ), equivalently. Similar to
π‘£π‘Žπ‘Ÿ (x (1) |x (2) ) = 𝚺11 − 𝚺12 𝚺−1
22 𝚺21 ,
π‘£π‘Žπ‘Ÿ (𝑋1 |𝑋2 = π‘₯ 2 ) = πœŽπ‘‹21 − 𝜌 2 πœŽπ‘‹21 .
Hence, the 𝑋1 |𝑋2 = 2 is normally distributed with mean as 𝐸 (𝑋1 |𝑋2 = π‘₯ 2 ) = 0+ 12 ( 2−1
2 ) =
1
3
1
4 and variance as π‘£π‘Žπ‘Ÿ (𝑋1 |𝑋2 = π‘₯ 2 ) = 1 − 4 = 4 . For 𝛼 = 0.05, the interval is
𝑝(π‘Ž ≤ 𝑋1 ≤ 𝑏|𝑋2 = 2) = 0.95 which is centered around 𝐸 (𝑋1 |𝑋2 = π‘₯ 2 ) = 14 with the
form of [ 14 − 𝑐, 14 + 𝑐].
1
1
𝑐
−𝑐
𝑐
𝑝( − 𝑐 ≤ 𝑋1 ≤ + 𝑐|𝑋2 = 2) = Φ √︁
− Φ √︁
= 2Φ √︁
− 1 = 0.95.
4
4
3/4
3/4
3/4
√οΈ‚
⇔𝑐=
3 −1 1.95
Φ
= 1.7
4
2
Thus, the 95% credible interval for 𝑋1 is
1
1
− 𝑐, + 𝑐 = [−1.45, 1.95]
4
4
3.2
Prior
The prior distribution of an uncertain quantity is to express one’s beliefs about this
quantity before some evidence is taken into account. Based on the unconditional probability, the chosen parameters of the prior distribution are hyperparameters. The prior
for parameter πœƒ is denoted as πœ‹(πœƒ) include conjugate priors with the binomial/beta and
multinomial/Dirichlet families.
In case prior ∝ constant, the Bernoulli example is a representative form of the noninformative prior as 𝑝(πœƒ) = 1 lead to πœƒ|𝑋 ∼ π΅π‘’π‘‘π‘Ž(𝑠 + 1, 𝑛 − 𝑠 + 1) to be seen earlier.
Bayesian Procedure
1. Choose a probability density πœ‹(πœƒ) = 𝑝(πœƒ) that expresses our beliefs about a parameter
πœƒ before any data.
2. Choose a statistical model 𝑝(π‘₯|πœƒ) that reflects our beliefs about π‘₯ given πœƒ.
3. After observing data 𝑋 = {π‘₯1 , ..., π‘₯ 𝑛 }, update beliefs to compute the posterior
distribution 𝑝(πœƒ|𝑋).
3.2.1
Conjugate Prior
Simply, if the prior 𝑝(πœƒ) and posterior 𝑝(πœƒ|π‘₯) have the same form, then the prior and
posterior is conjugate distributions, for the likelihood function 𝑝(π‘₯|πœƒ).
38
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Lim, Kyuson
STA498
For the class of sampling distribution 𝑝(π‘₯|πœƒ), the class of prior distribution 𝑝(πœƒ) which
is the family is conjugate for the class 𝑝(π‘₯|πœƒ), if
𝑝(πœƒ|π‘₯) = ∫
Θ
𝑝(π‘₯|πœƒ)πœ‹(πœƒ)
𝑝(π‘₯|πœƒ)πœ‹(πœƒ)π‘‘πœƒ
is in the class of 𝑝(πœƒ) for all 𝑝(·|πœƒ) that is in the class 𝑝(π‘₯|πœƒ) and 𝑝(·) in the class of 𝑝(πœƒ).
Hence, the prior distribution family is conjugate to the family of sampling distribution
for any posterior distributions.
This include only for exponential family distribution. The class of sampling distribution 𝑝(π‘₯|πœƒ) of exponential family is generalized with its form
𝑝(π‘₯𝑖 |πœƒ) = 𝑓 (π‘₯𝑖 )𝑔(πœƒ)𝑒 πœ™(πœƒ)
𝑇 𝑒(π‘₯
𝑖)
,
where πœ™(πœƒ) and 𝑒(π‘₯𝑖 ) are vectors and πœ™(πœƒ) is the parameters. For x,
𝑛
∑︁
𝑛
𝑇
𝑒(π‘₯𝑖 ) ,
𝑝(x|πœƒ) ∝ 𝑔(πœƒ) exp πœ™(πœƒ)
𝑖=1
Í𝑛
where 𝑖=1
𝑒(π‘₯𝑖 ) is the sufficient statistics for πœƒ as the likelihood for πœƒ depends on the
data x. Hence, the likelihood for x is
𝑛
∑︁
𝑛
𝑛
𝑇
𝑝(x|πœƒ) = Π𝑖=1 𝑓 (π‘₯𝑖 ) 𝑔(πœƒ) exp πœ™(πœƒ)
𝑒(π‘₯𝑖 ) .
𝑖=1
If the prior density is specified as
𝑝(πœƒ) ∝ 𝑔(πœƒ) π‘š exp(πœ™(πœƒ)𝑇 𝑣).
Then, the posterior density is
𝑝(πœƒ|π‘₯) ∝ 𝑔(πœƒ) π‘š+𝑛 exp(πœ™(πœƒ)𝑇 𝑣 +
𝑛
∑︁
𝑒(π‘₯𝑖 )),
𝑖=1
as the prior density is conjugate.
List of Conjugate Models
Likelihood
Binomial
Negative Binomial
Poisson
Geometric
Exponential
Normal (mean unknown)
Normal (variance unknown)
Normal (mean, variance unknown)
Prior
Beta
Beta
Gamma
Beta
Gamma
Normal
Inverse Gamma
Normal / Gamma
Posterior
Beta
Beta
Gamma
Beta
Gamma
Normal
Inverse Gamma
Normal / Gamma
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
39
STA498
3.2.2
Lim, Kyuson
Univariate Normal distribution Conjugate Prior with known
variance
From previous example, Beta(πœƒ|𝛼, 𝛽) ∝ πœƒ 𝛼−1 (1 − πœƒ) 𝛽−1 to be derived. In the univariate
case, the normal distribution of observation 𝑋 with multiple observations π‘₯ 1 , ..., π‘₯ 𝑛 is in
the form of
1
1
2
exp −
𝑓 (𝑋 |πœƒ) = √
(𝑋 − πœƒ) , 𝑋 ∼ 𝑁 (πœƒ, 𝜎 2 ),
2
2
2𝜎
2πœ‹πœŽ
which is part of
𝑓 (πœƒ|𝑋) ∝ 𝑓 (𝑋 |πœƒ) 𝑓 (πœƒ).
Since the variance 𝜎 2 is known, the joint prior is just a prior of 𝑓 (πœƒ) in this case compared
to variance unknown 17. The goal is to ultimately update the unknown quantity of πœƒ,
which is to find the 𝑓 (πœƒ|π‘₯𝑖 ). First, the likelihood function by definition for current data
of multiple observations where 𝑓 (πœƒ) is the prior mean is
(π‘₯𝑖 − πœƒ) 2
1
𝑛
exp −
.
𝑓 (𝑋 |πœƒ) ∝ 𝐿 (πœƒ|𝑋) = Π𝑖=1 √
2𝜎 2
2πœ‹πœŽ 2
On the other hand, the prior is parametrized with known hyperparameters πœ‡0 and 𝜏02
where πœƒ ∼ 𝑁 (πœ‡02 , 𝜏02 ),
1
1
1
2
2
𝑓 (πœƒ) ∝ exp − 2 (πœƒ − πœ‡0 ) , as 𝑓 (πœƒ) = √οΈƒ
exp − 2 (πœƒ − πœ‡0 ) ,
2𝜏0
2𝜏0
2
2πœ‹πœ
0
where 𝜏0 18 is referred to a precision that control how mean can be varied, as the multiple
observations has a standard deviation different from 𝜎 19. Note that πœ‡0 is the prior mean
and 𝜏0 reflect the variation of πœƒ around πœ‡0 .
However, 𝑋 = {π‘₯ 1 , ..., π‘₯ 𝑛 } such that
∑︁
𝑛
1
(π‘₯𝑖 − πœƒ) 2
.
𝑓 (𝑋 |πœƒ) = 𝑓 (π‘₯ 1 , ..., π‘₯ 𝑛 |πœƒ) =
exp −
(2πœ‹) 𝑛/2 𝜎0𝑛
2𝜎02
𝑖=1
Hence, the posterior is prior times the likelihood to yield
Í𝑛
2
1
1
(πœƒ − πœ‡0 ) 2
𝑖=1 (π‘₯𝑖 − πœƒ)
𝑓 (πœƒ|π‘₯) ∝ √οΈƒ
exp −
+
.
2
2
2
𝜎
𝜏
2
2
0
𝜏0 𝜎0
Ignoring constant terms, the posterior is expressed as
Í𝑛 2
¯ + π‘›πœƒ 2 (πœƒ − πœ‡0 ) 2
1
𝑖=1 π‘₯𝑖 − 2𝑛π‘₯πœƒ
𝑓 (πœƒ|π‘₯) ∝ exp −
+
,
2
𝜎2
𝜏02
17The 𝑓 (πœƒ, 𝜎 2 |𝑋) ∝ 𝑓 (𝑋 |πœƒ, 𝜎 2 ) 𝑓 (πœƒ, 𝜎 2 ) is where variance unknown.
18The variable reflect how much each observation π‘₯𝑖 have varied and does not directly reflect the
variability of individual sampling for π‘₯𝑖 to be.
19For example, the class of 30 students has mean grade π‘₯¯ = 75 with sd 𝜎 = 10 but over serval semesters
the overall mean πœƒ = 75 and sd for the class means is 𝜏0 = 5.
40
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Lim, Kyuson
STA498
and drop any terms that does not include πœƒ and arrange terms for πœƒ 2 and πœƒ,
2 2
2
2 ¯ + π‘›πœƒ 2 𝜏 2 2
2 2
2
2 ¯ 1 𝜎 πœƒ − 2𝜎 πœ‡0 πœƒ − 2π‘›πœ0 π‘₯πœƒ
1 (π‘›πœ0 + 𝜎 )πœƒ − 2(𝜎 πœ‡0 + π‘›πœ0 π‘₯)πœƒ
0
⇔ exp −
= exp −
2
2
𝜎 2 𝜏02
𝜎 2 𝜏02
2
2
2¯
2
2 1 πœƒ − 2πœƒ (𝜎0 πœ‡0 + π‘›πœ0 π‘₯)/(π‘›πœ
0 + 𝜎0 )
,
= exp −
2
(𝜎02 𝜏02 )/(π‘›πœ02 + 𝜎02 )
then divide by (π‘›πœ02 + 𝜎 2 ), dropping any constant to simplify for
2
2¯
2
2 2 1 [πœƒ − (𝜎 πœ‡0 + π‘›πœ0 π‘₯)/(π‘›πœ
0 + 𝜎 )]
.
𝑓 (πœƒ|𝑋) ∝ exp −
2
(𝜎 2 𝜏02 )/(π‘›πœ02 + 𝜎 2 )
Therefore, πœƒ|π‘₯ ∼ 𝑁 (πœ‡1 , 𝜏12 ), where
πœ‡1 =
𝜎 2 πœ‡0 + π‘›πœ02 π‘₯¯
π‘›πœ02 + 𝜎 2
,
𝜎12
=
𝜎 2 𝜏02
π‘›πœ02 + 𝜎 2
.
The posterior mean πœ‡1 is expressed as a weighted average 20 of the prior mean and the
observed value π‘₯, with weights proportional to 𝜏02 21.
In case 𝜎02 = 𝜏02 22, the prior mean is only weighted 1/(𝑛 + 1) in the posterior 23.
Posterior predictive distribution
For the future observation π‘₯˜ the posterior predictive observation 𝑝( π‘₯|𝑋)
˜
is
∫
1
2
𝑓 ( π‘₯|𝑋)
˜
=
𝑓 ( π‘₯|πœƒ)
˜ 𝑓 (πœƒ|𝑋)π‘‘πœƒ ⇔ 𝑓 (πœƒ|π‘₯) ∝ exp − 2 (πœƒ − πœ‡1 )
2𝜏1
Θ
such that
1
1
2
2
⇔ 𝑓 ( π‘₯|𝑋)
˜
∝
exp −
( π‘₯˜ − πœƒ) exp − 2 (πœƒ − πœ‡1 ) π‘‘πœƒ.
2𝜎 2
2𝜏1
Θ
∫
Notice that the joint distribution of ( π‘₯,
˜ πœƒ) bivariate normal distribution, where the
marginal posterior distribution of π‘₯˜ is normal with 𝐸 ( π‘₯|πœƒ)
˜
= πœƒ and π‘£π‘Žπ‘Ÿ ( π‘₯|πœƒ)
˜
= 𝜎2.
By the law of total probability,
𝐸 ( π‘₯|𝑋)
˜
= 𝐸 (𝐸 ( π‘₯|πœƒ,
˜ 𝑋)|𝑋) = 𝐸 (πœƒ|𝑋) = πœ‡1 , and
π‘£π‘Žπ‘Ÿ ( π‘₯|𝑋)
˜
= 𝐸 (π‘£π‘Žπ‘Ÿ ( π‘₯|πœƒ)|𝑋)
¯
+ π‘£π‘Žπ‘Ÿ (𝐸 ( π‘₯|πœƒ,
¯ 𝑋)|𝑋) = 𝐸 (𝜎 2 |𝑋) + π‘£π‘Žπ‘Ÿ (πœƒ|𝑋) = 𝜎 2 + 𝜏12 .
This, the posterior predictive distribution of π‘₯˜ has mean equal to the posterior mean of
πœƒ. However, the predictive variance is 𝜎 2 and the second variance 𝜏12 due to posterior
uncertainty in πœƒ.
20If the sample variance is large, then the prior mean has considerable weight in the posterior. if the
prior variance is large, the sample mean has considerable weight in the posterior.
21The posterior precision is 𝜏12 and the prior precision is 𝜏02 .
22That is each observation sd is same as sampling distribution sd
23As it is reduced to (πœ‡0 + 𝑛π‘₯)/(𝑛
¯
+ 1).
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
41
STA498
Lim, Kyuson
Univariate Normal distribution Conjugate Prior with unknown variance
In most cases, the 𝜎 2 is unknown. Note that
𝑓 (πœƒ, 𝜎 2 |𝑋) ∝ 𝑓 (𝑋 |πœƒ, 𝜎 2 ) 𝑓 (πœƒ, 𝜎 2 ).
Hence, the joint prior for both πœƒ and 𝜎 2 should be specified including the prior of πœƒ. If
the two parameters are assumed to be independent 𝜎 and πœƒ, then 𝑝(πœƒ, 𝜎 2 ) = 𝑝(πœƒ) 𝑝(𝜎 2 )
to establish the separate priors for each parameter.
Previously, πœƒ ∼ 𝑁 (πœ‡0 , 𝜏02 ) where 𝜏0 is the measure of belief for πœƒ, the easies prior
for 𝜎 2 is the non-informative prior. This would be discussed in the next chapter.
3.2.3
Non-informative Prior
For determination of the prior, if there is no prior information about the πœƒ, then the
non-informative prior is about the minimal influence on the inference.
However, the uniform prior could not be simply used for non-informative prior as the
reparametrization is invariant. The uniform prior on πœƒ does not correspond to the uniform
prior for 1/πœƒ 24. As a mean of ignorance, there is no unique prior in non-informative,
and the preferable prior is sufficient to use for. On the other hands, the uniform prior is
possible by construction to be invariant non-informative prior using location parameters
and scale parameters.
For the location parameters, the random variable 𝑋 distributed uniformly for 𝑓 (𝑋 − πœƒ)
with location parameter πœƒ. As π‘Œ = 𝑋 + π‘Ž is distributed as 𝑓 (𝑦 − πœ™) with πœ™ = πœƒ + π‘Ž, 𝑋 and
π‘Œ has the same distribution but just different parameters. Hence, the prior distribution
is location invariant: πœ‹(πœƒ) = πœ‹(πœƒ − π‘Ž) ⇔ πœ‹(πœƒ) = 1.
For scale parameters, 𝑋 ∼ 𝜎1 𝑓 ( 𝜎π‘₯ ) with scale parameter 𝜎. Precisely, the distribution is scale invariant as for 𝑐 > 0 as πœ‹(𝜎) = 1𝑐 πœ‹( πœŽπ‘ ) where the invariant non-informative
prior for the scale parameter πœ‹(𝜎) = 𝜎 −1 satisfies the equation.
3.2.4
Univariate Normal distribution Conjugate Prior with unknown
variance
Continuously, both πœƒ and 𝜎 2 need to be specified. Previously, 𝑝(πœƒ, 𝜎 2 ) = 𝑝(πœƒ) 𝑝(𝜎 2 ) to
be separated for each by the independence. Note that full probability model for πœƒ and
𝜎 2 is
𝑓 (πœƒ, 𝜎 2 |π‘₯) = 𝑓 (π‘₯|πœƒ, 𝜎 2 ) 𝑓 (πœƒ, 𝜎 2 )
24For transformation corresponding to 1-1 function 𝑔(πœƒ) is πœ‹(πœƒ) = 1, πœ™ = 𝑔(πœƒ) ⇒ πœ‹(πœƒ) = | π‘‘π‘‘πœ™ 𝑔 −1 (πœ™)|.
42
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Lim, Kyuson
STA498
For 𝜏 2 that is the measure of the uncertainty for πœƒ, the 𝜎 2 will be used to update the
knowledge of 𝜏 25 as to specify the prior of 𝜎 2 .
Now, to develop non-informative priors the first approach is to assign a uniform prior
for πœƒ and log(𝜎 2 ) because 𝜎 2 > 0 and log(𝜎 2 ) ∈ R. For transformation on log(𝜎 2 )
into the density of 𝜎 2 , by the definition of non-informative prior 𝑝(log(𝜎 2 )) ∝ constant
such that by the Jacobian matrix 𝑝(𝜎 2 ) ∝ 𝑑 log(𝜎 2 )/π‘‘πœŽ 2 constant = (1/𝜎 2 ) constant.
The joint prior is also 𝑝(πœƒ, 𝜎 2 ) ∝ 1/𝜎 2 .
Without the log(𝜎 2 ) ∈ R, the other approach is to choose values of πœ‡0 and 𝜏 2 where
πœ‡ ∼ 𝑁 (πœ‡0 , 𝜏 2 ) and non-informative prior for 𝜎 2 . With relative non-informative prior
for 𝜎 2 where 1/𝜎 2 ∼ πΊπ‘Žπ‘šπ‘šπ‘Ž(𝛼, 𝛽), 𝜎 2 ∼ Inverse Gamma (IG) 26 (𝛼, 𝛽) is chosen to
follow where the density function is
𝑓 (𝜎 2 ) =
𝛽𝛼 −2(𝛼+1)
𝜎
exp(−𝛽/𝜎 2 ),
Γ(𝛼)
𝜎 2 > 0.
Hence,
𝑓 (𝜎 2 |𝛼, 𝛽) ∝ (𝜎 2 ) −(𝛼+1) exp(−𝛽/𝜎 2 ).
Notice that 𝜎 2 ∼ 𝐼𝐺 (0, 0) ⇔ 𝑝(𝜎 2 ) ∝ 1/𝜎 2 27.
Unknown variance: posterior density of πœƒ
√
As the prior of 𝑓 (π‘₯|πœƒ, 𝜎 2 ) = 1/ 2πœ‹πœŽ 2 exp − (π‘₯𝑖 −πœƒ) 2 /(2𝜎 2 ) , the posterior distribution
for πœƒ and 𝜎 2 with joint prior of 𝑓 (πœƒ, 𝜎 2 ) = 1/𝜎 2 is 𝑓 (πœƒ, 𝜎 2 |𝑋) = 𝑓 (π‘₯|πœƒ, 𝜎 2 ) 𝑓 (πœƒ, 𝜎 2 )
which is
1 𝑛
1
(π‘₯𝑖 − πœƒ) 2
2
.
𝑓 (πœƒ, 𝜎 |𝑋) ∝ 2 Π𝑖=1 √
exp −
𝜎
2𝜎 2
2πœ‹πœŽ 2
Notice that for two parameters the conditional posterior distribution could generally be
determined by the joint as 𝑓 (πœƒ, 𝜎 2 |𝑋) ∝ 𝑓 (πœƒ, 𝜎 2 ). Now, 𝑋 = {π‘₯ 1 , ..., π‘₯ 𝑛 } such that
∑︁
𝑛
1
(π‘₯𝑖 − πœƒ) 2
𝑓 (𝑋 |πœƒ, 𝜎 ) = 𝑓 (π‘₯ 1 , ..., π‘₯ 𝑛 |πœƒ, 𝜎 ) =
exp −
2𝜎 2
(2πœ‹) 𝑛/2 𝜎 𝑛
𝑖=1
2
2
2
⇔ 𝑓 (πœƒ, 𝜎 |𝑋) ∝
1
(2πœ‹) 𝑛/2 𝜎 𝑛+2
1
exp −
2
2
𝑖=1 π‘₯𝑖
Í𝑛
− 2𝑛π‘₯πœƒ
¯ + π‘›πœƒ 2
𝜎2
.
Hence, the posterior dropping terms that does not contain πœƒ of a parameter of interest
yields
1 −2𝑛π‘₯πœƒ
¯ + π‘›πœƒ 2
2
𝑓 (πœƒ|𝑋, 𝜎 ) ∝ exp −
.
2
𝜎2
25From CLT where for 𝑛 observation, 𝜎 2 and 𝜏 2 is related as π‘₯¯ ∼ 𝑁 (πœƒ, 𝜎 2 /𝑛) and for fixed 𝜎 2 , 𝜎 2 /𝑛
is the estimate of 𝜏 2 to update as it depend heavily on the new sample data.
26The IG distribution is used as a conjugate prior for the variance of the normal distribution model.
27Due to improper prior to be discussed, if both parameters approach 0 then distribution is set as the
prior for 𝜎 2 .
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
43
STA498
Then, for πœƒ divide by 𝑛 and add up constant term to get quadratic term
(πœƒ − π‘₯)
¯ 2
2
𝑓 (πœƒ|𝑋, 𝜎 ) ∝ exp −
.
2𝜎 2 /𝑛
Lim, Kyuson
This result in posterior distribution πœƒ|𝑋, 𝜎 2 ∼ 𝑁 ( π‘₯,
¯ 𝜎 2 /𝑛). Notice that the CLT of the
sampling distribution of π‘₯¯ follows 𝑁 (πœ‡0, 𝜎 2 /𝑛) as well.
Unknown variance approach 1: marginal posterior density of 𝜎 2
However, the posterior distribution of 𝜎 2 is derived by the conditional distribution of
𝜎 2 |πœƒ, 𝑋 or by the joint posterior distribution for πœ‡ and 𝜎 2 . In the first approach, the
terms involving 𝜎 2 is considered to give
1
2
𝑓 (𝜎 |πœƒ) ∝
𝜎 𝑛+2
𝑛
∑︁
(π‘₯𝑖 − πœƒ) 2
⇔ 𝑓 (πœƒ|𝜎 2 , 𝑋) 𝑓 (𝜎 2 )
exp −
2
2𝜎
𝑖=1
which is the Inverse Gamma distribution without the normalizing constant 𝛽𝛼 /Γ(𝛼),
as the πœƒ is considered to be fixed. Equivalently, two parameters 𝛼 = 𝑛/2 and 𝛽 =
Í𝑛
2
2
𝑖=1 (π‘₯𝑖 − πœƒ) /2 for the IG distribution that 𝜎 |πœƒ follows to be.
Unknown variance approach 2: marginal posterior density of 𝜎 2
The second approach is to use the equation for Bayes’ rule of
𝑓 (πœƒ, 𝜎 2 |𝑋) = { 𝑓 (πœƒ, 𝜎 2 , 𝑋)/ 𝑓 (𝜎 2 , 𝑋)}{ 𝑓 (𝜎 2 , 𝑋)/ 𝑓 (π‘₯)}
⇔ 𝑓 (πœƒ|𝜎 2 , 𝑋) 𝑓 (𝜎 2 |𝑋),
1
(2πœ‹) 𝑛/2 𝜎 𝑛
where 𝑓 (𝑋 |πœƒ, 𝜎 2 ) = 𝑓 (π‘₯1 , ..., π‘₯ 𝑛 |πœƒ, 𝜎 2 ) =
exp −
(π‘₯ 𝑖 −πœƒ) 2 to derive to
2𝜎 2
(𝜎 2 |𝑋), which is what
Í𝑛
𝑖=1
separate 𝑓 (πœƒ|𝜎 2 , 𝑋) and the marginal posterior density of 𝜎 2 𝑓
the goal is. Previously,
Í𝑛 2
¯ + π‘›πœƒ 2
1
𝑖=1 π‘₯𝑖 − 2𝑛π‘₯πœƒ
2
𝑓 (πœƒ, 𝜎 |𝑋) ∝
exp −
2𝜎 2
(2πœ‹) 𝑛/2 𝜎 𝑛+2
. Then, rearrange the terms to isolate πœ‡2 and divide by 𝑛 for the equation to get squared
terms.
Í
Í 2
(πœ‡ − π‘₯)
¯ 2 + π‘₯𝑖2 /𝑛 − π‘₯¯ 2
π‘₯𝑖 − 𝑛π‘₯¯ 2
1
1
(πœ‡ − π‘₯)
¯ 2
1
⇔ 𝑛+2 exp −
,
=
−
× π‘›+1 exp
𝜎
𝜎
2𝜎 2 /𝑛
2𝜎 2 /𝑛
𝜎
2𝜎 2
where the first term corresponds to 𝑓 (πœƒ|𝜎 2 , 𝑋) and the second term correspond to
𝑓 (𝜎 2 |𝑋). Notice that for 𝑓 (𝜎 2 |𝑋) the numerator is the sample variance. Similarly,
Í𝑛
Í𝑛
𝜎 2 |∼ 𝐼𝐺 ((𝑛 − 1)/2, (𝑛 − 1) 𝑖=1
(π‘₯𝑖 − π‘₯)
¯ 2 /2) 28, where 𝑖=1
(π‘₯𝑖 − π‘₯)
¯ 2 =var(π‘₯).
28Equivalently, 𝑛 + 1 = −2
44
−𝑛−1
2
= −2
−𝑛+1
2
−
2
2
⇒ 𝜎 −2(−(
𝑛−1
2 )−1)
, where the 𝛼 =
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
𝑛−1
2
to be.
Lim, Kyuson
Unknown variance: connection for the multivariate normal
STA498
For 𝚺 that is 𝑑 dimensions with 𝑛 𝑑𝑓 , X ∼ π‘Š 𝑝 (𝚺) and both X, Σ is positive definite then
1
(𝑛−𝑝−1)/2
−1
𝑓 (X) ∝ |X|
exp − π‘‘π‘Ÿ (Σ X)
2
1
. As the conjugate prior distribution
ignoring the normalizing constant 2𝑛 𝑝 /2|𝚺| 𝑛/2
à 𝑝 (𝑛/2)
of the univariate normal distribution is the IG, the inverse Wishart distribution is the
conjugate prior of the Σ in multivariate normal distribution 29. Similarly, X ∼ π‘Š 𝑝−1 (𝚺−1 )
then
1 𝑛/2 −(𝑛+𝑝+1)/2
1
−1 −1
𝑓 (X) ∝
|X|
exp − π‘‘π‘Ÿ (Σ X ) .
Σ
2
3.3
Posterior
Note that the usual Bayesian inference typically involves (1) establishing a model and
obtaining a posterior distribution for the parameter (πœƒ) of interest, (2) generating samples
from the posterior distribution, and (3) using discrete formulas applied to the samples
from the posterior distribution to summarize our knowledge of the parameters. There are
two sampling methods which include the inversion method of sampling, and rejection
method of sampling, which is for understanding MCMC methods.
Weakly-informative Prior
As for specifying and justifying for the prior distribution, the prior distribution represents
a population of possible parameter values, from which the πœƒ of current interest has been
drawn from the population point of view. However, for subjective interpretation, the
uncertainty about πœƒ as if its value is thought of as a random realization from the prior
distribution.
3.3.1
Maximum A Posteriori (MAP)
For given π‘₯1 , ..., π‘₯ 𝑛 ∼ 𝑁 (πœƒ, 𝜎 2 ) as random variables and 𝑋 = {π‘₯1 , ..., π‘₯ 𝑛 }, the prior
distribution of πœƒ is given as 𝑁 (πœ‡0 , 𝜎02 ). The function is maximized as
𝑓 (𝑋 |πœƒ) 𝑓 (πœƒ) = 𝐿(πœƒ)πœ‹(πœƒ) =
𝑛
Π𝑖=1
√
2 2
1
1 π‘₯𝑖 − πœƒ
1 πœƒ − πœ‡0
exp −
exp −
.
√οΈƒ
2 𝜎
2 𝜎0
2πœ‹πœŽ 2
2πœ‹πœŽ 2
1
0
𝑛 𝑓 (𝑋 = π‘₯ , ..., π‘₯ |πœƒ). However, the πœƒˆ
Notice that the πœƒˆπ‘€ 𝐿𝐸 = Π𝑖=1
1
𝑛
𝑀 𝐴𝑃 is the mode of the
Í𝑛
posterior distribution that is maximized as log( 𝑝(πœƒ)) + 𝑖=1 log( 𝑓 (π‘₯𝑖 |πœƒ)).
29Note that the marginal distribution of mean vector πœ‡ is the multivariate 𝑑 distribution.
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
45
STA498
Lim, Kyuson
For derivation, the
log( 𝑓 (πœƒ|πœƒ)) =
∑︁
𝑛 𝑖=1
√οΈƒ
√︁
2
2
(π‘₯
−
πœƒ)
𝑖
2 − (πœƒ − πœ‡0 ) .
2
− log 2πœ‹πœŽ −
−
log
2πœ‹πœŽ
0
2𝜎 2
2𝜎02
Then, the derivative is
𝑛 𝛼 log( 𝑓 (πœƒ|πœƒ)) ∑︁ (π‘₯𝑖 − πœƒ)
(πœƒ − πœ‡0 )
=
−
=0
2
2
π›Όπœƒ
𝜎
𝜎
0
𝑖=1
𝑛 ∑︁
(π‘₯𝑖 − πœƒ)
𝑛
∑︁ π‘₯𝑖
(πœƒ − πœ‡0 )
π‘›πœ‡ (πœƒ − πœ‡0 )
−
=
⇔
2
2
2
2
𝜎
𝜎
𝜎
𝜎
𝜎02
0
𝑖=1
𝑖=1
Í
Í𝑛
(𝜎 2 + π‘›πœŽ02 )πœƒ 𝜎 2 πœ‡0 + 𝜎02 π‘₯𝑖
𝜎 2 πœ‡0 + 𝜎02 𝑖=1
π‘₯𝑖
ˆπ‘€ 𝐴𝑃 =
⇔
=
⇔
πœƒ
.
𝜎 2 𝜎02
𝜎 2 𝜎02
𝜎 2 + π‘›πœŽ02
⇔
=
Equivalently, to minimize the function of πœƒ for the posterior distribution of 𝑓 (πœƒ|𝑋) from
Í𝑛 π‘₯𝑖 −πœƒ 2 πœƒ−πœ‡0 2
+ 𝜎0 . Hence,
previous prior chapter derivation, the part is to minimize 𝑖=1
𝜎
the MAP estimate of πœ‡ is derived to be
Í
∑︁
𝑛
2
𝜎02 ( π‘₯𝑖 ) + πœ‡0 𝜎 2
π‘›πœŽ02
𝜎
1
π‘₯𝑖 +
πœ‡0 =
,
πœƒˆπ‘€ 𝐴𝑃 =
π‘›πœŽ02 + 𝜎 2 𝑛 𝑖=1
π‘›πœŽ02 + 𝜎 2
π‘›πœŽ02 + 𝜎 2
which is the weighted average for the prior and sample mean.
3.3.2
Multivariate Normal distribution with known Σ
The multivariate normal likelihood is x𝑖 |πœƒ, Σ ∼ 𝑁 (πœƒ, Σ) 30 without the normalizing
constant, 1/2πœ‹ 𝑝/2 , as previously defined. Equivalently, the likelihood function for single
observation model is
1
−1/2
𝑇 −1
𝑓 (x𝑖 |πœƒ, Σ) ∝ |Σ|
exp − (x𝑖 − πœƒ) Σ (x𝑖 − πœƒ) ,
2
and for samples of 𝑛 iid observations which is X = {x1 , ..., x𝑛 } is
𝑓 (X|πœƒ, Σ) = 𝑓 (x1 , ..., x𝑛 |πœƒ, Σ) ∝ |Σ|
−𝑛/2
𝑛
1 ∑︁
𝑇 −1
exp −
(x𝑖 − πœƒ) Σ (x𝑖 − πœƒ) .
2 𝑖=1
Analogous from univariate case with variance known which is 𝑓 (πœƒ|𝑋) ∝ 𝑓 (𝑋 |πœƒ) 𝑓 (πœƒ)
where πœƒ ∼ 𝑁 (πœ‡0 , 𝜏02 ) from 3.2.2, the multivariate posterior distribution is generalization
of a multiple observation for
𝑓 (πœƒ|X) ∝ 𝑓 (X|πœƒ) 𝑓 (πœƒ) with πœƒ ∼ 𝑁 (πœ‡0 , Λ0 )
30Note that Σ is 𝑑 × π‘‘ matrix that is positive definite and πœƒ is a multivariate.
46
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Lim, Kyuson
STA498
𝑛 𝑓 (x |πœƒ) that is,
Equivalently, the posterior distribution is 𝑓 (πœƒ|X) ∝ 𝑓 (πœƒ)Π𝑖=1
𝑖
𝑛
1
1 ∑︁
𝑇 −1
−1/2
𝑇 −1
−𝑛/2
(x𝑖 −πœƒ) Σ (x𝑖 −πœƒ) ,
𝑓 (πœƒ|X) = |Λ0 |
exp − (πœƒ−πœ‡0 ) Λ0 (πœƒ−πœ‡0 )
|Σ|
exp −
2
2 𝑖=1
and dropping the constant terms yield for
𝑛
∑︁
1
𝑇 −1
2
𝑇 −1
⇔ 𝑓 (πœƒ|X) ∝ exp − (πœƒ − πœ‡0 ) Λ0 (πœƒ − πœ‡0 ) +
(x𝑖 − πœƒ) Σ (x𝑖 − πœƒ) .
2
𝑖=1
Now, take natural logarithm (log) to simplify for derivation of log density.
𝑛
1
1 ∑︁
𝑇 −1
𝑇 −1
(x𝑖 − πœƒ) Σ (x𝑖 − πœƒ) − (πœƒ − πœ‡0 ) Λ0 (πœƒ − πœ‡0 )
log( 𝑓 (πœƒ|X)) = −
2 𝑖=1
2
𝑛
∑︁
1
𝑛
𝑇 −1
πœƒ 𝑇 Σ−1 x𝑖 − πœƒ 𝑇 Λ−1
= − πœƒ 𝑇 Σ−1 πœƒ +
0 πœƒ + πœƒ Λ0 πœ‡ 0 ,
2
2
𝑖=1
∑︁
∑︁ 1 𝑇
1 𝑇
−1
−1
𝑇
−1
−1
−1
−1
𝑇
−1
−1
x𝑖 ) = − πœƒ (π‘›Σ +Λ0 )πœƒ−2πœƒ (Λ0 πœ‡0 +Σ
x𝑖 ) .
= − πœƒ (π‘›Σ +Λ0 )πœƒ+πœƒ (Λ0 πœ‡0 +Σ
2
2
Now, copula modeling for matrix multiplication and arrangement is used.
𝑇
𝑛
∑︁
∑︁ 1
−1
−1 −1
−1
−1
−1
−1
−1
−1 −1
−1
−1
⇔ − πœƒ−(π‘›Σ +Λ0 ) (Λ0 πœ‡0 +Σ
x𝑖 ) (π‘›Σ +Λ0 ) πœƒ−(π‘›Σ +Λ0 ) (Λ0 πœ‡0 +Σ + x𝑖 ) ,
2
𝑖=1
which is the log density of a normal distribution
∑︁
−1
−1 −1
−1
−1
−1
−1 −1
πœƒ|X ∼ 𝑁 (π‘›Σ + Λ0 ) (Λ0 πœ‡0 + Σ
x𝑖 ), (π‘›Σ + Λ0 ) .
Equivalently, the mean πœ‡π‘› and inverse of cov-variance matrix Λ−1
𝑛 31 is defined as
−1 −1
−1
−1
πœ‡π‘› = (Λ−1
0 + π‘›Σ ) (Λ0 πœ‡0 + π‘›Σ xΜ„),
−1
−1
Λ−1
𝑛 = Λ0 + π‘›Σ ,
using the Woodbury identity on our expression for the covariance matrix. As the multivariate normal distribution has the conjugate prior for multivariate normal distribution
analogous to univariate case, the Σ is an inverse Wishart distribution to be defined. Note
that the posterior conditional and marginal distributions of subvectors of πœƒ with known
Σ could be also derived.
Posterior predictive distribution for x̃
For new observation xΜƒ ∼ 𝑁 (πœƒ, Σ), the joint distribution is defined as
𝑓 ( xΜƒ, πœƒ|X) ∝ 𝑁 ( xΜƒ|πœƒ, Σ)𝑁 (πœƒ|πœ‡π‘› , Λ𝑛 ).
The joint posterior distribution of xΜƒ is multivariate normal as the Σ is known. By the
Adam’s law,
𝐸 ( xΜƒ|X) = 𝐸 {𝐸 ( xΜƒ|πœƒ, X)|X} = 𝐸 (πœƒ|X) = πœ‡π‘› ,
and also applying for the Eve’s law and MMSE,
π‘£π‘Žπ‘Ÿ ( π‘₯|X)
˜
= 𝐸 {π‘£π‘Žπ‘Ÿ ( xΜƒ|πœƒ, X)|X} + π‘£π‘Žπ‘Ÿ{𝐸 ( xΜƒ|πœƒ, X)|X} = 𝐸 (Σ|X) + π‘£π‘Žπ‘Ÿ (πœƒ|X) = Σ + Λ𝑛 .
31Notice that the posterior precision is the sum of prior and data precisions.
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
47
STA498
Non-informative prior density of πœƒ
Lim, Kyuson
If 𝑓 (πœƒ) ∝ constant by definition as the precision which is the variance of the prior
converge to 0, |Λ−1
0 |, then the prior mean is irrelevant for the posterior density.
3.3.3
Multivariate Normal distribution with unknown Σ
Goal scheme
Previously, the posterior with two parameters πœƒ and 𝜎 2 is defined as 𝑓 (πœƒ, 𝜎 2 |𝑋) ∝
𝑓 (𝑋 |πœƒ, 𝜎 2 ) 𝑓 (πœƒ, 𝜎 2 ). Similarly, as x𝑖 |πœƒ, Σ ∼ 𝑁 (πœƒ, Σ), the posterior distribution is defined
as
𝑓 (πœƒ, Σ|X) ∝ 𝑓 (X|πœƒ, Σ) 𝑓 (πœƒ, Σ).
Analogous from univariate unknown 𝜎 2 : scheme connection
[ 𝑓 (Σ)] : Since the multivariate approach is exactly analogous to univariate approach,
from 3.2.4 the 𝑓 (𝜎 2 ) is derived as Inverse Gamma (IG) (𝛼, 𝛽) where the Inverse Gamma
is a Inverse Wishart distribution in multivariate for prior distribution for the Σ.
[ 𝑓 (πœƒ|Σ)] : Similarly, πœƒ|Σ ∼ 𝑁 (πœ‡0 , Σ/π‘˜) as the univariate in 3.2.4 shown to have
π‘₯)
¯ 2
which is πœƒ|𝜎 2 , 𝑋 ∼ 𝑁 ( π‘₯,
¯ 𝜎 2 /𝑛) as a normal distribu𝑓 (πœƒ|𝜎 2 , 𝑋) ∝ exp − (πœƒ−
2𝜎 2 /𝑛
tion for posterior density function for 𝑓 (πœƒ|Σ).
[ 𝑓 (πœƒ, Σ)] : Also, the posterior density of πœƒ is following to be normal as with respect to
𝑓 (πœƒ, 𝜎 2 |𝑋) ∝ 𝑓 (πœƒ, 𝜎 2 ), where 𝑓 (πœƒ, 𝜎 2 ) = 𝑓 (πœƒ|𝜎 2 ) 𝑓 (𝜎 2 ) analogously such that
1
𝑝/2
−( 𝑝+𝑑+1)/2
−1
𝑓 (Σ) 𝑓 (πœƒ|Σ) = |Λ0 | ×|Σ|
exp − π‘‘π‘Ÿ (Λ0 Σ )
2
1
𝑇
−1
−1/2
× |Σ|
exp − (πœƒ − πœ‡0 ) π‘˜Σ (πœƒ − πœ‡0 )
2
1
π‘˜
−{( 𝑝+𝑑)/2+1}
−1
𝑇 −1
⇔ 𝑓 (πœƒ, Σ) ∝ |Σ|
exp − π‘‘π‘Ÿ (Λ0 Σ ) − (πœƒ − πœ‡0 ) Σ (πœƒ − πœ‡0 ) .
2
2
Therefore, using the inverse-Wishart distribution to describe for the prior distribution
of the Σ, Σ ∼ π‘Š 𝑝−1 (Λ−1
0 ), where the hyperparameters of (πœ‡0 , Λ0 /π‘˜, 𝑝, Λ0 ) is used for
parametrization. Notice πœƒ|Σ ∼ 𝑁 (πœ‡0 , Σ/π‘˜), as 𝑝 and Λ0 controls the 𝑑𝑓 and scale matrix
for the inverse Wishart distribution on Σ.
Posterior with unknown 𝜎 2 : conclusion
[ 𝑓 (X|πœƒ, Σ)]: In 𝑓 (πœƒ, Σ|X) ∝ 𝑓 (X|πœƒ, Σ) 𝑓 (πœƒ, Σ), the likelihood function is normal
distri
1
𝑇
−1
−1/2
bution that is defined as 𝑓 (x𝑖 |πœƒ, Σ) ∝ |Σ|
exp − 2 (x𝑖 − πœƒ) Σ (x𝑖 − πœƒ) when the
48
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Lim, Kyuson
cov-variance is known (which is analogous from univariate case).
STA498
Hence, the
𝑓 (X|πœƒ, Σ) ∝ |Σ|
−𝑛/2
𝑛
1 ∑︁
(x𝑖 − πœƒ)𝑇 Σ−1 (x𝑖 − πœƒ)
exp −
2 𝑖=1
𝑛
1 ∑︁
𝑇 −1
𝑇 −1
⇔ |Σ| exp −
(x𝑖 − xΜ„) Σ (x𝑖 − xΜ„) + 𝑛( xΜ„ − πœƒ) Σ ( xΜ„ − πœƒ)
2 𝑖=1
𝑛
−𝑛
1
1 ∑︁
2
𝑇
−1
2
⇔ |Σ| 2 exp − (𝑛−1)S +𝑛( xΜ„−πœƒ) Σ ( xΜ„−πœƒ) , S =
(x𝑖 − xΜ„)𝑇 Σ−1 (x𝑖 − xΜ„)
2
𝑛 − 1 𝑖=1
−𝑛
2
by definition. Before, 𝑓 (πœƒ|X) ∝ 𝑓 (X|πœƒ) 𝑓 (πœƒ), πœƒ ∼ 𝑁 (πœ‡0 , Λ0 ) in 3.3.2. Thus, the
posterior of two unknown parameters is derived with respect to
𝑓 (πœƒ, Σ|X) ∝ 𝑓 (X|πœƒ, Σ) 𝑓 (πœƒ, Σ).
Notice, few lines before
1
π‘˜
−1
𝑇 −1
𝑓 (πœƒ, Σ) = |Λ0 | |Σ|
exp − π‘‘π‘Ÿ (Λ0 Σ ) − (πœƒ − πœ‡0 ) Σ (πœƒ − πœ‡0 ) .
2
2
π‘˜
1
−1
𝑇 −1
𝑝/2
−{( 𝑝+𝑑)/2+1}
⇔ 𝑓 (πœƒ, Σ|X) ∝ |Λ0 | |Σ|
exp − π‘‘π‘Ÿ (Λ0 Σ ) − (πœƒ − πœ‡0 ) Σ (πœƒ − πœ‡0 )
2
2
−𝑛
1
2
𝑇 −1
2
×|Σ| exp − (𝑛 − 1)S + 𝑛( xΜ„ − πœƒ) Σ ( xΜ„ − πœƒ)
2
𝑝/2
−{( 𝑝+𝑑)/2+1}
= |Λ0 | 𝑝/2 |Σ| −{(𝑛+𝑝+𝑑)/2+1}
1
−1
2
𝑇 −1
𝑇 −1
exp − π‘‘π‘Ÿ (Λ0 Σ ) + (𝑛 − 1)S + 𝑛( xΜ„ − πœƒ) Σ ( xΜ„ − πœƒ) + π‘˜ (πœƒ − πœ‡0 ) Σ (πœƒ − πœ‡0 ) .
2
Posterior with unknown 𝜎 2 : derivation for square and Inverse-Wishart kernel
First, (𝑛 − 1)S2 + 𝑛( xΜ„ − πœƒ)𝑇 Σ−1 ( xΜ„ − πœƒ) + π‘˜ (πœƒ − πœ‡0 )𝑇 Σ−1 (πœƒ − πœ‡0 ) = (𝑛 − 1)S2 + 𝑛x̄𝑇 Σ−1 xΜ„ −
2π‘›πœƒ 𝑇 Σ−1 xΜ„ + π‘›πœƒ 𝑇 Σ−1 πœƒ + π‘˜πœƒ 𝑇 Σ−1 πœƒ − 2π‘˜πœƒ 𝑇 Σ−1 πœ‡0 + π‘˜ πœ‡π‘‡0 Σ−1 πœ‡0 .
Now, for rearrangement later, add (π‘˜ + 𝑛)πœƒ 𝑇𝑛 Σ−1 πœƒ 𝑛 and subtract for balancing out the
equation for posterior distribution parameters such that (𝑛 − 1)S2 + (π‘˜ + 𝑛)πœƒ 𝑇 Σ−1 πœƒ −
2πœƒ 𝑇 Σ−1 (π‘˜ πœ‡0 + 𝑛XΜ„) + (π‘˜ + 𝑛)πœƒ 𝑇𝑛 Σ−1 πœƒ 𝑛 − (π‘˜ + 𝑛)πœƒ 𝑇𝑛 Σ−1 πœƒ 𝑛 + π‘˜ πœ‡π‘‡0 Σ−1 πœ‡0 + 𝑛x̄𝑇 Σ−1 xΜ„ =
π‘˜π‘›
(πœ‡0 − XΜ„)𝑇 Σ−1 (πœ‡0 − XΜ„). Then,
(𝑛 − 1)S2 + (π‘˜ + 𝑛)(πœƒ − πœƒ 𝑛 )𝑇 Σ−1 (πœƒ − πœƒ 𝑛 ) + π‘˜+𝑛
1
𝑝/2
−{( 𝑝+𝑛+𝑑+1)/2}
−1
𝑓 (πœƒ, Σ|X) ∝ |Λ0 | |Σ|
exp − π‘‘π‘Ÿ (Λ0 Σ )
2
−1
1
π‘˜π‘›
2
𝑇 −1
𝑇 −1
2
×|Σ| exp − (𝑛 −1)S + (π‘˜ + 𝑛)(πœƒ − πœƒ 𝑛 ) Σ (πœƒ − πœƒ 𝑛 ) +
(πœ‡0 − XΜ„) Σ (πœ‡0 − XΜ„)
2
π‘˜ +𝑛
1
π‘˜π‘›
𝑝/2
− ( 𝑝+𝑛+𝑑+1)
−1
2
𝑇
−1
2
⇔ |Λ0 | |Σ|
exp − π‘‘π‘Ÿ (Λ0 Σ ) + (𝑛 − 1)S +
(πœ‡0 − XΜ„) Σ (πœ‡0 − XΜ„)
2
π‘˜ +𝑛
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
49
STA498
×|Σ|
−1
2
1
𝑇 −1
exp − (π‘˜ + 𝑛)(πœƒ − πœƒ 𝑛 ) Σ (πœƒ − πœƒ 𝑛 )
2
Lim, Kyuson
The idea is to build up a model that is exactly similar to Normal × Inversed Wishart
and identify the parameters. For identifying the exponent of the normal by inverted
Wishart kernels, the property of adding symmetric matrix and multiplication is used,
ie. π‘‘π‘Ÿ ( 𝐴) + π‘‘π‘Ÿ (𝐡) = π‘‘π‘Ÿ ( 𝐴 + 𝐡), π‘‘π‘Ÿ (𝐷𝐢) = π‘‘π‘Ÿ (𝐢𝐷) and π‘₯𝑇 Σ−1 π‘₯ = π‘‘π‘Ÿ (π‘₯ 𝑑 Σ−1 π‘₯) ⇒
π‘‘π‘Ÿ (π‘₯𝑇 Σ−1 π‘₯) = π‘‘π‘Ÿ (π‘₯π‘₯𝑇 Σ−1 ).
1 Í𝑛
𝑇 −1
Notice that S2 = 𝑛−1
𝑖=1 (x𝑖 − xΜ„) Σ (x𝑖 − xΜ„). Then, the first part of the exponential is simplified to be
1
π‘˜π‘›
−1
2
𝑇 −1
− π‘‘π‘Ÿ (Λ0 Σ ) + (𝑛 − 1)S +
(πœ‡0 − XΜ„) Σ (πœ‡0 − XΜ„)
2
π‘˜ +𝑛
∑︁
π‘˜π‘›
1
𝑇
𝑇
(x𝑖 − XΜ„)(x𝑖 − XΜ„) −
( XΜ„ − πœ‡0 ) ( XΜ„ − πœ‡0 ) Σ−1
= − π‘‘π‘Ÿ Λ0 +
2
π‘˜ +𝑛
Σ
These properties enable the equation to rearrange as 𝑁 πœƒ 𝑛 , π‘˜+𝑛
× Inverted Wishart(πœƒ 𝑛 , Λ𝑛 )
−1
with Σ , which is
𝑛+ 𝑝+𝑑+1
𝑝
π‘˜ +𝑛
1
−1
−1/2
𝑇
−1
−
2
exp −
(πœƒ − πœƒ 𝑛 ) Σ (πœƒ − πœƒ 𝑛 ) ,
exp − π‘‘π‘Ÿ (Λ𝑛 Σ ) |Σ|
|Λ𝑛 | 2 |Σ|
2
2
and det(Λ0 ) = det(Λ𝑛 ) as symmetric matrix. Now, comparing with the equation of the
interest,
1
πœƒπ‘› =
(π‘˜ πœ‡0 + 𝑛XΜ„),
π‘˜ +𝑛
Λ𝑛 = Λ0 +
𝑛
∑︁
(x𝑖 − XΜ„)(x𝑖 − XΜ„)𝑇 +
π‘˜π‘›
( XΜ„ − πœ‡0 )( XΜ„ − πœ‡0 )𝑇 ,
π‘˜ +𝑛
𝑖=1
where the first term of πœƒ 𝑛 matches with the second term in the modelling and the second
term fo Λ𝑛 describes the first term in the modelling for equivalent relationship. Thus,
Σ
πœƒ, Σ|X ∼ 𝑁 πœƒ 𝑛 ,
with πœƒ × Inverse Wishart (πœƒ 𝑛 , Λ𝑛 ) with Σ−1 ,
π‘˜ +𝑛
to follow the modelling. Also,
π‘˜ +𝑛
𝑇 −1
(πœƒ − πœƒ 𝑛 ) Σ (πœƒ − πœƒ 𝑛 ) .
πœƒ|Σ ∼ 𝑁 πœƒ 𝑛 , (π‘˜ + 𝑛) Σ ⇔ 𝑓 (πœƒ|Σ, X) ∝ exp −
2
−1
Posterior with unknown 𝜎 2 : uninformative priors
The joint uninformative prior (with a locally uniform prior for πœƒ) is 𝑓 (πœƒ, Σ) ∝ |Σ| −
and the joint posterior is derived as
𝑛
1
− 𝑑+1
−
2
𝑇
−1
𝑓 (πœƒ, Σ|X) ∝ |Σ| 2 |Σ| 2 exp − (𝑛 − 1)S + 𝑛( XΜ„ − πœƒ) Σ ( XΜ„ − πœƒ)
2
1
− 𝑛+𝑑+1
−1
⇔ |Σ| 2 exp − π‘‘π‘Ÿ (𝑆(πœƒ)Σ ) ,
2
50
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
𝑑+1
2
,
Lim, Kyuson Í
STA498
𝑛
𝑇 + 𝑛( XΜ„ − πœƒ)( XΜ„ − πœƒ) 𝑇 . Then, the conditional posterior
where 𝑆(πœƒ) = 𝑖=1
(
XΜ„
−
x
)(
XΜ„
−
x
)
𝑖
𝑖
for πœƒ|Σ ∼ 𝑁 XΜ„, Σ𝑛 such that
𝑛
𝑇 −1
𝑓 (πœƒ|Σ, X) ∝ exp − (πœƒ − XΜ„) Σ (πœƒ − XΜ„)
2
Multivariate list of Conjugate Models
Parameter
Prior πœ‹(πœƒ)
2
0)
exp − (πœƒ−πœ‡
, 𝜎02 =
2𝜎 2
Normal πœƒ
0
𝑝 𝛼−1 (1
3.3.4
π‘˜
𝛽−1
𝑝)
Beta ∝
−
𝛼−1
Beta ∝ πœƒ
exp(−π‘πœƒ)
Beta-Bin 𝑝
Gamma-exp πœƒ
* πœ‡π‘› =
𝜎2
π‘˜ πœ‡0 +𝑛XΜ„
2
π‘˜+𝑛 , πœŽπ‘›
=
Likelihood 𝑓 (πœƒ|X)
2
𝑛 exp − (x𝑖 −πœƒ)
Π𝑖=1
2𝜎 2
Posterior 𝑓 (πœƒ|X)
2
𝑛)
exp − (πœƒ−πœ‡
*
2𝜎 2
Bin ∝ 𝑝 𝑠 (1 − 𝑝) 𝑛−𝑠
exp ∝ πœƒ 𝑛 exp(−π‘ πœƒ)
Beta ∝ 𝑝 𝛼+𝑠+1 (1 − 𝑝) 𝑏+𝑛−𝑠−1
Gamma ∝ πœƒ π‘Ž+𝑛−1 exp(−(𝑏 + 𝑠)πœƒ)
𝑛
𝜎2
π‘˜+𝑛
Lindley’s Paradox
Based on the different choices of certain prior distribution, the frequentist and Bayesian
give a different result for the hypothesis testing. The paradox occurs for the result of an
experiment where there are two explanations 𝐻0 and π»π‘Ž with some prior distribution πœ‹
to represent the uncertainty that gives which hypothesis is more accurate before taking
into account for the result π‘₯.
Lindley’s paradox occurs as the result of π‘₯ is significant by the frequentist test of 𝐻0 ,
indicating the sufficient evidence to reject 𝐻0 at a given 𝛼 = 0.05. On the other hands,
Bayesian approach examine the posterior probability of 𝐻0 given π‘₯ is high, indicating
strong evidence that 𝐻0 is better than π‘₯ with π»π‘Ž to take the 𝐻0 .
Example for the comparison
In a statistics program around the world, 4900 male and 4700 male is enrolled at a
certain time period. The observed proportion π‘₯ of male student is 4900/9600 = 0.51.
We assume the fraction of male student is a Binomial variable with parameter πœƒ. The
goal is to test whether πœƒ is 0.5 or other value. That is, 𝐻0 : πœƒ = 0.5 and π»π‘Ž : πœƒ ≠ 0.5.
The frequentist approach to testing 𝐻0 is to compute a p-value, the probability of
observing a fraction of male student at least as large as π‘₯ assuming 𝐻0 is true. A
normal approximation for the fraction of the male student is 𝑋 ∼ 𝑁 (πœ‡, 𝜎 2 ) with
πœ‡ = 𝑛𝑝 = π‘›πœƒ = 9600(0.5) = 4800 and 𝜎 2 = π‘›πœƒ (1 − πœƒ) = 9600(0.5)(1 − 0.5) = 2400,
∫ 9600
(𝑒 − πœ‡) 2
1
exp −
𝑑𝑒
𝑝(𝑋 ≥ π‘₯|πœ‡ = 4800) =
√
2𝜎 2
π‘₯=4900 2πœ‹πœŽ 2
∫ 9600
1
(𝑒 − 4800) 2
=
exp −
𝑑𝑒 = 0.020613.
√︁
4800
π‘₯=4900 2πœ‹(2400)
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
51
STA498
Lim, Kyuson
For the two sided test, the p-value is 2(0.020613) = 0.0402 such that < 0.05 to reject
the 𝐻0 and take π»π‘Ž to be different from the observed data for 0.5.
The Bayesian approach assumes for the equal prior probability as there is no favour
in the hypothesis to be, which is πœ‹(𝐻0 ) = πœ‹(π»π‘Ž ) = 0.5. Also, πœƒ ∼ π‘ˆ [0, 1] under π»π‘Ž ,
where the posterior probability under 𝐻0 for π‘˜ = 4900 is described as
𝑝(π‘˜ |𝐻0 ) 𝑝(𝐻0 )
𝑝(𝐻0 |π‘˜) =
,
𝑝(π‘˜ |𝐻0 ) 𝑝(𝐻0 ) + 𝑝(π‘˜ |π»π‘Ž ) 𝑝(π»π‘Ž )
after observing π‘˜/𝑛 = 4900/9600 births, where the posterior probability is computed
from the PMF of the binomial distribution under each hypothesis,
𝑛
𝑝(π‘˜ |𝐻0 ) =
(0.5) π‘˜ (1 − 0.5) 𝑛−π‘˜ = 0.021.
π‘˜
∫ 1 1
𝑛
𝑛
π‘˜
𝑛−π‘˜
𝑝(π‘˜ |π»π‘Ž ) =
(πœƒ) (1 − πœƒ) π‘‘πœƒ =
π΅π‘’π‘‘π‘Ž(π‘˜ + 1, 𝑛 − π‘˜ + 1) =
= 0.001041.
π‘˜
π‘˜
𝑛+1
0
⇒ 𝑝(π‘˜ |𝐻0 ) > 𝑝(π‘˜ |π»π‘Ž )
Hence, the posterior probability for 𝑝(𝐻0 |π‘˜) = 0.95 which strongly favours 𝐻0 over π»π‘Ž .
Thus, the two approaches Bayesian and frequentist appears to be in conflict, as paradoxical which also leads to the goodness-of-fit test.
3.3.5
Bernstein-von Mises theorem
The Bernstein-von Mises theorem is a result that links Bayesian inference with Frequentist inference. In particular, it states that Bayesian credible sets of a certain credibility
level 𝛼 will asymptotically be confidence sets of confidence level 𝛼, which allows for
the interpretation of Bayesian credible sets, under the probabilistic generating process.
In parameter inference, a posterior distribution converges in the limit of infinitely many
data to a multivariate normal distribution centered with the given covariance matrix. Using the posterior distribution from a frequentist, the Bayesian inference is asymptotically
correct.
Bernstein-von Mises theorem: Univariate normal data
For Bayesian approach when 𝑁 is large for the observed data,
√
𝑁 ( π‘₯¯ − πœ‡)|𝑋 = π‘₯ 1 , ..., π‘₯ 𝑁 → 𝑁 (0, 𝜎 2 ),
the prior does not matter for large samples. In frequentist approach for large samples,
√
𝑁 ( π‘₯¯ − πœ‡)|πœ‡ ∼ 𝑁 (0, 𝜎 2 ).
With loss of generality, the Bayesian probability for 95% credible region and the frequentist confidence interval matches for 95% confidence interval as
𝜎
𝜎
𝜎
𝜎
¯
¯
¯
¯
𝑝 πœ‡ ∈ 𝑋−1.96
√ , 𝑋+1.96
√ 𝑋1 , ..., 𝑋𝑁 ≈ 𝑝 πœ‡ ∈ 𝑋−1.96
√ , 𝑋+1.96
√ πœ‡ = 0.95
𝑁
𝑁
𝑁
𝑁
52
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Lim, Kyuson
3.4
STA498
Goodness-of-fit test
Suppose pick 𝑛 samples from a normal distribution, 𝑁 (πœ‡, 𝜎 2 ), with known variance and
the goal is to select the best model that predict the mean of the distribution.
3.4.1
Bayes factor
The Bayes factors is used as a determination for the Bayesian alternative to frequentist
hypothesis testing. The Bayesian model comparison is a method of model selection
based on Bayes factors (𝐡𝐹) among statistical models. Based on the observed data 𝐷,
the relative plausibility of the two different models 𝑀1 and 𝑀2 parametrized by πœƒ 1 and
πœƒ 2 respectively, is assessed by the probability odds of two models,
𝐡𝐹12
∫
𝑝(πœƒ 1 |𝑀1 ) 𝑝(𝐷|πœƒ 1 , 𝑀1 )π‘‘πœƒ 1
𝑝(𝐷|𝑀1 )
=∫
=
=
𝑝(𝐷|𝑀2)
𝑝(πœƒ 2 |𝑀2 ) 𝑝(𝐷|πœƒ 2 , 𝑀2 )π‘‘πœƒ 2
𝑝(𝑀1 |𝐷) 𝑝(𝐷)
𝑝(𝑀1 )
𝑝(𝑀2 |𝐷) 𝑝(𝐷)
𝑝(𝑀2 )
=
𝑝(𝑀1 |𝐷) 𝑝(𝑀2 )
,
𝑝(𝑀2 |𝐷) 𝑝(𝑀1 )
Likelihood Ratio ⇔ Posterior odds × Prior odds−1 ,
unlike the LRT there is no overfitting but a biasedness. Moreover, the Bayes factor is a
relative predictive accuracy of one hypothesis over another, and extent to which the data
sway our relative belief from one hypothesis to the other. Hence, 𝐡𝐹 = 𝑁, 𝑁 ∈ (0, ∞)
means that there is 𝑁 times more evidence for π»π‘Ž than 𝐻0 .
In a case where there is only 2 models, given the Bayes factor 𝐡𝐹 (𝐷), the posterior
probability of the Model 1 is derived as
𝑝(𝐷|𝑀1 ) 𝑝(𝑀2 )
𝑝(𝐷|𝑀2 ) 𝑝(𝑀2 )
=1−
𝑝(𝑀1 |𝐷) = 1 − 𝑝(𝑀2 |𝐷) = 1 −
𝑃(𝐷)
𝐡𝐹 (𝐷) 𝑝(𝐷)
𝑝(𝑀1 |𝐷) 𝑝(𝑀2 )
⇒1−
= 𝑝(𝑀1 |𝐷)
𝐡𝐹 (𝐷) 𝑝(𝑀1 )
1
𝑝(𝑀2 )
⇔1= 1+
𝑝(𝑀1 |𝐷) ⇔ 𝑝(𝑀1 |𝐷) =
𝐡𝐹 (𝐷) 𝑝(𝑀1 )
1+
1
𝑝(𝑀2 ) 1
𝐡𝐹 (𝐷) 𝑝(𝑀1 )
Bayes factor cutoffs
𝐡𝐹10
30 − 100
3 − 10
1
1/3 − 1
1/100 − 1/30
interpretation
Very strong evidence for π»π‘Ž
Moderate evidence for π»π‘Ž
Equal evidence for π»π‘Ž and 𝐻0
Anecdotal evidence for 𝐻0
Very strong evidence for 𝐻0
> 100
10 − 30
1−3
1/3 − 1
1/30 − 1/10
< 1/100
Extreme evidence for π»π‘Ž
Strong Evidence for π»π‘Ž
Anecdotal evidence for π»π‘Ž
Anecdotal evidence for 𝐻0
Strong evidence for 𝐻0
Extreme evidence for 𝐻0
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
53
STA498
Relationship between Bayes factor and p-value
Lim, Kyuson
As the Bayes factor increases quadratically, the p-value is also increased for smaller
values to reject 𝐻0 . On the other hand, the larger p-values correspond with the low Baye
factor numerical values as an inverse relationship.
Note that Bayes factors allow to directly test the null hypothesis (relative to models under
consideration).
3.4.2
Bayes factor: hypothesis testing
For the benefit of Bayes factor that could test multiple hypothesis upon same data of
observation, such regression models could be tested in a way of
𝐡𝐹10 =
𝑝(𝐷|𝐻2 )
𝐡𝐹10 𝑝(𝐷|𝐻1 )
𝑝(𝐷|𝐻1 )
, 𝐡𝐹20 =
⇒
=
= 𝐡𝐹12
𝑝(𝐷|𝐻0 )
𝑝(𝐷|𝐻0 )
𝐡𝐹20 𝑝(𝐷|𝐻2 )
Frequentist: Chi-square Goodness-of-fit test
Previously, 𝐻0 : πœ‡ = πœ‡0 and the CLT 32 guarantees for the sample mean π‘₯¯ =
which is π‘₯¯ ∼ 𝑁 (πœ‡, 𝜎 2 /𝑛). Then, the test statistics is computed as
πœ’2 =
Í𝑛
𝑖=1 π‘₯𝑖 /𝑛
( π‘₯¯ − πœ‡0 ) 2
.
𝜎 2 /𝑛
Bayesian: Bayes factor
The emphasize is in computing for the Bayes’ factor
of the models. For two models, let 𝑀1 : πœ‡ = πœ‡0 and
length 𝐿, including πœ‡0 and 𝐿 > 𝜎. The prior 𝑝(πœ‡) =
of 𝑀1 that is
1
( π‘₯¯ − πœ‡0 ) 2
𝑝(𝑋 |𝑀1 ) = √
exp −
=
2𝜎 2 /𝑛
2π‘›πœ‹πœŽ 2
that determines the relative ratio
𝑀2 : πœ‡ lies inside the interval of
1/𝐿 to calculate for the evidence
√
𝑛
√
2πœ‹πœŽ 2
exp − πœ’2 /2 .
For 𝑀2 , marginalize for the πœ‡ by
∫
∫
1
1
( π‘₯¯ − πœ‡) 2
1
exp −
π‘‘πœ‡ = .
𝑝(𝑋, 𝑀2) =
𝑝(𝑋 |πœ‡, 𝑀2 ) 𝑝(πœ‡|𝑀2 )π‘‘πœ‡ =
√
2
𝐿
𝐿
2𝜎 /𝑛
2πœ‹πœŽ 2 /𝑛
Then, the Bayes’ factor is derived as
√
𝑛𝐿
𝑀1
2
𝐡
=√
exp − πœ’ /2 .
𝑀2
2πœ‹πœŽ 2
For fixed πœ’2 valeu, when 𝑛 → ∞ the Bayes factor favours πœ‡ = πœ‡0 as 𝐡 → ∞.
32From 𝑛 samples of distribution with mean πœ‡ and variance 𝜎 2 , the sample mean π‘₯¯ =
𝑁 (πœ‡, 𝜎 2 /𝑛.
54
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Í𝑛
𝑖=1 π‘₯ 𝑖 /𝑛
∼
Lim, Kyuson
Improper prior and example
STA498
∫
When the prior is a function that is Θ πœ‹(πœƒ)π‘‘πœƒ = ∞, the prior is not a pdf but the poste∫
rior is still a valid pdf as a marginal distribution π‘š(π‘₯) = Θ π‘“ (π‘₯|πœƒ)πœ‹(πœƒ)π‘‘πœƒ is well defined.
For the univariate normal distribution of known variance,
if the uniform prior that
∫
is πœ‹(πœƒ) = 1 where there is no prior information then Θ πœ‹(πœƒ)π‘‘πœƒ = ∞. However, the
∫
corresponding marginal distribution π‘š(𝑋) = Θ π‘“ (𝑋 |πœƒ)π‘‘πœƒ is
Í
∫
1
(𝑛 − 1)𝑠2
(π‘₯𝑖 − πœ‡) 2
2 −𝑛/2
,
π‘‘πœ‡ = √
exp −
(2πœ‹πœŽ )
exp −
2𝜎 2
2𝜎 2
𝑛2πœ‹πœŽ 2
so that the posterior becomes πœ‹(πœ‡|𝑋) = πœ™(πœ‡| π‘₯,
¯ 𝜎 2 /𝑛) as shown before. For the Bayesian
𝑑-test, the Jeffreys prior which is improper priors are used as for the area of the curve to
be 1.
3.4.3
One sample test for equal means
Suppose there are two samples with 𝑛1 and 𝑛2 for 𝑛 = 𝑛1 + 𝑛2 where 𝑋1 𝑗 ∼ 𝑁 (πœ‡1 , 𝜎 2 )
and 𝑋2π‘˜ ∼ 𝑁 (πœ‡2 , 𝜎 2 ) for sample variance, 𝑗 ∈ 𝑛1 , π‘˜ ∈ 𝑛2 . First, the frequentist
approach for the two sided 𝑑-test aim to find whether the mean of the two groups differs,
𝐻0 : πœ‡1 = πœ‡2 ⇔ πœ‡1 − πœ‡2 = 0 vs. π»π‘Ž = πœ‡1 ≠ πœ‡2 ⇔ πœ‡1 − πœ‡2 ≠ 0. Then, the two-sample
𝑑-statistics is
𝑋¯ 1 − 𝑋¯ 2
.
𝑇 = √οΈ‚
2
2
𝑛1 𝑛2 (𝑛1 −1)𝑠1 +(𝑛2 −1)𝑠2
𝑛1 +𝑛2
𝑛1 +𝑛2 −2
)0
Let x = {x1 , x2 } where x1 = (π‘₯1 , ..., π‘₯ 𝑛1 and x2 = (π‘₯ 𝑛1 +1 , ..., π‘₯ 𝑛1 +𝑛2 ) 0. The goal is to test
𝐻0 : π‘₯𝑖 |πœ‡, 𝜎 2 ∼ 𝑁 (πœ‡, 𝜎 2 ), 1 ≥ 𝑖 ≥ 𝑛
against
π»π‘Ž : π‘₯𝑖 |πœ‡1 , 𝜎12 ∼ 𝑁 (πœ‡1 , 𝜎12 ), 1 ≥ 𝑖 ≥ 𝑛1 ,
and
π‘₯𝑖 |πœ‡2 , 𝜎22 ∼ 𝑁 (πœ‡2 , 𝜎22 ), 𝑛1 + 1 ≥ 𝑖 ≥ 𝑛1 + 𝑛2 .
However, the Bayesian approach place the prior on the difference of the standardized
means as 𝑑 = πœ‡1𝜎−πœ‡2 . In the case
Í 2
Í
π‘₯𝑖
(π‘₯𝑖 − πœ‡1 ) 2
2 −𝑛/2
2 −𝑛/2
𝑀0 = (2πœ‹πœŽ )
exp −
, 𝑀1 = (2πœ‹πœŽ )
exp −
,
2𝜎 2
2𝜎 2
then the Bayes factor is derived as
𝑀0
𝑁 πœ‡1
𝐡𝐹01 =
= exp −
(2π‘₯¯ − πœ‡1 ) ,
𝑀1
2𝜎 2
where the prior is normal πœ™ = 𝑁 + 𝜎 2 /𝜎02 .
1/2
𝑁 2 π‘₯¯ 2
𝜎2
𝐡𝐹01 = πœ™ 2
exp −
𝜎 /0
2πœ™πœŽ 2
so that the goal is to derive the Jeffrey’s Bayes factor (JZS).
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
55
STA498
56
Lim, Kyuson
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Chapter 4
Appendix
4.1
4.1.1
Extension of Bayesian distribution
EM (expectation-maximizing) algorithm for MLE example
When missing the data set, the prediction step consists of initial estimate πœ‡˜ and 𝚺˜ to
predict the contribution of missing values to the sufficient statistics.
Algorithm:
Assume that population mean and variance πœ‡ and 𝚺 are unknown and estimated.
1. Prediction: given estimates πœƒ˜ of unknown parameters, predict the contribution of
any missing observation to the complete data for sufficient statistics.
2. Estimation: using predicted sufficient statistics, compute estimates of parameters.
3. Iterate until revised estimates do not differ from estimates obtained previously.
57
Download