Bayesian Hotelling’s π 2
Lim, Kyuson
November 4, 2021
STA498
2
Lim, Kyuson
Contents
1
Acknowledgement
7
2
Multivariate Normal and Hypothesis Testing
9
2.1
Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1.1
Multivariate Normal distribution . . . . . . . . . . . . . . . . .
9
2.1.2
Distribution of (x − π) 0πΊ−1 (x − π) . . . . . . . . . . . . . . . .
9
2.1.3
MLE of π and πΊ . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.1.4
The sampling distribution of S and xΜ . . . . . . . . . . . . . . .
11
2.1.5
Hypothesis testing when π, Σ is known . . . . . . . . . . . . .
11
2.1.6
Hypothesis testing when π, Σ is unknown . . . . . . . . . . . .
12
Confidence region . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.2.1
Univariate π‘-interval . . . . . . . . . . . . . . . . . . . . . . .
16
2.2.2
Bonferroni’s Simultaneous Confidence interval . . . . . . . . .
17
2.2.3
Simultaneous π 2 -intervals . . . . . . . . . . . . . . . . . . . .
17
2.2.4
Comparison between Simultaneous π 2 -intervals and Bonferroni’s Confidence intervals . . . . . . . . . . . . . . . . . . . .
18
Multivariate Quality-Control (QC) . . . . . . . . . . . . . . . .
19
Comparing mean vectors of two population . . . . . . . . . . . . . . .
21
2.3.1
Pooled sample covariance when π1 , π2 is small and Σ = Σ1 = Σ2
21
2.3.2
Hypothesis test with small samples when Σ1 = Σ2 . . . . . . . .
22
2.3.3
Confidence intervals with small samples when Σ1 = Σ2 . . . . .
22
2.3.4
Behrens-Fisher problem . . . . . . . . . . . . . . . . . . . . .
23
2.3.5
Heterogeneous covariance matrices with large sample size . . .
23
2.3.6
Box’s M test (Bartlett’s test) . . . . . . . . . . . . . . . . . . .
23
2.2
2.2.5
2.3
3
STA498
2.4
3
Lim, Kyuson
MANOVA (Multivariate Analysis Of Variance) . . . . . . . . . . . . . 24
2.4.1
Sum of Squares (TSS = SSπ‘π +SSπππ ) . . . . . . . . . . . . . . .
24
2.4.2
Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . .
25
2.4.3
Distribution of Wilk’s Lambda . . . . . . . . . . . . . . . . . .
26
2.4.4
Large Sample property for modification of π²∗ . . . . . . . . . .
26
2.4.5
Simultaneous Confidence Intervals for treatment effect . . . . .
26
Bayesian Alternative approach
3.0.1
3.1
3.2
3.3
3.4
4
27
Overview: Univariate Binomial distribution with known and
unknown parameter . . . . . . . . . . . . . . . . . . . . . . . .
29
Conditional distribution of the subset . . . . . . . . . . . . . . . . . . .
31
3.1.1
Law of total expectation . . . . . . . . . . . . . . . . . . . . .
32
3.1.2
Conditional expectation (MMSE) . . . . . . . . . . . . . . . .
33
3.1.3
Laplace’s law of succession . . . . . . . . . . . . . . . . . . .
34
3.1.4
Bayesian Hypothesis testing . . . . . . . . . . . . . . . . . . .
35
3.1.5
Bayesian Interval Estimation . . . . . . . . . . . . . . . . . . .
37
Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.2.1
Conjugate Prior . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.2.2
Univariate Normal distribution Conjugate Prior with known
variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
3.2.3
Non-informative Prior . . . . . . . . . . . . . . . . . . . . . .
42
3.2.4
Univariate Normal distribution Conjugate Prior with unknown
variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
Posterior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.3.1
Maximum A Posteriori (MAP) . . . . . . . . . . . . . . . . . .
45
3.3.2
Multivariate Normal distribution with known Σ . . . . . . . . .
46
3.3.3
Multivariate Normal distribution with unknown Σ . . . . . . . .
48
3.3.4
Lindley’s Paradox . . . . . . . . . . . . . . . . . . . . . . . . .
51
3.3.5
Bernstein-von Mises theorem . . . . . . . . . . . . . . . . . .
52
Goodness-of-fit test . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
3.4.1
53
Bayes factor . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS
Lim, Kyuson
3.4.2
Bayes factor: hypothesis testing . . . . . . . . . . . . . . . . .
STA498
54
3.4.3
One sample test for equal means . . . . . . . . . . . . . . . . .
55
4
Appendix
57
4.1
Extension of Bayesian distribution . . . . . . . . . . . . . . . . . . . .
57
4.1.1
57
EM (expectation-maximizing) algorithm for MLE example . . .
CONTENTS
5
STA498
6
Lim, Kyuson
CONTENTS
Chapter 1
Acknowledgement
First, the chapter of multivariate Normal and Hypothesis Testing explains about the
construction and concepts of multivariate normal and relevant statistical inference as to
apply. The notation and interpretation is all of multivariate random variables to consider
for.
Second, the basic information starts with frequentist approach in understanding the
Bayesian statistics. However, the chapter is mainly about Bayesian approach apart from
frequentist approach for interpretation where majority of concept and derivation lies
on Bayesian approach to consider for. The chapter discuss for the hypothesis testing
for Bayesian approach and derivation for posterior distribution of the univariate normal
distribution as well as Bayes posterior estimator.
The normal distribution is the main topic of Bayesian inference for the posterior distribution where multivariate statistics concept is introduced and used to build up the
knowledge. The goal is to expand for bivariate and multivariate normal distribution
including Wishart distribution. Also, the idea of relative belief ratio and normal distribution for understanding posterior distribution is discussed.
7
STA498
8
Lim, Kyuson
CHAPTER 1. ACKNOWLEDGEMENT
Chapter 2
Multivariate Normal and Hypothesis
Testing
2.1
2.1.1
Basic definitions
Multivariate Normal distribution
If x ∼ π π (π, πΊ), then the PDF of x 1 is
π (x) =
1
0 −1
exp − (x − π) πΊ (x − π) ,
2
1
(2π) π/2 |πΊ| 1/2
where (x− π) 0πΊ−1 (x− π) is the squared Mahalanobis distance 2 between x and population
mean vector π as a quadratic term.
Notice that the PDF does not exists if πΊ is not positive definite 3, which implies |πΊ| = 0.
For Gaussian function exp − 21 (x − π), the normalizing constant (2π)1 π/2 is multiplied so
the area under the curve is 1.
2.1.2
Distribution of (x − π) 0πΊ−1 (x − π)
For x ∼ π π (π, πΊ),
0 −1
(x − π) πΊ (x − π) = {πΊ
−1/2
0
(x − π)} {πΊ
−1/2
π ∑οΈ
1 0
(x − π)}4 ⇔
√ eπ (x − π) = z0z,
ππ
π=1
1The constant probability density contour of the function is defined to be C = {x : π (x) = π 0 ⇔ x :
(x − π) 0πΊ−1 (x − π) = π2 } for connections of points. √οΈ
2For arbitrary distance of π and π, π (π, π) = (x − y) 0S−1 (x − y), where x = [π₯ 1 , ..., π₯ π ] 0, y =
[π¦ 1 , ..., π¦ π ] 0 and π is the sample covariance matrix of all measurements on p variables.
3By the spectral decomposition, πΊ = Qπ²Q is positive definite if and only if ππ ≥ 0 for eigenvalues.
Íπ 1
0
4Notice that πΊ−1 = π=1
ππ ei ei .
9
STA498
where
Lim, Kyuson
z ∼ π π (0, I) =
π
∑οΈ
π§π2 , where π§π ∼ π (0, 1),
π=1
as π(2π) is defined as the distribution of
2.1.3
2
π=1 π§π
Íπ
such that (x − π) 0πΊ−1 (x − π) ∼ π(2π) .
MLE of π and πΊ
πˆ = xΜ and πΊˆ =
1
π
Íπ
π=1 (xπ
− xΜ)(xπ − xΜ) 0 = Sπ =
π−1
π S,
where
π
1 ∑οΈ
S=
(xπ − xΜ)(xπ − xΜ) 0
π − 1 π=1
Now, πΈ (S) = πΊ and πΈ ( xΜ) = π5 are unbiased estimators.
π
π
π
∑οΈ
1 ∑οΈ 0
π−1
1 ∑οΈ 0
0
0
S=
xπ xπ − xΜxΜ0,
xπ xπ − 2
xπ xΜ + πxΜxΜ =
π
π π=1
π π=1
π=1
where πΈ (xπ x0π ) = πΊ + ππ0 and πΈ ( xΜxΜ0) = cov( xΜ) + πΈ ( xΜ)πΈ ( xΜ0) = π1 πΊ + ππ0. Hence, by
taking the expected value
π
π−1
1 ∑οΈ 0
1
π−1
0
0
0
πΈ (S) = πΈ
xπ xπ − πΈ ( xΜxΜ ) = πΊ + ππ − πΊ + ππ =
πΊ
π
π
π
π
π=1
π
π
π
→ πΊ, S −
→ πΊ. Asymptotically, S could be replaced by Sπ
According to LLN, xΜ −
→ π, Sπ −
1 Íπ
or πΊ. By definition of S = {π ππ = π−1 π=1 (x ππ − xΜπ )(x π π − xΜ π )},
π
π ππ =
π
1 ∑οΈ
1 ∑οΈ
(x ππ − xΜπ )(x π π − xΜ π ) =
(x ππ − ππ + ππ − xΜπ )(x π π − π π + π π − xΜ π )
π − 1 π=1
π − 1 π=1
π
1 ∑οΈ
π
(x ππ − ππ )(x π π − π π ) +
( xΜπ − ππ )( xΜ π − π π ),
=
π − 1 π=1
π−1
where the second term converges to 0. By applying LLN,
π
π−1
∑οΈ
π
1
1
π
(x ππ −ππ )(x π π −π π ) = 1− πΈ {(x ππ −ππ )(x π π −π π )} −
→ πππ , as π → ∞.
π π=1
π
Equivalently, Sπ is a consistent estimator for πΊ which is analogous to univariate case 6.
By CLT where xπ ∼ π π (π0 , πΊ) and xΜ ∼ π π (π0 , π1 πΊ)
√
π
π( xΜ − π0 ) →
− π π (π0 , πΊ)
Íπ
Íπ
Íπ
5πΈ ( xΜ) = πΈ π1 π=1
xπ = π=1
πΈ π1 xπ = π1 π=1
π=π
6As π → ∞, π 2π converges to π 2 which is the population variance.
10
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
Lim, Kyuson
and
STA498
(x − π0 ) 0πΊ−1 (x − π0 ) ∼ π2π
such that
π
π( xΜ − π0 ) 0πΊ−1 ( xΜ − π0 ) →
− π2π ,
for large sample size and π relatively larger than π.
2.1.4
The sampling distribution of S and xΜ
1
xΜ ∼ π π (π, πΊ),
π
1
Var( xΜ) = πΊ,
π
where S and xΜ are independent.
As xπ ∼ π π (π, πΊ) and xΜ is a linear combination of xπ , xΜ follows a normal distribution.
(π − 1)S =
π
∑οΈ
0
(xπ − xΜ)(xπ − xΜ) =
π=1
π
∑οΈ
π
∑οΈ
(xπ − π + π − xΜ)(xπ − π + π − xΜ) 0 =
π=1
(xπ − xΜ) (xπ − xΜ) 0 +π(ππ − xΜ)(π − xΜ) 0 −2π(π − xΜ)(π − xΜ) 0 =
π=1
π
∑οΈ
(xπ − π) 0 −π(π − xΜ)(π − xΜ) 0,
π=1
and
π
∑οΈ
(xπ − π) 0 ∼ ππ (πΊ),
π(π − xΜ)(π − xΜ) 0 ∼ π1 (πΊ)
π=1
such that
(π − 1)S ∼ ππ−1 (πΊ) =
π−1
∑οΈ
zπ z0π ,
zπ ∼ π π (0, πΊ)
π=1
The Wishart distribution with π − 1 degree of freedom has a property πΈ {(π − 1)S} =
(π − 1)πΊ.
2.1.5
Hypothesis testing when π, Σ is known
The statistical inference is based upon the hypothesis test and to construct confidence
regions for the parameters of interest.
The goal of this chapter is to include two general ideas, including construction of a
likelihood ratio test (LRT) based on the multivariate normal distribution, and the unionintersection approach.
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
11
STA498
Univariate test statistics (π known)
Lim, Kyuson
If π₯ ∼ π1 (0, π 2 ), the hypothesis testing is π»0 : π = π0 vs π»π : π ≠ π0 . For random
samples of π₯1 , ..., π₯ π from the Normal population, the test statistics is
π§=
π₯¯ − π0
√ ∼ π1 (0, 1)
π/ π
or
π§2 =
( π₯¯ − π0 ) 2
∼ π12
π 2 /π
under π»0 .
Multivariate generalization (Σ known)
If x ∼ π π (π, πΊ) where |πΊ| > 0, then the hypothesis test is π»0 : π = π0 vs π»π : π ≠ π0 .
If x1 , ..., xπ is a random sample from a normal population, then the test statistics
π§ 2 = π( xΜ − π0 ) 0πΊ−1 ( xΜ − π0 ) ∼ π2π
under π»0 .
2.1.6
Hypothesis testing when π, Σ is unknown
Univariate test statistics (π unknown)
As an estimated mean vector and hypothesized mean vector π0 for the distance measure
is defined, the hypothesis testing is π»0 : π = π0 vs π»π : π ≠ π0 . The test statistics is
π‘=
π₯¯ − π0
√ ∼ π‘ π−1
π / π
under π»0 , where π 2 =
or π‘ 2 =
Íπ
π=1
( π₯¯ − π0 ) 2
2
= π( π₯¯ − π0 )(π 2 ) −1 ( π₯¯ − π0 ) ∼ πΉ1,π−1
π 2 /π
(π₯ π −π₯)
¯ 2
π−1 .
Note that π‘ 2 is the square distance between sample mean π₯¯ and the test value π0 .
The distribution of π‘ 2 under π»0
Under the π»0 ,
−1 π₯¯ − π0 π 2
π₯¯ − π0
π‘ = π( π₯¯ − π0 )(π )
π( π₯¯ − π0 ) =
√
√
π/ π π 2
π/ π
Íπ
2 −1 ¯
π₯¯ − π0
π₯¯ − π0
π=1 {(π₯π − π₯)/π}
=
√
√
π−1
π/ π
π/ π
2 −1
π
π2 /1
∼ (π (0, 1)) π−1
(π (0, 1)) ⇔ 2 1
⇔ πΉ1,π−1
π−1
ππ−1 /(π − 1)
2
12
√
2 −1 √
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
Lim, Kyuson
Multivariate generalization (π unknown)
STA498
For π-dimensional vector, π»0 : π = π0 vs π»π : π ≠ π0 . A natural generalization of
univariate π‘ 2 is a multivariate analog of test statistics for Hotelling’s π 2 distribution.
−1
s
π = ( xΜ − π0 )
( xΜ − π0 )
π
0
2
√
= π( xΜ−π0 )
0
Íπ
π=1 (xπ
− xΜ)(xπ − xΜ) 0
π−1
−1
√
π π,π−1 (πΊ)
π( xΜ−π0 ) ∼ (π π (0, πΊ)) ,
π−1
0
−1
(π π (0, πΊ)),
which is in the form of (multivariate normal)0 (Wishart distribution / ππ ) −1 (multivariate
normal)7.
⇔ (π − 1)(π π (0, I)) 0 {ππ−1 (I)}−1 (π π (0, I)), where I = I π×π
In case the π 2 is too large, this means xΜ too far from the π0 such that π»0 is rejected.
Hotelling’s π 2 distribution
In the case vector d follows the multivariate normal distribution π π (0, I) which is
√
π( xΜ − π0 ) (by CLT), and another random vector M (which is S) follows the Wishart
distribution, then π(d0Md) (which is π 2 ) has a Hotelling’s π 2 ( π, π) distribution with
dimensionality parameter π and π degrees of freedom, based on the observation π and π.
If a random vector π₯ follows the Hotelling’s π 2 distribution which is π₯ ∼ π 2 ( π, π),
then
π− π+1
π₯ ∼ πΉπ,π−π+1
ππ
For hypothesis testing, reject π»0 : π = π0 if
π2 >
π(π − 1)
πΉπ,π−π (πΌ)
π−π
or
πΉ=
π−π 2
π > πΉπ,π−π (πΌ),
π(π − 1)
when observed π = π − 1 for the sample size and π = π to be the dimension of πΊ.
Computational example
The student Kyuson from sample of 15 course marks he has taken at UTM was analyzed
based on the classification on the π₯ 1 = MAT, π₯ 2 = STA and π₯ 3 = other courses (for
simplicity sample numbers for courses are the same).
Question: Is π0 = (99 99 95) 0 plausible for the population mean vector at πΌ = 0.1?
2
7Notice this is analogous to π‘ π−1
= (Normal random variable)0 (chi-square random variable/ ππ ) −1
(Normal random variable)
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
13
STA498
Lim, Kyuson
Equivalently, the problem actually is to test π»0 : π = (99 99 95) 0 vs. π»π : π ≠
(99 99 95) 0.
At level πΌ = 0.1, reject π»0 if π 2 >
π(π−1)
π−π πΉπ,π−π (πΌ)
=
3(40−1)
40−3 πΉ3,37 (0.1)
The sample mean xΜ = (100 100 99) 0 with S computed.
40( xΜ − π) 0S−1 ( xΜ − π) = 8.739.
= 7.544.
The computed π 2 =
Since 8.739 > 7.544 which is the critical value, π»0 is rejected where his true average differ at least for one area, ππ ≠ ππ0 and conclude he is not being honest.
Invariant under transformation, Hotelling’s π 2
Moreover, Hotelling’s π 2 is invariant under transformation of the form y = Cx+b, where
C π×π for the hypothesis testing of π»0 : πΈ (y) = Cπ0 + b 8.
Since yΜ = CxΜ + b and Sπ¦ =
1
π−1
Íπ
π=1 (yπ
− yΜ)(yπ − yΜ) 0 = CSπ₯ C0,
0
−1
π 2 = π{yΜ − (Cπ0 + π)}0S−1
π¦ { yΜ − (Cπ 0 + π)} = π{C( xΜ − π0 )} (CSπ₯ C) {C( xΜ − π0 )}
= π( xΜ − π0 ) 0C0 (CSπ₯ C) −1 C( xΜ − π0 ) = π( xΜ − π0 ) 0 (Sπ₯ ) −1 ( xΜ − π0 )
Normality, Hotelling’s π 2
π 2 = π( xΜ − π0 ) 0S−1 ( xΜ − π0 ) is approximately chi-square distribution with π ππ . whenπ0
is correct.
Note that the πΉ-distribution of π 2 rely on the normality assumption. Then, the critical value
π(π − 1)
πΉπ,π−π (πΌ) > π2π (πΌ),
π−π
but the value is nearly equivalent for larger values of π − π of πΉπ,π−π (πΌ) as π > π − π. In
other words, if π >> π then the difference is larger but if π > π then the gap is smaller 9.
Likelihood Ratio Test (LRT)
The Hotelling’s π 2 test is equivalent to the LRT 10. For hypothesis testing of π»0 : π = π0
vs π»πΌ : π ≠ π0 , the likelihood ratio (π²) is
ˆ π/2
maxπΊ πΏ (π0 , πΊ)
| πΊ|
π²=
=
max π,πΊ πΏ (π, πΊ)
| πΊˆ 0 |
8Instead of π»0 : πΈ (x) = π0
9For example π = 3000 and π = 10,
π (π−1)
π− π πΉ π,π− π (πΌ)
= 16.057 is close to π2π (πΌ) = 15.987 but if
(π−1)
2
π = 30 and π = 5, ππ−
π πΉ π,π− π (πΌ) = 12.135 is greater than π π (πΌ) = 9.236.
10Note that this is extended to Neyman-Pearson Lemma for uniformly most powerful test.
14
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
Lim, Kyuson
STA498
, where the maximum of multivariate normal likelihood of π and πΊ is
1
1
−ππ
−ππ
, max πΏ (π, πΊ) =
max πΏ (π0 , πΊ) =
ππ₯ π
ππ₯ π
π,πΊ
πΊ
2
2
(2π) ππ/2 | ΣΜ0 | π/2
(2π) ππ/2 | ΣΜ| π/2
However, while ΣΜ0 is restricted under the π»0 , ΣΜ is unrestricted 11. The LRT reject π»0 if
π² < π for the cutoff value π.
ˆ is approxiUnder the continuous mapping theorem, −2 ln(π²) = π{ln( πΊˆ 0 ) − ln( πΊ)}
2
mately following the πππ , where ππ = {π + π( π + 1)/2} − {π( π + 1)/2} = π 12 (number
of parameters without the restriction of π»0 - number of parameters under π»0 ).
Wilk’s Lambda
Equivalently, based on the likelihood ration statistics of π² it is derived for π²2/π
Íπ
−1
ˆ
| π=1 (xπ − xΜ)(xπ − xΜ) 0 |
| πΊ|
π2
2/π
π² =
= Íπ
⇔ 1+
< ππ
| π=1 (xπ − π0 )(xπ − π0 ) 0 |
π−1
| πΊˆ 0 |
Notice that for large π 2 the likelihood ratio is small and will reject π»0 . Also, the
Hotelling’s π 2 , Wilk’s Lambda and LRT are all equivalent.
Inverse of Wishart distribution
The Wishart distribution which is (π − 1)S ∼ ππ−1 (πΊ) is an multivariate analogue of
the Gamma distribution (as the chi-square distribution of z2 is gamma random variable
as well).
Íπ−1 0 −1
With a reparametrization where x1 , ..., xπ ∼ π (0, S−1
π=1 xπ xπ
0 ), a cov-matrix πΊ =
is sampled from the inverse-Wishart distribution, which is π − 1 df and parameter S−1
0 .
Hence,
πΈ (πΊ−1 ) = (π − 1)S−1
0 ,
πΈ (πΊ) =
1
1
−1
(S−1
=
S0 ,
0 )
(π − 1) − π − 1
π− π−2
by the property of Wishart distribution.
For large π − 1, S0 = (π − π − 2)πΊ0 is near true covariance matrix of πΊ.
Union-Intersection derivation of π 2
If the null hypothesis is not rejected for given a π of π¦ = a0x ∼ π1 (a0 π, a0πΊa) that
maximize test statistics of π‘ a2 , then any of univariate null hypothesis π»0,a : a0 π = a0 π0 ⇔
Íπ
Íπ
11ΣΜ0 = π1 π=1
(xπ − π0 ) (xπ − π0 ) 0 to be restricted and ΣΜ = π1 π=1
(xπ − xΜ) (xπ − xΜ) 0 to be unrestricted
12 π correspond to π, π( π + 1)/2 correspond to var-cov matrix, πΊ
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
15
STA498
π = π0 is not rejected.
Lim, Kyuson
First, the π»0 : π = π0 is equivalent of the form π»0,a : a0 π = a0 π0 . The test statistics
0
π¦¯ − a0 π0 2
a π₯¯ − a0 π0 2 {a0 ( xΜ − π0 )}2
π‘a =
= √οΈ
=
π π¦¯
a0 (S/π)a
1 0
a
Sa
π
Hence, if maxa π‘ a2 < π, then π‘a2 < π for any a.
Second, the maximum squared t-test is
max π‘a2
a
−1
S
= ( xΜ − π0 )
( xΜ − π0 ) = π( xΜ − π0 ) 0 (S) −1 ( xΜ − π0 ) = π 2 ,
π
0
which is the Hotelling’s π 2 distribution 13.
2.2
Confidence region
2.2.1
Univariate π‘-interval
Without relationship between multivariate components, the univariate π‘-interval method
is constructed as
π¦¯ − π π¦
< π‘ π−1 (πΌ/2) = 1 − πΌ,
π − π‘ π−1 (πΌ/2) ≤ √οΈ
2
π π¦ /π
π¦¯ −π π¦
as √
π 2π¦ /π
∼ π‘ π−1 .
√οΈ
In other words, the confidence interval of 100(1 − πΌ)% for π π¦ is π¦¯ ± π‘ π−1 (πΌ/2) π 2π¦ /π,
where π‘ π−1 (πΌ/2) is the upper percentile.
Problem and Bonferroni’s inequality
Notice that the π 100(1 − πΌ)% does not cover joint CI. For π
π , each π
π covers the
corresponding ππ and π(π
π ) = 1 − πΌπ . Then, for each π
π independent
π{ππ ∈ π
π } =
π
π(∩π=1 π
π )
=1−
π
π(∪π=1 π
ππ )
≥ 1−
π
∑οΈ
πΌπ ,
π=1
which is the Bonferroni’s inequality. In the case πΌπ = πΌ for all π, then 1 −
1 − π πΌ < 1 − πΌ such that if π > 1 the inequality is not guaranteed.
Íπ
π=1 πΌπ
=
13Using the Maximization Lemma, the Cauchy-Schwartz inequality is based to derive the UnionIntersection derivation of π 2 .
16
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
Lim, Kyuson
STA498
Equivalently, if π
ππ 14 is the the event of making a Type 1 error on πth test, then
π(at least one Type 1 error) =
π
π(∪π=1 π
ππ )
≤
π
∑οΈ
π(π
ππ ).
π=1
For each of π tests, use a significance level of πΌ/π, then the CI coverage rate or Type 1
error rate is at most 100(1 − πΌ)% or πΌ.
So the probability that at least one test results in a Type I error is at most πΌ or the chance
that at least one CI does not capture the true mean difference is at most 100(1 − πΌ)%.
2.2.2
Bonferroni’s Simultaneous Confidence interval
To construct the simultaneous confidence interval for {π1 , ..., π π } by the confidence level
πΌ/π for each of π separate univariate CI’s that is
√οΈ
π ππ
π₯¯π ± π‘ π−1 (πΌ/(2π))
, π = 1, .., π.
π
√οΈ
Since π( π₯¯π ±(ππ ∈ πΌ/(2π)) π ππ /π) = 1−πΌ/π, the joint coverage probability ≥ 1−π( πΌπ ) =
1 − πΌ, which now guarantee to be no smaller than 1 − πΌ 15.
2.2.3
Simultaneous π 2 -intervals
To construct simultaneous confidence intervals for any linear combinations of a0 π that
Íπ
is the expected value of π=1 ππ xπ = a0x where x ∼ π π (π, πΊ) with variance a0πΊa, the CI
is derived from univariate-intersection derivation of π 2
0
(a0xΜ − a0 π0 ) 2
a π₯¯ − a0 π0 2 {a0 ( xΜ − π0 )}2
= √οΈ
=
π‘a =
var(a0x)/π
a0 (S/π)a
1 0
a
Sa
π
√οΈ
⇔ a0xΜ ±
π(π − 1)
πΉπ,π−π (πΌ)
π−π
where
max π‘a2 = π 2 ∼
a
√οΈ
a0Sa
,
π
π(π − 1)
πΉπ,π−π
π−π
For each ππ ,
√οΈ
π₯¯π ±
π(π − 1)
πΉπ,π−π (πΌ)
π−π
√οΈ
π ππ
π
14Notice that this the confidence interval that is √οΈ
not covered for ππ .
15Note that π‘ π−1 (πΌ/2π) could be replaces with (π − 1) ππΉ π,π− π (πΌ)/(π − π) for equivalency, by the
property of Hotelling π 2 .
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
17
STA498
Therefore,
Lim, Kyuson
π
π‘a2
π(π − 1)
π(π − 1)
2
πΉπ,π−π (πΌ) = π max π‘a ≤
πΉπ,π−π (πΌ)
≤
a
π−π
π−π
π(π − 1)
2
=π π ≤
πΉπ,π−π (πΌ) = 1 − πΌ
π−π
The drawback of the simultaneous π 2 -intervals is less powerful due to wider range of
interval, which lead to be less powerful and conservative.
2.2.4
Comparison between Simultaneous π 2 -intervals and Bonferroni’s Confidence intervals
Criteria
shape
joint coverage rate
π‘-intervals
Bonferroni’s π‘-intervals
narrower, powerful
< 100(1 − πΌ)%
depends on number of intervals
Simultaneous π 2 -intervals
winder, conservative
≥ 100(1 − πΌ)%
does not depend
For each ππ , the simultaneous confidence intervals is computed as
√οΈ
√οΈ
π(π − 1)
π ππ
π₯¯π ±
πΉπ,π−π (πΌ)
,
π−π
π
but the Bonferroni’s confidence intervals for ππ is computed as
√οΈ
πΌ
π ππ
π₯¯π ± π‘ π−1
, where π = 1, .., π.
2π
π
Confidence Region
Denoted as π
(x), which is the multivariate extensions of univariate confidence interval
(CI) where xπ ∼ π π (π, πΊ) for π = 1, ..., π. Then, for mean vector π
π(π − 1)
0 −1
π π( xΜ − π) S ( xΜ − π) ≤
πΉπ,π−π (πΌ) = 1 − πΌ
π−π
Cantered at xΜ and computing S for the set yields
π(π − 1)
0 −1
πΉπ,π−π (πΌ)
π
(x) = π : π( xΜ − π) S ( xΜ − π) ≤
π−π
However, the on half-length along the normalized eigenvector eπ from S16 gives
√οΈ √οΈ
√οΈ
ππ π(π − 1)
ππ √οΈ 2
πΉπ,π−π (πΌ) =
π (πΌ),
π
π−π
π
for each eigenvalues of ππ from S, π = 1, ..., π.
16Correlation R for eigenvalues are computed to be equivalent to the covariance matrix of S for
standardized eigenvalues.
18
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
Lim, Kyuson
2.2.5
Multivariate Quality-Control (QC)
STA498
Univariate paired t-test
For π₯ 1π and π₯ 2π which is the response to treatments, let ππ = π₯π1 − π₯π2 with ππ ∼ π (π π , ππ2 ),
for hypothesis testing of π»0 : π π = 0 vs. π»π : π π ≠ 0. Then, the test statistics is
π‘=
π¯
√ ∼ π‘ π−1 ,
π π / π
under π»0 . If |π‘| > π‘ π−1 (πΌ/2), then reject π»0 . The confidence interval for π π is
π π
π ± π‘ π−1 (πΌ/2) √
π
Multivariate extension in comparison of confidence intervals and confidence region
Suppose for π units that there are 2 treatments to be x1π = (π₯ 1π1 , ..., π₯ 1π π ) 0
and x2π = (π₯ 2π1 , ..., π₯ 2π π ) 0 with dπ = x1π − x2π ⇔ ππ π = π₯ 1π π − π₯ 2π π , for all π = 1, ..., π.
For dπ ∼ π π (π π , πΊπ , π»0 : π π = 0 vs. π»0 : π π ≠ 0. Then, the test statistics is
π 2 = πdΜ0S−1
π dΜ
√
0 Íπ
π=1 dπ
π=1 (dπ
Íπ
= π
π
Íπ
− dΜ)(dπ − dΜ) 0 √
π(π − 1)
π=1 dπ
π
πΉπ,π−π (πΌ),
∼
π−1
π
π−π
under π»0 where the 100(1 − πΌ)% confidence region of π π is
π(π − 1)
0 −1
πΉπ,π−π (πΌ) ,
π
(π π ) = π π : π( dΜ − π π ) Sπ ( dΜ − π π ) ≤
π−π
which is analogous to the confidence region for xΜ and S
π(π − 1)
0 −1
π : π( xΜ − π) S ( xΜ − π) ≤
πΉπ,π−π (πΌ)
π−π
When π 2 >
π(π−1)
π−π πΉπ,π−π (πΌ)
for the critical value, then reject π»0 .
However, 100(1 − πΌ)% Simultaneous π 2 confidence intervals for individual mean differences {π π π } 17 is given by
√οΈ
π¯π ±
√οΈ
π(π − 1)
πΉπ,π−π (πΌ)
π−π
π 2π π
, π = 1..., π.
π
(π−1)
2
17Note that when π − π is large replaced by ππ−
π πΉ π,π− π (πΌ) with π π (πΌ) by the property of Hotelling’s
2
π and the normality assumption is not necessary.
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
19
STA498
Lim, Kyuson
Note that π¯π is the diagonal element of dΜ and π 2π π is the diagonal element of Sπ . This is
analogous to ππ of
√οΈ
√οΈ
π(π − 1)
π ππ
π₯¯π ±
πΉπ,π−π (πΌ)
.
π−π
π
Also, 100(1 − πΌ)% Bonferroni’s confidence intervals for {ππ } is given by
√οΈ
π 2π π
π¯π ± π‘ π−1 (πΌ/2π)
, π = 1..., π.
π
This is analogous to
√οΈ
π₯¯π ± π‘ π−1 (πΌ/(2π))
π ππ
, π = 1, .., π.
π
shown before.
Simple Block design
For π treatments
over successive period
of time, observation data is denoted as xπ =
π₯π1 , ..., π₯ππ and π = π1 , ..., π π . The goal is to compare the components of π.
The contrast matrix C is found for two ways. First, the contrast matrix is set for control
treatments compared with other treatment, which is
π − π2
1 −1 0 · · · 0
© 1
ª
π
­ π1 − π3 ® ©­1 0 −1 · · · 0 ª® © .1 ª
­ . ®=­
­
® ­ .. ®® =C1 π.
­ .. ® ­
.
.
.
®
­
®
1 0 0 · · · −1¬ « π π ¬
« π1 − π π ¬ «
The other way is a successive treatments for contrast matrix
π − π2
1 −1 0 · · · 0
© 1
ª
π
­ π2 − π3 ® ©­0 1 −1 · · · 0 ª® © .1 ª
­
­
®
=­
® ­ .. ®® =C2 π.
..
­
®
.
.
.
­
®
.
­
®
0 0 · · · 1 −1¬ « π π ¬
« π π−1 − π π ¬ «
Note that the contrast matrix C1 and C2 is (π − 1) × π matrix.
However, in order to test that there is no difference in treatments π»0 : Cπ = 0 vs.
π»π : Cπ ≠ 0.
The test statistics is the Hotelling’s π 2 as xπ ∼ π π (π, πΊ),
π 2 = π(CxΜ) 0 (CSC0) −1 (CxΜ) ∼
(π − 1)(π − 1)
πΉπ−1,π−π+1 (πΌ),
π−π+1
and the 100(1 − πΌ)% confidence region for C is
(π − 1)(π − 1)
0
0 −1
Cπ : π(CxΜ − Cπ) (CSC ) (CxΜ − Cπ) ≤
πΉπ−1,π−π+1 (πΌ) ,
π−π+1
20
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
Lim, Kyuson
STA498
compared to the 100(1 − πΌ)% simultaneous π 2 confidence intervals for 1-dimensional
{cπ π} is
√οΈ
√οΈ
c0π Sc π
(π − 1)(π − 1)
0
πΉπ−1,π−π+1 (πΌ)
c π xΜ ±
π−π+1
π
Computational Example
A sample of 20 courses were administrated with 4 assessments ways:
Treatment 1: final exam and no term tests
Treatment 2: no exam and no term tests
Treatment 3: final exam and term test
Treatment 4: no exam and term test
The outcome variable is % for students marks.
(π3 + π4 ) − (π1 + π2 ): effect of having term test
(π1 + π3 ) − (π2 + π4 ): effect of having final exam
(π1 + π4 ) − (π2 + π3 ): interaction between term test and final exam
−1 −1 1 1
©
ª
C = ­ 1 −1 1 −1®
« 1 −1 −1 1 ¬
Then, π»0 : Cπ = 0 vs. π»π : Cπ ≠ 0 at πΌ = 0.05. From the data, π 2 =
π(CxΜ) 0 (CSC0) −1 CxΜ = 20.5.
(π−1)
3×19
At πΌ = 0.05, the critical value is (π−1)
π−π+1 πΉπ−1,π−π+1 (πΌ) = 17 πΉ3,17 (0.05) = 10.73.
Since π 2 > 10.93, reject π»0 at the level of πΌ = 0.05 and conclude that there is a significant difference in contrast for the effect of midterm and final exam for courses to be
offered.
Within 95% simultaneous confidence intervals, if the confidence interval does not contain 0 then there is an effect by the presence of either term test or final exam. Note
that the interaction effect of two factors is not significant if the confidence interval does
contain 0.
2.3
Comparing mean vectors of two population
When x1π ∼ π (π1 , Σ1 ) and x2 π ∼ π (π2 , Σ2 ) for π = 1, .., π1 and π = 1, ..., π2 for π-variate
population and independent, then the goal is to make inference on π1 − π2 .
2.3.1
Pooled sample covariance when π1 , π2 is small and Σ = Σ1 = Σ2
As x1 ∼ π (π, Σ1 ), x1 ∼ π (π, Σ1 ), let xΜ π =
xΜ) 0.
1
ππ
Íπ π
π=1 x ππ and S π =
1
π π −1
Íπ π
π=1 (x ππ − xΜ)(x ππ −
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
21
STA498
Lim, Kyuson
The pooled sample covariance is the weighted mean of two samples.
Íπ 1
0 Íπ2 (x − xΜ )(x − xΜ ) 0
π1 − 1
π2 − 1
2π
2
2π
2
π=1 (x1π − xΜ1 )(x1π − xΜ1 )
S ππππππ =
+ π=1
⇔
S1 +
S2
π1 − 1 + π2 − 1
π1 − 1 + π2 − 1
π1 + π2 − 2
π1 + π2 − 2
Hypothesis test with small samples when Σ1 = Σ2
2.3.2
For π»0 : π1 − π2 = πΏ0 18 vs. π»π : π1 − π2 ≠ πΏ0 , test statistics of Hotelling’s under π»0
2
π = (x1 − x2 − πΏ0 )
1
1
=
+
π1 π2
0
−1/2
(x1 − x2 − π1 +
−1
1
1
+
S ππππππ
(x1 − x2 − πΏ0 )
π1 π2
π2 ) 0S−1
ππππππ
1
1
+
π1 π2
−1/2
(x1 − x2 − π1 + π2 ),
which follows (multivariate normal)(Wishart / ππ )(multivariate normal) such that
ππ1 +π2 −2
⇔ π π (0, πΊ)
π1 + π2 − 2
0
−1
π π (0, πΊ) =
π(π1 + π2 − 2)
πΉπ,π1 +π2 −π−1 .
π1 + π2 − π − 1
The hypothesis testing reject π»0 if
π2 >
2.3.3
π(π1 + π2 − 2)
2
πΉπ,π1 +π2 −π−1 (πΌ) = πππππ‘ππππ
.
π1 + π2 − π − 1
Confidence intervals with small samples when Σ1 = Σ2
Confidence region
Analogously, the half-length with axes along e1 , ..., e π and ellipsoid centered at x1 − x2
is
√οΈ √οΈ
1
1
π(π1 + π2 − 2)
+
πΉπ,π1 +π2 −π−1 (πΌ) , π = 1, ..., π.
ππ
π1 π2 π1 + π2 − π − 1
Simultaneous π 2 Confidence intervals a0 (π1 − π2 )
√οΈ √οΈ
1
1
0
2
0
a ( xΜ1 − xΜ2 ) ± πππππ‘ππππ a
+
S ππππππ a .
π1 π2
Notice that this is analogous to each confidence intervals
√οΈ √οΈ
π(π1 + π2 − 2)
1
1
( π₯¯1 π − π₯¯2 π ) ±
πΉπ,π1 +π2 −π−1 (πΌ)
+
π π π,ππππππ .
π1 + π2 − π − 1
π1 π2
18ie. πΏ0 = 0
22
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
Lim, Kyuson
Bonferroni’s Confidence Intervals
STA498
( π₯¯1 π − π₯¯2 π ) ± π‘ π1 +π2 −2
2.3.4
πΌ
2π
√οΈ 1
1
+
π π π,ππππππ .
π1 π2
Behrens-Fisher problem
In the case of heterogeneous covariance πΊ1 ≠ πΊ2 with small (moderate) sample sizes
π1 , π2 greater than π, the estimator of sample mean difference yields sample covariance
S1 S2
(π − 1)(S1 + S2 ) 2
Sπ =
+
⇔
π1 π2
2(π − 1)
π
such that the test statistics under π»0 : π1 − π2 = 0 vs. π»π : π1 − π2 ≠ 0 is
π 2 = ( xΜ1 − xΜ2 ) 0S−1
π ( xΜ1 − xΜ2 ) ∼
π£π
πΉπ,π£−π+1 ,
π£− π+1
for π number of variables where
π£=Í
2
π=1
π + π2
1
1
−1 2 + tr
ππ tr ππ Sπ Sπ
1
−1 2
ππ Sπ Sπ
where min(π1 , π2 ) ≤ π£ ≤ π1 + π2 19. Hence, reject π»0 if π 2 >
2.3.5
,
π£π
π£−π+1 πΉπ,π£−π+1 (πΌ).
Heterogeneous covariance matrices with large sample size
The test statistics under π»0 with same Sπ is
2
π 2 = ( xΜ1 − xΜ2 ) 0S−1
π ( xΜ1 − xΜ2 ) ∼ π π
with the assumption that π1 − π and π2 − π is large enough. Hence, reject π»0 if
π 2 > π2π (πΌ).
2.3.6
Box’s M test (Bartlett’s test)
The goal is to hypothesis test for the equality of covariance matrices, π»0 : πΊ1 = · · · =
πΊπ = πΊ vs. π»π : at least one πΊπ ≠ πΊ π , for some π ≠ π with chi-square approximation.
Under multivariate normal distribution, the LRT20
Íπ
(ππ −1)/2
(ππ − 1)Sπ
|Sπ |
, where S ππππππ = Íπ=1
Λ = Ππ
π
|S ππππππ |
π=1 (ππ − 1)
19The approximation reduces to Welch π‘-test in univariate (π = 1), π‘ =
20Formerly, under π»0 : π = π0
maxπΊ πΏ ( π0 ,πΊ)
max π,πΊ πΏ ( π,πΊ)
=
| πΊˆ | π/2
.
| πΊˆ 0 |
π₯¯ 1 − π₯¯ 2
π 2
1
π1
π 2
+ π2
2
and π£ =
π 2
1
π1
π 2
2
+ π2
π 4
1
π 2 ( π1 −1)
1
2
π 4
2
2 ( π2 −1)
+π
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
23
STA498
Lim, Kyuson
for ππ that is the sample size for the πth group of Sπ sample covariance.
∑οΈ
π
π
∑οΈ
⇔ −2 πππΛ = π =
(ππ − 1) πππ|S ππππππ | −
{(ππ − 1) πππ|Sπ |},
π=1
where
π=1
π
1
2π 2 + 3π − 1 ∑οΈ 1
,
− Íπ
π’=
6( π + 1)(π − 1) π=1 ππ − 1
π=1 (ππ − 1)
as π is the number of variables and π is the number of groups. The test statistics is
1
π( π + 1)(π − 1)
2
under π»0 21. While Box’s M test is sensitive to non-normality, MANOVA test of means
or treatments are robust to non-normality 22.
πΆ = (1 − π’)π ∼ ππ£2 , π£ =
2.4
MANOVA (Multivariate Analysis Of Variance)
The one-way MANOVA model for comparing π population mean vector is illustrated as
Xπ π = π + ππ + eπ π , eπ π ∼ π π (0, πΊ),
which is random vector = overall mean+ πth population treatment effect +random error,
where there are π populations and ππ observations ({xπ1 , ..., xπππ }) for population π with
the population mean ππ , π = 1, .., π, which follows Wishart distribution.
Íπ
Constraint on π=1 ππ ππ = 0 define the unique model parameters.
For vector of observations, decomposes into
xπ π = xΜ + ( xΜπ − xΜ) + (xπ π − xΜπ ),
which is also observation = overall sample mean π+
ˆ estimated treatment effect πˆπ +
residual error, eΜπ π .
Note that the normality assumption for samples can be relaxed when the sample size
{ππ } is large.
2.4.1
Sum of Squares (TSS = SSπ‘π +SSπππ )
Total (corrected) sum of squares (and cross products), TSS = treatment (between groups)
sum of squares and cross products, B + residuals (within group) sum of squares and
cross products, W.
π ∑οΈ
ππ
∑οΈ
π=1 π=1
0
(xπ π − xΜ)(xπ π − xΜ) =
π
∑οΈ
π=1
0
ππ ( xΜπ − xΜ)( xΜπ − xΜ) +
π ∑οΈ
ππ
∑οΈ
(xπ π − xΜπ )(xπ π − xΜπ ) 0,
π=1 π=1
21Reject π»0 if πΆ > ππ£2 (πΌ)
22Although M-test reject π»0 , MANOVA test could be inconsistent with.
24
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
Lim, Kyuson
as simplified from
STA498
(xπ π − xΜ)(xπ π − xΜ) 0 = [(xπ π − xΜ + xΜπ − xΜ) [(xπ π − xΜ + xΜπ − xΜ)] 0
= (xπ π − xΜπ )(xπ π − xΜπ ) 0 + (xπ π − xΜπ )( xΜπ − xΜ) 0 + ( xΜπ − xΜ)(xπ π − xΜπ ) + ( xΜπ − xΜ)( xΜπ − xΜ) 0,
Íπ
(xπ π − xΜπ ) = 023 such that
and ππ=1
⇔ (xπ π − xΜ)(xπ π − xΜ) 0 = (xπ π − xΜπ )(xπ π − xΜπ ) 0 + ( xΜπ − xΜ)( xΜπ − xΜ) 0 .
Notice that (xπ π − xΜ)(xπ π − xΜ) 0 = (xπ π − xΜ) 2 applies for other terms.
First, for Sπ of πth sample covariance matrix 24
W=
π ∑οΈ
ππ
∑οΈ
0
(xπ π − xΜπ )(xπ π − xΜπ ) = (π1 −1)S1 +· · ·+(ππ −1)Sπ ⇔
(ππ −1)S = (π −π)S,
π=1
π=1 π=1
where π =
π
∑οΈ
Íπ
π=1 ππ
with π − π ππ , Wishart distribution. Hence,
W
πΈ
= πΊ,
π −π
Second,
B=
π
∑οΈ
ππ ( xΜπ − xΜ)( xΜπ − xΜ) 0,
π=1
with π − 1 ππ , Wishart distribution.
Thus, TSS has total π − 1 = (π − π) + (π − 1) ππ , Wishart distribution.
2.4.2
Hypothesis Testing
The goal is to test the presence of treatment effects. π»0 : π1 = π1 = · · · = ππ is
equivalent to π»π : π1 = π1 = · · · = ππ 25, which is that treatment effects are all same. The
test statistics26 uses Wilk’s Lambda27 test as B, W follows Wishart distribution,
|W|
1
1
π
∗
⇔
= Ππ=1
,
π² =
|B + W|
det(I + W−1 B)
1 + πˆ π
where πˆ 1 , .., πˆ π are eigenvalues of W−1 S, as π = min( π, π − 1) is the rank of B.
23Note that (xπ π − xΜπ ) ( xΜπ − xΜ) 0 + ( xΜπ − xΜ) (xπ π − xΜπ ) = 0
24Note that the generalized (π1 + π2 − 2)S ππππππ is recommended in two-sample case.
25As ππ = xΜπ − xΜ, testing for π»0 : π1 = · · · = ππ ⇔ xΜ1 − xΜπ = 0
ˆ
26Analogous LRT to | πΊπΊˆ |.
0
27There are Roy’s test maxπ (BW−1 ), Lawley-Hotelling’s test tr(BW−1 ), and Pillar’s test tr{B(B+W) −1 }
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
25
STA498
2.4.3
Number of variable
π=1
Number of group
π≥2
π=2
π≥2
π≥1
π=2
π≥1
π=3
2.4.4
Lim, Kyuson
Distribution of Wilk’s Lambda
Test statistics
π−π 1−π²∗
π−1 1 − π²∗√
∗
π−π−1 1 − 1−√ π²∗
π−1
π² π−π−1 1−π²∗
1
−
∗
π
π²
√ ∗
π−π−2 1 − 1−√ π²∗
π
π²
Distribution under π»0
πΉπ−1,π−π
πΉ2(π−1),2(π−π)
πΉπ,π−π−1
πΉ2π,2(π−π−2)
Large Sample property for modification of π²∗
If π»0 is true for π to be large,
π+π
− π −1−
ln(π²∗ ) ∼ π2π(π−1)
2
However, reject π»0 if
π+π
− π −1−
ln(π²∗ ) > π2π(π−1) (πΌ)
2
2.4.5
Simultaneous Confidence Intervals for treatment effect
Let ππ = xΜπ − xΜ to be the πth treatment effect. Then, the treatment difference between πth
0
and πth treatment is πˆπ − πˆπ = xΜ π − xΜ − xΜπ + xΜ = xΜ π − xΜπ ⇔ π₯¯ π1 − π₯¯π1 · · · π₯¯ π π − π₯¯π π ,
Wππ
and Var(πˆππ − πˆππ ) = Var(π₯¯ ππ − π₯¯ππ ) = π1π + π1π πππ , where πππ = π ππ, ππππππ = π−π
.
The 95% simultaneous Bonferroni’s confidence intervals for { πˆππ − πˆππ } 28 is
√οΈ 1
1
( π₯¯ ππ − π₯¯ππ ) ± π‘ π−π (πΌ/2π)
+
π ππ,ππππππ , where π = ππ(π − 1)/2,
π π ππ
where π is the number of variables and π is the number of populations.
Hence, reject π»0 : πππ − πππ = 0 if
| π₯¯ ππ − π₯¯ππ |
πΌ
π‘ = √οΈ
> π‘ π−π
.
2π
1
1
+
π
ππ
ππ ππ, ππππππ
28Note that this is analogous to ( π₯¯1 π − π₯¯2 π ) ± π‘ π1 +π2 −2
26
πΌ
2π
√οΈ 1
π1
+
1
π2
π π π,
ππππππ
defined.
CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING
Chapter 3
Bayesian Alternative approach
Let the discrete random variables π which is to be estimated and observed random variable of π = π₯. From the prior pr(π) information about possible values for the parameter,
the approach uses observed data p(π₯|π) to update the information on posterior probabilities p(π|π₯)1 as a regenerating process by the confidence intervals, p(π ∈ πΆπΌ |π₯) = 1−πΌ.
If known with π ∗ related to probability data points given, then the estimated π₯ˆπ ∼
p(π₯|π ∗ ) for the true value of π₯ to generate the update information in the compatible
space, where there is no overfitting to be concerned with.
From the likelihood p(π₯|π), the actual distribution by the Bayes theorem yield unnormalized posterior density which is the right side
p(π|π₯) =
p(π₯|π)p(π)
⇔ ∝ p(π₯|π)p(π),
p(π₯)
where p(π₯) is unknown with fixed π¦ and does not depend on π. Note that p(π₯) is also
referred to as evidence.
⇔ Posterior ∝ Prior × Likelihood.
A parameter of a prior distribution is referred to a hyperparameter.
For predictive inference on unknown variable before data π₯ is considered, the distribution of unknown but observed π₯ is
∫
∫
π(π₯) =
π(π₯, π)ππ =
π(π) π(π₯|π)ππ
Θ
Θ
as a marginal distribution of π₯, which a prior predictive distribution2.
For observed data x and unknown π = (π, π 2 ), the unknown observable π₯˜ to be predicted
is referred to be posterior predictive distribution
∫
∫
∫
π( π₯|x)
˜ =
π( π₯,
˜ π|x)ππ =
π( π₯|π,
˜ x) π(π|x)ππ =
π( π₯|π)
˜ π(π|x)ππ,
Θ
Θ
Θ
1This is written by the Bayes theorem that p(π₯, π)/p(π₯) ⇔ (p(π₯ | π) p(π))/p(π₯)
2predictive refers to the distribution for a quantity that is observable.
27
STA498
Lim, Kyuson
as posterior which is conditional on observed x and predictive for observable π₯.
˜
The ratio of posterior density π(π|π₯) evaluated at points π 1 and π 2 under the given
model is referred to posterior odds for π 1 compared to π 2 .
π(π 1 |π₯) π(π 1 ) π(π₯|π 1 )/π(π₯) π(π 1 ) π(π₯|π 1 )
=
=
,
π(π 2 |π₯) π(π 2 ) π(π₯|π 2 )/π(π₯) π(π 2 ) π(π₯|π 2 )
which the posterior odds,
π(π₯|π 1 )
π(π₯|π 2 )
π(π 1 |π₯)
π(π 2 |π₯)
equal to the prior odds
π(π 1 )
π(π 2 )
times likelihood ratio,
under the Bayes’ rule for discrete parameters.
Random variables and Bayesian statistical inference
For the unknown random variables of Θ to be estimated and π = π₯ which is observed,
the Bayes’ rule yields
π(Θ = π|π = π₯) =
π π |Θ (π₯|π) π Θ (π)
π(π = π₯|Θ = π) π(Θ = π)
⇔ π Θ|π (π|π₯) =
.
π(π = π₯)
π π (π₯)
Either Θ or π2 are continuous random variable, replace the PMF or PDF in the formula.
Equivalently, the posterior PDF is represented by prior times likelihood with π π (π₯) using
the law of total probability as
⇔ πΘ|π (π|π₯) =
π π |Θ (π₯|π) πΘ (π)
.
π π (π₯)
In the problem of Bayesian statistics, the choice prior πΘ (π) is generally unclear and
subjective to be different. With unknown variable Θ, the goal is to draw inferences by
observing related random variable π, about Θ. Note that the posterior distribution of Θ,
πΘ|π (π|π₯)/π Θ|π (π|π₯), contains all information is derived by point or interval estimates
of Θ.
Comparison between Frequentist and Bayesian methods
For frequentist inference, probabilities are frequencies as the goal is to create procedure
with long run frequency guarantees. For Bayesian inference, probabilities are subjective
degrees of belief as to state and analyze. Hence, frequentists view parameter as fixed
constant while Bayesian considers as random variable. For example, the confidence
interval is considered.
√
√
For confidence interval defined as πΆ πΌ = [ π¯ π − 1.96/ π, π¯ π + 1.96/ π], the probability statement is π π (π ∈ πΆ πΌ) = 0.95 for all π ∈ R, which is random due to function of
the data. With parameter π fixed, the CI trap the true value with probability 0.95.
For infinitely many experiments of π data points and chosen π π , the computed intervals
28
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Lim, Kyuson
STA498
πΆ πΌπ is found to trap the parameter π π , 95% of the time, that is almost surely convergent
for any sequences π π ,
π
1 ∑οΈ
πΌ (π π ∈ πΆ πΌπ ) ≥ 0.95
lim inf
π→∞
π π=1
On the other hands, for beliefs the unknown parameter π is given as a prior distribution
π(π) to represent the subjective beliefs about π. Using Bayes’ theorem, the posterior
distribution for π given the observed data π1 , ..., ππ is computed with likelihood function
π π(π |π),
πΏ(π) = Ππ=1
π
∫
π(π|π1 , ..., ππ ) ∝ πΏ (π)π(π) ⇔
π(π|π1 , ..., ππ )ππ = 0.95 ⇔ π(π ∈ πΆ πΌ |π1 , ..., ππ ) = 0.95
πΆπΌ
Hence, the degree-of-belief probability statement about π given the observed data is not
the same, where the intervals would not trap the true value 95% of the time.
In summary, the frequentist CI satisfies inf π π π (π ∈ πΆ πΌ) = 1 − πΌ for the coverage
of the interval CI, and the probability refers to random interval CI. A Bayesian confidence interval CI satisfies π(π ∈ πΆ πΌ |π1 , ..., ππ ) = 1 − πΌ, where the probability refers
to π. While the subjective Bayesian interpret probability strictly as personal degrees of
belief, the objective Bayesian try to find the prior distributions for the resulting posterior
to be objective 3.
However, frequentist Bayesian only use Bayesian methods when resulting posterior
has good frequency behaviours. On the other hands, the likelihoodist use the likelihood
function to measure the strength of data as an evidence.
3.0.1
Overview: Univariate Binomial distribution with known and
unknown parameter
Let the probability of a success in a trial is π. Also, let π = {π₯1 , .., π₯ π } be the observation
Íπ
set where π₯ 1 ∼ π΅ππ (π). Then, the probability of π = π=1
π₯π success
times in π trials
π π
(π₯ 1 , ..., π₯ π ) happens is p(π = π₯ 1 , ..., π₯ π |π, π) = Bin(π |π, π)= π π (1 − π) π−π as the
posterior distribution.
Example. Objective Bayesian approach
As π(π|π = π₯ 1 , ..., π₯ π ) ∝ π(π = π₯ 1 , ..., π₯ π |π) π(π) for the prior π ∼ π [0, 1] to be
unknown so to set the following the uniform distribution for π(π) = 1, then
π(π|π) ∝ π π (1 − π) π−π = π π +1−1 (1 − π) π−π +1−1
⇔
π π (1 − π) π−π
Γ(π + 2)
π π (1 − π) π−π =
,
Γ(π + 1)Γ(π − π + 1)
Beta(π + 1, π − π + 1)
π|π, π ∼ Beta(π + 1, π − π + 1)
3Empirical Bayesian estimate the prior distribution from the data.
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
29
STA498
Lim, Kyuson
where posterior follows the Beta distribution. Since the density function integrates to 1,
the normalizing constant (π§) is
∫ 1
Γ(π + 1)Γ(π − π + 1)
π§=
π π (1 − π) π−π =
.
Γ(π + 2)
0
The prior predictive distribution for fixed π success in
∫ 1
∫ 1 1
π π
π Γ(π + 1)(π − π + 1)
π−π
π(π) =
π(π |π , π) π(π, π )ππ =
π (1−π) ππ =
=
.
π
π
Γ(π + 2)
π+1
0
0
Hence, the prior predictive density π(π) =
1
π+1
∫
one observation with an outcome π( π₯˜ = 1) =
an example.
which is to be uniform, where π₯˜ is the
∫1
π( π₯˜ = 1|π) π(π)ππ = 0 πππ = 1/2 as
1
0
Also, by the mean of Beta distribution the Bayes posterior estimator is πΈ (π|π) =
For instructive purpose of convexity,
1
π
+ (1 − π π )
,
πΈ (π|π) = π π
π
2
π +1
π+2 .
from the prior mean for 1/2 and the maximum likelihood estimate4 π /π. Moreover, the
π
which is close to 1.
optimized convex set for π π is π+2
Example. Subjective Bayesian approach
On the other hand, the subjective Bayesian find the uninformed prior to be strongly
peaked around 1/2, as a subjective beliefs about the data. For the known of π, the
posterior with Bayes rule yield
∫
π(π |π, π ) π (π|π )
π (π|π, π ) =
, where π(π |π ) =
π(π |π, π ) π (π|π )ππ.
π(π |π )
Θ
By setting the prior distribution π ∼ π΅ππ‘π(πΌ, π½) for π (π|π ) = π (π),
π π
Γ(πΌ + π½) πΌ−1
π½−1
π−π
π(π|π, π ) ∝ π(π |π) π (π) ⇔
π (1 − π)
π (1 − π)
π
Γ(πΌ)Γ(π½)
=
Γ(πΌ + π½ + π)
π πΌ+π −1 (1 − π) π½+π−π −1 ⇔∝ π πΌ+π −1 (1 − π) π½+π−π −1
Γ(πΌ + π )Γ(π½ − π + π)
without the normalizing constant 5. The posterior distribution π|π ∼ π΅ππ‘π(πΌ+π , π½+π−π )
πΈ (π|π) =
πΌ+π
,
πΌ+π½+π
π£ππ (π|π) =
(πΌ + π )(π½ − π + π)
,
(πΌ + π½ + π) 2 (πΌ + π½ + π + 1)
and PDF ππ|π (π|π = π₯ 1 , ..., π₯ π ).
π
4The estimate is success over success plus failure as πΏ (Θ, π0 = Ππ=1
π ππ in multinomial distribution.
5Note that Γ(π) = (π − 1)!
30
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Lim, Kyuson
STA498
Moreover, the Bayesian point estimate is summarized at the center of the posterior
distribution
∫ 1
πΌ
+
π
π
π
πΌ
+
π½
πΌ
π¯ =
π π (π|π)ππ =
=
+
πΌ+π½+π
πΌ+π½+π π
πΌ+π½+π πΌ+π½
0
πΌ+π½
π
π π πΏπΈ +
πΈ (π),
⇔
πΌ+π½+π
πΌ+π½+π
for the prior mean.
After the data π have been observed, an unknown observable, π₯˜ is predicted, which
is referred to as a posterior predictive distribution. Now, the posterior predictive distribution for just one observation π₯˜ = 1 of new value conditional on several observations
π yield
∫ 1
∫ 1
π( π₯˜ = 1|π, π ) =
π( π₯˜ = 1|π, π, π ) π(π|π, π )ππ =
π΅ππ ( π₯˜ = 1|π) π΅ππ‘π(π|π +πΌ, π½)ππ
0
0
∫
⇔
1
∫
π π΅ππ‘π(π|π + πΌ, π½)ππ =
0
1
π π(π|π)ππ = πΈ (π|π)
0
where π( π₯˜ = 1) = π6 such that the mean of the posterior distribution is derived to be
πΌ+π
πΈ (π|π) = πΌ+π½+π
.
For the purpose of Bayesian inference, the predictive distribution for the new observations are derived in the example.
Equivalently, the generalized form of prediction is
Íπ
πΌ+ π=1
π₯π
π( π₯˜ = 1|π) = πΈ (π|π) = πΌ+π½+π . On the other hand, π( π₯˜ = 0|π) = 1 − πΈ (π|π) =
π½+
Íπ
π=1 (1−π₯ π )
.
πΌ+π½+π
3.1
Conditional distribution of the subset
Given canonical form of x (2) ∼ π π−π (π (2) , πΊ22 ), the conditional distribution of x (1) ∼
(2) − π (2) ) and πΊ
π π (π (1) , πΊ11 ) 7 is π π (π1.2 , πΊ11.2 ), where π1.2 = π (1) + πΊ12 πΊ−1
11.2 =
22 (x
−1
πΊ11 − πΊ12 πΊ22 πΊ21 , and x is π × 1 matrix 8.
(2)
(2)
−1
Thus, the conditional density x (1) |x (2) ∼ π (π (1) +πΊ12 πΊ−1
22 (x − π ), πΊ11 −πΊ12 πΊ22 πΊ21 ).
Independence and covariance
0
For partition of subset x = x = x (1) x (2) , x (1) ⊥ x (1) ⇔ πΊ12 = 0.9 Generally, if both
x (1) and x (2) follow normal distribution and are independent, then the joint distribution
is normally distributed.
6For Bernoulli trial, π( π₯˜ = 0) = 1-π
7Note that this definition for partition is also valid in EM algorithm to be estimated
0
8Note that x = x (1) x (2) .
9This could be proven as π (x) = π (x (1) ) π (x (2) ) = 0, where off-diagonal elements other than πΊππ is 0.
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
31
STA498
Linear transformation
Lim, Kyuson
For the linear transformation with respect to A which could be defined as y = Ax, y
follows the distribution of π π (π∗ , πΊ∗ ), where π∗ = Aπ and πΊ∗ = AπΊA0 10.
Based on the conditional distribution formula, y (1) |y (2) ∼ π (π∗∗ , πΊ∗∗ ), where π∗∗ =
(2) − π ∗(2) ) and πΊ∗∗ = πΊ∗ − πΊ∗ πΊ∗−1 πΊ∗ , where y (2) is given matrix.
π∗(1) + πΊ∗12 πΊ∗−1
22 (y
11
12 22
21
3.1.1
Law of total expectation
Often referred to as tower property, the Adam’s law for random variable π and π is
πΈ (π) = πΈ (πΈ (π |π ))
∑οΈ
∑οΈ ∑οΈ
πΈ (πΈ (π |π )) = πΈ
π₯ π(π = π₯|π ) =
π₯ π(π = π₯|π = π¦) π(π = π¦)
π₯
=
∑οΈ ∑οΈ
π¦
=π₯
π¦
π₯ π(π = π₯, π = π¦) =
π₯
π₯
∑οΈ ∑οΈ
π₯
π₯
∑οΈ ∑οΈ
π(π = π₯, π = π¦) =
π¦
π₯ π(π = π₯, π = π¦)
π¦
∑οΈ
π₯ π(π = π₯) = πΈ (π),
π₯
and the Eve’s law is
π£ππ (π) = πΈ (π£ππ (π |π )) + π£ππ (πΈ (π |π )),
as
πΈ (π 2 ) = πΈ πΈ ((π |π ) 2 ) − (πΈ (π |π )) 2 + (πΈ (π |π )) 2 ⇔ πΈ π£ππ (π |π ) + (πΈ (π |π )) 2
2
⇔ πΈ (π 2 ) − πΈ (π) 2 = πΈ π£ππ (π |π ) + (πΈ (π |π )) 2 − πΈ (πΈ (π |π ))
= πΈ π£ππ (π |π ) + πΈ (πΈ (π |π ) 2 ) − πΈ (πΈ (π |π )) 2 ,
where π£ππ (π |π ) = πΈ (πΈ (π |π ) 2 ) − πΈ (πΈ (π |π )) 2
⇔ πΈ π£ππ (π |π ) + π£ππ πΈ (π |π )
which is also referred to as the law of total expectation. As {π΄π } is the partition of the
probability space and assumes finite or countably infinite set of finite values πΈ (π) < ∞,
the law of total probability in countable and finite cases guarantees,
∑οΈ
πΈ (π) =
πΈ (π | π΄π ) π( π΄π )
π
For Eve’s law, notice that the posterior variance is on average smaller than the prior
variance. This indicates the greater the latter variation in Eve’s law, the more the
potential for reducing our uncertainty with regard to π.
10Note that cov(Ax (1) ) = A cov(x (1) )A0 and also cov(Ax (1) , Bx (2) ) = A cov(x (1) , x (2) )B0 in partitioning the vector.
32
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Lim, Kyuson
3.1.2
STA498
Conditional expectation (MMSE)
For posterior distribution for unknown random variable π , such as ππ |π (π¦|π₯), the point
estimate of the posterior mean is defined as
π¦ˆ π = πΈ (π |π = π₯),
which is the minimum estimate of the π in terms of the MSE, referred to as a minimum
mean squared error (MMSE) 11 or Bayes’ estimate of π .
The posterior density is derived with computing
π π |π (π₯|π¦) ππ (π¦)
ππ |π (π¦|π₯) =
, where π π (π₯) =
π π (π₯)
∫
+∞
π π |π (π₯|π¦) ππ (π¦)ππ¦.
−∞
Then, the MMSE estimate of π given π = π₯ is then given by
∫ +∞
π¦ˆ π =
π¦ ππ |π (π¦|π₯)ππ¦ ⇒ πΈ (πˆπ ) = πΈ (πΈ (π |π)) = πΈ (π ),
−∞
by applying for the Adam’s law. Hence, πΈ (π ) − πΈ (πˆπ ) = 0 which is an unbiased
estimator.
Properties of estimation error
For the unobserved random variable to be estimated is π which is estimated by π = π₯,
let estimate πˆ = π(π) to be the function of π, and the error of estimate is defined as
π˜ = π − πˆ ⇔ π − π(π) for MSE of πΈ [(π − π(π)) 2 ]. The goal is to derive the variance
of π and expectation of π . Now, let π = πΈ (π˜ |π) and πˆπ = πΈ (π |π), where π˜ = π − πˆπ .
Then,
π = πΈ (π˜ |π) = πΈ (π − πˆπ |π) = πΈ (π |π) − πΈ (πˆπ |π) = πˆπ − πΈ (πˆπ |π) = πˆπ − πˆπ = 0.
For any function of π(π)
πΈ (π˜ π(π)|π) = π(π)πΈ (π˜ |π) = π(π)π = 0.
Similarly, by the Adam’s law for iterated expectations
πΈ (π˜ π(π)) = πΈ [πΈ (π˜ π(π)|π)] = 0,
by the previous result applied. However, the estimation error of π˜ = π − πˆπ and
πˆπ = πΈ (π |π) is uncorrelated to derive the variance of π . By applying covariance
formula
πππ£(π˜ , πˆπ ) = πΈ (π˜ πˆπ ) − πΈ (π˜ )πΈ (πˆπ ) = πΈ (π˜ πˆπ ),
11Least mean squares (LMS)
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
33
STA498
where πΈ (π˜ ) = πΈ (πΈ (π˜ |π)) = 0 from the previous result such that
Lim, Kyuson
⇔ πΈ (π˜ π(π)) = πΈ [πΈ (π˜ π(π))|π] = 0,
by applying for the Adam’s law for iterated expectation. For π˜ = π − πˆπ , due to
πππ£(π˜ , πˆπ ) = 0,
π£ππ (π ) = π£ππ (πˆπ ) + π£ππ (π˜ )
2
⇔ πΈ (π 2 ) − πΈ (π ) 2 = πΈ (πˆπ
) − πΈ (πˆπ ) 2 + πΈ (π˜ 2 ) − πΈ (π˜ ) 2 ,
where πΈ (πˆπ ) = πΈ (πΈ (π |π)) = πΈ (π ), πΈ (π˜ ) = πΈ (π − πˆπ ) = 0 such that
2
2
πΈ (π 2 ) = πΈ (πˆπ
) + πΈ (π˜ 2 ) + (πΈ (π ) 2 − πΈ (πˆπ ) 2 ) − πΈ (π˜ ) 2 ⇔ πΈ (πˆπ
) + πΈ (π˜ 2 ).
Also, the MSE of π |π, where π is the unknown variable, is derived as
πππΈ (π |π) = πΈ [π£ππ (π |π)]
π£ππ (π |π) = πΈ [(π − πΈ (π |π)) 2 |π] ⇔ πΈ (π 2 |π) − πΈ (π |π) 2 by definition
⇔ πΈ [π£ππ (π |π)] = πΈ [πΈ [(π − πΈ (π |π)) 2 |π]] = πΈ [(π − πΈ (π |π)) 2 ] = πΈ [(π − πˆπ ) 2 ],
which is the MSE of the estimator. Moreover, the above derivations and equation involves
for convolution of normals and bivariate normal for estimators.
MSE for convolution of two normally distributed random variables
For π ∼ π (π π , ππ2 ) independent of π ∼ π (ππ , ππ2 ), let π = π + π. The goal is to
2 ) − πΈ (π
˜ 2 ).
derive πˆ π = πΈ (π |π ), πΈ [(π − πˆ π ) 2 ] which will verify for πΈ (π 2 ) = πΈ ( πˆ π
Now, πππ£(π, π ) = πππ£(π, π + π) = π£ππ (π) + πππ£(π, π) = ππ2 by independence and
π π,π = πππ£(π, π )(ππ ππ ) −1 = ππ (ππ + ππ ) −1 . Then, MMSE of π |π is πΈ (π |π ) =
πˆ π = π π + πππ ππ−1 (π − ππ ) = π π + (ππ2 (π − ππ ))ππ−2 , which is analogous to
(2) − π (2) ).
πΈ (x (1) |x (2) ) = π (1) + πΊ12 πΊ−1
22 (x
Also, the MSE of πˆ π is πΈ ( πˆ 2 ) = πΈ [(π − πˆ π ) 2 ] with substituting the derived equation.
2 ) + πΈ (π
˜ 2 ) could be verified for substitution
Since πΈ (π 2 ) = ππ2 + πΈ (π) 2 , πΈ (π 2 ) = πΈ ( πˆ π
from the above equation.
3.1.3
Laplace’s law of succession
For the rule of succession 12, when repeating Bernoulli trials for π times independently
with π successes, if π1 , ..., ππ+1 conditionally independent random variables, then
π(ππ+1 = 1|π1 + · · · + ππ = π ) =
π +1
.
π+2
12When there are few observations, or for events that have not been observed to occur at all in (finite)
sample data, the probability examine the next repetition to succeed
34
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Lim, Kyuson
STA498
Within the prior success or failure, let π ∈ π [0, 1] to describe the uncertainty as a prior
probability measure. Also, ππ describe πth trial for 0 and 1 and π₯π is the data actually
observed. Now, the likelihood function for π is
π
πΏ (π) = π(π1 = π₯ 1 , ..., ππ = π₯ π |π) = Ππ=1
π π₯π (1 − π) 1−π₯π = π π (1 − π) π−π ,
Íπ
where π = π=1
π₯π is the number of successes for π trials.
The goal is to derive for posterior distribution
π (π|π1 = π₯1 , ..., ππ = π₯ π ) = ∫ 1
πΏ (π) π (π)
0
˜ π ( π)π
˜ π˜
πΏ ( π)
= ∫1
0
π π (1 − π) π−π
,
˜ π−π π π˜
π˜π (1 − π)
where the Beta distribution PDF yield
∫ 1
Γ(π + 1)Γ(π − π + 1) ˜π
Γ(π + 2)
Γ(π + 2)
˜ π−π π π˜ =
π (1 − π)
Γ(π + 1)Γ(π − π + 1) 0
Γ(π + 2)
Γ(π + 1)Γ(π − π + 1)
so that π΅(πΌ, π½) =
Γ(π +1)Γ(π−π +1)
,πΌ
Γ(π+2)
= π + 1, π½ = π − π + 1
⇔ π (π|π1 = π₯ 1 , ..., ππ = π₯ π ) =
(π + 1)! π
π (1 − π) π−π ,
π !(π − π )!
where this is the Beta distribution with expected value
∫ 1
π +1
,
πΈ (π|π1 = π₯ 1 , ..., ππ = π₯ π ) =
π π (π|π1 = π₯ 1 , ..., ππ = π₯ π )ππ =
π+2
0
as π is a random variable the law of total probability provide the expected probability of
success is π.
For cases when π = 0 or π = π, the hypergeometric distribution Hyp(π|π, π, Θ) used,
where Θ is the total number of successes in the total population size π. When π, Θ → ∞,
1
the ratio π = Θ
π is fixed. Now, the prior probability of π (1−π) is roughly equivalent to
1
Θ(π−Θ) with 1 ≤ Θ ≤ π − 1. Then, the posterior for Θ,
Θ π −Θ
1
π(Θ|π, π, π) ∝
.
Θ(π − Θ) π π − π
For conjugate prior of multinomial distribution, the Dirichlet distribution is the posterior
distribution. 13
3.1.4
Bayesian Hypothesis testing
For two hypothesis π»0 and π»π , let π(π»0 ) = π 0 and π(π»π ) = π 1 and π 0 + π 1 = 1. Also,
for random variable π, the distribution of π under hypothesis is defined as π π (π₯|π»0 )
and π π (π₯|π»π ). By Bayes’ rule, the posterior probability of π»0 and π»π is obtained:
π(π»0 |π = π₯) =
ππ₯ (π₯|π»0 ) π(π»0 )
,
ππ₯ (π₯)
π(π»π |π = π₯) =
ππ₯ (π₯|π»π ) π(π»π )
.
ππ₯ (π₯)
13Th joint posterior distribution of π 1 , ..., π π for π (π 1 , ..., π π |π1 , ..., ππ , πΌ) =
Íπ
π=1 π π
Íπ
Γ( π=1
(ππ +1))
π Γ(π +1) π π1 ··· π ππ
Ππ=1
π
π
1
,
= 1.
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
35
STA498
Lim, Kyuson
The posterior comparison between π(π»0 |π = π₯) and π(π»π |π = π₯) could be used to
decide between π»0 and π»π for higher probability to take into account for.
Maximum A Posteriori (MAP) test
The idea to take for the higher posterior probability in hypothesis test is referred to as
MAP test. The π»0 is chosen if and only if
π(π»0 |π = π₯) ≥ π(π»1 |π = π₯) ⇔ ππ₯ (π₯|π»0 ) π(π»0 ) ≥ ππ₯ (π₯|π»π ) π(π»π ).
Note that the MAP test is also generalized for the case where there are more than 2
hypotheses for taking the hypothesis with highest posterior probability, π(π»π |π = π₯) ⇔
π π (π₯|π»π ) π(π»π ).
Then, the average error probability for hypothesis testing is written as
π π = π(choose π»1 |π»0 ) π(π»0 ) + π(choose π»0 |π»1 ) π(π»1 ),
where the MAP test achieve minimum possible average error probability.
Either for continuous π π |π (π₯|π¦) or discrete π π |π (π₯|π¦), the maximum a posteriori (MAP)
estimate, π₯ˆ π π΄π could be obtained for the point or interval estimates of π, by maximizing
ππ |π (π¦|π₯) π π (π₯), as π₯ does not depend on ππ (π¦).
Minimum Cost hypothesis test
In two hypothesis testing, there are two types of error which is to accept π»0 while π»π
is true or π»π while π»0 is true. Let the cost to each error type be defined as πΆ10 and πΆ01
accordingly. Then, the average cost is
πΆ = πΆ10 π(choose π»π |π»0 ) π(π»0 ) + πΆ01 π(choose π»0 |π»π ) π(π»π )
⇔ π(choose π»π |π»0 ) [ π(π»0 )πΆ10 ] + π(choose π»0 |π»π ) [ π(π»π )πΆ01 ].
Since π(π»π |π = π₯) =
π π (π₯|π»π ) π(π»π )
,
π π (π₯)
the π»0 is chosen if and only if
π π (π₯|π»0 ) π(π»0 )πΆ10 ≥ π π (π₯|π»π ) π(π»π )πΆ01 =
π π (π₯|π»0 )
π(π»π )πΆ01
≥
π π (π₯|π»π )
π(π»0 )πΆ10
⇔ π(π»0 |π₯)πΆ10 ≥ π(π»π |π₯)πΆ01 ,
for decision rule. Hence, the posterior risk for accepting π»π is derived to be π(π»0 |π₯)πΆ10
14. This would derived to take the minimum cost test as to accept the hypothesis test
with lowest posterior risk.
14Equivalently, the posterior risk for accepting π»0 is derived to be π(π» π |π₯)πΆ01 .
36
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Lim, Kyuson
Decision rule for cost in hypothesis testing
STA498
In two hypothesis cases for π»0 and π»1 , let πΆπ π to be defined for the cost of accepting
π»π given π» π to be true 15. Since associated cost for the correct decision is less than
the incorrect decision, that is πΆππ < πΆ ππ for π, π = 1, 2, the average cost is derived as
Í
πΆ = π, π ∈{0,1} πΆπ π π(choose π»π |π» π ) π(π» π ) 16, as the goal is to find the decision rule such
that the average cost is minimized.
First, the complement for choosing the correct hypothesis is the complement of choosing
the wrong hypothesis such that
π(choose π»0 |π»0 ) = 1 − π(choose π»1 |π»0 ),
π(choose π»1 |π»1 ) = 1 − π(choose π»0 |π»1 )
Hence,
πΆ = πΆ00 [1 − π(choose π»1 |π»0 )] π(π»0 ) + πΆ01 π(choose π»0 |π»1 ) π(π»1 )
+πΆ10 π(choose π»1 |π»0 ) π(π»0 ) + πΆ11 [1 − π(choose π»0 |π»1 )] π(π»1 )
⇔ (πΆ10 −πΆ00 ) π(choose π»1 |π»0 ) π(π»0 )+(πΆ01 −πΆ11 ) π(choose π»0 |π»1 ) π(π»1 )+πΆ00 π(π»0 )+πΆ11 π(π»1 ),
where πΆ00 π(π»0 ) + πΆ11 π(π»1 ) is constant. To minimize, the decision rule is simplified as
π· = π(chooseπ»1 |π»0 ) π(π»0 )(πΆ10 − πΆ00 ) + π(chooseπ»1 = 0|π»1 ) π(π»1 )(πΆ01 − πΆ11 )
Applying the hypothesis testing from previous inequality, the π»0 is chosen if and only if
π π (π₯|π»0 ) π(π»0 )(πΆ10 − πΆ00 ) ≥ π π (π₯|π»1 ) π(π»1 )(πΆ01 − πΆ11 )
⇔ π(π»0 |π)(πΆ10 − πΆ00 ) ≥ π(π»1 |π)(πΆ01 − πΆ11 )
3.1.5
Bayesian Interval Estimation
Instead of posterior density ππ₯1 |π₯2 (π₯ 1 |π₯ 2 ) for unobserved random variable π₯ 1 given observed π₯ 2 , the (1 − πΌ)100% credible interval of π₯1 being in [π, π] is derived as
π(π ≤ π₯ 1 ≤ π|π2 = π₯ 2 ) = 1 − πΌ.
Bivariate normal example
For π1 ∼ π (0, 1) and π2 ∼ π (1, 4) with π(π1 , π2 ) = 41 , the goal is to derive a
95% credible interval for π1 , given π2 = 2 is observed. Analogous to πΈ (x (1) |x (2) ) =
(2) − π (2) ),
π (1) + πΊ12 πΊ−1
22 (x
πΈ (π1 |π2 = π₯ 2 ) = π π1 + πππ1
π₯ 1 − ππ₯2
,
ππ2
15Then, there would be 2 more cases, including πΆ00 : The cost of choosing π»0 given π»0 is true and
πΆ11 : The cost of choosing π»1 given π»1 is true.
16which is πΆ00 π(choose π»0 |π»0 ) π(π»0 ) + πΆ01 π(choose π»0 |π»1 ) π(π»1 ) + πΆ10 π(choose π»1 |π»0 ) π(π»0 ) +
πΆ11 π(choose π»1 |π»1 ) π(π»1 )
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
37
STA498
Lim, Kyuson
where π π1 ,π2 ππ1 ππ2 = Σ12 ⇔ πππ£(π1 , π2 ) (and ππ1 π1 = Σ11 ), equivalently. Similar to
π£ππ (x (1) |x (2) ) = πΊ11 − πΊ12 πΊ−1
22 πΊ21 ,
π£ππ (π1 |π2 = π₯ 2 ) = ππ21 − π 2 ππ21 .
Hence, the π1 |π2 = 2 is normally distributed with mean as πΈ (π1 |π2 = π₯ 2 ) = 0+ 12 ( 2−1
2 ) =
1
3
1
4 and variance as π£ππ (π1 |π2 = π₯ 2 ) = 1 − 4 = 4 . For πΌ = 0.05, the interval is
π(π ≤ π1 ≤ π|π2 = 2) = 0.95 which is centered around πΈ (π1 |π2 = π₯ 2 ) = 14 with the
form of [ 14 − π, 14 + π].
1
1
π
−π
π
π( − π ≤ π1 ≤ + π|π2 = 2) = Φ √οΈ
− Φ √οΈ
= 2Φ √οΈ
− 1 = 0.95.
4
4
3/4
3/4
3/4
√οΈ
⇔π=
3 −1 1.95
Φ
= 1.7
4
2
Thus, the 95% credible interval for π1 is
1
1
− π, + π = [−1.45, 1.95]
4
4
3.2
Prior
The prior distribution of an uncertain quantity is to express one’s beliefs about this
quantity before some evidence is taken into account. Based on the unconditional probability, the chosen parameters of the prior distribution are hyperparameters. The prior
for parameter π is denoted as π(π) include conjugate priors with the binomial/beta and
multinomial/Dirichlet families.
In case prior ∝ constant, the Bernoulli example is a representative form of the noninformative prior as π(π) = 1 lead to π|π ∼ π΅ππ‘π(π + 1, π − π + 1) to be seen earlier.
Bayesian Procedure
1. Choose a probability density π(π) = π(π) that expresses our beliefs about a parameter
π before any data.
2. Choose a statistical model π(π₯|π) that reflects our beliefs about π₯ given π.
3. After observing data π = {π₯1 , ..., π₯ π }, update beliefs to compute the posterior
distribution π(π|π).
3.2.1
Conjugate Prior
Simply, if the prior π(π) and posterior π(π|π₯) have the same form, then the prior and
posterior is conjugate distributions, for the likelihood function π(π₯|π).
38
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Lim, Kyuson
STA498
For the class of sampling distribution π(π₯|π), the class of prior distribution π(π) which
is the family is conjugate for the class π(π₯|π), if
π(π|π₯) = ∫
Θ
π(π₯|π)π(π)
π(π₯|π)π(π)ππ
is in the class of π(π) for all π(·|π) that is in the class π(π₯|π) and π(·) in the class of π(π).
Hence, the prior distribution family is conjugate to the family of sampling distribution
for any posterior distributions.
This include only for exponential family distribution. The class of sampling distribution π(π₯|π) of exponential family is generalized with its form
π(π₯π |π) = π (π₯π )π(π)π π(π)
π π’(π₯
π)
,
where π(π) and π’(π₯π ) are vectors and π(π) is the parameters. For x,
π
∑οΈ
π
π
π’(π₯π ) ,
π(x|π) ∝ π(π) exp π(π)
π=1
Íπ
where π=1
π’(π₯π ) is the sufficient statistics for π as the likelihood for π depends on the
data x. Hence, the likelihood for x is
π
∑οΈ
π
π
π
π(x|π) = Ππ=1 π (π₯π ) π(π) exp π(π)
π’(π₯π ) .
π=1
If the prior density is specified as
π(π) ∝ π(π) π exp(π(π)π π£).
Then, the posterior density is
π(π|π₯) ∝ π(π) π+π exp(π(π)π π£ +
π
∑οΈ
π’(π₯π )),
π=1
as the prior density is conjugate.
List of Conjugate Models
Likelihood
Binomial
Negative Binomial
Poisson
Geometric
Exponential
Normal (mean unknown)
Normal (variance unknown)
Normal (mean, variance unknown)
Prior
Beta
Beta
Gamma
Beta
Gamma
Normal
Inverse Gamma
Normal / Gamma
Posterior
Beta
Beta
Gamma
Beta
Gamma
Normal
Inverse Gamma
Normal / Gamma
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
39
STA498
3.2.2
Lim, Kyuson
Univariate Normal distribution Conjugate Prior with known
variance
From previous example, Beta(π|πΌ, π½) ∝ π πΌ−1 (1 − π) π½−1 to be derived. In the univariate
case, the normal distribution of observation π with multiple observations π₯ 1 , ..., π₯ π is in
the form of
1
1
2
exp −
π (π |π) = √
(π − π) , π ∼ π (π, π 2 ),
2
2
2π
2ππ
which is part of
π (π|π) ∝ π (π |π) π (π).
Since the variance π 2 is known, the joint prior is just a prior of π (π) in this case compared
to variance unknown 17. The goal is to ultimately update the unknown quantity of π,
which is to find the π (π|π₯π ). First, the likelihood function by definition for current data
of multiple observations where π (π) is the prior mean is
(π₯π − π) 2
1
π
exp −
.
π (π |π) ∝ πΏ (π|π) = Ππ=1 √
2π 2
2ππ 2
On the other hand, the prior is parametrized with known hyperparameters π0 and π02
where π ∼ π (π02 , π02 ),
1
1
1
2
2
π (π) ∝ exp − 2 (π − π0 ) , as π (π) = √οΈ
exp − 2 (π − π0 ) ,
2π0
2π0
2
2ππ
0
where π0 18 is referred to a precision that control how mean can be varied, as the multiple
observations has a standard deviation different from π 19. Note that π0 is the prior mean
and π0 reflect the variation of π around π0 .
However, π = {π₯ 1 , ..., π₯ π } such that
∑οΈ
π
1
(π₯π − π) 2
.
π (π |π) = π (π₯ 1 , ..., π₯ π |π) =
exp −
(2π) π/2 π0π
2π02
π=1
Hence, the posterior is prior times the likelihood to yield
Íπ
2
1
1
(π − π0 ) 2
π=1 (π₯π − π)
π (π|π₯) ∝ √οΈ
exp −
+
.
2
2
2
π
π
2
2
0
π0 π0
Ignoring constant terms, the posterior is expressed as
Íπ 2
¯ + ππ 2 (π − π0 ) 2
1
π=1 π₯π − 2ππ₯π
π (π|π₯) ∝ exp −
+
,
2
π2
π02
17The π (π, π 2 |π) ∝ π (π |π, π 2 ) π (π, π 2 ) is where variance unknown.
18The variable reflect how much each observation π₯π have varied and does not directly reflect the
variability of individual sampling for π₯π to be.
19For example, the class of 30 students has mean grade π₯¯ = 75 with sd π = 10 but over serval semesters
the overall mean π = 75 and sd for the class means is π0 = 5.
40
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Lim, Kyuson
STA498
and drop any terms that does not include π and arrange terms for π 2 and π,
2 2
2
2 ¯ + ππ 2 π 2 2
2 2
2
2 ¯ 1 π π − 2π π0 π − 2ππ0 π₯π
1 (ππ0 + π )π − 2(π π0 + ππ0 π₯)π
0
⇔ exp −
= exp −
2
2
π 2 π02
π 2 π02
2
2
2¯
2
2 1 π − 2π (π0 π0 + ππ0 π₯)/(ππ
0 + π0 )
,
= exp −
2
(π02 π02 )/(ππ02 + π02 )
then divide by (ππ02 + π 2 ), dropping any constant to simplify for
2
2¯
2
2 2 1 [π − (π π0 + ππ0 π₯)/(ππ
0 + π )]
.
π (π|π) ∝ exp −
2
(π 2 π02 )/(ππ02 + π 2 )
Therefore, π|π₯ ∼ π (π1 , π12 ), where
π1 =
π 2 π0 + ππ02 π₯¯
ππ02 + π 2
,
π12
=
π 2 π02
ππ02 + π 2
.
The posterior mean π1 is expressed as a weighted average 20 of the prior mean and the
observed value π₯, with weights proportional to π02 21.
In case π02 = π02 22, the prior mean is only weighted 1/(π + 1) in the posterior 23.
Posterior predictive distribution
For the future observation π₯˜ the posterior predictive observation π( π₯|π)
˜
is
∫
1
2
π ( π₯|π)
˜
=
π ( π₯|π)
˜ π (π|π)ππ ⇔ π (π|π₯) ∝ exp − 2 (π − π1 )
2π1
Θ
such that
1
1
2
2
⇔ π ( π₯|π)
˜
∝
exp −
( π₯˜ − π) exp − 2 (π − π1 ) ππ.
2π 2
2π1
Θ
∫
Notice that the joint distribution of ( π₯,
˜ π) bivariate normal distribution, where the
marginal posterior distribution of π₯˜ is normal with πΈ ( π₯|π)
˜
= π and π£ππ ( π₯|π)
˜
= π2.
By the law of total probability,
πΈ ( π₯|π)
˜
= πΈ (πΈ ( π₯|π,
˜ π)|π) = πΈ (π|π) = π1 , and
π£ππ ( π₯|π)
˜
= πΈ (π£ππ ( π₯|π)|π)
¯
+ π£ππ (πΈ ( π₯|π,
¯ π)|π) = πΈ (π 2 |π) + π£ππ (π|π) = π 2 + π12 .
This, the posterior predictive distribution of π₯˜ has mean equal to the posterior mean of
π. However, the predictive variance is π 2 and the second variance π12 due to posterior
uncertainty in π.
20If the sample variance is large, then the prior mean has considerable weight in the posterior. if the
prior variance is large, the sample mean has considerable weight in the posterior.
21The posterior precision is π12 and the prior precision is π02 .
22That is each observation sd is same as sampling distribution sd
23As it is reduced to (π0 + ππ₯)/(π
¯
+ 1).
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
41
STA498
Lim, Kyuson
Univariate Normal distribution Conjugate Prior with unknown variance
In most cases, the π 2 is unknown. Note that
π (π, π 2 |π) ∝ π (π |π, π 2 ) π (π, π 2 ).
Hence, the joint prior for both π and π 2 should be specified including the prior of π. If
the two parameters are assumed to be independent π and π, then π(π, π 2 ) = π(π) π(π 2 )
to establish the separate priors for each parameter.
Previously, π ∼ π (π0 , π02 ) where π0 is the measure of belief for π, the easies prior
for π 2 is the non-informative prior. This would be discussed in the next chapter.
3.2.3
Non-informative Prior
For determination of the prior, if there is no prior information about the π, then the
non-informative prior is about the minimal influence on the inference.
However, the uniform prior could not be simply used for non-informative prior as the
reparametrization is invariant. The uniform prior on π does not correspond to the uniform
prior for 1/π 24. As a mean of ignorance, there is no unique prior in non-informative,
and the preferable prior is sufficient to use for. On the other hands, the uniform prior is
possible by construction to be invariant non-informative prior using location parameters
and scale parameters.
For the location parameters, the random variable π distributed uniformly for π (π − π)
with location parameter π. As π = π + π is distributed as π (π¦ − π) with π = π + π, π and
π has the same distribution but just different parameters. Hence, the prior distribution
is location invariant: π(π) = π(π − π) ⇔ π(π) = 1.
For scale parameters, π ∼ π1 π ( ππ₯ ) with scale parameter π. Precisely, the distribution is scale invariant as for π > 0 as π(π) = 1π π( ππ ) where the invariant non-informative
prior for the scale parameter π(π) = π −1 satisfies the equation.
3.2.4
Univariate Normal distribution Conjugate Prior with unknown
variance
Continuously, both π and π 2 need to be specified. Previously, π(π, π 2 ) = π(π) π(π 2 ) to
be separated for each by the independence. Note that full probability model for π and
π 2 is
π (π, π 2 |π₯) = π (π₯|π, π 2 ) π (π, π 2 )
24For transformation corresponding to 1-1 function π(π) is π(π) = 1, π = π(π) ⇒ π(π) = | πππ π −1 (π)|.
42
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Lim, Kyuson
STA498
For π 2 that is the measure of the uncertainty for π, the π 2 will be used to update the
knowledge of π 25 as to specify the prior of π 2 .
Now, to develop non-informative priors the first approach is to assign a uniform prior
for π and log(π 2 ) because π 2 > 0 and log(π 2 ) ∈ R. For transformation on log(π 2 )
into the density of π 2 , by the definition of non-informative prior π(log(π 2 )) ∝ constant
such that by the Jacobian matrix π(π 2 ) ∝ π log(π 2 )/ππ 2 constant = (1/π 2 ) constant.
The joint prior is also π(π, π 2 ) ∝ 1/π 2 .
Without the log(π 2 ) ∈ R, the other approach is to choose values of π0 and π 2 where
π ∼ π (π0 , π 2 ) and non-informative prior for π 2 . With relative non-informative prior
for π 2 where 1/π 2 ∼ πΊππππ(πΌ, π½), π 2 ∼ Inverse Gamma (IG) 26 (πΌ, π½) is chosen to
follow where the density function is
π (π 2 ) =
π½πΌ −2(πΌ+1)
π
exp(−π½/π 2 ),
Γ(πΌ)
π 2 > 0.
Hence,
π (π 2 |πΌ, π½) ∝ (π 2 ) −(πΌ+1) exp(−π½/π 2 ).
Notice that π 2 ∼ πΌπΊ (0, 0) ⇔ π(π 2 ) ∝ 1/π 2 27.
Unknown variance: posterior density of π
√
As the prior of π (π₯|π, π 2 ) = 1/ 2ππ 2 exp − (π₯π −π) 2 /(2π 2 ) , the posterior distribution
for π and π 2 with joint prior of π (π, π 2 ) = 1/π 2 is π (π, π 2 |π) = π (π₯|π, π 2 ) π (π, π 2 )
which is
1 π
1
(π₯π − π) 2
2
.
π (π, π |π) ∝ 2 Ππ=1 √
exp −
π
2π 2
2ππ 2
Notice that for two parameters the conditional posterior distribution could generally be
determined by the joint as π (π, π 2 |π) ∝ π (π, π 2 ). Now, π = {π₯ 1 , ..., π₯ π } such that
∑οΈ
π
1
(π₯π − π) 2
π (π |π, π ) = π (π₯ 1 , ..., π₯ π |π, π ) =
exp −
2π 2
(2π) π/2 π π
π=1
2
2
2
⇔ π (π, π |π) ∝
1
(2π) π/2 π π+2
1
exp −
2
2
π=1 π₯π
Íπ
− 2ππ₯π
¯ + ππ 2
π2
.
Hence, the posterior dropping terms that does not contain π of a parameter of interest
yields
1 −2ππ₯π
¯ + ππ 2
2
π (π|π, π ) ∝ exp −
.
2
π2
25From CLT where for π observation, π 2 and π 2 is related as π₯¯ ∼ π (π, π 2 /π) and for fixed π 2 , π 2 /π
is the estimate of π 2 to update as it depend heavily on the new sample data.
26The IG distribution is used as a conjugate prior for the variance of the normal distribution model.
27Due to improper prior to be discussed, if both parameters approach 0 then distribution is set as the
prior for π 2 .
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
43
STA498
Then, for π divide by π and add up constant term to get quadratic term
(π − π₯)
¯ 2
2
π (π|π, π ) ∝ exp −
.
2π 2 /π
Lim, Kyuson
This result in posterior distribution π|π, π 2 ∼ π ( π₯,
¯ π 2 /π). Notice that the CLT of the
sampling distribution of π₯¯ follows π (π0, π 2 /π) as well.
Unknown variance approach 1: marginal posterior density of π 2
However, the posterior distribution of π 2 is derived by the conditional distribution of
π 2 |π, π or by the joint posterior distribution for π and π 2 . In the first approach, the
terms involving π 2 is considered to give
1
2
π (π |π) ∝
π π+2
π
∑οΈ
(π₯π − π) 2
⇔ π (π|π 2 , π) π (π 2 )
exp −
2
2π
π=1
which is the Inverse Gamma distribution without the normalizing constant π½πΌ /Γ(πΌ),
as the π is considered to be fixed. Equivalently, two parameters πΌ = π/2 and π½ =
Íπ
2
2
π=1 (π₯π − π) /2 for the IG distribution that π |π follows to be.
Unknown variance approach 2: marginal posterior density of π 2
The second approach is to use the equation for Bayes’ rule of
π (π, π 2 |π) = { π (π, π 2 , π)/ π (π 2 , π)}{ π (π 2 , π)/ π (π₯)}
⇔ π (π|π 2 , π) π (π 2 |π),
1
(2π) π/2 π π
where π (π |π, π 2 ) = π (π₯1 , ..., π₯ π |π, π 2 ) =
exp −
(π₯ π −π) 2 to derive to
2π 2
(π 2 |π), which is what
Íπ
π=1
separate π (π|π 2 , π) and the marginal posterior density of π 2 π
the goal is. Previously,
Íπ 2
¯ + ππ 2
1
π=1 π₯π − 2ππ₯π
2
π (π, π |π) ∝
exp −
2π 2
(2π) π/2 π π+2
. Then, rearrange the terms to isolate π2 and divide by π for the equation to get squared
terms.
Í
Í 2
(π − π₯)
¯ 2 + π₯π2 /π − π₯¯ 2
π₯π − ππ₯¯ 2
1
1
(π − π₯)
¯ 2
1
⇔ π+2 exp −
,
=
−
× π+1 exp
π
π
2π 2 /π
2π 2 /π
π
2π 2
where the first term corresponds to π (π|π 2 , π) and the second term correspond to
π (π 2 |π). Notice that for π (π 2 |π) the numerator is the sample variance. Similarly,
Íπ
Íπ
π 2 |∼ πΌπΊ ((π − 1)/2, (π − 1) π=1
(π₯π − π₯)
¯ 2 /2) 28, where π=1
(π₯π − π₯)
¯ 2 =var(π₯).
28Equivalently, π + 1 = −2
44
−π−1
2
= −2
−π+1
2
−
2
2
⇒ π −2(−(
π−1
2 )−1)
, where the πΌ =
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
π−1
2
to be.
Lim, Kyuson
Unknown variance: connection for the multivariate normal
STA498
For πΊ that is π dimensions with π ππ , X ∼ π π (πΊ) and both X, Σ is positive definite then
1
(π−π−1)/2
−1
π (X) ∝ |X|
exp − π‘π (Σ X)
2
1
. As the conjugate prior distribution
ignoring the normalizing constant 2π π /2|πΊ| π/2
Γ π (π/2)
of the univariate normal distribution is the IG, the inverse Wishart distribution is the
conjugate prior of the Σ in multivariate normal distribution 29. Similarly, X ∼ π π−1 (πΊ−1 )
then
1 π/2 −(π+π+1)/2
1
−1 −1
π (X) ∝
|X|
exp − π‘π (Σ X ) .
Σ
2
3.3
Posterior
Note that the usual Bayesian inference typically involves (1) establishing a model and
obtaining a posterior distribution for the parameter (π) of interest, (2) generating samples
from the posterior distribution, and (3) using discrete formulas applied to the samples
from the posterior distribution to summarize our knowledge of the parameters. There are
two sampling methods which include the inversion method of sampling, and rejection
method of sampling, which is for understanding MCMC methods.
Weakly-informative Prior
As for specifying and justifying for the prior distribution, the prior distribution represents
a population of possible parameter values, from which the π of current interest has been
drawn from the population point of view. However, for subjective interpretation, the
uncertainty about π as if its value is thought of as a random realization from the prior
distribution.
3.3.1
Maximum A Posteriori (MAP)
For given π₯1 , ..., π₯ π ∼ π (π, π 2 ) as random variables and π = {π₯1 , ..., π₯ π }, the prior
distribution of π is given as π (π0 , π02 ). The function is maximized as
π (π |π) π (π) = πΏ(π)π(π) =
π
Ππ=1
√
2 2
1
1 π₯π − π
1 π − π0
exp −
exp −
.
√οΈ
2 π
2 π0
2ππ 2
2ππ 2
1
0
π π (π = π₯ , ..., π₯ |π). However, the πˆ
Notice that the πˆπ πΏπΈ = Ππ=1
1
π
π π΄π is the mode of the
Íπ
posterior distribution that is maximized as log( π(π)) + π=1 log( π (π₯π |π)).
29Note that the marginal distribution of mean vector π is the multivariate π‘ distribution.
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
45
STA498
Lim, Kyuson
For derivation, the
log( π (π|π)) =
∑οΈ
π π=1
√οΈ
√οΈ
2
2
(π₯
−
π)
π
2 − (π − π0 ) .
2
− log 2ππ −
−
log
2ππ
0
2π 2
2π02
Then, the derivative is
π πΌ log( π (π|π)) ∑οΈ (π₯π − π)
(π − π0 )
=
−
=0
2
2
πΌπ
π
π
0
π=1
π ∑οΈ
(π₯π − π)
π
∑οΈ π₯π
(π − π0 )
ππ (π − π0 )
−
=
⇔
2
2
2
2
π
π
π
π
π02
0
π=1
π=1
Í
Íπ
(π 2 + ππ02 )π π 2 π0 + π02 π₯π
π 2 π0 + π02 π=1
π₯π
ˆπ π΄π =
⇔
=
⇔
π
.
π 2 π02
π 2 π02
π 2 + ππ02
⇔
=
Equivalently, to minimize the function of π for the posterior distribution of π (π|π) from
Íπ π₯π −π 2 π−π0 2
+ π0 . Hence,
previous prior chapter derivation, the part is to minimize π=1
π
the MAP estimate of π is derived to be
Í
∑οΈ
π
2
π02 ( π₯π ) + π0 π 2
ππ02
π
1
π₯π +
π0 =
,
πˆπ π΄π =
ππ02 + π 2 π π=1
ππ02 + π 2
ππ02 + π 2
which is the weighted average for the prior and sample mean.
3.3.2
Multivariate Normal distribution with known Σ
The multivariate normal likelihood is xπ |π, Σ ∼ π (π, Σ) 30 without the normalizing
constant, 1/2π π/2 , as previously defined. Equivalently, the likelihood function for single
observation model is
1
−1/2
π −1
π (xπ |π, Σ) ∝ |Σ|
exp − (xπ − π) Σ (xπ − π) ,
2
and for samples of π iid observations which is X = {x1 , ..., xπ } is
π (X|π, Σ) = π (x1 , ..., xπ |π, Σ) ∝ |Σ|
−π/2
π
1 ∑οΈ
π −1
exp −
(xπ − π) Σ (xπ − π) .
2 π=1
Analogous from univariate case with variance known which is π (π|π) ∝ π (π |π) π (π)
where π ∼ π (π0 , π02 ) from 3.2.2, the multivariate posterior distribution is generalization
of a multiple observation for
π (π|X) ∝ π (X|π) π (π) with π ∼ π (π0 , Λ0 )
30Note that Σ is π × π matrix that is positive definite and π is a multivariate.
46
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Lim, Kyuson
STA498
π π (x |π) that is,
Equivalently, the posterior distribution is π (π|X) ∝ π (π)Ππ=1
π
π
1
1 ∑οΈ
π −1
−1/2
π −1
−π/2
(xπ −π) Σ (xπ −π) ,
π (π|X) = |Λ0 |
exp − (π−π0 ) Λ0 (π−π0 )
|Σ|
exp −
2
2 π=1
and dropping the constant terms yield for
π
∑οΈ
1
π −1
2
π −1
⇔ π (π|X) ∝ exp − (π − π0 ) Λ0 (π − π0 ) +
(xπ − π) Σ (xπ − π) .
2
π=1
Now, take natural logarithm (log) to simplify for derivation of log density.
π
1
1 ∑οΈ
π −1
π −1
(xπ − π) Σ (xπ − π) − (π − π0 ) Λ0 (π − π0 )
log( π (π|X)) = −
2 π=1
2
π
∑οΈ
1
π
π −1
π π Σ−1 xπ − π π Λ−1
= − π π Σ−1 π +
0 π + π Λ0 π 0 ,
2
2
π=1
∑οΈ
∑οΈ 1 π
1 π
−1
−1
π
−1
−1
−1
−1
π
−1
−1
xπ ) = − π (πΣ +Λ0 )π−2π (Λ0 π0 +Σ
xπ ) .
= − π (πΣ +Λ0 )π+π (Λ0 π0 +Σ
2
2
Now, copula modeling for matrix multiplication and arrangement is used.
π
π
∑οΈ
∑οΈ 1
−1
−1 −1
−1
−1
−1
−1
−1
−1 −1
−1
−1
⇔ − π−(πΣ +Λ0 ) (Λ0 π0 +Σ
xπ ) (πΣ +Λ0 ) π−(πΣ +Λ0 ) (Λ0 π0 +Σ + xπ ) ,
2
π=1
which is the log density of a normal distribution
∑οΈ
−1
−1 −1
−1
−1
−1
−1 −1
π|X ∼ π (πΣ + Λ0 ) (Λ0 π0 + Σ
xπ ), (πΣ + Λ0 ) .
Equivalently, the mean ππ and inverse of cov-variance matrix Λ−1
π 31 is defined as
−1 −1
−1
−1
ππ = (Λ−1
0 + πΣ ) (Λ0 π0 + πΣ xΜ),
−1
−1
Λ−1
π = Λ0 + πΣ ,
using the Woodbury identity on our expression for the covariance matrix. As the multivariate normal distribution has the conjugate prior for multivariate normal distribution
analogous to univariate case, the Σ is an inverse Wishart distribution to be defined. Note
that the posterior conditional and marginal distributions of subvectors of π with known
Σ could be also derived.
Posterior predictive distribution for xΜ
For new observation xΜ ∼ π (π, Σ), the joint distribution is defined as
π ( xΜ, π|X) ∝ π ( xΜ|π, Σ)π (π|ππ , Λπ ).
The joint posterior distribution of xΜ is multivariate normal as the Σ is known. By the
Adam’s law,
πΈ ( xΜ|X) = πΈ {πΈ ( xΜ|π, X)|X} = πΈ (π|X) = ππ ,
and also applying for the Eve’s law and MMSE,
π£ππ ( π₯|X)
˜
= πΈ {π£ππ ( xΜ|π, X)|X} + π£ππ{πΈ ( xΜ|π, X)|X} = πΈ (Σ|X) + π£ππ (π|X) = Σ + Λπ .
31Notice that the posterior precision is the sum of prior and data precisions.
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
47
STA498
Non-informative prior density of π
Lim, Kyuson
If π (π) ∝ constant by definition as the precision which is the variance of the prior
converge to 0, |Λ−1
0 |, then the prior mean is irrelevant for the posterior density.
3.3.3
Multivariate Normal distribution with unknown Σ
Goal scheme
Previously, the posterior with two parameters π and π 2 is defined as π (π, π 2 |π) ∝
π (π |π, π 2 ) π (π, π 2 ). Similarly, as xπ |π, Σ ∼ π (π, Σ), the posterior distribution is defined
as
π (π, Σ|X) ∝ π (X|π, Σ) π (π, Σ).
Analogous from univariate unknown π 2 : scheme connection
[ π (Σ)] : Since the multivariate approach is exactly analogous to univariate approach,
from 3.2.4 the π (π 2 ) is derived as Inverse Gamma (IG) (πΌ, π½) where the Inverse Gamma
is a Inverse Wishart distribution in multivariate for prior distribution for the Σ.
[ π (π|Σ)] : Similarly, π|Σ ∼ π (π0 , Σ/π) as the univariate in 3.2.4 shown to have
π₯)
¯ 2
which is π|π 2 , π ∼ π ( π₯,
¯ π 2 /π) as a normal distribuπ (π|π 2 , π) ∝ exp − (π−
2π 2 /π
tion for posterior density function for π (π|Σ).
[ π (π, Σ)] : Also, the posterior density of π is following to be normal as with respect to
π (π, π 2 |π) ∝ π (π, π 2 ), where π (π, π 2 ) = π (π|π 2 ) π (π 2 ) analogously such that
1
π/2
−( π+π+1)/2
−1
π (Σ) π (π|Σ) = |Λ0 | ×|Σ|
exp − π‘π (Λ0 Σ )
2
1
π
−1
−1/2
× |Σ|
exp − (π − π0 ) πΣ (π − π0 )
2
1
π
−{( π+π)/2+1}
−1
π −1
⇔ π (π, Σ) ∝ |Σ|
exp − π‘π (Λ0 Σ ) − (π − π0 ) Σ (π − π0 ) .
2
2
Therefore, using the inverse-Wishart distribution to describe for the prior distribution
of the Σ, Σ ∼ π π−1 (Λ−1
0 ), where the hyperparameters of (π0 , Λ0 /π, π, Λ0 ) is used for
parametrization. Notice π|Σ ∼ π (π0 , Σ/π), as π and Λ0 controls the ππ and scale matrix
for the inverse Wishart distribution on Σ.
Posterior with unknown π 2 : conclusion
[ π (X|π, Σ)]: In π (π, Σ|X) ∝ π (X|π, Σ) π (π, Σ), the likelihood function is normal
distri
1
π
−1
−1/2
bution that is defined as π (xπ |π, Σ) ∝ |Σ|
exp − 2 (xπ − π) Σ (xπ − π) when the
48
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Lim, Kyuson
cov-variance is known (which is analogous from univariate case).
STA498
Hence, the
π (X|π, Σ) ∝ |Σ|
−π/2
π
1 ∑οΈ
(xπ − π)π Σ−1 (xπ − π)
exp −
2 π=1
π
1 ∑οΈ
π −1
π −1
⇔ |Σ| exp −
(xπ − xΜ) Σ (xπ − xΜ) + π( xΜ − π) Σ ( xΜ − π)
2 π=1
π
−π
1
1 ∑οΈ
2
π
−1
2
⇔ |Σ| 2 exp − (π−1)S +π( xΜ−π) Σ ( xΜ−π) , S =
(xπ − xΜ)π Σ−1 (xπ − xΜ)
2
π − 1 π=1
−π
2
by definition. Before, π (π|X) ∝ π (X|π) π (π), π ∼ π (π0 , Λ0 ) in 3.3.2. Thus, the
posterior of two unknown parameters is derived with respect to
π (π, Σ|X) ∝ π (X|π, Σ) π (π, Σ).
Notice, few lines before
1
π
−1
π −1
π (π, Σ) = |Λ0 | |Σ|
exp − π‘π (Λ0 Σ ) − (π − π0 ) Σ (π − π0 ) .
2
2
π
1
−1
π −1
π/2
−{( π+π)/2+1}
⇔ π (π, Σ|X) ∝ |Λ0 | |Σ|
exp − π‘π (Λ0 Σ ) − (π − π0 ) Σ (π − π0 )
2
2
−π
1
2
π −1
2
×|Σ| exp − (π − 1)S + π( xΜ − π) Σ ( xΜ − π)
2
π/2
−{( π+π)/2+1}
= |Λ0 | π/2 |Σ| −{(π+π+π)/2+1}
1
−1
2
π −1
π −1
exp − π‘π (Λ0 Σ ) + (π − 1)S + π( xΜ − π) Σ ( xΜ − π) + π (π − π0 ) Σ (π − π0 ) .
2
Posterior with unknown π 2 : derivation for square and Inverse-Wishart kernel
First, (π − 1)S2 + π( xΜ − π)π Σ−1 ( xΜ − π) + π (π − π0 )π Σ−1 (π − π0 ) = (π − 1)S2 + πxΜπ Σ−1 xΜ −
2ππ π Σ−1 xΜ + ππ π Σ−1 π + ππ π Σ−1 π − 2ππ π Σ−1 π0 + π ππ0 Σ−1 π0 .
Now, for rearrangement later, add (π + π)π ππ Σ−1 π π and subtract for balancing out the
equation for posterior distribution parameters such that (π − 1)S2 + (π + π)π π Σ−1 π −
2π π Σ−1 (π π0 + πXΜ) + (π + π)π ππ Σ−1 π π − (π + π)π ππ Σ−1 π π + π ππ0 Σ−1 π0 + πxΜπ Σ−1 xΜ =
ππ
(π0 − XΜ)π Σ−1 (π0 − XΜ). Then,
(π − 1)S2 + (π + π)(π − π π )π Σ−1 (π − π π ) + π+π
1
π/2
−{( π+π+π+1)/2}
−1
π (π, Σ|X) ∝ |Λ0 | |Σ|
exp − π‘π (Λ0 Σ )
2
−1
1
ππ
2
π −1
π −1
2
×|Σ| exp − (π −1)S + (π + π)(π − π π ) Σ (π − π π ) +
(π0 − XΜ) Σ (π0 − XΜ)
2
π +π
1
ππ
π/2
− ( π+π+π+1)
−1
2
π
−1
2
⇔ |Λ0 | |Σ|
exp − π‘π (Λ0 Σ ) + (π − 1)S +
(π0 − XΜ) Σ (π0 − XΜ)
2
π +π
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
49
STA498
×|Σ|
−1
2
1
π −1
exp − (π + π)(π − π π ) Σ (π − π π )
2
Lim, Kyuson
The idea is to build up a model that is exactly similar to Normal × Inversed Wishart
and identify the parameters. For identifying the exponent of the normal by inverted
Wishart kernels, the property of adding symmetric matrix and multiplication is used,
ie. π‘π ( π΄) + π‘π (π΅) = π‘π ( π΄ + π΅), π‘π (π·πΆ) = π‘π (πΆπ·) and π₯π Σ−1 π₯ = π‘π (π₯ π‘ Σ−1 π₯) ⇒
π‘π (π₯π Σ−1 π₯) = π‘π (π₯π₯π Σ−1 ).
1 Íπ
π −1
Notice that S2 = π−1
π=1 (xπ − xΜ) Σ (xπ − xΜ). Then, the first part of the exponential is simplified to be
1
ππ
−1
2
π −1
− π‘π (Λ0 Σ ) + (π − 1)S +
(π0 − XΜ) Σ (π0 − XΜ)
2
π +π
∑οΈ
ππ
1
π
π
(xπ − XΜ)(xπ − XΜ) −
( XΜ − π0 ) ( XΜ − π0 ) Σ−1
= − π‘π Λ0 +
2
π +π
Σ
These properties enable the equation to rearrange as π π π , π+π
× Inverted Wishart(π π , Λπ )
−1
with Σ , which is
π+ π+π+1
π
π +π
1
−1
−1/2
π
−1
−
2
exp −
(π − π π ) Σ (π − π π ) ,
exp − π‘π (Λπ Σ ) |Σ|
|Λπ | 2 |Σ|
2
2
and det(Λ0 ) = det(Λπ ) as symmetric matrix. Now, comparing with the equation of the
interest,
1
ππ =
(π π0 + πXΜ),
π +π
Λπ = Λ0 +
π
∑οΈ
(xπ − XΜ)(xπ − XΜ)π +
ππ
( XΜ − π0 )( XΜ − π0 )π ,
π +π
π=1
where the first term of π π matches with the second term in the modelling and the second
term fo Λπ describes the first term in the modelling for equivalent relationship. Thus,
Σ
π, Σ|X ∼ π π π ,
with π × Inverse Wishart (π π , Λπ ) with Σ−1 ,
π +π
to follow the modelling. Also,
π +π
π −1
(π − π π ) Σ (π − π π ) .
π|Σ ∼ π π π , (π + π) Σ ⇔ π (π|Σ, X) ∝ exp −
2
−1
Posterior with unknown π 2 : uninformative priors
The joint uninformative prior (with a locally uniform prior for π) is π (π, Σ) ∝ |Σ| −
and the joint posterior is derived as
π
1
− π+1
−
2
π
−1
π (π, Σ|X) ∝ |Σ| 2 |Σ| 2 exp − (π − 1)S + π( XΜ − π) Σ ( XΜ − π)
2
1
− π+π+1
−1
⇔ |Σ| 2 exp − π‘π (π(π)Σ ) ,
2
50
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
π+1
2
,
Lim, Kyuson Í
STA498
π
π + π( XΜ − π)( XΜ − π) π . Then, the conditional posterior
where π(π) = π=1
(
XΜ
−
x
)(
XΜ
−
x
)
π
π
for π|Σ ∼ π XΜ, Σπ such that
π
π −1
π (π|Σ, X) ∝ exp − (π − XΜ) Σ (π − XΜ)
2
Multivariate list of Conjugate Models
Parameter
Prior π(π)
2
0)
exp − (π−π
, π02 =
2π 2
Normal π
0
π πΌ−1 (1
3.3.4
π
π½−1
π)
Beta ∝
−
πΌ−1
Beta ∝ π
exp(−ππ)
Beta-Bin π
Gamma-exp π
* ππ =
π2
π π0 +πXΜ
2
π+π , ππ
=
Likelihood π (π|X)
2
π exp − (xπ −π)
Ππ=1
2π 2
Posterior π (π|X)
2
π)
exp − (π−π
*
2π 2
Bin ∝ π π (1 − π) π−π
exp ∝ π π exp(−π π)
Beta ∝ π πΌ+π +1 (1 − π) π+π−π −1
Gamma ∝ π π+π−1 exp(−(π + π )π)
π
π2
π+π
Lindley’s Paradox
Based on the different choices of certain prior distribution, the frequentist and Bayesian
give a different result for the hypothesis testing. The paradox occurs for the result of an
experiment where there are two explanations π»0 and π»π with some prior distribution π
to represent the uncertainty that gives which hypothesis is more accurate before taking
into account for the result π₯.
Lindley’s paradox occurs as the result of π₯ is significant by the frequentist test of π»0 ,
indicating the sufficient evidence to reject π»0 at a given πΌ = 0.05. On the other hands,
Bayesian approach examine the posterior probability of π»0 given π₯ is high, indicating
strong evidence that π»0 is better than π₯ with π»π to take the π»0 .
Example for the comparison
In a statistics program around the world, 4900 male and 4700 male is enrolled at a
certain time period. The observed proportion π₯ of male student is 4900/9600 = 0.51.
We assume the fraction of male student is a Binomial variable with parameter π. The
goal is to test whether π is 0.5 or other value. That is, π»0 : π = 0.5 and π»π : π ≠ 0.5.
The frequentist approach to testing π»0 is to compute a p-value, the probability of
observing a fraction of male student at least as large as π₯ assuming π»0 is true. A
normal approximation for the fraction of the male student is π ∼ π (π, π 2 ) with
π = ππ = ππ = 9600(0.5) = 4800 and π 2 = ππ (1 − π) = 9600(0.5)(1 − 0.5) = 2400,
∫ 9600
(π’ − π) 2
1
exp −
ππ’
π(π ≥ π₯|π = 4800) =
√
2π 2
π₯=4900 2ππ 2
∫ 9600
1
(π’ − 4800) 2
=
exp −
ππ’ = 0.020613.
√οΈ
4800
π₯=4900 2π(2400)
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
51
STA498
Lim, Kyuson
For the two sided test, the p-value is 2(0.020613) = 0.0402 such that < 0.05 to reject
the π»0 and take π»π to be different from the observed data for 0.5.
The Bayesian approach assumes for the equal prior probability as there is no favour
in the hypothesis to be, which is π(π»0 ) = π(π»π ) = 0.5. Also, π ∼ π [0, 1] under π»π ,
where the posterior probability under π»0 for π = 4900 is described as
π(π |π»0 ) π(π»0 )
π(π»0 |π) =
,
π(π |π»0 ) π(π»0 ) + π(π |π»π ) π(π»π )
after observing π/π = 4900/9600 births, where the posterior probability is computed
from the PMF of the binomial distribution under each hypothesis,
π
π(π |π»0 ) =
(0.5) π (1 − 0.5) π−π = 0.021.
π
∫ 1 1
π
π
π
π−π
π(π |π»π ) =
(π) (1 − π) ππ =
π΅ππ‘π(π + 1, π − π + 1) =
= 0.001041.
π
π
π+1
0
⇒ π(π |π»0 ) > π(π |π»π )
Hence, the posterior probability for π(π»0 |π) = 0.95 which strongly favours π»0 over π»π .
Thus, the two approaches Bayesian and frequentist appears to be in conflict, as paradoxical which also leads to the goodness-of-fit test.
3.3.5
Bernstein-von Mises theorem
The Bernstein-von Mises theorem is a result that links Bayesian inference with Frequentist inference. In particular, it states that Bayesian credible sets of a certain credibility
level πΌ will asymptotically be confidence sets of confidence level πΌ, which allows for
the interpretation of Bayesian credible sets, under the probabilistic generating process.
In parameter inference, a posterior distribution converges in the limit of infinitely many
data to a multivariate normal distribution centered with the given covariance matrix. Using the posterior distribution from a frequentist, the Bayesian inference is asymptotically
correct.
Bernstein-von Mises theorem: Univariate normal data
For Bayesian approach when π is large for the observed data,
√
π ( π₯¯ − π)|π = π₯ 1 , ..., π₯ π → π (0, π 2 ),
the prior does not matter for large samples. In frequentist approach for large samples,
√
π ( π₯¯ − π)|π ∼ π (0, π 2 ).
With loss of generality, the Bayesian probability for 95% credible region and the frequentist confidence interval matches for 95% confidence interval as
π
π
π
π
¯
¯
¯
¯
π π ∈ π−1.96
√ , π+1.96
√ π1 , ..., ππ ≈ π π ∈ π−1.96
√ , π+1.96
√ π = 0.95
π
π
π
π
52
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Lim, Kyuson
3.4
STA498
Goodness-of-fit test
Suppose pick π samples from a normal distribution, π (π, π 2 ), with known variance and
the goal is to select the best model that predict the mean of the distribution.
3.4.1
Bayes factor
The Bayes factors is used as a determination for the Bayesian alternative to frequentist
hypothesis testing. The Bayesian model comparison is a method of model selection
based on Bayes factors (π΅πΉ) among statistical models. Based on the observed data π·,
the relative plausibility of the two different models π1 and π2 parametrized by π 1 and
π 2 respectively, is assessed by the probability odds of two models,
π΅πΉ12
∫
π(π 1 |π1 ) π(π·|π 1 , π1 )ππ 1
π(π·|π1 )
=∫
=
=
π(π·|π2)
π(π 2 |π2 ) π(π·|π 2 , π2 )ππ 2
π(π1 |π·) π(π·)
π(π1 )
π(π2 |π·) π(π·)
π(π2 )
=
π(π1 |π·) π(π2 )
,
π(π2 |π·) π(π1 )
Likelihood Ratio ⇔ Posterior odds × Prior odds−1 ,
unlike the LRT there is no overfitting but a biasedness. Moreover, the Bayes factor is a
relative predictive accuracy of one hypothesis over another, and extent to which the data
sway our relative belief from one hypothesis to the other. Hence, π΅πΉ = π, π ∈ (0, ∞)
means that there is π times more evidence for π»π than π»0 .
In a case where there is only 2 models, given the Bayes factor π΅πΉ (π·), the posterior
probability of the Model 1 is derived as
π(π·|π1 ) π(π2 )
π(π·|π2 ) π(π2 )
=1−
π(π1 |π·) = 1 − π(π2 |π·) = 1 −
π(π·)
π΅πΉ (π·) π(π·)
π(π1 |π·) π(π2 )
⇒1−
= π(π1 |π·)
π΅πΉ (π·) π(π1 )
1
π(π2 )
⇔1= 1+
π(π1 |π·) ⇔ π(π1 |π·) =
π΅πΉ (π·) π(π1 )
1+
1
π(π2 ) 1
π΅πΉ (π·) π(π1 )
Bayes factor cutoffs
π΅πΉ10
30 − 100
3 − 10
1
1/3 − 1
1/100 − 1/30
interpretation
Very strong evidence for π»π
Moderate evidence for π»π
Equal evidence for π»π and π»0
Anecdotal evidence for π»0
Very strong evidence for π»0
> 100
10 − 30
1−3
1/3 − 1
1/30 − 1/10
< 1/100
Extreme evidence for π»π
Strong Evidence for π»π
Anecdotal evidence for π»π
Anecdotal evidence for π»0
Strong evidence for π»0
Extreme evidence for π»0
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
53
STA498
Relationship between Bayes factor and p-value
Lim, Kyuson
As the Bayes factor increases quadratically, the p-value is also increased for smaller
values to reject π»0 . On the other hand, the larger p-values correspond with the low Baye
factor numerical values as an inverse relationship.
Note that Bayes factors allow to directly test the null hypothesis (relative to models under
consideration).
3.4.2
Bayes factor: hypothesis testing
For the benefit of Bayes factor that could test multiple hypothesis upon same data of
observation, such regression models could be tested in a way of
π΅πΉ10 =
π(π·|π»2 )
π΅πΉ10 π(π·|π»1 )
π(π·|π»1 )
, π΅πΉ20 =
⇒
=
= π΅πΉ12
π(π·|π»0 )
π(π·|π»0 )
π΅πΉ20 π(π·|π»2 )
Frequentist: Chi-square Goodness-of-fit test
Previously, π»0 : π = π0 and the CLT 32 guarantees for the sample mean π₯¯ =
which is π₯¯ ∼ π (π, π 2 /π). Then, the test statistics is computed as
π2 =
Íπ
π=1 π₯π /π
( π₯¯ − π0 ) 2
.
π 2 /π
Bayesian: Bayes factor
The emphasize is in computing for the Bayes’ factor
of the models. For two models, let π1 : π = π0 and
length πΏ, including π0 and πΏ > π. The prior π(π) =
of π1 that is
1
( π₯¯ − π0 ) 2
π(π |π1 ) = √
exp −
=
2π 2 /π
2πππ 2
that determines the relative ratio
π2 : π lies inside the interval of
1/πΏ to calculate for the evidence
√
π
√
2ππ 2
exp − π2 /2 .
For π2 , marginalize for the π by
∫
∫
1
1
( π₯¯ − π) 2
1
exp −
ππ = .
π(π, π2) =
π(π |π, π2 ) π(π|π2 )ππ =
√
2
πΏ
πΏ
2π /π
2ππ 2 /π
Then, the Bayes’ factor is derived as
√
ππΏ
π1
2
π΅
=√
exp − π /2 .
π2
2ππ 2
For fixed π2 valeu, when π → ∞ the Bayes factor favours π = π0 as π΅ → ∞.
32From π samples of distribution with mean π and variance π 2 , the sample mean π₯¯ =
π (π, π 2 /π.
54
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Íπ
π=1 π₯ π /π
∼
Lim, Kyuson
Improper prior and example
STA498
∫
When the prior is a function that is Θ π(π)ππ = ∞, the prior is not a pdf but the poste∫
rior is still a valid pdf as a marginal distribution π(π₯) = Θ π (π₯|π)π(π)ππ is well defined.
For the univariate normal distribution of known variance,
if the uniform prior that
∫
is π(π) = 1 where there is no prior information then Θ π(π)ππ = ∞. However, the
∫
corresponding marginal distribution π(π) = Θ π (π |π)ππ is
Í
∫
1
(π − 1)π 2
(π₯π − π) 2
2 −π/2
,
ππ = √
exp −
(2ππ )
exp −
2π 2
2π 2
π2ππ 2
so that the posterior becomes π(π|π) = π(π| π₯,
¯ π 2 /π) as shown before. For the Bayesian
π‘-test, the Jeffreys prior which is improper priors are used as for the area of the curve to
be 1.
3.4.3
One sample test for equal means
Suppose there are two samples with π1 and π2 for π = π1 + π2 where π1 π ∼ π (π1 , π 2 )
and π2π ∼ π (π2 , π 2 ) for sample variance, π ∈ π1 , π ∈ π2 . First, the frequentist
approach for the two sided π‘-test aim to find whether the mean of the two groups differs,
π»0 : π1 = π2 ⇔ π1 − π2 = 0 vs. π»π = π1 ≠ π2 ⇔ π1 − π2 ≠ 0. Then, the two-sample
π‘-statistics is
π¯ 1 − π¯ 2
.
π = √οΈ
2
2
π1 π2 (π1 −1)π 1 +(π2 −1)π 2
π1 +π2
π1 +π2 −2
)0
Let x = {x1 , x2 } where x1 = (π₯1 , ..., π₯ π1 and x2 = (π₯ π1 +1 , ..., π₯ π1 +π2 ) 0. The goal is to test
π»0 : π₯π |π, π 2 ∼ π (π, π 2 ), 1 ≥ π ≥ π
against
π»π : π₯π |π1 , π12 ∼ π (π1 , π12 ), 1 ≥ π ≥ π1 ,
and
π₯π |π2 , π22 ∼ π (π2 , π22 ), π1 + 1 ≥ π ≥ π1 + π2 .
However, the Bayesian approach place the prior on the difference of the standardized
means as π = π1π−π2 . In the case
Í 2
Í
π₯π
(π₯π − π1 ) 2
2 −π/2
2 −π/2
π0 = (2ππ )
exp −
, π1 = (2ππ )
exp −
,
2π 2
2π 2
then the Bayes factor is derived as
π0
π π1
π΅πΉ01 =
= exp −
(2π₯¯ − π1 ) ,
π1
2π 2
where the prior is normal π = π + π 2 /π02 .
1/2
π 2 π₯¯ 2
π2
π΅πΉ01 = π 2
exp −
π /0
2ππ 2
so that the goal is to derive the Jeffrey’s Bayes factor (JZS).
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
55
STA498
56
Lim, Kyuson
CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH
Chapter 4
Appendix
4.1
4.1.1
Extension of Bayesian distribution
EM (expectation-maximizing) algorithm for MLE example
When missing the data set, the prediction step consists of initial estimate π˜ and πΊ˜ to
predict the contribution of missing values to the sufficient statistics.
Algorithm:
Assume that population mean and variance π and πΊ are unknown and estimated.
1. Prediction: given estimates π˜ of unknown parameters, predict the contribution of
any missing observation to the complete data for sufficient statistics.
2. Estimation: using predicted sufficient statistics, compute estimates of parameters.
3. Iterate until revised estimates do not differ from estimates obtained previously.
57
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )