STA498 book - Bayesian analysis

Bayesian Hotelling’s 𝑇 2 Lim, Kyuson November 4, 2021 STA498 2 Lim, Kyuson Contents 1 Acknowledgement 7 2 Multivariate Normal and Hypothesis Testing 9 2.1 Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 Multivariate Normal distribution . . . . . . . . . . . . . . . . . 9 2.1.2 Distribution of (x − 𝜇) 0𝚺−1 (x − 𝜇) . . . . . . . . . . . . . . . . 9 2.1.3 MLE of 𝜇 and 𝚺 . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.4 The sampling distribution of S and x̄ . . . . . . . . . . . . . . . 11 2.1.5 Hypothesis testing when 𝜎, Σ is known . . . . . . . . . . . . . 11 2.1.6 Hypothesis testing when 𝜎, Σ is unknown . . . . . . . . . . . . 12 Confidence region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 Univariate 𝑡-interval . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.2 Bonferroni’s Simultaneous Confidence interval . . . . . . . . . 17 2.2.3 Simultaneous 𝑇 2 -intervals . . . . . . . . . . . . . . . . . . . . 17 2.2.4 Comparison between Simultaneous 𝑇 2 -intervals and Bonferroni’s Confidence intervals . . . . . . . . . . . . . . . . . . . . 18 Multivariate Quality-Control (QC) . . . . . . . . . . . . . . . . 19 Comparing mean vectors of two population . . . . . . . . . . . . . . . 21 2.3.1 Pooled sample covariance when 𝑛1 , 𝑛2 is small and Σ = Σ1 = Σ2 21 2.3.2 Hypothesis test with small samples when Σ1 = Σ2 . . . . . . . . 22 2.3.3 Confidence intervals with small samples when Σ1 = Σ2 . . . . . 22 2.3.4 Behrens-Fisher problem . . . . . . . . . . . . . . . . . . . . . 23 2.3.5 Heterogeneous covariance matrices with large sample size . . . 23 2.3.6 Box’s M test (Bartlett’s test) . . . . . . . . . . . . . . . . . . . 23 2.2 2.2.5 2.3 3 STA498 2.4 3 Lim, Kyuson MANOVA (Multivariate Analysis Of Variance) . . . . . . . . . . . . . 24 2.4.1 Sum of Squares (TSS = SS𝑡𝑟 +SS𝑟𝑒𝑠 ) . . . . . . . . . . . . . . . 24 2.4.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.3 Distribution of Wilk’s Lambda . . . . . . . . . . . . . . . . . . 26 2.4.4 Large Sample property for modification of 𝚲∗ . . . . . . . . . . 26 2.4.5 Simultaneous Confidence Intervals for treatment effect . . . . . 26 Bayesian Alternative approach 3.0.1 3.1 3.2 3.3 3.4 4 27 Overview: Univariate Binomial distribution with known and unknown parameter . . . . . . . . . . . . . . . . . . . . . . . . 29 Conditional distribution of the subset . . . . . . . . . . . . . . . . . . . 31 3.1.1 Law of total expectation . . . . . . . . . . . . . . . . . . . . . 32 3.1.2 Conditional expectation (MMSE) . . . . . . . . . . . . . . . . 33 3.1.3 Laplace’s law of succession . . . . . . . . . . . . . . . . . . . 34 3.1.4 Bayesian Hypothesis testing . . . . . . . . . . . . . . . . . . . 35 3.1.5 Bayesian Interval Estimation . . . . . . . . . . . . . . . . . . . 37 Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.1 Conjugate Prior . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.2 Univariate Normal distribution Conjugate Prior with known variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2.3 Non-informative Prior . . . . . . . . . . . . . . . . . . . . . . 42 3.2.4 Univariate Normal distribution Conjugate Prior with unknown variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Posterior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3.1 Maximum A Posteriori (MAP) . . . . . . . . . . . . . . . . . . 45 3.3.2 Multivariate Normal distribution with known Σ . . . . . . . . . 46 3.3.3 Multivariate Normal distribution with unknown Σ . . . . . . . . 48 3.3.4 Lindley’s Paradox . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.5 Bernstein-von Mises theorem . . . . . . . . . . . . . . . . . . 52 Goodness-of-fit test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4.1 53 Bayes factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS Lim, Kyuson 3.4.2 Bayes factor: hypothesis testing . . . . . . . . . . . . . . . . . STA498 54 3.4.3 One sample test for equal means . . . . . . . . . . . . . . . . . 55 4 Appendix 57 4.1 Extension of Bayesian distribution . . . . . . . . . . . . . . . . . . . . 57 4.1.1 57 EM (expectation-maximizing) algorithm for MLE example . . . CONTENTS 5 STA498 6 Lim, Kyuson CONTENTS Chapter 1 Acknowledgement First, the chapter of multivariate Normal and Hypothesis Testing explains about the construction and concepts of multivariate normal and relevant statistical inference as to apply. The notation and interpretation is all of multivariate random variables to consider for. Second, the basic information starts with frequentist approach in understanding the Bayesian statistics. However, the chapter is mainly about Bayesian approach apart from frequentist approach for interpretation where majority of concept and derivation lies on Bayesian approach to consider for. The chapter discuss for the hypothesis testing for Bayesian approach and derivation for posterior distribution of the univariate normal distribution as well as Bayes posterior estimator. The normal distribution is the main topic of Bayesian inference for the posterior distribution where multivariate statistics concept is introduced and used to build up the knowledge. The goal is to expand for bivariate and multivariate normal distribution including Wishart distribution. Also, the idea of relative belief ratio and normal distribution for understanding posterior distribution is discussed. 7 STA498 8 Lim, Kyuson CHAPTER 1. ACKNOWLEDGEMENT Chapter 2 Multivariate Normal and Hypothesis Testing 2.1 2.1.1 Basic definitions Multivariate Normal distribution If x ∼ 𝑁 𝑝 (𝜇, 𝚺), then the PDF of x 1 is 𝑓 (x) = 1 0 −1 exp − (x − 𝜇) 𝚺 (x − 𝜇) , 2 1 (2𝜋) 𝑝/2 |𝚺| 1/2 where (x− 𝜇) 0𝚺−1 (x− 𝜇) is the squared Mahalanobis distance 2 between x and population mean vector 𝜇 as a quadratic term. Notice that the PDF does not exists if 𝚺 is not positive definite 3, which implies |𝚺| = 0. For Gaussian function exp − 21 (x − 𝜇), the normalizing constant (2𝜋)1 𝑝/2 is multiplied so the area under the curve is 1. 2.1.2 Distribution of (x − 𝜇) 0𝚺−1 (x − 𝜇) For x ∼ 𝑁 𝑝 (𝜇, 𝚺), 0 −1 (x − 𝜇) 𝚺 (x − 𝜇) = {𝚺 −1/2 0 (x − 𝜇)} {𝚺 −1/2 𝑝 ∑︁ 1 0 (x − 𝜇)}4 ⇔ √ e𝑖 (x − 𝜇) = z0z, 𝜆𝑖 𝑖=1 1The constant probability density contour of the function is defined to be C = {x : 𝑓 (x) = 𝑐 0 ⇔ x : (x − 𝜇) 0𝚺−1 (x − 𝜇) = 𝑐2 } for connections of points. √︁ 2For arbitrary distance of 𝑃 and 𝑄, 𝑑 (𝑃, 𝑄) = (x − y) 0S−1 (x − y), where x = [𝑥 1 , ..., 𝑥 𝑝 ] 0, y = [𝑦 1 , ..., 𝑦 𝑝 ] 0 and 𝑆 is the sample covariance matrix of all measurements on p variables. 3By the spectral decomposition, 𝚺 = Q𝚲Q is positive definite if and only if 𝜆𝑖 ≥ 0 for eigenvalues. Í𝑝 1 0 4Notice that 𝚺−1 = 𝑖=1 𝜆𝑖 ei ei . 9 STA498 where Lim, Kyuson z ∼ 𝑁 𝑝 (0, I) = 𝑝 ∑︁ 𝑧𝑖2 , where 𝑧𝑖 ∼ 𝑁 (0, 1), 𝑖=1 as 𝜒(2𝑝) is defined as the distribution of 2.1.3 2 𝑖=1 𝑧𝑖 Í𝑝 such that (x − 𝜇) 0𝚺−1 (x − 𝜇) ∼ 𝜒(2𝑝) . MLE of 𝜇 and 𝚺 𝜇ˆ = x̄ and 𝚺ˆ = 1 𝑛 Í𝑛 𝑖=1 (x𝑖 − x̄)(x𝑖 − x̄) 0 = S𝑛 = 𝑛−1 𝑛 S, where 𝑛 1 ∑︁ S= (x𝑖 − x̄)(x𝑖 − x̄) 0 𝑛 − 1 𝑖=1 Now, 𝐸 (S) = 𝚺 and 𝐸 ( x̄) = 𝜇5 are unbiased estimators. 𝑛 𝑛 𝑛 ∑︁ 1 ∑︁ 0 𝑛−1 1 ∑︁ 0 0 0 S= x𝑖 x𝑖 − x̄x̄0, x𝑖 x𝑖 − 2 x𝑖 x̄ + 𝑛x̄x̄ = 𝑛 𝑛 𝑖=1 𝑛 𝑖=1 𝑖=1 where 𝐸 (x𝑖 x0𝑖 ) = 𝚺 + 𝜇𝜇0 and 𝐸 ( x̄x̄0) = cov( x̄) + 𝐸 ( x̄)𝐸 ( x̄0) = 𝑛1 𝚺 + 𝜇𝜇0. Hence, by taking the expected value 𝑛 𝑛−1 1 ∑︁ 0 1 𝑛−1 0 0 0 𝐸 (S) = 𝐸 x𝑖 x𝑖 − 𝐸 ( x̄x̄ ) = 𝚺 + 𝜇𝜇 − 𝚺 + 𝜇𝜇 = 𝚺 𝑛 𝑛 𝑛 𝑛 𝑖=1 𝑃 𝑃 𝑃 → 𝚺, S − → 𝚺. Asymptotically, S could be replaced by S𝑛 According to LLN, x̄ − → 𝜇, S𝑛 − 1 Í𝑛 or 𝚺. By definition of S = {𝑠𝑖𝑘 = 𝑛−1 𝑗=1 (x 𝑗𝑖 − x̄𝑖 )(x 𝑗 𝑘 − x̄ 𝑘 )}, 𝑛 𝑠𝑖𝑘 = 𝑛 1 ∑︁ 1 ∑︁ (x 𝑗𝑖 − x̄𝑖 )(x 𝑗 𝑘 − x̄ 𝑘 ) = (x 𝑗𝑖 − 𝜇𝑖 + 𝜇𝑖 − x̄𝑖 )(x 𝑗 𝑘 − 𝜇 𝑘 + 𝜇 𝑘 − x̄ 𝑘 ) 𝑛 − 1 𝑗=1 𝑛 − 1 𝑗=1 𝑛 1 ∑︁ 𝑛 (x 𝑗𝑖 − 𝜇𝑖 )(x 𝑗 𝑘 − 𝜇 𝑘 ) + ( x̄𝑖 − 𝜇𝑖 )( x̄ 𝑘 − 𝜇 𝑘 ), = 𝑛 − 1 𝑗=1 𝑛−1 where the second term converges to 0. By applying LLN, 𝑛 𝑛−1 ∑︁ 𝑛 1 1 𝑃 (x 𝑗𝑖 −𝜇𝑖 )(x 𝑗 𝑘 −𝜇 𝑘 ) = 1− 𝐸 {(x 𝑗𝑖 −𝜇𝑖 )(x 𝑗 𝑘 −𝜇 𝑘 )} − → 𝜎𝑖𝑘 , as 𝑛 → ∞. 𝑛 𝑗=1 𝑛 Equivalently, S𝑛 is a consistent estimator for 𝚺 which is analogous to univariate case 6. By CLT where x𝑖 ∼ 𝑁 𝑝 (𝜇0 , 𝚺) and x̄ ∼ 𝑁 𝑝 (𝜇0 , 𝑛1 𝚺) √ 𝑑 𝑛( x̄ − 𝜇0 ) → − 𝑁 𝑝 (𝜇0 , 𝚺) Í𝑛 Í𝑛 Í𝑛 5𝐸 ( x̄) = 𝐸 𝑛1 𝑖=1 x𝑖 = 𝑖=1 𝐸 𝑛1 x𝑖 = 𝑛1 𝑖=1 𝜇=𝜇 6As 𝑛 → ∞, 𝑆 2𝑛 converges to 𝜎 2 which is the population variance. 10 CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING Lim, Kyuson and STA498 (x − 𝜇0 ) 0𝚺−1 (x − 𝜇0 ) ∼ 𝜒2𝑝 such that 𝑑 𝑛( x̄ − 𝜇0 ) 0𝚺−1 ( x̄ − 𝜇0 ) → − 𝜒2𝑝 , for large sample size and 𝑛 relatively larger than 𝑝. 2.1.4 The sampling distribution of S and x̄ 1 x̄ ∼ 𝑁 𝑝 (𝜇, 𝚺), 𝑛 1 Var( x̄) = 𝚺, 𝑛 where S and x̄ are independent. As x𝑖 ∼ 𝑁 𝑝 (𝜇, 𝚺) and x̄ is a linear combination of x𝑖 , x̄ follows a normal distribution. (𝑛 − 1)S = 𝑛 ∑︁ 0 (x𝑖 − x̄)(x𝑖 − x̄) = 𝑖=1 𝑛 ∑︁ 𝑛 ∑︁ (x𝑖 − 𝜇 + 𝜇 − x̄)(x𝑖 − 𝜇 + 𝜇 − x̄) 0 = 𝑖=1 (x𝑖 − x̄) (x𝑖 − x̄) 0 +𝑛(𝜇𝑖 − x̄)(𝜇 − x̄) 0 −2𝑛(𝜇 − x̄)(𝜇 − x̄) 0 = 𝑖=1 𝑛 ∑︁ (x𝑖 − 𝜇) 0 −𝑛(𝜇 − x̄)(𝜇 − x̄) 0, 𝑖=1 and 𝑛 ∑︁ (x𝑖 − 𝜇) 0 ∼ 𝑊𝑛 (𝚺), 𝑛(𝜇 − x̄)(𝜇 − x̄) 0 ∼ 𝑊1 (𝚺) 𝑖=1 such that (𝑛 − 1)S ∼ 𝑊𝑛−1 (𝚺) = 𝑛−1 ∑︁ z𝑖 z0𝑖 , z𝑖 ∼ 𝑁 𝑝 (0, 𝚺) 𝑖=1 The Wishart distribution with 𝑛 − 1 degree of freedom has a property 𝐸 {(𝑛 − 1)S} = (𝑛 − 1)𝚺. 2.1.5 Hypothesis testing when 𝜎, Σ is known The statistical inference is based upon the hypothesis test and to construct confidence regions for the parameters of interest. The goal of this chapter is to include two general ideas, including construction of a likelihood ratio test (LRT) based on the multivariate normal distribution, and the unionintersection approach. CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING 11 STA498 Univariate test statistics (𝜎 known) Lim, Kyuson If 𝑥 ∼ 𝑁1 (0, 𝜎 2 ), the hypothesis testing is 𝐻0 : 𝜇 = 𝜇0 vs 𝐻𝑎 : 𝜇 ≠ 𝜇0 . For random samples of 𝑥1 , ..., 𝑥 𝑛 from the Normal population, the test statistics is 𝑧= 𝑥¯ − 𝜇0 √ ∼ 𝑁1 (0, 1) 𝜎/ 𝑛 or 𝑧2 = ( 𝑥¯ − 𝜇0 ) 2 ∼ 𝜒12 𝜎 2 /𝑛 under 𝐻0 . Multivariate generalization (Σ known) If x ∼ 𝑁 𝑝 (𝜇, 𝚺) where |𝚺| > 0, then the hypothesis test is 𝐻0 : 𝜇 = 𝜇0 vs 𝐻𝑎 : 𝜇 ≠ 𝜇0 . If x1 , ..., x𝑛 is a random sample from a normal population, then the test statistics 𝑧 2 = 𝑛( x̄ − 𝜇0 ) 0𝚺−1 ( x̄ − 𝜇0 ) ∼ 𝜒2𝑝 under 𝐻0 . 2.1.6 Hypothesis testing when 𝜎, Σ is unknown Univariate test statistics (𝜎 unknown) As an estimated mean vector and hypothesized mean vector 𝜇0 for the distance measure is defined, the hypothesis testing is 𝐻0 : 𝜇 = 𝜇0 vs 𝐻𝑎 : 𝜇 ≠ 𝜇0 . The test statistics is 𝑡= 𝑥¯ − 𝜇0 √ ∼ 𝑡 𝑛−1 𝑠/ 𝑛 under 𝐻0 , where 𝑠2 = or 𝑡 2 = Í𝑛 𝑖=1 ( 𝑥¯ − 𝜇0 ) 2 2 = 𝑛( 𝑥¯ − 𝜇0 )(𝑠2 ) −1 ( 𝑥¯ − 𝜇0 ) ∼ 𝐹1,𝑛−1 𝑠2 /𝑛 (𝑥 𝑖 −𝑥) ¯ 2 𝑛−1 . Note that 𝑡 2 is the square distance between sample mean 𝑥¯ and the test value 𝜇0 . The distribution of 𝑡 2 under 𝐻0 Under the 𝐻0 , −1 𝑥¯ − 𝜇0 𝑠2 𝑥¯ − 𝜇0 𝑡 = 𝑛( 𝑥¯ − 𝜇0 )(𝑠 ) 𝑛( 𝑥¯ − 𝜇0 ) = √ √ 𝜎/ 𝑛 𝜎 2 𝜎/ 𝑛 Í𝑛 2 −1 ¯ 𝑥¯ − 𝜇0 𝑥¯ − 𝜇0 𝑖=1 {(𝑥𝑖 − 𝑥)/𝜎} = √ √ 𝑛−1 𝜎/ 𝑛 𝜎/ 𝑛 2 −1 𝜒 𝜒2 /1 ∼ (𝑁 (0, 1)) 𝑛−1 (𝑁 (0, 1)) ⇔ 2 1 ⇔ 𝐹1,𝑛−1 𝑛−1 𝜒𝑛−1 /(𝑛 − 1) 2 12 √ 2 −1 √ CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING Lim, Kyuson Multivariate generalization (𝜎 unknown) STA498 For 𝑝-dimensional vector, 𝐻0 : 𝜇 = 𝜇0 vs 𝐻𝑎 : 𝜇 ≠ 𝜇0 . A natural generalization of univariate 𝑡 2 is a multivariate analog of test statistics for Hotelling’s 𝑇 2 distribution. −1 s 𝑇 = ( x̄ − 𝜇0 ) ( x̄ − 𝜇0 ) 𝑛 0 2 √ = 𝑛( x̄−𝜇0 ) 0 Í𝑛 𝑖=1 (x𝑖 − x̄)(x𝑖 − x̄) 0 𝑛−1 −1 √ 𝑊 𝑝,𝑛−1 (𝚺) 𝑛( x̄−𝜇0 ) ∼ (𝑁 𝑝 (0, 𝚺)) , 𝑛−1 0 −1 (𝑁 𝑝 (0, 𝚺)), which is in the form of (multivariate normal)0 (Wishart distribution / 𝑑𝑓 ) −1 (multivariate normal)7. ⇔ (𝑛 − 1)(𝑁 𝑝 (0, I)) 0 {𝑊𝑛−1 (I)}−1 (𝑁 𝑝 (0, I)), where I = I 𝑝×𝑝 In case the 𝑇 2 is too large, this means x̄ too far from the 𝜇0 such that 𝐻0 is rejected. Hotelling’s 𝑇 2 distribution In the case vector d follows the multivariate normal distribution 𝑁 𝑝 (0, I) which is √ 𝑛( x̄ − 𝜇0 ) (by CLT), and another random vector M (which is S) follows the Wishart distribution, then 𝑚(d0Md) (which is 𝑇 2 ) has a Hotelling’s 𝑇 2 ( 𝑝, 𝑚) distribution with dimensionality parameter 𝑝 and 𝑚 degrees of freedom, based on the observation 𝑚 and 𝑝. If a random vector 𝑥 follows the Hotelling’s 𝑇 2 distribution which is 𝑥 ∼ 𝑇 2 ( 𝑝, 𝑚), then 𝑚− 𝑝+1 𝑥 ∼ 𝐹𝑝,𝑚−𝑝+1 𝑝𝑚 For hypothesis testing, reject 𝐻0 : 𝜇 = 𝜇0 if 𝑇2 > 𝑝(𝑛 − 1) 𝐹𝑝,𝑛−𝑝 (𝛼) 𝑛−𝑝 or 𝐹= 𝑛−𝑝 2 𝑇 > 𝐹𝑝,𝑛−𝑝 (𝛼), 𝑝(𝑛 − 1) when observed 𝑚 = 𝑛 − 1 for the sample size and 𝑝 = 𝑝 to be the dimension of 𝚺. Computational example The student Kyuson from sample of 15 course marks he has taken at UTM was analyzed based on the classification on the 𝑥 1 = MAT, 𝑥 2 = STA and 𝑥 3 = other courses (for simplicity sample numbers for courses are the same). Question: Is 𝜇0 = (99 99 95) 0 plausible for the population mean vector at 𝛼 = 0.1? 2 7Notice this is analogous to 𝑡 𝑛−1 = (Normal random variable)0 (chi-square random variable/ 𝑑𝑓 ) −1 (Normal random variable) CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING 13 STA498 Lim, Kyuson Equivalently, the problem actually is to test 𝐻0 : 𝜇 = (99 99 95) 0 vs. 𝐻𝑎 : 𝜇 ≠ (99 99 95) 0. At level 𝛼 = 0.1, reject 𝐻0 if 𝑇 2 > 𝑝(𝑛−1) 𝑛−𝑝 𝐹𝑝,𝑛−𝑝 (𝛼) = 3(40−1) 40−3 𝐹3,37 (0.1) The sample mean x̄ = (100 100 99) 0 with S computed. 40( x̄ − 𝜇) 0S−1 ( x̄ − 𝜇) = 8.739. = 7.544. The computed 𝑇 2 = Since 8.739 > 7.544 which is the critical value, 𝐻0 is rejected where his true average differ at least for one area, 𝜇𝑖 ≠ 𝜇𝑖0 and conclude he is not being honest. Invariant under transformation, Hotelling’s 𝑇 2 Moreover, Hotelling’s 𝑇 2 is invariant under transformation of the form y = Cx+b, where C 𝑝×𝑝 for the hypothesis testing of 𝐻0 : 𝐸 (y) = C𝜇0 + b 8. Since ȳ = Cx̄ + b and S𝑦 = 1 𝑛−1 Í𝑛 𝑖=1 (y𝑖 − ȳ)(y𝑖 − ȳ) 0 = CS𝑥 C0, 0 −1 𝑇 2 = 𝑛{ȳ − (C𝜇0 + 𝑏)}0S−1 𝑦 { ȳ − (C𝜇 0 + 𝑏)} = 𝑛{C( x̄ − 𝜇0 )} (CS𝑥 C) {C( x̄ − 𝜇0 )} = 𝑛( x̄ − 𝜇0 ) 0C0 (CS𝑥 C) −1 C( x̄ − 𝜇0 ) = 𝑛( x̄ − 𝜇0 ) 0 (S𝑥 ) −1 ( x̄ − 𝜇0 ) Normality, Hotelling’s 𝑇 2 𝑇 2 = 𝑛( x̄ − 𝜇0 ) 0S−1 ( x̄ − 𝜇0 ) is approximately chi-square distribution with 𝑝 𝑑𝑓 . when𝜇0 is correct. Note that the 𝐹-distribution of 𝑇 2 rely on the normality assumption. Then, the critical value 𝑝(𝑛 − 1) 𝐹𝑝,𝑛−𝑝 (𝛼) > 𝜒2𝑝 (𝛼), 𝑛−𝑝 but the value is nearly equivalent for larger values of 𝑛 − 𝑝 of 𝐹𝑝,𝑛−𝑝 (𝛼) as 𝑝 > 𝑛 − 𝑝. In other words, if 𝑛 >> 𝑝 then the difference is larger but if 𝑛 > 𝑝 then the gap is smaller 9. Likelihood Ratio Test (LRT) The Hotelling’s 𝑇 2 test is equivalent to the LRT 10. For hypothesis testing of 𝐻0 : 𝜇 = 𝜇0 vs 𝐻𝛼 : 𝜇 ≠ 𝜇0 , the likelihood ratio (𝚲) is ˆ 𝑛/2 max𝚺 𝐿 (𝜇0 , 𝚺) | 𝚺| 𝚲= = max 𝜇,𝚺 𝐿 (𝜇, 𝚺) | 𝚺ˆ 0 | 8Instead of 𝐻0 : 𝐸 (x) = 𝜇0 9For example 𝑛 = 3000 and 𝑝 = 10, 𝑝 (𝑛−1) 𝑛− 𝑝 𝐹 𝑝,𝑛− 𝑝 (𝛼) = 16.057 is close to 𝜒2𝑝 (𝛼) = 15.987 but if (𝑛−1) 2 𝑛 = 30 and 𝑝 = 5, 𝑝𝑛− 𝑝 𝐹 𝑝,𝑛− 𝑝 (𝛼) = 12.135 is greater than 𝜒 𝑝 (𝛼) = 9.236. 10Note that this is extended to Neyman-Pearson Lemma for uniformly most powerful test. 14 CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING Lim, Kyuson STA498 , where the maximum of multivariate normal likelihood of 𝜇 and 𝚺 is 1 1 −𝑛𝑝 −𝑛𝑝 , max 𝐿 (𝜇, 𝚺) = max 𝐿 (𝜇0 , 𝚺) = 𝑒𝑥 𝑝 𝑒𝑥 𝑝 𝜇,𝚺 𝚺 2 2 (2𝜋) 𝑛𝑝/2 | Σ̂0 | 𝑛/2 (2𝜋) 𝑛𝑝/2 | Σ̂| 𝑛/2 However, while Σ̂0 is restricted under the 𝐻0 , Σ̂ is unrestricted 11. The LRT reject 𝐻0 if 𝚲 < 𝑐 for the cutoff value 𝑐. ˆ is approxiUnder the continuous mapping theorem, −2 ln(𝚲) = 𝑛{ln( 𝚺ˆ 0 ) − ln( 𝚺)} 2 mately following the 𝜒𝑑𝑓 , where 𝑑𝑓 = {𝑝 + 𝑝( 𝑝 + 1)/2} − {𝑝( 𝑝 + 1)/2} = 𝑝 12 (number of parameters without the restriction of 𝐻0 - number of parameters under 𝐻0 ). Wilk’s Lambda Equivalently, based on the likelihood ration statistics of 𝚲 it is derived for 𝚲2/𝑛 Í𝑛 −1 ˆ | 𝑖=1 (x𝑖 − x̄)(x𝑖 − x̄) 0 | | 𝚺| 𝑇2 2/𝑛 𝚲 = = Í𝑛 ⇔ 1+ < 𝑐𝑎 | 𝑖=1 (x𝑖 − 𝜇0 )(x𝑖 − 𝜇0 ) 0 | 𝑛−1 | 𝚺ˆ 0 | Notice that for large 𝑇 2 the likelihood ratio is small and will reject 𝐻0 . Also, the Hotelling’s 𝑇 2 , Wilk’s Lambda and LRT are all equivalent. Inverse of Wishart distribution The Wishart distribution which is (𝑛 − 1)S ∼ 𝑊𝑛−1 (𝚺) is an multivariate analogue of the Gamma distribution (as the chi-square distribution of z2 is gamma random variable as well). Í𝑛−1 0 −1 With a reparametrization where x1 , ..., x𝑛 ∼ 𝑁 (0, S−1 𝑖=1 x𝑖 x𝑖 0 ), a cov-matrix 𝚺 = is sampled from the inverse-Wishart distribution, which is 𝑛 − 1 df and parameter S−1 0 . Hence, 𝐸 (𝚺−1 ) = (𝑛 − 1)S−1 0 , 𝐸 (𝚺) = 1 1 −1 (S−1 = S0 , 0 ) (𝑛 − 1) − 𝑝 − 1 𝑛− 𝑝−2 by the property of Wishart distribution. For large 𝑛 − 1, S0 = (𝑛 − 𝑝 − 2)𝚺0 is near true covariance matrix of 𝚺. Union-Intersection derivation of 𝑇 2 If the null hypothesis is not rejected for given a 𝑝 of 𝑦 = a0x ∼ 𝑁1 (a0 𝜇, a0𝚺a) that maximize test statistics of 𝑡 a2 , then any of univariate null hypothesis 𝐻0,a : a0 𝜇 = a0 𝜇0 ⇔ Í𝑛 Í𝑛 11Σ̂0 = 𝑛1 𝑖=1 (x𝑖 − 𝜇0 ) (x𝑖 − 𝜇0 ) 0 to be restricted and Σ̂ = 𝑛1 𝑖=1 (x𝑖 − x̄) (x𝑖 − x̄) 0 to be unrestricted 12 𝑝 correspond to 𝜇, 𝑝( 𝑝 + 1)/2 correspond to var-cov matrix, 𝚺 CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING 15 STA498 𝜇 = 𝜇0 is not rejected. Lim, Kyuson First, the 𝐻0 : 𝜇 = 𝜇0 is equivalent of the form 𝐻0,a : a0 𝜇 = a0 𝜇0 . The test statistics 0 𝑦¯ − a0 𝜇0 2 a 𝑥¯ − a0 𝜇0 2 {a0 ( x̄ − 𝜇0 )}2 𝑡a = = √︃ = 𝑠 𝑦¯ a0 (S/𝑛)a 1 0 a Sa 𝑛 Hence, if maxa 𝑡 a2 < 𝑐, then 𝑡a2 < 𝑐 for any a. Second, the maximum squared t-test is max 𝑡a2 a −1 S = ( x̄ − 𝜇0 ) ( x̄ − 𝜇0 ) = 𝑛( x̄ − 𝜇0 ) 0 (S) −1 ( x̄ − 𝜇0 ) = 𝑇 2 , 𝑛 0 which is the Hotelling’s 𝑇 2 distribution 13. 2.2 Confidence region 2.2.1 Univariate 𝑡-interval Without relationship between multivariate components, the univariate 𝑡-interval method is constructed as 𝑦¯ − 𝜇 𝑦 < 𝑡 𝑛−1 (𝛼/2) = 1 − 𝛼, 𝑃 − 𝑡 𝑛−1 (𝛼/2) ≤ √︃ 2 𝑠 𝑦 /𝑛 𝑦¯ −𝜇 𝑦 as √ 𝑠2𝑦 /𝑛 ∼ 𝑡 𝑛−1 . √︃ In other words, the confidence interval of 100(1 − 𝛼)% for 𝜇 𝑦 is 𝑦¯ ± 𝑡 𝑛−1 (𝛼/2) 𝑠2𝑦 /𝑛, where 𝑡 𝑛−1 (𝛼/2) is the upper percentile. Problem and Bonferroni’s inequality Notice that the 𝑝 100(1 − 𝛼)% does not cover joint CI. For 𝑅𝑖 , each 𝑅𝑖 covers the corresponding 𝜇𝑖 and 𝑃(𝑅𝑖 ) = 1 − 𝛼𝑖 . Then, for each 𝑅𝑖 independent 𝑃{𝜇𝑖 ∈ 𝑅𝑖 } = 𝑝 𝑃(∩𝑖=1 𝑅𝑖 ) =1− 𝑝 𝑃(∪𝑖=1 𝑅𝑖𝑐 ) ≥ 1− 𝑝 ∑︁ 𝛼𝑖 , 𝑖=1 which is the Bonferroni’s inequality. In the case 𝛼𝑖 = 𝛼 for all 𝑖, then 1 − 1 − 𝑝 𝛼 < 1 − 𝛼 such that if 𝑝 > 1 the inequality is not guaranteed. Í𝑝 𝑖=1 𝛼𝑖 = 13Using the Maximization Lemma, the Cauchy-Schwartz inequality is based to derive the UnionIntersection derivation of 𝑇 2 . 16 CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING Lim, Kyuson STA498 Equivalently, if 𝑅𝑖𝑐 14 is the the event of making a Type 1 error on 𝑖th test, then 𝑃(at least one Type 1 error) = 𝑝 𝑃(∪𝑖=1 𝑅𝑖𝑐 ) ≤ 𝑝 ∑︁ 𝑃(𝑅𝑖𝑐 ). 𝑖=1 For each of 𝑝 tests, use a significance level of 𝛼/𝑝, then the CI coverage rate or Type 1 error rate is at most 100(1 − 𝛼)% or 𝛼. So the probability that at least one test results in a Type I error is at most 𝛼 or the chance that at least one CI does not capture the true mean difference is at most 100(1 − 𝛼)%. 2.2.2 Bonferroni’s Simultaneous Confidence interval To construct the simultaneous confidence interval for {𝜇1 , ..., 𝜇 𝑝 } by the confidence level 𝛼/𝑝 for each of 𝑝 separate univariate CI’s that is √︂ 𝑠𝑖𝑖 𝑥¯𝑖 ± 𝑡 𝑛−1 (𝛼/(2𝑝)) , 𝑖 = 1, .., 𝑝. 𝑛 √︁ Since 𝑃( 𝑥¯𝑖 ±(𝜇𝑖 ∈ 𝛼/(2𝑝)) 𝑠𝑖𝑖 /𝑛) = 1−𝛼/𝑝, the joint coverage probability ≥ 1−𝑝( 𝛼𝑝 ) = 1 − 𝛼, which now guarantee to be no smaller than 1 − 𝛼 15. 2.2.3 Simultaneous 𝑇 2 -intervals To construct simultaneous confidence intervals for any linear combinations of a0 𝜇 that Í𝑝 is the expected value of 𝑖=1 𝑎𝑖 x𝑖 = a0x where x ∼ 𝑁 𝑝 (𝜇, 𝚺) with variance a0𝚺a, the CI is derived from univariate-intersection derivation of 𝑇 2 0 (a0x̄ − a0 𝜇0 ) 2 a 𝑥¯ − a0 𝜇0 2 {a0 ( x̄ − 𝜇0 )}2 = √︃ = 𝑡a = var(a0x)/𝑛 a0 (S/𝑛)a 1 0 a Sa 𝑛 √︄ ⇔ a0x̄ ± 𝑝(𝑛 − 1) 𝐹𝑝,𝑛−𝑝 (𝛼) 𝑛−𝑝 where max 𝑡a2 = 𝑇 2 ∼ a √︂ a0Sa , 𝑛 𝑝(𝑛 − 1) 𝐹𝑝,𝑛−𝑝 𝑛−𝑝 For each 𝜇𝑖 , √︄ 𝑥¯𝑖 ± 𝑝(𝑛 − 1) 𝐹𝑝,𝑛−𝑝 (𝛼) 𝑛−𝑝 √︂ 𝑠𝑖𝑖 𝑛 14Notice that this the confidence interval that is √︁ not covered for 𝜇𝑖 . 15Note that 𝑡 𝑛−1 (𝛼/2𝑝) could be replaces with (𝑛 − 1) 𝑝𝐹 𝑝,𝑛− 𝑝 (𝛼)/(𝑛 − 𝑝) for equivalency, by the property of Hotelling 𝑇 2 . CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING 17 STA498 Therefore, Lim, Kyuson 𝑃 𝑡a2 𝑝(𝑛 − 1) 𝑝(𝑛 − 1) 2 𝐹𝑝,𝑛−𝑝 (𝛼) = 𝑃 max 𝑡a ≤ 𝐹𝑝,𝑛−𝑝 (𝛼) ≤ a 𝑛−𝑝 𝑛−𝑝 𝑝(𝑛 − 1) 2 =𝑃 𝑇 ≤ 𝐹𝑝,𝑛−𝑝 (𝛼) = 1 − 𝛼 𝑛−𝑝 The drawback of the simultaneous 𝑇 2 -intervals is less powerful due to wider range of interval, which lead to be less powerful and conservative. 2.2.4 Comparison between Simultaneous 𝑇 2 -intervals and Bonferroni’s Confidence intervals Criteria shape joint coverage rate 𝑡-intervals Bonferroni’s 𝑡-intervals narrower, powerful < 100(1 − 𝛼)% depends on number of intervals Simultaneous 𝑇 2 -intervals winder, conservative ≥ 100(1 − 𝛼)% does not depend For each 𝜇𝑖 , the simultaneous confidence intervals is computed as √︄ √︂ 𝑝(𝑛 − 1) 𝑠𝑖𝑖 𝑥¯𝑖 ± 𝐹𝑝,𝑛−𝑝 (𝛼) , 𝑛−𝑝 𝑛 but the Bonferroni’s confidence intervals for 𝜇𝑖 is computed as √︂ 𝛼 𝑠𝑖𝑖 𝑥¯𝑖 ± 𝑡 𝑛−1 , where 𝑖 = 1, .., 𝑝. 2𝑝 𝑛 Confidence Region Denoted as 𝑅(x), which is the multivariate extensions of univariate confidence interval (CI) where x𝑖 ∼ 𝑁 𝑝 (𝜇, 𝚺) for 𝑖 = 1, ..., 𝑛. Then, for mean vector 𝜇 𝑝(𝑛 − 1) 0 −1 𝑃 𝑛( x̄ − 𝜇) S ( x̄ − 𝜇) ≤ 𝐹𝑝,𝑛−𝑝 (𝛼) = 1 − 𝛼 𝑛−𝑝 Cantered at x̄ and computing S for the set yields 𝑝(𝑛 − 1) 0 −1 𝐹𝑝,𝑛−𝑝 (𝛼) 𝑅(x) = 𝜇 : 𝑛( x̄ − 𝜇) S ( x̄ − 𝜇) ≤ 𝑛−𝑝 However, the on half-length along the normalized eigenvector e𝑖 from S16 gives √︂ √︄ √︂ 𝜆𝑖 𝑝(𝑛 − 1) 𝜆𝑖 √︁ 2 𝐹𝑝,𝑛−𝑝 (𝛼) = 𝑇 (𝛼), 𝑛 𝑛−𝑝 𝑛 for each eigenvalues of 𝜆𝑖 from S, 𝑖 = 1, ..., 𝑝. 16Correlation R for eigenvalues are computed to be equivalent to the covariance matrix of S for standardized eigenvalues. 18 CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING Lim, Kyuson 2.2.5 Multivariate Quality-Control (QC) STA498 Univariate paired t-test For 𝑥 1𝑖 and 𝑥 2𝑖 which is the response to treatments, let 𝑑𝑖 = 𝑥𝑖1 − 𝑥𝑖2 with 𝑑𝑖 ∼ 𝑁 (𝜇 𝑑 , 𝜎𝑑2 ), for hypothesis testing of 𝐻0 : 𝜇 𝑑 = 0 vs. 𝐻𝑎 : 𝜇 𝑑 ≠ 0. Then, the test statistics is 𝑡= 𝑑¯ √ ∼ 𝑡 𝑛−1 , 𝑠𝑑 / 𝑛 under 𝐻0 . If |𝑡| > 𝑡 𝑛−1 (𝛼/2), then reject 𝐻0 . The confidence interval for 𝜇 𝑑 is 𝑠𝑑 𝑑 ± 𝑡 𝑛−1 (𝛼/2) √ 𝑛 Multivariate extension in comparison of confidence intervals and confidence region Suppose for 𝑝 units that there are 2 treatments to be x1𝑖 = (𝑥 1𝑖1 , ..., 𝑥 1𝑖 𝑝 ) 0 and x2𝑖 = (𝑥 2𝑖1 , ..., 𝑥 2𝑖 𝑝 ) 0 with d𝑖 = x1𝑖 − x2𝑖 ⇔ 𝑑𝑖 𝑗 = 𝑥 1𝑖 𝑗 − 𝑥 2𝑖 𝑗 , for all 𝑗 = 1, ..., 𝑝. For d𝑖 ∼ 𝑁 𝑝 (𝜇 𝑑 , 𝚺𝑑 , 𝐻0 : 𝜇 𝑑 = 0 vs. 𝐻0 : 𝜇 𝑑 ≠ 0. Then, the test statistics is 𝑇 2 = 𝑛d̄0S−1 𝑑 d̄ √ 0 Í𝑛 𝑖=1 d𝑖 𝑖=1 (d𝑖 Í𝑛 = 𝑛 𝑛 Í𝑛 − d̄)(d𝑖 − d̄) 0 √ 𝑝(𝑛 − 1) 𝑖=1 d𝑖 𝑛 𝐹𝑝,𝑛−𝑝 (𝛼), ∼ 𝑛−1 𝑛 𝑛−𝑝 under 𝐻0 where the 100(1 − 𝛼)% confidence region of 𝜇 𝑑 is 𝑝(𝑛 − 1) 0 −1 𝐹𝑝,𝑛−𝑝 (𝛼) , 𝑅(𝜇 𝑑 ) = 𝜇 𝑑 : 𝑛( d̄ − 𝜇 𝑑 ) S𝑑 ( d̄ − 𝜇 𝑑 ) ≤ 𝑛−𝑝 which is analogous to the confidence region for x̄ and S 𝑝(𝑛 − 1) 0 −1 𝜇 : 𝑛( x̄ − 𝜇) S ( x̄ − 𝜇) ≤ 𝐹𝑝,𝑛−𝑝 (𝛼) 𝑛−𝑝 When 𝑇 2 > 𝑝(𝑛−1) 𝑛−𝑝 𝐹𝑝,𝑛−𝑝 (𝛼) for the critical value, then reject 𝐻0 . However, 100(1 − 𝛼)% Simultaneous 𝑇 2 confidence intervals for individual mean differences {𝜇 𝑑 𝑗 } 17 is given by √︄ 𝑑¯𝑗 ± √︄ 𝑝(𝑛 − 1) 𝐹𝑝,𝑛−𝑝 (𝛼) 𝑛−𝑝 𝑠2𝑑 𝑗 , 𝑗 = 1..., 𝑝. 𝑛 (𝑛−1) 2 17Note that when 𝑛 − 𝑝 is large replaced by 𝑝𝑛− 𝑝 𝐹 𝑝,𝑛− 𝑝 (𝛼) with 𝜒 𝑝 (𝛼) by the property of Hotelling’s 2 𝑇 and the normality assumption is not necessary. CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING 19 STA498 Lim, Kyuson Note that 𝑑¯𝑗 is the diagonal element of d̄ and 𝑠2𝑑 𝑗 is the diagonal element of S𝑑 . This is analogous to 𝜇𝑖 of √︄ √︂ 𝑝(𝑛 − 1) 𝑠𝑖𝑖 𝑥¯𝑖 ± 𝐹𝑝,𝑛−𝑝 (𝛼) . 𝑛−𝑝 𝑛 Also, 100(1 − 𝛼)% Bonferroni’s confidence intervals for {𝜇𝑖 } is given by √︄ 𝑠2𝑑 𝑗 𝑑¯𝑗 ± 𝑡 𝑛−1 (𝛼/2𝑝) , 𝑗 = 1..., 𝑝. 𝑛 This is analogous to √︂ 𝑥¯𝑖 ± 𝑡 𝑛−1 (𝛼/(2𝑝)) 𝑠𝑖𝑖 , 𝑖 = 1, .., 𝑝. 𝑛 shown before. Simple Block design For 𝑞 treatments over successive period of time, observation data is denoted as x𝑖 = 𝑥𝑖1 , ..., 𝑥𝑖𝑞 and 𝜇 = 𝜇1 , ..., 𝜇 𝑞 . The goal is to compare the components of 𝜇. The contrast matrix C is found for two ways. First, the contrast matrix is set for control treatments compared with other treatment, which is 𝜇 − 𝜇2 1 −1 0 · · · 0 © 1 ª 𝜇 𝜇1 − 𝜇3 ® ©1 0 −1 · · · 0 ª® © .1 ª . ®= ® .. ®® =C1 𝜇. .. ® . . . ® ® 1 0 0 · · · −1¬ « 𝜇 𝑞 ¬ « 𝜇1 − 𝜇 𝑞 ¬ « The other way is a successive treatments for contrast matrix 𝜇 − 𝜇2 1 −1 0 · · · 0 © 1 ª 𝜇 𝜇2 − 𝜇3 ® ©0 1 −1 · · · 0 ª® © .1 ª ® = ® .. ®® =C2 𝜇. .. ® . . . ® . ® 0 0 · · · 1 −1¬ « 𝜇 𝑞 ¬ « 𝜇 𝑞−1 − 𝜇 𝑞 ¬ « Note that the contrast matrix C1 and C2 is (𝑞 − 1) × 𝑞 matrix. However, in order to test that there is no difference in treatments 𝐻0 : C𝜇 = 0 vs. 𝐻𝑎 : C𝜇 ≠ 0. The test statistics is the Hotelling’s 𝑇 2 as x𝑖 ∼ 𝑁 𝑞 (𝜇, 𝚺), 𝑇 2 = 𝑛(Cx̄) 0 (CSC0) −1 (Cx̄) ∼ (𝑞 − 1)(𝑛 − 1) 𝐹𝑞−1,𝑛−𝑞+1 (𝛼), 𝑛−𝑞+1 and the 100(1 − 𝛼)% confidence region for C is (𝑞 − 1)(𝑛 − 1) 0 0 −1 C𝜇 : 𝑛(Cx̄ − C𝜇) (CSC ) (Cx̄ − C𝜇) ≤ 𝐹𝑞−1,𝑛−𝑞+1 (𝛼) , 𝑛−𝑞+1 20 CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING Lim, Kyuson STA498 compared to the 100(1 − 𝛼)% simultaneous 𝑇 2 confidence intervals for 1-dimensional {c𝑖 𝜇} is √︄ √︄ c0𝑗 Sc 𝑗 (𝑞 − 1)(𝑛 − 1) 0 𝐹𝑞−1,𝑛−𝑞+1 (𝛼) c 𝑗 x̄ ± 𝑛−𝑞+1 𝑛 Computational Example A sample of 20 courses were administrated with 4 assessments ways: Treatment 1: final exam and no term tests Treatment 2: no exam and no term tests Treatment 3: final exam and term test Treatment 4: no exam and term test The outcome variable is % for students marks. (𝜇3 + 𝜇4 ) − (𝜇1 + 𝜇2 ): effect of having term test (𝜇1 + 𝜇3 ) − (𝜇2 + 𝜇4 ): effect of having final exam (𝜇1 + 𝜇4 ) − (𝜇2 + 𝜇3 ): interaction between term test and final exam −1 −1 1 1 © ª C = 1 −1 1 −1® « 1 −1 −1 1 ¬ Then, 𝐻0 : C𝜇 = 0 vs. 𝐻𝑎 : C𝜇 ≠ 0 at 𝛼 = 0.05. From the data, 𝑇 2 = 𝑛(Cx̄) 0 (CSC0) −1 Cx̄ = 20.5. (𝑛−1) 3×19 At 𝛼 = 0.05, the critical value is (𝑞−1) 𝑛−𝑞+1 𝐹𝑞−1,𝑛−𝑞+1 (𝛼) = 17 𝐹3,17 (0.05) = 10.73. Since 𝑇 2 > 10.93, reject 𝐻0 at the level of 𝛼 = 0.05 and conclude that there is a significant difference in contrast for the effect of midterm and final exam for courses to be offered. Within 95% simultaneous confidence intervals, if the confidence interval does not contain 0 then there is an effect by the presence of either term test or final exam. Note that the interaction effect of two factors is not significant if the confidence interval does contain 0. 2.3 Comparing mean vectors of two population When x1𝑖 ∼ 𝑁 (𝜇1 , Σ1 ) and x2 𝑗 ∼ 𝑁 (𝜇2 , Σ2 ) for 𝑖 = 1, .., 𝑛1 and 𝑗 = 1, ..., 𝑛2 for 𝑝-variate population and independent, then the goal is to make inference on 𝜇1 − 𝜇2 . 2.3.1 Pooled sample covariance when 𝑛1 , 𝑛2 is small and Σ = Σ1 = Σ2 As x1 ∼ 𝑁 (𝜇, Σ1 ), x1 ∼ 𝑁 (𝜇, Σ1 ), let x̄ 𝑘 = x̄) 0. 1 𝑛𝑘 Í𝑛 𝑘 𝑖=1 x 𝑘𝑖 and S 𝑘 = 1 𝑛 𝑘 −1 Í𝑛 𝑘 𝑖=1 (x 𝑘𝑖 − x̄)(x 𝑘𝑖 − CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING 21 STA498 Lim, Kyuson The pooled sample covariance is the weighted mean of two samples. Í𝑛 1 0 Í𝑛2 (x − x̄ )(x − x̄ ) 0 𝑛1 − 1 𝑛2 − 1 2𝑖 2 2𝑖 2 𝑖=1 (x1𝑖 − x̄1 )(x1𝑖 − x̄1 ) S 𝑝𝑜𝑜𝑙𝑒𝑑 = + 𝑖=1 ⇔ S1 + S2 𝑛1 − 1 + 𝑛2 − 1 𝑛1 − 1 + 𝑛2 − 1 𝑛1 + 𝑛2 − 2 𝑛1 + 𝑛2 − 2 Hypothesis test with small samples when Σ1 = Σ2 2.3.2 For 𝐻0 : 𝜇1 − 𝜇2 = 𝛿0 18 vs. 𝐻𝑎 : 𝜇1 − 𝜇2 ≠ 𝛿0 , test statistics of Hotelling’s under 𝐻0 2 𝑇 = (x1 − x2 − 𝛿0 ) 1 1 = + 𝑛1 𝑛2 0 −1/2 (x1 − x2 − 𝜇1 + −1 1 1 + S 𝑝𝑜𝑜𝑙𝑒𝑑 (x1 − x2 − 𝛿0 ) 𝑛1 𝑛2 𝜇2 ) 0S−1 𝑝𝑜𝑜𝑙𝑒𝑑 1 1 + 𝑛1 𝑛2 −1/2 (x1 − x2 − 𝜇1 + 𝜇2 ), which follows (multivariate normal)(Wishart / 𝑑𝑓 )(multivariate normal) such that 𝑊𝑛1 +𝑛2 −2 ⇔ 𝑁 𝑝 (0, 𝚺) 𝑛1 + 𝑛2 − 2 0 −1 𝑁 𝑝 (0, 𝚺) = 𝑝(𝑛1 + 𝑛2 − 2) 𝐹𝑝,𝑛1 +𝑛2 −𝑝−1 . 𝑛1 + 𝑛2 − 𝑝 − 1 The hypothesis testing reject 𝐻0 if 𝑇2 > 2.3.3 𝑝(𝑛1 + 𝑛2 − 2) 2 𝐹𝑝,𝑛1 +𝑛2 −𝑝−1 (𝛼) = 𝑇𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 . 𝑛1 + 𝑛2 − 𝑝 − 1 Confidence intervals with small samples when Σ1 = Σ2 Confidence region Analogously, the half-length with axes along e1 , ..., e 𝑝 and ellipsoid centered at x1 − x2 is √︄ √︁ 1 1 𝑝(𝑛1 + 𝑛2 − 2) + 𝐹𝑝,𝑛1 +𝑛2 −𝑝−1 (𝛼) , 𝑗 = 1, ..., 𝑝. 𝜆𝑗 𝑛1 𝑛2 𝑛1 + 𝑛2 − 𝑝 − 1 Simultaneous 𝑇 2 Confidence intervals a0 (𝜇1 − 𝜇2 ) √︄ √︃ 1 1 0 2 0 a ( x̄1 − x̄2 ) ± 𝑇𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 a + S 𝑝𝑜𝑜𝑙𝑒𝑑 a . 𝑛1 𝑛2 Notice that this is analogous to each confidence intervals √︄ √︄ 𝑝(𝑛1 + 𝑛2 − 2) 1 1 ( 𝑥¯1 𝑗 − 𝑥¯2 𝑗 ) ± 𝐹𝑝,𝑛1 +𝑛2 −𝑝−1 (𝛼) + 𝑠 𝑗 𝑗,𝑝𝑜𝑜𝑙𝑒𝑑 . 𝑛1 + 𝑛2 − 𝑝 − 1 𝑛1 𝑛2 18ie. 𝛿0 = 0 22 CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING Lim, Kyuson Bonferroni’s Confidence Intervals STA498 ( 𝑥¯1 𝑗 − 𝑥¯2 𝑗 ) ± 𝑡 𝑛1 +𝑛2 −2 2.3.4 𝛼 2𝑝 √︄ 1 1 + 𝑠 𝑗 𝑗,𝑝𝑜𝑜𝑙𝑒𝑑 . 𝑛1 𝑛2 Behrens-Fisher problem In the case of heterogeneous covariance 𝚺1 ≠ 𝚺2 with small (moderate) sample sizes 𝑛1 , 𝑛2 greater than 𝑝, the estimator of sample mean difference yields sample covariance S1 S2 (𝑛 − 1)(S1 + S2 ) 2 S𝑒 = + ⇔ 𝑛1 𝑛2 2(𝑛 − 1) 𝑛 such that the test statistics under 𝐻0 : 𝜇1 − 𝜇2 = 0 vs. 𝐻𝑎 : 𝜇1 − 𝜇2 ≠ 0 is 𝑇 2 = ( x̄1 − x̄2 ) 0S−1 𝑒 ( x̄1 − x̄2 ) ∼ 𝑣𝑝 𝐹𝑝,𝑣−𝑝+1 , 𝑣− 𝑝+1 for 𝑝 number of variables where 𝑣=Í 2 𝑖=1 𝑝 + 𝑝2 1 1 −1 2 + tr 𝑛𝑖 tr 𝑛𝑖 S𝑖 S𝑒 1 −1 2 𝑛𝑖 S𝑖 S𝑒 where min(𝑛1 , 𝑛2 ) ≤ 𝑣 ≤ 𝑛1 + 𝑛2 19. Hence, reject 𝐻0 if 𝑇 2 > 2.3.5 , 𝑣𝑝 𝑣−𝑝+1 𝐹𝑝,𝑣−𝑝+1 (𝛼). Heterogeneous covariance matrices with large sample size The test statistics under 𝐻0 with same S𝑒 is 2 𝑇 2 = ( x̄1 − x̄2 ) 0S−1 𝑒 ( x̄1 − x̄2 ) ∼ 𝜒 𝑝 with the assumption that 𝑛1 − 𝑝 and 𝑛2 − 𝑝 is large enough. Hence, reject 𝐻0 if 𝑇 2 > 𝜒2𝑝 (𝛼). 2.3.6 Box’s M test (Bartlett’s test) The goal is to hypothesis test for the equality of covariance matrices, 𝐻0 : 𝚺1 = · · · = 𝚺𝑔 = 𝚺 vs. 𝐻𝑎 : at least one 𝚺𝑖 ≠ 𝚺 𝑗 , for some 𝑖 ≠ 𝑗 with chi-square approximation. Under multivariate normal distribution, the LRT20 Í𝑔 (𝑛𝑙 −1)/2 (𝑛𝑙 − 1)S𝑙 |S𝑙 | , where S 𝑝𝑜𝑜𝑙𝑒𝑑 = Í𝑙=1 Λ = Π𝑙 𝑔 |S 𝑝𝑜𝑜𝑙𝑒𝑑 | 𝑙=1 (𝑛𝑙 − 1) 19The approximation reduces to Welch 𝑡-test in univariate (𝑝 = 1), 𝑡 = 20Formerly, under 𝐻0 : 𝜇 = 𝜇0 max𝚺 𝐿 ( 𝜇0 ,𝚺) max 𝜇,𝚺 𝐿 ( 𝜇,𝚺) = | 𝚺ˆ | 𝑛/2 . | 𝚺ˆ 0 | 𝑥¯ 1 − 𝑥¯ 2 𝑠2 1 𝑁1 𝑠2 + 𝑁2 2 and 𝑣 = 𝑠2 1 𝑁1 𝑠2 2 + 𝑁2 𝑠4 1 𝑁 2 ( 𝑁1 −1) 1 2 𝑠4 2 2 ( 𝑁2 −1) +𝑁 CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING 23 STA498 Lim, Kyuson for 𝑛𝑙 that is the sample size for the 𝑙th group of S𝑙 sample covariance. ∑︁ 𝑔 𝑔 ∑︁ ⇔ −2 𝑙𝑜𝑔Λ = 𝑀 = (𝑛𝑙 − 1) 𝑙𝑜𝑔|S 𝑝𝑜𝑜𝑙𝑒𝑑 | − {(𝑛𝑙 − 1) 𝑙𝑜𝑔|S𝑙 |}, 𝑙=1 where 𝑙=1 𝑔 1 2𝑝 2 + 3𝑝 − 1 ∑︁ 1 , − Í𝑔 𝑢= 6( 𝑝 + 1)(𝑔 − 1) 𝑙=1 𝑛𝑙 − 1 𝑙=1 (𝑛𝑙 − 1) as 𝑝 is the number of variables and 𝑔 is the number of groups. The test statistics is 1 𝑝( 𝑝 + 1)(𝑔 − 1) 2 under 𝐻0 21. While Box’s M test is sensitive to non-normality, MANOVA test of means or treatments are robust to non-normality 22. 𝐶 = (1 − 𝑢)𝑀 ∼ 𝜒𝑣2 , 𝑣 = 2.4 MANOVA (Multivariate Analysis Of Variance) The one-way MANOVA model for comparing 𝑔 population mean vector is illustrated as X𝑙 𝑗 = 𝜇 + 𝜏𝑙 + e𝑙 𝑗 , e𝑙 𝑗 ∼ 𝑁 𝑝 (0, 𝚺), which is random vector = overall mean+ 𝑙th population treatment effect +random error, where there are 𝑔 populations and 𝑛𝑙 observations ({x𝑙1 , ..., x𝑙𝑛𝑙 }) for population 𝑙 with the population mean 𝜇𝑙 , 𝑙 = 1, .., 𝑔, which follows Wishart distribution. Í𝑔 Constraint on 𝑙=1 𝑛𝑙 𝜏𝑙 = 0 define the unique model parameters. For vector of observations, decomposes into x𝑙 𝑗 = x̄ + ( x̄𝑙 − x̄) + (x𝑙 𝑗 − x̄𝑙 ), which is also observation = overall sample mean 𝜇+ ˆ estimated treatment effect 𝜏ˆ𝑙 + residual error, ê𝑙 𝑗 . Note that the normality assumption for samples can be relaxed when the sample size {𝑛𝑙 } is large. 2.4.1 Sum of Squares (TSS = SS𝑡𝑟 +SS𝑟𝑒𝑠 ) Total (corrected) sum of squares (and cross products), TSS = treatment (between groups) sum of squares and cross products, B + residuals (within group) sum of squares and cross products, W. 𝑔 ∑︁ 𝑛𝑙 ∑︁ 𝑙=1 𝑗=1 0 (x𝑙 𝑗 − x̄)(x𝑙 𝑗 − x̄) = 𝑔 ∑︁ 𝑙=1 0 𝑛𝑙 ( x̄𝑙 − x̄)( x̄𝑙 − x̄) + 𝑔 ∑︁ 𝑛𝑙 ∑︁ (x𝑙 𝑗 − x̄𝑙 )(x𝑙 𝑗 − x̄𝑙 ) 0, 𝑙=1 𝑗=1 21Reject 𝐻0 if 𝐶 > 𝜒𝑣2 (𝛼) 22Although M-test reject 𝐻0 , MANOVA test could be inconsistent with. 24 CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING Lim, Kyuson as simplified from STA498 (x𝑙 𝑗 − x̄)(x𝑙 𝑗 − x̄) 0 = [(x𝑙 𝑗 − x̄ + x̄𝑙 − x̄) [(x𝑙 𝑗 − x̄ + x̄𝑙 − x̄)] 0 = (x𝑙 𝑗 − x̄𝑙 )(x𝑙 𝑗 − x̄𝑙 ) 0 + (x𝑙 𝑗 − x̄𝑙 )( x̄𝑙 − x̄) 0 + ( x̄𝑙 − x̄)(x𝑙 𝑗 − x̄𝑙 ) + ( x̄𝑙 − x̄)( x̄𝑙 − x̄) 0, Í𝑙 (x𝑙 𝑗 − x̄𝑙 ) = 023 such that and 𝑛𝑗=1 ⇔ (x𝑙 𝑗 − x̄)(x𝑙 𝑗 − x̄) 0 = (x𝑙 𝑗 − x̄𝑙 )(x𝑙 𝑗 − x̄𝑙 ) 0 + ( x̄𝑙 − x̄)( x̄𝑙 − x̄) 0 . Notice that (x𝑙 𝑗 − x̄)(x𝑙 𝑗 − x̄) 0 = (x𝑙 𝑗 − x̄) 2 applies for other terms. First, for S𝑙 of 𝑙th sample covariance matrix 24 W= 𝑔 ∑︁ 𝑛𝑙 ∑︁ 0 (x𝑙 𝑗 − x̄𝑙 )(x𝑙 𝑗 − x̄𝑙 ) = (𝑛1 −1)S1 +· · ·+(𝑛𝑔 −1)S𝑔 ⇔ (𝑛𝑙 −1)S = (𝑁 −𝑔)S, 𝑖=1 𝑙=1 𝑗=1 where 𝑁 = 𝑔 ∑︁ Í𝑔 𝑙=1 𝑛𝑙 with 𝑁 − 𝑔 𝑑𝑓 , Wishart distribution. Hence, W 𝐸 = 𝚺, 𝑁 −𝑔 Second, B= 𝑔 ∑︁ 𝑛𝑙 ( x̄𝑙 − x̄)( x̄𝑙 − x̄) 0, 𝑙=1 with 𝑔 − 1 𝑑𝑓 , Wishart distribution. Thus, TSS has total 𝑁 − 1 = (𝑁 − 𝑔) + (𝑔 − 1) 𝑑𝑓 , Wishart distribution. 2.4.2 Hypothesis Testing The goal is to test the presence of treatment effects. 𝐻0 : 𝜇1 = 𝜇1 = · · · = 𝜇𝑔 is equivalent to 𝐻𝑎 : 𝜏1 = 𝜏1 = · · · = 𝜏𝑔 25, which is that treatment effects are all same. The test statistics26 uses Wilk’s Lambda27 test as B, W follows Wishart distribution, |W| 1 1 𝑠 ∗ ⇔ = Π𝑖=1 , 𝚲 = |B + W| det(I + W−1 B) 1 + 𝜆ˆ 𝑖 where 𝜆ˆ 1 , .., 𝜆ˆ 𝑠 are eigenvalues of W−1 S, as 𝑠 = min( 𝑝, 𝑔 − 1) is the rank of B. 23Note that (x𝑙 𝑗 − x̄𝑙 ) ( x̄𝑙 − x̄) 0 + ( x̄𝑙 − x̄) (x𝑙 𝑗 − x̄𝑙 ) = 0 24Note that the generalized (𝑛1 + 𝑛2 − 2)S 𝑝𝑜𝑜𝑙𝑒𝑑 is recommended in two-sample case. 25As 𝜏𝑙 = x̄𝑙 − x̄, testing for 𝐻0 : 𝜏1 = · · · = 𝜏𝑔 ⇔ x̄1 − x̄𝑔 = 0 ˆ 26Analogous LRT to | 𝚺𝚺ˆ |. 0 27There are Roy’s test max𝜆 (BW−1 ), Lawley-Hotelling’s test tr(BW−1 ), and Pillar’s test tr{B(B+W) −1 } CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING 25 STA498 2.4.3 Number of variable 𝑝=1 Number of group 𝑔≥2 𝑝=2 𝑔≥2 𝑝≥1 𝑔=2 𝑝≥1 𝑔=3 2.4.4 Lim, Kyuson Distribution of Wilk’s Lambda Test statistics 𝑁−𝑔 1−𝚲∗ 𝑔−1 1 − 𝚲∗√ ∗ 𝑁−𝑔−1 1 − 1−√ 𝚲∗ 𝑔−1 𝚲 𝑁−𝑝−1 1−𝚲∗ 1 − ∗ 𝑝 𝚲 √ ∗ 𝑁−𝑔−2 1 − 1−√ 𝚲∗ 𝑝 𝚲 Distribution under 𝐻0 𝐹𝑔−1,𝑁−𝑔 𝐹2(𝑔−1),2(𝑁−𝑔) 𝐹𝑝,𝑁−𝑝−1 𝐹2𝑝,2(𝑁−𝑝−2) Large Sample property for modification of 𝚲∗ If 𝐻0 is true for 𝑁 to be large, 𝑝+𝑔 − 𝑁 −1− ln(𝚲∗ ) ∼ 𝜒2𝑝(𝑔−1) 2 However, reject 𝐻0 if 𝑝+𝑔 − 𝑁 −1− ln(𝚲∗ ) > 𝜒2𝑝(𝑔−1) (𝛼) 2 2.4.5 Simultaneous Confidence Intervals for treatment effect Let 𝜏𝑙 = x̄𝑙 − x̄ to be the 𝑙th treatment effect. Then, the treatment difference between 𝑘th 0 and 𝑙th treatment is 𝜏ˆ𝑘 − 𝜏ˆ𝑙 = x̄ 𝑘 − x̄ − x̄𝑙 + x̄ = x̄ 𝑘 − x̄𝑙 ⇔ 𝑥¯ 𝑘1 − 𝑥¯𝑙1 · · · 𝑥¯ 𝑘 𝑝 − 𝑥¯𝑙 𝑝 , W𝑖𝑖 and Var(𝜏ˆ𝑘𝑖 − 𝜏ˆ𝑙𝑖 ) = Var(𝑥¯ 𝑘𝑖 − 𝑥¯𝑙𝑖 ) = 𝑛1𝑘 + 𝑛1𝑙 𝜎𝑖𝑖 , where 𝜎𝑖𝑖 = 𝑠𝑖𝑖, 𝑝𝑜𝑜𝑙𝑒𝑑 = 𝑁−𝑔 . The 95% simultaneous Bonferroni’s confidence intervals for { 𝜏ˆ𝑘𝑖 − 𝜏ˆ𝑙𝑖 } 28 is √︄ 1 1 ( 𝑥¯ 𝑘𝑖 − 𝑥¯𝑙𝑖 ) ± 𝑡 𝑁−𝑔 (𝛼/2𝑚) + 𝑠𝑖𝑖,𝑝𝑜𝑜𝑙𝑒𝑑 , where 𝑚 = 𝑝𝑔(𝑔 − 1)/2, 𝑛 𝑘 𝑛𝑙 where 𝑝 is the number of variables and 𝑔 is the number of populations. Hence, reject 𝐻0 : 𝜏𝑘𝑖 − 𝜏𝑙𝑖 = 0 if | 𝑥¯ 𝑘𝑖 − 𝑥¯𝑙𝑖 | 𝛼 𝑡 = √︃ > 𝑡 𝑁−𝑔 . 2𝑚 1 1 + 𝑠 𝑛𝑘 𝑛𝑙 𝑖𝑖, 𝑝𝑜𝑜𝑙𝑒𝑑 28Note that this is analogous to ( 𝑥¯1 𝑗 − 𝑥¯2 𝑗 ) ± 𝑡 𝑛1 +𝑛2 −2 26 𝛼 2𝑝 √︃ 1 𝑛1 + 1 𝑛2 𝑠 𝑗 𝑗, 𝑝𝑜𝑜𝑙𝑒𝑑 defined. CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING Chapter 3 Bayesian Alternative approach Let the discrete random variables 𝜃 which is to be estimated and observed random variable of 𝑋 = 𝑥. From the prior pr(𝜃) information about possible values for the parameter, the approach uses observed data p(𝑥|𝜃) to update the information on posterior probabilities p(𝜃|𝑥)1 as a regenerating process by the confidence intervals, p(𝜃 ∈ 𝐶𝛼 |𝑥) = 1−𝛼. If known with 𝜃 ∗ related to probability data points given, then the estimated 𝑥ˆ𝑖 ∼ p(𝑥|𝜃 ∗ ) for the true value of 𝑥 to generate the update information in the compatible space, where there is no overfitting to be concerned with. From the likelihood p(𝑥|𝜃), the actual distribution by the Bayes theorem yield unnormalized posterior density which is the right side p(𝜃|𝑥) = p(𝑥|𝜃)p(𝜃) ⇔ ∝ p(𝑥|𝜃)p(𝜃), p(𝑥) where p(𝑥) is unknown with fixed 𝑦 and does not depend on 𝜃. Note that p(𝑥) is also referred to as evidence. ⇔ Posterior ∝ Prior × Likelihood. A parameter of a prior distribution is referred to a hyperparameter. For predictive inference on unknown variable before data 𝑥 is considered, the distribution of unknown but observed 𝑥 is ∫ ∫ 𝑝(𝑥) = 𝑝(𝑥, 𝜃)𝑑𝜃 = 𝑝(𝜃) 𝑝(𝑥|𝜃)𝑑𝜃 Θ Θ as a marginal distribution of 𝑥, which a prior predictive distribution2. For observed data x and unknown 𝜃 = (𝜇, 𝜎 2 ), the unknown observable 𝑥˜ to be predicted is referred to be posterior predictive distribution ∫ ∫ ∫ 𝑝( 𝑥|x) ˜ = 𝑝( 𝑥, ˜ 𝜃|x)𝑑𝜃 = 𝑝( 𝑥|𝜃, ˜ x) 𝑝(𝜃|x)𝑑𝜃 = 𝑝( 𝑥|𝜃) ˜ 𝑝(𝜃|x)𝑑𝜃, Θ Θ Θ 1This is written by the Bayes theorem that p(𝑥, 𝜃)/p(𝑥) ⇔ (p(𝑥 | 𝜃) p(𝜃))/p(𝑥) 2predictive refers to the distribution for a quantity that is observable. 27 STA498 Lim, Kyuson as posterior which is conditional on observed x and predictive for observable 𝑥. ˜ The ratio of posterior density 𝑝(𝜃|𝑥) evaluated at points 𝜃 1 and 𝜃 2 under the given model is referred to posterior odds for 𝜃 1 compared to 𝜃 2 . 𝑝(𝜃 1 |𝑥) 𝑝(𝜃 1 ) 𝑝(𝑥|𝜃 1 )/𝑝(𝑥) 𝑝(𝜃 1 ) 𝑝(𝑥|𝜃 1 ) = = , 𝑝(𝜃 2 |𝑥) 𝑝(𝜃 2 ) 𝑝(𝑥|𝜃 2 )/𝑝(𝑥) 𝑝(𝜃 2 ) 𝑝(𝑥|𝜃 2 ) which the posterior odds, 𝑝(𝑥|𝜃 1 ) 𝑝(𝑥|𝜃 2 ) 𝑝(𝜃 1 |𝑥) 𝑝(𝜃 2 |𝑥) equal to the prior odds 𝑝(𝜃 1 ) 𝑝(𝜃 2 ) times likelihood ratio, under the Bayes’ rule for discrete parameters. Random variables and Bayesian statistical inference For the unknown random variables of Θ to be estimated and 𝑋 = 𝑥 which is observed, the Bayes’ rule yields 𝑝(Θ = 𝜃|𝑋 = 𝑥) = 𝑝 𝑋 |Θ (𝑥|𝜃) 𝑝 Θ (𝜃) 𝑝(𝑋 = 𝑥|Θ = 𝜃) 𝑝(Θ = 𝜃) ⇔ 𝑝 Θ|𝑋 (𝜃|𝑥) = . 𝑝(𝑋 = 𝑥) 𝑝 𝑋 (𝑥) Either Θ or 𝑋2 are continuous random variable, replace the PMF or PDF in the formula. Equivalently, the posterior PDF is represented by prior times likelihood with 𝑓 𝑋 (𝑥) using the law of total probability as ⇔ 𝑓Θ|𝑋 (𝜃|𝑥) = 𝑓 𝑋 |Θ (𝑥|𝜃) 𝑓Θ (𝜃) . 𝑓 𝑋 (𝑥) In the problem of Bayesian statistics, the choice prior 𝑓Θ (𝜃) is generally unclear and subjective to be different. With unknown variable Θ, the goal is to draw inferences by observing related random variable 𝑋, about Θ. Note that the posterior distribution of Θ, 𝑓Θ|𝑋 (𝜃|𝑥)/𝑝 Θ|𝑋 (𝜃|𝑥), contains all information is derived by point or interval estimates of Θ. Comparison between Frequentist and Bayesian methods For frequentist inference, probabilities are frequencies as the goal is to create procedure with long run frequency guarantees. For Bayesian inference, probabilities are subjective degrees of belief as to state and analyze. Hence, frequentists view parameter as fixed constant while Bayesian considers as random variable. For example, the confidence interval is considered. √ √ For confidence interval defined as 𝐶 𝐼 = [ 𝑋¯ 𝑛 − 1.96/ 𝑛, 𝑋¯ 𝑛 + 1.96/ 𝑛], the probability statement is 𝑝 𝜃 (𝜃 ∈ 𝐶 𝐼) = 0.95 for all 𝜃 ∈ R, which is random due to function of the data. With parameter 𝜃 fixed, the CI trap the true value with probability 0.95. For infinitely many experiments of 𝑛 data points and chosen 𝜃 𝑖 , the computed intervals 28 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Lim, Kyuson STA498 𝐶 𝐼𝑖 is found to trap the parameter 𝜃 𝑖 , 95% of the time, that is almost surely convergent for any sequences 𝜃 𝑖 , 𝑛 1 ∑︁ 𝐼 (𝜃 𝑖 ∈ 𝐶 𝐼𝑖 ) ≥ 0.95 lim inf 𝑛→∞ 𝑛 𝑖=1 On the other hands, for beliefs the unknown parameter 𝜃 is given as a prior distribution 𝜋(𝜃) to represent the subjective beliefs about 𝜃. Using Bayes’ theorem, the posterior distribution for 𝜃 given the observed data 𝑋1 , ..., 𝑋𝑛 is computed with likelihood function 𝑛 𝑝(𝑋 |𝜃), 𝐿(𝜃) = Π𝑖=1 𝑖 ∫ 𝜋(𝜃|𝑋1 , ..., 𝑋𝑛 ) ∝ 𝐿 (𝜃)𝜋(𝜃) ⇔ 𝜋(𝜃|𝑋1 , ..., 𝑋𝑛 )𝑑𝜃 = 0.95 ⇔ 𝑝(𝜃 ∈ 𝐶 𝐼 |𝑋1 , ..., 𝑋𝑛 ) = 0.95 𝐶𝐼 Hence, the degree-of-belief probability statement about 𝜃 given the observed data is not the same, where the intervals would not trap the true value 95% of the time. In summary, the frequentist CI satisfies inf 𝜃 𝑝 𝜃 (𝜃 ∈ 𝐶 𝐼) = 1 − 𝛼 for the coverage of the interval CI, and the probability refers to random interval CI. A Bayesian confidence interval CI satisfies 𝑝(𝜃 ∈ 𝐶 𝐼 |𝑋1 , ..., 𝑋𝑛 ) = 1 − 𝛼, where the probability refers to 𝜃. While the subjective Bayesian interpret probability strictly as personal degrees of belief, the objective Bayesian try to find the prior distributions for the resulting posterior to be objective 3. However, frequentist Bayesian only use Bayesian methods when resulting posterior has good frequency behaviours. On the other hands, the likelihoodist use the likelihood function to measure the strength of data as an evidence. 3.0.1 Overview: Univariate Binomial distribution with known and unknown parameter Let the probability of a success in a trial is 𝜃. Also, let 𝑋 = {𝑥1 , .., 𝑥 𝑛 } be the observation Í𝑛 set where 𝑥 1 ∼ 𝐵𝑒𝑟 (𝜃). Then, the probability of 𝑠 = 𝑖=1 𝑥𝑖 success times in 𝑛 trials 𝑛 𝑠 (𝑥 1 , ..., 𝑥 𝑛 ) happens is p(𝑋 = 𝑥 1 , ..., 𝑥 𝑛 |𝜃, 𝑛) = Bin(𝑠|𝜃, 𝑛)= 𝑠 𝜃 (1 − 𝜃) 𝑛−𝑠 as the posterior distribution. Example. Objective Bayesian approach As 𝑝(𝜃|𝑋 = 𝑥 1 , ..., 𝑥 𝑛 ) ∝ 𝑝(𝑋 = 𝑥 1 , ..., 𝑥 𝑛 |𝜃) 𝑝(𝜃) for the prior 𝜃 ∼ 𝑈 [0, 1] to be unknown so to set the following the uniform distribution for 𝑝(𝜃) = 1, then 𝑝(𝜃|𝑋) ∝ 𝜃 𝑠 (1 − 𝜃) 𝑛−𝑠 = 𝜃 𝑠+1−1 (1 − 𝜃) 𝑛−𝑠+1−1 ⇔ 𝜃 𝑠 (1 − 𝜃) 𝑛−𝑠 Γ(𝑛 + 2) 𝜃 𝑠 (1 − 𝜃) 𝑛−𝑠 = , Γ(𝑠 + 1)Γ(𝑛 − 𝑠 + 1) Beta(𝑠 + 1, 𝑛 − 𝑠 + 1) 𝜃|𝑋, 𝑛 ∼ Beta(𝑠 + 1, 𝑛 − 𝑠 + 1) 3Empirical Bayesian estimate the prior distribution from the data. CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 29 STA498 Lim, Kyuson where posterior follows the Beta distribution. Since the density function integrates to 1, the normalizing constant (𝑧) is ∫ 1 Γ(𝑠 + 1)Γ(𝑛 − 𝑠 + 1) 𝑧= 𝜃 𝑠 (1 − 𝜃) 𝑛−𝑠 = . Γ(𝑛 + 2) 0 The prior predictive distribution for fixed 𝑠 success in ∫ 1 ∫ 1 1 𝑛 𝑠 𝑛 Γ(𝑠 + 1)(𝑛 − 𝑠 + 1) 𝑛−𝑠 𝑝(𝑋) = 𝑝(𝑋 |𝑠, 𝜃) 𝑝(𝜃, 𝑠)𝑑𝜃 = 𝜃 (1−𝜃) 𝑑𝜃 = = . 𝑠 𝑠 Γ(𝑛 + 2) 𝑛+1 0 0 Hence, the prior predictive density 𝑝(𝑋) = 1 𝑛+1 ∫ one observation with an outcome 𝑝( 𝑥˜ = 1) = an example. which is to be uniform, where 𝑥˜ is the ∫1 𝑝( 𝑥˜ = 1|𝜃) 𝑝(𝜃)𝑑𝜃 = 0 𝜃𝑑𝜃 = 1/2 as 1 0 Also, by the mean of Beta distribution the Bayes posterior estimator is 𝐸 (𝜃|𝑋) = For instructive purpose of convexity, 1 𝑠 + (1 − 𝜆 𝑛 ) , 𝐸 (𝜃|𝑋) = 𝜆 𝑛 𝑛 2 𝑠+1 𝑛+2 . from the prior mean for 1/2 and the maximum likelihood estimate4 𝑠/𝑛. Moreover, the 𝑛 which is close to 1. optimized convex set for 𝜆 𝑛 is 𝑛+2 Example. Subjective Bayesian approach On the other hand, the subjective Bayesian find the uninformed prior to be strongly peaked around 1/2, as a subjective beliefs about the data. For the known of 𝜃, the posterior with Bayes rule yield ∫ 𝑝(𝑋 |𝜃, 𝑠) 𝑓 (𝜃|𝑠) 𝑓 (𝜃|𝑋, 𝑠) = , where 𝑝(𝑋 |𝑠) = 𝑝(𝑋 |𝜃, 𝑠) 𝑓 (𝜃|𝑠)𝑑𝜃. 𝑝(𝑋 |𝑠) Θ By setting the prior distribution 𝜃 ∼ 𝐵𝑒𝑡𝑎(𝛼, 𝛽) for 𝑓 (𝜃|𝑠) = 𝑓 (𝜃), 𝑛 𝑠 Γ(𝛼 + 𝛽) 𝛼−1 𝛽−1 𝑛−𝑠 𝑝(𝜃|𝑋, 𝑠) ∝ 𝑝(𝑋 |𝜃) 𝑓 (𝜃) ⇔ 𝜃 (1 − 𝜃) 𝜃 (1 − 𝜃) 𝑠 Γ(𝛼)Γ(𝛽) = Γ(𝛼 + 𝛽 + 𝑛) 𝜃 𝛼+𝑠−1 (1 − 𝜃) 𝛽+𝑛−𝑠−1 ⇔∝ 𝜃 𝛼+𝑠−1 (1 − 𝜃) 𝛽+𝑛−𝑠−1 Γ(𝛼 + 𝑠)Γ(𝛽 − 𝑠 + 𝑛) without the normalizing constant 5. The posterior distribution 𝜃|𝑋 ∼ 𝐵𝑒𝑡𝑎(𝛼+𝑠, 𝛽+𝑛−𝑠) 𝐸 (𝜃|𝑋) = 𝛼+𝑠 , 𝛼+𝛽+𝑛 𝑣𝑎𝑟 (𝜃|𝑋) = (𝛼 + 𝑠)(𝛽 − 𝑠 + 𝑛) , (𝛼 + 𝛽 + 𝑛) 2 (𝛼 + 𝛽 + 𝑛 + 1) and PDF 𝑓𝜃|𝑋 (𝜃|𝑋 = 𝑥 1 , ..., 𝑥 𝑛 ). 𝑠 4The estimate is success over success plus failure as 𝐿 (Θ, 𝑋0 = Π𝑖=1 𝜃 𝑖𝑛 in multinomial distribution. 5Note that Γ(𝑛) = (𝑛 − 1)! 30 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Lim, Kyuson STA498 Moreover, the Bayesian point estimate is summarized at the center of the posterior distribution ∫ 1 𝛼 + 𝑠 𝑛 𝑠 𝛼 + 𝛽 𝛼 𝜃¯ = 𝜃 𝑓 (𝜃|𝑋)𝑑𝜃 = = + 𝛼+𝛽+𝑛 𝛼+𝛽+𝑛 𝑛 𝛼+𝛽+𝑛 𝛼+𝛽 0 𝛼+𝛽 𝑛 𝜃 𝑀 𝐿𝐸 + 𝐸 (𝜃), ⇔ 𝛼+𝛽+𝑛 𝛼+𝛽+𝑛 for the prior mean. After the data 𝑋 have been observed, an unknown observable, 𝑥˜ is predicted, which is referred to as a posterior predictive distribution. Now, the posterior predictive distribution for just one observation 𝑥˜ = 1 of new value conditional on several observations 𝑋 yield ∫ 1 ∫ 1 𝑝( 𝑥˜ = 1|𝑋, 𝑠) = 𝑝( 𝑥˜ = 1|𝜃, 𝑋, 𝑠) 𝑝(𝜃|𝑋, 𝑠)𝑑𝜃 = 𝐵𝑒𝑟 ( 𝑥˜ = 1|𝜃) 𝐵𝑒𝑡𝑎(𝜃|𝑠+𝛼, 𝛽)𝑑𝜃 0 0 ∫ ⇔ 1 ∫ 𝜃 𝐵𝑒𝑡𝑎(𝜃|𝑠 + 𝛼, 𝛽)𝑑𝜃 = 0 1 𝜃 𝑝(𝜃|𝑋)𝑑𝜃 = 𝐸 (𝜃|𝑋) 0 where 𝑝( 𝑥˜ = 1) = 𝜃6 such that the mean of the posterior distribution is derived to be 𝛼+𝑠 𝐸 (𝜃|𝑋) = 𝛼+𝛽+𝑛 . For the purpose of Bayesian inference, the predictive distribution for the new observations are derived in the example. Equivalently, the generalized form of prediction is Í𝑛 𝛼+ 𝑖=1 𝑥𝑖 𝑝( 𝑥˜ = 1|𝑋) = 𝐸 (𝜃|𝑋) = 𝛼+𝛽+𝑛 . On the other hand, 𝑝( 𝑥˜ = 0|𝑋) = 1 − 𝐸 (𝜃|𝑋) = 𝛽+ Í𝑛 𝑖=1 (1−𝑥 𝑖 ) . 𝛼+𝛽+𝑛 3.1 Conditional distribution of the subset Given canonical form of x (2) ∼ 𝑁 𝑝−𝑞 (𝜇 (2) , 𝚺22 ), the conditional distribution of x (1) ∼ (2) − 𝜇 (2) ) and 𝚺 𝑁 𝑞 (𝜇 (1) , 𝚺11 ) 7 is 𝑁 𝑞 (𝜇1.2 , 𝚺11.2 ), where 𝜇1.2 = 𝜇 (1) + 𝚺12 𝚺−1 11.2 = 22 (x −1 𝚺11 − 𝚺12 𝚺22 𝚺21 , and x is 𝑝 × 1 matrix 8. (2) (2) −1 Thus, the conditional density x (1) |x (2) ∼ 𝑁 (𝜇 (1) +𝚺12 𝚺−1 22 (x − 𝜇 ), 𝚺11 −𝚺12 𝚺22 𝚺21 ). Independence and covariance 0 For partition of subset x = x = x (1) x (2) , x (1) ⊥ x (1) ⇔ 𝚺12 = 0.9 Generally, if both x (1) and x (2) follow normal distribution and are independent, then the joint distribution is normally distributed. 6For Bernoulli trial, 𝑝( 𝑥˜ = 0) = 1-𝜃 7Note that this definition for partition is also valid in EM algorithm to be estimated 0 8Note that x = x (1) x (2) . 9This could be proven as 𝑓 (x) = 𝑓 (x (1) ) 𝑓 (x (2) ) = 0, where off-diagonal elements other than 𝚺𝑖𝑖 is 0. CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 31 STA498 Linear transformation Lim, Kyuson For the linear transformation with respect to A which could be defined as y = Ax, y follows the distribution of 𝑁 𝑝 (𝜇∗ , 𝚺∗ ), where 𝜇∗ = A𝜇 and 𝚺∗ = A𝚺A0 10. Based on the conditional distribution formula, y (1) |y (2) ∼ 𝑁 (𝜇∗∗ , 𝚺∗∗ ), where 𝜇∗∗ = (2) − 𝜇 ∗(2) ) and 𝚺∗∗ = 𝚺∗ − 𝚺∗ 𝚺∗−1 𝚺∗ , where y (2) is given matrix. 𝜇∗(1) + 𝚺∗12 𝚺∗−1 22 (y 11 12 22 21 3.1.1 Law of total expectation Often referred to as tower property, the Adam’s law for random variable 𝑋 and 𝑌 is 𝐸 (𝑋) = 𝐸 (𝐸 (𝑋 |𝑌 )) ∑︁ ∑︁ ∑︁ 𝐸 (𝐸 (𝑋 |𝑌 )) = 𝐸 𝑥 𝑝(𝑋 = 𝑥|𝑌 ) = 𝑥 𝑝(𝑋 = 𝑥|𝑌 = 𝑦) 𝑝(𝑌 = 𝑦) 𝑥 = ∑︁ ∑︁ 𝑦 =𝑥 𝑦 𝑥 𝑝(𝑋 = 𝑥, 𝑌 = 𝑦) = 𝑥 𝑥 ∑︁ ∑︁ 𝑥 𝑥 ∑︁ ∑︁ 𝑝(𝑋 = 𝑥, 𝑌 = 𝑦) = 𝑦 𝑥 𝑝(𝑋 = 𝑥, 𝑌 = 𝑦) 𝑦 ∑︁ 𝑥 𝑝(𝑋 = 𝑥) = 𝐸 (𝑋), 𝑥 and the Eve’s law is 𝑣𝑎𝑟 (𝑋) = 𝐸 (𝑣𝑎𝑟 (𝑋 |𝑌 )) + 𝑣𝑎𝑟 (𝐸 (𝑋 |𝑌 )), as 𝐸 (𝑋 2 ) = 𝐸 𝐸 ((𝑋 |𝑌 ) 2 ) − (𝐸 (𝑋 |𝑌 )) 2 + (𝐸 (𝑋 |𝑌 )) 2 ⇔ 𝐸 𝑣𝑎𝑟 (𝑋 |𝑌 ) + (𝐸 (𝑋 |𝑌 )) 2 2 ⇔ 𝐸 (𝑋 2 ) − 𝐸 (𝑋) 2 = 𝐸 𝑣𝑎𝑟 (𝑋 |𝑌 ) + (𝐸 (𝑋 |𝑌 )) 2 − 𝐸 (𝐸 (𝑋 |𝑌 )) = 𝐸 𝑣𝑎𝑟 (𝑋 |𝑌 ) + 𝐸 (𝐸 (𝑋 |𝑌 ) 2 ) − 𝐸 (𝐸 (𝑋 |𝑌 )) 2 , where 𝑣𝑎𝑟 (𝑋 |𝑌 ) = 𝐸 (𝐸 (𝑋 |𝑌 ) 2 ) − 𝐸 (𝐸 (𝑋 |𝑌 )) 2 ⇔ 𝐸 𝑣𝑎𝑟 (𝑋 |𝑌 ) + 𝑣𝑎𝑟 𝐸 (𝑋 |𝑌 ) which is also referred to as the law of total expectation. As {𝐴𝑖 } is the partition of the probability space and assumes finite or countably infinite set of finite values 𝐸 (𝑋) < ∞, the law of total probability in countable and finite cases guarantees, ∑︁ 𝐸 (𝑋) = 𝐸 (𝑋 | 𝐴𝑖 ) 𝑝( 𝐴𝑖 ) 𝑖 For Eve’s law, notice that the posterior variance is on average smaller than the prior variance. This indicates the greater the latter variation in Eve’s law, the more the potential for reducing our uncertainty with regard to 𝑋. 10Note that cov(Ax (1) ) = A cov(x (1) )A0 and also cov(Ax (1) , Bx (2) ) = A cov(x (1) , x (2) )B0 in partitioning the vector. 32 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Lim, Kyuson 3.1.2 STA498 Conditional expectation (MMSE) For posterior distribution for unknown random variable 𝑌 , such as 𝑓𝑌 |𝑋 (𝑦|𝑥), the point estimate of the posterior mean is defined as 𝑦ˆ 𝑀 = 𝐸 (𝑌 |𝑋 = 𝑥), which is the minimum estimate of the 𝑌 in terms of the MSE, referred to as a minimum mean squared error (MMSE) 11 or Bayes’ estimate of 𝑌 . The posterior density is derived with computing 𝑓 𝑋 |𝑌 (𝑥|𝑦) 𝑓𝑌 (𝑦) 𝑓𝑌 |𝑋 (𝑦|𝑥) = , where 𝑓 𝑋 (𝑥) = 𝑓 𝑋 (𝑥) ∫ +∞ 𝑓 𝑋 |𝑌 (𝑥|𝑦) 𝑓𝑌 (𝑦)𝑑𝑦. −∞ Then, the MMSE estimate of 𝑌 given 𝑋 = 𝑥 is then given by ∫ +∞ 𝑦ˆ 𝑀 = 𝑦 𝑓𝑌 |𝑋 (𝑦|𝑥)𝑑𝑦 ⇒ 𝐸 (𝑌ˆ𝑀 ) = 𝐸 (𝐸 (𝑌 |𝑋)) = 𝐸 (𝑌 ), −∞ by applying for the Adam’s law. Hence, 𝐸 (𝑌 ) − 𝐸 (𝑌ˆ𝑀 ) = 0 which is an unbiased estimator. Properties of estimation error For the unobserved random variable to be estimated is 𝑌 which is estimated by 𝑋 = 𝑥, let estimate 𝑌ˆ = 𝑔(𝑋) to be the function of 𝑋, and the error of estimate is defined as 𝑌˜ = 𝑌 − 𝑌ˆ ⇔ 𝑌 − 𝑔(𝑋) for MSE of 𝐸 [(𝑌 − 𝑔(𝑋)) 2 ]. The goal is to derive the variance of 𝑌 and expectation of 𝑌 . Now, let 𝑊 = 𝐸 (𝑌˜ |𝑋) and 𝑌ˆ𝑀 = 𝐸 (𝑌 |𝑋), where 𝑌˜ = 𝑌 − 𝑌ˆ𝑀 . Then, 𝑊 = 𝐸 (𝑌˜ |𝑋) = 𝐸 (𝑌 − 𝑌ˆ𝑀 |𝑋) = 𝐸 (𝑌 |𝑋) − 𝐸 (𝑌ˆ𝑀 |𝑋) = 𝑌ˆ𝑀 − 𝐸 (𝑌ˆ𝑀 |𝑋) = 𝑌ˆ𝑀 − 𝑌ˆ𝑀 = 0. For any function of 𝑔(𝑋) 𝐸 (𝑌˜ 𝑔(𝑋)|𝑋) = 𝑔(𝑋)𝐸 (𝑌˜ |𝑋) = 𝑔(𝑋)𝑊 = 0. Similarly, by the Adam’s law for iterated expectations 𝐸 (𝑌˜ 𝑔(𝑋)) = 𝐸 [𝐸 (𝑌˜ 𝑔(𝑋)|𝑋)] = 0, by the previous result applied. However, the estimation error of 𝑌˜ = 𝑌 − 𝑌ˆ𝑀 and 𝑌ˆ𝑀 = 𝐸 (𝑌 |𝑋) is uncorrelated to derive the variance of 𝑌 . By applying covariance formula 𝑐𝑜𝑣(𝑌˜ , 𝑌ˆ𝑀 ) = 𝐸 (𝑌˜ 𝑌ˆ𝑀 ) − 𝐸 (𝑌˜ )𝐸 (𝑌ˆ𝑀 ) = 𝐸 (𝑌˜ 𝑌ˆ𝑀 ), 11Least mean squares (LMS) CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 33 STA498 where 𝐸 (𝑌˜ ) = 𝐸 (𝐸 (𝑌˜ |𝑋)) = 0 from the previous result such that Lim, Kyuson ⇔ 𝐸 (𝑌˜ 𝑔(𝑋)) = 𝐸 [𝐸 (𝑌˜ 𝑔(𝑋))|𝑋] = 0, by applying for the Adam’s law for iterated expectation. For 𝑌˜ = 𝑌 − 𝑌ˆ𝑀 , due to 𝑐𝑜𝑣(𝑌˜ , 𝑌ˆ𝑀 ) = 0, 𝑣𝑎𝑟 (𝑌 ) = 𝑣𝑎𝑟 (𝑌ˆ𝑀 ) + 𝑣𝑎𝑟 (𝑌˜ ) 2 ⇔ 𝐸 (𝑌 2 ) − 𝐸 (𝑌 ) 2 = 𝐸 (𝑌ˆ𝑀 ) − 𝐸 (𝑌ˆ𝑀 ) 2 + 𝐸 (𝑌˜ 2 ) − 𝐸 (𝑌˜ ) 2 , where 𝐸 (𝑌ˆ𝑀 ) = 𝐸 (𝐸 (𝑌 |𝑋)) = 𝐸 (𝑌 ), 𝐸 (𝑌˜ ) = 𝐸 (𝑌 − 𝑌ˆ𝑀 ) = 0 such that 2 2 𝐸 (𝑌 2 ) = 𝐸 (𝑌ˆ𝑀 ) + 𝐸 (𝑌˜ 2 ) + (𝐸 (𝑌 ) 2 − 𝐸 (𝑌ˆ𝑀 ) 2 ) − 𝐸 (𝑌˜ ) 2 ⇔ 𝐸 (𝑌ˆ𝑀 ) + 𝐸 (𝑌˜ 2 ). Also, the MSE of 𝑌 |𝑋, where 𝑌 is the unknown variable, is derived as 𝑀𝑆𝐸 (𝑌 |𝑋) = 𝐸 [𝑣𝑎𝑟 (𝑌 |𝑋)] 𝑣𝑎𝑟 (𝑌 |𝑋) = 𝐸 [(𝑌 − 𝐸 (𝑌 |𝑋)) 2 |𝑋] ⇔ 𝐸 (𝑌 2 |𝑋) − 𝐸 (𝑌 |𝑋) 2 by definition ⇔ 𝐸 [𝑣𝑎𝑟 (𝑌 |𝑋)] = 𝐸 [𝐸 [(𝑌 − 𝐸 (𝑌 |𝑋)) 2 |𝑋]] = 𝐸 [(𝑌 − 𝐸 (𝑌 |𝑋)) 2 ] = 𝐸 [(𝑌 − 𝑌ˆ𝑀 ) 2 ], which is the MSE of the estimator. Moreover, the above derivations and equation involves for convolution of normals and bivariate normal for estimators. MSE for convolution of two normally distributed random variables For 𝑋 ∼ 𝑁 (𝜇 𝑋 , 𝜎𝑋2 ) independent of 𝑊 ∼ 𝑁 (𝜇𝑊 , 𝜎𝑊2 ), let 𝑌 = 𝑋 + 𝑊. The goal is to 2 ) − 𝐸 (𝑋 ˜ 2 ). derive 𝑋ˆ 𝑀 = 𝐸 (𝑋 |𝑌 ), 𝐸 [(𝑋 − 𝑋ˆ 𝑀 ) 2 ] which will verify for 𝐸 (𝑋 2 ) = 𝐸 ( 𝑋ˆ 𝑀 Now, 𝑐𝑜𝑣(𝑋, 𝑌 ) = 𝑐𝑜𝑣(𝑋, 𝑋 + 𝑊) = 𝑣𝑎𝑟 (𝑋) + 𝑐𝑜𝑣(𝑋, 𝑊) = 𝜎𝑋2 by independence and 𝜌 𝑋,𝑌 = 𝑐𝑜𝑣(𝑋, 𝑌 )(𝜎𝑋 𝜎𝑌 ) −1 = 𝜎𝑋 (𝜎𝑋 + 𝜎𝑊 ) −1 . Then, MMSE of 𝑋 |𝑌 is 𝐸 (𝑋 |𝑌 ) = 𝑋ˆ 𝑀 = 𝜇 𝑋 + 𝜌𝜎𝑋 𝜎𝑌−1 (𝑌 − 𝜇𝑌 ) = 𝜇 𝑋 + (𝜎𝑋2 (𝑌 − 𝜇𝑌 ))𝜎𝑌−2 , which is analogous to (2) − 𝜇 (2) ). 𝐸 (x (1) |x (2) ) = 𝜇 (1) + 𝚺12 𝚺−1 22 (x Also, the MSE of 𝑋ˆ 𝑀 is 𝐸 ( 𝑋ˆ 2 ) = 𝐸 [(𝑋 − 𝑋ˆ 𝑀 ) 2 ] with substituting the derived equation. 2 ) + 𝐸 (𝑋 ˜ 2 ) could be verified for substitution Since 𝐸 (𝑋 2 ) = 𝜎𝑋2 + 𝐸 (𝑋) 2 , 𝐸 (𝑋 2 ) = 𝐸 ( 𝑋ˆ 𝑀 from the above equation. 3.1.3 Laplace’s law of succession For the rule of succession 12, when repeating Bernoulli trials for 𝑛 times independently with 𝑠 successes, if 𝑋1 , ..., 𝑋𝑛+1 conditionally independent random variables, then 𝑝(𝑋𝑛+1 = 1|𝑋1 + · · · + 𝑋𝑛 = 𝑠) = 𝑠+1 . 𝑛+2 12When there are few observations, or for events that have not been observed to occur at all in (finite) sample data, the probability examine the next repetition to succeed 34 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Lim, Kyuson STA498 Within the prior success or failure, let 𝜃 ∈ 𝑈 [0, 1] to describe the uncertainty as a prior probability measure. Also, 𝑋𝑖 describe 𝑖th trial for 0 and 1 and 𝑥𝑖 is the data actually observed. Now, the likelihood function for 𝜃 is 𝑛 𝐿 (𝜃) = 𝑝(𝑋1 = 𝑥 1 , ..., 𝑋𝑛 = 𝑥 𝑛 |𝜃) = Π𝑖=1 𝜃 𝑥𝑖 (1 − 𝜃) 1−𝑥𝑖 = 𝜃 𝑠 (1 − 𝑝) 𝑛−𝑠 , Í𝑛 where 𝑠 = 𝑖=1 𝑥𝑖 is the number of successes for 𝑛 trials. The goal is to derive for posterior distribution 𝑓 (𝜃|𝑋1 = 𝑥1 , ..., 𝑋𝑛 = 𝑥 𝑛 ) = ∫ 1 𝐿 (𝜃) 𝑓 (𝜃) 0 ˜ 𝑓 ( 𝜃)𝑑 ˜ 𝜃˜ 𝐿 ( 𝜃) = ∫1 0 𝜃 𝑠 (1 − 𝜃) 𝑛−𝑠 , ˜ 𝑛−𝑠 𝑑 𝜃˜ 𝜃˜𝑠 (1 − 𝜃) where the Beta distribution PDF yield ∫ 1 Γ(𝑠 + 1)Γ(𝑛 − 𝑠 + 1) ˜𝑠 Γ(𝑛 + 2) Γ(𝑛 + 2) ˜ 𝑛−𝑠 𝑑 𝜃˜ = 𝜃 (1 − 𝜃) Γ(𝑠 + 1)Γ(𝑛 − 𝑠 + 1) 0 Γ(𝑛 + 2) Γ(𝑠 + 1)Γ(𝑛 − 𝑠 + 1) so that 𝐵(𝛼, 𝛽) = Γ(𝑠+1)Γ(𝑛−𝑠+1) ,𝛼 Γ(𝑛+2) = 𝑠 + 1, 𝛽 = 𝑛 − 𝑠 + 1 ⇔ 𝑓 (𝜃|𝑋1 = 𝑥 1 , ..., 𝑋𝑛 = 𝑥 𝑛 ) = (𝑛 + 1)! 𝑠 𝜃 (1 − 𝜃) 𝑛−𝑠 , 𝑠!(𝑛 − 𝑠)! where this is the Beta distribution with expected value ∫ 1 𝑠+1 , 𝐸 (𝜃|𝑋1 = 𝑥 1 , ..., 𝑋𝑛 = 𝑥 𝑛 ) = 𝜃 𝑓 (𝜃|𝑋1 = 𝑥 1 , ..., 𝑋𝑛 = 𝑥 𝑛 )𝑑𝜃 = 𝑛+2 0 as 𝜃 is a random variable the law of total probability provide the expected probability of success is 𝜃. For cases when 𝜃 = 0 or 𝜃 = 𝑛, the hypergeometric distribution Hyp(𝜃|𝑁, 𝑛, Θ) used, where Θ is the total number of successes in the total population size 𝑁. When 𝑁, Θ → ∞, 1 the ratio 𝜃 = Θ 𝑁 is fixed. Now, the prior probability of 𝜃 (1−𝜃) is roughly equivalent to 1 Θ(𝑁−Θ) with 1 ≤ Θ ≤ 𝑁 − 1. Then, the posterior for Θ, Θ 𝑁 −Θ 1 𝑝(Θ|𝑁, 𝑛, 𝜃) ∝ . Θ(𝑁 − Θ) 𝜃 𝑛 − 𝜃 For conjugate prior of multinomial distribution, the Dirichlet distribution is the posterior distribution. 13 3.1.4 Bayesian Hypothesis testing For two hypothesis 𝐻0 and 𝐻𝑎 , let 𝑝(𝐻0 ) = 𝑝 0 and 𝑝(𝐻𝑎 ) = 𝑝 1 and 𝑝 0 + 𝑝 1 = 1. Also, for random variable 𝑋, the distribution of 𝑋 under hypothesis is defined as 𝑓 𝑋 (𝑥|𝐻0 ) and 𝑓 𝑋 (𝑥|𝐻𝑎 ). By Bayes’ rule, the posterior probability of 𝐻0 and 𝐻𝑎 is obtained: 𝑝(𝐻0 |𝑋 = 𝑥) = 𝑓𝑥 (𝑥|𝐻0 ) 𝑝(𝐻0 ) , 𝑓𝑥 (𝑥) 𝑝(𝐻𝑎 |𝑋 = 𝑥) = 𝑓𝑥 (𝑥|𝐻𝑎 ) 𝑝(𝐻𝑎 ) . 𝑓𝑥 (𝑥) 13Th joint posterior distribution of 𝜃 1 , ..., 𝜃 𝑚 for 𝑓 (𝜃 1 , ..., 𝜃 𝑚 |𝑛1 , ..., 𝑛𝑚 , 𝐼) = Í𝑚 𝑖=1 𝜃 𝑖 Í𝑚 Γ( 𝑖=1 (𝑛𝑖 +1)) 𝑚 Γ(𝑛 +1) 𝜃 𝑛1 ··· 𝜃 𝑛𝑚 Π𝑖=1 𝑖 𝑚 1 , = 1. CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 35 STA498 Lim, Kyuson The posterior comparison between 𝑝(𝐻0 |𝑋 = 𝑥) and 𝑝(𝐻𝑎 |𝑋 = 𝑥) could be used to decide between 𝐻0 and 𝐻𝑎 for higher probability to take into account for. Maximum A Posteriori (MAP) test The idea to take for the higher posterior probability in hypothesis test is referred to as MAP test. The 𝐻0 is chosen if and only if 𝑝(𝐻0 |𝑋 = 𝑥) ≥ 𝑝(𝐻1 |𝑋 = 𝑥) ⇔ 𝑓𝑥 (𝑥|𝐻0 ) 𝑝(𝐻0 ) ≥ 𝑓𝑥 (𝑥|𝐻𝑎 ) 𝑝(𝐻𝑎 ). Note that the MAP test is also generalized for the case where there are more than 2 hypotheses for taking the hypothesis with highest posterior probability, 𝑝(𝐻𝑖 |𝑋 = 𝑥) ⇔ 𝑓 𝑋 (𝑥|𝐻𝑖 ) 𝑝(𝐻𝑖 ). Then, the average error probability for hypothesis testing is written as 𝑝 𝑒 = 𝑝(choose 𝐻1 |𝐻0 ) 𝑝(𝐻0 ) + 𝑝(choose 𝐻0 |𝐻1 ) 𝑝(𝐻1 ), where the MAP test achieve minimum possible average error probability. Either for continuous 𝑓 𝑋 |𝑌 (𝑥|𝑦) or discrete 𝑝 𝑋 |𝑌 (𝑥|𝑦), the maximum a posteriori (MAP) estimate, 𝑥ˆ 𝑀 𝐴𝑃 could be obtained for the point or interval estimates of 𝑋, by maximizing 𝑓𝑌 |𝑋 (𝑦|𝑥) 𝑓 𝑋 (𝑥), as 𝑥 does not depend on 𝑓𝑌 (𝑦). Minimum Cost hypothesis test In two hypothesis testing, there are two types of error which is to accept 𝐻0 while 𝐻𝑎 is true or 𝐻𝑎 while 𝐻0 is true. Let the cost to each error type be defined as 𝐶10 and 𝐶01 accordingly. Then, the average cost is 𝐶 = 𝐶10 𝑝(choose 𝐻𝑎 |𝐻0 ) 𝑝(𝐻0 ) + 𝐶01 𝑝(choose 𝐻0 |𝐻𝑎 ) 𝑝(𝐻𝑎 ) ⇔ 𝑝(choose 𝐻𝑎 |𝐻0 ) [ 𝑝(𝐻0 )𝐶10 ] + 𝑝(choose 𝐻0 |𝐻𝑎 ) [ 𝑝(𝐻𝑎 )𝐶01 ]. Since 𝑝(𝐻𝑖 |𝑋 = 𝑥) = 𝑓 𝑋 (𝑥|𝐻𝑖 ) 𝑝(𝐻𝑖 ) , 𝑓 𝑋 (𝑥) the 𝐻0 is chosen if and only if 𝑓 𝑋 (𝑥|𝐻0 ) 𝑝(𝐻0 )𝐶10 ≥ 𝑓 𝑋 (𝑥|𝐻𝑎 ) 𝑝(𝐻𝑎 )𝐶01 = 𝑓 𝑋 (𝑥|𝐻0 ) 𝑝(𝐻𝑎 )𝐶01 ≥ 𝑓 𝑋 (𝑥|𝐻𝑎 ) 𝑝(𝐻0 )𝐶10 ⇔ 𝑝(𝐻0 |𝑥)𝐶10 ≥ 𝑝(𝐻𝑎 |𝑥)𝐶01 , for decision rule. Hence, the posterior risk for accepting 𝐻𝑎 is derived to be 𝑝(𝐻0 |𝑥)𝐶10 14. This would derived to take the minimum cost test as to accept the hypothesis test with lowest posterior risk. 14Equivalently, the posterior risk for accepting 𝐻0 is derived to be 𝑝(𝐻 𝑎 |𝑥)𝐶01 . 36 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Lim, Kyuson Decision rule for cost in hypothesis testing STA498 In two hypothesis cases for 𝐻0 and 𝐻1 , let 𝐶𝑖 𝑗 to be defined for the cost of accepting 𝐻𝑖 given 𝐻 𝑗 to be true 15. Since associated cost for the correct decision is less than the incorrect decision, that is 𝐶𝑖𝑖 < 𝐶 𝑗𝑖 for 𝑖, 𝑗 = 1, 2, the average cost is derived as Í 𝐶 = 𝑖, 𝑗 ∈{0,1} 𝐶𝑖 𝑗 𝑝(choose 𝐻𝑖 |𝐻 𝑗 ) 𝑝(𝐻 𝑗 ) 16, as the goal is to find the decision rule such that the average cost is minimized. First, the complement for choosing the correct hypothesis is the complement of choosing the wrong hypothesis such that 𝑝(choose 𝐻0 |𝐻0 ) = 1 − 𝑝(choose 𝐻1 |𝐻0 ), 𝑝(choose 𝐻1 |𝐻1 ) = 1 − 𝑝(choose 𝐻0 |𝐻1 ) Hence, 𝐶 = 𝐶00 [1 − 𝑝(choose 𝐻1 |𝐻0 )] 𝑝(𝐻0 ) + 𝐶01 𝑝(choose 𝐻0 |𝐻1 ) 𝑝(𝐻1 ) +𝐶10 𝑝(choose 𝐻1 |𝐻0 ) 𝑝(𝐻0 ) + 𝐶11 [1 − 𝑝(choose 𝐻0 |𝐻1 )] 𝑝(𝐻1 ) ⇔ (𝐶10 −𝐶00 ) 𝑝(choose 𝐻1 |𝐻0 ) 𝑝(𝐻0 )+(𝐶01 −𝐶11 ) 𝑝(choose 𝐻0 |𝐻1 ) 𝑝(𝐻1 )+𝐶00 𝑝(𝐻0 )+𝐶11 𝑝(𝐻1 ), where 𝐶00 𝑝(𝐻0 ) + 𝐶11 𝑝(𝐻1 ) is constant. To minimize, the decision rule is simplified as 𝐷 = 𝑝(choose𝐻1 |𝐻0 ) 𝑝(𝐻0 )(𝐶10 − 𝐶00 ) + 𝑝(choose𝐻1 = 0|𝐻1 ) 𝑝(𝐻1 )(𝐶01 − 𝐶11 ) Applying the hypothesis testing from previous inequality, the 𝐻0 is chosen if and only if 𝑓 𝑋 (𝑥|𝐻0 ) 𝑝(𝐻0 )(𝐶10 − 𝐶00 ) ≥ 𝑓 𝑋 (𝑥|𝐻1 ) 𝑝(𝐻1 )(𝐶01 − 𝐶11 ) ⇔ 𝑝(𝐻0 |𝑋)(𝐶10 − 𝐶00 ) ≥ 𝑝(𝐻1 |𝑋)(𝐶01 − 𝐶11 ) 3.1.5 Bayesian Interval Estimation Instead of posterior density 𝑓𝑥1 |𝑥2 (𝑥 1 |𝑥 2 ) for unobserved random variable 𝑥 1 given observed 𝑥 2 , the (1 − 𝛼)100% credible interval of 𝑥1 being in [𝑎, 𝑏] is derived as 𝑝(𝑎 ≤ 𝑥 1 ≤ 𝑏|𝑋2 = 𝑥 2 ) = 1 − 𝛼. Bivariate normal example For 𝑋1 ∼ 𝑁 (0, 1) and 𝑋2 ∼ 𝑁 (1, 4) with 𝜌(𝑋1 , 𝑋2 ) = 41 , the goal is to derive a 95% credible interval for 𝑋1 , given 𝑋2 = 2 is observed. Analogous to 𝐸 (x (1) |x (2) ) = (2) − 𝜇 (2) ), 𝜇 (1) + 𝚺12 𝚺−1 22 (x 𝐸 (𝑋1 |𝑋2 = 𝑥 2 ) = 𝜇 𝑋1 + 𝜌𝜎𝑋1 𝑥 1 − 𝜇𝑥2 , 𝜎𝑋2 15Then, there would be 2 more cases, including 𝐶00 : The cost of choosing 𝐻0 given 𝐻0 is true and 𝐶11 : The cost of choosing 𝐻1 given 𝐻1 is true. 16which is 𝐶00 𝑝(choose 𝐻0 |𝐻0 ) 𝑝(𝐻0 ) + 𝐶01 𝑝(choose 𝐻0 |𝐻1 ) 𝑝(𝐻1 ) + 𝐶10 𝑝(choose 𝐻1 |𝐻0 ) 𝑝(𝐻0 ) + 𝐶11 𝑝(choose 𝐻1 |𝐻1 ) 𝑝(𝐻1 ) CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 37 STA498 Lim, Kyuson where 𝜌 𝑋1 ,𝑋2 𝜎𝑋1 𝜎𝑋2 = Σ12 ⇔ 𝑐𝑜𝑣(𝑋1 , 𝑋2 ) (and 𝜎𝑋1 𝑋1 = Σ11 ), equivalently. Similar to 𝑣𝑎𝑟 (x (1) |x (2) ) = 𝚺11 − 𝚺12 𝚺−1 22 𝚺21 , 𝑣𝑎𝑟 (𝑋1 |𝑋2 = 𝑥 2 ) = 𝜎𝑋21 − 𝜌 2 𝜎𝑋21 . Hence, the 𝑋1 |𝑋2 = 2 is normally distributed with mean as 𝐸 (𝑋1 |𝑋2 = 𝑥 2 ) = 0+ 12 ( 2−1 2 ) = 1 3 1 4 and variance as 𝑣𝑎𝑟 (𝑋1 |𝑋2 = 𝑥 2 ) = 1 − 4 = 4 . For 𝛼 = 0.05, the interval is 𝑝(𝑎 ≤ 𝑋1 ≤ 𝑏|𝑋2 = 2) = 0.95 which is centered around 𝐸 (𝑋1 |𝑋2 = 𝑥 2 ) = 14 with the form of [ 14 − 𝑐, 14 + 𝑐]. 1 1 𝑐 −𝑐 𝑐 𝑝( − 𝑐 ≤ 𝑋1 ≤ + 𝑐|𝑋2 = 2) = Φ √︁ − Φ √︁ = 2Φ √︁ − 1 = 0.95. 4 4 3/4 3/4 3/4 √︂ ⇔𝑐= 3 −1 1.95 Φ = 1.7 4 2 Thus, the 95% credible interval for 𝑋1 is 1 1 − 𝑐, + 𝑐 = [−1.45, 1.95] 4 4 3.2 Prior The prior distribution of an uncertain quantity is to express one’s beliefs about this quantity before some evidence is taken into account. Based on the unconditional probability, the chosen parameters of the prior distribution are hyperparameters. The prior for parameter 𝜃 is denoted as 𝜋(𝜃) include conjugate priors with the binomial/beta and multinomial/Dirichlet families. In case prior ∝ constant, the Bernoulli example is a representative form of the noninformative prior as 𝑝(𝜃) = 1 lead to 𝜃|𝑋 ∼ 𝐵𝑒𝑡𝑎(𝑠 + 1, 𝑛 − 𝑠 + 1) to be seen earlier. Bayesian Procedure 1. Choose a probability density 𝜋(𝜃) = 𝑝(𝜃) that expresses our beliefs about a parameter 𝜃 before any data. 2. Choose a statistical model 𝑝(𝑥|𝜃) that reflects our beliefs about 𝑥 given 𝜃. 3. After observing data 𝑋 = {𝑥1 , ..., 𝑥 𝑛 }, update beliefs to compute the posterior distribution 𝑝(𝜃|𝑋). 3.2.1 Conjugate Prior Simply, if the prior 𝑝(𝜃) and posterior 𝑝(𝜃|𝑥) have the same form, then the prior and posterior is conjugate distributions, for the likelihood function 𝑝(𝑥|𝜃). 38 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Lim, Kyuson STA498 For the class of sampling distribution 𝑝(𝑥|𝜃), the class of prior distribution 𝑝(𝜃) which is the family is conjugate for the class 𝑝(𝑥|𝜃), if 𝑝(𝜃|𝑥) = ∫ Θ 𝑝(𝑥|𝜃)𝜋(𝜃) 𝑝(𝑥|𝜃)𝜋(𝜃)𝑑𝜃 is in the class of 𝑝(𝜃) for all 𝑝(·|𝜃) that is in the class 𝑝(𝑥|𝜃) and 𝑝(·) in the class of 𝑝(𝜃). Hence, the prior distribution family is conjugate to the family of sampling distribution for any posterior distributions. This include only for exponential family distribution. The class of sampling distribution 𝑝(𝑥|𝜃) of exponential family is generalized with its form 𝑝(𝑥𝑖 |𝜃) = 𝑓 (𝑥𝑖 )𝑔(𝜃)𝑒 𝜙(𝜃) 𝑇 𝑢(𝑥 𝑖) , where 𝜙(𝜃) and 𝑢(𝑥𝑖 ) are vectors and 𝜙(𝜃) is the parameters. For x, 𝑛 ∑︁ 𝑛 𝑇 𝑢(𝑥𝑖 ) , 𝑝(x|𝜃) ∝ 𝑔(𝜃) exp 𝜙(𝜃) 𝑖=1 Í𝑛 where 𝑖=1 𝑢(𝑥𝑖 ) is the sufficient statistics for 𝜃 as the likelihood for 𝜃 depends on the data x. Hence, the likelihood for x is 𝑛 ∑︁ 𝑛 𝑛 𝑇 𝑝(x|𝜃) = Π𝑖=1 𝑓 (𝑥𝑖 ) 𝑔(𝜃) exp 𝜙(𝜃) 𝑢(𝑥𝑖 ) . 𝑖=1 If the prior density is specified as 𝑝(𝜃) ∝ 𝑔(𝜃) 𝑚 exp(𝜙(𝜃)𝑇 𝑣). Then, the posterior density is 𝑝(𝜃|𝑥) ∝ 𝑔(𝜃) 𝑚+𝑛 exp(𝜙(𝜃)𝑇 𝑣 + 𝑛 ∑︁ 𝑢(𝑥𝑖 )), 𝑖=1 as the prior density is conjugate. List of Conjugate Models Likelihood Binomial Negative Binomial Poisson Geometric Exponential Normal (mean unknown) Normal (variance unknown) Normal (mean, variance unknown) Prior Beta Beta Gamma Beta Gamma Normal Inverse Gamma Normal / Gamma Posterior Beta Beta Gamma Beta Gamma Normal Inverse Gamma Normal / Gamma CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 39 STA498 3.2.2 Lim, Kyuson Univariate Normal distribution Conjugate Prior with known variance From previous example, Beta(𝜃|𝛼, 𝛽) ∝ 𝜃 𝛼−1 (1 − 𝜃) 𝛽−1 to be derived. In the univariate case, the normal distribution of observation 𝑋 with multiple observations 𝑥 1 , ..., 𝑥 𝑛 is in the form of 1 1 2 exp − 𝑓 (𝑋 |𝜃) = √ (𝑋 − 𝜃) , 𝑋 ∼ 𝑁 (𝜃, 𝜎 2 ), 2 2 2𝜎 2𝜋𝜎 which is part of 𝑓 (𝜃|𝑋) ∝ 𝑓 (𝑋 |𝜃) 𝑓 (𝜃). Since the variance 𝜎 2 is known, the joint prior is just a prior of 𝑓 (𝜃) in this case compared to variance unknown 17. The goal is to ultimately update the unknown quantity of 𝜃, which is to find the 𝑓 (𝜃|𝑥𝑖 ). First, the likelihood function by definition for current data of multiple observations where 𝑓 (𝜃) is the prior mean is (𝑥𝑖 − 𝜃) 2 1 𝑛 exp − . 𝑓 (𝑋 |𝜃) ∝ 𝐿 (𝜃|𝑋) = Π𝑖=1 √ 2𝜎 2 2𝜋𝜎 2 On the other hand, the prior is parametrized with known hyperparameters 𝜇0 and 𝜏02 where 𝜃 ∼ 𝑁 (𝜇02 , 𝜏02 ), 1 1 1 2 2 𝑓 (𝜃) ∝ exp − 2 (𝜃 − 𝜇0 ) , as 𝑓 (𝜃) = √︃ exp − 2 (𝜃 − 𝜇0 ) , 2𝜏0 2𝜏0 2 2𝜋𝜏 0 where 𝜏0 18 is referred to a precision that control how mean can be varied, as the multiple observations has a standard deviation different from 𝜎 19. Note that 𝜇0 is the prior mean and 𝜏0 reflect the variation of 𝜃 around 𝜇0 . However, 𝑋 = {𝑥 1 , ..., 𝑥 𝑛 } such that ∑︁ 𝑛 1 (𝑥𝑖 − 𝜃) 2 . 𝑓 (𝑋 |𝜃) = 𝑓 (𝑥 1 , ..., 𝑥 𝑛 |𝜃) = exp − (2𝜋) 𝑛/2 𝜎0𝑛 2𝜎02 𝑖=1 Hence, the posterior is prior times the likelihood to yield Í𝑛 2 1 1 (𝜃 − 𝜇0 ) 2 𝑖=1 (𝑥𝑖 − 𝜃) 𝑓 (𝜃|𝑥) ∝ √︃ exp − + . 2 2 2 𝜎 𝜏 2 2 0 𝜏0 𝜎0 Ignoring constant terms, the posterior is expressed as Í𝑛 2 ¯ + 𝑛𝜃 2 (𝜃 − 𝜇0 ) 2 1 𝑖=1 𝑥𝑖 − 2𝑛𝑥𝜃 𝑓 (𝜃|𝑥) ∝ exp − + , 2 𝜎2 𝜏02 17The 𝑓 (𝜃, 𝜎 2 |𝑋) ∝ 𝑓 (𝑋 |𝜃, 𝜎 2 ) 𝑓 (𝜃, 𝜎 2 ) is where variance unknown. 18The variable reflect how much each observation 𝑥𝑖 have varied and does not directly reflect the variability of individual sampling for 𝑥𝑖 to be. 19For example, the class of 30 students has mean grade 𝑥¯ = 75 with sd 𝜎 = 10 but over serval semesters the overall mean 𝜃 = 75 and sd for the class means is 𝜏0 = 5. 40 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Lim, Kyuson STA498 and drop any terms that does not include 𝜃 and arrange terms for 𝜃 2 and 𝜃, 2 2 2 2 ¯ + 𝑛𝜃 2 𝜏 2 2 2 2 2 2 ¯ 1 𝜎 𝜃 − 2𝜎 𝜇0 𝜃 − 2𝑛𝜏0 𝑥𝜃 1 (𝑛𝜏0 + 𝜎 )𝜃 − 2(𝜎 𝜇0 + 𝑛𝜏0 𝑥)𝜃 0 ⇔ exp − = exp − 2 2 𝜎 2 𝜏02 𝜎 2 𝜏02 2 2 2¯ 2 2 1 𝜃 − 2𝜃 (𝜎0 𝜇0 + 𝑛𝜏0 𝑥)/(𝑛𝜏 0 + 𝜎0 ) , = exp − 2 (𝜎02 𝜏02 )/(𝑛𝜏02 + 𝜎02 ) then divide by (𝑛𝜏02 + 𝜎 2 ), dropping any constant to simplify for 2 2¯ 2 2 2 1 [𝜃 − (𝜎 𝜇0 + 𝑛𝜏0 𝑥)/(𝑛𝜏 0 + 𝜎 )] . 𝑓 (𝜃|𝑋) ∝ exp − 2 (𝜎 2 𝜏02 )/(𝑛𝜏02 + 𝜎 2 ) Therefore, 𝜃|𝑥 ∼ 𝑁 (𝜇1 , 𝜏12 ), where 𝜇1 = 𝜎 2 𝜇0 + 𝑛𝜏02 𝑥¯ 𝑛𝜏02 + 𝜎 2 , 𝜎12 = 𝜎 2 𝜏02 𝑛𝜏02 + 𝜎 2 . The posterior mean 𝜇1 is expressed as a weighted average 20 of the prior mean and the observed value 𝑥, with weights proportional to 𝜏02 21. In case 𝜎02 = 𝜏02 22, the prior mean is only weighted 1/(𝑛 + 1) in the posterior 23. Posterior predictive distribution For the future observation 𝑥˜ the posterior predictive observation 𝑝( 𝑥|𝑋) ˜ is ∫ 1 2 𝑓 ( 𝑥|𝑋) ˜ = 𝑓 ( 𝑥|𝜃) ˜ 𝑓 (𝜃|𝑋)𝑑𝜃 ⇔ 𝑓 (𝜃|𝑥) ∝ exp − 2 (𝜃 − 𝜇1 ) 2𝜏1 Θ such that 1 1 2 2 ⇔ 𝑓 ( 𝑥|𝑋) ˜ ∝ exp − ( 𝑥˜ − 𝜃) exp − 2 (𝜃 − 𝜇1 ) 𝑑𝜃. 2𝜎 2 2𝜏1 Θ ∫ Notice that the joint distribution of ( 𝑥, ˜ 𝜃) bivariate normal distribution, where the marginal posterior distribution of 𝑥˜ is normal with 𝐸 ( 𝑥|𝜃) ˜ = 𝜃 and 𝑣𝑎𝑟 ( 𝑥|𝜃) ˜ = 𝜎2. By the law of total probability, 𝐸 ( 𝑥|𝑋) ˜ = 𝐸 (𝐸 ( 𝑥|𝜃, ˜ 𝑋)|𝑋) = 𝐸 (𝜃|𝑋) = 𝜇1 , and 𝑣𝑎𝑟 ( 𝑥|𝑋) ˜ = 𝐸 (𝑣𝑎𝑟 ( 𝑥|𝜃)|𝑋) ¯ + 𝑣𝑎𝑟 (𝐸 ( 𝑥|𝜃, ¯ 𝑋)|𝑋) = 𝐸 (𝜎 2 |𝑋) + 𝑣𝑎𝑟 (𝜃|𝑋) = 𝜎 2 + 𝜏12 . This, the posterior predictive distribution of 𝑥˜ has mean equal to the posterior mean of 𝜃. However, the predictive variance is 𝜎 2 and the second variance 𝜏12 due to posterior uncertainty in 𝜃. 20If the sample variance is large, then the prior mean has considerable weight in the posterior. if the prior variance is large, the sample mean has considerable weight in the posterior. 21The posterior precision is 𝜏12 and the prior precision is 𝜏02 . 22That is each observation sd is same as sampling distribution sd 23As it is reduced to (𝜇0 + 𝑛𝑥)/(𝑛 ¯ + 1). CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 41 STA498 Lim, Kyuson Univariate Normal distribution Conjugate Prior with unknown variance In most cases, the 𝜎 2 is unknown. Note that 𝑓 (𝜃, 𝜎 2 |𝑋) ∝ 𝑓 (𝑋 |𝜃, 𝜎 2 ) 𝑓 (𝜃, 𝜎 2 ). Hence, the joint prior for both 𝜃 and 𝜎 2 should be specified including the prior of 𝜃. If the two parameters are assumed to be independent 𝜎 and 𝜃, then 𝑝(𝜃, 𝜎 2 ) = 𝑝(𝜃) 𝑝(𝜎 2 ) to establish the separate priors for each parameter. Previously, 𝜃 ∼ 𝑁 (𝜇0 , 𝜏02 ) where 𝜏0 is the measure of belief for 𝜃, the easies prior for 𝜎 2 is the non-informative prior. This would be discussed in the next chapter. 3.2.3 Non-informative Prior For determination of the prior, if there is no prior information about the 𝜃, then the non-informative prior is about the minimal influence on the inference. However, the uniform prior could not be simply used for non-informative prior as the reparametrization is invariant. The uniform prior on 𝜃 does not correspond to the uniform prior for 1/𝜃 24. As a mean of ignorance, there is no unique prior in non-informative, and the preferable prior is sufficient to use for. On the other hands, the uniform prior is possible by construction to be invariant non-informative prior using location parameters and scale parameters. For the location parameters, the random variable 𝑋 distributed uniformly for 𝑓 (𝑋 − 𝜃) with location parameter 𝜃. As 𝑌 = 𝑋 + 𝑎 is distributed as 𝑓 (𝑦 − 𝜙) with 𝜙 = 𝜃 + 𝑎, 𝑋 and 𝑌 has the same distribution but just different parameters. Hence, the prior distribution is location invariant: 𝜋(𝜃) = 𝜋(𝜃 − 𝑎) ⇔ 𝜋(𝜃) = 1. For scale parameters, 𝑋 ∼ 𝜎1 𝑓 ( 𝜎𝑥 ) with scale parameter 𝜎. Precisely, the distribution is scale invariant as for 𝑐 > 0 as 𝜋(𝜎) = 1𝑐 𝜋( 𝜎𝑐 ) where the invariant non-informative prior for the scale parameter 𝜋(𝜎) = 𝜎 −1 satisfies the equation. 3.2.4 Univariate Normal distribution Conjugate Prior with unknown variance Continuously, both 𝜃 and 𝜎 2 need to be specified. Previously, 𝑝(𝜃, 𝜎 2 ) = 𝑝(𝜃) 𝑝(𝜎 2 ) to be separated for each by the independence. Note that full probability model for 𝜃 and 𝜎 2 is 𝑓 (𝜃, 𝜎 2 |𝑥) = 𝑓 (𝑥|𝜃, 𝜎 2 ) 𝑓 (𝜃, 𝜎 2 ) 24For transformation corresponding to 1-1 function 𝑔(𝜃) is 𝜋(𝜃) = 1, 𝜙 = 𝑔(𝜃) ⇒ 𝜋(𝜃) = | 𝑑𝑑𝜙 𝑔 −1 (𝜙)|. 42 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Lim, Kyuson STA498 For 𝜏 2 that is the measure of the uncertainty for 𝜃, the 𝜎 2 will be used to update the knowledge of 𝜏 25 as to specify the prior of 𝜎 2 . Now, to develop non-informative priors the first approach is to assign a uniform prior for 𝜃 and log(𝜎 2 ) because 𝜎 2 > 0 and log(𝜎 2 ) ∈ R. For transformation on log(𝜎 2 ) into the density of 𝜎 2 , by the definition of non-informative prior 𝑝(log(𝜎 2 )) ∝ constant such that by the Jacobian matrix 𝑝(𝜎 2 ) ∝ 𝑑 log(𝜎 2 )/𝑑𝜎 2 constant = (1/𝜎 2 ) constant. The joint prior is also 𝑝(𝜃, 𝜎 2 ) ∝ 1/𝜎 2 . Without the log(𝜎 2 ) ∈ R, the other approach is to choose values of 𝜇0 and 𝜏 2 where 𝜇 ∼ 𝑁 (𝜇0 , 𝜏 2 ) and non-informative prior for 𝜎 2 . With relative non-informative prior for 𝜎 2 where 1/𝜎 2 ∼ 𝐺𝑎𝑚𝑚𝑎(𝛼, 𝛽), 𝜎 2 ∼ Inverse Gamma (IG) 26 (𝛼, 𝛽) is chosen to follow where the density function is 𝑓 (𝜎 2 ) = 𝛽𝛼 −2(𝛼+1) 𝜎 exp(−𝛽/𝜎 2 ), Γ(𝛼) 𝜎 2 > 0. Hence, 𝑓 (𝜎 2 |𝛼, 𝛽) ∝ (𝜎 2 ) −(𝛼+1) exp(−𝛽/𝜎 2 ). Notice that 𝜎 2 ∼ 𝐼𝐺 (0, 0) ⇔ 𝑝(𝜎 2 ) ∝ 1/𝜎 2 27. Unknown variance: posterior density of 𝜃 √ As the prior of 𝑓 (𝑥|𝜃, 𝜎 2 ) = 1/ 2𝜋𝜎 2 exp − (𝑥𝑖 −𝜃) 2 /(2𝜎 2 ) , the posterior distribution for 𝜃 and 𝜎 2 with joint prior of 𝑓 (𝜃, 𝜎 2 ) = 1/𝜎 2 is 𝑓 (𝜃, 𝜎 2 |𝑋) = 𝑓 (𝑥|𝜃, 𝜎 2 ) 𝑓 (𝜃, 𝜎 2 ) which is 1 𝑛 1 (𝑥𝑖 − 𝜃) 2 2 . 𝑓 (𝜃, 𝜎 |𝑋) ∝ 2 Π𝑖=1 √ exp − 𝜎 2𝜎 2 2𝜋𝜎 2 Notice that for two parameters the conditional posterior distribution could generally be determined by the joint as 𝑓 (𝜃, 𝜎 2 |𝑋) ∝ 𝑓 (𝜃, 𝜎 2 ). Now, 𝑋 = {𝑥 1 , ..., 𝑥 𝑛 } such that ∑︁ 𝑛 1 (𝑥𝑖 − 𝜃) 2 𝑓 (𝑋 |𝜃, 𝜎 ) = 𝑓 (𝑥 1 , ..., 𝑥 𝑛 |𝜃, 𝜎 ) = exp − 2𝜎 2 (2𝜋) 𝑛/2 𝜎 𝑛 𝑖=1 2 2 2 ⇔ 𝑓 (𝜃, 𝜎 |𝑋) ∝ 1 (2𝜋) 𝑛/2 𝜎 𝑛+2 1 exp − 2 2 𝑖=1 𝑥𝑖 Í𝑛 − 2𝑛𝑥𝜃 ¯ + 𝑛𝜃 2 𝜎2 . Hence, the posterior dropping terms that does not contain 𝜃 of a parameter of interest yields 1 −2𝑛𝑥𝜃 ¯ + 𝑛𝜃 2 2 𝑓 (𝜃|𝑋, 𝜎 ) ∝ exp − . 2 𝜎2 25From CLT where for 𝑛 observation, 𝜎 2 and 𝜏 2 is related as 𝑥¯ ∼ 𝑁 (𝜃, 𝜎 2 /𝑛) and for fixed 𝜎 2 , 𝜎 2 /𝑛 is the estimate of 𝜏 2 to update as it depend heavily on the new sample data. 26The IG distribution is used as a conjugate prior for the variance of the normal distribution model. 27Due to improper prior to be discussed, if both parameters approach 0 then distribution is set as the prior for 𝜎 2 . CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 43 STA498 Then, for 𝜃 divide by 𝑛 and add up constant term to get quadratic term (𝜃 − 𝑥) ¯ 2 2 𝑓 (𝜃|𝑋, 𝜎 ) ∝ exp − . 2𝜎 2 /𝑛 Lim, Kyuson This result in posterior distribution 𝜃|𝑋, 𝜎 2 ∼ 𝑁 ( 𝑥, ¯ 𝜎 2 /𝑛). Notice that the CLT of the sampling distribution of 𝑥¯ follows 𝑁 (𝜇0, 𝜎 2 /𝑛) as well. Unknown variance approach 1: marginal posterior density of 𝜎 2 However, the posterior distribution of 𝜎 2 is derived by the conditional distribution of 𝜎 2 |𝜃, 𝑋 or by the joint posterior distribution for 𝜇 and 𝜎 2 . In the first approach, the terms involving 𝜎 2 is considered to give 1 2 𝑓 (𝜎 |𝜃) ∝ 𝜎 𝑛+2 𝑛 ∑︁ (𝑥𝑖 − 𝜃) 2 ⇔ 𝑓 (𝜃|𝜎 2 , 𝑋) 𝑓 (𝜎 2 ) exp − 2 2𝜎 𝑖=1 which is the Inverse Gamma distribution without the normalizing constant 𝛽𝛼 /Γ(𝛼), as the 𝜃 is considered to be fixed. Equivalently, two parameters 𝛼 = 𝑛/2 and 𝛽 = Í𝑛 2 2 𝑖=1 (𝑥𝑖 − 𝜃) /2 for the IG distribution that 𝜎 |𝜃 follows to be. Unknown variance approach 2: marginal posterior density of 𝜎 2 The second approach is to use the equation for Bayes’ rule of 𝑓 (𝜃, 𝜎 2 |𝑋) = { 𝑓 (𝜃, 𝜎 2 , 𝑋)/ 𝑓 (𝜎 2 , 𝑋)}{ 𝑓 (𝜎 2 , 𝑋)/ 𝑓 (𝑥)} ⇔ 𝑓 (𝜃|𝜎 2 , 𝑋) 𝑓 (𝜎 2 |𝑋), 1 (2𝜋) 𝑛/2 𝜎 𝑛 where 𝑓 (𝑋 |𝜃, 𝜎 2 ) = 𝑓 (𝑥1 , ..., 𝑥 𝑛 |𝜃, 𝜎 2 ) = exp − (𝑥 𝑖 −𝜃) 2 to derive to 2𝜎 2 (𝜎 2 |𝑋), which is what Í𝑛 𝑖=1 separate 𝑓 (𝜃|𝜎 2 , 𝑋) and the marginal posterior density of 𝜎 2 𝑓 the goal is. Previously, Í𝑛 2 ¯ + 𝑛𝜃 2 1 𝑖=1 𝑥𝑖 − 2𝑛𝑥𝜃 2 𝑓 (𝜃, 𝜎 |𝑋) ∝ exp − 2𝜎 2 (2𝜋) 𝑛/2 𝜎 𝑛+2 . Then, rearrange the terms to isolate 𝜇2 and divide by 𝑛 for the equation to get squared terms. Í Í 2 (𝜇 − 𝑥) ¯ 2 + 𝑥𝑖2 /𝑛 − 𝑥¯ 2 𝑥𝑖 − 𝑛𝑥¯ 2 1 1 (𝜇 − 𝑥) ¯ 2 1 ⇔ 𝑛+2 exp − , = − × 𝑛+1 exp 𝜎 𝜎 2𝜎 2 /𝑛 2𝜎 2 /𝑛 𝜎 2𝜎 2 where the first term corresponds to 𝑓 (𝜃|𝜎 2 , 𝑋) and the second term correspond to 𝑓 (𝜎 2 |𝑋). Notice that for 𝑓 (𝜎 2 |𝑋) the numerator is the sample variance. Similarly, Í𝑛 Í𝑛 𝜎 2 |∼ 𝐼𝐺 ((𝑛 − 1)/2, (𝑛 − 1) 𝑖=1 (𝑥𝑖 − 𝑥) ¯ 2 /2) 28, where 𝑖=1 (𝑥𝑖 − 𝑥) ¯ 2 =var(𝑥). 28Equivalently, 𝑛 + 1 = −2 44 −𝑛−1 2 = −2 −𝑛+1 2 − 2 2 ⇒ 𝜎 −2(−( 𝑛−1 2 )−1) , where the 𝛼 = CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 𝑛−1 2 to be. Lim, Kyuson Unknown variance: connection for the multivariate normal STA498 For 𝚺 that is 𝑑 dimensions with 𝑛 𝑑𝑓 , X ∼ 𝑊 𝑝 (𝚺) and both X, Σ is positive definite then 1 (𝑛−𝑝−1)/2 −1 𝑓 (X) ∝ |X| exp − 𝑡𝑟 (Σ X) 2 1 . As the conjugate prior distribution ignoring the normalizing constant 2𝑛 𝑝 /2|𝚺| 𝑛/2 Γ 𝑝 (𝑛/2) of the univariate normal distribution is the IG, the inverse Wishart distribution is the conjugate prior of the Σ in multivariate normal distribution 29. Similarly, X ∼ 𝑊 𝑝−1 (𝚺−1 ) then 1 𝑛/2 −(𝑛+𝑝+1)/2 1 −1 −1 𝑓 (X) ∝ |X| exp − 𝑡𝑟 (Σ X ) . Σ 2 3.3 Posterior Note that the usual Bayesian inference typically involves (1) establishing a model and obtaining a posterior distribution for the parameter (𝜃) of interest, (2) generating samples from the posterior distribution, and (3) using discrete formulas applied to the samples from the posterior distribution to summarize our knowledge of the parameters. There are two sampling methods which include the inversion method of sampling, and rejection method of sampling, which is for understanding MCMC methods. Weakly-informative Prior As for specifying and justifying for the prior distribution, the prior distribution represents a population of possible parameter values, from which the 𝜃 of current interest has been drawn from the population point of view. However, for subjective interpretation, the uncertainty about 𝜃 as if its value is thought of as a random realization from the prior distribution. 3.3.1 Maximum A Posteriori (MAP) For given 𝑥1 , ..., 𝑥 𝑛 ∼ 𝑁 (𝜃, 𝜎 2 ) as random variables and 𝑋 = {𝑥1 , ..., 𝑥 𝑛 }, the prior distribution of 𝜃 is given as 𝑁 (𝜇0 , 𝜎02 ). The function is maximized as 𝑓 (𝑋 |𝜃) 𝑓 (𝜃) = 𝐿(𝜃)𝜋(𝜃) = 𝑛 Π𝑖=1 √ 2 2 1 1 𝑥𝑖 − 𝜃 1 𝜃 − 𝜇0 exp − exp − . √︃ 2 𝜎 2 𝜎0 2𝜋𝜎 2 2𝜋𝜎 2 1 0 𝑛 𝑓 (𝑋 = 𝑥 , ..., 𝑥 |𝜃). However, the 𝜃ˆ Notice that the 𝜃ˆ𝑀 𝐿𝐸 = Π𝑖=1 1 𝑛 𝑀 𝐴𝑃 is the mode of the Í𝑛 posterior distribution that is maximized as log( 𝑝(𝜃)) + 𝑖=1 log( 𝑓 (𝑥𝑖 |𝜃)). 29Note that the marginal distribution of mean vector 𝜇 is the multivariate 𝑡 distribution. CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 45 STA498 Lim, Kyuson For derivation, the log( 𝑓 (𝜃|𝜃)) = ∑︁ 𝑛 𝑖=1 √︃ √︁ 2 2 (𝑥 − 𝜃) 𝑖 2 − (𝜃 − 𝜇0 ) . 2 − log 2𝜋𝜎 − − log 2𝜋𝜎 0 2𝜎 2 2𝜎02 Then, the derivative is 𝑛 𝛼 log( 𝑓 (𝜃|𝜃)) ∑︁ (𝑥𝑖 − 𝜃) (𝜃 − 𝜇0 ) = − =0 2 2 𝛼𝜃 𝜎 𝜎 0 𝑖=1 𝑛 ∑︁ (𝑥𝑖 − 𝜃) 𝑛 ∑︁ 𝑥𝑖 (𝜃 − 𝜇0 ) 𝑛𝜇 (𝜃 − 𝜇0 ) − = ⇔ 2 2 2 2 𝜎 𝜎 𝜎 𝜎 𝜎02 0 𝑖=1 𝑖=1 Í Í𝑛 (𝜎 2 + 𝑛𝜎02 )𝜃 𝜎 2 𝜇0 + 𝜎02 𝑥𝑖 𝜎 2 𝜇0 + 𝜎02 𝑖=1 𝑥𝑖 ˆ𝑀 𝐴𝑃 = ⇔ = ⇔ 𝜃 . 𝜎 2 𝜎02 𝜎 2 𝜎02 𝜎 2 + 𝑛𝜎02 ⇔ = Equivalently, to minimize the function of 𝜃 for the posterior distribution of 𝑓 (𝜃|𝑋) from Í𝑛 𝑥𝑖 −𝜃 2 𝜃−𝜇0 2 + 𝜎0 . Hence, previous prior chapter derivation, the part is to minimize 𝑖=1 𝜎 the MAP estimate of 𝜇 is derived to be Í ∑︁ 𝑛 2 𝜎02 ( 𝑥𝑖 ) + 𝜇0 𝜎 2 𝑛𝜎02 𝜎 1 𝑥𝑖 + 𝜇0 = , 𝜃ˆ𝑀 𝐴𝑃 = 𝑛𝜎02 + 𝜎 2 𝑛 𝑖=1 𝑛𝜎02 + 𝜎 2 𝑛𝜎02 + 𝜎 2 which is the weighted average for the prior and sample mean. 3.3.2 Multivariate Normal distribution with known Σ The multivariate normal likelihood is x𝑖 |𝜃, Σ ∼ 𝑁 (𝜃, Σ) 30 without the normalizing constant, 1/2𝜋 𝑝/2 , as previously defined. Equivalently, the likelihood function for single observation model is 1 −1/2 𝑇 −1 𝑓 (x𝑖 |𝜃, Σ) ∝ |Σ| exp − (x𝑖 − 𝜃) Σ (x𝑖 − 𝜃) , 2 and for samples of 𝑛 iid observations which is X = {x1 , ..., x𝑛 } is 𝑓 (X|𝜃, Σ) = 𝑓 (x1 , ..., x𝑛 |𝜃, Σ) ∝ |Σ| −𝑛/2 𝑛 1 ∑︁ 𝑇 −1 exp − (x𝑖 − 𝜃) Σ (x𝑖 − 𝜃) . 2 𝑖=1 Analogous from univariate case with variance known which is 𝑓 (𝜃|𝑋) ∝ 𝑓 (𝑋 |𝜃) 𝑓 (𝜃) where 𝜃 ∼ 𝑁 (𝜇0 , 𝜏02 ) from 3.2.2, the multivariate posterior distribution is generalization of a multiple observation for 𝑓 (𝜃|X) ∝ 𝑓 (X|𝜃) 𝑓 (𝜃) with 𝜃 ∼ 𝑁 (𝜇0 , Λ0 ) 30Note that Σ is 𝑑 × 𝑑 matrix that is positive definite and 𝜃 is a multivariate. 46 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Lim, Kyuson STA498 𝑛 𝑓 (x |𝜃) that is, Equivalently, the posterior distribution is 𝑓 (𝜃|X) ∝ 𝑓 (𝜃)Π𝑖=1 𝑖 𝑛 1 1 ∑︁ 𝑇 −1 −1/2 𝑇 −1 −𝑛/2 (x𝑖 −𝜃) Σ (x𝑖 −𝜃) , 𝑓 (𝜃|X) = |Λ0 | exp − (𝜃−𝜇0 ) Λ0 (𝜃−𝜇0 ) |Σ| exp − 2 2 𝑖=1 and dropping the constant terms yield for 𝑛 ∑︁ 1 𝑇 −1 2 𝑇 −1 ⇔ 𝑓 (𝜃|X) ∝ exp − (𝜃 − 𝜇0 ) Λ0 (𝜃 − 𝜇0 ) + (x𝑖 − 𝜃) Σ (x𝑖 − 𝜃) . 2 𝑖=1 Now, take natural logarithm (log) to simplify for derivation of log density. 𝑛 1 1 ∑︁ 𝑇 −1 𝑇 −1 (x𝑖 − 𝜃) Σ (x𝑖 − 𝜃) − (𝜃 − 𝜇0 ) Λ0 (𝜃 − 𝜇0 ) log( 𝑓 (𝜃|X)) = − 2 𝑖=1 2 𝑛 ∑︁ 1 𝑛 𝑇 −1 𝜃 𝑇 Σ−1 x𝑖 − 𝜃 𝑇 Λ−1 = − 𝜃 𝑇 Σ−1 𝜃 + 0 𝜃 + 𝜃 Λ0 𝜇 0 , 2 2 𝑖=1 ∑︁ ∑︁ 1 𝑇 1 𝑇 −1 −1 𝑇 −1 −1 −1 −1 𝑇 −1 −1 x𝑖 ) = − 𝜃 (𝑛Σ +Λ0 )𝜃−2𝜃 (Λ0 𝜇0 +Σ x𝑖 ) . = − 𝜃 (𝑛Σ +Λ0 )𝜃+𝜃 (Λ0 𝜇0 +Σ 2 2 Now, copula modeling for matrix multiplication and arrangement is used. 𝑇 𝑛 ∑︁ ∑︁ 1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 ⇔ − 𝜃−(𝑛Σ +Λ0 ) (Λ0 𝜇0 +Σ x𝑖 ) (𝑛Σ +Λ0 ) 𝜃−(𝑛Σ +Λ0 ) (Λ0 𝜇0 +Σ + x𝑖 ) , 2 𝑖=1 which is the log density of a normal distribution ∑︁ −1 −1 −1 −1 −1 −1 −1 −1 𝜃|X ∼ 𝑁 (𝑛Σ + Λ0 ) (Λ0 𝜇0 + Σ x𝑖 ), (𝑛Σ + Λ0 ) . Equivalently, the mean 𝜇𝑛 and inverse of cov-variance matrix Λ−1 𝑛 31 is defined as −1 −1 −1 −1 𝜇𝑛 = (Λ−1 0 + 𝑛Σ ) (Λ0 𝜇0 + 𝑛Σ x̄), −1 −1 Λ−1 𝑛 = Λ0 + 𝑛Σ , using the Woodbury identity on our expression for the covariance matrix. As the multivariate normal distribution has the conjugate prior for multivariate normal distribution analogous to univariate case, the Σ is an inverse Wishart distribution to be defined. Note that the posterior conditional and marginal distributions of subvectors of 𝜃 with known Σ could be also derived. Posterior predictive distribution for x̃ For new observation x̃ ∼ 𝑁 (𝜃, Σ), the joint distribution is defined as 𝑓 ( x̃, 𝜃|X) ∝ 𝑁 ( x̃|𝜃, Σ)𝑁 (𝜃|𝜇𝑛 , Λ𝑛 ). The joint posterior distribution of x̃ is multivariate normal as the Σ is known. By the Adam’s law, 𝐸 ( x̃|X) = 𝐸 {𝐸 ( x̃|𝜃, X)|X} = 𝐸 (𝜃|X) = 𝜇𝑛 , and also applying for the Eve’s law and MMSE, 𝑣𝑎𝑟 ( 𝑥|X) ˜ = 𝐸 {𝑣𝑎𝑟 ( x̃|𝜃, X)|X} + 𝑣𝑎𝑟{𝐸 ( x̃|𝜃, X)|X} = 𝐸 (Σ|X) + 𝑣𝑎𝑟 (𝜃|X) = Σ + Λ𝑛 . 31Notice that the posterior precision is the sum of prior and data precisions. CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 47 STA498 Non-informative prior density of 𝜃 Lim, Kyuson If 𝑓 (𝜃) ∝ constant by definition as the precision which is the variance of the prior converge to 0, |Λ−1 0 |, then the prior mean is irrelevant for the posterior density. 3.3.3 Multivariate Normal distribution with unknown Σ Goal scheme Previously, the posterior with two parameters 𝜃 and 𝜎 2 is defined as 𝑓 (𝜃, 𝜎 2 |𝑋) ∝ 𝑓 (𝑋 |𝜃, 𝜎 2 ) 𝑓 (𝜃, 𝜎 2 ). Similarly, as x𝑖 |𝜃, Σ ∼ 𝑁 (𝜃, Σ), the posterior distribution is defined as 𝑓 (𝜃, Σ|X) ∝ 𝑓 (X|𝜃, Σ) 𝑓 (𝜃, Σ). Analogous from univariate unknown 𝜎 2 : scheme connection [ 𝑓 (Σ)] : Since the multivariate approach is exactly analogous to univariate approach, from 3.2.4 the 𝑓 (𝜎 2 ) is derived as Inverse Gamma (IG) (𝛼, 𝛽) where the Inverse Gamma is a Inverse Wishart distribution in multivariate for prior distribution for the Σ. [ 𝑓 (𝜃|Σ)] : Similarly, 𝜃|Σ ∼ 𝑁 (𝜇0 , Σ/𝑘) as the univariate in 3.2.4 shown to have 𝑥) ¯ 2 which is 𝜃|𝜎 2 , 𝑋 ∼ 𝑁 ( 𝑥, ¯ 𝜎 2 /𝑛) as a normal distribu𝑓 (𝜃|𝜎 2 , 𝑋) ∝ exp − (𝜃− 2𝜎 2 /𝑛 tion for posterior density function for 𝑓 (𝜃|Σ). [ 𝑓 (𝜃, Σ)] : Also, the posterior density of 𝜃 is following to be normal as with respect to 𝑓 (𝜃, 𝜎 2 |𝑋) ∝ 𝑓 (𝜃, 𝜎 2 ), where 𝑓 (𝜃, 𝜎 2 ) = 𝑓 (𝜃|𝜎 2 ) 𝑓 (𝜎 2 ) analogously such that 1 𝑝/2 −( 𝑝+𝑑+1)/2 −1 𝑓 (Σ) 𝑓 (𝜃|Σ) = |Λ0 | ×|Σ| exp − 𝑡𝑟 (Λ0 Σ ) 2 1 𝑇 −1 −1/2 × |Σ| exp − (𝜃 − 𝜇0 ) 𝑘Σ (𝜃 − 𝜇0 ) 2 1 𝑘 −{( 𝑝+𝑑)/2+1} −1 𝑇 −1 ⇔ 𝑓 (𝜃, Σ) ∝ |Σ| exp − 𝑡𝑟 (Λ0 Σ ) − (𝜃 − 𝜇0 ) Σ (𝜃 − 𝜇0 ) . 2 2 Therefore, using the inverse-Wishart distribution to describe for the prior distribution of the Σ, Σ ∼ 𝑊 𝑝−1 (Λ−1 0 ), where the hyperparameters of (𝜇0 , Λ0 /𝑘, 𝑝, Λ0 ) is used for parametrization. Notice 𝜃|Σ ∼ 𝑁 (𝜇0 , Σ/𝑘), as 𝑝 and Λ0 controls the 𝑑𝑓 and scale matrix for the inverse Wishart distribution on Σ. Posterior with unknown 𝜎 2 : conclusion [ 𝑓 (X|𝜃, Σ)]: In 𝑓 (𝜃, Σ|X) ∝ 𝑓 (X|𝜃, Σ) 𝑓 (𝜃, Σ), the likelihood function is normal distri 1 𝑇 −1 −1/2 bution that is defined as 𝑓 (x𝑖 |𝜃, Σ) ∝ |Σ| exp − 2 (x𝑖 − 𝜃) Σ (x𝑖 − 𝜃) when the 48 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Lim, Kyuson cov-variance is known (which is analogous from univariate case). STA498 Hence, the 𝑓 (X|𝜃, Σ) ∝ |Σ| −𝑛/2 𝑛 1 ∑︁ (x𝑖 − 𝜃)𝑇 Σ−1 (x𝑖 − 𝜃) exp − 2 𝑖=1 𝑛 1 ∑︁ 𝑇 −1 𝑇 −1 ⇔ |Σ| exp − (x𝑖 − x̄) Σ (x𝑖 − x̄) + 𝑛( x̄ − 𝜃) Σ ( x̄ − 𝜃) 2 𝑖=1 𝑛 −𝑛 1 1 ∑︁ 2 𝑇 −1 2 ⇔ |Σ| 2 exp − (𝑛−1)S +𝑛( x̄−𝜃) Σ ( x̄−𝜃) , S = (x𝑖 − x̄)𝑇 Σ−1 (x𝑖 − x̄) 2 𝑛 − 1 𝑖=1 −𝑛 2 by definition. Before, 𝑓 (𝜃|X) ∝ 𝑓 (X|𝜃) 𝑓 (𝜃), 𝜃 ∼ 𝑁 (𝜇0 , Λ0 ) in 3.3.2. Thus, the posterior of two unknown parameters is derived with respect to 𝑓 (𝜃, Σ|X) ∝ 𝑓 (X|𝜃, Σ) 𝑓 (𝜃, Σ). Notice, few lines before 1 𝑘 −1 𝑇 −1 𝑓 (𝜃, Σ) = |Λ0 | |Σ| exp − 𝑡𝑟 (Λ0 Σ ) − (𝜃 − 𝜇0 ) Σ (𝜃 − 𝜇0 ) . 2 2 𝑘 1 −1 𝑇 −1 𝑝/2 −{( 𝑝+𝑑)/2+1} ⇔ 𝑓 (𝜃, Σ|X) ∝ |Λ0 | |Σ| exp − 𝑡𝑟 (Λ0 Σ ) − (𝜃 − 𝜇0 ) Σ (𝜃 − 𝜇0 ) 2 2 −𝑛 1 2 𝑇 −1 2 ×|Σ| exp − (𝑛 − 1)S + 𝑛( x̄ − 𝜃) Σ ( x̄ − 𝜃) 2 𝑝/2 −{( 𝑝+𝑑)/2+1} = |Λ0 | 𝑝/2 |Σ| −{(𝑛+𝑝+𝑑)/2+1} 1 −1 2 𝑇 −1 𝑇 −1 exp − 𝑡𝑟 (Λ0 Σ ) + (𝑛 − 1)S + 𝑛( x̄ − 𝜃) Σ ( x̄ − 𝜃) + 𝑘 (𝜃 − 𝜇0 ) Σ (𝜃 − 𝜇0 ) . 2 Posterior with unknown 𝜎 2 : derivation for square and Inverse-Wishart kernel First, (𝑛 − 1)S2 + 𝑛( x̄ − 𝜃)𝑇 Σ−1 ( x̄ − 𝜃) + 𝑘 (𝜃 − 𝜇0 )𝑇 Σ−1 (𝜃 − 𝜇0 ) = (𝑛 − 1)S2 + 𝑛x̄𝑇 Σ−1 x̄ − 2𝑛𝜃 𝑇 Σ−1 x̄ + 𝑛𝜃 𝑇 Σ−1 𝜃 + 𝑘𝜃 𝑇 Σ−1 𝜃 − 2𝑘𝜃 𝑇 Σ−1 𝜇0 + 𝑘 𝜇𝑇0 Σ−1 𝜇0 . Now, for rearrangement later, add (𝑘 + 𝑛)𝜃 𝑇𝑛 Σ−1 𝜃 𝑛 and subtract for balancing out the equation for posterior distribution parameters such that (𝑛 − 1)S2 + (𝑘 + 𝑛)𝜃 𝑇 Σ−1 𝜃 − 2𝜃 𝑇 Σ−1 (𝑘 𝜇0 + 𝑛X̄) + (𝑘 + 𝑛)𝜃 𝑇𝑛 Σ−1 𝜃 𝑛 − (𝑘 + 𝑛)𝜃 𝑇𝑛 Σ−1 𝜃 𝑛 + 𝑘 𝜇𝑇0 Σ−1 𝜇0 + 𝑛x̄𝑇 Σ−1 x̄ = 𝑘𝑛 (𝜇0 − X̄)𝑇 Σ−1 (𝜇0 − X̄). Then, (𝑛 − 1)S2 + (𝑘 + 𝑛)(𝜃 − 𝜃 𝑛 )𝑇 Σ−1 (𝜃 − 𝜃 𝑛 ) + 𝑘+𝑛 1 𝑝/2 −{( 𝑝+𝑛+𝑑+1)/2} −1 𝑓 (𝜃, Σ|X) ∝ |Λ0 | |Σ| exp − 𝑡𝑟 (Λ0 Σ ) 2 −1 1 𝑘𝑛 2 𝑇 −1 𝑇 −1 2 ×|Σ| exp − (𝑛 −1)S + (𝑘 + 𝑛)(𝜃 − 𝜃 𝑛 ) Σ (𝜃 − 𝜃 𝑛 ) + (𝜇0 − X̄) Σ (𝜇0 − X̄) 2 𝑘 +𝑛 1 𝑘𝑛 𝑝/2 − ( 𝑝+𝑛+𝑑+1) −1 2 𝑇 −1 2 ⇔ |Λ0 | |Σ| exp − 𝑡𝑟 (Λ0 Σ ) + (𝑛 − 1)S + (𝜇0 − X̄) Σ (𝜇0 − X̄) 2 𝑘 +𝑛 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 49 STA498 ×|Σ| −1 2 1 𝑇 −1 exp − (𝑘 + 𝑛)(𝜃 − 𝜃 𝑛 ) Σ (𝜃 − 𝜃 𝑛 ) 2 Lim, Kyuson The idea is to build up a model that is exactly similar to Normal × Inversed Wishart and identify the parameters. For identifying the exponent of the normal by inverted Wishart kernels, the property of adding symmetric matrix and multiplication is used, ie. 𝑡𝑟 ( 𝐴) + 𝑡𝑟 (𝐵) = 𝑡𝑟 ( 𝐴 + 𝐵), 𝑡𝑟 (𝐷𝐶) = 𝑡𝑟 (𝐶𝐷) and 𝑥𝑇 Σ−1 𝑥 = 𝑡𝑟 (𝑥 𝑡 Σ−1 𝑥) ⇒ 𝑡𝑟 (𝑥𝑇 Σ−1 𝑥) = 𝑡𝑟 (𝑥𝑥𝑇 Σ−1 ). 1 Í𝑛 𝑇 −1 Notice that S2 = 𝑛−1 𝑖=1 (x𝑖 − x̄) Σ (x𝑖 − x̄). Then, the first part of the exponential is simplified to be 1 𝑘𝑛 −1 2 𝑇 −1 − 𝑡𝑟 (Λ0 Σ ) + (𝑛 − 1)S + (𝜇0 − X̄) Σ (𝜇0 − X̄) 2 𝑘 +𝑛 ∑︁ 𝑘𝑛 1 𝑇 𝑇 (x𝑖 − X̄)(x𝑖 − X̄) − ( X̄ − 𝜇0 ) ( X̄ − 𝜇0 ) Σ−1 = − 𝑡𝑟 Λ0 + 2 𝑘 +𝑛 Σ These properties enable the equation to rearrange as 𝑁 𝜃 𝑛 , 𝑘+𝑛 × Inverted Wishart(𝜃 𝑛 , Λ𝑛 ) −1 with Σ , which is 𝑛+ 𝑝+𝑑+1 𝑝 𝑘 +𝑛 1 −1 −1/2 𝑇 −1 − 2 exp − (𝜃 − 𝜃 𝑛 ) Σ (𝜃 − 𝜃 𝑛 ) , exp − 𝑡𝑟 (Λ𝑛 Σ ) |Σ| |Λ𝑛 | 2 |Σ| 2 2 and det(Λ0 ) = det(Λ𝑛 ) as symmetric matrix. Now, comparing with the equation of the interest, 1 𝜃𝑛 = (𝑘 𝜇0 + 𝑛X̄), 𝑘 +𝑛 Λ𝑛 = Λ0 + 𝑛 ∑︁ (x𝑖 − X̄)(x𝑖 − X̄)𝑇 + 𝑘𝑛 ( X̄ − 𝜇0 )( X̄ − 𝜇0 )𝑇 , 𝑘 +𝑛 𝑖=1 where the first term of 𝜃 𝑛 matches with the second term in the modelling and the second term fo Λ𝑛 describes the first term in the modelling for equivalent relationship. Thus, Σ 𝜃, Σ|X ∼ 𝑁 𝜃 𝑛 , with 𝜃 × Inverse Wishart (𝜃 𝑛 , Λ𝑛 ) with Σ−1 , 𝑘 +𝑛 to follow the modelling. Also, 𝑘 +𝑛 𝑇 −1 (𝜃 − 𝜃 𝑛 ) Σ (𝜃 − 𝜃 𝑛 ) . 𝜃|Σ ∼ 𝑁 𝜃 𝑛 , (𝑘 + 𝑛) Σ ⇔ 𝑓 (𝜃|Σ, X) ∝ exp − 2 −1 Posterior with unknown 𝜎 2 : uninformative priors The joint uninformative prior (with a locally uniform prior for 𝜃) is 𝑓 (𝜃, Σ) ∝ |Σ| − and the joint posterior is derived as 𝑛 1 − 𝑑+1 − 2 𝑇 −1 𝑓 (𝜃, Σ|X) ∝ |Σ| 2 |Σ| 2 exp − (𝑛 − 1)S + 𝑛( X̄ − 𝜃) Σ ( X̄ − 𝜃) 2 1 − 𝑛+𝑑+1 −1 ⇔ |Σ| 2 exp − 𝑡𝑟 (𝑆(𝜃)Σ ) , 2 50 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 𝑑+1 2 , Lim, Kyuson Í STA498 𝑛 𝑇 + 𝑛( X̄ − 𝜃)( X̄ − 𝜃) 𝑇 . Then, the conditional posterior where 𝑆(𝜃) = 𝑖=1 ( X̄ − x )( X̄ − x ) 𝑖 𝑖 for 𝜃|Σ ∼ 𝑁 X̄, Σ𝑛 such that 𝑛 𝑇 −1 𝑓 (𝜃|Σ, X) ∝ exp − (𝜃 − X̄) Σ (𝜃 − X̄) 2 Multivariate list of Conjugate Models Parameter Prior 𝜋(𝜃) 2 0) exp − (𝜃−𝜇 , 𝜎02 = 2𝜎 2 Normal 𝜃 0 𝑝 𝛼−1 (1 3.3.4 𝑘 𝛽−1 𝑝) Beta ∝ − 𝛼−1 Beta ∝ 𝜃 exp(−𝑏𝜃) Beta-Bin 𝑝 Gamma-exp 𝜃 * 𝜇𝑛 = 𝜎2 𝑘 𝜇0 +𝑛X̄ 2 𝑘+𝑛 , 𝜎𝑛 = Likelihood 𝑓 (𝜃|X) 2 𝑛 exp − (x𝑖 −𝜃) Π𝑖=1 2𝜎 2 Posterior 𝑓 (𝜃|X) 2 𝑛) exp − (𝜃−𝜇 * 2𝜎 2 Bin ∝ 𝑝 𝑠 (1 − 𝑝) 𝑛−𝑠 exp ∝ 𝜃 𝑛 exp(−𝑠𝜃) Beta ∝ 𝑝 𝛼+𝑠+1 (1 − 𝑝) 𝑏+𝑛−𝑠−1 Gamma ∝ 𝜃 𝑎+𝑛−1 exp(−(𝑏 + 𝑠)𝜃) 𝑛 𝜎2 𝑘+𝑛 Lindley’s Paradox Based on the different choices of certain prior distribution, the frequentist and Bayesian give a different result for the hypothesis testing. The paradox occurs for the result of an experiment where there are two explanations 𝐻0 and 𝐻𝑎 with some prior distribution 𝜋 to represent the uncertainty that gives which hypothesis is more accurate before taking into account for the result 𝑥. Lindley’s paradox occurs as the result of 𝑥 is significant by the frequentist test of 𝐻0 , indicating the sufficient evidence to reject 𝐻0 at a given 𝛼 = 0.05. On the other hands, Bayesian approach examine the posterior probability of 𝐻0 given 𝑥 is high, indicating strong evidence that 𝐻0 is better than 𝑥 with 𝐻𝑎 to take the 𝐻0 . Example for the comparison In a statistics program around the world, 4900 male and 4700 male is enrolled at a certain time period. The observed proportion 𝑥 of male student is 4900/9600 = 0.51. We assume the fraction of male student is a Binomial variable with parameter 𝜃. The goal is to test whether 𝜃 is 0.5 or other value. That is, 𝐻0 : 𝜃 = 0.5 and 𝐻𝑎 : 𝜃 ≠ 0.5. The frequentist approach to testing 𝐻0 is to compute a p-value, the probability of observing a fraction of male student at least as large as 𝑥 assuming 𝐻0 is true. A normal approximation for the fraction of the male student is 𝑋 ∼ 𝑁 (𝜇, 𝜎 2 ) with 𝜇 = 𝑛𝑝 = 𝑛𝜃 = 9600(0.5) = 4800 and 𝜎 2 = 𝑛𝜃 (1 − 𝜃) = 9600(0.5)(1 − 0.5) = 2400, ∫ 9600 (𝑢 − 𝜇) 2 1 exp − 𝑑𝑢 𝑝(𝑋 ≥ 𝑥|𝜇 = 4800) = √ 2𝜎 2 𝑥=4900 2𝜋𝜎 2 ∫ 9600 1 (𝑢 − 4800) 2 = exp − 𝑑𝑢 = 0.020613. √︁ 4800 𝑥=4900 2𝜋(2400) CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 51 STA498 Lim, Kyuson For the two sided test, the p-value is 2(0.020613) = 0.0402 such that < 0.05 to reject the 𝐻0 and take 𝐻𝑎 to be different from the observed data for 0.5. The Bayesian approach assumes for the equal prior probability as there is no favour in the hypothesis to be, which is 𝜋(𝐻0 ) = 𝜋(𝐻𝑎 ) = 0.5. Also, 𝜃 ∼ 𝑈 [0, 1] under 𝐻𝑎 , where the posterior probability under 𝐻0 for 𝑘 = 4900 is described as 𝑝(𝑘 |𝐻0 ) 𝑝(𝐻0 ) 𝑝(𝐻0 |𝑘) = , 𝑝(𝑘 |𝐻0 ) 𝑝(𝐻0 ) + 𝑝(𝑘 |𝐻𝑎 ) 𝑝(𝐻𝑎 ) after observing 𝑘/𝑛 = 4900/9600 births, where the posterior probability is computed from the PMF of the binomial distribution under each hypothesis, 𝑛 𝑝(𝑘 |𝐻0 ) = (0.5) 𝑘 (1 − 0.5) 𝑛−𝑘 = 0.021. 𝑘 ∫ 1 1 𝑛 𝑛 𝑘 𝑛−𝑘 𝑝(𝑘 |𝐻𝑎 ) = (𝜃) (1 − 𝜃) 𝑑𝜃 = 𝐵𝑒𝑡𝑎(𝑘 + 1, 𝑛 − 𝑘 + 1) = = 0.001041. 𝑘 𝑘 𝑛+1 0 ⇒ 𝑝(𝑘 |𝐻0 ) > 𝑝(𝑘 |𝐻𝑎 ) Hence, the posterior probability for 𝑝(𝐻0 |𝑘) = 0.95 which strongly favours 𝐻0 over 𝐻𝑎 . Thus, the two approaches Bayesian and frequentist appears to be in conflict, as paradoxical which also leads to the goodness-of-fit test. 3.3.5 Bernstein-von Mises theorem The Bernstein-von Mises theorem is a result that links Bayesian inference with Frequentist inference. In particular, it states that Bayesian credible sets of a certain credibility level 𝛼 will asymptotically be confidence sets of confidence level 𝛼, which allows for the interpretation of Bayesian credible sets, under the probabilistic generating process. In parameter inference, a posterior distribution converges in the limit of infinitely many data to a multivariate normal distribution centered with the given covariance matrix. Using the posterior distribution from a frequentist, the Bayesian inference is asymptotically correct. Bernstein-von Mises theorem: Univariate normal data For Bayesian approach when 𝑁 is large for the observed data, √ 𝑁 ( 𝑥¯ − 𝜇)|𝑋 = 𝑥 1 , ..., 𝑥 𝑁 → 𝑁 (0, 𝜎 2 ), the prior does not matter for large samples. In frequentist approach for large samples, √ 𝑁 ( 𝑥¯ − 𝜇)|𝜇 ∼ 𝑁 (0, 𝜎 2 ). With loss of generality, the Bayesian probability for 95% credible region and the frequentist confidence interval matches for 95% confidence interval as 𝜎 𝜎 𝜎 𝜎 ¯ ¯ ¯ ¯ 𝑝 𝜇 ∈ 𝑋−1.96 √ , 𝑋+1.96 √ 𝑋1 , ..., 𝑋𝑁 ≈ 𝑝 𝜇 ∈ 𝑋−1.96 √ , 𝑋+1.96 √ 𝜇 = 0.95 𝑁 𝑁 𝑁 𝑁 52 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Lim, Kyuson 3.4 STA498 Goodness-of-fit test Suppose pick 𝑛 samples from a normal distribution, 𝑁 (𝜇, 𝜎 2 ), with known variance and the goal is to select the best model that predict the mean of the distribution. 3.4.1 Bayes factor The Bayes factors is used as a determination for the Bayesian alternative to frequentist hypothesis testing. The Bayesian model comparison is a method of model selection based on Bayes factors (𝐵𝐹) among statistical models. Based on the observed data 𝐷, the relative plausibility of the two different models 𝑀1 and 𝑀2 parametrized by 𝜃 1 and 𝜃 2 respectively, is assessed by the probability odds of two models, 𝐵𝐹12 ∫ 𝑝(𝜃 1 |𝑀1 ) 𝑝(𝐷|𝜃 1 , 𝑀1 )𝑑𝜃 1 𝑝(𝐷|𝑀1 ) =∫ = = 𝑝(𝐷|𝑀2) 𝑝(𝜃 2 |𝑀2 ) 𝑝(𝐷|𝜃 2 , 𝑀2 )𝑑𝜃 2 𝑝(𝑀1 |𝐷) 𝑝(𝐷) 𝑝(𝑀1 ) 𝑝(𝑀2 |𝐷) 𝑝(𝐷) 𝑝(𝑀2 ) = 𝑝(𝑀1 |𝐷) 𝑝(𝑀2 ) , 𝑝(𝑀2 |𝐷) 𝑝(𝑀1 ) Likelihood Ratio ⇔ Posterior odds × Prior odds−1 , unlike the LRT there is no overfitting but a biasedness. Moreover, the Bayes factor is a relative predictive accuracy of one hypothesis over another, and extent to which the data sway our relative belief from one hypothesis to the other. Hence, 𝐵𝐹 = 𝑁, 𝑁 ∈ (0, ∞) means that there is 𝑁 times more evidence for 𝐻𝑎 than 𝐻0 . In a case where there is only 2 models, given the Bayes factor 𝐵𝐹 (𝐷), the posterior probability of the Model 1 is derived as 𝑝(𝐷|𝑀1 ) 𝑝(𝑀2 ) 𝑝(𝐷|𝑀2 ) 𝑝(𝑀2 ) =1− 𝑝(𝑀1 |𝐷) = 1 − 𝑝(𝑀2 |𝐷) = 1 − 𝑃(𝐷) 𝐵𝐹 (𝐷) 𝑝(𝐷) 𝑝(𝑀1 |𝐷) 𝑝(𝑀2 ) ⇒1− = 𝑝(𝑀1 |𝐷) 𝐵𝐹 (𝐷) 𝑝(𝑀1 ) 1 𝑝(𝑀2 ) ⇔1= 1+ 𝑝(𝑀1 |𝐷) ⇔ 𝑝(𝑀1 |𝐷) = 𝐵𝐹 (𝐷) 𝑝(𝑀1 ) 1+ 1 𝑝(𝑀2 ) 1 𝐵𝐹 (𝐷) 𝑝(𝑀1 ) Bayes factor cutoffs 𝐵𝐹10 30 − 100 3 − 10 1 1/3 − 1 1/100 − 1/30 interpretation Very strong evidence for 𝐻𝑎 Moderate evidence for 𝐻𝑎 Equal evidence for 𝐻𝑎 and 𝐻0 Anecdotal evidence for 𝐻0 Very strong evidence for 𝐻0 > 100 10 − 30 1−3 1/3 − 1 1/30 − 1/10 < 1/100 Extreme evidence for 𝐻𝑎 Strong Evidence for 𝐻𝑎 Anecdotal evidence for 𝐻𝑎 Anecdotal evidence for 𝐻0 Strong evidence for 𝐻0 Extreme evidence for 𝐻0 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 53 STA498 Relationship between Bayes factor and p-value Lim, Kyuson As the Bayes factor increases quadratically, the p-value is also increased for smaller values to reject 𝐻0 . On the other hand, the larger p-values correspond with the low Baye factor numerical values as an inverse relationship. Note that Bayes factors allow to directly test the null hypothesis (relative to models under consideration). 3.4.2 Bayes factor: hypothesis testing For the benefit of Bayes factor that could test multiple hypothesis upon same data of observation, such regression models could be tested in a way of 𝐵𝐹10 = 𝑝(𝐷|𝐻2 ) 𝐵𝐹10 𝑝(𝐷|𝐻1 ) 𝑝(𝐷|𝐻1 ) , 𝐵𝐹20 = ⇒ = = 𝐵𝐹12 𝑝(𝐷|𝐻0 ) 𝑝(𝐷|𝐻0 ) 𝐵𝐹20 𝑝(𝐷|𝐻2 ) Frequentist: Chi-square Goodness-of-fit test Previously, 𝐻0 : 𝜇 = 𝜇0 and the CLT 32 guarantees for the sample mean 𝑥¯ = which is 𝑥¯ ∼ 𝑁 (𝜇, 𝜎 2 /𝑛). Then, the test statistics is computed as 𝜒2 = Í𝑛 𝑖=1 𝑥𝑖 /𝑛 ( 𝑥¯ − 𝜇0 ) 2 . 𝜎 2 /𝑛 Bayesian: Bayes factor The emphasize is in computing for the Bayes’ factor of the models. For two models, let 𝑀1 : 𝜇 = 𝜇0 and length 𝐿, including 𝜇0 and 𝐿 > 𝜎. The prior 𝑝(𝜇) = of 𝑀1 that is 1 ( 𝑥¯ − 𝜇0 ) 2 𝑝(𝑋 |𝑀1 ) = √ exp − = 2𝜎 2 /𝑛 2𝑛𝜋𝜎 2 that determines the relative ratio 𝑀2 : 𝜇 lies inside the interval of 1/𝐿 to calculate for the evidence √ 𝑛 √ 2𝜋𝜎 2 exp − 𝜒2 /2 . For 𝑀2 , marginalize for the 𝜇 by ∫ ∫ 1 1 ( 𝑥¯ − 𝜇) 2 1 exp − 𝑑𝜇 = . 𝑝(𝑋, 𝑀2) = 𝑝(𝑋 |𝜇, 𝑀2 ) 𝑝(𝜇|𝑀2 )𝑑𝜇 = √ 2 𝐿 𝐿 2𝜎 /𝑛 2𝜋𝜎 2 /𝑛 Then, the Bayes’ factor is derived as √ 𝑛𝐿 𝑀1 2 𝐵 =√ exp − 𝜒 /2 . 𝑀2 2𝜋𝜎 2 For fixed 𝜒2 valeu, when 𝑛 → ∞ the Bayes factor favours 𝜇 = 𝜇0 as 𝐵 → ∞. 32From 𝑛 samples of distribution with mean 𝜇 and variance 𝜎 2 , the sample mean 𝑥¯ = 𝑁 (𝜇, 𝜎 2 /𝑛. 54 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Í𝑛 𝑖=1 𝑥 𝑖 /𝑛 ∼ Lim, Kyuson Improper prior and example STA498 ∫ When the prior is a function that is Θ 𝜋(𝜃)𝑑𝜃 = ∞, the prior is not a pdf but the poste∫ rior is still a valid pdf as a marginal distribution 𝑚(𝑥) = Θ 𝑓 (𝑥|𝜃)𝜋(𝜃)𝑑𝜃 is well defined. For the univariate normal distribution of known variance, if the uniform prior that ∫ is 𝜋(𝜃) = 1 where there is no prior information then Θ 𝜋(𝜃)𝑑𝜃 = ∞. However, the ∫ corresponding marginal distribution 𝑚(𝑋) = Θ 𝑓 (𝑋 |𝜃)𝑑𝜃 is Í ∫ 1 (𝑛 − 1)𝑠2 (𝑥𝑖 − 𝜇) 2 2 −𝑛/2 , 𝑑𝜇 = √ exp − (2𝜋𝜎 ) exp − 2𝜎 2 2𝜎 2 𝑛2𝜋𝜎 2 so that the posterior becomes 𝜋(𝜇|𝑋) = 𝜙(𝜇| 𝑥, ¯ 𝜎 2 /𝑛) as shown before. For the Bayesian 𝑡-test, the Jeffreys prior which is improper priors are used as for the area of the curve to be 1. 3.4.3 One sample test for equal means Suppose there are two samples with 𝑛1 and 𝑛2 for 𝑛 = 𝑛1 + 𝑛2 where 𝑋1 𝑗 ∼ 𝑁 (𝜇1 , 𝜎 2 ) and 𝑋2𝑘 ∼ 𝑁 (𝜇2 , 𝜎 2 ) for sample variance, 𝑗 ∈ 𝑛1 , 𝑘 ∈ 𝑛2 . First, the frequentist approach for the two sided 𝑡-test aim to find whether the mean of the two groups differs, 𝐻0 : 𝜇1 = 𝜇2 ⇔ 𝜇1 − 𝜇2 = 0 vs. 𝐻𝑎 = 𝜇1 ≠ 𝜇2 ⇔ 𝜇1 − 𝜇2 ≠ 0. Then, the two-sample 𝑡-statistics is 𝑋¯ 1 − 𝑋¯ 2 . 𝑇 = √︂ 2 2 𝑛1 𝑛2 (𝑛1 −1)𝑠1 +(𝑛2 −1)𝑠2 𝑛1 +𝑛2 𝑛1 +𝑛2 −2 )0 Let x = {x1 , x2 } where x1 = (𝑥1 , ..., 𝑥 𝑛1 and x2 = (𝑥 𝑛1 +1 , ..., 𝑥 𝑛1 +𝑛2 ) 0. The goal is to test 𝐻0 : 𝑥𝑖 |𝜇, 𝜎 2 ∼ 𝑁 (𝜇, 𝜎 2 ), 1 ≥ 𝑖 ≥ 𝑛 against 𝐻𝑎 : 𝑥𝑖 |𝜇1 , 𝜎12 ∼ 𝑁 (𝜇1 , 𝜎12 ), 1 ≥ 𝑖 ≥ 𝑛1 , and 𝑥𝑖 |𝜇2 , 𝜎22 ∼ 𝑁 (𝜇2 , 𝜎22 ), 𝑛1 + 1 ≥ 𝑖 ≥ 𝑛1 + 𝑛2 . However, the Bayesian approach place the prior on the difference of the standardized means as 𝑑 = 𝜇1𝜎−𝜇2 . In the case Í 2 Í 𝑥𝑖 (𝑥𝑖 − 𝜇1 ) 2 2 −𝑛/2 2 −𝑛/2 𝑀0 = (2𝜋𝜎 ) exp − , 𝑀1 = (2𝜋𝜎 ) exp − , 2𝜎 2 2𝜎 2 then the Bayes factor is derived as 𝑀0 𝑁 𝜇1 𝐵𝐹01 = = exp − (2𝑥¯ − 𝜇1 ) , 𝑀1 2𝜎 2 where the prior is normal 𝜙 = 𝑁 + 𝜎 2 /𝜎02 . 1/2 𝑁 2 𝑥¯ 2 𝜎2 𝐵𝐹01 = 𝜙 2 exp − 𝜎 /0 2𝜙𝜎 2 so that the goal is to derive the Jeffrey’s Bayes factor (JZS). CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 55 STA498 56 Lim, Kyuson CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Chapter 4 Appendix 4.1 4.1.1 Extension of Bayesian distribution EM (expectation-maximizing) algorithm for MLE example When missing the data set, the prediction step consists of initial estimate 𝜇˜ and 𝚺˜ to predict the contribution of missing values to the sufficient statistics. Algorithm: Assume that population mean and variance 𝜇 and 𝚺 are unknown and estimated. 1. Prediction: given estimates 𝜃˜ of unknown parameters, predict the contribution of any missing observation to the complete data for sufficient statistics. 2. Estimation: using predicted sufficient statistics, compute estimates of parameters. 3. Iterate until revised estimates do not differ from estimates obtained previously. 57

STA498 book - Bayesian analysis

Related documents

Products

Support

STA498 book - Bayesian analysis

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib