Bayesian Hotelling’s π 2 Lim, Kyuson November 4, 2021 STA498 2 Lim, Kyuson Contents 1 Acknowledgement 7 2 Multivariate Normal and Hypothesis Testing 9 2.1 Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 Multivariate Normal distribution . . . . . . . . . . . . . . . . . 9 2.1.2 Distribution of (x − π) 0πΊ−1 (x − π) . . . . . . . . . . . . . . . . 9 2.1.3 MLE of π and πΊ . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.4 The sampling distribution of S and xΜ . . . . . . . . . . . . . . . 11 2.1.5 Hypothesis testing when π, Σ is known . . . . . . . . . . . . . 11 2.1.6 Hypothesis testing when π, Σ is unknown . . . . . . . . . . . . 12 Confidence region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 Univariate π‘-interval . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.2 Bonferroni’s Simultaneous Confidence interval . . . . . . . . . 17 2.2.3 Simultaneous π 2 -intervals . . . . . . . . . . . . . . . . . . . . 17 2.2.4 Comparison between Simultaneous π 2 -intervals and Bonferroni’s Confidence intervals . . . . . . . . . . . . . . . . . . . . 18 Multivariate Quality-Control (QC) . . . . . . . . . . . . . . . . 19 Comparing mean vectors of two population . . . . . . . . . . . . . . . 21 2.3.1 Pooled sample covariance when π1 , π2 is small and Σ = Σ1 = Σ2 21 2.3.2 Hypothesis test with small samples when Σ1 = Σ2 . . . . . . . . 22 2.3.3 Confidence intervals with small samples when Σ1 = Σ2 . . . . . 22 2.3.4 Behrens-Fisher problem . . . . . . . . . . . . . . . . . . . . . 23 2.3.5 Heterogeneous covariance matrices with large sample size . . . 23 2.3.6 Box’s M test (Bartlett’s test) . . . . . . . . . . . . . . . . . . . 23 2.2 2.2.5 2.3 3 STA498 2.4 3 Lim, Kyuson MANOVA (Multivariate Analysis Of Variance) . . . . . . . . . . . . . 24 2.4.1 Sum of Squares (TSS = SSπ‘π +SSπππ ) . . . . . . . . . . . . . . . 24 2.4.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.3 Distribution of Wilk’s Lambda . . . . . . . . . . . . . . . . . . 26 2.4.4 Large Sample property for modification of π²∗ . . . . . . . . . . 26 2.4.5 Simultaneous Confidence Intervals for treatment effect . . . . . 26 Bayesian Alternative approach 3.0.1 3.1 3.2 3.3 3.4 4 27 Overview: Univariate Binomial distribution with known and unknown parameter . . . . . . . . . . . . . . . . . . . . . . . . 29 Conditional distribution of the subset . . . . . . . . . . . . . . . . . . . 31 3.1.1 Law of total expectation . . . . . . . . . . . . . . . . . . . . . 32 3.1.2 Conditional expectation (MMSE) . . . . . . . . . . . . . . . . 33 3.1.3 Laplace’s law of succession . . . . . . . . . . . . . . . . . . . 34 3.1.4 Bayesian Hypothesis testing . . . . . . . . . . . . . . . . . . . 35 3.1.5 Bayesian Interval Estimation . . . . . . . . . . . . . . . . . . . 37 Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.1 Conjugate Prior . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.2 Univariate Normal distribution Conjugate Prior with known variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2.3 Non-informative Prior . . . . . . . . . . . . . . . . . . . . . . 42 3.2.4 Univariate Normal distribution Conjugate Prior with unknown variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Posterior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3.1 Maximum A Posteriori (MAP) . . . . . . . . . . . . . . . . . . 45 3.3.2 Multivariate Normal distribution with known Σ . . . . . . . . . 46 3.3.3 Multivariate Normal distribution with unknown Σ . . . . . . . . 48 3.3.4 Lindley’s Paradox . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.5 Bernstein-von Mises theorem . . . . . . . . . . . . . . . . . . 52 Goodness-of-fit test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4.1 53 Bayes factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS Lim, Kyuson 3.4.2 Bayes factor: hypothesis testing . . . . . . . . . . . . . . . . . STA498 54 3.4.3 One sample test for equal means . . . . . . . . . . . . . . . . . 55 4 Appendix 57 4.1 Extension of Bayesian distribution . . . . . . . . . . . . . . . . . . . . 57 4.1.1 57 EM (expectation-maximizing) algorithm for MLE example . . . CONTENTS 5 STA498 6 Lim, Kyuson CONTENTS Chapter 1 Acknowledgement First, the chapter of multivariate Normal and Hypothesis Testing explains about the construction and concepts of multivariate normal and relevant statistical inference as to apply. The notation and interpretation is all of multivariate random variables to consider for. Second, the basic information starts with frequentist approach in understanding the Bayesian statistics. However, the chapter is mainly about Bayesian approach apart from frequentist approach for interpretation where majority of concept and derivation lies on Bayesian approach to consider for. The chapter discuss for the hypothesis testing for Bayesian approach and derivation for posterior distribution of the univariate normal distribution as well as Bayes posterior estimator. The normal distribution is the main topic of Bayesian inference for the posterior distribution where multivariate statistics concept is introduced and used to build up the knowledge. The goal is to expand for bivariate and multivariate normal distribution including Wishart distribution. Also, the idea of relative belief ratio and normal distribution for understanding posterior distribution is discussed. 7 STA498 8 Lim, Kyuson CHAPTER 1. ACKNOWLEDGEMENT Chapter 2 Multivariate Normal and Hypothesis Testing 2.1 2.1.1 Basic definitions Multivariate Normal distribution If x ∼ π π (π, πΊ), then the PDF of x 1 is π (x) = 1 0 −1 exp − (x − π) πΊ (x − π) , 2 1 (2π) π/2 |πΊ| 1/2 where (x− π) 0πΊ−1 (x− π) is the squared Mahalanobis distance 2 between x and population mean vector π as a quadratic term. Notice that the PDF does not exists if πΊ is not positive definite 3, which implies |πΊ| = 0. For Gaussian function exp − 21 (x − π), the normalizing constant (2π)1 π/2 is multiplied so the area under the curve is 1. 2.1.2 Distribution of (x − π) 0πΊ−1 (x − π) For x ∼ π π (π, πΊ), 0 −1 (x − π) πΊ (x − π) = {πΊ −1/2 0 (x − π)} {πΊ −1/2 π ∑οΈ 1 0 (x − π)}4 ⇔ √ eπ (x − π) = z0z, ππ π=1 1The constant probability density contour of the function is defined to be C = {x : π (x) = π 0 ⇔ x : (x − π) 0πΊ−1 (x − π) = π2 } for connections of points. √οΈ 2For arbitrary distance of π and π, π (π, π) = (x − y) 0S−1 (x − y), where x = [π₯ 1 , ..., π₯ π ] 0, y = [π¦ 1 , ..., π¦ π ] 0 and π is the sample covariance matrix of all measurements on p variables. 3By the spectral decomposition, πΊ = Qπ²Q is positive definite if and only if ππ ≥ 0 for eigenvalues. Íπ 1 0 4Notice that πΊ−1 = π=1 ππ ei ei . 9 STA498 where Lim, Kyuson z ∼ π π (0, I) = π ∑οΈ π§π2 , where π§π ∼ π (0, 1), π=1 as π(2π) is defined as the distribution of 2.1.3 2 π=1 π§π Íπ such that (x − π) 0πΊ−1 (x − π) ∼ π(2π) . MLE of π and πΊ πˆ = xΜ and πΊˆ = 1 π Íπ π=1 (xπ − xΜ)(xπ − xΜ) 0 = Sπ = π−1 π S, where π 1 ∑οΈ S= (xπ − xΜ)(xπ − xΜ) 0 π − 1 π=1 Now, πΈ (S) = πΊ and πΈ ( xΜ) = π5 are unbiased estimators. π π π ∑οΈ 1 ∑οΈ 0 π−1 1 ∑οΈ 0 0 0 S= xπ xπ − xΜxΜ0, xπ xπ − 2 xπ xΜ + πxΜxΜ = π π π=1 π π=1 π=1 where πΈ (xπ x0π ) = πΊ + ππ0 and πΈ ( xΜxΜ0) = cov( xΜ) + πΈ ( xΜ)πΈ ( xΜ0) = π1 πΊ + ππ0. Hence, by taking the expected value π π−1 1 ∑οΈ 0 1 π−1 0 0 0 πΈ (S) = πΈ xπ xπ − πΈ ( xΜxΜ ) = πΊ + ππ − πΊ + ππ = πΊ π π π π π=1 π π π → πΊ, S − → πΊ. Asymptotically, S could be replaced by Sπ According to LLN, xΜ − → π, Sπ − 1 Íπ or πΊ. By definition of S = {π ππ = π−1 π=1 (x ππ − xΜπ )(x π π − xΜ π )}, π π ππ = π 1 ∑οΈ 1 ∑οΈ (x ππ − xΜπ )(x π π − xΜ π ) = (x ππ − ππ + ππ − xΜπ )(x π π − π π + π π − xΜ π ) π − 1 π=1 π − 1 π=1 π 1 ∑οΈ π (x ππ − ππ )(x π π − π π ) + ( xΜπ − ππ )( xΜ π − π π ), = π − 1 π=1 π−1 where the second term converges to 0. By applying LLN, π π−1 ∑οΈ π 1 1 π (x ππ −ππ )(x π π −π π ) = 1− πΈ {(x ππ −ππ )(x π π −π π )} − → πππ , as π → ∞. π π=1 π Equivalently, Sπ is a consistent estimator for πΊ which is analogous to univariate case 6. By CLT where xπ ∼ π π (π0 , πΊ) and xΜ ∼ π π (π0 , π1 πΊ) √ π π( xΜ − π0 ) → − π π (π0 , πΊ) Íπ Íπ Íπ 5πΈ ( xΜ) = πΈ π1 π=1 xπ = π=1 πΈ π1 xπ = π1 π=1 π=π 6As π → ∞, π 2π converges to π 2 which is the population variance. 10 CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING Lim, Kyuson and STA498 (x − π0 ) 0πΊ−1 (x − π0 ) ∼ π2π such that π π( xΜ − π0 ) 0πΊ−1 ( xΜ − π0 ) → − π2π , for large sample size and π relatively larger than π. 2.1.4 The sampling distribution of S and xΜ 1 xΜ ∼ π π (π, πΊ), π 1 Var( xΜ) = πΊ, π where S and xΜ are independent. As xπ ∼ π π (π, πΊ) and xΜ is a linear combination of xπ , xΜ follows a normal distribution. (π − 1)S = π ∑οΈ 0 (xπ − xΜ)(xπ − xΜ) = π=1 π ∑οΈ π ∑οΈ (xπ − π + π − xΜ)(xπ − π + π − xΜ) 0 = π=1 (xπ − xΜ) (xπ − xΜ) 0 +π(ππ − xΜ)(π − xΜ) 0 −2π(π − xΜ)(π − xΜ) 0 = π=1 π ∑οΈ (xπ − π) 0 −π(π − xΜ)(π − xΜ) 0, π=1 and π ∑οΈ (xπ − π) 0 ∼ ππ (πΊ), π(π − xΜ)(π − xΜ) 0 ∼ π1 (πΊ) π=1 such that (π − 1)S ∼ ππ−1 (πΊ) = π−1 ∑οΈ zπ z0π , zπ ∼ π π (0, πΊ) π=1 The Wishart distribution with π − 1 degree of freedom has a property πΈ {(π − 1)S} = (π − 1)πΊ. 2.1.5 Hypothesis testing when π, Σ is known The statistical inference is based upon the hypothesis test and to construct confidence regions for the parameters of interest. The goal of this chapter is to include two general ideas, including construction of a likelihood ratio test (LRT) based on the multivariate normal distribution, and the unionintersection approach. CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING 11 STA498 Univariate test statistics (π known) Lim, Kyuson If π₯ ∼ π1 (0, π 2 ), the hypothesis testing is π»0 : π = π0 vs π»π : π ≠ π0 . For random samples of π₯1 , ..., π₯ π from the Normal population, the test statistics is π§= π₯¯ − π0 √ ∼ π1 (0, 1) π/ π or π§2 = ( π₯¯ − π0 ) 2 ∼ π12 π 2 /π under π»0 . Multivariate generalization (Σ known) If x ∼ π π (π, πΊ) where |πΊ| > 0, then the hypothesis test is π»0 : π = π0 vs π»π : π ≠ π0 . If x1 , ..., xπ is a random sample from a normal population, then the test statistics π§ 2 = π( xΜ − π0 ) 0πΊ−1 ( xΜ − π0 ) ∼ π2π under π»0 . 2.1.6 Hypothesis testing when π, Σ is unknown Univariate test statistics (π unknown) As an estimated mean vector and hypothesized mean vector π0 for the distance measure is defined, the hypothesis testing is π»0 : π = π0 vs π»π : π ≠ π0 . The test statistics is π‘= π₯¯ − π0 √ ∼ π‘ π−1 π / π under π»0 , where π 2 = or π‘ 2 = Íπ π=1 ( π₯¯ − π0 ) 2 2 = π( π₯¯ − π0 )(π 2 ) −1 ( π₯¯ − π0 ) ∼ πΉ1,π−1 π 2 /π (π₯ π −π₯) ¯ 2 π−1 . Note that π‘ 2 is the square distance between sample mean π₯¯ and the test value π0 . The distribution of π‘ 2 under π»0 Under the π»0 , −1 π₯¯ − π0 π 2 π₯¯ − π0 π‘ = π( π₯¯ − π0 )(π ) π( π₯¯ − π0 ) = √ √ π/ π π 2 π/ π Íπ 2 −1 ¯ π₯¯ − π0 π₯¯ − π0 π=1 {(π₯π − π₯)/π} = √ √ π−1 π/ π π/ π 2 −1 π π2 /1 ∼ (π (0, 1)) π−1 (π (0, 1)) ⇔ 2 1 ⇔ πΉ1,π−1 π−1 ππ−1 /(π − 1) 2 12 √ 2 −1 √ CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING Lim, Kyuson Multivariate generalization (π unknown) STA498 For π-dimensional vector, π»0 : π = π0 vs π»π : π ≠ π0 . A natural generalization of univariate π‘ 2 is a multivariate analog of test statistics for Hotelling’s π 2 distribution. −1 s π = ( xΜ − π0 ) ( xΜ − π0 ) π 0 2 √ = π( xΜ−π0 ) 0 Íπ π=1 (xπ − xΜ)(xπ − xΜ) 0 π−1 −1 √ π π,π−1 (πΊ) π( xΜ−π0 ) ∼ (π π (0, πΊ)) , π−1 0 −1 (π π (0, πΊ)), which is in the form of (multivariate normal)0 (Wishart distribution / ππ ) −1 (multivariate normal)7. ⇔ (π − 1)(π π (0, I)) 0 {ππ−1 (I)}−1 (π π (0, I)), where I = I π×π In case the π 2 is too large, this means xΜ too far from the π0 such that π»0 is rejected. Hotelling’s π 2 distribution In the case vector d follows the multivariate normal distribution π π (0, I) which is √ π( xΜ − π0 ) (by CLT), and another random vector M (which is S) follows the Wishart distribution, then π(d0Md) (which is π 2 ) has a Hotelling’s π 2 ( π, π) distribution with dimensionality parameter π and π degrees of freedom, based on the observation π and π. If a random vector π₯ follows the Hotelling’s π 2 distribution which is π₯ ∼ π 2 ( π, π), then π− π+1 π₯ ∼ πΉπ,π−π+1 ππ For hypothesis testing, reject π»0 : π = π0 if π2 > π(π − 1) πΉπ,π−π (πΌ) π−π or πΉ= π−π 2 π > πΉπ,π−π (πΌ), π(π − 1) when observed π = π − 1 for the sample size and π = π to be the dimension of πΊ. Computational example The student Kyuson from sample of 15 course marks he has taken at UTM was analyzed based on the classification on the π₯ 1 = MAT, π₯ 2 = STA and π₯ 3 = other courses (for simplicity sample numbers for courses are the same). Question: Is π0 = (99 99 95) 0 plausible for the population mean vector at πΌ = 0.1? 2 7Notice this is analogous to π‘ π−1 = (Normal random variable)0 (chi-square random variable/ ππ ) −1 (Normal random variable) CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING 13 STA498 Lim, Kyuson Equivalently, the problem actually is to test π»0 : π = (99 99 95) 0 vs. π»π : π ≠ (99 99 95) 0. At level πΌ = 0.1, reject π»0 if π 2 > π(π−1) π−π πΉπ,π−π (πΌ) = 3(40−1) 40−3 πΉ3,37 (0.1) The sample mean xΜ = (100 100 99) 0 with S computed. 40( xΜ − π) 0S−1 ( xΜ − π) = 8.739. = 7.544. The computed π 2 = Since 8.739 > 7.544 which is the critical value, π»0 is rejected where his true average differ at least for one area, ππ ≠ ππ0 and conclude he is not being honest. Invariant under transformation, Hotelling’s π 2 Moreover, Hotelling’s π 2 is invariant under transformation of the form y = Cx+b, where C π×π for the hypothesis testing of π»0 : πΈ (y) = Cπ0 + b 8. Since yΜ = CxΜ + b and Sπ¦ = 1 π−1 Íπ π=1 (yπ − yΜ)(yπ − yΜ) 0 = CSπ₯ C0, 0 −1 π 2 = π{yΜ − (Cπ0 + π)}0S−1 π¦ { yΜ − (Cπ 0 + π)} = π{C( xΜ − π0 )} (CSπ₯ C) {C( xΜ − π0 )} = π( xΜ − π0 ) 0C0 (CSπ₯ C) −1 C( xΜ − π0 ) = π( xΜ − π0 ) 0 (Sπ₯ ) −1 ( xΜ − π0 ) Normality, Hotelling’s π 2 π 2 = π( xΜ − π0 ) 0S−1 ( xΜ − π0 ) is approximately chi-square distribution with π ππ . whenπ0 is correct. Note that the πΉ-distribution of π 2 rely on the normality assumption. Then, the critical value π(π − 1) πΉπ,π−π (πΌ) > π2π (πΌ), π−π but the value is nearly equivalent for larger values of π − π of πΉπ,π−π (πΌ) as π > π − π. In other words, if π >> π then the difference is larger but if π > π then the gap is smaller 9. Likelihood Ratio Test (LRT) The Hotelling’s π 2 test is equivalent to the LRT 10. For hypothesis testing of π»0 : π = π0 vs π»πΌ : π ≠ π0 , the likelihood ratio (π²) is ˆ π/2 maxπΊ πΏ (π0 , πΊ) | πΊ| π²= = max π,πΊ πΏ (π, πΊ) | πΊˆ 0 | 8Instead of π»0 : πΈ (x) = π0 9For example π = 3000 and π = 10, π (π−1) π− π πΉ π,π− π (πΌ) = 16.057 is close to π2π (πΌ) = 15.987 but if (π−1) 2 π = 30 and π = 5, ππ− π πΉ π,π− π (πΌ) = 12.135 is greater than π π (πΌ) = 9.236. 10Note that this is extended to Neyman-Pearson Lemma for uniformly most powerful test. 14 CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING Lim, Kyuson STA498 , where the maximum of multivariate normal likelihood of π and πΊ is 1 1 −ππ −ππ , max πΏ (π, πΊ) = max πΏ (π0 , πΊ) = ππ₯ π ππ₯ π π,πΊ πΊ 2 2 (2π) ππ/2 | ΣΜ0 | π/2 (2π) ππ/2 | ΣΜ| π/2 However, while ΣΜ0 is restricted under the π»0 , ΣΜ is unrestricted 11. The LRT reject π»0 if π² < π for the cutoff value π. ˆ is approxiUnder the continuous mapping theorem, −2 ln(π²) = π{ln( πΊˆ 0 ) − ln( πΊ)} 2 mately following the πππ , where ππ = {π + π( π + 1)/2} − {π( π + 1)/2} = π 12 (number of parameters without the restriction of π»0 - number of parameters under π»0 ). Wilk’s Lambda Equivalently, based on the likelihood ration statistics of π² it is derived for π²2/π Íπ −1 ˆ | π=1 (xπ − xΜ)(xπ − xΜ) 0 | | πΊ| π2 2/π π² = = Íπ ⇔ 1+ < ππ | π=1 (xπ − π0 )(xπ − π0 ) 0 | π−1 | πΊˆ 0 | Notice that for large π 2 the likelihood ratio is small and will reject π»0 . Also, the Hotelling’s π 2 , Wilk’s Lambda and LRT are all equivalent. Inverse of Wishart distribution The Wishart distribution which is (π − 1)S ∼ ππ−1 (πΊ) is an multivariate analogue of the Gamma distribution (as the chi-square distribution of z2 is gamma random variable as well). Íπ−1 0 −1 With a reparametrization where x1 , ..., xπ ∼ π (0, S−1 π=1 xπ xπ 0 ), a cov-matrix πΊ = is sampled from the inverse-Wishart distribution, which is π − 1 df and parameter S−1 0 . Hence, πΈ (πΊ−1 ) = (π − 1)S−1 0 , πΈ (πΊ) = 1 1 −1 (S−1 = S0 , 0 ) (π − 1) − π − 1 π− π−2 by the property of Wishart distribution. For large π − 1, S0 = (π − π − 2)πΊ0 is near true covariance matrix of πΊ. Union-Intersection derivation of π 2 If the null hypothesis is not rejected for given a π of π¦ = a0x ∼ π1 (a0 π, a0πΊa) that maximize test statistics of π‘ a2 , then any of univariate null hypothesis π»0,a : a0 π = a0 π0 ⇔ Íπ Íπ 11ΣΜ0 = π1 π=1 (xπ − π0 ) (xπ − π0 ) 0 to be restricted and ΣΜ = π1 π=1 (xπ − xΜ) (xπ − xΜ) 0 to be unrestricted 12 π correspond to π, π( π + 1)/2 correspond to var-cov matrix, πΊ CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING 15 STA498 π = π0 is not rejected. Lim, Kyuson First, the π»0 : π = π0 is equivalent of the form π»0,a : a0 π = a0 π0 . The test statistics 0 π¦¯ − a0 π0 2 a π₯¯ − a0 π0 2 {a0 ( xΜ − π0 )}2 π‘a = = √οΈ = π π¦¯ a0 (S/π)a 1 0 a Sa π Hence, if maxa π‘ a2 < π, then π‘a2 < π for any a. Second, the maximum squared t-test is max π‘a2 a −1 S = ( xΜ − π0 ) ( xΜ − π0 ) = π( xΜ − π0 ) 0 (S) −1 ( xΜ − π0 ) = π 2 , π 0 which is the Hotelling’s π 2 distribution 13. 2.2 Confidence region 2.2.1 Univariate π‘-interval Without relationship between multivariate components, the univariate π‘-interval method is constructed as π¦¯ − π π¦ < π‘ π−1 (πΌ/2) = 1 − πΌ, π − π‘ π−1 (πΌ/2) ≤ √οΈ 2 π π¦ /π π¦¯ −π π¦ as √ π 2π¦ /π ∼ π‘ π−1 . √οΈ In other words, the confidence interval of 100(1 − πΌ)% for π π¦ is π¦¯ ± π‘ π−1 (πΌ/2) π 2π¦ /π, where π‘ π−1 (πΌ/2) is the upper percentile. Problem and Bonferroni’s inequality Notice that the π 100(1 − πΌ)% does not cover joint CI. For π π , each π π covers the corresponding ππ and π(π π ) = 1 − πΌπ . Then, for each π π independent π{ππ ∈ π π } = π π(∩π=1 π π ) =1− π π(∪π=1 π ππ ) ≥ 1− π ∑οΈ πΌπ , π=1 which is the Bonferroni’s inequality. In the case πΌπ = πΌ for all π, then 1 − 1 − π πΌ < 1 − πΌ such that if π > 1 the inequality is not guaranteed. Íπ π=1 πΌπ = 13Using the Maximization Lemma, the Cauchy-Schwartz inequality is based to derive the UnionIntersection derivation of π 2 . 16 CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING Lim, Kyuson STA498 Equivalently, if π ππ 14 is the the event of making a Type 1 error on πth test, then π(at least one Type 1 error) = π π(∪π=1 π ππ ) ≤ π ∑οΈ π(π ππ ). π=1 For each of π tests, use a significance level of πΌ/π, then the CI coverage rate or Type 1 error rate is at most 100(1 − πΌ)% or πΌ. So the probability that at least one test results in a Type I error is at most πΌ or the chance that at least one CI does not capture the true mean difference is at most 100(1 − πΌ)%. 2.2.2 Bonferroni’s Simultaneous Confidence interval To construct the simultaneous confidence interval for {π1 , ..., π π } by the confidence level πΌ/π for each of π separate univariate CI’s that is √οΈ π ππ π₯¯π ± π‘ π−1 (πΌ/(2π)) , π = 1, .., π. π √οΈ Since π( π₯¯π ±(ππ ∈ πΌ/(2π)) π ππ /π) = 1−πΌ/π, the joint coverage probability ≥ 1−π( πΌπ ) = 1 − πΌ, which now guarantee to be no smaller than 1 − πΌ 15. 2.2.3 Simultaneous π 2 -intervals To construct simultaneous confidence intervals for any linear combinations of a0 π that Íπ is the expected value of π=1 ππ xπ = a0x where x ∼ π π (π, πΊ) with variance a0πΊa, the CI is derived from univariate-intersection derivation of π 2 0 (a0xΜ − a0 π0 ) 2 a π₯¯ − a0 π0 2 {a0 ( xΜ − π0 )}2 = √οΈ = π‘a = var(a0x)/π a0 (S/π)a 1 0 a Sa π √οΈ ⇔ a0xΜ ± π(π − 1) πΉπ,π−π (πΌ) π−π where max π‘a2 = π 2 ∼ a √οΈ a0Sa , π π(π − 1) πΉπ,π−π π−π For each ππ , √οΈ π₯¯π ± π(π − 1) πΉπ,π−π (πΌ) π−π √οΈ π ππ π 14Notice that this the confidence interval that is √οΈ not covered for ππ . 15Note that π‘ π−1 (πΌ/2π) could be replaces with (π − 1) ππΉ π,π− π (πΌ)/(π − π) for equivalency, by the property of Hotelling π 2 . CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING 17 STA498 Therefore, Lim, Kyuson π π‘a2 π(π − 1) π(π − 1) 2 πΉπ,π−π (πΌ) = π max π‘a ≤ πΉπ,π−π (πΌ) ≤ a π−π π−π π(π − 1) 2 =π π ≤ πΉπ,π−π (πΌ) = 1 − πΌ π−π The drawback of the simultaneous π 2 -intervals is less powerful due to wider range of interval, which lead to be less powerful and conservative. 2.2.4 Comparison between Simultaneous π 2 -intervals and Bonferroni’s Confidence intervals Criteria shape joint coverage rate π‘-intervals Bonferroni’s π‘-intervals narrower, powerful < 100(1 − πΌ)% depends on number of intervals Simultaneous π 2 -intervals winder, conservative ≥ 100(1 − πΌ)% does not depend For each ππ , the simultaneous confidence intervals is computed as √οΈ √οΈ π(π − 1) π ππ π₯¯π ± πΉπ,π−π (πΌ) , π−π π but the Bonferroni’s confidence intervals for ππ is computed as √οΈ πΌ π ππ π₯¯π ± π‘ π−1 , where π = 1, .., π. 2π π Confidence Region Denoted as π (x), which is the multivariate extensions of univariate confidence interval (CI) where xπ ∼ π π (π, πΊ) for π = 1, ..., π. Then, for mean vector π π(π − 1) 0 −1 π π( xΜ − π) S ( xΜ − π) ≤ πΉπ,π−π (πΌ) = 1 − πΌ π−π Cantered at xΜ and computing S for the set yields π(π − 1) 0 −1 πΉπ,π−π (πΌ) π (x) = π : π( xΜ − π) S ( xΜ − π) ≤ π−π However, the on half-length along the normalized eigenvector eπ from S16 gives √οΈ √οΈ √οΈ ππ π(π − 1) ππ √οΈ 2 πΉπ,π−π (πΌ) = π (πΌ), π π−π π for each eigenvalues of ππ from S, π = 1, ..., π. 16Correlation R for eigenvalues are computed to be equivalent to the covariance matrix of S for standardized eigenvalues. 18 CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING Lim, Kyuson 2.2.5 Multivariate Quality-Control (QC) STA498 Univariate paired t-test For π₯ 1π and π₯ 2π which is the response to treatments, let ππ = π₯π1 − π₯π2 with ππ ∼ π (π π , ππ2 ), for hypothesis testing of π»0 : π π = 0 vs. π»π : π π ≠ 0. Then, the test statistics is π‘= π¯ √ ∼ π‘ π−1 , π π / π under π»0 . If |π‘| > π‘ π−1 (πΌ/2), then reject π»0 . The confidence interval for π π is π π π ± π‘ π−1 (πΌ/2) √ π Multivariate extension in comparison of confidence intervals and confidence region Suppose for π units that there are 2 treatments to be x1π = (π₯ 1π1 , ..., π₯ 1π π ) 0 and x2π = (π₯ 2π1 , ..., π₯ 2π π ) 0 with dπ = x1π − x2π ⇔ ππ π = π₯ 1π π − π₯ 2π π , for all π = 1, ..., π. For dπ ∼ π π (π π , πΊπ , π»0 : π π = 0 vs. π»0 : π π ≠ 0. Then, the test statistics is π 2 = πdΜ0S−1 π dΜ √ 0 Íπ π=1 dπ π=1 (dπ Íπ = π π Íπ − dΜ)(dπ − dΜ) 0 √ π(π − 1) π=1 dπ π πΉπ,π−π (πΌ), ∼ π−1 π π−π under π»0 where the 100(1 − πΌ)% confidence region of π π is π(π − 1) 0 −1 πΉπ,π−π (πΌ) , π (π π ) = π π : π( dΜ − π π ) Sπ ( dΜ − π π ) ≤ π−π which is analogous to the confidence region for xΜ and S π(π − 1) 0 −1 π : π( xΜ − π) S ( xΜ − π) ≤ πΉπ,π−π (πΌ) π−π When π 2 > π(π−1) π−π πΉπ,π−π (πΌ) for the critical value, then reject π»0 . However, 100(1 − πΌ)% Simultaneous π 2 confidence intervals for individual mean differences {π π π } 17 is given by √οΈ π¯π ± √οΈ π(π − 1) πΉπ,π−π (πΌ) π−π π 2π π , π = 1..., π. π (π−1) 2 17Note that when π − π is large replaced by ππ− π πΉ π,π− π (πΌ) with π π (πΌ) by the property of Hotelling’s 2 π and the normality assumption is not necessary. CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING 19 STA498 Lim, Kyuson Note that π¯π is the diagonal element of dΜ and π 2π π is the diagonal element of Sπ . This is analogous to ππ of √οΈ √οΈ π(π − 1) π ππ π₯¯π ± πΉπ,π−π (πΌ) . π−π π Also, 100(1 − πΌ)% Bonferroni’s confidence intervals for {ππ } is given by √οΈ π 2π π π¯π ± π‘ π−1 (πΌ/2π) , π = 1..., π. π This is analogous to √οΈ π₯¯π ± π‘ π−1 (πΌ/(2π)) π ππ , π = 1, .., π. π shown before. Simple Block design For π treatments over successive period of time, observation data is denoted as xπ = π₯π1 , ..., π₯ππ and π = π1 , ..., π π . The goal is to compare the components of π. The contrast matrix C is found for two ways. First, the contrast matrix is set for control treatments compared with other treatment, which is π − π2 1 −1 0 · · · 0 © 1 ª π ­ π1 − π3 ® ©­1 0 −1 · · · 0 ª® © .1 ª ­ . ®=­ ­ ® ­ .. ®® =C1 π. ­ .. ® ­ . . . ® ­ ® 1 0 0 · · · −1¬ « π π ¬ « π1 − π π ¬ « The other way is a successive treatments for contrast matrix π − π2 1 −1 0 · · · 0 © 1 ª π ­ π2 − π3 ® ©­0 1 −1 · · · 0 ª® © .1 ª ­ ­ ® =­ ® ­ .. ®® =C2 π. .. ­ ® . . . ­ ® . ­ ® 0 0 · · · 1 −1¬ « π π ¬ « π π−1 − π π ¬ « Note that the contrast matrix C1 and C2 is (π − 1) × π matrix. However, in order to test that there is no difference in treatments π»0 : Cπ = 0 vs. π»π : Cπ ≠ 0. The test statistics is the Hotelling’s π 2 as xπ ∼ π π (π, πΊ), π 2 = π(CxΜ) 0 (CSC0) −1 (CxΜ) ∼ (π − 1)(π − 1) πΉπ−1,π−π+1 (πΌ), π−π+1 and the 100(1 − πΌ)% confidence region for C is (π − 1)(π − 1) 0 0 −1 Cπ : π(CxΜ − Cπ) (CSC ) (CxΜ − Cπ) ≤ πΉπ−1,π−π+1 (πΌ) , π−π+1 20 CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING Lim, Kyuson STA498 compared to the 100(1 − πΌ)% simultaneous π 2 confidence intervals for 1-dimensional {cπ π} is √οΈ √οΈ c0π Sc π (π − 1)(π − 1) 0 πΉπ−1,π−π+1 (πΌ) c π xΜ ± π−π+1 π Computational Example A sample of 20 courses were administrated with 4 assessments ways: Treatment 1: final exam and no term tests Treatment 2: no exam and no term tests Treatment 3: final exam and term test Treatment 4: no exam and term test The outcome variable is % for students marks. (π3 + π4 ) − (π1 + π2 ): effect of having term test (π1 + π3 ) − (π2 + π4 ): effect of having final exam (π1 + π4 ) − (π2 + π3 ): interaction between term test and final exam −1 −1 1 1 © ª C = ­ 1 −1 1 −1® « 1 −1 −1 1 ¬ Then, π»0 : Cπ = 0 vs. π»π : Cπ ≠ 0 at πΌ = 0.05. From the data, π 2 = π(CxΜ) 0 (CSC0) −1 CxΜ = 20.5. (π−1) 3×19 At πΌ = 0.05, the critical value is (π−1) π−π+1 πΉπ−1,π−π+1 (πΌ) = 17 πΉ3,17 (0.05) = 10.73. Since π 2 > 10.93, reject π»0 at the level of πΌ = 0.05 and conclude that there is a significant difference in contrast for the effect of midterm and final exam for courses to be offered. Within 95% simultaneous confidence intervals, if the confidence interval does not contain 0 then there is an effect by the presence of either term test or final exam. Note that the interaction effect of two factors is not significant if the confidence interval does contain 0. 2.3 Comparing mean vectors of two population When x1π ∼ π (π1 , Σ1 ) and x2 π ∼ π (π2 , Σ2 ) for π = 1, .., π1 and π = 1, ..., π2 for π-variate population and independent, then the goal is to make inference on π1 − π2 . 2.3.1 Pooled sample covariance when π1 , π2 is small and Σ = Σ1 = Σ2 As x1 ∼ π (π, Σ1 ), x1 ∼ π (π, Σ1 ), let xΜ π = xΜ) 0. 1 ππ Íπ π π=1 x ππ and S π = 1 π π −1 Íπ π π=1 (x ππ − xΜ)(x ππ − CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING 21 STA498 Lim, Kyuson The pooled sample covariance is the weighted mean of two samples. Íπ 1 0 Íπ2 (x − xΜ )(x − xΜ ) 0 π1 − 1 π2 − 1 2π 2 2π 2 π=1 (x1π − xΜ1 )(x1π − xΜ1 ) S ππππππ = + π=1 ⇔ S1 + S2 π1 − 1 + π2 − 1 π1 − 1 + π2 − 1 π1 + π2 − 2 π1 + π2 − 2 Hypothesis test with small samples when Σ1 = Σ2 2.3.2 For π»0 : π1 − π2 = πΏ0 18 vs. π»π : π1 − π2 ≠ πΏ0 , test statistics of Hotelling’s under π»0 2 π = (x1 − x2 − πΏ0 ) 1 1 = + π1 π2 0 −1/2 (x1 − x2 − π1 + −1 1 1 + S ππππππ (x1 − x2 − πΏ0 ) π1 π2 π2 ) 0S−1 ππππππ 1 1 + π1 π2 −1/2 (x1 − x2 − π1 + π2 ), which follows (multivariate normal)(Wishart / ππ )(multivariate normal) such that ππ1 +π2 −2 ⇔ π π (0, πΊ) π1 + π2 − 2 0 −1 π π (0, πΊ) = π(π1 + π2 − 2) πΉπ,π1 +π2 −π−1 . π1 + π2 − π − 1 The hypothesis testing reject π»0 if π2 > 2.3.3 π(π1 + π2 − 2) 2 πΉπ,π1 +π2 −π−1 (πΌ) = πππππ‘ππππ . π1 + π2 − π − 1 Confidence intervals with small samples when Σ1 = Σ2 Confidence region Analogously, the half-length with axes along e1 , ..., e π and ellipsoid centered at x1 − x2 is √οΈ √οΈ 1 1 π(π1 + π2 − 2) + πΉπ,π1 +π2 −π−1 (πΌ) , π = 1, ..., π. ππ π1 π2 π1 + π2 − π − 1 Simultaneous π 2 Confidence intervals a0 (π1 − π2 ) √οΈ √οΈ 1 1 0 2 0 a ( xΜ1 − xΜ2 ) ± πππππ‘ππππ a + S ππππππ a . π1 π2 Notice that this is analogous to each confidence intervals √οΈ √οΈ π(π1 + π2 − 2) 1 1 ( π₯¯1 π − π₯¯2 π ) ± πΉπ,π1 +π2 −π−1 (πΌ) + π π π,ππππππ . π1 + π2 − π − 1 π1 π2 18ie. πΏ0 = 0 22 CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING Lim, Kyuson Bonferroni’s Confidence Intervals STA498 ( π₯¯1 π − π₯¯2 π ) ± π‘ π1 +π2 −2 2.3.4 πΌ 2π √οΈ 1 1 + π π π,ππππππ . π1 π2 Behrens-Fisher problem In the case of heterogeneous covariance πΊ1 ≠ πΊ2 with small (moderate) sample sizes π1 , π2 greater than π, the estimator of sample mean difference yields sample covariance S1 S2 (π − 1)(S1 + S2 ) 2 Sπ = + ⇔ π1 π2 2(π − 1) π such that the test statistics under π»0 : π1 − π2 = 0 vs. π»π : π1 − π2 ≠ 0 is π 2 = ( xΜ1 − xΜ2 ) 0S−1 π ( xΜ1 − xΜ2 ) ∼ π£π πΉπ,π£−π+1 , π£− π+1 for π number of variables where π£=Í 2 π=1 π + π2 1 1 −1 2 + tr ππ tr ππ Sπ Sπ 1 −1 2 ππ Sπ Sπ where min(π1 , π2 ) ≤ π£ ≤ π1 + π2 19. Hence, reject π»0 if π 2 > 2.3.5 , π£π π£−π+1 πΉπ,π£−π+1 (πΌ). Heterogeneous covariance matrices with large sample size The test statistics under π»0 with same Sπ is 2 π 2 = ( xΜ1 − xΜ2 ) 0S−1 π ( xΜ1 − xΜ2 ) ∼ π π with the assumption that π1 − π and π2 − π is large enough. Hence, reject π»0 if π 2 > π2π (πΌ). 2.3.6 Box’s M test (Bartlett’s test) The goal is to hypothesis test for the equality of covariance matrices, π»0 : πΊ1 = · · · = πΊπ = πΊ vs. π»π : at least one πΊπ ≠ πΊ π , for some π ≠ π with chi-square approximation. Under multivariate normal distribution, the LRT20 Íπ (ππ −1)/2 (ππ − 1)Sπ |Sπ | , where S ππππππ = Íπ=1 Λ = Ππ π |S ππππππ | π=1 (ππ − 1) 19The approximation reduces to Welch π‘-test in univariate (π = 1), π‘ = 20Formerly, under π»0 : π = π0 maxπΊ πΏ ( π0 ,πΊ) max π,πΊ πΏ ( π,πΊ) = | πΊˆ | π/2 . | πΊˆ 0 | π₯¯ 1 − π₯¯ 2 π 2 1 π1 π 2 + π2 2 and π£ = π 2 1 π1 π 2 2 + π2 π 4 1 π 2 ( π1 −1) 1 2 π 4 2 2 ( π2 −1) +π CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING 23 STA498 Lim, Kyuson for ππ that is the sample size for the πth group of Sπ sample covariance. ∑οΈ π π ∑οΈ ⇔ −2 πππΛ = π = (ππ − 1) πππ|S ππππππ | − {(ππ − 1) πππ|Sπ |}, π=1 where π=1 π 1 2π 2 + 3π − 1 ∑οΈ 1 , − Íπ π’= 6( π + 1)(π − 1) π=1 ππ − 1 π=1 (ππ − 1) as π is the number of variables and π is the number of groups. The test statistics is 1 π( π + 1)(π − 1) 2 under π»0 21. While Box’s M test is sensitive to non-normality, MANOVA test of means or treatments are robust to non-normality 22. πΆ = (1 − π’)π ∼ ππ£2 , π£ = 2.4 MANOVA (Multivariate Analysis Of Variance) The one-way MANOVA model for comparing π population mean vector is illustrated as Xπ π = π + ππ + eπ π , eπ π ∼ π π (0, πΊ), which is random vector = overall mean+ πth population treatment effect +random error, where there are π populations and ππ observations ({xπ1 , ..., xπππ }) for population π with the population mean ππ , π = 1, .., π, which follows Wishart distribution. Íπ Constraint on π=1 ππ ππ = 0 define the unique model parameters. For vector of observations, decomposes into xπ π = xΜ + ( xΜπ − xΜ) + (xπ π − xΜπ ), which is also observation = overall sample mean π+ ˆ estimated treatment effect πˆπ + residual error, eΜπ π . Note that the normality assumption for samples can be relaxed when the sample size {ππ } is large. 2.4.1 Sum of Squares (TSS = SSπ‘π +SSπππ ) Total (corrected) sum of squares (and cross products), TSS = treatment (between groups) sum of squares and cross products, B + residuals (within group) sum of squares and cross products, W. π ∑οΈ ππ ∑οΈ π=1 π=1 0 (xπ π − xΜ)(xπ π − xΜ) = π ∑οΈ π=1 0 ππ ( xΜπ − xΜ)( xΜπ − xΜ) + π ∑οΈ ππ ∑οΈ (xπ π − xΜπ )(xπ π − xΜπ ) 0, π=1 π=1 21Reject π»0 if πΆ > ππ£2 (πΌ) 22Although M-test reject π»0 , MANOVA test could be inconsistent with. 24 CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING Lim, Kyuson as simplified from STA498 (xπ π − xΜ)(xπ π − xΜ) 0 = [(xπ π − xΜ + xΜπ − xΜ) [(xπ π − xΜ + xΜπ − xΜ)] 0 = (xπ π − xΜπ )(xπ π − xΜπ ) 0 + (xπ π − xΜπ )( xΜπ − xΜ) 0 + ( xΜπ − xΜ)(xπ π − xΜπ ) + ( xΜπ − xΜ)( xΜπ − xΜ) 0, Íπ (xπ π − xΜπ ) = 023 such that and ππ=1 ⇔ (xπ π − xΜ)(xπ π − xΜ) 0 = (xπ π − xΜπ )(xπ π − xΜπ ) 0 + ( xΜπ − xΜ)( xΜπ − xΜ) 0 . Notice that (xπ π − xΜ)(xπ π − xΜ) 0 = (xπ π − xΜ) 2 applies for other terms. First, for Sπ of πth sample covariance matrix 24 W= π ∑οΈ ππ ∑οΈ 0 (xπ π − xΜπ )(xπ π − xΜπ ) = (π1 −1)S1 +· · ·+(ππ −1)Sπ ⇔ (ππ −1)S = (π −π)S, π=1 π=1 π=1 where π = π ∑οΈ Íπ π=1 ππ with π − π ππ , Wishart distribution. Hence, W πΈ = πΊ, π −π Second, B= π ∑οΈ ππ ( xΜπ − xΜ)( xΜπ − xΜ) 0, π=1 with π − 1 ππ , Wishart distribution. Thus, TSS has total π − 1 = (π − π) + (π − 1) ππ , Wishart distribution. 2.4.2 Hypothesis Testing The goal is to test the presence of treatment effects. π»0 : π1 = π1 = · · · = ππ is equivalent to π»π : π1 = π1 = · · · = ππ 25, which is that treatment effects are all same. The test statistics26 uses Wilk’s Lambda27 test as B, W follows Wishart distribution, |W| 1 1 π ∗ ⇔ = Ππ=1 , π² = |B + W| det(I + W−1 B) 1 + πˆ π where πˆ 1 , .., πˆ π are eigenvalues of W−1 S, as π = min( π, π − 1) is the rank of B. 23Note that (xπ π − xΜπ ) ( xΜπ − xΜ) 0 + ( xΜπ − xΜ) (xπ π − xΜπ ) = 0 24Note that the generalized (π1 + π2 − 2)S ππππππ is recommended in two-sample case. 25As ππ = xΜπ − xΜ, testing for π»0 : π1 = · · · = ππ ⇔ xΜ1 − xΜπ = 0 ˆ 26Analogous LRT to | πΊπΊˆ |. 0 27There are Roy’s test maxπ (BW−1 ), Lawley-Hotelling’s test tr(BW−1 ), and Pillar’s test tr{B(B+W) −1 } CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING 25 STA498 2.4.3 Number of variable π=1 Number of group π≥2 π=2 π≥2 π≥1 π=2 π≥1 π=3 2.4.4 Lim, Kyuson Distribution of Wilk’s Lambda Test statistics π−π 1−π²∗ π−1 1 − π²∗√ ∗ π−π−1 1 − 1−√ π²∗ π−1 π² π−π−1 1−π²∗ 1 − ∗ π π² √ ∗ π−π−2 1 − 1−√ π²∗ π π² Distribution under π»0 πΉπ−1,π−π πΉ2(π−1),2(π−π) πΉπ,π−π−1 πΉ2π,2(π−π−2) Large Sample property for modification of π²∗ If π»0 is true for π to be large, π+π − π −1− ln(π²∗ ) ∼ π2π(π−1) 2 However, reject π»0 if π+π − π −1− ln(π²∗ ) > π2π(π−1) (πΌ) 2 2.4.5 Simultaneous Confidence Intervals for treatment effect Let ππ = xΜπ − xΜ to be the πth treatment effect. Then, the treatment difference between πth 0 and πth treatment is πˆπ − πˆπ = xΜ π − xΜ − xΜπ + xΜ = xΜ π − xΜπ ⇔ π₯¯ π1 − π₯¯π1 · · · π₯¯ π π − π₯¯π π , Wππ and Var(πˆππ − πˆππ ) = Var(π₯¯ ππ − π₯¯ππ ) = π1π + π1π πππ , where πππ = π ππ, ππππππ = π−π . The 95% simultaneous Bonferroni’s confidence intervals for { πˆππ − πˆππ } 28 is √οΈ 1 1 ( π₯¯ ππ − π₯¯ππ ) ± π‘ π−π (πΌ/2π) + π ππ,ππππππ , where π = ππ(π − 1)/2, π π ππ where π is the number of variables and π is the number of populations. Hence, reject π»0 : πππ − πππ = 0 if | π₯¯ ππ − π₯¯ππ | πΌ π‘ = √οΈ > π‘ π−π . 2π 1 1 + π ππ ππ ππ, ππππππ 28Note that this is analogous to ( π₯¯1 π − π₯¯2 π ) ± π‘ π1 +π2 −2 26 πΌ 2π √οΈ 1 π1 + 1 π2 π π π, ππππππ defined. CHAPTER 2. MULTIVARIATE NORMAL AND HYPOTHESIS TESTING Chapter 3 Bayesian Alternative approach Let the discrete random variables π which is to be estimated and observed random variable of π = π₯. From the prior pr(π) information about possible values for the parameter, the approach uses observed data p(π₯|π) to update the information on posterior probabilities p(π|π₯)1 as a regenerating process by the confidence intervals, p(π ∈ πΆπΌ |π₯) = 1−πΌ. If known with π ∗ related to probability data points given, then the estimated π₯ˆπ ∼ p(π₯|π ∗ ) for the true value of π₯ to generate the update information in the compatible space, where there is no overfitting to be concerned with. From the likelihood p(π₯|π), the actual distribution by the Bayes theorem yield unnormalized posterior density which is the right side p(π|π₯) = p(π₯|π)p(π) ⇔ ∝ p(π₯|π)p(π), p(π₯) where p(π₯) is unknown with fixed π¦ and does not depend on π. Note that p(π₯) is also referred to as evidence. ⇔ Posterior ∝ Prior × Likelihood. A parameter of a prior distribution is referred to a hyperparameter. For predictive inference on unknown variable before data π₯ is considered, the distribution of unknown but observed π₯ is ∫ ∫ π(π₯) = π(π₯, π)ππ = π(π) π(π₯|π)ππ Θ Θ as a marginal distribution of π₯, which a prior predictive distribution2. For observed data x and unknown π = (π, π 2 ), the unknown observable π₯˜ to be predicted is referred to be posterior predictive distribution ∫ ∫ ∫ π( π₯|x) ˜ = π( π₯, ˜ π|x)ππ = π( π₯|π, ˜ x) π(π|x)ππ = π( π₯|π) ˜ π(π|x)ππ, Θ Θ Θ 1This is written by the Bayes theorem that p(π₯, π)/p(π₯) ⇔ (p(π₯ | π) p(π))/p(π₯) 2predictive refers to the distribution for a quantity that is observable. 27 STA498 Lim, Kyuson as posterior which is conditional on observed x and predictive for observable π₯. ˜ The ratio of posterior density π(π|π₯) evaluated at points π 1 and π 2 under the given model is referred to posterior odds for π 1 compared to π 2 . π(π 1 |π₯) π(π 1 ) π(π₯|π 1 )/π(π₯) π(π 1 ) π(π₯|π 1 ) = = , π(π 2 |π₯) π(π 2 ) π(π₯|π 2 )/π(π₯) π(π 2 ) π(π₯|π 2 ) which the posterior odds, π(π₯|π 1 ) π(π₯|π 2 ) π(π 1 |π₯) π(π 2 |π₯) equal to the prior odds π(π 1 ) π(π 2 ) times likelihood ratio, under the Bayes’ rule for discrete parameters. Random variables and Bayesian statistical inference For the unknown random variables of Θ to be estimated and π = π₯ which is observed, the Bayes’ rule yields π(Θ = π|π = π₯) = π π |Θ (π₯|π) π Θ (π) π(π = π₯|Θ = π) π(Θ = π) ⇔ π Θ|π (π|π₯) = . π(π = π₯) π π (π₯) Either Θ or π2 are continuous random variable, replace the PMF or PDF in the formula. Equivalently, the posterior PDF is represented by prior times likelihood with π π (π₯) using the law of total probability as ⇔ πΘ|π (π|π₯) = π π |Θ (π₯|π) πΘ (π) . π π (π₯) In the problem of Bayesian statistics, the choice prior πΘ (π) is generally unclear and subjective to be different. With unknown variable Θ, the goal is to draw inferences by observing related random variable π, about Θ. Note that the posterior distribution of Θ, πΘ|π (π|π₯)/π Θ|π (π|π₯), contains all information is derived by point or interval estimates of Θ. Comparison between Frequentist and Bayesian methods For frequentist inference, probabilities are frequencies as the goal is to create procedure with long run frequency guarantees. For Bayesian inference, probabilities are subjective degrees of belief as to state and analyze. Hence, frequentists view parameter as fixed constant while Bayesian considers as random variable. For example, the confidence interval is considered. √ √ For confidence interval defined as πΆ πΌ = [ π¯ π − 1.96/ π, π¯ π + 1.96/ π], the probability statement is π π (π ∈ πΆ πΌ) = 0.95 for all π ∈ R, which is random due to function of the data. With parameter π fixed, the CI trap the true value with probability 0.95. For infinitely many experiments of π data points and chosen π π , the computed intervals 28 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Lim, Kyuson STA498 πΆ πΌπ is found to trap the parameter π π , 95% of the time, that is almost surely convergent for any sequences π π , π 1 ∑οΈ πΌ (π π ∈ πΆ πΌπ ) ≥ 0.95 lim inf π→∞ π π=1 On the other hands, for beliefs the unknown parameter π is given as a prior distribution π(π) to represent the subjective beliefs about π. Using Bayes’ theorem, the posterior distribution for π given the observed data π1 , ..., ππ is computed with likelihood function π π(π |π), πΏ(π) = Ππ=1 π ∫ π(π|π1 , ..., ππ ) ∝ πΏ (π)π(π) ⇔ π(π|π1 , ..., ππ )ππ = 0.95 ⇔ π(π ∈ πΆ πΌ |π1 , ..., ππ ) = 0.95 πΆπΌ Hence, the degree-of-belief probability statement about π given the observed data is not the same, where the intervals would not trap the true value 95% of the time. In summary, the frequentist CI satisfies inf π π π (π ∈ πΆ πΌ) = 1 − πΌ for the coverage of the interval CI, and the probability refers to random interval CI. A Bayesian confidence interval CI satisfies π(π ∈ πΆ πΌ |π1 , ..., ππ ) = 1 − πΌ, where the probability refers to π. While the subjective Bayesian interpret probability strictly as personal degrees of belief, the objective Bayesian try to find the prior distributions for the resulting posterior to be objective 3. However, frequentist Bayesian only use Bayesian methods when resulting posterior has good frequency behaviours. On the other hands, the likelihoodist use the likelihood function to measure the strength of data as an evidence. 3.0.1 Overview: Univariate Binomial distribution with known and unknown parameter Let the probability of a success in a trial is π. Also, let π = {π₯1 , .., π₯ π } be the observation Íπ set where π₯ 1 ∼ π΅ππ (π). Then, the probability of π = π=1 π₯π success times in π trials π π (π₯ 1 , ..., π₯ π ) happens is p(π = π₯ 1 , ..., π₯ π |π, π) = Bin(π |π, π)= π π (1 − π) π−π as the posterior distribution. Example. Objective Bayesian approach As π(π|π = π₯ 1 , ..., π₯ π ) ∝ π(π = π₯ 1 , ..., π₯ π |π) π(π) for the prior π ∼ π [0, 1] to be unknown so to set the following the uniform distribution for π(π) = 1, then π(π|π) ∝ π π (1 − π) π−π = π π +1−1 (1 − π) π−π +1−1 ⇔ π π (1 − π) π−π Γ(π + 2) π π (1 − π) π−π = , Γ(π + 1)Γ(π − π + 1) Beta(π + 1, π − π + 1) π|π, π ∼ Beta(π + 1, π − π + 1) 3Empirical Bayesian estimate the prior distribution from the data. CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 29 STA498 Lim, Kyuson where posterior follows the Beta distribution. Since the density function integrates to 1, the normalizing constant (π§) is ∫ 1 Γ(π + 1)Γ(π − π + 1) π§= π π (1 − π) π−π = . Γ(π + 2) 0 The prior predictive distribution for fixed π success in ∫ 1 ∫ 1 1 π π π Γ(π + 1)(π − π + 1) π−π π(π) = π(π |π , π) π(π, π )ππ = π (1−π) ππ = = . π π Γ(π + 2) π+1 0 0 Hence, the prior predictive density π(π) = 1 π+1 ∫ one observation with an outcome π( π₯˜ = 1) = an example. which is to be uniform, where π₯˜ is the ∫1 π( π₯˜ = 1|π) π(π)ππ = 0 πππ = 1/2 as 1 0 Also, by the mean of Beta distribution the Bayes posterior estimator is πΈ (π|π) = For instructive purpose of convexity, 1 π + (1 − π π ) , πΈ (π|π) = π π π 2 π +1 π+2 . from the prior mean for 1/2 and the maximum likelihood estimate4 π /π. Moreover, the π which is close to 1. optimized convex set for π π is π+2 Example. Subjective Bayesian approach On the other hand, the subjective Bayesian find the uninformed prior to be strongly peaked around 1/2, as a subjective beliefs about the data. For the known of π, the posterior with Bayes rule yield ∫ π(π |π, π ) π (π|π ) π (π|π, π ) = , where π(π |π ) = π(π |π, π ) π (π|π )ππ. π(π |π ) Θ By setting the prior distribution π ∼ π΅ππ‘π(πΌ, π½) for π (π|π ) = π (π), π π Γ(πΌ + π½) πΌ−1 π½−1 π−π π(π|π, π ) ∝ π(π |π) π (π) ⇔ π (1 − π) π (1 − π) π Γ(πΌ)Γ(π½) = Γ(πΌ + π½ + π) π πΌ+π −1 (1 − π) π½+π−π −1 ⇔∝ π πΌ+π −1 (1 − π) π½+π−π −1 Γ(πΌ + π )Γ(π½ − π + π) without the normalizing constant 5. The posterior distribution π|π ∼ π΅ππ‘π(πΌ+π , π½+π−π ) πΈ (π|π) = πΌ+π , πΌ+π½+π π£ππ (π|π) = (πΌ + π )(π½ − π + π) , (πΌ + π½ + π) 2 (πΌ + π½ + π + 1) and PDF ππ|π (π|π = π₯ 1 , ..., π₯ π ). π 4The estimate is success over success plus failure as πΏ (Θ, π0 = Ππ=1 π ππ in multinomial distribution. 5Note that Γ(π) = (π − 1)! 30 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Lim, Kyuson STA498 Moreover, the Bayesian point estimate is summarized at the center of the posterior distribution ∫ 1 πΌ + π π π πΌ + π½ πΌ π¯ = π π (π|π)ππ = = + πΌ+π½+π πΌ+π½+π π πΌ+π½+π πΌ+π½ 0 πΌ+π½ π π π πΏπΈ + πΈ (π), ⇔ πΌ+π½+π πΌ+π½+π for the prior mean. After the data π have been observed, an unknown observable, π₯˜ is predicted, which is referred to as a posterior predictive distribution. Now, the posterior predictive distribution for just one observation π₯˜ = 1 of new value conditional on several observations π yield ∫ 1 ∫ 1 π( π₯˜ = 1|π, π ) = π( π₯˜ = 1|π, π, π ) π(π|π, π )ππ = π΅ππ ( π₯˜ = 1|π) π΅ππ‘π(π|π +πΌ, π½)ππ 0 0 ∫ ⇔ 1 ∫ π π΅ππ‘π(π|π + πΌ, π½)ππ = 0 1 π π(π|π)ππ = πΈ (π|π) 0 where π( π₯˜ = 1) = π6 such that the mean of the posterior distribution is derived to be πΌ+π πΈ (π|π) = πΌ+π½+π . For the purpose of Bayesian inference, the predictive distribution for the new observations are derived in the example. Equivalently, the generalized form of prediction is Íπ πΌ+ π=1 π₯π π( π₯˜ = 1|π) = πΈ (π|π) = πΌ+π½+π . On the other hand, π( π₯˜ = 0|π) = 1 − πΈ (π|π) = π½+ Íπ π=1 (1−π₯ π ) . πΌ+π½+π 3.1 Conditional distribution of the subset Given canonical form of x (2) ∼ π π−π (π (2) , πΊ22 ), the conditional distribution of x (1) ∼ (2) − π (2) ) and πΊ π π (π (1) , πΊ11 ) 7 is π π (π1.2 , πΊ11.2 ), where π1.2 = π (1) + πΊ12 πΊ−1 11.2 = 22 (x −1 πΊ11 − πΊ12 πΊ22 πΊ21 , and x is π × 1 matrix 8. (2) (2) −1 Thus, the conditional density x (1) |x (2) ∼ π (π (1) +πΊ12 πΊ−1 22 (x − π ), πΊ11 −πΊ12 πΊ22 πΊ21 ). Independence and covariance 0 For partition of subset x = x = x (1) x (2) , x (1) ⊥ x (1) ⇔ πΊ12 = 0.9 Generally, if both x (1) and x (2) follow normal distribution and are independent, then the joint distribution is normally distributed. 6For Bernoulli trial, π( π₯˜ = 0) = 1-π 7Note that this definition for partition is also valid in EM algorithm to be estimated 0 8Note that x = x (1) x (2) . 9This could be proven as π (x) = π (x (1) ) π (x (2) ) = 0, where off-diagonal elements other than πΊππ is 0. CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 31 STA498 Linear transformation Lim, Kyuson For the linear transformation with respect to A which could be defined as y = Ax, y follows the distribution of π π (π∗ , πΊ∗ ), where π∗ = Aπ and πΊ∗ = AπΊA0 10. Based on the conditional distribution formula, y (1) |y (2) ∼ π (π∗∗ , πΊ∗∗ ), where π∗∗ = (2) − π ∗(2) ) and πΊ∗∗ = πΊ∗ − πΊ∗ πΊ∗−1 πΊ∗ , where y (2) is given matrix. π∗(1) + πΊ∗12 πΊ∗−1 22 (y 11 12 22 21 3.1.1 Law of total expectation Often referred to as tower property, the Adam’s law for random variable π and π is πΈ (π) = πΈ (πΈ (π |π )) ∑οΈ ∑οΈ ∑οΈ πΈ (πΈ (π |π )) = πΈ π₯ π(π = π₯|π ) = π₯ π(π = π₯|π = π¦) π(π = π¦) π₯ = ∑οΈ ∑οΈ π¦ =π₯ π¦ π₯ π(π = π₯, π = π¦) = π₯ π₯ ∑οΈ ∑οΈ π₯ π₯ ∑οΈ ∑οΈ π(π = π₯, π = π¦) = π¦ π₯ π(π = π₯, π = π¦) π¦ ∑οΈ π₯ π(π = π₯) = πΈ (π), π₯ and the Eve’s law is π£ππ (π) = πΈ (π£ππ (π |π )) + π£ππ (πΈ (π |π )), as πΈ (π 2 ) = πΈ πΈ ((π |π ) 2 ) − (πΈ (π |π )) 2 + (πΈ (π |π )) 2 ⇔ πΈ π£ππ (π |π ) + (πΈ (π |π )) 2 2 ⇔ πΈ (π 2 ) − πΈ (π) 2 = πΈ π£ππ (π |π ) + (πΈ (π |π )) 2 − πΈ (πΈ (π |π )) = πΈ π£ππ (π |π ) + πΈ (πΈ (π |π ) 2 ) − πΈ (πΈ (π |π )) 2 , where π£ππ (π |π ) = πΈ (πΈ (π |π ) 2 ) − πΈ (πΈ (π |π )) 2 ⇔ πΈ π£ππ (π |π ) + π£ππ πΈ (π |π ) which is also referred to as the law of total expectation. As {π΄π } is the partition of the probability space and assumes finite or countably infinite set of finite values πΈ (π) < ∞, the law of total probability in countable and finite cases guarantees, ∑οΈ πΈ (π) = πΈ (π | π΄π ) π( π΄π ) π For Eve’s law, notice that the posterior variance is on average smaller than the prior variance. This indicates the greater the latter variation in Eve’s law, the more the potential for reducing our uncertainty with regard to π. 10Note that cov(Ax (1) ) = A cov(x (1) )A0 and also cov(Ax (1) , Bx (2) ) = A cov(x (1) , x (2) )B0 in partitioning the vector. 32 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Lim, Kyuson 3.1.2 STA498 Conditional expectation (MMSE) For posterior distribution for unknown random variable π , such as ππ |π (π¦|π₯), the point estimate of the posterior mean is defined as π¦ˆ π = πΈ (π |π = π₯), which is the minimum estimate of the π in terms of the MSE, referred to as a minimum mean squared error (MMSE) 11 or Bayes’ estimate of π . The posterior density is derived with computing π π |π (π₯|π¦) ππ (π¦) ππ |π (π¦|π₯) = , where π π (π₯) = π π (π₯) ∫ +∞ π π |π (π₯|π¦) ππ (π¦)ππ¦. −∞ Then, the MMSE estimate of π given π = π₯ is then given by ∫ +∞ π¦ˆ π = π¦ ππ |π (π¦|π₯)ππ¦ ⇒ πΈ (πˆπ ) = πΈ (πΈ (π |π)) = πΈ (π ), −∞ by applying for the Adam’s law. Hence, πΈ (π ) − πΈ (πˆπ ) = 0 which is an unbiased estimator. Properties of estimation error For the unobserved random variable to be estimated is π which is estimated by π = π₯, let estimate πˆ = π(π) to be the function of π, and the error of estimate is defined as π˜ = π − πˆ ⇔ π − π(π) for MSE of πΈ [(π − π(π)) 2 ]. The goal is to derive the variance of π and expectation of π . Now, let π = πΈ (π˜ |π) and πˆπ = πΈ (π |π), where π˜ = π − πˆπ . Then, π = πΈ (π˜ |π) = πΈ (π − πˆπ |π) = πΈ (π |π) − πΈ (πˆπ |π) = πˆπ − πΈ (πˆπ |π) = πˆπ − πˆπ = 0. For any function of π(π) πΈ (π˜ π(π)|π) = π(π)πΈ (π˜ |π) = π(π)π = 0. Similarly, by the Adam’s law for iterated expectations πΈ (π˜ π(π)) = πΈ [πΈ (π˜ π(π)|π)] = 0, by the previous result applied. However, the estimation error of π˜ = π − πˆπ and πˆπ = πΈ (π |π) is uncorrelated to derive the variance of π . By applying covariance formula πππ£(π˜ , πˆπ ) = πΈ (π˜ πˆπ ) − πΈ (π˜ )πΈ (πˆπ ) = πΈ (π˜ πˆπ ), 11Least mean squares (LMS) CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 33 STA498 where πΈ (π˜ ) = πΈ (πΈ (π˜ |π)) = 0 from the previous result such that Lim, Kyuson ⇔ πΈ (π˜ π(π)) = πΈ [πΈ (π˜ π(π))|π] = 0, by applying for the Adam’s law for iterated expectation. For π˜ = π − πˆπ , due to πππ£(π˜ , πˆπ ) = 0, π£ππ (π ) = π£ππ (πˆπ ) + π£ππ (π˜ ) 2 ⇔ πΈ (π 2 ) − πΈ (π ) 2 = πΈ (πˆπ ) − πΈ (πˆπ ) 2 + πΈ (π˜ 2 ) − πΈ (π˜ ) 2 , where πΈ (πˆπ ) = πΈ (πΈ (π |π)) = πΈ (π ), πΈ (π˜ ) = πΈ (π − πˆπ ) = 0 such that 2 2 πΈ (π 2 ) = πΈ (πˆπ ) + πΈ (π˜ 2 ) + (πΈ (π ) 2 − πΈ (πˆπ ) 2 ) − πΈ (π˜ ) 2 ⇔ πΈ (πˆπ ) + πΈ (π˜ 2 ). Also, the MSE of π |π, where π is the unknown variable, is derived as πππΈ (π |π) = πΈ [π£ππ (π |π)] π£ππ (π |π) = πΈ [(π − πΈ (π |π)) 2 |π] ⇔ πΈ (π 2 |π) − πΈ (π |π) 2 by definition ⇔ πΈ [π£ππ (π |π)] = πΈ [πΈ [(π − πΈ (π |π)) 2 |π]] = πΈ [(π − πΈ (π |π)) 2 ] = πΈ [(π − πˆπ ) 2 ], which is the MSE of the estimator. Moreover, the above derivations and equation involves for convolution of normals and bivariate normal for estimators. MSE for convolution of two normally distributed random variables For π ∼ π (π π , ππ2 ) independent of π ∼ π (ππ , ππ2 ), let π = π + π. The goal is to 2 ) − πΈ (π ˜ 2 ). derive πˆ π = πΈ (π |π ), πΈ [(π − πˆ π ) 2 ] which will verify for πΈ (π 2 ) = πΈ ( πˆ π Now, πππ£(π, π ) = πππ£(π, π + π) = π£ππ (π) + πππ£(π, π) = ππ2 by independence and π π,π = πππ£(π, π )(ππ ππ ) −1 = ππ (ππ + ππ ) −1 . Then, MMSE of π |π is πΈ (π |π ) = πˆ π = π π + πππ ππ−1 (π − ππ ) = π π + (ππ2 (π − ππ ))ππ−2 , which is analogous to (2) − π (2) ). πΈ (x (1) |x (2) ) = π (1) + πΊ12 πΊ−1 22 (x Also, the MSE of πˆ π is πΈ ( πˆ 2 ) = πΈ [(π − πˆ π ) 2 ] with substituting the derived equation. 2 ) + πΈ (π ˜ 2 ) could be verified for substitution Since πΈ (π 2 ) = ππ2 + πΈ (π) 2 , πΈ (π 2 ) = πΈ ( πˆ π from the above equation. 3.1.3 Laplace’s law of succession For the rule of succession 12, when repeating Bernoulli trials for π times independently with π successes, if π1 , ..., ππ+1 conditionally independent random variables, then π(ππ+1 = 1|π1 + · · · + ππ = π ) = π +1 . π+2 12When there are few observations, or for events that have not been observed to occur at all in (finite) sample data, the probability examine the next repetition to succeed 34 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Lim, Kyuson STA498 Within the prior success or failure, let π ∈ π [0, 1] to describe the uncertainty as a prior probability measure. Also, ππ describe πth trial for 0 and 1 and π₯π is the data actually observed. Now, the likelihood function for π is π πΏ (π) = π(π1 = π₯ 1 , ..., ππ = π₯ π |π) = Ππ=1 π π₯π (1 − π) 1−π₯π = π π (1 − π) π−π , Íπ where π = π=1 π₯π is the number of successes for π trials. The goal is to derive for posterior distribution π (π|π1 = π₯1 , ..., ππ = π₯ π ) = ∫ 1 πΏ (π) π (π) 0 ˜ π ( π)π ˜ π˜ πΏ ( π) = ∫1 0 π π (1 − π) π−π , ˜ π−π π π˜ π˜π (1 − π) where the Beta distribution PDF yield ∫ 1 Γ(π + 1)Γ(π − π + 1) ˜π Γ(π + 2) Γ(π + 2) ˜ π−π π π˜ = π (1 − π) Γ(π + 1)Γ(π − π + 1) 0 Γ(π + 2) Γ(π + 1)Γ(π − π + 1) so that π΅(πΌ, π½) = Γ(π +1)Γ(π−π +1) ,πΌ Γ(π+2) = π + 1, π½ = π − π + 1 ⇔ π (π|π1 = π₯ 1 , ..., ππ = π₯ π ) = (π + 1)! π π (1 − π) π−π , π !(π − π )! where this is the Beta distribution with expected value ∫ 1 π +1 , πΈ (π|π1 = π₯ 1 , ..., ππ = π₯ π ) = π π (π|π1 = π₯ 1 , ..., ππ = π₯ π )ππ = π+2 0 as π is a random variable the law of total probability provide the expected probability of success is π. For cases when π = 0 or π = π, the hypergeometric distribution Hyp(π|π, π, Θ) used, where Θ is the total number of successes in the total population size π. When π, Θ → ∞, 1 the ratio π = Θ π is fixed. Now, the prior probability of π (1−π) is roughly equivalent to 1 Θ(π−Θ) with 1 ≤ Θ ≤ π − 1. Then, the posterior for Θ, Θ π −Θ 1 π(Θ|π, π, π) ∝ . Θ(π − Θ) π π − π For conjugate prior of multinomial distribution, the Dirichlet distribution is the posterior distribution. 13 3.1.4 Bayesian Hypothesis testing For two hypothesis π»0 and π»π , let π(π»0 ) = π 0 and π(π»π ) = π 1 and π 0 + π 1 = 1. Also, for random variable π, the distribution of π under hypothesis is defined as π π (π₯|π»0 ) and π π (π₯|π»π ). By Bayes’ rule, the posterior probability of π»0 and π»π is obtained: π(π»0 |π = π₯) = ππ₯ (π₯|π»0 ) π(π»0 ) , ππ₯ (π₯) π(π»π |π = π₯) = ππ₯ (π₯|π»π ) π(π»π ) . ππ₯ (π₯) 13Th joint posterior distribution of π 1 , ..., π π for π (π 1 , ..., π π |π1 , ..., ππ , πΌ) = Íπ π=1 π π Íπ Γ( π=1 (ππ +1)) π Γ(π +1) π π1 ··· π ππ Ππ=1 π π 1 , = 1. CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 35 STA498 Lim, Kyuson The posterior comparison between π(π»0 |π = π₯) and π(π»π |π = π₯) could be used to decide between π»0 and π»π for higher probability to take into account for. Maximum A Posteriori (MAP) test The idea to take for the higher posterior probability in hypothesis test is referred to as MAP test. The π»0 is chosen if and only if π(π»0 |π = π₯) ≥ π(π»1 |π = π₯) ⇔ ππ₯ (π₯|π»0 ) π(π»0 ) ≥ ππ₯ (π₯|π»π ) π(π»π ). Note that the MAP test is also generalized for the case where there are more than 2 hypotheses for taking the hypothesis with highest posterior probability, π(π»π |π = π₯) ⇔ π π (π₯|π»π ) π(π»π ). Then, the average error probability for hypothesis testing is written as π π = π(choose π»1 |π»0 ) π(π»0 ) + π(choose π»0 |π»1 ) π(π»1 ), where the MAP test achieve minimum possible average error probability. Either for continuous π π |π (π₯|π¦) or discrete π π |π (π₯|π¦), the maximum a posteriori (MAP) estimate, π₯ˆ π π΄π could be obtained for the point or interval estimates of π, by maximizing ππ |π (π¦|π₯) π π (π₯), as π₯ does not depend on ππ (π¦). Minimum Cost hypothesis test In two hypothesis testing, there are two types of error which is to accept π»0 while π»π is true or π»π while π»0 is true. Let the cost to each error type be defined as πΆ10 and πΆ01 accordingly. Then, the average cost is πΆ = πΆ10 π(choose π»π |π»0 ) π(π»0 ) + πΆ01 π(choose π»0 |π»π ) π(π»π ) ⇔ π(choose π»π |π»0 ) [ π(π»0 )πΆ10 ] + π(choose π»0 |π»π ) [ π(π»π )πΆ01 ]. Since π(π»π |π = π₯) = π π (π₯|π»π ) π(π»π ) , π π (π₯) the π»0 is chosen if and only if π π (π₯|π»0 ) π(π»0 )πΆ10 ≥ π π (π₯|π»π ) π(π»π )πΆ01 = π π (π₯|π»0 ) π(π»π )πΆ01 ≥ π π (π₯|π»π ) π(π»0 )πΆ10 ⇔ π(π»0 |π₯)πΆ10 ≥ π(π»π |π₯)πΆ01 , for decision rule. Hence, the posterior risk for accepting π»π is derived to be π(π»0 |π₯)πΆ10 14. This would derived to take the minimum cost test as to accept the hypothesis test with lowest posterior risk. 14Equivalently, the posterior risk for accepting π»0 is derived to be π(π» π |π₯)πΆ01 . 36 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Lim, Kyuson Decision rule for cost in hypothesis testing STA498 In two hypothesis cases for π»0 and π»1 , let πΆπ π to be defined for the cost of accepting π»π given π» π to be true 15. Since associated cost for the correct decision is less than the incorrect decision, that is πΆππ < πΆ ππ for π, π = 1, 2, the average cost is derived as Í πΆ = π, π ∈{0,1} πΆπ π π(choose π»π |π» π ) π(π» π ) 16, as the goal is to find the decision rule such that the average cost is minimized. First, the complement for choosing the correct hypothesis is the complement of choosing the wrong hypothesis such that π(choose π»0 |π»0 ) = 1 − π(choose π»1 |π»0 ), π(choose π»1 |π»1 ) = 1 − π(choose π»0 |π»1 ) Hence, πΆ = πΆ00 [1 − π(choose π»1 |π»0 )] π(π»0 ) + πΆ01 π(choose π»0 |π»1 ) π(π»1 ) +πΆ10 π(choose π»1 |π»0 ) π(π»0 ) + πΆ11 [1 − π(choose π»0 |π»1 )] π(π»1 ) ⇔ (πΆ10 −πΆ00 ) π(choose π»1 |π»0 ) π(π»0 )+(πΆ01 −πΆ11 ) π(choose π»0 |π»1 ) π(π»1 )+πΆ00 π(π»0 )+πΆ11 π(π»1 ), where πΆ00 π(π»0 ) + πΆ11 π(π»1 ) is constant. To minimize, the decision rule is simplified as π· = π(chooseπ»1 |π»0 ) π(π»0 )(πΆ10 − πΆ00 ) + π(chooseπ»1 = 0|π»1 ) π(π»1 )(πΆ01 − πΆ11 ) Applying the hypothesis testing from previous inequality, the π»0 is chosen if and only if π π (π₯|π»0 ) π(π»0 )(πΆ10 − πΆ00 ) ≥ π π (π₯|π»1 ) π(π»1 )(πΆ01 − πΆ11 ) ⇔ π(π»0 |π)(πΆ10 − πΆ00 ) ≥ π(π»1 |π)(πΆ01 − πΆ11 ) 3.1.5 Bayesian Interval Estimation Instead of posterior density ππ₯1 |π₯2 (π₯ 1 |π₯ 2 ) for unobserved random variable π₯ 1 given observed π₯ 2 , the (1 − πΌ)100% credible interval of π₯1 being in [π, π] is derived as π(π ≤ π₯ 1 ≤ π|π2 = π₯ 2 ) = 1 − πΌ. Bivariate normal example For π1 ∼ π (0, 1) and π2 ∼ π (1, 4) with π(π1 , π2 ) = 41 , the goal is to derive a 95% credible interval for π1 , given π2 = 2 is observed. Analogous to πΈ (x (1) |x (2) ) = (2) − π (2) ), π (1) + πΊ12 πΊ−1 22 (x πΈ (π1 |π2 = π₯ 2 ) = π π1 + πππ1 π₯ 1 − ππ₯2 , ππ2 15Then, there would be 2 more cases, including πΆ00 : The cost of choosing π»0 given π»0 is true and πΆ11 : The cost of choosing π»1 given π»1 is true. 16which is πΆ00 π(choose π»0 |π»0 ) π(π»0 ) + πΆ01 π(choose π»0 |π»1 ) π(π»1 ) + πΆ10 π(choose π»1 |π»0 ) π(π»0 ) + πΆ11 π(choose π»1 |π»1 ) π(π»1 ) CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 37 STA498 Lim, Kyuson where π π1 ,π2 ππ1 ππ2 = Σ12 ⇔ πππ£(π1 , π2 ) (and ππ1 π1 = Σ11 ), equivalently. Similar to π£ππ (x (1) |x (2) ) = πΊ11 − πΊ12 πΊ−1 22 πΊ21 , π£ππ (π1 |π2 = π₯ 2 ) = ππ21 − π 2 ππ21 . Hence, the π1 |π2 = 2 is normally distributed with mean as πΈ (π1 |π2 = π₯ 2 ) = 0+ 12 ( 2−1 2 ) = 1 3 1 4 and variance as π£ππ (π1 |π2 = π₯ 2 ) = 1 − 4 = 4 . For πΌ = 0.05, the interval is π(π ≤ π1 ≤ π|π2 = 2) = 0.95 which is centered around πΈ (π1 |π2 = π₯ 2 ) = 14 with the form of [ 14 − π, 14 + π]. 1 1 π −π π π( − π ≤ π1 ≤ + π|π2 = 2) = Φ √οΈ − Φ √οΈ = 2Φ √οΈ − 1 = 0.95. 4 4 3/4 3/4 3/4 √οΈ ⇔π= 3 −1 1.95 Φ = 1.7 4 2 Thus, the 95% credible interval for π1 is 1 1 − π, + π = [−1.45, 1.95] 4 4 3.2 Prior The prior distribution of an uncertain quantity is to express one’s beliefs about this quantity before some evidence is taken into account. Based on the unconditional probability, the chosen parameters of the prior distribution are hyperparameters. The prior for parameter π is denoted as π(π) include conjugate priors with the binomial/beta and multinomial/Dirichlet families. In case prior ∝ constant, the Bernoulli example is a representative form of the noninformative prior as π(π) = 1 lead to π|π ∼ π΅ππ‘π(π + 1, π − π + 1) to be seen earlier. Bayesian Procedure 1. Choose a probability density π(π) = π(π) that expresses our beliefs about a parameter π before any data. 2. Choose a statistical model π(π₯|π) that reflects our beliefs about π₯ given π. 3. After observing data π = {π₯1 , ..., π₯ π }, update beliefs to compute the posterior distribution π(π|π). 3.2.1 Conjugate Prior Simply, if the prior π(π) and posterior π(π|π₯) have the same form, then the prior and posterior is conjugate distributions, for the likelihood function π(π₯|π). 38 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Lim, Kyuson STA498 For the class of sampling distribution π(π₯|π), the class of prior distribution π(π) which is the family is conjugate for the class π(π₯|π), if π(π|π₯) = ∫ Θ π(π₯|π)π(π) π(π₯|π)π(π)ππ is in the class of π(π) for all π(·|π) that is in the class π(π₯|π) and π(·) in the class of π(π). Hence, the prior distribution family is conjugate to the family of sampling distribution for any posterior distributions. This include only for exponential family distribution. The class of sampling distribution π(π₯|π) of exponential family is generalized with its form π(π₯π |π) = π (π₯π )π(π)π π(π) π π’(π₯ π) , where π(π) and π’(π₯π ) are vectors and π(π) is the parameters. For x, π ∑οΈ π π π’(π₯π ) , π(x|π) ∝ π(π) exp π(π) π=1 Íπ where π=1 π’(π₯π ) is the sufficient statistics for π as the likelihood for π depends on the data x. Hence, the likelihood for x is π ∑οΈ π π π π(x|π) = Ππ=1 π (π₯π ) π(π) exp π(π) π’(π₯π ) . π=1 If the prior density is specified as π(π) ∝ π(π) π exp(π(π)π π£). Then, the posterior density is π(π|π₯) ∝ π(π) π+π exp(π(π)π π£ + π ∑οΈ π’(π₯π )), π=1 as the prior density is conjugate. List of Conjugate Models Likelihood Binomial Negative Binomial Poisson Geometric Exponential Normal (mean unknown) Normal (variance unknown) Normal (mean, variance unknown) Prior Beta Beta Gamma Beta Gamma Normal Inverse Gamma Normal / Gamma Posterior Beta Beta Gamma Beta Gamma Normal Inverse Gamma Normal / Gamma CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 39 STA498 3.2.2 Lim, Kyuson Univariate Normal distribution Conjugate Prior with known variance From previous example, Beta(π|πΌ, π½) ∝ π πΌ−1 (1 − π) π½−1 to be derived. In the univariate case, the normal distribution of observation π with multiple observations π₯ 1 , ..., π₯ π is in the form of 1 1 2 exp − π (π |π) = √ (π − π) , π ∼ π (π, π 2 ), 2 2 2π 2ππ which is part of π (π|π) ∝ π (π |π) π (π). Since the variance π 2 is known, the joint prior is just a prior of π (π) in this case compared to variance unknown 17. The goal is to ultimately update the unknown quantity of π, which is to find the π (π|π₯π ). First, the likelihood function by definition for current data of multiple observations where π (π) is the prior mean is (π₯π − π) 2 1 π exp − . π (π |π) ∝ πΏ (π|π) = Ππ=1 √ 2π 2 2ππ 2 On the other hand, the prior is parametrized with known hyperparameters π0 and π02 where π ∼ π (π02 , π02 ), 1 1 1 2 2 π (π) ∝ exp − 2 (π − π0 ) , as π (π) = √οΈ exp − 2 (π − π0 ) , 2π0 2π0 2 2ππ 0 where π0 18 is referred to a precision that control how mean can be varied, as the multiple observations has a standard deviation different from π 19. Note that π0 is the prior mean and π0 reflect the variation of π around π0 . However, π = {π₯ 1 , ..., π₯ π } such that ∑οΈ π 1 (π₯π − π) 2 . π (π |π) = π (π₯ 1 , ..., π₯ π |π) = exp − (2π) π/2 π0π 2π02 π=1 Hence, the posterior is prior times the likelihood to yield Íπ 2 1 1 (π − π0 ) 2 π=1 (π₯π − π) π (π|π₯) ∝ √οΈ exp − + . 2 2 2 π π 2 2 0 π0 π0 Ignoring constant terms, the posterior is expressed as Íπ 2 ¯ + ππ 2 (π − π0 ) 2 1 π=1 π₯π − 2ππ₯π π (π|π₯) ∝ exp − + , 2 π2 π02 17The π (π, π 2 |π) ∝ π (π |π, π 2 ) π (π, π 2 ) is where variance unknown. 18The variable reflect how much each observation π₯π have varied and does not directly reflect the variability of individual sampling for π₯π to be. 19For example, the class of 30 students has mean grade π₯¯ = 75 with sd π = 10 but over serval semesters the overall mean π = 75 and sd for the class means is π0 = 5. 40 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Lim, Kyuson STA498 and drop any terms that does not include π and arrange terms for π 2 and π, 2 2 2 2 ¯ + ππ 2 π 2 2 2 2 2 2 ¯ 1 π π − 2π π0 π − 2ππ0 π₯π 1 (ππ0 + π )π − 2(π π0 + ππ0 π₯)π 0 ⇔ exp − = exp − 2 2 π 2 π02 π 2 π02 2 2 2¯ 2 2 1 π − 2π (π0 π0 + ππ0 π₯)/(ππ 0 + π0 ) , = exp − 2 (π02 π02 )/(ππ02 + π02 ) then divide by (ππ02 + π 2 ), dropping any constant to simplify for 2 2¯ 2 2 2 1 [π − (π π0 + ππ0 π₯)/(ππ 0 + π )] . π (π|π) ∝ exp − 2 (π 2 π02 )/(ππ02 + π 2 ) Therefore, π|π₯ ∼ π (π1 , π12 ), where π1 = π 2 π0 + ππ02 π₯¯ ππ02 + π 2 , π12 = π 2 π02 ππ02 + π 2 . The posterior mean π1 is expressed as a weighted average 20 of the prior mean and the observed value π₯, with weights proportional to π02 21. In case π02 = π02 22, the prior mean is only weighted 1/(π + 1) in the posterior 23. Posterior predictive distribution For the future observation π₯˜ the posterior predictive observation π( π₯|π) ˜ is ∫ 1 2 π ( π₯|π) ˜ = π ( π₯|π) ˜ π (π|π)ππ ⇔ π (π|π₯) ∝ exp − 2 (π − π1 ) 2π1 Θ such that 1 1 2 2 ⇔ π ( π₯|π) ˜ ∝ exp − ( π₯˜ − π) exp − 2 (π − π1 ) ππ. 2π 2 2π1 Θ ∫ Notice that the joint distribution of ( π₯, ˜ π) bivariate normal distribution, where the marginal posterior distribution of π₯˜ is normal with πΈ ( π₯|π) ˜ = π and π£ππ ( π₯|π) ˜ = π2. By the law of total probability, πΈ ( π₯|π) ˜ = πΈ (πΈ ( π₯|π, ˜ π)|π) = πΈ (π|π) = π1 , and π£ππ ( π₯|π) ˜ = πΈ (π£ππ ( π₯|π)|π) ¯ + π£ππ (πΈ ( π₯|π, ¯ π)|π) = πΈ (π 2 |π) + π£ππ (π|π) = π 2 + π12 . This, the posterior predictive distribution of π₯˜ has mean equal to the posterior mean of π. However, the predictive variance is π 2 and the second variance π12 due to posterior uncertainty in π. 20If the sample variance is large, then the prior mean has considerable weight in the posterior. if the prior variance is large, the sample mean has considerable weight in the posterior. 21The posterior precision is π12 and the prior precision is π02 . 22That is each observation sd is same as sampling distribution sd 23As it is reduced to (π0 + ππ₯)/(π ¯ + 1). CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 41 STA498 Lim, Kyuson Univariate Normal distribution Conjugate Prior with unknown variance In most cases, the π 2 is unknown. Note that π (π, π 2 |π) ∝ π (π |π, π 2 ) π (π, π 2 ). Hence, the joint prior for both π and π 2 should be specified including the prior of π. If the two parameters are assumed to be independent π and π, then π(π, π 2 ) = π(π) π(π 2 ) to establish the separate priors for each parameter. Previously, π ∼ π (π0 , π02 ) where π0 is the measure of belief for π, the easies prior for π 2 is the non-informative prior. This would be discussed in the next chapter. 3.2.3 Non-informative Prior For determination of the prior, if there is no prior information about the π, then the non-informative prior is about the minimal influence on the inference. However, the uniform prior could not be simply used for non-informative prior as the reparametrization is invariant. The uniform prior on π does not correspond to the uniform prior for 1/π 24. As a mean of ignorance, there is no unique prior in non-informative, and the preferable prior is sufficient to use for. On the other hands, the uniform prior is possible by construction to be invariant non-informative prior using location parameters and scale parameters. For the location parameters, the random variable π distributed uniformly for π (π − π) with location parameter π. As π = π + π is distributed as π (π¦ − π) with π = π + π, π and π has the same distribution but just different parameters. Hence, the prior distribution is location invariant: π(π) = π(π − π) ⇔ π(π) = 1. For scale parameters, π ∼ π1 π ( ππ₯ ) with scale parameter π. Precisely, the distribution is scale invariant as for π > 0 as π(π) = 1π π( ππ ) where the invariant non-informative prior for the scale parameter π(π) = π −1 satisfies the equation. 3.2.4 Univariate Normal distribution Conjugate Prior with unknown variance Continuously, both π and π 2 need to be specified. Previously, π(π, π 2 ) = π(π) π(π 2 ) to be separated for each by the independence. Note that full probability model for π and π 2 is π (π, π 2 |π₯) = π (π₯|π, π 2 ) π (π, π 2 ) 24For transformation corresponding to 1-1 function π(π) is π(π) = 1, π = π(π) ⇒ π(π) = | πππ π −1 (π)|. 42 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Lim, Kyuson STA498 For π 2 that is the measure of the uncertainty for π, the π 2 will be used to update the knowledge of π 25 as to specify the prior of π 2 . Now, to develop non-informative priors the first approach is to assign a uniform prior for π and log(π 2 ) because π 2 > 0 and log(π 2 ) ∈ R. For transformation on log(π 2 ) into the density of π 2 , by the definition of non-informative prior π(log(π 2 )) ∝ constant such that by the Jacobian matrix π(π 2 ) ∝ π log(π 2 )/ππ 2 constant = (1/π 2 ) constant. The joint prior is also π(π, π 2 ) ∝ 1/π 2 . Without the log(π 2 ) ∈ R, the other approach is to choose values of π0 and π 2 where π ∼ π (π0 , π 2 ) and non-informative prior for π 2 . With relative non-informative prior for π 2 where 1/π 2 ∼ πΊππππ(πΌ, π½), π 2 ∼ Inverse Gamma (IG) 26 (πΌ, π½) is chosen to follow where the density function is π (π 2 ) = π½πΌ −2(πΌ+1) π exp(−π½/π 2 ), Γ(πΌ) π 2 > 0. Hence, π (π 2 |πΌ, π½) ∝ (π 2 ) −(πΌ+1) exp(−π½/π 2 ). Notice that π 2 ∼ πΌπΊ (0, 0) ⇔ π(π 2 ) ∝ 1/π 2 27. Unknown variance: posterior density of π √ As the prior of π (π₯|π, π 2 ) = 1/ 2ππ 2 exp − (π₯π −π) 2 /(2π 2 ) , the posterior distribution for π and π 2 with joint prior of π (π, π 2 ) = 1/π 2 is π (π, π 2 |π) = π (π₯|π, π 2 ) π (π, π 2 ) which is 1 π 1 (π₯π − π) 2 2 . π (π, π |π) ∝ 2 Ππ=1 √ exp − π 2π 2 2ππ 2 Notice that for two parameters the conditional posterior distribution could generally be determined by the joint as π (π, π 2 |π) ∝ π (π, π 2 ). Now, π = {π₯ 1 , ..., π₯ π } such that ∑οΈ π 1 (π₯π − π) 2 π (π |π, π ) = π (π₯ 1 , ..., π₯ π |π, π ) = exp − 2π 2 (2π) π/2 π π π=1 2 2 2 ⇔ π (π, π |π) ∝ 1 (2π) π/2 π π+2 1 exp − 2 2 π=1 π₯π Íπ − 2ππ₯π ¯ + ππ 2 π2 . Hence, the posterior dropping terms that does not contain π of a parameter of interest yields 1 −2ππ₯π ¯ + ππ 2 2 π (π|π, π ) ∝ exp − . 2 π2 25From CLT where for π observation, π 2 and π 2 is related as π₯¯ ∼ π (π, π 2 /π) and for fixed π 2 , π 2 /π is the estimate of π 2 to update as it depend heavily on the new sample data. 26The IG distribution is used as a conjugate prior for the variance of the normal distribution model. 27Due to improper prior to be discussed, if both parameters approach 0 then distribution is set as the prior for π 2 . CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 43 STA498 Then, for π divide by π and add up constant term to get quadratic term (π − π₯) ¯ 2 2 π (π|π, π ) ∝ exp − . 2π 2 /π Lim, Kyuson This result in posterior distribution π|π, π 2 ∼ π ( π₯, ¯ π 2 /π). Notice that the CLT of the sampling distribution of π₯¯ follows π (π0, π 2 /π) as well. Unknown variance approach 1: marginal posterior density of π 2 However, the posterior distribution of π 2 is derived by the conditional distribution of π 2 |π, π or by the joint posterior distribution for π and π 2 . In the first approach, the terms involving π 2 is considered to give 1 2 π (π |π) ∝ π π+2 π ∑οΈ (π₯π − π) 2 ⇔ π (π|π 2 , π) π (π 2 ) exp − 2 2π π=1 which is the Inverse Gamma distribution without the normalizing constant π½πΌ /Γ(πΌ), as the π is considered to be fixed. Equivalently, two parameters πΌ = π/2 and π½ = Íπ 2 2 π=1 (π₯π − π) /2 for the IG distribution that π |π follows to be. Unknown variance approach 2: marginal posterior density of π 2 The second approach is to use the equation for Bayes’ rule of π (π, π 2 |π) = { π (π, π 2 , π)/ π (π 2 , π)}{ π (π 2 , π)/ π (π₯)} ⇔ π (π|π 2 , π) π (π 2 |π), 1 (2π) π/2 π π where π (π |π, π 2 ) = π (π₯1 , ..., π₯ π |π, π 2 ) = exp − (π₯ π −π) 2 to derive to 2π 2 (π 2 |π), which is what Íπ π=1 separate π (π|π 2 , π) and the marginal posterior density of π 2 π the goal is. Previously, Íπ 2 ¯ + ππ 2 1 π=1 π₯π − 2ππ₯π 2 π (π, π |π) ∝ exp − 2π 2 (2π) π/2 π π+2 . Then, rearrange the terms to isolate π2 and divide by π for the equation to get squared terms. Í Í 2 (π − π₯) ¯ 2 + π₯π2 /π − π₯¯ 2 π₯π − ππ₯¯ 2 1 1 (π − π₯) ¯ 2 1 ⇔ π+2 exp − , = − × π+1 exp π π 2π 2 /π 2π 2 /π π 2π 2 where the first term corresponds to π (π|π 2 , π) and the second term correspond to π (π 2 |π). Notice that for π (π 2 |π) the numerator is the sample variance. Similarly, Íπ Íπ π 2 |∼ πΌπΊ ((π − 1)/2, (π − 1) π=1 (π₯π − π₯) ¯ 2 /2) 28, where π=1 (π₯π − π₯) ¯ 2 =var(π₯). 28Equivalently, π + 1 = −2 44 −π−1 2 = −2 −π+1 2 − 2 2 ⇒ π −2(−( π−1 2 )−1) , where the πΌ = CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH π−1 2 to be. Lim, Kyuson Unknown variance: connection for the multivariate normal STA498 For πΊ that is π dimensions with π ππ , X ∼ π π (πΊ) and both X, Σ is positive definite then 1 (π−π−1)/2 −1 π (X) ∝ |X| exp − π‘π (Σ X) 2 1 . As the conjugate prior distribution ignoring the normalizing constant 2π π /2|πΊ| π/2 Γ π (π/2) of the univariate normal distribution is the IG, the inverse Wishart distribution is the conjugate prior of the Σ in multivariate normal distribution 29. Similarly, X ∼ π π−1 (πΊ−1 ) then 1 π/2 −(π+π+1)/2 1 −1 −1 π (X) ∝ |X| exp − π‘π (Σ X ) . Σ 2 3.3 Posterior Note that the usual Bayesian inference typically involves (1) establishing a model and obtaining a posterior distribution for the parameter (π) of interest, (2) generating samples from the posterior distribution, and (3) using discrete formulas applied to the samples from the posterior distribution to summarize our knowledge of the parameters. There are two sampling methods which include the inversion method of sampling, and rejection method of sampling, which is for understanding MCMC methods. Weakly-informative Prior As for specifying and justifying for the prior distribution, the prior distribution represents a population of possible parameter values, from which the π of current interest has been drawn from the population point of view. However, for subjective interpretation, the uncertainty about π as if its value is thought of as a random realization from the prior distribution. 3.3.1 Maximum A Posteriori (MAP) For given π₯1 , ..., π₯ π ∼ π (π, π 2 ) as random variables and π = {π₯1 , ..., π₯ π }, the prior distribution of π is given as π (π0 , π02 ). The function is maximized as π (π |π) π (π) = πΏ(π)π(π) = π Ππ=1 √ 2 2 1 1 π₯π − π 1 π − π0 exp − exp − . √οΈ 2 π 2 π0 2ππ 2 2ππ 2 1 0 π π (π = π₯ , ..., π₯ |π). However, the πˆ Notice that the πˆπ πΏπΈ = Ππ=1 1 π π π΄π is the mode of the Íπ posterior distribution that is maximized as log( π(π)) + π=1 log( π (π₯π |π)). 29Note that the marginal distribution of mean vector π is the multivariate π‘ distribution. CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 45 STA498 Lim, Kyuson For derivation, the log( π (π|π)) = ∑οΈ π π=1 √οΈ √οΈ 2 2 (π₯ − π) π 2 − (π − π0 ) . 2 − log 2ππ − − log 2ππ 0 2π 2 2π02 Then, the derivative is π πΌ log( π (π|π)) ∑οΈ (π₯π − π) (π − π0 ) = − =0 2 2 πΌπ π π 0 π=1 π ∑οΈ (π₯π − π) π ∑οΈ π₯π (π − π0 ) ππ (π − π0 ) − = ⇔ 2 2 2 2 π π π π π02 0 π=1 π=1 Í Íπ (π 2 + ππ02 )π π 2 π0 + π02 π₯π π 2 π0 + π02 π=1 π₯π ˆπ π΄π = ⇔ = ⇔ π . π 2 π02 π 2 π02 π 2 + ππ02 ⇔ = Equivalently, to minimize the function of π for the posterior distribution of π (π|π) from Íπ π₯π −π 2 π−π0 2 + π0 . Hence, previous prior chapter derivation, the part is to minimize π=1 π the MAP estimate of π is derived to be Í ∑οΈ π 2 π02 ( π₯π ) + π0 π 2 ππ02 π 1 π₯π + π0 = , πˆπ π΄π = ππ02 + π 2 π π=1 ππ02 + π 2 ππ02 + π 2 which is the weighted average for the prior and sample mean. 3.3.2 Multivariate Normal distribution with known Σ The multivariate normal likelihood is xπ |π, Σ ∼ π (π, Σ) 30 without the normalizing constant, 1/2π π/2 , as previously defined. Equivalently, the likelihood function for single observation model is 1 −1/2 π −1 π (xπ |π, Σ) ∝ |Σ| exp − (xπ − π) Σ (xπ − π) , 2 and for samples of π iid observations which is X = {x1 , ..., xπ } is π (X|π, Σ) = π (x1 , ..., xπ |π, Σ) ∝ |Σ| −π/2 π 1 ∑οΈ π −1 exp − (xπ − π) Σ (xπ − π) . 2 π=1 Analogous from univariate case with variance known which is π (π|π) ∝ π (π |π) π (π) where π ∼ π (π0 , π02 ) from 3.2.2, the multivariate posterior distribution is generalization of a multiple observation for π (π|X) ∝ π (X|π) π (π) with π ∼ π (π0 , Λ0 ) 30Note that Σ is π × π matrix that is positive definite and π is a multivariate. 46 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Lim, Kyuson STA498 π π (x |π) that is, Equivalently, the posterior distribution is π (π|X) ∝ π (π)Ππ=1 π π 1 1 ∑οΈ π −1 −1/2 π −1 −π/2 (xπ −π) Σ (xπ −π) , π (π|X) = |Λ0 | exp − (π−π0 ) Λ0 (π−π0 ) |Σ| exp − 2 2 π=1 and dropping the constant terms yield for π ∑οΈ 1 π −1 2 π −1 ⇔ π (π|X) ∝ exp − (π − π0 ) Λ0 (π − π0 ) + (xπ − π) Σ (xπ − π) . 2 π=1 Now, take natural logarithm (log) to simplify for derivation of log density. π 1 1 ∑οΈ π −1 π −1 (xπ − π) Σ (xπ − π) − (π − π0 ) Λ0 (π − π0 ) log( π (π|X)) = − 2 π=1 2 π ∑οΈ 1 π π −1 π π Σ−1 xπ − π π Λ−1 = − π π Σ−1 π + 0 π + π Λ0 π 0 , 2 2 π=1 ∑οΈ ∑οΈ 1 π 1 π −1 −1 π −1 −1 −1 −1 π −1 −1 xπ ) = − π (πΣ +Λ0 )π−2π (Λ0 π0 +Σ xπ ) . = − π (πΣ +Λ0 )π+π (Λ0 π0 +Σ 2 2 Now, copula modeling for matrix multiplication and arrangement is used. π π ∑οΈ ∑οΈ 1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 ⇔ − π−(πΣ +Λ0 ) (Λ0 π0 +Σ xπ ) (πΣ +Λ0 ) π−(πΣ +Λ0 ) (Λ0 π0 +Σ + xπ ) , 2 π=1 which is the log density of a normal distribution ∑οΈ −1 −1 −1 −1 −1 −1 −1 −1 π|X ∼ π (πΣ + Λ0 ) (Λ0 π0 + Σ xπ ), (πΣ + Λ0 ) . Equivalently, the mean ππ and inverse of cov-variance matrix Λ−1 π 31 is defined as −1 −1 −1 −1 ππ = (Λ−1 0 + πΣ ) (Λ0 π0 + πΣ xΜ), −1 −1 Λ−1 π = Λ0 + πΣ , using the Woodbury identity on our expression for the covariance matrix. As the multivariate normal distribution has the conjugate prior for multivariate normal distribution analogous to univariate case, the Σ is an inverse Wishart distribution to be defined. Note that the posterior conditional and marginal distributions of subvectors of π with known Σ could be also derived. Posterior predictive distribution for xΜ For new observation xΜ ∼ π (π, Σ), the joint distribution is defined as π ( xΜ, π|X) ∝ π ( xΜ|π, Σ)π (π|ππ , Λπ ). The joint posterior distribution of xΜ is multivariate normal as the Σ is known. By the Adam’s law, πΈ ( xΜ|X) = πΈ {πΈ ( xΜ|π, X)|X} = πΈ (π|X) = ππ , and also applying for the Eve’s law and MMSE, π£ππ ( π₯|X) ˜ = πΈ {π£ππ ( xΜ|π, X)|X} + π£ππ{πΈ ( xΜ|π, X)|X} = πΈ (Σ|X) + π£ππ (π|X) = Σ + Λπ . 31Notice that the posterior precision is the sum of prior and data precisions. CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 47 STA498 Non-informative prior density of π Lim, Kyuson If π (π) ∝ constant by definition as the precision which is the variance of the prior converge to 0, |Λ−1 0 |, then the prior mean is irrelevant for the posterior density. 3.3.3 Multivariate Normal distribution with unknown Σ Goal scheme Previously, the posterior with two parameters π and π 2 is defined as π (π, π 2 |π) ∝ π (π |π, π 2 ) π (π, π 2 ). Similarly, as xπ |π, Σ ∼ π (π, Σ), the posterior distribution is defined as π (π, Σ|X) ∝ π (X|π, Σ) π (π, Σ). Analogous from univariate unknown π 2 : scheme connection [ π (Σ)] : Since the multivariate approach is exactly analogous to univariate approach, from 3.2.4 the π (π 2 ) is derived as Inverse Gamma (IG) (πΌ, π½) where the Inverse Gamma is a Inverse Wishart distribution in multivariate for prior distribution for the Σ. [ π (π|Σ)] : Similarly, π|Σ ∼ π (π0 , Σ/π) as the univariate in 3.2.4 shown to have π₯) ¯ 2 which is π|π 2 , π ∼ π ( π₯, ¯ π 2 /π) as a normal distribuπ (π|π 2 , π) ∝ exp − (π− 2π 2 /π tion for posterior density function for π (π|Σ). [ π (π, Σ)] : Also, the posterior density of π is following to be normal as with respect to π (π, π 2 |π) ∝ π (π, π 2 ), where π (π, π 2 ) = π (π|π 2 ) π (π 2 ) analogously such that 1 π/2 −( π+π+1)/2 −1 π (Σ) π (π|Σ) = |Λ0 | ×|Σ| exp − π‘π (Λ0 Σ ) 2 1 π −1 −1/2 × |Σ| exp − (π − π0 ) πΣ (π − π0 ) 2 1 π −{( π+π)/2+1} −1 π −1 ⇔ π (π, Σ) ∝ |Σ| exp − π‘π (Λ0 Σ ) − (π − π0 ) Σ (π − π0 ) . 2 2 Therefore, using the inverse-Wishart distribution to describe for the prior distribution of the Σ, Σ ∼ π π−1 (Λ−1 0 ), where the hyperparameters of (π0 , Λ0 /π, π, Λ0 ) is used for parametrization. Notice π|Σ ∼ π (π0 , Σ/π), as π and Λ0 controls the ππ and scale matrix for the inverse Wishart distribution on Σ. Posterior with unknown π 2 : conclusion [ π (X|π, Σ)]: In π (π, Σ|X) ∝ π (X|π, Σ) π (π, Σ), the likelihood function is normal distri 1 π −1 −1/2 bution that is defined as π (xπ |π, Σ) ∝ |Σ| exp − 2 (xπ − π) Σ (xπ − π) when the 48 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Lim, Kyuson cov-variance is known (which is analogous from univariate case). STA498 Hence, the π (X|π, Σ) ∝ |Σ| −π/2 π 1 ∑οΈ (xπ − π)π Σ−1 (xπ − π) exp − 2 π=1 π 1 ∑οΈ π −1 π −1 ⇔ |Σ| exp − (xπ − xΜ) Σ (xπ − xΜ) + π( xΜ − π) Σ ( xΜ − π) 2 π=1 π −π 1 1 ∑οΈ 2 π −1 2 ⇔ |Σ| 2 exp − (π−1)S +π( xΜ−π) Σ ( xΜ−π) , S = (xπ − xΜ)π Σ−1 (xπ − xΜ) 2 π − 1 π=1 −π 2 by definition. Before, π (π|X) ∝ π (X|π) π (π), π ∼ π (π0 , Λ0 ) in 3.3.2. Thus, the posterior of two unknown parameters is derived with respect to π (π, Σ|X) ∝ π (X|π, Σ) π (π, Σ). Notice, few lines before 1 π −1 π −1 π (π, Σ) = |Λ0 | |Σ| exp − π‘π (Λ0 Σ ) − (π − π0 ) Σ (π − π0 ) . 2 2 π 1 −1 π −1 π/2 −{( π+π)/2+1} ⇔ π (π, Σ|X) ∝ |Λ0 | |Σ| exp − π‘π (Λ0 Σ ) − (π − π0 ) Σ (π − π0 ) 2 2 −π 1 2 π −1 2 ×|Σ| exp − (π − 1)S + π( xΜ − π) Σ ( xΜ − π) 2 π/2 −{( π+π)/2+1} = |Λ0 | π/2 |Σ| −{(π+π+π)/2+1} 1 −1 2 π −1 π −1 exp − π‘π (Λ0 Σ ) + (π − 1)S + π( xΜ − π) Σ ( xΜ − π) + π (π − π0 ) Σ (π − π0 ) . 2 Posterior with unknown π 2 : derivation for square and Inverse-Wishart kernel First, (π − 1)S2 + π( xΜ − π)π Σ−1 ( xΜ − π) + π (π − π0 )π Σ−1 (π − π0 ) = (π − 1)S2 + πxΜπ Σ−1 xΜ − 2ππ π Σ−1 xΜ + ππ π Σ−1 π + ππ π Σ−1 π − 2ππ π Σ−1 π0 + π ππ0 Σ−1 π0 . Now, for rearrangement later, add (π + π)π ππ Σ−1 π π and subtract for balancing out the equation for posterior distribution parameters such that (π − 1)S2 + (π + π)π π Σ−1 π − 2π π Σ−1 (π π0 + πXΜ) + (π + π)π ππ Σ−1 π π − (π + π)π ππ Σ−1 π π + π ππ0 Σ−1 π0 + πxΜπ Σ−1 xΜ = ππ (π0 − XΜ)π Σ−1 (π0 − XΜ). Then, (π − 1)S2 + (π + π)(π − π π )π Σ−1 (π − π π ) + π+π 1 π/2 −{( π+π+π+1)/2} −1 π (π, Σ|X) ∝ |Λ0 | |Σ| exp − π‘π (Λ0 Σ ) 2 −1 1 ππ 2 π −1 π −1 2 ×|Σ| exp − (π −1)S + (π + π)(π − π π ) Σ (π − π π ) + (π0 − XΜ) Σ (π0 − XΜ) 2 π +π 1 ππ π/2 − ( π+π+π+1) −1 2 π −1 2 ⇔ |Λ0 | |Σ| exp − π‘π (Λ0 Σ ) + (π − 1)S + (π0 − XΜ) Σ (π0 − XΜ) 2 π +π CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 49 STA498 ×|Σ| −1 2 1 π −1 exp − (π + π)(π − π π ) Σ (π − π π ) 2 Lim, Kyuson The idea is to build up a model that is exactly similar to Normal × Inversed Wishart and identify the parameters. For identifying the exponent of the normal by inverted Wishart kernels, the property of adding symmetric matrix and multiplication is used, ie. π‘π ( π΄) + π‘π (π΅) = π‘π ( π΄ + π΅), π‘π (π·πΆ) = π‘π (πΆπ·) and π₯π Σ−1 π₯ = π‘π (π₯ π‘ Σ−1 π₯) ⇒ π‘π (π₯π Σ−1 π₯) = π‘π (π₯π₯π Σ−1 ). 1 Íπ π −1 Notice that S2 = π−1 π=1 (xπ − xΜ) Σ (xπ − xΜ). Then, the first part of the exponential is simplified to be 1 ππ −1 2 π −1 − π‘π (Λ0 Σ ) + (π − 1)S + (π0 − XΜ) Σ (π0 − XΜ) 2 π +π ∑οΈ ππ 1 π π (xπ − XΜ)(xπ − XΜ) − ( XΜ − π0 ) ( XΜ − π0 ) Σ−1 = − π‘π Λ0 + 2 π +π Σ These properties enable the equation to rearrange as π π π , π+π × Inverted Wishart(π π , Λπ ) −1 with Σ , which is π+ π+π+1 π π +π 1 −1 −1/2 π −1 − 2 exp − (π − π π ) Σ (π − π π ) , exp − π‘π (Λπ Σ ) |Σ| |Λπ | 2 |Σ| 2 2 and det(Λ0 ) = det(Λπ ) as symmetric matrix. Now, comparing with the equation of the interest, 1 ππ = (π π0 + πXΜ), π +π Λπ = Λ0 + π ∑οΈ (xπ − XΜ)(xπ − XΜ)π + ππ ( XΜ − π0 )( XΜ − π0 )π , π +π π=1 where the first term of π π matches with the second term in the modelling and the second term fo Λπ describes the first term in the modelling for equivalent relationship. Thus, Σ π, Σ|X ∼ π π π , with π × Inverse Wishart (π π , Λπ ) with Σ−1 , π +π to follow the modelling. Also, π +π π −1 (π − π π ) Σ (π − π π ) . π|Σ ∼ π π π , (π + π) Σ ⇔ π (π|Σ, X) ∝ exp − 2 −1 Posterior with unknown π 2 : uninformative priors The joint uninformative prior (with a locally uniform prior for π) is π (π, Σ) ∝ |Σ| − and the joint posterior is derived as π 1 − π+1 − 2 π −1 π (π, Σ|X) ∝ |Σ| 2 |Σ| 2 exp − (π − 1)S + π( XΜ − π) Σ ( XΜ − π) 2 1 − π+π+1 −1 ⇔ |Σ| 2 exp − π‘π (π(π)Σ ) , 2 50 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH π+1 2 , Lim, Kyuson Í STA498 π π + π( XΜ − π)( XΜ − π) π . Then, the conditional posterior where π(π) = π=1 ( XΜ − x )( XΜ − x ) π π for π|Σ ∼ π XΜ, Σπ such that π π −1 π (π|Σ, X) ∝ exp − (π − XΜ) Σ (π − XΜ) 2 Multivariate list of Conjugate Models Parameter Prior π(π) 2 0) exp − (π−π , π02 = 2π 2 Normal π 0 π πΌ−1 (1 3.3.4 π π½−1 π) Beta ∝ − πΌ−1 Beta ∝ π exp(−ππ) Beta-Bin π Gamma-exp π * ππ = π2 π π0 +πXΜ 2 π+π , ππ = Likelihood π (π|X) 2 π exp − (xπ −π) Ππ=1 2π 2 Posterior π (π|X) 2 π) exp − (π−π * 2π 2 Bin ∝ π π (1 − π) π−π exp ∝ π π exp(−π π) Beta ∝ π πΌ+π +1 (1 − π) π+π−π −1 Gamma ∝ π π+π−1 exp(−(π + π )π) π π2 π+π Lindley’s Paradox Based on the different choices of certain prior distribution, the frequentist and Bayesian give a different result for the hypothesis testing. The paradox occurs for the result of an experiment where there are two explanations π»0 and π»π with some prior distribution π to represent the uncertainty that gives which hypothesis is more accurate before taking into account for the result π₯. Lindley’s paradox occurs as the result of π₯ is significant by the frequentist test of π»0 , indicating the sufficient evidence to reject π»0 at a given πΌ = 0.05. On the other hands, Bayesian approach examine the posterior probability of π»0 given π₯ is high, indicating strong evidence that π»0 is better than π₯ with π»π to take the π»0 . Example for the comparison In a statistics program around the world, 4900 male and 4700 male is enrolled at a certain time period. The observed proportion π₯ of male student is 4900/9600 = 0.51. We assume the fraction of male student is a Binomial variable with parameter π. The goal is to test whether π is 0.5 or other value. That is, π»0 : π = 0.5 and π»π : π ≠ 0.5. The frequentist approach to testing π»0 is to compute a p-value, the probability of observing a fraction of male student at least as large as π₯ assuming π»0 is true. A normal approximation for the fraction of the male student is π ∼ π (π, π 2 ) with π = ππ = ππ = 9600(0.5) = 4800 and π 2 = ππ (1 − π) = 9600(0.5)(1 − 0.5) = 2400, ∫ 9600 (π’ − π) 2 1 exp − ππ’ π(π ≥ π₯|π = 4800) = √ 2π 2 π₯=4900 2ππ 2 ∫ 9600 1 (π’ − 4800) 2 = exp − ππ’ = 0.020613. √οΈ 4800 π₯=4900 2π(2400) CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 51 STA498 Lim, Kyuson For the two sided test, the p-value is 2(0.020613) = 0.0402 such that < 0.05 to reject the π»0 and take π»π to be different from the observed data for 0.5. The Bayesian approach assumes for the equal prior probability as there is no favour in the hypothesis to be, which is π(π»0 ) = π(π»π ) = 0.5. Also, π ∼ π [0, 1] under π»π , where the posterior probability under π»0 for π = 4900 is described as π(π |π»0 ) π(π»0 ) π(π»0 |π) = , π(π |π»0 ) π(π»0 ) + π(π |π»π ) π(π»π ) after observing π/π = 4900/9600 births, where the posterior probability is computed from the PMF of the binomial distribution under each hypothesis, π π(π |π»0 ) = (0.5) π (1 − 0.5) π−π = 0.021. π ∫ 1 1 π π π π−π π(π |π»π ) = (π) (1 − π) ππ = π΅ππ‘π(π + 1, π − π + 1) = = 0.001041. π π π+1 0 ⇒ π(π |π»0 ) > π(π |π»π ) Hence, the posterior probability for π(π»0 |π) = 0.95 which strongly favours π»0 over π»π . Thus, the two approaches Bayesian and frequentist appears to be in conflict, as paradoxical which also leads to the goodness-of-fit test. 3.3.5 Bernstein-von Mises theorem The Bernstein-von Mises theorem is a result that links Bayesian inference with Frequentist inference. In particular, it states that Bayesian credible sets of a certain credibility level πΌ will asymptotically be confidence sets of confidence level πΌ, which allows for the interpretation of Bayesian credible sets, under the probabilistic generating process. In parameter inference, a posterior distribution converges in the limit of infinitely many data to a multivariate normal distribution centered with the given covariance matrix. Using the posterior distribution from a frequentist, the Bayesian inference is asymptotically correct. Bernstein-von Mises theorem: Univariate normal data For Bayesian approach when π is large for the observed data, √ π ( π₯¯ − π)|π = π₯ 1 , ..., π₯ π → π (0, π 2 ), the prior does not matter for large samples. In frequentist approach for large samples, √ π ( π₯¯ − π)|π ∼ π (0, π 2 ). With loss of generality, the Bayesian probability for 95% credible region and the frequentist confidence interval matches for 95% confidence interval as π π π π ¯ ¯ ¯ ¯ π π ∈ π−1.96 √ , π+1.96 √ π1 , ..., ππ ≈ π π ∈ π−1.96 √ , π+1.96 √ π = 0.95 π π π π 52 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Lim, Kyuson 3.4 STA498 Goodness-of-fit test Suppose pick π samples from a normal distribution, π (π, π 2 ), with known variance and the goal is to select the best model that predict the mean of the distribution. 3.4.1 Bayes factor The Bayes factors is used as a determination for the Bayesian alternative to frequentist hypothesis testing. The Bayesian model comparison is a method of model selection based on Bayes factors (π΅πΉ) among statistical models. Based on the observed data π·, the relative plausibility of the two different models π1 and π2 parametrized by π 1 and π 2 respectively, is assessed by the probability odds of two models, π΅πΉ12 ∫ π(π 1 |π1 ) π(π·|π 1 , π1 )ππ 1 π(π·|π1 ) =∫ = = π(π·|π2) π(π 2 |π2 ) π(π·|π 2 , π2 )ππ 2 π(π1 |π·) π(π·) π(π1 ) π(π2 |π·) π(π·) π(π2 ) = π(π1 |π·) π(π2 ) , π(π2 |π·) π(π1 ) Likelihood Ratio ⇔ Posterior odds × Prior odds−1 , unlike the LRT there is no overfitting but a biasedness. Moreover, the Bayes factor is a relative predictive accuracy of one hypothesis over another, and extent to which the data sway our relative belief from one hypothesis to the other. Hence, π΅πΉ = π, π ∈ (0, ∞) means that there is π times more evidence for π»π than π»0 . In a case where there is only 2 models, given the Bayes factor π΅πΉ (π·), the posterior probability of the Model 1 is derived as π(π·|π1 ) π(π2 ) π(π·|π2 ) π(π2 ) =1− π(π1 |π·) = 1 − π(π2 |π·) = 1 − π(π·) π΅πΉ (π·) π(π·) π(π1 |π·) π(π2 ) ⇒1− = π(π1 |π·) π΅πΉ (π·) π(π1 ) 1 π(π2 ) ⇔1= 1+ π(π1 |π·) ⇔ π(π1 |π·) = π΅πΉ (π·) π(π1 ) 1+ 1 π(π2 ) 1 π΅πΉ (π·) π(π1 ) Bayes factor cutoffs π΅πΉ10 30 − 100 3 − 10 1 1/3 − 1 1/100 − 1/30 interpretation Very strong evidence for π»π Moderate evidence for π»π Equal evidence for π»π and π»0 Anecdotal evidence for π»0 Very strong evidence for π»0 > 100 10 − 30 1−3 1/3 − 1 1/30 − 1/10 < 1/100 Extreme evidence for π»π Strong Evidence for π»π Anecdotal evidence for π»π Anecdotal evidence for π»0 Strong evidence for π»0 Extreme evidence for π»0 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 53 STA498 Relationship between Bayes factor and p-value Lim, Kyuson As the Bayes factor increases quadratically, the p-value is also increased for smaller values to reject π»0 . On the other hand, the larger p-values correspond with the low Baye factor numerical values as an inverse relationship. Note that Bayes factors allow to directly test the null hypothesis (relative to models under consideration). 3.4.2 Bayes factor: hypothesis testing For the benefit of Bayes factor that could test multiple hypothesis upon same data of observation, such regression models could be tested in a way of π΅πΉ10 = π(π·|π»2 ) π΅πΉ10 π(π·|π»1 ) π(π·|π»1 ) , π΅πΉ20 = ⇒ = = π΅πΉ12 π(π·|π»0 ) π(π·|π»0 ) π΅πΉ20 π(π·|π»2 ) Frequentist: Chi-square Goodness-of-fit test Previously, π»0 : π = π0 and the CLT 32 guarantees for the sample mean π₯¯ = which is π₯¯ ∼ π (π, π 2 /π). Then, the test statistics is computed as π2 = Íπ π=1 π₯π /π ( π₯¯ − π0 ) 2 . π 2 /π Bayesian: Bayes factor The emphasize is in computing for the Bayes’ factor of the models. For two models, let π1 : π = π0 and length πΏ, including π0 and πΏ > π. The prior π(π) = of π1 that is 1 ( π₯¯ − π0 ) 2 π(π |π1 ) = √ exp − = 2π 2 /π 2πππ 2 that determines the relative ratio π2 : π lies inside the interval of 1/πΏ to calculate for the evidence √ π √ 2ππ 2 exp − π2 /2 . For π2 , marginalize for the π by ∫ ∫ 1 1 ( π₯¯ − π) 2 1 exp − ππ = . π(π, π2) = π(π |π, π2 ) π(π|π2 )ππ = √ 2 πΏ πΏ 2π /π 2ππ 2 /π Then, the Bayes’ factor is derived as √ ππΏ π1 2 π΅ =√ exp − π /2 . π2 2ππ 2 For fixed π2 valeu, when π → ∞ the Bayes factor favours π = π0 as π΅ → ∞. 32From π samples of distribution with mean π and variance π 2 , the sample mean π₯¯ = π (π, π 2 /π. 54 CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Íπ π=1 π₯ π /π ∼ Lim, Kyuson Improper prior and example STA498 ∫ When the prior is a function that is Θ π(π)ππ = ∞, the prior is not a pdf but the poste∫ rior is still a valid pdf as a marginal distribution π(π₯) = Θ π (π₯|π)π(π)ππ is well defined. For the univariate normal distribution of known variance, if the uniform prior that ∫ is π(π) = 1 where there is no prior information then Θ π(π)ππ = ∞. However, the ∫ corresponding marginal distribution π(π) = Θ π (π |π)ππ is Í ∫ 1 (π − 1)π 2 (π₯π − π) 2 2 −π/2 , ππ = √ exp − (2ππ ) exp − 2π 2 2π 2 π2ππ 2 so that the posterior becomes π(π|π) = π(π| π₯, ¯ π 2 /π) as shown before. For the Bayesian π‘-test, the Jeffreys prior which is improper priors are used as for the area of the curve to be 1. 3.4.3 One sample test for equal means Suppose there are two samples with π1 and π2 for π = π1 + π2 where π1 π ∼ π (π1 , π 2 ) and π2π ∼ π (π2 , π 2 ) for sample variance, π ∈ π1 , π ∈ π2 . First, the frequentist approach for the two sided π‘-test aim to find whether the mean of the two groups differs, π»0 : π1 = π2 ⇔ π1 − π2 = 0 vs. π»π = π1 ≠ π2 ⇔ π1 − π2 ≠ 0. Then, the two-sample π‘-statistics is π¯ 1 − π¯ 2 . π = √οΈ 2 2 π1 π2 (π1 −1)π 1 +(π2 −1)π 2 π1 +π2 π1 +π2 −2 )0 Let x = {x1 , x2 } where x1 = (π₯1 , ..., π₯ π1 and x2 = (π₯ π1 +1 , ..., π₯ π1 +π2 ) 0. The goal is to test π»0 : π₯π |π, π 2 ∼ π (π, π 2 ), 1 ≥ π ≥ π against π»π : π₯π |π1 , π12 ∼ π (π1 , π12 ), 1 ≥ π ≥ π1 , and π₯π |π2 , π22 ∼ π (π2 , π22 ), π1 + 1 ≥ π ≥ π1 + π2 . However, the Bayesian approach place the prior on the difference of the standardized means as π = π1π−π2 . In the case Í 2 Í π₯π (π₯π − π1 ) 2 2 −π/2 2 −π/2 π0 = (2ππ ) exp − , π1 = (2ππ ) exp − , 2π 2 2π 2 then the Bayes factor is derived as π0 π π1 π΅πΉ01 = = exp − (2π₯¯ − π1 ) , π1 2π 2 where the prior is normal π = π + π 2 /π02 . 1/2 π 2 π₯¯ 2 π2 π΅πΉ01 = π 2 exp − π /0 2ππ 2 so that the goal is to derive the Jeffrey’s Bayes factor (JZS). CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH 55 STA498 56 Lim, Kyuson CHAPTER 3. BAYESIAN ALTERNATIVE APPROACH Chapter 4 Appendix 4.1 4.1.1 Extension of Bayesian distribution EM (expectation-maximizing) algorithm for MLE example When missing the data set, the prediction step consists of initial estimate π˜ and πΊ˜ to predict the contribution of missing values to the sufficient statistics. Algorithm: Assume that population mean and variance π and πΊ are unknown and estimated. 1. Prediction: given estimates π˜ of unknown parameters, predict the contribution of any missing observation to the complete data for sufficient statistics. 2. Estimation: using predicted sufficient statistics, compute estimates of parameters. 3. Iterate until revised estimates do not differ from estimates obtained previously. 57