Introduction to Bayesian Methods Introduction to Bayesian Methods – p.1/?? Introduction We develop the Bayesian paradigm for parametric inference. To this end, suppose we conduct (or wish to design) a study, in which the parameter θ is of inferential interest. Here θ may be vector valued. For example, 1. θ = difference in treatment means 2. θ = hazard ratio 3. θ = vector of regression coefficients 4. θ = probability a treatment is effective Introduction to Bayesian Methods – p.2/?? Introduction In parametric inference, we specify a parametric model for the data, indexed by the parameter θ. Letting x denote the data, we denote this model (density) by p(x|θ). The likelihood function of θ is any function proportional to p(x|θ), i.e., L(θ) ∝ p(x|θ). Example Suppose x|θ Binomial(N, θ). Then N x p(x|θ) = θ (1 − θ)N −x , x x = 0, 1, ..., N. Introduction to Bayesian Methods – p.3/?? Introduction We can take L(θ) = θx (1 − θ)N −x . The parameter θ is unknown. In the Bayesian mind-set, we express our uncertainty about quantities by specifying distributions for them. Thus, we express our uncertainty about θ by specifying a prior distribution for it. We denote the prior density of θ by π(θ). The word "prior" is used to denote that it is the density of θ before the data x is observed. By Bayes theorem, we can construct the distribution of θ|x, which is called the posterior distribution of θ . We denote the posterior distribution of θ by p(θ|x). Introduction to Bayesian Methods – p.4/?? Introduction By Bayes theorem, p(θ|x) = Z p(x|θ)π(θ) p(x|θ)π(θ)dθ Θ where Θ denotes the parameter space of θ. The quantity Z p(x) = p(x|θ)π(θ)dθ Θ is the normalizing constant of the posterior distribution. For most inference problems, p(x) does not have a closed form. Bayesian inference about θ is primarily based on the posterior distribution of θ , p(θ|x). Introduction to Bayesian Methods – p.5/?? Introduction For example, one can compute various posterior summaries, such as the mean, median, mode, variance, and quantiles. For example, the posterior mean of θ is given by Z E(θ|x) = θp(θ|x)dθ Θ Example 1 Given θ, suppose Binomial(1,θ), and θ ∼ Beta(α, λ). x1 , x2 , ..., xn are i.i.d. The parameters of the prior distribution are often called the hyperparameters. Let us derive the posterior distribution of θ. Let x = (x1 , x2 , ..., xn ), and thus, Introduction to Bayesian Methods – p.6/?? Introduction p(x|θ) = p(x|θ) = n Y i=1 n Y i=1 P p(x|θ) = θ P Pn where xi = i=1 xi . Also, p(xi |θ) θxi (1 − θ)n−xi xi n− (1 − θ) P xi Γ(α + λ) α−1 θ (1 − θ)λ−1 π(θ|x) = Γ(α)Γ(λ) Now, we can write the kernel of the posterior density as Introduction to Bayesian Methods – p.7/?? Introduction p(θ|x) ∝ θ = θ P P P xi α−1 θ xi +α−1 xi +α−1 n− (1 − θ) P (1 − θ)n− n− P xi (1 − θ)λ−1 xi +λ−1 P xi +λ−1 Thus p(θ|x) ∝ θ (1 − θ) . We can recognize this kernel as a beta kernel with paramters P P ( xi + α, n − xi + λ) . Thus, X X θ|x ∼ Beta xi + α, n − xi + λ and therefore P P Γ(α + n + λ) xi +α−1 n− xi +λ−1 P P ×θ (1 − θ) . p(θ|x) = Γ( xi + α)Γ(n − xi + λ) Introduction to Bayesian Methods – p.8/?? Introduction Remark In deriving posterior densities, an often used technique is to try and recognize the kernel of the posterior density of θ. This avoids direct computation of p(x). This technique saves lots of time in derivation. If the kernel cannot be recognized, then p(x) must be computed directly. In this example we have p(x) = p(x1 , ..., xn ) Z 1 P P xi +α−1 n− xi +λ−1 ∝ (1 − θ) . θ 0 P P Γ( xi + α)Γ(n − xi + λ) = Γ(α + n + λ) Introduction to Bayesian Methods – p.9/?? Introduction Thus P P Γ(α + λ) Γ( xi + α)Γ(n − xi + λ) p(x1 , ..., xn ) = Γ(α)Γ(λ) Γ(α + n + λ) for xi = 0, 1, and i = 1, ..., n. T Suppose A1 , A2 , ... are events such that Ai Aj = φ and S∞ i=1 = Ω, where Ω denotes the sample space. Let B denote an event in Ω. Then Bayes theorem for events can be written as P (B|Ai )P (Ai ) p(Ai |B) = P∞ i=1 P (B|Ai )P (Ai ) Introduction to Bayesian Methods – p.10/?? Introduction P (Ai ) is the prior probability of Ai and p(Ai |B) is the posterior probability ofAi given B has ocurred. Example 2 Bayes theorem is often used in diagnostic tests for cancer. A young person was diagnosed as having a type of cancer that occurs extremely rarely in young people. Naturally, has was very upset. A friend told him that it was probably a mistake. His friend reasons as follows. No medical test is perfect: There are always incidences of false positives and false negatives. Introduction to Bayesian Methods – p.11/?? Introduction Let C stand for the event that he has cancer and let + stand for the event that an individual responds positively to the test. Assume P (C) = 1/1, 000, 000 = 10−6 and P (+|C c ) = .01. (So only one per million people his age have the disease and the test is extremely god relative to most medical tests, giving only 1% false positives and 1% false negatives). Find the probability that he has cancer given that he has a positive response. (After you make this calculation you will not be surprised to learn that he did not have cancer.) P (+|C)P (C) P (C|+) = P (+|C)P (C) + P (+|C c )P (C c ) (.99)(10−6 ) P (C|+) = (.99)(10−6 ) + (.01)(.999999) Introduction to Bayesian Methods – p.12/?? Introduction 00000099 P (C|+) = = .00009899 .01000098 Example 3 Suppose x1 , ..., xn is a random sample from N (µ, σ 2 ). i) Suppose σ 2 is known and µ ∼ N (µo , σo2 ). The posterior density of µ is given by: P (µ|x) ∝ n Y p(xi |µ, σ 2 ) i=1 ∝ „ ! π(µ) ff« „ ff« 1 1 X (xi − µ)2 × exp − 2 (µ − µo )2 exp − 2 2σ 2σo Introduction to Bayesian Methods – p.13/?? Introduction σo2 nσo2 + σ 2 2 µ + 2µ 2 2 σo σ 2 2 2 nσo + σ σo 2 µ − 2µ 2 2 σo σ 2 P xi + µo σ 1 ∝ exp − 2 2σo2 σ 2 P 2 1 xi + µo σ = exp − 2 nσo2 + σ 2 ) ( P 2 2 σo2 xi + µo σ 2 1 nσo + σ 2 µ− ∝ exp − 2 2 2 σo σ nσo2 + σ 2 We can recognize this as a normal kernel with mean P nσo2 +σ 2 −1 xi +µo σ 2 σo2 2 µpost = nσ2 +σ2 and variance σpost = ( σ2 σ2 ) = o o Thus 2P 2 2 2 σo xi + µo σ σo σ . µ|x ∼ N , 2 2 2 2 nσo + σ nσo + σ σo2 σ 2 nσo2 +σ 2 Introduction to Bayesian Methods – p.14/?? Introduction ii) Suppose µ is known and σ 2 is unknown. Let τ = 1/σ 2 . τ is often called the precision parameter. Suppose τ ∼ gamma( δ2o , γ2o ). Thus τγ δo o π(τ ) ∝ τ 2 −1 exp − 2 Let us derive the posterior distribution of τ . o δo n τγ o n τX o (xi − µ)2 τ 2 −1 exp − p(τ |x) ∝ τ n/2 exp − 2 2 o n X n+δo τ −1 p(τ |x) ∝ τ 2 (xi − µ)2 ) exp − (γo + 2 Introduction to Bayesian Methods – p.15/?? Introduction Thus τ |x ∼ gamma P 2 n + δo γo + (xi − µ) , 2 2 iii) Now suppose µ and σ 2 are both unknown. Suppose we specify the joint prior π(µ, τ ) = π(µ|τ )π(τ ) where µ|τ ∼ N (µo , τ −1 σo2 ) δo γo τ ∼ gamma , 2 2 Introduction to Bayesian Methods – p.16/?? Introduction The joint posterior density of (µ, τ ) is given by n τX o ∝ τ n/2 exp − (xi − µ)2 2 1 τ 2 × τ exp − 2 (µ − µo )2 2σo n τ γ o o × τ δo /2−1 exp − 2 2 X n+δo +1 τ (µ − µo ) −1 2 2 = τ exp − γo + + (x − µ) i 2 σo2 The joint posterior does not have a clear recognizable form. Thus, we need to compute p(x) by brute force. Introduction to Bayesian Methods – p.17/?? Introduction p(x) ∝ Z ∞ 0 ∝ Z 0 = Z ∞Z ∞ «ff „ X (µ − µo )2 2 dµdτ + (x − µ) γo + i σo2 X τ 2 2 xi + µo /σo2 ) + × exp{− (γo + µ (n + 1/σo ) − 2µ( 2 τ n+δo +1 −1 2 exp τ n+δo +1 −1 2 −∞ ∞ −∞ X x2i )}dµτ (µ2o /σo2 + „Z n τ n+δo +1 −1 2 exp − × ∞ τ 0 2 τ − 2 (γo + µ2o /σo2 + X x2i ) o« dτ ”o« n τ “ X dµ exp − xi + µo /σo2 ) µ2 (n + 1/σo2 ) − 2µ( 2 −∞ „Z ∞ Introduction to Bayesian Methods – p.18/?? Introduction The integratal with respect to µ can be evaluated by completing the square. ) ( P Z ∞ 2 ( xi + µo σo−2 ) τ (n + σo−2 )2 µ− exp − 2 (n + σo−2 ) 0 P −2 2 τ ( xi + µo σo ) ×exp dµ −2 2(n + σo ) P −2 2 τ ( xi + µo σo ) 1/2 −1/2 −2 −1/2 (2π) τ (n + σ = exp o ) −2 2(n + σo ) Introduction to Bayesian Methods – p.19/?? Introduction Now we need to evaluate Z ∞ n+δo /2−1 1/2 −2 −1/2 −1/2 −1 2 τ τ (2π) (n + σo ) 0 n τ X o x2i ] ×exp − [γo + µ2o /σo2 + 2P 2 2 τ ( xi + µo /σo ) [ ] dτ ×exp 2 2 (n + 1/σo ) Z ∞ n+δo /2−1 −1 1/2 −2 −1/2 2 = (2π) (n + σo ) τ 0 P 2 2 X τ ( xi + µo /σo ) 2 2 2 ×exp − γo + µo /σo + xi − dτ 2 2 (n + 1/σo ) Introduction to Bayesian Methods – p.20/?? Introduction 1/2 = h (2π) Γ n+δo 2 1 2 −2 1/σo ) (n + P xi +µo /σo2 )2 (n+1/σo2 ) i n+δ o 2 γo + µ2o /σo2 + x2i − ( n+δo − 21 n+δo 1/2 2 (2π) Γ 2 2 2 (n + 1/σo ) ∗ = h ≡ p (x) n+δo i P P 2 ( xi +µo /σo2 )2 2 2 2 γo + µo /σo + xi − (n+1/σ2 ) 1 2 P o Thus, p(x) = γo δo /2 ( −(n+1)/2 −1 2 ) (2π) σo Γ( δ2o ) ! p∗ (x) Introduction to Bayesian Methods – p.21/?? Introduction The joint posterior density of (µ, τ |x) can also be obtained in this case by deriving p(µ, τ |x) = p(µ|τ |x)p(τ |x). Exercise: Find p(µ|τ |x) and p(τ |x). It is of great interest to find the marginal posterior distributions of µ and τ . Z ∞ p(µ|x) = p(µ, τ |x)dτ Z0 ∞ n τh X io n+δ0 +1 ∝ τ 2 −1 exp − γo + µ2o /σo2 + x2i 2 0 io n τh X xi + µo /σo2 ) dτ ×exp − µ2 (n + 1/σo2 ) − 2µ( 2 Introduction to Bayesian Methods – p.22/?? Introduction ∞ n τh X io x2i exp − γo + µ2o /σo2 + = τ 2 0 ( #) " P 2 τ (n + 1/σo2 ) xi + µo /σo2 ×exp − µ− 2 n + 1/σo2 P 2 2 τ ( xi + µo /σo ) ×exp dτ 2 2 n + 1/σo Z Let a = as ( n+δ0 +1 −1 2 xi +µo /σo2 ) (n+1/σo2 ) . P Then, we can write the integral Introduction to Bayesian Methods – p.23/?? Introduction = = ∝ Z ∞ τ n+δ0 +1 −1 2 0 io n τ h X 2 2 2 2 2 2 2 γo + µo /σo + xi + (n + 1/σo )(µ − a) − (n + 1/σo )a dτ ×exp − 2 “ ” n+δ0 +1 n+δ0 +1 2 Γ 2 2 ˆ ˜ P 2 γo + µ2o /σo2 + xi + (n + 1/σo2 )(µ − a)2 − (n + 1/σo2 )a2 » 1+ – a)2 c(µ − b − ca2 where c = n + n+δ0 +1 2 1/σo2 and b = γo + µ2o /σo2 + P x2i . We recognize this kernel as that of a t-distribution with location parameter a −1 o )c , and n + δo degrees of freeand dispersion parameter (n+δ b−ca2 dom. Introduction to Bayesian Methods – p.24/?? Introduction Definition Let y = (y1 , ..., yp )′ be a p × 1 random vector. Then y is said to have a p diminsional multivariate t distribution with d degrees of freedom, location paramter m and dispersion matrix Σp×p if y has density d+p −p/2 −1/2 Γ 2 (πd) |Σ| p(y) = d Γ 2 − d+p 2 1 ′ −1 × 1 + (y − m) Σ (y − m) d We write this as y ∼ Sp (d, m, Σ). In our problem, p = −1 (n+δo )c (n+δo )c −1 1, d = n + δo , m = a, Σ = b−ca2 , Σ = b−ca2 Introduction to Bayesian Methods – p.25/?? Introduction The marginal distribution of τ is give by p(τ |y) = ∝ = n τ h X io 2 2 × exp − τ γo + µo /σo + x2i 2 0 ff o nτ 2) τ (n + 1/σ o (n + 1/σo2 )a2 × exp − (m − a)2 dµ ×exp 2 2 io n τ h X n+δ0 +1 1 −1 − 2 2 2 2 2 2 γo + µo /σo + τ xi − (n + 1/σo )a τ 2 exp − 2 n τ h io X n+δ0 −1 2 2 2 2 2 exp − τ 2 γo + µo /σo + xi − (n + 1/σo )a 2 Z ∞ n+δ0 +1 −1 2 Thus, » ”– X n + δ0 1 “ γo + µ2o /σo2 + x2i − (n + 1/σo2 )a2 . τ |x ∼ gamma , 2 2 Introduction to Bayesian Methods – p.26/?? Introduction Remark A t distribution can be obtain as a scale mixture of normals. That is, if x|τ ∼ Np (m, τ −1 Σ) and τ ∼ gamma(δo , γo ), then Z ∞ p(x) = p(x|τ )π(τ )dτ 0 is the Sp Note: δo , m, γδoo Σ density. That is, x ∼ Sp δo , m, γδoo Σ p(x|τ ) = (2π)−p/2 τ p/2 |Σ|−1/2 n τ o ×exp − (x − m)′ |Σ|−1 (x − m) . 2 Introduction to Bayesian Methods – p.27/?? Introduction Remark Note that in Examples 1 and 3i),ii), the posterior distribution is of the same family as the prior distribution. When the posterior distributionof a paramter is of the sme family as the prior istribution, such prior distributions are called conjugate prior distributions. For example 1, a Beta prior in θ led to a Beta posterior for θ. In example 3i), a normal prior for µ yielded a normal posterior for µ. In example 3ii), a gamma prior for τ yielded a gamma posterior for τ . More on conjugate priors later. Introduction to Bayesian Methods – p.28/?? Advantages of Bayesian Methods 1. Interpretation Having a distribution for your unknown parameter θ is easier to understand that a point estimate and a standard error. In addition, we consider the following example of a confidence interval. A 95% confidence interval for a population mean θ can be written as √ x ± (1.96)s/ n. Thus P (a < θ < b) 6= 0.95. Introduction to Bayesian Methods – p.29/?? Advantages of Bayesian Methods 1. Interpretation We have to rely on a repeated sampling interpretation to make a probability as above. Thus, after observing the data, we cannot make a statement like the true θ has a 95% chance of falling in √ x ± (1.96)s/ n. although we are tempted to say this. Introduction to Bayesian Methods – p.30/?? Advantages of Bayesian Methods 2. Bayes Inference Obeys the Likelihood Principal The likelihood principle: If two distinct sampling plans (designs) yield proportional likelihood functions for θ, then inference about θ should be identical from these two designs. Frequentist inference does not obey the likelihood principle, in general. Example Suppose in 12 independent tosses of a coin, 9 heads and 3 tails are observed. I wish to test the null hypothesis Ho : θ = 1/2 vs.Ho : θ > 1/2, where θ is the true probability of heads. Introduction to Bayesian Methods – p.31/?? Advantages of Bayesian Methods Consider the following 2 choices for the likelihood function: a) Binomial n = 12 (fixed), x = number of heads. x ∼ Binomial(12, θ) and the likelihood is n x L1 (θ) = θ (1 − θ)n−x x 12 9 = θ (1 − θ)3 9 b) Negative Binomial: n is not fixed, flip until the third tail appears. Here x is the number of flips required to complete the experiment, x ∼ NegBinomial(r=3,θ). Introduction to Bayesian Methods – p.32/?? Advantages of Bayesian Methods r+x−1 x L2 (θ) = θ (1 − θ)r x 11 9 = θ (1 − θ)3 9 Note that L1 (θ) ∝ L2 (θ). From a Bayesian perspective, the posterior distribution of θ is the same under either design. That is L2 (θ)π(θ) L1 (θ)π(θ) ≡R p(θ|x) = R L1 (θ)π(θ)dθ L2 (θ)π(θ)dθ Introduction to Bayesian Methods – p.33/?? Advantages of Bayesian Methods However, under the frequentist paradigm, inferences about θ are quite different under each design. The rejection region based on the binomial likelihood is 12 X 12 j θ (1 − θ)12−j = 0.075 p(x ≥ 9|θ = 1/2) = j j=9 while for the negative binomial likelihood, the p-value is ∞ X 2+j j θ (1 − θ)3 = 0.0325 p(x ≥ 9|θ = 1/2) = j j=9 The two designs lead to different decisions, rejecting Ho under design 2 and not under design 1. Introduction to Bayesian Methods – p.34/?? Advantages of Bayesian Methods 3. Bayesian Inference Does not Lead to Absurd Results Absurd results can be obtained when doing UMVUE estimation. Suppose x ∼ Poisson(λ), and we want to estimate θ = e−2λ , 0 < θ < 1. It can be shown that the UMVUE of θ is (−1)x . Thus, if x is even the UMVUE of θ is 1 and if x is odd the UMVUE of θ is -1!! Introduction to Bayesian Methods – p.35/?? Advantages of Bayesian Methods 4. Bayes Theorem is a formula for learning Suppose you conduct an experiment and collect observations x1 , ..., xn . Then p(θ|x) = Z p(x|θ)π(θ) p(x|θ)π(θ)dθ Θ where x = (x1 , ..., xn ). Suppose you collect an additional observation xn+1 in a new study. Then, p(θ|x, xn+1 ) = Z Θ p(xn+1 |θ)π(θ|x) p(xn+1 |θ)π(θ|x)dθ So your prior in the new study is the posterior from the previous. Introduction to Bayesian Methods – p.36/?? Advantages of Bayesian Methods 5. Bayes inference does not require large sample theory With modern computing advances, “exact” calculations can be carried out using Markov chain Monte Carlo (MCMC) methods. Bayes methods do not require asymptotics for valid inference. Thus small sample Bayesian inference proceeds in the same way as if one had a large sample. Introduction to Bayesian Methods – p.37/?? Advantages of Bayesian Methods 6. Bayes inference often has frequentist inference as a special case Often one can obtain frequentists answers by choosing a uniform priorfor the parameters, i.e. π(θ) ∝ 1, so that p(θ|x) ∝ L(θ) In such cases, frenquentist answers can be obtained from such a posterior distribution. Introduction to Bayesian Methods – p.38/??