Bayesian Methods for Course contents

. Course contents • Introduction of Bayesian concepts using single-parameter models. Bayesian Methods for Data Analysis • Multiple-parameter models and hyerarchical models. • Computation: approximations to the posterior, importance sampling and MCMC. ENAR Annual Meeting rejection and • Model checking, diagnostics, model fit. • Linear hierarchical models: random effects and mixed models. Tampa, Florida – March 26, 2006 • Generalized linear models: logistic, multinomial, Poisson regression. • Hierarchical models for spatially correlated data. ENAR - March 2006 1 ENAR - March 2006 2 Course emphasis • Software: R (or SPlus) early on, and WinBUGS for most of the examples after we discuss computational methods. • Notes draw heavily on the book by Gelman et al., Bayesian Data Analysis 2nd. ed., and many of the figures are ’borrowed’ directly from that book. • All of the programs used to construct examples can be downloaded from www.public.iastate.edu/ ∼ alicia. • We focus on the implementation of Bayesian methods and interpretation of results. • On my website, go to Teaching and then to ENAR Short Course. • Little theory, but some is needed to understand methods. • Lots of examples. Some are not directly drawn from biological problems, but still serve to illustrate methodology. • Biggest idea to get across: inference by simulation. ENAR - March 2006 3 ENAR - March 2006 4 Single parameter models Binomial model • There is no real advantage to being a Bayesian in these simple models. We discuss them to introduce important concepts: • Of historical importance: Bayes derived his theorem and Laplace provided computations for the Binomial model. – Priors: non-informative, conjugate, other informative – Computations – Summary of posterior, presentation of results • Before Bayes, question was: given θ, what are the probabilities of the possible outcomes of y? • Bayes asked: what is P r(θ1 < θ < θ2|y)? • Binomial, Poisson, Normal models as examples • Using a uniform prior for θ, Bayes showed that • Some “real” examples P r(θ1 < θ < θ2|y) ∝ R θ2 θ1 θy (1 − θ)n−y dθ p(y) , with p(y) = 1/(n + 1) uniform a priori. ENAR - March 2006 5 • Laplace devised an approximation to the integral expression in numerator. • First application by Laplace: estimate the probability that there were more female births in Paris from 1745 to 1770. • Consider n exchangeable Bernoulli trials y1, ..., yn. • Prior on θ: uniform on [0,1] (for now) p(θ) ∝ 1. p(θ|y) ∝ θy (1 − θ)n−y . • As a function of θ, posterior is proportional to a Beta(y + 1, n − y + 1) density. • Exchangeability:summary is # of successes in n trials y. • For θ the probability of success, Y ∼ B(n, θ): ENAR - March 2006 6 • MLE is sample proportion: θ̂ = y/n. • Posterior: • yi = 1 a “success”, yi = 0 a “failure”. p(y|θ) = ENAR - March 2006 • We could also do the calculations to derive the posterior: n y n−y y θ (1 − θ) p(θ|y) = R n y n−y dθ y θ (1 − θ) n y θ (1 − θ)n−y . y 7 ENAR - March 2006 8 = (n + 1) = = n! θy (1 − θ)n−y y!(n − y)! – Posterior variance V ar(θ|y) = (n + 1)! y θ (1 − θ)n−y y!(n − y)! Γ(n + 2) θy+1−1(1 − θ)n−y+1−1 Γ(y + 1)Γ(n − y + 1) • Interval estimation – 95% credibe set (or central posterior interval) is (a, b) if: = Beta(y + 1, n − y + 1) Z • Point estimation a p(θ|y)dθ = 0.025 and 0 y+1 – Posterior mean E(θ|y) = n+2 – Note posterior mean is compromise between prior mean (1/2) and sample proportion y/n. – Posterior mode y/n – Posterior median θ∗ such that P r(θ ≤ θ∗|y) = 0.5. – Best point estimator minimizes the expected loss (more later). ENAR - March 2006 (y + 1)(n − y + 1) . (n + 2)2(n + 3) 9 – In other cases, HPD sets have smallest size. – Interpretation (for either): probability that θ is in set is equal to 1 − α. Which is what we want to say!. Credible set and HPD set Z b p(θ|y)dθ = 0.975 0 – A 100(1 − α)% highest posterior density credible set is subset C of Θ such that C = {θ ∈ Θ : p(θ|y) ≤ k(α)} where k(α) is largest constant such that P r(C|y) ≤ 1 − α. – For symmetric unimodal posteriors, credible sets and highest posterior density credible sets coincide. ENAR - March 2006 10 • Inference by simulation: – Draw values from posterior: easy for closed form posteriors, can also do in other cases (later) – Monte Carlo estimates of point and interval estimates – Added MC error (due to “sampling”) – Easy to get interval estimates and estimators for functions of parameters. • Prediction: – Prior predictive distribution p(y) = Z 0 1 1 n y . θ (1 − θ)n−y dθ = n+1 y – A priori, all values of y are equally likely ENAR - March 2006 11 ENAR - March 2006 12 Binomial model: different priors – Posterior predictive distribution, to predict outcome of a new trial ỹ given y successes in previous n trials 1 P r(ỹ = 1|y) = Z 1 = Z Z 1 • How do we choose priors? P r(ỹ = 1|y, θ)p(θ|y)dθ – In a purely subjective manner (orthodox) – Using actual information (e.g., from literature or scientific knowledge) – Eliciting from experts – For mathematical convenience – To express ignorance 0 P r(ỹ = 1|θ)p(θ|y)dθ 0 = θp(θ|y)dθ = E(θ|y) = 0 y+1 n+2 • Unless we have reliable prior information about θ, we prefer to let the data ’speak’ for themselves. • Asymptotic argument: as sample size increases, likelihood should dominate posterior ENAR - March 2006 13 Conjugate prior p(θ|α, β) ∝ θ 14 • Beta is the conjugate prior for binomial model: posterior is in the same form as the prior. • Suppose that we choose a Beta prior for θ: α−1 ENAR - March 2006 • To choose prior parameters, think as follows: observe α successes in α + β prior “trials”. Prior “guess” for θ is α/(α + β). β−1 (1 − θ) • Posterior is now p(θ|y) ∝ θy+α−1(1 − θ)n−y+β−1 • Posterior is again proportional to a Beta: p(θ|y) = Beta(y + α, n − y + β). • For now, α, β considered fixed and known, but they can also get their own prior distribution (hierarchical model). ENAR - March 2006 15 ENAR - March 2006 16 Conjugate priors for φ(θ) the natural parameter and t(y) a sufficient statistic. • Formal definition: F a class of sampling distributions and P a class of prior distributions. Then P is conjugate for F if p(θ) ∈ P and p(y|θ) ∈ F implies p(θ|y) ∈ P . • If F is exponential family then distributions in F have natural conjugate priors. • Consider prior density p(θ) ∝ g(θ)η exp(φ(θ)T ν) • Posterior also in exponential form p(θ|y) ∝ g(θ)n+η exp(φ(θ)T (t(y) + ν)) • A distribution in F has form p(yi|θ) = f (yi)g(θ) exp(φ(θ)T u(yi)). • For an iid sequence, likelihood function is p(y|θ) ∝ g(θ)n exp(φ(θ)T t(y)), ENAR - March 2006 17 Conjugate prior for binomial proportion • y is # of successes in n exchangeable trials, so sampling distribution is binomial with prob of success θ: y n−y ENAR - March 2006 18 where g(θ)n = (1 − θ)n, y is the sufficient statistic, and the logit log θ/(1 − θ) is the natural parameter. • Consider prior for θ: Beta(α, β): p(y|θ) ∝ θ (1 − θ) p(θ) ∝ θα−1(1 − θ)β−1 • Note that • If we let ν = α − 1 and η = β + α − 2, can write p(y|θ) ∝ θy (1 − θ)n(1 − θ)−y ∝ (1 − θ)n exp{y[log θ − log(1 − θ)]} p(θ) ∝ θν (1 − θ)η−ν • Written in exponential family form or θ p(y|θ) ∝ (1 − θ) exp{y log } 1−θ p(θ) ∝ (1 − θ)η exp{ν log n ENAR - March 2006 19 ENAR - March 2006 θ } 1−θ 20 • Then posterior is in same form as prior: p(θ|y) ∝ (1 − θ)n+η exp{(y + ν) log • Posterior variance θ } 1−θ V ar(θ|y) = = • Since p(y|θ) ∝ θu(1 − θ)n−y then prior Beta(α, β) suggests that a priori we believe in approximately α successes in α + β trials. E(θ|y)[1 − E(θ|y)] α+β+n+1 • As y and n − y get large: • Our prior guess for the probability of success is α/(α + β) • By varying (α + β) (with α/(α + β) fixed), we can incorporate more or less information into the prior: “prior sample size” • Also note: (α + y)(β + n − y) (α + β + n)2(α + β + n + 1) – E(θ|y) → y/n – var(θ|y) → (y/n)[1 − (y/n)]/n which approaches zero at rate 1/n. – Prior parameters have diminishing influence in the posterior as n → ∞. α+y α+β+n is always between the prior mean α/(α + β) and the MLE y/n. E(θ|y) = ENAR - March 2006 21 Placenta previa example ENAR - March 2006 22 and therefore θ|y ∼ Beta(y + 1, n − y + 1). • Condition in which placenta is implanted low in uterus, preventing normal delivery. • In study in Germany, found that 437 out of 980 births with placenta previa were females so observed ratio is 0.446 • Thus a posteriori, θ is Beta(438, 544) and E(θ|y) = • Suppose that in general population, proportion of female births is 0.485. V ar(θ|y) = 438 438 + 544 438 × 544 (438 + 544)2(438 + 544 + 1) • Can we say that the proportion of female babies among placenta previa births is lower than in the general population? • If y = number of female births, θ is probability of a female birth, and p(θ) is uniform, then p(θ|y) ∝ θy (1 − θ)n−1 ENAR - March 2006 • Check sensitivity to choice of priors: try several Beta priors with increasingly more “information” about θ. • Fix prior mean at 0.485 and increase “prior sample size” α + β. 23 ENAR - March 2006 24 Conjugate prior for Poisson rate Prior mean 0.5 0.485 0.485 0.485 0.485 0.485 0.485 α+β 2 2 5 10 20 100 200 Post. median 0.446 0.446 0.446 0.446 0.447 0.450 0.453 Post 2.5th pctile 0.415 0.415 0.415 0.415 0.416 0.420 0.424 Post 97.5th pctile 0.477 0.477 0.477 0.477 0.478 0.479 0.481 • y ∈ (0, 1, ...) is counts, with rate θ > 0. Sampling distribution is Poisson θyi exp(θ) p(yi|θ) = yi ! • For exchangeable (y1, ..., yn) p(y|θ) = n Y θyi exp(−θ) yi ! i=1 = θnȳ exp(−nθ) Qn . i=1 yi ! • Consider Gamma(α, β) as prior for θ: p(θ) ∝ θα−1 exp(−βθ) Results robust to choice of prior, even very informative prior. Prior mean not in posterior 95% credible set. • Then posterior is also Gamma: p(θ|y) ∝ θnȳ exp(−nθ)θα−1 exp(−βθ) ∝ θnȳ+α−1 exp(−nθ − βθ) ENAR - March 2006 25 ENAR - March 2006 26 Mean of a normal distribution • For η = nȳ + α and ν = n + β, posterior is Gamma(η, ν) • y ∼ N (θ, σ 2) with σ 2 known • Note: – Prior mean of θ is α/β – Posterior mean of θ is • For {y1, ..., yn} an iid sample, the likelihood is: p(y|θ) = Πi √ nȳ + α E(θ|y) = n+β 1 1 exp(− 2 (yi − θ)2) 2σ 2πσ 2 • Viewed as function of θ, likelihood is exponential of a quadratic in θ: • If sample size n → ∞ then E(θ|y) approaches MLE of θ. p(y|θ) ∝ exp(− • If sample size goes to zero, then E(θ|y) approaches prior mean. 1 X 2 (θ − 2yiθ + yi2)) 2σ 2 i • Conjugate prior for θ must belong to family of form p(θ) = exp(Aθ2 + Bθ + C) ENAR - March 2006 27 ENAR - March 2006 28 • Expand squares, collect terms in θ2 and in θ: that can be parameterized as p(θ) ∝ exp(− 1 (θ − µ0)2) 2τ02 1 p(θ|y) ∝ exp(− [ 2 • (µ0, τ02) are hyperparameters. For now, consider known. ∝ exp(− • If p(θ) is conjugate, then posterior p(θ|y) must also be normal with parameters (µn, τn2). • Recall that p(θ|y) ∝ p(θ)p(y|θ) 1 p(θ|y) ∝ exp(− [ 2 P i (yi − θ)2 σ2 + (θ − µ0)2 ]) τ02 µn = 2 + µ0(σ /n) = (σ 2/n) + τ02 n ȳ + τ12 µ0 σ2 0 n 1 + 2 σ τ02 P iθ 2 + θ2 − 2µ0θ + µ20 ]) τ02 • Then p(θ|y) is normal with ENAR - March 2006 30 σ 2/n τ02 ) + ȳ( ) (σ 2/n) + τ02 (σ 2/n) + τ02 Add and subtract µ0τ02/((σ 2/n) + τ02) to see that µn = µ0 + (ȳ − µ0)( • Posterior mean is weighted average of prior mean and sample mean. • Weights are given by precisions n/σ 2 and 1/τ02. Posterior variance: • Also, note that • Recall that ȳτ02 + µ0(σ 2/n) (σ 2/n) + τ02 p(θ|y) ∝ exp(− 31 τ02 ) (σ 2/n) + τ02 • Posterior mean is prior mean shrunken towards observed value. Amount of shrinkage depends on relative size of precisions • As data precision increases ((σ 2/n) → 0) because σ 2 → 0 or because n → ∞, µn → ȳ. ENAR - March 2006 + ȳτ02 + µ0(σ 2/n) 1 ((σ 2/n) + τ02) 2 [θ − 2( )θ]) 2 (σ 2/n)τ02 (σ 2/n) + τ02 = µ0( (by dividing numerator and denominator into (σ 2/n)τ02) µn = i yi θ σ2 Posterior mean 29 ȳτ02 P – Mean: µn = (ȳτ02 + µ0(σ 2/n))/((σ 2/n) + τ02) – Variance: τn2 = ((σ 2/n)τ02)/((σ 2/n) + τ02) ENAR - March 2006 • Note that −2 1 (τ 2 + σ 2)nθ2 − 2(nȳτ02 + µ0σ 2)θ ∝ exp(− [ 0 ]) 2 σ 2τ02 and then p(θ) is N (µ0, τ02) • Then 2 i yi P ENAR - March 2006 ȳτ 2 + µ0(σ 2/n) 1 ((σ 2/n) + τ02) 2 [θ − 2( 0 2 )θ]) 2 2 2 (σ /n)τ0 (σ /n) + τ02 32 – Given θ, ỹ ∼ N (θ, σ 2) – θ|y ∼ N (µn, τn2) • Then 1 τn2 = = ((σ 2/n) + τ02) (σ 2/n)τ02 n 1 + 2 2 σ τ0 • Then • Posterior precision = sum of prior and data precisions. Z exp{− ∝ Z 1 (ỹ − θ)2 (θ − µn)2 exp{− [ + ]}dθ 2 σ2 τn2 Posterior predictive distribution • Want to predict next observation ỹ. • Recall p(ỹ|y) ∝ R 1 (ỹ − θ)2 1 (θ − µn)2 } exp{− }dθ 2 σ2 2 τn2 p(ỹ|y) ∝ • Integrand is kernel of bivariate normal, so (ỹ, θ) have bivariate normal joint posterior. p(ỹ|θ)p(θ|y)dθ • Marginal p(ỹ|y) must be normal. • We know that ENAR - March 2006 33 ENAR - March 2006 34 • In µn, prior precision 1/τ02 and data precision n/σ 2 are “equivalent”. Then: • Need E(ỹ|y) and var(ỹ|y): E(ỹ|y) = E[E(ỹ|y, θ)|y] = E(θ|y) = µn, – For n large, (ȳ, σ 2) determine posterior – For τ02 = σ 2/n, prior has the same weight as adding one more observation with value µ0. – When τ02 → ∞ with n fixed, or when n → ∞ with τ02 fixed: because E(ỹ|y, θ) = E(ỹ|θ) = θ var(ỹ|y) = E[var(ỹ|y, θ)|y] + var[E(ỹ|y, θ)|y] = E(σ 2|y) + var(θ|y) p(θ|ȳ) → N (ȳ, σ 2/n) = σ 2 + τn2 – Good approximation in practice when prior beliefs about θ are vague or when sample size is large. because var(ỹ|y, θ) = var(ỹ|θ) = σ 2 • Variance includes additional term τn as a penalty for us not knowing the true value of θ. • Recall µn = ENAR - March 2006 1 µ + n ȳ 0 σ2 τ02 1 + n τ02 σ 2 35 ENAR - March 2006 36 • Likelihood is in exponential family form with natural parameter φ(σ 2) = σ −2. Then, natural conjugate prior must be of form Normal variance • Example of a scale model p(σ 2) ∝ (σ 2)−η exp(−βφ(σ 2)) 2 • Assume that y ∼ N (θ, σ ), θ known. • For iid y1, ..., yn: p(y|σ 2) ∝ (σ 2)−n/2 exp(− 1 2σ 2 n X i=1 p(y|σ 2) ∝ (σ 2)−n/2 exp(− for suffient statistic • Consider an inverse Gamma prior: (yi − θ)2) p(σ 2) ∝ (σ 2)−(α+1) exp(−β/σ 2) nv ) 2σ 2 • For ease of interpretation, reparameterize as a scaled-inverted χ2 distribution, with ν0 d.f. and σ02 scale. n v= 1X (yi − θ)2 n i=1 ENAR - March 2006 – Scale σ02 corresponds to prior “guess” for σ 2. – Large ν0: lots of confidence in σ02 as a good value for σ 2. 37 ENAR - March 2006 • If σ 2 ∼ inv − χ2(ν0, σ02) then p(σ 2) ∝ ( ν1 ∝ (σ 2)−( 2 +1) exp(− ν0σ02 σ 2 −( ν0 +1) 2 exp(− ) ) σ02 2σ 2 • Corresponds to Inv − Gamma( ν20 , with ν1 = ν0 + n and σ12 = ν0 σ02 2 ). σ04/ν0: 2 large ν0, large prior precision. • Weights given by prior and sample degrees of freedom ∝ (σ 2)−n/2 exp(− ENAR - March 2006 • Prior provides information equivalent to ν0 observations with average squared deviation equal to σ02. 2 p(σ |y) ∝ p(σ )p(y|σ ) nv + ν0σ02 n + ν0 • Posterior scale is weighted average of prior “guess” and data estimate • Posterior 2 1 ν1σ12) 2σ 2 • p(σ 2|y) is also a scaled inverted χ2 • Prior mean is σ02ν0/(ν0 − 2) • Prior variance behaves like 38 nv σ 2 −( ν0 +1) ν0σ02 2 )( ) ) exp(− 2σ 2 σ02 2σ 2 • As n → ∞, σ12 → v. 39 ENAR - March 2006 40 Poisson model • Natural conjugate prior must have form • Appropriate for count data such as number of cases, number of accidents, etc. For a vector of n iid observations: p(y|θ) = Πi θyi e−θ θt(y)e−nθ , = p(y|θ) = Qn yi ! i=1 y! where θ is the rate, y = 0, 1, ... and t(y) = statistic for θ. Pn i=1 yi p(θ) ∝ e−ηθ eν log θ or p(θ) ∝ e−βθ θα−1 that looks like a Gamma(α, β). • Posterior is Gamma(α + nȳ, β + n) the sufficient • We can write the model in the exponential family form: p(y|θ) ∝ e−nθ et(y) log θ where φ(θ) = log θ is natural parameter. ENAR - March 2006 41 Poisson model - Digression ENAR - March 2006 • Thus we can interpret negative binomial distributions as arising from a mixture of Poissons with rate θ, where θ follows a Gamma distribution: • In the conjugate case, can often derive p(y) using p(y) = 42 p(y) = p(θ)p(y|θ) p(θ|y) Z Poisson (y|θ) Gamma (θ|α, β)dθ • Negative binomial is robust alternative to Poisson. • In case of Gamma-Poisson model for a single observation: Poisson(θ) Gamma(α, β) Γ(α + y)β α = Gamma(α + y, β + 1) Γ(α)y!(1 + β)(α+y) β α 1 y = (α + y − 1, y)( ) ( ) β+1 β+1 p(y) = the density of a negative binomial with parameters (α, β). ENAR - March 2006 43 ENAR - March 2006 44 • Now Poisson model (cont’d) p(y|θ) ∝ θ Py i exp(−θ for xi the exposure of unit i. • Given rate, observations assumed to be exchangeable. • Add an exposure to model: observations are exchangeable within small exposure intervals. X xi) • With prior p(θ) = Gamma (α, β), posterior is p(θ|y) = Gamma (α + X yi , β + i • Examples: X xi) i – In a city of size 500,000, define rate of death from cancer per million people, then exposure is 0.5 – Intersection in Ames with traffic of one million vehicles per year, consider number of traffic accidents per million vehicles per year. Exposure for intersection is 1. – Exposure is typically known and reflects the fact that in each unit, the number of persons or cars or plants or animals that are ’at risk’ is different. ENAR - March 2006 45 Non-informative prior distns – µ1 → ȳ – τ12 → σ 2/n – Has minimal impact on posterior – Lets the data “speak for themselves” – Also called vague, flat, diffuse • Same result could have been obtained using p(θ) ∝ 1. • The uniform prior is a natural non-informative prior for location parameters (see later). • So far, we have concentrated on conjugate family. • It is improper: • Conjugate priors can also be almost non-informative. • If y ∼ N (θ, 1), natural conjugate for θ is N (µ0, τ02), and posterior is N (µ1, τ12), where ENAR - March 2006 46 • For τ0 → ∞: • A non-informative prior distribution µ1 = ENAR - March 2006 Z p(θ)dθ = Z dθ = ∞ yet leads to proper posterior for θ. This is not always the case. • Poisson example: y1, ..., yn iid Poisson(θ), consider p(θ) ∝ θ−1/2. R∞ – 0 θ−1/2dθ = ∞ improper 1 µ0/τ02 + nȳ/σ 2 2 , τ1 = 1/τ02 + n/σ 2 1/τ02 + n/σ 2 47 ENAR - March 2006 48 – Yet P y −nθ −1/2 e θ P y −1/2 e−nθ θ P y +1/2−1 −nθ p(θ|y) ∝ θ = i i i i = θ proportional to a Gamma( 21 + dθ = η1 is Jacobian – dη – Then, if p(θ) is prior for θ, p∗(η) = η −1p(log η) is corresponding prior for transformation. – For p(θ) ∝ c, p∗(η) ∝ η −1, informative. – Informative prior is needed to arrive at same answer in both parameterizations. e i i P i yi , n), proper • Uniform priors arise when we assign equal density to each θ ∈ Θ: no preferences for one value over the other. • But they are not invariant under one-to-one transformations. • Example: – η = exp{θ} so that θ = log{η} is inverse transformation. ENAR - March 2006 49 Jeffreys prior ENAR - March 2006 50 and p(θ) ∝ |I(θ)|1/2 • In 1961, Jeffreys proposed a method for finding non-informative priors that are invariant to one-to-one transformations • Theorem: Jeffreys prior is locally uniform and therefore non-informative. • Jeffreys proposed p(θ) ∝ [I(θ)]1/2 where I(θ) is the expected Fisher information: I(θ) = −Eθ [ d2 log p(y|θ)] dθ2 • If θ is a vector, then I(θ) is a matrix with (i, j) element equal to −Eθ [ ENAR - March 2006 d2 log p(y|θ)] dθidθj 51 ENAR - March 2006 52 Jeffreys prior for binomial proportion Jeffreys prior for normal mean • If y ∼ B(n, θ) then log p(y|θ) ∝ y log θ + (n − y) log(1 − θ) and • y1, ..., yn iid N (θ, σ 2), σ 2 known. d2 log p(y|θ) = −yθ−2 − (n − y)(1 − θ)−2 dθ2 1 X (y − θ)2) 2σ 2 1 X log p(y|θ) ∝ − 2 (y − θ)2 2σ d2 n log p(y|θ) = − 2 dθ2 σ p(y|θ) ∝ exp(− • Taking expectations: d2 log p(y|θ)] = E(y)θ−2 + (n − E(y))(1 − θ)−2 dθ2 n = nθθ−2(n − nθ)(1 − θ)−2 = . θ(1 − θ) I(θ) = −Eθ [ constant with respect to θ. • Then p(θ) ∝ [I(θ)]1/2 ∝ θ−1/2(1 − θ)−1/2 • Then I(θ) constant and p(θ) ∝ constant. 1 1 ∝ Beta( , ). 2 2 ENAR - March 2006 53 ENAR - March 2006 Jeffreys prior for normal variance Invariance property of Jeffreys’ • y1, ..., yn iid N (θ, σ 2), θ known. • The Jeffreys’ prior is invariant to one-to-one transformations φ = h(θ) 1 X (y − θ)2) p(y|σ) ∝ σ −n exp(− 2 2σ 1 X log p(y|σ) ∝ −n log σ − 2 (y − θ)2 2σ n 3 X d2 (yi − θ)2 log p(y|σ) = − dσ 2 σ 2 2σ 4 • Then: dh(θ) −1 dθ | = p(θ)| | dφ dθ • Consider p(θ) = [I(θ)]1/2 and evaluate I(φ) at θ = h−1(φ): n 3 n + nσ 2 = 2 σ 2 2σ 4 2σ I(φ) = −E[ and therefore the Jeffreys prior is p(σ) ∝ σ −1. ENAR - March 2006 p(φ) = p(θ)| • Under invariance, choose p(θ) such that p(φ) constructed as above would match what would be obtained directly. • Take negative of expectation so that: I(σ) = − 54 = −E[ 55 ENAR - March 2006 d2 log p(y|φ)] dφ2 d2 dθ log p(y|θ = h−1(φ))[ ]2] 2 dφ dφ 56 = I(θ)[ dθ 2 ] dφ dθ • Then [I(φ)]1/2 = [I(θ)]1/2 dφ as required. ENAR - March 2006 57 Multiparameter models - Intro Nuisance parameters • Most realistic problems require models with more than one parameter • Consider a model with two parameters (θ1, θ2) (e.g., a normal distribution with unknown mean and variance) • Typically, we are interested in one or a few of those parameters • We are interested in θ1 so θ2 is a nuisance parameter • Classical approach for estimation in multiparameter models: • The marginal posterior distribution of interest is p(θ1|y) 1. Maximize a joint likelihood: can get nasty when there are many parameters 2. Proceed in steps • Bayesian approach: base inference on the marginal posterior distributions of the parameters of interest. • Parameters that are not of interest are called nuisance parameters. ENAR - March 2006 1 • Can be obtained directly from the joint posterior density p(θ1, θ2|y) ∝ p(θ1, θ2)p(y|θ1, θ2) by integrating with respect to θ2: Z p(θ1|y) = p(θ1, θ2|y)dθ2 ENAR - March 2006 Nuisance parameters (cont’d) 2 Nuisance parameters (cont’d) • Important difference with frequentists! • Note too that p(θ1|y) = Z • By averaging conditional p(θ1, |θ2, y) over possible values of θ2, we explicitly recognize our uncertainty about θ2. p(θ1, |θ2, y)p(θ2|y)dθ2 • Two extreme cases: • The marginal of θ1 is a mixture of conditionals on θ2, or a weighted average of the conditional evaluated at different values of θ2. Weights are given by marginal p(θ2|y) 1. Almost certainty about the value of θ2: If prior and sample are very informative about θ2, marginal p(θ2|y) will be concentrated around some value θ̂2. In that case, p(θ1|y) ≈ p(θ1|θˆ2, y) 2. Lots of uncertainty about θ2: Marginal p(θ2|y) will assign relatively high probability to wide range of values of θ2. Point estimate θ̂2 no longer “reliable”. Important to average over range of values of θ2. ENAR - March 2006 3 ENAR - March 2006 4 Example: Normal model Nuisance parameters (cont’d) • In most cases, integral not computed explicitly • yi iid from N (µ, σ 2), both unknown • Instead, use a two-step simulation approach (k) θ2 of θ2 from p(θ2|y) for 1. Marginal simulation step: Draw value k = 1, 2, ... (k) 2. Conditional simulation step: For each θ2 , draw a value of θ1 from (k) the conditional density p(θ1|θ2 , y) • Effective approach when marginal and conditional are of standard form. • Non-informative prior for (µ, σ 2) assuming prior independence: p(µ, σ 2) ∝ 1 × σ −2 • Joint posterior: p(µ, σ 2|y) ∝ p(µ, σ 2)p(y|µ, σ 2) n 1 X (yi − µ)2) ∝ σ −n−2 exp(− 2 2σ i=1 • More sophisticated simulation approaches later. ENAR - March 2006 5 ENAR - March 2006 • Note that 6 Example: Normal model (cont’d) n X i=1 (yi − µ)2 = X (yi2 − 2µyi + µ2) = X yi2 − 2µnȳ + nµ2 = X (yi − ȳ)2 + n(ȳ − µ)2 i i i • Let s2 = 1 X (yi − ȳ)2 n−1 i • Then can write posterior for (µ, σ 2) as p(µ, σ 2|y) ∝ σ −n−2 exp(− 2 by adding and subtracting 2nȳ . 1 [(n − 1)s2 + n(ȳ − µ)2]) 2σ 2 • Sufficient statistics are (ȳ, s2) ENAR - March 2006 7 ENAR - March 2006 8 Example: Conditional posterior p(µ|σ 2, y) Example: Marginal posterior p(σ 2|y) • To get p(σ 2|y) we need to integrate p(µ, σ 2|y) over µ: • Conditional on σ 2: p(µ|σ 2, y) = N (ȳ, σ 2/n) 1 [(n − 1)s2 + n(ȳ − µ)2])dµ 2σ 2 Z n (n − 1)s2 ) exp(− 2 (ȳ − µ)2)dµ ∝ σ −n−2 exp(− 2σ 2 2σ (n − 1)s2 p ) 2πσ 2/n ∝ σ −n−2 exp(− 2σ 2 • We know this from earlier chapter (posterior of normal mean when variance is known) • We can also see this by noting that, viewed as a function of µ only: p(µ|σ 2, y) ∝ exp(− n (ȳ − µ)2) 2σ 2 • Then σ −n−2 exp(− (n − 1)s2 ) 2σ 2 2 which is proportional to a scaled-inverse χ distribution with degrees of freedom (n − 1) and scale s2. that we recognize as the kernel of a N (ȳ, σ 2/n) ENAR - March 2006 Z p(σ 2|y) ∝ 9 Recall classical result: conditional on σ 2, the distribution of the scaled sufficient statistic (n − 1)s2/σ 2 is χ2n−1. p(σ 2|y) ∝ (σ 2)−(n+1)/2 exp(− ENAR - March 2006 10 Normal model: analytical derivation • For the normal model, we can derive the marginal p(µ|y) analytically: p(µ|y) = Z p(µ, σ 2|y)dσ 2 ∝ Z ( 1 n/2+1 1 ) exp(− 2 [(n − 1)s2 + n(ȳ − µ)2])dσ 2 2 2σ 2σ • Use the transformation A 2σ 2 where A = (n − 1)s2 + n(ȳ − µ)2. Then z= A dσ 2 =− 2 dz 2z ENAR - March 2006 11 ENAR - March 2006 12 Normal model: analytical derivation and ∞ z n A ( ) 2 +1 2 exp(−z)dz A z 0 Z n ∝ A−n/2 z 2 −1 exp(−z)dz p(µ|y) ∝ Z p(µ|y) ∝ A−n/2 Z n z 2 −1 exp (−z)dz • Integrand is unnormalized Gamma(n/2, 1), so integral is constant w.r.t. µ • Recall that A = (n − 1)s2 + n(ȳ − µ)2. Then p(µ|y) ∝ A−n/2 ∝ [(n − 1)s2 + n(ȳ − µ)2]−n/2 ∝ [1 + n(µ − ȳ)2 −n/2 ] (n − 1)s2 the kernel of a t−distribution with n − 1 degrees of freedom, centered at ȳ and with scale parameter s2/n ENAR - March 2006 13 ENAR - March 2006 • For the non-informative prior p(µ, σ 2) ∝ σ −2, the posterior distribution of µ is a non-standard t. Then, 14 Normal model: analytical derivation • We saw that µ − ȳ p( √ |y) = tn−1 s/ n p( µ − ȳ √ |y) = tn−1 s/ n • Notice similarity to classical result: for iid normal observations from N (µ, σ 2), given (µ, σ 2), the pivotal quantity the standard t distribution. ȳ − µ √ |µ, σ 2 ∼ tn−1 s/ n • A pivot is a non-trivial function of the data and the parameter(s) θ whose distribution, given θ, is independent of θ. Property deduced from sampling distribution as above. • Baby example of pivot property: y ∼ N (θ, 1). Pivot is x = y − θ. Given θ, x ∼ N (0, 1), independent of θ. ENAR - March 2006 15 ENAR - March 2006 16 Posterior predictive for future obs • Posterior predictive distribution for future observation ỹ is a mixture: p(ỹ|y) = Z Z p(ỹ|y, σ 2, µ)p(µ, σ 2|y)dµdσ 2 • First factor in integrand is just normal model, and it does not depend on y at all. • To simulate ỹ from posterior predictive distributions, do the following: 1. Draw σ 2 from Inv-χ2(n − 1, s2) 2. Draw µ from N (ȳ, σ 2/n) 3. Draw ỹ from N (µ, σ 2) ENAR - March 2006 17 • Can derive the posterior predictive distribution in analytic form. Note that Z p(ỹ|σ 2, y) = p(ỹ|σ 2, µ)p(µ|σ 2, y)dµ Z n 1 ∝ exp(− 2 (ỹ − µ)2) exp(− 2 (µ − ȳ)2)dµ 2σ 2σ • After some algebra: 1 p(ỹ|σ 2, y) = N (ȳ, (1 + )σ 2) n • Using same approach as in deriving posterior distribution of µ, we find that 1 p(ỹ|y) ∝ tn−1(ȳ, (1 + )1/2s) n ENAR - March 2006 Normal data and conjugate prior Normal data and conjugate prior • Recall that using a non-informative prior, we found that 2 • Note that µ and σ 2 are not independent a priori. 2 p(µ|σ , y) ∝ N (ȳ, σ /n) • Posterior density for (µ, σ 2): p(σ 2|y) ∝ Inv − χ2(n − 1, s2) • Then, factoring p(µ, σ 2) = p(µ|σ 2)p(σ 2) the conjugate prior for σ 2 would also be scaled inverse χ2 and for µ (conditional on σ 2) would be normal. Consider µ|σ 2 ∼ σ2 ∼ – Multiply likelihood by N-Inv-χ2(µ0, σ 2/κ0; ν0, σ02) prior – Expand the two squares in µ – Complete the square by adding and subtracting term depending on ȳ and µ0 Then p(µ, σ 2|y) ∝ N-Inv-χ2(µn, σn2 /κn; νn, σn2 ) where N(µ0, σ 2/κ0) Inv-χ2(ν0, σ02) κn 2 p(µ, σ ) ∝ σ −1 2 −(ν0 /2+1) (σ ) n κ0 µ0 + ȳ κ0 + n κ0 + n = κ0 + n, νn = ν0 + n κ0 n = ν0σ02 + (n − 1)s2 + (ȳ − µ)2 κ0 + n µn = • Jointly: ENAR - March 2006 18 νnσn2 1 exp(− 2 [ν0σ02 + κ0(µ0 − µ)2]) 2σ 19 ENAR - March 2006 20 Normal data and conjugate prior Normal data and conjugate prior • Conditional posterior of µ: As before µ|σ 2, y ∼ N(µn, σ 2/κn). • Interpretation of posterior parameters: – µn a weighted average as earlier – νnσn2 : sum of the sample sum of squares, prior sum of squares, and additional uncertainty due to difference between sample mean and prior mean. • Marginal posterior of σ 2: As before σ 2|y ∼ Inv-χ2(νn, σn2 ). • Marginal posterior of µ: As before µ|y ∼ tνn (µ|µn, σn2 /κn). • Two ways to sample from joint posterior distribution: 1. Sample µ from t and σ 2 from Inv-χ2 2. Sample σ 2 from Inv-χ2 and given σ 2, sample µ from N ENAR - March 2006 21 Semi-conjugate prior for normal model ENAR - March 2006 with µn = • Consider setting independent priors for µ and σ 2: µ ∼ σ2 ∼ 22 N(µ0, τ02) 1 µ + σn2 ȳ τ02 0 , 1 + σn2 τ22 τn2 = 1 τ02 1 + σn2 • NOTE: Even though µ and σ 2 are independent a priori, they are not independent in the posterior. Inv-χ2(ν0, σ02) • Example: mean weight of students, visual inspection shows weights between 100 and 200 pounds. • p(µ, σ 2) = p(µ)p(σ 2) is not conjugate and does not lead to posterior of known form • Can factor as earlier: µ|σ 2, y ∼ N(µn, τn2) ENAR - March 2006 23 ENAR - March 2006 24 Semi-conjugate prior and p(σ 2|y) so that • The marginal posterior p(σ 2|y) can be obtained by integrating the joint p(µ, σ 2|y) w.r.t. µ: p(σ 2|y) ∝ Z p(σ 2|y) ∝ N(µ|µ0, τ02) Inv − χ2(σ 2|ν0, σ02)Π N(yi|µ, σ 2) N(µ|µn, τn2) which is still a mess. N(µ|µ0, τ02) Invχ2(σ 2|ν0, σ02)Π N(yi|µ, σ 2)dµ • Integration can be performed by noting that integrand as function of µ is proportional to normal density. • Keeping track of normalizing constants that depend on σ 2 is messy. Easier to note that: p(µ, σ 2|y) p(σ 2|y) = p(µ|σ 2, y) ENAR - March 2006 25 ENAR - March 2006 Semi-conjugate prior and p(σ 2|y) 26 Implementing inverse CDF method • From earlier page: • We often need to draw values from distributions that do not have a “nice” standard form. Example is expression 3.14 in text, for p(σ 2|y) in the normal model with semi-conjugate priors for µ and σ 2. N(µ|µ0, τ02) Inv − χ2(σ 2|ν0, σ02)Π N(yi|µ, σ 2) p(σ 2|y) ∝ N(µ|µn, τn2) • Factors that depend on µ must cancel, and therefore we know that p(σ 2|y) does not depend on µ in the sense that we can evaluate p(σ 2|y) for a grid of values of σ 2 and any arbitrary value of µ. • Choose µ = µn and then denominator simplifies to something proportional to τn−1. Then • One approach is the inverse cdf method (see earlier notes), implemented numerically. • We assume that we can evaluate the pdf (even in unnormalized form) for a grid of values of θ in the appropriate range p(σ 2|y) ∝ τn N(µ|µ0, τ02) Inv − χ2(σ 2|ν0, σ02)Π N(yi|µ, σ 2) that can be evaluated for a grid of values of σ 2. ENAR - March 2006 27 ENAR - March 2006 28 Implementing inverse CDF method Inverse CDF method - Example θ 1 2 3 4 5 6 7 8 9 10 • To generate 1000 draws from a distribution p(θ|y) with parameter θ, do 1. Evaluate p(θ|y) for a grid of m values of θ. Let the evaluations be denoted by (p1, p2, ..., pm) Pm−1 2. Compute the CDF over the grid: (p1, p1 + p2, ...., i=1 pi, 1) and denote those (f1, f2, ..., fm) 3. Generate M uniform random variables u in [0, 1]. 4. If u ∈ [fi−1, fi], draw θi. Prob(θ) 0.03 0.04 0.08 0.15 0.20 0.30 0.10 0.05 0.03 0.02 CDF (F) 0.03 0.07 0.15 0.30 0.50 0.80 0.90 0.95 0.98 1.00 • Consider a parameter θ with values in (1, 10) with probability distribution and cumulative distribution functions as above. Under distribution, values of θ between 4 and 7 are more likely. ENAR - March 2006 29 ENAR - March 2006 Inverse CDF method - Example Example: Football point spreads • Data di, i = 1, ..., 672 are differences between predicted outcome of football game and actual score. • To implement method do: 1. Draw u ∼ U (0, 1). For example, in first draw u = 0.45 2. For u ∈ (fi−1, fi), draw θi. 3. For u = 0.45 ∈ (0.3, 0.5), draw θ = 5. 4. Alternative approach: (a) Flip another coin v ∼ U (0, 1). (b) Pick θi−1 if v ≤ 0.5 and pick θi if v > 0.5 5. In example, for u = 0.45, would either choose θ = 4 or would choose θ = 4 with probability 1/2. 6. Repeat many times M . • If M very large, we expect that about 50% of our draws will be θ = 4, 5, or 6, about 2% will be 10, etc. ENAR - March 2006 30 31 • Normal model: • Priors: di ∼ N (µ, σ 2) µ ∼ N (0, 22) p(σ 2) ∝ σ −2 • To draw values of (µ, σ 2) from posterior, do – First draw σ 2 from p(σ 2|y) using inverse cdf method – Then draw µ from p(µ|σ 2, y), normal. • To draw σ 2, evaluate p(σ 2|y) on the grid [150, 250]. ENAR - March 2006 32 Example: Rounded measurements (prob. 3.5) Joint posterior contours exact normal 4 • Sometimes, measurements are rounded, and we do not observe “true” values. 0 yi + 0.5 − µ σ −Φ yi − 0.5 − µ σ 8 9 10 11 12 13 12 13 mu rounded 4 y|µ, σ 2 ∼ ΠiΦ 1 • If zi ∼ N (µ, σ 2), then 2 sigma2 3 • yi are observed rounded values, and zi are unobserved true measurements. 0 1 • We are interested in posterior inference about (µ, σ 2) and in differences between rounded and exact analysis. 2 sigma2 3 • Prior: p(µ, σ 2) ∝ σ −2 8 9 10 11 mu ENAR - March 2006 33 Multinomial model - Intro Multinomial model - Sampling dist’n • Generalization of binomial model, for the case where observations can have more than two possible values. • Sampling distribution: multinomial with parameters (θ1, ..., θk ), the probabilities associated to each of the k possible outcomes. • Example: In a survey, respondents may: Strongly Agree, Agree, Disagree, Strongly Disagree, or have No Opinion when presented with a statement such as “Instructor for Stat 544 is spectacular”. ENAR - March 2006 34 • Formally: – y = k × 1 vector of counts of #s of observations in each outcome – θj : probability of jth outcome Pk Pk – j=1 θj = 1 and j=1 yj = n • Sampling distribution: y p(y|θ) ∝ Πkj=1θj j ENAR - March 2006 35 Multinomial model - Prior • The αj can be thought of as “prior counts” associated to jth outcome, so that α0 would then be a “prior sample size”. • Conjugate prior for (θ1, ..., θk ) is Dirichlet distribution, a multivariate generalization of the Beta: • For the Dirichlet: E(θj ) = α −1 p(θ|α) ∝ Πkj=1θj j αj , α0 V ar(θj ) = Cov(θi, θj ) = − with αj > 0∀j, and α0 = k X αj (α0 − αj ) α02(α0 + 1) αiαj α02(α0 + 1) αj j=1 θj > 0∀j, and k X θj = 1 j=1 ENAR - March 2006 36 ENAR - March 2006 Dirichlet distribution αj (α0 − αj ) α02(α0 + 1) αiαj Cov(θi, θj ) = − 2 α0(α0 + 1) V ar(θj ) = • The Dirichlet distribution is the conjugate prior for the parameters of the multinomial model. Note that the θj are negatively correlated. • If (θ1, θ2, ...θK ) ∼ D(α1, α2, ..., αK ) then • Because of the sum to one restriction, the pdf of the K-dimensional random vector can be written in terms of K − 1 random variables. Γ(α0) α −1 Πj θj j , p(θ1, ...θk ) = Πj Γ(αj ) • The marginal distributions can be shown to be Beta. where θj ≥ 0, Σj θj = 1, αj ≥ 0 and α0 = Σj αj . • Proof for K = 3: • Some properties: E(θj ) = ENAR - March 2006 37 αj α0 p(θ1, θ2) = 38 ENAR - March 2006 Γ(α1, α2, α3) α1−1 α2−1 θ2 (1 − θ1 − θ2)α3−1 θ Πj Γ(α0) 1 39 • To get marginal for θ1, integrate p(θ1, θ2) with respect to θ2 with limits of integration 0 and 1 − θ1. Call normalizing constant Q and use change of variable: θ2 v= 1 − θ1 so that θ2 = v(1 − θ1) and dθ2 = (1 − θ1)dv. • Then, • This generalizes to what is called the clumping property of the Dirichlet. In general, if (θ1, ..., θk ) ∼ D(α1, ..., αK ): p(θi) = Beta(α1, α0 − αi) • The marginal is then: p(θ1) = Q Z 0 1 θ1 ∼ Beta(α1, α2 + α3). θ1α1−1(v(1 − θ1))α2−1 (1 − θ1 − v(1 − θ1))α3−1(1 − θ1)dv Z 1 = Qθ1α1−1(1 − θ1)α2+α3−1 v α2−1(1 − v)α3−1dv 0 Γ(α2)Γ(α3) = Qθ1α1−1(1 − θ1)α2+α3−1 Γ(α2 + α3) ENAR - March 2006 40 Multinomial model - Posterior • For αj = 0 restriction. α −1 yj θj p(θ|y) ∝ Πkj=1θj j y +αj −1 j (αj ∀j: the uniform prior is on the log(θj ) with same • In either case, posterior is proper if yj ≥ 1 ∀j. ∝ Πkj=1θj j P 41 • For αP j = 1 ∀j: uniform non-informative prior on all vectors of θj such that j θj = 1 • Posterior must have Dirichlet form also: • αn = ENAR - March 2006 + yj ) = α0 + n is “total” number of “observations”. • Posterior mean (a point estimate) of θj is: E(θj |y) = = ENAR - March 2006 αj + yj α0 + n “#” of obs. of jth outcome “total” # of obs 42 ENAR - March 2006 43 Multinomial model - Example Multinomial model - samples from p(θ|y) • Pre-election polling in 1988: • Gamma method: Two steps for each θj : – – – – 1. Draw x1, x2, ..., xk from independent Gamma(δ, (αj + yj )) for any common δ Pk 2. Set θj = xj / i=1 xi n = 1,447 adults in the US y1 = 727 supported G. Bush (the elder) y2 = 583 supported M. Dukakis y3 = 137 supported other or had no opinion • If no other information available, can assume that observations are exchangeable given θ. (However, if information on party affiliation is available, unreasonable to assume exchangeability) • Beta method: Relies on properties of Dirichlet: – Marginal p(θj |y) = Beta(αj + yj , αn − (αj + yj )) – Conditional p(θj |θ−j , y) = Dirichlet • Polling is done under complex survey design. Ignore for now and assume simple random sampling. Then, (y1, y2, y3) ∼ M ult(θ1, θ2, θ3) • Of interest: θ1 − θ2, the difference in the population in support of the two major candidates. ENAR - March 2006 44 ENAR - March 2006 45 Multinomial model - Example Proportion voting for Bush (the elder) • Non-informative prior, for now: set α1 = α2 = α3 = 1 • Posterior distribution is Dirichlet(728, 584, 138). Then: E(θ1|y) = 0.502, E(θ2|y) = 0.403 • Other quantities obtain by simulation (see next) • To derive p(θ1 − θ2|y) do: 1. Draw m values of (θ1, θ2, θ3) from posterior 2. For each draw, compute θ1 − θ2 • Results and program: see quantiles of posterior distributions of θ1 and θ2, credible set for θ1 − θ2 and Prob(θ1 > θ2|y). ENAR - March 2006 46 0.46 0.48 0.50 theta_1 0.52 0.54 Proportion voting for Dukakis Difference between two candidates 0.36 0.0 0.05 0.10 0.15 0.20 0.38 0.40 0.42 0.44 theta_2 theta_1 - theta_2 Example: Bioassay experiment Bioassay example (cont’d) • Does mortality in lab animals increase with increased dose of some drug? • Model • Experiment: 20 animals randomly allocated to four doses (“treatments”), and number of dead animals within each dose recorded. • θi are not exchangeable because probability of death depends on dose. Dose xi ni No. of deaths yi -0.863 -0.296 -0.053 0.727 5 5 5 5 0 1 3 5 • One way to model is with linear relationship: θi = α + βxi. Not a good idea because θi ∈ (0, 1). • Transform θi: • Animals exchangeable within dose. ENAR - March 2006 yi ∼ Bin (ni, θi) 47 ENAR - March 2006 logit (θi) = log θi 1 − θi 48 • Since logit (θi) ∈ (−∞, +∞), can use linear model. regression) E[logit (θi)] = α + βxi. (Logistic • Then p(yi| • Likelihood: if yi ∼ Bin (ni, θi), then p(yi|ni, θi) ∝ (θi)yi (1 − θi)ni−yi . • But recall that θi log 1 − θi so that θi = α , β, ni, xi) yi ni−yi exp(α + βxi) exp(α + βxi) ∝ 1− 1 + exp(α + βxi) 1 + exp(α + βxi) • Prior: p(α, β) ∝ 1. • Posterior: = α + βxi, p(α, β|y, n, x) ∝ Πki=1p(yi|α, β, ni, xi). • We first evaluate p(α, β|y, n, x) on the grid exp(α + βxi) . 1 + exp(α + βxi) (α, β) ∈ [−5, 10] × [−10, 40] and use inverse cdf method to sample from posterior. ENAR - March 2006 49 ENAR - March 2006 Bioassay example (cont’d) = α1 α2 ... ... p(α1|y) p(α2|y) ... ... α200 p(α200|y) Z p(αj |β, y)p(β|y)dβ X ≈ p(αj |β, y) Posterior evaluated on 200 × 200 grid of values of (α, β). β1 β2 .. .. β200 50 p(β1|y) p(β2|y) .. .. p(β200|y) • Sum of row i entries is p(βi|y). • Entry (j, i) in grid is p(α, β|α = αi, β = βj , y). • Sum of column j entries is p(αj |y) because: Z p(αj |y) = p(αj , β|y)dβ ENAR - March 2006 51 ENAR - March 2006 52 Bioassay example (cont’d) Bioassay example (cont’d) LD50 is dose at which probability of death is 0.5, or • We sample α from p(α|y), and then β from p(β|α, y) (or the other way around). • To sample α from p(α|y): E yi ni = θi = 0.5 so that 1. Obtain empirical p(α|α = αj , y), j = 1, ...200 by summing over the β. 2. Use inverse cdf method to sample from p(α|y). 0.5 = logit−1(α + βxi) logit(0.5) = α + βxi • To sample β from p(β|α, y): 0 = α + βxi 1. Given a draw α∗, choose appropriate column in grid 2. Use inverse cdf method on p(β|α = α∗, y). Then LD50 is computed as xi = −α/β. ENAR - March 2006 53 Posterior distribution of the probability of death for each dose ENAR - March 2006 54 Posterior distributions of the regression parameters Alpha: posterior mean = 1.297, with 95% credible set [-0.365, 3.755] Beta: posterior mean = 11.52, with 95% credible set [3.572, 25.79] box plot: theta [4] 1.0 [3] alpha sample: 2000 0.75 beta sample: 2000 0.6 0.1 0.075 0.05 0.025 0.0 0.4 0.5 [2] 0.2 0.0 -2.5 0.25 [1] 0.0 0.0 2.5 5.0 -20.0 0.0 20.0 40.0 Multivariate normal model - Intro Posterior distribution of the LD50 in the log scale Posterior mean: -0.1064 95% credible set: [-0.268, 0.1141] • Generalization of the univariate normal model, where now observations are d × 1 vectors of measurements • For the ith sample unit: yi = (yi1 , yi2 , ..., yid ), and yi ∼ N (µ, Σ), with (µ, Σ) d × 1 and d × d respectively LD50 sample: 2000 6.0 • For a sample of n iid observations: 4.0 2.0 p(y|µ, Σ) ∝ |Σ|−n/2 exp[− 0.0 -0.5 -1.0 0.0 1X (yi − µ)0Σ−1(yi − µ)] 2 i 1 ∝ |Σ|−n/2 exp[− tr(yi − µ)0Σ−1(yi − µ)] 2 1 −1 −n/2 ∝ |Σ| exp[− trΣ S] 2 0.5 ENAR - March 2006 with S= X i 55 Multivariate normal model - Known Σ (yi − µ)(yi − µ)0 • S is the d × d matrix of sample squared deviations and cross-deviations from µ. • We want the posterior distribution of the mean p(µ|y) under the assumption that the variance matrix is known. • Conjugate prior for µ is p(µ|µ0, Λ0) = N (µ0, Λ0), as in the univariate case • Posterior for µ: p (µ|y, Σ) ∝ ( #) " X 1 0 −1 0 −1 exp − (µ − µ0) Λ (µ − µ0) + (yi − µ) Σ (yi − µ) 2 i that is a quadratic function of µ. Completing the square in the exponent: p(µ|y, Σ) = N (µn, Λn) ENAR - March 2006 56 ENAR - March 2006 57 Multivariate normal model - Known Σ with µn = (Λ−1µ0 + nΣ−1ȳ)(Λ−1 + nΣ−1)−1 • Posterior precision is equal to the sum of prior and sample precisions: Λn = (Λ−1 + nΣ−1)−1 −1 −1 Λ−1 n = Λ0 + nΣ • Posterior mean is weighted average of prior and sample means: −1 −1 −1 −1 µn = (Λ−1 0 µ0 + nΣ ȳ)(Λ0 + nΣ ) • From usual properties of multivariate normal distribution, can derive marginal posterior distribution of any subvector of µ or conditional posterior distribution of µ(1) given µ(2). ENAR - March 2006 58 0 ENAR - March 2006 Multivariate normal model - Known Σ 0 • Let µ0 = (µ(1) , µ(2) )0 and correspondingly, let µn = and Λn = " " (1) µn (2) µn (1) (1,1) • Marginal of subvector µ(1) is p(µ(1)|Σ, y) = N (µn , Λn # (1,1) (1,2) Λn Λn (2,1) (2,2) Λn Λn 59 ). • Conditional of subvector µ(1) given µ(2) is 1|2 1|2 (2) ) (µ − µ(2) p(µ(1)|µ(2), Σ, y) = N (µ(1) n ), Λ n +β # where −1 β 1|2 = Λn(1,2) Λ(2,2) n −1 Λ1|2 = Λ(1,1) − Λ(1,2) Λ(2,2) Λ(2,1) n n n n • We recognize β 1|2 as the regression coefficient in a regression of µ(1) on µ(2) ENAR - March 2006 60 ENAR - March 2006 61 Multivariate normal - Posterior predictive Multivariate normal - Sampling from p(ỹ|y) • We seek p(ỹ|y), and note that • With Σ known, to draw a value ỹ from p(ỹ|y) note that p(ỹ, µ|y) = N (ỹ; µ, Σ)N (µ; µn, Λn) p(ỹ|y) = • Exponential in p(ỹ, µ|y) is a quadratic form in (ỹ, µ), so (ỹ, µ) have a normal joint posterior distribution. p(ỹ|µ, y)p(µ|y)dµ • Then: • Marginal p(ỹ|y) is also normal with mean and variance: 1. Draw µ from p(µ|y) = N (µn, Λn) 2. Draw ỹ from p(ỹ|µ, y) = N (µ, Σ) E(ỹ|y) = E(E(ỹ|µ, y)|y) = E(µ|y) = µn var(ỹ|y) = E(var(ỹ|µ, y)|y) + var(E(ỹ|µ, y)|y) • Alternatively (better), draw ỹ directly from = E(Σ|y) + var(µ|y) p(ỹ|y) = N (µn, Σ + Λn) = Σ + Λn . ENAR - March 2006 Z 62 ENAR - March 2006 Sampling from a multivariate normal 63 Sampling from a multivariate normal • Using sequential conditional sampling: • Use fact that all conditionals in a multivariate normal are also normal. • Want to sample y d × 1 from N (µ, Σ). 1. Draw y1 from N (µ1, Σ(11)) 2. Then draw y2|y1 from N (µ2|1, Σ2|1) 3. Etc. • Two common approaches – Using Cholesky decomposition of Σ: 1. Get A, a lower triangular matrix such that AA0 = Σ 2. Draw (z1, z2, ..., zd) iid N (0, 1) and let z = (z1, ..., zd) 3. Compute y = µ + Az • For example, if d = 2, 1. Draw y1 from N (µ(1), σ 2(1)) 2. Draw y2 from N (µ(2) + ENAR - March 2006 64 ENAR - March 2006 σ (12) (σ (12))2 (1) 2(2) (y − µ ), σ − ) 1 σ 2(2) σ 2(1) 65 Non-informative prior for µ Multivariate normal - unknown µ, Σ • A non-informative prior for µ is obtained by letting the prior precision Λ0 go to zero. • With the uniform prior, the posterior for µ is proportional to the likelihood. • Posterior is proper only if n > d; otherwise, S is not of full column rank. • If n > d, • The conjugate family of priors for (µ, Σ) is the normal-inverse Wishart family. • The inverse Wishart is the multivariate generalization of the inverse χ2 distribution • If Σ|ν0, Λ0 ∼ Inv-Wishartν0 (Λ−1 0 ) then 1 p(Σ|ν0, Λ0) ∝ |Σ|−(ν0+d+1)/2 exp − tr(Λ0Σ−1) 2 for Σ positive definite and symmetric. p(µ|Σ, y) = N (ȳ, Σ/n) • For the Inv-Wishart with ν0 degrees of freedom and scale Λ0: E(Σ) = (ν0 − d − 1)−1Λ0 ENAR - March 2006 66 ENAR - March 2006 Multivariate normal - unknown µ, Σ Multivariate normal - unknown µ, Σ • Posterior must also be in the Normal-Inv-Wishart form • Then conjuage prior is • Results from univariate normal generalize directly: Σ ∼ Inv-Wishartν0 (Λ−1 0 ) – – – – µ|Σ ∼ N(µ0, Σ/κ0) which corresponds to p(µ, Σ) 67 |Σ|−(ν0+d)/2+1 1 κ0 exp − tr(Λ0Σ−1) − (µ − µ0)0Σ−1(µ − µ0) 2 2 ∝ Σ|y ∼ Inv-Wishartνn (Λ−1 n ) µ|Σ, y ∼ N (µn, Σ/κn) µ|y ∼ Mult-tνn−d+1(µn, Λn/(κn(νn − d + 1))) ỹ|y ∼ Mult-t • Here: κ0 n µ0 + ȳ κ0 + n κ0 + n = κ0 + n µn = κn νn = ν0 + n ENAR - March 2006 68 ENAR - March 2006 69 nκ0 Λ n = Λ0 + S + (ȳ − µ0)(ȳ − µ0)0 κ0 + n X (yi − ȳ)(yi − ȳ)0 S = Multivariate normal - sampling µ, Σ • To sample from p(µ, Σ|y) do: (1) Sample Σ from p(Σ|y) (2) Sample µ from p(µ|Σ, y) i • To sample from p(ỹ|y), use drawn (µ, Σ) and obtain draw ỹ from N (µ, Σ) • Sampling from Wishartν (S): 1. Draw ν independent d × 1 vectors α1, α2, ..., αν from a N (0, S) Pν 2. Let Q = i=1 αiαi0 • Method works when ν > d. Inv-Wishart. If Q ∼ Wishart then Q−1 = Σ ∼ • We have already seen how to sample from a multivariate normal given mean vector and covariance matrix. ENAR - March 2006 70 ENAR - March 2006 Multivariate normal - Jeffreys prior Example: bullet lead concentrations • If we start from the conjugate normal-inv-Wishart prior and let : • In the US, four major manufacturers produce all bullets fired. One of them is Cascade. κ0 → 0 • A sample of 200 round-nosed .38 caliber bullets with lead tips were obtained from Cascade. ν0 → −1 |Λ0| → 0 • Concentration of five elements (antimony, copper, arsenic, bismuth, and silver) in the lead alloy was obtained. Data for for Cascade are stored in federal.data in the course’s web site. then resulting prior is Jeffreys prior for (µ, Σ): p(µ, Σ) ∝ |Σ|−(d+1)/2 ENAR - March 2006 71 • Assuming that the 200 5 × 1 observation vectors for Cascade can be represented by a multivariate normal distribution (perhaps after transformation), we are interested in the posterior distribution of the mean vector and of functions of the covariance parameters. 72 ENAR - March 2006 1 Bullet lead example - Model and prior • Particular quantities of interest for inference include: – The mean trace element concentrations (µ1, ..., µ5) – The correlations between trace element concentrations (ρ11, ρ12, ..., ρ45). – The largest eigenvalue of the covariance matrix • In the data file, columns correspond to antimony, copper, arsenic, bismuth, and silver, in that order. • For this example, the antimony concentration was divided by 100 • We concentrate on the correlations that involve copper (four of them). • Sampling distribution: y|µ, Σ ∼ N (µ, Σ) • We use the conjugate Normal-Inv-Wishart family of priors, and choose parameters for p(Σ|ν0, Λ0) first: Λ0,ii = (100, 5000, 15000, 1000, 100) Λ0,ij = 0 ∀i, j ν0 = 7 ENAR - March 2006 2 ENAR - March 2006 3 3. For each draw, compute: (a) the ratio between λ1 (largest eigenvalue of Σ) and λ2 (second largest eigenvalue of Σ) (b) the four correlations ρcopper,j = σcopper,j /(σcopper σj ), for j ∈ {antimony, arsenic, bismuth, silver}. • For p(µ|Σ, mu0, κ0): µ0 = (200, 200, 200, 100, 50) κ0 = 10 • Low values for ν0 and κ0 suggest little confidence in prior guesses µ0, Λ0 • We are interested in the posterior distributions of the five means, the eigenvalues ratio, and the four correlation coefficients • We set Λ0 to be diagonal apriori: we have some information about the variances of the five element concentrations, but no information about their correlation. • Sampling from posterior distribution: We follow the usual approach: 1. Draw Σ from Inv-Wishartνn (Λ−1 n ) 2. Draw µ from N (µn, Σ/κn) ENAR - March 2006 4 ENAR - March 2006 5 Scatter plot of the posterior mean concentrations • • •• • •• • • ••••• •••••• •• • ••••••• • ••• ••• • •• •• • •••••••••••••• • • •• • • ••••••••••••••••••••••••••••••••••••••••••••••••••••• • • ••••••• ••• • • •••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••• • • • • • ••••••••• ••• •••• • • ••••••••••••••••• ••••••••••••• • •• • • ••• • • •••••••••••••••••••••••••••••••••••••••••••••• • ••••• ••• •• •••••••• ••••• • • • • • • • • ••••••••• ••••• •• ••• • • • • •• • • •• •••• • ••••••••• •• ••• •• • •• ••• •••••••••••••••••••••• •••••• • ••• •• • ••••••••••••• •••••• • • •• • •••••••••••••••••••••••••••••••••••••••••••••••• •• •• • •••••••••••••••••••••••••••••••••••••••••••• ••••••••• • ••••••• ••••••• •• •• • •• •••••••• • •••••••••••••••••••••••••••••••• • •• •••••••• •••••••••••••••••••••••••••••••••••••••• • • • • •• •• • ••• • • • ••••••••• • • •• • •• •••••• •• •• • • • • • • • • • • •• •• •• ••• • •• •••• • • •• • • ••• •••••••••••••••••••••••••• •••••••• • •• •• •• •• •• •• • •• •••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • • • • • ••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••• • • •••••••• ••••••••••••• ••••••••••••••••••• • • • •••• • • • •••••••••••••••• •• ••• •• ••• •• ••• • • •• • • •• • • • • ENAR - March 2006 • •• • • • •• ••• • • •• ••••••••• •• •• • •••• •• • ••••••• •••• •••••• •• •• ••• ••••••••••••••••••• •••••• • ••• • •• •••• ••••• •••••••• ••••• • • • •• • • •••••• ••••••••••••••••••••••••••• ••• • •••••••••••••••••••••••••••••••••••••••••••••••••••• •••• • •• •••••••••••••••••••••••••••••••••••••••••••••••••••• •• •• • • • • • • • • • • • • • • • • •• • • • ••••••••••• •••••• •• •••••• • •• • ••••••• •••• • • • • ••• • •• • ••• • • •• • • • • •• • • • • •• • • • ••••••••••••• • • •••••• •••••• •• ••• • •• •• • • • •• ••••••••••••••••••••••••••••••••••••••••• • ••••••••• ••••••••••••••••••• •••••••• •• • • • •••••••••••••••••••••••••••••••••••••••••••••••••••• • • • •• • • •••••• •••••••••••••••••••••••••••••••••••••••••••••••• •• • • • • • • • • • • • • • • • • • • • • •• •••••••••••••••••••••••••••• ••• •• • • • ••• • ••• ••••• • ••• • • • •• • • • • • • •• • • • • • • • •••••• •• •••• •• • ••••• • • ••••••• ••••••• ••• ••• • •• • •••••••••••••••••••••••••••••••••••••••••• •• •• ••• ••• •••••••••••••• •• •• • • • • •• •• •••••••••••••••••••••••••••••••••••••• • •••••••••• •••••••• ••••••••••••••••••• • ••• •• •• • • •••• • • • • ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • • • •••••• •• •••••• •••••• • • ••• ••• ••••••• ••• ••••• • • • •• • • •• •• •••••• •• •• • •• • • • •• •• •• •••• •••••• •••••• • • ••• •• •• •• • • • ••• • • • •• ••••••••••••••••••••••••••••••••••••••• •• • • •••• • ••• • •••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••• • • ••••• •••••• ••• •• •••••••••••••• ••••• •••••••••••• • ••••••••••••••••••••••• • •••• •••••••••••••••••••••••••••••••••••••••••••••••• •••• •• ••• •••••••••••••••••••••• •••• •••• •••• • • • • • • • • • ••• • • •••••• ••• • • •• ••• • •• • • • •••••••••••••• •••••••• ••• •• • • • • •• • • ••••••••••••••••••••••••••••• ••• • • •••• • • •• •• ••••••••••••••••••••••••••••••••••• •• •••• ••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••• •••••• • ••••••••••••••••••••••••••••••••••• ••• • •••••••••• • • • • • • • • • • • • • •• ••••••••••••• •••••••••••••• • ••• • • ••• ••••• ••• • • • • • • • ••• • • • • • •• •••••••• •••••••• •• ••• • •••• ••••••••••••••••••••••••••• • • ••••• •••••••••••••••••••••••••••••••••• •• • •• • ••••••••••••••••••• •••••••••••••••••••••••••••• • •••• ••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • ••• • •••• •• •• • • •• •• ••••••••••••• ••••••••••••••••••••••••••••••••••• • • • • • • • •• • • • • • •••••••••• •• •••••••••••••••• ••••••• • • ••• • •••••• ••••• • •• • • • • •• • •••••• •• • • • • • ••• ••• ••• •• •••••••••••••••••••••••••••••••••• • •• • •••••••••••••••••••••••••••• ••• • •••• •••••••••• • • •• •••• •••••••••••••••••••••••••••••••••••••••••••••••••••• • • •• •••• ••••• •• • • •••••••••••••••••••••••••••••••••••••••••••••••••••• • •••••••••••••••• •• ••• • • ••••••• ••••••••••••••••••• • • ••••••••• •••• • • • •• • •• • • •• • •• • •••• • •• •• • ••••••••••••• • •• • •••••••••••••••••• • •••• • •••••••••••••••••••••••••••••••• • •••• ••••••••••••• ••••• • ••••••••••••••••••••• •••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••• • • ••••••••••••••••••••••••••••••••••••••••• •• • •••• •••••••••• • ••••• •• •• ••• ••• • • •••• •••••••• • •• ••• • •• • • • ••••••• • • • • •• • ••• • •• • • ••••••••••••••••••••••••••••••••••••••••••• •• • • • •••••••• ••• •• ••••••••••••••••••••••••••••••••••••••• • • ••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••• •• •••••••••••••••••••••••••••••••••••••••••••••••••••••• • • ••••••••••••••••••••••••••••••••••••••••••••••••••• • ••• •••• ••••••••••••• •• • • • •• •••••• • ••••• • • • ••• • • • • • • • • •• • ••••• • • •• • • • • • • • •• • •••••••••••••••••••••••••• •••••••••• •••• ••••••••••••••••••••••••••••••••• • • •••••••••••••••••••• •••• •• • • • • •••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••• • • •• •• ••• ••• • • •• • • ••••••••••• •••••••• •••••••••• • • •• ••• ••••••••••••••••••••••• •••••••••• • • •• • • •• • ••••••••• ••• ••••• •• • • • • ••• •• • • • •••• •• •• • • • •• • • ••• • •• • • • • •••••••••••••••••••••••••••••••••••••••• •• • •• •• •• • •••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••• •••••••••••••••••••••••••••••• • • ••• ••••••••••••• • • • • • • •••• •••••••••••••••••••••••••••••• ••••• • • •••••••••••••••••••••••••••••••••••••••••••••••••• ••• • • •••• •••••••••• ••• • • • •••••••••••••••• •• • • • •• • ••• • • • • • • •••• • • • • •• • ••• •• • •• •• • • ••• • • • • ••••••••••••••••••••• ••••••••••••• • • •••••••••••••••••••••••••••••••••••••• •••• •••••••••••• • • •• •••••• • •• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • •• • ••••••••••••••••••••••••••••••••••••••••••• ••• •• •••••••••••••• • • • •••• ••••••••••• •• • • ••••• • • ••• • •• • •• • • • •• • • •• •• •• • • • •••• • • ••• ••• •••• • • •• •• ••••••••••••••• • • ••••••• •• • •• ••••• ••••••••••••••••••••••••••••••• • •• • • •••• ••••••••••••••••••••••••••••••••••••••• • •• • • ••••••••••••••••••••••••••••••••••••• •• •••••• • ••• • • • ••••••••••••••••••••••••••••••••••••••• ••••• • • • •••• • •••••••••••••••••••••••••••••••••••••••••••••••••••• •• • • • •• • •••••••••• ••••• ••••••• •••• •• • •• • •• •••••• • • • • •• • •• • • • • •• • •• •• • •• • ••••••• •• ••• • •• ••• •• ••••••• • • •• • • • •••••••••••••••••••• ••••••••••••• • ••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••••••• • •••••••••••••••••••••••••••••• ••••••• •• •• ••••• ••• ••••••••••••• ••••••••• • ••• • • • ••• •••••• •• ••• ••• ••••••••••••••••••••••••••••••••••••••••••••• ••••• •••••• • • •• •••••• •• • •••••• •••• ••• ••••••••••••••••• • • • • • • •••• • • •• • •• • • •• • • ••••••• •• • •••• •••••• ••••• • •• •• •••••••••••••••••••••••• •••• •• •••••••••••••••••••••••••••••••• • • •••••••••••••••••••••••••••••••••• •••• • • ••••• •••••••••••••••••••••••••••••••••••• • ••••••• •••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••• • • •••• •••••••••• • ••• ••••••••••••••••••••••• •• • • •• •• ••••• • • •• • •• • • • •• Results from R program # antimony c(mean(sample.mu[,1]),sqrt(var(sample.mu[,1]))) [1] 265.237302 1.218594 quantile(sample.mu[,1],probs=c(0.025,0.05,0.5,0.95,0.975)) 2.5% 5.0% 50.0% 95.0% 97.5% 262.875 263.3408 265.1798 267.403 267.7005 # copper c(mean(sample.mu[,2]),sqrt(var(sample.mu[,2]))) [1] 259.36335 5.20864 quantile(sample.mu[,2],probs=c(0.025,0.05,0.5,0.95,0.975)) 2.5% 5.0% 50.0% 95.0% 97.5% 249.6674 251.1407 259.5157 268.0606 269.3144 # arsenic c(mean(sample.mu[,3]),sqrt(var(sample.mu[,3]))) [1] 231.196891 8.390751 quantile(sample.mu[,3],probs=c(0.025,0.05,0.5,0.95,0.975)) • •• •• • • • • •• • • • • • •• •• ••• •• •••• ••••••• •••••••••••••••••••••••••••••• • •• •••• • • ••••••••••••••••••••••••••••••••••••••••••••• • • •••••••••••••••••••••••••••••••••••• ••• • • • •••••••••••••••••••• • • • • • • ••••••••••••••••••••••• ••• •• •••••••••••••••••••••••••••••••••• •••••••••••• • ••••••••••••••••••••••••••••••••••••••••••• • • • ••• • • • ••• •• •••••• •• ••••••• • • • • •• • • • • ••• • • • • • • •• ••••• ••• ••••••• • • • •• • • • • • ••• • •• •••• ••••• •• ••••• ••••••••••••••••••••••••••••••••••••••••••• • • • ••••• •••••••••••••••••••••••••••••••••••••••••• ••••• ••• • •••••••••••••••••••••••••••••••••••••••• • ••• • ••••••• • • • •••• •••••••••••••••••••••••••••• • • ••••• • •••••••••••••••••••••••• •••••• • • • • • • • • ••••••••• ••••• •• • •• •• • ••••• •••••••••••• •• •••• • •• • • • • • 6 ENAR - March 2006 Posterior of λ1/λ2 of Σ 2.5% 5.0% 50.0% 95.0% 97.5% 213.5079 216.7256 231.6443 243.5567 247.1686 # bismuth c(mean(sample.mu[,4]),sqrt(var(sample.mu[,4]))) [1] 127.248741 1.553327 quantile(sample.mu[,4],probs=c(0.025,0.05,0.5,0.95,0.975)) 2.5% 5.0% 50.0% 95.0% 97.5% 124.3257 124.6953 127.2468 129.7041 130.759 # silver c(mean(sample.mu[,5]),sqrt(var(sample.mu[,5]))) [1] 38.2072916 0.7918199 quantile(sample.mu[,5],probs=c(0.025,0.05,0.5,0.95,0.975)) 2.5% 5.0% 50.0% 95.0% 97.5% 36.70916 36.93737 38.1971 39.54205 39.73169 ENAR - March 2006 8 7 ENAR - March 2006 9 Summary statistics of posterior dist. of ratio Correlation of copper with other elements c(mean(ratio.l),sqrt(var(ratio.l))) [1] 5.7183875 0.8522507 quantile(ratio.l,probs=c(0.025,0.05,0.5,0.95,0.975)) 2.5% 5.0% 50.0% 95.0% 97.5% 4.336424 4.455404 5.606359 7.203735 7.619699 ENAR - March 2006 10 Summary statistics for correlations # j = antimony c(mean(sample.rho[,1]),sqrt(var(sample.rho[,1]))) [1] 0.50752063 0.05340247 quantile(sample.rho[,1],probs=c(0.025,0.05,0.5,0.95,0.975)) 2.5% 5.0% 50.0% 95.0% 97.5% 0.4014274 0.4191201 0.5076544 0.5901489 0.6047564 # j = arsenic c(mean(sample.rho[,3]),sqrt(var(sample.rho[,3]))) [1] -0.56623609 0.04896403 quantile(sample.rho[,3],probs=c(0.025,0.05,0.5,0.95,0.975)) 2.5% 5.0% 50.0% 95.0% 97.5% -0.6537461 -0.6448817 -0.570833 -0.4857224 -0.465808 # j = bismuth c(mean(sample.rho[,4]),sqrt(var(sample.rho[,4]))) [1] 0.63909311 0.04087149 quantile(sample.rho[,4],probs=c(0.025,0.05,0.5,0.95,0.975)) ENAR - March 2006 12 ENAR - March 2006 11 2.5% 5.0% 50.0% 95.0% 97.5% 0.5560137 0.5685715 0.6429459 0.7013519 0.7135556 # j = silver c(mean(sample.rho[,5]),sqrt(var(sample.rho[,5]))) [1] 0.03642082 0.07232010 quantile(sample.rho[,5],probs=c(0.025,0.05,0.5,0.95,0.975)) 2.5% 5.0% 50.0% 95.0% 97.5% -0.09464575 -0.0831816 0.03370379 0.164939 0.1765523 ENAR - March 2006 13 S-Plus code # # # # # # # # rwishart_function(a,b) { k_ncol(b) m_matrix(0,nrow=k,ncol=1) cc_matrix(0,nrow=a,ncol=k) for (i in 1:a) { cc[i,]_rmnorm(m,b) } w_t(cc)%*%cc w } # #Read the data # y_scan(file="cascade.data") y_matrix(y,ncol=5,byrow=T) y[,1]_y[,1]/100 means_rep(0,5) for (i in 1:5){ means[i]_mean(y[,i]) } means_matrix(means,ncol=1,byrow=T) n_nrow(y) # #Assign values to the prior parameters # v0_7 Delta0_diag(c(100,5000,15000,1000,100)) k0_10 mu0_matrix(c(200,200,200,100,50),ncol=1,byrow=T) # Calculate the values of the parameters of the posterior # mu.n_(k0*mu0+n*means)/(k0+n) v.n_v0+n k.n_k0+n Cascade example We need to define two functions, one to generate random vectors from a multivariate normal distribution and the other to generate random matrices from a Wishart distribution. Note, that a function that generates random matrices from a inverse Wishart distribution is not necessary to be defined since if W ~ Wishart(S) then W^(-1) ~ Inv-Wishart(S^(-1)) # A function that generates random observations from a # Multivariate Normal distribution: rmnorm(a,b) # the parameters of the function are # a = a column vector of kx1 # b = a definite positive kxk matrix # rmnorm_function(a,b) { k_nrow(b) zz_t(chol(b))%*%matrix(rnorm(k),nrow=k)+a zz } # A function that generates random observations from a # Wishart distribution: rwishart(a,b) # the parameters of the function are # a = df # b = a definite positive kxk matrix # Note a must be > k ENAR - March 2006 14 ones_matrix(1,nrow=n,ncol=1) S = t(y-ones%*%t(means))%*%(y-ones%*%t(means)) Delta.n_Delta0+S+(k0*n/(k0+n))*(means-mu0)%*%t(means-mu0) # # Draw Sigma and mu from their posteriors # samplesize_200 lambda.samp_matrix(0,nrow=samplesize,ncol=5) #this matrix will store #eigenvalues of Sigma rho.samp_matrix(0,nrow=samplesize,ncol=5) mu.samp_matrix(0,nrow=samplesize,ncol=5) for (j in 1:samplesize) { 15 } # Graphics and summary statistics sink("cascade.output") # Calculate the ratio between max(eigenvalues of Sigma) # and max({eigenvalues of Sigma}\{max(eigenvalues of Sigma)}) # ratio.l_sample.l[,1]/sample.l[,2] # "Histogram of the draws of the ratio" postscript("cascade_lambda.eps",height=5,width=6) hist(ratio.l,nclass=30,axes = F, xlab="eigenvalue1/eigenvalue2",xlim=c(3,9)) axis(1) dev.off() # Sumary statistics of the ratio of the eigenvalues c(mean(ratio.l),sqrt(var(ratio.l))) quantile(ratio.l,probs=c(0.025,0.05,0.5,0.95,0.975)) # # correlations with copper # postscript("cascade_corr.eps",height=8,width=8) par(mfrow=c(2,2)) # "Histogram of the draws of corr(antimony,copper)" hist(sample.rho[,1],nclass=30,axes = F,xlab="corr(antimony,copper)", xlim=c(0.3,0.7)) axis(1) # "Histogram of the draws of corr(arsenic,copper)" hist(sample.rho[,3],nclass=30,axes = F,xlab="corr(arsenic,copper)") axis(1) # Sigma SS = solve(Delta.n) # The following makes sure that SS is symmetric for (pp in 1:5) { for (jj in 1:5) { if(pp<jj){SS[pp,jj]=SS[jj,pp]} } } Sigma_solve(rwishart(v.n,SS)) # The following makes sure that Sigma is symmetric for (pp in 1:5) { for (jj in 1:5) { if(pp<jj){Sigma[pp,jj]=Sigma[jj,pp]} } } # Eigenvalue of Sigma lambda.samp[j,]_eigen(Sigma)$values # Correlation coefficients for (pp in 1:5){ rho.samp[j,pp]_Sigma[pp,2]/sqrt(Sigma[pp,pp]*Sigma[2,2]) } # mu mu.samp[j,]_rmnorm(mu.n,Sigma/k.n) ENAR - March 2006 ENAR - March 2006 16 ENAR - March 2006 17 # "Histogram of the draws of corr(bismuth,copper)" hist(sample.rho[,4],nclass=30,axes = F,xlab="corr(bismuth,copper)", xlim=c(0.5,.8)) axis(1) # "Histogram of the draws of corr(silver,copper)" hist(sample.rho[,5],nclass=25,axes = F,xlab="corr(silver,copper)", xlim=c(-.2,.3)) axis(1) dev.off() # Summary statistics of the dsn. of correlation of copper with # antimony c(mean(sample.rho[,1]),sqrt(var(sample.rho[,1]))) quantile(sample.rho[,1],probs=c(0.025,0.05,0.5,0.95,0.975)) # arsenic c(mean(sample.rho[,3]),sqrt(var(sample.rho[,3]))) quantile(sample.rho[,3],probs=c(0.025,0.05,0.5,0.95,0.975)) # bismuth c(mean(sample.rho[,4]),sqrt(var(sample.rho[,4]))) quantile(sample.rho[,4],probs=c(0.025,0.05,0.5,0.95,0.975)) # silver c(mean(sample.rho[,5]),sqrt(var(sample.rho[,5]))) quantile(sample.rho[,5],probs=c(0.025,0.05,0.5,0.95,0.975)) # # Means # mulabels=c("Antimony","Copper","Arsenic","Bismuth","Silver") postscript("cascade_means.eps",height=8,width=8) pairs(sample.mu,labels=mulabels) dev.off() ENAR - March 2006 # Summary statistics of the dsn. of mean of # antimony c(mean(sample.mu[,1]),sqrt(var(sample.mu[,1]))) quantile(sample.mu[,1],probs=c(0.025,0.05,0.5,0.95,0.975)) # copper c(mean(sample.mu[,2]),sqrt(var(sample.mu[,2]))) quantile(sample.mu[,2],probs=c(0.025,0.05,0.5,0.95,0.975)) # arsenic c(mean(sample.mu[,3]),sqrt(var(sample.mu[,3]))) quantile(sample.mu[,3],probs=c(0.025,0.05,0.5,0.95,0.975)) # bismuth c(mean(sample.mu[,4]),sqrt(var(sample.mu[,4]))) quantile(sample.mu[,4],probs=c(0.025,0.05,0.5,0.95,0.975)) # silver c(mean(sample.mu[,5]),sqrt(var(sample.mu[,5]))) quantile(sample.mu[,5],probs=c(0.025,0.05,0.5,0.95,0.975)) q() 18 ENAR - March 2006 19 Advanced Computation Strategy for computation • Approximations based on posterior modes • If possible, work in the log-posterior scale. • Simulation from posterior distributions • Factor posterior distribution: p(γ, φ|y) = p(γ|φ, y)p(φ|y) – Reduces to lower-dimensional problem – May be able to sample on a grid – Helps to identify parameters most influenced by prior • Markov chain simulation • Why do we need advanced computational methods? • Except for simpler cases, computation is not possible with available methods: – Logistic regression with random effects – Normal-normal model with unknown sampling variances σj2 – Poisson-lognormal hierarchical model for counts. ENAR - March 26, 2006 1 Normal approximation to the posterior • Re-parametrizing sometimes helps: – Create parameters with easier interpretation – Permit normal approximation (e.g., log of variance or log of Poisson rate or logit of probability) ENAR - March 26, 2006 2 – q(θ|y): un-normalized density, typically p(θ)p(y|θ). – θ = (γ, φ), φ typically lower dimensional than γ. • It is often reasonable to approximate a complicated posterior distribution using a normal (or a mixture of normals) approximation. • Practical advice: computations are often easier with log posteriors than with posteriors. • Approximation may be a good starting point for more sophisticated methods. • Computational strategy: 1. Find joint posterior mode or mode of marginal posterior distributions (better strategy if possible) 2. Fit normal approximations at the mode (or use mixture of normals if posterior is multimodal) • Notation: – p(θ|y): joint posterior of interest (target distribution) ENAR - March 26, 2006 3 ENAR - March 26, 2006 4 Finding posterior modes conditional posteriors. If θ = (γ, φ), and p(θ, y) = p(γ|φ, y)p(φ|y) • To find the mode of a density, we maximize the function with respect to the parameter(s). Optimization problem. then • Modes are not interesting per se, but provide the first step for analytically approximating a density. 1. Find mode φ̂ 2. Then find γ̂ by maximizing p(γ|φ = φ̂, y) • Modes are easier to find than means: no integration, and can work with un-normalized density • Many algorithms to find modes. Most popular include Newton-Raphson and Expectation-Maximization (EM). • Multi-modal posteriors pose a problem: only way to find multiple modes is to run mode-finding algorithms several times, starting from different locations in parameter space • Whenever possible, shoot for finding the mode of marginal and ENAR - March 26, 2006 5 Taylor approximation to p(θ|y) ENAR - March 26, 2006 6 • Then, for large n and θ̂ in interior of parameter space: • Second order expansion of log p(θ|y) around the mode θ̂ is 2 1 0 d log p(θ|y) (θ − θ̂). log p(θ|y) ≈ log p(θ̂|y) + (θ − θ̂) 2 dθ2 θ=θ̂ p(θ|y) ≈ N(θ̂, [I(θ̂)]−1) with I(θ) = − • Linear term in expansion vanishes because log posterior has a zero derivative at the mode. d2 log p(θ|y), dθ2 the observed information matrix. • Considering log p(θ|y) approximation as a function of θ, we see that: 1. log p(θ̂|y) is constant and 2 1 0 d log p(θ|y) ≈ (θ − θ̂) log p(θ|y) (θ − θ̂) 2 dθ2 θ=θ̂ is proportional to a log-normal density ENAR - March 26, 2006 7 ENAR - March 26, 2006 8 Normal and normal-mixture approximation • The ωk reflect the relative mass associated to each mode • It can be shown that ωk = q(θ̂k |y)|Vθk |1/2. • For one mode, we saw that • The mixture-normal approximation is then pn−approx(θ|y) = N(θ̂, Vθ ) pn−approx(θ|y) ∝ with Vθ = [−L00(θ̂)]−1 or [I(θ̂)]−1 • Suppose now that p(θ|y) has K modes. • Approximation to p(θ|y) is now a mixture of normals: X k with d the dimension of θ and ν relatively low. For most problems, ν = 4 works well. ωk N(θ̂k , Vθk ) k ENAR - March 26, 2006 9 Sampling from a mixture approximation ENAR - March 26, 2006 10 Simulation from posterior - Rejection sampling • To sample from the normal-mixture approximation: 1. First choose one of the K normal components, using relative probability masses ωk as multinomial probabilities. 2. Given a component, draw a value θ from either the normal or the t density. • Reasons not to use normal approximation: – When mode of parameter is near edge of parameter space (e.g., τ in SAT example) – When even transformation of parameter makes normal approximation crazy. – Can do better than normal approximation using more advanced methods ENAR - March 26, 2006 k 1 q(θ̂k |y) exp{− (θ − θ̂k )0Vθ−1 (θ − θ̂k )}. k 2 • A more robust approximation can be obtained by using a t−kernel instead of the normal kernel: X q(θ̂k |y)[ν + (θ − θ̂k )0Vθ−1 pt−approx(θ|y) ∝ (θ − θ̂k )]−(d+ν)/2, k • L00(θ̂) is called the curvature of the log posterior at the mode. pn−approx(θ|y) ∝ X 11 • Idea is to draw values of p(θ|y), perhaps by making use of an instrumental or auxiliary distribution g(θ|y) from which it is easier to sample. • The target density p(θ|y) need only be known up to the normalizing constant. • Let q(θ|y) = p(θ)p(y; θ) be the un-normalized posterior distribution so that q(θ|y) p(θ|y) = R . p(θ)p(y; θ)dθ • Simpler notation: 1. Target density: f (x) ENAR - March 26, 2006 12 Then 2. Instrumental density: g(x) 3. Constant M such that f (x) ≤ M g(x) P (Y ≤ y) = • The following algorithm produces a variable Y that is distributed according to f : 1. Generate X ∼ g and U ∼ U[0,1] 2. Set Y = X if U ≤ f (X)/M g(X) 3. Reject draw otherwise. = • Since the last expression equals • Proof: The distribution function of Y is given by: f (X) P (Y ≤ y) = P X ≤ y|U ≤ M g(X) (X) P X ≤ y, U ≤ Mf g(X) = (X) P U ≤ Mf g(X) Ry −∞ f (x)dx, we have proven the result. • In the Bayesian context, q(θ|y) (the un-normalized posterior) plays the role of f (x) above. ENAR - March 26, 2006 • When both f and g are normalized densities: 1. The probability of accepting a draw is 1/M , and expected number of draws until one is accepted is M . 13 2. For each f , there will be many instrumental densities g1, g2, .... Choose the g that requires smallest bound M 3. M is necessarily larger than 1, and will approach minimum value 1 when g closely imitates f . • In general, g needs to have thicker tails than f for f /g to remain bounded for all x. • Cannot use a normal g to generate values from a Cauchy f . • Can do the opposite, however. • Rejection sampling can be used within other algorithms, such as the Gibbs sampler (see later). ENAR - March 26, 2006 R f (x)/M g(x) dug(x)dx −∞ 0 R ∞ R f (x)/M g(x) dug(x)dx −∞ 0 R y M −1 −∞ f (x)dx R∞ . M −1 −∞ f (x)dx Ry 15 ENAR - March 26, 2006 14 Rejection sampling - Getting M Importance sampling • Finding the bound M can be a problem, but consider the following implementation approach recalling that • Also known as SIR (Sampling Importance Resampling), the method is no more than a weighted bootstrap. q(θ|y) = p(θ)p(y; θ). – – – – • Suppose we have a sample of draws (θ1, θ2, ..., θn) from the proposal distribution g(θ). Can we ‘convert it’ to a sample from q(θ|y)? Let g(θ) = p(θ) and draw a value θ∗ from the prior. Draw U ∼ U (0, 1). Let M = p(y; θ̂), where θ̂ is the mode of p(y; θ). Accept the draw θ∗ if • For each θi, compute φi = q(θi|y)/g(θi) X φj . wi = φi/ p(θ)p(y; θ) p(y; θ) q(θ|y) = u≤ = . M p(θ) p(y; θ̂)p(θ) p(y; θ̂) j • Those θ from the prior that are likely under the likelihood are kept in the posterior sample. ENAR - March 26, 2006 16 • Draw θ∗ from the discrete distribution over {θ1, ..., θn} with weight wi on θi, without replacement. ENAR - March 26, 2006 17 • The sample of re-sampled θ’s is approximately a sample from q(θ|y). • The size of the re-sample can be as large as desired. • Proof: suppose that θ is univariate (for convenience). Then: • The more g resembles q, the smaller the sample size that is needed for the re-sample to approximate well the target distribution p(θ|y). Pr(θ∗ ≤ a) = wi1−∞,a(θi) • A consistent estimator of the normalizing constant is i=1 = → → → ENAR - March 26, 2006 n X n −1 P i 1−∞,a (θi ) i φP −1 n i φi n−1 X φi. i Eg q(θ|y) g(θ) 1−∞,a (θi ) Eg q(θ|y) g(θ) R∞ R −a ∞ q(θ|y)dθ −∞ a Z q(θ|y)dθ p(θ|y)dθ. −∞ 18 ENAR - March 26, 2006 19 Importance sampling in a different context • Generate values from g(θ), and estimate Ê[h(θ)] = • Suppose that we wish to estimate E[h(θ)] for θ ∼ p(θ|y) (e.g., a posterior mean). • If N values θi can be drawn from p(θ|y), can compute Monte Carlo estimate: 1 X Ê[h(θ)] = h(θi). N p(θi|y) 1 X h(θi) N g(θi) • The w(θi) = p(θi|y)/g(θi) are the same importance weights from earlier. • Will not work if tails of g are short relative to tails of p. • Sampling from p(θ|y) may be difficult. But note: E[h(θ)|y] = Z h(θ)p(θ|y) g(θ)dθ. g(θ) ENAR - March 26, 2006 20 Importance sampling (cont’d) • Importance sampling is most often used to improve normal approximation to posterior. • Example: suppose that you have a normal (or t) approximation to the posterior and use it to draw a sample that you hope approximates a sample from p(θ|y). 21 • Why without replacement? If weights are approximately constant, can re-sample with replacement. If some weights are large, re-sample would favor values of θ with large weights repeteadly unless we sample without replacement. • Difficult to determine whether importance sampling draws approximate draws from posterior • Diagnostic: monitor weights and look for outliers • Note: draws from g(θ) can be reused for other p(θ|y)! • Importance sampling can be used to improve sample: 1. Obtain large sample of size L: (θ1, ..., θL) from the approximation g(θ). 2. Construct importance weights w(θl) = q(θl|y)/g(θl) 3. Sample k < L values from (θ1, ..., θL) with probability proportional to the weights, without replacement. ENAR - March 26, 2006 ENAR - March 26, 2006 22 • Given draws (θ1, ..., θm) from g(θ), can investigate sensitivity to prior by re-computing importance weights. For posteriors p1(θ|y) and p2(θ|y), compute (for the same draws of θ) wj (θi) = ENAR - March 26, 2006 pj (θi|y) , g(θi) j = 1, 2. 23 Markov chain Monte Carlo Markov chains • Methods based on stochastic process theory, useful for approximating target posterior distributions. • A process Xt in discrete time t = 0, 1, 2, ..., T where E(Xt|X0, X1, ..., Xt−1) = E(Xt|Xt−1) • Iterative: must decide when convergence has happened is called a Markov chain. • Quite general, can be implemented where other methods fail • A Markov chain is irreducible if it is possible to reach all states from any other state: • Can be used even in high dimensional examples • Three methods: Metropolis-Hastings, Metropolis and Gibbs sampler. pn(j|i) > 0, • M and GS special cases of M-H. ENAR - March 26, 2006 pm(i|j) > 0, m, n > 0 where pn(j|i) denotes probability of getting to state j from state i in n steps. 24 • A Markov chain is periodic with period d if pn(i|i) = 0 unless n = kd for some integer k. If d = 2, the chain returns to i in cycles of 2k steps. ENAR - March 26, 2006 25 Properties of Markov chains (cont’d) • Ergodicity: If a MC is irreducible and aperiodic and has stationary distribution π then we have ergodicity: • If d = 1, the chain is aperiodic. • Theorem: Finite-state, irreducible, aperiodic Markov chains have a limiting distribution: limn→∞ pn(j|i) = π, with pn(j|i) the probability that we reach j from i after n steps or transitions. ān = 1X a(Xt) n t → E{a(X)} as n → ∞. • ān is an ergodic average. • Also, rate of convergence can be calculated and is geometric. ENAR - March 26, 2006 26 ENAR - March 26, 2006 27 Numerical standard error Markov chain simulation • Sequence {X1, X2, ..., Xn} is not iid. • Idea: Suppose that sampling from p(θ|y) is hard, but that we can generate (somehow) a Markov chain {θ(t), t ∈ T } with stationary distribution p(θ|y). • The asymptotic standard error of an is approximately s X σa2 ρi(a)} {1 + 2 n i • Situation is different from the usual stochastic process case: – Here we know the stationary distribution. – We seek an algorithm to transition from θ(t) to θ(t+1)) and that will take us to the stationary distribution. where ρi is the ith lag autocorrelation in the function a{Xt}. • First term σa2/n is the usual sampling variance under iid sampling. • Second term is a ’penalty’ for the fact that sample is not iid and is usually bigger than 1. • Idea: start from some initial guess θ0 and let the chain run for n steps (n large), so that it reaches its stationary distribution. • Often, the standard error is computed assuming a finite number of lags. • After convergence, all additional steps in the chain are draws from the stationary distribution p(θ|y). ENAR - March 26, 2006 ENAR - March 26, 2006 28 The Gibbs Sampler • MCMC methods all based on the same idea; difference is just in how the transitions in the MC are created. • In MCMC simulation, we generate at least one MC for each parameter in the model. Often, more than one (independent) chain for each parameter 29 • An iterative algorithm that produces Markov chains with joint stationary distribution p(θ|y) by cycling through all possible conditional posterior distributions. • Example: suppose that θ = (θ1, θ2, θ3), and that the target distribution is p(θ|y). Steps in the Gibbs sampler are: 1. 2. 3. 4. Start with a guess (θ10, θ20, θ30) Draw θ11 from p(θ1|θ2 = θ20, θ3 = θ30, y) Draw θ21 from p(θ2|θ1 = θ11, θ3 = θ30, y) Draw θ31 from p(θ3|θ1 = θ11, θ2 = θ21, y) • Steps above complete one iteration of the GS • Repeat the steps above n times, and after convergence (see later), draws (θn+1, θn+2, ...) are sample from stationary distribution p(θ|y). ENAR - March 26, 2006 30 ENAR - March 26, 2006 31 Example: normal The Gibbs Sampler (cont’d) • Baby example: θ = (θ1, θ2) are bivariate normal with mean y = (y1, y2), variances σ12 = σ22 = 1 and covariance ρ. Then: p(θ1|θ2, y) ∝ N (y1 + ρ(θ2 − y2), 1 − ρ2) Gibbs sampling in the bivariate • Just as illustration, we consider the trivial problem of sampling from the posterior distribution of a bivariate normal mean vector. • Suppose that we have a single observation vector (y1, y2) where and p(θ2|θ1, y) ∝ N (y2 + ρ(θ1 − y1), 1 − ρ2) • In this case, would not need GS, just for illustration 32 • The conditional distributions are θ1|θ2, y θ2|θ1, y ∼ ∼ θ1 |y θ2 KSU - April 2005 After 50 iterations….. N(y1 + ρ(θ2 − y2), 1 − ρ2) N(y2 + ρ(θ1 − y1), 1 − ρ2) • We generate four independent chains, starting from four different corners of the 2-dimensional parameter space. KSU - April 2005 ∼ N θ1 θ2 1 ρ , . ρ 1 (1) • With a uniform prior on (θ1, θ2), the joint posterior distribution is • See figure: y1 = 0, y2 = 0, and ρ = 0.8. ENAR - March 26, 2006 y1 y2 2 ∼ N y1 y2 1 ρ , . ρ 1 (2) 1 The non-conjugate normal example After 1000 iterations, and eliminating the path lines…. • Let y ∼ N (µ, σ 2), with (µ, σ 2) unknown. • Consider the semi-conjugate prior: µ ∼ N (µ0, τ02) σ 2 ∼ Inv − χ2(ν0, σ02). • Joint posterior distribution is n (− 1 2σ 2 p(µ, σ 2) ∝ (σ 2) 2 e P(y −µ) ) (− i 2 e 1 (µ−µ )2 ) 0 2τ02 ν0 ENAR - March 26, 2006 33 where Non-conjugate example (cont’d) µn = 2 2 ν σ −( 0 20 ) 2σ . (σ 2)−( 2 +1))e n ȳ + τ12 µ0 σ2 0 n 1 + 2 σ τ02 and τn2 = n σ2 1 . + τ12 0 2 • Derive full conditional distributions p(µ|σ , y) and p(σ |µ, y). • For µ: 1 1 X 1 p(µ|σ 2, y) ∝ exp{− [ 2 (yi − µ)2 + 2 (µ − µ0)2]} 2σ τ0 1 ∝ exp{− [(nτ02 + σ 2)µ2 − 2(nτ02ȳ + σ 2µ0)µ]} 2 2 1 nτ02 + σ 2 2 nτ0 ȳ + σ 2µ0 ∝ exp{− µ]} [µ − 2 2 σ 2τ02 nτ02 + σ 2 ∝ N (µn, τn2), ENAR - March 26, 2006 34 ENAR - March 26, 2006 35 Non-conjugate normal example (cont’d) Non-conjugate normal example (cont’d) • The full conditional for σ 2 is • For the non-conjugate normal model, Gibbs sampling consists of following steps: n+ν0 1X p(σ 2|µ, y) ∝ (σ 2)−( 2 +1) exp{− [ (yi − µ)2 + ν0σ02]} 2 ∝ Inv − χ2(νn, σn2 ), 1. Start with a guess for σ 2, σ 2(0). 2. Draw µ(1) from a normal (µn(σ 2(0)), τn2(σ 2(0))), where with mean and variance where νn = ν0 + n σ02 2 = nS + µn(σ 2(0) )= ν0σ02, n ȳ + τ12 µ0 σ 2(0) 0 , n 1 + 2(0) τ02 σ and and S=n −1 X τn2(σ 2(0)) = 2 (yi − µ) . ENAR - March 26, 2006 0 36 ENAR - March 26, 2006 σn2 (µ(1)) = nS 2(µ(1)) + ν0σ02, S 2(µ(1)) = n−1 4. Repeat many times. X 37 A more interesting example 3. Next draw σ 2(1) from Inv − χ2(νn, σn2 (µ(1))), where and 1 . + τ12 σ 2(0) n • Poisson counts, with a change point: consider a sample of size n of counts (y1, ..., yn) distributed as Poisson with some rate, and suppose that the rate changes, so that (yi − µ(1))2. For i = 1, ..., m , For i = m + 1, ..., n , yi ∼ Poi(λ) yi ∼ Poi(φ) and m is unknown. • Priors on (λ, φ, m): λ ∼ φ ∼ Ga(α, β) Ga(γ, δ) m ∼ U[1,n] ENAR - March 26, 2006 38 ENAR - March 26, 2006 39 • Joint posterior distribution: p(λ, φ, m|y) ∝ m Y Pm i=1 yi e−λλyi i=1 ∝ λ with y1∗ = • Conditional of λ: n Y p(λ|m, φ, y) ∝ λ e−φφyi λα−1e−λβ φγ−1e−φδ n−1 ∝ i=m+1 y1∗ +α−1 −λ(m+β) y2∗ +γ−1 −φ(n−m+δ) e and y2∗ = φ e • Conditional of φ: Pn i=m+1 yi . • To implement Gibbs sampling, need full conditional distributions. Pick and choose pieces that depend on each parameter ENAR - March 26, 2006 40 ∗ yi, m + β) n i=m+1 yi +γ−1 Ga(γ + n X i=m+1 exp(−φ(n − m + δ)) yi, n − m + δ) p(m|λ, φ, y) = c−1q(m|λ, φ) ENAR - March 26, 2006 41 Non-standard distributions ∗ • Note that all terms in joint posterior depend on m, so conditional of m is proportional to joint posterior. • Distribution does not look like any standard form so cannot sample directly. Need to obtain normalizing constant, easy to do for relatively small n and for this discrete case: n X exp(−λ(m + β)) • Conditional of m = 1, 2, ..., n: = c−1λy1 +α−1 exp(−λ(m + β))φy2 +γ−1 exp(−φ(n − m + δ)). c= m X i=1 P ∝ m i=1 yi +α−1 Ga(α + p(φ|λ, m, y) ∝ φ • Note: if we knew m, problem would be trivial to solve. Thus Gibbs, where we condition on all other parameters to determine Markov chains, appears to be the right approach. P q(k|λ, φ, y). k=1 • It may happen that one or more of the full conditionals is not a standard distribution • What to do then? – Try direct simulation: grid approximation, rejection sampling – Try approximation: normal or t approximation, need mode at each iteration (see later) – Try more general Markov chain algorithms: Metropolis or MetropolisHastings. • To sample from p(m|λ, φ, y) can use inverse cdf on a grid, or other methods. ENAR - March 26, 2006 42 ENAR - March 26, 2006 43 Metropolis-Hastings algorithm 3. Stay in place (do not accept the draw) with probability 1 − r, i.e., θ(t+1) = θ(t). • More flexible transition kernel: rather than requiring sampling from conditional distributions, M-H permits using many other “proposal” densities • Idea: instead of drawing sequentially from conditionals as in Gibbs, M-H “jumps” around the parameter space • The algorithm is the following: 1. Given a draw θt in iteration t, sample a candidate draw θ∗ from a proposal distribution J(θ∗|θ) 2. Accept the draw with probability r= p(θ∗|y)/J(θ∗|θ) . p(θ|y)/J(θ|θ∗) ENAR - March 26, 2006 • Remarkably, the proposal distribution (text calls it jump distribution) can have just about any form. • When proposal distribution is symmetric, i.e. J(θ∗|θ) = J(θ|θ∗), Metropolis-Hastings acceptance probability is r= p(θ∗|y) . p(θ|y) This is the Metropolis algorithm 44 ENAR - March 26, 2006 Proposal distributions 45 Independence sampler • Convergence does not depend on J, but rate of convergence does. • Proposal distribution Jt does not depend on θt. • Optimal J is p(θ|y) in which case r = 1. • Just find a distribution g(θ) and generate values from it • Else, how do we choose J? 1. It is easy to get samples from J 2. It is easy to compute r 3. It leads to rapid convergence and mixing: jumps should be large enough to take us everywhere in the parameter space but not too large so that draw is accepted (see figure from Gilks et al. (1995)). • Can work very well if g(.) is a good approximation to p(θ|y) and g has heavier tails than p. • Can work awfully bad otherwise • Acceptance probability is • Three main approaches: random walk M-H (most popular), independence sampler, and approximation M-H ENAR - March 26, 2006 46 r= ENAR - March 26, 2006 p(θ∗|y)/g(θ∗) p(θ|y)/g(θ) 47 Random walk M-H • With large samples (central limit theorem operating) proposal could be normal distribution centered at mode of p(θ|y) and with variance larger than inverse of Fisher information evaluated at mode. • Most popular, easy to use. • Idea: generate candidate using random walk. • Proposal is often normal centered at current draw: J(θ|θ(t−1)) = N (θ|θ(t−1), V ) • Symmetric: think of drawing values θ∗ − θ(t−1) from a N (0, V ). Thus, r simplifies. • Difficult to choose V : 1. V too small: takes long to explore parameter space ENAR - March 26, 2006 48 2. V too large: jumps to extremes are less likely to be accepted. Stay in the same place too long 3. Ideal V : posterior variance. Unknown, so might do some trial and error runs • Optimal acceptance rate (from some theory results) are between 25% and 50% for this type of proposal distribution. Gets lower with higher dimensional problem. ENAR - March 26, 2006 50 ENAR - March 26, 2006 49 Approximation M-H Starting values • Idea is to improve approximation of J to p(θ|y) as we know more about θ. • E.g., in random walk M-H, can perhaps increase acceptance rate by considering J(θ|θ(t−1)) = N (θ|θ(t−1), Vθ(t−1) ) variance also depends on current draw • Proposals here typically not symmetric, requires full r expression. ENAR - March 26, 2006 • If chain is irreducible, choice of θ0 will not affect convergence. • With multiple chains (see later), can choose over-dispersed starting values for each chain. Possible algorithm: 1. Find posterior modes 2. Create over-dispersed approximation to posterior at mode (e.g., t4) 3. Sample values from that distribution • Not much research on this topic. 51 ENAR - March 26, 2006 How many chains? 52 Convergence • Even after convergence, draws from stationary distribution are correlated. Sample is not i.i.d. • An iid sample of size n can be obtained by keeping only last draw from each of n independent chains. Too inefficient. • Impossible to decide whether chain has converged, can only monitor behavior. • Easiest approach: graphical displays (trace plots in WinBUGS). • Compromise: If autocorrelation at lag k and larger is negligible, then can generate an almost i.i.d. sample by keeping only every kth draw in the chain after convergence. Need a very long chain. • A bit more formal (and most popular in terms of use): and Rubin (1992). • To check convergence (see later) multiple chains can be generated independently (in parallel). • To use the G-R diagnostic, must generate multiple independent chains for each parameter. • Burn-in: iterations required to reach (approximately). Not used for inference. ENAR - March 26, 2006 stationary distribution 53 p (R̂) of Gelman • The G-R diagnostic compares the within-chain sd to the between-chain sd. ENAR - March 26, 2006 54 • Before convergence, the within-chain sd is an underestimate of the sd(θ|y), and between-chain sd is overestimate. • An unbiased estimate of var(η|y) is weighted average: var(η|y) ˆ = • If chains converge, all draws are from stationary distribution, so within and between chain variances should be similar. • Consider scalar parameter η and let ηij be jth draw in ith chain, with i = 1, ..., n and j = 1, ..., J • Between-chain variance: B= n X 1X (η̄j − η̄..)2, η̄j = ηij , J −1 n η̄.. = 1 η̄j J 1 n−1 W+ B n n • Early in iterations, var(η|y) ˆ overestimates true posterior variance • Diagnostic measures potential reduction in the scale if iterations are continued: p var(η|y) p ˆ (R̂) = ( ) W which goes to 1 as n → ∞. • Within-chain variance: W = ENAR - March 26, 2006 X 1 (ηij − η̄j )2. J(n − 1) 55 ENAR - March 26, 2006 56 Hierarchical models - Introduction • Neither approach appealing. • Hierarchical approach: view the θ’s as a sample from a common population distribution indexed by a parameter φ. • Consider the following study: – J counties are sampled from the population of 99 counties in the state of Iowa. – Within each county, nj farms are selected to participate in a fertilizer trial – Outcome yij corresponds to ith farm in jth county, and is modeled as p(yij |θj ). – We are interested in estimating the θj . • Observed data yij can be used to draw inferences about θ’s even though the θj are not observed. • A simple hierarchical model: p(y, θ, φ) = p(y|θ)p(θ|φ)p(φ). • Two obvious approaches: 1. Conduct J separate analyses and obtain p(θj |yj ): all θ’s are independent 2. Get one posterior p(θ|y1, y2, ..., yJ ): point estimate is same for all θ’s. ENAR - March 26, 2006 1 Hierarchical models - Rat tumor example p(y|θ) = Usual sampling distribution p(θ|φ) = Population dist. for θ or prior p(φ) = Hyperprior ENAR - March 26, 2006 2 • Posterior for θ is p(θ|y) = Beta(α + 4, β + 10) • Imagine a single toxicity experiment performed on rats. • Where can we get good “guesses” for α and β? • θ is probability that a rat receiving no treatment develops a tumor. • One possibility: from the literature, look at data from similar experiments. In this example, information from 70 other experiments is available. • Data: n = 14, and y = 4 develop a tumor. • From the sample: tumor rate is 4/14 = 0.286. • In jth study, yj is the number of rats with tumors and nj is the sample size, j = 1, .., 70. • Simple analysis: • See Table 5.1 and Figure 5.1 in Gelman et al.. y|θ ∼ θ|α, β ENAR - March 26, 2006 ∼ Bin(n, θ) • Model the yj as independent binomial data given the study-specific θj and sample size nj Beta(α, β) 3 ENAR - March 26, 2006 4 • Representation of hierarchical model in Figure 5.1 assumes that the Beta(α, β) is a good population distribution for the θj . From Gelman et al.: hierarchical representation of tumor experiments • Empirical Bayes as a first step to estimating mean tumor rate in experiment number 71: 1. Get point estimates of α, β from earlier 70 experiments as follows: P70 2. Mean tumor rate is r̄ = (70)−1 j=1 yj /nj and standard deviation is P [(69)−1 j (yj /nj − r̄)2]1/2. Values are 0.136 and 0.103 respectively. • Using method of moments: α = 0.136, −→ α + β = α/0.136 α+β αβ = 0.1032 2 (α + β) (α + β + 1) • Resulting estimate for (α, β) is (1.4, 8.6). ENAR - March 26, 2006 • Then posterior is p(θ|y) = Beta(5.4, 18.6) with posterior mean 0.223, lower than the sample mean, and posterior standard deviation 0.083. • Posterior point estimate is lower than crude estimate; indicates that in current experiment, number of tumors was unusually high. • Why not go back now and use the prior to obtain better estimates for tumor rates in earlier 70 experiments? 5 3. If the prior Beta(α, β) is the appropriate prior, shouldn’t we know the values of the parameters prior to observing any data? • Approach: place a hyperprior on the tumor rates θj with parameters (α, β). • Can still use all the data to estimate the hyperparameters. • Idea: Bayesian analysis on the joint distribution of all parameters (θ1, ..., θ71, α, β|y) • Should not do that because: 1. Can’t use the data twice: we use the historical data to get the prior, and cannot now combine that prior with the data from the experiments for inference 2. Using a point estimate for (α, β) suggests that there is no uncertainty about (α, β): not true ENAR - March 26, 2006 6 ENAR - March 26, 2006 7 Hierarchical models - Exchangeability • In rat tumor example, we have no information to distinguish between the 71 experiments, except sample size. Since nj is presumably not related to θj , we choose an exchangeable model for the θj . • J experiments. In experiment j, yj ∼ p(yj |θj ). • Simplest exchangeable distribution for the θj : • To create joint probability model, use idea of exchangeability • Recall: a set of random variables (z1, ..., zK ) are exchangeable if their joint distribution p(z1, ..., zK ) is invariant to permutations of their labels. p(θ|φ) = ΠJj=1p(θj |φ). • The hyperparameter φ is typically unknown, so marginal for θ must be obtained by averaging: • In our set-up, we have two opportunities for assuming exchangeability: 1. At the level of data: conditional on θj , the yij ’s are exchangeable if we cannot “distinguish” between them 2. At the level of the parameters: unless something (other than the data) distinguishes the θ’s, we assume that they are exchangeable as well. ENAR - March 26, 2006 8 • Mixture model characterizes parameters (θ1, ..., θJ ) as independent draws from a superpopulation that is determined by unknown parameters φ • Exchangeability does not hold when we have additional information on covariates xj to distinguish between the θj . p(θ) = ΠJj=1p(θj |φ)p(φ)dφ. • Exchangeable distribution for (θ1, ..., θJ ) is written in the form of a mixture distribution ENAR - March 26, 2006 9 Hierarchical models - Bayesian treatment • Since φ is unknown, posterior is now p(θ, φ|y). • Model formulation: • Can still model exchangeability with covariates: Z p(θ1, ..., θJ |x1, ..., xJ ) = ΠJj=1p(θj |xj , φ)p(φ|x)dφ, p(φ, θ) = p(φ)p(θ|φ) = joint prior p(φ, θ|y) ∝ p(φ, θ)p(y|φ, θ) = p(φ, θ)p(y|θ) with x = (x1, ..., xJ ). In Iowa example, we might know that different counties have very different soil quality. • Exchangeability does not imply that θj are all the same; just that they can be assumed to be draws from some common superpopulation distribution. ENAR - March 26, 2006 Z 10 • The hyperparameter φ gets its own prior distribution. • Important to check whether posterior is proper when using improper priors in hierarchical models! ENAR - March 26, 2006 11 Posterior predictive distributions Hierarchical models - Computation • There are two posterior predictive distributions potentially of interest: 1. Distribution of future observations ỹ corresponding to an existing θj . Draw ỹ from the posterior predictive given existing draws for θj . 2. Distribution of new observations ỹ corresponding to new θ̃j ’s drawn from the same superpopulation. First draw φ from its posterior, then draw θ̃ for a new experiment, and then draw ỹ from the posterior predictive given the simulated θ̃. • Harder than before because we have more parameters • Easiest when population distribution p(θ|φ) is conjugate to the likelihood p(θ|y). • In non-conjugate models, must use more advanced computation. • Usual steps: 1. Write p(φ, θ|y) ∝ p(φ)p(θ|φ)p(y|θ) 2. Analytically determine p(θ|y, φ). Easy for conjugate models. 3. Derive the marginal p(φ|y) by • In rat tumor example: 1. More rats from experiment #71, for example 2. Experiment #72 and then rats from experiment #72 p(φ|y) = ENAR - March 26, 2006 12 or, if convenient, by p(φ|y) = Z p(φ, θ|y)dθ, ENAR - March 26, 2006 13 Rat tumor example p(φ,θ|y) p(θ|y,φ) . • Normalizing “constant” in denominator may depend on φ as well as y. • Sampling distribution for data from experiments j = 1, ..., 71: • Steps for computation in hierarchical models are: 1. Draw vector φ from p(φ|y). If φ is low-dimensional, can use inverse cdf method as before. Else, need more advanced methods 2. Draw θ from conditional p(θ|φ, y). If the θj are conditionally independent, p(θ|φ, y) factors as yj ∼ Bin(nj , θj ) • Tumor rates θj assumed to be independent draws from Beta: θj ∼ Beta(α, β) p(θ|φ, y) = Πj p(θj |φ, y) • Choose a non-informative prior for (α, β) to indicate prior ignorance. so components of θ can be drawn one at a time. 3. Draw ỹ from appropriate posterior predictive distribution. • Since hyperprior will be non-informative, and perhaps improper, must check integrability of posterior. ENAR - March 26, 2006 14 ENAR - March 26, 2006 15 • Marginal posterior distribution of hyperparameters is obtained using • Defer choice of p(α, β) until a bit later. p(α, β|y) = • Joint posterior distribution: p(θ, α, β|y) ∝ ∝ p(α, β) Πj p(α, β)p(θ|α, β)p(y|θ, α, β) • Substituting into expression above, we get Γ(α + β) α−1 y θj (1 − θj )β−1Πj θj j (1 − θj )nj −yj . Γ(α)Γ(β) p(α, β|y) ∝ p(α, β)Πj • Conditional of θ: notice that given (α, β), the θj are independent, with Beta distributions: p(θj |α, β, y) = Γ(α + β + nj ) α+yj −1 (1 − θj )β+nj −yj −1. θ Γ(α + yj )Γ(β + nj − yj ) j ENAR - March 26, 2006 p(α, β, θ|y) p(θ|y, α, β) 16 Γ(α + β) Γ(α + yj )Γ(β + nj − yj ) Γ(α)Γ(β) Γ(α + β + nj ) • Not a standard distribution, but easy to evaluate and only in two dimensions. • Can evaluate p(α, β|y) over a grid of values of (α, β), and then use inverse cdf method to draw α from marginal and β from conditional. ENAR - March 26, 2006 17 • But what is p(α, β)? As it turns out, many of the ’obvious’ choices for the prior lead to improper posterior. (See solution to Exercise 5.7, on the course web site.) – A flat prior on prior guess and “degrees of freedom” • Some obvious choices such as p(α, β|y) ∝ 1 lead to non-integrable posterior. – A flat prior on the log(mean) and log(degrees of freedom) p( α , α + β) ∝ 1. α+β p(log(α/β), log(α + β)) ∝ 1. • Integrability can be checked analytically by evaluating the behavior of the posterior as α, β (or functions) go to ∞. • An empirical assessment can be made by looking at the contour plot of p(α, β|y) over a grid. Significant mass extending towards infinity suggests that the posterior will not integrate to a constant. • A reasonable choice for the prior is a flat distribution on the prior mean and square root of inverse of the degrees of freedom: p( • This is equivalent to p(α, β) ∝ (α + β)−5/2. • Other choices leading to non-integrability of p(α, β|y) include: ENAR - March 26, 2006 α , (α + β)−1/2) ∝ 1. α+β 18 ENAR - March 26, 2006 19 • Also equivalent to • Grid too narrow leaves mass outside. Try [−2.3, −1.3] × [1, 5]. • Steps for computation: p(log(α/β), log(α + β) ∝ αβ(α + β)−5/2. 1. Given draws (u, v), transform back to (α, β). 2. Finally, sample θj from Beta(α + yj , β + nj − yj ) • We use the parameterization (log(α/β), log(α + β)). • Idea for drawing values of α and β from their posterior is the usual: evaluate p(log(α/β), log(α + β)|y) over grid of values of (α, β), and then use inverse cdf method. • For this problem, easier to evaluate the log posterior and then exponentiate. • Grid: From earlier estimates, potential centers for the grid are α = 1.4, β = 8.6. In the new parameterization, this translates into u = log(α/β) = −1.8, v = log(α + β) = 2.3. ENAR - March 26, 2006 From Gelman et al.: Contours of joint posterior distribution of alpha and beta in reparametrized scale 20 ENAR - March 26, 2006 Posterior means and 95% credible sets for și 21 with σj2 = σ 2/nj . Normal hierarchical model • What we know as one-way random effects models • Sampling model for ȳ.j is quite general. For nj large, ȳ.j is normal even if yij is not. • Set-up: J independent experiments, in each want to estimate a mean θj . • Need now to think of priors for the θj . • Sampling model: • What type of posterior estimates for θj might be reasonable? 2 2 yij |θj , σ ∼ N(θj , σ ) for i = 1, ..., nj , and j = 1, ..., J. 1. θ̂j = ȳ.j : reasonable if nj large P P 2. θ̂j = ȳ.. = ( j σ −2)−1 j σj−2ȳ.j : pooled estimate, reasonable if we believe that all means are the same. • Assume for now that σ 2 is known. • If ȳ.j = n−1 j P i yij , • To decide between those two choices, do an F-test for differences between groups. then ȳ.j |θj ∼ N(θj , σj2) • ANOVA approach (for nj = n): ENAR - March 26, 2006 Source Between groups Within groups 22 df J-1 J(n-1) SS SSB SSE MS SSB / (J-1) SSE / J(n-1) 23 for λj ∈ [0, 1]. Factor (1 − λj ) is a shrinkage factor. E(MS) nτ 2 + σ 2 σ2 • All three estimates have a Bayesian justification: 1. θ̂j = ȳ.j is posterior mean of θj if sample means are normal and p(θj ) ∝ 1 2. θ̂j = ȳ.. is posterior mean if θ1 = ... = θJ and p(θ) ∝ 1. 3. θj = λj ȳ.j + (1 − λj )ȳ.. is posterior mean if θj ∼ N(µ, τ 2) independent of other θ’s and sampling distribution for ȳ.j is normal where τ 2, the variance of the group means can be estimated as τ̂ 2 = ENAR - March 26, 2006 M SB − M SE n • If M SB >>> M SE then τ 2 > 0 and F-statistic is significant: do not pool and use θ̂j = ȳ.j . • Latter is called the Normal-Normal model. • Else, F-test cannot reject H0 : τ 2 = 0, and must pool. • Alternative: why not a more general estimator: θj = λj ȳ.j + (1 − λj )ȳ.. ENAR - March 26, 2006 24 ENAR - March 26, 2006 25 Normal-normal model consider a contidional flat prior, so that p(µ, τ ) ∝ p(τ ) • Set-up (for σ 2 known): ȳ.j |θj θ1, ..., θJ |µ, τ N(θj , σj2) Y N(µ, τ 2) ∼ • Joint posterior ∼ p(θ, µ, τ |y) ∝ p(µ, τ )p(θ|µ, τ )p(y|θ) Y Y N(ȳ.j ; θj , σj2) N(θj ; µ, τ ) ∝ p(µ, τ ) j p(µ, τ 2) • It follows that Conditional distribution of group means p(θ1, ..., θJ ) = Z Y N(θj ; µ, τ 2)p(µ, τ 2)dµdτ 2 • For p(µ|τ ) ∝ 1, the conditional distributions of the θj given µ, τ, ȳ.j are independent, and j p(θj |µ, τ, ȳ.j ) = N(θ̂j , Vj ) • The joint prior can be written as p(µ, τ ) = p(µ|τ )p(τ ). For µ, we will ENAR - March 26, 2006 with 26 σj2µ + τ 2ȳ.j , θ̂j = σj2 + τ 2 Vj−1 ENAR - March 26, 2006 p(y|µ, τ ) = Z p(y|θ, µ, τ )p(θ|µ, τ )dθ Z Y Y N(θj ; µ, τ )dθ1, ..., dθJ N(ȳ.j ; θj , σj2) = • To get p(µ, τ |y) we would typically either integrate p(µ, τ, θ|y) with respect to θ1, ..., θJ , or would use the algebraic approach p(µ, τ |y) = 27 where p(y|µ, τ ) is the marginal likelihood: 1 1 = 2+ 2 σj τ Marginal distribution of hyperparameters j j • Integrand above is the product of quadratic functions in ȳ.j and θj , so they are jointly normal. p(µ, τ, θ|y) p(θ|µ, τ, y) • In the normal-normal model, data provide information about µ, τ , as shown below. • Then the ȳ.j |µ, τ are also normal, with mean and variance: E(ȳ.j |µ, τ ) = E[E(ȳ.j |θj , µ, τ )|µ, τ ] • Consider the marginal posterior: = E(θj |µ, τ ) = µ var(ȳ.j |µ, τ ) = E[var(ȳ.j |θj , µ, τ )|µ, τ ] p(µ, τ |y) ∝ p(µ, τ )p(y|µ, τ ) ENAR - March 26, 2006 j j 28 ENAR - March 26, 2006 29 which µ is the only parameter. +var[E(ȳ.j |θj , µ, τ )|µ, τ ] = E(σj2|µ, τ ) + var(θj |µ, τ ) • Recall that p(µ|τ ) ∝ 1. = σj2 + τ 2 • Therefore • From earlier, we know that p(µ|y, τ ) = N(µ̂, Vµ) where p(µ, τ |y) ∝ p(τ ) Y N(ȳ.j ; µ, σj2 + τ 2) P j µ̂ = P j • We know that " Y 2 2 −1/2 exp − (σj + τ ) × j 1 (ȳ.j − µ)2 2(σj2 + τ 2) # p(τ |y) = ∝ • Now fix τ , and think of p(ȳ.j |µ) as the “likelihood” in a problem in ENAR - March 26, 2006 30 • Expression holds for any µ, so set µ = µ̂. Denominator is then just −1/2 Vµ . p(τ |y) ∝ p(τ )Vµ1/2 2 j 1/(σj + τ 2) , Vµ = X j 1 σj2 + τ 2 • To get p(τ |y) we use the old trick: p(µ, τ |y) ∝ p(τ )p(µ|τ ) • Then ȳ.j /(σj2 + τ 2) Y N(ȳ.j ; µ̂, σj2 + τ 2). j p(µ, τ |y) p(µ|τ, y) Q p(τ ) j N(ȳ.j ; µ, σj2 + τ 2) N(µ; µ̂, Vµ) ENAR - March 26, 2006 31 • Non-informative (and improper) choice: – Beware. Improper prior in hierarchical model can easily lead to non-integrable posterior. – In normal-normal model, “natural” non-informative p(τ ) ∝ τ −1 or equivalently p(log(τ )) ∝ 1 results in improper posterior. – p(τ ) ∝ 1 leads to proper posterior p(τ |y). • What do we use for p(τ )? • Safe choice is always a proper prior. For example, consider p(τ ) = Inv − χ2(ν0, τ02), where – τ02 can be “best guess” for τ – ν0 can be small to reflect prior uncertainty. ENAR - March 26, 2006 32 ENAR - March 26, 2006 33 Normal-normal model - Computation Normal-normal model - Prediction 1. Evaluate the one-dimensional p(τ |y) on a grid. • Predicting future data ỹ from the current experiments with means θ = (θ1, ..., θJ ): 2. Sample τ from p(τ |y) using inverse cdf method. 1. Obtain draws of (τ, µ, θ1, ..., θJ ) 2. Draw ỹ from N(θj , σj2) • Predicting future data ỹ from a future experiment with mean θ̃ and sample size ñ: 3. Sample µ from N(µ̂, Vµ) 4. Sample θj from N(θ̂j , Vj ) 1. Draw µ, τ from their posterior 2. Draw θ̃ from the population distribution p(θ|µ, τ ) (also known as the prior for θ) 3. Draw ỹ from N(θ̃, σ̃ 2), where σ̃ 2 = σ 2/ñ ENAR - March 26, 2006 34 Example: effect of diet on coagulation times • Normal hierarchical model for coagulation times of 24 animals randomized to four diets. Model: – Observations: yij with i = 1, ..., nj , j = 1, ..., J. – Given θ, coagulation times yij are exchangeable – Treatment means are normal N (µ, τ 2), and variance σ 2 is constant across treatments – Prior for (µ, log σ, log τ ) ∝ τ . A uniform prior on log τ leads to improper posterior. ENAR - March 26, 2006 P P σ̂j2 = (nj − 1)−1 (yij − ȳ.j )2, σ̂ 2 = J −1 σ̂j2, and τ̂ 2 = (J − P 1)−1 (ȳ.j − ȳ..)2. Data The following table contains data that represents coagulation time in seconds for blood drawn from 24 animals randomly allocated to four different diets. Different treatments have different numbers of observations because the randomization was unrestricted. Diet p(θ, µ, log σ, log τ |y) ∝ τ Πj N(θj |µ, τ 2) A Πj Πi N(yij |θj , σ 2) • Crude initial estimates: θ̂j = n−1 j ENAR - March 26, 2006 P yij = ȳ.j , µ̂ = J −1 35 P Measurements 62,60,63,59 B 63,67,71,64,65,66 C 68,66,71,67,68,68 D 56,62,60,61,63,64,63,59 ȳ.j = ȳ.., 36 ENAR - March 26, 2006 37 Conditional maximization for joint mode • Conditional mode of µ: µ|θ, σ, τ, y ∼ N(µ̂, τ 2/J), µ̂ = J −1 • Conjugacy makes conditional maximization easy. X θj . • Conditional maximization: replace current estimate µ with µ̂. • Conditional modes of treatment means are θj |µ, σ, τ, y ∼ N(θ̂j , Vj ) with θ̂j = µ/τ 2 + nj ȳ.j /σ 2 , 1/τ 2 + nj /σ 2 and Vj−1 = 1/τ 2 + nj /σ 2. • For j = 1, ..., J, maximize conditional posteriors by using θ̂j in place of current estimate θj . ENAR - March 26, 2006 38 ENAR - March 26, 2006 Conditional maximization Conditional maximization • Conditional mode of log σ: first derive conditional posterior for σ 2: • Starting from crude estimates, conditional maximization required only three iterations to converge approximately (see table) σ 2|θ, µ, τ, y ∼ Inv − χ2(n, σ̂ 2), PP with σ̂ 2 = n−1 (yij − θj )2. • Log posterior increased at each step • The mode of the Inv − χ2 is σ̂ 2n/(n + 2). To get mode for log σ, use transformation. Term n/(n + 2) disappears with Jacobian, so conditional mode of log σ is log σ̂. • Conditional mode of log τ : same reasoning. Note that 2 2 2 • Values in final iteration is approximate joint mode. • When J is large relative to the nj , joint mode may not provide good summary of posterior. Try to get marginal modes. • In this problem, factor: 2 τ |θ, µ, σ , y ∼ Inv − χ (J − 1, τ̂ ), P with τ̂ 2 = (J − 1)−1 (θj − µ)2. After accounting for Jacobian of transformation, conditional mode of log τ is log τ̂ . ENAR - March 26, 2006 39 40 p(θ, µ, log σ, log τ |y) = p(θ|µ, log σ, log τ, y)p(µ, log σ, log τ |y). • Marginal of µ, log σ, log τ is three-dimensional regardless of J and nj . ENAR - March 26, 2006 41 Conditional maximization results Marginal maximization Table shows iterations for conditional maximization. Parameter θ1 θ2 θ3 θ4 µ σ τ log p(params.|y) Crude estimate 61.000 66.000 68.000 61.000 64.000 2.291 3.559 -61.604 First iteration 61.282 65.871 67.742 61.148 64.010 2.160 3.318 -61.420 Stepwise ascent Second Third iteration iteration 61.288 61.290 65.869 65.868 67.737 67.736 61.152 61.152 64.011 64.011 2.160 2.160 3.318 3.312 -61.420 -61.420 • Recall algebraic trick: Fourth iteration 61.290 65.868 67.736 61.152 64.011 2.160 3.312 -61.420 p(µ, log σ, log τ |y) = ∝ p(θ, µ, log σ, log τ |y) p(θ|µ, log σ, log τ, y) τ Πj N(θj |µ, τ 2)Πj Πi N(yij |θj , σ 2) Πj N(θj |θ̂j , Vj ) • Using θ̂ in place of θ: 1/2 p(µ, log σ, log τ |y) ∝ τ Πj N(θj |µ, τ 2)Πj ΠiN(yij |θj , σ 2)Πj Vj with θ̂ and Vj as earlier. ENAR - March 26, 2006 42 • Can maximize p(µ, log σ, log τ |y) using EM. Here, (θj ) are the “missing data”. • Steps in EM: ENAR - March 26, 2006 43 EM algorithm for marginal mode Log joint posterior: log p(θ, µ, log σ, log τ |y) ∝ −n log σ − (J − 1) log τ 1 X 1 XX − 2 (θj − µ)2 − (yij − θj )2 2τ j 2σ 2 j i 1. Average over missing data θ in E-step 2. Maximize over (µ, log σ, log τ ) in M-step • E-step: Average over θ using conditioning trick (and p(θ| rest). Need two expectations: Eold[(θj − µ)2] = E[(θj − µ)2|µold, σ old, τ old, y] = [Eold(θj − µ)]2 + varold(θj ) = (θ̂j − µ)2 + Vj Similarly: Eold[(yij − θj )2] = (yij − θ̂j )2 + Vj . ENAR - March 26, 2006 44 ENAR - March 26, 2006 45 EM algorithm for marginal mode (cont’d) EM algorithm for marginal mode (cont’d) • M-step: Maximize the expected log posterior (expectations just taken in E-step) with respect to (µ, log σ, log τ ). Differentiate and equate to zero to get maximizing values (µnew , log σ new , log τ new ). Expressions are: 1X µnew = θ̂j . J j  1/2 X X 1 σ new =  [(yij − θ̂j )2 + Vj ] n j i and  τ new =  1/2 1 X [(θ̂j − µnew )2 + Vj ] J −1 j ENAR - March 26, 2006 • Beginning from joint mode, algorithm converged in three iterations. • Important to check that log posterior increases at each step. Else, programming or formulation error! • Results: Parameter µ σ τ . 46 First iteration 64.01 2.33 3.46 Second iteration 64.01 2.36 3.47 Third iteration 64.01 2.36 3.47 ENAR - March 26, 2006 Using EM results in simulation 47 Gibbs sampling in normal-normal case • In a problem with standard form such as normal-normal, can do the following: 1. Use EM to find marginal and conditional modes 2. Approximate posterior at the mode using N or t approximation 3. Draw values of parameters from papprox, and act as if they were from p. • Importance re-sampling can be used to improve accuracy of draws. For each draw, compute importance weights, and re-sample draws (without replacement) using probability proportional to weight. • In diet and coagulation example, approximate approach as above likely to produce quite reasonable results. • The Gibbs sampler can be easily implemented in the normal-normal example because all posterior conditionals are of standard form. • Refer back to Gibbs sampler discussion, and see derivation of conditionals in hierarchical normal example. • Full conditionals are: – – – – θj |all ∼ N (J of them) µ|all ∼ N σ 2|all ∼ Inv − χ2 τ 2|all ∼ Inv − χ2. • Starting values for the Gibbs sampler can be drawn from, e.g., a t4 approximation to the marginal and conditional mode. • But must be careful, specially with scale parameters. ENAR - March 26, 2006 Value at joint mode 64.01 2.17 3.31 48 ENAR - March 26, 2006 49 Results from Gibbs sampler • Multiple parallel chains for each parameter permit monitoring convergence using potential scale reduction statistic. • Summary of posterior distributions in coagulation example. • Posterior quantiles and estimated potential scale reductions computed from the second halves of then Gibbs sampler sequences, each of length 1000. • Potential scale reductions for τ and σ were computed on the log scale. • The hierarchical variance τ 2, is estimated less precisely than the unitlevel variance, σ 2, as is typical in hierarchical models with a small number of batches. ENAR - March 26, 2006 50 ENAR - March 26, 2006 Posterior quantiles from Gibbs sampler Checking convergence Parameter θ1 θ2 θ3 θ4 µ σ τ log p(µ, log σ, log τ |y) log p(θ, µ, log σ, log τ |y) Burn-in = 50% of chain length. Parameter θ1 θ2 θ3 θ4 µ σ τ log p(µ, log σ, log τ |y) log p(θ, µ, log σ, log τ |y) ENAR - March 26, 2006 2.5% 58.92 63.96 65.72 59.39 55.64 1.83 1.98 -70.79 -71.07 Posterior quantiles 25.0% 50.0% 75.0% 60.44 61.23 62.08 65.26 65.91 66.57 67.11 67.77 68.44 60.56 61.14 61.71 62.32 63.99 65.69 2.17 2.41 2.7 3.45 4.97 7.76 -66.87 -65.36 -64.20 -66.88 -65.25 -64.00 51 97.5% 63.69 67.94 69.75 62.89 73.00 3.47 24.60 -62.71 -62.42 52 ENAR - March 26, 2006 R̂ 1.001 1.001 1.000 1.000 1.005 1.000 1.012 1.000 1.000 upper 1.003 1.003 1.002 1.001 1.005 1.001 1.013 1.000 1.001 53 Posteriors for diet means 62 58 theta1 66 Chains for diet means 500 600 700 800 900 1000 800 900 1000 800 900 1000 800 900 1000 65 56 58 60 62 64 66 62 64 66 68 70 63 theta2 67 iteration 500 600 700 theta1 theta2 69 67 65 theta3 iteration 500 600 700 63 61 64 66 68 70 72 58 60 62 64 59 theta3 65 iteration 500 600 700 theta3 theta4 iteration ENAR - March 26, 2006 54 ENAR - March 26, 2006 55 Posteriors for µ, σ, τ 80 50 60 70 mu 90 100 Chains for µ, σ, τ 50 500 600 700 800 900 1000 800 900 1000 55 60 65 70 75 80 mu 10 5 sigma2 15 iteration 500 600 700 1.5 2.0 2.5 3.0 3.5 4.0 4.5 sigma 6000 0 2000 tau2 10000 iteration 500 600 700 800 900 1000 0 iteration ENAR - March 26, 2006 5 10 15 20 25 tau 56 ENAR - March 26, 2006 57 Hierarchical modeling for meta-analysis • Possible quantities of interest: • Idea: summarize and integrate the results of research studies in a specific area. • Example 5.6 in Gelman et al.: 22 clinical trials conducted to study the effect of beta-blockers on reducing mortality after cardiac infarction. 1. difference p1j − p0j 2. probability ratio p1j /p0j 3. odds ratio ρj = [p1j /(1 − p1j )]/[p0j /(1 − p0j )] • We parametrize in terms of the log odds ratios: θj = log ρj because the posterior is almost normal even for small samples. • Considering studies separately, no obvious effect of beta-blockers. • Data are 22 2×2 tables: in jth study, n0j and n1j are numbers of individuals assigned to control and treatment groups, respectively, and y0j and y1j are number of deaths in each group. • Sampling model for jth experiment: two independent Binomials, with probabilities of death p0j and p1j . ENAR - March 26, 2006 58 Normal approximation to the likelihood y1j n1j − y1j − log y0j n0j − y0j • Approximate sampling variance (e.g., using a Taylor expansion): σj2 = 59 Raw data Study j 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 • Consider estimating θj by the empirical logits: yj = log ENAR - March 26, 2006 1 1 1 1 + + + y1j n1j − y1j y0j n0j − y0j Control deaths total 3 39 14 116 11 93 127 1520 27 365 6 52 152 939 48 471 37 282 188 1921 52 583 47 266 16 293 45 883 31 147 38 213 12 122 6 154 3 134 40 218 43 364 39 674 Treated deaths total 3 38 7 114 5 69 102 1533 28 355 4 59 98 945 60 632 25 278 138 1916 64 873 45 263 9 291 57 858 25 154 33 207 28 251 8 151 6 174 32 209 27 391 22 680 Logodds, yj 0.0282 -0.7410 -0.5406 -0.2461 0.0695 -0.5842 -0.5124 -0.0786 -0.4242 -0.3348 -0.2134 -0.0389 -0.5933 0.2815 -0.3213 -0.1353 0.1406 0.3220 0.4444 -0.2175 -0.5911 -0.6081 sd, σj 0.8503 0.4832 0.5646 0.1382 0.2807 0.6757 0.1387 0.2040 0.2740 0.1171 0.1949 0.2295 0.4252 0.2054 0.2977 0.2609 0.3642 0.5526 0.7166 0.2598 0.2572 0.2724 • See Table 5.4 for the estimated log-odds ratios and their estimated standard errors. ENAR - March 26, 2006 60 ENAR - March 26, 2006 61 Goals of analysis Normal-normal model for meta-analysis • Goal 1: if studies can be assumed to be exchangeable, we wish to estimate the mean of the distribution of effect sizes, or “overall average effect”. • Second level: population distribution θj |µ, τ ∼ N(µ, τ 2) • Goal 2: the average effect size in each of the exchangeable studies • Goal 3: the effect size that could be expected if a new, exchangeable study, were to be conducted. ENAR - March 26, 2006 62 ENAR - March 26, 2006 63 Estimand Mean, µ Standard deviation, τ Predicted effect, θ̃j 97.5% 0.13 -0.03 0.06 -0.04 0.13 0.04 -0.16 0.08 -0.05 -0.12 -0.01 0.09 0.02 0.30 0.00 0.06 0.15 0.15 0.17 0.04 -0.09 -0.07 2.5% -0.38 0.01 -0.57 Posterior quantiles 25.0% 50.0% 75.0% -0.29 -0.25 -0.21 0.08 0.13 0.18 -0.33 -0.25 -0.17 97.5% -0.12 0.32 0.08 Marginal posterior density p(τ |y) 0.04 Posterior quantiles of effect θj normal approx. (on log-odds scale) 25.0% 50.0% 75.0% -0.32 -0.24 -0.15 -0.36 -0.28 -0.20 -0.34 -0.26 -0.18 -0.31 -0.25 -0.18 -0.28 -0.21 -0.12 -0.36 -0.27 -0.19 -0.44 -0.36 -0.27 -0.28 -0.20 -0.13 -0.36 -0.27 -0.20 -0.35 -0.29 -0.23 -0.31 -0.24 -0.17 -0.28 -0.21 -0.12 -0.37 -0.27 -0.20 -0.22 -0.12 0.00 -0.32 -0.26 -0.17 -0.30 -0.23 -0.15 -0.28 -0.21 -0.11 -0.30 -0.22 -0.13 -0.31 -0.22 -0.13 -0.32 -0.24 -0.17 -0.40 -0.30 -0.23 -0.40 -0.30 -0.22 ENAR - March 26, 2006 0.02 2.5% -0.58 -0.63 -0.64 -0.44 -0.44 -0.62 -0.62 -0.43 -0.56 -0.48 -0.47 -0.42 -0.65 -0.34 -0.54 -0.50 -0.46 -0.53 -0.51 -0.51 -0.67 -0.69 • Third level: prior for hyperparameters µ, τ p(µ, τ ) = p(µ|τ )p(τ ) ∝ 1 mostly for convenience. Can incorporate information if available. 0.0 Study j 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 • First level: sampling distribution yj |θj , σj2 ∼ N(θj , σj2) 0.0 0.5 1.0 1.5 2.0 tau 64 ENAR - March 26, 2006 65 0.8 Conditional posterior standard deviations sd(θj |τ, y) 0.4 Conditional posterior means E(θj |τ, y) 19 1 0.2 18 14 2 13 17 15 22 9 5 21 20 16 12 8 14 11 0.2 7 3 6 13 22 3 18 0.4 Estimated standard deviations 0.0 -0.2 -0.4 9 -0.6 Estimated Treatment Effects 5 1 12 8 16 20 11 4 10 15 0.6 19 6 17 21 10 4 7 2 0.0 0.5 1.0 1.5 2.0 0.0 0.5 tau 1.0 1.5 2.0 tau ENAR - March 26, 2006 66 Histogram of 1000 simulations of θ2, θ18, θ7, and θ14 ENAR - March 26, 2006 67 -0.23 Overall mean effect, E(µ|τ, y) -0.2 0.2 -0.8 -0.4 0.0 0.4 Study 18 -0.26 -0.25 Study 2 -0.24 -0.6 E(mu|tau,y) -1.0 -0.8 -0.4 Study 7 ENAR - March 26, 2006 0.0 -0.6 -0.2 0.2 0.6 0.0 Study 14 0.5 1.0 1.5 2.0 tau 68 ENAR - March 26, 2006 69 Example: Mixed model analysis • Fat content in all samples was known to be very close to 3%. • Sheffield Food Company produces dairy products. • Because of technical difficulties, none of the labs managed to analyze all six samples using the government method within the alloted time. • Government cited company because the actual content of fat in yogurt sold by Sheffield appers to be higher than the label amount. Method Government Lab 1 5.19 5.09 Sheffield 3.26 3.38 3.24 3.41 3.35 3.04 • Sheffield believes that the discrepancy is due to the method employed by the goverment to measure fat content, and conducted a multilaboratory study to investigate. • Four laboratories in the United States were randomly chosen. • Each laboratory received 12 carefully mixed samples of yogurt, with instructions to analyze 6 using the government method and 6 using Sheffield’s method. Lab 2 4.09 3.0 3.75 4.04 4.06 3.02 3.32 2.83 2.96 3.23 3.07 Lab 3 4.62 4.32 4.35 4.59 Lab 4 3.71 3.86 3.79 3.63 3.08 2.95 2.98 2.74 3.07 2.70 2.98 2.89 2.75 3.04 2.88 3.20 alpha[1] chains 1:3 • We fitted a mixed linear model to the data, where – Method is a fixed effect with two levels – Laboratory is a random effect with four levels – Method by laboratory interaction is random with six levels 5.5 5.0 4.5 4.0 3.5 3.0 5001 yijk = αi + βj + (αβ)ij + eijk , and αi = µ + γi and eijk ∼ N(0, σ 2). • Priors: 6000 8000 10000 iteration alpha[2] chains 1:3 4.5 p(αi) ∝ 1 p(βj |σβ2 |σβ2 ) 2 p((αβ)ij |σαβ ) ∝ ∝ N(0, σβ2 ) 2 N(0, σαβ ) • The three variance components were assigned diffuse inverted gamma priors. 4.0 3.5 3.0 2.5 2.0 5001 8000 6000 iteration 10000 beta[1] chains 1:3 sample: 15000 beta[2] chains 1:3 sample: 15000 4.0 3.0 2.0 1.0 0.0 3.0 2.0 1.0 0.0 -1.0 1.0 0.0 beta[3] chains 1:3 beta[1] chains 1:3 1.0 0.5 0.0 -0.5 -1.0 1.0 0.5 0.0 -0.5 -1.0 -1.0 -0.5 0.0 0 0.5 20 0 40 lag beta[4] chains 1:3 sample: 15000 beta[3] chains 1:3 sample: 15000 4.0 3.0 2.0 1.0 0.0 -2.0 -1.0 0.0 1.0 0.5 0.0 -0.5 -1.0 1.0 0.5 0.0 -0.5 -1.0 -2.0 1.0 beta[2] chains 1:3 beta[4] chains 1:3 4.0 3.0 2.0 1.0 0.0 0 0.0 -1.0 20 20 0 40 beta[1] chains 1:3 beta[1] chains 1:3 beta[2] chains 1:3 1.0 0.5 0.0 -0.5 -1.0 0 40 20 20 lag 40 lag 1.0 0.5 0.5 0.0 5001 40 6000 8000 beta[4] chains 1:3 0.5 0.5 0.0 40 lag 8000 iteration 1.0 1.0 20 6000 beta[3] chains 1:3 1.5 0 5001 iteration 1.0 0.5 0.0 -0.5 -1.0 20 1.0 beta[4] chains 1:3 beta[3] chains 1:3 0 1.5 lag 1.0 0.5 0.0 -0.5 -1.0 beta[2] chains 1:3 1.5 0.0 0 40 lag lag 1.0 0.5 0.0 -0.5 -1.0 40 20 lag 0.0 5001 6000 8000 iteration 5001 6000 8000 iteration box plot: alpha 5.0 box plot: beta 1.0 [1] [1] [3] 0.5 [2] 4.0 [4] [2] 0.0 3.0 -0.5 2.0 -1.0 Model checking – other model characteristics such as covariates, form of dependence between response variable and covariates, etc. • What do we need to check? – – – – • Classical approaches to model checking: Model fit: does the model fit the data? Sensitivity to prior and other assumptions Model selection: is this the best model? Robustness: do conclusions change if we change data? – – – – • Remember: models are never true; they may just fit the data well and allow for useful inference. • Bayesian approach to model checking – Does the posterior distribution of parameters correspond with what we know from subject-matter knowledge? – Predictive distribution for future data must also be consistent with substantive knowledge – Future data generated by predictive distribution compared to current sample • Model checking strategies must address various parts of models: – priors – sampling distribution – hierarchical structure ENAR - March 26, 2006 Do parameter estimates make sense Does the model generate data like observed sample Are predictions reasonable Is model “best” in some sense (e.g., AIC, likelihood ratio, etc.) 1 ENAR - March 26, 2006 2 Posterior predictive model checking – Sensitivity to prior and other model components • Most popular model checking approach has a frequentist flavor: we generate replications of the sample from posterior and observe the behavior of sample summaries over repeated sampling. • Basic idea is that data generated from the model must look like observed data. • Posterior predictive model checks based on replicated data y rep generated from the posterior predictive distribution: Z p(y rep|y) = p(y rep|θ)p(θ|y)dθ. • y rep is not the same as ỹ. Predictive outcomes ỹ can be anything (e.g., a regression prediction using different covariates x̃). But y rep is a replication of y. • y rep are data that could be observed if we repeated the exact same experiment again tomorrow (if in fact the θs in our analysis gave rise to the data y). ENAR - March 26, 2006 3 ENAR - March 26, 2006 4 • Definition of a replicate in a hierarchical model: – Determine appropriate T (y, θ) – Compare posterior predictive distribution of T (y rep, θ) to posterior distribution of T (y, θ) p(φ|y) → p(θ|φ, y) → p(y rep|θ) : rep from same units p(φ|y) → p(θ|φ) → p(y rep|θ) : rep from new units • Test quantity: T (y, θ) is a discrepancy statistic used as a standard. • We use T (y, θ) to determine discrepancy between model and data on some specific aspect we wish to check. • Example of T (y, θ): proportion of standardized residuals outside of (−3, 3) in a regression model to check for outliers • In classical statistics, use T (y) a test-statistic that depends only on the data. Special case of Bayesian T (y, θ). • For model checking: ENAR - March 26, 2006 5 ENAR - March 26, 2006 6 • Probability taken over joint posterior distribution of (θ, y rep): Bayes p-values Bayes p − value = • p-values attempt to measure tail-area probabilities. Z Z IT (yrep,θ)≥T (y,θ)p(θ|y)p(y rep|θ)dθdy rep • Classical definition: class p − value = Pr(T (y rep) ≥ T (y)|θ) – Probability is taken over distribution of y rep with θ fixed – Point estimate θ̂ typically used to compute the p−value. • Posterior predictive p-values: Bayes p − value = Pr(T (y rep, θ) ≥ T (y, θ)|y) ENAR - March 26, 2006 7 ENAR - March 26, 2006 8 √ nȳ rep nȳ 2 ≥ |σ S S √ nȳ = P tn−1 ≥ S Relation between p-values = Pr • Small example: – Sample y1, ..., yn ∼ N (µ, σ 2) – Fit N (0, σ 2) and check whether µ = 0 is good fit • This is special case: parameters. • In real life, would fit more general N (µ, σ 2) and would decide whether µ = 0 is plausible. – Test statistic is sample mean T (y) = ȳ IT (yrep)≥T (y)p(y rep|σ 2) = P (T (y rep) ≥ T (y)|σ 2) = P r(ȳ rep ≥ ȳ|σ 2) 9 ENAR - March 26, 2006 10 Interpreting posterior predictive p−values = classical p-value • We look for tail-area probabilities that are not too small or too large. Bayes p − value = E{p − valueclass|y} where the expectation is taken with respect to p(σ 2|y). • The ideal posterior predictive p-value is 0.5: the test quantity falls right in the middle of its posterior predictive. • In this example, classical p−value and Bayes p−value are the same. • Posterior predictive p-values are actual posterior probabilities • In general, Bayes can handle easily nuisance parameters. ENAR - March 26, 2006 • Bayes approach: • Note that p − value = P r(T (y rep) ≥ T (y)|σ 2) • Then: not always possible to get rid of nuisance p − value = P r(T (y rep) ≥ T (y)|y) Z Z = IT (yrep)≥T (y)p(y rep|σ 2)p(σ 2|y)dy repdσ 2 • Classical approach: ENAR - March 26, 2006 √ • Wrong interpretation: P r(model is true |data). 11 ENAR - March 26, 2006 12 Example: Independence of Bernoulli trials • In sample, T (y) = 3. • Sequence of binary outcomes y1, ..., yn modeled as iid Bernoulli trials with probability of success θ. • Uniform prior on θ leads to p(θ|y) ∝ θs(1 − θ)n−s with s = • Sample: P • Posterior is Beta(8,14). • To test assumption of independence, do: 1. For j = 1, ..., M draw θj from Beta(8,14). rep,j rep,j , ..., y20 } independent Bernoulli variables with 2. Draw {y1 j probability θ . 3. In each of the M replicate samples, compute T (y), the number of switches between 0 and 1. p = Prob(T (y rep) ≥ T (y)|y) = 0.98. yi . 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 • Is the assumption of independence warranted? B • If observed trials were independent, expected number of switches in 20 trials is approximately 8 and 3 is very unlikely. • Consider discrepancy statistic T (y, θ) = T (y) = number of switches between 0 and 1. ENAR - March 26, 2006 13 ENAR - March 26, 2006 Posterior predictive dist. of number of switches 14 Example: Newcomb’s speed of light • Newcomb obtained 66 measurements of the speed of light. (Chapter 3). One measurement of -44 is a potential outlier under a normal model. • From data: ȳ = 26.21 and s = 10.75. • Model: yi|µ, σ 2 ∼ N (µ, σ 2), (µ, σ 2) ∼ σ −2. • Posteriors p(σ 2|y) = Inv − χ2(65, s2) p(µ|σ 2, y) = N(ȳ, σ 2/66) or p(µ|y) = t65(ȳ, s2/66). ENAR - March 26, 2006 15 ENAR - March 26, 2006 16 Example: Newcomb’s data • For posterior predictive checks, do for i = 1, ..., M (i) 2(i) 2 1. Generate (µ , σ ) from p(µ, σ |y) rep(i) rep(i) from N (µ(i), σ 2(i)) , ..., y66 2. Generate y1 ENAR - March 26, 2006 17 Example: Replicated datasets ENAR - March 26, 2006 18 • Is observed minimum value -44 consistent with model? – Define T (y) = min{yi} – get distribution of T (y rep). – Large negative value of -44 inconsistent with distribution of T (y rep) = min{yirep} because observed T (y) very unlikely under distribution of T (y rep). – Model inadequate, must account for long tail to the left. ENAR - March 26, 2006 19 ENAR - March 26, 2006 20 Posterior predictive of smallest observation • Is sampling variance captured by model? – Define T (y) = s2y = 1 Σi(yi − ȳ)2 65 – obtain distribution of T (y rep). – Observed variance of 115 very consistent with distribution (across reps) of sample variances – Probability of observing a larger sample variance is 0.48: observed variance in middle of distribution of T (y rep) – But this is meaningless: T (y) is sufficient statistic for σ 2, so model MUST have fit it well. • Symmetry in center of distribution – Look at difference in the distance of 10th and 90th percentile from the mean. – Approx 10th percentile is 6th order statistic y(6), and approx 90th percentile is y(61). ENAR - March 26, 2006 21 – – – – Define T (y, θ) = |y(61) − θ| − |y(6) − θ| Need joint distribution of T (y, θ) and of T (y rep, θ). Model accounts for symmetry in center of distribution Joint distribution of T (y, θ) and T (y rep, θ) about evenly distributed along line T (y, θ) = T (y rep, θ). – Probability that T (y rep, θ) > T (y, θ) about 0.26, plausible under sampling variability. ENAR - March 26, 2006 23 ENAR - March 26, 2006 22 Posterior predictive of variance and symmetry ENAR - March 26, 2006 24 Choice of discrepancy measure • Often, choose more than one measure, to investigate different attributes of the model. – Can smoking behavior in adolescents be predicted using information on covariates? (Example on page 172 of Gelman et al.) – Data: six observations of smoking habits in 2,000 adolescents every six months – % who never smoked – % who always smoked – % of “incident smokers”: began not smoking but then switched to smoking and continued doing so. • Generate replicate datasets from both models. • Compute the three test statistics in each of the replicated datasets • Compare the posterior predictive distributions of each of the three statistics with the value of the statistic in the observed dataset • Two models: – Logistic regression model – Latent-class model: propensity to smoke modeled as nonlinear function of predictors. Conditional on smoking, fit same logistic regression model as above. • Three different test statistics chosen: ENAR - March 26, 2006 25 Test variable % never smokers % always-smokers % incident smokers T (y) 77.3 5.1 8.4 95% CI for T (y rep ) [75.5, 78.2] [5.0, 6.5] [5.3, 7.9] p-value 0.27 0.95 0.005 95% CI for T (y rep ) [74.8, 79.9] [3.8, 6.3] [4.9, 7.8] ENAR - March 26, 2006 p-value 0.53 0.44 0.004 26 • A Bayesian χ2 test is carried out as follows: Omnibus tests – Compute the distribution of T (y, θ) for many draws of θ from posterior and for observed data. – Compute the distribution of T (y rep, θ) for replicated datasets and posterior draws of θ – Compute Bayesian p-value: probability of observing a more extreme value of T (y rep, θ). – Calculate p-value empirically from simulations over θ and y rep. • It is often useful to also consider summary statistics: χ2-discrepancy: T (y, θ) = • Results: Model 2 slightly better than 1 at predicting % of always smokers, but both models fail to predict % of incident smokers. P (yi−E(yi|θ))2 var(yi |θ) Deviance: T (y, θ) = −2 log p(y|θ) • Deviance is proportional to mean squared error if model is normal and with constant variance. • In classical approach, insert θnull or θmle in place of θ in T (y, θ) and compare test statistic to reference distribution derived under asymptotic arguments. • In Bayesian approach, reference distribution for test statistic is automatically calculated from posterior predictive simulations. ENAR - March 26, 2006 27 ENAR - March 26, 2006 28 Criticisms of posterior predictive checks Pps can be conservative • Too conservative because data get used twice • Posterior predictive p-values are not really p-values: distribution under null hypothesis is not uniform, as it should be • What is high/low if pp p-values not uniform? • Some Bayesians object to frequentist slant of using unobserved data for anything • Lots of work recently on improving posterior predictive p-values: Bayarri and Berger, JASA, 2000, is good reference • In spite of criticisms, pp checks are easy to use in very general cases, and intuitively appealing. ENAR - March 26, 2006 29 • One criticism for pp model checks is that they tend to reject models only in the face of extreme evidence. • Example (from Stern): – – – – – y ∼ N (µ, 1) and µ ∼ N (0, 9). Observation: yobs = 10 Posterior p(µ|y) = N (0.9yobs, 0.9) = N (9, 0.9) Posterior predictive dist: N (9, 1.9) Posterior predictive p-value is 0.23, we would not reject the model • Effect of prior is minimized because 9 > 1 and posterior predictive mean is “close” to observed datapoint. • To reject, we would need to observe y ≥ 23. ENAR - March 26, 2006 Graphical posterior predictive checks 30 Model comparison • Idea is to display the data together with simulated data. • Which model fits the data best? • Three kinds of checks: • Often, models are nested: a model with parameters θ is nested within a model with parameters (θ, φ). – Direct data display – Display of data summaries or parameter estimates – Graphs of residuals or other measures of discrepancy • Comparison involves deciding whether adding φ to the model improves its fit. Improvement in fit may not justify additional complexity. • It is also possible to compare non-nested models. • We focus on predictive performance and on model posterior probabilities to compare models. ENAR - March 26, 2006 31 ENAR - March 26, 2006 32 Expected deviance • An estimate is • We compare the observed data to several models to see which predicts more accurately. D̂avg(y) = 1 X D(y, θj ). M j • We summarize model fit using the deviance D(y, θ) = −2 log p(y|θ). • It can be shown that the model with the lowest expected deviance is best in the sense of minimizing the (Kullback-Leibler) distance between the model p(y|θ) and the true distribution of y, f (y). • We compute the expected deviance by simulation: Davg(y) = Eθ|y (D(y, θ)|y). ENAR - March 26, 2006 33 Deviance information criterion - DIC ENAR - March 26, 2006 34 where Dθ̂ (y) = D(y, θ̂(y)) and θ̂(y) is a point estimator of θ such as the posterior mean. • Idea is to estimate the error that would be expected when applying the model to future data. • Expected mean square predictive error: # " 1 X rep rep pred 2 − E(yi |y)) . Davg = E (y n i i • A model that minimizes the expected predictive deviance is best in the sense of out-of-sample predictive power. pred • An approximation to Davg is the Deviance Information Criterion or DIC pred DIC = D̂avg = 2D̂avg(y) − Dθ̂ (y), ENAR - March 26, 2006 35 ENAR - March 26, 2006 36 Example: SAT study • We still prefer the hierarchical model because assumption that all schools have identical mean is strong. • We compare three models: no pooling, complete pooling and hierarchical model (partial pooling). • Deviance is D(y, θ) = −2 X j log N(yj |θj , σj2) = log(2πJσ 2) + X j ((yj − θj )2/σj2) • The DIC for the three models were: 70.3 (no pooling), 61.5 (complete pooling), 63.4 (hierarchical model). • Based on DIC, we would pick the complete pooling model. ENAR - March 26, 2006 37 Bayes Factors ENAR - March 26, 2006 • The BF12 is computed as BF12 • Suppose that we wish to decide between two models M1 and M2 (different prior, sampling distribution, parameters). R p(y|θ1, M1)p(θ1|M1)dθ1 . =R p(y|θ2, M2)p(θ2|M2)dθ2 • Note that we need p(y) under each model to compute p(θi|Mi). Therefore, the BF is only defined when the marginal distribution of the data p(y) is proper, and therefore, when the prior distribution in each model is proper. • Priors on models are p(M1) and p(M2) = 1 − p(M1). • The posterior odds favoring M1 over M2 are • For example, if y ∼ N (θ, 1) with p(θ) ∝ 1, we get Z 1 p(y) ∝ (2π)1/2 exp{− (y − θ)2}dθ = 1, 2 p(M1|y) p(y|M1) p(M1) = . p(M2|y) p(y|M2) p(M2) • The ratio p(y|M1)/p(y|M2) is called a Bayes factor. which is constant for any y and thus is not a proper distribution. This creates an indeterminacy because we can increase or decrease the BF simply by using a different constant in the numerator and denominator. • It tells us how much the data support one model over the other. ENAR - March 26, 2006 38 39 ENAR - March 26, 2006 40 Computation of Bayes Factors Note that p(y) −1 • Difficulty lies in the computation of p(y) p(y) = Z p(y|θ)p(θ)dθ. = Z h(θ) dθ p(y) = Z h(θ) p(θ|y)dθ. p(y|θ)p(θ) • h(θ) can be anything (e.g., a normal approximation to the posterior). • The simplest approach is to draw many values of θ from p(θ) and get a Monte Carlo approximation to the integral. • Draw values of θ from the posterior and evaluate the integral numerically. • May not work well because the θ’s from the prior may not come from the parameter space region where the sampling distribution has most mass. • Computations can get tricky because denominator can be small. • A better Monte Carlo approximation is similar to importance sampling. ENAR - March 26, 2006 41 where θ̂i is the posterior mode under model Mi and di is the dimension of θi. • Notice that the criterion penalizes a model for additional parameters. • Using log(BF ) is equivalent to ranking models using the BIC criterion, given by 1 BIC = − log(p(y|θ̂, M ) + d log(n). 2 ENAR - March 26, 2006 43 • As n → ∞, 1 log(BF ) ≈ log(p(y|θ̂2, M2) − log(p(y|θ̂1, M1) − (d1 − d2) log(n), 2 ENAR - March 26, 2006 42 Ordinary Linear Regression - Introduction Ordinary Linear Regression - Model • Question of interest: how does an outcome y vary as a function of a vector of covariates X. • We want the conditional distribution p(y|θ, x), where observations (y, x)i are assumed exchangeable. • Covariates (x1, x2, ..., xk ) can be discrete or continuous, and typically x1 = 1 for all n units. The matrix X, n × k is the model matrix • Most common version of the model is the normal linear model E(yi|β, X) = k X • In ordinary linear regression models, the conditional variance is equal across observations: var(yi|θ, X) = σ 2. Thus, θ = (β1, β2, ..., βk , σ 2). • Likelihood: For the ordinary normal linear regression model, we have p(y|X, β, σ 2) = N (Xβ, σ 2I) with In and n × n identity matrix. • Priors: A non-informative prior distribution for (β, σ 2) is p(β, σ 2) ∝ σ −2 βj xij j=1 (more on informative priors later) ENAR - March 2006 1 2 −n 2 −1 1 −2 0 exp − σ (y − Xβ) (y − Xβ) 2 2 −n 2 −1 1 −2 0 exp − σ (y − Xβ) (y − Xβ) 2 p(β, σ |y) ∝ (σ ) i 1 h p(β|σ 2, y) ∝ exp − 2 β 0X 0Xβ − 2β̂ 0X 0Xβ 2σ 1 ∝ exp − 2 (β − β̂)0X 0X(β − β̂) 2σ • Joint posterior for p(β, σ 2|y) ∝ (σ ) β̂ = (X 0X)−1X 0y • Then: • Consider β|σ 2, y ∼ N (β̂, σ 2(X 0X)−1) p(β, σ 2|y) = p(β|σ 2, y)p(σ 2|y) • Conditional posterior for β: ENAR - March 2006 2 p(β|σ 2, y) (viewed as function of β). We get: • Joint posterior: 2 ENAR - March 2006 Expand and complete the squares in 3 ENAR - March 2006 4 Ordinary Linear Regression - p(σ 2|y) • The marginal posterior distribution for σ 2 is obtained by integrating the joint posterior with respect to β: 2 Z p(β, σ 2|y)dβ Z 1 −2 −1 0 2 −n exp −σ (y − Xβ) (y − Xβ) dβ ∝ (σ ) 2 2 p(σ |y) = i 1 h exp − 2 (y − X β̂)0(y − X β̂) + (β − β̂)0X 0X(β − β̂) dβ 2σ Z 1 1 2 −n 2 0 0 −1 2 exp − 2 (β − β̂) X X(β − β̂) dβ ∝ (σ ) exp − 2 (n − k)S 2σ 2σ Z • Integrand is kernel of k− dimensional normal, so result of integration is proportional to (σ 2)k/2. Then p(σ 2|y) ∝ (σ 2)− • By expanding square in integrand, and adding and subtracting 2y 0X(X 0X)−1X 0y, we can write: arge n−k 2 +1 exp −σ −2(n − k)S 2 proportional to an Inv-χ2(n − k, S 2). n p(σ 2|y) = (σ 2)− 2 −1 × ENAR - March 2006 5 Ordinary Linear Regression - Notes ENAR - March 2006 6 Regression - Posterior predictive distribution • Note that β̂ and S 2 are the MLEs of β and σ 2 • In regression, we often want to predict the outcome ỹ for a new set of covariates x̃. Thus, we wish to draw values from p(ỹ|y, X̃) • When is the posterior proper? p(β, σ 2|y) is proper if • By simulation: 1. n > k 2. rank(X) = k: columns of X are linearly independent or |X 0X| = 6 0 1. Draw σ 2 from Inv-χ2(n − k, S 2) 2. Draw β from N(β̂, σ 2(X 0X)−1) 3. Draw ỹi for i = 1, ..., m from N(x̃0iβ, σ 2) • To sample from the joint posterior: 1. Draw σ 2 from Inv-χ2(n − k, S 2) 2. Given σ 2, draw vector β from N(β̂, σ 2(X 0X)−1) • For efficiency, compute β̂, S 2, (X 0X)−1 once, before starting repeated drawing. ENAR - March 2006 7 ENAR - March 2006 8 Regression - Posterior predictive distribution where inner expectation averages over ỹ conditional on β and outer averages over β. var(ỹ|σ 2, y) = E var(ỹ|β, σ 2, y)|σ 2, y + var E(ỹ|β, σ 2, y)|σ 2, y • We can derive p(ỹ|y) in steps, first considering p(ỹ|y, σ 2). • Note that p(ỹ|y, σ 2) = Z = E[σ 2I|σ 2, y] + var[X̃β|σ 2, y] = σ 2(I + X̃(X 0X)−1X̃ 0) p(ỹ|β, σ 2)p(β|σ 2, y)dβ is normal because exponential in integrand is quadratic function in (β, ỹ). • To get mean and variance use conditioning trick: • Var has two terms: σ 2I is sampling variation, and σ 2X̃(X 0X)−1X̃ 0 is due to uncertainty about β • To complete specification of p(ỹ|y) must integrate p(ỹ|y, σ 2) with respect to marginal posterior distribution of σ 2. E(ỹ|σ 2, y) = E E(ỹ|β, σ 2, y)|σ 2, y • Result is: = E(X̃β|σ 2, y) p(ỹ|y) = tn−k (X̃ β̂, S 2[I + X̃(X 0X)−1X̃]) = X̃ β̂ ENAR - March 2006 9 Regression example: radon measurements in Minnesota ENAR - March 2006 10 and is 0 otherwise. Similarly, X2 and X3 are dummies for Clay and Goodhue counties, respectively. • X4 = 1 if measurement was taken in the first floor. • Radon measurements yi were taken in three counties in Minnesota: Blue Earth, Clay and Goodhue. • 14 houses in each of Blue Earth and Clay county and 13 houses in Goodhue were sampled. • Measurements were taken in the basement and in the first floor. • The model is log(yi) = β1x1i + β2x2i + β3x3i + β4x4i + ei, with e ∼ N(0, σ 2). • Thus • We fit an ordinary regression model to the log radon measurements, without an intercept. E(y|Blue Earth, basement) = exp(β1) • We define dummy variables as follows: X1 = 1 if county is Blue Earth E(y|Clay, basement) = exp(β2) E(y|Blue Earth, first floor) = exp(β1 + β4) E(y|Clay, first floor) = exp(β2 + β4) beta[1] E(y|Goodhue, basement) = exp(β3) 3.0 E(y|Goodhue, first floor) = exp(β3 + β4) 2.5 2.0 1.5 1.0 • We used noninformative priors N(0, 1000) for the regression coefficients and a noninformative prior Gamma(0.01, 0.01) for the error variance. 0.5 4000 3500 3001 4500 5000 4500 5000 iteration beta[2] 3.0 2.5 2.0 1.5 1.0 0.5 3001 4000 3500 iteration box plot: beta beta[3] 3.0 2.5 3.0 2.0 [1] 1.5 1.0 [2] [3] 2.0 3001 3500 4000 4500 5000 iteration 1.0 beta[4] 1.0 [4] 0.0 0.0 -1.0 -2.0 3001 3500 4000 iteration 4500 5000 -1.0 Regression - Posterior predictive checks becb sample: 2000 • For regression models, there are well-known methods for checking model and assumptions using estimated residuals. Residuals: becff sample: 2000 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.0 0.0 5.0 10.0 15.0 0.0 ccb sample: 2000 5.0 10.0 i = yi − x0iβ 15.0 • Two useful test statistics are: ccff sample: 2000 0.4 0.3 0.2 0.1 0.0 0.4 0.3 0.2 0.1 0.0 0.0 10.0 5.0 0.0 gcb sample: 2000 0.3 0.2 0.2 0.1 0.1 0.0 0.0 5.0 10.0 5.0 10.0 • Posterior predictive distributions of both statistics can be obtained via simulation gcff sample: 2000 0.3 0.0 – Proportion of outliers among residuals – Correlation between squared residuals and fitted values ŷ 15.0 • Define a standardized residual as 0.0 5.0 10.0 ei = (yi − x0iβ)/σ 15.0 ENAR - March 2006 • If normal model is correct, standardized residuals should be about N (0, 1) and therefore, |e|i > 3 suggests that ith observation may be an outlier. 13 information about model adequacy. • To derive the posterior predictive distribution for the proportion of outliers q and for ρ, the correlation between (e2, ŷ), do: 1. 2. 3. 4. 5. 6. Draw (σ 2, β) from the joint posterior distribution Draw y rep from N (Xβ, σ 2I) given the existing X Run the regression of y rep on X, and save residuals Compute ρ Compute proportion of “large” standardized residuals q Repeat for another y rep • Approach is frequentist in nature: we act as if we could repeat the experiment many times. • Inspection of posterior predictive distribution of ρ and of q provides ENAR - March 2006 14 ENAR - March 2006 15 Regression with unequal variances • Results of posterior predictive checks suggest that there are no outliers and that the correlation between the squared residuals and the fitted values is negligible. • The 95% credible set for the proportion of absolute residuals (in the 41 observations) above 3 was (0, 4.88) with a mean of 0.59% and a median of 0. • Consider now the case where y ∼ N (Xβ, Σy ), with Σy 6= σ 2I. • Covariance matrix Σy is n × n and has n(n + 1)/2 distinct parameters, and cannot be estimated from n observations. Must either specify Σy or it must be assigned an informative prior distribution. • Typically, some structure is imposed on Σy to reduce the number of free parameters. • The 95% credible set for ρ was (−0.31, 0.32) with a mean of 0.011. • Notice that the posterior distribution of the proportion of outliers is skewed. This sometimes happens when the quantity of interest is bounded and most of the mass of its distribution is close to the boundary. ENAR - March 2006 Regression - known Σy 16 is equivalent to computing β̂ • As before, let p(β) ∝ 1 be the non-informative prior Vβ • Since Σy is positive definite and symmetric, it has an upper-triangular 1/2 1/2 1/20 square root matrix (Cholesky factor) Σy such that Σy Σy = Σy so that if: y = Xβ + e, e ∼ N (0, Σy ) −1 −1 = (X 0Σ−1 y X) XΣy y −1 = (X 0Σ−1 y X) • Note: By using the Cholesky factor you avoid computing the n × n inverse Σ−1 y . then Σy−1/2y = Σ−1/2 Xβ + Σ−1/2 e, y y Σy−1/2e ∼ N (0, I) • With Σy known, just proceed as in ordinary linear regression, but use the transformed y and X as above and fix σ 2 = 1. Algebraically, this ENAR - March 2006 17 ENAR - March 2006 18 Prediction of new ỹ with known Σy where • Even if we know Σy , prediction of new observations is more complicated: must know covariance matrix of old and new data. µỹ = X̃β + Σyỹ Σ−1 yy (y − Xβ) Vỹ = Σỹỹ − Σyỹ Σ−1 yy Σỹy • Example: heights of children from same family are correlated. To predict height of a new child when a brother is in the old dataset, we must include that correlation in the prediction. • ỹ are ñ new observations given a ñ × k matrix of regressors X̃. • Joint distribution of ỹ, y is: y |X, X̃, θ ∼ N ỹ ỹ|y, β, Σy Xβ X̃β Σy Σyỹ , Σỹy Σỹỹ ∼ N (µỹ , Vỹ ) ENAR - March 2006 19 Regression - unknown Σy • To draw inferences about (β, Σy ), we proceed in steps: p(β|Σy , y) and then p(Σy |y). 20 • Expression for p(Σy |y) must hold for any β, so we set β = β̂. derive • Note that for β = β̂. Then p(β|Σy , y) = N(β̂, Vβ ) ∝ |Vβ |1/2 1 p(Σy |y) ∝ p(Σy )|Vβ |1/2|Σy |−1/2 exp[− (y − X β̂)0Σ−1 y (y − X β̂)] 2 • Let p(β) ∝ 1 as before • We know that p(Σy |y) = ∝ • In principle, p(Σy |y) could be evaluated for a range of values of Σy . However: p(β, Σy |y) p(Σy )p(y|β, Σy ) ∝ p(β|Σy , y) p(β|Σy , y) p(Σy ) N(y|β, Σy ) N(β|β̂, Vβ ) 1. It is difficult to determine a prior for an n × n unstructured matrix. 2. β̂, Vβ depend on Σy and it is very difficult to draw values of Σy from p(Σy |y). , • Need to put some structure on Σy where (β̂, Vβ ) depend on Σy . ENAR - March 2006 ENAR - March 2006 21 ENAR - March 2006 22 Regression - Σy = σ 2Qy 1. Draw σ 2 from Inv-χ2(n − k, s2) 2. Draw β from n(β̂, Vβ σ 2) • Suppose that we know Σy upto a scalar factor σ 2. • In large datasets, use Cholesky factor transformation and unweighted regression to avoid computation of Q−1 y . • Non-informative prior on β, σ 2 is p(β, σ 2) ∝ σ −2 • Results follow directly from ordinary linear regression, by using −1/2 −1/2 transformation Qy y and Qy X, which is equivalent to computing β̂ Vβ −1 0 −1 = (X 0Q−1 y X) X Qy y −1 = (X 0Q−1 y X) s2 = (n − k)−1(y − X β̂)0Q−1 y (y − X β̂) • To estimate the joint posterior distribution p(β, σ 2|y) do ENAR - March 2006 23 ENAR - March 2006 24 φ = 1 −→ Σii = σ 2/wi ∀i Regression - other covariance structures Weighted regression • In some applications, Σy = diag(σ 2/wi), for known weights and σ 2 unknown. • In practice, to uncouple φ from the scale of the weights, we multiply weights by a factor so that their product equals 1. • Inference is the same as before, but now Q−1 y = diag(wi ) Parametric model for unequal variances: • Variances may depend on weights in non-linear fashion: Σii = σ 2v(wi, φ) for unknown parameter φ ∈ (0, 1) and known function v such as v = wi−φ • Note that for that function v: φ = 0 −→ Σii = σ 2 ∀i ENAR - March 2006 25 ENAR - March 2006 26 Parametric model for unequal variances • This suggests the following scheme 1. 2. 3. 4. 2 • Three parameters to estimate: β, σ , φ. • Possible prior for φ is uniform in [0, 1] Draw φ from marginal posterior p(φ|y) −1/2 −1/2 Compute Qy y and Qy X 2 2 Draw σ from p(σ |φ, y) as in ordinary regression Draw β from p(β|σ 2, φ, y) as in ordinary regression • Marginal posterior distribution of φ: • Non-informative prior for β, σ 2 is ∝ σ −2 p(φ|y) = • Joint posterior distribution: 2 p(β, σ , φ|y) ∝ p(φ)p(β, σ 2 )Πni=1 = 2 N(yi; (Xβ)i, σ v(wi, φ)) ∝ • For a given φ, we are back in the earlier weighted regression case with Q−1 y = diag(v(w1 , φ), ..., v(wn , φ)) ENAR - March 2006 p(β, σ 2, φ|y) p(β|σ 2, φ, y)p(σ 2|φ, y) p(φ)σ −2Πi N(yi|(Xβ)i, σ 2v(wi, φ)) Inv-χ2(n − k, s2) N(β̂, Vβ ) • Expression must hold for any (β, σ 2), so we set β = β̂ and σ 2 = s2. 27 ENAR - March 2006 28 s2 = (n − k)−1(y ∗ − X ∗β̂)0(y ∗ − X ∗β̂) • Recall that β̂ and s2 depend on φ. • Also recall that weights are scaled to have product equal to 1. • Then p(β, σ 2, φ|y) p(β, σ 2|φ, y) – Draw σ 2 from Inv-χ2(n − k, s2) – Draw β from N(β̂, σ 2Vβ ) p(φ|y) ∝ p(φ)|Vβ |1/2(s2)−(n−k)/2 • To carry out computation – First sample φ from p(φ|y): 1. Evaluate p(φ|y) for a range of values of φ ∈ [0, 1] 2. Use inverse cdf method to sample φ −1/2 −1/2 – Given φ, compute y ∗ = Qy y and X ∗ = Qy X – Compute β̂ Vβ ENAR - March 2006 0 0 = (X ∗ X ∗)−1X ∗ y 0 = (X ∗ X ∗)−1 29 ENAR - March 2006 30 – variance σβ2j . Including prior information about β • Suppose we wish to add prior information about a single regression coefficient βj of the form: βj ∼ N(βj0, σβ2j ), with βj0, σβ2j known. • To include prior information, do the following: 1. Append one more ‘data point’ to vector y with value βj0. 2. Add one row to X with zeroes except in jth column. 3. Add a diagonal element with value σβ2j to Σy . • Now apply computational methods for non-informative prior. • Prior information can be added in the form of an additional ‘data point’. • Given Σy , posterior for β obtained by weighted linear regression. • An ordinary observation y is normal with mean xβ and variance σ 2. • As a function of βj , the prior can be viewed as an ‘observation’ with – 0 on all x’s except xj ENAR - March 2006 31 ENAR - March 2006 Prior information about σ 2 Adding prior information for several βs • Typically we do not wish to include prior information about σ 2. • Suppose that for the entire vector β, • If we do, we can use the conjugate prior β ∼ N(β0, Σβ ). • Proceed as before: add k ‘data points’ and draw posterior inference by weighted linear regression applied to ‘observations’ y∗, explanatory variables X∗ and variance matrix Σ∗: y∗ = y β0 , X∗ = X Ik 32 , Σ∗ = Σy 0 0 Σβ σ 2 ∼ Inv-χ2(n0, σ02). • The marginal posterior of σ 2 is σ 2|y ∼ Inv-χ2(n0 + n, n0σ02 + nS 2 ). n0 + n • Computation can be carried out conditional on Σ∗ first and then inverse cdf for Σ∗ or using the Gibbs sampling. • If prior information on β is also incorporated, S 2 is replaced by corresponding value from regression of y∗ on X∗ and Σ∗, and n is replaced by length of y∗. ENAR - March 2006 ENAR - March 2006 33 34 Inequality constraints on β • Sometimes we wish to impose inequality constraints such as β1 > 0 or β2 < β3 < β4. • The easiest way is to ignore the constraint until the end: – Simulate (β, σ 2) from posterior – discard all the draws that do not satisfy the constraint. • Typically a reasonably efficient way to proceed unless constraint eliminates a large portion of unconstrained posterior distribution. • If so, data tend to contradict the model. ENAR - March 2006 35 Generalized Linear Models • There are three main components in the model: • Generalized linear models are an extension of linear models to the case where relationship between E(y|X) and X is not linear or normal assumption is not appropriate. 1. Linear predictor η = Xβ 2. Link function g(.) relating linear predictor to mean of outcome variable: E(y|X) = µ = g −1(η) = g −1(Xβ) 3. Distribution of outcome variable y with mean µ = E(y|X). Distribution can also depend on a dispersion parameter φ: • Sometimes a transformation suffices to return to the linear setup. Consider the multiplicative model b2 b3 yi = xb1 i1 xi2 xi3 i p(y|X, β, φ) = Πni=1p(yi|(Xβ)i, φ) • In standard GLIMs for Poisson and binomial data, φ = 1. A simple log transformation leads to • In many applications, however, excess dispersion is present. log(yi) = b1 log(xi1) + b2 log(xi2) + b3 log(xi3) + ei • When simple approaches do not work, we use GLIMs. ENAR - March 26, 2006 1 Some standard GLIMs ENAR - March 26, 2006 • Binomial model: Suppose that yi ∼ Bin(ni, µi), ni known. Standard link function is logit of probability of success µ: • Linear model: µi g(µi) = log 1 − µi – Simplest GLIM, with identity link function g(µ) = µ. • Poisson model: p(y|β) = µ = exp(Xβ) = exp(η) – For y = (y1, ..., yn): = (Xβ)i = ηi Πni=1 y i ni−yi 1 ni exp(ηi) 1 + exp(ηi) 1 + exp(ηi) yi • Another link used in econometrics is the probit link: 1 exp(− exp(ηi))(exp(ηi))yi y! Φ−1(µi) = ηi with ηi = (Xβ)i. ENAR - March 26, 2006 • For a vector of data y: – Mean and variance µ and link function log(µ) = Xβ, so that p(y|β) = Πni=1 2 with Φ(.) the normal cdf. 3 ENAR - March 26, 2006 4 Overdispersion • In practice, inference from logit and probit models is almost the same, except in extremes of the tails of the distribution. • In many applications, the model can be formulated to allow for extra variability or overdispersion. • E.g. in Poisson model, the variance is constrained to be equal to the mean. • As an example, suppose that data are the number of fatal car accidents at K intersections over T years. Covariates might include intersection characteristics and traffic control devices (stop lights, etc). • To accommodate overdispersion we model the (log)rate as a linear combination of covariates and add a random effect for intersection with its own population distribution. ENAR - March 26, 2006 5 Setting up GLIMs ENAR - March 26, 2006 6 • To apply the Poisson GLIM, add a column to X with values log(T ) and fix the coefficient to 1. This is an offset. • Canonical link functions: Canonical link is function of mean that appears in exponent of exponential family form of sampling distribution. • All links discussed so far are canonical except for the probit. • Offset: Arises when counts are obtained from different population sizes or volumes or time periods and we need to use an exposure. Offset is a covariate with a known coefficient. • Example: Number of incidents in a given exposure time T are Poisson with rate µ per unit of time. Mean number of incidents is µT . • Link function would be log(µ) = ηi, but here mean of y is not µ but µT . ENAR - March 26, 2006 7 ENAR - March 26, 2006 8 Interpreting GLIMs and y = g −1(g(y0) + (∆x)β) • In linear models, βj represents the change in the outcome when xj is changed by one unit. • Here, βj reflects changes in g(µ) when xj is changed. • The effect of changing xj depends of current value of x. • To translate effects into the scale of y, measure changes relative to a baseline y0 = g −1(x0β). • A change in x of ∆x takes outcome from y0 to y where g(y0) = x0β −→ y0 = g −1(x0β) ENAR - March 26, 2006 9 Priors in GLIM 10 – Non-informative prior for β in augmented model. • We focus on priors for β although sometimes φ is present and has its own prior. • Non-informative prior for β: – With p(β) ∝ 1, posterior mode = MLE for β – Approximate posterior inference can be approximation to posterior at mode. ENAR - March 26, 2006 based on normal • Non-conjugate priors: – Often more natural to model p(β|β0, Σ0) = N (β0, Σ0) with (β0, Σ0) known. – Approximate computation based on normal approximation (see next) particularly suitable. • Hierarchical GLIM: – Same approach as in linear models. – Model some of the β as exchangeable with common population distribution with unknown parameters. Hyperpriors for parameters. • Conjugate prior for β: – As in regression, express prior information about β in terms of hypothetical data obtained under same model. – Augment data vector and model matrix with y0 hypothetical observations and X0n0×k hypothetical predictors. ENAR - March 26, 2006 11 ENAR - March 26, 2006 12 Computation Normal approximation to likelihood • Posterior distributions of parameters can be estimated using MCMC methods in WinBUGS or other software. • Objective: find zi and σi2 such that normal likelihood N (zi|(Xβ)i, σi2) • Metropolis within Gibbs will often be necessary: in GLIM, most often full conditionals do not have standard form. • An alternative is to approximate the sampling distribution with a cleverly chosen approximation. is good approximation to GLIM likelihood p(yi|(Xβ)i, φ). • Let (β̂, φ̂) be mode of (β, φ) so that η̂i is the mode of ηi. • Idea: – Find mode of likelihood (β̂, φ̂) perhaps conditional hyperparameters – Create pseudo-data with their pseudo-variances (see later) – Model pseudo-data as normal with known (pseudo-)variances. ENAR - March 26, 2006 on • For L the loglikelihood, write p(y1, ..., yn) = Πip(yi|ηi, φ) = Πi exp(L(yi|ηi, φ)) 13 ENAR - March 26, 2006 • Let L00 = δ 2L/δηi2: • Approximate factor in exponent by normal density in ηi: 1 L(yi|ηi, φ) ≈ − 2 (zi − ηi)2, 2σi zi = η̂i − (zi, σi2). σi2 = − • To get (zi, σi2), match first and second order terms in Taylor approx around η̂i to (ηi, σi2) and solve for zi and for σi2. • Let L0 = δL/δηi: L0 = ENAR - March 26, 2006 L00 = − 1 σi2 • Then where (zi, σi2) depend on (yi, ηi, φ). • Now need to find expressions for 14 1 (zi − ηi) σi2 15 L0(yi|η̂i, φ̂) L00(yi|η̂i, φ̂) 1 L00(yi|η̂i, φ̂) • Example: binomial model with logit link: exp(ηi) L(yi, |ηi) = yi log 1 + exp(ηi) 1 +(ni − yi) log 1 + exp(ηi) ENAR - March 26, 2006 16 Models for multinomial responses = yiηi − ni log(1 + exp(ηi)) • Then • Multinomial data: outcomes y = (yi, ..., yK ) are counts in K categories. L0 = yi − ni L00 = −ni exp(ηi) 1 + exp(ηi) • Examples: – Number of students receiving grades A, B, C, D or F – Number of alligators that prefer to eat reptiles, birds, fish, invertebrate animals, or other (see example later) – Number of survey respondents who prefer Coke, Pepsi or tap water. exp(ηi) (1 + exp(ηi))2 • Pseudo-data and pseudo-variances: zi (1 + exp(η̂i))2 = η̂i + exp(η̂i) σi2 = ENAR - March 26, 2006 yi exp(η̂i) − ni 1 + exp(η̂i) • In Chapter 3, we saw non-hierarchical multinomial models: y p(y|α) ∝ Πkj=1αj j 1 (1 + exp(η̂i))2 ni exp(η̂i) with αj : probability of jth outcome and n. 17 Pk j=1 αj = 1 and Pk j=1 yj ENAR - March 26, 2006 = 18 • Here: we model αj as a function of covariates (or predictors) X with corresponding regression coefficients βj . Logit model for multinomial data • For full hierarchical structure, the βj are modeled as exchangeable with some common population distribution p(β|µ, τ ). • Here i = 1, ..., I is number of covariate patters. E.g., in alligator example, 2 sizes × four lakes = 8 covariate categories. • Model can be developed as extension of either binomial or Poisson models. • Let yi be a multinomial random variable with sample size ni and k possible outcomes. Then yi ∼ Mult (ni; αi1, ..., αik ) with P i yi = ni, and Pk j αij = 1. • αij is the probability of jth outcome for ith covariate combination. • Standard parametrization: log of the probability of jth outcome relative ENAR - March 26, 2006 19 ENAR - March 26, 2006 20 to baseline category j = 1: log • Typically, indicators for each outcome category are added to predictors to indicate relative frequency of each category when X = 0. Then αij αi1 = ηij = (Xβj )i, ηij = δj + (Xβj )i with βj a vector of regression coefficients for jth category. with δ1 = β1 = 0 typically. • Sampling distribution: p(y|β) ∝ ΠIi=1Πkj=1 exp(ηij ) Pk l=1 exp(ηil ) !yij . • For identifiability, β1 = 0 and thus ηi1 = 0 for all i. • βj is effect of changing X on probability of category j relative to category 1. ENAR - March 26, 2006 21 Example from WinBUGS - Alligators ENAR - March 26, 2006 with • Agresti (1990) analyzes feeding choices of 221 alligators. and • Response is one of five categories: fish, invertebrate, reptile, bird, other. • Two covariates: length of alligator (less than 2.3 meters or larger than 2.3 meters) and lake (Hancock, Oklawaha, Trafford, George). • 2 × 4 = 8 covariate combinations (see data) 22 exp(ηijk ) , θijk = Pk l=1 exp(ηijl ) ηijk = δk + βik + γjk . • Here, – δk is baseline indicator for category k – βik is coefficient for indicator for lake – γjk is coefficient for indicator for size • For i, j a combination of size and lake, we have counts in five possible categories yij = (yij1, ..., yij5). • Model ENAR - March 26, 2006 p(yij |αij , nij ) = Mult (yij |nij , θij1, ..., θij5) 23 ENAR - March 26, 2006 24 box plot: beta box plot: alpha 5.0 [5] 0.0 [3,2] [2,2] [3,3] [2] [4] [2,3] [3] [4,2] 2.5 -2.0 [3,4] [3,5] [4,3] [4,4] [2,4] [2,5] [4,5] -4.0 0.0 -6.0 -2.5 -5.0 alpha[2] alpha[3] alpha[4] alpha[5] Mean Std 2.5% Median 97.5% -1.838 -2.655 -2.188 -0.7844 0.5278 0.706 0.58 0.3691 -2.935 -4.261 -3.382 -1.531 -1.823 -2.593 -2.153 -0.7687 -0.8267 -1.419 -1.172 -0.1168 Mean Std 2.5% Median 97.5% 2.706 1.398 -1.799 -0.9353 2.932 1.935 0.3768 0.7328 1.75 -1.595 -0.7617 -0.843 0.6431 0.8571 1.413 0.7692 0.6864 0.8461 0.7936 0.5679 0.6116 1.447 0.8026 0.5717 1.51 -0.2491 -5.1 -2.555 1.638 0.37 -1.139 -0.3256 0.636 -4.847 -2.348 -1.97 2.659 1.367 -1.693 -0.867 2.922 1.886 0.398 0.7069 1.73 -1.433 -0.7536 -0.8492 beta[2,2] beta[2,3] beta[2,4] beta[2,5] beta[3,2] beta[3,3] beta[3,4] beta[3,5] beta[4,2] beta[4,3] beta[4,4] beta[4,5] 4.008 3.171 0.5832 0.4413 4.297 3.842 1.931 1.849 3.023 0.9117 0.746 0.281 box plot: gamma 4.0 [2,4] 2.0 [2,3] [2,5] 0.0 [2,2] -2.0 -4.0 gamma[2,2] gamma[2,3] gamma[2,4] gamma[2,5] Mean Std 2.5% Median 97.5% -1.523 0.342 0.7098 -0.353 0.4101 0.5842 0.6808 0.4679 -2.378 -0.788 -0.651 -1.216 -1.523 0.3351 0.7028 -0.3461 -0.7253 1.476 2.035 0.518 box plot: p[2,,] box plot: p[1,,] 0.8 0.8 [2,1,2] [1,2,1] [1,1,1] [2,2,1] 0.6 0.6 [2,1,1] [1,1,5] 0.4 [2,2,2] 0.4 [2,2,3] [1,2,5] [1,2,4] [1,2,3] [1,1,2] 0.2 [2,1,3] 0.2 [1,1,4] [2,2,5] [2,1,5] [1,1,3] [2,2,4] [1,2,2] [2,1,4] 0.0 0.0 box plot: p[3,,] box plot: p[4,,] 0.8 [4,2,1] 0.8 [3,1,2] 0.6 0.6 [4,1,1] [4,1,2] [3,2,1] 0.4 [3,2,3] [3,1,5] [3,1,1] [3,2,5] [3,2,2] 0.4 [4,2,2] [3,2,4] 0.2 [3,1,3] [4,2,4] [4,1,5] 0.2 [3,1,4] [4,1,4] [4,2,5] [4,2,3] [4,1,3] 0.0 0.0 p[1,1,1] p[1,1,2] p[1,1,3] p[1,1,4] p[1,1,5] p[1,2,1] p[1,2,2] p[1,2,3] p[1,2,4] p[1,2,5] p[2,1,1] p[2,1,2] p[2,1,3] p[2,1,4] p[2,1,5] p[2,2,1] p[2,2,2] p[2,2,3] p[2,2,4] p[2,2,5] p[3,1,1] p[3,1,2] p[3,1,3] p[3,1,4] p[3,1,5] p[3,2,1] 0.5392 0.09435 0.04585 0.06837 0.2523 0.5679 0.02322 0.06937 0.1469 0.1927 0.2578 0.5968 0.08137 0.0095 0.05445 0.4612 0.2455 0.1947 0.0302 0.06833 0.1794 0.5178 0.09403 0.03554 0.1732 0.2937 0.07161 0.04288 0.02844 0.03418 0.06411 0.09029 0.01428 0.04539 0.07437 0.07171 0.07217 0.08652 0.0427 0.01186 0.03517 0.08211 0.06879 0.07042 0.02921 0.03984 0.05581 0.09136 0.04716 0.02504 0.06253 0.07225 0.4046 0.03016 0.007941 0.01923 0.1389 0.3959 0.005301 0.01191 0.03647 0.0793 0.1367 0.4159 0.01939 0.00533 0.009716 0.3055 0.1263 0.07776 7.653E-4 0.01357 0.08555 0.3334 0.02652 0.006063 0.07053 0.1618 0.5384 0.08735 0.04043 0.06237 0.2496 0.5684 0.02013 0.05918 0.1344 0.1852 0.2496 0.6 0.0728 0.04105 0.0476 0.4632 0.2426 0.189 0.02009 0.06097 0.1732 0.5185 0.08517 0.02996 0.167 0.2901 0.676 0.2067 0.1213 0.1457 0.3808 0.7342 0.06186 0.1768 0.3182 0.349 0.4265 0.7551 0.1841 0.143 0.1398 0.6175 0.3867 0.3502 0.1122 0.1643 0.296 0.6933 0.2219 0.1055 0.3261 0.4469 Hierarchical Poisson model Interpretation of results Because we set to zero several model parameters, interpreting results is tricky. For example: • Count data are often modeled using a Poisson model. x Beta[2,2] ha posterior mean 2.714. This is the effect of lake Oklawaha (relative to Hancock) on the alligator’s preference for invertebrates relative to fish. • If y ∼ Poisson(µ) then E(y) = var(y) = µ. Since beta[2,2] > 0, we conclude that alligators in Oklawaha eat more invertebrates than do alligators in Hancock (even though both may prefer fish!). x Gamma[2,2] is the effect of size 2 relative to size on the relative preference for invertebrates. Since gamma[2,2] < 0, we conclude that large alligators prefer fish more than do small alligators. • When counts are assumed exchangeable given µ and the rates µ can also be assumed to be exchangeable, a Gamma population model for the rates is often chosen. • The hierarchical model is then yi ∼ µi ∼ x The alpha are baseline counts for each type of food relative to fish. ENAR - March 26, 2006 Poisson(µi) Gamma(α, β). 25 • Priors for the hyperparameters are often taken to be Gamma (or exponential): α ∼ β ∼ Gamma(a, b) p(µi| all) ∝ µiyi+α−1 exp{−µi(β + 1)}, which is proportional to a Gamma with parameters (yi + α, β + 1). Gamma(c, d), • The full conditional for α is with (a, b, c, d) known. p(α| all) ∝ Πiµα−1 αa−1 exp{αb}. i • The joint posterior distribution is p(µ, α, β|y) ∝ • Conditional for µi is Πiµiyi a−1 α • The conditional for α does not have a standard form. exp{−µi}µiα−1 exp{−µiβ} c−1 exp{−αb}β • For β: exp{−βd} p(β| all) ∝ Πi exp{−βµi}β c−1 exp{−βd} X µi + d)}, ∝ β c−1 exp{−β( • To carry out Gibbs sampling we need to find the full conditional distributions. ENAR - March 26, 2006 which is proportional to a Gamma with parameters (c, 26 P i µi i ENAR - March 26, 2006 27 + d). Italian marriages – Example • Computation: – Given α, β, draw each µi from the corresponding Gamma conditional. – Draw α using a Metropolis step or rejection sampling or inverse cdf method. – Draw β from the Gamma conditional. Gill (2002) collected data on the number of marriages per 1,000 people in Italy during 1936-1951. Question: did the number of marriages decrease during WWII years? (1939 – 1945). Model: • See Italian marriages example. Number of marriages yi are Poisson with year-specific means Ȝi. Assuming that rates of marriages are exchangeable across years, we model the Ȝi as Gamma(Į, ȕ). To complete model specification, place independent Gamma priors on (Į, ȕ), with known hyper-parameter values. ENAR - March 26, 2006 28 WinBUGS code: Results lambda: 95% credible sets model { for (i in 1:16) { y[i] ~ dpois(l[i]) l[i] ~ dgamma(alpha, beta) } [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] alpha ~ dgamma(1,1) beta ~ dgamma(1,1) warave <- (l[4]+l[5]+ l[6]+l[7]+l[8]+l[9]+l[10]) / 7 nonwarave<- (l[1]+l[2]+l[3]+l[11]+l[12]+l[13]+l[14]+l[15]+l[16]) / 9 diff <- nonwarave - warave } list(y = c(7,9,8,7,7,6,6,5,5,7,9,10,8,8,8,7)) [15] [16] 0.0 5.0 10.0 lambda: box plot 15.0 [12] [2] [11] [13] [3] [1] [4] [5] [14] [15] [10] [6] [16] [7] [8] [9] 10.0 5.0 0.0 Difference between non-war and war years marriage rate overallrate chains 1:3 sample: 3000 0.6 Nonwar - war average 0.4 0.2 0.4 0.3 0.2 0.1 0.0 0.0 4.0 -5.0 0.0 6.0 8.0 10.0 overallstd chains 1:3 sample: 3000 5.0 1.5 1.0 0.5 0.0 Overall marriage rate: 1.0 If Ȝi ~ Gamma(Į, ȕ), then E(Ȝi | y) = Į / ȕ. overallrate overallstd mean sd 7.362 1.068 2.281 0.394 2.5% median 5.508 7.285 1.59 2.266 97.5% 9.665 3.125 2.0 3.0 4.0 Poisson regression • Population distribution for the rates: • When rates are not exchangeable, we need to incorporate covariates into the model. Often we are interested in the association between one or more covariate and the outcome. • It is possible (but not easy) to incorporate covariates into the PoissonGamma model. λi|α ∼ Gamma(ζ, ζ/µi), with log(µi) = x0iβ, and α = (β0, β1, ..., βk−1, ζ). • ζ is thought of as an unboserved prior count. • Under population model, ζ ζ/µi = µi • Christiansen and Morris (1997, JASA) propose the following model: E(λi) = • Sampling distribution, where ei is a known exposure: CV2(λi) = yi|λi ∼ Poisson(λiei). = Under model, E(yi/ei) = λi. ENAR - March 26, 2006 29 • For k = 0, µi is known. For k = 1, µi are exchangeable. For k ≥ 2, µi are (unconditionally) nonexchangeable. • In all cases, standardized rates λi/µi are exchangeable, and have expectation 1. Gamma(ζ, ζ), are • The covariates can include random effects. ENAR - March 26, 2006 • Small values of y0 (for example, y0 < ζ̂ and ζ̂ the MLE of ζ) provide less information. • When the rates cannot be assumed to be exchangeable, it is common to choose a generalized linear model of the form: p(y|β) ∝ Πi exp{−λi}λiyi • Christensen and Morris (1997) suggest: ∝ Πi exp{− exp(ηi)}[exp(ηi)]yi , – β and ζ independent a priori. – Non-informative prior on β’s associated to ‘fixed’ effects. – For ζ a proper prior of the form: ENAR - March 26, 2006 30 where y0 is the prior guess for the median of ζ. • To complete specification of model, we need priors on α. p(ζ|y0) ∝ µ2i 1 ζ µ2i 1 ζ for ηi = x0iβ and log(λi) = ηi. y0 , (ζ + y0)2 • The vector of covariates can include one or more random effects to accommodate additional dispersion (see epylepsy example). 31 ENAR - March 26, 2006 32 Epilepsy example • The second-level distribution for the β’s will typically be flat (if covariate is a ‘fixed’ effect) or normal • From Breslow and Clayton, 1993, JASA. βj ∼ Normal(βj0, σβ2j ) if jth covariate is a random effect. The variance σβ2j represents the between ‘batch’ variability. • Fifty nine epilectic patients in a clinical trial were randomized to a new drug: T = 1 is the drug and T = 0 is the placebo. • Covariates included: – Baseline data: number of seizures during eight weeks preceding trial – Age in years. • Outcomes: number of seizures during the two weeks preceding each of four clinical visits. • Data suggest that number of seizures was significantly lower prior to fourth visit, so an indicator was used for V4 versus the others. ENAR - March 26, 2006 33 ENAR - March 26, 2006 • Two random effects in the model: – A patient-level effect to introduce between patient variability. – A patients by visit effect to introduce between visit within patient dispersion. Epilepsy study – Program and results model { for(j in 1 : N) { for(k in 1 : T) { log(mu[j, k]) <- a0 + alpha.Base * (log.Base4[j] - log.Base4.bar) + alpha.Trt * (Trt[j] - Trt.bar) + alpha.BT * (BT[j] - BT.bar) + alpha.Age * (log.Age[j] - log.Age.bar) + alpha.V4 * (V4[k] - V4.bar) + b1[j] + b[j, k] y[j, k] ~ dpois(mu[j, k]) b[j, k] ~ dnorm(0.0, tau.b); # subject*visit random effects } b1[j] ~ dnorm(0.0, tau.b1) # subject random effects BT[j] <- Trt[j] * log.Base4[j] # interaction log.Base4[j] <- log(Base[j] / 4) log.Age[j] <- log(Age[j]) diff[j] <- mu[j,4] – mu[j,1] } # covariate means: log.Age.bar <- mean(log.Age[]) Trt.bar <- mean(Trt[]) BT.bar <- mean(BT[]) log.Base4.bar <- mean(log.Base4[]) V4.bar <- mean(V4[]) # priors: ENAR - March 26, 2006 35 a0 ~ dnorm(0.0,1.0E-4) alpha.Base ~ dnorm(0.0,1.0E-4) alpha.Trt ~ dnorm(0.0,1.0E-4); 34 alpha.BT ~ dnorm(0.0,1.0E-4) alpha.Age ~ dnorm(0.0,1.0E-4) alpha.V4 ~ dnorm(0.0,1.0E-4) tau.b1 ~ dgamma(1.0E-3,1.0E-3); sigma.b1 <- 1.0 / sqrt(tau.b1) tau.b ~ dgamma(1.0E-3,1.0E-3); sigma.b <- 1.0/ sqrt(tau.b) Individual random effects # re-calculate intercept on original scale: alpha0 <- a0 - alpha.Base * log.Base4.bar - alpha.Trt * Trt.bar - alpha.BT * BT.bar - alpha.Age * log.Age.bar - alpha.V4 * V4.bar } box plot: b1 2.0 [4] [1] [2] [32] [33] [8] [3] 1.0 [18] [22] [11] [5] [6] [7] [9] [12] [14] [13] [15] [21] [20] [19] [24] [37] [40] [43] [42] [27] [26] [45] [44] [39] [30][31] [29] [49] [46] [36] [28] [23] [56] [35] [25] [10] Results [53] [50] [51] [41] alpha.Age alpha.Base alpha.Trt alpha.V4 alpha.BT sigma.b1 sigma.b Mean 0.4677 0.8815 -0.9587 -0.1013 0.3778 0.4983 0.3641 2.5th Std 0.3557 0.1459 0.4557 0.08818 0.2427 0.07189 0.04349 -0.2407 0.5908 -1.794 -0.273 -0.1478 0.3704 0.2871 Median 0.4744 0.8849 -0.9637 -0.09978 0.3904 0.4931 0.362 1.172 1.165 -0.06769 0.07268 0.7886 0.6579 0.4552 [58] -1.0 -2.0 Patients 5 and 49 are ‘different’. For #5, the number of events increases from visits 1 through 4. For #49, there is a significant decrease. [1] [2] [3] [4] [8] [49] [7] [6] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [5] box plot: b[5, ] 0.0 box plot: b[49, ] 1.0 [5,4] 1.0 [5,2] [25] [49,1] 0.5 [5,3] 0.5 [5,1] [49,3] [49,2] 0.0 0.0 -0.5 -1.0 -50.0 [52] 97.5th Diff in seizure counts between V4 and V1 -100.0 [54] [57] [38] [17] [16] 0.0 Parameter [59] [47] [48] [34] [55] -0.5 [49,4] Example: hierarchical Poisson model Convergence and autocorrelation We ran three parallel chains for 10,000 iterations Discarded the first 5,000 iterations from each chain as burn-in • The USDA collects data on the number of farmers who adopt conservation tillage practices. Thinned by 10, so there are 1,500 draws for inference. • In a recent survey, 10 counties in a midwestern state were randomly sampled, and within county, a random number of farmers were interviewed and asked whether they had adopted conservation practices. sigma.b1 chains 1:3 sigma.b chains 1:3 1.0 0.5 0.0 -0.5 -1.0 1.0 0.5 0.0 -0.5 -1.0 0 0 40 20 40 20 lag lag sigma.b1 chains 1:3 sigma.b chains 1:3 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 5001 6000 5001 6000 8000 8000 iteration iteration • We fitted the following model: • The number of farmers interviewed in each county varied from a low of 2 to a high of 100. • Of interest to USDA is the estimation of the overall adoption rate in the state. contained in the posterior distribution of (α, β). yi ∼ Poisson(λi) λ i = θ i Ei • We fitted the model using WinBUGS and chose non-informative priors for the hyperparameters. θi ∼ Gamma(α, β), Observed data: where yi is the number of adopters in the ith county, Ei is the number of farmers interviewed, θi is the expected adoption rate, and the expected number of adopters in a county can be obtained by multiplying θi by the number of farms in the county. • The hierarchical structure establishes that even though counties may vary significantly in terms of the rate of adoption, they are still exchangeable, so the rates are generated from a common population distribution. • Information about the overall rate of adoption across the state is County 1 2 3 4 5 Ei 94 15 62 126 5 yi 5 1 5 14 3 County 6 7 8 9 10 Ei 31 2 2 9 100 yi 19 1 1 4 22 Posterior distribution of rate of adoption in each county [7] [8] [5] 1.0 [9] [6] mean theta[1] 0.06033 theta[2] 0.1074 theta[3] 0.09093 theta[4] 0.1158 theta[5] 0.5482 theta[6] 0.6028 theta[7] 0.4586 theta[8] 0.4658 theta[9] 0.4378 theta[10] 0.2238 box plot: theta 1.5 node sd 2.5% median 97.5% 0.02414 0.08086 0.03734 0.03085 0.2859 0.1342 0.3549 0.3689 0.2055 0.0486 0.02127 0.009868 0.03248 0.06335 0.131 0.3651 0.0482 0.04501 0.1352 0.1386 0.05765 0.08742 0.08618 0.1132 0.4954 0.5962 0.3646 0.3805 0.3988 0.2211 0.1147 0.306 0.1767 0.1831 1.242 0.8932 1.343 1.41 0.9255 0.33 0.5 [10] [2] [3] [4] [1] 0.0 Posterior distribution of number of adopters in each county, given exposures box plot: lambda 40.0 [10] 30.0 [6] [4] 20.0 [1] node mean sd 2.5% median 97.5% lambda[1] lambda[2] lambda[3] lambda[4] lambda[5] lambda[6] lambda[7] lambda[8] lambda[9] lambda[10] 5.671 1.611 5.638 14.59 2.741 18.69 0.9171 0.9317 3.94 22.38 2.269 1.213 2.315 3.887 1.43 4.16 0.7097 0.7377 1.85 0.1028 2.0 0.148 2.014 7.982 0.6551 11.32 0.09641 0.09002 1.217 13.86 5.419 1.311 5.343 14.26 2.477 18.48 0.7292 0.761 3.589 22.11 node mean sd 2.5% median alpha beta mean 0.8336 0.3181 0.3342 0.7989 1.582 2.094 1.123 0.4235 1.897 4.835 0.4772 0.2536 0.1988 0.4227 1.14 [3] 10.0 [9] 97.5% [5] [2] [7] 0.0 10.78 4.59 10.95 23.07 6.21 27.69 2.687 2.819 8.33 33.0 [8] Poisson model for small area deaths Taken from Bayesian Statistical Modeling, Peter Congdon, 2001 Model • First level of the hierarchy • Congdon considers the incidence of heart disease mortality in 758 electoral wards in the Greater London area over three years (1990-92). These small areas are grouped administratively into 33 boroughs. Oij |µij ∼ Poisson(µij ) log(µij ) = log(Eij ) + β1j + β2j (xij − x̄) + δij • Regressors: at ward level: xij , index of socio-economic deprivation. at borough level: wj , where wi = 1 0 2 δij ∼ N (0, σδ ) for inner London boroughs for outer suburban boroughs – Death counts Oij are Poisson with means µij . • We assume borough level variation in the intercepts and in the impacts of deprivation; this variation is linked to the category of borough (inner vs outer). – βj = (β1j , β2j )0 are random coefficients for the intercepts and the impacts of deprivation at the borough level. – δij is a random error for Poisson over-dispersion. We fitted two models: one without and another with this random error term. – log(Eij ) is an offset, i.e. an explanatory variable with known coefficient (equal to 1). Model (cont’d) • Second level of the hierarchy βj = β1j β2j where ∼ N2(µβj , Σ) ψ11 ψ12 ψ21 ψ22 µβ1j = γ11 + γ12wj µβ2j = γ21 + γ22wj – γ11, γ12, γ21, and γ22 are the population coefficients for the intercepts, the impact of borough category, the impact of deprivations, and, respectively, the interaction impact of the level-2 regressors and level-1 regressors. • Hyperparameters −1 −1 Σ ∼ Wishart (ρR) , ρ Σ= γij ∼ N (0, 0.1) i, j ∈ {1, 2} 2 σδ ∼ Inv-Gamma(a, b) Computation of Eij Model 1: without overdispersion term Suppose that p∗ is an overall disease rate. Then, Eij = nij p∗ and µij = pij /p∗. (i.e. without δij ) The µ’s are said: 1. externally standardized if p∗ is obtained from another data source (such as a standard reference table); 2. internally standardized if p∗ is obtained from the given dataset, e.g. XO = X n ij ∗ p ij ij ij • In our example we rely on the latter approach. • Under (a), the joint distribution of the Oij is a product Poisson; under (b) is multinomial. • However, since likelihood inference is unaffected by whether we condition on ij Oij , the product Poisson likelihood is commonly retained. P Model 2: with overdispersion term node γ11 γ12 γ21 γ22 σβ1 σβ2 Deviance mean -0.069 0.068 0.616 0.105 0.292 0.431 std. dev. 0.074 0.106 0.141 0.198 0.037 0.075 802.400 40.290 Posterior quantiles 2.5% median 97.5% -0.218 -0.069 0.076 -0.141 0.068 0.278 0.335 0.615 0.892 -0.276 0.104 0.499 0.228 0.289 0.376 0.306 0.423 0.600 726.600 800.900 883.500 • Including ward-level random variability δij reduces the average Poisson GLM deviance to 802, with a 95% credible interval from 726 to 883. This is in line with the expected value of the GLM deviance for N = 758 areas if the Poisson Model is appropriate. node γ11 γ12 γ21 γ22 σβ1 σβ2 Deviance mean -0.075 0.078 0.621 0.104 0.294 0.445 std. dev. 0.074 0.106 0.138 0.197 0.038 0.077 945.800 11.320 Posterior quantiles 2.5% median 97.5% -0.224 -0.075 0.070 -0.128 0.078 0.290 0.354 0.620 0.896 -0.282 0.103 0.486 0.231 0.290 0.380 0.318 0.437 0.618 925.000 945.300 969.400 Hierarchical models for spatial data Based on the book by Banerjee, Carlin and Gelfand Hierarchical Modeling and Analysis for Spatial Data, 2004. We focus on Chapters 1, 2 and 5. • Geo-referenced data arise in agriculture, climatology, economics, epidemiology, transportation and many other areas. • Often data are also collected over time, so models that include spatial and temporal correlations of outcomes are needed. • We focus on spatial, rather than spatio-temporal models. • We also focus on models for univariate rather than multivariate outcomes. • What does geo-referenced mean? In a nutshell, we know the geographic location at which an observation was collected. • Why does it matter? Sometimes, relative location can provide information about an outcome beyond that provided by covariates. • Example: infant mortality is typically higher in high poverty areas. Even after incorporating poverty as a covariate, residuals may still be spatially correlated due to other factors such as nearness to pollution sites, distance to pre and post-natal care centers, etc. ENAR - March 26, 2006 1 Types of spatial data ENAR - March 26, 2006 2 • Marked point process data: If covariate information is available we talk about a marked point process. Covariate value at each site marks the site as belonging to a certain covariate batch or group. • Point-referenced data: Y (s) a random outcome (perhaps vectorvalued) at location s, where s varies continuously over some region D. The location s is typically two-dimensional (latitude and longitude) but may also include altitude. Known as geostatistical data. • Combinations: e.g. ozone daily levels collected in monitoring stations with precise location, and number of children in a zip code reporting to the ER on that day. Require data re-alignment to combine outcomes and covariates. • Areal data: outcome Yi is an aggregate value over an areal unit with well-defined boundaries. Here, D is divided into a finite collection of areal units. Known as lattice data even though lattices can be irregular. • Point-pattern data: Outcome Y (s) is the occurrence or not of an event and locations s are random. Example: locations of trees of a species in a forest or addresseses of persons with a particular disease. Interest is often in deciding whether points occur independently in space or whether there is clustering. ENAR - March 26, 2006 3 ENAR - March 26, 2006 4 Models for point-level data The basics • Location index s varies continuously over region D. • We often assume that the covariance between two observations at locations si and sj depends only on the distance dij between the points. • The spatial covariance is often modeled as exponential: Cov (Y (si), Y (si0 )) = C(dii0 ) = σ 2e−φdii0 , where (σ 2, φ) > 0 are the partial sill and decay parameters, respectively. • Covariogram: a plot of C(dii0 ) against dii0 . ENAR - March 26, 2006 5 Models for point-level data (cont’d) • For i = i0, dii0 = 0 and C(dii0 ) = var(Y (si)). • Sometimes, var(Y (si)) = τ 2 + σ 2, for τ 2 the nugget effect and τ 2 + σ 2 the sill. Covariance structure • Suppose that outcomes are normally distributed and that we choose an exponential model for the covariance matrix. Then: Y |µ, θ ∼ N(µ, Σ(θ)), with Y Σ(θ)ii0 = {Y (s1), Y (s2), ..., Y (sn)} = cov(Y (si), Y (si0 )) θ = (τ 2, σ 2, φ). • Then ENAR - March 26, 2006 6 with (τ 2, σ 2, φ) > 0. Σ(θ)ii0 = σ 2 exp(−φdii0 ) + τ 2Ii=i0 , ENAR - March 26, 2006 7 Models for point-level data, details • This is an example of an isotropic covariance function: the spatial correlation is only a function of d. • Basic model: Y (s) = µ(s) + w(s) + e(s), where µ(s) = x (s)β and the residual is divided into two components: 0 w(s) is a realization of a zero-centered stationary Gaussian process and e(s) is uncorrelated pure error. • The w(s) are functions of the partial sill σ 2 and decay φ parameters. • The e(s) introduces the nugget effect τ 2. • τ 2 interpreted as pure sampling variability or as microscale variability, i.e., spatial variability at distances smaller than the distance between two outcomes: the e(s) are sometimes viewed as spatial processes with rapid decay. ENAR - March 26, 2006 8 ENAR - March 26, 2006 9 The variogram and semivariogram Examples of semi-variograms Semi-variograms for the linear, spherical and exponential models. • A spatial process is said to be: – Strictly stationary if distributions of Y (s) and Y (s + h) are equal, for h the distance. – Weakly stationary if µ(s) = µ and Cov(Y (s), Y (s + h)) = C(h). – Instrinsically stationary if E[Y (s + h) − Y (s)] = 0, and E[Y (s + h) − Y (s)]2 = V ar[Y (s + h) − Y (s)] = 2γ(h), defined for differences and depending only on distance. • 2γ(h) is the variogram and γ(h) is the semivariogram ENAR - March 26, 2006 10 ENAR - March 26, 2006 Stationarity 11 Semivariogram (cont’d) • Strict stationarity implies weak stationarity but the converse is not true except in Gaussian processes. • Weak stationarity implies intrinsec stationarity, but the converse is not true in general. • If γ(h) depends on h only through its length ||h||, then the spatial process is isotropic. Else it is anisotropic. • There are many choices for isotropic models. The exponential model is popular and has good properties. For t = ||h||: γ(t) = τ 2 + σ 2(1 − exp(−φt)) if t > 0, • Notice that intrinsec stationarity is defined on the differences between outcomes at two locations and thus says nothing about the joint distribution of outcomes. = 0 otherwise. • See figures, page 24. • The powered exponential model has an extra parameter for smoothness: γ(t) = τ 2 + σ 2(1 − exp(−φtκ)) if t > 0 ENAR - March 26, 2006 12 ENAR - March 26, 2006 13 • Another popular choice is the Gaussian variogram model, equal to the exponential except for the exponent term, that is exp(−φ2t2)). • Fitting of the variogram has been traditionally done ”by eye”: – Plot an empirical estimate of the variogram akin to the sample variance estimate or the autocorrelation function in time series – Choose a theoretical functional form to fit the empirical γ – Choose values for (τ 2, σ 2, φ) that fit the data well. • If a distribution for the outcomes is assumed and a functional form for the variogram is chosen, parameter estimates can be estimated via some likelihood-based method. • Of course, we can also be Bayesians. Point-level data (cont’d) • For point-referenced data, frequentists focus on spatial prediction using kriging. • Problem: given observations {Y (s1), ..., Y (sn)}, how do we predict Y (so) at a new site so? • Consider the model Y = Xβ + , where ∼ N(0, Σ), and where Σ = σ 2H(φ) + τ 2I. Here, H(φ)ii0 = ρ(φ, dii0 ). ENAR - March 26, 2006 14 ENAR - March 26, 2006 15 Classical kriging (cont’d) • Kriging consists in finding a function f (y) of the observations that minimizes the MSE of prediction • (Not a surprising!) Result: f (y) that minimizes Q is the conditional mean of Y (s0) given observations y (see pages 50-52 for proof): Q = E[(Y (so) − f (y))2|y]. E[Y (so)|y] = x0oβ̂ + γ̂ 0Σ̂−1(y − X β̂) V ar[Y (so)|y] = σ̂ 2 + τ̂ 2 − γ̂ 0Σ̂−1γ̂, where γ̂ = (σ̂ 2ρ(φ̂, do1), ..., σ̂ 2ρ(φ̂, don)) β̂ = (X 0Σ̂−1X)−1X 0Σ̂−1y Σ̂ = σ̂ 2H(φ̂). • Solution assumes that we have observed the covariates xo at the new site. ENAR - March 26, 2006 16 ENAR - March 26, 2006 17 Bayesian methods for estimation • If not,in the classical framework Y (so), xo are jointly estimated using an EM-type iterative algorithm. • The Gaussian isotropic kriging model is just a general linear model similar to those in Chapter 15 of textbook. • Just need to define the appropriate covariance structure. • For an exponential covariance structure with a nugget effect, parameters to be estimated are θ = (β, σ 2, τ 2, φ). • Steps: – Choose priors and define sampling distribution – Obtain posterior for all parameters p(θ|y) – Bayesian kriging: get posterior predictive distribution for outcome at new location p(yo|y, X, xo). ENAR - March 26, 2006 18 ENAR - March 26, 2006 19 • Be cautious with improper priors for anything but β. • Sampling distribution (marginal data model) y|θ ∼ N(Xβ, σ 2H(φ) + τ 2I) • Priors: typically chosen so that parameters are independent a priori. • As in the linear model: – Non-informative prior for β is uniform or can use a normal prior too. – Conjugate priors for variances σ 2, τ 2 are inverse gamma priors. • For φ, appropriate prior depends on covariance model. For simple exponential where ρ(si − sj ; φ) = exp(−φ||si − sj ||), a Gamma prior can be a good choice. ENAR - March 26, 2006 20 ENAR - March 26, 2006 21 Hierarchical representation of model Estimation of spatial surface W |y • Hierarchical model representation: first condition on the spatial random effects W = {w(s1), ..., w(sn)}: • Interest is sometimes on estimating the spatial surface using p(W |y). y|θ, W ∼ W |φ, σ 2 ∼ N(Xβ + W, τ 2I) N(0, σ 2H(φ)). • Model specification is then completed by choosing priors for β, τ 2 and for φ, σ 2 (hyperparameters). • If marginal model is fitted, we can still get marginal posterior for W as Z p(W |y) = p(W |σ 2, φ)p(σ 2, φ|y)dσ 2dφ. • Given draws (σ 2(g), φ(g)) from the Gibbs sampler on the marginal model, we can generate W from • Note that hierarchical model has n more parameters (the w(si)) than the marginal model. p(W |σ 2(g), φ(g)) = N(0, σ 2(g)H(φ(g))). • Computation with the marginal model preferable because σ 2H(φ)+τ 2I tends to be better behaved than σ 2H(φ) at small distances. • Analytical marginalization over W is possible only if model has Gaussian form. ENAR - March 26, 2006 ENAR - March 26, 2006 22 Bayesian kriging (1) 23 (2) (G) • Draws {yo , yo , ..., yo , } are a sample from the posterior predictive distribution of the outcome at the new location so. • Let Yo = Y (so) and xo = x(so). Kriging is accomplished by obtaining the posterior predictive distribution Z p(yo|xo, X, y) = p(yo, θ|y, X, xo)dθ Z = p(yo|θ, y, xo)p(θ|y, X)dθ. • To predict Y at a set of m new locations so1, ..., som, it is best to do joint prediction to be able to estimate the posterior association among m predictions. • Beware of joint prediction at many new locations with WinBUGS. It can take forever. • Since (Yo, Y ) are jointly multivariate normal (see expressions 2.18 and 2.19 on page 51), then p(yo|θ, y, xo) is a conditional normal distribution. • Given MCMC draws of the parameters (θ(1), ..., θ(G)) from the posterior (g) distribution p(θ|y, X), we draw values yo for each θ(g) as yo(g) ∼ p(yo|θ(g), y, xo). ENAR - March 26, 2006 24 ENAR - March 26, 2006 25 • Here, H(φ)ij = ρ(si − sj ; φ) = exp(−φ||si − sj ||κ). Kriging example from WinBugs • Priors on (β, φ, κ). • Data were first published by Davis (1973) and consist of heights at 52 locations in a 310-foot square area. • We predict elevations at 225 new locations. • We have 52 s = (x, y) coordinates and outcomes (heights). • Unit of distance is 50 feet and unit of elevation is 10 feet. • The model is height = β + , where ∼ N(0, Σ), and where Σ = σ 2H(φ). ENAR - March 26, 2006 ENAR - March 26, 2006 26 model { 2000 # exponential correlation function for(i in 1:N) { mu[i] <- beta } 1500 # Priors beta ~ dflat() tau ~ dgamma(0.001, 0.001) sigma2 <- 1/tau 1000 # priors for spatial.exp parameters phi ~ dunif(0.05, 20) # prior decay for correlation at min distance (0.2 x 50 ft) is 0.02 to 0.99 # prior range for correlation at max distance (8.3 x 50 ft) is 0 to 0.66 kappa ~ dunif(0.05,1.95) # Spatial prediction 500 # Single site prediction for(j in 1:M) { height.pred[j] ~ spatial.unipred(beta, x.pred[j], y.pred[j], height[]) } Initial values Click on one of the arrows for inital values for spatial.exp model 400.0 600.0 800.0 beta 1.00E+3 Data Click on one of the arrows for the data 1200.0 } 1 # Only use joint prediction for small subset of points, due to length of time it takes to run for(j in 1:10) { mu.pred[j] <- beta } height.pred.multi[1:10] ~ spatial.pred(mu.pred[], x.pred[1:10], y.pred[1:10], height[]) iteration # Spatially structured multivariate normal likelihood height[1:N] ~ spatial.exp(mu[], x[], y[], tau, phi, kappa) 27 2000 1500 phi 0.8 0.6 1000 0.2 0.0 1 500 1000 1500 iteration 0.4 2000 0.5 1.0 1.5 2.0 1 kappa 500 iteration • Areal data: often aggregate outcomes over a well-defined area. Example: number of cancer cases per county or proportion of people living in poverty in a set of census tracks. (1) >= 950.0 • What are the inferential issues? N (samples)means for height.pred 0.002km (26) 900.0 - 950.0 (61) 850.0 - 900.0 (80) 800.0 - 850.0 (40) 750.0 - 800.0 (16) 700.0 - 750.0 (1) < 700.0 Hierarchical models for areal data 1. Identifying a spatial pattern and its strength. If data are spatially correlated, measurements from areas that are ‘close’ will be more alike. 2. Smoothing and to what degree. Observed measurements often present extreme values due to small samples in small areas. Maximal smoothing: substitute observed measurements by the overall mean in the region. Something less extreme is what we discuss later. 3. Prediction: for a new area, how would we predict Y given measurements in other areas? ENAR - March 26, 2006 28 Defining neighbors • The wij can be thought of as weights and provide a means to introduce spatial structure into the model. • A proximity matrix W with entries wij spatially connects areas i and j in some fashion. • Areas that are ’closer by’ in some sense are more alike. • Typically, wii = 0. • For any problem, we can define first, second, third, etc order neighbors. For distance bins (0, d1], (d1, d2], (d2, d3], ... we can define • There are many choices for the wij : (1) – Binary: wij = 1 if areas i, j share a common boundary, and is 0 otherwise. – Continuous: decreasing function of intercentroidal distance. – Combo: wij = 1 if areas are within a certain distance. 1. W (1), the first-order proximity matrix with wij = 1 if distance between i abd j is less than d1. (2) 2. W (2), the second-order proximity matrix with wij = 1 if distance between i abd j is more than d1 but less than d2. • W need not be symmetric. P • Entries are often standardized by dividing into j wij = wi+. If entries are standardized, the W will often be asymmetric. ENAR - March 26, 2006 29 Areal data models ENAR - March 26, 2006 30 corresponding to a constant disease rate across areas. • Because these models are used mostly in epidemiology, we begin with an application in disease mapping to introduce concepts. • An external standardized estimate is Ei = X nij rj , j • Typical data: Yi observed number of cases in area i, i = 1, ..., I where rj is the risk for persons of age group j (from some existing table of risks by age) and nij is the number of persons of age j in area i. Ei expected number of cases in area i. • The Y ’s are assumed to be random and the Es are assumed to be known and to depend on the number of persons ni at risk. • An internal standardized estimate of Ei is P yi , Ei = nir̄ = ni P i i ni ENAR - March 26, 2006 31 ENAR - March 26, 2006 32 Standard frequentist approach • For small Ei, estimated by plugging η̂i to obtain var(SM Ri) = Yi|ηi ∼ Poisson(Eiηi), Yi . Ei2 with ηi the true relative risk in area i. • To get a confidence interval for ηi, first assume that log(SM Ri) is approximately normal. • The MLE is the standard mortality ratio η̂i = SM Ri = Yi . Ei • From a Taylor expansion: V ar[log(SM Ri)] ≈ • The variance of the SM Ri is var(Yi) ηi var(SM Ri) = = , Ei2 Ei ENAR - March 26, 2006 = 33 • An approximate 95% CI for log(ηi) is 1 V ar(SM Ri) SM Ri2 Yi 1 Ei2 × 2= . 2 Yi Ei Yi ENAR - March 26, 2006 34 • Under H0, Yi ∼ Poisson(Ei) so the p−value for the test is SM Ri ± 1.96/(Yi)1/2. p = Prob(X ≥ Yi|Ei) = 1 − Prob(X ≥ Yi|Ei) Yi −1 • Transforming back, an approximate 95% CI for ηi is = 1− (SM Ri exp(−1.96/(Yi)1/2), SM Ri exp(1.96/(Yi)1/2)). • Suppose we wish to test whether risk in area i is high relative to other areas. Then test X exp(−Ei)E x i . x! x=0 • If p < 0.05 we reject H0. H0 : ηi = 1 versus Ha : ηi > 1. • This is a one-sided test. ENAR - March 26, 2006 35 ENAR - March 26, 2006 36 Hierarchical models for areal data Poisson-Gamma model • To estimate and map underlying relative risks, might wish to fit a random effects model. • Consider Yi|ηi ∼ • Assumption: true risks come from a common underlying distribution. • Random effects models permit borrowing strength across areas to obtain better area-level estimates. ηi|a, b ∼ Gamma(a, b). • Since E(ηi) = a/b and V ar(ηi) = a/b2, we can fix a, b as follows: – A priori, let E(ηi) = 1, the null value. – Let V ar(ηi) = (0.5)2, large on that scale. • Alas, models may be complex: – High-dimensional: one random effect for each area. – Non-normal if data are counts or binomial proportions. • Resulting prior is Gamma(4, 4). • We have already discussed hierarchical Poisson models, so material in the next few transparencies is a review. ENAR - March 26, 2006 37 • Posterior is also Gamma: p(ηi|yi) = Gamma(yi + a, Ei + b). ENAR - March 26, 2006 38 Poisson-lognormal models with spatial errors • A point estimate of ηi is E(ηi|y) = E(ηi|yi) = yi + a Ei + b • The Poisson-Gamma model does not allow (easily) for spatial correlation among the ηi. 2 = = i )] yi + V[E(η ar(ηi ) i) Ei + VE(η ar(ηi ) Ei( Eyii ) i) Ei + VE(η ar(ηi ) + • Instead, consider the Poisson-lognormal model, where in the second stage we model the log-relative risks log(ηi) = ψi: E(ηi ) V ar(ηi ) E(ηi ) i) Ei + VE(η ar(ηi ) Yi|ψi ∼ Poisson(Ei exp(ψi)) ψi = x0iβ + θi + φi, = wiSM Ri + (1 − wi)E(ηi), where xi are area-level covariates. where wi = Ei/[Ei + (E(ηi)/V ar(ηi))]. • Bayesian point estimate is a weighted average of the data-based SM Ri and the prior mean E(ηi). ENAR - March 26, 2006 Poisson(Eiηi) 39 • The θi are assumed to be exchangeable and model between-area variability: θi ∼ N(0, 1/τh). ENAR - March 26, 2006 40 • The θi incorporate global extra-Poisson variability in the log-relative risks (across the entire region). • The φi are the ’spatial’ parameters; they capture regional clustering. • While sensible, this model is difficult to fit because – Lots of matrix inversions required – Distance between φi and φi0 may not be obvious. • They model extra-Poisson variability in the log-relative risks at the local level so that ’neighboring’ areas have similar risks. • One way to model the φi is to proceed as in the point-referenced data case. For φ = (φ1, ..., φI ), consider φ|µ, λ ∼ NI (µ, H(λ)), and H(λ)ii0 = cov(φi, φi0 ) with hyperparameters λ. • Possible models for H(λ) include the exponential, the powered exponential, etc. ENAR - March 26, 2006 41 CAR model ENAR - March 26, 2006 42 needed: • More reasonable to think of a neighbor-based proximity measure and consider a conditionally autoregressive model for φ: p(φi| all) ∝ Poi(yi|Eiexiβ+θi+φi ) N(φi|φ̄i, 1 ). mi τ c φi ∼ N(φ̄i, 1/(τcmi)), where φ̄i = X i6=j wij (φi − φj ), and mi is the number of neighbors of area i. Earlier we called this wi+. • The weights wij are (typically) 0 if areas i and j are not neighbors and 1 if they are. • CAR models lend themselves to the Gibbs sampler. Each φi can be sampled from its conditional distribution so no matrix inversion is ENAR - March 26, 2006 43 ENAR - March 26, 2006 44 Difficulties with CAR model • Consider τh ∼ Gamma(ah, bh), • The CAR prior is improper. Prior is a pairwise difference prior identified only up to a constant. τc ∼ Gamma(ac, bc). • To place equal emphasis on heterogeneity and spatial clustering, it is tempting to make ah = ac and bh = bc. This is not correct because • The posterior will still be proper, but to identify anP intercept β0 for the log-relative risks, we need to impose a constraint: i φi = 0. 1. The τh prior is defined marginally, where the τc prior is conditional. 2. The conditional prior precision is τcmi. Thus, a scale that satisfies • In simulation, constraint is imposed numerically by recentering each vector φ around its own mean. 1 1 √ ≈ ≈ sd(φi) sd(θi) = p 0.7 m̄τc τh ) • τh and τc cannot be too large because θi and φi become unidentifiable. We observe only one Yi in each area yet we try to fit two random effects. Very little data information. with m̄ the average number of neighbors is more ’fair’ (Bernardinelli et al. 1995, Statistics in Medicine). • Hyperpriors for τh, τc need to be chosen carefully. ENAR - March 26, 2006 ENAR - March 26, 2006 45 Example: Incidence of lip cancer in 56 areas in Scotland Data on the number of lip cancer cases in 56 counties in Scotland were obtained. Expected lip cancer counts Ei were available. Covariate was the proportion of the population in each county working in agriculture. Model included only one random effect bi to introduce spatial association between counties. 46 Model model { # Likelihood for (i in 1 : N) { O[i] ~ dpois(mu[i]) log(mu[i]) <- log(E[i]) + alpha0 + alpha1 * X[i]/10 + b[i] RR[i] <- exp(alpha0 + alpha1 * X[i]/10 + b[i]) relative risk (for maps) } Two counties are neighbors if they have a common border (i.e., they are adjacent). # CAR prior distribution for random effects: b[1:N] ~ car.normal(adj[], weights[], num[], tau) for(k in 1:sumNumNeigh) { weights[k] <- 1 } Three counties that are islands have no neighbors and for those, WinBUGS sets the random spatial effect to be 0. The relative risk for the islands is thus based on the baseline rate Į0 and on the value of the covariate xi. We wish to smooth and map the relative risks RR. # Other priors: alpha0 ~ dflat() alpha1 ~ dnorm(0.0, 1.0E-5) tau ~ dgamma(0.5, 0.0005) sigma <- sqrt(1 / tau) } # prior on precision # standard deviation # Area-specific Data list(N = 56, O = c( 9, 39, 11, 9, 15, 8, 26, 7, 6, 20, 13, 5, 3, 8, 17, 9, 2, 7, 9, 7, 16, 31, 11, 7, 19, 15, 7, 10, 16, 11, 5, 3, 7, 8, 11, 9, 11, 8, 6, 4, 10, 8, 2, 6, 19, 3, 2, 3, 28, 6, 1, 1, 1, 1, 0, 0), E = c( 1.4, 8.7, 3.0, 2.5, 4.3, 2.4, 8.1, 2.3, 2.0, 6.6, 4.4, 1.8, 1.1, 3.3, 7.8, 4.6, 1.1, 4.2, 5.5, 4.4, 10.5,22.7, 8.8, 5.6,15.5,12.5, 6.0, 9.0,14.4,10.2, 4.8, 2.9, 7.0, 8.5,12.3,10.1,12.7, 9.4, 7.2, 5.3, 18.8,15.8, 4.3,14.6,50.7, 8.2, 5.6, 9.3,88.7,19.6, 3.4, 3.6, 5.7, 7.0, 4.2, 1.8), X = c(16,16,10,24,10,24,10, 7, 7,16, 7,16,10,24, 7,16,10, 7, 7,10, 7,16,10, 7, 1, 1, 7, 7,10,10, 7,24,10, 7, 7, 0,10, 1,16, 0, 1,16,16, 0, 1, 7, 1, 1, 0, 1, 1, 0, 1, 1,16,10), num = c(3, 2, 1, 3, 3, 0, 5, 0, 5, 4, 0, 2, 3, 3, 2, 6, 6, 6, 5, 3, 3, 2, 4, 8, 3, 3, 4, 4, 11, 6, 7, 3, 4, 9, 4, 2, 4, 6, 3, 4, 5, 5, 4, 5, 4, 6, 6, 4, 9, 2, 45, 33, 18, 4, 50, 43, 34, 26, 25, 23, 21, 17, 16, 15, 9, 55, 45, 44, 42, 38, 24, 47, 46, 35, 32, 27, 24, 14, 31, 27, 14, 55, 45, 28, 18, 54, 52, 51, 43, 42, 40, 39, 29, 23, 46, 37, 31, 14, 41, 37, 46, 41, 36, 35, 54, 51, 49, 44, 42, 30, 40, 34, 23, 52, 49, 39, 34, 53, 49, 46, 37, 36, 51, 43, 38, 34, 30, 42, 34, 29, 26, 49, 48, 38, 30, 24, 55, 33, 30, 28, 53, 47, 41, 37, 35, 31, 53, 49, 48, 46, 31, 24, 49, 47, 44, 24, 54, 53, 52, 48, 47, 44, 41, 40, 38, 29, 21, 54, 42, 38, 34, 54, 49, 40, 34, 49, 47, 46, 41, 4, 4, 4, 5, 6, 5), adj = c( 19, 9, 5, 10, 7, 12, 28, 20, 18, 19, 12, 1, 17, 16, 13, 10, 2, 29, 23, 19, 17, 1, 22, 16, 7, 2, 5, 3, 19, 17, 7, 35, 32, 31, 29, 25, 29, 22, 21, 17, 10, 7, 29, 19, 16, 13, 9, 7, 56, 55, 33, 28, 20, 4, 17, 13, 9, 5, 1, 56, 18, 4, 50, 29, 16, 16, 10, 39, 34, 29, 9, 56, 55, 48, 47, 44, 31, 30, 27, 29, 26, 15, 43, 29, 25, 56, 32, 31, 24, 52, 51, 49, 38, 34, 56, 45, 33, 30, 24, 18, 55, 27, 24, 20, 18), sumNumNeigh = 234) Results The proportion of individuals working in agriculture appears to be associated to the incidence of cancer alpha1 sample: 10000 4.0 3.0 2.0 1.0 0.0 -0.25 0.0 0.25 0.5 0.75 (2) >= 4.0 4.0 3.0 (4) 3.0 2.0 - 1.0 (15) (9) 2.0 1.0 box plot: b (26) < There appears to be spatial correlation among counties: 2.0 [1] [3] [5] [12][13] [7] [9] [10] [15] [17] [19] [4] 1.0 [16] [14] 0.0 [18] [20][21] [25] [26] [23] [22] [24] [27][28][29] [36] 200.0km [2] [56] [31][32][33] [35] [40] [37][38][39] [30] [43] [34] [51][52] [41] [44] [46][47][48] [50] [53][54][55] [45] [42] [49] -2.0 N (samples)means for RR -1.0 (2) >= 30.0 20.0 10.0 - 20.0 (2) (14) (38) < 10.0 30.0 Example: Lung cancer in London Data were obtained from an annual report by the London Health Authority in which the association between health and poverty was investigated. 200.0km Population under study are men 65 years of age and older. Available were: x Observed and expected lung cancer counts in 44 census ward districts x Ward-level index of socio-economic deprivation. A model with two random effects was fitted to the data. One random effect introduces region-wide heterogeneity (between ward variability) and the other one introduces regional clustering. N values for O The priors for the two components of random variability, are sometimes known as convolution priors. Interest is in: x Determining association between health outcomes and poverty. x Smoothing and mapping relative risks. x Assessing the degree of spatial clustering in these data. Model sigma.b <- sqrt(1 / tau.b) tau.h ~ dgamma(0.5, 0.0005) sigma.h <- sqrt(1 / tau.h) propstd <- sigma.b / (sigma.b + sigma.h) model { for (i in 1 : N) { # Likelihood O[i] ~ dpois(mu[i]) log(mu[i]) <- log(E[i]) + alpha + beta * depriv[i] + b[i] + h[i] RR[i] <- exp(alpha + beta * depriv[i] + b[i] + h[i]) relative risk (for maps) } Data click on one of the arrows to open data # Area-specific Inits click on one of the arrows to open initial values # Exchangeable prior on unstructured random effects h[i] ~ dnorm(0, tau.h) } Note that the priors on the precisions for the exchangeable and the spatial random effects are Gamma(0.5, 0.0005). # CAR prior distribution for spatial random effects: b[1:N] ~ car.normal(adj[], weights[], num[], tau.b) for(k in 1:sumNumNeigh) { weights[k] <- 1 } That means that a priori, the expected value of the standard deviations is approximately 0.03 with a relatively large prior standard deviation. This is not a “fair” prior as discussed in class. The average number of neighbors is 4.8 and this is not taken into account in the choice of priors. # Other priors: alpha ~ dflat() beta ~ dnorm(0.0, 1.0E-5) tau.b ~ dgamma(0.5, 0.0005) Posterior distribution of RR for first 15 wards Results th Parameter Mean Std 2.5 perc. 97.5 perc. Alpha Beta Relative size of spatial std -0.208 0.0474 0.358 0.1 0.0179 0.243 -0.408 0.0133 0.052 -0.024 0.0838 0.874 sigma.b sample: 1000 sigma.h sample: 1000 20.0 15.0 10.0 5.0 0.0 10.0 7.5 5.0 2.5 0.0 -0.2 0.0 0.2 0.4 0.6 node mean sd 2.5% median 97.5% RR[1] RR[2] RR[3] RR[4] RR[5] RR[6] RR[7] RR[8] RR[9] RR[10] RR[11] RR[12] RR[13] RR[14] RR[15] 0.828 1.18 0.8641 0.8024 0.7116 1.05 1.122 0.821 1.112 1.546 0.7697 0.865 1.237 0.8359 0.7876 0.1539 0.2061 0.1695 0.1522 0.1519 0.2171 0.1955 0.1581 0.2167 0.2823 0.1425 0.16 0.3539 0.1807 0.1563 0.51 0.7593 0.5621 0.5239 0.379 0.7149 0.7556 0.4911 0.7951 1.072 0.4859 0.6027 0.8743 0.5279 0.489 0.8286 1.188 0.8508 0.7873 0.7206 1.015 1.116 0.8284 1.07 1.505 0.7788 0.8464 1.117 0.8084 0.7869 th -0.2 0.0 0.2 0.4 0.6 1.148 1.6 1.247 1.154 0.9914 1.621 1.589 1.132 1.702 2.146 1.066 1.26 2.172 1.3 1.141 values for O (17) < (19) values for b 5.0 5.0 - 10.0 N (2) < -0.1 (2) -0.1 - -0.05 N (3) 10.0 - 15.0 (18) -0.05 - 1.38778E-17 (3) 15.0 - 20.0 (19) 1.38778E-17 (2) (2) >= 20.0 0.05 - (1) >= 2.5km 2.5km values for RR (8) < 0.8 (20) 0.8 - 1.0 (11) 1.0 - 1.2 N (5) >= 2.5km 1.2 0.1 0.1 0.05

Bayesian Methods for Course contents

Related documents

Products

Support

Bayesian Methods for Course contents

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib