jointly distributed random variables env/bio 665 Bayesian inference for environmental models Jim Clark 2020-01-27 Table of Contents readings ...................................................................................................................................................................... 1 #resources ##software source('../clarkFunctions2020.r') readings Models for Ecological Data, Appendix D. Evaluating the impacts of fungal seedling pathogens on temperate forest seedling survival, Hersh et al. on joint, conditional, predictive distributions, Ecology. #objectives • • • • Understand key concepts: – discrete and continuous densities, probability mass functions, and probability density functions – the sample space for a distribution – moments of a distribution – factor a joint distribution – graph a model – the contribution of data and prior to the posterior estimate of a mean Apply basic rules to manipulate jointly distributed random variables: – total probability – Bayes theorem Use R to draw random samples and to determine density and probability for standard distributions – binomial and Bernoulli – beta and beta-binomial – normal and multivariate normal Find the posterior distribution of regression parameters “only grade-schoolers can divide, only undergraduates can differentiate, a rare PhD can integrate”, Mark Twain, maybe #Attention au l’calcule I need a few rules from calculus to manipulate distributions. Calculus is more often important for its conceptual and notation contributions than for actual solutions to equations. Most functions cannot be integrated. The limited capacity to handle the integration constant needed for Bayes theorem stalled progress until numerical analysis advanced with methods such as Gibbs sampling. Although I will will not integrate much, calculus provides concepts needed here. For probability, I need both distribution functions and density functions. The latter is the derivative of the former. The notation is powerful, providing a direct connection to concepts and notation used in basic algebra. The derivative ๐๐/๐๐ฅ can be viewed as a limit of a ratio ๐(๐ฅ) = ๐๐ ๐(๐ฅ + ๐๐ฅ) − ๐(๐ฅ) = lim [ ] ๐๐ฅ ๐๐ฅ→0 ๐๐ฅ i.e., division. Multiplication ๐(๐ฅ) ⋅ ๐๐ฅ can be viewed as the limit of the anti-derivative ๐ฅ+๐๐ฅ ∫ ๐ฅ ๐ (๐ข)๐๐ข = lim [๐(๐ฅ) ⋅ ๐๐ฅ] ๐๐ฅ→0 Ironically, although multiplication is easy and division is hard, the tables are turned on their calculus counterparts: differentiation is usually easier than integration. The idea of derivatives and integrals as limiting quantities is important for computation in Bayesian analysis. Differentiation is needed for optimization problems–more important for maximum likelihood. Optimizations and integrations are commonly approximated numerically. In the discussion ahead I rely on some of the notation of calculus, but there are no difficult solutions. #Basic probability rules and the Janzen Connell hypothesis Basic probality ideas are introducted here with the Janzen Connell (JC) effect. The JC effect is believed to promote forest tree diversity as natural enemies disproportionately attack the most abundant tree hosts. The mechanism requires that natural enemies are host-specific and that they most efficiently find and/or impact host populations when and where those hosts are abundant. Fungi are plausible candidates for the JC effect, because there are many taxa, and they include known pathogens of trees. To test whether or not fungal pathogens contribute to tree diversity, Hersh et al. (2012) planted seedlings of six species in 60 plots, they observed survival, and they assayed them for fungal infection on cultures and with DNA sequencing. A model was constructed for the relationship between pathogen, host, and observations, and it was used to infer where pathogens occur (‘incidence’), infection of hosts, and effect of infection on survival. This example introduces some of the techniques used in Bayesian analysis, including some basic distribution theory. I begin with background on distributions, followed by examples that demononstrate ways to look at the JC hypothesis. ##continuous and discrete probability distributions I express uncertainty about an event using probability. Uncertainty could be temporary, expressing current information: my prediction of heads or tails will be updated when I flip this coin. Or it could be indefinite, expressing my ability to predict in general: about half of my predictions for coin tosses will be wrong. A probability is dimensionless. It can be zero, one, or somewhere in between. In the sections that follow I introduce basic distribution theory needed for Bayesian analysis. ##probability spaces The probability distribution assigns probability values to the set of all possible outcomes over the sample space. The sample space for a continuous probability distribution is all or part of the real line. The normal distribution applies to the full real line, โ = (−∞, ∞). The gamma (including the exponential) and the log-normal distributions apply to the non-negative real numbers โ+ = (0, ∞). The beta distribution ๐๐๐ก๐(๐, ๐) applies to real numbers in the interval [0,1]. The uniform distribution, ๐ข๐๐๐(๐, ๐), applies to the interval [๐, ๐]. These are univariate distributions: they describe one dimension. [Note: when discussing an interval, a square bracket indicates that I am including the boundary–the uniform distribution is defined at zero, but the lognormal distribution is not.] A multivariate distribution describes variation in more than one dimension. For example, a d-dimensional multivariate normal distribution describes a length-๐ random vector in โ๐ . Three continuous distributions supported on different portions of the real line. Zero and one (grey lines) are unsupported. PDFs above and CDFs below. The sample space for a discrete distribution is a set of discrete values. For the Bernoulli distribution, the sample space is {0,1}. For the binomial distribution, ๐๐๐๐๐(๐, ๐), the sample space is {0, … , ๐}. For count data, often modeled as a Poisson distribution, the sample space is {0,1,2, … }. The probability mass function means that there is point mass on specific values (e.g., integers) and no support elsewhere. A common multivariate discrete distribution is the multinomial distribution. It assigns probability to ๐ trials, each of which can have ๐ฝ outcomes, ๐๐ข๐๐ก๐๐๐๐(๐, (๐1 , … , ๐๐ฝ )). It has the sample space {0, … , ๐} ๐ฝ , subject to the constraint that the sum over all ๐ classes is equal to ๐. Three discrete distributions, with PMFs above and CDFs below. The figures show continuous and discrete distributions, each in two forms, as densities and as cumulative probabilities. ##probability density and cumulative probability The cumulative distribution function (CDF) ๐(๐ฅ) accumulates continuous probability over the sample space, from low values (near-zero) to high values (near-one). The probability density function (PDF) is the derivative of the CDF, ๐(๐ฅ) = ๐๐/๐๐ฅ. Because the CDF is a dimensionless probability, its derivative must have units of 1/๐ฅ. The CDF is obtained from the PDF by integration, ๐ฅ ๐(๐ฅ) = ∫ ๐ (๐ฅ)๐๐ฅ −∞ I cannot assign a probability to a continuous value of ๐ฅ, only to an interval, say (๐ฅ, ๐ฅ + ๐๐ฅ). For a small interval ๐๐ฅ the following relationships are useful, ๐ฅ+๐๐ฅ ๐(๐ฅ + ๐๐ฅ) − ๐(๐ฅ) = ∫ ๐ (๐ฅ)๐๐ฅ ≈ ๐(๐ฅ) ⋅ ๐๐ฅ ๐ฅ For example, if I wanted to assure myself that 68% of the normal density lies within 1 sd of the mean, I evaluate the CDF at these values and take the difference: p <- pnorm(c(-1,1)) diff(p) ## [1] 0.6826895 Because integration is like multiplication, it adds one integer value to the exponent, from the PDF (๐ฅ −1 ) to the CDF (๐ฅ 0 ). The area under the PDF is ∞ 1 = ๐(∞) = ∫ ๐ (๐ฅ)๐๐ฅ −∞ For discrete distributions the PDF is replaced with a probability mass function (PMF). Like the CDF (and unlike the PDF), the PMF is a dimensionless probability. To obtain the CDF from the PMF I sum, rather than integrate, ๐ฅ ๐(๐ฅ) = ∑ ๐ (๐) ๐≤๐ฅ To obtain the PMF from the CDF I difference, rather than differentiate, ๐(๐ฅ) = ๐(๐ฅ) − ๐(๐ฅ − 1) The sum over the sample space is 1 = ∑ ๐ (๐) ๐∈๐ฆ where ๐ฆ is the sample space. In R there are functions for common distributions. The CDF begins with the letter p. The PDF and PMF begin with the letter d. To obtain random values use the letter r. For quantiles use the letter q. Suppose I want to draw the PMF for the Poisson distribution having intensity ๐ = 4.6. I can generate a sequence of integer values and then use dpois: k <- c(0:20) dk <- dpois(k,4.6) plot(k, dk) segments(k, 0, k, dk) I could draw the CDF with ppois: pk <- ppois(k, 4.6) plot(k, pk, type='s') If I want to confirm the values of k having these probabilities, I invert the probabilities with qpois: qpois(pk,4.6) ## [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 —————————————— ===== —————————————— Example 1. I want to generate a random sample from a normal distribution and see if I can ‘recover the mean’. Here are the steps I use: • • • define a sample size, a mean and variance draw a random sample using rnorm ๐ estimate parameters as ๐ฬ = ๐ฅ and ๐ฬ 2 = ๐−1 ๐ฃ๐๐(๐ฅ) • • • determine a 95% confidence interval using qnorm draw a PDF using dnorm and CDF using pnorm based on the estimates determine if the true estimate lies within this interval Plots from Example 1. Now repeat this 1000 times and count the number of times the confidence interval includes the true mean. Density of low and high 95% CIs from Example 1. —————————————— ===== —————————————— #moments Moments are expectations of powers of ๐ฅ. They are used to summarize the location and shape of a distribution. I can think of the ๐๐กโ moment of a distribution as a weighted average of ๐ฅ ๐ . For a continuous distribution ∞ ๐ธ[๐ฅ ๐ ] = ∫ ๐ฅ ๐ ๐(๐ฅ)๐๐ฅ −∞ The first moment, ๐ = 1, is the mean. The variance is a central moment, ∞ ๐ฃ๐๐[๐ฅ] = ๐ธ[(๐ฅ − ๐ธ[๐ฅ]) 2] = ∫ ( ๐ฅ − ๐ธ[๐ฅ])2 ๐(๐ฅ)๐๐ฅ −∞ For discrete variables, we replace the integral with a sum. #discrete probability for multiple host states I now want to apply probability concepts to the Janzen-Connell question. In the foregoing discussion I used notation for to distinguish between probability and density. In cases where I do not want to name a specific PDF or CDF I use the shorthand bracket notation. Let [๐ผ, ๐] be the joint probability of two events, i) that a host plant is infected, ๐ผ, and ii) that it survives, ๐. In this example, both of these events are binary, being either true (indicated with a one) or not (indicated with a zero). For example, the probability that an individual is not infected and survives is written as [๐ผ = 0, ๐ = 1]. If I write simply [๐ผ, ๐] for these binary events, it is interpreted as the probability that both events are a ‘success’ or ‘true’, i.e., [๐ผ = 1, ๐ = 1]. A graphical model of relationships discussed for the Janzen Connell hypothesis. Symbols are I infected host, S - survival, D - detection. As previously, states (or events) and parameters are nodes in the graph, the states {๐ผ, ๐, ๐ท}, and the parameters {๐, ๐, ๐0 , ๐1 }. The connections, or arrows, between nodes are sometimes called edges. Here I assign parameters: [๐ผ] [๐ท|๐ผ = 1] [๐|๐ผ = 0] [๐|๐ผ = 1] =๐ =๐ = ๐0 = ๐1 A host can become infected with probability [๐ผ] = ๐. An infection can be detected with probability [๐ท|๐ผ = 1] = ๐. An infected individual survives with probability [๐|๐ผ = 1] = ๐1 , and a non-infected individual survives with probability [๐|๐ผ = 0] = ๐0 . For this example, I assume no false positives, [๐ท = 1|๐ผ = 0] = 0. ##start simple A study of infection and host survival could be modeled as a joint distribution [๐ผ, ๐]. I might be interested in estimating state ๐ผ, in comparing parameter estimates (‘does ๐0 differ from ๐1 ?), or both. Both events are unknown before they are observed. After data collection I know ๐, but not ๐ผ. I want a model for the conditional probability of survival given that an individual is infected, [๐|๐ผ = 1] = ๐1 or not [๐|๐ผ = 0] = ๐0 . The notation [๐|๐ผ] indicates that the event to the left of the bar is ‘conditional on’ the event to the right. An arrow points from ๐ผ to ๐, because I believe that infection might affect survival, but I do not believe that survival influences infection (because I am not concerned with infection ‘after’ or ‘caused by’ death). The challenge is that I observe ๐, but not ๐ผ. I cannot condition on ๐ผ if it is unknown. Rather, I want to estimate it. Progress requires Bayes theorem. Nodes from the graphical model for infection status and survival. To make use of the model I need a relationship between conditional and joint probabilities, [๐ผ, ๐] = [๐|๐ผ][๐ผ] Here I have factored the joint probability on the left-hand side into a conditional distribution and a marginal distribution on the right-hand side. I can also factor the joint distribution this way, [๐ผ, ๐] = [๐ผ|๐][๐] Because both are equal to the joint probability, they must be equal to each other, [๐|๐ผ]][๐ผ] = [๐ผ|๐][๐] Rearranging, I have Bayes theorem, solved two ways, [๐|๐ผ] = [๐ผ|๐][๐] [๐ผ] [๐ผ|๐] = [๐|๐ผ][๐ผ] [๐] and The two pieces of this relationship that I have not yet defined are the marginal distributions, [๐] and [๐ผ]. I could evaluate either one conditional on the other using the law of total probability, [๐ผ = 0] = ∑ [ ๐ผ = 0|๐ = ๐][๐ = ๐] ๐∈{0,1} or [๐ = 1] = ∑ [ ๐ = 1|๐ผ = ๐][๐ผ = ๐] ๐∈{0,1} How can I use these relationships to address the effect of infection on survival? Given survival status, I first determine the probability that the individual was infected. I have four factors, all univariate, two conditional and two marginal distributions. I have defined [๐|๐ผ] in terms of parameter values, but I want to know [๐ผ|๐]. For a host that survived, Bayes theorem gives me [๐ผ|๐ = 1] [๐ = 1|๐ผ][๐ผ] [๐ = 1] [๐ = 1|๐ผ][๐ผ] = ∑๐∈{0,1} [ ๐ = 1|๐ผ = ๐][๐ผ = ๐] ๐1 ๐ = ๐0 (1 − ๐) + ๐1 ๐ = For a host that died this conditional probability is [๐ผ|๐ = 0] [๐ = 0|๐ผ][๐ผ] [๐ = 0] (1 − ๐1 )๐ = (1 − ๐0 )(1 − ๐) + (1 − ๐1 )๐ = These two expressions demonstrate that, if I knew the parameter values, then I could evaluate the conditional probability for [๐ผ|๐]. If I do not know parameter values, then they too might be estimated. Before going further, notice that the numerator is always the ‘unormalized’ probability of the two events. The demoninator simply normalizes them. —————————————— ===== —————————————— Exercise 1. In R: Assume that there are ๐ = 100 hosts and that infection decreases survival probability (๐1 < ๐0 ). Define parameter values for {๐0 , ๐1 , ๐} and draw a binomial distribution for [๐ผ|๐ = 0] and for [๐ผ|๐ = 1]. (Use the function dbinom.) Is the infection rate estimated to be higher for survivors or for those that die? How are these two distributions affected by the underlying prevalence of infection, ๐? [Hint: write down the probabilities required and then place them in a function]. Comparison of binomial distributions of infection for survivors and deaths. —————————————— ===== —————————————— ##continuous probability for parameters Here I consider the problem of estimating parameters. The survival parameters are ๐๐ผ = {๐0 , ๐1 }. From Bayes theorem I need [๐๐ผ |๐] = [๐|๐๐ผ ][๐๐ผ ] [๐] where the subscript ๐ผ = 0 (uninfected) or ๐ผ = 1 (infected). Again, assuming I know whether or not a host is infected, I can write the distribution for survival conditioned on parameters as [๐|๐๐ผ ] = ๐๐ผ๐ (1 − ๐๐ผ )1−๐ This is a Bernoulli distribution. The Bernoulli distribution is a special case of the binomial distribution for a single trial, ๐ต๐๐๐๐๐ข๐๐๐(๐) = ๐๐๐๐๐(1, ๐) If I know ๐, and I want to estimate ๐๐ผ , I again need Bayes theorem. Unlike states, the survival parameters take continous values on (0,1). The total probability of survival ๐ requires not summation, but rather integration, 1 [๐] = ∫ [ ๐|๐๐ผ ][๐๐ผ ]๐๐๐ผ 0 I now have the elements needed to write the conditional distribution for [๐๐ผ |๐], but it involves an integral expression. One way to insure a solution to this integral is to assume that the marginal distribution of ๐๐ผ is a beta distribution, which results in a marginal beta-binomial distribution for [๐], 1 ๐๐๐ก๐๐ต๐๐๐๐(๐|๐, ๐, ๐) = ∫ ๐ ๐๐๐๐(๐|๐, ๐๐ผ )๐๐๐ก๐(๐๐ผ |๐, ๐)๐๐๐ผ 0 To understand this integral, I draw the two distributions in the integrand. When I integrate I smear out the binomial distribution based on the variation represented by the beta distribution for ๐. par(mfrow=c(2,2), mar=c(4,4,1,1), bty='n') m <- 50 # no. at risk S <- 0:m pi <- .35 # survival Pr b <- 4 # beta parameter a <- signif(b/(1/pi - 1),3) # 2nd beta parameter to give mean value = pi plot(S, dbinom(S, m, pi), type='s',lwd=3, xlab='S', ylab='[S]') title('binimal distribution for m, pi') plot(S/m, dbeta(S/m, a, b), lwd=2, xlab=expression( pi ), ylab=expression( paste("[", pi, "]") ), type='l') title('beta density for a, b') ptext <- paste( "(", a, ", ", b, ")",sep="") text(1,1.5,ptext,pos=2) plot(S, dbinom(S, m, pi), type='s',lwd=3, xlab='S', ylab='[S]', col = 'grey') lines(S, dbetaBinom(S, m, mu=pi,b=b), type='s',lwd=2, col='blue') abline(v=pi*m,lty=2) title('betabinomial for m, a, b') Comparison of binomial and beta-binomial (above) and beta density with parameters (a, b) (right). In the foregoing code I wanted to draw a ๐๐๐ก๐๐ต๐๐๐๐(๐, ๐) distribution with a mean of ๐ = 0.35 and wide variance. I selected a low value of parameter b and then used moments (the mean) to determine that the value of parameter a, ๐= ๐ ๐+๐ To draw the binomial PMF I used the R function dbinom. The PMF for the beta-binomial is drawn by dbetaBinom in clarkFunctions2020.r. For the next exercise, consult the appendix on moments. —————————————— ===== —————————————— Exercise 2. The variance in beta distribution decreases as the value of parameter b increases. Change parameter values to demonstrate this with a plot. Then compare the mean and variance (see Appendix) from the moments for the binomial and beta-binomial. Here it is for the beta-binomial: meanS <- sum( S*dbetaBinom(S, m, mu=pi,b=b) ) varS <- sum( (S - meanS)^2*dbetaBinom(S, m, mu=pi,b=b) ) For the binomial, your variance should agree with ๐๐(1 − ๐). —————————————— ===== —————————————— The graphical model for known infection status and survival and unknown survival probability. From Bayes’ theorem I could now write the posterior distribution as ๐ต๐๐๐๐๐ข๐๐๐(๐|๐๐ผ )๐๐๐ก๐(๐๐ผ |๐, ๐) ๐๐ผ๐+๐−1 (1 − ๐๐ผ )๐−๐ [๐๐ผ |๐, ๐, ๐] = = ๐๐๐ก๐๐ต๐๐๐๐(๐|๐, ๐) ๐ต(๐ + ๐, 1 − ๐ + ๐) where ๐ต(⋅) is the beta function. Here’s another way to draw this function, showing the prior beta distribution and posteriors for a single observation, survived or died: post <- function(S, a, b, p){ p^(S+a-1)*(1-p)^(b-S)/beta(S + a, 1 - S + b) } S <- 0 p <- seq(.01,.99,length=50) plot(p, post(S=0, a, b, p), xlab='pi', ylab = '[pi]', type='l') # if obs died lines(p, post(S=1, a, b, p), col=2) # if obs survived lines(p, dbeta(p, a, b), lty=2, col = 3) # prior legend('topright',c('died','survived','prior (dashed)'),text.col=c(1,2,3)) With these few basic rules, I return to the Janzen Connell hypothesis. The graph summarizes several events that influence infection and survival. I can use the model to evaluate important properties of the process it represents, to estimate parameters, and to predict behavior. Parameter values might be estimated from previous studies, or they might be completely unknown. The states might be observed or not. Consider a few of the ways the model could be used in the following example. —————————————— ===== —————————————— Example 2. What is the probability of infection ๐ผ where ๐ท, and ๐ are unknown? I only need to consider arrows that ‘cause’ ๐ผ–if neither ๐ท nor ๐ cause ๐ผ, and there is no knowledge of them that could affect my subjective probability of event ๐ผ, then they have no influence on the result. The event ๐ผ = 1 has probability ๐. —————————————— ===== —————————————— In the next exercise I return to the original graph to consider parameter estimates. —————————————— ===== —————————————— Example 3. If I know ๐ but I have no knowledge of ๐ท or ๐ผ, what is the probability of ๐ผ? This problem asks for the conditional probability [๐ผ|๐]. I know [๐|๐ผ]. I already determined [๐ผ = 1] = ๐. I still need [๐], which I can obtain using total probability. For an individual that survived, [๐ = 1] = ∑ [ ๐ = 1|๐ผ][๐ผ] ๐ผ = [๐ = 1|๐ผ = 0][๐ผ = 0] + [๐ = 1|๐ผ = 1][๐ผ = 1] = ๐0 (1 − ๐) + ๐1 ๐ By substitution I have [๐ผ|๐ = 1] = ๐1 ๐ ๐0 (1 − ๐) + ๐1 ๐ —————————————— ===== —————————————— —————————————— ===== —————————————— Exercise 3. I observe ๐ท and ๐, and I know ๐ from previous studies. What is the probability of an observation [๐ท = 1, ๐ = 1]. (Hint: use total probability on [๐ท, ๐ผ, ๐]. Now write down the posterior distribution of parameters, given observations and known detection probabilty, i.e., [๐๐ผ , ๐|๐ท = 1, ๐ = 1, ๐]. —————————————— ===== —————————————— These examples of discrete states and continuous parameters are used by Clark and Hersh and Hersh et al. to evaluate the effect of co-infection of multiple pathogens that attack multiple hosts. These models admit covariates, which could abundances of host plants or environmental variables. #the normal distribution I used the normal distribution for regression examples without saying much about it. In Bayesian analysis it is not only used for the likelihood, but also as a prior distribution for parameters. Here I extend the foregoing distribution theory to Bayesian methods that involve the normal distribution. ##Bayesian estimate of the mean To obtain an estimate of the mean of a normal distribution I combine likelihood and prior distribution. My observations are ๐ฆ๐ , ๐ = 1, … , ๐. The likelihood for one observation ๐ is ๐(๐ฆ๐ |๐, ๐ 2 ) = 1 √2๐๐ ๐๐ฅ๐ [− The likelihood for the sample of ๐ observations is 1 (๐ฆ − ๐)2 ] 2๐ 2 ๐ ๐ 2 ๐(๐ฒ|๐, ๐ ) = ∏ ๐ (๐ฆ๐ |๐, ๐ 2 ) ๐=1 1 ๐ ๐ 1 =( ) ๐๐ฅ๐ [− 2 ∑( ๐ฆ๐ − ๐)2 ] 2๐ √2๐๐ ๐=1 Recalling Bayes theorem, I combine this likelihood with a prior distribution. I use a normal distribution for the mean. For now I assume that ๐ 2 is fixed. Here is a prior distribution for ๐ ๐(๐|๐, ๐) To obtain posterior estimates I will use a trick that starts with the following observation. I know the posterior distribution with have this form, ๐(๐|๐๐ฃ, ๐) where ๐ will be the variance, and ๐ฃ is an unknown constant. If I write out the exponent for the normal distribution I get this: 1 1 ๐2 2 − (๐ − ๐๐ฃ) = − ( − 2๐๐ฃ + ๐๐ฃ 2 ) 2๐ 2 ๐ I now know that variance ๐ will be whatever is multiplied by ๐ −2, and ๐ฃ will be whatever is multiplied by −2๐. I want to multiply likelihood and prior, then find these constants. Here is likelihood times prior, focusing on factors that include ๐ in the exponent ๐ 1 1 ๐(๐ฒ|๐, ๐ 2 )๐(๐|๐, ๐) ∝ ๐๐ฅ๐ [− 2 ∑( ๐ฆ๐ − ๐)2 − (๐ − ๐)2 ] 2๐ 2๐ ๐=1 Setting ๐๐ฆ = ∑๐๐=1 ๐ฆ๐ and extracting only the terms I need, ๐2 ( ๐ 1 ๐๐ฆ ๐ + ) − 2๐ ( 2 + ) 2 ๐ ๐ ๐ ๐ shows that ๐ ๐ฃ ๐ 1 −1 =( 2+ ) ๐ ๐ ๐๐ฆ ๐ = 2+ ๐ ๐ You might recognize this to be a weighted average of data and prior, with the inverse of variances being the weights, ๐๐ฆ ๐ 2 +๐ ๐๐ฃ = ๐ ๐ 1 + ๐2 ๐ Note how large ๐ will swamp the prior, $$\underset{\scriptscriptstyle \lim n \rightarrow \infty}{Vv} \rightarrow \frac{1}{n} \sum_{i=1}^n y_i = \bar{y}$$ The prior can fight back with a tiny prior variance ๐. $$\underset{\scriptscriptstyle \lim M \rightarrow 0}{Vv} \rightarrow m$$ —————————————— ===== —————————————— Exercise 4. Write a function to determine the posterior estimate of the mean for a normal likelihood, normal prior distribution, and known variance ๐ 2 . You will need to generate a sample, supply a prior mean and variance, determine the posterior mean and variance, and plot. Bayesian analysis of the mean. Then demonstrate the effect of ๐ and ๐. —————————————— ===== —————————————— ##Bayesian regression (known ๐ 2 ) For the regression model, I start with matrix notation, ๐ฒ ∼ ๐๐๐(๐๐, ๐ด) where ๐ฒ is the length-๐ vector of responses, ๐ is the ๐ × ๐ design matrix, ๐ is the length-๐ vector of coefficients, and ๐ด is an ๐ × ๐ covariance matrix. I can write this as 1 (2๐)−๐/2 |๐ด|−1/2 ๐๐ฅ๐ [− (๐ฒ − ๐๐)′๐ด −1 (๐ฒ − ๐๐)] 2 Because we assume i.i.d (independent, identically distributed) ๐ฆ๐ , the covariance matrix is ๐ด = ๐ 2 ๐, and |๐ด|−1/2 = (๐ 2 )−๐/2 , giving us (2๐)−๐/2 (๐ 2 )−๐/2 ๐๐ฅ๐ [− 1 (๐ฒ − ๐๐)′(๐ฒ − ๐๐)] 2๐ 2 This is the form of the likelihood I use to obtain the conditional posterior for regression coefficients. The multivariate prior distribution is also multivariate normal, [๐ฝ1 , … , ๐ฝ๐ ] = ๐๐๐(๐|๐, ๐) 1 1 = ๐๐ฅ๐ [− (๐ − ๐)′๐−1 (๐ − ๐)] ๐/2 1/2 2 (2๐) ๐๐๐ก(๐) If there are ๐ predictors, then ๐ = (๐ฝ1 , … , ๐ฝ๐ )′. The prior mean is a length-๐ vector ๐. The prior covariance matrix could be a non-informative diagonal matrix, ๐ต 0 ๐=( โฎ 0 0 ๐ต โฎ 0 โฏ โฏ โฑ โฏ 0 0 ) โฎ ๐ต for some large value ๐ต. The posterior distribution is ๐๐๐(๐|๐๐ฏ, ๐), where ๐ = (๐ −2 ๐′๐ + ๐ −1 )−1 ๐ฏ = ๐ −2 ๐′๐ฒ + ๐ −1 ๐ (appendix). Taking limits as I did for the previous example, I obtain the MLE for the mean parameter vector, $$\underset{\scriptscriptstyle \lim n \rightarrow \infty}{\mathbf{Vv}} \rightarrow (\mathbf{X'X})^{-1}\mathbf{X'y}$$ (appendix). —————————————— ===== —————————————— Exercise 5. Obtain the posterior mean and variance for regression parameters for a simulated data set. Your algorithm might proceed as follows: 1. 2. 3. 4. 5. 6. define ๐, ๐, and ๐ 2 generate ๐ × ๐ matrix ๐ from random values, and set the first column to ones generate a ๐ × 1 matrix ๐ from random values generate a ๐ × 1 vector ๐ฒ using rnorm. specify a ๐ × 1 prior matrix ๐ and prior covariance matrix ๐ write a function to evaluate ๐, ๐ฏ, and return the mean vector and covariance matrix Marginal posterior densities for beta. Explain how you would check that the algorithm is correct. —————————————— ===== —————————————— ##Residual variance (known ๐) Now I assume that I know the coefficients and want to estimate the residual variance ๐๐ . Recall the likelihood for the normal distribution, 2 ๐(๐ฒ|๐, ๐ ) ๐ ๐ 1 1 =( ) ๐๐ฅ๐ [− 2 ∑( ๐ฆ๐ − ๐)2 ] 2๐ √2๐๐ ๐=1 ๐ 1 ∝ ๐ −2(๐/2) ๐๐ฅ๐ [− 2 ∑( ๐ฆ๐ − ๐)2 ] 2๐ ๐=1 A prior distribution for that is commonly used is inverse gamma, ๐ 2 ๐ผ๐บ(๐ |๐ 1 , ๐ 2 ) ๐ 21 −2(๐ +1) 1 = ๐ ๐๐ฅ๐(−๐ 2 ๐ −2 ) ๐ค(๐ 1 ) ∝ ๐ −2(๐ 1 +1) ๐๐ฅ๐(−๐ 2 ๐ −2 ) If I combine likelihood and prior I get another inverse gamma distribution, ๐ 2 ๐ผ๐บ(๐ |๐ข1 , ๐ข2 ) ∝ ๐ −2(๐ 1 +๐/2+1) ๐๐ฅ๐ [−๐ −2 1 (๐ 2 + ∑( ๐ฆ๐ − ๐)2 )] 2 ๐=1 1 Then ๐ข1 = ๐ 1 + ๐/2, and ๐ข2 = ๐ 2 + 2 ∑๐๐=1( ๐ฆ๐ − ๐)2 . Here is a prior and posterior distribution for a sample data set. library(MCMCpack) par(bty='n') n <- 10 y <- rnorm(n) s1 <- s2 <- 1 yb <- mean(y) ss <- seq(0,4,length=100) u1 <- s1 + n/2 u2 <- s2 + 1/2*sum( (y - yb)^2) plot(ss,dinvgamma(ss, u1, u2), type='l', lwd=2) lines(ss,dinvgamma(ss,s1,s2),col='blue',lwd=2) Prior and posterior IG distribution ##residual variance for regression 1 For regression, I replace ๐ with ๐๐, I have ๐ข2 = ๐ 2 + 2 (๐ฒ − ๐๐)′(๐ฒ − ๐๐) To see this, recall the likelihood, ๐ −2(๐/2) ๐๐ฅ๐ [− 1 (๐ฒ − ๐๐)′(๐ฒ − ๐๐)] 2๐ 2 —————————————— ===== —————————————— Exercise in class Find the conditional posterior distribution for the variance in regression. Based on the previous two blocks of code, write a function to evaluate the variance for a sample regression. ##small step to Gibbs sampling The conditional posterior distributions for coefficients and variance will be combined with Gibbs sampling. To see how this will come together, consider that we can now sample [๐|๐ 2 ] and, conversely, [๐ 2 |๐]. If we alternate these two steps repeatedly we have a simulation for their joint distribution, [๐, ๐ 2 ]. To see the setup that is used in jags, refer back to unit 2. For the regression example, I would simply add an additional step. #jags example To see how well we can recover parameters when they are known, here is a simulated data set: n <- 100 # p <- 4 # beta <- matrix( rnorm(p), p) # sigma <- .1 # x <- matrix( rnorm(n*p), n, p) # x[,1] <- 1 # mu <- x%*%beta y <- rnorm(n, mu, sqrt(sigma) ) pairs(cbind(y,x[,-1])) sample size no. predictors coefficients residual variance design intercept If I knew the residual variance, this would be my Bayesian estimate: B <- diag(10000,p) b <- beta*0 V <- solve( 1/sigma*crossprod(x) + solve(B) ) v <- 1/sigma*crossprod(x,y) betaHat <- V%*%v betaSe <- sqrt( diag(V) ) coefficients <- signif( cbind(beta, betaHat, betaSe), 4) colnames(coefficients) <- c('true', 'estimate', 'Se') coefficients ## ## ## ## ## [1,] [2,] [3,] [4,] true estimate Se 0.27350 0.3340 0.03200 0.68000 0.6661 0.03347 0.44740 0.4949 0.03721 0.04942 0.0363 0.03312 For comparison, here’s the classical estimate: summary( lm( y ~ x[,-1]) )$coefficients[,1:2] ## Estimate Std. Error ## (Intercept) 0.3339715 0.03367717 ## x[, -1]1 0.6660953 0.03522470 ## x[, -1]2 ## x[, -1]3 0.4949326 0.03915431 0.0363018 0.03484890 Now I want to sample the joint distribution of [๐ฝ, ๐ 2 ]. Here’s jags: library(rjags) ## Linked to JAGS 4.3.0 ## Loaded modules: basemod,bugs file <- "lmSimulated.txt" cat("model{ # Likelihood for(i in 1:n){ y[i] ~ dnorm(mu[i],precision) mu[i] <- inprod(beta[],x[i,]) } for (i in 1:p) { beta[i] ~ dnorm(0, 1.0E-5) } # Prior for the inverse variance precision ~ dgamma(0.01, 0.01) sigma <- 1/precision }", file = file) Here is a function that sets up the posterior sampling: model <- jags.model(file=file, data = list(x = x, y = y, n=nrow(x), p=ncol(x))) ## Compiling model graph ## Resolving undeclared variables ## Allocating nodes ## Graph information: ## Observed stochastic nodes: 100 ## Unobserved stochastic nodes: 5 ## Total graph size: 713 ## ## Initializing model I start with 100 burnin iterations, then sample for 2000: update(model, 100) jagsLm <- coda.samples(model, variable.names=c("beta","sigma"), n.iter=2000) tmp <- summary(jagsLm) print(tmp$statistics) ## ## ## ## ## ## beta[1] beta[2] beta[3] beta[4] sigma Mean 0.33539775 0.66702382 0.49500158 0.03524711 0.11349828 Here are plots: plot(jagsLm) SD 0.03402236 0.03561230 0.03941578 0.03548611 0.01673148 Naive SE Time-series SE 0.0007607631 0.0007607631 0.0007963152 0.0008575067 0.0008813636 0.0008813636 0.0007934935 0.0008929009 0.0003741272 0.0003929038 —————————————— ===== —————————————— Exercise in class Make an informative prior distribution for regression parameters. Then compare the estimates you get with the non-informative prior. Do this analytically and with jags. #recap Bayesian analysis requires some basic distribution theory to combine data and prior information to generate a posterior distribution. Fundamental ways to parameterize probability include densities (continuous), probability mass functions (discrete), and probability density (both) functions. The sample space defines allowable (non-zero probablity) for a random variable. Integrating (continous) or summing (continuous) over the sample space gives a probability of 1. Distributions have moments, which are expectations for integer powers of a random variable. The first moment is the mean, and the second central moment is the variance. Higher moments include skewness (asymmetry) and kurtosis (shoulders versus peak and tails). Joint distribution can be factored into conditional and marginal distributions. A conditional distribution assumes a specific value for the variable that is being conditioned on. Marginalizing over a variable is done with the law of total probability. Bayes theorem relies on a specific factorization giving a posterior distribution in terms of likelihood and prior. R can be use to draw random variables and to evaluate densities and probabilities. Binomial and Bernoulli distributions apply to numbers of successes in ๐ or 1 trials, respectively. The multivariate normal distribution is commonly used as a prior distribution. When combined with a normal likelihood, the posterior mean can be found with the ‘Vv rule’. #appendix Here I provide a bit more detail on moments used in the beta-binomial example, the posterior for regression parameters, and its connection to maximum likelihood estimates. ##moments Moments describe the shape of a distribution. The mean of the distribution is the first moment. The variance is the second central moment. The ๐๐กโ moment of a distribution for ๐ฅ is expected value of ๐ฅ ๐ . For continuous variable ๐ฅ having PDF ๐(๐ฅ) this is ∞ ๐ธ[๐ฅ ๐ ] = ∫ ๐ฅ ๐ ๐(๐ฅ)๐๐ฅ −∞ Note that the zero moment = 1, the area under the PDF. For a discrete variable this is ๐ธ[๐ฅ ๐ ] = ∑ ๐ฅ ๐ ๐(๐ฅ) ๐∈๐ฆ Let ๐ = ๐ธ[๐ฅ1 ] be the first moment. Then the ๐๐กโ central moment is ∞ ๐ ๐ธ[(๐ฅ − ๐) ] = ∫ ( ๐ฅ − ๐)๐ ๐(๐ฅ)๐๐ฅ −∞ (continuous) and ๐ธ[(๐ฅ − ๐)๐ ] = ∑ ( ๐ฅ − ๐)๐ ๐(๐ฅ) ๐∈๐ฆ (discrete). The variance is ๐ธ[(๐ฅ − ๐)2 ]. Moments also exist for a sample. In this case I can think of the discrete probability assigned to each observation as 1/๐, where ๐ is the number of observations. Plugging this into the discrete moment equation I have ๐ 1 ๐ฅ = ๐ธ[๐ฅ] = ∑ ๐ฅ๐๐ ๐ ๐=1 for the sample mean and ๐ 1 ๐ฃ๐๐(๐ฅ) = ๐ธ[(๐ฅ − ๐) ] = ∑( ๐ฅ๐ − ๐ฅ)2 ๐ 2 ๐=1 for the sample variance. ##Bayesian regression parameters As for the example for the mean of the normal distribution, I apply the “big-V, small-v” method. For matrices the exponent of ๐(๐|๐๐ฏ) is 1 1 − (๐ − ๐๐ฏ)′๐ −1 (๐ − ๐๐ฏ) = − (๐′๐ −1 ๐ − 2๐′๐ฏ + ๐ฏ′๐๐ฏ) 2 2 As before I find ๐ and ๐ฏ in the first two terms. Now I combine the regression likelihood with this prior distribution, I have an exponent on the multivariate normal distribution that looks like this, ๐ 1 ∑( ๐ฆ๐ − ๐ฅ๐ ′๐)2 + (๐ − ๐)′๐−1 (๐ − ๐) ๐2 ๐=1 or like this, 1 (๐ฒ − ๐๐)′(๐ฒ − ๐๐) + (๐ − ๐)′๐ −1 (๐ − ๐) 2 ๐ where ๐ฒ is the length-๐ vector of responses, and ๐ is the ๐ × ๐ design matrix. Retaining only terms containing coefficients, I collect terms, −2๐′(๐ −2 ๐′๐ฒ + ๐ −1 ๐) + ๐′(๐ −2 ๐′๐ + ๐ −1 )๐ I identify parameter vectors, ๐ = (๐ −2 ๐′๐ + ๐ −1 )−1 ๐ฏ = ๐ −2 ๐′๐ฒ + ๐ −1 ๐ These are determine the posterior distribution. ##connection to maximum likelihood Consider again the likelihood, now ignoring the prior distribution, having exponent log๐ฟ ∝ − 1 (๐ฒ − ๐๐)′(๐ฒ − ๐๐) 2๐ 2 To maximumize the log likelihood I consider only these terms, because others do not contain parameters. I differentiate once, ๐๐๐๐๐ฟ = ๐ −2 ๐′๐ฒ − ๐ −2 ๐′๐๐ ๐๐ and again, ๐ 2 ๐๐๐๐ฟ = −๐ −2 ๐′๐ 2 ๐๐ To obtain MLEs I set the first derivative equal to zero and solve, ฬ = (๐′๐)−1 ๐′๐ฒ ๐ The matrix of curvatures, or second derivatives, is related to Fisher Information and the covariance of parameter estimates, ๐ 2 ๐๐๐๐ฟ ๐=− ๐๐2 The covariance of parameter estimates is ๐ −1.