Bayesian Inference for Environmental Models

jointly distributed random variables env/bio 665 Bayesian inference for environmental models

Jim Clark

2020-01-27

Table of Contents

readings ...................................................................................................................................................................... 1

#resources

##software source ( '../clarkFunctions2020.r' )

readings

Models for Ecological Data, Appendix D.

Evaluating the impacts of fungal seedling pathogens on temperate forest seedling survival ,

Hersh et al. on joint, conditional, predictive distributions, Ecology.

#objectives

• Understand key concepts:

– discrete and continuous densities, probability mass functions, and probability density functions

– the sample space for a distribution

– moments of a distribution

– factor a joint distribution

– graph a model

– the contribution of data and prior to the posterior estimate of a mean

• Apply basic rules to manipulate jointly distributed random variables:

– total probability

– Bayes theorem

• Use R to draw random samples and to determine density and probability for standard distributions

– binomial and Bernoulli

– beta and beta-binomial

– normal and multivariate normal

• Find the posterior distribution of regression parameters

“only grade-schoolers can divide, only undergraduates can differentiate, a rare PhD can

integrate”, Mark Twain, maybe

#Attention au l’calcule

I need a few rules from calculus to manipulate distributions. Calculus is more often important for its conceptual and notation contributions than for actual solutions to equations. Most functions cannot be integrated. The limited capacity to handle the integration constant needed for Bayes theorem stalled progress until numerical analysis advanced with methods such as

Gibbs sampling. Although I will will not integrate much, calculus provides concepts needed here.

For probability, I need both distribution functions and density functions. The latter is the derivative of the former. The notation is powerful, providing a direct connection to concepts and notation used in basic algebra. The derivative 𝑑𝑃/𝑑𝑥 can be viewed as a limit of a ratio 𝑝(𝑥) = 𝑑𝑃 𝑑𝑥

= lim 𝑑𝑥→0

[

𝑃(𝑥 + 𝑑𝑥) − 𝑃(𝑥) 𝑑𝑥

] i.e., division. Multiplication 𝑝(𝑥) ⋅ 𝑑𝑥 can be viewed as the limit of the anti-derivative 𝑥+𝑑𝑥

∫ 𝑝 𝑥

(𝑢)𝑑𝑢 = lim 𝑑𝑥→0

[𝑝(𝑥) ⋅ 𝑑𝑥]

Ironically, although multiplication is easy and division is hard, the tables are turned on their calculus counterparts: differentiation is usually easier than integration.

The idea of derivatives and integrals as limiting quantities is important for computation in

Bayesian analysis. Differentiation is needed for optimization problems–more important for maximum likelihood. Optimizations and integrations are commonly approximated numerically.

In the discussion ahead I rely on some of the notation of calculus, but there are no difficult solutions.

#Basic probability rules and the Janzen Connell hypothesis

Basic probality ideas are introducted here with the Janzen Connell (JC) effect. The JC effect is believed to promote forest tree diversity as natural enemies disproportionately attack the most abundant tree hosts. The mechanism requires that natural enemies are host-specific and that they most efficiently find and/or impact host populations when and where those hosts are abundant. Fungi are plausible candidates for the JC effect, because there are many taxa, and they include known pathogens of trees.

To test whether or not fungal pathogens contribute to tree diversity, Hersh et al. (2012) planted seedlings of six species in 60 plots, they observed survival, and they assayed them for fungal infection on cultures and with DNA sequencing. A model was constructed for the relationship between pathogen, host, and observations, and it was used to infer where pathogens occur (‘incidence’), infection of hosts, and effect of infection on survival. This example introduces some of the techniques used in Bayesian analysis, including some basic

distribution theory. I begin with background on distributions, followed by examples that demononstrate ways to look at the JC hypothesis.

##continuous and discrete probability distributions

I express uncertainty about an event using probability. Uncertainty could be temporary, expressing current information: my prediction of heads or tails will be updated when I flip this coin. Or it could be indefinite, expressing my ability to predict in general: about half of my predictions for coin tosses will be wrong. A probability is dimensionless. It can be zero, one, or somewhere in between. In the sections that follow I introduce basic distribution theory needed for Bayesian analysis.

##probability spaces

The probability distribution assigns probability values to the set of all possible outcomes over the sample space. The sample space for a continuous probability distribution is all or part of the real line. The normal distribution applies to the full real line, ℝ = (−∞, ∞) . The gamma (including the exponential) and the log-normal distributions apply to the non-negative real numbers ℝ

+

= (0, ∞) . The beta distribution 𝑏𝑒𝑡𝑎(𝑎, 𝑏) applies to real numbers in the interval

[0,1]

. The uniform distribution, 𝑢𝑛𝑖𝑓(𝑎, 𝑏)

, applies to the interval

[𝑎, 𝑏]

. These are

univariate distributions: they describe one dimension. [Note: when discussing an interval, a square bracket indicates that I am including the boundary–the uniform distribution is defined at zero, but the lognormal distribution is not.]

A multivariate distribution describes variation in more than one dimension. For example, a

d-dimensional multivariate normal distribution describes a length𝑑 random vector in ℝ 𝑑 .

Three continuous distributions supported on different portions of the real line. Zero and one

(grey lines) are unsupported. PDFs above and CDFs below.

The sample space for a discrete distribution is a set of discrete values. For the Bernoulli distribution, the sample space is {0,1} . For the binomial distribution, 𝑏𝑖𝑛𝑜𝑚(𝑛, 𝜃) , the sample space is {0, … , 𝑛} . For count data, often modeled as a Poisson distribution, the sample space is

{0,1,2, … } . The probability mass function means that there is point mass on specific values

(e.g., integers) and no support elsewhere.

A common multivariate discrete distribution is the multinomial distribution. It assigns probability to 𝑛 trials, each of which can have 𝐽 outcomes, 𝑚𝑢𝑙𝑡𝑖𝑛𝑜𝑚(𝑛, (𝜃

1

, … , 𝜃

𝐽

)) . It has the sample space {0, … , 𝑛} 𝐽 , subject to the constraint that the sum over all 𝑗 classes is equal to 𝑛 .

Three discrete distributions, with PMFs above and CDFs below.

The figures show continuous and discrete distributions, each in two forms, as densities and as cumulative probabilities.

##probability density and cumulative probability

The cumulative distribution function (CDF)

𝑃(𝑥)

accumulates continuous probability over the sample space, from low values (near-zero) to high values (near-one). The probability

density function (PDF) is the derivative of the CDF, 𝑝(𝑥) = 𝑑𝑃/𝑑𝑥 . Because the CDF is a dimensionless probability, its derivative must have units of 1/𝑥 . The CDF is obtained from the

PDF by integration, 𝑥

𝑃(𝑥) = ∫ 𝑝

−∞

(𝑥)𝑑𝑥

I cannot assign a probability to a continuous value of 𝑥 , only to an interval, say (𝑥, 𝑥 + 𝑑𝑥) . For a small interval 𝑑𝑥 the following relationships are useful,

𝑥+𝑑𝑥

𝑃(𝑥 + 𝑑𝑥) − 𝑃(𝑥) = ∫ 𝑝 𝑥

(𝑥)𝑑𝑥 ≈ 𝑝(𝑥) ⋅ 𝑑𝑥

For example, if I wanted to assure myself that 68% of the normal density lies within 1 sd of the mean, I evaluate the CDF at these values and take the difference: p <pnorm ( c ( 1 , 1 )) diff (p)

## [1] 0.6826895

Because integration is like multiplication, it adds one integer value to the exponent, from the

PDF (𝑥 −1 ) to the CDF ( 𝑥 0 ). The area under the PDF is

∞

1 = 𝑃(∞) = ∫ 𝑝

−∞

(𝑥)𝑑𝑥

For discrete distributions the PDF is replaced with a probability mass function (PMF). Like the CDF (and unlike the PDF), the PMF is a dimensionless probability. To obtain the CDF from the PMF I sum, rather than integrate, 𝑥

𝑃(𝑥) = ∑ 𝑝 (𝑘) 𝑘≤𝑥

To obtain the PMF from the CDF I difference, rather than differentiate, 𝑝(𝑥) = 𝑃(𝑥) − 𝑃(𝑥 − 1)

The sum over the sample space is

1 = ∑ 𝑝 (𝑘) 𝑘∈𝒦 where 𝒦 is the sample space. In R there are functions for common distributions. The CDF begins with the letter p

. The PDF and PMF begin with the letter d

. To obtain random values use the letter r

. For quantiles use the letter q

.

Suppose I want to draw the PMF for the Poisson distribution having intensity 𝜆 = 4.6

. I can generate a sequence of integer values and then use dpois

: k <c ( 0 : 20 ) dk <dpois (k, 4.6

) plot (k, dk) segments (k, 0 , k, dk)

I could draw the CDF with ppois

: pk <ppois (k, 4.6

) plot (k, pk, type= 's' )

If I want to confirm the values of k

having these probabilities, I invert the probabilities with qpois

: qpois (pk, 4.6

)

## [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

—————————————— ===== ——————————————

Example 1. I want to generate a random sample from a normal distribution and see if I can

‘recover the mean’. Here are the steps I use:

• define a sample size, a mean and variance

• draw a random sample using rnorm

• estimate parameters as 𝜇̂ = 𝑥 and 𝜎̂ 2 = 𝑛 𝑣𝑎𝑟(𝑥) 𝑛−1

• determine a 95% confidence interval using qnorm

• draw a PDF using dnorm

and CDF using pnorm

based on the estimates

• determine if the true estimate lies within this interval

Plots from Example 1.

Now repeat this 1000 times and count the number of times the confidence interval includes the true mean.

Density of low and high 95% CIs from Example 1.

—————————————— ===== ——————————————

#moments

Moments are expectations of powers of 𝑥 . They are used to summarize the location and shape of a distribution. I can think of the

For a continuous distribution 𝑚 𝑡ℎ moment of a distribution as a weighted average of 𝑥 𝑚 .

∞

𝐸[𝑥 𝑚 ] = ∫ 𝑥 𝑚

−∞ 𝑝(𝑥)𝑑𝑥

The first moment, 𝑚 = 1 , is the mean. The variance is a central moment,

∞ 𝑣𝑎𝑟[𝑥] = 𝐸[(𝑥 − 𝐸[𝑥]) 2 ] = ∫ (

−∞ 𝑥 − 𝐸[𝑥]) 2 𝑝(𝑥)𝑑𝑥

For discrete variables, we replace the integral with a sum.

#discrete probability for multiple host states

I now want to apply probability concepts to the Janzen-Connell question. In the foregoing discussion I used notation for to distinguish between probability and density. In cases where I do not want to name a specific PDF or CDF I use the shorthand bracket notation.

Let [𝐼, 𝑆] be the joint probability of two events, i) that a host plant is infected, 𝐼 , and ii) that it survives, 𝑆 . In this example, both of these events are binary, being either true (indicated with a one) or not (indicated with a zero). For example, the probability that an individual is not infected and survives is written as [𝐼 = 0, 𝑆 = 1] . If I write simply [𝐼, 𝑆] for these binary events, it is interpreted as the probability that both events are a ‘success’ or ‘true’, i.e.,

[𝐼 = 1, 𝑆 = 1]

.

A graphical model of relationships discussed for the Janzen Connell hypothesis. Symbols are I - infected host, S - survival, D - detection.

As previously, states (or events) and parameters are nodes in the graph, the states {𝐼, 𝑆, 𝐷} , and the parameters {𝜃, 𝜙, 𝜋

0

, 𝜋

1

} . The connections, or arrows, between nodes are sometimes called edges. Here I assign parameters:

[𝐼] = 𝜃

[𝐷|𝐼 = 1] = 𝜙

[𝑆|𝐼 = 0] = 𝜋

0

[𝑆|𝐼 = 1] = 𝜋

1

A host can become infected with probability [𝐼] = 𝜃 . An infection can be detected with probability

[𝐷|𝐼 = 1] = 𝜙

. An infected individual survives with probability

[𝑆|𝐼 = 1] = 𝜋 a non-infected individual survives with probability [𝑆|𝐼 = 0] = 𝜋

0 no false positives, [𝐷 = 1|𝐼 = 0] = 0 .

1

, and

. For this example, I assume

##start simple

A study of infection and host survival could be modeled as a joint distribution [𝐼, 𝑆] . I might be interested in estimating state 𝐼 , in comparing parameter estimates (‘does 𝜋

0

differ from 𝜋

1

?), or both. Both events are unknown before they are observed. After data collection I know

𝑆

, but not

𝐼

. I want a model for the conditional probability of survival given that an individual is infected, [𝑆|𝐼 = 1] = 𝜋

1

or not [𝑆|𝐼 = 0] = 𝜋

0

. The notation left of the bar is ‘conditional on’ the event to the right.

[𝑆|𝐼] indicates that the event to the

An arrow points from 𝐼 to 𝑆 , because I believe that infection might affect survival, but I do not believe that survival influences infection (because I am not concerned with infection ‘after’ or

‘caused by’ death). The challenge is that I observe

𝑆

, but not

𝐼

. I cannot condition on

𝐼

if it is unknown. Rather, I want to estimate it. Progress requires Bayes theorem.

Nodes from the graphical model for infection status and survival.

To make use of the model I need a relationship between conditional and joint probabilities,

[𝐼, 𝑆] = [𝑆|𝐼][𝐼]

Here I have factored the joint probability on the left-hand side into a conditional distribution and a marginal distribution on the right-hand side. I can also factor the joint distribution this way,

[𝐼, 𝑆] = [𝐼|𝑆][𝑆]

Because both are equal to the joint probability, they must be equal to each other,

[𝑆|𝐼]][𝐼] = [𝐼|𝑆][𝑆]

Rearranging, I have Bayes theorem, solved two ways,

[𝑆|𝐼] =

[𝐼|𝑆][𝑆]

[𝐼] and

[𝐼|𝑆] =

[𝑆|𝐼][𝐼]

[𝑆]

The two pieces of this relationship that I have not yet defined are the marginal distributions,

[𝑆] and [𝐼] . I could evaluate either one conditional on the other using the law of total

probability,

[𝐼 = 0] = ∑ [ 𝐼 = 0|𝑆 = 𝑗][𝑆 = 𝑗] 𝑗∈{0,1} or

[𝑆 = 1] = ∑ [ 𝑆 = 1|𝐼 = 𝑗][𝐼 = 𝑗] 𝑗∈{0,1}

How can I use these relationships to address the effect of infection on survival?

Given survival status, I first determine the probability that the individual was infected. I have four factors, all univariate, two conditional and two marginal distributions. I have defined [𝑆|𝐼] in terms of parameter values, but I want to know [𝐼|𝑆] . For a host that survived, Bayes theorem gives me

[𝐼|𝑆 = 1] =

[𝑆 = 1|𝐼][𝐼]

=

=

[𝑆 = 1]

∑ 𝑗∈{0,1}

[

[𝑆 = 1|𝐼][𝐼]

𝑆 = 1|𝐼 = 𝑗][𝐼 = 𝑗] 𝜋

1 𝜃 𝜋

0

(1 − 𝜃) + 𝜋

1 𝜃

For a host that died this conditional probability is

[𝐼|𝑆 = 0] =

[𝑆 = 0|𝐼][𝐼]

[𝑆 = 0]

=

(1 − 𝜋

1

)𝜃

(1 − 𝜋

0

)(1 − 𝜃) + (1 − 𝜋

1

)𝜃

These two expressions demonstrate that, if I knew the parameter values, then I could evaluate the conditional probability for [𝐼|𝑆] . If I do not know parameter values, then they too might be estimated.

Before going further, notice that the numerator is always the ‘unormalized’ probability of the two events. The demoninator simply normalizes them.

—————————————— ===== ——————————————

Exercise 1. In R: Assume that there are 𝑛 = 100 hosts and that infection decreases survival probability (𝜋

1

< 𝜋

0

) . Define parameter values for {𝜋

0

, 𝜋

1

, 𝜃} and draw a binomial distribution for [𝐼|𝑆 = 0] and for [𝐼|𝑆 = 1] . (Use the function dbinom

.) Is the infection rate estimated to be higher for survivors or for those that die? How are these two distributions affected by the underlying prevalence of infection, 𝜃 ? [Hint: write down the probabilities required and then place them in a function].

Comparison of binomial distributions of infection for survivors and deaths.

—————————————— ===== ——————————————

##continuous probability for parameters

Here I consider the problem of estimating parameters. The survival parameters are 𝜋

𝐼

{𝜋

0

, 𝜋

1

} . From Bayes theorem I need

=

[𝜋

𝐼

|𝑆] =

[𝑆|𝜋

𝐼

][𝜋

𝐼

]

[𝑆] where the subscript 𝐼 = 0 (uninfected) or 𝐼 = 1 (infected). Again, assuming I know whether or not a host is infected, I can write the distribution for survival conditioned on parameters as

[𝑆|𝜋

𝐼

] = 𝜋

𝐼

𝑆 (1 − 𝜋

𝐼

) 1−𝑆

This is a Bernoulli distribution. The Bernoulli distribution is a special case of the binomial distribution for a single trial,

𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝) = 𝑏𝑖𝑛𝑜𝑚(1, 𝑝)

If I know 𝑆 , and I want to estimate 𝜋 summation, but rather integration,

𝐼

, I again need Bayes theorem. Unlike states, the survival parameters take continous values on (0,1) . The total probability of survival 𝑆 requires not

1

[𝑆] = ∫ [ 𝑆|𝜋

𝐼

][𝜋

𝐼

]𝑑𝜋

𝐼

0

I now have the elements needed to write the conditional distribution for marginal distribution of 𝜋

𝐼 distribution for

[𝑆]

,

[𝜋

𝐼

|𝑆] , but it involves an integral expression. One way to insure a solution to this integral is to assume that the

is a beta distribution, which results in a marginal beta-binomial

1 𝑏𝑒𝑡𝑎𝐵𝑖𝑛𝑜𝑚(𝑆|𝑚, 𝑎, 𝑏) = ∫ 𝑏

0 𝑖𝑛𝑜𝑚(𝑆|𝑚, 𝜋

𝐼

)𝑏𝑒𝑡𝑎(𝜋

𝐼

|𝑎, 𝑏)𝑑𝜋

𝐼

To understand this integral, I draw the two distributions in the integrand. When I integrate I smear out the binomial distribution based on the variation represented by the beta distribution for 𝜋 . par ( mfrow=c ( 2 , 2 ), mar=c ( 4 , 4 , 1 , 1 ), bty= 'n' ) m <50 # no. at risk

S <0 : m pi <.35

# survival Pr b <4 # beta parameter a <signif (b / ( 1 / pi 1 ), 3 ) # 2nd beta parameter to give mean value = pi plot (S, dbinom (S, m, pi), type= 's' , lwd= 3 , xlab= 'S' , ylab= '[S]' ) title ( 'binimal distribution for m, pi' ) plot (S / m, dbeta (S / m, a, b), lwd= 2 , xlab=expression ( pi ), ylab=expression ( paste ( "[" , pi, "]" ) ), type= 'l' ) title ( 'beta density for a, b' ) ptext <paste ( "(" , a, ", " , b, ")" , sep= "" ) text ( 1 , 1.5

,ptext, pos= 2 ) plot (S, dbinom (S, m, pi), type= 's' , lwd= 3 , xlab= 'S' , ylab= '[S]' , col = 'grey' ) lines (S, dbetaBinom (S, m, mu= pi, b= b), type= 's' , lwd= 2 , col= 'blue' ) abline ( v= pi * m, lty= 2 ) title ( 'betabinomial for m, a, b' )

Comparison of binomial and beta-binomial (above) and beta density with parameters (a, b)

(right).

In the foregoing code I wanted to draw a 𝑏𝑒𝑡𝑎𝐵𝑖𝑛𝑜𝑚(𝑎, 𝑏) distribution with a mean of 𝜇 =

0.35

and wide variance. I selected a low value of parameter b

and then used moments (the mean) to determine that the value of parameter a

, 𝜇 = 𝑎 𝑎 + 𝑏

To draw the binomial PMF I used the R function dbinom

. The PMF for the beta-binomial is drawn by dbetaBinom

in clarkFunctions2020.r

.

For the next exercise, consult the appendix on moments.

—————————————— ===== ——————————————

Exercise 2. The variance in beta distribution decreases as the value of parameter b

increases.

Change parameter values to demonstrate this with a plot. Then compare the mean and

variance (see Appendix) from the moments for the binomial and beta-binomial. Here it is for the beta-binomial: meanS <sum ( S * dbetaBinom (S, m, mu= pi, b= b) ) varS <sum ( (S meanS) ^ 2 * dbetaBinom (S, m, mu= pi, b= b) )

For the binomial, your variance should agree with 𝑚𝜋(1 − 𝜋) .

—————————————— ===== ——————————————

The graphical model for known infection status and survival and unknown survival probability.

From Bayes’ theorem I could now write the posterior distribution as

[𝜋

𝐼

|𝑆, 𝑎, 𝑏] =

𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑆|𝜋

𝐼

)𝑏𝑒𝑡𝑎(𝜋

𝐼

|𝑎, 𝑏) 𝑏𝑒𝑡𝑎𝐵𝑖𝑛𝑜𝑚(𝑆|𝑎, 𝑏)

= 𝜋

𝐼

𝑆+𝑎−1 (1 − 𝜋

𝐼

) 𝑏−𝑆

𝐵(𝑆 + 𝑎, 1 − 𝑆 + 𝑏) where 𝐵(⋅) is the beta function. Here’s another way to draw this function, showing the prior beta distribution and posteriors for a single observation, survived or died: post <function (S, a, b, p){

p ^ (S + a -1 ) * ( 1 p) ^ (b S) / beta (S + a, 1 S + b)

}

S <0 p <seq (.

01 ,.

99 , length= 50 ) plot (p, post ( S= 0 , a, b, p), xlab= 'pi' , ylab = '[pi]' , type= 'l' ) # if obs died lines (p, post ( S= 1 , a, b, p), col= 2 ) # if obs survived lines (p, dbeta (p, a, b), lty= 2 , col = 3 ) # prior legend ( 'topright' , c ( 'died' , 'survived' , 'prior (dashed)' ), text.col=c ( 1 , 2 , 3 ))

With these few basic rules, I return to the Janzen Connell hypothesis. The graph summarizes several events that influence infection and survival. I can use the model to evaluate important properties of the process it represents, to estimate parameters, and to predict behavior.

Parameter values might be estimated from previous studies, or they might be completely unknown. The states might be observed or not. Consider a few of the ways the model could be used in the following example.

—————————————— ===== ——————————————

Example 2. What is the probability of infection 𝐼 where 𝐷 , and 𝑆 are unknown?

I only need to consider arrows that ‘cause’ 𝐼 –if neither 𝐷 nor 𝑆 cause 𝐼 , and there is no knowledge of them that could affect my subjective probability of event 𝐼 , then they have no influence on the result. The event 𝐼 = 1 has probability 𝜃 .

—————————————— ===== ——————————————

In the next exercise I return to the original graph to consider parameter estimates.

—————————————— ===== ——————————————

Example 3. If I know 𝑆 but I have no knowledge of 𝐷 or 𝐼 , what is the probability of 𝐼 ?

This problem asks for the conditional probability [𝐼|𝑆] . I know [𝑆|𝐼] . I already determined [𝐼 =

1] = 𝜃 . I still need [𝑆] , which I can obtain using total probability. For an individual that survived,

[𝑆 = 1] = ∑ [ 𝑆 = 1|𝐼][𝐼]

𝐼

= [𝑆 = 1|𝐼 = 0][𝐼 = 0] + [𝑆 = 1|𝐼 = 1][𝐼 = 1]

= 𝜋

0

(1 − 𝜃) + 𝜋

1 𝜃

By substitution I have

[𝐼|𝑆 = 1] = 𝜋

1 𝜃 𝜋

0

(1 − 𝜃) + 𝜋

1 𝜃

—————————————— ===== ——————————————

—————————————— ===== ——————————————

Exercise 3. I observe 𝐷 and 𝑆 , and I know 𝜙 from previous studies. What is the probability of an observation [𝐷 = 1, 𝑆 = 1] . (Hint: use total probability on [𝐷, 𝐼, 𝑆] .

Now write down the posterior distribution of parameters, given observations and known detection probabilty, i.e.,

[𝜋

𝐼

, 𝜃|𝐷 = 1, 𝑆 = 1, 𝜙]

.

—————————————— ===== ——————————————

These examples of discrete states and continuous parameters are used by Clark and Hersh and

Hersh et al.

to evaluate the effect of co-infection of multiple pathogens that attack multiple hosts. These models admit covariates, which could abundances of host plants or environmental variables.

#the normal distribution

I used the normal distribution for regression examples without saying much about it. In

Bayesian analysis it is not only used for the likelihood, but also as a prior distribution for parameters. Here I extend the foregoing distribution theory to Bayesian methods that involve the normal distribution.

##Bayesian estimate of the mean

To obtain an estimate of the mean of a normal distribution I combine likelihood and prior distribution. My observations are 𝑦 𝑖

, 𝑖 = 1, … , 𝑛 . The likelihood for one observation 𝑖 is

𝑁(𝑦 𝑖

|𝜇, 𝜎 2 ) =

1

√2𝜋𝜎

1 𝑒𝑥𝑝 [−

2𝜎 2

(𝑦 𝑖

− 𝜇) 2 ]

The likelihood for the sample of 𝑛 observations is

𝑛

𝑁(𝐲|𝜇, 𝜎 2 ) = ∏ 𝑁 (𝑦 𝑖

|𝜇, 𝜎 2 ) 𝑖=1

= (

1

√2𝜋𝜎

) 𝑛

1 𝑒𝑥𝑝 [−

2𝜎 2 𝑛

∑( 𝑦 𝑖 𝑖=1

− 𝜇) 2 ]

Recalling Bayes theorem, I combine this likelihood with a prior distribution. I use a normal distribution for the mean. For now I assume that 𝜎 2 is fixed. Here is a prior distribution for 𝜇

𝑁(𝜇|𝑚, 𝑀)

To obtain posterior estimates I will use a trick that starts with the following observation. I know the posterior distribution with have this form,

𝑁(𝜇|𝑉𝑣, 𝑉) where 𝑉 will be the variance, and 𝑣 is an unknown constant. If I write out the exponent for the normal distribution I get this:

1

−

2𝑉

(𝜇 − 𝑉𝑣) 2 = −

1

2

( 𝜇

𝑉

2

− 2𝜇𝑣 + 𝑉𝑣 2 )

I now know that variance

𝑉

will be whatever is multiplied by 𝜇 −2 , and 𝑣

will be whatever is multiplied by −2𝜇 . I want to multiply likelihood and prior, then find these constants.

Here is likelihood times prior, focusing on factors that include 𝜇 in the exponent 𝑛

𝑁(𝐲|𝜇, 𝜎 2 )𝑁(𝜇|𝑚, 𝑀) ∝ 𝑒𝑥𝑝 [−

1

2𝜎 2

∑( 𝑦 𝑖 𝑖=1

− 𝜇) 2

1

−

2𝑀

(𝜇 − 𝑚) 2 ]

Setting 𝑛𝑦 = ∑ 𝑛 𝑖=1 𝑦 𝑖

and extracting only the terms I need, 𝜇 2 𝑛

( 𝜎 2

+

1

𝑀

) − 2𝜇 ( 𝑛𝑦 𝜎 2

+ 𝑚

𝑀

) shows that 𝑛

𝑉 = ( 𝜎 2 𝑣 = 𝑛𝑦 𝜎 2

+

+

1

𝑀

)

−1 𝑚

𝑀

You might recognize this to be a weighted average of data and prior, with the inverse of variances being the weights,

𝑉𝑣 = 𝑛𝑦 𝜎 𝑛

2 𝜎 2

+

+ 𝑚

𝑀

1

𝑀

Note how large 𝑛 will swamp the prior,

$$\underset{\scriptscriptstyle \lim n \rightarrow \infty}{Vv} \rightarrow \frac{1}{n}

\sum_{i=1}^n y_i = \bar{y}$$

The prior can fight back with a tiny prior variance 𝑀 .

$$\underset{\scriptscriptstyle \lim M \rightarrow 0}{Vv} \rightarrow m$$

—————————————— ===== ——————————————

Exercise 4. Write a function to determine the posterior estimate of the mean for a normal likelihood, normal prior distribution, and known variance plot. 𝜎 2 . You will need to generate a sample, supply a prior mean and variance, determine the posterior mean and variance, and

Bayesian analysis of the mean.

Then demonstrate the effect of 𝑛 and 𝑀 .

—————————————— ===== ——————————————

##Bayesian regression (known 𝜎 2 )

For the regression model, I start with matrix notation, 𝐲 ∼ 𝑀𝑉𝑁(𝐗𝛃, 𝛴) where 𝐲 is the length𝑛 vector of responses, 𝐗 is the 𝑛 × 𝑝 design matrix, 𝛃 is the length𝑝 vector of coefficients, and 𝛴 is an 𝑛 × 𝑛 covariance matrix. I can write this as

(2𝜋) −𝑛/2 |𝛴| −1/2 𝑒𝑥𝑝 [−

1

2

(𝐲 − 𝐗𝛃)′𝛴 −1 (𝐲 − 𝐗𝛃)]

Because we assume i.i.d (independent, identically distributed) 𝑦 𝑖 𝜎 2 𝐈 , and |𝛴| −1/2 = (𝜎 2 ) −𝑛/2 , giving us

, the covariance matrix is 𝛴 =

1

(2𝜋) −𝑛/2 (𝜎 2 ) −𝑛/2 𝑒𝑥𝑝 [−

2𝜎 2

(𝐲 − 𝐗𝛃)′(𝐲 − 𝐗𝛃)]

This is the form of the likelihood I use to obtain the conditional posterior for regression coefficients.

The multivariate prior distribution is also multivariate normal,

[𝛽

1

, … , 𝛽 𝑝

] = 𝑀𝑉𝑁(𝛃|𝐛, 𝐁)

1

=

(2𝜋) 𝑝/2 𝑑𝑒𝑡(𝐁) 1/2 𝑒𝑥𝑝 [−

1

2

(𝛃 − 𝐛)′𝐁 −1 (𝛃 − 𝐛)]

If there are 𝑝 predictors, then 𝛃 = (𝛽

1

, … , 𝛽 𝑝

)′ . The prior mean is a lengthcovariance matrix could be a non-informative diagonal matrix, 𝑝 vector 𝐛 . The prior

𝐵 0 ⋯ 0

𝐁 = (

0 𝐵 ⋯ 0

⋮ ⋮ ⋱ ⋮

)

0 0 ⋯ 𝐵 for some large value 𝐵 . The posterior distribution is 𝑀𝑉𝑁(𝛃|𝐕𝐯, 𝐕) , where

𝐕 = (𝜎 −2 𝐯 = 𝜎 −2

𝐗′𝐗 + 𝐁 −1 )

𝐗′𝐲 + 𝐁 −1 𝐛

−1

(appendix). Taking limits as I did for the previous example, I obtain the MLE for the mean parameter vector,

$$\underset{\scriptscriptstyle \lim n \rightarrow \infty}{\mathbf{Vv}} \rightarrow

(\mathbf{X'X})^{-1}\mathbf{X'y}$$

(appendix).

—————————————— ===== ——————————————

Exercise 5. Obtain the posterior mean and variance for regression parameters for a simulated data set. Your algorithm might proceed as follows:

1.

define 𝑛 , 𝑝 , and 𝜎 2

2.

generate 𝑛 × 𝑝 matrix 𝑋 from random values, and set the first column to ones

3.

generate a 𝑝 × 1 matrix 𝛃 from random values

4.

generate a 𝑛 × 1 vector 𝐲 using rnorm

.

5.

specify a 𝑝 × 1 prior matrix 𝐛 and prior covariance matrix 𝛃

6.

write a function to evaluate 𝐕 , 𝐯 , and return the mean vector and covariance matrix

Marginal posterior densities for beta.

Explain how you would check that the algorithm is correct.

—————————————— ===== ——————————————

##Residual variance (known 𝛍 )

Now I assume that I know the coefficients and want to estimate the residual variance 𝛔 𝟐

Recall the likelihood for the normal distribution,

.

𝑁(𝐲|𝜇, 𝜎 2 ) = (

∝ 𝜎

1

√2𝜋𝜎

) 𝑛

−2(𝑛/2) 𝑒𝑥𝑝 [− 𝑒𝑥𝑝 [−

2𝜎 2 𝑛

1

2𝜎 2

∑( 𝑛 𝑖=1

1

∑( 𝑦 𝑖 𝑖=1 𝑦 𝑖

− 𝜇)

− 𝜇) 2 ]

2 ]

A prior distribution for that is commonly used is inverse gamma,

𝐼𝐺(𝜎 2 |𝑠

1

, 𝑠

2

) = 𝑠 𝑠

1

2

𝛤(𝑠

1

) 𝜎 −2(𝑠

1

+1) 𝑒𝑥𝑝(−𝑠

2

∝ 𝜎 −2(𝑠

1

+1) 𝑒𝑥𝑝(−𝑠

2 𝜎 −2 ) 𝜎 −2 )

If I combine likelihood and prior I get another inverse gamma distribution,

𝐼𝐺(𝜎 2 |𝑢

1

, 𝑢

2

) ∝ 𝜎 −2(𝑠

1

+𝑛/2+1) 𝑒𝑥𝑝 [−𝜎 −2 (𝑠

2

+ 𝑛

1

2

∑( 𝑖=1 𝑦 𝑖

− 𝜇) 2 )]

Then 𝑢

1

= 𝑠

1

+ 𝑛/2 , and for a sample data set. 𝑢

2

= 𝑠

2

+

1

2

∑ 𝑛 𝑖=1

( 𝑦 𝑖

− 𝜇) 2 . Here is a prior and posterior distribution library (MCMCpack) par ( bty= 'n' ) n <10 y <rnorm (n) s1 <s2 <1 yb <mean (y) ss <seq ( 0 , 4 , length= 100 ) u1 <s1 + n / 2 u2 <s2 + 1 / 2 * sum ( (y yb) ^ 2 ) plot (ss, dinvgamma (ss, u1, u2), type= 'l' , lwd= 2 ) lines (ss, dinvgamma (ss,s1,s2), col= 'blue' , lwd= 2 )

Prior and posterior IG distribution

##residual variance for regression

For regression, I replace 𝜇 with 𝐗𝛃 , I have 𝑢

2

= 𝑠

2

+

1

2

(𝐲 − 𝐗𝛃)′(𝐲 − 𝐗𝛃)

To see this, recall the likelihood, 𝜎 −2(𝑛/2) 𝑒𝑥𝑝 [−

1

2𝜎 2

(𝐲 − 𝐗𝛃)′(𝐲 − 𝐗𝛃)]

—————————————— ===== ——————————————

Exercise in class Find the conditional posterior distribution for the variance in regression.

Based on the previous two blocks of code, write a function to evaluate the variance for a sample regression.

##small step to Gibbs sampling

The conditional posterior distributions for coefficients and variance will be combined with

Gibbs sampling. To see how this will come together, consider that we can now sample [𝛃|𝜎 2 and, conversely, [𝜎 2 |𝛃] . If we alternate these two steps repeatedly we have a simulation for their joint distribution,

[𝛃, 𝜎 2 ]

.

]

To see the setup that is used in jags

, refer back to unit 2. For the regression example, I would simply add an additional step.

#jags example

To see how well we can recover parameters when they are known, here is a simulated data set: n <100 # sample size p <4 # no. predictors beta <matrix ( rnorm (p), p) # coefficients sigma <.1

# residual variance x <matrix ( rnorm (n * p), n, p) # design x[, 1 ] <1 # intercept mu <x %*% beta y <rnorm (n, mu, sqrt (sigma) ) pairs ( cbind (y,x[, 1 ]))

If I knew the residual variance, this would be my Bayesian estimate:

B <diag ( 10000 ,p) b <beta * 0

V <solve ( 1 / sigma * crossprod (x) + solve (B) ) v <1 / sigma * crossprod (x,y) betaHat <V %*% v betaSe <sqrt ( diag (V) ) coefficients <signif ( cbind (beta, betaHat, betaSe), 4 ) colnames (coefficients) <c ( 'true' , 'estimate' , 'Se' ) coefficients

## true estimate Se

## [1,] 0.27350 0.3340 0.03200

## [2,] 0.68000 0.6661 0.03347

## [3,] 0.44740 0.4949 0.03721

## [4,] 0.04942 0.0363 0.03312

For comparison, here’s the classical estimate: summary ( lm ( y ~ x[, 1 ]) ) $ coefficients[, 1 : 2 ]

## Estimate Std. Error

## (Intercept) 0.3339715 0.03367717

## x[, -1]1 0.6660953 0.03522470

## x[, -1]2 0.4949326 0.03915431

## x[, -1]3 0.0363018 0.03484890

Now I want to sample the joint distribution of [𝛽, 𝜎 2 ] . Here’s jags: library (rjags)

## Linked to JAGS 4.3.0

## Loaded modules: basemod,bugs file < "lmSimulated.txt" cat ( "model{

# Likelihood

for(i in 1:n){

y[i] ~ dnorm(mu[i],precision)

mu[i] <- inprod(beta[],x[i,])

}

for (i in 1:p) {

beta[i] ~ dnorm(0, 1.0E-5)

}

# Prior for the inverse variance

precision ~ dgamma(0.01, 0.01)

sigma <- 1/precision

}" , file = file)

Here is a function that sets up the posterior sampling: model <jags.model

( file= file, data = list ( x = x, y = y, n=nrow (x), p=ncol (x)))

## Compiling model graph

## Resolving undeclared variables

## Allocating nodes

## Graph information:

## Observed stochastic nodes: 100

## Unobserved stochastic nodes: 5

## Total graph size: 713

##

## Initializing model

I start with 100 burnin iterations, then sample for 2000: update (model, 100 ) jagsLm <coda.samples

(model, variable.names=c ( "beta" , "sigma" ), n.iter= 2000 )

tmp <summary (jagsLm) print (tmp $ statistics)

## Mean SD Naive SE Time-series SE

## beta[1] 0.33539775 0.03402236 0.0007607631 0.0007607631

## beta[2] 0.66702382 0.03561230 0.0007963152 0.0008575067

## beta[3] 0.49500158 0.03941578 0.0008813636 0.0008813636

## beta[4] 0.03524711 0.03548611 0.0007934935 0.0008929009

## sigma 0.11349828 0.01673148 0.0003741272 0.0003929038

Here are plots: plot (jagsLm)

—————————————— ===== ——————————————

Exercise in class Make an informative prior distribution for regression parameters. Then compare the estimates you get with the non-informative prior. Do this analytically and with jags.

#recap

Bayesian analysis requires some basic distribution theory to combine data and prior information to generate a posterior distribution. Fundamental ways to parameterize probability include densities (continuous), probability mass functions (discrete), and probability density (both) functions. The sample space defines allowable (non-zero probablity) for a random variable. Integrating (continous) or summing (continuous) over the sample space gives a probability of 1.

Distributions have moments, which are expectations for integer powers of a random variable.

The first moment is the mean, and the second central moment is the variance. Higher moments include skewness (asymmetry) and kurtosis (shoulders versus peak and tails).

Joint distribution can be factored into conditional and marginal distributions. A conditional distribution assumes a specific value for the variable that is being conditioned on.

Marginalizing over a variable is done with the law of total probability. Bayes theorem relies on a specific factorization giving a posterior distribution in terms of likelihood and prior.

R can be use to draw random variables and to evaluate densities and probabilities. Binomial and Bernoulli distributions apply to numbers of successes in 𝑛 or 1 trials, respectively.

The multivariate normal distribution is commonly used as a prior distribution. When combined with a normal likelihood, the posterior mean can be found with the ‘Vv rule’.

#appendix

Here I provide a bit more detail on moments used in the beta-binomial example, the posterior for regression parameters, and its connection to maximum likelihood estimates.

##moments

Moments describe the shape of a distribution. The mean of the distribution is the first

moment. The variance is the second central moment. The 𝑥 is expected value of 𝑥 𝑚 𝑚 𝑡ℎ moment of a distribution for

. For continuous variable 𝑥 having PDF 𝑝(𝑥) this is

∞

𝐸[𝑥 𝑚 ] = ∫ 𝑥 𝑚

−∞ 𝑝(𝑥)𝑑𝑥

Note that the zero moment = 1, the area under the PDF. For a discrete variable this is

𝐸[𝑥 𝑚 ] = ∑ 𝑥 𝑚 𝑝(𝑥) 𝑘∈𝒦

Let 𝜇 = 𝐸[𝑥 1 ] be the first moment. Then the 𝑚 𝑡ℎ central moment is

𝐸[(𝑥 − 𝜇)

∞ 𝑚 ] = ∫ (

−∞ 𝑥 − 𝜇) 𝑚 𝑝(𝑥)𝑑𝑥

(continuous) and

𝐸[(𝑥 − 𝜇) 𝑚 ] = ∑ ( 𝑥 − 𝜇) 𝑚 𝑝(𝑥) 𝑘∈𝒦

(discrete). The variance is 𝐸[(𝑥 − 𝜇) 2 ] .

Moments also exist for a sample. In this case I can think of the discrete probability assigned to each observation as 1/𝑛 , where 𝑛 is the number of observations. Plugging this into the discrete moment equation I have 𝑥 = 𝐸[𝑥] = 𝑛

1 𝑛

∑ 𝑥 𝑖 𝑚 𝑖=1 for the sample mean and 𝑣𝑎𝑟(𝑥) = 𝐸[(𝑥 − 𝜇) 2 ] = 𝑛

1 𝑛

∑( 𝑥 𝑖 𝑖=1

− 𝑥) 2 for the sample variance.

##Bayesian regression parameters

As for the example for the mean of the normal distribution, I apply the “big-V, small-v” method.

For matrices the exponent of 𝑁(𝛃|𝐕𝐯) is

−

1

2

(𝛃 − 𝐕𝐯)′𝐕 −1 (𝛃 − 𝐕𝐯) = −

1

2

(𝛃′𝐕 −1 𝛃 − 2𝛃′𝐯 + 𝐯′𝐕𝐯)

As before I find 𝐕 and 𝐯 in the first two terms.

Now I combine the regression likelihood with this prior distribution, I have an exponent on the multivariate normal distribution that looks like this, 𝑛

1 𝜎 2

∑( 𝑖=1 𝑦 𝑖

− 𝑥 𝑖

′𝛃) 2 + (𝛃 − 𝐛)′𝐁 −1 (𝛃 − 𝐛) or like this,

1 𝜎 2

(𝐲 − 𝐗𝛃)′(𝐲 − 𝐗𝛃) + (𝛃 − 𝐛)′𝐁 −1 (𝛃 − 𝐛) where 𝐲

is the length𝑛

vector of responses, and

𝐗

is the 𝑛 × 𝑝

design matrix.

Retaining only terms containing coefficients, I collect terms,

−2𝛃′(𝜎 −2 𝐗′𝐲 + 𝐁 −1 𝐛) + 𝛃′(𝜎 −2 𝐗′𝐗 + 𝐁 −1 )𝛃

I identify parameter vectors,

𝐕 = (𝜎 −2 𝐯 = 𝜎 −2

𝐗′𝐗 + 𝐁 −1 )

𝐗′𝐲 + 𝐁 −1 𝐛

−1

These are determine the posterior distribution.

##connection to maximum likelihood

Consider again the likelihood, now ignoring the prior distribution, having exponent

1 log𝐿 ∝ −

2𝜎 2

(𝐲 − 𝐗𝛃)′(𝐲 − 𝐗𝛃)

To maximumize the log likelihood I consider only these terms, because others do not contain parameters. I differentiate once,

𝜕𝑙𝑜𝑔𝐿

= 𝜎 −2 𝐗′𝐲 − 𝜎 −2 𝐗′𝐗𝛃

𝜕𝛃 and again,

𝜕 2 𝑙𝑜𝑔𝐿

= −𝜎 −2 𝐗′𝐗

𝜕𝛃 2

To obtain MLEs I set the first derivative equal to zero and solve, 𝛃 −1 𝐗′𝐲

The matrix of curvatures, or second derivatives, is related to Fisher Information and the covariance of parameter estimates,

𝐈 = −

𝜕 2 𝑙𝑜𝑔𝐿

𝜕𝛃 2

The covariance of parameter estimates is 𝐈 −1 .

Bayesian Inference for Environmental Models

jointly distributed random variables env/bio 665 Bayesian inference for environmental models

Table of Contents

readings

Related documents

Products

Support

Bayesian Inference for Environmental Models

jointly distributed random variables env/bio 665 Bayesian inference for environmental models

Table of Contents

readings

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib