2020-01-27

#resources

##software

**source**

( '../clarkFunctions2020.r' )

*Models for Ecological Data*, Appendix D.

,

Hersh et al. on joint, conditional, predictive distributions, *Ecology*.

#objectives

•

•

•

•

Understand key concepts:

–

–

– discrete and continuous densities, probability mass functions, and probability density functions the sample space for a distribution moments of a distribution

–

– factor a joint distribution graph a model

– the contribution of data and prior to the posterior estimate of a mean

Apply basic rules to manipulate jointly distributed random variables:

–

– total probability

Bayes theorem

Use R to draw random samples and to determine density and probability for standard distributions

–

– binomial and Bernoulli beta and beta-binomial

– normal and multivariate normal

Find the posterior distribution of regression parameters

*“only grade-schoolers can divide, only undergraduates can differentiate, a rare PhD can *

*integrate”, Mark Twain, maybe*

#*Attention au l’calcule*

I need a few rules from calculus to manipulate distributions. Calculus is more often important for its conceptual and notation contributions than for actual solutions to equations. Most functions cannot be integrated. The limited capacity to handle the integration constant needed for Bayes theorem stalled progress until numerical analysis advanced with methods such as

Gibbs sampling. Although I will will not integrate much, calculus provides concepts needed here.

For probability, I need both distribution functions and density functions. The latter is the derivative of the former. The notation is powerful, providing a direct connection to concepts and notation used in basic algebra. The derivative ๐๐/๐๐ฅ

can be viewed as a limit of a ratio ๐(๐ฅ) = ๐๐ ๐๐ฅ

= lim ๐๐ฅ→0

[

๐(๐ฅ + ๐๐ฅ) − ๐(๐ฅ) ๐๐ฅ

] i.e., division. Multiplication ๐(๐ฅ) ⋅ ๐๐ฅ

can be viewed as the limit of the anti-derivative

∫ ๐ฅ+๐๐ฅ ๐ ๐ฅ

(๐ข)๐๐ข = lim ๐๐ฅ→0

[๐(๐ฅ) ⋅ ๐๐ฅ]

Ironically, although multiplication is easy and division is hard, the tables are turned on their calculus counterparts: differentiation is usually easier than integration.

The idea of derivatives and integrals as limiting quantities is important for computation in

Bayesian analysis. Differentiation is needed for optimization problems–more important for maximum likelihood. Optimizations and integrations are commonly approximated numerically.

In the discussion ahead I rely on some of the notation of calculus, but there are no difficult solutions.

#Basic probability rules and the Janzen Connell hypothesis

Basic probality ideas are introducted here with the Janzen Connell (JC) effect. The JC effect is believed to promote forest tree diversity as natural enemies disproportionately attack the most abundant tree hosts. The mechanism requires that natural enemies are host-specific and that they most efficiently find and/or impact host populations when and where those hosts are abundant. Fungi are plausible candidates for the JC effect, because there are many taxa, and they include known pathogens of trees.

To test whether or not fungal pathogens contribute to tree diversity, Hersh et al. (2012) planted seedlings of six species in 60 plots, they observed survival, and they assayed them for fungal infection on cultures and with DNA sequencing. A model was constructed for the relationship between pathogen, host, and observations, and it was used to infer where pathogens occur (‘incidence’), infection of hosts, and effect of infection on survival. This example introduces some of the techniques used in Bayesian analysis, including some basic

distribution theory. I begin with background on distributions, followed by examples that demononstrate ways to look at the JC hypothesis.

##continuous and discrete probability distributions

I express uncertainty about an event using probability. Uncertainty could be temporary, expressing current information: my prediction of heads or tails will be updated when I flip this coin. Or it could be indefinite, expressing my ability to predict in general: about half of my predictions for coin tosses will be wrong. A **probability** is dimensionless. It can be zero, one, or somewhere in between. In the sections that follow I introduce basic distribution theory needed for Bayesian analysis.

##probability spaces

The **probability distribution** assigns probability values to the set of all possible outcomes over the **sample space**. The sample space for a **continuous probability distribution** is all or part of the real line. The normal distribution applies to the full real line,

โ = (−∞, ∞)

. The gamma (including the exponential) and the log-normal distributions apply to the non-negative real numbers

โ

+

= (0, ∞)

. The beta distribution ๐๐๐ก๐(๐, ๐)

applies to real numbers in the interval

[0,1]

. The uniform distribution, ๐ข๐๐๐(๐, ๐)

, applies to the interval

[๐, ๐]

. These are

**univariate distributions**: they describe one dimension. [Note: when discussing an interval, a square bracket indicates that I am including the boundary–the uniform distribution is defined at zero, but the lognormal distribution is not.]

A **multivariate distribution** describes variation in more than one dimension. For example, a

*d*-dimensional multivariate normal distribution describes a length๐

random vector in

โ ๐

.

*Three continuous distributions supported on different portions of the real line. Zero and one *

*(grey lines) are unsupported. PDFs above and CDFs below. *

The sample space for a **discrete distribution** is a set of discrete values. For the Bernoulli distribution, the sample space is

{0,1}

. For the binomial distribution, ๐๐๐๐๐(๐, ๐)

, the sample space is

{0, … , ๐}

. For count data, often modeled as a Poisson distribution, the sample space is

{0,1,2, … }

. The **probability mass function** means that there is point mass on specific values

(e.g., integers) and no support elsewhere.

A common multivariate discrete distribution is the **multinomial distribution**. It assigns probability to ๐

trials, each of which can have

๐ฝ

outcomes, ๐๐ข๐๐ก๐๐๐๐(๐, (๐

1

, … , ๐

๐ฝ

))

. It has the sample space

{0, … , ๐}

๐ฝ

, subject to the constraint that the sum over all ๐

classes is equal to ๐

.

*Three discrete distributions, with PMFs above and CDFs below. *

The figures show continuous and discrete distributions, each in two forms, as densities and as cumulative probabilities.

##probability density and cumulative probability

The **cumulative distribution function (CDF)**

๐(๐ฅ)

accumulates continuous probability over the sample space, from low values (near-zero) to high values (near-one). The **probability **

**density function (PDF)** is the derivative of the CDF, ๐(๐ฅ) = ๐๐/๐๐ฅ

. Because the CDF is a dimensionless probability, its derivative must have units of

1/๐ฅ

. The CDF is obtained from the

PDF by integration, ๐ฅ

๐(๐ฅ) = ∫ ๐

−∞

(๐ฅ)๐๐ฅ

I cannot assign a probability to a continuous value of ๐ฅ

, only to an interval, say

(๐ฅ, ๐ฅ + ๐๐ฅ)

. For a small interval ๐๐ฅ

the following relationships are useful,

๐(๐ฅ + ๐๐ฅ) − ๐(๐ฅ) = ∫ ๐ฅ+๐๐ฅ ๐ ๐ฅ

(๐ฅ)๐๐ฅ ≈ ๐(๐ฅ) ⋅ ๐๐ฅ

For example, if I wanted to assure myself that 68% of the normal density lies within 1 sd of the mean, I evaluate the CDF at these values and take the difference: p <-

**pnorm**

(

**c**

(

**-**

1 , 1 ))

**diff**

(p)

## [1] 0.6826895

Because integration is like multiplication, it adds one integer value to the exponent, from the

(๐ฅ

−1

)

to the CDF ( ๐ฅ

0

). The area under the PDF is

∞

1 = ๐(∞) = ∫ ๐

−∞

(๐ฅ)๐๐ฅ

For discrete distributions the PDF is replaced with a **probability mass function (PMF)**. Like the CDF (and unlike the PDF), the PMF is a dimensionless probability. To obtain the CDF from the PMF I sum, rather than integrate, ๐ฅ

๐(๐ฅ) = ∑ ๐ (๐) ๐≤๐ฅ

To obtain the PMF from the CDF I difference, rather than differentiate, ๐(๐ฅ) = ๐(๐ฅ) − ๐(๐ฅ − 1)

The sum over the sample space is

1 = ∑ ๐ (๐) ๐∈๐ฆ where

๐ฆ

is the sample space. **In R** there are functions for common distributions. The CDF begins with the letter p

. The PDF and PMF begin with the letter d

. To obtain random values use the letter r

. For quantiles use the letter q

.

Suppose I want to draw the PMF for the Poisson distribution having intensity ๐ = 4.6

. I can generate a sequence of integer values and then use dpois

: k <-

**c**

( 0

**:**

20 ) dk <-

**dpois**

(k, 4.6

)

**plot**

(k, dk)

**segments**

(k, 0 , k, dk)

I could draw the CDF with ppois

: pk <-

**ppois**

(k, 4.6

)

**plot**

(k, pk, type= 's' )

•

•

•

•

•

•

If I want to confirm the values of k

having these probabilities, I invert the probabilities with qpois

:

**qpois**

(pk, 4.6

)

## [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

—————————————— ===== ——————————————

Example 1. I want to generate a random sample from a normal distribution and see if I can

‘recover the mean’. Here are the steps I use: define a sample size, a mean and variance draw a random sample using rnorm estimate parameters as ๐ฬ = ๐ฅ

and ๐ฬ

2

= ๐ ๐ฃ๐๐(๐ฅ) ๐−1 determine a 95% confidence interval using qnorm draw a PDF using dnorm

and CDF using pnorm

based on the estimates determine if the true estimate lies within this interval

*Plots from Example 1. *

Now repeat this 1000 times and count the number of times the confidence interval includes the true mean.

*Density of low and high 95% CIs from Example 1. *

—————————————— ===== ——————————————

#moments

Moments are expectations of powers of ๐ฅ

. They are used to summarize the location and shape of a distribution. I can think of the

For a continuous distribution ๐ ๐กโ moment of a distribution as a weighted average of ๐ฅ ๐ .

∞

๐ธ[๐ฅ ๐

] = ∫ ๐ฅ ๐

−∞ ๐(๐ฅ)๐๐ฅ

The first moment, ๐ = 1

, is the mean. The variance is a central moment,

∞ ๐ฃ๐๐[๐ฅ] = ๐ธ[(๐ฅ − ๐ธ[๐ฅ])

2

] = ∫ (

−∞ ๐ฅ − ๐ธ[๐ฅ])

2 ๐(๐ฅ)๐๐ฅ

For discrete variables, we replace the integral with a sum.

#discrete probability for multiple host states

I now want to apply probability concepts to the Janzen-Connell question. In the foregoing discussion I used notation for to distinguish between probability and density. In cases where I do not want to name a specific PDF or CDF I use the shorthand **bracket notation**.

Let

[๐ผ, ๐]

be the **joint probability** of two events, i) that a host plant is infected,

๐ผ

, and ii) that it survives,

๐

. In this example, both of these events are binary, being either true (indicated with a one) or not (indicated with a zero). For example, the probability that an individual is not infected and survives is written as

[๐ผ = 0, ๐ = 1]

. If I write simply

[๐ผ, ๐]

for these binary events, it is interpreted as the probability that both events are a ‘success’ or ‘true’, i.e.,

[๐ผ = 1, ๐ = 1]

.

*A graphical model of relationships discussed for the Janzen Connell hypothesis. Symbols are I - infected host, S - survival, D - detection. *

As previously, states (or events) and parameters are **nodes** in the graph, the states

{๐ผ, ๐, ๐ท}

, and the parameters

{๐, ๐, ๐

0

, ๐

1

}

. The connections, or arrows, between nodes are sometimes called **edges**. Here I assign parameters:

[๐ผ] = ๐

[๐ท|๐ผ = 1] = ๐

[๐|๐ผ = 0] = ๐

0

[๐|๐ผ = 1] = ๐

1

A host can become infected with probability

[๐ผ] = ๐

. An infection can be detected with probability

[๐ท|๐ผ = 1] = ๐

. An infected individual survives with probability

[๐|๐ผ = 1] = ๐ a non-infected individual survives with probability

[๐|๐ผ = 0] = ๐

0 no false positives,

[๐ท = 1|๐ผ = 0] = 0

.

1

, and

. For this example, I assume

##start simple

A study of infection and host survival could be modeled as a joint distribution

[๐ผ, ๐]

. I might be interested in estimating state

๐ผ

, in comparing parameter estimates (‘does ๐

0

differ from ๐

1

?), or both. Both events are unknown before they are observed. After data collection I know

๐

, but not

๐ผ

. I want a model for the **conditional probability** of survival given that an individual is infected,

[๐|๐ผ = 1] = ๐

1

or not

[๐|๐ผ = 0] = ๐

0

. The notation left of the bar is ‘conditional on’ the event to the right.

[๐|๐ผ]

indicates that the event to the

An arrow points from

๐ผ

to

๐

, because I believe that infection might affect survival, but I do not believe that survival influences infection (because I am not concerned with infection ‘after’ or

‘caused by’ death). The challenge is that I observe

๐

, but not

๐ผ

. I cannot condition on

๐ผ

if it is unknown. Rather, I want to estimate it. Progress requires Bayes theorem.

*Nodes from the graphical model for infection status and survival. *

To make use of the model I need a relationship between conditional and joint probabilities,

[๐ผ, ๐] = [๐|๐ผ][๐ผ]

Here I have factored the joint probability on the left-hand side into a **conditional distribution** and a **marginal distribution** on the right-hand side. I can also factor the joint distribution this way,

[๐ผ, ๐] = [๐ผ|๐][๐]

Because both are equal to the joint probability, they must be equal to each other,

[๐|๐ผ]][๐ผ] = [๐ผ|๐][๐]

Rearranging, I have **Bayes theorem**, solved two ways,

[๐|๐ผ] =

[๐ผ|๐][๐]

[๐ผ] and

[๐ผ|๐] =

[๐|๐ผ][๐ผ]

[๐]

The two pieces of this relationship that I have not yet defined are the marginal distributions,

[๐]

and

[๐ผ]

. I could evaluate either one *conditional* on the other using the **law of total **

**probability**,

[๐ผ = 0] = ∑ [ ๐ผ = 0|๐ = ๐][๐ = ๐] ๐∈{0,1} or

[๐ = 1] = ∑ [ ๐ = 1|๐ผ = ๐][๐ผ = ๐] ๐∈{0,1}

How can I use these relationships to address the effect of infection on survival?

Given survival status, I first determine the probability that the individual was infected. I have four factors, all univariate, two conditional and two marginal distributions. I have defined

[๐|๐ผ] in terms of parameter values, but I want to know

[๐ผ|๐]

. For a host that survived, Bayes theorem gives me

[๐ผ|๐ = 1] =

[๐ = 1|๐ผ][๐ผ]

=

=

[๐ = 1]

∑ ๐∈{0,1}

[

[๐ = 1|๐ผ][๐ผ]

๐ = 1|๐ผ = ๐][๐ผ = ๐] ๐

1 ๐ ๐

0

(1 − ๐) + ๐

1 ๐

For a host that died this conditional probability is

[๐ผ|๐ = 0] =

[๐ = 0|๐ผ][๐ผ]

[๐ = 0]

=

(1 − ๐

1

)๐

(1 − ๐

0

)(1 − ๐) + (1 − ๐

1

)๐

These two expressions demonstrate that, if I knew the parameter values, then I could evaluate the conditional probability for

[๐ผ|๐]

. If I do not know parameter values, then they too might be estimated.

Before going further, notice that the numerator is always the ‘unormalized’ probability of the two events. The demoninator simply normalizes them.

—————————————— ===== ——————————————

Exercise 1. In R: Assume that there are ๐ = 100

hosts and that infection decreases survival probability

(๐

1

< ๐

0

)

. Define parameter values for

{๐

0

, ๐

1

, ๐}

and draw a binomial distribution for

[๐ผ|๐ = 0]

and for

[๐ผ|๐ = 1]

. (Use the function dbinom

.) Is the infection rate estimated to be higher for survivors or for those that die? How are these two distributions affected by the underlying prevalence of infection, ๐

? [Hint: write down the probabilities required and then place them in a function].

*Comparison of binomial distributions of infection for survivors and deaths. *

—————————————— ===== ——————————————

##continuous probability for parameters

Here I consider the problem of estimating parameters. The survival parameters are ๐

๐ผ

{๐

0

, ๐

1

}

. From Bayes theorem I need

=

[๐

๐ผ

|๐] =

[๐|๐

๐ผ

][๐

๐ผ

]

[๐] where the subscript

๐ผ = 0

(uninfected) or

๐ผ = 1

(infected). Again, assuming I know whether or not a host is infected, I can write the distribution for survival conditioned on parameters as

[๐|๐

๐ผ

] = ๐

๐ผ

๐

(1 − ๐

๐ผ

)

1−๐

This is a **Bernoulli distribution**. The Bernoulli distribution is a special case of the binomial distribution for a single trial,

๐ต๐๐๐๐๐ข๐๐๐(๐) = ๐๐๐๐๐(1, ๐)

If I know

๐

, and I want to estimate ๐ summation, but rather integration,

๐ผ

, I again need Bayes theorem. Unlike states, the survival parameters take continous values on

(0,1)

. The total probability of survival

๐

requires not

1

[๐] = ∫ [ ๐|๐

๐ผ

][๐

๐ผ

]๐๐

๐ผ

0

I now have the elements needed to write the conditional distribution for marginal distribution of ๐

๐ผ distribution for

[๐]

,

[๐

๐ผ

|๐]

, but it involves an integral expression. One way to insure a solution to this integral is to assume that the

is a beta distribution, which results in a marginal **beta-binomial**

1 ๐๐๐ก๐๐ต๐๐๐๐(๐|๐, ๐, ๐) = ∫ ๐

0 ๐๐๐๐(๐|๐, ๐

๐ผ

)๐๐๐ก๐(๐

๐ผ

|๐, ๐)๐๐

๐ผ

To understand this integral, I draw the two distributions in the integrand. When I integrate I smear out the binomial distribution based on the variation represented by the beta distribution for ๐

.

**par**

( mfrow=**c** ( 2 , 2 ), mar=**c** ( 4 , 4 , 1 , 1 ), bty= 'n' ) m <50

*# no. at risk*

S <0

**:**

m pi <.35

*# survival Pr*

b <4

*# beta parameter*

a <-

**signif**

(b

**/**

( 1

**/**

pi

**-**

1 ), 3 )

*# 2nd beta parameter to give mean value = pi*

**plot**

(S,

**dbinom**

(S, m, pi), type= 's' , lwd= 3 , xlab= 'S' , ylab= '[S]' )

**title**

( 'binimal distribution for m, pi' )

**plot**

(S

**/**

m,

**dbeta**

(S

**/**

m, a, b), lwd= 2 , xlab=**expression** ( pi ), ylab=**expression** (

**paste**

( "[" , pi, "]" ) ), type= 'l' )

**title**

( 'beta density for a, b' ) ptext <-

**paste**

( "(" , a, ", " , b, ")" , sep= "" )

**text**

( 1 , 1.5

,ptext, pos= 2 )

**plot**

(S,

**dbinom**

(S, m, pi), type= 's' , lwd= 3 , xlab= 'S' , ylab= '[S]' , col = 'grey' )

**lines**

(S,

**dbetaBinom**

(S, m, mu= pi, b= b), type= 's' , lwd= 2 , col= 'blue' )

**abline**

( v= pi

*****

m, lty= 2 )

**title**

( 'betabinomial for m, a, b' )

*Comparison of binomial and beta-binomial (above) and beta density with parameters (a, b) *

*(right). *

In the foregoing code I wanted to draw a ๐๐๐ก๐๐ต๐๐๐๐(๐, ๐)

distribution with a mean of ๐ =

0.35

and wide variance. I selected a low value of parameter b

and then used moments (the mean) to determine that the value of parameter a

, ๐ = ๐ ๐ + ๐

To draw the binomial PMF I used the R function dbinom

. The PMF for the beta-binomial is drawn by dbetaBinom

in clarkFunctions2020.r

.

For the next exercise, consult the appendix on moments.

—————————————— ===== ——————————————

Exercise 2. The variance in beta distribution decreases as the value of parameter b

increases.

Change parameter values to demonstrate this with a plot. Then compare the mean and

variance (see Appendix) from the moments for the binomial and beta-binomial. Here it is for the beta-binomial: meanS <-

**sum**

( S

*** dbetaBinom**

(S, m, mu= pi, b= b) ) varS <-

**sum**

( (S

**-**

meanS)

**^**

2

*** dbetaBinom**

(S, m, mu= pi, b= b) )

For the binomial, your variance should agree with ๐๐(1 − ๐)

.

—————————————— ===== ——————————————

*The graphical model for known infection status and survival and unknown survival probability. *

From Bayes’ theorem I could now write the posterior distribution as

[๐

๐ผ

|๐, ๐, ๐] =

๐ต๐๐๐๐๐ข๐๐๐(๐|๐

๐ผ

)๐๐๐ก๐(๐

๐ผ

|๐, ๐) ๐๐๐ก๐๐ต๐๐๐๐(๐|๐, ๐)

= ๐

๐ผ

๐+๐−1

(1 − ๐

๐ผ

) ๐−๐

๐ต(๐ + ๐, 1 − ๐ + ๐) where

๐ต(⋅)

is the beta function. Here’s another way to draw this function, showing the prior beta distribution and posteriors for a single observation, survived or died: post <-

**function**

(S, a, b, p){

p

**^**

(S

**+**

a -1 )

*****

( 1

**-**

p)

**^**

(b

**-**

S)

**/ beta**

(S

**+**

a, 1

**-**

S

**+**

b)

}

S <0 p <-

**seq**

(.

01 ,.

99 , length= 50 )

**plot**

(p,

**post**

( S= 0 , a, b, p), xlab= 'pi' , ylab = '[pi]' , type= 'l' )

*# if obs died*

**lines**

(p,

**post**

( S= 1 , a, b, p), col= 2 )

*# if obs survived*

**lines**

(p,

**dbeta**

(p, a, b), lty= 2 , col = 3 )

*# prior*

**legend**

( 'topright' ,

**c**

( 'died' , 'survived' , 'prior (dashed)' ), text.col=**c** ( 1 , 2 , 3 ))

With these few basic rules, I return to the Janzen Connell hypothesis. The graph summarizes several events that influence infection and survival. I can use the model to evaluate important properties of the process it represents, to estimate parameters, and to predict behavior.

Parameter values might be estimated from previous studies, or they might be completely unknown. The states might be observed or not. Consider a few of the ways the model could be used in the following example.

—————————————— ===== ——————————————

Example 2. What is the probability of infection

๐ผ

where

๐ท

, and

๐

are unknown?

I only need to consider arrows that ‘cause’

๐ผ

–if neither

๐ท

nor

๐

cause

๐ผ

, and there is no knowledge of them that could affect my subjective probability of event

๐ผ

, then they have no influence on the result. The event

๐ผ = 1

has probability ๐

.

—————————————— ===== ——————————————

In the next exercise I return to the original graph to consider parameter estimates.

—————————————— ===== ——————————————

Example 3. If I know

๐

but I have no knowledge of

๐ท

or

๐ผ

, what is the probability of

๐ผ

?

This problem asks for the conditional probability

[๐ผ|๐]

. I know

[๐|๐ผ]

. I already determined

[๐ผ =

1] = ๐

. I still need

[๐]

, which I can obtain using total probability. For an individual that survived,

[๐ = 1] = ∑ [ ๐ = 1|๐ผ][๐ผ]

๐ผ

= [๐ = 1|๐ผ = 0][๐ผ = 0] + [๐ = 1|๐ผ = 1][๐ผ = 1]

= ๐

0

(1 − ๐) + ๐

1 ๐

By substitution I have

[๐ผ|๐ = 1] = ๐

1 ๐ ๐

0

(1 − ๐) + ๐

1 ๐

—————————————— ===== ——————————————

—————————————— ===== ——————————————

Exercise 3. I observe

๐ท

and

๐

, and I know ๐

from previous studies. What is the probability of an observation

[๐ท = 1, ๐ = 1]

. (Hint: use total probability on

[๐ท, ๐ผ, ๐]

.

Now write down the posterior distribution of parameters, given observations and known detection probabilty, i.e.,

[๐

๐ผ

, ๐|๐ท = 1, ๐ = 1, ๐]

.

—————————————— ===== ——————————————

These examples of discrete states and continuous parameters are used by

and

to evaluate the effect of co-infection of multiple pathogens that attack multiple hosts. These models admit covariates, which could abundances of host plants or environmental variables.

#the normal distribution

I used the normal distribution for regression examples without saying much about it. In

Bayesian analysis it is not only used for the likelihood, but also as a prior distribution for parameters. Here I extend the foregoing distribution theory to Bayesian methods that involve the normal distribution.

##Bayesian estimate of the mean

To obtain an estimate of the mean of a normal distribution I combine likelihood and prior distribution. My observations are ๐ฆ ๐

, ๐ = 1, … , ๐

. The likelihood for one observation ๐

is

๐(๐ฆ ๐

|๐, ๐

2

) =

1

√2๐๐

1 ๐๐ฅ๐ [−

2๐

2

(๐ฆ ๐

− ๐)

2

]

The likelihood for the sample of ๐

observations is

๐(๐ฒ|๐, ๐

2

) ๐

= ∏ ๐ (๐ฆ ๐

|๐, ๐

2

) ๐=1

= (

1

√2๐๐

) ๐

1 ๐๐ฅ๐ [−

2๐

2 ๐

∑( ๐ฆ ๐ ๐=1

− ๐)

2

]

Recalling Bayes theorem, I combine this likelihood with a prior distribution. I use a normal distribution for the mean. For now I assume that ๐

2 is fixed. Here is a prior distribution for ๐

๐(๐|๐, ๐)

To obtain posterior estimates I will use a trick that starts with the following observation. I know the posterior distribution with have this form,

๐(๐|๐๐ฃ, ๐) where

๐

will be the variance, and ๐ฃ

is an unknown constant. If I write out the exponent for the normal distribution I get this:

1

−

2๐

(๐ − ๐๐ฃ)

2

= −

1

2

( ๐

๐

2

− 2๐๐ฃ + ๐๐ฃ

2

)

I now know that variance

๐

will be whatever is multiplied by ๐

−2

, and ๐ฃ

will be whatever is multiplied by

−2๐

. I want to multiply likelihood and prior, then find these constants.

Here is likelihood times prior, focusing on factors that include ๐

in the exponent ๐

๐(๐ฒ|๐, ๐

2

)๐(๐|๐, ๐) ∝ ๐๐ฅ๐ [−

1

2๐

2

∑( ๐ฆ ๐ ๐=1

− ๐)

2

1

−

2๐

(๐ − ๐)

2

]

Setting ๐๐ฆ = ∑ ๐ ๐=1 ๐ฆ ๐

and extracting only the terms I need, ๐

2 ๐

( ๐

2

+

1

๐

) − 2๐ ( ๐๐ฆ ๐

2

+ ๐

๐

) shows that

๐ ๐ฃ ๐

= ( ๐

2

= ๐๐ฆ ๐

2

+

+

1

๐

)

−1 ๐

๐

You might recognize this to be a weighted average of data and prior, with the inverse of variances being the weights,

๐๐ฃ = ๐๐ฆ ๐ ๐

2 ๐

2

+

+ ๐

๐

1

๐

Note how large ๐

will swamp the prior,

$$\underset{\scriptscriptstyle \lim n \rightarrow \infty}{Vv} \rightarrow \frac{1}{n}

\sum_{i=1}^n y_i = \bar{y}$$

The prior can fight back with a tiny prior variance

๐

.

$$\underset{\scriptscriptstyle \lim M \rightarrow 0}{Vv} \rightarrow m$$

—————————————— ===== ——————————————

Exercise 4. Write a function to determine the posterior estimate of the mean for a normal likelihood, normal prior distribution, and known variance plot. ๐

2 . You will need to generate a sample, supply a prior mean and variance, determine the posterior mean and variance, and

*Bayesian analysis of the mean. *

Then demonstrate the effect of ๐

and

๐

.

—————————————— ===== ——————————————

##Bayesian regression (known ๐

2

)

For the regression model, I start with matrix notation, ๐ฒ ∼ ๐๐๐(๐๐, ๐ด) where ๐ฒ

is the length๐

vector of responses,

๐

is the ๐ × ๐

design matrix, ๐

is the length๐ vector of coefficients, and

๐ด

is an ๐ × ๐

covariance matrix. I can write this as

(2๐)

−๐/2

|๐ด|

−1/2 ๐๐ฅ๐ [−

1

2

(๐ฒ − ๐๐)′๐ด

−1

(๐ฒ − ๐๐)]

Because we assume i.i.d (independent, identically distributed) ๐ฆ ๐ ๐

2

๐

, and

|๐ด|

−1/2

= (๐

2

)

−๐/2 , giving us

, the covariance matrix is

๐ด =

1

(2๐)

−๐/2

(๐

2

)

−๐/2 ๐๐ฅ๐ [−

2๐

2

(๐ฒ − ๐๐)′(๐ฒ − ๐๐)]

This is the form of the likelihood I use to obtain the conditional posterior for regression coefficients.

The multivariate prior distribution is also multivariate normal,

[๐ฝ

1

, … , ๐ฝ ๐

] = ๐๐๐(๐|๐, ๐)

1

=

(2๐) ๐/2 ๐๐๐ก(๐)

1/2 ๐๐ฅ๐ [−

1

2

(๐ − ๐)′๐

−1

(๐ − ๐)]

If there are ๐

predictors, then ๐ = (๐ฝ

1

, … , ๐ฝ ๐

)′

. The prior mean is a lengthcovariance matrix could be a non-informative diagonal matrix, ๐

vector ๐

. The prior

๐ต

๐ = (

0

โฎ

0

0

๐ต

โฎ

0

โฏ

โฏ

โฑ

โฏ

0

0

โฎ

๐ต

) for some large value

๐ต

. The posterior distribution is

๐๐๐(๐|๐๐ฏ, ๐)

, where

๐ = (๐

−2 ๐ฏ = ๐

−2

๐′๐ + ๐

−1

)

๐′๐ฒ + ๐

−1 ๐

−1

(appendix). Taking limits as I did for the previous example, I obtain the MLE for the mean parameter vector,

$$\underset{\scriptscriptstyle \lim n \rightarrow \infty}{\mathbf{Vv}} \rightarrow

(\mathbf{X'X})^{-1}\mathbf{X'y}$$

(appendix).

—————————————— ===== ——————————————

Exercise 5. Obtain the posterior mean and variance for regression parameters for a simulated data set. Your algorithm might proceed as follows:

1.

2.

3.

4.

5.

6.

define ๐

, ๐

, and ๐ generate ๐ × ๐ generate a

2

matrix ๐ × 1

๐

matrix

from random values, and set the first column to ones ๐

from random values generate a ๐ × 1

vector ๐ฒ

using rnorm

. specify a ๐ × 1

prior matrix ๐

and prior covariance matrix ๐ write a function to evaluate

๐

, ๐ฏ

, and return the mean vector and covariance matrix

*Marginal posterior densities for beta. *

Explain how you would check that the algorithm is correct.

—————————————— ===== ——————————————

##Residual variance (known ๐

)

Now I assume that I know the coefficients and want to estimate the residual variance ๐

๐

Recall the likelihood for the normal distribution,

.

๐(๐ฒ|๐, ๐

2

) = (

∝ ๐

1

√2๐๐

) ๐

−2(๐/2) ๐๐ฅ๐ [− ๐๐ฅ๐ [−

2๐

2 ๐

1

2๐

2

∑( ๐ ๐=1

1

∑( ๐ฆ ๐ ๐=1 ๐ฆ ๐

− ๐)

− ๐)

2

]

2

]

A prior distribution for that is commonly used is inverse gamma,

๐ผ๐บ(๐

2

|๐

1

, ๐

2

) = ๐ ๐

1

2 ๐

−2(๐

1

+1) ๐๐ฅ๐(−๐

๐ค(๐

1

)

∝ ๐

−2(๐

1

+1) ๐๐ฅ๐(−๐

2 ๐

−2

)

2 ๐

−2

)

If I combine likelihood and prior I get another inverse gamma distribution,

๐ผ๐บ(๐

2

|๐ข

1

, ๐ข

2

) ∝ ๐

−2(๐

1

+๐/2+1) ๐๐ฅ๐ [−๐

−2

(๐

2

+ ๐

1

2

∑( ๐=1 ๐ฆ ๐

− ๐)

2

)]

Then ๐ข

1

= ๐

1

+ ๐/2

, and for a sample data set. ๐ข

2

= ๐

2

+

1

2

∑ ๐ ๐=1

( ๐ฆ ๐

− ๐)

2

. Here is a prior and posterior distribution

**library**

(MCMCpack)

**par**

( bty= 'n' ) n <10 y <-

**rnorm**

(n) s1 <s2 <1 yb <-

**mean**

(y) ss <-

**seq**

( 0 , 4 , length= 100 ) u1 <s1

**+**

n

**/**

2 u2 <s2

**+**

1

**/**

2

*** sum**

( (y

**-**

yb)

**^**

2 )

**plot**

(ss,

**dinvgamma**

(ss, u1, u2), type= 'l' , lwd= 2 )

**lines**

(ss,

**dinvgamma**

(ss,s1,s2), col= 'blue' , lwd= 2 )

*Prior and posterior IG distribution *

##residual variance for regression

For regression, I replace ๐

with

๐๐

, I have ๐ข

2

= ๐

2

+

1

2

(๐ฒ − ๐๐)′(๐ฒ − ๐๐)

To see this, recall the likelihood, ๐

−2(๐/2) ๐๐ฅ๐ [−

1

2๐

2

(๐ฒ − ๐๐)′(๐ฒ − ๐๐)]

—————————————— ===== ——————————————

Exercise in class Find the conditional posterior distribution for the variance in regression.

Based on the previous two blocks of code, write a function to evaluate the variance for a sample regression.

##small step to Gibbs sampling

The conditional posterior distributions for coefficients and variance will be combined with

Gibbs sampling. To see how this will come together, consider that we can now sample

[๐|๐

2 and, conversely,

[๐

2

|๐]

. If we alternate these two steps repeatedly we have a simulation for their joint distribution,

[๐, ๐

2

]

.

]

To see the setup that is used in jags

, refer back to unit 2. For the regression example, I would simply add an additional step.

#jags example

To see how well we can recover parameters when they are known, here is a simulated data set: n <100

*# sample size*

p <4

*# no. predictors*

beta <-

**matrix**

(

**rnorm**

(p), p)

*# coefficients*

sigma <.1

*# residual variance*

x <-

**matrix**

(

**rnorm**

(n

*****

p), n, p)

*# design*

x[, 1 ] <1

*# intercept*

mu <x

**%*%**

beta y <-

**rnorm**

(n, mu,

**sqrt**

(sigma) )

**pairs**

(

**cbind**

(y,x[,

**-**

1 ]))

If I knew the residual variance, this would be my Bayesian estimate:

B <-

**diag**

( 10000 ,p) b <beta

*****

0

V <-

**solve**

( 1

**/**

sigma

*** crossprod**

(x)

**+ solve**

(B) ) v <1

**/**

sigma

*** crossprod**

(x,y) betaHat <V

**%*%**

v betaSe <-

**sqrt**

(

**diag**

(V) ) coefficients <-

**signif**

(

**cbind**

(beta, betaHat, betaSe), 4 )

**colnames**

(coefficients) <-

**c**

( 'true' , 'estimate' , 'Se' ) coefficients

## true estimate Se

## [1,] 0.27350 0.3340 0.03200

## [2,] 0.68000 0.6661 0.03347

## [3,] 0.44740 0.4949 0.03721

## [4,] 0.04942 0.0363 0.03312

For comparison, here’s the classical estimate:

**summary**

(

**lm**

( y

**~**

x[,

**-**

1 ]) )

**$**

coefficients[, 1

**:**

2 ]

## Estimate Std. Error

## (Intercept) 0.3339715 0.03367717

## x[, -1]1 0.6660953 0.03522470

## x[, -1]2 0.4949326 0.03915431

## x[, -1]3 0.0363018 0.03484890

Now I want to sample the joint distribution of

[๐ฝ, ๐

2

]

. Here’s jags:

**library**

(rjags)

## Linked to JAGS 4.3.0

## Loaded modules: basemod,bugs file < "lmSimulated.txt"

**cat**

( "model{

# Likelihood

for(i in 1:n){

y[i] ~ dnorm(mu[i],precision)

mu[i] <- inprod(beta[],x[i,])

}

for (i in 1:p) {

beta[i] ~ dnorm(0, 1.0E-5)

}

# Prior for the inverse variance

precision ~ dgamma(0.01, 0.01)

sigma <- 1/precision

}" , file = file)

Here is a function that sets up the posterior sampling: model <-

**jags.model**

( file= file, data =

**list**

( x = x, y = y, n=**nrow** (x), p=**ncol** (x)))

## Compiling model graph

## Resolving undeclared variables

## Allocating nodes

## Graph information:

## Observed stochastic nodes: 100

## Unobserved stochastic nodes: 5

## Total graph size: 713

##

## Initializing model

I start with 100 burnin iterations, then sample for 2000:

**update**

(model, 100 ) jagsLm <-

**coda.samples**

(model, variable.names=**c** ( "beta" , "sigma" ), n.iter= 2000 )

tmp <-

**summary**

(jagsLm)

**print**

(tmp

**$**

statistics)

## Mean SD Naive SE Time-series SE

## beta[1] 0.33539775 0.03402236 0.0007607631 0.0007607631

## beta[2] 0.66702382 0.03561230 0.0007963152 0.0008575067

## beta[3] 0.49500158 0.03941578 0.0008813636 0.0008813636

## beta[4] 0.03524711 0.03548611 0.0007934935 0.0008929009

## sigma 0.11349828 0.01673148 0.0003741272 0.0003929038

Here are plots:

**plot**

(jagsLm)

—————————————— ===== ——————————————

Exercise in class Make an informative prior distribution for regression parameters. Then compare the estimates you get with the non-informative prior. Do this analytically and with jags.

#recap

Bayesian analysis requires some basic distribution theory to combine data and prior information to generate a posterior distribution. Fundamental ways to parameterize probability include densities (continuous), probability mass functions (discrete), and probability density (both) functions. The sample space defines allowable (non-zero probablity) for a random variable. Integrating (continous) or summing (continuous) over the sample space gives a probability of 1.

Distributions have moments, which are expectations for integer powers of a random variable.

The first moment is the mean, and the second central moment is the variance. Higher moments include skewness (asymmetry) and kurtosis (shoulders versus peak and tails).

Joint distribution can be factored into conditional and marginal distributions. A conditional distribution assumes a specific value for the variable that is being conditioned on.

Marginalizing over a variable is done with the law of total probability. Bayes theorem relies on a specific factorization giving a posterior distribution in terms of likelihood and prior.

R can be use to draw random variables and to evaluate densities and probabilities. Binomial and Bernoulli distributions apply to numbers of successes in ๐

or 1 trials, respectively.

The multivariate normal distribution is commonly used as a prior distribution. When combined with a normal likelihood, the posterior mean can be found with the ‘Vv rule’.

#appendix

Here I provide a bit more detail on moments used in the beta-binomial example, the posterior for regression parameters, and its connection to maximum likelihood estimates.

##moments

Moments describe the shape of a distribution. The **mean** of the distribution is the **first **

**moment**. The **variance** is the **second central moment**. The ๐ฅ

is expected value of ๐ฅ ๐ ๐ ๐กโ moment of a distribution for

. For continuous variable ๐ฅ

having PDF ๐(๐ฅ)

this is

∞

๐ธ[๐ฅ ๐

] = ∫ ๐ฅ ๐

−∞ ๐(๐ฅ)๐๐ฅ

Note that the zero moment = 1, the area under the PDF. For a discrete variable this is

๐ธ[๐ฅ ๐

] = ∑ ๐ฅ ๐ ๐(๐ฅ) ๐∈๐ฆ

Let ๐ = ๐ธ[๐ฅ

1

]

be the first moment. Then the ๐ ๐กโ central moment is

๐ธ[(๐ฅ − ๐)

∞ ๐

] = ∫ (

−∞ ๐ฅ − ๐) ๐ ๐(๐ฅ)๐๐ฅ

(continuous) and

๐ธ[(๐ฅ − ๐) ๐

] = ∑ ( ๐ฅ − ๐) ๐ ๐(๐ฅ) ๐∈๐ฆ

(discrete). The variance is

๐ธ[(๐ฅ − ๐)

2

]

.

Moments also exist for a sample. In this case I can think of the discrete probability assigned to each observation as

1/๐

, where ๐

is the number of observations. Plugging this into the discrete moment equation I have ๐ฅ = ๐ธ[๐ฅ] = ๐

1 ๐

∑ ๐ฅ ๐ ๐ ๐=1 for the sample mean and ๐ฃ๐๐(๐ฅ) = ๐ธ[(๐ฅ − ๐)

2

] = ๐

1 ๐

∑( ๐ฅ ๐ ๐=1

− ๐ฅ)

2 for the sample variance.

##Bayesian regression parameters

As for the example for the mean of the normal distribution, I apply the “big-V, small-v” method.

For matrices the exponent of

๐(๐|๐๐ฏ)

is

−

1

2

(๐ − ๐๐ฏ)′๐

−1

(๐ − ๐๐ฏ) = −

1

2

(๐′๐

−1 ๐ − 2๐′๐ฏ + ๐ฏ′๐๐ฏ)

As before I find

๐

and ๐ฏ

in the first two terms.

Now I combine the regression likelihood with this prior distribution, I have an exponent on the multivariate normal distribution that looks like this, ๐

1 ๐

2

∑( ๐=1 ๐ฆ ๐

− ๐ฅ ๐

′๐)

2

+ (๐ − ๐)′๐

−1

(๐ − ๐) or like this,

1 ๐

2

(๐ฒ − ๐๐)′(๐ฒ − ๐๐) + (๐ − ๐)′๐

−1

(๐ − ๐) where ๐ฒ

is the length๐

vector of responses, and

๐

is the ๐ × ๐

design matrix.

Retaining only terms containing coefficients, I collect terms,

−2๐′(๐

−2

๐′๐ฒ + ๐

−1 ๐) + ๐′(๐

−2

๐′๐ + ๐

−1

)๐

I identify parameter vectors,

๐ = (๐

−2 ๐ฏ = ๐

−2

๐′๐ + ๐

−1

)

๐′๐ฒ + ๐

−1 ๐

−1

These are determine the posterior distribution.

##connection to maximum likelihood

Consider again the likelihood, now ignoring the prior distribution, having exponent

1 log๐ฟ ∝ −

2๐

2

(๐ฒ − ๐๐)′(๐ฒ − ๐๐)

To maximumize the log likelihood I consider only these terms, because others do not contain parameters. I differentiate once,

๐๐๐๐๐ฟ

= ๐

−2

๐′๐ฒ − ๐

−2

๐′๐๐

๐๐ and again,

๐

2 ๐๐๐๐ฟ

= −๐

−2

๐′๐

๐๐

2

To obtain MLEs I set the first derivative equal to zero and solve, ๐

−1

๐′๐ฒ

The matrix of curvatures, or second derivatives, is related to **Fisher Information** and the covariance of parameter estimates,

๐ = −

๐

2 ๐๐๐๐ฟ

๐๐

2

The covariance of parameter estimates is

๐

−1

.