# 3jointDistributions jointly distributed random variables
env/bio 665 Bayesian inference for environmental models
Jim Clark
2020-01-27
#resources
##software
source('../clarkFunctions2020.r')
Models for Ecological Data, Appendix D.
Evaluating the impacts of fungal seedling pathogens on temperate forest seedling survival,
Hersh et al. on joint, conditional, predictive distributions, Ecology.
#objectives
•
•
•
•
Understand key concepts:
–
discrete and continuous densities, probability mass functions, and probability
density functions
–
the sample space for a distribution
–
moments of a distribution
–
factor a joint distribution
–
graph a model
–
the contribution of data and prior to the posterior estimate of a mean
Apply basic rules to manipulate jointly distributed random variables:
–
total probability
–
Bayes theorem
Use R to draw random samples and to determine density and probability for standard
distributions
–
binomial and Bernoulli
–
beta and beta-binomial
–
normal and multivariate normal
Find the posterior distribution of regression parameters
“only grade-schoolers can divide, only undergraduates can differentiate, a rare PhD can
integrate”, Mark Twain, maybe
#Attention au l’calcule
I need a few rules from calculus to manipulate distributions. Calculus is more often important
for its conceptual and notation contributions than for actual solutions to equations. Most
functions cannot be integrated. The limited capacity to handle the integration constant needed
for Bayes theorem stalled progress until numerical analysis advanced with methods such as
Gibbs sampling. Although I will will not integrate much, calculus provides concepts needed
here.
For probability, I need both distribution functions and density functions. The latter is the
derivative of the former. The notation is powerful, providing a direct connection to concepts
and notation used in basic algebra. The derivative 𝑑𝑃/𝑑𝑥 can be viewed as a limit of a ratio
𝑝(𝑥) =
𝑑𝑃
𝑃(𝑥 + 𝑑𝑥) − 𝑃(𝑥)
= lim [
]
𝑑𝑥 𝑑𝑥→0
𝑑𝑥
i.e., division. Multiplication 𝑝(𝑥) ⋅ 𝑑𝑥 can be viewed as the limit of the anti-derivative
𝑥+𝑑𝑥
∫
𝑥
𝑝 (𝑢)𝑑𝑢 = lim [𝑝(𝑥) ⋅ 𝑑𝑥]
𝑑𝑥→0
Ironically, although multiplication is easy and division is hard, the tables are turned on their
calculus counterparts: differentiation is usually easier than integration.
The idea of derivatives and integrals as limiting quantities is important for computation in
Bayesian analysis. Differentiation is needed for optimization problems–more important for
maximum likelihood. Optimizations and integrations are commonly approximated
numerically.
In the discussion ahead I rely on some of the notation of calculus, but there are no difficult
solutions.
#Basic probability rules and the Janzen Connell hypothesis
Basic probality ideas are introducted here with the Janzen Connell (JC) effect. The JC effect is
believed to promote forest tree diversity as natural enemies disproportionately attack the
most abundant tree hosts. The mechanism requires that natural enemies are host-specific and
that they most efficiently find and/or impact host populations when and where those hosts are
abundant. Fungi are plausible candidates for the JC effect, because there are many taxa, and
they include known pathogens of trees.
To test whether or not fungal pathogens contribute to tree diversity, Hersh et al. (2012)
planted seedlings of six species in 60 plots, they observed survival, and they assayed them for
fungal infection on cultures and with DNA sequencing. A model was constructed for the
relationship between pathogen, host, and observations, and it was used to infer where
pathogens occur (‘incidence’), infection of hosts, and effect of infection on survival. This
example introduces some of the techniques used in Bayesian analysis, including some basic
distribution theory. I begin with background on distributions, followed by examples that
demononstrate ways to look at the JC hypothesis.
##continuous and discrete probability distributions
I express uncertainty about an event using probability. Uncertainty could be temporary,
expressing current information: my prediction of heads or tails will be updated when I flip this
coin. Or it could be indefinite, expressing my ability to predict in general: about half of my
predictions for coin tosses will be wrong. A probability is dimensionless. It can be zero, one,
or somewhere in between. In the sections that follow I introduce basic distribution theory
needed for Bayesian analysis.
##probability spaces
The probability distribution assigns probability values to the set of all possible outcomes
over the sample space. The sample space for a continuous probability distribution is all or
part of the real line. The normal distribution applies to the full real line, ℝ = (−∞, ∞). The
gamma (including the exponential) and the log-normal distributions apply to the non-negative
real numbers ℝ+ = (0, ∞). The beta distribution 𝑏𝑒𝑡𝑎(𝑎, 𝑏) applies to real numbers in the
interval [0,1]. The uniform distribution, 𝑢𝑛𝑖𝑓(𝑎, 𝑏), applies to the interval [𝑎, 𝑏]. These are
univariate distributions: they describe one dimension. [Note: when discussing an interval, a
square bracket indicates that I am including the boundary–the uniform distribution is defined
at zero, but the lognormal distribution is not.]
A multivariate distribution describes variation in more than one dimension. For example, a
d-dimensional multivariate normal distribution describes a length-𝑑 random vector in ℝ𝑑 .
Three continuous distributions supported on different portions of the real line. Zero and one
(grey lines) are unsupported. PDFs above and CDFs below.
The sample space for a discrete distribution is a set of discrete values. For the Bernoulli
distribution, the sample space is {0,1}. For the binomial distribution, 𝑏𝑖𝑛𝑜𝑚(𝑛, 𝜃), the sample
space is {0, … , 𝑛}. For count data, often modeled as a Poisson distribution, the sample space is
{0,1,2, … }. The probability mass function means that there is point mass on specific values
(e.g., integers) and no support elsewhere.
A common multivariate discrete distribution is the multinomial distribution. It assigns
probability to 𝑛 trials, each of which can have 𝐽 outcomes, 𝑚𝑢𝑙𝑡𝑖𝑛𝑜𝑚(𝑛, (𝜃1 , … , 𝜃𝐽 )). It has the
sample space {0, … , 𝑛} 𝐽 , subject to the constraint that the sum over all 𝑗 classes is equal to 𝑛.
Three discrete distributions, with PMFs above and CDFs below.
The figures show continuous and discrete distributions, each in two forms, as densities and as
cumulative probabilities.
##probability density and cumulative probability
The cumulative distribution function (CDF) 𝑃(𝑥) accumulates continuous probability over
the sample space, from low values (near-zero) to high values (near-one). The probability
density function (PDF) is the derivative of the CDF, 𝑝(𝑥) = 𝑑𝑃/𝑑𝑥. Because the CDF is a
dimensionless probability, its derivative must have units of 1/𝑥. The CDF is obtained from the
PDF by integration,
𝑥
𝑃(𝑥) = ∫ 𝑝 (𝑥)𝑑𝑥
−∞
I cannot assign a probability to a continuous value of 𝑥, only to an interval, say (𝑥, 𝑥 + 𝑑𝑥). For
a small interval 𝑑𝑥 the following relationships are useful,
𝑥+𝑑𝑥
𝑃(𝑥 + 𝑑𝑥) − 𝑃(𝑥) = ∫
𝑝 (𝑥)𝑑𝑥 ≈ 𝑝(𝑥) ⋅ 𝑑𝑥
𝑥
For example, if I wanted to assure myself that 68% of the normal density lies within 1 sd of the
mean, I evaluate the CDF at these values and take the difference:
p &lt;- pnorm(c(-1,1))
diff(p)
##  0.6826895
Because integration is like multiplication, it adds one integer value to the exponent, from the
PDF (𝑥 −1 ) to the CDF (𝑥 0 ). The area under the PDF is
∞
1 = 𝑃(∞) = ∫ 𝑝 (𝑥)𝑑𝑥
−∞
For discrete distributions the PDF is replaced with a probability mass function (PMF). Like
the CDF (and unlike the PDF), the PMF is a dimensionless probability. To obtain the CDF from
the PMF I sum, rather than integrate,
𝑥
𝑃(𝑥) = ∑ 𝑝 (𝑘)
𝑘≤𝑥
To obtain the PMF from the CDF I difference, rather than differentiate,
𝑝(𝑥) = 𝑃(𝑥) − 𝑃(𝑥 − 1)
The sum over the sample space is
1 = ∑ 𝑝 (𝑘)
𝑘∈𝒦
where 𝒦 is the sample space. In R there are functions for common distributions. The CDF
begins with the letter p. The PDF and PMF begin with the letter d. To obtain random values use
the letter r. For quantiles use the letter q.
Suppose I want to draw the PMF for the Poisson distribution having intensity 𝜆 = 4.6. I can
generate a sequence of integer values and then use dpois:
k &lt;- c(0:20)
dk &lt;- dpois(k,4.6)
plot(k, dk)
segments(k, 0, k, dk)
I could draw the CDF with ppois:
pk &lt;- ppois(k, 4.6)
plot(k, pk, type='s')
If I want to confirm the values of k having these probabilities, I invert the probabilities with
qpois:
qpois(pk,4.6)
##

0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
—————————————— ===== ——————————————
Example 1. I want to generate a random sample from a normal distribution and see if I can
‘recover the mean’. Here are the steps I use:
•
•
•
define a sample size, a mean and variance
draw a random sample using rnorm
𝑛
estimate parameters as 𝜇̂ = 𝑥 and 𝜎̂ 2 = 𝑛−1 𝑣𝑎𝑟(𝑥)
•
•
•
determine a 95% confidence interval using qnorm
draw a PDF using dnorm and CDF using pnorm based on the estimates
determine if the true estimate lies within this interval
Plots from Example 1.
Now repeat this 1000 times and count the number of times the confidence interval includes
the true mean.
Density of low and high 95% CIs from Example 1.
—————————————— ===== ——————————————
#moments
Moments are expectations of powers of 𝑥. They are used to summarize the location and shape
of a distribution. I can think of the 𝑚𝑡ℎ moment of a distribution as a weighted average of 𝑥 𝑚 .
For a continuous distribution
∞
𝐸[𝑥 𝑚 ] = ∫ 𝑥 𝑚 𝑝(𝑥)𝑑𝑥
−∞
The first moment, 𝑚 = 1, is the mean. The variance is a central moment,
∞
𝑣𝑎𝑟[𝑥] = 𝐸[(𝑥 − 𝐸[𝑥])
2]
= ∫ ( 𝑥 − 𝐸[𝑥])2 𝑝(𝑥)𝑑𝑥
−∞
For discrete variables, we replace the integral with a sum.
#discrete probability for multiple host states
I now want to apply probability concepts to the Janzen-Connell question. In the foregoing
discussion I used notation for to distinguish between probability and density. In cases where I
do not want to name a specific PDF or CDF I use the shorthand bracket notation.
Let [𝐼, 𝑆] be the joint probability of two events, i) that a host plant is infected, 𝐼, and ii) that it
survives, 𝑆. In this example, both of these events are binary, being either true (indicated with a
one) or not (indicated with a zero). For example, the probability that an individual is not
infected and survives is written as [𝐼 = 0, 𝑆 = 1]. If I write simply [𝐼, 𝑆] for these binary events,
it is interpreted as the probability that both events are a ‘success’ or ‘true’, i.e., [𝐼 = 1, 𝑆 = 1].
A graphical model of relationships discussed for the Janzen Connell hypothesis. Symbols are I infected host, S - survival, D - detection.
As previously, states (or events) and parameters are nodes in the graph, the states {𝐼, 𝑆, 𝐷},
and the parameters {𝜃, 𝜙, 𝜋0 , 𝜋1 }. The connections, or arrows, between nodes are sometimes
called edges. Here I assign parameters:
[𝐼]
[𝐷|𝐼 = 1]
[𝑆|𝐼 = 0]
[𝑆|𝐼 = 1]
=𝜃
=𝜙
= 𝜋0
= 𝜋1
A host can become infected with probability [𝐼] = 𝜃. An infection can be detected with
probability [𝐷|𝐼 = 1] = 𝜙. An infected individual survives with probability [𝑆|𝐼 = 1] = 𝜋1 , and
a non-infected individual survives with probability [𝑆|𝐼 = 0] = 𝜋0 . For this example, I assume
no false positives, [𝐷 = 1|𝐼 = 0] = 0.
##start simple
A study of infection and host survival could be modeled as a joint distribution [𝐼, 𝑆]. I might be
interested in estimating state 𝐼, in comparing parameter estimates (‘does 𝜋0 differ from 𝜋1 ?),
or both. Both events are unknown before they are observed. After data collection I know 𝑆, but
not 𝐼. I want a model for the conditional probability of survival given that an individual is
infected, [𝑆|𝐼 = 1] = 𝜋1 or not [𝑆|𝐼 = 0] = 𝜋0 . The notation [𝑆|𝐼] indicates that the event to the
left of the bar is ‘conditional on’ the event to the right.
An arrow points from 𝐼 to 𝑆, because I believe that infection might affect survival, but I do not
believe that survival influences infection (because I am not concerned with infection ‘after’ or
‘caused by’ death). The challenge is that I observe 𝑆, but not 𝐼. I cannot condition on 𝐼 if it is
unknown. Rather, I want to estimate it. Progress requires Bayes theorem.
Nodes from the graphical model for infection status and survival.
To make use of the model I need a relationship between conditional and joint probabilities,
[𝐼, 𝑆] = [𝑆|𝐼][𝐼]
Here I have factored the joint probability on the left-hand side into a conditional distribution
and a marginal distribution on the right-hand side. I can also factor the joint distribution this
way,
[𝐼, 𝑆] = [𝐼|𝑆][𝑆]
Because both are equal to the joint probability, they must be equal to each other,
[𝑆|𝐼]][𝐼] = [𝐼|𝑆][𝑆]
Rearranging, I have Bayes theorem, solved two ways,
[𝑆|𝐼] =
[𝐼|𝑆][𝑆]
[𝐼]
[𝐼|𝑆] =
[𝑆|𝐼][𝐼]
[𝑆]
and
The two pieces of this relationship that I have not yet defined are the marginal distributions,
[𝑆] and [𝐼]. I could evaluate either one conditional on the other using the law of total
probability,
[𝐼 = 0] = ∑ [ 𝐼 = 0|𝑆 = 𝑗][𝑆 = 𝑗]
𝑗∈{0,1}
or
[𝑆 = 1] = ∑ [ 𝑆 = 1|𝐼 = 𝑗][𝐼 = 𝑗]
𝑗∈{0,1}
How can I use these relationships to address the effect of infection on survival?
Given survival status, I first determine the probability that the individual was infected. I have
four factors, all univariate, two conditional and two marginal distributions. I have defined [𝑆|𝐼]
in terms of parameter values, but I want to know [𝐼|𝑆]. For a host that survived, Bayes theorem
gives me
[𝐼|𝑆 = 1]
[𝑆 = 1|𝐼][𝐼]
[𝑆 = 1]
[𝑆 = 1|𝐼][𝐼]
=
∑𝑗∈{0,1} [ 𝑆 = 1|𝐼 = 𝑗][𝐼 = 𝑗]
𝜋1 𝜃
=
𝜋0 (1 − 𝜃) + 𝜋1 𝜃
=
For a host that died this conditional probability is
[𝐼|𝑆 = 0]
[𝑆 = 0|𝐼][𝐼]
[𝑆 = 0]
(1 − 𝜋1 )𝜃
=
(1 − 𝜋0 )(1 − 𝜃) + (1 − 𝜋1 )𝜃
=
These two expressions demonstrate that, if I knew the parameter values, then I could evaluate
the conditional probability for [𝐼|𝑆]. If I do not know parameter values, then they too might be
estimated.
Before going further, notice that the numerator is always the ‘unormalized’ probability of the
two events. The demoninator simply normalizes them.
—————————————— ===== ——————————————
Exercise 1. In R: Assume that there are 𝑛 = 100 hosts and that infection decreases survival
probability (𝜋1 &lt; 𝜋0 ). Define parameter values for {𝜋0 , 𝜋1 , 𝜃} and draw a binomial distribution
for [𝐼|𝑆 = 0] and for [𝐼|𝑆 = 1]. (Use the function dbinom.) Is the infection rate estimated to be
higher for survivors or for those that die? How are these two distributions affected by the
underlying prevalence of infection, 𝜃? [Hint: write down the probabilities required and then
place them in a function].
Comparison of binomial distributions of infection for survivors and deaths.
—————————————— ===== ——————————————
##continuous probability for parameters
Here I consider the problem of estimating parameters. The survival parameters are 𝜋𝐼 =
{𝜋0 , 𝜋1 }. From Bayes theorem I need
[𝜋𝐼 |𝑆] =
[𝑆|𝜋𝐼 ][𝜋𝐼 ]
[𝑆]
where the subscript 𝐼 = 0 (uninfected) or 𝐼 = 1 (infected). Again, assuming I know whether or
not a host is infected, I can write the distribution for survival conditioned on parameters as
[𝑆|𝜋𝐼 ] = 𝜋𝐼𝑆 (1 − 𝜋𝐼 )1−𝑆
This is a Bernoulli distribution. The Bernoulli distribution is a special case of the binomial
distribution for a single trial,
𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝) = 𝑏𝑖𝑛𝑜𝑚(1, 𝑝)
If I know 𝑆, and I want to estimate 𝜋𝐼 , I again need Bayes theorem. Unlike states, the survival
parameters take continous values on (0,1). The total probability of survival 𝑆 requires not
summation, but rather integration,
1
[𝑆] = ∫ [ 𝑆|𝜋𝐼 ][𝜋𝐼 ]𝑑𝜋𝐼
0
I now have the elements needed to write the conditional distribution for [𝜋𝐼 |𝑆], but it involves
an integral expression. One way to insure a solution to this integral is to assume that the
marginal distribution of 𝜋𝐼 is a beta distribution, which results in a marginal beta-binomial
distribution for [𝑆],
1
𝑏𝑒𝑡𝑎𝐵𝑖𝑛𝑜𝑚(𝑆|𝑚, 𝑎, 𝑏) = ∫ 𝑏 𝑖𝑛𝑜𝑚(𝑆|𝑚, 𝜋𝐼 )𝑏𝑒𝑡𝑎(𝜋𝐼 |𝑎, 𝑏)𝑑𝜋𝐼
0
To understand this integral, I draw the two distributions in the integrand. When I integrate I
smear out the binomial distribution based on the variation represented by the beta
distribution for 𝜋.
par(mfrow=c(2,2), mar=c(4,4,1,1), bty='n')
m &lt;- 50
# no. at risk
S &lt;- 0:m
pi &lt;- .35
# survival Pr
b &lt;- 4
# beta parameter
a &lt;- signif(b/(1/pi - 1),3) # 2nd beta parameter to give mean value = pi
plot(S, dbinom(S, m, pi), type='s',lwd=3, xlab='S', ylab='[S]')
title('binimal distribution for m, pi')
plot(S/m, dbeta(S/m, a, b), lwd=2, xlab=expression( pi ),
ylab=expression( paste(&quot;[&quot;, pi, &quot;]&quot;) ), type='l')
title('beta density for a, b')
ptext &lt;- paste( &quot;(&quot;, a, &quot;, &quot;, b, &quot;)&quot;,sep=&quot;&quot;)
text(1,1.5,ptext,pos=2)
plot(S, dbinom(S, m, pi), type='s',lwd=3, xlab='S', ylab='[S]', col = 'grey')
lines(S, dbetaBinom(S, m, mu=pi,b=b), type='s',lwd=2, col='blue')
abline(v=pi*m,lty=2)
title('betabinomial for m, a, b')
Comparison of binomial and beta-binomial (above) and beta density with parameters (a, b)
(right).
In the foregoing code I wanted to draw a 𝑏𝑒𝑡𝑎𝐵𝑖𝑛𝑜𝑚(𝑎, 𝑏) distribution with a mean of 𝜇 =
0.35 and wide variance. I selected a low value of parameter b and then used moments (the
mean) to determine that the value of parameter a,
𝜇=
𝑎
𝑎+𝑏
To draw the binomial PMF I used the R function dbinom. The PMF for the beta-binomial is
drawn by dbetaBinom in clarkFunctions2020.r.
For the next exercise, consult the appendix on moments.
—————————————— ===== ——————————————
Exercise 2. The variance in beta distribution decreases as the value of parameter b increases.
Change parameter values to demonstrate this with a plot. Then compare the mean and
variance (see Appendix) from the moments for the binomial and beta-binomial. Here it is for
the beta-binomial:
meanS &lt;- sum( S*dbetaBinom(S, m, mu=pi,b=b) )
varS &lt;- sum( (S - meanS)^2*dbetaBinom(S, m, mu=pi,b=b) )
For the binomial, your variance should agree with 𝑚𝜋(1 − 𝜋).
—————————————— ===== ——————————————
The graphical model for known infection status and survival and unknown survival probability.
From Bayes’ theorem I could now write the posterior distribution as
𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑆|𝜋𝐼 )𝑏𝑒𝑡𝑎(𝜋𝐼 |𝑎, 𝑏) 𝜋𝐼𝑆+𝑎−1 (1 − 𝜋𝐼 )𝑏−𝑆
[𝜋𝐼 |𝑆, 𝑎, 𝑏] =
=
𝑏𝑒𝑡𝑎𝐵𝑖𝑛𝑜𝑚(𝑆|𝑎, 𝑏)
𝐵(𝑆 + 𝑎, 1 − 𝑆 + 𝑏)
where 𝐵(⋅) is the beta function. Here’s another way to draw this function, showing the prior
beta distribution and posteriors for a single observation, survived or died:
post &lt;- function(S, a, b, p){
p^(S+a-1)*(1-p)^(b-S)/beta(S + a, 1 - S + b)
}
S &lt;- 0
p &lt;- seq(.01,.99,length=50)
plot(p, post(S=0, a, b, p), xlab='pi', ylab = '[pi]', type='l') # if obs died
lines(p, post(S=1, a, b, p), col=2)
# if obs
survived
lines(p, dbeta(p, a, b), lty=2, col = 3)
# prior
legend('topright',c('died','survived','prior (dashed)'),text.col=c(1,2,3))
With these few basic rules, I return to the Janzen Connell hypothesis. The graph summarizes
several events that influence infection and survival. I can use the model to evaluate important
properties of the process it represents, to estimate parameters, and to predict behavior.
Parameter values might be estimated from previous studies, or they might be completely
unknown. The states might be observed or not. Consider a few of the ways the model could be
used in the following example.
—————————————— ===== ——————————————
Example 2. What is the probability of infection 𝐼 where 𝐷, and 𝑆 are unknown?
I only need to consider arrows that ‘cause’ 𝐼–if neither 𝐷 nor 𝑆 cause 𝐼, and there is no
knowledge of them that could affect my subjective probability of event 𝐼, then they have no
influence on the result. The event 𝐼 = 1 has probability 𝜃.
—————————————— ===== ——————————————
In the next exercise I return to the original graph to consider parameter estimates.
—————————————— ===== ——————————————
Example 3. If I know 𝑆 but I have no knowledge of 𝐷 or 𝐼, what is the probability of 𝐼?
This problem asks for the conditional probability [𝐼|𝑆]. I know [𝑆|𝐼]. I already determined [𝐼 =
1] = 𝜃. I still need [𝑆], which I can obtain using total probability. For an individual that
survived,
[𝑆 = 1]
= ∑ [ 𝑆 = 1|𝐼][𝐼]
𝐼
= [𝑆 = 1|𝐼 = 0][𝐼 = 0] + [𝑆 = 1|𝐼 = 1][𝐼 = 1]
= 𝜋0 (1 − 𝜃) + 𝜋1 𝜃
By substitution I have
[𝐼|𝑆 = 1] =
𝜋1 𝜃
𝜋0 (1 − 𝜃) + 𝜋1 𝜃
—————————————— ===== ——————————————
—————————————— ===== ——————————————
Exercise 3. I observe 𝐷 and 𝑆, and I know 𝜙 from previous studies. What is the probability of
an observation [𝐷 = 1, 𝑆 = 1]. (Hint: use total probability on [𝐷, 𝐼, 𝑆].
Now write down the posterior distribution of parameters, given observations and known
detection probabilty, i.e., [𝜋𝐼 , 𝜃|𝐷 = 1, 𝑆 = 1, 𝜙].
—————————————— ===== ——————————————
These examples of discrete states and continuous parameters are used by Clark and Hersh and
Hersh et al. to evaluate the effect of co-infection of multiple pathogens that attack multiple
hosts. These models admit covariates, which could abundances of host plants or
environmental variables.
#the normal distribution
I used the normal distribution for regression examples without saying much about it. In
Bayesian analysis it is not only used for the likelihood, but also as a prior distribution for
parameters. Here I extend the foregoing distribution theory to Bayesian methods that involve
the normal distribution.
##Bayesian estimate of the mean
To obtain an estimate of the mean of a normal distribution I combine likelihood and prior
distribution. My observations are 𝑦𝑖 , 𝑖 = 1, … , 𝑛. The likelihood for one observation 𝑖 is
𝑁(𝑦𝑖 |𝜇, 𝜎 2 ) =
1
√2𝜋𝜎
𝑒𝑥𝑝 [−
The likelihood for the sample of 𝑛 observations is
1
(𝑦 − 𝜇)2 ]
2𝜎 2 𝑖
𝑛
2
𝑁(𝐲|𝜇, 𝜎 )
= ∏ 𝑁 (𝑦𝑖 |𝜇, 𝜎 2 )
𝑖=1
1
𝑛
𝑛
1
=(
) 𝑒𝑥𝑝 [− 2 ∑( 𝑦𝑖 − 𝜇)2 ]
2𝜎
√2𝜋𝜎
𝑖=1
Recalling Bayes theorem, I combine this likelihood with a prior distribution. I use a normal
distribution for the mean. For now I assume that 𝜎 2 is fixed. Here is a prior distribution for 𝜇
𝑁(𝜇|𝑚, 𝑀)
To obtain posterior estimates I will use a trick that starts with the following observation. I
know the posterior distribution with have this form,
𝑁(𝜇|𝑉𝑣, 𝑉)
where 𝑉 will be the variance, and 𝑣 is an unknown constant. If I write out the exponent for the
normal distribution I get this:
1
1 𝜇2
2
−
(𝜇 − 𝑉𝑣) = − ( − 2𝜇𝑣 + 𝑉𝑣 2 )
2𝑉
2 𝑉
I now know that variance 𝑉 will be whatever is multiplied by 𝜇 −2, and 𝑣 will be whatever is
multiplied by −2𝜇. I want to multiply likelihood and prior, then find these constants.
Here is likelihood times prior, focusing on factors that include 𝜇 in the exponent
𝑛
1
1
𝑁(𝐲|𝜇, 𝜎 2 )𝑁(𝜇|𝑚, 𝑀) ∝ 𝑒𝑥𝑝 [− 2 ∑( 𝑦𝑖 − 𝜇)2 −
(𝜇 − 𝑚)2 ]
2𝜎
2𝑀
𝑖=1
Setting 𝑛𝑦 = ∑𝑛𝑖=1 𝑦𝑖 and extracting only the terms I need,
𝜇2 (
𝑛
1
𝑛𝑦 𝑚
+ ) − 2𝜇 ( 2 + )
2
𝜎
𝑀
𝜎
𝑀
shows that
𝑉
𝑣
𝑛
1 −1
=( 2+ )
𝜎
𝑀
𝑛𝑦 𝑚
= 2+
𝜎
𝑀
You might recognize this to be a weighted average of data and prior, with the inverse of
variances being the weights,
𝑛𝑦 𝑚
2 +𝑀
𝑉𝑣 = 𝜎
𝑛
1
+
𝜎2 𝑀
Note how large 𝑛 will swamp the prior,
$$\underset{\scriptscriptstyle \lim n \rightarrow \infty}{Vv} \rightarrow \frac{1}{n} \sum_{i=1}^n y_i = \bar{y}$$
The prior can fight back with a tiny prior variance 𝑀.
$$\underset{\scriptscriptstyle \lim M \rightarrow 0}{Vv} \rightarrow m$$
—————————————— ===== ——————————————
Exercise 4. Write a function to determine the posterior estimate of the mean for a normal
likelihood, normal prior distribution, and known variance 𝜎 2 . You will need to generate a
sample, supply a prior mean and variance, determine the posterior mean and variance, and
plot.
Bayesian analysis of the mean.
Then demonstrate the effect of 𝑛 and 𝑀.
—————————————— ===== ——————————————
##Bayesian regression (known 𝜎 2 )
𝐲 ∼ 𝑀𝑉𝑁(𝐗𝛃, 𝛴)
where 𝐲 is the length-𝑛 vector of responses, 𝐗 is the 𝑛 &times; 𝑝 design matrix, 𝛃 is the length-𝑝
vector of coefficients, and 𝛴 is an 𝑛 &times; 𝑛 covariance matrix. I can write this as
1
(2𝜋)−𝑛/2 |𝛴|−1/2 𝑒𝑥𝑝 [− (𝐲 − 𝐗𝛃)′𝛴 −1 (𝐲 − 𝐗𝛃)]
2
Because we assume i.i.d (independent, identically distributed) 𝑦𝑖 , the covariance matrix is 𝛴 =
𝜎 2 𝐈, and |𝛴|−1/2 = (𝜎 2 )−𝑛/2 , giving us
(2𝜋)−𝑛/2 (𝜎 2 )−𝑛/2 𝑒𝑥𝑝 [−
1
(𝐲 − 𝐗𝛃)′(𝐲 − 𝐗𝛃)]
2𝜎 2
This is the form of the likelihood I use to obtain the conditional posterior for regression
coefficients.
The multivariate prior distribution is also multivariate normal,
[𝛽1 , … , 𝛽𝑝 ] = 𝑀𝑉𝑁(𝛃|𝐛, 𝐁)
1
1
=
𝑒𝑥𝑝 [− (𝛃 − 𝐛)′𝐁−1 (𝛃 − 𝐛)]
𝑝/2
1/2
2
(2𝜋) 𝑑𝑒𝑡(𝐁)
If there are 𝑝 predictors, then 𝛃 = (𝛽1 , … , 𝛽𝑝 )′. The prior mean is a length-𝑝 vector 𝐛. The prior
covariance matrix could be a non-informative diagonal matrix,
𝐵
0
𝐁=(
⋮
0
0
𝐵
⋮
0
⋯
⋯
⋱
⋯
0
0
)
⋮
𝐵
for some large value 𝐵. The posterior distribution is 𝑀𝑉𝑁(𝛃|𝐕𝐯, 𝐕), where
𝐕 = (𝜎 −2 𝐗′𝐗 + 𝐁 −1 )−1
𝐯 = 𝜎 −2 𝐗′𝐲 + 𝐁 −1 𝐛
(appendix). Taking limits as I did for the previous example, I obtain the MLE for the mean
parameter vector,
$$\underset{\scriptscriptstyle \lim n \rightarrow \infty}{\mathbf{Vv}} \rightarrow (\mathbf{X'X})^{-1}\mathbf{X'y}$$
(appendix).
—————————————— ===== ——————————————
Exercise 5. Obtain the posterior mean and variance for regression parameters for a simulated
data set. Your algorithm might proceed as follows:
1.
2.
3.
4.
5.
6.
define 𝑛, 𝑝, and 𝜎 2
generate 𝑛 &times; 𝑝 matrix 𝑋 from random values, and set the first column to ones
generate a 𝑝 &times; 1 matrix 𝛃 from random values
generate a 𝑛 &times; 1 vector 𝐲 using rnorm.
specify a 𝑝 &times; 1 prior matrix 𝐛 and prior covariance matrix 𝛃
write a function to evaluate 𝐕, 𝐯, and return the mean vector and covariance matrix
Marginal posterior densities for beta.
Explain how you would check that the algorithm is correct.
—————————————— ===== ——————————————
##Residual variance (known 𝛍)
Now I assume that I know the coefficients and want to estimate the residual variance 𝛔𝟐 .
Recall the likelihood for the normal distribution,
2
𝑁(𝐲|𝜇, 𝜎 )
𝑛
𝑛
1
1
=(
) 𝑒𝑥𝑝 [− 2 ∑( 𝑦𝑖 − 𝜇)2 ]
2𝜎
√2𝜋𝜎
𝑖=1
𝑛
1
∝ 𝜎 −2(𝑛/2) 𝑒𝑥𝑝 [− 2 ∑( 𝑦𝑖 − 𝜇)2 ]
2𝜎
𝑖=1
A prior distribution for that is commonly used is inverse gamma,
𝑠
2
𝐼𝐺(𝜎 |𝑠1 , 𝑠2 )
𝑠21 −2(𝑠 +1)
1
=
𝜎
𝑒𝑥𝑝(−𝑠2 𝜎 −2 )
𝛤(𝑠1 )
∝ 𝜎 −2(𝑠1 +1) 𝑒𝑥𝑝(−𝑠2 𝜎 −2 )
If I combine likelihood and prior I get another inverse gamma distribution,
𝑛
2
𝐼𝐺(𝜎 |𝑢1 , 𝑢2 ) ∝ 𝜎
−2(𝑠1 +𝑛/2+1)
𝑒𝑥𝑝 [−𝜎
−2
1
(𝑠2 + ∑( 𝑦𝑖 − 𝜇)2 )]
2
𝑖=1
1
Then 𝑢1 = 𝑠1 + 𝑛/2, and 𝑢2 = 𝑠2 + 2 ∑𝑛𝑖=1( 𝑦𝑖 − 𝜇)2 . Here is a prior and posterior distribution
for a sample data set.
library(MCMCpack)
par(bty='n')
n &lt;- 10
y &lt;- rnorm(n)
s1 &lt;- s2 &lt;- 1
yb &lt;- mean(y)
ss &lt;- seq(0,4,length=100)
u1 &lt;- s1 + n/2
u2 &lt;- s2 + 1/2*sum( (y - yb)^2)
plot(ss,dinvgamma(ss, u1, u2), type='l', lwd=2)
lines(ss,dinvgamma(ss,s1,s2),col='blue',lwd=2)
Prior and posterior IG distribution
##residual variance for regression
1
For regression, I replace 𝜇 with 𝐗𝛃, I have 𝑢2 = 𝑠2 + 2 (𝐲 − 𝐗𝛃)′(𝐲 − 𝐗𝛃)
To see this, recall the likelihood,
𝜎 −2(𝑛/2) 𝑒𝑥𝑝 [−
1
(𝐲 − 𝐗𝛃)′(𝐲 − 𝐗𝛃)]
2𝜎 2
—————————————— ===== ——————————————
Exercise in class Find the conditional posterior distribution for the variance in regression.
Based on the previous two blocks of code, write a function to evaluate the variance for a
sample regression.
##small step to Gibbs sampling
The conditional posterior distributions for coefficients and variance will be combined with
Gibbs sampling. To see how this will come together, consider that we can now sample [𝛃|𝜎 2 ]
and, conversely, [𝜎 2 |𝛃]. If we alternate these two steps repeatedly we have a simulation for
their joint distribution, [𝛃, 𝜎 2 ].
To see the setup that is used in jags, refer back to unit 2. For the regression example, I would
#jags example
To see how well we can recover parameters when they are known, here is a simulated data
set:
n
&lt;- 100
#
p
&lt;- 4
#
beta &lt;- matrix( rnorm(p), p)
#
sigma &lt;- .1
#
x
&lt;- matrix( rnorm(n*p), n, p) #
x[,1] &lt;- 1
#
mu
&lt;- x%*%beta
y
&lt;- rnorm(n, mu, sqrt(sigma) )
pairs(cbind(y,x[,-1]))
sample size
no. predictors
coefficients
residual variance
design
intercept
If I knew the residual variance, this would be my Bayesian estimate:
B &lt;- diag(10000,p)
b &lt;- beta*0
V &lt;- solve( 1/sigma*crossprod(x) + solve(B) )
v &lt;- 1/sigma*crossprod(x,y)
betaHat &lt;- V%*%v
betaSe &lt;- sqrt( diag(V) )
coefficients &lt;- signif( cbind(beta, betaHat, betaSe), 4)
colnames(coefficients) &lt;- c('true', 'estimate', 'Se')
coefficients
##
##
##
##
##
[1,]
[2,]
[3,]
[4,]
true estimate
Se
0.27350
0.3340 0.03200
0.68000
0.6661 0.03347
0.44740
0.4949 0.03721
0.04942
0.0363 0.03312
For comparison, here’s the classical estimate:
summary( lm( y ~ x[,-1]) )$coefficients[,1:2] ## Estimate Std. Error ## (Intercept) 0.3339715 0.03367717 ## x[, -1]1 0.6660953 0.03522470 ## x[, -1]2 ## x[, -1]3 0.4949326 0.03915431 0.0363018 0.03484890 Now I want to sample the joint distribution of [𝛽, 𝜎 2 ]. Here’s jags: library(rjags) ## Linked to JAGS 4.3.0 ## Loaded modules: basemod,bugs file &lt;- &quot;lmSimulated.txt&quot; cat(&quot;model{ # Likelihood for(i in 1:n){ y[i] ~ dnorm(mu[i],precision) mu[i] &lt;- inprod(beta[],x[i,]) } for (i in 1:p) { beta[i] ~ dnorm(0, 1.0E-5) } # Prior for the inverse variance precision ~ dgamma(0.01, 0.01) sigma &lt;- 1/precision }&quot;, file = file) Here is a function that sets up the posterior sampling: model &lt;- jags.model(file=file, data = list(x = x, y = y, n=nrow(x), p=ncol(x))) ## Compiling model graph ## Resolving undeclared variables ## Allocating nodes ## Graph information: ## Observed stochastic nodes: 100 ## Unobserved stochastic nodes: 5 ## Total graph size: 713 ## ## Initializing model I start with 100 burnin iterations, then sample for 2000: update(model, 100) jagsLm &lt;- coda.samples(model, variable.names=c(&quot;beta&quot;,&quot;sigma&quot;), n.iter=2000) tmp &lt;- summary(jagsLm) print(tmp$statistics)
##
##
##
##
##
##
beta
beta
beta
beta
sigma
Mean
0.33539775
0.66702382
0.49500158
0.03524711
0.11349828
Here are plots:
plot(jagsLm)
SD
0.03402236
0.03561230
0.03941578
0.03548611
0.01673148
Naive SE Time-series SE
0.0007607631
0.0007607631
0.0007963152
0.0008575067
0.0008813636
0.0008813636
0.0007934935
0.0008929009
0.0003741272
0.0003929038
—————————————— ===== ——————————————
Exercise in class Make an informative prior distribution for regression parameters. Then
compare the estimates you get with the non-informative prior. Do this analytically and with
jags.
#recap
Bayesian analysis requires some basic distribution theory to combine data and prior
information to generate a posterior distribution. Fundamental ways to parameterize
probability include densities (continuous), probability mass functions (discrete), and
probability density (both) functions. The sample space defines allowable (non-zero
probablity) for a random variable. Integrating (continous) or summing (continuous) over the
sample space gives a probability of 1.
Distributions have moments, which are expectations for integer powers of a random variable.
The first moment is the mean, and the second central moment is the variance. Higher moments
include skewness (asymmetry) and kurtosis (shoulders versus peak and tails).
Joint distribution can be factored into conditional and marginal distributions. A conditional
distribution assumes a specific value for the variable that is being conditioned on.
Marginalizing over a variable is done with the law of total probability. Bayes theorem relies on
a specific factorization giving a posterior distribution in terms of likelihood and prior.
R can be use to draw random variables and to evaluate densities and probabilities. Binomial
and Bernoulli distributions apply to numbers of successes in 𝑛 or 1 trials, respectively.
The multivariate normal distribution is commonly used as a prior distribution. When
combined with a normal likelihood, the posterior mean can be found with the ‘Vv rule’.
#appendix
Here I provide a bit more detail on moments used in the beta-binomial example, the posterior
for regression parameters, and its connection to maximum likelihood estimates.
##moments
Moments describe the shape of a distribution. The mean of the distribution is the first
moment. The variance is the second central moment. The 𝑚𝑡ℎ moment of a distribution for
𝑥 is expected value of 𝑥 𝑚 . For continuous variable 𝑥 having PDF 𝑝(𝑥) this is
∞
𝐸[𝑥 𝑚 ] = ∫ 𝑥 𝑚 𝑝(𝑥)𝑑𝑥
−∞
Note that the zero moment = 1, the area under the PDF. For a discrete variable this is
𝐸[𝑥 𝑚 ] = ∑ 𝑥 𝑚 𝑝(𝑥)
𝑘∈𝒦
Let 𝜇 = 𝐸[𝑥1 ] be the first moment. Then the 𝑚𝑡ℎ central moment is
∞
𝑚
𝐸[(𝑥 − 𝜇) ] = ∫ ( 𝑥 − 𝜇)𝑚 𝑝(𝑥)𝑑𝑥
−∞
(continuous) and
𝐸[(𝑥 − 𝜇)𝑚 ] = ∑ ( 𝑥 − 𝜇)𝑚 𝑝(𝑥)
𝑘∈𝒦
(discrete). The variance is 𝐸[(𝑥 − 𝜇)2 ].
Moments also exist for a sample. In this case I can think of the discrete probability assigned to
each observation as 1/𝑛, where 𝑛 is the number of observations. Plugging this into the discrete
moment equation I have
𝑛
1
𝑥 = 𝐸[𝑥] = ∑ 𝑥𝑖𝑚
𝑛
𝑖=1
for the sample mean and
𝑛
1
𝑣𝑎𝑟(𝑥) = 𝐸[(𝑥 − 𝜇) ] = ∑( 𝑥𝑖 − 𝑥)2
𝑛
2
𝑖=1
for the sample variance.
##Bayesian regression parameters
As for the example for the mean of the normal distribution, I apply the “big-V, small-v” method.
For matrices the exponent of 𝑁(𝛃|𝐕𝐯) is
1
1
− (𝛃 − 𝐕𝐯)′𝐕 −1 (𝛃 − 𝐕𝐯) = − (𝛃′𝐕 −1 𝛃 − 2𝛃′𝐯 + 𝐯′𝐕𝐯)
2
2
As before I find 𝐕 and 𝐯 in the first two terms.
Now I combine the regression likelihood with this prior distribution, I have an exponent on
the multivariate normal distribution that looks like this,
𝑛
1
∑( 𝑦𝑖 − 𝑥𝑖 ′𝛃)2 + (𝛃 − 𝐛)′𝐁−1 (𝛃 − 𝐛)
𝜎2
𝑖=1
or like this,
1
(𝐲 − 𝐗𝛃)′(𝐲 − 𝐗𝛃) + (𝛃 − 𝐛)′𝐁 −1 (𝛃 − 𝐛)
2
𝜎
where 𝐲 is the length-𝑛 vector of responses, and 𝐗 is the 𝑛 &times; 𝑝 design matrix.
Retaining only terms containing coefficients, I collect terms,
−2𝛃′(𝜎 −2 𝐗′𝐲 + 𝐁 −1 𝐛) + 𝛃′(𝜎 −2 𝐗′𝐗 + 𝐁 −1 )𝛃
I identify parameter vectors,
𝐕 = (𝜎 −2 𝐗′𝐗 + 𝐁 −1 )−1
𝐯 = 𝜎 −2 𝐗′𝐲 + 𝐁 −1 𝐛
These are determine the posterior distribution.
##connection to maximum likelihood
Consider again the likelihood, now ignoring the prior distribution, having exponent
log𝐿 ∝ −
1
(𝐲 − 𝐗𝛃)′(𝐲 − 𝐗𝛃)
2𝜎 2
To maximumize the log likelihood I consider only these terms, because others do not contain
parameters. I differentiate once,
𝜕𝑙𝑜𝑔𝐿
= 𝜎 −2 𝐗′𝐲 − 𝜎 −2 𝐗′𝐗𝛃
𝜕𝛃
and again,
𝜕 2 𝑙𝑜𝑔𝐿
= −𝜎 −2 𝐗′𝐗
2
𝜕𝛃
To obtain MLEs I set the first derivative equal to zero and solve,
̂ = (𝐗′𝐗)−1 𝐗′𝐲
𝛃
The matrix of curvatures, or second derivatives, is related to Fisher Information and the
covariance of parameter estimates,
𝜕 2 𝑙𝑜𝑔𝐿
𝐈=−
𝜕𝛃2
The covariance of parameter estimates is 𝐈 −1.