BHME

advertisement
Bayesian Hierarchical Models in Ecology
Steve Midway
2019-10-28
2
Contents
1 Background
1.1 How to Use This Book . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 The
2.1
2.2
2.3
Model Matrix and
Linear Models . . . .
Model Effects . . . .
Hierarchical Models
Random
. . . . . .
. . . . . .
. . . . . .
Effects
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
5
6
6
6
7
7
14
17
3 Fundamentals of Bayesian Inference
21
3.1 Models vs. Estimation . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Bayesian Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Bayesian and Frequentist Comparison . . . . . . . . . . . . . . . 28
4 Bayesian Machinery
35
4.1 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Priors: p(θ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Normalizing Constant: P (y) . . . . . . . . . . . . . . . . . . . . . 37
5 Introduction to JAGS
5.1 WinBUGS . . . . . .
5.2 JAGS . . . . . . . .
5.3 Convergence . . . . .
5.4 Additional Resources
5.5 JAGS in R: Model of
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
45
45
45
49
50
50
6 Simple Models in JAGS
6.1 Revisiting Hierarchical Structures
6.2 Simple Linear Regression . . . .
6.3 Varying Intercept Model . . . . .
6.4 Varying Slope Model . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
55
55
59
59
64
. . . . . .
. . . . . .
. . . . . .
. . . . . .
the Mean
7 Varying Coefficients
71
3
4
8 Generalized Linear Models in JAGS
8.1 Background to GLMs . . . . . . . .
8.2 Components of a GLM . . . . . . . .
8.3 Binomial Regression . . . . . . . . .
8.4 Poisson Regression . . . . . . . . . .
CONTENTS
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
73
73
73
75
75
9 Plotting
77
10 Within-subjects Model
79
11 Hierarchical Models
81
Chapter 1
Background
Welcome to Bayesian Hierarchical Models in Ecology. This is an ebook that is
also serving as the course materials for a graduate class of the same name. There
will be numerous and on-going changes to this book, so please check back. And
don’t hesistate to email me if you have questions, comments, or for anything
else.
To start, let’s calrify the title of this text—it should be Hierarchical Models in
Ecology Using Bayesian Inference. A Bayesian Hierarchical Model is more a
term of convenience than accuracy, as hierarchical models need not be Bayesian
and Bayesian models can take many forms. However, hierarchical models and
Bayesian inference do work together very nicely, as you will see, and so hopefully
the title is not too misleading. I have dedicated several parts of this book
attempting to differentiate these terms and concepts, while also making them
as useful as possible.
5
6
CHAPTER 1. BACKGROUND
1.1 How to Use This Book
You are welcome to use this book in any way you find it useful. The contents
are really a mashup of conceptual descriptions, lecture notes, enumerated and
itemized lists, images, code, analysis, models, output, plots, and other things
that I wanted to include. The only real motivation behind the content and
organization is that it has been useful for some students to learn, and so I
have tried to adopt the best and most effective formats while revising others.
Bookdown has provided such freedom in creating content, and perhaps I have
veered too far from traditional formats.
The document structure (e.g., chapters, sections, etc.) should be logical enough
to skip around, if you prefer. I rely somewhat heavily on quotes and more
heavily on code, which are formatted with thier own colored boxes.
# R code or JAGS code is in gray boxes.
Quotes are in colored boxes.
1.2 Acknowledgments
In creating this course and format, I want to first acknowledge Ty Wagner.
In reality, Ty is the co-author of this text. Ty taught me most of what I
know about Bayesian hierarchical models, and for a few years he and I taught
a multiday Bayesian hierarchical models workshop from which much of these
course materials derive. I would also like to thank Yihui Xie for all his work in
developing numerous R packages, especially the bookdown package (Xie, 2019)
that has enabled the creation of this book.
1.3 Motivation
The computer scientist Alan Kay is known for this quote:
People who are really serious about software should make their own
hardware. –Alan Kay, 1982
I have always liked this quote and have choosen to adapt it for how I think about
data analysis. My adaptation of this quote also serves to describe why I have
undertaken learning statistical models and the general approach I advocate for
students and other data analysts.
People who are really serious about their analyses should code their
own models. –Midway, 2018
Chapter 2
The Model Matrix and
Random Effects
2.1
Linear Models
Let’s start with reviewing some linear modeling terminology.
•
•
•
•
•
units–observations; i , data points
x variable–predictor(s); independent variables
y variable–outcome, response; dependent variable
inputs–model terms; ̸= predictors; inputs ≥ predictors
random variables–outcomes that include chance; often described with
probability distribution
• multiple regression–regression of > 1 predictor; not multivariate regression (MANOVA, PCA, etc.)
The general components of a linear model can be thought of as any of the
following:
response = deterministic + stochastic
outcome = linear predictor + error
result = systematic + chance
• The stochastic component makes these statistical models
• Explanatory variables whose effects add together
• Not necessarily a straight line!
2.1.1
Stochastic Component
• Nature is stochastic; nature adds error, variability, chance
• Always a reason, we might just not know it
7
8
CHAPTER 2. THE MODEL MATRIX AND RANDOM EFFECTS
Figure 2.1: Although hurricanes are not statistical models, they are a good
example of understanding something that cannot perfectly be modeled, and
therefore has some stochasticity inherent to it.
Figure 2.2: Examples of different statistical distributions as seen by their characterstic shape.
• Might not be important and including it risks over-parameterization
• Combined effect of unobserved factors captured in probability distribution
• Often (default) is Normal distribution, but others common
In a linear model, we need to identify a distribution that we assume is the data
generating distribution underyling the process we are attempting to model. Not
only is this distrubtion important in describing what we are modeling, but it is
also critically important because it will accomodate the errors that arise when
we model the data. So how do we know which distribution we need for a given
model?
•
•
•
•
•
We can know possible distributions
We can know how data were sampled
We can know characteristics of the data
We can run model, evaluate, and try another distribution
Correct distribution is not always Y/N answer
2.1. LINEAR MODELS
2.1.2
9
Distributions
Although many distributions are available, we will review three.
1. Normal distribution: continuous, −∞ to ∞
2. Binomial distribution: discrete, integers, 0 to 1
3. Poisson distribution: discrete, integers, 0 to ∞
The Normal distribution
The normal distribution is the model common distribution in linear modeling
and possible the most common in nature. In a normal distribution, measurements effected by large number of additive effects (CLT). When effects are
multiplicative, the result is a log-normal distribution. There are 2 parameters
that govern the normal distribution: µ = mean and σ 2 = standard deviation.
A normal distribution can be easily simulated in R.
rnorm(n = 10, mean = 0, sd = 1)
##
##
[1]
[7]
1.2299399 1.2514092 0.9777302
1.4565016 -0.2789666 -0.4747550
0.7566882
1.1019663
1.4762023
0.8731642
The Binomial distribution
The binomial distribution always concerns the number of successes (i.e., a specific outcome) out of a total number of trials. A single trial is a special case
of the binomial, and is called a Bernoulli trial. The flip of a coin once is the
classic Bernoulli trial. A binomial distribution might be thought of as the sum
of some number of Bernoulli trials. A binomial distibution has 2 parameters: p
= success probability and N = sample size (although N =1 in a Bernoulli trial,
which means a Bernoulli trial can be thought of as a 1-parameter distribution).
N is the upper limit, which differentiates from a binomial distribution from a
Poisson distribution. The binomial distribution mean is a derived quantity,
where µ = N × p. the variance = N × p × (1 − p); therefore, the variance is
function of mean.
A binomial distribution can be simulated in R.
rbinom(n = 10, size = 5, prob = 0.3) # size = trials
##
[1] 0 0 1 2 3 3 1 4 2 0
The Poisson distribution
The Poisson distribution is the classic distribution for (integer) counts; e.g.,
things in a plot, things that pass a point, etc.) The Poisson distribution can
approximate to Binomial when N is large and p is small, and it can approximate
to the normal distribution when when λ is large. The Poisson distribution has
1 parameter: λ = mean = variance. This distribution can be modified for
zero-inflated data (ZIP) and other distributional anomolies.
A Poisson distribution can be easily simulated in R.
10
CHAPTER 2. THE MODEL MATRIX AND RANDOM EFFECTS
Table 2.1: Replicated data from Kery (2010).
mass
pop
region
hab
svl
6
8
5
7
9
1
1
2
2
3
1
1
1
1
2
1
2
3
1
2
40
45
39
50
52
11
3
2
3
57
rpois(n = 10, lambda = 3)
##
[1] 4 3 2 2 2 2 6 3 4 3
2.1.3
Linear Component
Recall that the linear contribution to our model includes the predictors (explanatory variables) with additive effects. Although this is a linear model, it
need not be thought of literally as a straight line. The linear component can
accomodate both continuous and categorical data. Conceptually,
linear predictor = design matrix + parameterization
R takes care of both the design matrix and the parameterization, but this is not
always true in JAGS, so it is worth a review.
Marc Kery has succinctly summarized the design matrix:
“For each element of the response vector, the design matrix n index
indicates which effect is present for categorical (discrete) explanatory
variables and what amount of an effect is present in the case of
continuous explanatory variables.” [@kery2010]
In other words the design matrix is a matrix that tells the model if and where
an effect is present, and how much. The number of columns in the matrix
are equal to the number of fitted parameters. Ultimatly, the design matrix is
multiplied with the parameter vector to produce the linear predictor. Let’s use
the simulated data in Chapter 6 of Marc Kery’s book as an example (Kéry,
2010).
If you would like to code this dataset into R and play along, use the code below.
mass <- c(6,8,5,7,9,11)
pop <- c(1,1,2,2,3,3)
region <- c(1,1,1,1,2,2)
2.1. LINEAR MODELS
11
hab <- c(1,2,3,1,2,3)
svl <- c(40,45,39,50,52,57)
If we think about a model of the mean, it might look like this:
massi = µ + ϵi
This model has a covariate of one for every observation, because every observation has a mass. This model is also sometimes referred to as an intercept
only model. The model describes individual snake mass as a (global) mean
plus individual variation (deviation or “error”).
Increasing model complexity, we might hypothesize that snake mass is predicted
by region. Because region is a categorial variable (although it is input as numeric, we don’t assume any ordination or relationship of the region numbers),
the model we are interested in is a t-test. A t-test is just a specific case of a
linear model. To represent the t-test, we can write the model equation as:
massi = α + β × regioni + ϵi
This model is considering snake mass to be made up of these components:
the mean snake mass, the region effect, and the residual error. Now is also a
good time to start thinking about the residuals, which we assume are normally
distributed with a mean of 0 and expresed as
ϵi ∼ N (0, σ 2 )
Let’s pause for a moment on the t-test and think about how we write models.
You are familiar with the notation that I have thus used, where the model and
residual error are expressed as two separate equations. What if we were to
combine them into one equation? How would that look?
massi ∼ N (α + β × regioni , σ 2 )
This notation may take a little while to get used to, but it should make sense.
This expression is basically saying that snake mass is thought to be normally
distributed with a mean that is the function of the mean mass plus region effect,
and with some residual error. Expressing models with distributional notation
may seen odd at first, but it may help you better understand how distributions
are built into the processes we are modeling, and what part of the process
belongs to which part of the distribution. We will stick with this notation (not
exclusively) for much of this course.
Back to the model matrix—how might we visualize the model matrix in R?
12
CHAPTER 2. THE MODEL MATRIX AND RANDOM EFFECTS
model.matrix(mass ~ region)
##
##
##
##
##
##
##
##
##
(Intercept) region
1
1
1
2
1
1
3
1
1
4
1
1
5
1
2
6
1
2
attr(,"assign")
[1] 0 1
But we said earlier the regions were not inherent quantities and just categories or
indicator variables. Above, R is taking the region variable as numeric because
it does not know any better. Let’s tell R that region is a factor.
model.matrix(mass ~ as.factor(region))
##
##
##
##
##
##
##
##
##
##
##
##
(Intercept) as.factor(region)2
1
1
0
2
1
0
3
1
0
4
1
0
5
1
1
6
1
1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$`as.factor(region)`
[1] "contr.treatment"
The model matrix yields a system of equations. And for the matrix above, the
system of equations we would have can be expressed 2 ways. The first way
shows the equation for each observation.
6 = α × 1 + β × 0 + ϵ1
8 = α × 1 + β × 0 + ϵ2
5 = α × 1 + β × 0 + ϵ3
7 = α × 1 + β × 0 + ϵ4
9 = α × 1 + β × 1 + ϵ5
11 = α × 1 + β × 1 + ϵ6
The second way adopts matrix (vector) notation to economize the system.
2.1. LINEAR MODELS








6
8
5
7
9
11

13

 
 
 
=
 
 
 
1
1
1
1
1
1
0
0
0
0
1
1




 (
) 


α
×
+


β




ϵ1
ϵ2
ϵ3
ϵ4
ϵ5
ϵ6








In both cases the design matrix and parameter vector are combined. And using
least-squares estimation to estimate the parameters will result in the best fits
for α and β.
2.1.4
Parameterization
Although parameters are inherent to specific models, there are cases in the
linear model where parameters can be represented in different ways in the design matrix, and these different representations will have different interpretations. Typically, we refer to either a means or effects parameterization. These
two parameterizations really only come into play when you have a categorical
variable—the representation of continuous variables in the model matrix includes the quantity or magnitude of that variable (and defaults to an effects
parameterization). A means parameterization may be present or needed when
categorical variables are present. Let’s take the above t-test example with 2
regions. As we notated the model above, α = µ for region 1, and β represented
the difference in expected mass in region 2 compared to region 1. In other
words, β is an estimate of the effect of being in region 2. As you guessed, this
is an effects parameterization because region 1 is considered a baseline or reference level, and other levels are compared to this reference, hence the coefficients
represent their effect. Effects parameterization is the default in R (e.g., lm()).
In a means parameterization, α has the same interpretation—the mean for the
first group. However, in a means parameterization, all other coefficients represent the means of those groups, which is actually a simpler imterpretation
than the effects parameterization. Going back to our t-test, in a means parameterization β would represent the mean mass for snakes in region 2. Thus,
no addition or subtraction of effects is required in a means parameterization.
Both the means and effects parameterization will yield the same estimates (with
or without some simple math), but it is advised to know which parameterization you are using. Additionally, as you work in JAGS you may need to use
one parameterization over another, and understanding how their work and are
interpreted is critically important.
Let’s look at one more model using the snake data. A simple linear regression
will fit one continuous covariate (x) as a predictor on y. Let’s look at the model
for the snout-vent length (svl) as the predictor on mass.
massi = α + β × svli + ϵi
14
CHAPTER 2. THE MODEL MATRIX AND RANDOM EFFECTS
The model matrix for this model (an effect parameterization) can be reported
as such.
model.matrix(mass ~ svl)
##
##
##
##
##
##
##
##
##
(Intercept) svl
1
1 40
2
1 45
3
1 39
4
1 50
5
1 52
6
1 57
attr(,"assign")
[1] 0 1
And we see that each snake has an intercept (mean) and an effect of svl.
These examples using the fake snake data were taken from Kery (2010), in which
he provides more context and examples that you are encouraged to review.
2.2 Model Effects
To understand hierarchical models, we need to understand random effects, and
to understand random effects, we need to understand variance.
1.
2.
3.
4.
5.
What is variance?
How do we deal with variance?
What can random effects do?
How do I know if I need random effects?
What is a hierarchical model? (later)
Variance was historically seen as noise or a distraction—it is the part of the
data that the model does not capture well. At times it has been called a nuisance parameter. The traditional view was always that less variance was better.
However:
“Twentieth-century science came to view the universe as statistical
rather than as clockwork, and variation as signal rather than as
noise…”[@leslie2003]
The good news is that there is (often) information in variance. Two problems,
however, are that our models don’t always account for the information connected
to variance, and that variance is often ubiquitous and can occur at numerous
places within the model. For instance, variance can occur among observations
and within observations, along with how we parameterize our models and how
we take our measurements.
One way to deal with variance concerns how we treat the factors in our model.
(Recall that a factor is a categorical predictor that has 2 or more levels.) Specif-
2.2. MODEL EFFECTS
15
ically, we can treat our factors as fixed or random, and the underlying difference lies in how the factor levels interact with each other.
2.2.1
Fixed Effects
Fixed effects are those model effects that are thought to be separate, independent, and otherwise not similar to each other.
• Likely more common—at least in use—than random effects, if only for the
fact that they are a statistical default in most statistical softwares.
• Treat the factor levels as separate levels with no assumed relationship
between them. _ For example, a fixed effect of sex might include two
factor levels (male and female) that are assumed to have no relationship
between them, other than that they are different types of sex.
• Without any relationship, we cannot infer what might exist or happen
between levels, even when it might be obvious.
• Fixed effects are also homoscedastic, which means they assume a common
variance.
• If you use fixed effects, you would likely need some type of post hoc means
comparison (including adjustment) to compare the factor levels.
2.2.2
Random Effects
Random effects are those model effects that can be thought of as units from a
distribution, almost like a random variable.
• Random effects are less commonly used, but perhaps more commonly
encountered (than fixed effects).
• Each factor level can be thought of as a random variable from an underlying process or distribution.
• Estimation provides inference about specific levels, but also the
population-level (and thus absent levels!)
• Exchangeability
• If n is large, factor level estimates are the same or similar to fixed effect
estimates.
• If n is small, factor levels estimates draw information from the populationlevel information.
Comparing Effects Consider A is a fixed effect and B is a random effect, each
with 5 levels. For A, inferences and estimates for the levels are applicable only
to those 5 levels. We cannot assume to know what a 6th level would look like,
or what a level between two levels might look like. Contrast that to B, where
the 5 levels are assumed to represent an infinite (i.e., larger numbers) number
of levels, and our inferences therefore extend beyond the data (i.e., we can use
estimates and inferences predictively to interpolate and extrapolate).
When are Random Effects Appropriate?
You can find your own support for what you want to do, which means modeling
16
CHAPTER 2. THE MODEL MATRIX AND RANDOM EFFECTS
freedom, but also you should be prepared to defend your model structure.
“The statistical literature is full of confusing and contradictory advice.” [@gelman2006]
• You can probably find a reference to support your desire
• Personal preference is Gelman and Hill (2006): “You should always use
random effects.” (use = consider)
– Several reasons why random effects extract more informationvfrom
your data and model
– Built-in safety: with no real group-level information, effects revert
back to a fixed effects model (essentially)
• Know that fixed effects are a default, which does not make them right
Random effects may not be appropriate when…
• Factor levels are very low (e.g., 2); it’s hard to estimate distributional
parameters with so little information (but still little risk).
• Or when you don’t want factor levels to inform each other.
• e.g., male and female growth (combined estimate could be meaningless
and misleading)
Summarizing Kery 2010:
“…as soon as we assume effects are random, we introduce a hierarchy
of effects.”
1.
2.
3.
4.
5.
6.
7.
8.
9.
Scope of Inference: understand levels not in the model
Assessment of Variability: can model different kinds of variability
Partitioning of Variability: can add covariates to variability
Modeling of Correlations among Parameters: helps understand correlated model terms
Accounting for all Random processes in a Modeled System: acknowledges within-group variability
Avoiding Pseudoreplication: better system description
Borrowing Strength: weaker levels draw from population effect
Compromise between Pooling and No Pooling Batch Effects
Combining information
Do I need a random effect (hierarchical model)?
If you answer Yes to any of these, consider random effects.
1. Can the factor(s) be viewed as a random sample from a probability distribution?
2. Does the intended scope of inference extend beyond the levels of a factor
included in the current analysis to the entire population of a factor?
2.3. HIERARCHICAL MODELS
17
3. Are the coeffcients of a given factor going to be modeled?
4. Is there a lack of statistical independence due to multiple observations
from the same level within a factor over space and/or time?
Random Effects Equation Notation There will be a much more extensive treatment of model notation and you will need to become familari with notation to
successfully code models; however, now is as good a time as any to introduce
some basics of random effects statistical notation. (Note that statistical notation and code notation are different, but may be unspecified and unclear when
referring to model notation.)
• SLR, no RE:
yi = α + β × xi + ϵi
• SLR, random intercept, fixed slope:
yi = αj + β × xi + ϵi
• SLR, fixed intercept, random slope:
yi = α + βj × xi + ϵi
• SLR, random coefficients:
yi = αj + βj × xi + ϵi
• MLR, random slopes:
yi = α + βj1 × x1i + βj2 × x2i + ϵi
For the most part, we will use subscript i to index the observation-level (data)
and subscript j to index groups (and indicate a random effect).
2.3
Hierarchical Models
Because random effects are used to model variation at different levels of the data,
they add a hierarchical structure to the model. Let’s start with non-hierarchical
models and review how data inform parameters. Consider a simple model with
no hierarchical strucure (Figure 2.3). The observations (yi ...yn ) all inform one
parameter and the data are said to be pooled.
Perhaps we want to group the data in some logical way. We might add a
fixed effect, and then assign certain observations to different parameters (Figure
2.4). In this case, our parameters, θ are subscripted to indicate that they are
different levels within a factor. Despite this apparent connection, the data and
the parameters inherent to one group inform only that group, and there is said
to be no pooling or sharing of information.
Hierarchical models are a middle ground or compromise approach to dealing
with data and parameters. Often we want groups or factor levels in our data
18
CHAPTER 2. THE MODEL MATRIX AND RANDOM EFFECTS
Figure 2.3: Diagrammatic representation complete pooling, a model structure
in which all observations inform one parameter.
Figure 2.4: Diagrammatic representation no pooling, a model structure in which
separate observations inform separate parameters.
2.3. HIERARCHICAL MODELS
19
Figure 2.5: Diagrammatic representation partial pooling, a model structure in
which different observations inform different latent parameters, which are then
governed by additional parameters.
because they represent real differences in the information we are trying to understand. However, often the groups that we want to include are not completely
independent of each other. In cases like these, we can structure a model where
the data inform factor level parameters, but those factor level parameters are
then governed by some additional parameter (Figure 2.5).
Example A simple example of rationalizing a hierarchical model might be modeling the size of birds. Our y values would be some measure of bird size. But we
know that different species inherently attain different sizes; i.e., observations are
not independent as any two size measurements from the same species are much
more likely to be similar than to a measurement from an individual of another
species). Therefore it makes sense to group the size observations by species,
θj . However, we also know that there may still be relatedness among species,
and as such some additional parameter, ϕ, can be added to govern the collection of species parameters. (To continue with the example, ϕ might represnt a
taxonomic order.)
Once you understand some examples of how hierarchical structures reflect the
ecological systems we seek to model, you will find hierarchical models to be an
often realistic representation of the system or process you want to understand—
or at least more realistic than the traditional model representations.
20
CHAPTER 2. THE MODEL MATRIX AND RANDOM EFFECTS
2.3.1
Definitions of Hierarchical Models
There is a lot of terminology.
From Raudenbush and Bryk (2002)
•
•
•
•
•
multilevel models in social research
mixed-effects and random-effects models in biometric research
random-coeffcient regression models in econometrics
variance components models in statistics
hierarchical models (Lindley and Smith, 1972)
Some useful definitions are:
“…hierarchical structure, in which response variables are measured
at the lowest level of the hierarchy and modeled as a function of
predictor variables measured at that level and higher levels of the
hierarchy.” @wagner2006
“Multilevel (hierarchical) modeling is a generalization of linear and
generalized linear modeling in which regression coeffcients are themselves given a model, whose parameters are also estimated from
data.” @gelman2006
“The existence of the process model is central to the hierarchical
modeling view. We recognize two basic types of hierarchical models. First is the hierarchical model that contains an explicit process
model, which describes realizations of an actual ecological process
(e.g., abundance or occurrence) in terms of explicit biological quantities. The second type is a hierarchical model containing an implicit
process model. The implicit process model is commonly represented
by a collection of random effects that are often spatially and or
temporally indexed. Usually the implicit process model serves as a
surrogate for a real ecological process, but one that is diffcult to characterize or poorly informed by the data (or not at all).” @royle2008
“However, the term, hierarchical model, by itself is about as informative about the actual model used as it would be to say that I
am using a four-wheeled vehicle for locomotion; there are golf carts,
quads, Smartcars, Volkswagen, and Mercedes, for instance, and they
all have four wheels. Similarly, plenty of statistical models applied in
ecology have at least one set of random effects (beyond the residual)
and therefore represent a hierarchical model. Hence, the term is not
very informative about the particular model adopted for inference
about an animal population.” @kery2011
Chapter 3
Fundamentals of Bayesian
Inference
3.1
Models vs. Estimation
Need text on actual difference between models and estimation.
Observations are a function of observable and unobservable influences. We can
think about the observable influences as data and the unobservable influences as
parameters. But even with this breakdown of influences, most systems are too
complex to look at and understand. Consider the effects of time, space, unknown
factors, interactions of factors, and other things that obscure relationships. One
approach is to start with a simple model that we might know is wrong (i.e.,
incomplete), but which can be known and understood. Any model is necessarily
a formal simplification of a more complex system and no models are perfect,
despite the fact they can still be useful. By starting with a simple model we
can add complexity as we understand it and hypothesize the mechanisms, rather
than trying to start with with a complex model that might be hard to understand
and work with.
A lot of good things have been said about models, including:
“All models are wrong, but some are useful” -George Box
“There has never been a straight line nor a Normal distribution in
history, and yet, using assumptions of linearity and normality allows,
to a good approximation, to understand and predict a huge number
of observations.” -W.J. Youden
“Nothing is gained if you replace a world that you don’t understand
with a model that you don’t understand.” -John Maynard Smith
21
22
CHAPTER 3. FUNDAMENTALS OF BAYESIAN INFERENCE
3.1.1
Model Building
Most model selection procedures attempt balance between model generalizability and specificity, and in putting a model together this balance is also importnat. Consider whether you seek prediction or understanding; although they
may overlap they do have differences. Model (system) understanding may not
predict well, but can help to explain drivers of a system. On the other hand
model prediction may not explain well, but the value is in performance. Often,
prediction and understanding are not exclusive and thought needs to be given
to the balance of both in a model. Let’s consider prediction and understanding
a little more.
Explanation (understanding)
•
•
•
•
Emphasis on understanding a system
Often simpler models (but not always!)
Strong focus on parameters, which represent your hypothesis of the system
Think: Causes of effects
Prediction
• Focus on fitting y
• Often result more complex models (but not always!)
• Think: Effects of causes
3.1.2
Case Study: Explanation vs Prediction
Reproductive biological work on Southern Flounder (Paralichthys lethostigma)
was conducted to determe predictors associated with oocyte development and
expected spawning. Becuase the species, like many other fish species, are not
observed on the spawning grounds all information about maturity has to be collected prior to fully-developed, spawning capable individuals are available. A
wide range of predictors were quantified to examine their correlation to histological samples of ovarian tissue. In addition to identifying reliable predictors,
value was placed on simplicity—the fewer predictors needed, the more useful
the model would be.
AIC was able to determine a best-fitting model; however, there were several
competing models all of which tended to have a large number of predictors.
Upon closer evaluation, it appeared that a large number of models all performed
relatively well, when compared with each other. This cloud of model points
warranted further investigation.
When evaluating model perfomance based on cross-validation, a large number of
simpler models were found to be effective. These results highlight that for this
particular dataset system understanding was optimized by AIC, while system
prediction was optimzed by cross-validation.
3.1. MODELS VS. ESTIMATION
23
Figure 3.1: Table of best-fitting models as determined by AIC (Midway et al.
2013).
Figure 3.2:
24
CHAPTER 3. FUNDAMENTALS OF BAYESIAN INFERENCE
Figure 3.3:
3.1.3
Models vs Estimation
Consider a simple linear regression model: yi = α + β × xi + ϵi . How might we
come up with estimates of the model parameters, namely α, and β? To do this,
we need and estimation routine, and there is more than one to choose from. It’s
important to know that the estimatation routine we select will have different
operating assumptions, different underlying machinery, and may produce different results (estimates). That being said, for simple models and clear data,
different estimation routines may result in very similar outcomes. Regardless,
it remains very important to remember that both models and estimation are
independent components of statistics. We might all agree on a model, but not
the estimation (or the opposite). The linkage between models and estimation is
often the parameters; parameters are what define a model, and parameters can
be estimated by different methods.
“…there is no ‘Bayesian so-and-so model’.” @kery2011
3.1.4
What is a parameter?
Parameters are system descriptors. Think of a parameter as something that
underlies and influences a population, whereas a statistic does the same for
a sample. For example, a population growth rate parameter, lambda, may
describe the rate of change in the size of a population, whereas some difference
statistic, Nd may describe the difference in the size of the population between
some time interval. In addition to making sure we know the parameters—and
their configuration—that serve as hypotheses for systems and processes, the
3.2. BAYESIAN BASICS
25
Figure 3.4: An attempt at humor while illustrating different statistical philosophies.
Figure 3.5: The Reverend Thomas Bayes
interpretation of parameters is also at the foundation for different statistical
estimation procedure and philosphies.
3.2
Bayesian Basics
3.2.1
Why learn Bayesian estimation?
“Our answer is that the Bayesian approach is attractive because it is
useful. Its usefulness derives in large measure from its simplicity. Its
simplicity allows the investigation of far more complex models than
can be handled by the tools in the classical toolbox.” @link2009
In order to understand and command complexity, you need to revisit simplicity—and when you go back to basics, you gain deeper
understanding.” Midway
Thomas Bayes was a Presbyterian Minister who is thought to have been born
26
CHAPTER 3. FUNDAMENTALS OF BAYESIAN INFERENCE
in 1701 and died 1761. He posthumously published An Essay towards solving
a Problem in the Doctrine of Chances in 1763, in which he introduced his
approach to probability. Bayesian approach were never really practiced in Bayes’
lifetime—not only was his work no published until after he died, but his work
then was not developed and popularized until Pierre-Simon Laplace (1749–1827)
did so in the early 19th century. (Fun fact according to Wikipedia: His [Laplace]
brain was removed by his physician, François Magendie, and kept for many years,
eventually being displayed in a roving anatomical museum in Britain. It was
reportedly smaller than the average brain.)
Bayes’ rule, in his words:
Given the number of times in which an unknown event has happened and failed:
Required the chance that the probability of its happening in a single trial lies
somewhere between any two degrees of probability that can be named.
• “unknown event” = e.g., Bernoulli trial
• “probability of its happening in a single trial” = p
• We may know ahead of time, or not
Consider an example where 10 chicks hatch and 4 survive. Bayes attempts to
draw a conclusion, such as “The probability that p (survival) is between 0.23 and
0.59 is 0.80.” The two degrees of probability are an interval; P r(a ≤ p ≤ b).
The overall idea is similar to a confidence interval in trying to account for and
reduce uncertainty—but confidence intervals are not probabilistic, so they are
are not the same.
A key distinction between Bayesian and Frequentists is how uncertainty regarding parameter θ is treated. Frequentists views parameters as fixed and
probability is the long run probability of events in hypothetical datasets. The
result is that probability statements are made about the data—not the parameters! A Frequentist could never state: “I am 95% certain that this population
is declining.” (Note: in order to learn about Bayesian approaches from a practical standpoint, we will often consider it against the Frequentists approach for
comparison.)
But for a Bayesian, probability is the belief that a parameter takes a specific
value. “Probability is the sole measure of uncertainty about all unknown quantities: parameters, unobservables, missing or mis-measured values, future or
unobserved predictions” (Kéry and Schaub, 2011). When everything is a probability, we can use mathematical laws of probability. One way to get started is
to think of parameters as random variables (but technically they are not.)
3.2.2
Bayesian vs. Frequentist Comparison
1. Both start with data distribution (DGF)
2. Data, y, is a function of parameter(s), θ
3. Example: p(y|θ) ∼ P ois(θ) which is often abbreviated to y|θ ∼ P ois(θ)
or y ∼ P ois(θ)
3.2. BAYESIAN BASICS
27
4. Frequentist then uses likelihood function to interpret distribution of observed data as a function of unknown parameters, L(θ|y). But, likelihoods do not integrate to 1, and are therefore not probabilistic
5. Frequentist estimate a single point, the maximum likelihood estimate
(MLE), which represents the parameter value that maximizes the chance
of getting the observed data
6. see Kéry and Schaub (2011) for extended example of MLE
3.2.3
Bayesians use Bayes’ Rule for inference
P (A|B) =
P (B|A) × P (A)
P (B)
Bayes’ Rule is a mathemetical expression of the simple relationship between
conditional and marginal probabilities.
Example of Bayes’ Rule Bird watching (B) or watching football (F), depending on good weather (g) or bad weather (b) on given day.
We know:
1. g + B = 0.5 (joint)
2. g = 0.6 (marginal)
3. B = 0.7 (marginal)
If you are watching football, what is the best guess as to the weather?
Bird watching (B)
Watch Football (F)
Good Weather (g)
Bad Weather (b)
0.5
0.1
0.6
0.2
0.2
0.4
0.7
0.3
1.0
Rephrased, we are asking p(b|F )
We know:
1. p(b, F ) = 0.2
2. p(F ) = 0.3
)
3. p(b|F ) = p(b,F
p(F ) =
0.2
0.3
= 0.66
The probability of bad weather is 0.4, but knowing football is more likely with
bad weather, football increased our guess to 0.66.
3.2.4
Breaking down Bayes’ Rule
P (θ|y) =
P (y|θ) × P (θ)
P (y)
28
CHAPTER 3. FUNDAMENTALS OF BAYESIAN INFERENCE
where
1.
2.
3.
4.
P (θ|y) = posterior distribution
P (y|θ) = Likelihood function
P (θ) = Prior distribution
P (y) = Normalizing constant
Another way to think about Bayes’ Rule
P (θ|y) ∝ P (y|θ) × P (θ)
Combine the information about parameters contained in the data y, quantified
in the likelihood function, with what is known or assumed about the parameter
before the data are collected and analyzed, the prior distribution.
3.3 Bayesian and Frequentist Comparison
3.3.1
Example with Data
Consider this simple dataset of two groups each with 8 measurements.
y <- c(-0.5, 0.1, 1.2, 1.2, 1.2, 1.85, 2.45, 3.0, -1.8, -1.8, -0.4, 0, 0, 0.5, 1.2, 1.8
g <- c(rep('g1', 8), rep('g2', 8))
We might want to consider a t-test to compare the groups (means). In this case,
we are estimating 5 parameters from 16 data points.
t.test(y ~ g)
##
##
##
##
##
##
##
##
##
##
##
Welch Two Sample t-test
data: y by g
t = 2.2582, df = 13.828, p-value = 0.04064
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.06751506 2.68248494
sample estimates:
mean in group g1 mean in group g2
1.3125
-0.0625
Our t-test results in a p-value of 0.0406, which is less than the traditional cutoff of 0.05 and allowing us to conclude that the group means are significantly
different. At this point in our study, we would write up our model results and
likely conclude the analysis. The frequentist t-test is simple, but limited.
Let’s try a Bayesian t-test.
Here we see a posterior distribution of our results. The mean difference between groups was estimated to be 1.37, and the 95% credible interval around
3.3. BAYESIAN AND FREQUENTIST COMPARISON
29
that difference is reported as -0.248 to 3.01. A credible interval represents the
percentage (95% in this case) of symmetric area under the distribution that
represents the uncertainty in our point estimate. The percentage can also be
used to interpret the chance that the interval has captured the true parameter.
So in this example, we can say that we are 95% certain that the true difference
in means lies between -0.248 and 3.01, with our best guess being 1.37. Another
thing we can use credible intervals for is “significance” interpretation. If a credible interval overlaps 0, it can be interpreted that 0 is within the range of possible
estimates and therefore there is no significant interpretation. (If 0 were to be
outside our 59% CI, we would have evidence that 0 is not a very likely estimate
and therefore the group means are “statistically significant.”) It’s worth noting
that the concept of statistical significance may be alive and well in Bayesian
esimation; however, you need to define how the term is used because there is no
a priori significance level as there is in frequentist routines.
Consider also that the frequentist t-test found a significant difference and the
Bayesian t-test did not find a difference. In reality, with simple (and even
more complex) datasets, both types of estimation will often arrive at the same
answer. This example was generated to show that there is the possibility to
reach different significance conclusions using the same data. Perhaps what is
more important in this example is to ask yourself if two types of estimation
reach different conclusions about the data, which are you more likely to trust—
the procedure that provides little information and spits out a yes or no, or the
procedure that is full tractable and provides results and estimation for all parts
of the model?
But wait—there’s more! Because Bayesian estimation assumes underlying distributions for everything, we get richer results. All estimated parameters have
their own posterior distributions, even the group-specific standard deviations.
Despite the benefits illustrated in this example, it’s important to know that the
Bayesian approach is not inherently the correct way to appraoch estimation. In
30
CHAPTER 3. FUNDAMENTALS OF BAYESIAN INFERENCE
fact, we cannot know which method is correct. All we can know is that we are
presented with a choice as to which method provides richer information from
which we can base a decision.
3.3.2
More differences
Frequentists: p(data|θ)
•
•
•
•
•
•
Data are repeatable, random variables
Inference from (long-run) frequency of data
Parameters do not change (the world is fixed and unknowable)
All experiments are (statistically) independent
Results driven by point-estimates
Null Hypothesis emphasis; accept or reject
Bayesians: p(θ|data)
•
•
•
•
•
The data are fixed (data is what we know)
Parameters are random variables to estimate
Degree-of-belief interpretation from probability
We can update our beliefs; analyses need not be independent
Driven by distributions and uncertainty
3.3.3
Put another way
Frequentists asks: The world is fixed and unchanging; therefore, given a certain
parameter value, how likely am I to observe data that supports that parameter
3.3. BAYESIAN AND FREQUENTIST COMPARISON
31
value?
Bayesian asks: The only thing I can know about the changing world is what I
can observe; therefore, based on my data, what are the most likely parameter
values I could infer?
Given the strengths and weaknesses for both types of estimation, why did the
Frequentist approach dominate for so long, even to the present? The best explanation includes several reasons. Frequentist routines are computationally simple
compared to Bayesian appraoches, which has permitted them to be formalized
into point-and-click routines that are available to armchair statisticians. Additionally, the create and popularization of ANOVA ran parallel to the rise in
popularity of Frequentist appraoches, and ANOVA provided an excellent model
for many data sets. Finally, although the limitations and issues with p-values are
well-publicized, there has been historic appeal for statistical complexity being
reduced to a significant or non-significant outcome.
3.3.4
Uncertainty: Confidence Interval vs Credible Interval
Consider the point estimate of 5:12 and the associated uncertainty of 4:12–6:12
Frequentist interpretation: Based on repeated sampling, 95% of intervals contain
the true, unknowable parameter. Therefore, there is a 95% chance that 4.12–6.12
contains the true parameter we are interested in.
Bayesian interpretation: Based on the data, we are 95% certain that the interval
32
CHAPTER 3. FUNDAMENTALS OF BAYESIAN INFERENCE
4.12–6.12 contains the parameter value, with the most likely value being 5.12
(the mean/mode of the distribution).
These are different interpretations!
3.3.5
Bayesian Pros and Cons
Pros
•
•
•
•
•
Can accomodate small samples sizes
Uncertainty accounted for easily
Degree-of-belief interpretation
Very flexible with respect to model design
Modern computers make computation possible
Cons
• Still need to code (but also a pro?)
• Still computationally intensive in some applications
Other Practical Advantages of Bayesian Estimation
• Easy to use different distributions (e.g., t-test with t-distribution)
• Free examination and comparisons
– Comparing standard deviations
– Comparing factor levels
– The data don’t change, why should the p-value?
• Uncertainty for ALL parameters and derived quantities
• Credible intervals make more sense than confidence intervals
• Probabilistic statements about parameters and inference
• With diffuse priors, approximate MLE (no estimation risk)
• Test statistics and model outcome NOT dependent on sampling intentions
– Frequentists experiments need a priori sample size (do you do this?)
– Test statistics vary, which means p-values vary, which means decisions
change
– The data don’t change, why should the p-value?
3.3. BAYESIAN AND FREQUENTIST COMPARISON
33
34
CHAPTER 3. FUNDAMENTALS OF BAYESIAN INFERENCE
Chapter 4
Bayesian Machinery
4.1
Bayes’ Rule
P (θ|y) =
P (y|θ) × P (θ)
P (y)
where
1.
2.
3.
4.
P (θ|y) = posterior distribution
P (y|θ) = Likelihood function
P (θ) = Prior distribution
P (y) = Normalizing constant
4.1.1
Posterior Distribution: p(θ|y)
The posterior distribution (often abbreviated as the posterior) is simply the way
of saying the result of computing Bayes’ Rule for a set of data and parameters.
Because we don’t get point estimates for answers, we correctly call it a distribution, and we add the term posterior because this is the distrution produced at
the end. You can think of the posterior as a astatement about the probability
of the parameter value given the data you observed.
“Reallocation of credibilities across possibilities.” - John Kruschke
35
36
CHAPTER 4. BAYESIAN MACHINERY
200
Frequency
150
100
50
0
0
5
10
θ
10
8
θ
6
4
2
0
0
200
400
600
800
1000
Iteration
4.1.2
Likelihood Function: p(y|θ)
• Skip the math
• Consider it similar to other likelihood functions
• In fact, it will give you the same answer as ML estimator (interpretation
differs)
4.2 Priors: p(θ)
• Distribution we give to a parameter before computation
4.3. NORMALIZING CONSTANT: P (Y )
37
Figure 4.1: Prior information can be useful.
• WARNING: This is historically a big deal among statisticians, and subjectivity is a main concern cited by Frequentists
• Priors can have very little, if any, influence (e.g., diffuse, vague, noninformative, unspecified, etc), yet all priors are technically informative.
• Much of ecology uses diffuse priors, so little concern
• But priors can be practical if you really do know information (e.g., even
basic information, like populations can’t be negative values)
• Simple models may not need informative priors; complex models may need
priors
You may not use informative priors when starting to model. Regardless, always
think about your priors, explore how they work, and be prepared to defend
them to reviewers and other peers.
“So far there are only few articles in ecological journals that have
actually used this asset of Bayesian statistics.” - Marc Kery (2010)
4.3
Normalizing Constant: P (y)
The normalizing constant is a function that converts the area under the curve to
1. While this may seem technical—and it is—this is what allows us to interpret
Bayesian output probabilistically. The normalizing constant is a high dimension
38
CHAPTER 4. BAYESIAN MACHINERY
Figure 4.2: Example of prior, likelihood, and poserior distributions.
Figure 4.3: Example of prior influence based on prior parameters.
4.3. NORMALIZING CONSTANT: P (Y )
39
Figure 4.4: MCMC samplers are designed to sample parameter space with a
combination of dependency and randomness.
integral that in most cases cannot be analytically solved. But we need it, so we
have to simulate it. To do this, we use Markov Chain Monte Carlo, MCMC.
4.3.1
•
•
•
•
MCMC Background
Stan Ulam: Manhattan project scientist
The solitaire problem: How do you know the chance of winning?
Can’t really solve… too much complexity
But we can automate a bunch of games and monitor the results—basically
we can do something so much that we assume the simulations are approximating the real thing.
Fun Fact: There are 80,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000 solitaire combinations!
Markov Chain: transitions from one state to another (dependency) Monte
Carlo: chance applied to the transition (randomness)
• MCMC is a group of functions, governed by specific algorithms
• Metropolis-Hastings algorithm: one of the first algorithms
• Gibbs algorithm: splits multidimensional θ into separate blocks, reducing
dimensionality
• Consider MCMC a black box, if that’s easier
4.3.2
MCMC Example
A politician on a chain of islands wants to spend time on each island proportional
to each island’s population.
1. After visiting one island, she needs to decide…
• stay on current island
40
CHAPTER 4. BAYESIAN MACHINERY
• move to island to the west
• move to island to the east
2. But she doesn’t know overall population—can ask current islanders their
population and population of adjacent islands
3. Flips a coin to decide east or west island
• if selected island has larger population, she goes
• if selected island has smaller population, she goes probabilistically
MCMC is a set of techniques to simulate draws from the posterior distribution
p(θ|x) given a model, a likelihood p(x|θ), and data x, using dependent sequences
of random variables. That is, MCMC yields a sample from the posterior distribution of a parameter.
4.3.3
Gibbs Sampling
One iteration includes as many random draws as there are parameters in the
model; in other words, the chain for each parameter is updated by using the
last value sampled for each of the other parameters, which is referred to as full
conditional sampling.
Although the nuts and bolts of MCMC can get very detailed and may go beyond
the operational knowledge you need to run models, there are some practical
issues that you will need to be comfortable handling, including initial values,
burn-in, convergence, thinning,
4.3.4
•
•
•
•
Burn-in
Chains start with an initial value that you specify or randomize
Initial value may not be close to true value
This is OK, but need time for chain to find correct parameter space
If you know your high probability region, then you may have burned in
already
• Visual Assessment can confirm burn-in
4.3. NORMALIZING CONSTANT: P (Y )
41
Figure 4.5: Visualizing parameter sampling
Figure 4.6: Burn-in is the initial MCMC sampling that may take place outside
of the highest probability region for a parameter.
42
CHAPTER 4. BAYESIAN MACHINERY
Figure 4.7: Clean non-convergence for 3 chains.
4.3.5
Convergence
• We run multiple independent chains for stronger evidence of correct parameter space
• I When chains converge on the same space, that is strong evidence for
convergence
• But how do we know or measure convergence?
• Averages of the functions may converge (chains don’t technically converge)
Convergence Diagnostics
1.
2.
3.
4.
Visual convergence of iterations (“hairy caterpillar” or the “grass”)
Visual convergence of histograms
Brooks-Gelman-Rubin Statistic, R̂
Others
4.3.6
Thinning
MCMC chains are autocorrelated, so θˆt ∼ f (θ̂t−1 ). It s common practice to
thin by 2, 3 or 4 to reduce autocorrelation. However, there are also arguements
against thinning.
4.3.7
MCMC Summary
There is some artistry in MCMC, or at least some decision that need to be made
by the modeler. Your final number of samples in your posterior is often much
less than your total iterations, because in handling the MCMC iterations you
will need to eliminate some samples (e.g., burn-in and thinning). Many MCMC
adjustments you make will not result in major changes, and this is typically a
4.3. NORMALIZING CONSTANT: P (Y )
43
Figure 4.8: Non-convergence is not always obvious. These chains are not converging despite overlapping.
Figure 4.9: Histograms and density plots are a good way to visualize convergence.
44
CHAPTER 4. BAYESIAN MACHINERY
good thing because it means you are in the parameter space you need to be
in. Other times, you will have a model issue and some MCMC adjustment will
make a (good) difference. Because computation is cheap—especially for simple
models—it is common to over-do the iterations a little. This is OK.
Here is a nice overview video about MCMC. https://www.youtube.com/watch?
v=OTO1DygELpY
And here is a great simulator to play with to evaluate how changes in MCMC
settings play out visually. https://chi-feng.github.io/mcmc-demo/app.html
Chapter 5
Introduction to JAGS
5.1
WinBUGS
• BUGS language developed in 1990s
• There is a point and click routine called WinBUGS and OpenBUGS
• WinBUGS made more of a splash because it made Bayesian estimation
accessible
• WinBUGS can be slow and fail to converge…so other softwares have been
developed
5.1.1
BUGS? JAGS?
• BUGS was first on the scene, standalone or run through R
• JAGS originally for BUGS models on Unix, but also developed with R in
mind
• BUGS and JAGS differences tend to be minor (Plummer, 2012)
• JAGS has really taken over BUGS, practically
5.1.2
STAN?
• STAN is newest, developed by Gelman et al.
• STAN fits models in C++, but can also be run through R
• STAN is more different from other two; more language differences, more
code, and fitting differences, but also offers some improvements (diagnostics) for more complex models.
5.2
JAGS
1. Software can be called from R
2. JAGS is functionally for the MCMC
45
46
CHAPTER 5. INTRODUCTION TO JAGS
Figure 5.1: The BUGS logo.
Figure 5.2: The STAN logo.
5.2. JAGS
47
3. All you need to do
• Specify the model likelihood
• Specify statistical distributions
• JAGS does the rest…
4. JAGS is not a procedural language like R; you can write the model code
in any order
5.2.1
General steps to fitting a model in JAGS
1. Define model in BUGS language (write in R and export as text file to be
read by JAGS)
2. Bundle data
3. Provide initial values for parameters
4. Set MCMC settings
•
•
•
•
#
#
#
#
iterations
to thin
to burn-in
of chains
5. Define parameters to keep track of (i.e., parameters of interest)
6. Call JAGS from R to fit the model
7. Assess convergence
5.2.2
Define the BUGS Model: Types of Nodes
Constants—specified in the data file; fixed value(s)
• example: number of observations, n
Stochastic—variables given a distribution
• example: variable � distribution(parameter 1, parameter 2, . .
.)
• y[i] ∼ N (µ, σ 2 ) is written as y[i] ~ dnorm(mu,tau)
• Note that the variance is expressed as a precision (τ ), where τ = σ12
Deterministic—logical functions of other nodes
• example: mu[i] <- alpha + beta * x[i]
5.2.3
Arrays and Indexing
Arrays are indexed by terms within square brackets
• 1 term indexed: y[i] � dnorm(mu, tau)
• 2 term indexed: mu.p[i,j] <- psi[i,j] * w[i]
• 3 term indexed: mu[i,j,k] <- z[i,k] * gamma[i,j,k]
48
5.2.4
CHAPTER 5. INTRODUCTION TO JAGS
Repeated Structures
for loops
for(i in 1:n){ # loop over observations 1 to n
#list of statements to be repeated for increasing values of loop index i
} # close loop
Recall that n is a constant and needs to be defined!
5.2.5
Likelihood Specification
Assume the response variable Y with n observations is stored in a vector y with
elements yi . The stochastic part of the model is written as:
Y ∼ distribution(ϕ)
Where ϕ is the parameter vector of the assumed distribution. The parameter vetor is linked with some explanatory variables, X1 , X2 , ...Xp using a link
function h
ϕ = h(θ, X1 , X2 , ...Xp )
Where θ is a set of parameters used to specify the link function and the final model structure. In generalized linear models, the link function links the
parameters of the assumed distribution with a linear combination of predictor
variables.
5.2.6
Family
Default link function
binomial
gaussian
Gamma
inverse gaussian
poisson
quasi
quasibinomial
quasipoisson
logit
identity
inverse
1/µ2
log
identity (with variance = constant)
logit
log
BUGS Syntax
for(i in 1:n){
y[i] � distribution(parameter1[i], parameter2[i])
parameter1[i] <- [function of theta and X’s]
parameter2[i] <- [function of theta and X’s]
}
Note: Not all parameters will be indexed—depends on the model.
5.3. CONVERGENCE
5.2.7
49
Simple Example
You would like to run a model to draw inference about the mean. The data
include 30 observations of fish length (yi = 1, ...30)
Statistical Model
yi ∼ N (µ, σ 2 )
Priors
µ ∼ N (0, 100)
σ ∼ U (0, 10)
5.2.8
Derived Quantities
One of the nicest things about a Bayesian analysis is that parameters that are
functions of primary parameters and their uncertainty (e.g.,standard errors or
credible intervals) can easily be obtained using the MCMC posterior samples.
We just add a line to the JAGS model that computes the derived quantity of
interest, and we directly obtain samples from the posterior distributions of not
only the original estimated parameters, but the derived relationships we seek to
quantify. In a frequentist mode of inference, this would require application of
the delta method (or various procedures that correct for computations) which is
more complicated and also makes more assumptions. In the Bayesian analysis,
estimation error is automatically propagated into functions of parameters.
“Derived quantities: One of the greatest things about a Bayesian
analysis.” Kery 2010
5.3
Convergence
• Think about priors: do they need to be more informative?
• Starting/initial values: make sure they are reasonable
• Standardize data (predictors): zi = (xi x̄)/sx
– Slope and intercept parameters tend to be correlated
– MCMC sampling from tightly correlated distributions is difficult
(samplers can get stuck)
– Standardizing data reduces correlation (mean-centering) and allows
us to set vague priors no matter what the scale of the original data
– Consider standardizing by 2 SDs (Gelman and Hill, 2006)
• Visually inspect trace plots and posterior densities
• Gelman-Rubin statistic (R̂)
– It compares the within-chain variance to the between-chain variance.
Is there a chain effect?
50
CHAPTER 5. INTRODUCTION TO JAGS
– Values near 1 indicates likely convergence, a value of ≤ 1.1 is considered acceptable by some
5.4 Additional Resources
•
•
•
•
Kéry (2010), Chapter 5
Lykou and Ntzoufras (2011)
Plummer (2012) (JAGS User Manual)
Spiegelhalter et al. (2003) (WinBUGS User Manual)
5.5 JAGS in R: Model of the Mean
This tutorial will work through the code needed to run a simple JAGS model,
where the mean and variance are estimated using JAGS. The model is likely not
very useful, but the objective is to show the preperation and coding that goes
into a JAGS model. The first thing we need to do is load the R2jags library.
library(R2jags)
5.5.1
Generate some date
For this exercise, let’s generate our own data so we know the answer.
n <- 50
mu <- 1.12
sigma <- 0.38
# Generate values (stochastic part of the model)
set.seed(214) # so we all get the same random numbers
yi <- rnorm(n, mean=mu, sd=sigma)
# Mean and SD of sample
mean(yi)
sd(yi)
# Get frequentist confidence interval for the mean
t.test(yi)$conf.int[1:2]
5.5.2
Define Model
We will write the model directly in R using JAGS syntax. Note that the model
is entirely character data (according to R) and is sunk to an external text file.
Note that you can write the model at any point prior to running the model,
but I find it useful to code the model first as several other preperations we will
5.5. JAGS IN R: MODEL OF THE MEAN
51
make will be dependent upon the model and much easier to do with a model to
reference.
sink("model.txt")
cat("
model {
# Likelihood
# Priors
# Derived quantities
} # end model
",fill = TRUE)
sink()
5.5.3
Bundle Data
We need to bundle our data into a list that JAGS will be able to read. This is
very simple in this exercise and generally very simple in more complex models.
Just make sure you have all the elements in that the model needs, which may
extend beyond data. Using an equals sign (not assignment operator), set the
data that JAGS is looking for on the left side, to the input data on the right
side.
data <- list(y = yi,
n = n )
5.5.4
Initial Values
JAGS will generate random starting values if not specified, and for simple models this should work. However, it is strongly suggested to get in the habit of
providing starting values because eventually you will need to start the MCMC
chains in a high-probability area. Like the data, list starting values for any
parameters in the model.
inits <- function (){
list (mu=rnorm(1), sigma=runif(1) )
}
52
CHAPTER 5. INTRODUCTION TO JAGS
5.5.5
MCMC Settings
You can adjust or set the MCMC settings directly in the JAGS command, but
I recommend against hardcoding these settings as you will likely want an easy
way to modify them. This simple code adjusts the settings, and the JAGS
command will never need to be modified.
ni
nt
nb
nc
<<<<-
5.5.6
1000
2
500
3
Parameters to Monitor
JAGS models can get complex and often have many (hundreds or thousands)
of parameters. Here we will specify that we want to monitor both parameters
in this model, but in the future, you may want to specify only some parameters
of interest.
parameters <- c("mu","sigma")
5.5.7
Run JAGS model
The command jags() runs the model, with the arguments below. Again, we
have specificed everything outide this command, which makes things a bit more
manageable.
out <- jags(data,
inits,
parameters,
"model.txt",
n.chains = nc,
n.thin = nt,
n.iter = ni,
n.burnin = nb)
5.5.8
Assess Convergence and Model Output
We will get into this more, but the first thing to do is print the model object
summary. This will give you a convergence statistic as a place to start. We can
also run traceplots and density plots for visual convergence.
print(out, dig = 3)
Also, get comfortable working with the JAGS model object, which stores a lot
of information. Inspect it here.
str(out)
5.5. JAGS IN R: MODEL OF THE MEAN
53
How would you get the posterior mean out of the JAGS model object without
using the summary function?
How would you plot the posterior by hand (it need not be pretty)?
How would you make traceplots?
How would you make density plots?
There are a number of packages that can also be used to plot JAGS output
and make things simpler. However, it is not a bad idea to become comfortable
with the JAGS model object as there may come a model that has sufficiently
complex output that a canned diagnostic does not work.
54
CHAPTER 5. INTRODUCTION TO JAGS
Chapter 6
Simple Models in JAGS
6.1
6.1.1
Revisiting Hierarchical Structures
What is a hierarchical model?
Parent and Rivot (2012): A model with three basic levels
1. A data level that specifies the probability distribution of the observables
at hand given the parameters and the underlying processes
2. A latent process level depicting the various hidden ecological mechanisms
that make sense of the data
3. A parameter level identifying the fixed quantities that would be suficient,
were they known, to mimic the behavior of the system and to produce
new data statistically similar to the ones already collected
Also
1. Observation model: distribution of data given parameters
2. Structural model: distribution of parameters, governed by hyperparameters
3. Hyperparameter model: parameters on priors
Other Definitions
• “Estimating the population distribution of unobserved parameters” (Gelman et al., 2013)
• “Sharing statistical strength”: we can make inferences about data-poor
groups using the entire ensemble of data
– Pooling (partially) across groups to obtain more reliable estimates
for each group
55
56
CHAPTER 6. SIMPLE MODELS IN JAGS
Figure 6.1:
6.1.2
Hierarchical Model Example
“Similar, but not identical” can be shown to be mathematically equivalent to
assuming that unit-specific parameters, θi , i = 1, ..., N , arise from a common
“population” distribution whose parameters are unknown (but assigned appropriate prior distributions)
θi ∼ N (µ, σ2 )
µ ∼ N (., .); σ ∼ unif (., .)
We learn about θi not only directly through yi , but also indirectly through
information from the other yj via the population distribution parameterized by
ϕ
Example: A normal-normal mixture model
• Let yi = (yi1 , yi2 , ...yin ) denote replicate measurements made on each of
j = 1, 2, ...n subjects
• Observation model: yij ∼ N (αj , σ 2 )
– yij is assumed normally-distributed with a mean that depends on a
subject-specific parameter, αj
– Assume that αj ’s come from a common population distribution αj ∼
N (µ, σα2 )
Ecological data are characterized by: - Observations measured at multiple spatial and/or temporal scales - An uneven number of observations measured for
any given subject or group of interest (i.e., unbalanced designs) - Observations
that lack independence (i.e., the value of a measurement at one location or time
period influences the value of another measurement made at a different location
or time period - Widely applicable to many (most) ecological investigations
6.1. REVISITING HIERARCHICAL STRUCTURES
57
Desirable properties of hierarchical models 1. Accommodate lack of statistical
independence 2. Scope of inference 3. Quantify and model variability at multiple levels 4. Ability to “borrow strength” from the entire ensemble of data
(phenomenon referred to as shrinkage)
6.1.3
Borrowing Strength?
• Make use of all available information
• Results in estimators that are a weighted composite of information from
an individual group (e.g., species, individual, reservoir, etc.) and the
relationships that exist in the overall sample
• Could fit separate ordinary least squares (OLS) regressions to each group
(e.g., reservoir) and obtain estimate of a slope and intercept
• OLS will give estimates of parameters, however, they may not be very
accurate for any given group
• Depends on sample size within a group (nj ) and the range represented in
the level-1 predictor variable, Xij
– If nj is small then intercept estimate will be imprecise
– If sample size is small or a restricted range of X, the slope estimate
will be imprecise
Hierarchical models allow for taking into account the imprecision of OLS estimates
• More weight to the grand mean when there are few observations within a
group and when the group variability is large compared to the betweengroup variability
– E.g., small sample size for a given group = not very precise OLS
estimate, so value “shrunk” towards population mean
• More weight is given to the observed groups mean if the opposite is true
• Thus, hierarchical models accounts and allows for the smallsample sizes
observed in some groups
• Same is true for the range represented in predictor variables (i.e.., hierarchical models accounts and allows for the small and limited ranges of X
observed in some groups)
6.1.4
Shrinkage toward the grand mean (µα )
The estimate of αj is a linear combination of the population-average estimate
µα and the ordinary least squares (OLS) estimate, αOLS
αj = wj × αOLS + (1 − wj ) × µα
The weight, wj is a ratio of the between group variability (σα2 ) to the sum of
the within and between-group variability (i.e., total variability)
58
CHAPTER 6. SIMPLE MODELS IN JAGS
wj =
nj × σα2
nj × σα2 + σ 2
where nj = sample size
\begin{figure}
{
}
\caption{Example where ICC = 0% (i.e., no among-group variability)}
\end{figure}
\begin{figure}
{
}
\caption{Example where ICC = 13%} \end{figure}
\begin{figure}
6.2. SIMPLE LINEAR REGRESSION
59
{
}
\caption{Example where ICC = 80%} \end{figure}
6.2 Simple Linear Regression
The model from a Bayesian point of view
yi ∼ N (α + βxi , σ 2 )
for i, ...n
Priors:
α ∼ N (0, 0.001)
β ∼ N (0, 0.001)
sigma ∼ U (0, 10)
6.2.1
Varying Coefficient Models
Few people run models in JAGS because they want fixed effects, and in fact,
because parameters need to have an underlying distribution you will have the
option to create random effects, or varying coefficients, very easily. Before we
start on the different types of simple varying coefficient models, let’s visually
review them.
6.3
Varying Intercept Model
Another way of labeling a varying intercept model is a one-way ANOVA with a
random effect. A one-way ANOVA is among the simpler of statistical models,
and a little complexity has been added by changing the single fixed factor to
be random. Either terminology is accurate and acceptable, but because
60
CHAPTER 6. SIMPLE MODELS IN JAGS
Figure 6.2: Examples of different varying coefficient models
ANOVAs model means and means without slopes are intercepts, we will often
hear an ANOVA called an interceptsmodel, with the modifier varying
meaning that group levels are random and can vary from the grand mean.
Let’s think about the varying intercepts model with an example.
Say we have multiple measurements of total phosphorus (T P ) taken from i
lakes located in j regions (j = 1, 2, ...J). The number of lakes sampled in each
region varies, and we want to estimate the mean T P for each region. We can
assign each region its own intercept (i.e., it’s own mean T P ) and we will allow
these intercepts to come from a normal distribution characterized by an overall
mean T P (µα ) and between-region variance (σα2 ).
The model can be expressed as follows:
yi ∼ N (αj(i) , σ 2 )
for i, ...n
αj ∼ N (µα , σα2 )
for j, ...J
6.3. VARYING INTERCEPT MODEL
61
µα ∼ N (0, 0.001)
σα2 ∼ U (0, 10)
σ ∼ U (0, 10)
# Likelihood
for(i in 1:n){
y[i] ~ dnorm(mu[i], tau)
mu[i] <- alpha[group[i]]
}
for(j in 1:J){}
alpha[j] ~ dnorm(mu.alpha, tau.alpha)
}
# Priors
mu.alpha ~ dnorm(0,0.001)
sigma ~ dunif(0,10)
sigma.alpha ~ dunif(0,10)
# Derived quantities
tau <- pow(sigma,-2)
tau.alpha <- pow(sigma.alpha,-2)
6.3.1 Nested Indexing
In random intercept models we have observation i in region j. Multiple i’s can
be observed within a single j. Therefore, we have multiple observations per
some grouping variable, so we have two indexes, i and j. Up until this point,
we only had observation i, which was easily accommodated in a single for loop.
“Nested indexing” is a way of accommodating this type of data structure
alpha[group[i]]
6.3.2 Intraclass correlation coefficient (ICC)
This is another version of the one-way ANOVA with a random effect. The
primary difference between this and the model above is that here we are going
to actually calcuate the Intraclass correlation coefficient, ICC, as a derived
quantity, in order to help us understand more about where the variability is in
the data.
Recall that ICC is a measure of the variability within groups compared to
among groups.
62
CHAPTER 6. SIMPLE MODELS IN JAGS
Figure 6.3: Low ICC
Figure 6.4: High ICC
6.3. VARYING INTERCEPT MODEL
63
The model can be expressed the same way as the one-way ANOVA with a
random effect:
yi ∼ N (αj(i) , σ 2 )
for i, ...n
αj ∼ N (µα , σα2 )
for j, ...J
Recall that the total variance = σ 2 + σα2 and therefore
ICC =
σ2
σα2
+ σα2
Both terms required in the ICC equation are already being estimated by the
model, so this is a good example of a derived quantity.
6.3.3
Varying Intercept, Fixed Slope Model
We can easily add a level-1 covariate (slope) to our model, but because it is
not varying, it will create a model where the effect of the covariate is equal on
all groups, although the intercept of the groups may vary among groups.
This model might look like:
yi ∼ N (αj(i) + βxi , σ 2 )
for i, ...n
αj ∼ N (µα , σα2 )
for j, ...J
Note that the second level model (the model for αj ’s) is the same as the
previous model. The only change is that we have added an x predictor and a β
that represents a single, common slope, or the effect of x on y. Although β is
not varying, we still need to give it a prior, because it is an estimated
parameter and not a deterministic node.
64
CHAPTER 6. SIMPLE MODELS IN JAGS
Figure 6.5: Example of Simpson’s Paradox
µα ∼ N (0, 0.001)
β ∼ N (0, 0.0001)
σα2 ∼ U (0, 10)
σ ∼ U (0, 10)
6.4
Varying Slope Model
A single slope may make sense in some applications where the change in a
covariate effects all groups the same, despite the fact that the groups may start
at different values. However, there is a good chance that covariates hold the
potential to have different directions and magnitudes of an effect on different
groups. In cases like this, a varying slope would make sense. In ecology, the
assumption that many processes are spatially and/or temporally homogeneous
or invariant may not be valid. For example, spatial variation in the a
stressor-response relationship may vary depending on local landscape features.
Another important attribute of varying slopes is that they can help avoid
Simpson’s Paradox, which is when a trend appears in several different groups
of data but disappears or reverses when these groups are combined.
Recall our varying intercept model, and we will change one thing in the
(level-1) equation. We will index β by i and j in the same way we did for α in
the varying intercept model.
yi ∼ N (αj(i) + βj(i) xi , σ 2 )
for i, ...n
6.4. VARYING SLOPE MODEL
65
Figure 6.6: Most quantitative ecologists cannot resist the temptation to include
a cartoon of Homer Simpson when referencing Simpson’s Paradox.
The second level of our model changes more dramatically, because we now
have two varying coefficient and need to model both their variances and their
correlation.
(
αj
βj
)
((
∼N
µα
µβ
) (
,
σα2
ρσα σβ
ρσα σβ
σβ2
))
for j, ...J
Several of these terms we have seen before, specifically the terms involving α,
the intercept. However, many of the β terms are used similarly to the α terms.
After understanding that the equation is written in matrix-notation to
accomdate both varying parameters, see that the row for βj contains the mean
β’s, expressed as a mean, µbeta . The final part is the variance-covariance
matrix, which might be simpler that it appears. Note that we have already
recoginized σα as the among-group variance for intercepts, and σβ is simple
the among-group variance for slopes. The only term remaining is the one
involving σαβ , which represents the joint variability of αj and βj . Perhaps the
important part of this term is that it is prefaced with ρ, the Greek symbol
commonly used for correlation. So the term ρσα σβ is quantifying the
correlation between the varying parameters, which is important to know
because we want and need the correlation to be minimal.
Finally, we need priors for the model. Again, think about building on previous
models that we have learned. This is not entirely a new model, so use what
you already know, and add the new parts.
µα ∼ N (0, 0.001)
66
CHAPTER 6. SIMPLE MODELS IN JAGS
µβ ∼ N (0, 0.001)
σ ∼ U (0, 10)
σα ∼ U (0, 10)
σβ ∼ U (0, 10)
ρ ∼ U (−1, 1)
Note that ICC cannot be applied to this model for reasons we won’t
get into.
6.4.1 Correlation between parameters
When estimated parameters are correlated, it may be a sign of model fitting
problems. Correlated parameters may mean that the parameters are not able
to independently explore parameter space, which may lead to poor estimates.
Additionally, if the estimates are reliable but the parameters are correlated, it
may mean that you aren’t getting much information from the data that are
correlated. Regardless, correlated parameters are a problem in your models
and need to be addressed. (Also, there is not agreed upon threshold for
“highly correlated”, but by the time correlations approach 0.7, you may want
to start looking into things.)
Fortunatly, there are several (and mostly simple) things to do to address
parameter correlation. The best thing to do is often to center and/or
standardize your model covariates, which has been discussed before.
Standardizing data typically reduces correlation between parameters, improves
convergence, may improve parameter interpretation, and can also mean that
the same priors can be used.
Going back to the PLD data that we have examined, note that the correlation
of intercepts and effect of temperature based on the raw data is high, at about
-0.95.
In the PLD paper, a centering constant was used that greatly reduced the
correlation.
The centering of the temperature by -15 resulted in near elimination of the
parameter correlation. Note that this example is about as impressive as you
might find and you should not expect results this dramatic in all data.
6.4. VARYING SLOPE MODEL
67
Figure 6.7: Correlation between intercepts and effect of log(temperature) for
the raw PLD dataset. Correlation is around 0.95.
Figure 6.8: Simulation of varying centering constants examined to reduce parameter correlation.
Figure 6.9: PLD intercepts and slopes with almost no correlation.
68
CHAPTER 6. SIMPLE MODELS IN JAGS
However, centering and/or standardizing does routinely reduce correlation as
advertised.
Recall that centering changes the values, but not the scale of the data. Think
about it as shifting the numberline under your data. Mean-centering is also a
very common type of centering. Standardizing your data changes both the
values and the scale. There is a reasonable literature on various standardizing
approaches, but mean-centering and dividing by 1 or 2 standard deviations are
very common and usually produce good results.
6.4.2 Correlation between 2 or more varying parameters
Another way to represent the models with 2 or more varying parameters is:
yi ∼ N (Xi Bj(i) , σ 2 )
for i, ...n
Bi ∼ N (MB , ΣB )
for j, ...J
Where B represents the matrix of group-level coefficients, MB represents the
population average coefficients (i.e., the mean of the distributions of the
intercepts and slopes), and Σb represents the covariance matrix.
Once we move beyond 2 varying parameters, it is no longer possible to set
diffuse priors one parameter at a time; each correlation is restricted to be
within -1 and 1 and correlations are jointly constrained. In other words, we
need to put a prior on the covariance matrix (ΣB ) itself.
To accomplish this “prior on a matrix”, we will use the scaled inverse-Wishert
distribution. This distribution implies a uniform distribution on the correlation
parameters. According to Gelman and Hill (2006), “When the number K of
varying coefficients per group is more than two, modeling the correlation
parameters ρ is a challenge.” But the scaled inverse-Wishert appraoch is a
“useful trick”. Note that Kéry (2010) does not go so far as to deal with this
issue. While this adds some model complexity, ultimtely this development is
something that will be provided in the JAGS model code and never really
changed. Note that this distribution requires the library MCMCpack.
Below is an example of coding for 2 varying coefficients without using the
Wishert distribution. This approach will work, but is not recommended.
#### Dealing with variance-covariance of random slopes and intercepts
# Convert covariance matrix to precision for use in bivariate normal
Tau.B[1:K,1:K] <- inverse(Sigma.B[,])
6.4. VARYING SLOPE MODEL
69
# variance among intercepts
Sigma.B[1,1] <- pow(sigma.a, 2)
sigma.a ~ dunif (0, 100)
# Variance among slopes
Sigma.B[2,2] <- pow(sigma.b, 2)
sigma.b ~ dunif (0, 100)
# Covariance between alpha's and beta's
Sigma.B[1,2] <- rho * sigma.a * sigma.b
Sigma.B[2,1] <- Sigma.B[1,2]
# Uniform prior on correlation
rho ~ dunif (-1, 1)
Below is code for the model variance-covariance using the Wishert
distribution. There are computational advantages to this, in addition to the
fact that it easily scales up for more than 2 varying coefficients.
### Model variance-covariance with Wishert disttribution
Tau.B[1:K,1:K] ~ dwish(W[,], df)
df <- K+1
Sigma.B[1:K,1:K] <- inverse(Tau.B[,])
for (k in 1:K){
for (k.prime in 1:K){
rho.B[k,k.prime] <- Sigma.B[k,k.prime]/
sqrt(Sigma.B[k,k]*Sigma.B[k.prime,k.prime])
}
sigma.B[k] <- sqrt(Sigma.B[k,k])
}
70
CHAPTER 6. SIMPLE MODELS IN JAGS
Chapter 7
Varying Coefficients
71
72
CHAPTER 7. VARYING COEFFICIENTS
Chapter 8
Generalized Linear Models
in JAGS
8.1 Background to GLMs
Recall the linear model, which has the feature of modeling a response variable
that is assume to come from a normal distribution and have
normally-distributed errors. Once we can adopt different error structures, we
have generalized the linear model, and the name reflects this. Generalized
linear models are not hierarchical, but they are still of great interest because
the ability to modify the error structure on the response variable greatly
increases the type of data we can model. GLMs were introduced in the early
1970s (McCullagh and Nelder 1972) and have become very popular in recent
decades.
8.2
Components of a GLM
A transformation of the expectation of the response (E(y))is expressed as a
linear combination of covariate effects, and distributions other than the normal
can be used for the random part of the model.
1. A statistical distribution is used to describe the random variation in response y
2. A link-function g is applied to the expectation of the response E(y)
3. A linear predictor is the linear combination of covariate effects thought to
make up E(y)
73
74
CHAPTER 8. GENERALIZED LINEAR MODELS IN JAGS
Figure 8.1: Family calls and default link functions for GLMs in R
Figure 8.2: Common statistical distributions, their description, use, and default
link function information.
8.2.1
Common GLMs
Technically a normal distribution is considered a special case of the GLM, but
practically speaking the binomial and Poisson GLM are the most common.
Note that the Bernoulli distribution is also common, but is just a special case
of the Binomial where the parameter N = 1. Each distributional family in a
GLM has a default link function, which is very likely the link function you will
want to use. However, note that other link functions are available and with
some digging, you may find a link function that is better for your data than
the detault link function.
Although it is not considered in the family of GLMs, recall the Beta
distribution can be used for Beta regression, which is appropriate when your
response variable takes on a proportion (0 > y < 1).
8.3. BINOMIAL REGRESSION
75
8.3 Binomial Regression
8.4 Poisson Regression
8.4.1
Poisson Extras
76
CHAPTER 8. GENERALIZED LINEAR MODELS IN JAGS
Chapter 9
Plotting
77
78
CHAPTER 9. PLOTTING
Chapter 10
Within-subjects Model
79
80
CHAPTER 10. WITHIN-SUBJECTS MODEL
Chapter 11
Hierarchical Models
81
82
CHAPTER 11. HIERARCHICAL MODELS
Bibliography
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin,
D. B. (2013). Bayesian data analysis. Chapman and Hall/CRC.
Gelman, A. and Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.
Kéry, M. (2010). Introduction to WinBUGS for ecologists: Bayesian approach
to regression, ANOVA, mixed models and related analyses. Academic Press.
Kéry, M. and Schaub, M. (2011). Bayesian population analysis using WinBUGS:
a hierarchical perspective. Academic Press.
Lindley, D. V. and Smith, A. F. (1972). Bayes estimates for the linear model.
Journal of the Royal Statistical Society: Series B (Methodological), 34(1):1–
18.
Lykou, A. and Ntzoufras, I. (2011). Winbugs: a tutorial. Wiley Interdisciplinary
Reviews: Computational Statistics, 3(5):385–396.
Parent, E. and Rivot, E. (2012). Introduction to hierarchical Bayesian modeling
for ecological data. Chapman and Hall/CRC.
Plummer, M. (2012). Jags version 3.3. 0 user manual. International Agency for
Research on Cancer, Lyon, France.
Raudenbush, S. W. and Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods, volume 1. Sage.
Spiegelhalter, D., Thomas, A., Best, N., and Lunn, D. (2003). Winbugs user
manual.
Xie, Y. (2019). bookdown: Authoring Books and Technical Documents with R
Markdown. R package version 0.13.
83
Download