Bayesian Hierarchical Models in Ecology Steve Midway 2019-10-28 2 Contents 1 Background 1.1 How to Use This Book . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 The 2.1 2.2 2.3 Model Matrix and Linear Models . . . . Model Effects . . . . Hierarchical Models Random . . . . . . . . . . . . . . . . . . Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 6 6 6 7 7 14 17 3 Fundamentals of Bayesian Inference 21 3.1 Models vs. Estimation . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Bayesian Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Bayesian and Frequentist Comparison . . . . . . . . . . . . . . . 28 4 Bayesian Machinery 35 4.1 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Priors: p(θ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3 Normalizing Constant: P (y) . . . . . . . . . . . . . . . . . . . . . 37 5 Introduction to JAGS 5.1 WinBUGS . . . . . . 5.2 JAGS . . . . . . . . 5.3 Convergence . . . . . 5.4 Additional Resources 5.5 JAGS in R: Model of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 45 45 49 50 50 6 Simple Models in JAGS 6.1 Revisiting Hierarchical Structures 6.2 Simple Linear Regression . . . . 6.3 Varying Intercept Model . . . . . 6.4 Varying Slope Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 55 59 59 64 . . . . . . . . . . . . . . . . . . . . . . . . the Mean 7 Varying Coefficients 71 3 4 8 Generalized Linear Models in JAGS 8.1 Background to GLMs . . . . . . . . 8.2 Components of a GLM . . . . . . . . 8.3 Binomial Regression . . . . . . . . . 8.4 Poisson Regression . . . . . . . . . . CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 73 73 75 75 9 Plotting 77 10 Within-subjects Model 79 11 Hierarchical Models 81 Chapter 1 Background Welcome to Bayesian Hierarchical Models in Ecology. This is an ebook that is also serving as the course materials for a graduate class of the same name. There will be numerous and on-going changes to this book, so please check back. And don’t hesistate to email me if you have questions, comments, or for anything else. To start, let’s calrify the title of this text—it should be Hierarchical Models in Ecology Using Bayesian Inference. A Bayesian Hierarchical Model is more a term of convenience than accuracy, as hierarchical models need not be Bayesian and Bayesian models can take many forms. However, hierarchical models and Bayesian inference do work together very nicely, as you will see, and so hopefully the title is not too misleading. I have dedicated several parts of this book attempting to differentiate these terms and concepts, while also making them as useful as possible. 5 6 CHAPTER 1. BACKGROUND 1.1 How to Use This Book You are welcome to use this book in any way you find it useful. The contents are really a mashup of conceptual descriptions, lecture notes, enumerated and itemized lists, images, code, analysis, models, output, plots, and other things that I wanted to include. The only real motivation behind the content and organization is that it has been useful for some students to learn, and so I have tried to adopt the best and most effective formats while revising others. Bookdown has provided such freedom in creating content, and perhaps I have veered too far from traditional formats. The document structure (e.g., chapters, sections, etc.) should be logical enough to skip around, if you prefer. I rely somewhat heavily on quotes and more heavily on code, which are formatted with thier own colored boxes. # R code or JAGS code is in gray boxes. Quotes are in colored boxes. 1.2 Acknowledgments In creating this course and format, I want to first acknowledge Ty Wagner. In reality, Ty is the co-author of this text. Ty taught me most of what I know about Bayesian hierarchical models, and for a few years he and I taught a multiday Bayesian hierarchical models workshop from which much of these course materials derive. I would also like to thank Yihui Xie for all his work in developing numerous R packages, especially the bookdown package (Xie, 2019) that has enabled the creation of this book. 1.3 Motivation The computer scientist Alan Kay is known for this quote: People who are really serious about software should make their own hardware. –Alan Kay, 1982 I have always liked this quote and have choosen to adapt it for how I think about data analysis. My adaptation of this quote also serves to describe why I have undertaken learning statistical models and the general approach I advocate for students and other data analysts. People who are really serious about their analyses should code their own models. –Midway, 2018 Chapter 2 The Model Matrix and Random Effects 2.1 Linear Models Let’s start with reviewing some linear modeling terminology. • • • • • units–observations; i , data points x variable–predictor(s); independent variables y variable–outcome, response; dependent variable inputs–model terms; ̸= predictors; inputs ≥ predictors random variables–outcomes that include chance; often described with probability distribution • multiple regression–regression of > 1 predictor; not multivariate regression (MANOVA, PCA, etc.) The general components of a linear model can be thought of as any of the following: response = deterministic + stochastic outcome = linear predictor + error result = systematic + chance • The stochastic component makes these statistical models • Explanatory variables whose effects add together • Not necessarily a straight line! 2.1.1 Stochastic Component • Nature is stochastic; nature adds error, variability, chance • Always a reason, we might just not know it 7 8 CHAPTER 2. THE MODEL MATRIX AND RANDOM EFFECTS Figure 2.1: Although hurricanes are not statistical models, they are a good example of understanding something that cannot perfectly be modeled, and therefore has some stochasticity inherent to it. Figure 2.2: Examples of different statistical distributions as seen by their characterstic shape. • Might not be important and including it risks over-parameterization • Combined effect of unobserved factors captured in probability distribution • Often (default) is Normal distribution, but others common In a linear model, we need to identify a distribution that we assume is the data generating distribution underyling the process we are attempting to model. Not only is this distrubtion important in describing what we are modeling, but it is also critically important because it will accomodate the errors that arise when we model the data. So how do we know which distribution we need for a given model? • • • • • We can know possible distributions We can know how data were sampled We can know characteristics of the data We can run model, evaluate, and try another distribution Correct distribution is not always Y/N answer 2.1. LINEAR MODELS 2.1.2 9 Distributions Although many distributions are available, we will review three. 1. Normal distribution: continuous, −∞ to ∞ 2. Binomial distribution: discrete, integers, 0 to 1 3. Poisson distribution: discrete, integers, 0 to ∞ The Normal distribution The normal distribution is the model common distribution in linear modeling and possible the most common in nature. In a normal distribution, measurements effected by large number of additive effects (CLT). When effects are multiplicative, the result is a log-normal distribution. There are 2 parameters that govern the normal distribution: µ = mean and σ 2 = standard deviation. A normal distribution can be easily simulated in R. rnorm(n = 10, mean = 0, sd = 1) ## ## [1] [7] 1.2299399 1.2514092 0.9777302 1.4565016 -0.2789666 -0.4747550 0.7566882 1.1019663 1.4762023 0.8731642 The Binomial distribution The binomial distribution always concerns the number of successes (i.e., a specific outcome) out of a total number of trials. A single trial is a special case of the binomial, and is called a Bernoulli trial. The flip of a coin once is the classic Bernoulli trial. A binomial distribution might be thought of as the sum of some number of Bernoulli trials. A binomial distibution has 2 parameters: p = success probability and N = sample size (although N =1 in a Bernoulli trial, which means a Bernoulli trial can be thought of as a 1-parameter distribution). N is the upper limit, which differentiates from a binomial distribution from a Poisson distribution. The binomial distribution mean is a derived quantity, where µ = N × p. the variance = N × p × (1 − p); therefore, the variance is function of mean. A binomial distribution can be simulated in R. rbinom(n = 10, size = 5, prob = 0.3) # size = trials ## [1] 0 0 1 2 3 3 1 4 2 0 The Poisson distribution The Poisson distribution is the classic distribution for (integer) counts; e.g., things in a plot, things that pass a point, etc.) The Poisson distribution can approximate to Binomial when N is large and p is small, and it can approximate to the normal distribution when when λ is large. The Poisson distribution has 1 parameter: λ = mean = variance. This distribution can be modified for zero-inflated data (ZIP) and other distributional anomolies. A Poisson distribution can be easily simulated in R. 10 CHAPTER 2. THE MODEL MATRIX AND RANDOM EFFECTS Table 2.1: Replicated data from Kery (2010). mass pop region hab svl 6 8 5 7 9 1 1 2 2 3 1 1 1 1 2 1 2 3 1 2 40 45 39 50 52 11 3 2 3 57 rpois(n = 10, lambda = 3) ## [1] 4 3 2 2 2 2 6 3 4 3 2.1.3 Linear Component Recall that the linear contribution to our model includes the predictors (explanatory variables) with additive effects. Although this is a linear model, it need not be thought of literally as a straight line. The linear component can accomodate both continuous and categorical data. Conceptually, linear predictor = design matrix + parameterization R takes care of both the design matrix and the parameterization, but this is not always true in JAGS, so it is worth a review. Marc Kery has succinctly summarized the design matrix: “For each element of the response vector, the design matrix n index indicates which effect is present for categorical (discrete) explanatory variables and what amount of an effect is present in the case of continuous explanatory variables.” [@kery2010] In other words the design matrix is a matrix that tells the model if and where an effect is present, and how much. The number of columns in the matrix are equal to the number of fitted parameters. Ultimatly, the design matrix is multiplied with the parameter vector to produce the linear predictor. Let’s use the simulated data in Chapter 6 of Marc Kery’s book as an example (Kéry, 2010). If you would like to code this dataset into R and play along, use the code below. mass <- c(6,8,5,7,9,11) pop <- c(1,1,2,2,3,3) region <- c(1,1,1,1,2,2) 2.1. LINEAR MODELS 11 hab <- c(1,2,3,1,2,3) svl <- c(40,45,39,50,52,57) If we think about a model of the mean, it might look like this: massi = µ + ϵi This model has a covariate of one for every observation, because every observation has a mass. This model is also sometimes referred to as an intercept only model. The model describes individual snake mass as a (global) mean plus individual variation (deviation or “error”). Increasing model complexity, we might hypothesize that snake mass is predicted by region. Because region is a categorial variable (although it is input as numeric, we don’t assume any ordination or relationship of the region numbers), the model we are interested in is a t-test. A t-test is just a specific case of a linear model. To represent the t-test, we can write the model equation as: massi = α + β × regioni + ϵi This model is considering snake mass to be made up of these components: the mean snake mass, the region effect, and the residual error. Now is also a good time to start thinking about the residuals, which we assume are normally distributed with a mean of 0 and expresed as ϵi ∼ N (0, σ 2 ) Let’s pause for a moment on the t-test and think about how we write models. You are familiar with the notation that I have thus used, where the model and residual error are expressed as two separate equations. What if we were to combine them into one equation? How would that look? massi ∼ N (α + β × regioni , σ 2 ) This notation may take a little while to get used to, but it should make sense. This expression is basically saying that snake mass is thought to be normally distributed with a mean that is the function of the mean mass plus region effect, and with some residual error. Expressing models with distributional notation may seen odd at first, but it may help you better understand how distributions are built into the processes we are modeling, and what part of the process belongs to which part of the distribution. We will stick with this notation (not exclusively) for much of this course. Back to the model matrix—how might we visualize the model matrix in R? 12 CHAPTER 2. THE MODEL MATRIX AND RANDOM EFFECTS model.matrix(mass ~ region) ## ## ## ## ## ## ## ## ## (Intercept) region 1 1 1 2 1 1 3 1 1 4 1 1 5 1 2 6 1 2 attr(,"assign") [1] 0 1 But we said earlier the regions were not inherent quantities and just categories or indicator variables. Above, R is taking the region variable as numeric because it does not know any better. Let’s tell R that region is a factor. model.matrix(mass ~ as.factor(region)) ## ## ## ## ## ## ## ## ## ## ## ## (Intercept) as.factor(region)2 1 1 0 2 1 0 3 1 0 4 1 0 5 1 1 6 1 1 attr(,"assign") [1] 0 1 attr(,"contrasts") attr(,"contrasts")$`as.factor(region)` [1] "contr.treatment" The model matrix yields a system of equations. And for the matrix above, the system of equations we would have can be expressed 2 ways. The first way shows the equation for each observation. 6 = α × 1 + β × 0 + ϵ1 8 = α × 1 + β × 0 + ϵ2 5 = α × 1 + β × 0 + ϵ3 7 = α × 1 + β × 0 + ϵ4 9 = α × 1 + β × 1 + ϵ5 11 = α × 1 + β × 1 + ϵ6 The second way adopts matrix (vector) notation to economize the system. 2.1. LINEAR MODELS 6 8 5 7 9 11 13 = 1 1 1 1 1 1 0 0 0 0 1 1 ( ) α × + β ϵ1 ϵ2 ϵ3 ϵ4 ϵ5 ϵ6 In both cases the design matrix and parameter vector are combined. And using least-squares estimation to estimate the parameters will result in the best fits for α and β. 2.1.4 Parameterization Although parameters are inherent to specific models, there are cases in the linear model where parameters can be represented in different ways in the design matrix, and these different representations will have different interpretations. Typically, we refer to either a means or effects parameterization. These two parameterizations really only come into play when you have a categorical variable—the representation of continuous variables in the model matrix includes the quantity or magnitude of that variable (and defaults to an effects parameterization). A means parameterization may be present or needed when categorical variables are present. Let’s take the above t-test example with 2 regions. As we notated the model above, α = µ for region 1, and β represented the difference in expected mass in region 2 compared to region 1. In other words, β is an estimate of the effect of being in region 2. As you guessed, this is an effects parameterization because region 1 is considered a baseline or reference level, and other levels are compared to this reference, hence the coefficients represent their effect. Effects parameterization is the default in R (e.g., lm()). In a means parameterization, α has the same interpretation—the mean for the first group. However, in a means parameterization, all other coefficients represent the means of those groups, which is actually a simpler imterpretation than the effects parameterization. Going back to our t-test, in a means parameterization β would represent the mean mass for snakes in region 2. Thus, no addition or subtraction of effects is required in a means parameterization. Both the means and effects parameterization will yield the same estimates (with or without some simple math), but it is advised to know which parameterization you are using. Additionally, as you work in JAGS you may need to use one parameterization over another, and understanding how their work and are interpreted is critically important. Let’s look at one more model using the snake data. A simple linear regression will fit one continuous covariate (x) as a predictor on y. Let’s look at the model for the snout-vent length (svl) as the predictor on mass. massi = α + β × svli + ϵi 14 CHAPTER 2. THE MODEL MATRIX AND RANDOM EFFECTS The model matrix for this model (an effect parameterization) can be reported as such. model.matrix(mass ~ svl) ## ## ## ## ## ## ## ## ## (Intercept) svl 1 1 40 2 1 45 3 1 39 4 1 50 5 1 52 6 1 57 attr(,"assign") [1] 0 1 And we see that each snake has an intercept (mean) and an effect of svl. These examples using the fake snake data were taken from Kery (2010), in which he provides more context and examples that you are encouraged to review. 2.2 Model Effects To understand hierarchical models, we need to understand random effects, and to understand random effects, we need to understand variance. 1. 2. 3. 4. 5. What is variance? How do we deal with variance? What can random effects do? How do I know if I need random effects? What is a hierarchical model? (later) Variance was historically seen as noise or a distraction—it is the part of the data that the model does not capture well. At times it has been called a nuisance parameter. The traditional view was always that less variance was better. However: “Twentieth-century science came to view the universe as statistical rather than as clockwork, and variation as signal rather than as noise…”[@leslie2003] The good news is that there is (often) information in variance. Two problems, however, are that our models don’t always account for the information connected to variance, and that variance is often ubiquitous and can occur at numerous places within the model. For instance, variance can occur among observations and within observations, along with how we parameterize our models and how we take our measurements. One way to deal with variance concerns how we treat the factors in our model. (Recall that a factor is a categorical predictor that has 2 or more levels.) Specif- 2.2. MODEL EFFECTS 15 ically, we can treat our factors as fixed or random, and the underlying difference lies in how the factor levels interact with each other. 2.2.1 Fixed Effects Fixed effects are those model effects that are thought to be separate, independent, and otherwise not similar to each other. • Likely more common—at least in use—than random effects, if only for the fact that they are a statistical default in most statistical softwares. • Treat the factor levels as separate levels with no assumed relationship between them. _ For example, a fixed effect of sex might include two factor levels (male and female) that are assumed to have no relationship between them, other than that they are different types of sex. • Without any relationship, we cannot infer what might exist or happen between levels, even when it might be obvious. • Fixed effects are also homoscedastic, which means they assume a common variance. • If you use fixed effects, you would likely need some type of post hoc means comparison (including adjustment) to compare the factor levels. 2.2.2 Random Effects Random effects are those model effects that can be thought of as units from a distribution, almost like a random variable. • Random effects are less commonly used, but perhaps more commonly encountered (than fixed effects). • Each factor level can be thought of as a random variable from an underlying process or distribution. • Estimation provides inference about specific levels, but also the population-level (and thus absent levels!) • Exchangeability • If n is large, factor level estimates are the same or similar to fixed effect estimates. • If n is small, factor levels estimates draw information from the populationlevel information. Comparing Effects Consider A is a fixed effect and B is a random effect, each with 5 levels. For A, inferences and estimates for the levels are applicable only to those 5 levels. We cannot assume to know what a 6th level would look like, or what a level between two levels might look like. Contrast that to B, where the 5 levels are assumed to represent an infinite (i.e., larger numbers) number of levels, and our inferences therefore extend beyond the data (i.e., we can use estimates and inferences predictively to interpolate and extrapolate). When are Random Effects Appropriate? You can find your own support for what you want to do, which means modeling 16 CHAPTER 2. THE MODEL MATRIX AND RANDOM EFFECTS freedom, but also you should be prepared to defend your model structure. “The statistical literature is full of confusing and contradictory advice.” [@gelman2006] • You can probably find a reference to support your desire • Personal preference is Gelman and Hill (2006): “You should always use random effects.” (use = consider) – Several reasons why random effects extract more informationvfrom your data and model – Built-in safety: with no real group-level information, effects revert back to a fixed effects model (essentially) • Know that fixed effects are a default, which does not make them right Random effects may not be appropriate when… • Factor levels are very low (e.g., 2); it’s hard to estimate distributional parameters with so little information (but still little risk). • Or when you don’t want factor levels to inform each other. • e.g., male and female growth (combined estimate could be meaningless and misleading) Summarizing Kery 2010: “…as soon as we assume effects are random, we introduce a hierarchy of effects.” 1. 2. 3. 4. 5. 6. 7. 8. 9. Scope of Inference: understand levels not in the model Assessment of Variability: can model different kinds of variability Partitioning of Variability: can add covariates to variability Modeling of Correlations among Parameters: helps understand correlated model terms Accounting for all Random processes in a Modeled System: acknowledges within-group variability Avoiding Pseudoreplication: better system description Borrowing Strength: weaker levels draw from population effect Compromise between Pooling and No Pooling Batch Effects Combining information Do I need a random effect (hierarchical model)? If you answer Yes to any of these, consider random effects. 1. Can the factor(s) be viewed as a random sample from a probability distribution? 2. Does the intended scope of inference extend beyond the levels of a factor included in the current analysis to the entire population of a factor? 2.3. HIERARCHICAL MODELS 17 3. Are the coeffcients of a given factor going to be modeled? 4. Is there a lack of statistical independence due to multiple observations from the same level within a factor over space and/or time? Random Effects Equation Notation There will be a much more extensive treatment of model notation and you will need to become familari with notation to successfully code models; however, now is as good a time as any to introduce some basics of random effects statistical notation. (Note that statistical notation and code notation are different, but may be unspecified and unclear when referring to model notation.) • SLR, no RE: yi = α + β × xi + ϵi • SLR, random intercept, fixed slope: yi = αj + β × xi + ϵi • SLR, fixed intercept, random slope: yi = α + βj × xi + ϵi • SLR, random coefficients: yi = αj + βj × xi + ϵi • MLR, random slopes: yi = α + βj1 × x1i + βj2 × x2i + ϵi For the most part, we will use subscript i to index the observation-level (data) and subscript j to index groups (and indicate a random effect). 2.3 Hierarchical Models Because random effects are used to model variation at different levels of the data, they add a hierarchical structure to the model. Let’s start with non-hierarchical models and review how data inform parameters. Consider a simple model with no hierarchical strucure (Figure 2.3). The observations (yi ...yn ) all inform one parameter and the data are said to be pooled. Perhaps we want to group the data in some logical way. We might add a fixed effect, and then assign certain observations to different parameters (Figure 2.4). In this case, our parameters, θ are subscripted to indicate that they are different levels within a factor. Despite this apparent connection, the data and the parameters inherent to one group inform only that group, and there is said to be no pooling or sharing of information. Hierarchical models are a middle ground or compromise approach to dealing with data and parameters. Often we want groups or factor levels in our data 18 CHAPTER 2. THE MODEL MATRIX AND RANDOM EFFECTS Figure 2.3: Diagrammatic representation complete pooling, a model structure in which all observations inform one parameter. Figure 2.4: Diagrammatic representation no pooling, a model structure in which separate observations inform separate parameters. 2.3. HIERARCHICAL MODELS 19 Figure 2.5: Diagrammatic representation partial pooling, a model structure in which different observations inform different latent parameters, which are then governed by additional parameters. because they represent real differences in the information we are trying to understand. However, often the groups that we want to include are not completely independent of each other. In cases like these, we can structure a model where the data inform factor level parameters, but those factor level parameters are then governed by some additional parameter (Figure 2.5). Example A simple example of rationalizing a hierarchical model might be modeling the size of birds. Our y values would be some measure of bird size. But we know that different species inherently attain different sizes; i.e., observations are not independent as any two size measurements from the same species are much more likely to be similar than to a measurement from an individual of another species). Therefore it makes sense to group the size observations by species, θj . However, we also know that there may still be relatedness among species, and as such some additional parameter, ϕ, can be added to govern the collection of species parameters. (To continue with the example, ϕ might represnt a taxonomic order.) Once you understand some examples of how hierarchical structures reflect the ecological systems we seek to model, you will find hierarchical models to be an often realistic representation of the system or process you want to understand— or at least more realistic than the traditional model representations. 20 CHAPTER 2. THE MODEL MATRIX AND RANDOM EFFECTS 2.3.1 Definitions of Hierarchical Models There is a lot of terminology. From Raudenbush and Bryk (2002) • • • • • multilevel models in social research mixed-effects and random-effects models in biometric research random-coeffcient regression models in econometrics variance components models in statistics hierarchical models (Lindley and Smith, 1972) Some useful definitions are: “…hierarchical structure, in which response variables are measured at the lowest level of the hierarchy and modeled as a function of predictor variables measured at that level and higher levels of the hierarchy.” @wagner2006 “Multilevel (hierarchical) modeling is a generalization of linear and generalized linear modeling in which regression coeffcients are themselves given a model, whose parameters are also estimated from data.” @gelman2006 “The existence of the process model is central to the hierarchical modeling view. We recognize two basic types of hierarchical models. First is the hierarchical model that contains an explicit process model, which describes realizations of an actual ecological process (e.g., abundance or occurrence) in terms of explicit biological quantities. The second type is a hierarchical model containing an implicit process model. The implicit process model is commonly represented by a collection of random effects that are often spatially and or temporally indexed. Usually the implicit process model serves as a surrogate for a real ecological process, but one that is diffcult to characterize or poorly informed by the data (or not at all).” @royle2008 “However, the term, hierarchical model, by itself is about as informative about the actual model used as it would be to say that I am using a four-wheeled vehicle for locomotion; there are golf carts, quads, Smartcars, Volkswagen, and Mercedes, for instance, and they all have four wheels. Similarly, plenty of statistical models applied in ecology have at least one set of random effects (beyond the residual) and therefore represent a hierarchical model. Hence, the term is not very informative about the particular model adopted for inference about an animal population.” @kery2011 Chapter 3 Fundamentals of Bayesian Inference 3.1 Models vs. Estimation Need text on actual difference between models and estimation. Observations are a function of observable and unobservable influences. We can think about the observable influences as data and the unobservable influences as parameters. But even with this breakdown of influences, most systems are too complex to look at and understand. Consider the effects of time, space, unknown factors, interactions of factors, and other things that obscure relationships. One approach is to start with a simple model that we might know is wrong (i.e., incomplete), but which can be known and understood. Any model is necessarily a formal simplification of a more complex system and no models are perfect, despite the fact they can still be useful. By starting with a simple model we can add complexity as we understand it and hypothesize the mechanisms, rather than trying to start with with a complex model that might be hard to understand and work with. A lot of good things have been said about models, including: “All models are wrong, but some are useful” -George Box “There has never been a straight line nor a Normal distribution in history, and yet, using assumptions of linearity and normality allows, to a good approximation, to understand and predict a huge number of observations.” -W.J. Youden “Nothing is gained if you replace a world that you don’t understand with a model that you don’t understand.” -John Maynard Smith 21 22 CHAPTER 3. FUNDAMENTALS OF BAYESIAN INFERENCE 3.1.1 Model Building Most model selection procedures attempt balance between model generalizability and specificity, and in putting a model together this balance is also importnat. Consider whether you seek prediction or understanding; although they may overlap they do have differences. Model (system) understanding may not predict well, but can help to explain drivers of a system. On the other hand model prediction may not explain well, but the value is in performance. Often, prediction and understanding are not exclusive and thought needs to be given to the balance of both in a model. Let’s consider prediction and understanding a little more. Explanation (understanding) • • • • Emphasis on understanding a system Often simpler models (but not always!) Strong focus on parameters, which represent your hypothesis of the system Think: Causes of effects Prediction • Focus on fitting y • Often result more complex models (but not always!) • Think: Effects of causes 3.1.2 Case Study: Explanation vs Prediction Reproductive biological work on Southern Flounder (Paralichthys lethostigma) was conducted to determe predictors associated with oocyte development and expected spawning. Becuase the species, like many other fish species, are not observed on the spawning grounds all information about maturity has to be collected prior to fully-developed, spawning capable individuals are available. A wide range of predictors were quantified to examine their correlation to histological samples of ovarian tissue. In addition to identifying reliable predictors, value was placed on simplicity—the fewer predictors needed, the more useful the model would be. AIC was able to determine a best-fitting model; however, there were several competing models all of which tended to have a large number of predictors. Upon closer evaluation, it appeared that a large number of models all performed relatively well, when compared with each other. This cloud of model points warranted further investigation. When evaluating model perfomance based on cross-validation, a large number of simpler models were found to be effective. These results highlight that for this particular dataset system understanding was optimized by AIC, while system prediction was optimzed by cross-validation. 3.1. MODELS VS. ESTIMATION 23 Figure 3.1: Table of best-fitting models as determined by AIC (Midway et al. 2013). Figure 3.2: 24 CHAPTER 3. FUNDAMENTALS OF BAYESIAN INFERENCE Figure 3.3: 3.1.3 Models vs Estimation Consider a simple linear regression model: yi = α + β × xi + ϵi . How might we come up with estimates of the model parameters, namely α, and β? To do this, we need and estimation routine, and there is more than one to choose from. It’s important to know that the estimatation routine we select will have different operating assumptions, different underlying machinery, and may produce different results (estimates). That being said, for simple models and clear data, different estimation routines may result in very similar outcomes. Regardless, it remains very important to remember that both models and estimation are independent components of statistics. We might all agree on a model, but not the estimation (or the opposite). The linkage between models and estimation is often the parameters; parameters are what define a model, and parameters can be estimated by different methods. “…there is no ‘Bayesian so-and-so model’.” @kery2011 3.1.4 What is a parameter? Parameters are system descriptors. Think of a parameter as something that underlies and influences a population, whereas a statistic does the same for a sample. For example, a population growth rate parameter, lambda, may describe the rate of change in the size of a population, whereas some difference statistic, Nd may describe the difference in the size of the population between some time interval. In addition to making sure we know the parameters—and their configuration—that serve as hypotheses for systems and processes, the 3.2. BAYESIAN BASICS 25 Figure 3.4: An attempt at humor while illustrating different statistical philosophies. Figure 3.5: The Reverend Thomas Bayes interpretation of parameters is also at the foundation for different statistical estimation procedure and philosphies. 3.2 Bayesian Basics 3.2.1 Why learn Bayesian estimation? “Our answer is that the Bayesian approach is attractive because it is useful. Its usefulness derives in large measure from its simplicity. Its simplicity allows the investigation of far more complex models than can be handled by the tools in the classical toolbox.” @link2009 In order to understand and command complexity, you need to revisit simplicity—and when you go back to basics, you gain deeper understanding.” Midway Thomas Bayes was a Presbyterian Minister who is thought to have been born 26 CHAPTER 3. FUNDAMENTALS OF BAYESIAN INFERENCE in 1701 and died 1761. He posthumously published An Essay towards solving a Problem in the Doctrine of Chances in 1763, in which he introduced his approach to probability. Bayesian approach were never really practiced in Bayes’ lifetime—not only was his work no published until after he died, but his work then was not developed and popularized until Pierre-Simon Laplace (1749–1827) did so in the early 19th century. (Fun fact according to Wikipedia: His [Laplace] brain was removed by his physician, François Magendie, and kept for many years, eventually being displayed in a roving anatomical museum in Britain. It was reportedly smaller than the average brain.) Bayes’ rule, in his words: Given the number of times in which an unknown event has happened and failed: Required the chance that the probability of its happening in a single trial lies somewhere between any two degrees of probability that can be named. • “unknown event” = e.g., Bernoulli trial • “probability of its happening in a single trial” = p • We may know ahead of time, or not Consider an example where 10 chicks hatch and 4 survive. Bayes attempts to draw a conclusion, such as “The probability that p (survival) is between 0.23 and 0.59 is 0.80.” The two degrees of probability are an interval; P r(a ≤ p ≤ b). The overall idea is similar to a confidence interval in trying to account for and reduce uncertainty—but confidence intervals are not probabilistic, so they are are not the same. A key distinction between Bayesian and Frequentists is how uncertainty regarding parameter θ is treated. Frequentists views parameters as fixed and probability is the long run probability of events in hypothetical datasets. The result is that probability statements are made about the data—not the parameters! A Frequentist could never state: “I am 95% certain that this population is declining.” (Note: in order to learn about Bayesian approaches from a practical standpoint, we will often consider it against the Frequentists approach for comparison.) But for a Bayesian, probability is the belief that a parameter takes a specific value. “Probability is the sole measure of uncertainty about all unknown quantities: parameters, unobservables, missing or mis-measured values, future or unobserved predictions” (Kéry and Schaub, 2011). When everything is a probability, we can use mathematical laws of probability. One way to get started is to think of parameters as random variables (but technically they are not.) 3.2.2 Bayesian vs. Frequentist Comparison 1. Both start with data distribution (DGF) 2. Data, y, is a function of parameter(s), θ 3. Example: p(y|θ) ∼ P ois(θ) which is often abbreviated to y|θ ∼ P ois(θ) or y ∼ P ois(θ) 3.2. BAYESIAN BASICS 27 4. Frequentist then uses likelihood function to interpret distribution of observed data as a function of unknown parameters, L(θ|y). But, likelihoods do not integrate to 1, and are therefore not probabilistic 5. Frequentist estimate a single point, the maximum likelihood estimate (MLE), which represents the parameter value that maximizes the chance of getting the observed data 6. see Kéry and Schaub (2011) for extended example of MLE 3.2.3 Bayesians use Bayes’ Rule for inference P (A|B) = P (B|A) × P (A) P (B) Bayes’ Rule is a mathemetical expression of the simple relationship between conditional and marginal probabilities. Example of Bayes’ Rule Bird watching (B) or watching football (F), depending on good weather (g) or bad weather (b) on given day. We know: 1. g + B = 0.5 (joint) 2. g = 0.6 (marginal) 3. B = 0.7 (marginal) If you are watching football, what is the best guess as to the weather? Bird watching (B) Watch Football (F) Good Weather (g) Bad Weather (b) 0.5 0.1 0.6 0.2 0.2 0.4 0.7 0.3 1.0 Rephrased, we are asking p(b|F ) We know: 1. p(b, F ) = 0.2 2. p(F ) = 0.3 ) 3. p(b|F ) = p(b,F p(F ) = 0.2 0.3 = 0.66 The probability of bad weather is 0.4, but knowing football is more likely with bad weather, football increased our guess to 0.66. 3.2.4 Breaking down Bayes’ Rule P (θ|y) = P (y|θ) × P (θ) P (y) 28 CHAPTER 3. FUNDAMENTALS OF BAYESIAN INFERENCE where 1. 2. 3. 4. P (θ|y) = posterior distribution P (y|θ) = Likelihood function P (θ) = Prior distribution P (y) = Normalizing constant Another way to think about Bayes’ Rule P (θ|y) ∝ P (y|θ) × P (θ) Combine the information about parameters contained in the data y, quantified in the likelihood function, with what is known or assumed about the parameter before the data are collected and analyzed, the prior distribution. 3.3 Bayesian and Frequentist Comparison 3.3.1 Example with Data Consider this simple dataset of two groups each with 8 measurements. y <- c(-0.5, 0.1, 1.2, 1.2, 1.2, 1.85, 2.45, 3.0, -1.8, -1.8, -0.4, 0, 0, 0.5, 1.2, 1.8 g <- c(rep('g1', 8), rep('g2', 8)) We might want to consider a t-test to compare the groups (means). In this case, we are estimating 5 parameters from 16 data points. t.test(y ~ g) ## ## ## ## ## ## ## ## ## ## ## Welch Two Sample t-test data: y by g t = 2.2582, df = 13.828, p-value = 0.04064 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.06751506 2.68248494 sample estimates: mean in group g1 mean in group g2 1.3125 -0.0625 Our t-test results in a p-value of 0.0406, which is less than the traditional cutoff of 0.05 and allowing us to conclude that the group means are significantly different. At this point in our study, we would write up our model results and likely conclude the analysis. The frequentist t-test is simple, but limited. Let’s try a Bayesian t-test. Here we see a posterior distribution of our results. The mean difference between groups was estimated to be 1.37, and the 95% credible interval around 3.3. BAYESIAN AND FREQUENTIST COMPARISON 29 that difference is reported as -0.248 to 3.01. A credible interval represents the percentage (95% in this case) of symmetric area under the distribution that represents the uncertainty in our point estimate. The percentage can also be used to interpret the chance that the interval has captured the true parameter. So in this example, we can say that we are 95% certain that the true difference in means lies between -0.248 and 3.01, with our best guess being 1.37. Another thing we can use credible intervals for is “significance” interpretation. If a credible interval overlaps 0, it can be interpreted that 0 is within the range of possible estimates and therefore there is no significant interpretation. (If 0 were to be outside our 59% CI, we would have evidence that 0 is not a very likely estimate and therefore the group means are “statistically significant.”) It’s worth noting that the concept of statistical significance may be alive and well in Bayesian esimation; however, you need to define how the term is used because there is no a priori significance level as there is in frequentist routines. Consider also that the frequentist t-test found a significant difference and the Bayesian t-test did not find a difference. In reality, with simple (and even more complex) datasets, both types of estimation will often arrive at the same answer. This example was generated to show that there is the possibility to reach different significance conclusions using the same data. Perhaps what is more important in this example is to ask yourself if two types of estimation reach different conclusions about the data, which are you more likely to trust— the procedure that provides little information and spits out a yes or no, or the procedure that is full tractable and provides results and estimation for all parts of the model? But wait—there’s more! Because Bayesian estimation assumes underlying distributions for everything, we get richer results. All estimated parameters have their own posterior distributions, even the group-specific standard deviations. Despite the benefits illustrated in this example, it’s important to know that the Bayesian approach is not inherently the correct way to appraoch estimation. In 30 CHAPTER 3. FUNDAMENTALS OF BAYESIAN INFERENCE fact, we cannot know which method is correct. All we can know is that we are presented with a choice as to which method provides richer information from which we can base a decision. 3.3.2 More differences Frequentists: p(data|θ) • • • • • • Data are repeatable, random variables Inference from (long-run) frequency of data Parameters do not change (the world is fixed and unknowable) All experiments are (statistically) independent Results driven by point-estimates Null Hypothesis emphasis; accept or reject Bayesians: p(θ|data) • • • • • The data are fixed (data is what we know) Parameters are random variables to estimate Degree-of-belief interpretation from probability We can update our beliefs; analyses need not be independent Driven by distributions and uncertainty 3.3.3 Put another way Frequentists asks: The world is fixed and unchanging; therefore, given a certain parameter value, how likely am I to observe data that supports that parameter 3.3. BAYESIAN AND FREQUENTIST COMPARISON 31 value? Bayesian asks: The only thing I can know about the changing world is what I can observe; therefore, based on my data, what are the most likely parameter values I could infer? Given the strengths and weaknesses for both types of estimation, why did the Frequentist approach dominate for so long, even to the present? The best explanation includes several reasons. Frequentist routines are computationally simple compared to Bayesian appraoches, which has permitted them to be formalized into point-and-click routines that are available to armchair statisticians. Additionally, the create and popularization of ANOVA ran parallel to the rise in popularity of Frequentist appraoches, and ANOVA provided an excellent model for many data sets. Finally, although the limitations and issues with p-values are well-publicized, there has been historic appeal for statistical complexity being reduced to a significant or non-significant outcome. 3.3.4 Uncertainty: Confidence Interval vs Credible Interval Consider the point estimate of 5:12 and the associated uncertainty of 4:12–6:12 Frequentist interpretation: Based on repeated sampling, 95% of intervals contain the true, unknowable parameter. Therefore, there is a 95% chance that 4.12–6.12 contains the true parameter we are interested in. Bayesian interpretation: Based on the data, we are 95% certain that the interval 32 CHAPTER 3. FUNDAMENTALS OF BAYESIAN INFERENCE 4.12–6.12 contains the parameter value, with the most likely value being 5.12 (the mean/mode of the distribution). These are different interpretations! 3.3.5 Bayesian Pros and Cons Pros • • • • • Can accomodate small samples sizes Uncertainty accounted for easily Degree-of-belief interpretation Very flexible with respect to model design Modern computers make computation possible Cons • Still need to code (but also a pro?) • Still computationally intensive in some applications Other Practical Advantages of Bayesian Estimation • Easy to use different distributions (e.g., t-test with t-distribution) • Free examination and comparisons – Comparing standard deviations – Comparing factor levels – The data don’t change, why should the p-value? • Uncertainty for ALL parameters and derived quantities • Credible intervals make more sense than confidence intervals • Probabilistic statements about parameters and inference • With diffuse priors, approximate MLE (no estimation risk) • Test statistics and model outcome NOT dependent on sampling intentions – Frequentists experiments need a priori sample size (do you do this?) – Test statistics vary, which means p-values vary, which means decisions change – The data don’t change, why should the p-value? 3.3. BAYESIAN AND FREQUENTIST COMPARISON 33 34 CHAPTER 3. FUNDAMENTALS OF BAYESIAN INFERENCE Chapter 4 Bayesian Machinery 4.1 Bayes’ Rule P (θ|y) = P (y|θ) × P (θ) P (y) where 1. 2. 3. 4. P (θ|y) = posterior distribution P (y|θ) = Likelihood function P (θ) = Prior distribution P (y) = Normalizing constant 4.1.1 Posterior Distribution: p(θ|y) The posterior distribution (often abbreviated as the posterior) is simply the way of saying the result of computing Bayes’ Rule for a set of data and parameters. Because we don’t get point estimates for answers, we correctly call it a distribution, and we add the term posterior because this is the distrution produced at the end. You can think of the posterior as a astatement about the probability of the parameter value given the data you observed. “Reallocation of credibilities across possibilities.” - John Kruschke 35 36 CHAPTER 4. BAYESIAN MACHINERY 200 Frequency 150 100 50 0 0 5 10 θ 10 8 θ 6 4 2 0 0 200 400 600 800 1000 Iteration 4.1.2 Likelihood Function: p(y|θ) • Skip the math • Consider it similar to other likelihood functions • In fact, it will give you the same answer as ML estimator (interpretation differs) 4.2 Priors: p(θ) • Distribution we give to a parameter before computation 4.3. NORMALIZING CONSTANT: P (Y ) 37 Figure 4.1: Prior information can be useful. • WARNING: This is historically a big deal among statisticians, and subjectivity is a main concern cited by Frequentists • Priors can have very little, if any, influence (e.g., diffuse, vague, noninformative, unspecified, etc), yet all priors are technically informative. • Much of ecology uses diffuse priors, so little concern • But priors can be practical if you really do know information (e.g., even basic information, like populations can’t be negative values) • Simple models may not need informative priors; complex models may need priors You may not use informative priors when starting to model. Regardless, always think about your priors, explore how they work, and be prepared to defend them to reviewers and other peers. “So far there are only few articles in ecological journals that have actually used this asset of Bayesian statistics.” - Marc Kery (2010) 4.3 Normalizing Constant: P (y) The normalizing constant is a function that converts the area under the curve to 1. While this may seem technical—and it is—this is what allows us to interpret Bayesian output probabilistically. The normalizing constant is a high dimension 38 CHAPTER 4. BAYESIAN MACHINERY Figure 4.2: Example of prior, likelihood, and poserior distributions. Figure 4.3: Example of prior influence based on prior parameters. 4.3. NORMALIZING CONSTANT: P (Y ) 39 Figure 4.4: MCMC samplers are designed to sample parameter space with a combination of dependency and randomness. integral that in most cases cannot be analytically solved. But we need it, so we have to simulate it. To do this, we use Markov Chain Monte Carlo, MCMC. 4.3.1 • • • • MCMC Background Stan Ulam: Manhattan project scientist The solitaire problem: How do you know the chance of winning? Can’t really solve… too much complexity But we can automate a bunch of games and monitor the results—basically we can do something so much that we assume the simulations are approximating the real thing. Fun Fact: There are 80,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000, 000,000,000,000,000,000 solitaire combinations! Markov Chain: transitions from one state to another (dependency) Monte Carlo: chance applied to the transition (randomness) • MCMC is a group of functions, governed by specific algorithms • Metropolis-Hastings algorithm: one of the first algorithms • Gibbs algorithm: splits multidimensional θ into separate blocks, reducing dimensionality • Consider MCMC a black box, if that’s easier 4.3.2 MCMC Example A politician on a chain of islands wants to spend time on each island proportional to each island’s population. 1. After visiting one island, she needs to decide… • stay on current island 40 CHAPTER 4. BAYESIAN MACHINERY • move to island to the west • move to island to the east 2. But she doesn’t know overall population—can ask current islanders their population and population of adjacent islands 3. Flips a coin to decide east or west island • if selected island has larger population, she goes • if selected island has smaller population, she goes probabilistically MCMC is a set of techniques to simulate draws from the posterior distribution p(θ|x) given a model, a likelihood p(x|θ), and data x, using dependent sequences of random variables. That is, MCMC yields a sample from the posterior distribution of a parameter. 4.3.3 Gibbs Sampling One iteration includes as many random draws as there are parameters in the model; in other words, the chain for each parameter is updated by using the last value sampled for each of the other parameters, which is referred to as full conditional sampling. Although the nuts and bolts of MCMC can get very detailed and may go beyond the operational knowledge you need to run models, there are some practical issues that you will need to be comfortable handling, including initial values, burn-in, convergence, thinning, 4.3.4 • • • • Burn-in Chains start with an initial value that you specify or randomize Initial value may not be close to true value This is OK, but need time for chain to find correct parameter space If you know your high probability region, then you may have burned in already • Visual Assessment can confirm burn-in 4.3. NORMALIZING CONSTANT: P (Y ) 41 Figure 4.5: Visualizing parameter sampling Figure 4.6: Burn-in is the initial MCMC sampling that may take place outside of the highest probability region for a parameter. 42 CHAPTER 4. BAYESIAN MACHINERY Figure 4.7: Clean non-convergence for 3 chains. 4.3.5 Convergence • We run multiple independent chains for stronger evidence of correct parameter space • I When chains converge on the same space, that is strong evidence for convergence • But how do we know or measure convergence? • Averages of the functions may converge (chains don’t technically converge) Convergence Diagnostics 1. 2. 3. 4. Visual convergence of iterations (“hairy caterpillar” or the “grass”) Visual convergence of histograms Brooks-Gelman-Rubin Statistic, R̂ Others 4.3.6 Thinning MCMC chains are autocorrelated, so θˆt ∼ f (θ̂t−1 ). It s common practice to thin by 2, 3 or 4 to reduce autocorrelation. However, there are also arguements against thinning. 4.3.7 MCMC Summary There is some artistry in MCMC, or at least some decision that need to be made by the modeler. Your final number of samples in your posterior is often much less than your total iterations, because in handling the MCMC iterations you will need to eliminate some samples (e.g., burn-in and thinning). Many MCMC adjustments you make will not result in major changes, and this is typically a 4.3. NORMALIZING CONSTANT: P (Y ) 43 Figure 4.8: Non-convergence is not always obvious. These chains are not converging despite overlapping. Figure 4.9: Histograms and density plots are a good way to visualize convergence. 44 CHAPTER 4. BAYESIAN MACHINERY good thing because it means you are in the parameter space you need to be in. Other times, you will have a model issue and some MCMC adjustment will make a (good) difference. Because computation is cheap—especially for simple models—it is common to over-do the iterations a little. This is OK. Here is a nice overview video about MCMC. v=OTO1DygELpY And here is a great simulator to play with to evaluate how changes in MCMC settings play out visually. Chapter 5 Introduction to JAGS 5.1 WinBUGS • BUGS language developed in 1990s • There is a point and click routine called WinBUGS and OpenBUGS • WinBUGS made more of a splash because it made Bayesian estimation accessible • WinBUGS can be slow and fail to converge…so other softwares have been developed 5.1.1 BUGS? JAGS? • BUGS was first on the scene, standalone or run through R • JAGS originally for BUGS models on Unix, but also developed with R in mind • BUGS and JAGS differences tend to be minor (Plummer, 2012) • JAGS has really taken over BUGS, practically 5.1.2 STAN? • STAN is newest, developed by Gelman et al. • STAN fits models in C++, but can also be run through R • STAN is more different from other two; more language differences, more code, and fitting differences, but also offers some improvements (diagnostics) for more complex models. 5.2 JAGS 1. Software can be called from R 2. JAGS is functionally for the MCMC 45 46 CHAPTER 5. INTRODUCTION TO JAGS Figure 5.1: The BUGS logo. Figure 5.2: The STAN logo. 5.2. JAGS 47 3. All you need to do • Specify the model likelihood • Specify statistical distributions • JAGS does the rest… 4. JAGS is not a procedural language like R; you can write the model code in any order 5.2.1 General steps to fitting a model in JAGS 1. Define model in BUGS language (write in R and export as text file to be read by JAGS) 2. Bundle data 3. Provide initial values for parameters 4. Set MCMC settings • • • • # # # # iterations to thin to burn-in of chains 5. Define parameters to keep track of (i.e., parameters of interest) 6. Call JAGS from R to fit the model 7. Assess convergence 5.2.2 Define the BUGS Model: Types of Nodes Constants—specified in the data file; fixed value(s) • example: number of observations, n Stochastic—variables given a distribution • example: variable � distribution(parameter 1, parameter 2, . . .) • y[i] ∼ N (µ, σ 2 ) is written as y[i] ~ dnorm(mu,tau) • Note that the variance is expressed as a precision (τ ), where τ = σ12 Deterministic—logical functions of other nodes • example: mu[i] <- alpha + beta * x[i] 5.2.3 Arrays and Indexing Arrays are indexed by terms within square brackets • 1 term indexed: y[i] � dnorm(mu, tau) • 2 term indexed: mu.p[i,j] <- psi[i,j] * w[i] • 3 term indexed: mu[i,j,k] <- z[i,k] * gamma[i,j,k] 48 5.2.4 CHAPTER 5. INTRODUCTION TO JAGS Repeated Structures for loops for(i in 1:n){ # loop over observations 1 to n #list of statements to be repeated for increasing values of loop index i } # close loop Recall that n is a constant and needs to be defined! 5.2.5 Likelihood Specification Assume the response variable Y with n observations is stored in a vector y with elements yi . The stochastic part of the model is written as: Y ∼ distribution(ϕ) Where ϕ is the parameter vector of the assumed distribution. The parameter vetor is linked with some explanatory variables, X1 , X2 , ...Xp using a link function h ϕ = h(θ, X1 , X2 , ...Xp ) Where θ is a set of parameters used to specify the link function and the final model structure. In generalized linear models, the link function links the parameters of the assumed distribution with a linear combination of predictor variables. 5.2.6 Family Default link function binomial gaussian Gamma inverse gaussian poisson quasi quasibinomial quasipoisson logit identity inverse 1/µ2 log identity (with variance = constant) logit log BUGS Syntax for(i in 1:n){ y[i] � distribution(parameter1[i], parameter2[i]) parameter1[i] <- [function of theta and X’s] parameter2[i] <- [function of theta and X’s] } Note: Not all parameters will be indexed—depends on the model. 5.3. CONVERGENCE 5.2.7 49 Simple Example You would like to run a model to draw inference about the mean. The data include 30 observations of fish length (yi = 1, ...30) Statistical Model yi ∼ N (µ, σ 2 ) Priors µ ∼ N (0, 100) σ ∼ U (0, 10) 5.2.8 Derived Quantities One of the nicest things about a Bayesian analysis is that parameters that are functions of primary parameters and their uncertainty (e.g.,standard errors or credible intervals) can easily be obtained using the MCMC posterior samples. We just add a line to the JAGS model that computes the derived quantity of interest, and we directly obtain samples from the posterior distributions of not only the original estimated parameters, but the derived relationships we seek to quantify. In a frequentist mode of inference, this would require application of the delta method (or various procedures that correct for computations) which is more complicated and also makes more assumptions. In the Bayesian analysis, estimation error is automatically propagated into functions of parameters. “Derived quantities: One of the greatest things about a Bayesian analysis.” Kery 2010 5.3 Convergence • Think about priors: do they need to be more informative? • Starting/initial values: make sure they are reasonable • Standardize data (predictors): zi = (xi x̄)/sx – Slope and intercept parameters tend to be correlated – MCMC sampling from tightly correlated distributions is difficult (samplers can get stuck) – Standardizing data reduces correlation (mean-centering) and allows us to set vague priors no matter what the scale of the original data – Consider standardizing by 2 SDs (Gelman and Hill, 2006) • Visually inspect trace plots and posterior densities • Gelman-Rubin statistic (R̂) – It compares the within-chain variance to the between-chain variance. Is there a chain effect? 50 CHAPTER 5. INTRODUCTION TO JAGS – Values near 1 indicates likely convergence, a value of ≤ 1.1 is considered acceptable by some 5.4 Additional Resources • • • • Kéry (2010), Chapter 5 Lykou and Ntzoufras (2011) Plummer (2012) (JAGS User Manual) Spiegelhalter et al. (2003) (WinBUGS User Manual) 5.5 JAGS in R: Model of the Mean This tutorial will work through the code needed to run a simple JAGS model, where the mean and variance are estimated using JAGS. The model is likely not very useful, but the objective is to show the preperation and coding that goes into a JAGS model. The first thing we need to do is load the R2jags library. library(R2jags) 5.5.1 Generate some date For this exercise, let’s generate our own data so we know the answer. n <- 50 mu <- 1.12 sigma <- 0.38 # Generate values (stochastic part of the model) set.seed(214) # so we all get the same random numbers yi <- rnorm(n, mean=mu, sd=sigma) # Mean and SD of sample mean(yi) sd(yi) # Get frequentist confidence interval for the mean t.test(yi)$[1:2] 5.5.2 Define Model We will write the model directly in R using JAGS syntax. Note that the model is entirely character data (according to R) and is sunk to an external text file. Note that you can write the model at any point prior to running the model, but I find it useful to code the model first as several other preperations we will 5.5. JAGS IN R: MODEL OF THE MEAN 51 make will be dependent upon the model and much easier to do with a model to reference. sink("model.txt") cat(" model { # Likelihood # Priors # Derived quantities } # end model ",fill = TRUE) sink() 5.5.3 Bundle Data We need to bundle our data into a list that JAGS will be able to read. This is very simple in this exercise and generally very simple in more complex models. Just make sure you have all the elements in that the model needs, which may extend beyond data. Using an equals sign (not assignment operator), set the data that JAGS is looking for on the left side, to the input data on the right side. data <- list(y = yi, n = n ) 5.5.4 Initial Values JAGS will generate random starting values if not specified, and for simple models this should work. However, it is strongly suggested to get in the habit of providing starting values because eventually you will need to start the MCMC chains in a high-probability area. Like the data, list starting values for any parameters in the model. inits <- function (){ list (mu=rnorm(1), sigma=runif(1) ) } 52 CHAPTER 5. INTRODUCTION TO JAGS 5.5.5 MCMC Settings You can adjust or set the MCMC settings directly in the JAGS command, but I recommend against hardcoding these settings as you will likely want an easy way to modify them. This simple code adjusts the settings, and the JAGS command will never need to be modified. ni nt nb nc <<<<- 5.5.6 1000 2 500 3 Parameters to Monitor JAGS models can get complex and often have many (hundreds or thousands) of parameters. Here we will specify that we want to monitor both parameters in this model, but in the future, you may want to specify only some parameters of interest. parameters <- c("mu","sigma") 5.5.7 Run JAGS model The command jags() runs the model, with the arguments below. Again, we have specificed everything outide this command, which makes things a bit more manageable. out <- jags(data, inits, parameters, "model.txt", n.chains = nc, n.thin = nt, n.iter = ni, n.burnin = nb) 5.5.8 Assess Convergence and Model Output We will get into this more, but the first thing to do is print the model object summary. This will give you a convergence statistic as a place to start. We can also run traceplots and density plots for visual convergence. print(out, dig = 3) Also, get comfortable working with the JAGS model object, which stores a lot of information. Inspect it here. str(out) 5.5. JAGS IN R: MODEL OF THE MEAN 53 How would you get the posterior mean out of the JAGS model object without using the summary function? How would you plot the posterior by hand (it need not be pretty)? How would you make traceplots? How would you make density plots? There are a number of packages that can also be used to plot JAGS output and make things simpler. However, it is not a bad idea to become comfortable with the JAGS model object as there may come a model that has sufficiently complex output that a canned diagnostic does not work. 54 CHAPTER 5. INTRODUCTION TO JAGS Chapter 6 Simple Models in JAGS 6.1 6.1.1 Revisiting Hierarchical Structures What is a hierarchical model? Parent and Rivot (2012): A model with three basic levels 1. A data level that specifies the probability distribution of the observables at hand given the parameters and the underlying processes 2. A latent process level depicting the various hidden ecological mechanisms that make sense of the data 3. A parameter level identifying the fixed quantities that would be suficient, were they known, to mimic the behavior of the system and to produce new data statistically similar to the ones already collected Also 1. Observation model: distribution of data given parameters 2. Structural model: distribution of parameters, governed by hyperparameters 3. Hyperparameter model: parameters on priors Other Definitions • “Estimating the population distribution of unobserved parameters” (Gelman et al., 2013) • “Sharing statistical strength”: we can make inferences about data-poor groups using the entire ensemble of data – Pooling (partially) across groups to obtain more reliable estimates for each group 55 56 CHAPTER 6. SIMPLE MODELS IN JAGS Figure 6.1: 6.1.2 Hierarchical Model Example “Similar, but not identical” can be shown to be mathematically equivalent to assuming that unit-specific parameters, θi , i = 1, ..., N , arise from a common “population” distribution whose parameters are unknown (but assigned appropriate prior distributions) θi ∼ N (µ, σ2 ) µ ∼ N (., .); σ ∼ unif (., .) We learn about θi not only directly through yi , but also indirectly through information from the other yj via the population distribution parameterized by ϕ Example: A normal-normal mixture model • Let yi = (yi1 , yi2 , ...yin ) denote replicate measurements made on each of j = 1, 2, ...n subjects • Observation model: yij ∼ N (αj , σ 2 ) – yij is assumed normally-distributed with a mean that depends on a subject-specific parameter, αj – Assume that αj ’s come from a common population distribution αj ∼ N (µ, σα2 ) Ecological data are characterized by: - Observations measured at multiple spatial and/or temporal scales - An uneven number of observations measured for any given subject or group of interest (i.e., unbalanced designs) - Observations that lack independence (i.e., the value of a measurement at one location or time period influences the value of another measurement made at a different location or time period - Widely applicable to many (most) ecological investigations 6.1. REVISITING HIERARCHICAL STRUCTURES 57 Desirable properties of hierarchical models 1. Accommodate lack of statistical independence 2. Scope of inference 3. Quantify and model variability at multiple levels 4. Ability to “borrow strength” from the entire ensemble of data (phenomenon referred to as shrinkage) 6.1.3 Borrowing Strength? • Make use of all available information • Results in estimators that are a weighted composite of information from an individual group (e.g., species, individual, reservoir, etc.) and the relationships that exist in the overall sample • Could fit separate ordinary least squares (OLS) regressions to each group (e.g., reservoir) and obtain estimate of a slope and intercept • OLS will give estimates of parameters, however, they may not be very accurate for any given group • Depends on sample size within a group (nj ) and the range represented in the level-1 predictor variable, Xij – If nj is small then intercept estimate will be imprecise – If sample size is small or a restricted range of X, the slope estimate will be imprecise Hierarchical models allow for taking into account the imprecision of OLS estimates • More weight to the grand mean when there are few observations within a group and when the group variability is large compared to the betweengroup variability – E.g., small sample size for a given group = not very precise OLS estimate, so value “shrunk” towards population mean • More weight is given to the observed groups mean if the opposite is true • Thus, hierarchical models accounts and allows for the smallsample sizes observed in some groups • Same is true for the range represented in predictor variables (i.e.., hierarchical models accounts and allows for the small and limited ranges of X observed in some groups) 6.1.4 Shrinkage toward the grand mean (µα ) The estimate of αj is a linear combination of the population-average estimate µα and the ordinary least squares (OLS) estimate, αOLS αj = wj × αOLS + (1 − wj ) × µα The weight, wj is a ratio of the between group variability (σα2 ) to the sum of the within and between-group variability (i.e., total variability) 58 CHAPTER 6. SIMPLE MODELS IN JAGS wj = nj × σα2 nj × σα2 + σ 2 where nj = sample size \begin{figure} { } \caption{Example where ICC = 0% (i.e., no among-group variability)} \end{figure} \begin{figure} { } \caption{Example where ICC = 13%} \end{figure} \begin{figure} 6.2. SIMPLE LINEAR REGRESSION 59 { } \caption{Example where ICC = 80%} \end{figure} 6.2 Simple Linear Regression The model from a Bayesian point of view yi ∼ N (α + βxi , σ 2 ) for i, ...n Priors: α ∼ N (0, 0.001) β ∼ N (0, 0.001) sigma ∼ U (0, 10) 6.2.1 Varying Coefficient Models Few people run models in JAGS because they want fixed effects, and in fact, because parameters need to have an underlying distribution you will have the option to create random effects, or varying coefficients, very easily. Before we start on the different types of simple varying coefficient models, let’s visually review them. 6.3 Varying Intercept Model Another way of labeling a varying intercept model is a one-way ANOVA with a random effect. A one-way ANOVA is among the simpler of statistical models, and a little complexity has been added by changing the single fixed factor to be random. Either terminology is accurate and acceptable, but because 60 CHAPTER 6. SIMPLE MODELS IN JAGS Figure 6.2: Examples of different varying coefficient models ANOVAs model means and means without slopes are intercepts, we will often hear an ANOVA called an interceptsmodel, with the modifier varying meaning that group levels are random and can vary from the grand mean. Let’s think about the varying intercepts model with an example. Say we have multiple measurements of total phosphorus (T P ) taken from i lakes located in j regions (j = 1, 2, ...J). The number of lakes sampled in each region varies, and we want to estimate the mean T P for each region. We can assign each region its own intercept (i.e., it’s own mean T P ) and we will allow these intercepts to come from a normal distribution characterized by an overall mean T P (µα ) and between-region variance (σα2 ). The model can be expressed as follows: yi ∼ N (αj(i) , σ 2 ) for i, ...n αj ∼ N (µα , σα2 ) for j, ...J 6.3. VARYING INTERCEPT MODEL 61 µα ∼ N (0, 0.001) σα2 ∼ U (0, 10) σ ∼ U (0, 10) # Likelihood for(i in 1:n){ y[i] ~ dnorm(mu[i], tau) mu[i] <- alpha[group[i]] } for(j in 1:J){} alpha[j] ~ dnorm(mu.alpha, tau.alpha) } # Priors mu.alpha ~ dnorm(0,0.001) sigma ~ dunif(0,10) sigma.alpha ~ dunif(0,10) # Derived quantities tau <- pow(sigma,-2) tau.alpha <- pow(sigma.alpha,-2) 6.3.1 Nested Indexing In random intercept models we have observation i in region j. Multiple i’s can be observed within a single j. Therefore, we have multiple observations per some grouping variable, so we have two indexes, i and j. Up until this point, we only had observation i, which was easily accommodated in a single for loop. “Nested indexing” is a way of accommodating this type of data structure alpha[group[i]] 6.3.2 Intraclass correlation coefficient (ICC) This is another version of the one-way ANOVA with a random effect. The primary difference between this and the model above is that here we are going to actually calcuate the Intraclass correlation coefficient, ICC, as a derived quantity, in order to help us understand more about where the variability is in the data. Recall that ICC is a measure of the variability within groups compared to among groups. 62 CHAPTER 6. SIMPLE MODELS IN JAGS Figure 6.3: Low ICC Figure 6.4: High ICC 6.3. VARYING INTERCEPT MODEL 63 The model can be expressed the same way as the one-way ANOVA with a random effect: yi ∼ N (αj(i) , σ 2 ) for i, ...n αj ∼ N (µα , σα2 ) for j, ...J Recall that the total variance = σ 2 + σα2 and therefore ICC = σ2 σα2 + σα2 Both terms required in the ICC equation are already being estimated by the model, so this is a good example of a derived quantity. 6.3.3 Varying Intercept, Fixed Slope Model We can easily add a level-1 covariate (slope) to our model, but because it is not varying, it will create a model where the effect of the covariate is equal on all groups, although the intercept of the groups may vary among groups. This model might look like: yi ∼ N (αj(i) + βxi , σ 2 ) for i, ...n αj ∼ N (µα , σα2 ) for j, ...J Note that the second level model (the model for αj ’s) is the same as the previous model. The only change is that we have added an x predictor and a β that represents a single, common slope, or the effect of x on y. Although β is not varying, we still need to give it a prior, because it is an estimated parameter and not a deterministic node. 64 CHAPTER 6. SIMPLE MODELS IN JAGS Figure 6.5: Example of Simpson’s Paradox µα ∼ N (0, 0.001) β ∼ N (0, 0.0001) σα2 ∼ U (0, 10) σ ∼ U (0, 10) 6.4 Varying Slope Model A single slope may make sense in some applications where the change in a covariate effects all groups the same, despite the fact that the groups may start at different values. However, there is a good chance that covariates hold the potential to have different directions and magnitudes of an effect on different groups. In cases like this, a varying slope would make sense. In ecology, the assumption that many processes are spatially and/or temporally homogeneous or invariant may not be valid. For example, spatial variation in the a stressor-response relationship may vary depending on local landscape features. Another important attribute of varying slopes is that they can help avoid Simpson’s Paradox, which is when a trend appears in several different groups of data but disappears or reverses when these groups are combined. Recall our varying intercept model, and we will change one thing in the (level-1) equation. We will index β by i and j in the same way we did for α in the varying intercept model. yi ∼ N (αj(i) + βj(i) xi , σ 2 ) for i, ...n 6.4. VARYING SLOPE MODEL 65 Figure 6.6: Most quantitative ecologists cannot resist the temptation to include a cartoon of Homer Simpson when referencing Simpson’s Paradox. The second level of our model changes more dramatically, because we now have two varying coefficient and need to model both their variances and their correlation. ( αj βj ) (( ∼N µα µβ ) ( , σα2 ρσα σβ ρσα σβ σβ2 )) for j, ...J Several of these terms we have seen before, specifically the terms involving α, the intercept. However, many of the β terms are used similarly to the α terms. After understanding that the equation is written in matrix-notation to accomdate both varying parameters, see that the row for βj contains the mean β’s, expressed as a mean, µbeta . The final part is the variance-covariance matrix, which might be simpler that it appears. Note that we have already recoginized σα as the among-group variance for intercepts, and σβ is simple the among-group variance for slopes. The only term remaining is the one involving σαβ , which represents the joint variability of αj and βj . Perhaps the important part of this term is that it is prefaced with ρ, the Greek symbol commonly used for correlation. So the term ρσα σβ is quantifying the correlation between the varying parameters, which is important to know because we want and need the correlation to be minimal. Finally, we need priors for the model. Again, think about building on previous models that we have learned. This is not entirely a new model, so use what you already know, and add the new parts. µα ∼ N (0, 0.001) 66 CHAPTER 6. SIMPLE MODELS IN JAGS µβ ∼ N (0, 0.001) σ ∼ U (0, 10) σα ∼ U (0, 10) σβ ∼ U (0, 10) ρ ∼ U (−1, 1) Note that ICC cannot be applied to this model for reasons we won’t get into. 6.4.1 Correlation between parameters When estimated parameters are correlated, it may be a sign of model fitting problems. Correlated parameters may mean that the parameters are not able to independently explore parameter space, which may lead to poor estimates. Additionally, if the estimates are reliable but the parameters are correlated, it may mean that you aren’t getting much information from the data that are correlated. Regardless, correlated parameters are a problem in your models and need to be addressed. (Also, there is not agreed upon threshold for “highly correlated”, but by the time correlations approach 0.7, you may want to start looking into things.) Fortunatly, there are several (and mostly simple) things to do to address parameter correlation. The best thing to do is often to center and/or standardize your model covariates, which has been discussed before. Standardizing data typically reduces correlation between parameters, improves convergence, may improve parameter interpretation, and can also mean that the same priors can be used. Going back to the PLD data that we have examined, note that the correlation of intercepts and effect of temperature based on the raw data is high, at about -0.95. In the PLD paper, a centering constant was used that greatly reduced the correlation. The centering of the temperature by -15 resulted in near elimination of the parameter correlation. Note that this example is about as impressive as you might find and you should not expect results this dramatic in all data. 6.4. VARYING SLOPE MODEL 67 Figure 6.7: Correlation between intercepts and effect of log(temperature) for the raw PLD dataset. Correlation is around 0.95. Figure 6.8: Simulation of varying centering constants examined to reduce parameter correlation. Figure 6.9: PLD intercepts and slopes with almost no correlation. 68 CHAPTER 6. SIMPLE MODELS IN JAGS However, centering and/or standardizing does routinely reduce correlation as advertised. Recall that centering changes the values, but not the scale of the data. Think about it as shifting the numberline under your data. Mean-centering is also a very common type of centering. Standardizing your data changes both the values and the scale. There is a reasonable literature on various standardizing approaches, but mean-centering and dividing by 1 or 2 standard deviations are very common and usually produce good results. 6.4.2 Correlation between 2 or more varying parameters Another way to represent the models with 2 or more varying parameters is: yi ∼ N (Xi Bj(i) , σ 2 ) for i, ...n Bi ∼ N (MB , ΣB ) for j, ...J Where B represents the matrix of group-level coefficients, MB represents the population average coefficients (i.e., the mean of the distributions of the intercepts and slopes), and Σb represents the covariance matrix. Once we move beyond 2 varying parameters, it is no longer possible to set diffuse priors one parameter at a time; each correlation is restricted to be within -1 and 1 and correlations are jointly constrained. In other words, we need to put a prior on the covariance matrix (ΣB ) itself. To accomplish this “prior on a matrix”, we will use the scaled inverse-Wishert distribution. This distribution implies a uniform distribution on the correlation parameters. According to Gelman and Hill (2006), “When the number K of varying coefficients per group is more than two, modeling the correlation parameters ρ is a challenge.” But the scaled inverse-Wishert appraoch is a “useful trick”. Note that Kéry (2010) does not go so far as to deal with this issue. While this adds some model complexity, ultimtely this development is something that will be provided in the JAGS model code and never really changed. Note that this distribution requires the library MCMCpack. Below is an example of coding for 2 varying coefficients without using the Wishert distribution. This approach will work, but is not recommended. #### Dealing with variance-covariance of random slopes and intercepts # Convert covariance matrix to precision for use in bivariate normal Tau.B[1:K,1:K] <- inverse(Sigma.B[,]) 6.4. VARYING SLOPE MODEL 69 # variance among intercepts Sigma.B[1,1] <- pow(sigma.a, 2) sigma.a ~ dunif (0, 100) # Variance among slopes Sigma.B[2,2] <- pow(sigma.b, 2) sigma.b ~ dunif (0, 100) # Covariance between alpha's and beta's Sigma.B[1,2] <- rho * sigma.a * sigma.b Sigma.B[2,1] <- Sigma.B[1,2] # Uniform prior on correlation rho ~ dunif (-1, 1) Below is code for the model variance-covariance using the Wishert distribution. There are computational advantages to this, in addition to the fact that it easily scales up for more than 2 varying coefficients. ### Model variance-covariance with Wishert disttribution Tau.B[1:K,1:K] ~ dwish(W[,], df) df <- K+1 Sigma.B[1:K,1:K] <- inverse(Tau.B[,]) for (k in 1:K){ for ( in 1:K){ rho.B[k,] <- Sigma.B[k,]/ sqrt(Sigma.B[k,k]*Sigma.B[,]) } sigma.B[k] <- sqrt(Sigma.B[k,k]) } 70 CHAPTER 6. SIMPLE MODELS IN JAGS Chapter 7 Varying Coefficients 71 72 CHAPTER 7. VARYING COEFFICIENTS Chapter 8 Generalized Linear Models in JAGS 8.1 Background to GLMs Recall the linear model, which has the feature of modeling a response variable that is assume to come from a normal distribution and have normally-distributed errors. Once we can adopt different error structures, we have generalized the linear model, and the name reflects this. Generalized linear models are not hierarchical, but they are still of great interest because the ability to modify the error structure on the response variable greatly increases the type of data we can model. GLMs were introduced in the early 1970s (McCullagh and Nelder 1972) and have become very popular in recent decades. 8.2 Components of a GLM A transformation of the expectation of the response (E(y))is expressed as a linear combination of covariate effects, and distributions other than the normal can be used for the random part of the model. 1. A statistical distribution is used to describe the random variation in response y 2. A link-function g is applied to the expectation of the response E(y) 3. A linear predictor is the linear combination of covariate effects thought to make up E(y) 73 74 CHAPTER 8. GENERALIZED LINEAR MODELS IN JAGS Figure 8.1: Family calls and default link functions for GLMs in R Figure 8.2: Common statistical distributions, their description, use, and default link function information. 8.2.1 Common GLMs Technically a normal distribution is considered a special case of the GLM, but practically speaking the binomial and Poisson GLM are the most common. Note that the Bernoulli distribution is also common, but is just a special case of the Binomial where the parameter N = 1. Each distributional family in a GLM has a default link function, which is very likely the link function you will want to use. However, note that other link functions are available and with some digging, you may find a link function that is better for your data than the detault link function. Although it is not considered in the family of GLMs, recall the Beta distribution can be used for Beta regression, which is appropriate when your response variable takes on a proportion (0 > y < 1). 8.3. BINOMIAL REGRESSION 75 8.3 Binomial Regression 8.4 Poisson Regression 8.4.1 Poisson Extras 76 CHAPTER 8. GENERALIZED LINEAR MODELS IN JAGS Chapter 9 Plotting 77 78 CHAPTER 9. PLOTTING Chapter 10 Within-subjects Model 79 80 CHAPTER 10. WITHIN-SUBJECTS MODEL Chapter 11 Hierarchical Models 81 82 CHAPTER 11. HIERARCHICAL MODELS Bibliography Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). Bayesian data analysis. Chapman and Hall/CRC. Gelman, A. and Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge university press. Kéry, M. (2010). Introduction to WinBUGS for ecologists: Bayesian approach to regression, ANOVA, mixed models and related analyses. Academic Press. Kéry, M. and Schaub, M. (2011). Bayesian population analysis using WinBUGS: a hierarchical perspective. Academic Press. Lindley, D. V. and Smith, A. F. (1972). Bayes estimates for the linear model. Journal of the Royal Statistical Society: Series B (Methodological), 34(1):1– 18. Lykou, A. and Ntzoufras, I. (2011). Winbugs: a tutorial. Wiley Interdisciplinary Reviews: Computational Statistics, 3(5):385–396. Parent, E. and Rivot, E. (2012). Introduction to hierarchical Bayesian modeling for ecological data. Chapman and Hall/CRC. Plummer, M. (2012). Jags version 3.3. 0 user manual. International Agency for Research on Cancer, Lyon, France. Raudenbush, S. W. and Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods, volume 1. Sage. Spiegelhalter, D., Thomas, A., Best, N., and Lunn, D. (2003). Winbugs user manual. Xie, Y. (2019). bookdown: Authoring Books and Technical Documents with R Markdown. R package version 0.13. 83