Josafhat Salinas Ruíz Osval Antonio Montesinos López Gabriela Hernández Ramírez Jose Crossa Hiriart Generalized Linear Mixed Models with Applications in Agriculture and Biology Generalized Linear Mixed Models with Applications in Agriculture and Biology Josafhat Salinas Ruíz • Osval Antonio Montesinos López • Gabriela Hernández Ramírez Jose Crossa Hiriart Generalized Linear Mixed Models with Applications in Agriculture and Biology Josafhat Salinas Ruíz Innovación Agroalimentaria Sustentable Colegio de Postgraduados Córdoba, Mexico Osval Antonio Montesinos López Facultad de Telemática University of Colima Colima, Mexico Gabriela Hernández Ramírez Ciencia de los Alimentos y Biotecnología Tecnológico Nacional de México/ITS de Tierra Blanca, Mexico Jose Crossa Hiriart International Maize and Wheat Improvement Center (CIMMYT) Texcoco, Mexico ISBN 978-3-031-32799-5 ISBN 978-3-031-32800-8 https://doi.org/10.1007/978-3-031-32800-8 (eBook) The translation was done with the help of artificial intelligence (machine translation by the service DeepL.com). A subsequent human revision was done primarily in terms of content. © The Editor(s) (if applicable) and The Author(s) 2023. This book is an open access publication. Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Foreword This book is another example of CIMMYT’s commitment to scientific advancement, and comprehensive and inclusive knowledge sharing and dissemination. By using alternative statistical models and methods to describe and analyze data sets from different disciplines, such as biology and agriculture, the book facilitates the adoption and effective use of these tools by publicly funded researchers and practitioners of national agricultural research extension systems (NARES) and universities across the Global South. The authors aim to offer different and new models, methods, and techniques to agricultural scientists who often lack the resources to adopt these tools or face practical constraints when analyzing different types of data. This work would not be possible with the continuous support of CIMMYT’s outstanding partners and donors who invest in non-profit frontier research for the benefit of millions of farmers and low-income communities worldwide. For that reason, it could not be more fitting for this book to be published as an open access resource for the international community to benefit from. I trust that this publication will greatly contribute to accelerate the development and deployment of resourceefficient and nutritious crops for a food secure future. Bram Govaerts Director General, CIMMYT v Acknowledgments The work presented in this book described models and methods for improving the statistical analyses of continuous, ordinal, and count data usually collected in agriculture and biology. The authors are grateful to the past and present CIMMYT Directors General, Deputy Directors of Research, Deputy Directors of Administration, Directors of Research Programs, and other Administration offices and Laboratories of CIMMYT for their continuous and firm support of biometrical genetics, and statistics research, training, and service in support of CIMMYT’s mission: “maize and wheat science for improved livelihoods.” This work was made possible with support from the CGIAR Research Programs on Wheat and Maize (wheat.org, maize.org), and many funders including Australia, United Kingdom (DFID), USA (USAID), South Africa, China, Mexico (SAGARPA), Canada, India, Korea, Norway, Switzerland, France, Japan, New Zealand, Sweden, and the World Bank. We thank the financial support of the Mexico Government throughout MASAGRO and several other regional projects and close collaboration with numerous Mexican researchers. We acknowledge the financial support provided by the (1) Bill and Melinda Gates Foundation (INV-003439 BMGF/FCDO Accelerating Genetic Gains in Maize and Wheat for Improved Livelihoods [AG2MW]) as well as (2) USAID projects (Amend. No. 9 MTO 069033, USAID-CIMMYT Wheat/AGGMW, AGG-Maize Supplementary Project, AGG [Stress Tolerant Maize for Africa]). Very special recognition is given to Bill and Melinda Gates Foundation for providing the Open Access fee of this book. We are also thankful for the financial support provided by the (1) Foundations for Research Levy on Agricultural Products (FFL) and the Agricultural Agreement Research Fund (JA) in Norway through NFR grant 267806, (2) Sveriges Llantbruksuniversitet (Swedish University of Agricultural Sciences) Department of The original version of this book has been revised. The Acknowledgment section which was inadvertently omitted after the Foreword has now been included. vii viii Acknowledgments Plant Breeding, Sundsvägen 10, 23053 Alnarp, Sweden, (3) CIMMYT CRP, (4) the Consejo Nacional de Tecnología y Ciencia (CONACYT) of México, and (5) Universidad de Colima of Mexico. We highly appreciate and thank the several students and professors at the Universidad de Colima and students as well professors from the Colegio de PostGraduados (COLPOS) who tested and made suggestions on early version of the material covered in the book; their many useful suggestions had a direct impact on the current organization and the technical content of the book. Contents 1 2 Elements of Generalized Linear Mixed Models . . . . . . . . . . . . . . . . . 1.1 Introduction to Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . 1.3 Analysis of Variance Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 One-Way Analysis of Variance . . . . . . . . . . . . . . . . . . . . 1.3.2 Two-Way Nested Analysis of Variance . . . . . . . . . . . . . . 1.3.3 Two-Way Analysis of Variance with Interaction . . . . . . . 1.4 Analysis of Covariance (ANCOVA) . . . . . . . . . . . . . . . . . . . . . 1.5 Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.3 Distribution of the Response Variable Conditional on Random Effects (y| b) . . . . . . . . . . . . . . . . . . . . . . . . 1.5.4 Types of Factors and Their Related Effects on LMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.5 Nested Versus Crossed Factors and Their Corresponding Effects . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.6 Estimation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.7 One-Way Random Effects Model . . . . . . . . . . . . . . . . . . 1.5.8 Analysis of Variance Model of a Randomized Block Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Components of a GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 The Random Component . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 2 4 6 6 9 13 18 22 22 23 25 26 27 27 30 31 35 40 43 43 44 44 ix x 3 4 Contents 2.2.2 The Systematic Component . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Predictor’s Link Function η . . . . . . . . . . . . . . . . . . . . . . 2.3 Assumptions of a GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Estimation and Inference of a GLM . . . . . . . . . . . . . . . . . . . . . . 2.5 Specification of a GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Continuous Normal Response Variable . . . . . . . . . . . . . . 2.5.2 Binary Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Poisson Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.4 Gamma Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.5 Beta Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 45 48 48 49 49 51 59 67 72 76 83 Objectives of Inference for Stochastic Models . . . . . . . . . . . . . . . . . . 3.1 Three Aspects to Consider for an Inference . . . . . . . . . . . . . . . . . 3.1.1 Data Scale in the Modeling Process Versus Original Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Inference Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Inference Based on Marginal and Conditional Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Illustrative Examples of the Data Scale and the Model Scale . . . . 3.3 Fixed and Random Effects in the Inference Space . . . . . . . . . . . . 3.3.1 A Broad Inference Space or a Population Inference . . . . . 3.3.2 Mixed Models with a Normal Response . . . . . . . . . . . . . 3.4 Marginal and Conditional Models . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Marginal Versus Conditional Models . . . . . . . . . . . . . . . 3.4.2 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Non-normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 86 Generalized Linear Mixed Models for Non-normal Responses . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 A Brief Description of Linear Mixed Models (LMMs) . . . . . . . . 4.3 Generalized Linear Mixed Models . . . . . . . . . . . . . . . . . . . . . . . 4.4 The Inverse Link Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 The Variance Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Specification of a GLMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Estimation of the Dispersion Parameter . . . . . . . . . . . . . . . . . . . 4.8 Estimation and Inference in Generalized Linear Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Fitting the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 87 88 88 101 101 103 105 105 107 108 111 113 113 114 114 116 117 117 119 123 123 125 126 126 Contents 5 6 Generalized Linear Mixed Models for Counts . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 The Poisson Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 CRD with a Poisson Response . . . . . . . . . . . . . . . . . . . . 5.2.2 Example 2: CRDs with Poisson Response . . . . . . . . . . . . 5.2.3 Example 3: Control of Weeds in Cereal Crops in an RCBD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Overdispersion in Poisson Data . . . . . . . . . . . . . . . . . . . . 5.2.5 Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.6 Latin Square (LS) Design . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generalized Linear Mixed Models for Proportions and Percentages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Response Variables as Ratios and Percentages . . . . . . . . . . . . . . 6.2 Analysis of Discrete Proportions: Binary and Binomial Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Completely Randomized Design (CRD): Methylation Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Factorial Design in a Randomized Complete Block Design (RCBD) with Binomial Data: Toxic Effect of Different Treatments on Two Species of Fleas . . . . . . . . . . . . . . . . . . . . . 6.4 A Split-Plot Design in an RCBD with a Normal Response . . . . . . 6.4.1 An RCBD Split Plot with Binomial Data: Carrot Fly Larval Infestation of Carrots . . . . . . . . . . . . . . . . . . . . . . 6.5 A Split-Split Plot in an RCBD:- In Vitro Germination of Seeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Alternative Link Functions for Binomial Data . . . . . . . . . . . . . . . 6.6.1 Probit Link: A Split-Split Plot in an RCBD with a Binomial Response . . . . . . . . . . . . . . . . . . . . . . . 6.6.2 Complementary Log-Log Link Function: A Split Plot in an RCBD with a Binomial Response . . . . . . . . . . . . . . 6.7 Percentages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.1 RCBD: Dead Aphid Rate . . . . . . . . . . . . . . . . . . . . . . . . 6.7.2 RCBD: Percentage of Quality Malt . . . . . . . . . . . . . . . . . 6.7.3 A Split Plot in an RCBD: Cockroach Mortality (Blattella germanica) . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.4 A Split-Plot Design in an RCBD: Percentage Disease Inhibition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.5 Randomized Complete Block Design with a Binomial Response with Multiple Variance Components . . . . . . . . 6.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 129 129 129 131 134 138 142 149 157 170 178 209 209 210 210 217 219 221 232 237 238 239 242 242 245 247 250 253 257 265 xii 7 8 9 Contents Time of Occurrence of an Event of Interest . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Generalized Linear Mixed Models with a Gamma Response . . . . 7.2.1 CRD: Estrus Induction in Pelibuey Ewes . . . . . . . . . . . . . 7.2.2 Randomized Complete Block Design (RCBD): Itch Relief Drugs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Factorial Design: Insect Survival Time . . . . . . . . . . . . . . 7.2.4 A Split Plot with a Factorial Structure on a Large Plot in a Completely Randomized Design (CRD) . . . . . . . 7.3 Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Concepts and Definitions . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 CRD: Aedes aegypti . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 RCBD: Aedes aegypti . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generalized Linear Mixed Models for Categorical and Ordinal Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Concepts and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Cumulative Logit Models (Proportional Odds Models) . . . . . . . . 8.3.1 Complete Randomize Design (CRD) with a Multinomial Response: Ordinal . . . . . . . . . . . . . . . 8.3.2 Randomized Complete Block Design (RCBD) with a Multinomial Response: Ordinal . . . . . . . . . . . . . . . 8.4 Cumulative Probit Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Effect of Judges’ Experience on Canned Bean Quality Ratings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Generalized Logit Models: Nominal Response Variables . . . . . . . 8.6.1 CRDs with a Nominal Multinomial Response . . . . . . . . . 8.6.2 CRD: Cheese Tasting . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generalized Linear Mixed Models for Repeated Measurements . . . . 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Example of Turf Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Effect of Insecticides on Aphid Growth . . . . . . . . . . . . . . . . . . . 9.4 Manufacture of Livestock Feed . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Characterization of Spatial and Temporal Variations in Fecal Coliform Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Log-Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.1 Emission of Nitrous Oxide (N2O) in Beef Cattle Manure with Different Percentages of Crude Protein in the Diet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 279 280 280 282 284 287 291 293 295 297 301 306 321 321 322 323 324 328 336 338 348 350 353 357 374 377 377 378 382 387 390 395 397 Contents Effect of a Chemical Salt on the Percentage Inhibition of the Fusarium sp. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.8 Carbon Dioxide (CO2) Emission as a Function of Soil Moisture and Microbial Activity . . . . . . . . . . . . . . . . . . . . . . . . 9.9 Effect of Soil Compaction and Soil Moisture on Microbial Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.10 Joint Model for Binary and Poisson Data . . . . . . . . . . . . . . . . . . 9.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 9.7 398 402 407 409 413 416 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425 Chapter 1 Elements of Generalized Linear Mixed Models 1.1 Introduction to Linear Models Linear models are commonly used to describe and analyze datasets from different research areas, such as biological, agricultural, social, and so on. A linear model aims to best represent/describe the nature of a dataset. A model is usually made up of factors or a series of factors that can be nominal or discrete variables (sex, year, etc.) or continuous variables (age, height, etc.), which have an effect on the observed data. Linear models are the most commonly used statistical models for estimating and predicting a response based on a set of observations. Linear models get their name because they are linear in the model parameters. The general form of a linear model is given by y = Xβ þ ε ð1:1Þ where y is the vector of dimension n × 1 observed responses, X is the design matrix of n × ( p + 1) fixed constants, β is the vector of ( p + 1) × 1 parameters to be estimated (unknown), and ε is the vector of n × 1 random errors. Linearity arises because the mean response of vector y is linear to the vector of unknown parameters β. Mathematically, this is demonstrated by obtaining the first derivative of the predictor with respect to β, and, if after derivation it is still a function of any of the beta parameters, then the model is said to be nonlinear; otherwise, it is a linear model. In this case, the derivative of the predictor (1.1) with respect to beta is equal to X, so, mathematically, the model in (1.1) is linear, since after derivation, the predictor no longer depends on the β parameters. Several models used in statistics are examples of the general linear model y = Xβ + e. These include regression models and analysis of variance (ANOVA) models. Regression models generally refer to those in which the design matrix X is © The Author(s) 2023 J. Salinas Ruíz et al., Generalized Linear Mixed Models with Applications in Agriculture and Biology, https://doi.org/10.1007/978-3-031-32800-8_1 1 2 1 Elements of Generalized Linear Mixed Models of a full column rank, whereas in analysis of variance models, the design matrix X is not of a full column rank. Some linear models are briefly described in the following sections. 1.2 Regression Models Linear models are often used to model the relationship between a variable, known as the response or dependent variable, y, and one or more predictors, known as independent or explanatory variables, X1, X2, ⋯, Xp. 1.2.1 Simple Linear Regression Consider a model in which a response variable y is linearly related to an explanatory variable X1 via yi = β0 þ β1 X 1i þ εi where εi are uncorrelated random errors (i = 1, 2, ⋯, n) which are commonly assumed to be normally distributed with mean 0 and variance constant σ 2 > 0, εi ~ N(0, σ 2). If X11, X12, ⋯, X1n are constant (fixed), then this is a general linear model y = Xβ + e where yn × 1 = y1 y2 ⋮ yn , Xn × 2 = 1 1 ⋮ 1 X 11 X 12 ⋮ X 1n , β2 × 1 = β0 , en × 1 = β1 E1 E2 ⋮ En Example Let us consider the relationship between the performance test scores and tissue concentration of lysergic acid diethylamide commonly known as LSD (from German Lysergsäure-diethylamid) in a group of volunteers who received the drug Table 1.1 Average mathematical test scores and LSD tissue concentrations Tissue concentration of LSD 1.17 2.97 3.26 4.69 5.83 6.00 6.41 Mathematical average 78.93 58.20 67.47 37.47 45.65 32.92 29.97 1.2 Regression Models 3 Table 1.2 Results of the simple regression analysis (a) Type III tests of fixed effects Effect Num DF Den DF Conc 1 5 (b) Parameter estimates Standard Effect Estimate error Intercept 89.1239 7.0475 Conc -9.0095 1.5031 50.7763 32.1137 Scale σ2 F-value 35.93 DF 5 5 . t-value 12.65 -5.99 . Pr > F 0.0019 Pr > |t| <0.0001 0.0019 . (Wagner et al. 1968). The average scores on the mathematical test and the LSD tissue concentrations are shown in Table 1.1. The components of this regression model are as follows: Distribution: yi N ηi , σ 2 Linear predictor: ηi = β0 þ β1 × Conci Link function: μi = ηi ðidentityÞ The syntax for performing a simple linear regression using the GLIMMIX procedure in Statistical Analysis Software (SAS) is as follows: proc glimmix; model y= X1/solution; run; Part of the results is shown in Table 1.2. The analysis of variance (item a) indicates that drug concentration has a significant effect on average mathematical performance (P = 0.0019). The estimates of the regression model parameters (item b) are β0 and β1, and the mean squared error (MSE scale) is shown in Table 1.2(b) under “Parameter estimates.” With these results, the linear predictor ðηi Þ that predicts the average mathematical performance as a function of LSD concentration is as follows: ηi = 89:124 - 9:01 × Conci This means that we can predict the average mathematical performance of an individual for whom we need to know the LSD concentration (Conci) to be applied. From the estimated parameters, we can say that there is a negative relationship between LSD concentration and mathematical score. Figure 1.1 clearly shows that an increase in drug supply has a negative effect on the mathematical score of the youth. This fitted model explains 87.7% of the variability in the data (Fig. 1.1). 4 1 Elements of Generalized Linear Mixed Models 90 80 Average score 70 y = 89.124 -9.0095*Conc R² = 0.8778 60 50 40 30 20 10 0 1 2 3 4 5 6 7 Concentration (LSD) Fig. 1.1 Relationship between applied drug concentration and the mathematical score of the youth Adjusted model of the relationship between the average score and LSD concentration. 1.2.2 Multiple Linear Regression Suppose that a response variable y is linearly related to several independent variables X1, X2, ⋯, Xp such that yi = β0 þ β1 X i1 þ β2 X i2 þ ⋯ þ βp X ip þ εi for i = 1, 2, ⋯, n. Here, εi are uncorrelated random errors (i = 1, 2, ⋯, n) normally distributed with a zero mean and constant variance σ 2, i.e., εi ~ N(0, σ 2). If the explanatory variables are fixed constants, then the above model belongs to a general linear model of the form y = Xβ + ε, as can be seen below: yn × 1 = y1 y2 ⋮ yn εn × 1 = E1 E2 ⋮ En 1 1 , X n × ðpþ1Þ = ⋮ 1 X 11 X 21 ⋮ X n1 X 12 ⋯ X 22 ⋯ ⋮ X n2 ⋯ X 1p X 2p ⋮ X np , βp × 1 = β0 β1 β2 ⋮ βp , 1.2 Regression Models 5 Table 1.3 Body weight (kilograms) and its relationship with circumference (centimeters) and heart length (centimeters) of seven young bulls Bull Weight (kilograms) Circumference (centimeters) Length (centimeters) 1 480 175 128 2 450 177 122 3 480 178 124 4 500 175 131 5 520 186 131 6 510 183 130 7 500 185 124 A regression analysis can be used to assess the relationship between explanatory variables and the response variable. It is also a useful tool for predicting future observations or simply describing the structure of the data. Example Let us to fit a regression model of the relationship between body weight and heart girth and length of the hearts of seven young bulls from the data shown in Table 1.3. The components of this multiple regression model are as follows: Distribution: yi N ηi , σ 2 Linear predictor: ηi = β0 þ X i1 β1 þ X i2 β2 Link function: μi = ηi ðidentityÞ The syntax for performing a multiple regression using the GLIMMIX procedure in SAS, assuming that there is no interaction between bull heart girth (X1) and length (X2), is shown below: proc glimmix; model y = X1 X2/solution cl; run; Based on the regression model specifications, the option “solution cl” prompts GLIMMIX to provide the value of the estimated parameters and their respective confidence intervals. Other useful options available are “htype = 1, 2, and 3,” which refer to the sum of squares of types I, II, and III. The type III fixed effects tests in (a) of Table 1.4 indicate that there is a linear relationship between heart length (size) and weight in young bulls. The estimated parameters with their respective confidence intervals β0 , β1 , β2 as well as the MSE (scale) of the fitted regression model are listed below in (b). Note that in a linear model, the parameters are linearly entered, but the variables do not necessarily have to be linear. For example, consider the following two examples: yi = β0 þ β1 X i1 þ β2 logðX i2 Þ þ ⋯ þ βk X ik þ Ei 6 1 Elements of Generalized Linear Mixed Models Table 1.4 Results of the multiple regression analysis (a) Type III tests of fixed effects Effect Num DF X1 1 X2 1 (b) Parameter estimates Effect Estimate Standard error Intercept -495.01 225.87 X1 2.2573 1.0739 X2 4.5808 1.4855 Scale 139.51 98.6518 Den DF DF . t-value -2.19 2.10 3.08 . Pr > F 0.1034 0.0368 F-value 4.42 9.51 Pr > |t| 0.0935 0.1034 0.0368 . α 0.05 0.05 0.05 . Lower -1122.13 -0.7243 0.4564 . Upper 132.10 5.2388 8.7053 . β yi = β0 þ X i11 þ β2 X i2 þ ⋯ þ βk expðX ik Þ þ Ei The first example is a linear model, whereas the second one is not, since its β derivatives do not depend on the beta coefficients, with the exception of the term X i11 β1 whose derivative is equal to X i1 logðX i1 Þ . This clearly shows that the second example is a nonlinear model because the derivative of the predictor depends on β1. 1.3 1.3.1 Analysis of Variance Models One-Way Analysis of Variance Consider an experiment in which you want to test t treatments (t > 2), to the level of the ith treatment with ni experimental units that are selected and randomly assigned to the ith treatment. The model describing this experiment is as follows: yij = μ þ τi þ Eij for i = 1, 2, ⋯, t and j = 1, 2, ⋯, ni . Here, Eij are the uncorrelated random errors with normal distribution with a zero mean and a variance constant σ 2 (εij ~ N(0, σ 2)). If the treatment effects are considered as fixed constants (drawn from a finite number), then this model is a special case of the general linear model (1), with the total number of experimental units n = t ni . i 1.3 Analysis of Variance Models 7 Table 1.5 Biomass production of the three types of bacteria Bacteria A 12 15 9 Table 1.6 Analysis of variance Sources of variation Bacteria type Error Total Bacteria B 20 19 23 Bacteria C 40 35 42 Degrees of freedom t-1=3-1=2 t(r - 1) = 3 × 2 = 6 tr - 1 = 3 × 3 - 1 = 8 In matrix terms, the information under this design of experiment is equal to: yn × 1 = y11 y12 ⋮ ytnt en × 1 = ε11 ε12 E13 ⋮ Etnt , Xn × ðtþ1Þ = 1n 1 1n 2 ⋮ 1n t 1n 1 0n 2 ⋮ 0n t 0n1 1n2 ⋮ 0n t ⋯ ⋯ ⋱ ⋯ 0n1 0n2 ⋮ 1nt , βtþ1 × 1 = μ τ1 τ2 ⋮ τt , where 1ni is the vector of ones of order ni and 0ni is the vector of zeros of order ni. Note that the matrix Xn×(t+1) is not of a full column rank because its first column can be obtained as a linear combination of its remaining columns. Example Assume that measurements of the biomass produced by three different types of bacteria are collected in three separate Petri dishes (replicates) in a glucose broth culture medium for each bacterium (Table 1.5). The sources of variation and degrees of freedom (DFs) for this experiment are shown in Table 1.6. The components for this one-way model, assuming that each of the response variable yij is normally distributed, are as follows: Distribution: yij N μij , σ 2 Linear predictor: ηi = α þ τi Link function: μi = ηi ðidentityÞ where yij is the response observed at the jth repetition in the ith bacterium, ηi is the linear predictor, α is the intercept (the grand mean), and τi is the fixed effect due to the type of bacterium. 8 1 Table 1.7 Results of the one-way analysis of variance Elements of Generalized Linear Mixed Models (a) Fit statistics -2 Res log likelihood 33.36 AIC (Akaike information criterion) (smaller is better) 41.36 AICC (Corrected Akaike information criterion) (smaller 81.36 is better) BIC (Bayesian information criterion) (smaller is better) 40.52 CAIC (Consistent Akaike's information criterion) 44.52 (smaller is better) HQIC (Hannan and Quinn information criterion) 38.02 (smaller is better) Pearson’s chi-square 52.67 Pearson’s chi-square / DF 8.78 (b) Type III tests of fixed effects Effect Num DF Den DF F-value Pr > F Bacteria 64.95 <0.0001 The SAS syntax for a one-way analysis of variance (ANOVA) is as follows: proc glimmix data=biomass; class bacteria; model y = bacteria; lsmeans bacteria/lines; run; Similar to “proc glm” or “proc mixed,” the “class” command allows to define the type of class variables (categorical or nominal) to be included in the model; in this case, for the class variable “bacteria,” the “model” command allows to declare (list) the response variable “y” and all the class or continuous variables that enter the model, whereas the “lsmeans” command asks GLIMMIX to estimate the means of the treatments and the “lines” option allows to make a comparison of means. Part of the results is presented below. By default, “proc GLIMMIX” provides the fit statistics (information criteria), which are extremely useful for comparing or choosing a model that explains the largest possible proportion of variation present in a dataset, i.e., the best-fit model (part (a) of Table 1.7). The statistic “-2 res log likelihood” is most useful when comparing nested models, and the rest of the statistics is useful for comparing models that are not necessarily nested. The mean squared error (MSE) in GLIMMIX is given as the statistic “Pearson′s chi - square/DF.” In this analysis, this value is 8.78. σ 2 = MSE = 8:78 . In part (b), the analysis of variance indicates that at least one type of bacterium produces a different biomass (P < 0.0001). That is, the null hypothesis is rejected (H0 : τA = τB = τC) at a significance level of 5%. The estimated least squares (LS) means obtained with “lsmeans” are tabulated under the “Estimate” column with their standard errors in the “Standard error” column of Table 1.8. These estimated means were obtained (by default) with Fisher’s LSD (least significant difference). 1.3 Analysis of Variance Models 9 Table 1.8 Means and estimated standard errors of the one-way model Least squares means of bacteria Bacteria Estimate A 12.0000 B 20.6667 C 39.0000 Table 1.9 Comparison of the means (LSD) in the one-way model Standard error 1.7105 1.7105 1.7105 DF t-value 7.02 12.08 22.80 Pr > |t| 0.0004 <0.0001 <0.0001 T grouping of the least squares means of bacteria (α = 0.05) LS means with the same letter are not significantly different Bacteria Estimate C 39.0000 A B 20.6667 B A 12.0000 C Finally, Table 1.9 presents a comparison of the means obtained with “lines” and indicates that bacteria type C has a better fermentative conversion of glucose to lactic acid compared to bacteria types B and A. Equal letters per column indicate that they are statistically equal. 1.3.2 Two-Way Nested Analysis of Variance Let us consider an experiment with two factors, A and B, in which each level of B is nested within a level of factor A, that is, each level of factor B appears within a level of factor A. Then, the model that describes this experiment is as follows: yijk = μ þ αi þ βjðiÞ þ Eijk for i = 1, 2, ⋯, a; j = 1, 2, ⋯, bi; and k = 1, 2, ⋯, nij. In this model, μ is the overall mean, αi represents the effect due to the ith level of factor A, and βj(i) represents the effect of the jth level of factor B nested within the ith level of factor A. Assuming that all factors are fixed, and that the errors εijk are normally distributed, that is εijk~N(0, σ 2), this model is the general linear model of the form y = Xβ + e. For example, suppose that you have a = 3 levels of factor A, b = 2 levels of factor B, and nij = 2, then the vectors and matrices have the following form: 10 1 y= y111 y112 y121 y122 y211 y212 y221 y222 y311 y312 y321 y322 β= μ α1 α2 α3 β11 β12 β21 β22 β31 β32 , X= , e= 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 E111 E112 E121 E122 E211 E212 E221 E222 E311 E312 E321 E322 0 0 0 0 1 1 1 1 0 0 0 0 Elements of Generalized Linear Mixed Models 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 , : Example Suppose that a researcher was studying the assimilation of fluorescently labeled proteins in rat kidneys and wanted to know whether his two technicians, technician A and technician B, were performing the procedure consistently. Technician A randomly chose three rats, and technician B randomly chose three other rats, and each technician measured the protein assimilation in each rat. Since rats are expensive and measurements are cheap, both technicians measured protein assimilation at various random locations in the kidneys of each rat (Table 1.10). When performing a nested ANOVA, we are often interested in testing the null hypothesis (Ho : τA = τB ). As in this example, we do not wish to test whether the subgroups (rats within technicians) are significantly different, since the goal is to prove that both technicians are performing their jobs adequately. The sources of variation and degrees of freedom are shown in Table 1.11. The components of this two-way model, assuming that the response variable yij is normally distributed, are as follows: 1.3 Analysis of Variance Models Table 1.10 Levels of protein assimilation in the rat kidneys measured by both technicians Technician A Rat1 Rat2 1.119 1.045 1.2996 1.1418 1.5407 1.2569 1.5084 0.6191 1.6181 1.4823 1.5962 0.8991 1.2617 0.8365 1.2288 1.2898 1.3471 1.1821 1.0206 0.9177 Table 1.11 Sources of variation and degrees of freedom of the two-way nested design Sources of variation Technician Rat (technical) Error Total 11 Technician B Rat4 Rat5 1.3883 1.3952 1.104 0.9714 1.1581 1.3972 1.319 1.5369 1.1803 1.3727 0.8738 1.2909 1.387 1.1874 1.301 1.1374 1.3925 1.0647 1.0832 0.9486 Rat3 0.9873 0.9873 0.8714 0.9452 1.1186 1.2909 1.1502 1.1635 1.151 0.9367 Rat6 1.2574 1.0295 1.1941 1.0759 1.3249 0.9494 1.1041 1.1575 1.294 1.4543 Degrees of freedom a-1=2-1=1 a(b - 1) = 2(3 - 1) = 4 ab(r - 1) = 2 × 3(10 - 1) = 54 abr - 1 = 2 × 3 × 10 - 1 = 59 Distribution: yij N μij , σ 2 Linear predictor: ηij = α þ τi þ βðτÞjðiÞ Link function: μi = ηi ðidentityÞ where yij is the level of assimilation of the fluorescent protein obtained from rat j by technician i, α is the intercept, τi is the fixed effect due to the technician, and β(τ)j(i) is the nested effect of rat j within technician i. The SAS commands for the main effects of factor A and factor B nested within A are as follows: proc glimmix data=rata nobound; class technician rat rep; model protein=technical rat(technical); lsmeans technician rat(technician)/lines; run; Part of the results is shown in Table 1.12. The results indicate that there is minimum variability of the technicians since the value of the mean squared error (Pearson′s chi - square/DF) is 0.04 (part (a)). This means that the variance between group means is smaller than would be expected. The analysis of variance in part (b) indicates that there is no difference in the measurement of fluorescent proteins in the rats between technicians (P = 0.3065). Since there is variation between rats in the 12 Table 1.12 Fit statistics of the two-way nested design 1 Elements of Generalized Linear Mixed Models (a) Fit statistics -2 Res log likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) CAIC (smaller is better) HQIC (smaller is better) Pearson’s chi-square Pearson’s chi-square / DF (b) Type III tests of fixed effects Effect Num DF Den DF Technician 1 Rat (technical) -12.39 1.61 4.04 15.53 22.53 6.98 1.95 0.04 F-value 1.07 3.98 Pr > F 0.3065 0.0067 Table 1.13 Comparison of the means (LSD) in the nested model (a) Technical least squares means Technician Estimate Standard error DF A 1.2110 0.03466 B 1.1604 0.03466 (b) T grouping of the technical least squares means (α = 0.05) LS means with the same letter are not significantly different Technician Estimate A 1.2110 B 1.1604 Pr > |t| <0.0001 <0.0001 t-value 34.94 33.48 A A A average protein uptake, it is to be expected that between rats within technicians, there are mean differences in the protein uptake (P = 0.0067). In Table 1.13 part (a), the values of the least squares means tabulated under the “Estimate” column are shown with their respective “Standard errors.” It can be seen that rats under technician A have statistically the same mean protein uptake as do rats under technician B (part (b)). Comparison of means for rat subgroups under both technicians showed similar means for rats under technician A but different means for rats under technician B (part (a) and (b), Table 1.14). 1.3 Analysis of Variance Models 13 Table 1.14 Comparison of the means (LSD) of the subgroups nested within technicians (a) Least squares means of rats (technical) Technician Rat Estimate Standard error DF A 5 1.2187 0.06003 A 1.2302 0.06003 A 1.1841 0.06003 B 1 1.3540 0.06003 B 1.0670 0.06003 B 1.0602 0.06003 (b) T grouping of the least squares means (α = 0.05) of rats (technical) LS means with the same letter are not significantly different Technician Rat Estimate B 1 1.3540 A 1.2302 B A 5 1.2187 B A 1.1841 B B 1.0670 B B 1.0602 B 1.3.3 t-value 20.30 20.49 19.72 22.56 17.77 17.66 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 A A A A Two-Way Analysis of Variance with Interaction This experiment is used when one wishes to test two factors A and B, with a levels of factor A and b levels of factor B. In this experiment, both factors are crossed, this means that each level of A occurs in combination with each level of factor B. The model with interaction is given by: yijk = μ þ αi þ βij þ γ ij þ Eijk for i = 1, 2, ⋯, a; j = 1, 2, ⋯, b; k = 1, 2, ⋯, nij; and εijk ~ N(0, σ 2). If all the parameters of the model are fixed, then this model can be expressed as y = Xβ + e. For this model with a = 3, b = 2, and nij = 3, the matrix expression has the form: 14 1 y= y111 y112 y113 y121 y122 y123 y211 y212 y213 y221 y222 y223 y311 y312 y313 y321 y322 y323 β= μ α1 α2 α3 β1 β2 γ 11 γ 12 γ 21 γ 22 γ 31 γ 32 , X= , e= 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 E111 E112 E113 E121 E122 E123 E211 E212 E213 E221 E222 E223 E311 E312 E313 E321 E322 E323 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 Elements of Generalized Linear Mixed Models 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 , Example This experiment consisted of developing an in vitro efficacy test for selftanning formulations. Two brands, 1 = erythrulose, 2 = dihydroxyacetone (factor A), and three formulations, 1 = solution, 2 = gel, and 3 = cream (factor B), were tested with four replicates for each condition according to Jermann et al. (2001). Total color change was measured for each of the combination conditions. The dataset is shown in Table 1.15. 1.3 Analysis of Variance Models 15 Table 1.15 Color change (Y ) in each of the brands and formulations Brand 1 1 1 1 1 1 1 1 1 1 1 1 Formulation 1 1 1 1 Y 16.79 12.68 12.47 11.67 10.23 10.29 8.97 8.51 9.43 9.45 8.86 8.66 Table 1.16 Analysis of variance of the two-way model with interaction Brand Sources of variation Brand Formulation Brand × formulation Error Total Formulation 1 1 1 1 Y 32.85 38.08 30.25 28.41 25.06 21.66 19.86 18.62 25.89 22.96 24.55 24.59 Degrees of freedom a-1=2-1=1 a-1=3-1=2 (a - 1)(b - 1) = 1 × 2 = 2 ab(r - 1) = 2 × 3 × 3 = 18 abr - 1 = 2 × 3 × 4 - 1 = 23 For this two-way model, assuming that the response variable yijk has a normal distribution, the components are as follows: Distribution: yijk N μijk , σ 2 Linear predictor: ηij = μ þ αi þ βj þ γ ij Link function: μij = ηij ðidentityÞ where yijk is the color change observed at the kth repetition at the ith level of factor A and at the jth level of factor B, μ is the intercept (the overall mean), αi is the fixed effect due to the level of factor A (mark), βj represents the fixed effect of the level of factor B (type of formulation), and γ ij is the fixed effect due to the interaction between the brand and formulation. Table 1.16 shows the sources of variation and degrees of freedom. The following code in GLIMMIX in SAS allows us to estimate the main effects and the interaction: proc glimmix; class brand formulation; model y = brand|formulacion; lsmeans brand|formulacion/lines; run; 16 1 Elements of Generalized Linear Mixed Models Table 1.17 Results of the analysis of variance of the two-way model with interaction (a) Fit statistics -2 Res log likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) CAIC (smaller is better) HQIC (smaller is better) Pearson’s chi-square Pearson’s chi-square / DF (b) Type III tests of fixed effects Effect Num DF Brand 1 Formulation Brand × formulation 90.20 104.20 115.40 110.43 117.43 105.06 99.61 5.53 Den DF Pr > F <0.0001 <0.0001 0.0231 F-value 257.04 22.99 4.68 Table 1.18 Means and standard errors of the tanning brand Least squares means of the brand Brand Estimate Standard error DF 1 10.6675 0.6791 26.0650 0.6791 T grouping of the least squares means (α =0.05) LS means with the same letter are not significantly different Brand Estimate 26.0650 1 10.6675 Pr > |t| <0.0001 <0.0001 t-value 15.71 38.38 A B Part of the results is shown below. Of all the fit statistics in (a) of Table 1.17, the value that we are interested in highlighting in this analysis is “Pearson′s chi square/DF,” which corresponds to the mean squared error (MSE), even though we are evaluating different possible models for this given dataset. The value of the MSE is 5.53. The type III fixed effects tests, in part (b) of Table 1.17, indicate that the type of brand (P < 0.0001), formulation (P < 0.0001), and the interaction between both factors (P = 0.0231) all have a significant effect on the change of self-tanning color. The least mean squares obtained with “lsmeans” are shown in the Table 1.18 for the levels of tanning brand factor in Table 1.19 for the levels of tanning brand formulation and in Table 1.20 for the interaction of both factors. The “lines” option allows us to make a comparison of means using the LSD method. The least squares means for the tanning brand factor are given in Table 1.18. The least squares means for the type of tanning brand formulation are given in Table 1.19. 1.3 Analysis of Variance Models 17 Table 1.19 Means and standard errors of the tanning brand formulation Least squares means for the tanning brand formulation Formulation Estimate Standard error DF t-value 1 22.9000 0.8317 27.53 15.4000 0.8317 18.52 16.7988 0.8317 20.20 T grouping of the least squares means (α = 0.05) of the tanning brand formulation LS means with the same letter are not significantly different Formulation Estimate 1 22.9000 A 16.7988 B B 15.4000 B Pr > |t| <0.0001 <0.0001 <0.0001 Table 1.20 Comparison of the means of the interaction of both factors T grouping of the least squares means (α =0.05) of the marca*formulation LS means with the same letter are not significantly different Brand Formulation Estimate 1 32.3975 24.4975 21.3000 1 1 13.4025 1 9.5000 1 9.1000 A B B C D D The hypothesis test for the interaction should be tested first, and only if the interaction effect is not significant, should the main effects be tested. If the interaction is significant, then tests for the main effects are meaningless. The interaction analysis shows that brand 2 (dihydroxyacetone), in all three formulations, shows a greater change compared to brand 1 (erythrulose). Now, considering the previous model without interaction (γ 11 = γ 12 = ⋯ = γ 32 = 0) where factor A has a levels and factor B has b levels, the model without interaction is given by: yijk = μ þ αi þ βj þ Eijk for i = 1, 2, ⋯, a; j = 1, 2, ⋯, b; k = 1, 2, ⋯, nij; and εijk ~ N(0, σ 2). The model without interaction with a = 3, b = 2, and nij = 3 reduces to: 18 y= 1 y111 y112 y113 y121 y122 y123 y211 y212 y213 y221 y222 y223 y311 y312 y313 y321 y322 y323 , X= 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 Elements of Generalized Linear Mixed Models 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 , β= μ α1 α2 α3 β1 β2 , e= E111 E112 ⋮ E322 E323 Note that the design matrix for the model without interaction is the same as that for the model with interaction, except that the last six columns are removed. Let us assume that the interaction effect is not significant. The following SAS code estimates the main effects of both factors. Running the program and analysis is left as practice for the readers. proc glimmix; class brand formulation; model y = formula brand; lsmeans brand formulation/lines; run; 1.4 Analysis of Covariance (ANCOVA) Consider an experiment to compare t ≥ 2 treatments after adjusting for the effects of a covariate x. The model for an analysis of covariance is given by: yij = μ þ τi þ βi xij þ Eij for i = 1, 2, ⋯, t and j = 1, 2, ⋯, ni Here, Eij are the independent normally distributed random errors with a zero mean and a variance constant σ 2 > 0. In this model, μ is the overall mean, τi is the fixed effect of the ith treatment (ignoring the covariates x’s), βi denotes the slope of the line that relates the response variable y to x for the ith treatment, and xij are fixed covariates. Assuming t = 3, n1 = n2 = n3 = 3, we have: 1.4 Analysis of Covariance (ANCOVA) y= e= y11 y12 y13 y21 y22 y23 y31 y32 y33 E11 E12 E13 E21 E22 E23 E31 E32 E33 , X= 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 19 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 x11 x12 x13 0 0 0 0 0 0 0 0 0 0 0 0 x21 0 x22 0 x23 0 0 x31 0 x32 0 x33 , β= μ α1 α2 α3 β1 β2 β3 , The analysis of covariance (ANCOVA), as can be seen, obeys a general linear model of the form y = Xβ + e . For example consider a hypothetical study of flower production in two subspecies of plants. The number of flowers per plant may vary between the subspecies, but, within each subspecies, flower production may also vary with the size of each plant, and this relationship may be positive or negative. A positive relationship might arise if plants with more resources (sunlight, water, nutrients) could invest more energy in both growth and flower production. A negative relationship could arise if there was a trade-off between the energy invested in growth and the energy invested in flower production. In this study, subspecies is a categorical variable and plant size is a continuous variable (the covariate). Measuring plant size and flower production in the two subspecies allows the investigation of three different questions: Is flower production influenced by subspecies? Is flower production influenced by plant size? Is the effect of flower production on plant size influenced by subspecies? Example 1. The central question in plant reproductive ecology is how hermaphroditic plant species allocate resources to male and female structures. A study conducted to address this question counted the number of stamens (male structures that produce pollen) and ovules (female structures that when fertilized by a pollen grain will become seeds) in the flowers of “prairie larkspur” plants in two populations in southeastern Minnesota. The total number of flowers produced by each plant was also determined to assess whether plant size affected ovule production per flower. The dataset for this example can be found in the Appendix (Data: Larkspur plants). 20 1 Elements of Generalized Linear Mixed Models An ANCOVA is appropriate for this study to test the following three null hypotheses from these data: (a) There is no difference in the average number of ovules per flower between the two populations (the main effect). (b) There is no effect of plant size on the average number of ovules per flower (the covariate effect). (c) The effect of plant size on the mean number of ovules per flower did not differ between the study sites (the interaction effect). The components of the ANCOVA model, assuming that the response variable yijk is normally distributed, are as follows: Distribution: yij N μijk , σ 2 Linear predictor: ηij = μ þ τi þ plantaðτÞjðiÞ þ βi X ij - X :: Link function : μij = ηij ðidentityÞ where yij is the number of ovules observed in the jth plant of the ith population, μ is the overall mean,τi is the fixed effect due to the population i, planta(τ)j(i) is the random effect due to the plant j in the population i, βi is the slope of the population i, X :: is the overall mean of the size of all plants, and Xij is the plant size i in the population j. The ANCOVA results (sources of variation and degrees of freedom) are shown in Table 1.21. The basic syntax in GLIMMIX for analysis of covariance with different slopes is as follows: proc glimmix; class poblacion plant; model ovules = population xbar population*xbar/ddfm=satterthwaite; random plant(population); lsmeans population/lines; run; Table 1.21 Analysis of covariance Sources of variation Population β (depending on the size of the plant) Population β Error Total Degrees of freedom t-1=2-1=1 1 1 t i=1 t i=1 r i - 1 - t - 1 = 75 r i - 1 = 79 - 1 = 78 1.4 Analysis of Covariance (ANCOVA) Table 1.22 Results of the analysis of covariance for the two populations of larkspur plants 21 (a) Covariance parameter estimates Cov Parm Estimate Standard error Plant (population) 12.7951 2.2416 Residual 0.9321 . (b) Type III tests of fixed effects Effect Num DF Den DF F-value Pr > F Population 1 7.32 0.0084 Center 1 16.39 0.0001 Center x population 1 7.81 0.0066 (c) Least squares means of population Standard tPopulation Estimate error DF value Pr > |t| Cedar.cr 20.3538 0.6062 33.58 <0.0001 St. Croix 22.7596 0.6502 35.00 <0.0001 (d) T grouping of the least squares means (α = 0.05) of population LS means with the same letter are not significantly different Population Estimate St. Croix 22.7596 A Cedar.cr 20.3538 B In the above syntax, the “class” command lists all classes or categorical variables, except the covariate (continuous variable), which – in this case – is a variable centered by the average of the size of all plants xbar = X ij - X :: : The options “ddfm” and “lines” invoke proc GLIMMIX to do a degree-of-freedom correction using the Satterthwaite method and a comparison of the means using the LSD method. Part of the results is shown in Table 1.22. The estimates of the variance components (part (a)) due to plant and withintreatment variability are σ 2plantaðpoblacionÞ = 12:795 and σ 2 = MSE = 0:9321, respectively. The analysis of variance in (b) showed that there is a significant effect between the two populations (P = 0.0084), plant size (P = 0.0001) and plant size is influenced by subspecies (interaction) on the average number of ovules (P = 0.0066) per flower. The estimated means and their respective standard errors of the average number of ovules for both populations are tabulated in the “Estimate” column in part (c), as well as the comparison of the means in part (d). If in the previous model we assume that the slopes were equal (β1 = β2), then the ANCOVA reduces to: yij = μ þ τi þ β X ij - X :: þ Eij The ANCOVA model, with t = 3, n1 = n2 = n3 = 3 for this case (equality of slopes) reduces to: 22 1 y= y11 y12 y13 y21 y22 y23 y31 y32 y33 , X= 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 Elements of Generalized Linear Mixed Models x11 x12 x13 x21 x22 x23 x31 x32 x33 , β= μ α1 α2 α3 β , ε= E11 E12 E13 E21 E22 E23 E31 E32 E33 : The basic syntax using GLIMMIX for an analysis of covariance with equal slopes is as follows: proc glimmix; class poblacion plant; model ovules = population xbar/ddfm=satterthwaite; random plant(population); lsmeans population/lines; run; So far, we have exemplified the general linear model of the form y = Xβ +e. In the following, some characteristics of a linear mixed model (LMM) will be described. 1.5 1.5.1 Mixed Models Introduction Linear mixed models (LMMs) are appropriate for analyzing continuous response variables in which the residuals are normally distributed. These types of models are well suited for studies of grouped datasets such as (1) students in classrooms, animals in herds, people grouped by municipality or geographic region, or randomized block experimental designs such as batches of raw materials for an industrial process and (2) longitudinal or repeated measures studies, in which subjects are measured repeatedly over time or under different conditions. These designs occur in a wide variety of settings: biology, agriculture, industry, and socioeconomic sciences. LMMs provide researchers with powerful and flexible analytical tools for these types of data. The name linear mixed models comes from the fact that these models are linear in the parameters and that the covariates, or independent variables, may involve a combination of fixed and random effects. “Fixed effects” can be associated with continuous covariates, such as weight in kilograms of an animal, maize yield in tons per hectare, and reference test score or socioeconomic status, which will carry a continuous range of values, or with factors, such as gender, variety, or group 1.5 Mixed Models 23 treatment, which are categorical. Fixed effects are unknown constant parameters associated with continuous covariates or levels of the categorical factors in an LMM. The estimation of these parameters in LMMs is generally of intrinsic interest because they indicate the relationship of the covariates with the continuous response variable. When the levels of a factor are drawn from a large enough sample such that each particular level is not of interest (e.g., classrooms, regions, herds, or clinics that are randomly sampled from a population), the effects associated with the levels of those factors can be modeled as random effects in an LMM. “Random effects” are represented by random (unobserved) variables that we generally assume to have a particular distribution, with normal distribution being the most common. Mixed models are extremely useful because they allow us to work on (address) two important aspects: 1. From a statistical point of view, biological data are often structured in a way that does not satisfy the assumption of independence of the dataset. Examples include the following: (a) Multiple measurements of the same subject/organism (b) Experiments organized into spatial blocks (c) Observational data in which multiple investigations were conducted in different locations (d) Synthesis of data from similar experiments that were performed by different researchers 2. From a biological perspective, the processes being measured can be affected by multiple sources of variation, often occurring at different spatial or temporal scales. We are interested in using statistical methods that can model multiple sources of stochasticity, at multiple scales, so that we can measure the relative magnitude of the different sources of variation and determine which predictors explain variation at different scales. 1.5.2 Mixed Models The matrix notation for a mixed model is highly similar to that for a fixed effects (systematic) model. The main difference is that, instead of using only one design matrix to explain the entire model in its systematic part, the matrix notation for a mixed model uses at least two design matrices: a design matrix X to describe the fixed effects in the model and a design matrix Z to describe the random effects in the model. The fixed effects design matrix X is constructed in the same way as a general linear fixed effects model (y = Xβ + ε ). X has a dimension of n × ( p + 1), where n is the number of observations in the dataset and p + 1 is the number of parameters of fixed effects in the model to be estimated. The design matrix for the random effects Z is constructed in the same way as the construction of the design matrix for fixed effects, but now for the random effects. The Z matrix has a dimension of n × q, where q is the number of coefficients of random effects in the model. 24 1 Elements of Generalized Linear Mixed Models In matrix notation, a linear mixed model can be represented as y= Sistematic random Experimental Error Xβ þ Zb þ ε ð1:2Þ b N ð0, GÞ and ε N ð0, RÞ where y is the vector of n × 1 observations, β is the vector of ( p + 1) × 1 fixed effects, b is the vector of random effects of q × 1, ε is the vector of n × 1 random error terms, X is the design matrix of n × ( p + 1) for fixed effects related to observations at β, and Z is the design matrix n × q for the random effects (b) related to observations at b. Assuming that both b and ε are uncorrelated random variables with a zero mean and variance–covariance matrices G and R, respectively, we have E ðbÞ = 0, E ðε Þ = 0 VarðbÞ = G, Varðε Þ = R Covðb, εÞ = 0 It is not difficult to verify that Var(y) = Var(Xβ + Zb + ε) is VarðyÞ = ZGZ0 þ R = V Matrix V is an important component when working with linear mixed models (LMMs) because it contains random sources of variation and also defines how such models differ from ordinary least squares estimation. If the model contains only random effects, such as a randomized complete block design (RCBD), then matrix G is the first point of attention. On the other hand, for repeated measures or for spatial analysis, matrix R is extremely important. Assuming that the random effects (blocks) have a normal distribution, b N ð0, GÞ and Varðε Þ = R Then, the vector of observations y will have a normal distribution, that is, y~N(Xβ, V). The same model can be written in the probability distribution form in two different but equivalent ways. The first is the marginal model y N ðE ½y = Xβ, V = ZGZ0 þ RÞ ð1:3Þ In this marginal model, the mean is based only on fixed effects and the parameters describing the random effects appear (are contained) in the variance and covariance matrix V (Littell et al. 2006). In general, a structure is imposed in b in terms of Var(b) = G, and, therefore, marginally, the components of y depend on the structure in V = ZGZ′ + R. 1.5 Mixed Models 25 The second model is the conditional model y j b N ðXβ þ Zb, RÞ ð1:4Þ In this conditional model, b is distributed as shown in Eq. (1.2) for this parameter. For LMMs, the two models are exactly the same; but if the response variable is modeled under a non-normal distribution, then the models are different (Stroup, 2012) and generalized linear mixed models are required. The fixed effects estimator (β) is useful to obtain the best linear unbiased estimators (commonly known as BLUEs), whereas the estimator b is useful for computing the best linear unbiased predictors (commonly known as BLUPs) for the random effects b. The estimation of the expected value of the marginal LMM (1.3) allows the estimation of the BLUEs and that of the conditional LMM (1.4), the BLUPs. The estimators for the BLUEs of β and the BLUPs of b are as follows: β = XT V - 1 X -1 XT V - 1 y b = GZT V - 1 y - β This solution is efficient when working with small datasets because, in the context of big data, it is computationally highly demanding since the inverse of matrix V has to be estimated. For this reason, it is normally used to obtain the solution of the BLUEs of β and the BLUPs of b, also known as Henderson’s mixed model equations, which are presented later in this chapter. 1.5.3 Distribution of the Response Variable Conditional on Random Effects (y| b) The distribution selected by the researcher from the population under study should be true or a good approximation that represents the likely distribution of the response variable. A good representation of the population distribution of a response variable should not only take into account the nature of the response variable (e.g., continuous, discrete, etc.) and the shape of the distribution but should also provide a good model for the relationship between the mean and variance. For the distribution of the dataset, in this chapter, we assume that it is normally distributed with a mean μ and a variance σ 2 {yij ~ (μ, σ 2)} and, for the random effects, it will assume a normal distribution with mean 0 and constant variance σ 2b bj 0, σ 2b . 26 1.5.4 1 Elements of Generalized Linear Mixed Models Types of Factors and Their Related Effects on LMMs In an LMM, there are two types of factors, namely, fixed factors that make up the systematic part and random factors that are the stochastic part, and their related effects on the dependent variable (response). In the following sections, we provide a brief description of these factors and their implications in the context of an LMM. 1.5.4.1 Fixed Factors A fixed factor is commonly used in standard analysis of variance (ANOVA) or analysis of covariance (ANCOVA) models. It is defined as a categorical or classification variable, for which the researcher has included all levels (or conditions) in the model that are of interest in the study. Fixed factors may include qualitative covariates, such as gender; classification variables implied by a sampling design, such as a region or a stratum, or by a study design, such as the method of treatment in a randomized clinical trial; and so on. The levels of a fixed factor are chosen to represent specific conditions so that they can be used to define contrasts (or sets of contrasts) of interest in the research study. 1.5.4.2 Random Factors A random factor is a classification variable with levels that can be randomly sampled from a population with different levels of study. All possible levels of a random factor are not present in the dataset, but this is the intention of the researcher, i.e., to make inference about the entire population of levels from the selected sample of these factor levels. Random factors are considered in an analysis such that the change in the dependent variable across random factor levels can be evaluated and the results of the data analysis can be generalized to all random factor levels in the population. 1.5.4.3 Fixed Versus Random Factors In contrast to fixed factor levels, random factor levels do not represent conditions specifically chosen to meet the objectives of the study. However, depending on the objectives of the study, the same factor may be considered as either a fixed factor or a random factor. Fixed effects, commonly referred to as regression coefficients or fixed effect parameters, describe the relationships between the dependent variable and predictor variables (i.e., fixed factors or continuous covariates) for an entire population of units of analysis or for a relatively small number of subpopulations defined by the levels of a fixed factor. Fixed effects may describe the contrasts or differences 1.5 Mixed Models 27 between levels of a fixed factor (e.g., sex between males and females) in the mean responses for a continuous dependent variable or may describe the effect of a continuous covariate on the dependent variable. Fixed effects are assumed to be unknown fixed quantities in an LMM and are estimated based on analysis of the data collected in a study. Random effects are random values associated with the levels of a random factor (or factors) in an LMM. These values, which are specific to a given level of a random factor, generally represent random deviations from the relationships described by fixed effects. For example, random effects associated with levels of a random factor may enter an LMM as random intercepts (random deviations for a given subject or group as an overall intercept) or as random coefficients (random deviations for a given subject or group from the total fixed effects) in the model. In contrast to fixed effects, random effects are represented as stochastic variables in an LMM. 1.5.5 Nested Versus Crossed Factors and Their Corresponding Effects When a given level of one factor (random or fixed) can be measured only at a single level of another factor and not across multiple levels, then the levels of the first factor are said to be nested within the levels of the second factor. The effects of the nested factor on the response variable are known as nested effects. For example, suppose that you want to conduct a particular study at the primary level in a school zone, you would select schools and classrooms at random. Classroom levels (one of the random factors) are nested within school levels (another random factor), since each classroom can appear within a single school. When a given level of one factor (random or fixed) can be measured across multiple levels of another factor, one factor is said to be crossed with the other and the effects of these factors on the dependent variable are known as crossover effects. 1.5.6 Estimation Methods Standard methods of estimation in mixed models with a normal response are maximum likelihood (ML) and restricted maximum likelihood (REML). The linear mixed effects model is as follows: y = Xβ þ Zb þ ε The variance–covariance matrix V for a one-way analysis of variance (ANOVA) with a randomized block effect and with six observations is equal to: 28 1 Elements of Generalized Linear Mixed Models 0 σ 2b σ 2 þ σ 2b 2 2 σb σ þ σ 2b 2 0 2 σ þ σb 0 0 σ 2b 0 0 0 0 0 0 0 0 V = VarðyÞ = ZGZ ′ þ σ 2 I = 0 0 σ 2b σ 2 þ σ 2b 0 0 0 0 0 0 2 σ þ σ 2b σ 2b 0 0 0 0 σ 2b σ 2 þ σ 2b The variance of y11 is V 11 = σ 2 þ σ 2b and the covariance between y11 and y21 is V 12 = V 21 = σ 2b . These two observations come from the same block. The covariance between y11 and other observations is zero. In matrix V, all possible covariances can be found. 1.5.6.1 Maximum Likelihood The likelihood function l is a function of the observations and the model parameters. It gives us a measure of the probability of looking at a particular observation y, given a set of model parameters β and b. The likelihood function for y j b and b for a mixed model is given by: lðyjbÞ = - n 1 1 logð2π Þ - logjRj - ðy - Xβ - ZbÞ0 R - 1 ðy - Xβ - ZbÞ 2 2 2 and l ð bÞ = - 1 1 Nb logð2π Þ - logjGj - bT G - 1 b 2 2 2 where Nb represents the total number of random effect levels. Therefore, the joint distribution of y and b is equal to: lðy, bÞ = - 1 1 T -1 ðy - Xβ - ZbÞT R - 1 ðy - Xβ - ZbÞ b G b 2 2 Now, after deriving the above expression with respect to β and b and then setting it to zero and solving the resulting equations with respect to β and b, the maximum likelihood estimators are obtained: ∂lðy, bÞ T ∂β ∂lðy, bÞ ∂b T = XT R - 1 y - X T R - 1 Xβ - ZT R - 1 Xb = ZT R - 1 y - X T R - 1 Zβ - ZT R - 1 Zb Setting them to zero and solving for β and b, we obtain the following linear mixed equations: 1.5 Mixed Models 29 XT R - 1 X ZR - 1 X XT R - 1 Z ZT R - 1 ZþG - 1 β b = XT R - 1 y ZT R - 1 y The solution can be written as: β b = XT R - 1 X ZR - 1 X XT R - 1 Z ZT R - 1 ZþG - 1 -1 XT R - 1 y ZT R - 1 y Here, β is the vector of fixed effects parameters and b is the vector of random effects parameters. The information of these parameters is related to the two covariance matrices G and R, and it no longer depends on V as in the previous solution. Moreover, this solution, which is known as Henderson’s (1950) mixed model equations, is computationally much more efficient than the previous one given for the parameters (β and b) since it does not need to obtain the inverse of the matrix V = ZGZ′ + R. The solution to these mixed model linear equations is based on the assumption that we know the components of G and R, which, in practice, need to be estimated. Therefore, the following is a popular method for estimating the variance components of G and R, which is extremely versatile and powerful. 1.5.6.2 Restricted Maximum Likelihood Estimation The restricted maximum likelihood method is also known as the residual maximum likelihood method and is extremely useful, among other things, for estimating variance components. This method is also based on the maximum likelihood method, but, instead of maximizing the likelihood function of the original data, it maximizes the likelihood function over a set of errors obtained by removing the variables from the original response to fixed effects, which are assumed to be known. That is, now instead of maximizing over y is maximized over Ky but to obtain the variance components, it is assumed that K is a matrix of constants, such that KX = 0, which implies that: EðKyÞ = ðKXβ þ KZb þ KεÞ = 0 VarðKyÞ = K T VK This implies that Ky is distributed over N(0, KTVK) and the likelihood of Ky is called the restricted maximum likelihood (REML). There are many options to choose K and typically K = I - X(XTX)-1XT, which is the ordinary least squares residual operator used. Therefore, the log likelihood of Ky is equal to lðVjKyÞ = - 0 1 n-p 1 logð2π Þ - log K T VK - yT K T K T VK - 1 ðKyÞ 2 2 2 This log likelihood after some algebra, according to Stroup (2012), is equal to: 30 1 lðVjKyÞ = - Elements of Generalized Linear Mixed Models n-p 1 1 1 logð2π Þ - logjV j - log X T V - 1 X - rT Vr 2 2 2 2 - where p = rank (X) and r = y - XβML , where βML = X0 V - 1 X X 0 V - 1 y The variance components of G and R are estimated with iterative methods such as the Newton–Raphson or Fisher’s scoring method, which maximizes the likelihood function l(V| Ky) with respect to the variance components. The maximization process starts with starting values for the variance components to estimate G and R, and, with these values of G and R, it is possible to estimate a new, more refined version of the parameters β and b; then, these values are used to update the estimates of the variance components of the matrices G and R, and this process continues until the established convergence is met. 1.5.7 One-Way Random Effects Model Suppose that we randomly select a possible levels from a sufficiently large set of levels of the factor of interest. In this case, we say that the factor is random. Random factors are usually categorical. Continuous covariates that cannot be measured at random levels are generally known as “systematic” or “fixed” effects (e.g., linear, quadratic, or even exponential terms). Random effects are not systematic. Let us assume a simple one-way model: yij = μ þ τi þ εij ; i = 1, 2, ⋯, a; j = 1, 2, ⋯, ni However, in this case, the treatment effects and the error term are random variables, i.e., τi N 0, σ 2τ and εij~N(0, σ 2), respectively. The terms τi and εij are uncorrelated, commonly referred to as “variance components.” There can be some confusion about the differences between noise factors and random factors. Noise factors can be fixed or random. Factors are random when we think of them as being/coming from a random sample of a larger population, and their effect is not systematic. It is not always clear when a factor is random. For example, suppose that the vice president of a chain of stores is interested in the effects of implementing a management policy in his stores and the experiment includes all five existing stores, he might consider “the store” as a fixed factor because the levels of the factor “store” do not come from a random sample. However, if the store chain has 100 stores and takes 5 stores for the experiment, as the company is considering rapid expansion and plans to implement the selected new policy at the new locations, then “store” could be considered as a random factor. In fixed effects models, the researcher’s interest would focus on testing the equality of means of treatments (stores). This would not be appropriate, however, for the case in which 5 stores are randomly selected out of 100 because the 1.5 Mixed Models 31 treatments are randomly selected and we are interested in the population of treatments (stores), not in a particular store or group of stores. The appropriate hypothesis test for this random effect model would be H 0 : σ 2τ = 0 vs H a : σ 2τ > 0 Partitioning a standard analysis of variance from the total sum of squares still works; however, the form of the appropriate test statistic depends on the expected mean squares. In this case, the appropriate test statistic would be Fc = Mean SaquareTreatments , Mean SquareError Fc follows an F-distribution (Fisher–Snedecor) with degrees of freedom a - 1 in the numerator and N - a in the denominator, where N = a i=1 ni . In a completely random model, we are interested in estimating the variance components. σ 2τ and σ 2 . To do so, we use the analysis of variance method, which consists of equating the expected mean squares with the observed values as follows: σ 2 þ nσ 2τ = Mean SquareTreatments where σ 2 = Mean SquareError σ 2τ = 1.5.8 Mean SquareError - σ 2 n Analysis of Variance Model of a Randomized Block Design Consider a one-way analysis of variance model with a randomized block additive effect. Assume two treatments and three blocks, yij = μ þ τi þ bj þ Eij where bj N 0, σ 2b and Eij N ð0, σ 2 Þ with i = 1, 2, 3 and j = 1, 2, 3. The random effects bj and Eij are independent and uncorrelated. In addition, treatment effects are assumed to be fixed. The matrix notation of this model is as follows: 32 1 Elements of Generalized Linear Mixed Models Sistematic Ramdom y11 1 1 0 1 0 0 y21 1 0 1 1 0 0 y12 1 1 0 0 1 0 1 0 1 0 1 0 y13 1 1 0 0 0 1 y23 1 0 1 0 0 1 y22 y = X μ τ1 τ2 β þ ε11 ε21 b1 b2 b3 b þ ε12 ε22 ε13 ε23 Z ε where b~N(0, G) and ε~N(0, R). The variance–covariance matrix G for the random effects in this case is a diagonal matrix 3 × 3 with diagonal elements σ 2b . Note how the matrix representation of this model exactly corresponds to the mixed model formulation. That is, y = Xβ þ Zb þ ε, where b N ð0, GÞ and ε N ð0, RÞ: Example An animal nutritionist is interested in comparing the effect of three diets on weight gain in piglets. To conduct the experiment, the nutritionist randomly selects 3 litters from a set of 20, each containing 3 healthy, similar-sized, recently weaned piglets. In each litter, three piglets are selected and each piglet is randomly assigned to a treatment. A randomized complete block design (RCBD) is a variation of the completely randomized design (CRD). In this design, blocks of experimental units are chosen in such a way that the units within the blocks are as homogeneous as possible with respect to each other (homogeneous) and different between blocks. In a randomized complete block design, generally in each block, there is one experimental unit for each treatment, but this does not limit having more than one experimental unit for each treatment in each block. An RCBD has two sources of variation: the factor of interest that includes the treatments to be studied and the “block factor” that identifies the litters used in the experiment. Assumptions in RCBD: 1. Sampling: Blocks (litters) are independently randomly selected and treatments are randomly assigned to each of the experimental units within each block. 2. Errors are normal, independent, and identically normally distributed with a zero mean and a constant variance σ 2. Table 1.23 lists the weight in kilograms of piglets from three different litters under three different diets. To make inferences about the pattern of weight gain for the entire population (all litters) of piglets, the litters must be considered in the model as a random effect. Thus, the linear mixed model describing the variability of piglet weight gain in this research, as a function of diets, is as follows: 1.5 Mixed Models 33 Table 1.23 Weight gain (kilograms) of the three litters of piglets Litter 1 Table 1.24 Analysis of variance of the randomized complete block design Sources of variation Blocks Diet Error Total Diet1 54.3 53.6 55.2 Diet2 53.1 52.4 57.1 Diet3 59.7 59.7 67.2 Degrees of freedom b-1=3-1=2 t-1=3-1=2 (t - 1)(b - 1) = 4 tb - 1 = 8 yij = μ þ τi þ bj þ εij for i = 1, 2, 3; j = 1, 2, 3 where yij is the weight observed in the ijth piglet, μ is the overall mean, τi is the fixed effect due to ith diet, bj is the random effect due to the jth block (litter) assuming bj N 0, σ 2b , and εij is the independent and identically distributed, approximately normal, observed error term with mean 0 and variance σ 2, i.e., εij~N(0, σ 2). Random effects, bj and εij, are assumed to be independent and uncorrelated. Table 1.24 shows an outline of the analysis of variance for this dataset. The SAS program to analyze this dataset is as follows: proc glimmix data=piglets; class litter diet; model gain=diet/ddfm=satterthwaite; random litter; lsmeans diet/lines; contrast “Diet1 vs Diet2” diet 1 -1 0; contrast “Diet2 vs Diet3” diet 1 0 -1; run; quit; In the previous syntax, we can mention two commands of great importance in this example: (1) the “ddfm = satterthwaite” command allows to make a correction of the degrees of freedom, and this correction is of great importance when the number of experimental units (UE) is different in each one of the treatments and (2) the command “lines” serve to obtain the means of “lsmeans” but are grouped with letters, and, if these averages appear with different letters, then they reflect significant differences. The output for this code is shown in Table 1.25. Subsection (a) of this table shows the estimated variance due to litter ðσ 2litter = 5:3117Þ and the mean squared error σ 2 = 3:2961 . The analysis of variance, part (b), shows that there is a highly significant effect of diet on piglet weight gain (P = 0.0091). In the results (part c), we also observe the estimated means and its standard errors (obtained with “lsmeans diet/lines”) and the grouping of means that are statistically different (part d). In these last results, we can observe that the weight gain of piglets under treatments I and II 34 1 Elements of Generalized Linear Mixed Models Table 1.25 Results of the analysis of variance of the three different diets tested on piglet weight gain (a) Covariance parameter estimates Cov Parm Estimate Standard error Litter 5.3117 6.4573 Residual 3.2961 2.3307 (b) Type III tests of fixed effects Effect Num DF Den DF F-value Pr > F Diet 19.02 0.0091 (c) Dietary least squares means Diet Estimate Standard error DF t-value Pr > |t| I 54.3667 1.6939 3.406 32.10 <0.0001 II 54.2000 1.6939 3.406 32.00 <0.0001 62.2000 1.6939 3.406 36.72 <0.0001 III (d) T grouping of the dietary least squares means (α = 0.05) LS means with the same letter are not significantly different Diet Estimate III 62.2000 A I 54.3667 B II 54.2000 B Table 1.26 Analysis of variance under a CRD Type III tests of fixed effects Effect Num DF Den DF Diet F-value 7.28 Pr > F 0.0248 are not statistically different from each other, but they are statistically different with respect to treatment III. Since the researcher wishes to make an inference about the entire population of litters, the factor “litter” must be entered as a random effect; otherwise, the ability of the F-test to detect differences between treatments is diminished because the P-value changes from 0.0091 to 0.0248. Another way to see the importance of including random effects in an ANOVA is to calculate the relative efficiency (RE) between the two models. Table 1.26 shows the results of the analysis of variance under a completely randomized design (CRD), i.e., yij = μ + litteri + eij is as follows: In this case, if the experiment had been analyzed under a CRD, then the relative efficiency (RE) between an RCBD and a CRD would be: CMECRD = RE = CMERCBD ðSSBRCBD þSCERCBD Þ t ðb - 1Þ CMERCBD = ðb - 1ÞMSBRCBD þ bðt - 1ÞCMERCBD ðbt - 1ÞCMERCBD where CMEDCA is the mean squared error under a CRD, CMERCBD is the mean squared error under an RCBD, SSBDBCA is the sum of squares due to blocks in an RCBD, SSEDBCA is the sum of squares of errors in an RCBD, MSBDBCA is the mean 1.6 Exercises 35 Table 1.27 Fit statistics of a CRD and RCBD Fit statistics -2 Res log likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) CAIC (smaller is better) HQIC (smaller is better) Pearson’s chi-square Pearson’s chi-square / DF CRD 33.24 41.24 81.24 40.41 44.41 37.90 51.65 8.61 RCBD 31.01 35.01 39.01 33.20 35.20 31.38 19.78 3.30 square due to blocks, and t and b are the number of treatments and blocks, respectively. If blocks are not useful, then the RE would be equal to 1. The higher the RE, the more effective the blocking is in reducing the error variance. This value can be interpreted as the relationship r=b, where r is the number of experimental units that would have to be assigned to each treatment if a CRD were used instead of an RCBD. In Table 1.27, we can observe the mean squared error (MSE) of a CRD and RCBD (Pearson’s chi-square / DF) obtained with the GLIMMIX procedure in SAS as well as a series of fit statistics. The MSE for a CRD and an RCBD are 8.61 and 3.3, respectively. Substituting these values into the above equation, we obtain ER = CMECRD 8:61 = = 2:609: CMERCBD 3:3 This value indicates that, an RCBD is 2.609 times more efficient than a CRD. In other words, this implies that it should have taken, at least, 8 (2.609 × 3 ≈ 8) more experimental units × treatment units in a CRD to obtain the same MSE as that obtained in an RCBD. 1.6 Exercises Exercise 1.6.1 The following dataset corresponds to the growth of pea plants, in eye units, in tissue culture with auxins ( 0.114 mm). The purpose of this experiment was to test the effects of the addition of various types of sugars to the culture medium on growth in length. Pea plants were randomly assigned to one of five treatments: control (no sugar), 2 % of glucose, 2 % of fructose, 1 % of glucose + 1 % of fructose, and 2% sucrose. A total of 10 observations were taken in each of the treatments, assuming that the measurements are approximately normally distributed with constant variance. Here, the individual plants to which the treatments were applied are the experimental units. The data from this experiment are shown below (Table 1.28): 36 1 Elements of Generalized Linear Mixed Models Table 1.28 Growth of pea plants in the culture medium with auxins with different types of sugars Plant 1 2 3 4 5 6 7 8 9 10 Control 75 67 70 75 65 71 67 67 76 68 2% Glucose 57 58 60 59 62 60 60 57 59 61 2% Fructose 58 61 56 58 57 56 61 60 57 58 Table 1.29 Growth (height in centimeters) of the two forage species with three types of fertilizers plus a control Species A Species B 1% Glucose +1% fructose 58 59 58 61 57 56 58 57 57 59 Fertilizer Control 21 19.5 22.5 21.5 20.5 21 23.7 23.8 23.8 23.7 22.8 24.4 F1 32 30.5 25 27.5 28 28.6 30.1 28.9 30.9 34.4 32.7 32.7 F2 22.5 26 28 27 26.5 25.2 30.6 30.6 28.1 34.9 30.1 25.5 2% Sucrose 62 66 65 63 64 62 65 62 62 67 F3 28 27.5 31 29.5 30 29.2 36.1 36.1 38.7 37.1 36.8 37.1 (a) Write the statistical model that best describes this dataset, indicating its components. (b) Calculate the analysis of variance for this experiment. (c) Is there any significant difference between treatments on average plant growth? Exercise 1.6.2 A forage company wants to test three different types of fertilizers (F1, F2, and F3) for the production of two forage species (A and B) for cattle and compare them with a fertilizer they usually apply, which we will call control. For this, he decides to use 48 pots with 6 replications in the greenhouse to test the combinations of fertilizers and forage species. The data from this experiment are shown in Table 1.29: 1.6 Exercises 37 (a) Write and describe the statistical model of the experimental design with all its components. (b) Calculate the analysis of variance for this experiment. (c) Is there any significant difference between treatments on average plant growth? Exercise 1.6.3 The data in this experiment are the number of plants regrown after grazing with sheep–goats. The initial size of the plant at the top of its rootstock is recorded, and the weight of seeds (g) that it produces at the end of the season is the response or dependent variable. The data for this experiment are as follows (Table 1.30): (a) List and describe all the components of the linear mixed model. (b) Calculate the ANOVA for this dataset and answer the following questions: Is seed weight influenced by the type of grazing? Is seed weight influenced by the plant size? Is the effect of grazing type on plant size influenced by the initial plant size? Exercise 1.6.4 An experiment was conducted to study the effect of supplementation of weaned lambs on health and growth rate when exposed to helminthiasis. A total of 16 Dorper (breed 1) and 16 Red Maasai (breed 2) lambs were treated with an anthelmintic at 3 months of age (after weaning) and randomly allocated into “blocks” of 4 per breed, classified on the basis of 3-month body weight for supplemented and unsupplemented groups. Therefore, two lambs in each block were randomly allocated to supplemented (night-fed cotton seed meal and wheat bran) and unsupplemented groups. All lambs were kept on grazing for a further 3 months. Data recorded included the initial body weight (kilograms) at weaning and weight at 3 months after weaning, percentage red blood cell volume (RBCV), and fecal egg count (FEC) at 6 months of age. Data from this experiment are shown below (Table 1.31): (a) List and describe all the components of the linear mixed model. (b) Calculate the ANOVA for this dataset and answer the following questions: Did supplementation improve weight gain? Did supplementation affect PRBC and FEC, and were there differences in weight gain, PRBC, or FEC between breeds? 38 Table 1.30 Fruit production after grazing 1 Size 6.225 6.487 4.919 5.13 5.417 5.359 7.614 6.352 4.975 6.93 6.248 5.451 6.013 5.928 6.264 7.181 7.001 4.426 7.302 5.836 10.253 6.958 8.001 9.039 8.91 6.106 7.691 8.988 8.975 9.844 8.508 7.354 8.643 7.916 9.351 7.066 8.158 7.382 8.515 8.53 Elements of Generalized Linear Mixed Models Fruit 59.77 60.98 14.73 19.28 34.25 35.53 87.73 63.21 24.25 64.34 52.92 32.35 53.61 54.86 64.81 73.24 80.64 18.89 75.49 46.73 116.05 38.94 60.77 84.37 70.11 14.95 70.7 80.31 82.35 105.07 73.79 50.08 78.28 41.48 98.47 40.15 52.26 46.64 71.01 83.03 Grazing No Grazing No Grazing No Grazing No Grazing No Grazing No Grazing No Grazing No Grazing No Grazing No Grazing No Grazing No Grazing No Grazing No Grazing No Grazing No Grazing No Grazing No Grazing No Grazing No Grazing Grazing Grazing Grazing Grazing Grazing Grazing Grazing Grazing Grazing Grazing Grazing Grazing Grazing Grazing Grazing Grazing Grazing Grazing Grazing Grazing 1.6 Exercises 39 Table 1.31 Supplementation trial in Dorper (breed 1) and Red Maasai (breed 2) lambs Id 349 326 393 71 271 382 85 176 286 183 21 122 374 32 282 94 127 216 133 249 123 222 290 148 142 154 166 322 156 161 321 324 Race 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Sex 2 2 1 1 1 2 2 2 2 1 2 1 1 2 2 1 2 2 1 1 2 2 2 1 2 2 1 1 1 2 1 1 Supplement 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 Block 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 IW 8 9 12 12.3 13 15.5 16.3 15.9 11 9.9 11.6 12.5 14.6 14.2 16.3 16.7 7.5 8.2 10.1 8.8 1.6 11.3 12.3 13.1 8.2 9.5 9.7 8.6 10.2 11.2 12.1 13.8 FW 8.9 10.1 12.6 14.6 13.7 16.8 18.2 17.7 13.6 11.7 13.1 14.8 17.9 16.9 20.2 17.7 8.1 9.3 11.7 10.4 12.6 13.5 14.3 14.9 11.5 12.2 12.8 12 13 14.6 15.9 18.1 PRBC 10 11 22 15 19 24 27 21 21 21 25 25 19 22 20 13 26 19 30 28 23 24 22 26 25 35 29 26 28 22 25 24 FEC 6500 2650 750 5200 4800 2450 200 3000 1600 450 2900 300 2250 2800 750 5600 1350 1150 200 0 600 1500 1950 500 850 700 400 800 1550 550 1250 1100 WG 0.9 1.1 0.6 2.3 0.7 1.3 1.9 1.8 2.6 1.8 1.5 2.3 3.3 2.7 3.9 1 0.6 1.1 1.6 1.6 1 2.2 2 1.8 3.3 3.7 3.1 3.4 2.8 3.4 3.8 4.3 IW initial weight, FW final weight, PRBC percentage of red blood cells, FEC fecal egg count, WG weight gain 40 1 Elements of Generalized Linear Mixed Models Appendix Population St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix St. Croix Plant 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 29 30 31 32 33 34 35 36 37 38 39 Stamens 30.75 33.83 35.67 35.40 33.50 37.40 33.57 29.86 33.80 31.60 32.57 31.80 35.25 30.00 30.50 32.20 32.40 38.50 37.00 33.00 31.40 31.80 30.40 35.20 27.80 31.29 32.83 31.20 33.00 33.80 32.22 32.91 34.50 28.33 30.71 33.00 31.00 35.00 Eggs 13.75 16.17 16.33 17.40 23.50 18.40 21.29 28.71 29.60 25.80 27.50 24.00 17.75 16.83 18.75 21.40 26.25 17.75 25.83 25.25 25.20 25.60 19.20 22.40 20.80 22.71 22.33 17.40 19.20 22.20 27.63 28.73 15.75 17.33 23.14 24.00 20.50 21.83 Total no. of flowers 8 12 6 14 13 10 25 20 17 14 21 13 8 13 9 13 12 8 16 8 15 14 15 22 10 14 20 14 13 13 31 18 9 8 14 14 4 15 Ratio (stamens/ovules) 2.24 2.09 2.18 2.03 1.43 2.03 1.58 1.04 1.14 1.22 1.18 1.33 1.99 1.78 1.63 1.50 1.23 2.17 1.43 1.31 1.25 1.24 1.58 1.57 1.34 1.38 1.47 1.79 1.72 1.52 1.17 1.15 2.19 1.63 1.33 1.38 1.51 1.60 (continued) Appendix Population St. Croix Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek Cedar Creek 41 Plant 40 1 2 3 4 5 6 7 8 9 10 11 15 16 17 18 19 20 21 23 25 27 28 32 33 34 35 36 37 38 40 41 42 43 44 45 46 47 48 49 50 Stamens 35.00 30.17 32.43 28.00 29.22 36.00 30.83 31.75 29.25 32.78 32.67 31.43 33.50 32.83 35.00 33.17 33.29 35.50 35.71 31.38 28.25 31.82 35.13 33.75 32.00 36.29 28.60 33.00 34.90 34.80 30.00 34.50 37.75 33.50 33.00 35.50 32.50 32.67 35.75 31.38 33.83 Eggs 18.00 18.67 15.14 14.00 16.89 17.14 20.17 18.00 19.00 24.44 22.83 21.00 29.50 15.17 15.00 13.83 27.14 19.83 18.86 25.63 17.50 24.91 26.88 21.63 20.80 17.00 16.40 20.80 25.11 19.60 21.17 20.50 29.00 20.75 22.40 21.50 22.00 16.67 21.50 22.88 20.50 Total no. of flowers 10 16 23 15 35 20 15 18 8 24 17 28 4 20 9 15 28 16 21 5 11 37 23 26 14 18 11 14 49 18 16 16 18 10 12 16 14 8 26 22 17 Data: Larkspur plants from two populations in the state of Minnesota Ratio (stamens/ovules) 1.94 1.62 2.14 2.00 1.73 2.10 1.53 1.76 1.54 1.34 1.43 1.50 1.14 2.16 2.33 2.40 1.23 1.79 1.89 1.22 1.61 1.28 1.31 1.56 1.54 2.13 1.74 1.59 1.39 1.78 1.42 1.68 1.30 1.61 1.47 1.65 1.48 1.96 1.66 1.37 1.65 42 1 Elements of Generalized Linear Mixed Models Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. Chapter 2 Generalized Linear Models 2.1 Introduction In the generalized linear model (GLM) (which is not highly general) y = Xβ + e, the response variables are normally distributed, with constant variance across the values of all the predictor variables, and are linear functions of the predictor variables. Transformations of data are used to try to force the data into a normal linear regression model or to find a non-normal-type response variable transformation (discrete, categorical, positive continuous scale, etc.) that is linearly related to the predictor variables; however, this is no longer necessary. Instead of using a normal distribution, a positively skewed distribution with values that are positive real numbers can be selected. Generalized linear models (GLMs) go beyond linear mixed models, taking into account that the response variables are not of continuous scale (not normally distributed), GLMs are heteroscedastic, and there is a linear relationship between the mean of the response variable and the predictor or explanatory variables. Nelder and Wedderburn (1972) implemented a unified methodology for linear models, thus opening a window for researchers to design models that can explain the variation of the phenomenon under study. Later, McCullagh and Nelder (1989) proposed an extension of linear models, called generalized linear models (GLMs). They pointed out that the key elements of a classical linear model are as follows: (i) the observations are independent, (ii) the mean of the observation is a linear function of some covariates, and (iii) the variance of the observation is a constant. To further extend these, points (ii) and (iii) are modified as follows: (ii’) the mean of the observation is associated with a linear function of some covariates via a link function and (iii’) the variance of the observation is a function of the mean. For more details, see the study by McCullagh and Nelder (1989). GLMs can be adapted to a wide variety of response variables. Special cases of GLMs include not only regression and analysis of variance (ANOVA) but also logistic regression, probit models, Poisson regression, log-linear models, and many more. © The Author(s) 2023 J. Salinas Ruíz et al., Generalized Linear Mixed Models with Applications in Agriculture and Biology, https://doi.org/10.1007/978-3-031-32800-8_2 43 44 2.2 2 Generalized Linear Models Components of a GLM The construction of a GLM begins with choosing the distribution of the response variable, the predictor or explanatory variables to include in the systematic component, and how to connect the mean of the response to the systematic component. The three important components are described in the following sections: 2.2.1 The Random Component The first component to specify is the random component, which consists of choosing a probability distribution for the response variable. This can be any member of the exponential family of distributions, such as normal, binomial, Poisson, gamma, and so on. 2.2.2 The Systematic Component The second component of a GLM is the systematic component or linear predictor, which consists of a linear combination of explanatory variables (the predictor). The systematic component of a model is the fixed structural part of the model that explains the systematic variability between means. The linear predictor is found on the right-hand side of the equation in the specification of a linear or nonlinear regression model. Let x1, x2, ⋯, xp be the numerical (dummy) or discrete (category) predictor (explanatory) variables, then the linear predictor is ηi = β0 þ β1 x1i þ β2 x2i þ ⋯ þ βp xpi = xTi β where βT = (β0, β1, β2, ⋯, βp) is the vector of regression parameters and xTi = 1, x1i , x2i , ⋯, xpi is the vector of predictor variables. Although η is a linear function, the x’s can be nonlinear in form. For example, η can be a quadratic, cubic, or higher-order polynomial. The expected value of yi and the linear predictor ηi are related through the link function. For example, in a Poisson GLM, the predictor is equal to logðλi Þ = xTi β, since the link is a natural logarithm, better known as the link log. In normal linear regression models, the focus is on η and finding the predictors or explanatory variables that best explain or predict the mean of the response variable. This is also important in a GLM. Problems such as multicollinearity in normal linear regression are also problems in generalized linear models. 2.2 Components of a GLM 2.2.3 45 Predictor’s Link Function η Finally, we will look at the specification of the link function that maps the mean of the response variable to the linear predictor. The link function allows a nonlinear relationship between the mean of the response variable and the linear predictor, and this link g () connects the mean of the response variable with the linear predictor. That is, gðμÞ = η The function must be monotonous (and differentiable). The mean is equal, in turn, to the inverse transformation of g (), that is, μ = g - 1 ðηÞ The most natural and meaningful way to interpret the model parameters is in terms of the scale of the data. In other words, μ = g - 1 ðηÞ = g - 1 β0 þ β1 x1i þ β2 x2i þ ⋯ þ βp xpi It is important to note that the link relates the mean of the response to the linear predictor and that this is different from transforming directly to the response variable. If the response variables are transformed (i.e., y’s), then a distribution must be selected, which describes the population distribution of the transformed data, thus making the original interpretation of the data more difficult. A transformation of the mean is generally not equal to the mean of the transformed values, that is, g(E[y]) ≠ E(g[y]). For example, suppose we have a distribution with the following values (and probabilities): yi prob(Y = yi) 1 0.125 2 0.375 3 0.375 4 0.125 The mean of this distribution is E[y] = 1 × 0.125 + 2 × 0.375 + 3 × 0.375 + 4 × 0.125 = 2.5. Therefore, the logarithm of the mean of this distribution is ln(E[y]) = ln (2.5) = 0.916, whereas the mean of the logarithm is equal to E(ln[y]) = 0.845. The value of the linear predictor η could potentially equal any value, but the expected values of the response variable – as in the case of counts or proportions – can be bounded. If there are no restrictions on the response variable (positive or negative real numbers), then the “identity link” function could be used, where the mean is identical to the linear predictor, that is, μ = η: 46 2 Generalized Linear Models As mentioned before, the link function establishes a connection between the linear predictor η and the mean of the distribution μ. It is important to note that the link function in some cases is in a sense similar to a function transformation, in that it establishes only a mathematical connection between the parameters of the model. A function transformation is applied to the observations to better understand the relationship between the mean and the response variables or, in some cases, to stabilize the variance. Special cases are mentioned below: (a) For a normal distribution, the link function is the identity function, η = μ, the variance function is constant. i.e., Var(μ) = 1, and the scale parameter is the variance, i.e., ϕ = σ 2, allowing the use of ordinary least squares in parameter estimation in procedures such as linear regression, analysis of variance (ANOVA) models, or analysis of covariance (ANCOVA) models. (b) In a binomial distribution, the response variable takes binary values like 0 and 1 or represents the relative frequency, i.e., yi = ei/ni, where ei is the number of successes and ni is the number of trials. The mean is a probability (π) and therefore must be between 0 and 1. The linear predictor is not bounded. Therefore, the link function must map the real line within the interval [0, 1]. A natural link function for binomial data is the logit link: η = log π eη → π= ð1 - π Þ ð 1 þ eη Þ Another useful alternative for these types of data is the probit link function: η = Φ - 1 ðπ Þ → π = ΦðηÞ where Φ is the cumulative distribution function of a standard normal distribution. The variance of the function has the form Var(π) = (π/(1 - π)) and the scale parameter ϕ is known and is equal to 1 (ϕ = 1). The difference between the logit and probit estimators is important if the estimated probabilities are extremely small or extremely close to 1, indicating that large sample sizes are required for an effective inference. Both the logit and probit functions produce extremely close or equivalent results, especially with probability values around 0.5. (c) For a Poisson distribution, the link function is the natural log: η = logðλÞ → λ = eη The variance of the function has the form Var(λ) = λ, and, similar to the binomial distribution, the scale parameter is 1. Poisson models with a log link function are often referred to as log-linear models, commonly used when there are contingency (data frequency) tables with at least two entries. 2.2 Components of a GLM 47 (d) A gamma distribution has a link function of the form: η= 1 1 → μ= η μ The variance of the function is given by Var(μ) = μ2and the scale parameter ϕ is usually unknown. In some cases, the log link function is commonly used, which results in an exponential inverse link. It should be noted that the link function does not map the range of the means contained within the linear predictor. Therefore, given its limitations, the theory only provides reasonable approximations for most applications. An exponential distribution is a special case of the gamma distribution. Previously, the classical methods for working with non-normal data – before the advances in computational methods – consisted of using direct transformation of the response variable, that is, the data were transformed using the function t( y) before being analyzed. The goal of the transformation was to obtain a simple connection between the mean and the linear predictor. However, obtaining a consistent scale of variation when selecting a transformation is vitally important. The usual way for selecting a suitable transformation is based on the assumption that, within the region of variation of the random variable, the transformation can capture the variability adequately through a simple linear approximation of the mean. That is, if the random variable y has a distribution with a mean μ and variance σ 2(μ), we want to find a transformation t( y) such that it is forced to have a constant variance (stabilizes the variance). The commonly used functions to stabilize variance are the square root p y when data have a Poisson distribution; the arcsine square root when data are binomial; and the logarithmic transformation for data with a constant coefficient of variation. Table 2.1 provides an overview of the most common link functions that will give admissible values for certain types of response variables and the corresponding inverse of the link function. Table 2.1 Common link functions for different response variables Type of response Normal Poisson Binomial ratio Media μ c π Variance σ2 λ π(1 - π)/N Exponential Gamma Negative binomial μ μ λ μ2 λ + λ2c g(μ) = η log(λ) (logit) log (π/1 - π) (probit) Φ-1(π) Iog(μ) (inverse)1/μ log(λ) g-1(η) μ=η λ = e(η) π = e(η)/(1 + eη) π = Φ(η) μ = e(η) μ = 1/η λ = e(η) Note: Φ is the cumulative distribution function of a standard normal distribution; μ and π are the expected values of the response; η is the linear predictor; and ϕ is the scale parameter 48 2.3 2 Generalized Linear Models Assumptions of a GLM According to McCullagh and Nelder (1989) and Agresti (2013) in Chap. 4, a GLM is defined under the following assumptions: (a) The data y1, y2, ⋯, yn are independent. (b) The response variable yi does not necessarily have to have a normal distribution, but we usually assume a distribution from an exponential family (e.g., binomial, Poisson, multinomial, gamma, etc.). (c) A GLM does not assume a linear relationship between the dependent variable and the independent variables, but it does assume a linear relationship between the response transformed in terms of the link function and the explanatory variables; for example, for logit(π) from a binary logistic regression, logit (π) = β0 + βx. (d) The predictor (explanatory) variables may be in terms of power or some other nonlinear transformations of the original independent variables. (e) The assumption of homogeneity of variance need not be satisfied. In fact, it is not possible in many cases, given the structure of the model and the presence of overdispersion (when the observed variance is larger than what the model assumes). (f) Errors are independent but are not normally distributed. (g) The estimation method is maximum likelihood (ML) or other methods instead of ordinary least squares (OLS) to estimate the parameters. 2.4 Estimation and Inference of a GLM Estimators of the regression coefficients for linear models with a normal response are obtained using least squares or ML, and significance tests are generally used to compare the sum of least squares under different hypothesis tests using the F-test. It is worth mentioning that these tests are exact, and, so, no approximations are required for their implementation. GLMs offer a natural extension of this situation in the sense that: (1) The computational calculations used to determine the ML estimations of the regression parameters/coefficients are highly similar to those used in cases when the response is normal, with the difference being that the estimation process is iterative, which produces successive approximations that converge to the ML estimates. (2) In the inference procedures, the test statistic commonly used is the likelihood ratio test, which is parallel to the F-tests in linear models with a normal response. Thus, GLMs provide a uniform method of estimation and inference. Estimation of parameter β is highly similar to the ML method, whereas the inference methods are generally approximations since they are based on the theory of the distribution of a sufficiently large sample, as in the case of the likelihood ratio method. There are several alternative tests such as the Wald test, test scores, and the likelihood ratio test. 2.5 Specification of a GLM 2.5 49 Specification of a GLM In the following examples, we will describe the components of a GLM for some normal, gamma, binomial, and Poisson regression models. 2.5.1 Continuous Normal Response Variable In simple linear regression models, the expected mean value of a continuous response variable depends on a set of explanatory variables, as follows: y i = β 0 þ β 1 xi þ εi , εi N 0, σ 2 Equivalently, Eðyi jxi Þ = β0 þ β1 xi This GLM can be expressed in terms of its three components: Distribution: yi N μi , σ 2 E ð yi Þ = μ i Varðyi Þ = σ 2 Linear predictor : ηi = β0 þ β1 xi Link function: ηi = μi ðidentity linkÞ where β0 and β1 are the intercept and slope, respectively. This means that we are expressing the linear model as a GLM. Example 1 A simple linear regression analysis was performed on the diamond price ( y) as a function of the number of carats (Table 2.2) and assuming that the response variable “y” has a normal distribution with a mean β0 + β1xi and variance σ 2. The basic Statistical Analysis Software (SAS) syntax for simple linear regression is as follows: proc reg ; model price=weight/clb p r; output out=diag p=pred r=resid; id weght; run; In the above program, “proc reg” invokes a linear regression procedure in SAS. The “clb” option generates a confidence interval for the slope and intercept. The “p” 50 2 Generalized Linear Models Table 2.2 Diamond price (dollars) based on weight (carats) Weight 0.17 0.16 0.17 0.18 0.25 0.16 0.15 0.19 0.21 0.15 Price 355 328 642 342 322 485 483 Weight 0.18 0.28 0.16 0.2 0.23 0.29 0.12 0.26 0.25 0.27 Price 462 823 336 498 595 860 223 663 Weight 0.18 0.16 0.17 0.16 0.17 0.18 0.17 0.18 0.17 0.15 Price 468 345 352 332 353 438 318 419 346 Weight 0.17 0.32 0.32 0.15 0.16 0.16 0.23 0.23 0.17 0.33 Price 918 919 298 339 338 595 553 345 945 Weight 0.25 0.35 0.18 0.25 0.25 0.15 0.26 0.15 0.43 Price 655 1086 443 678 675 287 693 316 . Table 2.3 Regression analysis results β0 Estimated parameters Effect Estimate Intercept -259.63 β1 Weight 3721.02 σ Scale 1013.82 2 Standard error 17.3189 81.7859 Degree of freedom (DF) 46 46 t- value -14.99 Pr > |t| <0.0001 45.50 <0.0001 211.40 option generates fitted values and standard errors. The “r” option performs a residual analysis (i.e., checks assumptions). The “output out” statement generates a new dataset called “diag” containing the residuals and the predicted/adjusted values. The “id weight” statement adds the specified variable to the fitted values output. Part of the results is shown in Table 2.3. The estimated parameters, obtained from “proc reg,” are shown below: Note that the estimated parameters are all statistically significantly different from zero. Then, the linear predictor takes the form: η = - 259:63 þ 3721:02 × weighti If the response variable “y” does not fit the data well, then the normal distribution may barely represent the response distribution; that is, it would weakly explain the variability of the data and, consequently, the “identity” may not be the best link function, since the linear predictor would not include all the relevant information or some combination of the three components of the GLM. Although other fit measure statistics exist in the linear regression model, such as the coefficient of determination (R2), the residual analysis is used to determine whether there is a good fit of the model or whether the assumptions of a Gaussian model are met. In this example, the value of R2 is R2 = 0.9783, and this value indicates that the model used explains 97.83% of the total variability of the dataset. In Fig. 2.1, we can see that the simple linear regression model provides a good fit to this dataset. 2.5 Specification of a GLM 51 Fig. 2.1 A dot plot of price vs. weight (carat) and fitted model 2.5.2 Binary Logistic Regression Logistic regression and other binomial response models are widely used in research areas like biological sciences and agriculture. Given their importance in this section, some relevant features of these models are mentioned. Let yi be the observed response on a set of p explanatory variables x1, x2, ⋯, xp whose distribution yi is binomial with ni independent Bernoulli trials and probability of success π i on each trial, i.e., yi Binomialðni , π i Þ Then, we can model the response using a GLM with a binomial response. The linear predictor in this case will be equal to log πi = β0 þ β1 x1i þ ⋯ þ βp xpi 1 - πi commonly known as “logit” because logit is defined as: logitðπ i Þ = log πi ð1 - π i Þ 52 2 Generalized Linear Models which models the logarithm of the odds ratio, ð1 -πi πi Þ , as a function of the predictor variables. The components of this GLM for binomial data are: Distribution: yi Binomialðni , π i Þ, with mean and variance E ðyi Þ = ni π i and Varðyi Þ = ni π i ð1 - π i Þ Linear predictor : ηi = β0 þ β1 x1i þ ⋯ þ βp xpi Link function : ηi = logitðπ i Þ = log πi ðlogit linkÞ ð1 - π i Þ Another highly useful link function – when you have experiments – is the “probit” link ηi = Φ-1(π i), which was mentioned before. The basic GLM for this dataset, under the probit link, is almost identical to the logit link as seen below: Distribution: yi Binomialðni , π i Þ Linear predictor : ηi = β0 þ β1 xi þ ⋯ þ βp xp Link function : ηi = probitðπ i Þ = Φ - 1 ½π i : Example 1 An engineer is interested in studying the effect of temperature (Temp) from 0 to 40 °C and time in days from 0 to 15 days on the germination of seeds of a certain crop. For this reason, he placed seeds in different pots containing moist soil. After a certain number of days, the number of germinated seeds was counted. If the seeds germinated, then y = 1; otherwise, y = 0. The probability of germination π ij can be modeled through ηij = β0 þ β1 Dayi þ β2 Tempj where ηij is the linear predictor and β0, β1, and β2 are the parameters to be estimated. In this GLM, the link function is ηij = logit π ij = log π ij 1 - π ij and the probability in the interval (0, 1) is computed through the inverse of the link function π ij = 1 = g - 1 ηij 1 þ exp ηij 2.5 Specification of a GLM 53 This last expression allows to estimate the probability of germination (π ij) under different temperature conditions (°C) and time periods (days). Note that the nonlinear relationship between the result π ij and the linear predictor ηij is modeled by the inverse of the link function. In this particular case, the link function is the logit. ηij = log π ij 1 - π ij = g π ij For the illustration of this example, a set of data was simulated using the values β0 = 8, β1 = - 0.19, and β2 = - 0.37 in the linear predictor and the inverse of the linear function by varying the temperature from 0 to 40 °C and time from 0 to 15 days, i.e., π^ij = 1 1þe ð8 - 0:19 × Tempi - 0:37 × Dayj Þ Part of the simulated data is shown below: Temp 0 0 0 0 0 . . . 40 40 40 40 40 Days 0 0.5 1 1.5 . . . 13 13.5 14 14.5 15 Germ 0.000335 0.000403 0.000485 0.000584 0.000703 . . . 0.987991 0.989999 0.991674 0.99307 0.994234 The following commands allow us to perform a binomial regression using the “logit,” “probit,” and linear regression with the “identity” link. It is important to mention that we denote temperature as t and days as d in the codes used below. Logit Regression proc glimmix data=germ; model p = t d/solution dist=binomial link=logit cl; output out=logitout pred(noblup ilink)=predicted resid=residual; run; 54 2 Generalized Linear Models Probit Regression proc glimmix data=germ; model p = t d/solution dist=binomial link=probit cl; output out=probitout pred(noblup ilink)=predicted resid=residual; run; Linear Regression (Identity) proc glimmix data=germ ; model p = t d/solution cl dist=normal; output out=identity out pred(noblup ilink)=predicted resid=residual; run; “proc GLIMMIX” in SAS uses complex models without modifying the response variable as occurs when a direct transformation is applied to the response variable. Instead, GLIMMIX uses a link function of the response variable that is modeled as having a linear relationship with the explanatory variables. The “model” command specifies the response variable p as a function of the explanatory variables t and d, which define Xβ. The “solution” option in the model specification invokes the regression procedure to list the fixed effects parameter estimates of the model (β0, β1, and β2). The “dist” option is used to specify the distribution of the response variable, and the “link” option is used to specify the link function. To get predicted probability values for each observation, the “output” option in proc GLIMMIX is used. Two types of predicted values can be obtained with the “output” option. The first type is the solution for the random effects (best linear unbiased predictors (BLUPs)) in the linearized model, and the second type is the predictions based on the fixed effects (best linear unbiased estimators (BLUEs)) (pred(noblup ilink) = predicted). The “ilink” sub-option in the “pred” option asks for the inverse function of the predicted values, that is, the probabilities of the predictions that are stored under the predicted file name. Finally, the “resid” option is used to request the residuals of the regression, which are stored in the residual. Table 2.4 shows part of the output (analysis of variance (part (a)) and estimation and significance of fixed effects (part (b)) of the regression procedure using the logit link function. Table 2.4 Estimation and significance of fixed effects using the logit link function (a) Type III tests of fixed effects Effect Num DF Den DF F-value T 1 2508 551.28 D 1 2508 407.19 (b) Parameters estimates Standard Effect Estimate error DF t-value Intercept -8.0000 0.3189 2508 -25.08 T 0.1900 0.008092 2508 23.48 D 0.3700 0.01834 2508 20.18 Pr > F <0.0001 <0.0001 Pr > |t| <0.0001 <0.0001 <0.0001 2.5 Specification of a GLM 55 Table 2.5 Parameter estimates, linear predictor, and probability of linear, logit, and probit models Link function Linear Logit Parameter β0 β1 0.022 β2 0.0412 β0 β1 Probit Estimated value -0.417 -8.00 η 1.149 π 0.873 5.95 0.962 3.362 0.965 0.190 β2 0.370 β0 -4.483 β1 0.106 β2 0.207 *The linear predictor η and the probability π were estimated using D = 15 and T = 30 Fig. 2.2 (a, b) Probability of seed germination as a function of temperature and day In Table 2.5, parameter estimates of the linear predictor for the generalized linear, logit, and probit models are presented. The probabilities estimated by the probit and logit models are almost identical to each other, but those of the linear probability model are different; this is because the data were generated with a binomial distribution, whereas the estimated linear predictor differs substantially from the linear predictor under the link probit and logit. In Fig. 2.2a, b, we observe that in an interval between 3 and 7 days and 0 and 15 °C, there is approximately 20% seed germination, but, while both factors increase, the germination percentage also increases substantially. 2.5.2.1 Model Diagnosis For a linear model, a plot of the predicted values against the residuals is probably the simplest way to decide whether the model used provides a good fit to the data; but, for a GLM, we must decide on the appropriate scale to use for the fitted values. 56 2 Generalized Linear Models Fig. 2.3 Predicted vs. residual values using the logit link Generally, it is better to use linear predictors η in the plot rather than the predicted responses μ. If there is no linear relationship between the linear predictors and the residuals, then it could indicate a lack of fit in the model. For a linear model, we could perform a transformation of the response variable, but this is not highly recommended for a GLM as this could change the response distribution. Another alternative would be to change the link function, but since there are not many link functions that allow interpreting a model easily, this is not a good option. Moreover, changing the linear predictor or transforming the predictor variables would not be the best way to go. Figures 2.3, 2.4, and 2.5 show the linear predictor versus residual (we can also see the predicted value versus the residual). By investigating the nature of the relationship between the predictors and the residuals in Fig. 2.3, we can see that there is a linear relationship between the predictor and the residual, using the logit function, whereas the probit and identity functions do not show this linear relationship. However, with the probit link function, we observe a curvilinear relationship between the predictor and the residual, which may be because homogeneity of variance is not satisfied under this link function. Therefore, the logit link is shown to be the best choice. Example 2 Fruit flies can be a year-round problem in fruit-growing areas in many regions of the world, such as in Mexico, and are most common especially in late summer and fall because ripe or fermented fruits and vegetables attract insects by serving as a natural host. If these insects are not controlled, economic losses in fruitgrowing areas could be large and devastating to the producers. In response to this, entomologists have implemented experiments to help mitigate the damages caused by these insects. One such experiment attempted to establish the relationship between the concentration of a toxic agent (nicotine) for 5 hours and the number 2.5 Specification of a GLM 57 Fig. 2.4 Predicted values vs. residuals using the probit link Fig. 2.5 Predicted vs. residual values using the identity link of insects killed (common fruit fly); the data are shown in Table 2.6, and, for more information, see the study by Myers et al. (2002). The number of dead insects can be modeled under a binomial distribution (n, π). Let yi denote the number of dead insects at a concentration i. The GLM components for this dataset are: 58 2 Generalized Linear Models Table 2.6 Ratio of the concentration of a toxic agent to the number of fruit flies killed Concentration (g/100 cc) 0.1 0.15 0.20 0.30 0.50 0.70 0.95 Number of insects (n) 47 53 55 52 46 54 52 Dead insects ( y) 8 14 24 32 38 50 50 Proportion of dead insects 0.17 0.264 0.436 0.615 0.826 0.926 0.962 Distribution: yi Binomialðni , π i Þ, with mean and variance : Eðyi Þ = ni π i and Varðyi Þ = ni π i ð1 - π i Þ Linear predictor : ηi = β0 þ β1 conci πi Link function : ηi = logitðπ i Þ = log ½ ðlogit linkÞ ð1 - π i Þ Note that we are using conci to denote the independent variable nicotine toxicant concentration. The following SAS code allows us to perform a binomial regression for the fruit fly dataset: fly data; input conc n y; datalines; 0.1 47 8 0.15 53 14 0.2 55 24 0.3 52 32 0.5 46 38 0.7 54 50 0.95 52 50 ; proc glimmix data=nobound fly; model y/n = conc/dist=binomial link=logit solution; run; The above syntax produces the following output: The analysis of variance (Table 2.7 a) shows that there is a highly significant effect of nicotine concentration on the number of flies killed (P = 0.0004). From the results obtained, we can observe that, in part (b), the maximum likelihood estimator for the intercept and slope are β0 = - 1:7361 and β1 = 6:2954, respectively, which are used to construct the linear predictor: ηi = - 1:7361 þ 6:2954 × conci 2.5 Specification of a GLM Table 2.7 Results of the analysis of variance with the logit link 59 (a) Type III tests of fixed effects Effect Num DF Den DF Conc 1 5 (b) Parameter estimates Effect Estimate Standard error Intercept -1.7361 0.2420 Conc 6.2954 0.7422 F-value 71.94 DF 5 5 t -value -7.17 8.48 Pr > F 0.0004 Pr > |t| 0.0008 0.0004 Fig. 2.6 Proportion of dead insects as a function of nicotine concentration Therefore, with the logistic regression model, we can estimate the probability that an insect dies when exposed to a certain concentration i of nicotine using the following expression: π ðconci Þ = eη i 1 þ eη i = e - 1:7361þ6:2954 × conci 1 þ e - 1:7361þ6:2954 × conci A plot of the mean proportion of dead insects exposed to a certain concentration of nicotine and the regression curve (linear, quadratic, and cubic) is shown in Fig. 2.6. In this figure, we observe that as the nicotine concentration increases, the mean proportion of dead insects increases. The best linear predictor is of a quadratic order. 2.5.3 Poisson Regression Often, the outcome of a variable is numerical in the form of counts. Sometimes it is a count of rare events such as, for example, (1) the number of plants infected by a 60 2 Generalized Linear Models certain disease in a population over a period of time, (2) the number of insects surviving after the application of an insecticide over time, (3) the number of dead fish found per cubic kilometer due to a certain pollutant, (4) the number of sick animals occurring in a given month in a given country, and so on. The Poisson probability distribution is perhaps the most widely used for modeling count-type response variables. As λ (the average count) increases, the Poisson distribution grows symmetrically and eventually approaches a normal distribution. The Poisson likelihood function is appropriate for nonnegative integer data and this process assumes that events occur randomly over time, so the following conditions must be met: (a) The probability of at least one occurrence of an event in a given time interval is proportional to the length of the interval. (b) The probability of two or more occurrences of an event within an extremely small interval is negligible. (c) The number of occurrences of an event in disjoint time intervals are mutually independent. The probability distribution of a Poisson random variable "y, " which represents the number of successes occurring in a given time interval or in a given region of space, is given by the expression Pðy = kÞ = e - λ λk , k! λ > 0, k = 1, 2, ⋯ where λ is the average number of successes (the average count) in a time or space interval. The mean and variance of this distribution are the same, that is, EðyÞ = VarðyÞ = λ Poisson regression belongs to a GLM and is appropriate for analyzing count data or contingency tables. A Poisson regression assumes that the response variable “y” has a Poisson distribution and that the logarithm of its expected value can be modeled by a linear combination of unknown parameters and independent variables. As in a standard linear regression, the predictors, weighted by the coefficients of x1, x2, ⋯, xp, are summed to form the linear predictor, P ηi = β 0 þ xpi βp p=1 where β0 is the intercept and βp is the slope of the covariates xp ( p = 1, ⋯, P). Thus, the expected value of yi and the linear predictor ηi are related through the link function. The components of a GLM with a Poisson response (yi ~ Poisson(λi)), where λi is the expected value of yi, are as follows: 2.5 Specification of a GLM 61 Fig. 2.7 Students infected with the disease Distribution: yi Poissonðλi Þ, with E ðyi Þ = Varðyi Þ = λi Linear predictor : ηi = β0 þ β1 x1i þ ⋯ þ βp xpi Link function: ηi = logðλi Þ = gðλi Þ ðlog linkÞ Example 1 The following dataset corresponds to the number of students diagnosed (Fig. 2.7) with a certain infectious disease within a period of days of an initial outbreak. We will fit a generalized linear model for “count” data assuming a Poisson distribution. Note that the response distribution is skewed to the right and that the responses are positive integers. Since the response variable is count, the initial choice of a Poisson distribution is reasonable for this dataset with its canonical link, the natural logarithm. The number of “days elapsed” after the initial disease outbreak is the predictor variable in the systematic component. Thus, the GLM for this dataset (Appendix: Data: Infected students) is: Distribution: Inffected studentsi Poissonðλi Þ Linear predictor: ηi = βo þ β1 Days Link function: ηi = logðλi Þ ðlog linkÞ Part of the data is shown below: Days elapsed 1 2 3 . Infected students 6 8 12 . (continued) 62 2 Generalized Linear Models Days elapsed . . 109 110 112 Infected students . . 1 1 0 For the purposes of implementation, we use days to denote elapsed days and students to denote infected students. We can employ the Poisson regression model using GLIMMIX in SAS, as shown below: proc glimmix data=students method=laplace; model students=days/solution dist=poisson link=log; output out=sal_infection pred(noblup ilink)=predicted resid=residual; run; The “proc GLIMMIX” statement invokes the SAS generalized linear mixed model (GLMM) procedure. The “model” command specifies the response variable and the predictor variable, whereas the “solution” option in the model specification requests a listing of the fixed effects parameter estimates. The “dist = poisson” option specifies the distribution of the data, and the “link = log” option declares the link function to be used in the model. The default estimation technique in generalized linear mixed models is restricted pseudo-likelihood (the “RPSL method”); in this example, we use “method = laplace.” The “output” option creates a dataset containing predicted values and diagnostic residuals, calculated after fitting the model. By default, all variables in the original dataset are included in the output dataset, whereas the “out = sal_infection” statement specifies the name of the output dataset. The “pre(noblup ilink) = predicted” option calculates the predicted values without taking into account the random effects of the model, and “ilink” calculates the statistics and predicted values at the scale of the data. Finally, the “resid = residual option” calculates the residuals. The probability estimation of a GLMM involves an integral, which, in general, cannot be calculated explicitly. “GLIMMIX,” by default, uses the RSPL method, but it also offers different options such as the quadrature and Laplace integration method, among others. These integral approximation methods approximate the probability function of an GLMM, and the optimization of the function is numerically approximated. These methods provide a real objective function for optimization. For more details, see the SAS manual. However, in a GLM, this approximation involving the integral is not necessary since an exact solution can be obtained to estimate the parameters, as there are no random effects. The results of this analysis are shown below (Table 2.8). The fit statistics in part (a) (“Fit statistics”) give us an idea of the quality of the goodness of the fit of the model; these statistics are very useful when we are proposing different models to try and find the best model for the data. In this case, the value of the generalized chi-squared statistic divided over its degrees of freedom 2.5 Specification of a GLM Table 2.8 Results of the analysis of variance 63 (a) Fit statistics (Akaike’s information criterion (AIC), a small sample bias corrected Akaike’s information criterion (AICC), Bozdogan Akaike’s information criterion (CAIC), Schwarz’s Bayesian information criterion (BIC), Hannan and Quinn information criterion (HQIC)) -2 Log likelihood 389.11 AIC (smaller is better) 393.11 AICC (smaller is better) 393.22 BIC (smaller is better) 398.49 CAIC (smaller is better) 400.49 HQIC (smaller is better) 395.29 Pearson’s chi-square 84.95 Pearson’s chi-square / DF 0.78 (b) Type III tests of fixed effects Effect Num DF Den DF F-value Pr > F Days 1 102.28 <0.0001 (c) Parameter estimates Standard Effect Estimate error DF t-value Pr > |t| Intercept 1.9902 0.08394 23.71 <0.0001 Days 0.001727 <0.0001 0.01746 10.11 is close to 1. This indicates that the variability of these data has been reasonably modeled and that there is no residual overdispersion. The value of the generalized chi-squared statistic divided over its degrees of freedom (Pearson′s chi - square/DF) is the experimental error of the analysis. The “Type III tests of fixed effects” (in part (b)) and the solution for the intercept and the days effect (“Parameter estimates”) in part (c) are shown in Table 2.8. The negative coefficient of the covariate days indicates that as the number of days increases, the average number of students diagnosed with the disease decreases. That is, we reject the null hypothesis (P = 0.0001) that the expected number of infected students is the same as the number of days increases. We see that with a 1-day increase in the infection period, the expected (or average) number of students diagnosed with the disease decreases by a factor of e-0.01746 = 0.9827. The estimated linear predictor for this GLM is: ηi = 1:9902 - 0:01746 × Days For example, we can calculate the probability of diagnosing "k = 2" infected students in a period of 2 days; i.e., "Days = 2"as follows: PðY i = kÞ = exp - λi k! λi k 64 2 Generalized Linear Models Fig. 2.8 Infected students and a Poisson regression fit exp ð- exp½1:9902 - 0:0146 × 2Þ ðexp½1:9902 - 0:0146 × 2Þ2 2 2 ð- expð1:961ÞÞ ðexpð1:961ÞÞ exp = = 0:0207 2 PðY i = 2Þ = This value indicates that the probability of observing/diagnosing two students with the disease in a 2-day period is 0.0207 (2.0701%). In Fig. 2.8, we observe that the Poisson model is a good candidate for modeling this dataset, since there is no overdispersion in this regression model. Example 2 A forest engineer is interested in modeling the number of trees recently infected by a certain virus. The data that he has are age (years), height (meters) of the trees, and the number of infected trees. Using a linear model could result in negative values of the parameter λ, which would not make sense. The link function g(λ) for a Poisson error structure is the logarithm. Therefore, the GLM, defining yi = infected treesi, can be as follows: Distribution: yi Poissonðλi Þ Linear predictor : ηi = β0 þ β1 × Agei þ β2 × heighti Link function: ηi = logðλi Þ = gðλi Þ ðlog linkÞ For this example, a dataset was simulated using the following parameter values: β0 = - 2, β1 = - 0.03, and β2 = - 0.04. In addition, in order to obtain the linear predictor, the variable age (years) varied from 0 to 50 and height (meters) from 0 to 30, both with increments in one unit. Thus, the values of yij were simulated with the following expression: 2.5 Specification of a GLM 65 Fig. 2.9 (a, b) Probability of tree infection as a function of tree height and age in years yij = expð - 2 - 0:03 × Agei - 0:04 × Heightj Þ In Fig 2.9a, b, we can see that at a young age, between 1 and 10 years and at a height of no more than 10 meters, trees are more susceptible to be infested by the virus. However, as their age increases, trees show greater resistance. The following SAS code fits a Poisson regression model with two predictor variables, assuming that there is no interaction between the two explanatory variables. proc glimmix data=infection method=laplace; model infection=age height /solution dist=poisson link=log; output out=sal_infection pred(noblup ilink)=predicted resid=residual; run; In Table 2.9 part (a), the analysis of variance shows that age and tree height are highly significant, indicating that both variables help explain the infection mechanism of the trees through a Poisson model (P < 0.0001). The linear predictor for this GLM, with Poisson distribution, in the response variable is: ηij = - 2 - 0:03 × Agei - 0:04 × Heightj The estimated values of the parameters of each of the explanatory variables indicate that as age (years) and height (meters) increase by one unit, the tree is less susceptible to the virus. If we want to calculate the probability of diagnosing "k = 3” infected trees with the virus when they are 2 years old and 3-meters tall, we can use the following equation: 66 2 Generalized Linear Models Table 2.9 Part of the results of the analysis of variance under a Poisson distribution (a) Type III tests of fixed effects Effect Num DF Age 1 Height 1 (b) Parameter estimates Effect Estimate Intercept -2.0000 Age -0.03000 Height -0.04000 Den DF 6158 6158 F-value 43.20 29.10 Standard error 0.1388 0.004564 0.007415 PðY i = kÞ = exp DF 6158 6158 6158 - λi λi t-value -14.41 -6.57 -5.39 Pr > F <0.0001 <0.0001 Pr > |t| <0.0001 <0.0001 <0.0001 k k! ^ i = 3Þ PðY = expð - exp ½ - 2 - 0:03 × Age - 0:04 × HeightÞ ðexp ½ - 2 - 0:03 × Age - 0:04 × HeightÞ3 3! = expð - exp ½ - 2 - 0:03 × 2 - 0:04 × 3Þ ðexp ½ - 2 - 0:03 × 2 - 0:04 × 3Þ3 = 0:000215 3! This value indicates that the probability of observing/diagnosing three trees with the virus causing the disease when they are 2 years old and 3-meters tall is 0.000215 (0.0215%). A Poisson regression model, sometimes referred to as a log-linear model, is especially useful when it is used in contingency table modeling. Log-linear models are models of associations between variables in a contingency table; they treat variables symmetrically and do not distinguish one variable as a response. They have a formal structure of double or more entries that can be fitted by binomial or Poisson regression. These models for contingency tables have several specific applications in biological and social sciences. Variables can be nominal or ordinal. A nominal variable has no natural order; for example, gender (male, female, transgender), eye color (blue, brown, green), and type of pet (cat, bird, fish, dog, mouse). An ordinal variable has a range of orders; for example, when you want to measure the degree of consumer satisfaction with the consumption of a product (very dissatisfied, somewhat dissatisfied, neither satisfied nor dissatisfied, somewhat satisfied, very satisfied). 2.5 Specification of a GLM 2.5.4 67 Gamma Regression A gamma distribution is a distribution that occurs naturally in processes for which waiting times, between events, are relevant. Lifetime data are sometimes modeled with a gamma distribution. This distribution can take a wide range of forms due to the relationship between the mean and variance across its two parameters (α and β) and is suitable for dealing with heteroscedasticity of nonnegative data. The probability of observing a particular value y, given the parameters α and β, is f ðyÞ = 1 yα - 1 e - ðy=βÞ ; y, α, β > 0 ΓðαÞβα where Γ(∙) is the gamma function. A gamma regression uses the input variables X’s and coefficients to make a prediction about the mean of "y, " but it actually focuses more of its attention on the scale parameter β. The mean and variance of a Gamma random variable are: E ðY Þ = αβ = μ and VarðY Þ = αβ2 = μ2 =α The probability density function gamma can be rewritten in terms of the mean μ and the scale parameter α as follows: f ðyÞ = 1 yα ΓðαÞy μ α exp ð- yμÞ , α y>0 Plotting the gamma distribution (Fig. 2.10) with three different values of shape α = (0.75, 1, and 2), the scale parameter μ has a multiplicative effect. In the gamma density of the first panel α = 0.75, we see that the density is infinite at 0, whereas in the second panel α = 1, it corresponds to the exponential density, and, in the third panel α = 2, we see a skewed distribution. Fig. 2.10 Gamma density: from left to right, α = 0.75, 1, and 2 68 2 Generalized Linear Models A gamma distribution can arise in different forms. The sum of "n" independent and identically distributed exponential random variables with parameter β has a gamma distribution (n, β). The chi-squared distribution χ 2 is a special case of a gamma distribution with β = 1/2 and α/2 degrees of freedom. Theoretically, a Gamma distribution should be the best choice when the response variable has a real value in the range of zero to infinity and it is appropriate when a fixed relationship between the mean and variance is suspected. If we expect the values "y" to be small, then we should expect a small amount of variability in the observed values. Conversely, if we expect large values of "y, " then we should expect (observe) a lot of variability. Models with a gamma distribution with multiplicative covariate effects provide additional support for modeling nonnegative right-skewed continuous responses, such as the gamma variable with the log link function. Whether the data are modeled with an inverse or logarithmic link function will depend on whether the rate of change or the logarithm of the rate of change is a more meaningful measure. For example, in studies of yield density that commonly assume that yield per plant is inversely proportional to plant density (Shinozaki and Kira 1956), the linear predictor is: η i = ð β 0 þ β 1 xi Þ - 1 Example 1 In the development of coagulation agents, it is common to perform in vitro clotting time studies. The following data were reported by McCullagh and Nelder (1989). Plasma samples from healthy men were diluted to nine different percentages of prothrombin-free plasma concentration; the greater the dilution, the more interference with the ability of the blood to clot because the natural clotting ability of the blood has been weakened. For each sample, clotting was induced by introducing thromboplastin, a clotting agent, and the time until clotting occurred (in seconds) was recorded. Five samples were measured at each of the nine concentration percentages, and the mean clotting times were averaged; therefore, the response is the mean clotting time across the five samples. In Fig. 2.11, the response variable is plotted against the percentage thromboplastin concentration in which we observe that the longer clotting times tend to be more variable than the smaller clotting times, so a linear regression model may not be appropriate. In this analysis, we will model clotting times as the response variable (yi) with plasma concentration percentage as the predictor variable. Conc denotes the independent variable concentration. The GLM for this dataset is: Distribution: yi = Clotting timei Gammaðα, βÞ Linear predictor: ηi = β0 þ β1 × conci Link function : μi = 1 ðinverse linkÞ β0 þ β1 × conci 2.5 Specification of a GLM 69 Fig. 2.11 Clotting time (seconds), depending on the thromboplastin concentration The following syntax allows us to adjust a GLM with gamma errors in GLIMMIX: data coagu; input num conc y; datalines; 1 5 118 2 10 58 3 15 42 4 20 35 5 30 27 6 40 25 7 60 21 8 80 19 9 100 18 ; proc glimmix data = coagu; model y = conc / dist=gamma link=power(-1) solution; output out=salgamm1 pred(noblup ilink)=predicted resid=residual; run; Most of the syntax has already been described in the previous examples; the only new one is the link = power(-1) option. This statement invokes the inverse link function in the GLIMMIX procedure. Some of the output from this analysis is shown in Table 2.10. The dilution percentage, part (a) in Table 2.10, of the blood plasma concentration significantly affects the clotting time (P = 0.0004). The values for constructing the fitted linear predictor are tabulated in part (b) of Table 2.10. ηi = 0:008686 þ 0:000658 × conci 70 2 Table 2.10 Results of the regression analysis under a gamma distribution Generalized Linear Models (a) Type III tests of fixed effects Effect Num DF Den DF Conc 1 (b) Parameter estimates Effect Estimate Standard error Intercept 0.008686 0.002294 Conc 0.000658 0.000103 Scale 0.05213 0.02436 F-value 41.01 DF . t-value 3.79 6.40 . Pr > F 0.0004 Pr > |t| 0.0068 0.0004 . With the parameterization of the gamma distribution, previously chosen, the intercept and the beta coefficient corresponding to the concentration variable were calculated through GLIMMIX in SAS, as well as the scale parameter(α), which in the SAS output corresponds to the scale. With part of this information, it is possible to calculate the mean (E[Y] = μ) and variance (Var[Y] = μ2/α) for a concentration conc = 10 as follows: y=μ= 1 1 = = 65:505 0:008686 þ 0:000658 × conc 0:008686 þ 0:000658 × 10 VarðyÞ = μ2 65:5052 = = 85818:215 α 0:052 The average time it takes for blood to clot – when a thromboplastin concentration of 10% is added – is 65.505 seconds with a variance of 85818.215. 2.5.4.1 Model Selection Selecting a model from a set of candidate models that provides the best fit and largely explains the variability in the data is a necessary but complex task. This process involves trying to minimize information loss. From the field of information theory, several information criteria have been proposed to quantify information, or the expected value of information, and, among these, the most widely used are the Akaike information criterion (AIC) (Akaike 1973, 1974) and the Bayesian information criterion (BIC) (Schwarz 1978). Both AIC and BIC are based on the ML estimates of the model parameters. In a regression fit, the estimates of β´s under the ordinary least method and the ML method are identical. The difference between the two methods comes from estimating the common variance σ 2 of the normal distribution of the errors, around the true mean. 2.5 Specification of a GLM Table 2.11 Goodness-of-fit metrics for each of the three models and regression analysis results for model 3 71 (a) Fit statistics Model 1 -2 Log likelihood 62.15 AIC (smaller is better) 68.15 AICC (smaller is better) 72.95 BIC (smaller is better) 68.74 CAIC (smaller is better) 71.74 HQIC (smaller is better) 66.87 Pearson’s chi-square 0.50 Pearson’s chi-square / DF 0.07 (b) Type III tests of fixed effects Model 2 44.49 52.49 62.49 53.27 57.27 50.78 0.07 0.01 Model 3 27.47 37.47 57.47 38.45 43.45 35.34 0.01 0.001 Fvalue 476.73 110.78 50.92 Pr > F <0.0001 0.0001 0.0008 Effect Num DF Den DF Conc 1 5 Conc × conc 1 5 conc × conc × conc 1 5 (c) Parameter estimates Standard Effect Estimate error DF Intercept 0.000576 5 0.00040 Conc 0.001946 0.000089 5 conc × conc 2.576E-6 5 0.00003 conc × conc 1.337E-7 2.520e5 × conc 08 Scale 0.001125 0.000530 . tvalue 0.70 21.83 10.53 5.306 <0.0001 . . Pr > |t| 0.5177 <0.0001 0.0001 To get an idea of how to use these adjustment statistics, let us compare three possible models that best explain the effect of the plasma dilution percentage: Model 1: ηi = β0 þ β1 × conci Model 2: ηi = β0 þ β1 × conci þ β2 × conc2i Model 3: ηi = β0 þ β1 × conci þ β2 × conc2i þ β3 × conc3i Since the proposed models have a gamma error structure, the commonly used fit statistic (R2) in a simple linear regression model is not reported. Part of the results of this analysis is shown below with various metrics as goodness-of-fit measures: With regard to the values of the goodness-of-fit metrics (Table 2.11 part (a)), the smaller they are, the better the fit. Based on the above, the accuracy of the fit of the three regression models increased as the polynomial in the linear predictor increased. That is, model three best explained the variability of the plasma clotting time. The 72 2 Generalized Linear Models Fig. 2.12 Fitting the gamma regression model with three predictors type III sum of squares for fixed effects and the estimated parameters under model three are tabulated in parts (b) and (c) in Table 2.11, respectively. Parameter estimates under the linear predictor with linear, quadratic, and cubic effects are highly significant. The results suggest that a cubic effect for the percentage dilution in plasma concentration in the linear predictor is more efficient in explaining the clotting time than taking only a linear predictor with only linear or both linear and quadratic effects (Fig. 2.12). 2.5.5 Beta Regression Studies in various areas of knowledge, including agriculture, often face the need to explain a variable expressed as a proportion, percentage, rate, or fraction in the continuous range (0,1). In economics, for example, the factors that influence the proportion of households that do not have a cement floor have been studied. Similarly, in plant breeding, it is desired to investigate the factors that influence the proportion of plant leaves damaged by a certain disease. In parallel, the proportion of impurities in chemical compounds is of everyday interest in physics and chemistry. While studies on electoral preferences analyze citizen participation rates and the variables that can explain them, in the field of education and academic performance, we try to explain the proportion of success in standardized tests. Moreover, it is also used to identify the factors associated with the proportion of credit used by credit card users. The public health field has also been confronted with the need to model the proportion of coverage in health programs in order to identify the sociodemographic and economic characteristics associated with whether a woman is covered. Johnson et al. (1995) presented the properties of the probability distribution of this type of variable; these researchers showed that the beta distribution can be used to model proportions, since its density can take different forms depending on the values of the two shape parameters that index the distribution. However, the beta regression that results from using the beta distribution as a 2.5 Specification of a GLM 73 response variable in the context of generalized linear models is not very well known, but its use is increasing every day, thanks to friendly software that allow its implementation in an extremely easy manner. Definition Let y be a continuous random variable defined in the interval [0, 1] and α, β > 0. Then, Y has a beta distribution with parameters of forms α and β if and only if: f Y ð yÞ = 1 y α - 1 ð 1 - yÞ β - 1 , 0 < y < 1 Bðα, βÞ ÞΓðβÞ where B(α, β) is the beta function defined as Bðα, βÞ = ΓΓððααþβ Þ and Γ is the gamma function. The mean and variance of this probability density function are given by E ðY Þ = α αβ : and VarðY Þ = αþβ ðα þ β þ 1Þðα þ βÞ2 In the context of regression analysis, the density of the beta distribution provided above is not very useful for modeling the mean of the response. Therefore, this density is reparametrized so that it contains a precision (or dispersion) parameter. α and ϕ = α + β, i.e., α = μϕ This reparameterization consists of defining a μ = αþβ and β = (1 - μ)ϕ, which means that: E ðyÞ = μ and VarðyÞ = μ ð1 - μ Þ 1þϕ So, μ is the mean of the response variable and ϕ can be interpreted as a parameter of precision in the sense that, for a fixed μ, the higher the value of ϕ, the smaller the variance of y. The density function of y can be written as: f ðy; μ, ϕÞ = ΓðϕÞ yμϕ - 1 ð1 - yÞð1 - μÞϕ - 1 , 0 < y < 1 ΓðμϕÞΓðð1 - μÞϕÞ where 0 < μ < 1 and ϕ > 0. Let y1, y2, . . . , yn be independent and identically distributed random variables, where each yi with i = 1, 2, . . . , n is modeled under the parametrized beta model with a mean μ and an unknown parameter ϕ. The model is obtained by assuming that the mean of yi can be written as: 74 2 Table 2.12 Proportion of fruit damage ( y) as a function of concentration. Percentage is equal to proportion ×100 Generalized Linear Models Concentration 0.1 0.25 0.5 1 2 4 5 8 10 25 50 100 Y 0.08 0.09 0.11 0.2 0.3 0.53 0.63 0.71 0.73 0.84 0.85 0.86 k gðμi Þ = xij βi = ηi i=1 where β1, β2, . . ., βk are unknown regression parameters and xij are the k covariates (k < n) that are fixed and known. Finally, g(∙) is a strictly monotone and differentiable link function that maps to the real numbers in the interval (0, 1). There are several possible options for the link function g(∙). For example, we can use a logit link function gðμÞ = log μ 1-μ , which is considered the most popular and asymptotically efficient, but it is also feasible to use the probit g(μ) = Φ-1(μ) function, where Φ(∙) is the cumulative distribution function of a standard normal random variable, and the complementary link function g(μ) = log {- log (1 - μ)}, among others (McCullagh and Nelder 1989). Example 1 The objective of this experiment was to evaluate the effect of the concentration of a chemical compound on the proportion of damage ( y) in the fruits (Table 2.12). This compound is known to inhibit the growth of an insect, but, at a certain concentration, it can cause damage to the fruits. The proportion of damage to the fruits can be modeled under a beta distribution (μ, ϕ). Let yi be the proportion of damage to the fruits due to the ith concentration. The GLM components for this dataset are as follows: Distribution: yi betaðμi , ϕÞ, with E ðyÞ = μ and VarðyÞ = μð1 - μÞ 1þϕ Linear predictor: ηi = β0 þ β1 × conci Link function : ηi = logitðπ i Þ = log πi ðlogit logÞ ð1 - π i Þ 2.5 Specification of a GLM Table 2.13 Goodness-of-fit metrics for the linear and quadratic models and results of the quadratic model fit 75 (a) Fit statistics Linear Quadratic -2 Log likelihood -6.23 -14.50 AIC (smaller is better) -0.23 -6.50 AICC (smaller is better) 2.77 -0.79 BIC (smaller is better) 1.22 -4.56 CAIC (smaller is better) 4.22 -0.56 HQIC (smaller is better) -0.77 -7.22 Pearson’s chi-square 12.85 15.05 Pearson’s chi-square / DF 1.07 1.25 (b) Type III tests of fixed effects Effect Num DF Den DF F-value Pr > F Conc 1 8.08 0.0193 Conc × conc 1 6.11 0.0354 (c) Parameter estimates Standard terror Effect Estimate DF value Pr > |t| Intercept -1.1425 0.2935 -3.89 0.0037 Conc 0.1572 0.05530 2.84 0.0193 Conc × -0.00132 0.000534 -2.47 0.0354 conc Scale 9.0432 4.0045 . . . Note that we are using conc to denote the independent variable concentration of the chemical compound. The following SAS code allows us to perform a beta regression for the dataset: proc glimmix method=laplace; model y = conc / dist=beta s; run; The “method = Laplace” statement asks SAS for the estimation method to be Laplace integration, and the “dist = beta” and “s” options invoke GLIMMIX to perform beta regression and provide fixed parameter estimation, respectively. In order to see which type of linear, quadratic, or cubic predictor best explains the observed variability in a dataset, we make use of the fit statistics (-2 log likelihood, AIC, etc.). Part of the output is shown below in Table 2.13. According to the fit statistics in part (a), the predictor that best models this experiment is the quadratic predictor. In Fig. 2.13, we can see that the best linear predictor to model a dataset is of the cubic order, but due to the indeterminacy (not showing here) in the t-value (infinity), in the hypothesis test of the estimated parameters, it was decided to take the quadratic predictor. Both predictors, quadratic and cubic, better model the proportion (percentage = proportion×100) of fruit damage caused by the concentration of the applied chemical. 76 2 Generalized Linear Models Fig. 2.13 Fitting the beta regression model 2.6 Exercises Exercise 2.6.1 The partial dataset corresponds to an evaluation of the effects of increasing application rates of picloram (0, 1.1, 2.2, and 4.5 kg/ha) for the control of larkspur plants (data in Table 2.14). The objective of this study was to study the efficacy of picloram herbicide in controlling larkspur plants. (a) List and describe the components of the GLM (distribution, systematic component (predictor), and the link function). (b) Fit the model according to part (a). (c) Interpret your results. Exercise 2.6.2 Effect of pH, Brix, temperature, and nisin concentration on the growth of Alicyclobacillus acidoterrestris CRA7152 in apple juice. The objective of this experiment was to model the presence/absence of CRA7152 growth in apple juice as a function of pH (3.5–5.5), Brix (11–19), temperature (25–50 °C), and nisin concentration (0–70). The data are shown below (Table 2.15): (a) List and describe the components of the GLM (distribution, systematic component (predictor), and the link function). (b) Fit the model according to part (a). (c) Interpret your findings. Exercise 2.6.3 The objective of this experiment was to evaluate the level of toxicity of concentrations of pyrethrin and piperonyl butoxide on the mortality of beetles (Tribolium castaneum). Pyrethrin is a natural insecticide found in the plant Chrysanthemum cinerariaefolium and its flowers. The active ingredients are pyrethrins I and II, cinerins I and II, and jasmolins I and II. The dried flowers contain 0.9–1.3% pyrethrum. The crude extract contains 50–60% pyrethrum and is imported from 2.6 Exercises 77 Table 2.14 Toxicity of picloram in controlling larkspur plants Rep 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Conc 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 Rep Conc 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 Rep Conc 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (continued) 78 2 Generalized Linear Models Table 2.14 (continued) Rep 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Conc 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 Y 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Rep Conc 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 Y 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Rep Conc 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 Y 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (continued) 2.6 Exercises 79 Table 2.14 (continued) Rep 1 1 1 1 1 1 1 1 1 1 1 Conc 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 Y 1 1 1 1 1 1 1 1 1 1 1 Rep Conc 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Rep Conc 2.2 2.2 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Rep replicate, Conc concentration, Y =1 dead/ Y = 0 alive various countries. The extract is diluted to 20%, which is the maximum concentration commercially available in the United States. Pyrethrin oxidizes on exposure to air but has been shown to be stable for long periods in water-based emulsions and oil concentrates. Synergistic compounds (such as piperonyl butoxide or N-octyl bicycloheptene dicarboximide), which enhance the effect of pyrethrin on insects, are present in commercially available pyrethrin formulations. The results of this study are shown below (Table 2.16). (a) List and describe the components of the GLM (distribution, systematic component (predictor), and the link function). (b) Fit the model according to part (a). (c) Interpret your results. 80 2 Generalized Linear Models Table 2.15 Growth of Alicyclobacillus acidoterrestris CRA7152 pH 5.5 5.5 5.5 5.5 5.5 5.5 5.5 5.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 3.5 4 5 Nisin 70 70 50 50 30 30 0 0 70 70 50 50 30 30 0 70 70 50 50 30 30 0 0 70 70 50 50 30 30 30 0 0 0 0 0 Temp (°C) 50 43 43 35 35 25 50 25 43 35 50 35 50 43 25 25 25 50 25 43 43 50 35 50 35 43 25 50 35 25 43 43 35 35 43 Brix 11 19 13 15 13 11 19 15 15 11 13 19 11 15 19 15 13 15 19 19 11 13 11 19 13 11 11 15 19 13 15 13 11 11 11 Y 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 pH 5.5 3.5 5.5 5.5 5.5 5.5 5.5 5.5 5.5 5.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 3.5 4 5 5.5 3.5 Nisin 70 0 70 70 50 50 30 30 0 0 70 70 50 50 30 30 0 70 70 50 50 30 30 0 0 70 70 50 50 30 30 30 0 0 0 0 0 70 0 Temp (°C) 50 25 50 43 43 35 35 25 50 25 43 35 50 35 50 43 25 25 25 50 25 43 43 50 35 50 35 43 25 50 35 25 43 43 35 35 43 50 25 Brix 19 11 11 19 13 15 13 11 19 15 15 11 13 19 11 15 19 15 13 15 19 19 11 13 11 19 13 11 11 15 19 13 15 13 11 11 11 19 11 Y 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 0 2.6 Exercises Table 2.16 Mixture: pyrethrin plus piperonyl; n is the number of beetles exposed and Y is number of beetles killed 81 n 150 149 150 151 151 150 149 150 140 150 150 149 200 Mixture 1.5 1.06 0.75 1.35 1.03 0.8 3.3 3.07 2.9 10.65 10.46 10.32 0 Y 138 75 32 129 65 19 143 112 37 141 117 56 1 Table 2.17 Results of the experiment with carbon disulfide Dose 49.1 53 56.9 60.8 64.8 68.7 72.6 76.5 Number of exposed beetles 59 60 62 56 63 59 62 60 Number of dead beetles 6 13 18 28 52 53 61 61 Proportion of dead beetles 0.102 0.217 0.29 0.5 0.825 0.898 0.984 1 Exercise 2.6.4 The objective of this experiment was to model the probability of mortality of the toxic effect of carbon disulfide (CS2) gas on beetles. The insects were exposed to various concentrations of this gas (in mf/L) for 5 hours (Bliss 1935), and, then the number of dead beetles (Y ) was counted. The data are shown below (Table 2.17). (a) List and describe the components of the GLM (distribution, systematic component (predictor), and the link function). (b) Fit the model according to part (a). (c) Interpret your results. Exercise 2.6.5 A study was conducted to assess the fowlpox virus in chorioallantois by the Pock counting technique. The membrane Pock count for 50 embryos exposed to one of four dilutions of virus (multiples of 10ˆ(-3.86)). The FD column heading corresponds to the dilution factor and the number of Pocks observed (Table 2.18). 82 2 Generalized Linear Models Table 2.18 Results of the fowl pox experiment FD 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125 Count 1 2 2 3 2 2 1 0 0 1 2 1 1 2 1 Table 2.19 Number of reversed Salmonella TA98 colonies FD 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 Count 5 2 3 2 5 0 2 2 0 3 2 2 FD 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 Quinoline dosage (μg/placa) 0 10 33 15 16 16 21 18 26 19 21 33 Count 5 11 7 5 4 6 5 9 4 7 4 8 4 FD 1 1 1 1 1 1 1 1 1 1 Count 12 9 11 17 11 10 8 16 15 12 100 27 41 60 333 33 38 41 1000 20 27 42 (a) List and describe the components of the GLM (distribution, systematic component (predictor), and the link function). (b) Fit the model according to part (a). (c) Interpret your findings. Exercise 2.6.6 Data were provided by Margolin et al. (1981) from an Ames Salmonella reverse mutagenicity assay. The table shows the number of reversed colonies observed on each of the three plates (repeats) tested at each of the six quinoline dose levels. The focus is on testing for mutagenic effects over time in the excess variation typically observed between counts (Table 2.19). (a) List and describe the components of the GLM (distribution, systematic component (predictor), and the link function). (b) Fit the model according to part a). (c) Interpret your results. Appendix 83 Appendix Students infected with a certain disease id Day Students id 1 1 6 45 2 2 8 46 3 3 12 47 4 3 9 48 5 4 3 49 6 4 3 50 7 4 11 51 8 6 5 52 9 7 7 53 10 8 3 54 11 8 8 55 12 8 4 56 13 8 6 57 14 12 8 58 15 14 3 59 16 15 6 60 17 17 3 61 18 17 2 62 19 17 2 63 20 18 6 64 21 19 3 65 22 19 7 66 23 20 7 67 24 23 2 68 25 23 2 69 26 23 8 70 27 24 3 71 28 24 6 72 29 25 5 73 30 26 7 74 31 27 6 75 32 28 4 76 29 4 77 33 34 34 3 78 35 36 3 79 36 36 5 80 37 42 3 81 38 42 3 82 39 43 3 83 40 43 5 84 Day 45 46 48 48 49 49 53 53 53 54 55 56 56 58 60 63 65 67 67 68 71 71 72 72 72 73 74 74 74 75 75 80 81 81 81 81 88 88 90 93 Students 3 3 3 2 3 1 3 3 5 4 4 3 5 4 3 5 3 4 2 3 3 1 3 2 5 4 3 0 3 3 4 0 3 3 4 0 2 2 1 1 id 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 Day 95 96 96 97 98 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 Students 1 0 0 1 1 2 2 1 1 1 1 0 0 0 1 1 0 0 0 0 0 (continued) 84 Students infected with a certain disease id Day Students id 41 44 3 85 42 44 5 86 43 44 6 87 44 44 3 88 2 Day 93 94 95 95 Students 2 0 2 1 Generalized Linear Models id Day Students Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. Chapter 3 Objectives of Inference for Stochastic Models Throughout this book, we have been using the pseudonym GLMMs to denote generalized linear mixed models. The common denominator among all these models is that they all contain a linear model (LM) part, which refers to the fixed effects component of the linear predictor Xβ. In a GLMM, the prefix “G” indicates that the distribution of observations may not be normal, the suffix of the first M means that the linear predictor includes mixed effects and thus contains random effects, which are expressed by the term “Zb.” The fixed linear component of the predictor Xβ is important because the fixed effects describe the treatment design, which, in turn, is determined by the objectives or the initial research questions that the study wishes to answer. Therefore, if the researcher proposes using a reasonable model to analyze an experiment, then he/she must be able to express each objective as a question about a model parameter or as a linear combination of model parameters. Example Assume a factorial 2 × 2 model, with two levels in both factors A and B, in which all possible combinations are tested. In this case, Xβ corresponds to a two-way model with interaction and a predictor given by ηij = μ þ αi þ βj þ ðαβÞij ; i, j = 1, 2 As in all the statistical models studied so far, the linear predictor is expressed in terms of the link function, and ηij can estimate the mean μij (a combination of treatments) directly if the data follow a normal distribution and indirectly if the data are not normally distributed. For this example, the inference should focus on one or more of the following options (estimable functions): a treatment combination mean; a main effect mean; the mean of factor A, which is the average of the overall levels of factor B or vice versa; the difference of the main effects or the difference of a single effect, i.e., the difference between two levels of factor A at a given level of B or the difference between two levels of factor B at a given level of factor A; and so on. Each of these options can be expressed in terms of the parameters of the linear predictor, as shown in Table 3.1. © The Author(s) 2023 J. Salinas Ruíz et al., Generalized Linear Mixed Models with Applications in Agriculture and Biology, https://doi.org/10.1007/978-3-031-32800-8_3 85 86 3 Objectives of Inference for Stochastic Models Table 3.1 Estimable functions in a factorial 2 × 2 treatment structure using the identity link function Target estimate Combination A × B Main effect of factor A Parameter estimator of the linear predictor η + αi + βj + (αβ)ij ηi: = η þ αi þ 12 βj þ 12 ðαβÞij Target estimation in terms of the expected value μij μ i: = 12 μij Main effect of factor B η:j = η þ 12 μ :j = j Difference between level 1 and level 2 of factor A Difference between level 1 and level 2 of factor B Simple effect of A given Bj (A| Bj) Simple effect of B given A (B| Ai) Interaction between factors A and B (A × B) = A|B1 - A|B2 =B j A1 - B j A2 i αi þ βj þ 12 η1: - η2: = α1 - α2 þ 12 η:1 - η:2 = 1 2 i ðαβÞi1 - j ðαβÞij i j ðαβÞ1j - i j ðαβÞ2j ðαβÞi2 þ β1 - β2 j 1 2 μij i μ 1: - μ 2: μ :1 - μ :2 η1j - η2j = α1 - α2 + (αβ)1j - (αβ)2j μ1j - μ2j ηi1 - ηi2 = (αβ)i1 - (αβ)i2 + β1 - β2 μi1 - μi2 (η11 - η21) - (η12 - η22) = (αβ)11 - (αβ)21 - (αβ)12 + (αβ)22= (η11 - η12) - (η21 - η22) μ11 - μ12 - μ21 + μ22 Assuming that the data have a normal distribution, which is equivalent to using an identity link function, the estimator, in terms of the linear predictor (column 2), estimates the expected values of column three. If the data do not follow a normal distribution, then column 2 indirectly estimates the expected values of column three, and, in order to estimate the expected values, link functions are required. For link functions other than identity, the estimates in column two require a more careful handling. In an experimental design with a factorial treatment structure, the analysis should focus on the interaction of the two factors. If this interaction is significant, then the simple effects are not equal; however, if the interaction is not significant, then the main effects provide useful information; otherwise, the main effects are confounded. Therefore, for this reason, in this case, it is better to focus on the simple effects. 3.1 Three Aspects to Consider for an Inference When constructing a model, the researcher must decide whether the effects are fixed or random. This decision has important implications with respect to the estimation criteria and in the interpretation of the tests and estimates obtained. Given these implications, three important aspects, described in the following sections, must be taken into consideration in statistical modeling. 3.1 Three Aspects to Consider for an Inference 3.1.1 87 Data Scale in the Modeling Process Versus Original Data This is a very particular issue for models with a link function other than “identity,” since the scale of the data used in the modeling process is not always the same as the scale of the original data when the assumption of normality in the response variable is no longer valid. When the data are normally distributed, the estimable function directly estimates the expected value. However, this is not true if the data follow a non-normal distribution. For example, in a logistic model for binomial data in a completely randomized design, the estimable function η + τi estimates a logit or “log” odds. In this vein, η + τi must be expressed as a probability and not as a logit, i.e., the expected value for individuals receiving the ith treatment is a probability. This requires converting the estimate to a probability, using the inverse link; that is: πi = 1 ð1 þ e - ðηþτi Þ Þ Thus, for functions other than “identity,” there are two ways of expressing the estimates: (1) in terms of the parameters directly estimated from the GLMM (model scale) or (2) in terms of the expected value of the response variable (data scale). 3.1.2 Inference Space This problem arises only when the linear predictor contains random effects. In these models, the estimates are obtained through a linear combination (an estimable function) with fixed effects, even though the linear predictor contains random effects. K′β denotes the estimable function, where K is the matrix of order [( p + 1) × k] and β is the vector of fixed effects parameters of order [( p + 1) × 1]. The estimable function (K′β) represents a broad inference as it generalizes results to the entire population represented by the random effects. Although the linear combination K′β + Z′b is a predictable function with Z′, a matrix for random effects with nonzero coefficients, its inference is limited to only those levels defined in b. Suppose that you are conducting an experiment with three treatments at different locations (L ), then the estimable function τ1 - τ2 provides information for the inference about the difference between treatments 1 and 2 in the whole population under study. Although the predictable function [τ1 - τ2 + (Lτ)1j (Lτ)2j] constrains the inference space between treatments 1 and 2, it is limited to location (Lj). The type of inference produced by predictable functions is called “narrow inference” because the nonzero coefficients in matrix Z reduce the scope of inference for the entire population at those levels identified in Z. Thus, the predictive function K′β + Z′b should be used for specific estimates, whereas the estimable function K′β should be used for valid estimates for the entire population under study. 88 3.1.3 3 Objectives of Inference for Stochastic Models Inference Based on Marginal and Conditional Models As mentioned in the previous chapter, the specification of a generalized linear mixed model (GLMM) is done in terms of two probability distributions: (1) the distribution of the observations, given the random effects y j b and (2) the distribution of the random effects b. This feature is very particular to Gaussian (and non-Gaussian) mixed models (MMs), for this reason, it is also valid for mixed models with response variables that are different from a normal distribution. From the probability theory, the marginal probability distribution of data (y) can be obtained by integrating over the random effects, b, from the joint probability distribution of y and b. Of the two distributions, the marginal distribution of data is the only one that can be known and observable. Many non-Gaussian mixed models, which seem reasonable, do not distinguish between the distribution of y j b and y. Models that do not make this distinction are called marginal models. Estimates obtained by marginal models have different expected values compared to those produced by conditional models. Therefore, marginal models are not estimated in the same way as conditional models. 3.2 Illustrative Examples of the Data Scale and the Model Scale In linear models, inference begins with the estimable function K′β, and, these models, in turn, are defined in terms of the linear function η = g(μ) = Xβ (if there are no random effects) and η = Xβ + Zb (if there are random effects in the model), whereas K′β produces results in terms of the link function. For linear normal response models such as LMs and LMMs, the link function is not visible because they use the “identity” function as the link. Linear combinations of model parameters directly estimate desired values such as differences between treatments and many other hypothesis tests of interest. Inference for an LM is straightforward. For GLMs and GLMMs with a non-normal response, the estimation of K′β yields a linear combination of elements of the linear predictor η, which is a linear combination of g(μ), typically a nonlinear function of μ. For example, with Poisson data the K′β is a function of logarithm (log) and for binomial data, it is a function of logit or probit. However, most of the time, the researcher wants to see the binomial results expressed in terms of the probability of the outcome of interest, whereas for Poisson, the results are expressed in terms of counts. This means that since both GLMs and GLMMs carry out the estimation process on the scale of the model (depending on the link used) to report the results of interest in terms of the scale of the data, it is necessary to apply the inverse link to the predictor in terms of the model scale to express the results. To mention two examples, in the case of the logit link for binomial, the results are expressed in terms of probability and, in the case of the 3.2 Illustrative Examples of the Data Scale and the Model Scale 89 Table 3.2 Percentage of germinated seeds (Y ) out of total seeds (N ) Treatment (Trt) Trt1 Trt1 Trt1 Trt2 Trt2 Trt2 Trt3 Trt3 Trt3 Y (no. of germinated seeds) 54 41 52 28 22 21 41 37 47 N (total no. of seeds) 70 60 70 70 60 70 70 60 70 Poisson model, they are expressed in terms of counts. To exemplify the model scale and the data scale, an example is shown below. Example 3.1 Consider the following experiment in which three chemical seed coat softeners were tested for studying their effect on germination of tomato seeds in Styrofoam trays (Table 3.2). To illustrate the above two concepts, we first analyze these data using a completely randomized design (CRD), assuming the response variable to be normal, and, then, we analyze the same experimental design but with a binomial response variable. We are interested in comparing the means of treatments using a completely randomized design. Note that for demonstrative purposes, we are assuming that Y has a normal distribution, when in fact it has a binomial distribution. The components of this model are defined as follows: Distribution: yij~N(μi, σ 2) Linear predictor: ηi = η + τi; (i = 1, 2, 3) Link function: ηi = μi (identity link) The analysis of variance (ANOVA) (part (a)) and estimated parameters (part (b)) of this experimental design indicate that there is a highly significant difference between the treatments (P = 0.0033) for the germination of tomato seeds. Table 3.3 shows part of the results. The estimated parameter values of the model, except for treatment three, are shown in the table above (obtained with the “solution” command) because the model is over-parameterized. The estimable functions K′β for the treatment means are as follows: K0 = 1 1 1 1 0 0 0 0 1 0 0 1 ; β= η τ1 τ2 τ3 90 3 Objectives of Inference for Stochastic Models Table 3.3 Results of the analysis of variance using a CRD (a) Type III tests of fixed effects Effect Num degree of freedom (DF) Trt 2 (b) Parameter estimates Effect Trt Estimate ^η Intercept 41.6667 7.3333 Trt Trt1 ^τ1 Trt Trt2 ^τ2 -18.0000 0. Trt Trt3 ^τ3 Scale 29.5556 Den DF 6 Standard error 3.1388 4.4389 4.4389 .. 17.0639. F-value 17.25 DF 6 6 6 . . t-value 13.27 1.65 -4.06 Pr > F 0.0033 Pr > |t| <0.0001 0.1496 0.0067 . From the estimated treatment parameters τi = μ i = ^η þ ^τi , we can obtain the estimated mean for each one of the treatments (i = 1, 2, 3) as follows: for treatment 1, τ1 = ^η þ ^τ1 = 41:6667 þ 7:3333 = 49; for treatment 2, τ2 = ^η þ ^τ2 = 41:6667 - 18 = 23:6667; and for treatment 3, τ3 = ^η þ ^τ3 = 41:6667 þ 0 = 41. The value of the mean squared error (^ σ 2 ), which appears in the table as “Scale,” is 29.5556. For the difference between treatments, the τi - τi′ values for i ≠ i′are as follows: τ1 - τ2 = ^η þ ^τ1 - ð^η þ ^τ2 Þ = ^τ1 - ^τ2 = 7:3333 - ð- 18Þ = 25:3333,τ1 - τ3 = ^η þ ^τ1 - ð^η þ ^τ3 Þ = ^τ1 - ^τ3 = 7:333 - 0:0 = 7:3333, and τ2 - τ3 = ^η þ ^τ2 - ð^η þ ^τ3 Þ = ^τ2 - ^τ3 = - 18:00 - 0:0 = - 18:0. These estimates can be obtained using the Statistical Analysis Software (SAS) “estimate” and “lsmeans” commands, as shown below: proc glimmix data=germi; class trt; model y=trt / solution; lsmeans trt / diff e; estimate 'lsm trt 1' intercept 1 trt 1 0; estimate 'lsm trt 2' intercept 1 trt 0 1; estimate 'lsm trt 3' intercept 1 trt 0 0 1; estimate 'overall mean' intercept 1 trt 0.33333 0.33333 0.33333 0.33333; estimate 'overall mean' intercept 3 trt 1 1 1 1 / divisor=3; estimate 'trt diff 1&2' trt 1 -1 0; estimate 'trt diff 2&3' trt 0 1 -1; run; The “estimate” command requires us to specify what we wish to estimate and the “intercept” command refers to the intercept (η) and “Trt” to the treatment (τi) effects under evaluation; the coefficients needed for the estimates are shown above. While the “lsmeans” command invokes GLIMMIX in SAS to estimate the treatment means, “diff” asks to estimate the differences between pairs of treatments, and “E” 3.2 Illustrative Examples of the Data Scale and the Model Scale 91 Table 3.4 Results obtained using the “estimate” and “lsmeans” commands (a) Differences of Trt least squares means Trt Trt Estimate Standard error Trt1 Trt2 25.3333 4.4389 Trt1 Trt3 7.3333 4.4389 Trt2 Trt3 -18.0000 4.4389 (b) Estimates Label Estimate Standard error LSM Trt 1 49.0000 3.1388 LSM Trt 2 23.6667 3.1388 LSM Trt 3 41.6667 3.1388 Overall mean 38.1111 1.8122 Overall mean 38.1111 1.8122 Trt diff 1&2 25.3333 4.4389 Trt diff 2&3 -18.0000 4.4389 DF 6 6 6 DF 6 6 6 6 6 6 6 t-value 5.71 1.65 -4.06 t-value 15.61 7.54 13.27 21.03 21.03 5.71 -4.06 Pr > |t| 0.0013 0.1496 0.0067 Pr > |t| <0.0001 0.0003 <0.0001 <0.0001 <0.0001 0.0013 0.0067 displays the coefficients of the estimable functions used in “lsmeans.” Some of the outputs of the above code are shown in Table 3.4. Next, we analyze the same data, also using a CRD, but now assuming a binomial distribution in the response variable. N indicates the independent number of Bernoulli trials observed in the ijth observation. The components of the model are as follows: Distribution: yij~Binomial(Nij, π i) Linear predictor: ηi = η + τi; (i = 1, 2, 3) Link function: ηi = logit πi 1 - πi (logit link) Fitting these data in a binomial model, the fixed effects solution of the parameters obtained in terms of the model scale are tabulated in Table 3.5. The above results were obtained using the following SAS code: proc glimmix data=germi; class trt; model y/n=trt / solution; run; Similar to the previous example, we can estimate the mean of treatments and the differences between two pairs of treatments. The linear predictors for the treatments are as follows: ^η1 = ^η þ ^τ1 = 0:5108 þ 0:5093 = 1:0201, ^η2 = ^η þ ^τ2 = 0:5108 1:108 = - 0:5971, and ^η3 = ^η þ ^τ3 = 0:5108 þ 0:0 = 0:5108, and, for the differences between treatments (1 and 2, 1 and 3, and 2 and 3), they are as follows: ^η1 - ^η2 = 1:0201 - ð- 0:5971Þ = 1:6173, ^η1 - ^η3 = 1:0201 - 0:5108 = 0:5093, and ^η2 - ^η3 = - 0:5971 - 0:5108 = - 1:1079, respectively 92 3 Objectives of Inference for Stochastic Models Table 3.5 Estimated parameters at the model scale Effect Intercept Trt Trt Trt Parameter estimates Estimate ^η 0.5108 0.5093 ^τ2 -1.1080 ^τ2 0 ^τ3 Trt Trt1 Trt2 Trt3 Standard error 0.1461 0.2168 0.2078 . DF 6 6 6 . t-value 3.50 2.35 -5.33 . Pr > |t| 0.0129 0.0571 0.0018 . Using the relationship between the linear predictor and the link function ηi = logit πi 1 - πi = log πi 1 - πi , we can estimate the probability of observing a favorable outcome for each of the treatments, that is, π 1, π 2, and π 3, respectively. Applying the inverse link, we obtain: π^1 = 1= 1 þ e - ð^ηþ^τ1 Þ ; π^2 = 1= 1 þ e - ð^ηþ^τ2 Þ ; and π^3 = 1= 1 þ e - ð^ηþ^τ3 Þ Substituting the corresponding values, we obtain π^1 = 1= 1 þ e - ð^ηþ^τ1 Þ = 1= 1 þ e - 1:0201 = 0:735, π^2 = 1= 1 þ e - ð^ηþ^τ2 Þ = 1= 1 þ e - ð- 0:5971Þ = 0:355, and π^3 = 1= 1 þ e - ð^ηþ^τ3 Þ = 1= 1 þ e - ð0:5108Þ = 0:625 Here, we can see that the treatment with the highest probability of success is treatment one, followed by treatment three, whereas treatment two has the lowest probability of success. Now, for the difference between two treatments, τi - τi′ for i ≠ i′, we can estimate the logarithm of the odds ratio as τi - τi0 = log πi π i0 - log = log 1 - πi 1 - π i0 where, in this particular case, odds = oddsratio = log πi 1 - πi = π i0 1-π 0 i πi 1 - πi πi 1 - πi = π0 i 1 - π i0 is the odds of the treatment i and is the odds ratio for treatments i and i′, for i ≠ i′. When applying the inverse link to the above expression (odds ratio), we get Oddsratio = 1= 1þe ð - ^τi - ^τ 0 i Þ 3.2 Illustrative Examples of the Data Scale and the Model Scale 93 The value of the odds ratios for treatments 1 and 3 is Oddsratio1 - 3 = 1= 1þe - ð^τ1 - ^τ3 Þ = 1 ð1þe - 0:5093 Þ = 0:6246 Similarly, for the pair of treatments 1–2 and 2–3, the resulting odds ratios are Oddsratio1 - 2 = 0.8344 and Oddsratio2 - 3 = 0.2483, respectively. It is important to mention that the odds ratios are not the mean of the difference of π i - π i′for i ≠ i′. From the previous example, it is clear that when the response variable is not normal, parameter estimation and inference occurs at two levels. The linear predictor Xβ and the estimable function K′β are expressed in terms of the link function, logit – estimates on the model scale (scale of the link function) – as in the above example. Under the logit link, the logarithm of the odds and the difference of the estimate (log odds ratio) are very common and useful terms in categorical data analysis for the estimation of treatments or treatment differences in terms of the data scale. Commonly, estimation at the model scale in GLMs is not very easy to interpret, and, as such, the data scale plays a very important role. A data scale involves applying the inverse of the link function to the estimable function, K′β, as we did in the previous example to convert the log of the odds for each treatment to a probability. In general, we use the inverse of the link function to transform the estimates at the model scale to the data scale. The inverse of the link function is not used for estimating the differences between treatments because the link functions are generally nonlinear. This is why the inverse of the link function is not applied to the differences between treatments because it produces meaningless results. Thus, in the logit model, we have two approximations for the difference. First, we could apply the inverse of the link function to each linear predictor for each treatment and then take the difference between probabilities: π^i - π^i ′ . That is, we can estimate the difference between π i - π i′ through 1= 1 þ e - ðηþτi Þ - 1= 1 þ e - ðηþτi0 Þ and not as 1= 1 þ e - ð^τi - ^τi0 Þ . Second, we know that τi - τi′ estimates the logarithm of the odds ratio by means of eðτi - τi0 Þ , which produces an estimate of the odds ratio. Both approaches are valid, and the use of one approach or the other depends on the requirements of the particular study. With the GLIMMIX procedure, we can implement the solution in terms of the data scale with the “ilink,” “exp,” and “oddsratio” commands, as shown in the following SAS code: proc glimmix; class trt; model y/n=trt / solution oddsratio; lsmeans trt / diff oddsratio ilink ; estimate 'lsm trt 1' intercept 1 trt 1 0/ilink; estimate 'lsm trt 2' intercept 1 trt 0 1/ilink; estimate 'lsm trt 3' intercept 1 trt 0 0 1/ilink; estimate 'overall mean' intercept 1 trt 0.33333 0.33333 0.33333 0.33333/ilink; estimate 'overall mean' intercept 3 trt 1 1 1 1 / divisor=3 ilink; 94 3 Objectives of Inference for Stochastic Models estimate 'trt diff 1&2' trt 1 -1 0/ oddsratio ilink; estimate 'trt diff 1&3' trt 1 0 -1/oddsratio ilink; estimate 'trt diff 2&3' trt 0 1 -1/oddsratio ilink; estimate 'trt diff 1&2' trt 1 -1 0/exp; estimate 'trt diff 1&3' trt 1 0 -1/exp; estimate 'trt diff 2&3' trt 0 1 -1/exp; run; Part of the output of “proc GLIMMIX” is shown in Table 3.6. The “Odds ratio estimates” (part (a)) are the result of the “oddsratio” command in the previous program, whereas the confidence intervals are provided by default. What appears under “Estimate” (in part (b)) is in the model scale ^ηi = ^η þ ^τi , and what appears under “Mean” (in part (b)) is an estimate of the inverse of the link function π^i = 1= 1 þ e - ð^ηþ^τi Þ and, in this case, is a probability that corresponds to the data scale. Similarly, what appears under “Estimate” is in model scale ^τi - ^τi0 , whereas the “Odds ratio” values were estimated using eðτi - τi0 Þ and are in data scale. Under “Estimates” column in Table 3.7, the log odds ratio appears as an “Exponentiated estimate” regardless of whether we use the “oddsratio” or “exp” option in the “estimate” command. For the overall mean, the inverse of the link function applied to ^η þ 13 ð^τ1 þ ^τ2 þ ^τ3 Þ is 0.5772, which is totally different from the average of π^is; that is, 13 ðπ^1 þ π^2 þ π^3 Þ = 13 ð0:735 þ 0:355 þ 0:655Þ = 0:5816. This illustrates that we have to be extremely careful when using the output of proc GLIMMIX, as it can produce outputs in terms of both the model scale and the data scale through the application of the inverse of the link function; however, this has to be applied appropriately, otherwise, we will get meaningless results. Example 3.2: Randomized complete block design (RCBD) with normal and binomial responses Now, assume that we have the same example but in an RCBD. The three treatments were tested in each of the blocks, as shown in Table 3.8. In this example, first, the data are analyzed assuming a normal response and assuming that the block effect is fixed; then, they are analyzed assuming a binomial response. The model components under a Gaussian response variable are as follows: Distribution: yij ~ N(μij, σ 2) Linear predictor: ηij = η + τi + blockj; (i, j = 1, 2, 3) Link function: ηij = μij; (identity link) From the theory of linear models, we know that we can estimate the ith treatment mean through 3 ηi • = 1=3 3 yij = η þ τi þ 1=3 j=1 j=1 blockj = η þ τi þ bloq • 3.2 Illustrative Examples of the Data Scale and the Model Scale 95 Table 3.6 Results of the “ilink,” “exp,” and “oddsratio” commands (a) Odds ratio estimates Trt Trt Estimate DF 95% confidence limits Trt1 Trt3 1.664 6 0.979 2.829 Trt2 Trt3 0.330 6 0.199 0.549 (b) Trt least squares means Trt Estimate Standard error DF t-value Pr > |t| Mean Standard error Mean Trt1 1.0201 0.1602 6 6.37 0.0007 0.7350 0.03121 Trt2 -0.5971 0.1478 6 -4.04 0.0068 0.3550 0.03384 Trt3 0.5108 0.1461 6 3.50 0.0129 0.6250 0.03423 (c) Differences of Trt least squares means Trt Trt Estimate Standard error DF t-value Pr > |t| Odds ratio Trt1 Trt2 1.6173 0.2180 6 7.42 0.0003 5.039 Trt1 Trt3 0.5093 0.2168 6 2.35 0.0571 1.664 Trt2 Trt3 -1.1080 0.2078 6 -5.33 0.0018 0.330 Table 3.7 Estimates at the model scale and at the data scale Estimates Label LSM Trt 1 LSM Trt 2 LSM Trt 3 Overall Mean Trt diff 1&2 Trt diff 1&3 Trt diff 2&3 Trt diff 1&2 Trt diff 1&3 Trt diff 2&3 Estimate 1.0201 Standard error 0.1602 DF 6 t Value 6.37 Pr > |t| 0.0007 Mean 0.7350 Standard error Mean 0.03121 -0.5971 0.1478 6 -4.04 0.0068 0.3550 0.03384 0.5108 0.1461 6 3.50 0.0129 0.6250 0.03423 0.3113 0.08746 6 3.56 0.0119 0.5772 0.02134 1.6173 0.2180 6 7.42 0.0003 0.8344 0.03011 5.0393 0.5093 0.2168 6 2.35 0.0571 0.6246 0.05083 1.6642 -1.1080 0.2078 6 -5.33 0.0018 0.2483 0.03878 0.3302 1.6173 0.2180 6 7.42 0.0003 5.0393 0.5093 0.2168 6 2.35 0.0571 1.6642 -1.1080 0.2078 6 -5.33 0.0018 0.3302 Exponentiated estimate 96 3 Objectives of Inference for Stochastic Models Table 3.8 Percentage of germinated seeds (Y ) out of total seeds (N ) in a randomized complete block design Treatment Trt1 Trt1 Trt1 Trt2 Trt2 Trt2 Trt3 Trt3 Trt3 Block Block1 Block2 Block3 Block1 Block2 Block3 Block1 Block2 Block3 where bloq • = 1=3 3 j=1 Y (no. of germinated seeds) 54 41 52 28 22 21 41 37 47 N (total no. of seeds) 70 60 70 70 60 70 70 60 70 blockj : For the mean difference of two treatments i and i’, this is estimated as • - ðη þ τi0 þ bloq • Þ = τi - τi ′ ηi • - ηi ′ • = η þ τi þ bloq The goal of this experiment could be to compare the treatment means, that is, η1: = η2: = η3: , equivalently – this can be expressed as τ1 = τ2 = τ3 – or to compare one treatment with the average of the other treatments: for example, to compare treatment 1 with the averages of treatments 2 and 3 (Trt1.vs.average.Trt2.and.Trt3). For the hypothesis test of the equality of treatments (τ1 = τ2 = τ3), the estimable function K′β is given by: K′ = 0 0 1 0 0 1 -1 -1 0 0 0 0 0 0 ; β= η τ1 τ2 τ3 bloq1 bloq2 bloq3 While for contrasts Trt1.vs.average.Trt2.and.Trt3 and Trt2.vs.average.Trt1.and. Trt3, K′β is given by Trt1:vs:average:Trt2:and:Trt3 = η1 • - 1=2ðη2 • þ η3 • Þ = τ1 - τ2 - τ3 2 Trt2:vs:average:Trt1:and:Trt3 = η2 • - 1 =2ðη1 • þ η3 • Þ = τ2 - τ1 - τ3 2 3.2 Illustrative Examples of the Data Scale and the Model Scale K′ = 0 2 0 1 -1 -2 -1 1 0 0 0 0 0 ; β= 0 97 η τ1 τ2 τ3 bloq1 bloq2 bloq3 The following GLIMMIX procedure allows us to implement the above example. proc glimmix; class trt block; model y = trt block/solution; lsmeans trt / diff e; estimate 'lsm trt1' intercept 3 trt 3 0 0 0 block 1 1 1 / divisor=3; estimate 'overall mean' intercept 3 trt 1 1 1 1 block 1 1 1 1 / divisor=3; estimate 'average trt1&trt2' intercept 6 trt 3 3 0 block 2 2 2 / divisor=6; estimate 'average trt1&trt2&trt3' intercept 9 trt 3 3 3 3 block 3 3 3 3 3/divider=9; estimate 'trt1 vs trt2' trt 1 -1 0 ; estimate 'trt1 vs trt3' trt 1 0 -1; estimate 'trt2 vs trt3' trt 0 1 -1; estimate 'trt1 vs trt2' trt 1 -1 0, 'trt1 vs trt3' trt 1 0 -1, 'trt2 vs trt3' trt 0 1 -1/divisor=1,1,1 adjust=sidak; contrast 'trt1 vs trt2' trt 1 -1 0 ; contrast 'trt1 vs trt3' trt 1 0 -1; contrast 'trt2 vs trt3' trt 0 1 -1; contrast 'trt1 vs average trt1,trt2' trt 2 -1 -1; contrast 'trt2 vs average trt1,trt3' trt -1 2 -1; contrast 'type 3 trt ss' trt 1 0 -1 0,trt 0 1 -1; contrast 'type 3 trt test' trt 2 -1 -1,trt -1 2 -1; run; Part of the GLIMMIX output is shown below in Table 3.9. Parameter estimation for treatments 1–2 and blocks 1–2 are shown below, except for treatment and block 3. This is because it is an incomplete rank model. The generalized inverse is used in the estimation through the SWEEP operator of SAS. In this case, it sets the last class effect equal to zero (Table 3.9). “Coefficients” (part (a) of Table 3.10) obtained with option E for the least squares means of treatments in “lsmeans” shows how SAS uses this information in the parameter solution to calculate the treatment means (part (b)). In part (c), we can see the difference of means obtained with the “diff” option in “lsmeans.” The estimates obtained from the “estimate” command with multiple estimable functions and in the “Sidak” adjustment and contrasts are shown in Table 3.11. This adjustment allows us to control for type I errors. The “adjust” option in “estimate” in the Sidak adjustment (part (b)) allows us to obtain the adjusted P-values denoted as AdjP in addition to Pr > |t|. 98 3 Objectives of Inference for Stochastic Models Table 3.9 Estimation of treatment and block parameters Parameter estimates Effect Intercept ð^ηÞ Trt ð^τ1 Þ Trt ð^τ2 Þ Trt ð^τ3 Þ ^ 1Þ BLOCK ðblock Trt BLOCK 1 Estimate 43.5556 7.3333 -18.0000 0 1.0000 2 -6.6667 1 2 3 ^ 2Þ BLOCK ðblock ^ 3Þ BLOCK ðblock 3 Pr > |t| 0.0002 0.1036 0.0067 4 0.29 0.7887 4 -1.91 0.1288 DF 4 4 4 3.4907 3.4907 0 Scale σ^2 Table 3.10 Coefficients for treatment and block used in least squares t-value 13.67 2.10 -5.16 Standard error 3.1866 3.4907 3.4907 18.2778 12.9243 (a) Coefficients for Trt least squares means Effect Trt BLOCK Intercept Trt 1 Trt 2 Trt 3 Row1 Row2 Row3 1 1 1 1 1 1 BLOCK 1 0.3333 0.3333 0.3333 BLOCK 2 0.3333 0.3333 0.3333 BLOCK 3 0.3333 0.3333 0.3333 (b) Trt least squares means Trt Estimate Standard error DF t-value Pr > |t| 1 49.0000 2.4683 4 19.85 <0.0001 2 23.6667 2.4683 4 9.59 0.0007 3 41.6667 2.4683 4 16.88 <0.0001 (c) Differences of Trt least squares means Trt _Trt Estimate Standard error DF t-value Pr > |t| 1 2 25.3333 3.4907 4 7.26 0.0019 1 3 7.3333 3.4907 4 2.10 0.1036 3 3 -18.0000 3.4907 4 -5.16 0.0067 The planned contrasts in matrix K′ and with the F-values obtained with the “contrast command” produce the same results (part (c)). Now, the same dataset is fitted using the same predictor but assuming that the response variable is binomial. This analysis intends to show the options available in the SAS commands when you want to fit non-normal responses; in this case, it is binomial. Practically, the same commands used in the previous program with normal data are used, but, now, some other options (“ilink,” “oddsratio,” or “exp”) are exemplified with details under what circumstances they should be used. This is because all estimable functions produce estimates at the model scale, and we must 3.2 Illustrative Examples of the Data Scale and the Model Scale 99 Table 3.11 Multiple estimates and contrasts (a) Estimates Label Estimate Standard error DF t-value LSM Trt1 49.0000 2.4683 4 19.85 Overall mean 38.1111 1.4251 4 26.74 Average Trt1&Trt2 36.3333 1.7454 4 20.82 Average Trt1&Trt2&Trt3 38.1111 1.4251 4 26.74 Trt1 vs Trt2 25.3333 3.4907 4 7.26 Trt1 vs Trt3 7.3333 3.4907 4 2.10 Trt2 vs Trt3 -18.0000 3.4907 4 -5.16 (b) Estimate adjustment for multiplicity: Sidak Label Estimate Standard error DF t-value Pr > |t| Trt1 vs Trt2 25.3333 3.4907 4 7.26 0.0019 Trt1 vs Trt3 7.3333 3.4907 4 2.10 0.1036 Trt1 vs Trt3 -18.0000 3.4907 4 -5.16 0.0067 (c) Contrasts Label Num DF Den DF F-value Trt1 vs Trt2 1 4 52.67 Trt1 vs Trt3 1 4 4.41 Trt2 vs Trt3 1 4 26.59 Trt1 vs average Trt1,Trt2 1 4 29.19 Trt2 vs average Trt1,Trt3 1 4 51.37 Type 3 Trt SS 2 4 27.89 Type 3 Trt Test 2 4 27.89 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 0.0019 0.1036 0.0067 Adj P 0.0057 0.2796 0.0200 Pr > F 0.0019 0.1036 0.0067 0.0057 0.0020 0.0045 0.0045 decide what conversions are necessary to obtain the results at the data scale. Below, the estimable functions and the appropriate conversion required to produce the results on the data scale are listed. (a) Least squares means (“lsmeans”) for normal data and an inverse link (“ilink”) for non-normal data (b) Difference between pairs of treatment means of “lsmeans” for normal data and “odds ratio” for non-normal data (c) Estimation of the mean of a treatment (“estimate”) for normal data and an inverse link (“ilink”) for non-normal data (d) Estimation of a treatment i vs treatment i′: exponentiation (“exp”) (or odds ratio) (e) Multiple estimates of treatment differences as “exp” (or odds ratio) for non-normal data (f) In “contrast estimation,” conversion to the data scale is not necessary, since it is only an F-statistic test. The following GLIMMIX program shows how to implement this model with a binomial response. 100 3 Objectives of Inference for Stochastic Models proc glimmix; class trt block; model y/n = trt block/solution oddsratio; lsmeans trt / diff e oddsratio; estimate 'lsm trt1' intercept 3 trt 3 0 0 0 block 1 1 1 1/divider=3 ilink; estimate 'difference trt1 vs trt2' trt 1 -1 0/exp; estimate 'avg trt1&trt2&trt3' intercept 9 trt 3 3 3 3 block 3 3 3 3 3/divider=9; estimate 'trt1 vs trt2' trt 1 -1 0/exp; estimate 'trt1 vs trt3' trt 1 0 -1/exp; estimate 'trt2 vs trt3' trt 0 1 -1/exp; estimate 'trt1 vs trt2' trt 1 -1 0, 'trt1 vs trt3' trt 1 0 -1, 'trt1 vs trt3' trt 0 1 -1/exp adjust=sidak; contrast 'trt1 vs trt2' trt 1 -1 0; contrast 'trt1 vs trt3' trt 1 0 -1; contrast 'trt2 vs trt3' trt 0 1 -1; contrast 'trt1 vs average trt1,trt2' trt 2 -1 -1; contrast 'trt2 vs average trt1,trt3' trt -1 2 -1; contrast 'type 3 trt ss' trt 1 0 -1 0,trt 0 1 -1; contrast 'type 3 trt test' trt 2 -1 -1,trt -1 2 -1; run; Part of the output is shown in Table 3.12. The estimated treatment and block parameters of the model are given in part (a) of Table 3.12; the last two effects of both classes were restricted to zero because they are incomplete rank design matrices. In part (b), the type III tests of fixed effects and in part (c) the odds ratio estimates are provided. Note that σ^2 does not appear in the output because the variance of the binomial distribution is not an independent parameter. In Table 3.12 (parts (b) and (c)), which shows the sum of the squares of fixed effects type III as well as the odds ratio, it can be seen that only the effect of treatments is significant but not the effect of blocks, which indicates that it is valid to analyze these data using a completely randomized design. Two sets of odds ratios were estimated (part (c)): one for the treatment effects and the other for the block effects in the model. In the calculation of odds ratios, generally, the last level of the factor is compared with the rest of the levels of that same factor. The estimates obtained with “estimate”, in Table 3.13 (parts (a) and (b)), are results in terms of the model scale, whereas the last column is obtained by applying EXP ðeτi - τi 0 Þ. The least squares means for treatment and the linear predictors of treatment differences (parts (a) and (b) of Table 3.14, respectively) obtained with “lsmeans” are the values under the “Estimate” column, and, these, together with their corresponding standard errors, were obtained using the linear predictor ^ . ^ηij = ^η þ ^τi þ block • These estimates are on the model scale, whereas the values under the “Mean” column and their respective standard errors were obtained by applying the inverse link to obtain the probabilities of success of each treatment (^ π i Þ. While the estimated linear predictors for the mean differences were obtained with the “oddsratio” option, the mean difference in the data scale is obtained by taking the inverse of these predictors. 3.3 Fixed and Random Effects in the Inference Space 101 Table 3.12 Results of the analysis of variance in a binomial model (a) Parameter estimates Effect Trt Intercept ð^ηÞ Trt ð^τ1 Þ 1 Trt ð^τ2 Þ 2 3 Trt ð^τ3 Þ ^ BLOCK ðblock1 Þ ^ 2Þ BLOCK ðblock ^ 3Þ BLOCK ðblock BLOCK Pr > |t| 0.0536 0.0785 0.0059 4 0.31 0.7698 4 -0.33 0.7559 Standard error 0.1883 0.2169 0.2079 DF 4 4 4 1 0.2088 2 -0.07205 0.2164 3 0 (b) Type III tests of fixed effects Effect Num DF Trt 2 BLOCK 2 (c) Odds ratio estimates Trt BLOCK _Trt _BLOCK 1 3 2 3 1 3 2 3 3.3 t-value 2.71 2.35 -5.33 Estimate 0.5099 0.5097 -1.1088 0 0.06541 Den DF 4 4 estimate 1.665 0.330 1.068 0.930 F-value 29.55 0.20 DF 4 4 4 4 Pr > F 0.0040 0.8258 95% confidence limits 0.912 3.040 0.185 0.588 0.598 1.906 0.510 1.697 Fixed and Random Effects in the Inference Space In an analysis, inference can be directed solely at fixed effects (population inference) or at a combination of fixed and random effects (specific inference). To illustrate these two levels of inferences, we will consider two examples: 3.3.1 A Broad Inference Space or a Population Inference In practice, the random effects in a linear mixed model (LMM) should represent the population from which the data were collected and should be included in studies as if they came from a well-planned sample. In a model, random effects can be locations, regions, states, blocks, and so on, and they have two very particular characteristics. • Random effects represent the target population. • Random effects have a probability distribution. These two characteristics allow us to have a broad inference space where we can calculate point estimates, estimate intervals, and perform hypothesis testing applicable to the entire population represented by the random effects. Formally, an estimate or hypothesis test based on an LMM indicates that we have a broad (a) Estimates Label Estimate Standard error DF t-value LSM Trt1 1.0174 0.1603 4 6.35 Average Trt1&Trt2&Trt3 0.3080 0.08768 4 3.51 Trt1 vs Trt2 1.6184 0.2181 4 7.42 Trt1 vs Trt3 0.5097 0.2169 4 2.35 Trt2 vs Trt3 -1.1088 0.2079 4 -5.33 (b) Estimate adjustment for multiplicity: Sidak Standard tExponentiated Label Estimate error DF value Pr > |t| Adj P estimate Trt1 1.6184 0.2181 4 7.42 0.0018 0.0053 5.0452 vs Trt2 Trt1 0.5097 0.2169 4 2.35 0.0785 0.2175 1.6647 vs Trt3 -1.1088 0.2079 4 -5.33 0.0059 0.0177 0.3300 Trt1 vs Trt3 Table 3.13 Different estimates obtained with “estimate” Pr > |t| 0.0032 0.0246 0.0018 0.0785 0.0059 Mean 0.7345 Non-est Non-est Non-est Non-est Standard error Mean 0.03127 5.0452 1.6647 0.3300 Exponentiated estimate . 102 3 Objectives of Inference for Stochastic Models 3.3 Fixed and Random Effects in the Inference Space 103 Table 3.14 Estimated linear predictors for treatments and treatment differences with their respective inverse values (a) Trt least squares means Trt Estimate Standard error DF t-value 1 1.0174 0.1603 4 6.35 -0.6011 0.1480 4 -4.06 0.5077 0.1462 4 3.47 (b) Differences of Trt least squares means Trt _Trt Estimate Standard error 1 2 1.6184 0.2181 1 3 0.5097 0.2169 -1.1088 0.2079 2 3 Pr > |t| 0.0032 0.0153 0.0255 DF 4 4 4 Mean 0.7345 0.3541 0.6243 t-value 7.42 2.35 -5.33 Standard error Mean 0.03127 0.03385 0.03429 Pr > |t| 0.0018 0.0785 0.0059 Odds ratio 5.045 1.665 0.330 inference space defined by the estimable function K′β if Z is a matrix with coefficients equal to zero; otherwise, the estimation or hypothesis test is defined by the prediction function K′β + Z′β, which is a specific inference. 3.3.2 Mixed Models with a Normal Response In Example 3.2, the response variable was assumed as a function of fixed effects due to treatments and blocks, since block effects were also assumed to be fixed effects. Now, suppose that applications of treatments were done by three different people (blocks); then, assuming that the block effects are fixed, this would be questionable since each person does their job according to their experience, skill, and so forth. Clearly, there is some variability between blocks that is not due to the experiment and this has to be removed, so the effects due to blocks must be considered random. In this example, let us assume that the three blocks (persons) were randomly selected from a population. Thus, the components of the model are defined as follows: Distribution: (a) yijj blockj ~ N(μij, σ 2) (b) bloquej N ð0, σ2block Þ Linear predictor: ηij = η + τi + blockj (i, j = 1, 2, 3) Link function: ηij = μij (identity link) Note the impact of changing the estimable function for the mean of treatments. In Example 3.2, the estimable function was defined by Eðyi • Þ = η þ τi þ block • . Now with the mixed model (fixed effects and random blocks), the estimable function is 104 3 Objectives of Inference for Stochastic Models defined by E ðyi • Þ = η þ τi because E(block) = 0. Therefore, the estimable function for the mean in each of the treatments is η + τi. In this situation, two questions arise: • How much do the results obtained from a fixed effects model differ from those obtained from a mixed model? • How can we compare the two results? The following program allows us to estimate a mixed model with a normal response. proc glimmix; class trt block; model y = trt /solution; random block/solution; lsmeans trt / diff e; estimate 'lsm trt1' intercept 1 trt 1 0 0 0|block 0 0 0 0; estimate 'lsm trt2' intercept 1 trt 0 1 0|block 0 0 0 0; estimate 'lsm trt3' intercept 1 trt 0 0 0 1|block 0 0 0 0; estimate 'blup trt1' intercept 3 trt 3 0 0 0|block 1 1 1 / divisor=3; estimate 'blup trt2' intercept 3 trt 0 3 0|block 1 1 1 / divisor=3; estimate 'blup trt3' intercept 3 trt 0 0 3|block 1 1 1 / divisor=3; run; In the previous SAS GLIMMIX code, the “estimate” command shows the coefficients associated with the fixed effects before the vertical bar (j) and after the vertical bar, are provided the coefficients for the random effects associated with the model, that is: efectosfijos 1 K ′ β þ Z′ b = 1 1 1 0 0 0 1 0 efectosaleatorios 0 0 1 η τ1 τ2 τ3 1 þ 1 1 1 1 1 1 1 1 block1 block2 block3 Part of the output is shown in Table 3.15. Subsection (a) shows the estimated variance components due to blocks, and for conditional observations, the effect of ^2block = 11:2778, whereas the mean squared error (MSE) is the blocks is σ 2 σ^ = 18:2778. On the other hand, the fixed effects solution obtained with the “solution” option of the parameters is provided in part by (b). The analysis of variance (part (c)) indicates that there is a significant difference between treatments (P = 0.0045), and so the null hypothesis must be rejected (H0 : μ1 = μ2 = μ3). The means estimated with the “estimate” statement, given that the mean block effect is zero, the mean and the best linear unbiased predictor for each treatment are similar, as shown in Table 3.16 (part (a)). Subsections (b) and (c) show the means and differences between two means estimated with the “lsmeans” statement and the “diff” option. 3.4 Marginal and Conditional Models 105 Table 3.15 Variance components, fixed effects, and fixed effects test (a) Covariance parameter estimates Cov Parm BLOCK Residual (b) Solutions for fixed effects Effect Trt Estimate Intercept 41.6667 Trt 1 7.3333 Trt 2 -18.0000 Trt 3 0. (c) Type III tests of fixed effects Effect Num DF Trt 2 3.4 Estimate 11.2778 18.2778 Standard error 3.1388 3.4907 3.4907 .. Den DF 4 Standard error 17.8966 12.9243 DF 2 4 4 . t-value 13.27 2.10 -5.16 F-value 27.89 Pr > |t| 0.0056 0.1036 0.0067 Pr > F 0.0045 Marginal and Conditional Models The process of analyzing a dataset has two main objectives: the first is model selection, which aims to find well-fitting parsimonious models for the responses being measured, and the second is model prediction, where estimates from the selected models are used to predict quantities of interest and their uncertainties. The differences that may arise in this analysis process are mainly due to the choice of unidentifiable constraints on random effects. To compare two different models, we must compare analogous quantities. Different constraints can lead to apparently extremely different but inferentially identical models. The conditional model is believed to be the basic model, and any conditional model leads to a specific marginal model. Lee and Nelder (2004) proposed and worked on conditional models derived from generalized hierarchical linear models (GHLMs) and marginal models derived from these conditional models. Marginal models have often been fitted using generalized estimating equations (GEEs), the drawbacks of which are also discussed. 3.4.1 Marginal Versus Conditional Models Consider two models with a normal distribution: one is a random effects model (a mixed model) yij = μ þ τi þ bj þ εij where bj N 0, σ 2b and εij~N(0, σ 2). The other is a marginal model 106 3 Objectives of Inference for Stochastic Models Table 3.16 Estimated means, best linear unbiased estimates (BLUEs), and BLUPs for treatment and the difference between two means (a) Estimates Label Estimate Standard error lsm trt1 49.0000 3.1388 lsm trt2 23.6667 3.1388 lsm trt3 41.6667 3.1388 blup trt1 49.0000 2.4683 blup trt2 23.6667 2.4683 blup trt3 41.6667 2.4683 (b) Trt least squares means Trt Estimate Standard error 1 49.0000 3.1388 1 23.6667 3.1388 2 41.6667 3.1388 (c) Differences of Trt least squares means Trt _Trt Estimate Standard error 1 2 25.3333 3.4907 1 3 7.3333 3.4907 -18.0000 3.4907 2 3 DF 4 4 4 4 4 4 t-value 15.61 7.54 13.27 19.85 9.59 16.88 DF 4 4 4 t-value 15.61 7.54 13.27 DF 4 4 4 t-value 7.26 2.10 -5.16 Pr > |t| <0.0001 0.0017 0.0002 <0.0001 0.0007 <0.0001 Pr > |t| <0.0001 0.0017 0.0002 Pr > |t| 0.0019 0.1036 0.0067 E yij = μ þ τi where the elements in V( y) = Σ are variances and covariances that have an arbitrary correlation structure. Zeger et al. (1988) pointed out that given a marginal model, the generalized estimating equations are consistent. An obvious advantage of using random effects models is that they allow conditional inferences in addition to marginal inferences (Robinson 1991). Using the model with random effects, we can obtain not only the conditional mean μCij = E yij jbj = μ þ τi þ bj but also the marginal mean μij = E μCij = E yij jbj = μ þ τi whereas with the marginal model, we can obtain only the marginal mean μij. It may be reasonable to assume that the unobservable characteristic of the random effects of blocks (bj) follows a certain distribution. However, the center of this distribution cannot be identified because it is confounded with the intercept. Therefore, in the random effects model, we put the unidentifiable constraints E(bi) = 0 and E(εij) = 0 as we do for error terms in linear models. In the mixed model, these ^ restrictions are εij = 0 in any estimation procedure. First, we j bj = 0 and j^ 3.4 Marginal and Conditional Models 107 Table 3.17 Mortality of coffee seedling clones (C) in different substrates (S) Block 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 S 3 3 3 1 1 1 2 2 2 4 4 4 4 4 4 1 1 1 2 2 2 3 C 1 3 2 1 3 2 1 3 2 1 3 2 3 1 2 3 1 2 3 1 2 3 Mortality 3.33 16 16 3.33 6.6 3.3 10 3.33 3.33 3.33 16 13.3 3.3 3.3 20 10 3.33 6.6 36.6 26.6 43.3 3.3 Pct 0.0333 0.16 0.16 0.0333 0.066 0.033 0.1 0.0333 0.0333 0.0333 0.16 0.133 0.033 0.033 0.2 0.1 0.0333 0.066 0.366 0.266 0.433 0.033 Block 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 S 3 3 3 2 2 2 4 4 4 1 1 1 4 4 4 2 1 1 2 2 2 C 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 3 2 3 1 2 3 Mortality 6.6 10 56.6 3.3 26.6 40 3.3 46 33.3 6.6 43.3 50 33 10 23.3 50 23.3 6.6 16 10 16 Pct 0.066 0.1 0.566 0.033 0.266 0.4 0.033 0.46 0.333 0.066 0.433 0.5 0.33 0.1 0.233 0.5 0.233 0.066 0.16 0.1 0.16 consider the case in which the data follow a normal distribution. We then briefly discuss how the results differ for data with a non-normal distribution. 3.4.2 Normal Distribution Example The effect of different substrates (factor A), i.e., three substrates made from vermicompost and one from compost, on the development of physiological variables and mortality of cuttings of three clones (factor B) of robusta coffee (Coffea canephora p.) was evaluated. The levels of factor A are randomly assigned to rows in each block, with the following restriction: each block receives levels A1, A2, A3, and A4 and each level of factor B (B1, B2, and B3) is randomly assigned to each level of factor A in each block. The data for this experiment are tabulated in Table 3.17. Note that while there are two randomization processes, there are effectively three sizes of experimental units: rows for A levels, columns for B levels, and row–column intersections for A × B combinations. Thus, the experimental design used was a complete randomized design with a strip-plot treatment arrangement. 108 3 Objectives of Inference for Stochastic Models The model, for these data, is given below: yijk = μ þ bk þ αi þ ðαbÞik þ βj þ ðβbÞjk þ ðαβÞij þ εijk where yijk is the kth response observed at the ith level of factor A and at the jth level of factor B, μ is the overall mean, bk is the random effect due to blocks assuming bk N 0, σ 2b , αi is the fixed effect due to substrate type (S), (αb)ik is the random effect due to the interaction of a substrate with blocks assuming ðαbÞik N 0, σ 2αb , βj is the fixed effect due to the coffee clone type (C), (βb)jk is the random effect due to the interaction of a coffee clone with blocks assuming ðβbÞjk N 0, σ 2βb , (αβ)ij is the interaction fixed effect between a substrate and a coffee clone, and εijk is the normal random error εijk~N(0, σ 2). The components of the model for this dataset are as follows: Linear predictor: ηijk = μ + bk + αi + (αb)ik + βj + (βb)jk + (αβ)ij Distributions: yijk j bk, (αb)ik, (βb)jk~N(μijk, σ 2) bk N 0, σ 2b ; ðαbÞik N 0, σ 2αb ; ðβbÞjk N 0, σ 2βb Link function: ηijk = μijk The following GLIMMIX syntax sets a GLMM with a normal response. proc glimmix; class block s c; model y=s|c; random intercept s w/subject=block; lsmeans s*c/ slicediff=s; run; Part of the results of this analysis is shown below. The estimated variance components for blocks, block × substrate, blocks × clon, and the MSE are σ^2b = 23:4714, σ^2αb = 35:4995, σ^2βb = 67:0160 and σ^2 = CME = 139:58, respectively, which are listed in part (a) of Table 3.18. However, the fixed effects tests for both factors and the interaction (part (b)) are not statistically significant. According to the “slicediff = s” option in the “lsmeans” statement, Table 3.19 shows the simple effects of each substrate level at varying clone levels. 3.4.3 Non-normal Distribution Example Using the data in Table 3.17 but under a beta distribution, the components of the GLMM change slightly: 3.4 Marginal and Conditional Models Table 3.18 Estimated variance components and type III tests of fixed effects 109 (a) Covariance parameter estimates Cov Parm Subject Estimate Standard error Intercept Block 23.4714 58.9336 S Block 35.4995 44.9134 C Block 67.0160 64.0909 Residual 139.58 49.9056 (b) Type III tests of fixed effects Effect Num DF Den DF F-value Pr > F S 3 8.318 0.48 0.7076 C 2 5.935 1.68 0.2650 S*C 6 16.65 0.44 0.8441 Table 3.19 Simple effect comparisons across substrate levels Simple effect comparisons of S*C least squares means by S Simple effect level C _C Estimate Standard error S1 1 2 -12.9814 10.9667 S1 1 3 -12.1564 10.9667 S1 2 3 0.8250 10.1636 S2 1 2 -6.8325 10.1636 S2 1 3 -15.2779 9.8708 S2 2 3 -8.4454 9.8708 S3 1 2 -4.4620 12.8219 S3 1 3 -19.0350 12.8124 S3 2 3 -14.5730 11.4417 S4 1 2 -11.5925 10.1636 1 3 -8.2425 10.1636 S4 S4 2 3 3.3500 10.1636 DF 15 15 15 15 15 15 15 15 15 15 15 15 t- value -1.18 -1.11 0.08 -0.67 -1.55 -0.86 -0.35 -1.49 -1.27 -1.14 -0.81 0.33 Pr > |t| 0.2549 0.2851 0.9364 0.5116 0.1425 0.4057 0.7327 0.1581 0.2222 0.2719 0.4301 0.7463 Distributions: yijk j bk, (αb)ik, (βb)jk~Beta(μijk, ϕ), where ϕ is the scale parameter. bk N 0, σ 2b ; ðαbÞik N 0, σ 2αb ; ðβbÞjk N 0, σ 2βb Linear predictor: ηijk = μ + bk + αi + (αb)ik + βj + (βb)jk + (αβ)ij Link function: ηijk = logit(μijk) The following GLIMMIX syntax sets a beta response variable. proc glimmix method=laplace; class block s c; model pct=s|c/dist=beta; random intercept s w/subject=block; lsmeans s*c/plot=meanplot(sliceby=s join) slicediff=s ilink; run; 110 Table 3.20 Variance components and the fixed effects test 3 Objectives of Inference for Stochastic Models (a) Covariance parameter estimates Cov Parm Subject Estimate Intercept block 0.06723 S block 0.1594 C block 0.1932 ^ 16.6041 Scale ϕ (b) Type III tests of fixed effects Effect Num DF Den DF S 3 8 C 2 6 S*C 6 15 Standard error 0.1617 0.1420 0.1687 5.6153 F-value 0.74 2.60 0.59 Pr > F 0.5584 0.1540 0.7303 Table 3.21 Simple effect comparisons across substrate levels Simple effect comparisons of S*C least squares means by S Simple effect level C _C Estimate Standard error S1 -0.9698 0.6578 1 2 S1 -0.9298 0.6696 1 3 S1 2 3 0.03999 0.5457 S2 -0.4407 0.5432 1 2 S2 -0.8555 0.5237 1 3 S2 -0.4149 0.4911 2 3 S3 -0.4588 0.8285 1 2 S3 -1.3224 0.7773 1 3 S3 -0.8636 0.6375 2 3 S4 -0.9880 0.5619 1 2 S4 -0.7138 0.5739 1 3 S4 2 3 0.2741 0.5261 DF 15 15 15 15 15 15 15 15 15 15 15 15 t-value -1.47 -1.39 0.07 -0.81 -1.63 -0.84 -0.55 -1.70 -1.35 -1.76 -1.24 0.52 Pr > |t| 0.1611 0.1852 0.9426 0.4299 0.1231 0.4115 0.5879 0.1095 0.1955 0.0991 0.2326 0.6099 Some of the SAS output from this analysis is shown below. The variance components estimated for blocks, block × substrate, blocks × clon, and the scale parameter are σ^2b = 0:06723, σ^2αb = 0:1594, σ^2βb = 0:1932, and with scale parameter ^ = 16:6041, respectively, which are listed in part (a) of Table 3.20. However, the ϕ fixed effects tests for both factors and the interaction (part (b)) are not statistically significant. Unlike a normal distribution (the previous example), the variance components (multiplied by 100) under the beta distribution are smaller, and the type III fixed effects test is closer to be significant. Table 3.21 shows, for each substrate level at varying clone levels, the estimates (linear predictors) of the simple effects. These effects differ from the previous results, but this is mainly because in a GLMM, these values correspond to the linear predictors estimated at the model scale and not to the estimated means at the data 3.5 Exercises 111 Table 3.22 Total yields (grams) of barley varieties in 12 independent trials Location 1 1 2 2 3 3 4 4 5 5 6 6 Variety Manchuria 81.0 80.7 146.6 100.4 82.3 103.1 119.8 98.9 98.9 66.4 86.9 67.7 Svansota 105.4 85.3 142.0 115.5 77.3 105.1 121.4 61.9 89.0 49.9 77.1 66.7 Velvet 119.7 80.4 150.7 112.2 78.4 116.5 124.0 96.2 69.1 96.7 78.9 67.4 Trebi 109.7 87.2 191.5 147.4 131.3 139.9 140.8 125.5 89.3 61.9 101.8 91.8 Peatland 98.3 84.2 145.7 108.1 896 129.6 124.8 75.7 104.1 80.3 96.0 94.1 scale (Example 3.4.2). It is also important to note that the degrees-of-freedom correction in the estimation of means cannot yet be used in the estimation of a GLMM. 3.5 Exercises Exercise 3.5.1 The data in the Table 3.22 below show the yield of five barley varieties in a randomized complete block experiment conducted in Minnesota (Immer et al. 1934). • Write a complete description of the statistical model associated with this study and the assumptions of this model. • Compute the ANOVA for the design model according to part (a) and determine whether there is a significant difference in the varieties. • Use the least significance difference (LSD) method to make pairwise comparisons of variety mean yields. Exercise 3.5.2 Lew (2007) conducted an experiment to determine whether cultured cells respond to two drugs (chemical formulations). The experiment was conducted using a cell culture line placed in Petri dishes. Each experimental trial consisted of three Petri dishes: one treated with drug 1, one treated with drug 2, and one untreated as a control. The data are shown in the following Table 3.23: (a) Write a complete description of the statistical model associated with this study and the assumptions of this model. (b) Analyze the data using a completely randomized design. Is there a significant difference between the treatment groups? 112 Table 3.23 Number of cells cultured in different drugs 3 Ensayo 1 Ensayo 2 Ensayo 3 Ensayo 4 Ensayo 5 Ensayo 6 Objectives of Inference for Stochastic Models Control 1147 1273 1216 1046 1108 1265 Drug 1 1169 1323 1276 1240 1432 1562 Drug 2 1009 1260 1143 1099 1385 1164 (c) Analyze the data as a randomized complete block design, where the number of trials represents a blocking factor. (d) Is there any difference in the results obtained in (a) and (b)? If so, explain what might be the cause of the difference in results and what method would you recommend? Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. Chapter 4 Generalized Linear Mixed Models for Non-normal Responses 4.1 Introduction Generalized linear mixed models (GLMMs) have been recognized as one of the major methodological developments in recent years, which is evidenced by the increased use of such sophisticated statistical tools with broader applicability and flexibility. This family of models can be applied to a wide range of different data types (continuous, categorical (nominal or ordinal), percentages, and counts), and each is appropriate for a specific type of data. This modern methodology allows data to be described through a distribution of the exponential family that best fits the response variable. These complex models were not computationally possible up until recently when advances in statistical software have allowed users to apply GLMMs (Zuur et al. 2009; Stroup 2012; Zuur et al. 2013). Researchers in fields other than statistical science are also interested in modeling the structure of data. For example, in the social sciences there have been applications in the field of education when several tests are applied to students; in longitudinal personality studies when the occurrence of an emotion is repeatedly observed over time over a set of people; and in surveys to investigate the political preference of a population, among others. Likewise, agriculture and life sciences are other major areas, where the measurement of response variables depart from the conventionally used classical methodology based on “normality” to model or describe the data set, i.e., data that generally fall within the nominal, ordinal or interval (continuous) scales of measurement. In a GLMM, the data response does not undergo any transformation, but, instead, the response is modeled as a function of the expected value through a linear relationship with the explanatory variables. GLMMs, a powerful tool, allow proper modeling of variations between groups and between space and time, leading to accuracy in the modeling of the observed data as well as in the estimation of variance components. © The Author(s) 2023 J. Salinas Ruíz et al., Generalized Linear Mixed Models with Applications in Agriculture and Biology, https://doi.org/10.1007/978-3-031-32800-8_4 113 114 4.2 4 Generalized Linear Mixed Models for Non-normal Responses A Brief Description of Linear Mixed Models (LMMs) Before addressing GLMMs, we present a brief overview of linear mixed models (LMMs). An LMM is a model whose response variable is normal and assumes: (1) that the relationship between the mean of the dependent variable ( y) and fixed and random effects can be modeled as a linear function; (2) that the variance is not a function of the mean; and (3) that random effects follow a normal distribution. The classic representation of an LMM in the matrix form, is y = Xβ þ Zb þ ε ð4:1Þ where y is the vector (n × 1) of the response variable; X is the design matrix (n × ( p + 1)) of fixed effects with rank k; β is the vector of unknown parameters (( p + 1) × 1); Z is the design matrix (n × q) of random effects; and b is the vector of unknown parameters of random effects (q × 1), assuming that the vector of random effects b follows a normal distribution with mean 0 and variance matrix G, that is, b~N(0, G). Finally, ε is the error vector with a normal distribution with mean 0 and a variance–covariance matrix (ε~N(0, R)); both vectors b and ε are assumed to be independent of each other. Model 4.1, as previously mentioned, can be described in terms of a probability distribution in two ways: the first is the marginal model y~N(E[y] = Xβ, Var [y] = V = ZGZ′ + R), where the mean is based solely on the fixed effects, and the parameters describing the random effects are contained in the variance and covariance matrix V (Littell et al. 2006), while the second form is the conditional model y j b~N(Xβ + Zb, R). Under normality assumptions, both models are exactly the same and hence produce the same solution, whereas when normality is not satisfied, the models produce different solutions (Stroup 2012). 4.3 Generalized Linear Mixed Models Most datasets in agricultural, biological, and social sciences often fall outside the scope of the traditional methods taught in introductory statistics and statistical methods. Often, these data (response variables) are: (a) binary (the presence or absence of a trait of interest, success or failure, the infection status of an individual, or the expression of a genetic disorder); (b) proportional (the ratio of females to males, infection or mortality rates within a group of individuals); or (c) counts (the number of emerging seedlings, the number of sprouts, etc.), where basic statistical methods attempt to quantify the effects of each predictor variable. However, often, studies of these experiments involve random effects, the purpose of which is to quantify variation among individuals or units. The most common random effects are blocks in experimental or observational studies that are replicated across sites (locations or environments) or over time. Random effects also encompass variations 4.3 Generalized Linear Mixed Models 115 among individuals (when measuring multiple responses per individual such as survival of multiple offspring or sex ratios of multiple offspring), genotypes, species, and regions or periods over time. GLMMs are a powerful class of statistical tools that combine the concepts and ideas of generalized linear models (GLMs) with linear mixed models (LMMs). That is, a GLMM is an extension of the GLM, in which the linear predictor contains random effects in addition to fixed effects. These models handle a wide range of both response distributions and scenarios in which observations are sampled. GLMMs extend the theory of LMMs to response variables that have a non-normal distribution. In GLMMs, the response data are not transformed; instead, the explanatory variables are expressed as a linear relationship through a function g of the expectation of y j b; that is, the response is conditional on random effects. This performs the link function that relates the response to the explanatory variables in a linear manner, thus allowing the use of standard LMM techniques for estimation and hypothesis testing. A conditional model is used to describe a GLMM with non-Gaussian errors (Model 4.1), given a link function (g), as shown below: gðηÞ = Xβ þ Zb, which is a function of the conditional expectation given by E ½yjb = g - 1 ðXβ þ ZbÞ = g - 1 ðηÞ = μ ð4:2Þ where g-1(∙) is the inverse link and the other terms have already been mentioned earlier. The fixed and random effects are combined to form the conditional linear predictor η = gðE ½yjbÞ = Xβ þ Zb ð4:3Þ The relationship between the linear predictor and the vector of observations is modeled as follows: y j b g - 1 ðηÞ, R ð4:4Þ The above notation (4.4) expresses the conditional distribution of y, given b has a mean g-1(η) and variance R. Note that instead of specifying the distribution for y as in the case of a GLM, we specify a distribution for the conditional response y j b. The variance and covariance matrix for the observations is given by: V ðyÞ = E ½V ðyjbÞ þ V ðE½yjbÞ = A1=2 RA1=2 þ ZGZ0 ð4:5Þ where matrix A is a diagonal matrix containing the variance functions of the model. GLMMs cover an important group of statistical models, such as: 116 4 Generalized Linear Mixed Models for Non-normal Responses (a) Linear models (LMs): absence of random effects, identity link function and the assumption of a normal distribution. (b) Generalized linear models (GLMs): random effects are absent, link function is different from the identity function, and the response variables are non-normally distributed. (c) Linear mixed models (LMMs): presence of random effects, identity link function and normal distribution assumed for the response variable. GLMMs have been formulated to correct the shortcomings of LMMs, as there are many cases where the assumptions made in linear mixed models are inadequate. First, an LMM assumes that the relationship between the mean of the dependent variable (y) and fixed and random effects (β, b) can be modeled through a linear function. This assumption is questionable, like when a researcher wishes to model the incidence of a disease or the success or failure of an event. The second assumption of an LMM is that variance is not a function of the mean and that the random effects follow a normal distribution. The assumption of constant variance is not met when the response variable is binary (1, 0). In this case, the variance is π(1 - π), which is a function of the mean. The result is a random variable, which can take two values (0, 1); in contrast, the normal distribution can take any real number. Finally, the predictions for an LMM can take any real value, whereas the predictions for a binary variable are bounded in the interval (0, 1), since it is a probability and this prediction cannot support negative values. Historically, a number of options have been used to address and solve some LMM problems, even though their use is not the most appropriate. These include applying logarithmic transformations (log( y)), transformations using the square root p y , arcsine transformations (seno-1( y)), and so on. However, many of these transformations use linear mixed models by ignoring the fact that these models are not the most accurate, despite being aware that the response variable does not satisfy the assumption of normality. These options are attractive because they are relatively simple and easy to implement using the LMM machinery. However, they circumvent the problem that a linear mixed model is not the best model for analyzing data. 4.4 The Inverse Link Function In a GLMM, the canonical link function maps the original data to the linear predictor of the model g(η) = Xβ + Zb. This linear predictor can be transformed to an observed data scale through an inverse link function. In other words, the inverse link function is used to map the value of the linear predictor for the ith observation to the conditional mean at the data scale ηi. For example, suppose that we are conducting an experiment in which we are assessing the number of undesirable weeds observed in a crop of interest after the application of a certain number of treatments; the response variable is assumed to have a Poisson distribution with a mean λij, the linear predictor of which is given by 4.6 Specification of a GLMM 117 ηij = η þ τi þ bj where η is the intercept, τi is the fixed effect due to treatments, and bj is the random effect assuming bj N 0, σ 2b . To obtain the inverse function of the following predictor log λij = g ηij = η þ τi þ bj , we proceed by exponentiating both sides of the previous equation, with which we obtain the inverse function of the link shown below: λij = eηþτi þbj , which is denoted as g-1(g(ηij)) = g-1(η + τi + bj). Therefore, for this example, λij depends on the linear predictor through the inverse link function and the variance σ 2ij depends on λij through the variance function. 4.5 The Variance Function The variance function is used to model the inconsistent variability of the phenomenon under study. With GLMMs, the residual variability arises from two sources, namely, the variability of the distribution of sampling units in an experimental arrangement (blocks, plots, locations, etc.) and the variability due to overdispersion. Overdispersion can be modeled in several ways. When dealing with a GLMM, the scale parameter or the dispersion parameter ϕ is extremely important since it can either increase or decrease the variance in the model for each observation. Var yij jbj = ϕVar ηij If overdispersion exists, one way to remove it is to add the random effects (in SAS _residual_) of each observation to the linear predictor. Another alternative is to use another distribution to model the dataset; for example, the two-parameter negative binomial (NB) distribution (ηij, ϕ) instead of the single-parameter Poisson distribution (λij) in the case of count data. 4.6 Specification of a GLMM A GLMM is composed of three parts: (1) fixed effects that convey systematic and structural differences in responses; (2) random effects that convey stochastic differences between blocks or other random factors, as these effects allow generalizations 118 4 Generalized Linear Mixed Models for Non-normal Responses Table 4.1 Common distributions with their respective link functions Distribution Binomial Poisson Beta Normal Negative binomial Multinomial Link function Logit or probit Log Logit Identity Log Cumulative logit Distribution syntax dist = binomial|bin|b dist = Poisson|Poi dist = beta dist = normal|gausian dist = negbinomial|negbin| nb dist = multinomial|multi Syntax of the link function link = logit or probit link = log link = logit link = identity|id link = log link = cumlogit|clogit of the population from which the sampling units have been (randomly) sampled; and (3) distribution of errors. Thus, a complete definition of a GLMM is as follows: y j b f ðμ, ϕÞðconditional distributionÞ b N ð0, GÞðrandom effectsÞ gðμÞ = ηðlink functionÞ η = Xβ þ Zbðlinear predictorÞ where the distribution function f(∙) is a member of the exponential family, g(μ) is the linear function, X and Z are the design matrices, and β and b are the unknown parameters for fixed and random effects, respectively. When fitting a GLMM, the data remain on the original measurement scale (data scale). However, when means are estimated from a linear function of the explanatory variables (the predictor), these means are on the model scale. A link function is used to link the model scale back to the original data scale. This is not the same as transforming the original measurements to a different measurement scale. For example, applying the log transformation for counts followed by an analysis of variance (ANOVA) under a normal distribution is not the same as fitting a generalized linear model, assuming a Poisson distribution and using a log link (Gbur et al. 2012). In the first case, the least squares means would normally be equal to the arithmetic means, whereas in the second case, the means are inversely linked to the data scale, which may not be equal to the arithmetic means of the original sample. The distribution specifications in “proc GLIMMIX” have default link functions, but it is always highly recommended to explicitly code the link function, since for some type of response variable, more than one alternative exists. This way, there is no doubt that an appropriate function was used. Using the wrong link function will lead to totally meaningless and incorrect results. Table 4.1 shows some common distributions, the appropriate link function, and the proper syntax for each. For a complete list, see the online Statistical Analysis Software (SAS/STAT) documentation for PROC GLIMMIX. 4.7 Estimation of the Dispersion Parameter 4.7 119 Estimation of the Dispersion Parameter The overall measures of fit compare the observed values of the response variable with fitted (predicted) values. The dispersion parameter is unknown and therefore must be estimated. There are two methods for estimating the overdispersion parameter. McCullagh (1983) proposed estimating overdispersion as follows: ϕ = ðy - μÞ0 V μ- 1 ðy - μÞ Pearson0 s χ 2 = N -p N -p where V μ- 1 is the diagonal matrix of the variance functions and N - p is the degree of freedom for lack of fit. Later, McCullagh and Nelder (1989) suggested using deviance ϕ= - 2½lnðLM 1 Þ - lnðLðM 2 ÞÞ Deviance = N -p N -p Deviance is a global fit statistic that also compares fitted and observed values; however, its exact function depends on the likelihood function of the random component of the model. Deviance compares the maximum value of the likelihood function of a model, like M1, with the maximum possible value of the likelihood function that is calculated using data. When data are used in the likelihood function, the model is saturated and has as many parameters as possible. Thus, M2 is saturated and has as many parameters as the data. Model M2 tries to fit the data and gives the highest possible value for the likelihood. If the overdispersion parameter is significantly greater than one, this indicates that overdispersion exists; in other words, it indicates that the variance is greater than the mean. Therefore, the parameter should be used to adjust the variance. If overdispersion is not taken into account, inflated test statistics may be generated. However, when the dispersion parameter is less than 1, the test statistics are more conservative, which is not considered a big problem. The following example is intended to show how GLIMMIX in SAS estimates the dispersion parameter in a GLMM. Example An agronomist wants to test the effectiveness of a new herbicide offered on the market (we will denote this as herb_N) and compare it with the herbicide that has been used for several cycles (herb_C). The experimental arrangement used was a randomized complete block design as shown below (Table 4.2). The components of a GLMM with a Poisson response variable are listed below: Distribution: yij j bj Poisson λij bj N 0, σ 2bloque 120 4 Table 4.2 Number of undesirable weeds per plot Generalized Linear Mixed Models for Non-normal Responses Block 1 2 3 4 5 6 7 8 Herb_C 1 5 21 7 2 6 0 19 Herb_N 36 109 30 48 3 0 5 26 Linear predictor: ηij = η þ herbicidei þ bj Link function: log λij = ηij This model assumes that the slopes are the same for each herbicide. The following SAS code is used for the proposed model: proc glimmix nobound method=laplace; class block trt; model count = trt/dist=poisson link=log; random block; lsmeans trt/ilink lines; run; Explanation The “method = ”option is used to specify the method used to optimize the logarithm of the likelihood function. In “proc GLIMMIX,” there are two popular methods: adaptive quadrature (quad) or Laplace (laplace), which are the preferred methods for categorical response variables. Both of these methods fit a conditional model. When the quadrature method is used (method = quad), subjects (individuals) must be declared in the random effects (e.g., for the above program, “random intercept/subject=block”). In addition, processing random effects by subject is more efficient than using the syntax “random block” random effects in blocks. The “dist” option is where you specify the probability distribution that is appropriate for the type of response; in this case, it is the Poisson distribution. The “link” option is for specifying the link function of the distribution. The “ddfm” option is omitted so that GLIMMIX uses – by default – the method for calculating the denominator degrees of freedom for the fixed effects tests that result from the model. The “ilink” option converts the estimates of the treatment means (lsmeans) on the model scale to the data scale. Finally, “proc GLIMMIX” supports the “lines” option, which adds letter groups to the mean differences resulting from using “lsmeans.” The most relevant parts of the SAS output, for the purposes of what we want to show, are shown in Tables 4.3 and 4.4. The fit statistics of the fitted model are shown in part (a) and part (b) of Table 4.3. The -2 log likelihood statistic is extremely 4.7 Estimation of the Dispersion Parameter Table 4.3 Fit statistics and variance components 121 (a) Fit statistics (Akaike’s information criterion (AIC), a small sample bias corrected Akaike’s information criterion (AICC), Bozdogan Akaike’s information criterion (CAIC), Schwarz’s Bayesian information criterion (BIC), Hannan and Quinn information criterion (HQIC)) -2 Log likelihood 175.35 AIC (smaller is better) 181.35 AICC (smaller is better) 183.35 BIC (smaller is better) 181.59 CAIC (smaller is better) 184.59 HQIC (smaller is better) 179.74 (b) Fit statistics for conditional distribution 139.03 -2 Log L (count | r. effects) Pearson’s chi-square 77.56 Pearson’s chi-square/degree of freedom (DF) 4.85 (c) Covariance parameter estimates Cov Parm Estimate Standard error Block 1.5590 0.8690 Table 4.4 Type III fixed effects tests and estimated least squares means (a) Type III tests of fixed effects Effect Num DF Herbicide 1 (b) Trts least squares means Trts Estimate Standard error Herb_C 1.4604 0.4696 Herb_N 2.8947 0.4561 Den DF 7 DF 7 7 t-value 3.11 6.35 F-value 101.34 Pr > |t| 0.0171 0.0004 Mean 4.3076 18.0778 Pr > F <0.0001 Standard error mean 2.0227 8.2447 useful for comparing nested models, whereas the different versions of information criteria that exist, such as Akaike information criterion (AIC), Akaike’s information criteria with small sample bias correction (AICC), Bayesian information criterion (BIC), Bozdogan Akaike’s information criteria (CAIC), and Hannan and Quinn information criteria (HQIC), are useful when comparing models that are not necessarily nested (subsection (a)). The table of fit statistics for the conditional distribution shows the sum of the independent contributions to the conditional (part (b)) -2 log likelihood, the value of which is 139.03, whereas the value of Pearson’s statistic divided by the degrees of freedom for the conditional distribution (Pearson′s chi square/DF) is 4.85. The estimated dispersion parameter (ϕ = Pearson’s chi-square/DF) has a value far from 1; in this case, it is ϕ = 4:85, which indicates that there is a strong overdispersion. This may be because the specified distribution of the data is not appropriate, the counts are too small, or the variance function was not correctly 122 4 Generalized Linear Mixed Models for Non-normal Responses specified. The estimate of the variance component due to a block is tabulated in part (c) of Table 4.2, the estimated value of which is σ 2bloque = 1:559. The fixed effects test and least squares means are shown in Table 4.4. The type III fixed effects tests indicate that there is a highly significant difference (part (a)) in the effectiveness of herbicides in weed suppression; the estimated means with their respective standard errors are tabulated under the “Mean” column (part (b)). The “Estimate” column containing the estimates of the means of lsmeans is on the model scale. They are derived from the log likelihood function. SAS always lists the means obtained with lsmeans from the model scale when creating least squares means test tables. The “Mean” column has been converted back to the data scale using the “ilink” inverse link function. These values are estimates of the average counts for each treatment level (in this case, the herbicide type on the data scale). When we report the results, we must replace the corresponding model’s least squares values in the test tables with these estimates (means on the data scale corresponding to the values in the “Mean” column). Since there is a strong overdispersion ϕ > 1 , assuming that the data have a Poisson distribution is risky because this implies that the mean and variance are equal, which is an assumption implying that the data have a Poisson distribution, i.e., that the mean and variance are the same. A useful alternative distribution might be a negative binomial distribution; this distribution has a mean λ and variance λ + λϕ2 with ϕ > 0 commonly known as the scale parameter. The following is the specification of the components of a GLMM with a negative binomial (NB) response variable: Distribution : yij j bj Negative binomialðλij , ϕÞ bj Nð0, σ2bloque Þ Linear predictor: ηij = η þ herbicidei þ bj Link function: log λij = ηij The GLIMMIX procedure also allows modeling a GLMM with a negative binomial response variable: proc glimmix data=itam nobound method=laplace; class block trts; model count = trts/dist=negbin; random block; lsmeans trts/ilink; run; Part of the output is shown in Table 4.5. The fit statistics for the model comparison (part (a)) and that for the conditional distribution (part (b)) are both provided by the GLIMMIX procedure when a conditional distribution is specified. Since in the previous analysis, it was observed that overdispersion exists when assuming a Poisson distribution, the results – under a negative binomial distribution – indicate that this overdispersion problem no longer exists; i.e., the binomial distribution is no 4.8 Estimation and Inference in Generalized Linear Mixed Models Table 4.5 Fit statistics under a negative binomial distribution 123 (a) Fit statistics -2 Log likelihood 120.50 AIC (smaller is better) 128.50 AICC (smaller is better) 132.13 BIC (smaller is better) 128.81 CAIC (smaller is better) 132.81 HQIC (smaller is better) 126.35 (b) Fit statistics for conditional distribution -2 Log L (count | r. effects) 120.50 Pearson’s chi-square 9.21 Pearson’s chi-square/DF 0.58 (c) Fit statistics and Pearson’s chi-square/DF Poisson Negative binomial 175.35 120.50 -2 Log likelihood AIC (smaller is better) 181.35 128.50 AICC (smaller is better) 183.35 132.13 BIC (smaller is better) 181.59 128.81 CAIC (smaller is better) 184.59 132.81 HQIC (smaller is better) 179.74 126.35 4.85 0.58 ϕ (Pearson’s chi-square/DF) longer overdispersed ϕ = 0:58 . In other words, the negative binomial distribution does a better job than the Poisson distribution in fitting these data, since it effectively controls the overdispersion. Comparing the fit statistics tabulated in Table 4.3 subsection (c) under both distributions, we can observe that when the data are modeled under a negative binomial distribution, the values of the fit statistics are lower than those under a Poisson distribution, since the dispersion parameter ϕ < 1. This indicates that the negative binomial models this dataset better. 4.8 4.8.1 Estimation and Inference in Generalized Linear Mixed Models Estimation In GLMMs, inference involves the estimation and testing of the hypotheses of unknown parameters in β, G, and R as well as the best linear unbiased predictions (BLUPs) of random effects, b. In most modern statistical tools, including GLMMs, parameter fitting is performed via maximum likelihood (ML) or methods derived from this method. For simple analyses, in which the response variables are normal, classical ANOVA methods are based on calculating the differences of the sums of 124 4 Generalized Linear Mixed Models for Non-normal Responses the squares that produce the same results as an ML estimation. However, this equivalence is not obtained in models with more complex structures such as LMMs or GLMMs. To find the ML estimators, in GLMMs, one must integrate over all possible values of the random effects. For GLMMs, this computation is at best slow and at worst (a large number of random effects) computationally infeasible. Statisticians have proposed several ways to approximate the parameter estimates of a GLMM, including penalized quasi-likelihood (PQL) and pseudo-likelihood methods (Schall 1991; Wolfinger and O'Connell 1993; Breslow and Clayton 1993), Laplace approximations (Raudenbush et al. 2000) and Gauss–Hermite quadrature (Pinheiro and Chao (2006), and Bayesian methods based on Markov chain Monte Carlo (Gilks et al. 1996). In all these approaches, researchers must distinguish between a standard ML estimation, which estimates the standard deviations of the random effects assuming that the fixed effects estimates are precisely correct, and restricted maximum likelihood (REML), a variant that averages over the uncertainty in the fixed effects parameters (Pinheiro and Bates 2000; Littell et al. 2006). The ML method underestimates the standard deviations of random effects, except in extremely large datasets, but it is most useful for comparing models with different fixed effects. Pseudo- and quasi-likelihood methods are the simplest and the most widely used in approximating a GLMM. They are widely implemented in statistical packages that promote the use of GLMMs in many areas of ecology, biology, and quantitative and evolutionary genetics (Breslow 2004). Unfortunately, pseudo- and quasi-likelihood methods produce biases in parameter estimation if the standard deviations of the random effects are large, especially when using binary data (Rodriguez and Goldman 2001; Goldstein and Rasbash 1996). Lee and Nelder (2001) have implemented several improvements to the PQL version, but these are not available in most common statistical software packages. As a rule of thumb, PQL performs poorly for Poisson data when the average number of counts per treatment combination is less than five or for binomial data when the expected numbers of successes and failures for each observation are less than five (Breslow 2004). Another disadvantage of PQL is that it calculates a quasi-likelihood rather than the true likelihood. Because of this, many statisticians believe that PQL-based methods should not be used for inference. There are two more accurate approximations available, which also reduce bias. One is the Laplace approximation (Raudenbush et al. 2000), which approximates the true likelihood of a GLMM instead of a quasi-likelihood, allowing the maximum likelihood method in the GLMM inference process. The other approach is called Gauss–Hermite quadrature (Pinheiro and Chao 2006), which is more accurate than the Laplace approximation but is slower (requires more computational resources). Therefore, the procedures for parameter estimation of a GLMM that are approximations are as follows: The penalized quasi-likelihood method performs the estimation process by alternating between (1) estimating the fixed parameters by fitting a GLM with a variance– covariance matrix based on an LMM fit and (2) estimating the variances and 4.8 Estimation and Inference in Generalized Linear Mixed Models 125 covariances by fitting a GLM with unequal variances calculated from the previous GLM fit. Pseudo-likelihood, a close cousin of the ML method, estimates variances differently and estimates a scale parameter to account for overdispersion (some authors use these terms interchangeably). In summary, GLMMs require an iterative process in parameter estimation. Two categories of iterative procedures are used by SAS: linearization and integral approximation. The GLIMMIX procedure uses the pseudo-likelihood method in linearization, and integral approximation uses the Laplace approximation or adaptive methods such as Gauss–Hermite quadrature. These methods maximize the log likelihood of the exponential distribution family, i.e., non-normal distributions. The pseudolikelihood method is the default procedure in the GLIMMIX procedure (Proc GLIMMIX). The Laplace method and quadrature are an approximation for maximum likelihood, but the Laplace method is computationally simpler than quadrature and also provides excellent estimates. 4.8.2 Inference After estimating the parameter values in a GLMM, the next step is to extract information and draw statistical conclusions from a given dataset through careful analysis of the parameter estimates (confidence intervals, hypothesis testing) and select a model that best describes or explains the most variability in the dataset. Inference can generally be based on three types: (a) hypothesis testing, (b) model comparison, and (c) Bayesian approaches. Hypothesis testing compares test statistics (F-test in ANOVA) to verify their expected distributions under the null hypothesis (H0), estimating the value of P (P-value) to determine whether H0 can be rejected. On the other hand, model selection compares candidate model fits. These can be selected using hypothesis testing; that is, testing nested versus more complex models (Stephens et al. 2005) or using information theory approaches such as Wald tests (Z, χ 2, t, and F). In model selection, likelihood ratio (LR) tests can ensure the significance of factors or choose the best of a pair of candidate models. On the other hand, information criteria allow multiple comparisons and selections of non-nested models. Among these criteria are the Akaike information criterion (AIC) and related information criteria that use deviance as a measure of fit, adding a term to penalize more complex models. Information criteria can provide better estimates. Variations of AIC are highly common when sample sizes are not large (AICC), when there is overdispersion in the data (quasi-AIC, QAIC), or when one wishes to identify/ determine the number of parameters in a model (Bayesian information criterion, BIC). 126 4 Generalized Linear Mixed Models for Non-normal Responses 4.9 Fitting the Model The mathematics behind a GLMM is quite complex. It is difficult to conceptualize the use of constructs such as distributions, link functions, log likelihood, and quasilikelihood when fitting a model. Perhaps the following points will help explain the modeling process. (a) An analysis of variance model is a vector of linear predictors (equation) with unknown parameter estimates. (b) Each distribution has a corresponding probability function. (c) The vector of linear predictors is substituted into the likelihood function. (d) Solutions to the parameter estimates are found by minimizing the negative of the log likelihood function (-log likelihood). (e) The means (least squares means – lsmeans) are derived from the parameter estimates and are on the model scale. (f) The link function converts the mean estimates at the model scale to the original data scale. The key concepts of proc GLIMMIX are (1) it uses a distribution to estimate the model parameters; it does not fit the data to a distribution, and (2) the data values are not transformed by the link function; the link function converts the means (least squares means) to the data scale after estimation at the model scale. 4.10 Exercises 1. As a simple example of these types of data, consider the following results of an experiment on wheat germination, carried out in pots under glass. The experiment consisted of four blocks of six treatments (Table 4.6). (a) According to the response variable, what type(s) of probability distribution do you suggest for the variable? (b) Construct a GLMM to study the effect of treatments on seed germination. (c) Analyze the dataset according to the model proposed in (a). Is the probability distribution proposed in (a) adequate? (d) Is there a significant difference in the proportion of germinated seeds between treatments? Table 4.6 Number of seeds not germinating (out of 50) A B C D Trt1 10 8 5 1 Trt2 11 10 11 6 Trt3 8 3 2 4 Trt4 9 7 8 13 Trt5 7 9 10 7 Trt6 6 3 7 10 4.10 Exercises 127 Table 4.7 Control of cockchafer larvae Trt1 Trt2 Trt3 Trt4 Trt5 A a 3 4 3 5 4 B 7 3 10 8 6 B a 7 3 6 4 4 b 15 12 12 11 11 C a 5 4 4 1 2 b 7 2 4 5 2 D a 5 12 1 5 3 b 14 5 14 9 8 E a 0 2 2 2 0 b 3 3 2 7 1 F a 1 1 1 3 5 b 7 6 7 7 4 G a 1 3 1 0 1 b 10 5 8 3 6 H a 4 4 7 3 1 b 13 11 10 12 8 2. Table 4.7 shows the counts per sample area of a variety type of cockchafer larva (two age groups a and b). The experiment consisted of five treatments in eight randomized blocks and two age groups to study the differential effects of treatments on insect age. (a) Considering the type of answer of this exercise; what type(s) of probability distribution(s) do you suggest for this type of response? (b) Construct a GLMM to study the effect of treatments and the age of Cockchafer larvae. (c) Analyze the dataset according to the model proposed in (a). (d) Is the model used in (a) sufficient? If so, discuss your findings. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. Chapter 5 Generalized Linear Mixed Models for Counts 5.1 Introduction Data in the for of counts regularly appear in studies in which the number of occurrences is investigated, such as the number of insects, birds, or weeds in agricultural or agroecological studies; the number of plants transformed or regenerated using modern breeding techniques; the number of individuals with a certain disease in a medical study; and the number of defective products in a quality improvement study, among others. These counts can be counted per unit of time, area, or volume. When using a generalized linear model (GLM) with a Poisson distribution, it is often found that there is excessive dispersion (extra variation) that is no longer captured by the Poisson model. In these cases, the data must be modeled with a negative binomial distribution that has the same mean as the Poisson distribution but with a variance greater than the mean. Most experiments have some form of structure due to the experimental design (completely randomized design (CRD), randomized complete block design (RCBD), incomplete block, or split-plot design) or the sampling design, which must be incorporated into the predictor to adequately model the data. 5.2 The Poisson Model A Poisson distribution with parameter λ belongs to the exponential family and is a discrete random variable, whose probability function is equal to f ðyÞ = e - λ λy ; λ > 0, y = 0, 1, 2, ⋯: y! © The Author(s) 2023 J. Salinas Ruíz et al., Generalized Linear Mixed Models with Applications in Agriculture and Biology, https://doi.org/10.1007/978-3-031-32800-8_5 129 130 5 Generalized Linear Mixed Models for Counts The mean and variance of a Poisson random variable are equal, i.e., E( y) = Var ( y) = λ. A Poisson distribution is often used to model responses that are “counts.” As λ increases, the Poisson distribution becomes more symmetric and eventually it can be reasonably approximated by a normal distribution. Let yij be the value of the count variable associated with unit i at level one and with unit j at level two, given a set of explanatory variables. Therefore, we can express this as y f yij = e - λij λijij , yij = 0, 1, 2, ⋯ yij ! and the logarithm of the likelihood is given by: y log f yij = log e - λij λijij yij ! = - λij þ yij log λij - log yij ! : A Poisson distribution has very particular mathematical properties that are used when we model “counts.” For example, the expected value of y is equal to the variance of y, such that E ðyik Þ = Varðyik Þ = λij Then, λij is necessarily a nonnegative number, which could lead to difficulties if we consider using the identity bound function in this context. The natural logarithm is mainly used as a link function for expected “counts.” For single-level (factor) data, Poisson regression model is considered, where we work with the natural logarithm of the counts, log(λi), whereas for multilevel data (more than two factors), mixed models with Poisson data are considered a better choice for the logarithm of the counts λij. Suppose that given the random effects of b, the counts y1, y2, ⋯, yn are conditionally independent such that yij j bj~Poisson(λij), where log λij = η þ τi þ bj : This is a special case of a generalized linear mixed model (GLMM) in which the link function of this family of distributions is g(λij) = log (λij). The dispersion parameter ϕ, in this case, is equal to 1. Sometimes, if the data counts are extremely large, their distribution can be approximated to a continuous distribution. Whereas, if all the counts are large enough, then the square root of the counts is viable for fitting the model as it allows the variance to be stabilized. However, as mentioned in previous chapters, the estimation process under normality can be problematic, as it can provide negative fitted values and predictions, which is illogical. 5.2 The Poisson Model 5.2.1 131 CRD with a Poisson Response An CRD is a design in which a fixed number of t treatments is randomly assigned to r experimental units. The linear predictor describing the mean structure of this GLM is ηij = η þ τi where ηij denotes the ijth link function of the ith treatment in the jth observation, η is the intercept, and τi is the fixed effect due to treatment i (i = 1, 2, ⋯, t; j = 1, 2, ⋯ri), with t treatments and ri replicates in each treatment i. Example Effect of a subculture on the number of shoots during micropropagation of sugarcane. The objective of micropropagation in sugarcane is to produce vegetative material identical to the donor so that its genetic integrity is preserved. Despite this, somaclonal variation has been observed in plants derived from in vitro culture regardless of explant, variety, ploidy level, number of subcultures, and generation route used, among others. A total of 8 explants were planted in temporary immersion bioreactors (explant/bioreactor) to determine whether the number of subcultures (10 subcultures) influences the number of shoots observed per explant. In this example, we have ri observations ( j = 1, 2, . . ., ri) on each of the 10 subcultures (i = 1, 2, . . ., 10) in a completely randomized design (Appendix 1: Data: Subcultures). The analysis of variance (ANOVA) table (Table 5.1) for this model is given below: The components of the GLM are set out below: Distribution: yij Poisson λij Linear predictor: ηij = η þ τi Link function: log λij = ηij where yij denotes the number of sprouts observed in subculture i explant j (i = 1, 2, ⋯, 10; j = 1, 2, ⋯, 8), ηij is the ijth link function, η is the intercept, and τi is the fixed effect of subculture i. Table 5.1 Analysis of variance Sources of variation Subculture Error Total Degrees of freedom Unbalanced design t - 1 = 10 - 1 = 9 10 i = 1 ri 10 i = 1 ri - t = 164 Balanced design t-1 t(r - 1) - 1 = 173 tr - 1 132 Table 5.2 Model information and estimation methods 5 Generalized Linear Mixed Models for Counts (a) Model information Dataset Response variable Response distribution Link function Variance function Variance matrix Estimation technique Degrees of freedom method (b) Dimensions Columns in X Columns in Z Subjects (blocks in V) Max Obs per subject WORK.SUGAR NB Poisson Log Default Diagonal Maximum likelihood Residual 0 1 The following Statistical Analysis Software (SAS) code allows analyzing an CRD with a Poisson response. proc glimmix data=sugar method=laplace; class rep1 sub1 ; model nb=sub/dist=poisson s link=log; lsmeans sub/lines ilink; run;quit; While most of the commands used have been explained before, the options in the model statement “dist,” “s,” and “link” communicate to the SAS the type of data distribution, the fixed effects solution, and the link to use, respectively. In addition, the “lines” option asks the GLIMMIX procedure in the “lsmeans” (least squares means) command for mean comparisons, and the “ilink” option provides the inverse link function. Part of the output is shown in Table 5.2, where part (a) shows the model and the methods used to fit the statistical model, whereas part (b) lists the dimensions of the relevant matrices in the model specification. Due to the absence of random effects in this model, there are no columns in matrix Z. The 11 columns in matrix X comprise an intercept and 10 columns for the effect of subcultures. The goodness-of-fit statistics of the model are shown in part (a) of Table 5.3. The value of the generalized chi-squared statistic over its degrees of freedom (DFs) is less than 1. (Pearson′s chi - square/DF = 0.79). This indicates that there is no overdispersion and that the variability in the data has been adequately modeled with the Poisson distribution. Subsection (b) of Table 5.3 shows the maximum likelihood (ML) (“Estimate”), parameter estimates, standard errors, and t-tests for the hypothesis of the parameters. 5.2 The Poisson Model 133 Table 5.3 Fit statistics and estimated parameters (a) Fit statistics (Akaike’s information criterion (AIC), a small sample bias Corrected Akaike’s information criterion (AICC), Bozdogan Akaike’s information criteria (CAIC), Schwarz’s Bayesian information criterion (BIC), Hannan and Quinn information criterion (HQIC)) -2 Log likelihood 1062.11 Akaike information criterion (AIC) (smaller is better) 1082.11 AICC (smaller is better) 1083.46 Bayesian information criterion (BIC) (smaller is better) 1113.70 CAIC (smaller is better) 1123.70 HQIC (smaller is better) 1094.93 Pearson’s chi-square 137.70 Pearson’s chi-square/DF 0.79 (b) Parameter estimates Effect sub1 Estimate Standard error DF t-value Pr > |t| ^η Intercept 3.6687 0.04124 164 88.96 <0.0001 -1.0809 0.07389 164 -14.63 <0.0001 sub1 1 ^τ1 sub1 2 ^τ2 -0.9043 0.06664 164 -13.57 <0.0001 -0.5596 0.06839 164 -8.18 <0.0001 sub1 3 ^τ3 -0.3412 0.06398 164 -5.33 <0.0001 sub1 4 ^τ4 0.2177 0.05540 164 3.93 0.0001 sub1 5 ^τ5 sub1 6 ^τ6 0.2257 0.05452 164 4.14 <0.0001 0.2631 0.05178 164 5.08 <0.0001 sub1 7 ^τ7 0.3387 0.05109 164 6.63 <0.0001 sub1 8 ^τ7 0.2684 0.05478 164 4.90 <0.0001 sub1 9 ^τ8 sub1 10 ^τ10 0 . . . . Table 5.4 (part (a)) shows significance tests for the fixed effects in the model “Type III fixed effects tests.” These tests are Wald tests and not likelihood ratio tests. The effect of a subculture on the number of shoots is highly significant in this model with a value of P < 0.0001, indicating that the 10 subcultures do not produce the same number of shoots, that is, the number of subcultures affects the average shoot production in the explant. The least squares means obtained with “lsmeans” (part (b) in Table 5.4) are the values under the column “Estimate,” which along with the standard errors, were calculated with the linear predictor ηi = η þ τi . These estimates are on the model scale, whereas the “Mean” column values and their respective standard errors are on the data scale, which were obtained by applying the inverse link to obtain the λi values, i.e., λi = exp ðηi Þ with their respective standard errors. A comparison of means, using the option “lines,” is presented in Fig. 5.1. In this figure, we can see that in the first subcultures, the average production is minimal but it increases as subcultures increase from 5 to 8, and, in subculture 9, the average number of shoots per explant begins to decrease. 134 5 Generalized Linear Mixed Models for Counts Table 5.4 Type III tests of fixed effects and least squares means (means) (a) Type III tests of fixed effects Effect Num DF sub1 9 (b) sub1 least squares means sub1 Estimate Standard error 1 2.5878 0.06131 2 2.7644 0.05234 3 3.1091 0.05455 4 3.3274 0.04891 5 3.8864 0.03699 6 3.8944 0.03567 7 3.9318 0.03131 8 4.0073 0.03015 9 3.9370 0.03606 10 3.6687 0.04124 errorstd ðηi Þ ηi Den DF 164 DF 164 164 164 164 164 164 164 164 164 164 t-value 42.21 52.81 56.99 68.03 105.08 109.18 125.57 132.91 109.18 88.96 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 13.3000 15.8696 22.4000 27.8667 48.7333 49.1250 51.0000 55.0000 51.2667 39.2000 Standard error mean 0.8155 0.8307 1.2220 1.3630 1.8025 1.7522 1.5969 1.6583 1.8487 1.6166 λi errorstd λi 60 Average of shoots per explant Pr > F <0.0001 F-value 120.14 a b 50 b ab ab c 40 d 30 20 e g f 10 0 1 2 3 4 5 6 7 8 9 10 Subcultures Fig. 5.1 Average number of shoots per subculture. Bars with different letters are statistically different using α = 0.05 5.2.2 Example 2: CRDs with Poisson Response Researchers want to determine whether the application of a new growth compound to walnut trees changes the amount of nuts produced per tree. They were applied at three different times (pre-flowering = 1, flowering = 2, and post-flowering = 3) and 5.2 The Poisson Model Table 5.5 Number of nuts per tree (yij) in each of the combinations of the two factors 135 Trt C A2 A1 B1 B1 B2 A1 yij 1 118 69 89 99 158 89 Trt A3 B3 A1 B3 A3 C A2 yij 79 99 50 118 99 50 127 Trt A3 C A2 B3 A1 B1 A2 yij 89 21 79 99 79 118 89 Trt B1 A2 A3 A1 B2 C B2 yij 50 69 69 21 118 30 99 Trt B2 B1 A3 C B2 B3 B3 yij 138 69 138 11 89 158 118 in two formulations (A and B) plus a control (C). In addition to the treatments (Trt) there was a control, where no compound was applied. In total, 7 treatments were randomly applied to the experimental units (trees), i.e., 35 trees, in a rectangular arrangement (as shown below). The average number of nuts yij observed in the formulation and the time of application are provided in Table 5.5. The components of the GLMM are listed below: Distribution : yij j r j Poissonðλij Þ r j Nð0, σ2tree Þ Linear predictor: ηij = η þ τi þ r j Link function: log λij = ηij where yij denotes the number of nuts in treatment i on tree j (i = 1, 2, ⋯, 7; j = 1, 2, ⋯, 5), ηij is the linear predictor, η is the intercept, τi is the fixed effect due to treatment i, and rj is the random effect due to tree j. The following SAS statements allows a GLMM to be fitted in a completely randomized design with a Poisson response variable. proc glimmix data=crd_nuez nobound method=laplace; class trt rep; model count = trt/dist=Poi link=log; random rep; lsmeans trt/lines ilink; run; The options in the model statement, dist, s and ilink communicates to SAS the type of data distribution, the fixed effects solution and to compute the inverse link, respectively. In addition, the option “lines” requests the GLIMMIX procedure in the “lsmeans” (least squares means) command, and the mean comparisons and the “ilink” option provide the inverse of the link function. Part of the results is presented in Table 5.6. The value of the statistic for conditional distribution (part (a)) indicates that there is a strong overdispersion (χ 2/ df = 3.62), and the variance component estimates due to sampling in the experimental units (trees) is σ 2tree = 0:035 (part (b)). 136 5 Table 5.6 Results of the analysis of variance Generalized Linear Mixed Models for Counts (a) Fit statistics for conditional distribution -2 Log L (count | r. effects) 354.60 Pearson’s chi-square 126.54 Pearson’s chi-square/DF 3.62 (b) Covariance parameter estimates Cov Parm Estimate Standard error rep 0.03573 0.02362 (c) Type III tests of fixed effects Effect Num DF Den DF F-value Pr > F Trt 6 24 59.55 <0.0001 In addition, Table 5.6 (part (c)) shows the type III tests of fixed effects, indicating that there is a significant difference between treatments on the average number of nuts per tree (P = 0.0001). However, it is not recommended to continue with the inference and analysis of the experiment due to the presence of extra-variance (commonly known as overdispersion; Pearson′s chi - square/DF = 3.62) in the data that strongly affects the F-test and the standard errors of the means. A highly effective alternative to deal with the inconvenience of overdispersion in the data is to use a different distribution to the Poisson distribution. A negative binomial distribution is an excellent option for count data with overdispersion. Assuming that the conditional distribution of the observations is given by: yij j r j Poisson λij , where λij~Gamma~(1/ϕ, ϕ), ϕ as the scale parameter and r j Nð0, σ2tree Þ. The resulting new GLMM is: Distribution : yij j r j Negative Binomialðλij , ϕÞ, r j Nð0, σ2tree Þ Linear predictor : ηij = η þ τi þ r j Link function: log λij = ηij The following GLIMMIX statements for fitting this model under a negative binomial distribution in a CRD manner is provided next. proc glimmix data=crd_nuez nobound method=laplace; class trt rep; model count = trt/dist=Negbin link=log; random rep; lsmeans trt/lines ilink; run; 5.2 The Poisson Model Table 5.7 Poisson and negative binomial model fit statistics Table 5.8 Variance component estimates and fixed effects tests 137 (a) Fit statistics Poisson Negative -2 Log likelihood 374.83 AIC (smaller is better) 390.83 AICC (smaller is better) 396.37 BIC (smaller is better) 387.71 CAIC (smaller is better) 395.71 HQIC (smaller is better) 382.45 (b) Fit statistics for conditional distribution Poisson Negative -2 Log L (count | r. effects) 354.60 Pearson’s chi-square 126.54 Pearson’s chi-square/DF 3.62 (a) Covariance parameter estimates Cov Parm Estimate Rep 0.04288 Scale 0.06141 (b) Type III tests of fixed effects Effect Num DF Den DF Trt 6 24 Binomial 328.03 346.03 353.23 342.51 351.51 336.60 Binomial 316.06 32.02 0.91 Standard error 0.03398 0.02428 F-value 18.75 Pr > F <0.0001 Part of the results is listed below. The information criteria in Table 5.7 part (a) are helpful in choosing which model best fits the dataset. Clearly, the negative binomial distribution provides the best fit to these data. On the other hand, in the conditional fit statistics (part (b)), we observed that the Poisson model had a strong overdispersion (Pearson′s chi - square/DF = 3.62) and that by fitting the data under a negative binomial distribution, the overdispersion of the dataset was removed (Pearson′s chi Square/DF = 0.91). Table 5.8 shows the variance component estimates (part (a)) and the type III tests of fixed effects (part (b)). The estimated variance parameter, due to trees, is σ 2tree = 0:04288, and the estimated scale parameter (Scale) is ϕ = 0:06141. The type III tests of fixed effects (part (b)) show that there is a highly significant effect of treatments on the average number of nuts (P < 0.0001). The values under the column “Estimates” are the estimates of the linear predictor ηi (the model scale), and the values under “Mean” are the means λi (the data scale) with their respective standard errors obtained with the command “lsmeans” and “ilink” (Table 5.9). The results show that the treatments implemented in this experiment showed a higher average number of walnuts than did the “control” treatment C. In general, formula B applied to the walnut trees at the full-flowering stage showed a higher nut production. 138 5 Generalized Linear Mixed Models for Counts Table 5.9 Estimates on the model scale (“Estimate”) and means on the data scale (“Mean”) Trt least squares means Trt Estimate Standard error A1 4.0865 0.1560 A2 4.5624 0.1519 A3 4.5293 0.1519 B1 4.4349 0.1529 B2 4.7863 0.1504 B3 4.7641 0.1504 C 3.0499 0.1742 DF 24 24 24 24 24 24 24 t-value 26.19 30.04 29.82 29.01 31.82 31.67 17.51 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 59.5307 95.8162 92.6956 84.3417 119.86 117.23 21.1140 Standard error mean 9.2890 14.5529 14.0783 12.8958 18.0304 17.6335 3.6785 Interest often arises in areas of agricultural and biological sciences to conduct experiments that involve random effects (blocks, locations, etc.) and response variables different from the normal distribution. For example, suppose that a certain number of treatments are being tested at different randomly selected locations, out of a sufficiently large number of locations. At each location, the experimental units are randomly assigned to each of the treatments. Let yij be the number of (observed) individuals possessing the characteristic of interest in the ith treatment in the jth block. The model for the mean structure of this experiment is ηij = η þ τi þ bj where η is the intercept, τi is the fixed effect due to the ith treatment i, and bj is the random effect of the block j with bj Nð0, σ2block Þ. 5.2.3 Example 3: Control of Weeds in Cereal Crops in an RCBD One of the main problems when growing cereal crops is the competition that exists between the weeds and seedlings. If a field supervisor is interested in testing five designed treatments plus a control for weed control in cereal crops, then a randomized complete block design (four blocks) should be used. Table 5.10 shows the number of weed plants observed in each of the treatments (yij) in parentheses. Table 5.11 shows the sources of variation and the degrees of freedom of a randomized complete block design used in this experiment. Since the response is count, it will be modeled using a GLMM with a Poisson response variable, which is stated below: Distribution : yij j bj Poissonðλij Þ bj Nð0, σ2block Þ 5.2 The Poisson Model 139 Table 5.10 Number of weeds in each treatment (the number in parentheses corresponds to the treatment number) Block A B C D (1) 438 (3) 61 (5) 77 (2) 315 Table 5.11 Analysis of variance (4) 17 (2) 422 (3) 157 (1) 380 (2) 538 (6) 57 (4) 87 (5) 20 (5) 18 (1) 442 (6) 100 (3) 52 Sources of variation Block Treatment Error Total (3) 77 (5) 26 (2) 377 (4) 16 (6) 115 (4) 31 (1) 319 (6) 45 Degrees of freedom b-1=4-1=3 t-1=6-1=5 (t - 1(b - 1)=15 tb - 1 = 23 Linear predictor: ηij = η þ τi þ bj Link function: log λij = ηij where yij denotes the number of weed plants observed in treatment i and block j (i = 1, 2, ⋯, 6; j = 1, 2, 3, 4), ηij is the linear predictor, η is the intercept, τi is the fixed effect due to treatment i, and bj is the random block effect bj N 0, σ 2block . Using the GLIMMIX procedure, the following syntax specifies the analysis of a GLMM with a Poisson response. proc glimmix nobound method=laplace; class Block Trt; model Count = Trt/dist=Poisson s; random block; lsmeans Trt/diff lines ilink; run; quit; Note that in the above syntax, we use “method = laplace” (or we can also use “method = quadrature”) to fit the mixed model and obtain the chi-squared/DF fit statistic. If the method of integration is not specified, then a generalized chi-squared/ DF statistic is obtained. The auxiliary options after the “lsmeans” command are described below: “diff” provides paired comparisons between treatments, “lines” provides the pair comparison of means using letters, and “ilink” provides the value of the inverse of the link function. Some of the outputs are listed below. Table 5.12 (a) presents the basic information about the model and estimation procedure used. Subsection (b) of Table 5.12 shows/ lists the “Dimensions” of the relevant matrices used in the model. The random effects matrix Z indicates that there are four columns due to blocks, and the fixed effects matrix X indicates that there is one column for the intercept plus six columns due to treatments. 140 Table 5.12 Basic model information Table 5.13 Model fit statistics 5 Generalized Linear Mixed Models for Counts (a) Model information Dataset Response variable Response distribution Link function Variance function Variance matrix Estimation technique Likelihood approximation Degrees of freedom method (b) Dimensions G-side Cov. parameters Columns in X Columns in Z Subjects (blocks in V) Max Obs per subject WORK.DBCA Counting Poisson Log Default Not blocked Maximum likelihood Laplace Containment (a) Fit statistics -2 Log likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) CAIC (smaller is better) HQIC (smaller is better) (b) Fit statistics for conditional distribution -2 Log L (Count | r. effects) Pearson’s chi-square Pearson’s chi-square/DF ϕ 1 7 4 1 24 434.46 448.46 455.46 444.16 451.16 439.03 418.66 283.09 11.80 The “Fit statistics” and “Fit statistics for conditional distribution” (parts (a) and (b) of Table 5.13, respectively) show information about the fit of the GLMM. The generalized chi-squared statistic measures the sum of the residual squares in the final model and the relationship with its degrees of freedom; this is a measure of the variability of the observations about the model around the mean. The value of Pearson’s chi-square/DF for the conditional distribution is 11.8, well above up 1. This value gives strong evidence of overdispersion in the dataset. In other words, this value is calling our distribution and linear predictor assumption into question, which means that the variance function was not adequately specified. 5.2 The Poisson Model 141 Table 5.14 Variance component estimates, parameter estimates, and type III tests of fixed effects Cov Parm ð^ σ2block Þ Block (a) Solutions for fixed effects Effect Trt Estimate η 4.3637 Intercept 1.6056 Trt 1 τ1 τ2 1.6508 Trt 2 τ3 0.09042 Trt 3 Trt 4 τ4 -0.7416 τ5 -0.8101 Trt 5 τ6 0 Trt 6 (b) Type III tests of fixed effects Effect Num DF Trt 5 Estimate 0.01840 Standard error 0.01377 Standard error 0.08808 0.06155 0.06132 0.07769 0.09888 0.1012 . DF 3 15 15 15 15 15 . Den DF 15 t-value 49.54 26.09 26.92 1.16 -7.50 -8.00 . Pr > |t| <0.0001 <0.0001 <0.0001 0.2627 <0.0001 <0.0001 . Pr > F <0.0001 F-value 523.57 Table 5.15 Estimated least squares means (“Mean”) Trt least squares means Trt Estimate Standard error 1 5.9693 0.07237 2 6.0145 0.07217 3 4.4541 0.08652 4 3.6221 0.1060 5 3.5536 0.1081 6 4.3637 0.08808 errorstd ðηi Þ ηi DF 15 15 15 15 15 15 t-value 82.49 83.33 51.48 34.19 32.86 49.54 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 391.25 409.34 85.9802 37.4150 34.9372 78.5467 Standard error mean 28.3139 29.5437 7.4390 3.9643 3.7784 6.9186 λi errorstd λi The F-test for testing H0 (τ1 = τ2 = ⋯ = τ6) or equivalent (μ1 = μ2 = ⋯ = μ6) indicates that there is a highly significant difference (P < 0.0001) in the average number of weeds in at least one treatment (part (c)) (Table 5.14). The estimates of the linear predictor on the model scale for each of the treatments ðηi Þ and the inverse of the linear predictor λi on the data scale (with their respective standard errors) are calculated as follows ηi = η þ τi and λi = exp ðηi Þ , respectively. These values are listed in Table 5.15. The “plots” option in the “proc GLIMMIX” statement creates a set of plots for the raw residuals, Pearson residuals, and studentized residuals. The panel consists of a plot of studentized residuals versus the linear predictor ðηi Þ, a histogram of the residuals with a normal density superimposed, a plot of residual versus quantiles, and a box plot for the residuals. The panel of studentized residuals indicates the possibility of a slightly skewed distribution (Fig. 5.2). In this figure, we can see that the range of values of the residuals changes, as do the values 142 5 Generalized Linear Mixed Models for Counts Fig. 5.2 Studentized conditional residuals of the linear predictor, indicating that the assumption of constant variance is no longer met. The residuals–quantiles plot confirms the constant variance violation. A nonconstant variance may also suggest an incorrect selection of the response distribution or variance function. 5.2.4 Overdispersion in Poisson Data Linear mixed models assume that the observations have a normal distribution conditional to the fixed effects of parameters. In addition, the mean μ is independent of the variance σ 2, whereas, in most GLMMs that assume a binomial or Poisson distribution, the variance “dispersion” is set to 1. That is, if the mean is known, then we assume that the variance is also known. The extra variability not predicted by a generalized linear model’s random component reflects overdispersion. Overdispersion occurs because the mean and variance components of a GLM are related and depend on the same parameter that is being predicted through the predictor set. However, if overdispersion is present in a dataset, then the estimated standard errors and test statistics of the overall goodness of fit will be distorted and 5.2 The Poisson Model 143 adjustments must be made. In other words, when there is overdispersion in a dataset, the standard errors of the estimated parameters are too small, which leads to test statistics for the model parameters that are too large (i.e., type I error increases). Overdispersion can be caused by several factors: omission of predictor variables in the model, high correlation in the observations due to nested effects, misspecification of the systematic component, or incorrect distribution of the data. Systematic or overdispersion deviations may be the result of incorrect assumptions about the stochastic and/or systematic component of the model. The model may also not fit the dataset well because of an incorrect choice of the link function. Systematic deviations may also result from lack of either random effects or independence of observations. These random factors should generally address deviance violations and problems associated with the systematic component. According to Stroup (2013), overdispersion occurs when the variance exceeds the theoretical variance under the distribution model of the data. For any distribution with a nontrivial variance function, overdispersion is theoretically possible for distributions belonging to the one-parameter exponential family because they lack a scale parameter to mitigate the mean–variance relationship; therefore, models such as Poisson distribution are vulnerable to overdispersion. In summary, overdispersion occurs when: (a) The variance is larger than expected, which leads to standard errors that are not correct. (b) The mean structure is not well specified. (c) The linear predictor η is not well specified. (d) The chosen distribution of the data is not appropriate. (e) Predictor variables are omitted. (f) Observations are significantly correlated. If we do not account for overdispersion, we underestimate the standard errors (for a large variance, the standard errors are not correct) and inflate the statistical tests causing the type I error to inflate and the confidence intervals to be unreliable. Fig. 5.3 shows that as the predicted mean μ increases, the residuals have a larger spread in the plot, indicating that the variance may increase as a function of the mean, whereas Fig. 5.4 shows a nonconstant variance. In the fit statistics obtained under the GLMM with the Poisson distribution (part (b), Table 5.13), the value of the statistic of Pearson′s chi - square/DF = 11.8) indicates that there is a strong overdispersion in the dataset. Another aspect provided by the output is the value of the test statistic F (F = 523.57) tabulated in (part (c)) of Table 5.14. A value too large may indicate that the fit is incorrect. Once the researcher has detected overdispersion, he/she must consider the strategy that will take to remedy it. There are three possible alternatives to evaluate (test) and eliminate overdispersion. Below, we will review the three aforementioned alternatives. 144 5 Generalized Linear Mixed Models for Counts Fig. 5.3 Conditional residuals versus predicted values on the data scale 5.2.4.1 Using the Scale Parameter The first alternative is to add a scale parameter and replace Var(yij| bj) = λij by Var(yij| bj) = ϕλij. This consists of replacing the logarithm of the conditional likelihood yij log (λij) - λij - log (yij) by the quasi-likelihood yij log (λij) - λij/ϕ, assuming that ϕ > 1 could adequately model the observed variance. The following GLIMMIX syntax invokes this alternative of adding a scale parameter under a Poisson response variable. proc glimmix; class Block Trt; model Count = Trt/dist=Poisson; random intercept/subject=block; random _residual_; lsmeans Trt/ ilink ; run; The SAS code is highly similar to that previously used with the addition of the “random _residual_” command to the program. Note that the Laplace integration method (“method = laplace”) has been removed, which causes the estimation to be performed using the pseudo-likelihood (PL) method; the scale parameter is estimated and used in the adjustment of the standard errors and test statistics. The 5.2 The Poisson Model 145 Fig. 5.4 Residuals on the model scale GLIMMIX procedure uses the generalized chi-square divided by its degrees of freedom Gener:chi - square=DF = ϕ as the estimate of the scale parameter. All standard errors are multiplied by ϕ, and all F-test values are divided by ϕ. Table 5.16 shows part of the results. In Table 5.16, we observe the fit statistics (part (a)), covariance parameter estimates (part (b)), and the value of the scale parameter, which is equal to ϕ = 19:4848 (Residual(VC)). The value of the F-statistic under the Poisson distribution in the analysis is 26.87 (part (c)); this value is obtained by dividing the F-value from the previous analysis by 523:57=ϕ . The results indicate that even under this adjustment, overdispersion exists and that this value increases from 11.8 to 19.4848 (part (a)). The inclusion of the scale parameter affects the variance estimate due to blocks σ 2block as well as the estimates of treatment means (part (d)), but the main impact is on the standard errors. The inclusion of the scale parameter implies that there is a quasi-likelihood, meaning that there is no true likelihood of the model and, therefore, there is no true likelihood process that provides a true expected value of λ and a variance of ϕλ. 146 5 Generalized Linear Mixed Models for Counts Table 5.16 Results of the adjustment by adding the scale parameter (a) Fit statistics -2 Res log pseudo-likelihood Generalized chi-square Gener. chi-square/DF (b) Covariance parameter estimates Cov Parm Intercept Residual Variance component (VC) (c) Type III tests of fixed effects Effect Num DF Trt 5 (d) Trt least squares means Trt Estimate Standard error DF 1 5.9779 0.1166 15 2 6.0231 0.1142 15 3 4.4626 0.2396 15 4 3.6306 0.3609 15 5 3.5621 0.3734 15 6 4.3722 0.2504 15 5.2.4.2 29.48 350.73 19.48 Subject block Estimate 0.004981 19.4848 Den DF 15 t-value 51.29 52.74 18.63 10.06 9.54 17.46 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 F-value 26.87 Mean 394.60 412.84 86.7160 37.7352 35.2362 79.2189 Standard error 0.02077 7.1346 Pr > F <0.0001 Standard error mean 45.9935 47.1443 20.7753 13.6205 13.1576 19.8383 Linear Predictor Review In count and binomial response variables, it is important to check whether the linear predictor is correctly specified, that is, whether it is being randomly affected by the experimental units within blocks. If λij is being randomly affected by the experimental units within blocks, which is important in count and binomial response variables, then, the ANOVA table should include the effect of the block × treatment source of variation; this must be specified in the linear predictor in a GLMM. Thus, the linear predictor is specified as ηij = η þ τi þ bj þ ðbτÞij Distribution : yij j bj , bτij Poissonðλij Þ bj Nð0, σ2block Þ bτij Nð0, σ2block × τ Þ Linear predictor: ηij = η þ τi þ bj þ ðbτÞij Link function: log λij = ηij : The following GLIMMIX program allows the above model to be adjusted: 5.2 The Poisson Model 147 Table 5.17 Results of the fit by redefining the predictor of the model (a) Fit statistics for conditional distribution -2 Log L (Count | r. effects) Pearson’s chi-square Pearson’s chi-square/DF (b) Covariance parameter estimates Cov Parm Subject Estimate Intercept block 0.05969 Trt block 0.1152 (c) Type III tests of fixed effects Effect Num DF Den DF F-value Trt 5 15 41.48 (d) Trt least squares means Trt Estimate Standard error DF t-value Pr > |t| Mean 1 5.9692 0.2106 15 28.34 <0.0001 391.20 2 6.0037 0.2106 15 28.51 <0.0001 404.92 3 4.3674 0.2170 15 20.12 <0.0001 78.8402 4 3.4255 0.2301 15 14.89 <0.0001 30.7370 5 3.4005 0.2298 15 14.80 <0.0001 29.9786 6 4.3027 0.2175 15 19.79 <0.0001 73.8997 156.38 2.58 0.11 Standard error 0.05758 0.04115 Pr > F <0.0001 Standard error mean 82.3947 85.2707 17.1100 7.0714 6.8891 16.0707 proc glimmix method=laplace; class Block Trt; model Count = Trt/dist=Poisson; random intercept Trt/subject=block; lsmeans Trt/ ilink ; run; Part of the output is shown in Table 5.17. The results tabulated in part (a) indicate that the overdispersion has been eliminated ϕ = 0:11 , but there is a risk of underestimating the variance. For this reason, it is highly recommended that the value of ϕ should be close to 1. The estimated variance components (part (b)) for blocks and block × treatments are σ 2block = 0:05969 and σ 2block × Trt = 0:1152, respectively. The type III tests of fixed effects are highly significant (P = 0.0001), indicating that the six treatments are not equally effective in weed control (part (c)). The values in part (d) under the “Mean” column are the means on the original scale of the data for each of the treatments with their respective standard errors. The values of the means – compared with the previous ones – (using the scale parameter) do not vary much, but the standard errors have a more marked variation. 148 5.2.4.3 5 Generalized Linear Mixed Models for Counts Using a Different Distribution Another way to account for the problem of overdispersion when using a Poisson distribution is to change the assumed distribution of the response variable. Poisson variables have the same mean and variance, but, in biological sciences, with variables such as counts, this assumption is not always true. A negative binomial distribution is a good alternative (see Example 5.2), as previously discussed. A negative binomial variable’s mean is denoted by the parameter λ > 0 and variance λ + ϕλ2 by ϕ > 0. That is, the expected value E( y) = λ and variance Var( y) = λ + ϕλ2, where ϕ is the scale parameter. The components of this model are shown below: Given that yij j bj~Poisson(λij), it is assumed that λij~Gamma~(1/ϕ, ϕ), with ϕ as the scale parameter and bj N 0, σ 2block . The new specification of the resulting GLMM is as follows: Distribution : yij j bj Negative Binomialðλij , ϕÞ bj Nð0, σ2block Þ Linear predictor: ηij = η þ τi þ bj Link function: log λij = ηij : The following GLIMMIX statements fit the model with a negative binomial distribution. proc glimmix method=laplace; class block Trt; model count = Trt/dist=NegBin; random block; lsmeans Trt/ ilink ; run; Some of the most relevant outputs from GLIMMIX are presented in Table 5.18. Pearson’s chi-squared (Pearson′s chi - square/DF) value of 0.88 (part (a)) shows that overdispersion in the dataset has been removed. The estimated scale parameter tabulated in part (b) (Scale) is ϕ = 0:1080. This value is not the same scale parameter estimated using the Poisson model with the “random _residual_” command, since the methodology for calculating them in these models is different. However, as mentioned above, both scale parameters affect the relationship between the mean and variance in the Poisson and negative binomial distributions. The value of the test statistic shown in part (c) of Table 5.18, under the negative binomial distribution for the effect of treatments, is highly similar to the value obtained with the Poisson distribution when the effect of the block × treatment interaction was added to the linear predictor. The values under “Estimate” are estimates of the linear predictor on the model scale (part (d)), whereas those under the “Mean” column are the treatment means on the data scale, using the negative binomial distribution. Of the three proposed alternatives to fit these data, the last two (including in the predictor the block–treatment interaction and assuming a negative binomial distribution) provides a better fit. 5.2 The Poisson Model 149 Table 5.18 Fitting results by redefining the model structure (a) Fit statistics for conditional distribution -2 Log L (Count | r. effects) Pearson’s chi-square Pearson’s chi-square/DF (b) Covariance parameter estimates Cov Parm Estimate Block 0.07713 Scale 0.1080 (c) Type III tests of fixed effects Effect Num DF Den DF Trt 5 15 (d) Trt least squares means Trt Estimate Standard error DF t-value Pr > |t| 1 6.0280 0.2179 15 27.66 <0.0001 2 6.0465 0.2174 15 27.81 <0.0001 3 4.3941 0.2228 15 19.72 <0.0001 4 3.5190 0.2335 15 15.07 <0.0001 5 3.4684 0.2338 15 14.83 <0.0001 6 4.3439 0.2235 15 19.44 <0.0001 5.2.5 235.11 21.13 0.88 Standard error 0.06955 0.03768 F-value 41.11 Mean 414.90 422.64 80.9704 33.7516 32.0863 77.0085 Pr > F <0.0001 Standard error mean 90.4085 91.8789 18.0426 7.8815 7.5030 17.2111 Factorial Designs Many experiments involve studying the effects of two or more factors. Factorial designs are the most efficient for these types of experiments. In a factorial design, all possible combinations of factor levels are investigated in each replicate. If there are a levels of factor A and b levels of factor B, then each replicate contains all ab treatment combinations. 5.2.5.1 Example: A 2 × 4 Factorial with a Poisson Response This application refers to a factorial experiment involving explants from cotyledons of cucumber (Cucumis sativus L.) with two factors, i.e., genotype (two levels) and culture medium (four levels). Each of the eight combinations of the genotype and culture levels were applied to four Petri dishes, each containing six leaf explants. The response variable was the number of buds in each of the leaf explants, i.e., the response variable was a count. There are two sources of variation in this application, namely, variation between Petri dishes and variation between the explants within the Petri dishes (Table 5.19). The sources of variation and degrees of freedom for this experiment are shown in Table 5.20. 150 5 Generalized Linear Mixed Models for Counts The components that define this model are shown below: Distribution : yijk j petri:dishk , explanteðpetri:dishÞlðkÞ Poissonðλijk Þpetri:dishj N 0, σ2petri:dish , explantðpetri:dishÞlðkÞ N 0, σ2explantðpetri:dishÞ Linear predictor : ηijkl = η þ αi þ βj þ ðαβÞij þ petri:dishk þ explantðpetri:dishÞlðkÞ Link function: log λijkl = ηijkl where ηijkl is the linear predictor in genotype i (i = 1, 2), culture medium j ( j = 1, 2, 3, 4), Petri.dish k (k = 1, 2, 3, 4), and explant l (l = 1, 2, 3, 4, 5, 6), η is the intercept, αi is fixed effect due to genotype i, βj is the fixed effect due to culture medium j, (αβ)ij is the effect of the interaction between genotype i and culture medium j, Petri.dishk is the random effect of the Petri.dish, and explant (Petri.dish)l(k) is the random effect of the explant within the Petri.dish, assuming Petri:dishj Nð0, σ2Petri:dish Þ and explantðPetri:dishÞlðkÞ Nð0, σ2explantðPetri:dishÞ Þ. The following GLIMMIX procedure fits a factorial experiment with a Poisson response. proc glimmix method=laplace ; class genotype culture petri.dish explant; model y = genotype|culture/dist=Poisson; random petri.dish explant(petri.dish)); lsmeans genotype|culture/ilink lines; run; Some of the SAS output is shown in Table 5.21. The fit statistics in part (a) for this dataset are shown below. Note that “method = laplace” was used for the estimation process and to obtain Pearson’s fit statistic χ 2/DF. The result indicates that there is evidence of overdispersion (Pearson′s chi - square/DF = 1.84). Overdispersion, as discussed before, implies more variability in the data than would be expected, potentially explaining the lack of fit in a Poisson model. Part (b) shows the variance component estimates due to Petri_dish, which is equal ^2Petri:dish = 0:003616, and, for the explants within Petri.dish, it is to σ 2 ^explantðPetri:dishÞ = 0:01462. However, the type III test of fixed effects indicates that σ there is a statistically significant effect of genotype, culture medium, and the interaction of both factors (part c). The plot of residuals against the linear predictor in Fig. 5.5 provides further evidence of possible overdispersion. The least squares means on the model scale for the genotype (part (a)), the culture medium (part (b)), and the interaction between both factors (part (c)) are listed under the “Estimate” column of Table 5.22, whereas under the “Mean” column are the means of these factors but in terms of the data. 5.2 The Poisson Model 151 Table 5.19 Number of buds counted in the cucumber experiment Genotype 1 2 Culture 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 Table 5.20 Sources of variation and degrees of freedom Petridish 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Explant 1 10 6 10 10 20 12 22 11 20 5 17 4 10 5 7 9 16 13 14 13 8 11 15 18 8 10 10 10 9 2 4 6 Sources of variation Genotype Culture Genotype × culture Petri.dish × Explant Error Total 2 7 16 12 12 16 8 13 12 18 20 10 8 4 5 5 5 9 7 6 0 9 12 6 12 12 10 10 14 5 6 12 1 3 5 20 13 2 14 18 24 18 15 0 20 5 0 6 10 8 10 3 9 3 10 8 9 6 6 12 17 14 1 2 2 9 4 7 5 0 8 18 20 15 19 18 18 12 12 5 2 2 4 11 2 9 7 9 9 9 0 4 12 10 14 9 3 0 3 5 12 16 12 15 17 20 10 14 20 5 14 10 4 3 10 4 9 3 15 5 15 12 16 13 6 15 14 9 15 7 4 5 6 1 13 15 2 20 17 14 18 18 0 21 15 8 1 0 7 12 12 18 2 9 10 16 14 11 10 12 14 9 0 3 8 Degrees of freedom a-1=2-1=1 b-1=4-1=3 (a - 1)(b - 1) = 3 c(r - 1) = 4 × 6 - 1 = 23 (by difference) = 161 abcr - 1 = 191 152 5 Generalized Linear Mixed Models for Counts Table 5.21 Conditional fit statistics, variance component estimates, and type III tests of fixed effects under the Poisson distribution (a) Fit statistics for conditional distribution -2 Log L (y | r. effects) Pearson’s chi-square Pearson’s chi-square/DF (b) Covariance parameter estimates Cov Parm Petri.dish Explant (Petri.dish) (c) Type III tests of fixed effects Effect Num DF Genotype 1 Culture 3 Genotype*culture 3 1168.16 354.01 1.84 Estimate 0.003616 0.01462 Den DF 161 161 161 Standard error 0.006014 0.008798 F-value 11.01 57.30 3.95 Pr > F 0.0011 <0.0001 0.0095 Since there is overdispersion in the data, we will fit the GLMM again using the negative binomial distribution. That is, under the following GLMM: Distribution : yij j Petri:dishk , explantðPetri:dishÞlðkÞ Negative Binomailðλij , ϕÞ, Petri:dishj Nð0, σ2Petri:dish Þ, explant ðPetri:dishÞlðkÞ N 0, σ2explantðPetri:dishÞ , Linear predictor : ηijkl = η þ αi þ βj þ ðαβÞij þ Petri:dishk þ explantðPetri:dishÞlðkÞ Link function: log λijkl = ηijkl and the scale parameter ϕ. The following GLIMMIX program allows us to fit a GLMM with a negative binomial response variable. proc glimmix ; class genotype culture petri.dish explant ; model y = cultivar|culture/dist=NegBin link=log; random petri.dish Explant(petri.dish); lsmeans cultivar|culture; run; It should be noted that this program is very similar to the previous one, and the only difference is that now a negative binomial distribution is used (“dist = negbin”). Part of the results is presented in Table 5.23. As we have already mentioned, a negative binomial distribution is another model for count variables when there is overdispersion in the dataset. If Pearson’s chi-squared value divided over the degrees of freedom is less than or equal to 1, then the overdispersion is 0 or close to 0, which means that the model is able to efficiently capture the degree of overdispersion. 5.2 The Poisson Model 153 Fig. 5.5 Studentized conditional residuals Based on the conditional distribution, Pearson’s chi-squared (χ 2/DF = 0.83) fit statistic indicates that we have no evidence of overdispersion, so we can justify the negative binomial distribution, which is better than the Poisson distribution implemented above. In part (b), we show that the estimated scale parameter is ϕ = 0:1712. This value is not the same as the parameter for the quasi-Poisson model obtained with the “random _residual_” command. Note that the variance components were slightly affected. Additionally, in Table 5.23, we can see the type III tests for the fixed effects of the model in part (c), where a significant effect of genotype, culture, and the interaction between both factors (genotype*culture) can be observed on the number of buds in the leaf explant. The “lines” option in the “lsmeans” command is used to obtain Fisher’s least significant difference (LSD) means for both factors and their interaction. The means and their respective standard errors, on the model scale (“Estimate” column) and on the data scale (“Mean” column), are tabulated in Table 5.24, the genotype and culture medium are in Table 5.25, and the interaction between both factors is in Table 5.26. The estimated values in this mean comparison for cultivar (Table 5.24) correspond to the values of the linear predictor ηi on the model scale, whereas the means on the data scale is λi (part (a)) and the comparison of means (on the model scale) are tabulated in part (b). 154 5 Generalized Linear Mixed Models for Counts Table 5.22 Estimates on the model scale and means on the data scale under the Poisson distribution (a) Genotype least squares means Standard Genotype Estimate error DF 1 2.2979 0.05165 161 2 2.1345 0.05298 161 (b) Culture least squares means Standard Culture Estimate error DF 1 2.1984 0.06180 161 2 2.5684 0.05607 161 3 2.4609 0.05738 161 4 1.6371 0.07445 161 (c) Genotype*culture least squares means Standard Genotype Culture Estimate error 1 1 2.2465 0.07676 1 2 2.7789 0.06395 1 3 2.5331 0.06932 1 4 1.6331 0.09793 2 1 2.1503 0.07958 2 2 2.3580 0.07370 2 3 2.3887 0.07290 2 4 1.6411 0.09760 tvalue 44.49 40.29 tvalue 35.57 45.81 42.89 21.99 DF 161 161 161 161 161 161 161 161 Pr > |t| <0.0001 <0.0001 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 tvalue 29.26 43.45 36.54 16.68 27.02 31.99 32.77 16.81 Mean 9.9533 8.4531 Standard error mean 0.5141 0.4479 Mean 9.0107 13.0456 11.7156 5.1402 Standard error mean 0.5569 0.7314 0.6723 0.3827 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 9.4547 16.1018 12.5925 5.1196 8.5877 10.5694 10.8997 5.1609 Standard error mean 0.7258 1.0298 0.8729 0.5014 0.6834 0.7790 0.7945 0.5037 Table 5.23 Conditional fit statistics, variance component estimates, and type III tests of fixed effects under the negative binomial distribution (a) Fit statistics for conditional distribution -2 Log L (y | r. effects) Pearson’s chi-square Pearson’s chi-square/DF (b) Covariance parameter estimates Cov Parm Petri.dish Explant (Petri.dish) Scale (c) Type III tests of fixed effects Effect Num DF Genotype 1 Culture 3 Genotype*culture 3 1143.90 159.95 0.83 Estimate -0.02717 -0.04323 0.1712 Den DF 161 161 161 Standard error . . 0.03514 F-value 4.43 25.91 1.44 Pr > F 0.0369 <0.0001 0.0322 5.2 The Poisson Model 155 Table 5.24 Estimates on the model scale and means on the data scale under the negative binomial distribution (a) Cultivar least squares means Standard Genotype Estimate error 1 2.3054 0.05407 2 2.1426 0.05535 ηi DF 161 161 tvalue 42.64 38.71 Pr > |t| <0.0001 <0.0001 Mean 10.0287 8.5219 Standard error mean 0.5423 0.4717 λi (b) T grouping of genotype least squares means (α=0.05) LS means with the same letter are not significantly different Genotype 1 2.3054 2 2.1426 Estimate A B Table 5.25 Means estimates on the model scale and data scale for the culture medium (a) Culture least squares means Standard Culture Estimate error 1 2.2061 0.07653 2 2.5766 0.07198 3 2.4684 0.07300 4 1.6451 0.08708 ηj DF 161 161 161 161 tvalue 28.82 35.80 33.81 18.89 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 (b) T grouping of culture least squares means (α=0.05) LS means with the same letter are not significantly different Culture 2 2.5766 3 2.4684 1 2.2061 4 1.6451 Mean 9.0802 13.1527 11.8031 5.1815 Standard error mean 0.6950 0.9468 0.8617 0.4512 λj Estimate A A B C For the culture medium (Table 5.25), the estimated values in this comparison of means correspond to the values of the linear predictor ηj (on the model scale), but, by applying the inverse link to ηj , we obtain the values under the “Mean” column that provide the means on the data scale (part (a)). The mean comparisons on the model scale are shown in part (b). The results indicate that the means in culture media 2 and 3 provided a statistically similar average number of buds compared to the means in culture media 1 and 4 (see Fig. 5.6). The interaction between both factors (Table 5.26), the average number of buds, and the mean comparisons are shown in Table 5.26. 156 5 Generalized Linear Mixed Models for Counts Table 5.26 Estimates on the model scale and means on the data scale for the interaction between genotype and culture medium Genotype*culture least squares means Standard Genotype Culture Estimate error 1 1 2.2540 0.1072 1 2 2.7869 0.09844 1 3 2.5401 0.1020 1 4 1.6408 0.1233 2 1 2.1582 0.1093 2 2 2.3663 0.1050 2 3 2.3967 0.1045 2 4 1.6493 0.1230 ηij DF 161 161 161 161 161 161 161 161 tvalue 21.02 28.31 24.91 13.31 19.75 22.53 22.94 13.41 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 9.5255 16.2310 12.6805 5.1595 8.6558 10.6582 10.9865 5.2036 Standard error mean 1.0212 1.5978 1.2933 0.6360 0.9457 1.1196 1.1478 0.6401 λij 16 a Average number of buds 14 a 12 10 b 8 c 6 4 2 0 1 2 3 4 Culture medium Fig. 5.6 Comparison of the average number of buds as a function of the type of culture medium (LSD, α = 0.05) The values under “Estimates” (Table 5.26) correspond to those of the linear predictor ηij (model scale), but the values under “Mean” correspond to the means λij on the data scale. Graphically, Fig. 5.7 shows that genotype 1 in culture medium 2 provides the highest number of buds, whereas the lowest number of buds was observed in culture medium 4. For genotype 2, the highest number of buds was observed in culture media 2 and 3. Finally, culture medium 4 is less suitable for both genotypes. 5.2 The Poisson Model 157 GENOTIPO1 Average number of buds 20 a 18 16 ab 14 12 GENOTIPO2 bc bc bc c 10 8 d d 6 4 2 0 1 2 3 4 Culture medium Fig. 5.7 Effect of the cultivar × culture medium interaction on the average number of buds (LSD, α = 0.05) 5.2.6 Latin Square (LS) Design A Latin square (LS) is used where heterogeneity is associated with the crossing of two factors, generally, both with the same number of levels. This design was originally used in agricultural experimentation with plots placed in a square arrangement, with expected heterogeneity along the rows and columns of the square. Blocking in both directions across rows and columns is done in this experimental design. Sometimes in experimentation, blocking in two directions may be appropriate, i.e., the use of an LS design is a good option. Some examples are provided below to illustrate the use of this experimental design: • Field experiments on plots set in a square arrangement with rows and columns that contribute to the heterogeneity between plots. For example, gradients of fertility, moisture, management practices, and so on. • Experiments in greenhouses, rooms with a controlled environment, or growth chambers where the placement of shelves, trays, etc. with respect to walls or light sources can introduce systematic variability related to temperature, humidity, or light in different directions (e.g., left to right, back to front, or top to bottom). • Laboratory experiments in which there are two potential sources of variability (e.g., technicians, machines, etc.) and researchers are aware of the possible impact of variation from both sources. For an LS layout, the number of rows (r) and columns (c) should be equal to the number of treatments (t) and the number of replicates of each treatment. The assignment of treatments is such that each treatment appears exactly once in each 158 5 Table 5.27 Sources of variation and degrees of freedom of a Latin square design Generalized Linear Mixed Models for Counts Sources of variation Rows Columns Treatments Error Total Degrees of freedom t-1 t-1 t-1 (t - 1)(t - 2) t×t-1 row and column, with each row and column containing a full set of treatments. Thus, the treatment effect estimates are independent of the differences between rows or columns, and the rows, columns, and treatments are orthogonal to each other. The analysis of variance for this experimental design, assuming that there are r rows, c columns, and t treatments, with r = c = t, contains the following sources of variability (Table 5.27). From the analysis of variance table, the linear model for an LS design with t treatments is as follows: yijk = μ þ f j þ ck þ τi þ εijk where yijk is the response observed in treatment i in row f and column c, μ is the overall mean, fj is the random effect of row j assuming f j N 0, σ 2f , ck is the random effect of column k with ck N 0, σ 2c , τi is the fixed effect of treatment i, and εijk is the distributed random error term N(0, σ 2). Note that the treatments are allocated in the jkth quadrant (in row j and column k). 5.2.6.1 Latin Square Design with a Poisson Response In a series of field experiments, several “inducer-attractant” strategies were tested to control insect pests in oilseed rape. In one experiment, the use of wild turnip rape (turnip rape) as an earlier flowering trap crop (TR) (the “attractor”) was tested together with the use of a repellent (an antifeedant) applied to oilseed rape in spring (S, the “inducer”). Untreated oilseed rape (U) was included as a control. The experiment was set up as a 6 × 6 Latin square with two replicates of each of the three treatments per row and column. An assessment of the number of mature pollen beetles was made on 10 plants per plot in early April, 1 day after spraying the repellent (antifeedant). The average number of adult beetles sampled on 10 plants per plot was recorded (Appendix 1: Data: Beatles). The question is: Is there evidence that the attractor or inducer works? That is, are fewer beetles present in the proposed treatments compared to the control? The model components that define this GLMM are as described below: 5.2 The Poisson Model 159 Distribution: yijkl j f j , ck Poisson λijkl f j N 0, σ 2f , ck N 0, σ 2c Linear predictor: ηijkl = η þ f j þ ck þ τi Link function: log λijkl = ηijkl where ηijkl is the linear predictor that relates the effect of the repetition l (l = 1, 2) in row j ( j = 1, 2, ⋯, 6) and column k (k = 1, 2, , ⋯, 6) when treatment i is applied (i = 1, 2, 3, ), η is the intercept, τi is the fixed effect of treatment i, fj is random effect of row j, and ck is the random effect due to column k, assuming that there is no interaction between the rows and columns as well as between the treatments and rows or the treatments and columns. The assumed distributions for rows and columns are f N 0, σ 2f and ck N 0, σ 2c , respectively. The model uses the linear predictor (ηijkl) to estimate the means (λijkl = μijkl) of the treatments. The following GLIMMIX program fits a Latin square design with a Poisson response: Proc glimmix nobound method=laplace; class Row Column Treatment; model count = treatment/dist=Poi link=log; random row column; lsmeans treatment/lines ilink; run; Part of the output is shown in Table 5.28. In the values of the fit statistics (part (a)), we observe that the value of Pearson’s chi-square divided by the degrees of χ2 = 0:55 , indicating that there is no overdispersion in the freedom is less than 1 DF data and that the Poisson distribution adequately models the dataset. The type III tests of fixed effects in part (b) indicate that there is no significant evidence of differences between the treatments (P = 0.0621). Part (c) of Table 5.28 shows the estimates of treatments on the model scale (“Estimate”) and on the data scale (“Mean”) with their respective standard errors. The values 4.6191, 6.9396, and 5.1561 (under the “Mean” column) correspond to the treatment means for S, TR, and U, respectively. 5.2.6.2 Randomized Complete Block Design in a Split Plot Sometimes the researcher is interested in testing multiple factors using different experimental units, and, in most cases, the experimenter cannot randomly accommodate the treatment combinations. Suppose that one wishes to test two factors, A and B with a and b levels each, respectively. The levels of the first factor (A) are randomly applied to the primary experimental units. Then, the levels of the second 160 5 Generalized Linear Mixed Models for Counts Table 5.28 Results of the analysis of variance (a) Fit statistics for conditional distribution -2 Log L (Conteo | r. effects) Pearson’s chi-square Pearson’s chi-square/DF (b) Type III tests of fixed effects Effect Num DF Den DF Treatment 3 23 (c) Treatment least squares means Standard tTreatment Estimate error DF value S 1.5302 0.1343 23 11.39 TR 1.9372 0.1096 23 17.68 U 1.6402 0.1271 23 12.90 τi 147.03 19.78 0.55 F-value 3.14 Pr > |t| <0.0001 <0.0001 <0.0001 Mean 4.6191 6.9396 5.1561 Pr > F 0.0621 Standard error mean 0.6204 0.7605 0.6555 λi factor (B) are applied to the secondary subunits formed within the primary unit in which the first factor was applied. In other words, the primary experimental unit (whole plot) was used for the application of the first factor; then, after this, it was divided to form the secondary experimental units (subplots) for the application of the levels of the second factor. Since the split-plot design has two levels of experimental units, the whole plot portions (primary units) and subplots (secondary units) have different experimental errors. Split-plot experiments were invented in agriculture by Fisher (1925), and their importance in industrial experimentation has been widely recognized (Yates 1935). As a simple illustration, consider a study of three pulp preparation methods (factor A) and four temperature levels (factor B) on the effect of paper tensile strength (paper quality). A batch of pulp is produced by one of the three methods; it is then divided into four equal portions (samples). Each portion is cooked at a specific level of temperature. The assignment of treatments to plots and subplots is shown in Table 5.29. The standard ANOVA model for two factors in a split-plot design, in which there are three levels of factor A and four levels of factor B nested within factor A, is described below: yijk = μ þ αi þ r k þ αðr Þik þ βj þ ðαβÞij þ εijk where yijk is the observed response at level i (i = 1, 2, 3) of factor A and at level j ( j = 1, 2, 3, 4) of factor B in block k (k = 1, 3, 3), μ is the overall mean, αi is the effect at level i of factor A, rk is the random effect of blocks assuming rk N 0, σ 2r , α(r)ik is the random effect of the error of the whole plot assuming αðr Þik N 0, σ 2αðrÞ , βj is the effect at level j of factor B, (αβ)ij is the interaction 5.2 The Poisson Model 161 Table 5.29 Assigning treatments to whole plots and subplots Block 1 Preparation method A1 A3 A2 AB14 AB34 AB24 AB12 AB32 AB22 AB11 AB31 AB21 AB13 AB33 AB23 Temperature B4 B2 B1 B3 Table 5.30 Sources of variation and degrees of freedom for a randomized block design with a split-plot treatment arrangement Block 2 Preparation method A3 A2 A1 AB34 AB24 AB14 AB32 AB22 AB12 AB31 AB21 AB11 AB33 AB23 AB13 Sources of variation Blocks Factor A Errora Factor B A×B Error Total Block 3 Preparation method A2 A3 A1 AB24 AB34 AB14 AB22 AB32 AB12 AB21 AB31 AB11 AB23 AB33 AB13 Degrees of freedom r-1=3-1=2 a-1=3-1=2 (a - 1)(r - 1) = 4 b-1=4-1=3 (a - 1)(b - 1) = 6 a(r - 1)(b - 1) = 3 × 2 × 3 = 18 r × a × b - 1 = 3 × 3 × 4 - 1 = 35 fixed effect at level i of factor A and at level j of factor B, and εijk is the normal random experimental error {εijk~iidN(, σ 2)}. The ANOVA table with sources of variation is shown in Table 5.30 for this experimental design. Example 5.1 A split-plot design in randomized complete block arrangement with a Poisson response A split plot is probably the most common design structure in plant and soil research. Such experiments involve two or more treatment factors. Typically, large units called whole plots are grouped into blocks. The levels of the first factor are randomly assigned to whole plots. Each whole plot is divided into smaller units, called subplots (split plots). Next, the levels of the second factor are randomly assigned to units of split plots within each whole plot. In this example, four blocks were implemented, which were divided into seven parts for the seven levels of the first factor (A1, A2, A3, A4, A5, A6, and A7), as whole plots. Then, each whole plot was divided into four units for randomly assigning the four levels of factor B, known as subplots (B1, B2, B3, and B4). Both factors were used to control the growth of a particular weed. Both factors were randomly allocated in each block, as shown below: Block 1 A7 A1 B3 B3 B1 B2 B2 B4 B4 B1 A3 B4 B3 B1 B2 A2 B1 B3 B4 B2 A5 B2 B1 B3 B4 A4 B1 B2 B3 B4 A6 B3 B2 B4 B1 ⋯ ⋯ ⋯ Block 4 A6 A3 B3 B3 B1 B2 B2 B4 B4 B1 A7 B4 B3 B1 B2 A2 B1 B3 B4 B2 A1 B2 B1 B3 B4 A5 B1 B2 B3 B4 A4 B3 B2 B4 B1 162 5 Table 5.31 Sources of variation and degrees of freedom for a randomized block design with a split-plot treatment arrangement Generalized Linear Mixed Models for Counts Sources of variation Blocks Factor A Errora(A × r) Factor B A×B Errorb Total Degrees of freedom r-1=4-1=3 a-1=7-1=6 (a - 1)(r - 1) = 18 b-1=4-1=3 (a - 1)(b - 1) = 18 a(r - 1)(b - 1) = 7 × 3 × 3 = 63 r × a × b - 1 = 4 × 7 × 4 - 1 = 111 The sources of variation and degrees of freedom for this experiment are shown below in Table 5.31: In this experiment, the response variable was the number of weeds in each of the plots (Appendix 1: Weed counts). The components that define this GLMM are as shown below: Distribution: yijk j r k , αðr Þik Poisson λijk r k N 0, σ 2r , αðr Þik N 0, σ 2ar Linear predictor: ηijk = η þ αi þ r k þ αðr Þik þ βj þ ðαβÞij Link function: log λijk = ηijk where ηijk is the linear predictor that relates the effect of factor A with i levels (i = 1, 2, ⋯, 7)and factor B with j levels ( j = 1, 2, 3, 4) in block k with (k = 1, 2, 3, 4); η is the intercept, αi is the fixed effect at level i of factor A, βj is the fixed effect at level j of factor B, (αβ)ij is the fixed effect of the interaction between level i of factor A and level j of factor B, rk is the random effect due to block; and α(r)ik is the random error effect of the whole plot, assuming r k N 0, σ 2r and αðr Þik N 0, σ 2AR , respectively. The model uses the aforementioned linear predictor (ηijk) to estimate the means (λijk = μijk) of the treatments. The following GLIMMIX program fits a split-plot block design with a Poisson response variable: proc glimmix method=laplace; class block a b; model count=a|b / dist=Poisson link=log; random block block*a; lsmeans a|b /lines ilink; run; Part of the output is shown below. As in the previous examples, the Poisson model was found to be inadequate because the value of Pearson’s chi-squared statistic divided by the degrees of 5.2 The Poisson Model 163 Table 5.32 Results of the analysis of variance (a) Fit statistics for conditional distribution -2 Log L (Conteo | r. effects) 1053.96 Pearson’s chi-square 504.44 Pearson’s chi-square/DF 4.50 (b) Covariance parameter estimates Cov Parm Estimate Standard error Bloque 0.01526 0.03867 Bloque*A 0.2454 0.07565 (c) Type III tests of fixed effects Effect Num DF Den DF F-value Pr > F A 6 18 2.32 0.0775 B 3 63 22.91 <0.0001 A*B 18 63 10.06 <0.0001 freedom is greater than 1 χ2 df = 4:50 .This indicates that we have probably misspecified either the conditional distribution of y j b or the linear predictor, but, in this case, there is evidence that we need to look for other distributions for this dataset (part (a), Table 5.32. In addition, in part (b), the values of variance component estimates due to blocks and blocks × A are tabulated σ^2r = 0:01526; σ^2ra = 0:2454 . On the other hand, the type III tests of fixed effects (part (c)) show a significant effect of factor B and the interaction between both factors. An alternative to reduce the overdispersion is to keep the same linear predictor, changing the Poison distribution in the response variable by the negative binomial distribution, that is: Distribution: yijk j r k , αðr Þik Negative binonial λijk , ϕ r k iidN 0, σ 2r , αðr Þik iidN 0, σ 2AR Linear predictor: ηijk = η þ αi þ r k þ αðr Þik þ βj þ ðαβÞij Link function: log λijk = ηijk The following syntax fits a GLMM under a negative binomial distribution. proc glimmix method=Laplace; class block a b; model count=a|b / dist=NegBin link=log; random intercept a /subject=block; lsmeans a|b/lines ilink; run; Part of the output is shown below (Table 5.33). According to the results tabulated in (a), they indicate that the overdispersion has been removed from the analysis 164 5 Table 5.33 Results of the analysis of variance χ2 df Generalized Linear Mixed Models for Counts (a) Fit statistics for conditional distribution -2 log L (Conteo | r. effects) 838.51 Pearson’s chi-square 79.36 Pearson’s chi-square/DF 0.71 (b) Covariance parameter estimates Cov Parm Subject Estimate Standard error Intercept Bloque 0.002421 0.02768 A Bloque 0.1222 0.07102 Scale 0.3458 0.06875 (c) Type III tests of fixed effects Effect Num DF Den DF F-value Pr > F A 6 18 2.71 0.0473 B 3 63 2.13 0.1054 A*B 18 63 1.18 0.3017 = 0:71 . The variance components estimates, tabulated in part (b), are σ 2r = 0:0024 and σ 2AR = 0:1222 for blocks and blocks × A, respectively. The estimated scale parameter is ϕ = 0:3458. Note that the results under the negative binomial distribution differ from those obtained under the Poisson distribution, which is due, of course, to the fact that the negative binomial distribution better captures overdispersion. The fixed effects F-test for factor A is significant at the 5% significance level (part (c)), whereas factor B and the interaction effect do not significantly influence the response variable. Example 5.2 A split-split plot in time in a randomized complete block design with a Poisson response. The propagation of coffee seedlings through grafting in nurseries depends on several factors such as the type of substrate, the rootstock of the plant that will host the graft, type of graft, light intensity, type and size of the container, humidity, temperature, and so forth. The objective of this experiment was to evaluate the effect of shade cloth (light intensity), type of container, and clone on the number of leaves produced by the Coffea canephora P. clones grafted with the Coffea arabica L. variety Oro azteca. The factors studied were the color of the shade cloth (black, pearl, and red), container size (tube of 0.5 kg and 1 kg), and five coffee clones of the variety Coffea canephora P. plus a franc foot (Coffea arabica L. and Var. Oro azteca) over a period of 11 months (Appendix 1: Coffee data). The clones used in the experiment are listed below (Table 5.34). Different physiological parameters were evaluated for 11 months. This work was implemented in four randomized complete blocks. The following table exemplifies how a block was constructed. 5.2 The Poisson Model 165 Table 5.34 Clones of Coffea canephora P Graft carrier (Coffea canephora P.) Clone 1 Clone 2 Clone 3 Clone 4 Clone 5 Franc foot: Coffea arabica L. Var. Aztec gold Shade cloth red Container Tray 0.5 kg C2 C5 C4 Pf C3 C1 C5 C2 Pf C4 C1 C3 Container 1 kg C4 C3 C5 Pf C1 C2 Grafting (Coffea arabica L.) Var. Aztec gold Var. Aztec gold Var. Aztec gold Var. Aztec gold Var. Aztec gold Shade cloth Perl Container Tray 0.5 kg C4 C5 C3 Pf C5 C1 Pf C2 C1 C4 C2 C3 Container 1 kg C2 C4 C3 C5 Pf C1 Shade cloth black Container Tray 0.5 kg C5 C2 Pf C3 C1 C5 C2 Pf C4 C1 C3 C24 Code C1 C2 C3 C4 C5 Pf Container 1 kg C4 C2 C3 C5 Pf C1 The statistical model describing a split-split plot in time design is described below: yijklm = μ þ αi þ r m þ ðar Þim þ βj þ ðαβÞij þ γ k þ ðαγ Þik þ ðβγ Þjk þ ðαβγ Þijk þðrabγ Þijkm þ τl þ ðατÞil þ ðβτÞjl þ ðαβτÞij þ ðγτÞkl þ ðαγτÞikl þðβγτÞjkl þ ðαβγτÞijkl þ εijklm i = 1, 2, 3; j = 1, 2, 3, 4, 5; k = 1, 2, 3; l = 1, ⋯, 11; m = 1, 2, 3, 4 where yijklm is the response variable in repetition m, shade cloth i, clone j, and tray k in time l; μ is the overall mean; αi is the fixed effect due to the type of shade cloth; βj, γ k, and τl are the fixed effects due to clone type, tray,and sampling time, respectively; (αβ)ij, (αγ)ik,(βγ)jk, (ατ)il, (βτ)jl, and (γτ)kl are the effects of the double interactions of the factors shade cloth type with clone, tray, and sampling time; (αβγ)ijk, (αβτ)ij, (αγτ)ikl, (βγτ)jkl, and (αβγτ)ijkl are the effects of the third and fourth interactions of the factors under study; (ar)im is the random effect of blocks with type of shade cloth with rm, (ar)im, (rabγ)ijkm are the random effect due to blocks, blocks with type of shade cloth, blocks with type of shade cloth, and time assuming r m N 0, σ 2r , ðar Þim N 0, σ 2rα ðrabγ Þijkm N 0, σ 2αβγðrepÞ , and εijklm is random error {εijklm~N(0, σ 2)}. The following SAS program fits a GLMM in a split-split plot in time under a randomized complete block design with a Poisson response. proc glimmix data=work.Nhojas_cafe nobound method=laplace; class shade clone tray rep time; 166 5 Generalized Linear Mixed Models for Counts Table 5.35 Fit statistics for choosing the correlation structure Correlation structure Fit statistics -2 Log likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) CAIC (smaller is better) HQIC (smaller is better) CS 29047.38 30236.38 30338.31 29871.61 30469.61 29435.72 Table 5.36 Conditional fit statistics and variance component estimates AR(1) 29043.38 30235.38 30337.31 29869.61 30465.61 29432.72 UN Not converged TOEP(1) 29053.85 30243.85 30345.44 29878.70 30473.70 29442.55 (a) Fit statistics for conditional distribution -2 Log L (y | r. effects) Pearson’s chi-square Pearson’s chi-square/DF (b) Covariance parameter estimates Cov Parm Subject Estimate Variance Rep 0.008106 AR(1) Rep -0.3254 ANTE(1) No converged 28709.17 4288.74 0.56 Standard error 0.001392 0.09437 model y = shade|clone|tray|time/dist=poi link=log; random intercept shade shade*clone*tray/subject=rep type=ar(1) ; lsmeans shade|clone|tray|time /lines ilink; run; Some of the results are listed below. To study which correlation structure best fits this experimental design, five types of correlation structures were tested (Table 5.35): compound symmetry (“CS”), autoregression of order 1 (“AR(1)”), unstructured (“UN”), Toeplizt of order 1 (“Toep(1)”), and ante (ANTE(1)). To do this, in the “random” command with the “type” option, the type of correlation to be tested is specified, and it is here where the option of type of variance–covariance structure must be changed. The fit statistics indicate that the variance–covariance structure that best fits the model is the autoregressive structure of order 1 hAR(1)i. This can be seen in the following table in which the goodness-of-fit statistics for choosing between all these variance–covariance structures are reported. Table 5.36 shows the conditional statistics and variance component estimates. The fit statistic Pearson′s chi - square/DF = 0.57 in part (a) indicates that, in a conditional model, there is no evidence of mis-specifying the distribution or linear predictor. In other words, there is no overdispersion in the dataset, and, therefore, it is reasonable that the analysis and inference can be based on the Poisson model. The analysis of variance for the type III tests of fixed effects (Table 5.37) indicates that there is a highly significant effect of the main effect type of shade cloth (P = 0.0001), clone (P = 0.0001), and tray (P = 0.0001) as well as of most of the interactions, except for the interactions shade_cloth*clone; (P = 0.3846), 5.2 The Poisson Model 167 Table 5.37 Type III fixed effects tests Type III tests of fixed effects Effect Shade Clone Shade*clone Tray Shade*tray Clone*tray Shade*clone*tray Time Shade*time Clone*time Shade*clone*time Tray*time Shade*tray*time Clone*tray*time Shade*clone*tray*time Num DF 2 5 10 2 4 10 20 10 20 50 100 20 40 100 200 Den DF 6 153 153 153 153 153 153 6822 6822 6822 6822 6822 6822 6822 6822 Pr > F 0.1011 <0.0001 0.3846 <0.0001 <0.0001 0.0027 0.0363 <0.0001 <0.0001 <0.0001 0.9289 <0.0001 <0.0001 0.9760 0.2484 F-value 3.44 16.38 1.08 56.60 8.83 2.86 1.71 721.20 6.91 3.17 0.80 9.03 2.42 0.74 1.07 Table 5.38 Estimated means on the model scale and on the data scale for the shade cloth (a) Shade cloth least squares means Shade Standard Estimate error DF t-value Pr > |t| cloth Black 1.6221 0.01542 2 105.17 <0.0001 Pearl 1.5472 0.01533 2 100.94 <0.0001 Red 1.7184 0.01301 2 132.09 <0.0001 (b) T grouping of shade cloth least squares means (α=0.05) LS means with the same letter are not significantly different Shade cloth Estimate ðτi Þ Red 1.7184 Black 1.6221 1.5472 Pearl Mean 5.0638 4.6981 5.5757 Standard error mean 0.07810 0.07201 0.07254 A B B shade_cloth*tray*time (P = 0.9289), clone*tray*time (P = 0.9760), and shade_cloth*clone*tray*time (P = 0.2484). The means and standard errors of each of the main effects, on the data scale, for shade_cloth, tray, and clone are shown in the “Mean” column in part (a) of Table 5.38, whereas in part (b), the mean comparisons for the type of shade cloth are shown. Table 5.39 presents the estimates of the linear predictor (“Estimates” column) in terms of the model scale and treatment means in terms of the data scale (“Mean” column) for the type of clone (part (a)). In addition, in Table 5.39 (part (b)), the mean comparisons are presented for the type of clone. 168 5 Generalized Linear Mixed Models for Counts Table 5.39 Estimated means on the model scale and on the data scale for the type of clone (a) Clone least squares means Clone Estimate Standard error DF t-value Pr > |t| C1 1.5008 0.04989 153 30.08 <0.0001 C2 1.4250 0.05080 153 28.05 <0.0001 C3 1.5064 0.05019 153 30.02 <0.0001 C4 1.4750 0.05029 153 29.33 <0.0001 C5 1.5965 0.04970 153 32.12 <0.0001 Pf 1.6344 0.04943 153 33.07 <0.0001 (b) T grouping of clone least squares means (α=0.05) LS means with the same letter are not significantly different Clone Estimate Pf 1.6344 C5 1.5965 C3 1.5064 C1 1.5008 C4 1.4750 C C2 1.4250 C Mean 4.4854 4.1578 4.5106 4.3709 4.9357 5.1264 Standard error mean 0.2238 0.2112 0.2264 0.2198 0.2453 0.2534 A A B B B Table 5.40 Estimated means on the model scale and on the data scale for the tray factor (a) Tray least squares means Tray Estimate Standard error DF t-value Pr > |t| CH1 1.3843 0.04859 28.49 <0.0001 CH2 1.5665 0.04838 32.38 <0.0001 CH3 1.6183 0.04819 33.58 <0.0001 (b) T grouping of tray least squares means (α=0.05) LS means with the same letter are not significantly different Tray CH3 1.6183 CH2 1.5665 CH1 1.3843 Mean 3.9921 4.7898 5.0443 Standard error mean 0.1940 0.2317 0.2431 Estimate A B C Table 5.40 presents the estimates for the levels of the tray on both scales (part (a)). Similarly, in this table (part (b)), the treatment mean comparisons are presented for the levels of the tray. Tables 5.41, 5.42, 5.43, and 5.44 show the means and standard errors on both scales of the two-factor and three-factor interactions. Interaction type of shade cloth*clone Interaction type of shade cloth*tray Interaction clone*tray Interaction shade*clone*tray Although it is not the objective of this book, part of the results is discussed below. In Fig. 5.8, it is possible to observe that the red shade cloth significantly stimulates leaf production in coffee grafts, followed by the black and pearl shade cloths. The 5.2 The Poisson Model 169 Table 5.41 Estimated means on the model scale and on the data scale for the type of shade cloth*clone Shade cloth*clone least squares means Shade Standard cloth Clone Estimate error Black C1 1.5109 0.06230 Black C2 1.3340 0.06507 Black C3 1.4990 0.06354 Black C4 1.4485 0.06425 Black C5 1.5916 0.06163 Black pf 1.6219 0.06114 Pearl C1 1.3835 0.07711 Pearl C2 1.3926 0.07781 Pearl C3 1.4028 0.07589 Pearl C4 1.4288 0.07575 Pearl C5 1.5216 0.07536 Pearl pf 1.5285 0.07458 Red C1 1.6081 0.06991 Red C2 1.5483 0.07100 Red C3 1.6175 0.07055 Red C4 1.5477 0.07072 Red C5 1.6762 0.06971 Red pf 1.7528 0.06923 DF 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 tvalue 24.25 20.50 23.59 22.54 25.83 26.53 17.94 17.90 18.49 18.86 20.19 20.50 23.00 21.81 22.93 21.88 24.04 25.32 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 4.5307 3.7961 4.4771 4.2566 4.9118 5.0628 3.9889 4.0254 4.0666 4.1736 4.5797 4.6112 4.9933 4.7036 5.0404 4.7005 5.3451 5.7707 Standard error mean 0.2823 0.2470 0.2845 0.2735 0.3027 0.3095 0.3076 0.3132 0.3086 0.3161 0.3451 0.3439 0.3491 0.3339 0.3556 0.3324 0.3726 0.3995 Table 5.42 Estimated means on the model scale and on the data scale for the interaction type of shade cloth*tray Shade*tray least squares means Standard Shade Tray Estimate error cloth Black CH1 1.4274 0.05961 Black CH2 1.5523 0.05846 Black CH3 1.5232 0.05824 Pearl CH1 1.2070 0.07354 Pearl CH2 1.4972 0.07218 Pearl CH3 1.6247 0.07145 Red CH1 1.5185 0.06733 Red CH2 1.6499 0.06714 Red CH3 1.7068 0.06732 DF 153 153 153 153 153 153 153 153 153 tvalue 23.94 26.55 26.15 16.41 20.74 22.74 22.55 24.57 25.35 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 4.1679 4.7224 4.5869 3.3434 4.4691 5.0771 4.5655 5.2066 5.5114 Standard error mean 0.2485 0.2761 0.2672 0.2459 0.3226 0.3628 0.3074 0.3496 0.3710 production of leaves in coffee grafts shows a bimodal figure that can be due to factors such as humidity and temperature. Extreme conditions of both factors cause stress at the growing points and, therefore, the appearance of leaves. Regarding the type of clone used as rootstock, the clones showed a better average leaf production in months 5 and 6, whereas the lowest production was observed in 170 5 Generalized Linear Mixed Models for Counts Table 5.43 Estimated means on the model scale and on the data scale for the clone–tray interaction Clone*tray least squares means Standard Clone Tray Estimate error C1 CH1 1.3916 0.06112 C1 CH2 1.5502 0.05861 C1 CH3 1.5607 0.05861 C2 CH1 1.2242 0.06459 C2 CH2 1.4780 0.06029 C2 CH3 1.5727 0.05890 C3 CH1 1.2924 0.06114 C3 CH2 1.5433 0.05975 C3 CH3 1.6836 0.05841 C4 CH1 1.2982 0.06251 C4 CH2 1.5829 0.05815 C4 CH3 1.5439 0.05939 C5 CH1 1.5311 0.05843 C5 CH2 1.5981 0.05920 C5 CH3 1.6602 0.05803 pf CH1 1.5684 0.05794 pf CH2 1.6464 0.05833 pf CH3 1.6884 0.05728 DF 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 tvalue 22.77 26.45 26.63 18.95 24.51 26.70 21.14 25.83 28.83 20.77 27.22 26.00 26.20 26.99 28.61 27.07 28.23 29.48 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 4.0214 4.7122 4.7622 3.4014 4.3843 4.8196 3.6414 4.6799 5.3851 3.6626 4.8690 4.6828 4.6234 4.9438 5.2604 4.7989 5.1884 5.4107 Standard error mean 0.2458 0.2762 0.2791 0.2197 0.2644 0.2839 0.2226 0.2796 0.3145 0.2289 0.2831 0.2781 0.2702 0.2927 0.3053 0.2781 0.3026 0.3099 months 1, 2, 8, and 9. The franc foot showed a higher average of leaves compared to the rest of the clones (Fig. 5.9). 5.3 Exercises Exercise 5.3.1 A researcher in the area of plant sciences wants to know what is the response of a plant in vitro culture when it is exposed to different concentrations (ppm) of a chemical compound to the number of outbreaks that the explant produces (yij). The data for this experiment are given below (Table 5.45): (a) Write down the analysis of variance table (sources of variation and degrees of freedom). (b) Write down the components of the GLMM. (c) Analyze the dataset with the model proposed in (b). (d) Compare and contrast the results of these analyses. If necessary, reanalyze the dataset using the same model as above, but, now, assume that the data have a negative binomial distribution. (e) Summarize the relevant results. 5.3 Exercises 171 Table 5.44 Estimated means on the model scale and on the data scale for the shade–clone–tray interaction Shade*clone*tray least squares means Shade Standard cloth Clone Tray Estimate error Black C1 CH1 1.2821 0.1528 Black C1 CH2 1.4143 0.1521 Black C1 CH3 1.2201 0.1538 Black C2 CH1 0.8131 0.1615 Black C2 CH2 1.1486 0.1543 Black C2 CH3 1.3376 0.1533 Black C3 CH1 1.1809 0.1548 Black C3 CH2 1.1105 0.1550 Black C3 CH3 1.3672 0.1528 Black C4 CH1 0.7672 0.1608 Black C4 CH2 1.4660 0.1517 Black C4 CH3 1.3925 0.1523 Black C5 CH1 1.2316 0.1538 0.1507 Black C5 CH2 1.6090 Black C5 CH3 1.4684 0.1515 Black Pf CH1 1.6751 0.1503 Black Pf CH2 1.3126 0.1548 Black Pf CH3 1.5092 0.1511 Pearl C1 CH1 0.6441 0.1741 Pearl C1 CH2 1.3602 0.1639 Pearl C1 CH3 1.6030 0.1633 Pearl C2 CH1 0.6336 0.1741 Pearl C2 CH2 1.2050 0.1672 Pearl C2 CH3 1.5547 0.1635 Pearl C3 CH1 0.8786 0.1684 Pearl C3 CH2 1.2777 0.1646 Pearl C3 CH3 1.5724 0.1637 Pearl C4 CH1 0.9893 0.1680 Pearl C4 CH2 1.4198 0.1636 Pearl C4 CH3 1.4357 0.1646 Pearl C5 CH1 1.4557 0.1631 Pearl C5 CH2 1.1672 0.1696 Pearl C5 CH3 1.6010 0.1633 Pearl Pf CH1 1.1901 0.1649 Pearl Pf CH2 1.4004 0.1643 Pearl Pf CH3 1.7623 0.1620 Red C1 CH1 1.5245 0.1606 Red C1 CH2 1.6004 0.1605 Red C1 CH3 1.6327 0.1607 Red C2 CH1 1.3462 0.1630 DF 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 153 tvalue 8.39 9.30 7.93 5.04 7.45 8.72 7.63 7.17 8.95 4.77 9.66 9.14 8.01 10.67 9.70 11.15 8.48 9.99 3.70 8.30 9.82 3.64 7.21 9.51 5.22 7.76 9.60 5.89 8.68 8.72 8.93 6.88 9.80 7.22 8.52 10.88 9.49 9.97 10.16 8.26 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 0.0003 <0.0001 <0.0001 0.0004 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 3.6041 4.1136 3.3874 2.2549 3.1538 3.8100 3.2574 3.0359 3.9242 2.1538 4.3318 4.0250 3.4269 4.9979 4.3422 5.3393 3.7160 4.5231 1.9043 3.8970 4.9678 1.8844 3.3366 4.7335 2.4074 3.5885 4.8184 2.6893 4.1362 4.2026 4.2875 3.2130 4.9582 3.2875 4.0570 5.8260 4.5930 4.9548 5.1178 3.8430 Standard error mean 0.5509 0.6258 0.5209 0.3641 0.4866 0.5842 0.5041 0.4705 0.5996 0.3462 0.6573 0.6131 0.5270 0.7534 0.6577 0.8025 0.5753 0.6834 0.3314 0.6387 0.8111 0.3281 0.5579 0.7740 0.4053 0.5905 0.7889 0.4519 0.6769 0.6919 0.6992 0.5449 0.8098 0.5422 0.6665 0.9440 0.7379 0.7953 0.8224 0.6264 (continued) 172 5 Generalized Linear Mixed Models for Counts Table 5.44 (continued) Shade*clone*tray least squares means Shade Standard Clone Tray Estimate error cloth Red C2 CH2 1.4270 0.1622 Red C2 CH3 1.6500 0.1620 Red C3 CH1 1.3915 0.1632 Red C3 CH2 1.4491 0.1614 Red C3 CH3 1.7875 0.1603 Red C4 CH1 1.3961 0.1614 Red C4 CH2 1.5874 0.1606 Red C4 CH3 1.3805 0.1635 Red C5 CH1 1.6313 0.1601 Red C5 CH2 1.6395 0.1605 Red C5 CH3 1.6470 0.1610 Red Pf CH1 1.7075 0.1600 Red Pf CH2 1.7594 0.1594 0.1601 Red Pf CH3 1.7568 DF 153 153 153 153 153 153 153 153 153 153 153 153 153 153 tvalue 8.80 10.19 8.53 8.98 11.15 8.65 9.89 8.44 10.19 10.22 10.23 10.67 11.04 10.98 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 4.1663 5.2071 4.0207 4.2592 5.9747 4.0394 4.8910 3.9768 5.1103 5.1527 5.1912 5.5151 5.8087 5.7938 Standard error mean 0.6759 0.8435 0.6563 0.6872 0.9577 0.6520 0.7854 0.6503 0.8180 0.8268 0.8360 0.8825 0.9260 0.9273 Avergage number of leafs 12.5 10. 7.5 5. 2.5 0. 1 2 3 4 5 6 7 Time (months) Black Pesrl Red Fig. 5.8 Effect of mesh type on the average number of leaves 8 9 10 11 5.3 Exercises 173 Average number of leafs 11. 8.25 5.5 2.75 0. 1 2 3 4 5 6 7 8 9 10 11 Time (months) Clone1 Clone2 Clone3 Clone4 Clone5 Pf Fig. 5.9 Effect of mesh type on the average number of leaves Exercise 5.3.2 Earthworms (Lubricus terrestris L.) were counted in four replicates of a factorial experiment at the W.K. Kellogg Biological Station in Battle Creek, Michigan, in 1995. A 24 factorial experiment was conducted. Factors and treatment levels were plowing (chiseled and unplowed), input level (conventional and low), manure application (yes/no), and crop (corn and soybean). The objective of interest was whether L. terrestris density varies according to these management protocols and how various factors act and interact. The data (not pooled) in the table shows the total worm counts (per square foot) in the factorial design 24 for the experimental units 64 (24 × 4) (juvenile and adult worms). The numbers in each cell of the table correspond to the counts in the replicates (Table 5.46). (a) Write down the analysis of variance table (sources of variation and degrees of freedom). (b) Write down the components of the GLMM. (c) Analyze the dataset with the model proposed in (b). (d) Summarize the relevant results. Exercise 5.3.3 This experiment involves an investigation of genotypic variation within cultivars of pore (Allium porrum L.) with respect to adventitious shoot formation in the callus tissue. The data in Table 5.47 refer to 20 genotypes of 1 cultivar. Each genotype is represented by six calluses. These observations are the number of shoots per callus. The data are subject to two sources of variation, i.e., variation between genotypes and variation between the calluses within the genotypes. 174 5 Generalized Linear Mixed Models for Counts Table 5.45 In vitro culture (Conc = concentration in ppm) Conc 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 50 50 50 50 50 50 50 50 50 50 50 50 Explant 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 No. of outbreaks 39 32 36 46 30 40 46 28 29 25 29 36 28 35 35 45 45 38 34 47 36 47 35 38 39 42 42 41 31 33 37 38 54 45 57 38 60 35 45 44 33 49 58 45 Conc 50 50 50 50 50 50 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 200 Explant 13 15 16 17 18 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 No. of outbreaks 54 35 50 51 38 61 46 55 54 49 55 55 47 42 38 50 46 42 44 30 38 31 42 36 37 27 38 25 29 30 30 37 28 37 29 36 34 27 32 37 30 31 30 5.3 Exercises 175 Table 5.46 Results of the experiment with earthworms Cultivation Manure Corn Yes No Yes No Soy Table 5.47 Results of the callus tissue experiment Tillage Chisel ploughing Entry level Low Conventional 5, 5, 4, 2 5, 1, 5, 0 3, 11, 0, 0 2, 0, 6, 1 8, 6, 0, 3 8, 4, 2, 2 8, 5, 3, 11 2, 6, 9, 4 Callus Genotype 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 0 9 2 1 6 6 0 1 3 3 2 0 9 2 0 5 1 0 1 4 2 0 0 4 2 3 2 2 1 3 6 6 0 3 3 0 4 0 1 4 3 No Tillage Entry level Low 8, 4, 6, 4 2, 2, 11, 4 2, 2, 13, 7 7, 5, 18, 3 3 0 1 4 5 8 4 0 3 1 4 8 3 5 2 0 4 0 0 6 5 Conventional 14, 9, 9, 6 15, 9, 6, 4 5, 3, 6, 0 23, 12, 17, 9 4 5 0 5 0 9 3 4 0 1 0 7 8 2 5 0 0 7 0 0 2 18 3 2 4 0 5 2 1 0 6 1 7 10 6 3 1 7 0 1 0 4 6 0 4 0 4 9 7 0 2 2 8 5 6 4 2 1 1 1 0 7 0 (a) Write down the analysis of variance table (sources of variation and degrees of freedom). (b) Write down the components of the GLMM. (c) Analyze the dataset with the model proposed in (b). (d) Reanalyze the dataset using the same model as above, but, now, assume that the data have a negative binomial distribution. (e) Compare and contrast the results of these analyses. (f) Summarize the relevant results. 176 5 Generalized Linear Mixed Models for Counts Exercise 5.3.4 In an experiment at the Research Institute for Animal Production “Schoonoord” in the Netherlands, the effects of active immunization against androstenedione on the fertility of Texel ewes were studied (Engel and te Brake 1993). The number of fetuses per ewe can be considered as the net result of a process that determines the number of ovulations and a probability process for these ovulations to produce fetuses. In this study, the goals are to model and analyze (a) the number of ovulations and the number of fetuses in relation to Fecundin (androstenedione-7acarboxyethylthioether) treatment, animal age, mating period and (b) the number of fetuses in relation to treatment, animal age, and number of ovulations observed. A summary of the experiment and a summary of the data are shown below (Table 5.48). Of the 125 Texel ewes, 63 are treated with Fecundin, whereas the remaining 62 serve as a control group. The ewes are sorted into four age classes (e.g.,<0.5, 0.5 - 1.5, 1.5 - 2.5, and > 2.5 years) and two mating periods (starting on October 1 and October 22, 1986, respectively). The interactions with age are interesting and because it is a factor, it is easier to handle than a covariate where age was entered as a factor. The number of animals in the four age classes is 25, 44, 24, and 32, respectively. The age class is evenly distributed in the combinations of mating period and treatment groups. Ewes were slaughtered at 75–80 days after the last mating, and the number of ovulations and number of fetuses were determined. Ovulation numbers ranged from 1 to 5. For six animals, the number of ovulations was not known, so these ewes were excluded from the database. (a) Analyze the dataset using a GLMM with the predictor: ηijkl = η + τi + αj + βk + (τα)ij + (τβ)ik + (ταβ)ijk + bl, where τ, α, and β are the fixed effects of treatment, age, and mating period and b is the random effect due to animal. Assuming that each b has normal distribution with a zero mean and variance σ 2b , and under the assumption that the number of ovulations and the number of fetuses have a Poisson distribution. (b) From the analyses performed, do you observe the presence of overdispersion in the dataset? If so, propose an alternative distribution for the analysis for this dataset. (c) Reanalyze the dataset using the same model as before with the new data distribution. (d) Compare and contrast the results of these analyses. (e) Summarize the relevant results. Exercise 5.3.5 The following example deals with one of the most harmful insects in the root system of the main crops, whose common name is “blind hen.” The experiment consisted of six treatments formulated for larval control in a randomized block arrangement (A, B, C, D, E, and F). The count per area shows the number of larvae in two age groups (a and b) (Table 5.49). T 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 A 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 M 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 N 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 4 3 2 2 2 2 3 2 x 1 2 2 1 1 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 1 2 T 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 A 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 M 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 1 1 n 2 2 2 3 2 3 3 3 3 2 3 3 2 3 2 2 4 4 4 3 3 2 3 2 x 1 2 1 2 2 3 2 1 2 1 3 2 1 3 2 1 4 3 1 3 2 2 2 1 T 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 A 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 2 2 M 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2 2 1 1 n 3 3 2 3 2 4 2 4 2 2 5 1 2 2 1 1 1 1 1 1 1 1 2 1 X 2 3 2 3 2 4 2 3 2 2 2 1 1 1 1 1 1 1 1 1 1 1 2 1 T 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 A 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 M 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 n 3 2 1 2 2 2 2 3 2 2 2 2 2 2 2 2 1 2 1 1 2 3 2 2 x 2 1 1 1 2 1 2 2 2 1 2 2 2 2 2 2 1 2 1 1 2 3 1 2 T 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 A 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 M 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 n 3 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 3 3 x 2 2 1 2 1 2 2 2 1 2 2 2 2 3 2 2 2 2 1 2 2 3 3 Table 5.48 Factors T (Treatment: 1: Fecundin; 2: Control), A (Age: 1: ≤0.5; 2: 0.5 < - ≤ 1.5; 3: 1.5 < - ≤ 2.5; 4:≥2.5 years); M (Mating period: 1: October 1; 2: October 22); n (number of ovulations), and x (number of fetuses) 5.3 Exercises 177 178 5 Generalized Linear Mixed Models for Counts Table 5.49 Results of the blind hen experiment Trt 1 2 3 4 5 A A 5 4 4 1 2 b 7 2 4 5 2 B A 5 12 1 5 3 B 14 5 14 9 8 C A 0 2 2 2 0 B 3 3 2 7 0 D A 1 1 1 3 5 b 7 6 7 7 4 E A 1 3 1 0 1 b 10 5 8 3 6 F A 4 4 7 3 1 b 13 11 10 12 8 (a) Write down the analysis of variance table (sources of variation and degrees of freedom). (b) Write down the components of the GLMM. (c) Analyze the dataset with the model proposed in (b). (d) Does the proposed model in (b) adequately describe the variation observed in the dataset? Summarize the relevant results. Appendix 1 Data: Subcultures sub1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 Rep1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 NB 18 16 15 15 11 17 10 8 17 13 16 15 12 15 8 8 15 15 14 8 15 11 12 18 sub1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 Rep1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 NB 24 24 19 25 24 20 24 20 19 26 22 23 24 23 24 28 29 34 24 24 25 28 24 32 sub1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 Rep1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 NB 45 44 45 44 52 47 46 45 48 56 54 44 54 62 55 45 56 62 45 45 46 48 55 45 sub1 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 Rep1 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 NB 53 59 57 65 63 55 50 52 55 50 53 52 48 44 54 55 51 58 47 42 50 48 48 53 (continued) Appendix 1 sub1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 Rep1 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 1 179 NB 8 17 8 18 22 19 19 24 12 12 11 21 10 15 20 22 20 13 18 19 sub1 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 Rep1 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 NB 34 30 26 27 29 38 38 37 41 46 44 54 45 60 57 51 54 51 62 53 sub1 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 Rep1 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 NB 45 44 52 45 43 58 62 45 63 56 55 50 53 58 56 50 57 60 50 52 sub1 9 9 9 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 Rep1 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 NB 54 59 58 46 38 29 30 31 33 35 59 37 44 42 41 45 38 40 Data: Beatles Row 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 Column 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 Treatment S U U TR S TR TR S TR U U S U TR U S S TR U TR Count 3 6 2 7 1 5 5 4 5 8 6 3 3 6 4 3 4 7 3 4 (continued) 180 Row 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 5 Column 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 Generalized Linear Mixed Models for Counts Treatment TR S U S TR S S U TR U S U S TR TR Count 5 2 3 6 8 5 6 7 9 4 6 5 6 9 9 B 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 Count 14 7 5 7 5 14 5 9 0 0 10 13 20 53 21 7 12 31 32 22 49 16 7 14 20 Data: Weed counts Block 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 A 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 (continued) Appendix 1 Block 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 181 A 7 7 7 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 1 1 1 1 2 2 2 2 3 3 3 3 B 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Count 20 16 6 9 9 9 19 31 11 30 29 25 11 15 23 7 22 20 3 0 28 18 18 55 58 18 19 14 44 19 17 12 8 44 0 29 11 5 49 99 66 11 15 (continued) 182 Block 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 A 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 Generalized Linear Mixed Models for Counts B 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Count 9 8 9 21 49 49 17 22 41 21 48 11 58 34 28 20 6 9 20 0 10 0 7 9 9 29 22 4 22 31 32 41 112 44 24 28 8 8 11 10 117 78 36 38 Shade Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Coffee data Clone C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 pf pf pf C1 C1 C1 C2 C2 C2 C3 Tray CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 Rep R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R2 R2 R2 R2 R2 R2 R2 y1 2 2 2 0 0 2 2 0 2 2 0 2 2 2 0 2 2 2 2 2 2 0 0 2 2 y2 2 2 0 2 0 0 0 2 2 2 2 2 1 2 2 2 2 4 2 2 2 2 2 2 0 y3 4 4 4 5 3 2 4 6 4 6 4 4 5 4 6 6 4 6 5 4 3 4 4 6 4 y4 4 4 5 2 2 2 5 6 5 5 4 6 7 5 8 5 5 7 6 5 5 6 6 5 6 y5 6 2 8 6 6 6 8 8 6 8 8 8 8 6 10 8 8 8 8 8 6 8 8 2 8 y6 4 2 7 3 6 6 3 10 7 9 8 6 6 4 10 7 8 10 9 10 8 10 10 8 10 y7 5 4 5 9 7 6 3 10 7 7 4 6 4 7 8 6 5 11 6 6 5 6 7 4 9 y8 5 2 4 8 6 5 6 10 5 4 4 4 4 7 8 2 3 11 6 4 4 7 4 5 8 y9 5 4 4 8 7 5 2 9 5 3 4 4 2 8 7 2 2 10 6 2 4 7 3 4 6 y10 4 8 6 9 5 10 11 5 5 10 5 4 11 6 4 6 9 11 8 2 6 9 7 4 13 (continued) 7 8 7 4 13 y11 7 12 5 8 6 14 9 9 7 6 7 8 13 11 2 12 12 13 9 Appendix 1 183 Shade Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Clone C3 C3 C4 C4 C4 C5 C5 C5 pf pf pf C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 Coffee data (continued) Tray CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 Rep R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 y1 0 0 2 2 0 2 2 2 2 0 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 y2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 0 2 0 2 2 2 4 4 y3 4 5 4 6 5 6 4 4 4 4 4 6 4 6 4 2 2 4 4 2 4 4 4 6 4 y4 4 4 6 6 6 6 4 6 6 6 6 6 6 4 6 4 2 4 6 4 4 6 5 5 6 y5 8 8 6 6 8 10 8 8 8 8 8 6 6 6 8 4 2 6 8 6 6 8 6 4 8 y6 10 10 10 6 10 8 10 10 8 10 8 8 8 8 10 4 4 6 6 6 4 8 6 6 10 3 2 4 2 7 5 2 9 y7 10 8 3 4 6 5 8 6 7 10 8 5 8 10 4 2 y8 3 4 2 3 3 4 10 3 8 5 4 7 1 8 3 6 2 2 8 2 2 7 2 2 2 y10 5 11 3 7 1 1 11 9 13 11 14 5 11 8 10 6 8 4 8 7 1 8 12 8 y9 3 2 2 7 3 3 12 3 12 5 4 1 7 7 6 4 2 1 3 2 4 4 6 4 6 10 6 12 9 2 8 12 12 y11 15 5 8 7 10 9 9 6 12 11 11 12 9 9 11 184 5 Generalized Linear Mixed Models for Counts Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja C5 pf pf pf C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 pf pf pf C1 C1 C1 C2 C2 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH2 CH2 CH2 CH2 CH2 R3 R3 R3 R3 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R1 R1 R1 R1 R1 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 0 0 2 2 2 2 2 2 2 2 2 3 2 2 2 2 0 2 0 0 0 0 0 2 4 0 0 4 2 4 2 2 2 2 2 6 4 6 6 4 6 6 2 2 2 3 4 2 4 2 4 4 4 5 6 4 6 3 4 4 4 4 6 5 6 6 5 6 6 2 2 2 2 4 4 5 2 5 5 4 6 5 5 6 5 6 6 4 4 8 4 8 8 8 10 8 4 4 2 4 8 6 4 2 6 10 4 8 4 8 8 6 10 8 8 6 10 10 8 8 8 8 8 6 3 4 4 10 7 2 4 8 10 6 10 8 8 10 8 10 10 8 6 12 10 6 6 2 10 12 6 3 6 3 4 7 1 5 8 6 3 9 2 7 8 5 11 11 2 3 9 2 3 5 3 2 3 1 2 6 1 8 6 1 2 6 5 3 4 3 4 3 10 11 10 1 5 3 8 1 4 4 1 6 9 10 1 3 11 3 16 10 9 13 7 1 6 6 1 6 7 2 2 5 7 2 3 4 3 4 9 11 1 5 11 11 12 18 2 11 2 10 4 3 5 3 2 2 (continued) 3 9 2 6 5 1 7 9 5 1 2 11 6 16 14 8 14 11 6 11 16 16 23 4 11 10 Appendix 1 185 Shade Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Clone C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 pf pf pf C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 Coffee data (continued) Tray CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 Rep R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 y1 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 y2 2 2 2 2 0 2 2 2 2 2 2 2 4 0 0 2 2 2 0 2 0 2 2 0 2 y3 6 4 6 4 4 4 4 6 6 6 5 6 4 4 4 6 4 4 2 6 2 4 4 4 4 y4 6 6 6 6 4 6 6 6 6 6 6 6 7 6 6 6 6 6 4 4 4 6 4 4 4 y5 8 8 10 8 8 8 8 10 10 8 8 8 10 8 8 8 6 8 6 6 4 6 8 8 8 y6 8 10 10 10 10 10 10 10 10 8 10 12 11 8 8 10 10 10 8 10 8 10 10 8 8 y7 9 7 11 9 6 12 10 9 12 10 10 6 10 4 5 10 4 12 10 10 4 10 4 2 8 y9 2 1 5 6 5 14 9 4 9 5 9 7 8 5 6 3 2 2 2 9 3 y8 2 1 6 6 6 12 10 5 9 8 12 5 9 7 6 3 2 2 2 5 8 4 11 5 9 3 10 3 8 1 10 7 7 9 8 12 6 9 14 11 9 15 16 10 7 8 14 9 10 10 7 9 8 5 14 y11 1 y10 2 1 7 186 5 Generalized Linear Mixed Models for Counts Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja C5 C5 C5 pf pf pf C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 pf pf pf C1 C1 C1 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 R2 R2 R2 R2 R2 R2 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R4 R4 R4 0 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 3 2 2 2 2 2 2 0 2 2 2 2 2 2 2 0 2 3 3 2 0 2 2 4 6 6 6 6 4 6 4 4 6 4 4 4 6 6 4 6 6 6 4 6 5 5 6 4 4 4 6 6 6 6 5 6 6 6 5 6 5 6 6 6 6 6 5 6 5 4 8 6 6 6 5 6 6 8 8 10 8 8 8 10 10 8 8 8 8 8 10 10 8 8 8 8 8 8 10 10 10 4 8 8 8 10 10 12 10 10 10 10 10 10 12 10 12 10 10 8 6 8 10 10 10 12 12 10 8 10 10 6 10 7 4 8 10 4 2 6 3 10 8 10 10 10 4 6 4 6 6 6 11 6 5 10 8 12 1 1 1 2 1 5 1 1 1 3 2 3 7 1 7 2 5 12 9 1 2 2 1 3 3 4 7 1 6 2 6 8 1 12 1 3 11 12 12 5 6 14 8 7 3 13 4 3 5 5 5 8 5 1 6 4 6 10 6 1 5 4 2 1 3 1 (continued) 10 2 2 11 5 14 6 9 18 11 8 4 4 14 6 4 11 8 Appendix 1 187 Shade Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Clone C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 pf pf pf C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 Coffee data (continued) Tray CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 Rep R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 y1 2 2 0 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 y2 2 0 2 2 0 2 2 2 0 2 0 2 3 0 3 0 2 2 0 0 2 2 0 2 2 y3 6 5 4 6 4 6 6 6 4 6 4 6 4 4 7 4 5 2 4 5 3 4 4 4 4 y4 6 6 4 6 6 6 6 6 6 6 4 8 5 6 7 6 6 4 6 5 5 6 4 6 4 y5 8 8 6 8 8 10 10 8 8 10 8 8 8 8 10 7 8 6 6 6 7 9 8 8 8 y6 8 8 8 8 10 9 8 10 10 8 8 10 8 10 10 10 8 8 8 8 8 8 8 10 8 y7 8 5 8 8 12 11 9 10 9 8 6 10 8 9 9 9 10 7 8 10 8 10 5 9 9 y8 4 4 6 3 9 9 8 7 9 7 2 7 5 7 10 7 6 2 9 4 9 10 5 7 7 y9 3 6 6 3 9 9 7 6 6 1 1 8 5 7 10 4 7 3 9 4 9 6 3 9 6 y10 7 10 4 10 7 8 5 7 4 7 3 7 3 6 12 4 7 5 9 2 9 5 3 11 7 y11 13 9 4 6 9 18 10 12 8 10 4 7 11 14 20 13 4 7 8 2 10 8 10 14 5 188 5 Generalized Linear Mixed Models for Counts Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja C4 C4 C5 C5 C5 pf pf pf C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 pf pf pf C1 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 R1 R1 R1 R1 R1 R1 R1 R1 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R3 2 2 2 2 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 0 2 2 2 0 2 2 2 0 0 0 2 2 0 2 1 2 2 2 2 2 2 2 2 0 2 2 5 4 4 4 6 6 4 4 4 4 4 4 2 4 4 6 6 4 6 4 6 4 6 4 4 4 5 4 6 6 4 6 6 6 6 6 6 6 6 2 4 6 6 8 6 6 6 6 6 6 6 6 6 6 9 8 8 6 8 8 9 8 8 8 8 8 6 8 10 10 10 8 9 8 10 8 8 8 8 8 8 8 8 10 8 10 10 10 10 8 8 10 10 8 8 8 10 10 8 8 8 10 8 10 10 10 10 10 7 10 9 5 5 10 8 6 10 3 12 12 8 6 8 10 9 7 2 8 12 8 10 10 8 10 10 5 8 8 12 10 10 9 2 6 8 10 11 9 9 10 9 15 5 14 12 10 12 5 17 12 3 6 1 4 10 14 12 8 3 8 4 4 9 7 11 9 3 11 8 3 10 2 5 7 9 1 12 11 8 7 6 5 4 6 7 4 5 10 7 11 (continued) 12 14 17 20 8 10 15 5 22 18 6 10 7 10 12 17 9 8 3 9 Appendix 1 189 Shade Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Clone C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 pf pf pf C1 C1 C1 C2 C2 C2 C3 C3 Coffee data (continued) Tray CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 Rep R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R4 R4 R4 R4 R4 R4 R4 R4 y1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 y2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 2 2 2 y3 4 5 6 4 3 5 6 4 4 4 4 4 4 5 4 4 4 6 4 4 4 4 6 6 5 y4 5 5 4 4 4 6 6 6 5 6 6 6 6 5 6 6 6 6 5 6 6 6 6 6 5 y5 9 9 9 8 6 8 8 6 7 8 8 8 8 8 8 8 8 8 8 8 8 8 7 8 7 y6 10 12 8 8 8 10 11 12 8 10 6 10 10 10 8 8 10 8 12 8 10 9 10 10 8 y7 10 6 6 10 8 11 4 7 4 5 6 4 12 4 3 5 6 8 6 10 12 11 6 11 5 13 6 4 3 1 12 13 8 8 3 2 2 8 5 10 2 3 4 10 3 9 10 4 7 9 8 1 2 3 6 1 8 6 1 6 6 2 7 y9 5 2 1 9 y8 6 2 14 14 6 7 8 18 6 22 23 13 16 2 5 17 4 12 11 11 15 y11 18 8 1 10 6 3 7 6 16 10 1 2 2 10 11 3 1 9 y10 3 190 5 Generalized Linear Mixed Models for Counts Roja Roja Roja Roja Roja Roja Roja Roja Roja Roja Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla C3 C4 C4 C4 C5 C5 C5 pf pf pf C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 pf pf CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 0 2 2 0 0 0 0 0 2 0 2 2 0 0 3 2 2 2 2 0 2 6 4 6 6 6 6 6 4 4 5 4 4 2 2 6 3 4 4 4 2 4 4 4 4 4 6 4 6 7 8 8 8 8 8 5 6 6 3 5 2 2 4 2 5 2 5 2 3 5 6 4 5 4 6 8 8 10 9 10 10 8 8 8 8 2 7 2 4 4 4 8 3 7 2 6 7 5 7 7 6 8 10 12 10 10 8 10 10 10 10 6 2 5 4 5 4 4 6 2 4 7 3 6 6 8 9 4 6 2 9 4 2 1 3 3 2 7 4 2 2 2 6 2 1 3 4 3 3 4 10 4 5 3 9 5 10 10 10 9 10 6 12 8 5 7 10 12 12 8 1 1 1 7 1 3 1 2 10 10 9 10 11 7 11 4 4 4 6 4 5 4 2 (continued) 8 12 14 14 6 6 1 2 9 14 16 24 23 13 2 14 12 14 21 11 9 1 13 11 14 7 Appendix 1 191 Shade Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Clone pf C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 pf pf pf C1 C1 C1 C2 C2 C2 Coffee data (continued) Tray CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 Rep R1 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R3 R3 R3 R3 R3 R3 y1 2 2 2 2 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 y2 2 2 2 2 0 0 0 2 2 0 2 0 2 2 2 0 3 2 2 2 2 0 2 2 2 y3 6 4 4 4 3 2 2 4 4 4 4 4 4 4 6 5 4 4 5 6 2 4 6 4 4 y4 6 3 6 3 3 2 1 4 4 3 6 4 1 4 6 3 4 5 6 6 2 4 7 6 4 y5 7 2 8 4 4 2 2 4 8 4 6 5 4 4 6 4 5 5 8 8 4 4 4 8 6 y6 3 8 8 2 6 8 2 2 6 4 7 4 6 5 7 5 5 7 4 7 5 6 2 8 4 y7 3 6 8 2 8 5 3 3 4 2 1 4 3 4 5 4 2 5 4 4 2 1 9 3 4 7 4 1 2 9 6 3 8 1 1 1 6 5 5 8 3 2 3 3 2 5 y10 3 10 7 3 2 1 2 3 2 4 2 2 1 2 1 1 2 2 11 1 2 5 3 2 1 2 y9 3 2 y8 3 1 7 7 12 16 2 5 9 6 5 11 11 2 4 5 5 5 y11 2 10 192 5 Generalized Linear Mixed Models for Counts Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla C3 C3 C3 C4 C4 C4 C5 C5 C5 pf pf pf C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 2 2 2 2 2 0 4 2 4 4 2 0 0 0 0 2 0 0 0 0 0 2 4 2 2 3 4 4 6 4 4 4 4 4 6 3 6 4 4 4 2 4 4 4 2 3 2 3 4 10 4 4 3 6 5 5 7 8 6 6 5 7 6 5 4 3 3 2 4 4 4 2 3 2 4 4 7 4 5 3 6 8 9 7 6 7 8 7 5 7 5 4 5 6 2 6 7 7 4 3 2 6 7 9 7 7 2 5 4 4 5 5 9 5 3 4 7 4 4 3 4 2 5 4 6 2 4 4 3 3 8 8 7 2 8 4 3 1 8 6 5 6 1 6 1 7 3 3 5 5 2 1 2 3 8 2 1 2 4 1 1 5 2 1 3 3 4 3 5 3 6 2 5 4 5 3 6 3 4 2 5 5 3 2 8 3 8 9 4 2 10 4 9 2 5 5 3 5 6 6 (continued) 4 9 3 13 12 4 2 11 10 9 8 2 12 10 7 6 Appendix 1 193 Shade Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Clone pf pf pf C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 pf pf pf C1 C1 C1 C2 Coffee data (continued) Tray CH1 CH1 CH1 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 Rep R4 R4 R4 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R2 R2 R2 R2 y1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 y2 0 0 2 2 2 2 2 0 0 2 2 2 0 2 2 0 2 0 2 2 2 2 2 2 0 y3 4 4 6 4 4 4 6 4 4 4 4 4 4 2 6 4 4 6 4 4 4 6 4 4 4 y4 5 4 6 6 6 4 6 4 4 6 6 4 4 2 6 6 6 6 4 6 6 6 6 6 6 y5 6 6 8 8 8 8 10 8 7 8 8 8 8 4 10 8 5 10 6 8 10 10 10 10 8 y6 4 4 2 9 8 8 10 8 6 6 8 8 8 5 10 10 7 8 9 4 7 8 5 8 8 10 6 12 4 1 11 2 6 11 10 3 5 8 9 9 5 9 4 9 8 6 y7 3 2 1 2 6 4 1 2 10 3 1 1 1 6 1 1 1 10 12 11 10 4 12 9 9 4 10 y9 2 1 3 12 y8 2 1 4 2 11 9 13 10 2 1 1 12 15 16 12 11 9 y11 10 4 4 12 5 7 9 7 14 15 12 12 5 y10 1 1 2 12 194 5 Generalized Linear Mixed Models for Counts Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 pf pf pf C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 0 2 2 2 2 2 0 1 2 2 1 0 2 5 0 2 2 0 0 2 2 0 0 2 2 0 0 4 4 4 6 4 6 6 4 4 4 4 4 4 6 4 4 6 4 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 7 5 6 4 6 6 9 6 6 6 3 4 6 4 4 4 8 8 6 4 7 8 10 10 8 9 10 8 10 10 8 10 10 11 8 8 8 7 8 8 8 7 8 8 8 8 8 9 10 8 10 8 9 10 10 8 6 8 9 4 10 7 8 8 7 7 6 8 5 8 7 8 4 8 3 6 5 6 6 8 6 4 6 5 1 6 1 3 5 5 4 3 2 4 2 2 1 5 7 8 9 8 9 8 10 7 10 5 9 6 3 6 5 5 6 2 5 9 8 9 1 3 10 2 1 1 5 3 2 5 1 1 3 2 2 7 6 4 5 7 8 7 4 7 5 3 5 (continued) 6 6 11 6 6 9 9 6 8 4 3 1 3 6 7 6 11 3 10 11 9 9 7 12 4 7 5 6 3 8 9 6 7 5 9 Appendix 1 195 Shade Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Clone C5 C5 pf pf pf C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 pf pf pf C1 C1 Coffee data (continued) Tray CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH3 CH3 Rep R3 R3 R3 R3 R3 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R1 R1 y1 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 y2 0 0 2 2 2 0 2 2 0 0 0 0 2 2 2 2 2 0 2 2 2 2 4 0 2 y3 2 4 4 4 4 4 4 4 4 4 4 2 2 4 4 4 4 4 4 4 4 4 6 4 2 y4 2 6 7 6 8 5 6 7 6 6 6 4 4 6 4 6 6 6 6 5 6 6 8 6 6 y5 6 8 12 8 8 2 10 5 5 8 7 4 6 10 5 8 7 9 2 6 8 8 10 8 8 y6 9 10 9 6 4 7 9 3 6 6 5 7 8 7 5 8 6 5 4 8 8 5 8 8 7 8 4 7 5 7 9 6 3 7 8 5 9 9 9 13 7 6 7 7 7 8 10 10 12 6 y11 7 2 12 10 4 5 2 4 5 6 4 2 1 1 1 1 2 3 y10 7 1 8 7 y9 6 1 3 4 3 2 3 2 1 9 1 7 1 7 8 4 5 y8 5 2 4 1 5 5 2 1 1 y7 9 9 4 5 1 5 3 1 1 2 196 5 Generalized Linear Mixed Models for Counts Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 pf pf pf C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 0 0 0 2 2 0 2 2 2 2 2 2 2 0 4 0 0 2 0 0 2 0 4 4 4 4 4 4 4 4 4 2 4 4 4 4 4 2 4 4 4 4 3 4 4 4 4 4 4 6 4 6 6 6 6 6 6 4 4 6 6 6 6 6 6 6 6 4 6 4 4 6 4 4 4 4 7 8 8 8 8 8 8 8 8 6 10 10 10 8 8 8 8 9 8 10 7 8 8 8 8 8 8 10 6 8 5 10 9 9 6 8 5 7 10 8 8 10 8 10 5 8 8 7 8 8 8 8 7 8 9 10 9 9 11 10 11 8 8 6 11 12 9 6 12 9 12 5 10 5 7 10 10 10 10 9 10 12 9 9 8 3 10 7 5 6 4 12 11 5 10 7 3 12 8 10 9 7 9 10 6 9 8 10 3 8 11 4 11 6 3 11 8 8 8 5 10 10 6 9 10 10 10 11 10 9 3 8 7 5 11 11 11 10 2 8 8 6 2 4 10 11 10 11 14 3 10 9 10 12 6 10 12 9 10 14 9 (continued) 16 13 11 11 4 10 12 8 8 12 16 11 11 17 20 5 14 14 16 16 9 15 14 10 16 9 6 Appendix 1 197 Shade Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Clone C4 C5 C5 C5 pf pf pf C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 pf pf pf Coffee data (continued) Tray CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 Rep R2 R2 R2 R2 R2 R2 R2 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 y1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 y2 2 0 0 0 0 2 0 2 0 2 0 0 0 2 2 0 2 0 0 0 0 2 4 2 2 y3 4 4 4 4 4 4 4 4 4 4 2 2 2 4 4 4 4 4 4 2 4 4 6 4 4 y4 6 6 6 6 6 6 6 6 6 6 2 2 2 6 6 6 4 6 4 4 8 6 6 6 6 y5 8 8 8 8 8 8 8 8 8 8 2 4 4 10 8 10 8 8 10 8 10 8 8 8 8 y6 10 10 10 8 8 10 7 6 7 8 3 6 6 8 5 8 8 8 7 6 10 8 8 10 9 y7 12 4 12 10 8 10 5 4 2 8 3 8 8 10 9 7 10 10 9 8 12 7 9 8 8 y8 9 7 5 5 6 12 12 5 2 6 3 4 7 8 3 2 6 6 6 7 11 10 7 5 10 6 4 2 1 8 10 9 11 14 12 11 5 9 11 4 6 5 10 13 19 12 16 18 26 15 11 16 8 21 22 18 12 14 4 5 6 7 9 y11 y10 7 6 7 3 7 13 14 6 9 5 5 4 5 8 y9 9 3 4 4 5 12 4 2 4 8 3 4 8 8 1 1 2 198 5 Generalized Linear Mixed Models for Counts Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Perla Negra Negra Negra Negra Negra Negra Negra Negra Negra C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 pf pf pf C1 C1 C1 C2 C2 C2 C3 C3 C3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R1 R1 R1 R1 R1 R1 R1 R1 R1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 2 2 2 2 0 0 2 0 0 2 0 0 2 2 0 2 2 2 2 0 4 2 2 0 0 0 0 2 2 0 4 4 6 3 5 4 4 4 4 4 4 6 6 5 4 4 5 3 4 6 4 2 2 2 4 4 2 6 6 6 5 6 4 6 6 6 4 6 6 4 5 4 6 5 5 6 6 6 4 2 2 4 5 3 7 10 8 7 6 6 8 8 6 6 8 8 9 8 6 8 7 6 6 6 8 6 2 2 6 6 4 5 7 5 8 8 2 8 7 8 4 4 6 5 4 5 6 5 4 8 8 8 8 4 4 8 8 6 1 5 1 2 5 5 6 8 3 4 3 4 4 9 7 5 3 1 2 5 5 6 6 10 4 1 1 1 1 5 3 1 5 4 4 5 5 2 5 9 6 4 9 5 10 2 2 4 4 4 1 2 4 4 7 1 1 1 1 5 5 5 5 7 5 4 6 3 9 2 7 16 4 10 4 5 3 7 8 3 11 5 (continued) 6 4 7 10 9 11 5 8 20 8 14 11 8 7 8 8 7 14 6 Appendix 1 199 Shade Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Clone C4 C4 C4 C5 C5 C5 pf pf pf C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 pf Coffee data (continued) Tray CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 Rep R1 R1 R1 R1 R1 R1 R1 R1 R1 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 y1 2 2 2 2 2 0 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 y2 0 0 2 2 2 0 2 2 2 2 2 0 0 2 2 2 0 2 2 0 0 0 2 2 2 y3 4 4 4 4 4 2 6 4 6 6 4 4 4 4 4 6 2 6 4 2 4 4 6 6 4 y4 6 4 6 6 4 4 6 4 6 6 6 6 6 4 6 6 2 6 4 2 2 4 6 6 6 y5 6 6 8 8 6 6 8 6 8 8 6 6 4 6 8 4 6 6 6 4 2 6 6 8 6 y6 4 6 8 8 8 8 8 6 8 8 8 6 4 8 8 4 6 8 4 6 2 6 8 7 8 6 3 6 4 6 2 4 4 6 2 4 4 11 3 1 1 3 2 3 3 3 6 4 7 10 9 7 3 5 4 5 y10 8 6 8 8 3 y9 4 y8 4 y7 4 13 2 2 13 22 16 9 8 y11 200 5 Generalized Linear Mixed Models for Counts Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra pf pf C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 pf pf pf C1 C1 C1 C2 C2 C2 C3 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 R2 R2 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R4 R4 R4 R4 R4 R4 R4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 4 2 2 0 0 0 0 2 2 2 2 2 2 2 2 2 2 4 4 2 0 2 2 2 2 2 6 6 6 6 4 2 2 4 6 6 4 4 6 6 6 6 6 6 6 6 6 4 6 4 6 4 4 6 7 6 6 6 6 2 4 6 6 6 6 6 4 6 6 6 6 6 8 6 6 6 5 6 6 6 8 8 8 8 8 8 2 4 8 8 8 8 8 6 8 8 8 8 8 8 8 8 8 8 8 8 8 6 4 10 10 10 4 2 8 6 10 6 6 10 4 6 10 9 10 8 8 6 4 4 8 6 6 6 3 6 2 5 5 4 6 2 8 8 6 4 5 5 1 6 4 4 3 4 5 12 7 6 3 4 2 1 2 4 4 6 6 2 4 3 4 3 2 1 3 2 2 3 2 9 5 6 1 2 5 2 13 1 7 10 13 18 3 3 7 1 7 7 1 9 4 7 4 1 2 1 5 1 6 6 1 3 1 4 5 9 6 4 (continued) 11 2 20 2 10 18 17 18 10 9 12 4 10 9 15 14 8 10 Appendix 1 201 Shade Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Clone C3 C3 C4 C4 C4 C5 C5 C5 pf pf pf C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 Coffee data (continued) Tray CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH1 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 Rep R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 y1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 0 2 2 2 2 2 2 2 2 y2 0 0 2 0 2 0 2 2 4 2 2 2 0 2 0 2 2 0 0 2 2 2 0 2 2 y3 4 4 6 2 4 6 6 6 6 6 6 4 4 6 2 4 2 6 4 4 4 6 4 6 4 y4 6 4 6 2 6 4 5 6 6 5 4 6 6 6 4 4 2 6 6 6 6 6 10 6 7 y5 6 8 8 2 8 6 8 8 10 8 4 8 8 8 6 6 4 8 8 8 8 8 8 8 10 y6 2 2 3 4 6 6 8 6 10 6 4 8 7 8 8 10 6 10 8 10 10 10 7 10 10 4 3 4 5 1 5 4 3 1 7 5 4 7 8 8 3 5 6 6 4 10 9 9 8 6 8 5 6 7 5 10 10 9 5 y8 5 3 2 6 y7 9 3 3 4 3 7 1 5 4 3 4 4 1 6 3 3 1 2 3 2 y9 5 3 2 6 2 9 4 1 y10 2 5 3 2 9 8 5 7 11 8 10 7 4 1 2 9 1 3 8 5 7 y11 2 6 4 5 8 15 11 10 16 11 14 10 8 5 5 9 12 202 5 Generalized Linear Mixed Models for Counts Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra C5 pf pf pf C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 pf pf pf C1 C1 C1 C2 C2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 R1 R1 R1 R1 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R3 R3 R3 R3 R3 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 3 2 0 2 2 2 2 2 2 0 2 0 0 2 2 2 2 2 0 0 0 2 2 0 0 4 4 6 6 4 6 4 6 4 4 4 4 4 4 6 6 6 6 4 4 4 4 6 6 4 4 4 8 6 6 6 6 6 6 6 4 4 6 4 6 4 6 6 6 6 8 6 6 6 6 6 6 6 6 10 8 8 8 8 8 6 8 8 6 8 6 8 8 8 10 8 8 8 8 8 8 8 10 8 8 8 10 6 6 6 7 9 8 5 7 8 6 8 8 9 7 10 10 10 10 10 10 7 8 4 8 8 6 11 3 6 7 10 10 2 6 7 5 4 6 8 8 6 8 9 5 3 10 10 10 9 7 9 12 3 5 2 2 7 7 5 5 2 6 7 10 6 8 7 10 10 7 10 9 4 2 8 6 4 1 3 8 2 2 6 7 6 4 3 3 8 2 2 8 9 2 8 6 10 (continued) 12 14 12 20 4 6 7 11 10 4 2 7 7 7 10 2 10 11 12 8 2 14 6 9 2 2 3 2 7 Appendix 1 203 Shade Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Clone C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 pf pf pf C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 Coffee data (continued) Tray CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 CH2 Rep R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 y1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 y2 2 2 2 2 2 2 2 2 2 2 0 0 0 2 2 2 0 0 2 2 2 0 2 0 2 y3 4 6 4 4 6 6 4 4 4 6 4 4 6 6 6 6 2 4 6 6 6 4 6 6 6 y4 6 6 4 6 6 6 6 6 4 8 6 6 8 6 6 6 4 4 6 6 6 4 6 6 4 y5 6 8 6 8 8 8 6 8 8 8 8 8 8 8 8 8 6 4 8 8 8 6 6 8 8 y6 5 10 4 6 4 10 6 10 8 12 10 10 8 8 8 3 5 2 2 8 6 2 8 10 4 4 3 4 5 1 3 4 2 4 3 4 6 7 5 1 2 2 5 8 9 7 8 2 2 9 10 13 10 6 7 5 4 2 2 9 1 5 4 5 4 7 13 14 18 12 7 11 14 12 6 5 12 8 4 5 2 3 7 8 11 6 9 y11 y10 y9 2 3 2 4 4 y8 4 4 5 4 6 y7 7 9 5 8 7 7 7 11 8 8 12 8 9 204 5 Generalized Linear Mixed Models for Counts Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra C5 C5 C5 pf pf pf C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 pf pf pf C1 C1 C1 CH2 CH2 CH2 CH2 CH2 CH2 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 R4 R4 R4 R4 R4 R4 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R1 R2 R2 R2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 0 2 2 0 2 2 2 2 0 2 2 0 4 2 2 2 2 2 2 2 0 0 0 2 6 4 6 4 4 4 6 6 4 6 6 2 6 4 4 4 4 4 6 6 6 6 6 4 4 4 6 6 8 6 6 6 6 6 6 4 4 6 4 6 6 6 6 4 6 6 6 8 8 6 6 6 6 6 10 10 8 8 8 8 8 8 4 6 8 2 8 6 8 8 6 7 8 8 8 6 8 8 8 8 10 10 6 8 8 8 10 9 6 4 8 8 6 10 8 10 10 8 6 10 9 5 8 10 10 10 6 6 2 2 5 5 7 5 6 10 12 6 4 8 8 9 6 9 9 9 6 9 7 8 8 6 7 10 12 8 3 7 10 8 6 8 8 8 8 7 7 5 2 4 5 4 3 10 1 8 6 5 3 6 4 10 12 1 2 7 11 7 11 4 9 10 3 10 1 1 2 9 7 7 7 9 10 3 4 3 6 2 5 1 5 6 9 9 6 3 9 10 6 4 4 5 (continued) 2 6 11 3 9 12 9 10 7 6 4 10 15 5 7 16 6 9 8 8 8 Appendix 1 205 Shade Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Clone C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 pf pf pf C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 Coffee data (continued) Tray CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 Rep R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R2 R3 R3 R3 R3 R3 R3 R3 R3 R3 R3 y1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 y2 0 2 0 2 2 2 0 2 2 2 2 2 2 0 0 2 0 2 0 0 2 0 0 0 0 y3 2 6 4 6 6 6 6 4 6 6 6 6 4 4 4 6 4 4 4 4 6 4 4 4 4 y4 2 6 4 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 y5 4 6 6 8 8 8 8 6 6 10 8 8 8 8 6 8 8 8 8 8 10 7 7 8 8 y6 6 6 7 10 6 7 7 4 5 7 7 8 5 2 7 6 8 4 9 8 8 8 8 5 8 y7 6 7 6 10 7 6 7 4 7 6 8 9 6 4 6 2 4 5 4 5 9 4 8 3 9 y9 2 4 3 10 5 5 2 3 6 6 4 6 5 6 4 4 10 2 7 6 y8 2 4 5 12 6 7 3 6 5 3 5 4 4 6 4 4 11 2 6 7 2 6 4 7 7 6 7 4 4 2 7 4 5 3 4 7 4 8 14 2 2 2 11 8 5 2 2 1 9 10 4 1 5 6 3 6 y11 y10 206 5 Generalized Linear Mixed Models for Counts Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra Negra C4 C4 C5 C5 C5 pf pf pf C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 pf pf pf CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 CH3 R3 R3 R3 R3 R3 R3 R3 R3 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 R4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 0 0 0 2 0 0 2 0 0 0 0 2 0 2 2 4 4 6 2 4 4 6 4 4 4 4 4 4 4 2 2 4 6 4 6 4 4 6 4 4 4 4 6 6 6 4 6 6 6 8 8 4 6 6 4 6 2 4 6 4 6 4 4 8 6 6 6 8 8 8 8 8 8 6 8 8 8 8 8 8 6 6 6 6 10 6 8 6 8 8 8 8 8 8 10 7 8 3 6 3 5 8 5 8 7 8 5 7 5 6 10 7 6 8 7 6 8 8 10 8 10 8 8 2 5 4 5 4 3 9 9 1 6 7 7 6 10 7 8 10 6 4 8 8 11 3 5 4 7 2 3 6 7 4 1 2 10 6 2 5 4 8 1 8 1 3 2 3 5 4 4 2 2 6 8 4 2 1 6 7 6 8 2 7 2 2 1 2 7 2 8 10 6 2 2 5 8 6 2 1 10 7 4 9 2 9 Appendix 1 207 208 5 Generalized Linear Mixed Models for Counts Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. Chapter 6 Generalized Linear Mixed Models for Proportions and Percentages 6.1 Response Variables as Ratios and Percentages In this chapter, we will review generalized linear mixed models (GLMMs) whose response can be either a proportion or a percentage. For proportion and percentage data, we refer to data whose expected value is between 0 and 1 or between 0 and 100. For the remainder of this book, we will refer to this type of data only in terms of proportion, knowing that it is possible to change it to a percentage scale only when multiplying it by 100. Proportions can be classified into two types: discrete and continuous. Discrete proportions arise when the unit of observation consists of N distinct entities, of which individuals have the attribute of interest “y”. N must be a nonnegative integer and “y” must be a positive integer; here, y ≤ N. Therefore, the observed proportion must be a discrete fraction, which can take values 0 1 N N , N , ⋯, N . A binomial distribution is the sum of a series of m independent binary trials (i.e., trials with only two possible outcomes: success or failure), where all trials have the same probability of success. For binary and binomial distributions, the target of inference is the value of the parameter such that 0 ≤ E Ny = π ≤ 1. Continuous proportions (ratios) arise when the researcher measures responses such as the fraction of the area of a leaf infested with a fungus, the proportion of damaged cloth in a square meter, the fraction of a contaminated area, and so on. As with the binomial parameter π, the continuous rates (fractions) take values between 0 and 1, but, unlike the binomial, the continuous proportions do not result from a set of Bernoulli tests. Instead, the beta distribution is most often used when the response variable is in continuous proportions. In the following sections, we will first address issues in modeling when we have binary and binomial data. When the response variable is binomial, we have the option of using a linearization method (pseudolikelihood (PL)) or the Laplace or quadrature integral approximation (Stroup 2012). © The Author(s) 2023 J. Salinas Ruíz et al., Generalized Linear Mixed Models with Applications in Agriculture and Biology, https://doi.org/10.1007/978-3-031-32800-8_6 209 210 6.2 6 Generalized Linear Mixed Models for Proportions and Percentages Analysis of Discrete Proportions: Binary and Binomial Responses A binomial distribution is the number of successes from a series of N independent binary trials – Bernoulli trials (i.e., trials with two possible outcomes: success or failure), where all trials have the same probability of success. In the context of a GLMM, there are N binomial responses, each of which is the result of binary trials. The ith response consists of two pieces of information: the number of trials ni and the number of successes yi, as shown in the following example. 6.2.1 Completely Randomized Design (CRD): Methylation Experiment An agent to induce demethylation is applied to plants; this agent converts methylated nucleotides to their unmethylated forms, thus causing epigenetic changes that produce or induce abnormal phenotypes such as deformation or stunting (Amoah et al. 2008). A pilot study was implemented to investigate the relationship between the dose of the demethylating agent and the observed proportion of plants with a normal phenotype. Seeds were treated with the demethylating agent at six different doses, including the control. Plants were sown in trays, with each tray containing seeds previously treated with the same dose of the demethylating agent. Each dose was replicated 4 times: 2 with 60 plants and 2 with 100 plants. The trays were allocated following a completely randomized design (CRD). The plants with a normal phenotype in each tray are shown (in Table 6.1) with the number of plants per tray (N ). The notation 59(60) indicates that 59 normal plants were found out of 60 plants under study. In the same way, the notation 14(100) indicates that 14 normal plants were found out of 100 plants under study. The sources of variation and degrees of freedom (DFs) for this experiment are shown in Table 6.2. Table 6.1 Number of normal plants out of a total of N plants per tray and dose of the demethylating agent Dose 0 59(60) 58(60) 99(100) 98(100) Table 6.2 Sources of variation and degrees of freedom Sources of variation Dose Error Total 0.01 58(60) 59(60) 98(100) 99(100) 0.1 54(60) 53(60) 88(100) 87(100) 0.5 4(60) 11(60) 14(100) 15(100) 1.0 3(60) 2(60) 2(100) 1(100) 1.5 3(60) 3(60) 1(100) 3(100) Degrees of freedom t-1=6-1=5 t(r - 1) = 6 × (4 - 1) = 18 t × r - 1 = 6 × 4 - 1 = 23 6.2 Analysis of Discrete Proportions: Binary and Binomial Responses 211 Observed proportion 1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Dose Fig. 6.1 Effect of the demethylating agent on the proportion of normal plants The statistical model of a completely randomized design (CRD) is yij = μ þ τi þ εij where yij is the number of observed normal plants in the tray j ( j = 1, 2, 3, 4) at the dose i (i = 1, 2, ⋯, 6), μ is the overall mean, τi is the effect of dose i of the demethylating agent, and εij are non-normal errors. The expected value (normal plants) of a set of tests ni follows a binomial distribution yi ~ Binomial(ni, π i), where π i is the probability of success in each trial, with 0 ≤ π i ≤ 1, where π i = yi=ni . Thus, the probability of observing an outcome yi can be written as PðY i = yi jni , yi Þ = ni yi π i i ð1 - π i Þni - yi ; yi = 0, 1, ⋯, ni : y This probability depends on the number of known tests ni, whereas the probability of success (π i) is an unknown parameter. In Fig. 6.1, we observe that the probability of obtaining a normal plant depends on the applied dose of the demethylating agent. Given that yi has a binomial distribution, the expected value (the mean) is the product of the number of trials and the probability of success in each trial, that is, E(Yi) = niπ i. Since the number of trials is fixed (once the data have been obtained), modeling the probability of success is equivalent to modeling the expected value as well as the variance since it is also a function of the number of trials and the probability of success. So, the expected value and variance of yi are 212 6 Generalized Linear Mixed Models for Proportions and Percentages E ðyi Þ = μi = ni π i ; Varðyi Þ = ni π i ð1 - π i Þ: This variance is small if the value π i is close to 0 or 1, and this increases to its maximum when π i = 0.5. This can be seen in Fig. 6.1, where proportions close to 0 or 1 show less variance than do proportions between 0.1 and 0.2 for a demethylating agent dose of 0.5. This variance can also be written in terms of the expected value as: Varðyi Þ = μi ðn - μi Þ: ni i In this CRD, the fixed number of treatments t (doses) were randomly assigned to r experimental units (trays). The linear predictor describing the structure of the mean of this GLMM is ηi = η þ τ i where ηi denotes the ith linear predictor, η is the intercept, and τi is the fixed effect due to treatments i (i = 1, 2, ⋯, t) with t treatments and ri replicates in each treatment. The components that define this GLMM are shown below: Distribution: yi~Binomial(Nij, π i) Linear predictor: ηi = η + τi Link function: logitðπ i Þ = logit πi 1 - πi = ηi where ηi is the linear predictor that relates the effect of dose i (i = 1, 2, ⋯, 6) to probability π i. The model uses the linear predictor (ηi) to estimate the means (π i = μi) of the observations for each treatment. The following GLIMMIX program fits a CRD with a binomial response: proc glimmix nobound method=Laplace; class Dose Rep; model y/N= dose/link=logit; lsmeans dose/lines ilink; run; In this example, the distribution of the dataset was not specified to GLIMMIX in the model specification because by using the expression “y/N,” proc GLIMMIX automatically infers that this dataset has a binomial distribution. It is also important to note that variable dose and repetition were declared as class variables in the “class” command, which Statistical Analysis Software (SAS) interprets as explanatory variables that are nonnumerical factors. However, the variable declared “Rep” is not used in the model specification. 6.2 Analysis of Discrete Proportions: Binary and Binomial Responses 213 Table 6.3 Results of the analysis of variance (a) Fit statistics for conditional distribution -2 Log L (y | r. effects) Pearson’s chi-square Pearson’s chi-square/DF (b) Type III tests of fixed effects Effect Num DF Den DF Dose 5 15 (c) Dose least squares (LS) means Dose Estimate Standard error DF t-value 0 3.9580 0.4122 15 9.60 0.01 3.9580 0.4122 15 9.60 0.1 2.0049 0.1728 15 11.60 0.5 -1.8360 0.1623 15 -11.31 1 -3.6633 0.3580 15 -10.23 1.5 -3.4337 0.3212 15 -10.69 83.46 11.95 0.50 F-value 132.53 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 0.9813 0.9813 0.8813 0.1375 0.02501 0.03126 Pr > F <0.0001 Standard error mean 0.007581 0.007581 0.01808 0.01925 0.008729 0.009728 Part of the results is shown in Table 6.3. Pearson’s chi-squared statistic value divided by the degrees of freedom in part (a) (Pearson′s chi - square/DF = 0.5) indicates that there is no evidence of extra-dispersion in the dataset. The analysis of variance (ANOVA) tabulated in part (b) in Table 6.3, with the type III tests of fixed effects, indicates that there is a highly significant difference (P = 0.0001) in the average proportion of normal plants with respect to the dose applied to the seeds. The output when using the “lsmeans” command in conjunction with the “ilink” option is in the “Mean” column (part (c) in Table 6.3). These values are the values of π i′s, i.e., the estimated probabilities π^0 = 0:9813 and π^0:01 = 0:9813 of normal plants for the treatments whose doses are 0 and 0.01, respectively. For treatments with doses of 0.1 and 0.5, the observed probabilities of normal plants are π^0:1 = 0:8813 and π^0:5 = 0:1375, respectively, whereas for the 1 and 1.5 doses, the observed probabilities of normal plants decrease dramatically with π^1 = 0:02501 and π^1:5 = 0:03126, respectively. Figure 6.2 shows the mean comparisons (least significance difference (LSD)) of the estimated probabilities according to the dose applied to the seeds in trays. In this figure, we can observe that in the treatments with dose = 0 (control) and dose = 0.01, the observed proportions of normal plants are not statistically different from each other, but they do differ with the other applied doses. At a dose of 0.1, the observed proportion of normal plants was 88.13%, and this was statistically different from all the doses used. Finally, doses at 0.5, 1, and 1.5 of the demethylating agent in the observed proportion of normal plants decreased drastically to 13.75%, 2.501%, and 3.12%, respectively. The doses of 1 and 1.5 produced statistically equal proportions of normal plants. 214 6 Generalized Linear Mixed Models for Proportions and Percentages Mean 1.1 Average roportion 0.825 0.55 0.275 0. C 0.01 0.1 0.5 1 1.5 Dose Fig. 6.2 Comparison of the estimated probabilities per dose of the demethylating agent If the researcher wishes to model how dose levels of the demethylating agent affect normal plant proportions, then the dose must be declared as a continuous variable. The following SAS syntax with proc GLIMMIX runs a binomial regression: proc glimmix data=crd_bin method=Laplace plots=all; class Rep; model y/N= dose/solution; random rep; run;quit. Most of the commands and options have already been discussed throughout this book; the “model y/N” command indicates that the response variable is in a ratio. Therefore, this dataset is modeled with a binomial distribution, which is affected by the different number of individuals in each repetition. proc GLIMMIX interprets the distribution of the data as binomial, whereas the “solution” option requests the parameter estimates of the model (intercept and slope). The components that define this GLMM are shown below: Distribution: yi~Binomial(Nij, π i) Linear predictor: ηi = η + β dosei Link function: logitðπ i Þ = logit πi 1 - πi Thus, the model can be written as = ηi 6.2 Analysis of Discrete Proportions: Binary and Binomial Responses 215 Table 6.4 Regression analysis results (a) Fit statistics -2 Log likelihood Akaike information criterion (AIC) (smaller is better) AICC (smaller is better) Bayesian information criterion (BIC) (smaller is better) CAIC (smaller is better) HQIC (smaller is better) Pearson’s chi-square Pearson’s chi-square/DF (b) Type III tests of fixed effects Effect Num DF Den DF Dose 1 19 (c) Solutions for fixed effects Effect Estimate Standard error Intercept 2.7927 0.1302 Dose -7.6232 0.3494 ηi = log 231.58 235.58 236.15 237.93 239.93 236.20 2317.12 96.55 F-value 475.97 DF 3 19 t-value 21.46 -21.82 Pr > F <0.0001 Pr > |t| 0.0002 <0.0001 μi ni π i πi = log = log = logitðπ i Þ = η þ βdosei ni - μi ni - ni π i 1 - πi and the logit function can be written in terms of the probability of success, π i, as πi = 1 1 þ expð- ηi Þ Part of the SAS output of the GLIMMIX syntax is shown below. The goodnessof-fit statistics, type III tests of fixed effects, and parameter estimates are shown in Table 6.4. The analysis of variance indicates that the demethylating agent has a highly significant effect on the observed proportion of normal plants (P < 0.0001) (part (b)). The maximum likelihood estimates for the intercept and slope are η = 2.7927 and β = - 7.6232, respectively. Figure 6.3 shows that as the value of the linear predictor increases (ηi), the value of the residuals rapidly decreases. We can also see that the residuals plotted against the quantiles clearly do not follow a normal distribution because this model is not a linear function of the explanatory variable “dose.” Figure 6.4 shows that the proportions studied and fitted are not so far apart, and, as such, the binomial model is suitable for this dataset. The estimated linear predictor of this model is as follows: ^ηi = ^η þ β^ × dosei = 2:7927 - 7:6232 × dosei : 216 6 Generalized Linear Mixed Models for Proportions and Percentages Residuals 80 60 200 Percent Residual 300 100 40 20 0 0 -7.5 -5.0 -2.5 0.0 Linear Predictor 2.5 -240 -120 0 120 240 360 Residual 21 22 300 300 Residual Residual 200 100 0 200 24 100 23 -100 0 -2 -1 0 Quantile 1 2 Fig. 6.3 A graph of residuals versus the linear predictor, quantiles 1.1 Probability ( i) 0.825 0.55 0.275 0. 0. 0.4 0.8 1.2 1.6 Dose Fig. 6.4 Observed and estimated proportion The logit of the probability of success is a linear function of the explanatory variables, so the model can be written in terms of the probability of success (observing normal plants) as Factorial Design in a Randomized Complete Block Design (RCBD). . . 6.3 πi = 217 1 1 þ exp ð- ηi Þ Given the parameter estimates, we can predict the success probability of observing a normal plant, and given a certain concentration of the demethylating agent, this estimated probability (using the estimated linear predictor) can be seen plotted in Fig. 6.4. π^i = 6.3 1 1 = 1 þ expð^ηi Þ 1 þ expð - 2:7927þ7:6232 × dosei Þ Factorial Design in a Randomized Complete Block Design (RCBD) with Binomial Data: Toxic Effect of Different Treatments on Two Species of Fleas A group of researchers wishes to study the toxic effect of certain treatments (Trts) on two flea species (SP) (Daphnia magna and Ceriodaphnia dubia). To compare the toxicity effect of treatments on both flea species, a randomized complete block design (RCBD bioassay) was implemented with three replicates per treatment, with each replicate consisting of 10 fleas (Appendix: Fleas). The linear predictor describing this experiment is described below: ηijkl = η þ αi þ βj þ ðαβÞij þ bioassayk þ repðbioassayÞlðkÞ where η is the intercept, αi is the fixed effect due to species i, βj is the fixed effect of treatment j, (αβ)ij is the fixed effects interaction between the flea species and treatment, bioassayk is the random effect due to bioassay k assuming bioassayk N 0, σ 2bioassay , and rep(bioassay)l(k) is the random effect due to repetition bioassay assuming repðbioassayÞlðkÞ N 0, σ 2repðbioassayÞ . The remaining components of this GLMM with a binomial response (Nijk, π ijk) are described below: Distribution: yijkl j bioassayk, rep(bioassay)l(k)~Binomial(Nijk, π ijk) bioassayk N 0, σ 2bioassay , repðbioassayÞlðkÞ N 0, σ 2repðbioassayÞ , where Nijkl is the number of dead fleas, observed in species i in replicate l in bioassay k under treatment j, Link function: logit π ijk = log π ijk 1 - π ijk = ηijk . The following SAS syntax allows us to fit the GLMM with a binomial response. 218 Table 6.5 Results of the analysis of variance 6 Generalized Linear Mixed Models for Proportions and Percentages (a) Fit statistics -2 Log likelihood 145.33 AIC (smaller is better) 173.33 AICC (smaller is better) 177.85 BIC (smaller is better) 160.71 CAIC (smaller is better) 174.71 HQIC (smaller is better) 147.97 (b) Fit statistics for conditional distribution -2 Log L (Sobrevi | r. effects) 145.33 Pearson’s chi-square 10.72 Pearson’s chi-square/DF 0.10 (c) Covariance parameter estimates Cov Parm Estimate Standard error Bioen -0.1051 . Bioen*SP (Rep) -0.1192 . (d) Type III tests of fixed effects Effect Num DF Den DF F-value Pr > F SP 1 14 0.02 0.8829 Trt 5 80 15.08 <0.0001 SP*trt 5 80 4.66 0.0009 proc glimmix data=pulgas nobound method=laplace; class Bioen SP Trt Rep ; Model Sobrevi/n = SP|Trat/dist=binomial; random Bioen sp*bioen(rep); lsmeans SP|Trt/lines ilink; run; Part of the results is listed in Table 6.5. The fit statistics in part (a) and the conditional statistics in part (b) are useful for model comparison, whereas the variance component estimates are shown in part (c). The value of the statistic Pearson’s chi - square/DF = 0.10 indicates that the binomial model gives a good fit to the dataset. The variance component estimates for bioassays and replication nested in bioassays are σ^2bioassay = - 0:1051 and σ^2repðbioassayÞ = - 0:1192, respectively. The type III tests of fixed effects (part (d)) show the significance tests of the fixed effects in the model. The treatment effect and the interaction between the flea species (SP) and treatment are clearly significant with P < 0.0001 and P = 0.0009, respectively. Since survival was statistically similar in both flea species, we will focus on the factors that were significant. Part (a) in Table 6.6 shows the means and standard errors of treatments on the model scale (“Estimate” column) and on the data scale (“Mean” column), obtained with “lsmeans” and the “ilink” option as well as the mean comparisons, which are on the model scale (part (b)). 6.4 A Split-Plot Design in an RCBD with a Normal Response 219 Table 6.6 Means and standard errors on the model scale and on the data scale (a) Trt least squares means Trt Estimate Standard error DF t-value Pr > |t| Mean T1 8.1179 4.3180 80 1.88 0.0637 0.9997 T2 4.3564 3.0554 80 1.43 0.1578 0.9873 T3 1.0081 0.1924 80 5.24 <0.0001 0.7326 T4 -1.0509 0.1712 80 -6.14 <0.0001 0.2591 T5 -4.7187 3.0570 80 -1.54 0.1266 0.008848 T6 -8.1182 4.3184 80 -1.88 0.0638 0.000298 (b) Conservative T grouping of Trt least squares means (α=0.05) LS means with the same letter are not significantly different Trt Estimate T1 8.1179 A T2 4.3564 B A T3 1.0081 B A T4 -1.0509 B D T5 -4.7187 D T6 -8.1182 D Standard error mean 0.001287 0.03820 0.03768 0.03286 0.02681 0.001286 C C C The LINES display does not reflect all significant comparisons. The following additional pairs are significantly different: (T3,T4) Based on the fixed effects tests, the flea species × treatment interaction is significant. The means on the model scale are listed under the “Estimate” column, followed by their standard errors, “Standard error” (Table 6.7). The output of the “ilink” option in “lsmeans” applies the inverse function of the link function to the estimates on the model scale to obtain the estimates on the data scale. The probabilities, on the data scale, are given under the “Mean” column with their respective standard errors and correspond to the probability of insect (flea) survival. Figure 6.5 shows that the survival of both species is different in treatments 2–5; the Daphnia species showed more resistance in treatments 2 and 3, whereas the Ceriodaphnia species showed greater resistance in treatments 4 and 5. On the other hand, in treatments 1 and 6, survival was similar in both species. 6.4 A Split-Plot Design in an RCBD with a Normal Response A split plot is the most common treatment structure design in agricultural and agroindustrial research areas. These experiments generally involve two or more factors under study. Typically, large or primary experimental units, commonly known as the whole plot, are grouped into blocks. The levels of the first factor are randomly assigned to the whole plots. Then, each whole plot is divided into smaller units, known as split or secondary plots. The levels of the second factor are randomly assigned to the subplots within each whole plot. SP*treatment least squares means SP Treatment Daphnia T1 Daphnia T2 Daphnia T3 Daphnia T4 Daphnia T5 Daphnia T6 Ceriodaphnia T1 Ceriodaphnia T2 Ceriodaphnia T3 Ceriodaphnia T4 Ceriodaphnia T5 Ceriodaphnia T6 Estimate 8.1179 8.1180 1.9717 -1.2537 -8.1186 -8.1182 8.1178 0.5947 0.04446 -0.8480 -1.3188 -8.1182 Standard error 6.1065 6.1068 0.3218 0.2536 6.1085 6.1073 6.1064 0.2202 0.2109 0.2301 0.2583 6.1071 DF 80 80 80 80 80 80 80 80 80 80 80 80 t-value 1.33 1.33 6.13 -4.94 -1.33 -1.33 1.33 2.70 0.21 -3.69 -5.11 -1.33 Pr > |t| 0.1875 0.1875 <0.0001 <0.0001 0.1876 0.1875 0.1875 0.0084 0.8336 0.0004 <0.0001 0.1875 Table 6.7 Means and standard errors on the model scale and on the data scale of the interaction between both factors Mean 0.9997 0.9997 0.8778 0.2221 0.0002 0.0002 0.9997 0.6444 0.5111 0.2999 0.2110 0.0002 Standard error mean 0.001820 0.001820 0.03452 0.04381 0.001819 0.001819 0.001820 0.05046 0.05269 0.04830 0.04301 0.001819 220 6 Generalized Linear Mixed Models for Proportions and Percentages 6.4 A Split-Plot Design in an RCBD with a Normal Response 221 Survival probability 1.1 0.825 0.55 0.275 0. Trt1 Trt2 Trt3 Trt4 Trt5 Trt6 Treatment Dapnhia Ceriodaphnia Fig. 6.5 The average survival rate of both species The model equation for the analysis of variance assuming normality in the response is yijk = η þ αi þ r k þ ðraÞik þ βj þ ðαβÞij þ eijk i = 1, 2, ⋯, a; j = 1, 2, ⋯, b; k = 1, 2, ⋯, r where yijk is the observed response variable in the kth block at the ith level of factor A and at the jth level of factor B, α and β refer to the fixed treatment effects due to factors A and B, respectively, r is the random effect due to the blocks, (ra)ik is the random error term due to the whole plot that is an interaction between the blocks and factor A, and eijk is the random residual effect. Normally, the errors and other random terms are also assumed to be normal; however, when the response variable is not normally distributed, this way of specifying the model is not the most appropriate. Thus, under the assumption that the response variable is normal, this way of specifying the model is valid. 6.4.1 An RCBD Split Plot with Binomial Data: Carrot Fly Larval Infestation of Carrots Data were obtained from an experiment that was designed to compare a number of carrot genotypes with respect to their resistance to infestation by carrot fly larvae. The data involved 16 genotypes that were compared at 2 pest levels to be controlled. The experiment was conducted in three randomized blocks. Each block consisted of 222 6 Generalized Linear Mixed Models for Proportions and Percentages Table 6.8 The notation 44/53 denotes that 44 carrots were infected ( y) out of a sample size of 53 studied (N ) Genotype G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 G14 G15 G16 Treatment (level of infestation) 1 Block1 Block2 Block3 44/53 42/48 27/51 24/48 35/42 45/52 8/49 16/49 16/50 4/51 5/42 12/46 11/52 13/51 15/44 15/50 5/49 7/50 18/52 13/47 7/47 5/47 15/49 8/50 11/52 6/45 5/51 0/51 10/39 14/48 6/52 4/46 10/37 0/52 4/55 1/40 14/45 18/43 4/40 3/52 12/53 4/55 11/52 6/54 5/49 4/53 1/40 4/52 2 Block1 16/60 13/44 4/52 15/52 4/51 1/51 2/52 6/56 3/54 3/50 1/52 1/50 4/51 3/52 2/50 4/56 Block2 9/52 20/48 6/51 10/56 6/43 8/49 4/52 4/50 8/51 0/50 7/38 3/50 7/46 7/48 4/46 1/44 Block3 26/54 16/53 12/43 6/48 9/46 3/54 6/52 6/42 3/53 10/51 4/48 1/45 7/45 12/49 14/53 3/42 Table 6.9 Sources of variation and degrees of freedom Sources of variation Blocks Factor A (infestation) Errora (A*blocks) Factor B (genotypes) Infestation*genotype (A*B) Errorb Total Degrees of freedom r-1=3-1=2 a-1=2-1=1 (r - 1)(a - 1) = 2 b - 1 = 16 - 1 = 15 (a - 1)(b - 1) = 15 a(r - 1)(b - 1) = 2 × 2 × 15 = 60 r × a × b - 1 = 3 × 2 × 16 - 1 = 95 32 plots, 1 for each combination of genotype and pest infestation level. At the end of the experiment, about 50 carrots were taken from each plot and assessed for infestation by carrot fly larvae. The data obtained are shown in Table 6.8. Table 6.9 shows the analysis of variance summarizing the sources of variation and degrees of freedom. Rewriting in terms of the linear predictor ηijk = η þ αi þ r k þ ðraÞik þ βj þ ðαβÞij Since the observations were taken at the subplot level, conditioned on the structural effects of the design, these observations have a variance associated with the subplot. Therefore, α and β refer to the treatment fixed effects due to factors A 6.4 A Split-Plot Design in an RCBD with a Normal Response Table 6.10 Results of the analysis of variance 223 (a) Fit statistics for conditional distribution -2 Log L (y | r. effects) 527.82 Pearson’s chi-square 189.09 Pearson’s chi-square/DF 1.97 (b) Covariance parameter estimates Cov Parm Subject Estimate Standard error Intercept Bloque 0.004272 0.02741 Trt Bloque 0.03344 0.03545 (c) Type III tests of fixed effects Effect Num DF Den DF F-value Pr > F Genotype 15 60 28.28 <0.0001 Trt 1 2 16.24 0.0564 Genotype*Trt 15 60 4.45 <0.0001 and B, respectively; (αβ)ij refers to the interaction of the above factors; rk is the random effect due to blocks; and blocks × whole plot (ra)ik is assumed to contribute to the variation such that r k N 0, σ 2r and ðraÞik N 0, σ 2block × A . This model uses the linear predictor ηijk to estimate the mean of the observations μijk. The specification of the this GLMM is as follows: Distribution: yijk j rk, (ra)rk~Binomial(Nijk, π ijk) r k N 0, σ 2r , ðraÞrk N 0, σ 2blockA Link function: logit(π ijk) = ηijk. The following SAS GLIMMIX program allows the fitting of a GLMM with a split-plot structure in a randomized complete block design with a binomial response. proc glimmix data=spd_pp nobound method=quadrature; class Genotype Trt Block ; model y/N = Genotype|Trt; random intercept trt /subject=block; lsmeans Genotype|Trt/lines ilink; run; The program uses the quadrature estimation method (method=quadrature). This estimation method produces similar results as the Laplace method. Part of the results is provided in Table 6.10. Pearson’s chi-squared/DF value in part (a) gives an idea of whether there is overdispersion or extra-variation in the dataset. In this case, Pearson’s chi - square/DF = 1.97 indicates that there is overdispersion in the dataset, so it is feasible to use either the pseudo-likelihood (PL) estimation method or a different distribution. In addition to these results, the variance component estimated due to blocks and blocks × genotype (the whole plot) in part (b) are σ 2block = 0:004272 and σ ð2block × AÞ = 0:03344, respectively. The results of the fixed 224 6 Generalized Linear Mixed Models for Proportions and Percentages effects tests (part (c)) indicate that the effect of genotype and the interaction between genotype and treatment are significant. The appropriate method for model evaluation depends on whether or not there is evidence of overdispersion, so we consider this issue below. The residual variance incorporates systematic discrepancies between the model and the observed responses, variation between replicates (observations in independent experimental units with the same values of the explanatory variables) and sampling variation arising from the distribution of the data; in this case, it is the binomial distribution. If there are no duplicate observations and the fitted model provides an adequate description of the systematic trend, then only sampling variation contributes to the residual variance. If this is true, then the residual deviation has an approximate chi-squared distribution with degrees of freedom similar to the mean squared error (MSE) (the residual). Since there is overdispersion in the data using the binomial distribution, there are three alternatives we can explore: (1) review the linear predictor, which involves carefully revising the analysis of variance table; (2) add a scale parameter; or (3) use another distribution for the dataset. Each of these three possible alternatives is discussed below, in this order. 6.4.1.1 Linear Predictor Review (ηijk) If the proportion of normal plants (π ijk) is being affected by the genotype within each infestation level (trt = αi) from plot to plot within each of the blocks, then a nested factorial effect of genotype within infestation levels (trt) could be included in the analysis of variance. Thus, the linear predictor would be defined as ηijk = η þ αi þ r k þ ðraÞik þ βðαÞjðiÞ where αi, β(τ)j(i), rk, and (ra)ik are the fixed effects due to treatments, the effect of genotypes nested within a treatment, random effects due to blocks r k N 0, σ 2r , and the interaction between blocks and treatment ðraÞik N 0, σ 2RA , respectively. The following GLIMMIX syntax estimates the above linear predictor: proc glimmix data=spd_pp method=laplace; class Genotype Trt Block ; model y/N = Trt genotype(trt); random trt/subject=block; lsmeans genotype(trt)/lines ilink slice=trt slicediff=trt; run; The only difference between this proc GLIMMIX and the previous one is that in this program, we have included the nested effect of genotypes within treatment, genotype (trt), and removed only the fixed effects of genotypes. Part of the results is shown in Table 6.11. The value of Pearson’s chi-squared/DF statistic (part (a)) as 6.4 A Split-Plot Design in an RCBD with a Normal Response Table 6.11 Results of the analysis of variance, under a new linear predictor 225 (a) Fit statistics for conditional distribution -2 Log L (y | r. effects) 527.82 Pearson’s chi-square 189.07 Pearson’s chi-square/DF 1.97 (b) Covariance parameter estimates Cov Parm Subject Estimate Standard error Intercept Bloque 0.004265 0.02740 Trt Bloque 0.03343 0.03544 (c) Type III tests of fixed effects Effect Num DF Den DF F-value Pr > F Trt 1 2 16.20 0.0565 Genotype (Trt) 30 60 15.83 <0.0001 well as the fit statistics did not decrease when modifying the linear predictor. However, the F-values calculated for treatments and genotypes within treatments (part (c)) are smaller than those obtained in the split-plot design. Since the overdispersion is still present (Pearson’s chi - square/DF = 1.97), another alternative is to add a scaling parameter to the model. This alternative is presented below. 6.4.1.2 Scale Parameter If the residual deviation is larger than expected when compared to critical values of the appropriate chi-squared distribution, and if this cannot be corrected by redefining the linear predictor of the model, then there is more variation present than can be accounted for by the distributional likelihood assumption. In this case, we say that the data show overdispersion. The simplest way to deal with overdispersion is to extend the model for scaling the variance function. Adding the scale parameter replaces Var(yij) = π ij(1 - π ij) with Var(yij) = ϕπ ij(1 - π ij). The rationale for this approach is discussed by Collett (2002). The parameter ϕ is a scale factor, called the dispersion parameter, which is used to summarize the degree of overdispersion present in the observations. Clearly, ϕ = 1 corresponds to the original distribution model. This parameter can be estimated in several different ways. The logarithm of the likelihood of the binomial distribution is given by log N yij þ yij log π ij 1 - π ij þ N log 1 - π ij In the logarithm of the likelihood, the term “yij log π ij 1 - π ij ” is very important; any quantity that multiplies yij is known as the natural or canonical parameter, and this parameter is always a function of the mean. For the binomial distribution, the mean 226 6 Generalized Linear Mixed Models for Proportions and Percentages Nijπ ij and the natural parameter is log π ij 1 - π ij , and, in categorical data, it is known as “log odds.” The generalized estimating equation (GEE) method provides a valid analysis for marginal means, since under a binomial distribution, in the quasilikelihood, the variance of the distribution is given by ϕπ ij(1 - π ij). This is achieved by adding the “random _residual_” command in the following SAS syntax. The following GLIMMIX commands are used to invoke the scale parameter but using the first predictor proposed for these data. proc glimmix data=spd_pp nobound; class GenotypeTrtBlock ; model y/N = Trt|genotype; random intercept trt/subject=block; random _residual__; lsmeans Trt|genotype/lines ilink ; run; In this syntax, we still keep the binomial distribution (y/N is equivalent to telling GLIMMIX in SAS that it is a binomial response) but will add the “random _residual_” command. In this case, we cannot obtain the maximum likelihood estimators because we cannot implement the Laplace method (“method = laplace”) or adaptive quadrature (“method = quad”) approximation method, so the estimation is performed through the pseudo-likelihood (PL) method. This causes the scale parameter to be estimated, and, consequently, it is used in the adjustment of all standard errors and statistical tests. Proc GLIMMIX uses the generalized statistics of ^ McCullagh and Nelder (1989), i.e., χ 2/df as the estimator of the scale parameter (ϕÞ. All standard errors from the analysis under a binomial distribution are multiplied by ^ and all F-tests are divided by ϕ ^ to account for overdispersion. Part of the output ϕ, is shown below. The value of Pearson’s statistic in part (a) indicates that overdispersion has not been eliminated. Chi - square/DF = 3.13, on the contrary, indicates that this value has increased. This result indicates that adding a scale parameter to the model does not decrease the extra-variation present in the dataset, since the binomial assumption forces a relationship between the mean and variance of the data that might not contain the data being analyzed. On the other hand, the estimated scale parameter is ^ = 3:1263 (part (b)). Pearson’s residual analysis showed that its variance is 3.6257, ϕ which is considerably larger than 1, implying a large overdispersion. In addition, the results of the fixed effects tests (part (c)) vary from those above (Table 6.12). Therefore, the third option based on assuming an alternative distribution (beta distribution) on the response variable is discussed below. 6.4 A Split-Plot Design in an RCBD with a Normal Response Table 6.12 Results of the analysis of variance, adding a scale parameter to the model (a) Fit statistics -2 Res log pseudo-likelihood Generalized chi-square Gener. chi-square/DF (b) Covariance parameter estimates Cov Parm Subject Intercept Bloque Trt Bloque Residual variance component (VC) (c) Type III tests of fixed effects Effect Num DF Den DF Trt 1 2 Genotype 15 60 Genotype*Trt 15 60 6.4.1.3 227 182.52 200.09 3.13 Estimate 0.005416 0.03202 3.1263 F-value 10.20 9.04 1.42 Standard error 0.04750 0.06338 0.5719 Pr > F 0.0856 <0.0001 0.1674 Alternative Distribution Another approach to control the overdispersion would be to use a different distribution in the interval [0, 1], such as the beta distribution, to model the data. Generally, this distribution yields good results when all experiments have the same number of observations (successes and failures), i.e., when Nijk = N. When Nijk varies a little, even in many cases, the beta distribution yields acceptable results. It is important to mention that the proportions come from binomial counts, and, y therefore, we now define the response variable as pijk = Nijkijk so that it can be modeled as the beta distribution. The components of the beta response model are listed below: Distribution: pijk j rk, (ra)rk~Beta(π ijk, ϕ) with ϕ as the scale parameter r k N 0, σ 2r , ðraÞrk N 0, σ 2RA Linear predictor: ηijk = η + αi + rk + (αr)ik + βj + (αβ)ij Link function: logit π ijk = logit π ijk 1 - π ijk = ηijk y As mentioned before, we now use the response variable pijk = Nijkijk . This new response variable pijk is not the same as the one used in the binomial distribution. The following SAS commands fit a GLMM in a split-plot randomized complete block design with a beta response. It is important to mention that before implementing this y model in SAS GLIMMIX, the variable p = pijk = Nijkijk was defined. proc glimmix data=spd_pp nobound method=laplace; class GenotypeTrtBlock ; model p = Genotype|Trt/dist=beta; random intercept trt/subject=block; lsmeans Genotype|Trt/lines ilink; run; 228 6 Generalized Linear Mixed Models for Proportions and Percentages Table 6.13 Fit statistics assuming binomial and beta distributions (a) Fit statistics Distribution Binomial -2 Log likelihood 541.85 AIC (smaller is better) 609.85 AICC (smaller is better) 648.87 BIC (smaller is better) 579.20 CAIC (smaller is better) 613.20 HQIC (smaller is better) 548.24 (b) Fit statistics for conditional distribution Distribution Binomial -2 Log L (y | r. effects) 527.82 Pearson’s chi-square 189.09 Pearson’s chi-square/DF 1.97 Beta -246.49 -176.49 -132.28 -208.04 -173.04 -239.91 Beta -254.68 93.95 1.01 Table 6.14 Results of the analysis of variance, assuming binomial and beta distributions (a) Covariance parameter estimates Binomial Cov Parm Subject Estimate Intercept Bloque 0.004272 Trt Bloque 0.03344 ^ Scale ϕ Standard error 0.02741 0.03545 . Beta Estimate -0.00524 0.02175 25.7070 Standard error . 0.1475 (b) Type III tests of fixed effects Effect Trt Genotype Genotype*Trt Num DF 1 15 15 Den DF 4 60 60 Binomial F-value 16.24 28.28 4.45 Pr > F 0.0564 <0.0001 <0.0001 Beta F-value 9.98 13.25 2.23 Pr > F 0.0342 <0.0001 0.0146 Some of the SAS GLIMMIX output is listed below. Based on the fit statistics under the binomial (first alternative) and beta distributions (Table 6.13), clearly the values of the statistics related to the degree of overdispersion are lower in the beta distribution than in the binomial distribution, indicating that the beta distribution provides a better fit (part (a)). Looking at the fit statistics for the conditional model in part (b), the values of the three fit statistics in the binomial model are higher than the values in the beta model. The value of Pearson’s chi - square/DF under the beta distribution is 1.01. This value indicates that the overdispersion has been virtually eliminated from the data and that therefore the beta distribution is a better candidate model for this dataset. Adding the scale parameter (ϕ) to the model, the variance components and standard errors in Table 6.14 cause (part (a)) variation for each of the results and, therefore, the F- and t-tests are affected (part (b)). The estimated value of the scale 6.4 A Split-Plot Design in an RCBD with a Normal Response 229 Table 6.15 Estimated means and standard errors on the model scale and the data scale (a) Trt least squares means Trt Estimate Standard error Trt1 -1.2362 0.01768 Trt2 -1.9327 0.01768 (b) Genotype least squares means Standard Genotype Estimate error G1 0.1524 0 G10 -1.4143 0 G11 -1.8698 0 G12 -2.8971 0.03885 G13 -1.4336 0 G14 -1.8761 0.1304 G15 -1.8618 0 G16 -2.6686 0 G2 0.2225 0 G3 -1.3329 0 G4 -1.5897 0 G5 -1.3696 0 G6 -2.0173 0 G7 -1.7001 0.1356 G8 -1.7161 0 G9 -1.9796 0 DF 2 2 t-value -69.94 -109.34 DF 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 57 t-value Infty -Infty -Infty -74.58 -Infty -14.39 -Infty -Infty Infty -Infty -Infty -Infty -Infty -12.53 -Infty -Infty Pr > |t| 0.0002 <0.0001 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 0.2251 0.1264 Mean 0.5380 0.1956 0.1336 0.05230 0.1925 0.1328 0.1345 0.06485 0.5554 0.2087 0.1694 0.2027 0.1174 0.1545 0.1524 0.1214 Standard error mean 0.003083 0.001952 Standard error mean 0 0 0 0.001925 0 0.01502 0 0 0 0 0 0 0 0.01771 0 0 ^ = 25:7018. The variance components based on the binomial model parameter is ϕ and beta are listed below. The treatment means (part (a)) and genotypes (part (b)) are presented in Table 6.15. The estimates on the model scale are listed under the column “Estimate” with their respective standard errors “Standard error,” and the values on the data scale are listed under the column “MEAN” with their respective standard errors “Standard error mean.” In the table of least squares means for the effect of genotypes, inconsistencies are observed in the values of t and in the standard error values of the means, so other estimation alternatives should be sought. In large samples, both binomial and normal distributions are quite similar. Logically, the latter two analyses, binomial and beta, are attractive because of their consistency with the nature of the data. Because of the inconsistencies in the estimates of the mean for genotypes (tvalue = Infty and standard error of the mean), a robust method of estimation could be used; in this case, this is the normal distribution. Assuming that pijk has a normal distribution with a mean μijk and constant variance σ 2, the components of this model are as follows: 230 6 Table 6.16 Results of the analysis of variance, assuming a normal distribution Generalized Linear Mixed Models for Proportions and Percentages (a) Fit statistics -2 Res log likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) CAIC (smaller is better) HQIC (smaller is better) Generalized chi-square Gener. chi-square/DF (b) Covariance parameter estimates Cov Parm Estimate Bloque 0.000123 Trt*bloque 0.000329 Residual 0.009442 (c) Type III tests of fixed effects Effect Num DF Den DF Genotype 15 60 Trt 1 2 Genotype*Trt 15 60 -79.38 -73.38 -72.98 -76.08 -73.08 -78.81 0.60 0.01 Standard error 0.000742 0.000925 0.001724 F-value 12.59 20.46 2.93 Pr > F <0.0001 0.0456 0.0016 Distribution: pctijk j rk, (ra)ik~Normal(μijk, σ 2) r k N 0, σ 2r , ðraÞik N 0, σ 2RA Linear predictor: ηijk = η + αi + rk + (αr)ik + βj + (αβ)ij Link function: ηijk = μijk; identity y Similarly, in this example, the response variable used was pctijk = Nijkijk . This new response variable pctijk is not the same as the response variable used in the binomial distribution. The following SAS GLIMMIX commands adjust a linear mixed model (LMM) under a split plot in a randomized complete block design with a normal response. proc glimmix data=spd_pct nobound; class Genotype Trt Block ; model pct = Genotype|Trt; random block block*trt; lsmeans Genotype|Trt/lines; run; Part of the results is shown below. The values of fit statistics in part (a) of Table 6.16 for the model are clearly lower than those estimated in the previous options. This indicates that the normal distribution is reasonable, even though the response is a proportion. The estimated variance components, tabulated in part (b) due to blocks, blocks x treatment, and the mean squared error (MSE) (Residual = Gener. chi-square/DF) are σ^2block = 0:000123, σ^2block × trt = 0:00039, and σ^2 = MSE = 0:009442 ffi 0:01, respectively. 6.4 A Split-Plot Design in an RCBD with a Normal Response 231 Table 6.17 Means and standard errors for genotypes and treatments (a) Genotype least squares means Standard Genotype Estimate error G1 0.5260 0.04086 G10 0.1340 0.04086 G11 0.1522 0.04086 G12 0.03332 0.04086 G13 0.2026 0.04086 G14 0.1342 0.04086 G15 0.1360 0.04086 G16 0.05625 0.04086 G2 0.5355 0.04086 G3 0.2139 0.04086 G4 0.1751 0.04086 G5 0.2035 0.04086 G6 0.1301 0.04086 G7 0.1671 0.04086 G8 0.1504 0.04086 G9 0.1187 0.04086 (b) Trt least squares means Trt Estimate Standard error Trt1 0.2478 0.01863 Trt2 0.1358 0.01863 DF 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 60 DF 2 2 tvalue 12.87 3.28 3.73 0.82 4.96 3.28 3.33 1.38 13.11 5.24 4.28 4.98 3.18 4.09 3.68 2.90 t-value 13.30 7.29 Pr > |t| <0.0001 0.0017 0.0004 0.4179 <0.0001 0.0017 0.0015 0.1737 <0.0001 <0.0001 <0.0001 <0.0001 0.0023 0.0001 0.0005 0.0051 Pr > |t| 0.0056 0.0183 Mean 0.5260 0.1340 0.1522 0.0333 0.2026 0.1342 0.1360 0.0562 0.5355 0.2139 0.1751 0.2035 0.1301 0.1671 0.1504 0.1187 Mean 0.2478 0.1358 Standard error mean 0.04086 0.04086 0.04086 0.04086 0.04086 0.04086 0.04086 0.04086 0.04086 0.04086 0.04086 0.04086 0.04086 0.04086 0.04086 0.04086 Standard error mean 0.01863 0.01863 The F-statistics for the fixed effects of genotype, treatments, and the interaction between both factors provide significant statistical evidence on the proportion of infested carrots in each of the genotypes (part (c)). Overall, the least squares means for genotypes and treatments are reported in Table 6.17 in parts (a) and (b). The genotypes showing the highest fraction of infested carrots were 1, 2, 3, 5, and 13, whereas genotypes 12 and 16 showed the lowest percentage of infested carrots. Now, for treatments, the highest proportion of infested carrots was observed in treatment 1 with 24.78%, whereas in treatment 2, it was 13.58%. Based on the fixed effects tests, the interaction effect of genotype x treatment on the proportion of infested carrots was statistically different. Genotypes 9 and 16 showed higher susceptibility in treatment 1 followed by treatment 2, whereas genotypes 5, 11, 13, and 15 showed the same proportions of infested carrots in both treatments (Fig. 6.6). On the other hand, genotypes that showed higher resistance to infestation levels were genotypes 1, 2, and 6 followed by genotypes 3, 4, 7, 8, 10, and 12. 232 6 Generalized Linear Mixed Models for Proportions and Percentages Proportion of infested carrots 1. 0.75 0.5 0.25 0. G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 G14 G15 G16 Interaction Fig. 6.6 The average proportion of infested carrots in genotypes as a function of treatment 6.5 A Split-Split Plot in an RCBD:- In Vitro Germination of Seeds The growth of a plant in a tissue culture can be explained by various combined effects of A, B, and C factors. For this, the availability and efficient use of chemical resources (factors) is of great relevance when availability is scarce or too expensive. In light of this, the combination of three reagents (A, B, and C), reagent A at three levels and reagents B and C at two levels, were tested on the in vitro germination of orchid seeds. The combination of the levels of each of the factors is schematized below. Block 1 A3 B1 C2 C2 C1 C1 B2 C2 C1 Block 2 A2 B1 C1 C1 C2 C2 B2 C1 C2 C2 C1 A1 B1 C2 C1 C1 C2 A1 B1 C1 C2 C2 C1 B2 C2 C1 C1 C2 B2 C1 C2 C2 C1 A2 B2 C2 C1 C2 C1 B1 C2 C1 C2 C1 C1 C2 A3 B2 C1 C2 C1 C2 B1 C1 C2 C1 C2 In each of the factor combinations, N orchid seeds were placed to germinate for a period of time. Let yijk be the number of seeds germinated at the ith level of factor A, at the jth level of factor B, and at the kth level of factor C. Since the observations are made at the sub-subplot level, conditional on the structural effects of the design, these observations have a variance associated with the subplot. Therefore, the statistical model for this experiment is given below: 6.5 A Split-Split Plot in an RCBD:- In Vitro Germination of Seeds Table 6.18 Number of seeds that germinated (yijkl) in each of the factor combinations Block 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 A 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 B 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 1 2 2 2 2 233 C 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 Y 15 10 17 19 26 21 14 12 10 12 30 32 37 30 32 37 18 18 23 21 24 27 37 37 N 73 86 69 32 125 62 81 21 92 108 44 33 91 42 98 44 52 73 108 55 106 92 64 97 Distribution: yijkl j rl, (ra)il, (rαβ)ijl~Binomial(Nijk, π ijk) r l N 0, σ 2r , ðraÞrk N 0, σ 2RA , ðrαβÞijl N 0, σ 2rab Linear predictor: ηijk = η + αi + rl + (rα)il + βj + (αβ)ij + (rαβ)ijl + γ k + (αγ)ik + (βγ)jk + (αβγ)ijk, where blocks (rl), blocks × A ((ra)il), and blocks × A × B ((rαβ)ijl) are assumed to contribute to the variation such that rl N 0, σ 2r , ðraÞil N 0, σ 2r × A , ðrαβÞijl N 0, σ 2rab , respectively, and εijkl experimental errors are distributed as N(0, σ 2). This model uses the linear predictor ηijk to estimate the mean of the observations μijk. Link function: logit(π ijkl) = ηijkl Table 6.18 below shows the data obtained from this experiment. Table 6.19 presents the analysis of variance and shows the sources of variation and degrees of freedom for this experimental design. The following SAS GLIMMIX program allows a GLMM with a split-split plot structure to be fitted in an RCBD with a binomial response. 234 6 Generalized Linear Mixed Models for Proportions and Percentages Table 6.19 Sources of variation and degrees of freedom for the randomized block design with an arrangement of treatments under the split-split-plot structure Sources of variation Blocks Factor A Errora(Bloque*A) Factor B A* B Errorb(A*B(Bloque)) Factor C A*C B*C A*B* C Error Total Degrees of freedom r-1=2-1=1 a-1=3-1=2 (r - 1)(a - 1) = 2 b-1=2-1=1 (a - 1)(b - 1) = 2 a(b - 1)(r - 1) = 3 × 1 × 1 = 3 (c - 1) = 2 - 1 = 1 (3 - 1)(2 - 1) = 2 (b - 1)(c - 1) = 1 (a - 1)(b - 1)(c - 1) = 2 ab(c - 1)(r - 1) = 3 × 2 × 1 × 1 = 6 r × a × b × c - 1 = 2 × 3 × 2 × 2 - 1 = 23 proc GLIMMIX data=germ nobound method=laplace; class Block A B C; model Y/N = A|B|C/dist=binomial link=logit; random block block*A block*A block*A*B; lsmeans A|B|C/lines ilink; run; Part of the output is shown in Table 6.20. The value of the conditional statistic Pearson’ chi - square/DF = 1.81 (part (a)) indicates that there is an overdispersion in the dataset since these values are greater than 1. The estimated variance components tabulated in part (b) correspond to blocks, blocks × factor A, and blocks × factor A × factor B, which are σ 2r = 0:0752, σ 2rA = 0:088, and σ 2rab = 0:0425, respectively. The type III tests of fixed effects are shown in part (c). Here, we see that the test of equality of treatments is not significant for factors A and B and the interaction AB (A, P = 0.1917, B, P = 0.0897; AB, P = 0.6262), whereas for factor C and the interactions AC, BC, and ABC, it is significant at a level of 5%. Since there is overdispersion in the dataset, the binomial distribution does not provide a good fit for the dataset (Pearson’s chi - square/DF = 1.81). An alternative to model this dataset could be the beta distribution. Under this assumption, let the y response variable be pijk = Nijkijk , the proportion of seeds that germinated, then pijk is assumed to have a beta distribution rather than a binomial distribution for the success count yijk out of a total of Nijk Bernoulli trials. The components of the model are listed below: Distribution: pijk j rl, (ra)il, (rαβ)ijl ~ Beta(π ijk, ϕ), with ϕ as the scale parameter. rl N 0, σ 2r , ðraÞrk N 0, σ 2rA , ðrαβÞijl N 0, σ 2rab Linear predictor: ηijk = η + αi + rl + (rα)il + βj + (αβ)ij + (rαβ)ijl + γ k + (αγ)ik + (βγ)jk + (αβγ)ijk Link function: logit π ijk = logit π ijk 1 - π ijk = ηijk 6.5 A Split-Split Plot in an RCBD:- In Vitro Germination of Seeds 235 Table 6.20 Results of the analysis of variance of the RCBD in the split-split plot under the binomial distribution (a) Fit statistics for conditional distribution -2 Log L (y | r. effects) Pearson’s chi-square Pearson’s chi-square/DF (b) Covariance parameter estimates Cov Parm Estimate Bloque 0.07521 Bloque*A 0.08847 Bloque*A*B 0.02205 (c) Type III tests of fixed effects Effect Num DF Den DF A 2 2 B 1 3 A*B 2 3 C 1 6 A*C 2 6 B*C 1 6 A*B*C 2 6 146.19 43.49 1.81 Standard error 0.1180 0.09319 0.04258 F-value 4.22 6.12 0.55 65.73 11.68 29.38 31.69 Pr > F 0.1917 0.0897 0.6262 0.0002 0.0085 0.0016 0.0006 The following SAS commands fit a GLMM on a split-split plot in a randomized complete block design assuming a beta distribution for the response variable. proc glimmix data=germ nobound method=laplace; class BlockABC ; model p = A|B|C/dist=beta ; random block block*A block*A*B;/*intercept A /subject=block*/; lsmeans A|B|C/lines ilink; run; Part of the results is listed in Table 6.21 under a beta distribution. The value of the fit statistic for the conditional model tabulated in (a) (Pearson’s chi - square/ DF = 1.01) indicates that overdispersion has been removed and that the beta distribution is a good model to fit the dataset. Part (b) shows the variance component estimates for blocks, blockxA, and blockxAxB σ^2r = - 0:157, σ 2rA = - 0:05558, and σ 2rab = - 0:227, respectively and the value ^ = 19:2789 . According to the type III tests of of the estimated scale parameter ϕ fixed effects in part (c), the main effect of factor C (P = 0.0128) and interaction A×B×C (P = 0.0424) are statistically significant at a level of 5%. The estimates of the interactions are shown in Table 6.22 on the model scale under the “Estimate” column and as probabilities on the data scale under the “Mean” column with its corresponding standard errors under the “Standard error mean” column. 236 6 Generalized Linear Mixed Models for Proportions and Percentages Table 6.21 Results of the analysis of variance of the RCBD in the split-split plot structure under the beta distribution (a) Fit statistics for conditional distribution -2 Log L (p | r. effects) Pearson’s chi-square Pearson’s chi-square/DF (b) Covariance parameter estimates Cov Parm Estimate Bloque -0.1570 Bloque*A -0.05558 Bloque*A*B -0.2270 Scale 19.2789 (c) Type III tests of fixed effects Effect Num DF Den DF A 2 2 B 1 2 A*B 2 2 C 1 4 A*C 2 4 B*C 1 4 A*B*C 2 4 -37.51 21.31 1.01 Standard error . . . 5.8703 F-value 1.21 0.00 1.08 18.34 1.50 6.56 7.72 Pr > F 0.4521 0.9687 0.4799 0.0128 0.3257 0.0626 0.0424 Table 6.22 Estimated least mean squares on the model scale (“Estimate” column) and the data scale (“Mean” column) A*B*C least squares means Standard A B C Estimate error 1 1 1 -0.3769 0.3194 1 1 2 0.9506 0.3445 1 2 1 0.1721 0.3147 1 2 2 0.7010 0.3308 2 1 1 -0.6521 0.3296 2 1 2 2.9148 0.8071 2 2 1 0.7430 0.4699 2 2 2 0.4056 0.4515 3 1 1 0.2695 0.3161 3 1 2 0.2752 0.3163 3 2 1 0.1236 0.3143 3 2 2 1.1726 0.3614 DF 4 4 4 4 4 4 4 4 4 4 4 4 tvalue -1.18 2.76 0.55 2.12 -1.98 3.61 1.58 0.90 0.85 0.87 0.39 3.24 Pr > |t| 0.3034 0.0509 0.6135 0.1014 0.1190 0.0225 0.1890 0.4198 0.4419 0.4334 0.7143 0.0315 Mean 0.4069 0.7212 0.5429 0.6684 0.3425 0.9486 0.6776 0.6000 0.5670 0.5684 0.5309 0.7636 Standard error mean 0.07709 0.06927 0.07810 0.07331 0.07422 0.03937 0.1026 0.1084 0.07761 0.07759 0.07827 0.06523 Alternative Link Functions for Binomial Data Average germination rate 6.6 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 237 a ab ab abc a.. abc bc bc bc bc C1 C2 C1 c c C1 C2 C1 B1 C2 C1 B2 A1 C2 C1 B1 C2 B2 B1 A2 C2 B2 A3 Interaction Fig. 6.7 The average seed germination rate The simple effects of factors show that the best combination of factor levels was A2*B1*C2, showing the highest seed germination proportion followed by the combination of factors A1*B1*C2, A3*B2*C2, and lower proportion, which were observed in the combination of factors A1*B2*C2, A2*B2*C1 and A2*B2*C2 (Fig. 6.7). Finally, the combination of the factor levels A2 × B1 × C1 showed the lowest proportion of seed germination. 6.6 Alternative Link Functions for Binomial Data In previous chapters, we used proc GLIMMIX with binomial data and, by default, it works with the link function “logit. ” However, in certain applications with binomial data, other link functions are acceptable, either because they make it easier to interpret or because for certain binomial datasets, the link function “logit” cannot accurately model the data and, as a result, produce biased (misleading) results. In this section, we consider two alternative link functions to the logit for binomial data: the link “probit” and the complementary log-log link. The probit model is also used to model dichotomous (Bernoulli) or binomial (sum of Bernoulli trials) responses. For this model, the link function, called the probit link, uses the inverse of the cumulative distribution function of a standard normal distribution to transform probabilities to the standard normal variable. That is, 1 2 z Φ-1(π i) = ηi, which implies that π i = Φ(ηi), where ΦðZ Þ = - 1 p12π e - 2t dt. The use of the probit regression model dates back to Bliss (1934). Bliss was interested in finding an effective pesticide to control insects that fed on grape leaves. 238 6 Generalized Linear Mixed Models for Proportions and Percentages He discovered that the relationship between the response and a dose of pesticide was sigmoid, and he applied the probit link function to transform the dose–response curve from a sigmoid to a linear relationship. The complementary function log - log defined as ηi = log (- log (1 - π i)), η whose inverse is π i = 1 - e - e i , is useful for data in which most of the probabilities are near zero or near one. For small values of π i, the log-log transformation produces results highly similar to those produced when using a logit link. As the probability increases, the transformation approaches infinity more slowly than the probit or logit model. 6.6.1 Probit Link: A Split-Split Plot in an RCBD with a Binomial Response This example takes the dataset of the split-split plot in an RCBD (Exercise 6.8.5). In this example, the data were modeled using the function “logit.” In this exercise, we will fit the dataset using the link function “probit, ” and we will compare and contrast the results using a logit link. The components of the GLMM are identical to those in Example 6.5, except for the link function. That is, we replace: Link function: logit π ijk = logit π ijk 1 - π ijk = ηijk by Φ-1(π ijk) = ηijk. The following GLIMMIX syntax implements the fitting of the binomial data using the link function “probit. ” proc glimmix data=germ nobound method=laplace; class Block A B C; model Y/N = A|B|C/link=probit; random block block*A block*A*B; lsmeans A|B|C/lines ilink; run; Table 6.23 shows part of the results under the binomial distribution with the “probit” link function. In parts (a) and (b), we see the mean squared error and variance component estimates for blocks, whole plot, subplot, and sub-subplot, where it can be observed that these values are positive and not negative, as the ones obtained with the link function “logit. ” Since the variance components are positive, this analysis makes more sense than the one based on the logit link. The type III tests of fixed effects are tabulated in part (c) of Table 6.23; the main effects of factors A and B and the interactions A*B, A*C, and B*C are not significant in both link functions, whereas the main effect of factor C and the interaction A*B*C are statistically significant under the “probit” link. The estimated probabilities π^ijk and their respective standard errors are presented in Table 6.24 for each of the combinations of the three factors, which 6.6 Alternative Link Functions for Binomial Data 239 Table 6.23 Results of the analysis of variance of the RCBD in the split-split plot structure under the binomial distribution using the “probit” link (a) Fit statistics for conditional distribution -2 Log L (y | r. effects) Pearson’s chi-square 146.43 43.01 1.09 Pearson’s chi-square/DF CME = σ^2 (b) Covariance parameter estimates Cov Parm Estimate 0.02411 Block σ^2block Block*A σ^2block × A Block*A*B σ^2block × AB Standard error 0.03707 0.02128 0.02830 0.01617 0.01896 (c) Type III tests of fixed effects Effect A B A*B C A*C B*C A*B*C Num DF 2 1 2 1 2 1 2 Den DF 2 3 3 6 6 6 6 F-value 5.49 4.17 0.36 67.13 12.34 29.16 33.93 Probit Pr > F 0.1541 0.1339 0.7226 0.0002 0.0075 0.0017 0.0005 Logit Pr > F 0.4521 0.9687 0.4799 0.0128 0.3257 0.0626 0.0424 are very similar in both link functions. However, the average standard error is slightly higher with the “logit” link function ðstandar:error:meanlogit = 0:0711Þ compared to the “probit” link ðstandar:errormeanprobit = 0:0693Þ. 6.6.2 Complementary Log-Log Link Function: A Split Plot in an RCBD with a Binomial Response Researchers studied three different micro-minerals (A, B, and C) on the attachment of explants of a commercial culture. In this vein, micro-mineral A was tested at three levels (i = 1, 2, and 3), and micro-minerals B and C at two levels ( j, k = 1,2 and). The combination of the different levels yielded a total of 12 combinations. Since the researchers wanted to study factor C with greater precision, a split-plot treatment structure was designed in which micro-minerals A and B were placed in the whole plot (a large plot) and micro-mineral C in the subplot (a small plot). Treatment factor combinations were placed in an RCBD manner (r = 1, 2). The outcome of interest was the number of live plants (yijkr) out of the total number of plants growing in the 240 6 Generalized Linear Mixed Models for Proportions and Percentages Table 6.24 Means and standard errors using the probit and logit link functions A*B*C least squares means Probit A B C Mean 1 1 1 0.1543 1 1 2 0.3723 1 2 1 0.2724 1 2 2 0.2954 2 1 1 0.1023 2 1 2 0.8255 2 2 1 0.5684 2 2 2 0.5529 3 1 1 0.2844 3 1 2 0.2751 3 2 1 0.2568 3 2 2 0.4612 Standard error mean 0.05050 0.08296 0.06746 0.07798 0.03805 0.06338 0.08306 0.08327 0.07196 0.06868 0.06452 0.08017 Logit Mean 0.1494 0.3780 0.2694 0.2953 0.09593 0.8292 0.5703 0.5530 0.2844 0.2733 0.2563 0.4608 Standard error mean 0.04796 0.08767 0.06896 0.08053 0.03409 0.06135 0.08845 0.08847 0.07418 0.07041 0.06589 0.08553 unit (nijkr). The data can be referred to in the Appendix (Data: Commercial crop explant attachment). The GLMM for this experiment is described below (log-log data): Distribution: yijkl j rl, r(aβ)ijl~Binomial(Nijk, π ijk) r l N 0, σ 2r , r ðaβÞijl N 0, σ 2rab , Linear predictor: ηijkl = η + rl + αi + βj + (αβ)ijl + r(αβ)il + γ k + (αγ)ik + (βγ)jk + (αβγ)ijk, i + βj + (αβ)ijl + r(αβ)il + γk + (αγ)ik + (βγ)jk + (αβγ)ijk, where blocks (rl) and blocks x (A x B) ((r(aβ))ijl) are assumed to contribute to the variation such that rl N 0, σ 2r and r ðaβÞijl N 0, σ 2rab , respectively. Link function: log - log (π ijkl) = ηijkl The following GLIMMIX code adjusts the binomial proportions with a complementary link function log - log in an RCBD manner. proc glimmix data=spp nobound method=laplace; class block A B C; model y/n = A|B|C/link=ccll; random block block(A*B); lsmeans A|B|C/lines ilink; run; The “link = ccll” option specifies that “proc GLIMMIX” will fit the model using the complementary (log - log) link function. The “lsmeans A|B|C/lines ilink” command calls for estimation of the linear predictors ηijk, whereas the “lines” and “ilink” options provide the comparison between the linear predictors and their inverse. Part of the output is shown below. Table 6.25 shows the variance component estimates of blocks and blocks (A×B) using alternative link functions. Under 6.6 Alternative Link Functions for Binomial Data 241 Table 6.25 Variance component estimates using the same distribution but a different link function Covariance parameter estimates Log – log Standard Cov Parm Estimate error Block 0.05808 0.07112 Block 0.05065 0.03121 (A*B) Logit Estimate 0.08144 0.09203 Probit Standard error 0.1042 0.05754 Estimate 0.02676 0.03374 Standard error 0.03494 0.02111 Table 6.26 Type III tests of fixed effects using the same distribution but with a different link function Type III tests of fixed effects Effect A B A*B C A*C B*C A*B*C Num DF 2 1 2 1 2 1 2 Den DF 5 5 5 6 6 6 6 Table 6.27 Fit statistics using the same distribution but a different link function Log - log F-value Pr > F 6.27 0.0434 4.85 0.0789 0.65 0.5613 68.84 0.0002 11.94 0.0081 27.51 0.0019 32.44 0.0006 Logit F-value 7.44 3.13 0.28 65.29 11.53 28.88 32.36 Pr > F 0.0318 0.1370 0.7693 0.0002 0.0088 0.0017 0.0006 Covariance parameter estimates Log - log -2 Log likelihood 164.85 AIC (smaller is better) 192.85 AICC (smaller is better) 239.51 BIC (smaller is better) 174.55 CAIC (smaller is better) 188.55 HQIC (smaller is better) 154.58 Probit F-value 8.17 2.81 0.24 66.70 12.12 28.77 33.93 Pr > F 0.0266 0.1543 0.7971 0.0002 0.0078 0.0017 0.0005 Logit 172.57 200.57 247.24 182.27 196.27 162.31 Probit 170.88 198.88 245.55 180.59 194.59 160.62 the link “probit,” the variance components are smaller compared to those obtained with the link functions “log – log” and “logit.” The values of the hypothesis tests for the fixed effects, both main effects and interactions, are shown in Table 6.26. The three link functions behave similarly. One tool that might be useful in choosing which link function provides a better fit, or which best describes the variability of a dataset, is the model fit statistics. The fit statistics indicate that the model with the complementary “log - log” link function provides the best fit (Table 6.27). Table 6.28 shows the maximum likelihood estimators π^ijk for each of the link functions and the combination of factor levels, and it can be verified that they provide very similar estimates. It is important to mention that the correct 242 6 Generalized Linear Mixed Models for Proportions and Percentages Table 6.28 Means and standard errors using the same distribution but with a different link function A*B*C least squares means Log - log Standard error A B C Mean mean 1 1 1 0.1494 0.04259 1 1 2 0.3776 0.08554 1 2 1 0.2661 0.06257 1 2 2 0.3001 0.07718 2 1 1 0.1020 0.03079 2 1 2 0.8389 0.08212 2 2 1 0.5558 0.09578 2 2 2 0.5578 0.09596 3 1 1 0.2770 0.06780 3 1 2 0.2782 0.06574 3 2 1 0.2555 0.05987 3 2 2 0.4599 0.08735 Logit Mean 0.1513 0.3727 0.2706 0.2993 0.1023 0.8188 0.5733 0.5560 0.2805 0.2779 0.2561 0.4610 Probit Standard error mean 0.04732 0.08510 0.06744 0.07951 0.03451 0.06189 0.08633 0.08635 0.07192 0.06929 0.06416 0.08331 Mean 0.1547 0.3696 0.2737 0.2980 0.1047 0.8196 0.5700 0.5546 0.2827 0.2778 0.2569 0.4609 Standard error mean 0.05030 0.08223 0.06718 0.07789 0.03829 0.06375 0.08251 0.08273 0.07131 0.06855 0.06410 0.07965 specification of the linear predictor as well as the distribution of the response variable are the most important elements for obtaining a good fit. 6.7 Percentages In this section, we consider proportions that have been calculated from discrete counts, for example, the number of infected plants in treatment i of total Ni plants that are likely to have a binomial distribution. This class of models allows the response to arise from different distributions and probabilities. 6.7.1 RCBD: Dead Aphid Rate An experiment was designed to study the effect of conidial density on the transmission of a fungus that attacks aphids. Aphid carcasses killed by the fungus, and from which the fungus released spores, were placed on bean plants at three densities (A = 1, B = 5, or C = 10 carcasses per plant) to provide different doses of fungal conidia. Densities were assigned to individual bean plants in a completely randomized design with six replicates. A total of 20 live uninfected (N ) aphids were placed on each plant with a ladybug that was allowed to forage (feed on the bean plants) to facilitate the transfer of conidia between the carcasses and the live aphids. For each plant, the number of aphids infected with the fungus was counted (nij) and the proportion of aphids infected with the fungus was calculated 7 days after the 6.7 Percentages 243 Table 6.29 Proportion of infested aphids Plant 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Table 6.30 Sources of variation and degrees of freedom Sources of variation Trt Error Total Density C A B C B B C C A B A A C C A B B A pij 0.34299 0.16659 0.47004 0.62481 0.21926 0.16659 0.47502 0.52747 0.41581 0.42556 0.19466 0.34299 0.677 0.76674 0.13124 0.58419 0.38225 0.28905 Degrees of freedom t-1=2 t(r - 1) = 15 t × r - 1 = 17 inoculum was placed. The results shown below correspond to the proportion of infected aphids calculated at each of the inoculum concentrations ( pij = nij/N; N = 20) to each of the conidial concentrations (density) tested (Table 6.29). The sources of variation and degrees of freedom for this experiment are shown in Table 6.30. The components of the GLMM having a beta response are listed below: Distributions: pij j density(plant)i( j ) ~ Beta(π ij, ϕ) 2 densityðplantÞiðjÞ N 0, σ density ðplantÞ Linear predictor: ηij = μ + densityi + density(plant)i( j ); i = 1, 2, 3; j = 1, ⋯, 6 Link function: log π ij 1 - π ij = logit π ij = ηij The following GLIMMIX program fits a GLMM in a completely randomized design with a beta distribution. Here, density is conc_ino. proc glimmix data=thumbs nobound method=laplace; class plant conc_ino; model p = conc_ino /dist=beta link=logit; random conc_ino(plant); 244 6 Generalized Linear Mixed Models for Proportions and Percentages Table 6.31 Results of the analysis of variance (a) Fit statistics for conditional distribution -2 Log L (P | r. effects) -24.13 Pearson’s chi-square 18.45 Pearson’s chi-square/DF 1.02 (b) Covariance parameter estimates Cov Parm Estimate Standard error Conc_Ino (Planta) -0.1833 . Scale 12.9999 4.1954 (c) Type III tests of fixed effects Effect Num DF Den DF F-value Pr > F Conc_Ino 2 15 8.25 0.0038 Table 6.32 Means and standard errors on the model scale and the data scale Conc_Ino least squares means Conc_Ino A B C Estimate -1.0340 -0.5282 0.2775 Standard error 0.2438 0.2246 0.2197 DF 15 15 15 tvalue -4.24 -2.35 1.26 Pr > |t| 0.0007 0.0328 0.2259 Mean 0.2623 0.3709 0.5689 Standard error mean 0.04717 0.05241 0.05388 lsmeans conc_ino/lines ilink; run; Part of the results is shown in Table 6.31. The value of the conditional fit statistic in part (a), Pearson’s chi - square/DF = 1.02, indicates that there is no overdispersion in the data and that the beta distribution is a good model for this dataset. The estimated variance of the plants’ nested inoculum density is ^ = 12:999; both are σ^2densityðplantÞ = - 0:1833 and the estimated scale parameter is ϕ tabulated in part (b). In part (c) of the same table, the type III tests of fixed effects are shown, indicating that the density (concentration) of the inoculum has a significant effect (P = 0.0038) on the proportion of infected aphids with the fungus. The values under the column “Estimates” are estimated mean proportions on the model scale, whereas the column “Mean” shows the estimated mean proportions on the data scale with their respective standard errors (Table 6.32). These estimates where obtained with the “lsmeans” and “ilink” option. Figure 6.8 shows a linear trend in the proportion of aphids infested as conidial density increases. Conidia densities A and B showed statistically equal proportions of infested aphids compared to density C. Finally, the highest proportion of infested aphids was observed at density C. 6.7 Percentages 245 Proportion of infested aphids 0.7 0.525 0.35 0.175 0. A B C Conidial density Fig. 6.8 Proportion of aphids infected at different conidia concentration densities 6.7.2 RCBD: Percentage of Quality Malt An agro-industrial engineer is interested in studying the effect of germination time in minutes (48, 96, and 144) on the percentage of quality malt obtained from six sorghum varieties (sorghum bicolor): Gambella 1107, Macia, Meko, Red Swazi, Teshale, and 76T1#23 (Bekele et al. 2012). The percentage of quality malt ( y) as a function of both factors is shown in Table 6.33. For this purpose, an RCBD was implemented with a treatment factorial structure (variety × germination time). The statistical model to analyze the dataset is the following: Distributions: yijk j rk ~ Beta(π ijk, ϕ); i = 1, ⋯, 6; j, k = 1, 2, 3 rk N 0, σ 2block , where yijk is the kth percentage of malt quality observed at the ith variety with the jth fermentation time. Linear predictor: ηijk = μ + rk + αi + βj + (αβ)ij, where μ is the overall mean, αi is the fixed effect due to variety i, βj is the fixed effect due to germination time j, and (αβ)ij is the interaction effect between variety and germination time. Link function: logit(π ijk) = ηijk Table 6.34 shows the sources of variation and degrees of freedom for this experiment. The following GLIMMIX commands adjust a GLMM with a beta response. proc glimmix data=malting nobound method=laplace; class var_sorghum ger_time block; model p = var_sorghum|ger_time/dist=beta link=logit; random block; lsmeans var_sorghum|ger_time/lines ilink; run; 246 6 Generalized Linear Mixed Models for Proportions and Percentages Table 6.33 Percentage of quality malt as a function of both factors (variety and germination time) Variety Gambella Gambella Gambella Macia Macia Macia Meko Meko Meko Red Swazi Red Swazi Red Swazi Teshale Teshale Teshale 76T1#23 76T1#23 76T1#23 Gambella Gambella Gambella Macia Macia Macia Meko Meko Meko Time T1 T1 T1 T1 T1 T1 T1 T1 T1 T1 T1 T1 T1 T1 T1 T1 T1 T1 T2 T2 T2 T2 T2 T2 T2 T2 T2 Block 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 Table 6.34 Sources of variation and degrees of freedom y 7.25 11.16 15.9 10.91 8.75 10.87 24.65 23.63 28.75 20.95 15.82 25.24 25.92 27.64 28.03 23.39 19.43 25.55 10.03 12.9 17.84 7.88 9.14 11.99 22.97 25.37 25.71 Variety Red Swazi Red Swazi Red Swazi Teshale Teshale Teshale 76 T1#23 76 T1#23 76 T1#23 Gambella Gambella Gambella Macia Macia Macia Meko Meko Meko Red Swazi Red Swazi Red Swazi Teshale Teshale Teshale 76 T1#23 76 T1#23 76 T1#23 Sources of variation Blocks Variety Time_Germination Variety*germ_time Error Total Time T2 T2 T2 T2 T2 T2 T2 T2 T2 T3 T3 T3 T3 T3 T3 T3 T3 T3 T3 T3 T3 T3 T3 T3 T3 T3 T3 Block 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 y 21 15.09 24.84 25.42 26.86 26.64 23.69 20.71 26.14 12.45 15.34 17.32 8.51 8.15 13.07 22.09 24.11 24.47 20.81 16.05 23.7 26.42 27.07 28.01 24.18 19.58 25.74 Degrees of freedom r-1=3-1=2 a-1=6-1=5 b-1=3-1=2 (a - 1)(b - 1) = 10 (ab - 1)(r - 1) = 17 × 2 = 34 r × a × b - 1 = 54 - 1 = 53 Part of the results of the above program is shown in Table 6.35. In part (a), the 2 value of Pearson’s chi-square/DF is tabulated χdf = 0:92 , which indicates that the beta distribution is a good distribution for modeling malt percentage since the t-value of Pearson’s chi-square/DF is close to 1. The estimated variance due to blocks is 6.7 Percentages Table 6.35 Results of the analysis of variance of the RCBD with a beta distribution 247 (a) Fit statistics for conditional distribution -2 Log L (p | r. effects) -280.89 Pearson’s chi-square 49.66 Pearson’s chi-square/DF 0.92 (b) Covariance parameter estimates Cov Parm Estimate Standard error Block 0.01210 0.01055 Scale 431.54 85.4922 (c) Type III tests of fixed effects Num Den FDF DF value Effect Pr > F Var_sorghum 5 34 106.51 <0.0001 Ger_time 2 34 0.26 0.7722 Var_sorghum*ger_time 10 34 1.08 0.4041 Table 6.36 Means and standard errors on the model scale and the data scale for sorghum varieties Var_sorghum least squares means Standard Var_sorghum Estimate error 76 T1#23 -1.2011 0.07401 Gambella -1.8898 0.07929 Macia -2.2067 0.08295 Meko -1.1201 0.07364 Red Swazi -1.3685 0.07493 Teshale -1.0025 0.07314 DF 34 34 34 34 34 34 t-value -16.23 -23.83 -26.60 -15.21 -18.26 -13.71 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 0.2313 0.1313 0.09915 0.2460 0.2029 0.2685 Standard error mean 0.01316 0.009042 0.007409 0.01366 0.01212 0.01436 ^ = 431 (part (b)), whereas the σ^2block = 0:012 and the estimated scale parameter is ϕ type III fixed effects hypothesis tests in part (c) show that sorghum variety has a significant effect on malt quality percentage (P = 0.0001). The least squares means on the model scale and the data scale for the factor variety are listed under the columns “Estimate” and “Mean” with their respective standard errors “Standard error” in Table 6.36. Figure 6.9 shows that Teshale produced the highest average malt percentage (0.2685 ± 0.01436), followed by the varieties 76 T1#23 and Meco (0.2313 ± 0.01316,0.246 ± 0.01366), whereas the variety Macia produced the lowest malt percentage (0.09915 ± 0.0074). 6.7.3 A Split Plot in an RCBD: Cockroach Mortality (Blattella germanica) An entomologist is interested in testing six isolates of insect pathogenic fungi: five obtained from different hosts and one already known isolate (Control) of a fungus 248 6 Generalized Linear Mixed Models for Proportions and Percentages Average malt percentage 0.3 0.225 0.15 0.075 0. 76T1#23 Gambella Macia Meko RedSwazi Teshale Variety Fig. 6.9 Percentage of quality malt of bicolor sorghum varieties Table 6.37 Analysis of variance with sources of variation and degrees of freedom for this experiment Sources of variation Blocks Isolation Block (insulation) Age Isolation*age Error Total Degrees of freedom r - s1 = 2 - 1 = 1 a-1=6-1=5 a(r - 1) = 6 b-1=3-1=2 (a - 1)(b - 1) = 5 × 2 = 10 (a - 1)(b - 1)(r - 1) = 2 × 5 × 1 = 10 r × a × b - 1 = 2 × 6 × 3 - 1 = 35 with potential for biological control of a particular species of cockroaches. To do so, the entomologist decides to test these fungal isolates on three different insect ages (age1 = E1, age2 = E2, and age3 = E3). Each of the isolates was placed in a Petri dish with 10 insects of a specific age. Each set (isolate–age) was randomly assigned to two blocks (Appendix: Data: Cockroaches). The analysis of variance table (Table 6.37) with the sources of variation and degrees of freedom for this experiment is presented below. The response variable (percentage mortality) for this experiment is assumed to have a beta distribution. The components that describe the model of this experiment are listed below: Distributions: yijk j rk, r(α)k(i)~Beta(π ijk, ϕ); i = 1, ⋯, 6; j = 1, 2, 3; k = 1, 2. rk N 0, σ 2r , r ðαÞkðiÞ N 0, σ 2rðαÞ Linear predictor: ηijk = μ + rk + αi + r(α)k(i) + βj + (αβ)ij Link function: logit(π ijk) = ηijk 6.7 Percentages Table 6.38 Results of the analysis of variance of the RCBD with a factorial structure in treatments 249 (a) Fit statistics for conditional distribution -2 Log L (y | r. effects) -74.53 Pearson’s chi-square 34.02 Pearson’s chi-square/DF 1.00 (b) Covariance parameter estimates Cov Parm Subject Estimate Standard error Aislamiento Block -0.03125 . Scale 24.1882 5.7925 (c) Type III tests of fixed effects Effect Num DF Den DF F-value Pr > F Isolation 5 6 16.48 0.0019 Age 2 10 30.01 <0.0001 Isolation*age 10 10 4.83 0.0102 The following GLIMMIX commands adjust a GLMM with a beta response. proc glimmix nobound method=laplace; class block Isolation Age; model y = Isolation|Age/dist=beta link=logit; random Isolation/subject=block; lsmeans Insulation|Age/slice=Insulation lines ilink; run; Some of the outputs are listed below (Table 6.38). The conditional statistic Pearson′s chi - square/DF = 1 indicates that the distribution used is appropriate for these datasets (part (a)). The variance component estimates are tabulated in part (b), and, for blocks, the estimate is σ^2r = - 0:03125 and the estimated scale param^ = 24:1882. The hypothesis test is in part (c) with type III fixed effects of eter is ϕ equality of means for type of isolation, age of the insect, and the interaction between both factors. These outputs indicate that they have a significant effect on insect mortality. We see the expected proportions with their respective standard errors of both factors on the data scale under the “Mean” column (Tables 6.39 and 6.40). These values arise by applying the inverse link to estimates under “Estimate” on the model scale. Table 6.39 shows the estimated average mortality probabilities for the isolates; for example, for isolate A1, applying the inverse link to the linear predictor estimate ^η1: = 0:1722 we get π^1: = 1=1 þ e - 0:1722 = 0:5429. In this manner, we see that the expected proportions for isolates 2 and 4 are π^2: = 0:6555 and π^4: = 0:5762, respectively, whereas for the control π^control: = 0:1157. Regarding the age of the insect (Table 6.40), the expected average probability of mortality was higher at age three (adults) with a higher mortality rate π^:3 = 0:6435, whereas insects at age two (E2) had a higher resistance to the isolations, showing a mortality of π^:2 = 0:2598. In general, fungal isolates A1, A2, A3, and A4 showed an average mortality of more than 75% for adult insects (E3), whereas isolates A1, A2, and A5 showed a 250 6 Generalized Linear Mixed Models for Proportions and Percentages Table 6.39 Means and standard errors on the model scale and the data scale for isolation Isolate least squares means Isolate Estimate Standard error A1 0.1722 0.1859 A2 0.6442 0.2100 A3 -0.1489 0.1952 A4 0.3073 0.2088 A5 -0.2023 0.1806 Control -2.0339 0.2418 DF 6 6 6 6 6 6 t-value 0.93 3.07 -0.76 1.47 -1.12 -8.41 Pr > |t| 0.3900 0.0220 0.4746 0.1915 0.3053 0.0002 Mean 0.5429 0.6557 0.4629 0.5762 0.4496 0.1157 Standard error mean 0.04614 0.04740 0.04853 0.05098 0.04468 0.02473 Table 6.40 Means and standard errors on the model scale and the data scale for insect age Age least squares means Age Estimate Standard error E1 -0.1747 0.1310 E2 -1.0468 0.1374 E3 0.5908 0.1634 DF t-value -1.33 -7.62 3.61 Pr > |t| 0.2120 <0.0001 0.0047 Mean 0.4564 0.2598 0.6435 Standard error mean 0.03251 0.02643 0.03749 mortality rate of around 65% for cockroaches of age E1 (juvenile insects). On the other hand, all isolates showed lower lethal effectiveness on insects of age E2 (Fig. 6.10). 6.7.4 A Split-Plot Design in an RCBD: Percentage Disease Inhibition A plant pathologist wishes to compare the response of two plant varieties to different doses/amounts of a pesticide formulated to protect plants against a disease. Five racks (blocks) were chosen to account for local variation within the greenhouse. Each rack was divided into four sections or rooms and were randomly assigned one of four pesticide levels to each rack. The four pesticide levels were 1, 2, 4, and 8 mg/ L. One plant of each variety was placed in each section of the rack. Of the two plant varieties, one variety was susceptible, labeled S, and the other variety was resistant, labeled R (Table 6.41). The response variable ( y) is the percentage of disease inhibition in the plant. The sources of variation and degrees of freedom for this experiment are shown in Table 6.42. Following the same reasoning used in the examples above, the components of the GLMM with a beta response that models the observed disease inhibition proportion ( pijk) under dose i with variety j in block k are listed as follows:. Distributions: yijk j rk, (rα)ik~Beta(π ijk, ϕ); i = 1, ⋯, 4; j = 1, 2; k = 1, ⋯, 5 rk N 0, σ 2r , ðrαÞik N 0, σ 2rA 6.7 Percentages 251 1 Average mortality rate 0.9 0.8 0.7 a ab ab ab abc bc bc de 0.5 0.4 0.3 cd cd cd 0.6 ef ef ef f f 0.2 f f 0.1 0 E1 E2 E3 E1 E2 E3 E1 E2 E3 E1 E2 E3 E1 E2 E3 E1 E2 E3 A1 A2 A3 A4 A5 Control Isolate/Age Fig. 6.10 Cockroach mortality percentage Table 6.41 Percentage of inhibition Block 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Variety R R R R R R R R R R R R R R R R R R R R Dose 1 1 1 1 1 2 2 2 2 2 4 4 4 4 4 8 8 8 8 8 y 15.7 23.1 15.9 20.8 24.5 25.1 29.2 29.7 28.6 26.6 27.9 29.7 24 29.7 29.6 23.8 31.2 21.8 23.3 23.9 Block 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Variety S S S S S S S S S S S S S S S S S S S S Dose 1 1 1 1 1 2 2 2 2 2 4 4 4 4 4 8 8 8 8 8 y 19.8 17.8 13.2 14.8 19.7 21.2 29.3 26 27.5 22 29.3 27.2 26 31.5 27.9 22.8 33 25.2 27.2 20.8 252 6 Generalized Linear Mixed Models for Proportions and Percentages Table 6.42 Sources of variation and degrees of freedom Sources of variation Blocks Dose Errora(Bloque*Dose) Variety Dose*variety Errorb Total Degrees of freedom r-1=5-1=4 a-1=4-1=3 (r - 1)(a - 1) = 12 b-1=2-1=1 (a - 1)(b - 1) = 3 a(b - 1)(r - 1) = 4 × 1 × 4 = 16 r × a × b - 1 = 5 × 4 × 2 - 1 = 39 Linear predictor: ηijk = μ + rk + αi + (rα)ik + βj + (αβ)ij, where rk is the random block effect, αi is the fixed dose effect, βj is the fixed variety effect, (rα)ik is the random effect due to block by dose interaction, and (αβ)ij is the interaction of fixed effects due to dose variety. Link function: logit(π ijk) = ηijk The following GLIMMIX commands adjust a GLMM. proc glimmix nobound method=laplace; class Variety dose block; model y = dose variety dose*variety /dist=beta link=logit; random Block Block*dose; contrast 'Linear dose' dose -3 -1 1 3; contrast 'Quadratic dose' dose 1 -1 -1 -1 1; contrast 'dose Cubic' dose -1 3 -3 1; lsmeans variety|dose / slice=(variety dose) lines ilink; ods output lsmeans=dose_means; run; The “contrast” command in the program can perform a hypothesis testing to see what trend (linear, quadratic, or cubic) the “dose” factor has on the percentage of disease inhibition. Part of the output is shown in Table 6.43. The value of the conditional goodness-of-fit statistic Pearson’s chi - square/DF= 0.59 indicates that we have no evidence of overdispersion, and, therefore, the beta distribution is adequate to model this dataset (part (a)). The variance component estimates in part (b) for block and block × dose are σ^2r = 0:004898 and σ^2r ∙ dose = 0:002372, respectively. Finally, the F-value provides sufficient statistical evidence of the effect of dose on disease decline in plants (P = 0.0001), whereas the effect of variety and dose × variety do not provide sufficient evidence. Table 6.44 shows the polynomial contrasts for the effect of “dose,” which indicate that there is a significant quadratic effect on the percentage of disease inhibition. The inhibition percentage has almost a linear trend as the dose increases from 1 to 4 ml/L in both varieties, but when the dose is higher than 4 ml/L, the inhibition of the disease decreases in both varieties (Fig. 6.11). 6.7 Percentages 253 Table 6.43 Results of the analysis of variance (a) Fit statistics for conditional distribution -2 Log L (y | r. effects) -184.32 Pearson’s chi-square 23.63 Pearson’s chi-square/DF 0.59 (b) Covariance parameter estimates Cov Parm Estimate Standard error Block 0.004898 . Block*dose 0.002372 0.007513 Scale 205.52 67.7447 (c) Type III tests of fixed effects Effect Num DF Den DF F-value Pr > F Dose 3 12 17.67 0.0001 Variety 1 16 1.74 0.2057 Dose*variety 3 16 1.22 0.3337 Table 6.44 Polynomial contrasts Contrasts Label Linear dose Quadratic dose Cubic dose 6.7.5 Num DF 1 1 1 Den DF 12 12 12 F-value 25.48 30.93 0.30 Pr > F 0.0003 0.0001 0.5948 Randomized Complete Block Design with a Binomial Response with Multiple Variance Components The dataset corresponds to an experiment implemented by Madden and Hughes (1995) on the incidence of the disease caused by the fungus Plasmopara viticola on grape plants (Vitis labrusca). Six different treatments in a randomized block design (b = 3) were tested, where treatment 1 was the control, to study the disease with three grape plants (v = 3). On a single date in autumn, five sprouts were (r = 5) randomly selected from each of the three grape plants and the number of leaves with at least one mildew lesion was counted (m) out of a total n leaves. The number of leaves per shoot ranged from 7 to 21. The data for this experiment can be found in the Appendix (Data: Disease incidence on grape plants). The statistical model that could describe the incidence of disease in this experiment, if the response variable pijkl were treated as a normal variable, would be as described below: pijkl = η þ τi þ bj þ ðbvÞjk þ ðbvr Þjkl þ εijkl i = 1, 2, . . . , 6; j = 1, 2, 3; k = 1, 2, ::, 3; l = 1, 2, . . . , 5 254 6 Generalized Linear Mixed Models for Proportions and Percentages Inhibition porcentage 0.35 0.3 0.25 0.2 0.15 0.1 1 2 3 4 5 6 7 8 Dose (ml/L) r s Fig. 6.11 Percentage of disease inhibition in both varieties where pijkl is the ijkl proportion of diseased leaves, η is the intercept, τi is the fixed treatment effect i, bj is the random effect of blocks assuming bj N 0, σ 2block , (bv)jk is the block–plant random effect assuming ðbvÞjk N 0, σ 2block × plant , (bvr)jkl is N the random effect 2 0, σ block × plant × sprout due to block–plant–sprouts assuming ðbvrÞjkl , and εijkl is the experimental error assuming εijkl~N(0, σ 2). For the disease incidence data, the assumption of a normal distribution for pijkl is not recommended. A good starting point for the analysis is to assume that the observed number of diseased leaves in the sprouts (yijkl) follows a binomial distribution with parameter π ijkl and nijkl, the total number of leaves on the sprout. Therefore, the components of the GLMM with a binomial distribution in the response variable are as follows: Distribution: pijkl j bj, (bv)jk, (bvr)jkl ~ binomial(π ijkl, nijkl) bj N 0, σ 2block ,ðbvÞjk N 0, σ 2block × plant , ðbvrÞjkl N 0, σ 2block × plant × sprout Linear predictor: ηijkl = η + τi + bj + (bv)jk + (bvr)jkl. Link function: logit(π ijkl) = ηijkl The following GLIMMIX syntax fits a GLMM with a binomial response. proc glimmix method=laplace nobound; class v r b t; model m/n = t /dist=bin; random intercept v v*r/subject=b; lsmeans t/lines ilink; run; 6.7 Percentages Table 6.45 Results of the analysis of variance under the binomial distribution 255 (a) Fit statistics -2 Log likelihood 723.17 AIC (smaller is better) 741.17 AICC (smaller is better) 741.87 BIC (smaller is better) 733.06 CAIC (smaller is better) 742.06 HQIC (smaller is better) 724.87 (b) Fit statistics for conditional distribution -2 Log L (m | r. effects) 665.02 Pearson’s chi-square 398.21 Pearson’s chi-square/DF 1.47 (c) Covariance parameter estimates Cov Parm Subject Estimate Standard error Intercept b -0.00408 . V b 0.01917 . v*r b 0.1960 . (d) Type III tests of fixed effects Den DF F-value Pr > F Effect Num DF t 2 220 1837.99 <0.0001 Part of the results based on the aforementioned model is shown in Table 6.45. By default, proc GLIMMIX provides the fit statistics useful for selecting the best model from a group of models (part (a)). In addition to accuracy considerations, the Laplace (or quadrature) analysis allows us to obtain the “conditional distribution fit statistics,” specifically Pearson’s χ 2/df. Recall that this statistic helps assess the goodness of fit of the model. If the value of χ 2/df ≫ 1 is an indicator that there is overdispersion in the dataset, then this may be because the linear predictor is incomplete or the assumed distribution is not suitable (mis-specified) for this dataset. In part (b), we can see that the value of the conditional distribution statistic of Pearson’s χ 2/df = 1.47. This value indicates that we have evidence of overdispersion. The variance component estimates due to block, block × plant, and block × plant × sprout are tabulated in part (c), whereas the type III tests of fixed effects (part (d)) indicate that there is a significant difference (P < 0.0001) between treatments. Since there is overdispersion in the data in the binomial model, an alternative distribution is the beta distribution. The components of the GLMM are as follows: Distribution: pijkl j bj, (bv)jk, (bvr)jkl~beta(π ijkl, ϕ); 2 2 2 bj N 0, σ block ,ðbvÞjk N 0, σ block × plant , ðbvrÞjkl N 0, σ block × plant × sprout Linear predictor: ηijkl = η + τi + bj + (bv)jk + (bvr)jkl Link function: logit(π ijkl) = ηijkl. 256 6 Table 6.46 Results of the analysis of variance under the beta distribution Generalized Linear Mixed Models for Proportions and Percentages (a) Fit statistics -2 Log likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) CAIC (smaller is better) HQIC (smaller is better) (b) Fit statistics for conditional distribution -2 Log L (m | r. effects) Pearson’s chi-square Pearson’s chi-square/DF (c) Covariance parameter estimates Cov Parm Subject Estimate Intercept b -0.6308 -0.2215 V b v*r b -0.1843 ^ 9.8397 Scale ϕ (d) Type III tests of fixed effects Effect Num DF Den DF t 2 220 -231.10 -211.10 -209.30 -220.11 -210.11 -229.22 -231.10 136.55 1.07 Standard error . . . 1.1926 F-value 1837.99 Pr > F <0.0001 The following SAS commands adjust an GLMM under a beta distribution. proc GLIMMIX method=laplace nobound; class v r b t; model pct = t /dist=beta link=logit; random intercept v v*r/subject=b; lsmeans t/lines ilink; run; Some of the outputs are shown below. Table 6.46 shows that the values of the fit statistics, as well as the conditional distribution statistics (parts (a) and (b)), are much smaller than when the binomial distribution was used. This indicates that the beta distribution is more appropriate for the dataset, as the value of Pearson’s statistic is χ 2/df = 1.03, indicating that the problem of overdispersion was almost totally controlled. The variance component estimates as ^ are tabulated in part (c). Similar to the well as the estimated scale parameter ϕ previous analysis, the type III tests of fixed effects indicate that there is a highly significant difference (part (d)) in treatments on the average proportion of leaves with fungal disease. The least mean squares (means) on the model scale (column “Estimate”) and on the data scale (column “Mean”) are tabulated in Table 6.47. The results indicate that 6.8 Exercises 257 Table 6.47 Estimated means (least squares means) on the model scale and on the data scale Least squares means t Estimate Standard error 1 0.7223 0.09989 2 -1.7482 0.1543 3 -2.0178 0.2214 4 -1.9358 0.1873 5 -1.7887 0.2173 6 -1.5360 0.1665 Table 6.48 Mean comparison (LSD method) DF 83 83 83 83 83 83 t-value 7.23 -11.33 -9.11 -10.34 -8.23 -9.23 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 0.6731 0.1483 0.1174 0.1261 0.1432 0.1771 Standard error mean 0.02198 0.01949 0.02294 0.02064 0.02667 0.02427 T grouping of t least squares means (α = 0.05) LS means with the same letter are not significantly different t Estimate 1 0.7223 A 6 -1.5360 B B 2 -1.7482 B B 5 -1.7887 B B 4 -1.9358 B B 3 -2.0178 B all proposed treatments in this study reduce the proportion of diseased leaves compared to the control treatment (t = 1). The mean comparison (LSD) obtained with the option “lines” indicates that the proportion of diseased leaves in treatment one is statistically different from the rest of the treatments (Table 6.48). 6.8 Exercises Exercise 6.8.1 Seeds of a particular crop were stored at four different temperatures (T1, T2, T3, and T4) under four different chemical concentrations (0, 0.1, 1.0, and 10). To study the effects of temperature and chemical concentration, a completely randomized experiment was conducted with a factorial treatment structure 4 × 4 and four replicates. For each of the 64 experimental units, 50 seeds were placed in a dish and the number of seeds that germinated under standard conditions was recorded. Germination data were obtained from Mead et al. (1993, p. 325) (Table 6.49). 258 6 Generalized Linear Mixed Models for Proportions and Percentages Table 6.49 Seed germination experiment results Temperature T1 T2 T3 T4 Chemical concentration 0 0.1 9, 9, 3, 7 13, 12, 14, 15 19, 30, 21, 29 33, 32, 30, 26 7, 7, 2, 5 1, 2, 4, 4 4, 9, 3, 7 13, 6, 15, 7 1.0 21, 23, 24, 27 43, 40, 37, 41 8, 10, 6, 7 16, 13, 18, 19 40, 32, 43, 34 48, 48, 49, 48 3, 4, 8, 5 13, 18, 11, 16 Table 6.50 Results of the apple sprouts experiment Density of inoculum 200 200 200 1000 1000 1000 5000 5000 5000 Cultivate Jonagold Golden delicious Jonathan Jonagold Golden delicious Jonathan Jonagold Golden delicious Jonathan Block 1 5/1 5/1 5/2 5/0 5/0 5/4 5/5 5/5 5/5 Block 2 5/2 5/0 5/2 5/2 5/0 5/4 5/5 5/4 5/0 Block 3 5/1 5/0 5/2 5/2 5/2 5/4 5/4 5/3 5/3 Block 4 5/0 5/0 5/0 5/4 5/0 5/0 5/5 5/5 5/5 The first number refers to the number of inoculations (n) and the second to the number of inoculations that developed the gangrenous sore (Y) (a) Write down an ANOVA table (sources of variation, degrees of freedom) for this experiment. (b) List all the components of the GLMM in (a). (c) Analyze this dataset and summarize the relevant results. Exercise 6.8.2 Data were obtained from an experiment in which separate sprouts of apple trees were inoculated with macroconidia of the fungus Nectria galligena, which causes apple cancer (canker gangrene). The experimental factors were inoculum density (three levels: 200, 1000, and 5000 macroconidia per ml) and variety (three levels: Jonagold, Golden Delicious, and Jonathan). The experiment was carried out in 4 randomized blocks with 12 plots. Each plot consisted of one sprout on which five inoculations were made. The numbers of successful inoculations per plot on day 17 after inoculation are shown in the table below (Table 6.50). (a) Write down an ANOVA table (sources of variation, degrees of freedom) for this experiment. (b) List all the components of the GLMM from part (a). (c) Analyze this dataset and summarize the relevant results. (d) Is there is an extra-variation in the dataset? What alternative distribution do you propose? Reanalyze the data and compare the results. Exercise 6.8.3 This experiment concerns the germination efficiencies of protoplasts obtained from plants of seven species of the genera Lycopersicon (tomato) and 6.8 Exercises 259 Table 6.51 Protoplast germination experiment results Species 1 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 7 Isolation 1 2 3 4 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 4 1 8.9 3.1 2.1 2.5 0.2 1.8 6.6 1.8 1.5 2 11.4 2.9 2.3 21.5 18.7 11.5 4.6 2.4 1.6 3 2.5 2.6 2.9 2 6.3 2.7 1.9 2.9 0.9 1.6 7.5 1.5 3.2 2.3 11.3 3.8 4.4 25.5 20 13.1 3.4 2.4 1.1 4 2.5 2.7 3 3 10.5 4.1 1.4 2.6 0.5 1.6 5.4 1.9 1.1 2.8 14.4 4.7 4.8 18.1 4 11.5 2.7 2 1.6 4.1 2.5 2.9 3 5 1.5 2.6 0.6 6 7 8 9 10 2.6 1.2 2.6 0.4 2.8 2.7 2.8 2.7 5.3 1.7 1.3 2.6 13.7 5.1 4.9 22.2 5 1.3 1.8 3.2 6.5 1.5 1.2 2.2 6.3 5.8 5.9 5.6 1.6 2.5 1.4 2.4 1.2 2.8 1.8 2.4 2.7 5.8 3.2 4.7 5.6 4.2 3.3 4.5 16.2 3 2.5 1.3 4.4 2.7 2.7 3.1 10.1 4.1 3.6 1.6 2.8 2.3 2.7 17.2 3.1 3.2 1 3.3 2.6 2.6 16 10.5 2.6 0.8 4.5 1.4 1.3 3.3 2.5 0.8 3 2.7 2.2 3.2 Solanum (potato). For each species, three or four protoplast isolates were used and, depending on the availability of the protoplasts, a variable number of plates was carried out. Per plate, approximately 105 protoplasts were placed in a Petri dish, and, after 4 weeks, the proportion of dividing protoplasts was recorded. The results in percentages are listed below (Table 6.51). (a) Write down an ANOVA table (sources of variation, degrees of freedom) for the experimental design of this study. (b) Write down a generalized linear mixed model base in (a), assuming a beta distribution on the response variable. (c) Implement an analysis of these data according to the linear predictor and model in part (b). Summarize the relevant results. Exercise 6.8.4 The data in this example are the results of a triangle test for 12 raters tasting 10 pairs of coffee varieties (Table 6.52). The triangle test consisted of each rater drinking three cups, one of one variety and two of the other. Each rater had 12 triangles for each pair of varieties, 2 for each of the following sequences: AAB, ABA, BAA, ABB, BAB, and BBA. The answer is the correct variety identification number appearing once. The experiment was conducted in two groups of six 260 6 Generalized Linear Mixed Models for Proportions and Percentages Table 6.52 Triangle test (G = group, Eval = panelist, PdV = variety pair, V_A = variety A; V_B = variety B; Y = number of correct discriminations, n = number of trials) G 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Eval 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 PdV 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 V_A 8 5 9 6 6 5 7 7 7 7 8 5 9 6 6 5 7 7 7 7 8 5 9 6 6 5 7 7 7 7 8 5 9 6 6 5 7 7 7 7 8 5 V_B 9 9 6 5 8 8 8 9 5 6 9 9 6 5 8 8 8 9 5 6 9 9 6 5 8 8 8 9 5 6 9 9 6 5 8 8 8 9 5 6 9 9 Y 2 11 9 6 8 9 6 8 11 5 5 8 8 9 10 11 8 9 8 7 4 9 9 11 8 10 3 7 10 9 7 10 7 8 7 8 7 6 10 7 6 10 n 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 G 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Eval 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 10 11 11 PdV 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 V_A 8 5 9 6 6 5 7 7 7 7 8 5 9 6 6 5 7 7 7 7 8 5 9 6 6 5 7 7 7 7 8 5 9 6 6 5 7 7 7 7 8 5 V_B 9 9 6 5 8 8 8 9 5 6 9 9 6 5 8 8 8 9 5 6 9 9 6 5 8 8 8 9 5 6 9 9 6 5 8 8 8 9 5 6 9 9 Y 4 12 7 9 10 5 9 9 7 5 2 10 8 9 8 9 4 6 10 7 3 11 11 8 8 11 5 4 11 8 7 9 5 11 5 10 7 8 6 9 7 9 n 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 (continued) 6.8 Exercises 261 Table 6.52 (continued) G 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Eval 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 PdV 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 V_A 9 6 6 5 7 7 7 7 8 5 9 6 6 5 7 7 7 7 V_B 6 5 8 8 8 9 5 6 9 9 6 5 8 8 8 9 5 6 Y 4 8 6 7 8 9 9 8 3 9 6 9 7 10 7 7 8 9 n 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 G 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Eval 11 11 11 11 11 11 11 11 12 12 12 12 12 12 12 12 12 12 PdV 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 V_A 9 6 6 5 7 7 7 7 8 5 9 6 6 5 7 7 7 7 V_B 6 5 8 8 8 9 5 6 9 9 6 5 8 8 8 9 5 6 Y 6 10 5 10 8 6 9 9 6 7 7 7 8 11 9 9 10 9 n 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 evaluators, each with the aim of discriminating the abilities of the panelists for future evaluations. The data for this example are shown below: (a) Write down an ANOVA table (sources of variation, degrees of freedom) for this experiment. (b) List all the components of the GLMM according to part (a). (c) Analyze this dataset and summarize the relevant results. (d) Is there an extra-variation in the dataset? If so, what alternative distribution do you propose? Reanalyze the data and compare the results. Exercise 6.8.5 Several brewing techniques are used in the production of espresso coffee. Among them, the most widespread are bar machines and single-dose pods, designed in large numbers due to their commercial popularity. This experiment tries to compare the foaming rate (Y, in percentage) effects of three different brewing techniques on espresso quality (method 1 = bar machine (BM), method 2 = hyperespresso method (HIP), and method 3 = I-espresso system (IT)). Nine replicates per method were carried out (Table 6.53). (a) Write down an ANOVA table (sources of variation, degrees of freedom) for the experimental design of this study. (b) Describe the generalized linear mixed model in (a), assuming a beta distribution. (c) Implement the analysis of these data according to the predictor and model in (b). Summarize the relevant results. 262 6 Generalized Linear Mixed Models for Proportions and Percentages Table 6.53 Experimental results of espresso coffee Method 1 1 1 1 1 1 1 1 1 Table 6.54 Results of wheat germination experiment in pots. Number of seeds that did not germinate out of 50 Treatments 1 A 10 B 8 C 5 D 1 Index 36.64 39.65 37.74 35.96 38.52 21.02 24.81 34.18 23.08 2 11 10 11 6 Method 2 2 2 2 2 2 2 2 2 3 8 3 2 4 Index 70.84 46.68 73.19 57.78 48.61 72.77 65.04 62.53 54.26 Method 3 3 3 3 3 3 3 3 3 Index 56.19 36.67 35.35 40.11 33.52 37.12 37.33 32.68 48.33 7 9 11 11 10 4 5 6 9 7 8 13 7 9 10 7 6 3 7 10 Exercise 6.8.6 The decision to adopt a particular scale for data involving small integers is not an easy one because any analysis must be – to some extent – as adequate as possible to obtain estimates with as little uncertainty as possible. As a simple example of this type of data, consider the following results from a potted wheat germination experiment (Table 6.54). (a) Write down an ANOVA table (sources of variation, degrees of freedom) for this experiment. (b) List all components of the GLMM in (a), assuming a binomial response variable. (c) Analyze this dataset and summarize the relevant results. (d) Is there an extra-variation in the dataset? If so, reanalyze the data with an alternative distribution. Summarize and compare your findings. Exercise 6.8.7 A greenhouse experiment was carried out to investigate how a disease spreads in two varieties of (agurkesyge) cucumber, which is supposed to depend on the climate and the amount of fertilizers used for the two varieties. The following data come from the Department of Plant Pathology. Two climates were used: (1) change to day temperature 3 hours before sunrise and (2) normal change to day temperature. Three amounts of fertilizer were applied, normal (2.0 units), high (3.5 units), and very high (4.0 units). The two varieties were Aminex and Dalibor. To have a better controlled experiment, the plants were “standardized” to equally have as many leaves, and, then (on day 0, for example), the plants were contaminated with the disease. Subsequently, 8 days after the plants were contaminated, the amount of infection (in percentage) was recorded. From the resulting infection curve, two measures were calculated (in a manner not specified here), namely, the rate of spread of the disease (%) and the level of infection at the 6.8 Exercises 263 end of the disease period. The experiment was implemented in three blocks, each of which consisted of two sections. Each section consisted of three plots, which were divided into two subplots, each of which had six to eight plants. Thus, there were a total of 36 subplots. The results were recorded for each subplot. The experimental factors were randomly assigned to the different units as follows: two climates to the two sections within each block, three amounts of fertilizer to the three plots within each section, and, finally, the two varieties to the two subplots within each plot. The data are shown below (Table 6.55). (a) (b) (c) (d) Write down a statistical model of this experiment. List all the components of the GLMM in (a). Write down the null and alternative hypotheses associated with this experiment. Construct an ANOVA table indicating the sources of variation and degrees of freedom. (e) Analyze the rate of disease spread to investigate the effect of different factors. (f) Comment on the results obtained. Exercise 6.8.8 This example is an experiment to identify damage to the uterus in laboratory rodents after exposure to boric acid, a compound widely used in pesticides, pharmaceuticals, and other household products (Heindel et al. 1992). The study design included four doses of boric acid. The compound was administered to pregnant female mice during the first 17 days of gestation, and, then, the females were sacrificed and their litters examined. The table below presents the resulting trials for litters dying in utero (Y ) of the total number of trials conducted (N ) at each of the four doses tested: d1 = 0{control}, d2 = 0.1, d3 = 0.2, and d3 = 0.4 (as percentage of boric acid in the diet) (Table 6.56). (a) Write down an ANOVA table (sources of variation, degrees of freedom) for this experiment. (b) List all the components of the GLMM in (a). (c) Analyze this dataset and summarize the relevant results. (d) Is there an extra-variation in the dataset? If so, what alternative distribution do you propose? Reanalyze the data and compare your findings. 264 6 Generalized Linear Mixed Models for Proportions and Percentages Table 6.55 Greenhouse experiment results of cucumber varieties Block 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 Section 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 Plot 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 Weather 2 2 2 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 Fertilizer 2 2 3.5 3.5 4 4 2 2 3.5 3.5 4 4 2 2 3.5 3.5 4 4 2 2 3.5 3.5 4 4 2 2 3.5 3.5 4 4 2 2 3.5 3.5 4 4 Variety Aminex Dalibor Aminex Dalibor Aminex Dalibor Aminex Dalibor Aminex Dalibor Aminex Dalibor Aminex Dalibor Aminex Dalibor Aminex Dalibor Aminex Dalibor Aminex Dalibor Aminex Dalibor Aminex Dalibor Aminex Dalibor Aminex Dalibor Aminex Dalibor Aminex Dalibor Aminex Dalibor Proportion (%) 48.8981 42.2463 48.2108 41.6767 55.4369 40.9562 51.5573 36.7739 47.9937 47.8723 57.9171 37.7185 60.1747 45.6937 51.0017 52.2796 51.1251 48.7217 51.6001 50.4463 48.3387 38.6538 51.3147 38.2488 49.6958 29.6786 46.6692 36.5892 56.032 36.0955 45.979 37.2489 40.7277 38.4831 44.5242 34.3907 Level 0.06915 0.06595 0.04679 0.04881 0.04025 0.04859 0.09353 0.10353 0.05327 0.04397 0.05225 0.09324 0.04182 0.06983 0.08863 0.03622 0.05875 0.08169 0.07001 0.09907 0.05788 0.06834 0.05695 0.07908 0.07218 0.11351 0.08825 0.09107 0.04532 0.08712 0.08882 0.12796 0.06418 0.0854 0.06215 0.09651 Appendix 265 Table 6.56 Rodent experiment results Dose 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Y 0 0 1 1 1 2 0 0 1 2 0 0 3 1 0 0 2 3 0 2 0 0 2 1 1 0 0 N 15 3 9 12 13 13 16 11 11 8 14 13 14 13 8 13 14 14 11 12 15 15 14 11 16 12 14 Dose 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 Y 0 1 1 0 2 0 0 3 0 2 3 1 1 0 0 0 1 0 2 2 2 3 1 0 1 1 1 N 6 14 12 10 14 12 14 14 10 12 13 11 11 11 13 10 12 11 10 12 15 12 12 12 12 13 15 Y 1 0 0 0 0 0 4 0 0 1 2 0 1 1 0 0 1 0 1 0 0 1 2 1 0 0 1 Dose 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 N 12 12 11 13 12 14 15 14 12 6 13 10 14 12 10 9 12 13 14 13 14 13 12 14 13 12 7 Dose 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 Y 12 1 0 2 2 4 0 1 0 1 3 0 1 0 3 2 3 3 1 1 8 0 2 8 4 2 N 12 12 13 8 12 13 13 13 12 9 9 11 14 10 12 21 10 11 11 11 14 15 13 11 12 12 Appendix Data: Fleas Bioen B1 B1 B1 B1 B1 B1 B1 B1 B1 SP Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Treat T1 T1 T1 T2 T2 T2 T3 T3 T3 Rep 1 2 3 1 2 3 1 2 3 Overvi 10 10 10 10 10 10 9 9 8 Dead 0 0 0 0 0 0 1 1 2 (continued) 266 Data: Fleas Bioen B1 B1 B1 B1 B1 B1 B1 B1 B1 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B3 B3 B3 B3 B3 B3 B3 B3 B3 B3 B3 B3 B3 B3 B3 6 SP Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Daphnia Generalized Linear Mixed Models for Proportions and Percentages Treat T4 T4 T4 T5 T5 T5 T6 T6 T6 T1 T1 T1 T2 T2 T2 T3 T3 T3 T4 T4 T4 T5 T5 T5 T6 T6 T6 T1 T1 T1 T2 T2 T2 T3 T3 T3 T4 T4 T4 T5 T5 T5 Rep 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 Overvi 2 2 3 0 0 0 0 0 0 10 10 10 10 10 10 9 9 9 2 2 2 0 0 0 0 0 0 10 10 10 10 10 10 8 9 9 3 2 2 0 0 0 Dead 8 8 7 10 10 10 10 10 10 0 0 0 0 0 0 1 1 1 8 8 8 10 10 10 10 10 10 0 0 0 0 0 0 2 1 1 7 8 8 10 10 10 (continued) Appendix Data: Fleas Bioen B3 B3 B3 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B3 B3 B3 267 SP Daphnia Daphnia Daphnia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Treat T6 T6 T6 T1 T1 T1 T2 T2 T2 T3 T3 T3 T4 T4 T4 T5 T5 T5 T6 T6 T6 T1 T1 T1 T2 T2 T2 T3 T3 T3 T4 T4 T4 T5 T5 T5 T6 T6 T6 T1 T1 T1 Rep 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 Overvi 0 0 0 10 10 10 5 6 6 5 5 5 2 3 3 2 2 2 0 0 0 10 10 10 7 5 6 5 5 5 4 4 4 2 2 2 0 0 0 10 10 10 Dead 10 10 10 0 0 0 5 4 4 5 5 5 8 7 7 8 8 8 10 10 10 0 0 0 3 5 4 5 5 5 6 6 6 8 8 8 10 10 10 0 0 0 (continued) 268 Data: Fleas Bioen B3 B3 B3 B3 B3 B3 B3 B3 B3 B3 B3 B3 B3 B3 B3 6 SP Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Dubia Generalized Linear Mixed Models for Proportions and Percentages Treat T2 T2 T2 T3 T3 T3 T4 T4 T4 T5 T5 T5 T6 T6 T6 Data: Commercial crop explant detachment Block A B 1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 2 1 2 1 1 2 2 1 2 1 2 1 2 2 1 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 1 3 1 2 3 1 1 3 1 2 3 1 1 3 2 2 3 2 1 3 2 2 3 2 Rep 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 Overvi 8 8 7 5 5 6 2 3 2 3 2 2 0 0 0 Dead 2 2 3 5 5 4 8 7 8 7 8 8 10 10 10 C 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 y 15 10 17 19 26 21 14 12 10 12 30 32 37 30 32 37 18 18 23 21 24 27 37 37 N 73 86 69 32 125 62 81 21 92 108 44 33 91 42 98 44 52 73 108 55 106 92 64 97 Appendix Data: Cockroaches (E1 = np, E2 = ng, E3 = adult) Bioassay Isolation 1 Bb1 2 Bb1 1 Bb1 2 Bb1 1 Bb1 2 Bb1 1 Bb2 2 Bb2 1 Bb2 2 Bb2 1 Bb2 2 Bb2 1 Bb3 2 Bb3 1 Bb3 2 Bb3 1 Bb3 2 Bb3 1 Bb4 2 Bb4 1 Bb4 2 Bb4 1 Bb4 2 Bb4 1 Bb5 2 Bb5 1 Bb5 2 Bb5 1 Bb5 2 Bb5 1 Bb6 2 Bb6 1 Bb6 2 Bb6 1 Bb6 2 Bb6 1 Bb8 2 Bb8 1 Bb8 2 Bb8 1 Bb8 2 Bb8 269 Age np np ng ng a a np np ng ng a a np np ng ng a a np np ng ng a a np np ng ng a a np np ng ng a a np np ng ng a a Dead 7 6 2 2 9 6 6 7 7 3 10 8 3 2 2 3 8 9 6 4 5 3 10 8 7 7 3 1 7 3 7 9 6 2 10 7 9 9 4 2 9 10 (continued) 270 6 Generalized Linear Mixed Models for Proportions and Percentages Data: Cockroaches (E1 = np, E2 = ng, E3 = adult) Bioassay Isolation 1 Bb9 2 Bb9 1 Bb9 2 Bb9 1 Bb9 2 Bb9 1 Bb10 2 Bb10 1 Bb10 2 Bb10 1 Bb10 2 Bb10 1 Bb11 2 Bb11 1 Bb11 2 Bb11 1 Bb11 2 Bb11 1 Bb12 2 Bb12 1 Bb12 2 Bb12 1 Bb12 2 Bb12 1 Bb13 2 Bb13 1 Bb13 2 Bb13 1 Bb13 2 Bb13 1 Bb14 2 Bb14 1 Bb14 2 Bb14 1 Bb14 2 Bb14 1 Bb15 2 Bb15 1 Bb15 2 Bb15 1 Bb15 2 Bb15 Age np np ng ng a a np np ng ng a a np np ng ng a a np np ng ng a a np np ng ng a a np np ng ng a a np np ng ng a a Dead 5 8 6 2 7 5 8 6 1 4 3 4 8 7 1 3 6 8 8 9 8 9 7 6 6 3 0 1 5 6 10 5 4 2 6 6 5 10 6 1 4 5 (continued) Appendix Data: Cockroaches (E1 = np, E2 = ng, E3 = adult) Bioassay Isolation 1 Bb16 2 Bb16 1 Bb16 2 Bb16 1 Bb16 2 Bb16 1 Control 2 Control 1 Control 2 Control 1 Control 2 Control 271 Age np np ng ng a a np np ng ng a a Dead 5 7 3 4 8 6 1 0 0 0 0 1 Data: Disease incidence in grapevine plants (b = block, v = plant, r = shoot, t = treatment, m = number of diseased leaves per shoot, and n = total number of leaves per shoot). b v r t M n 1 1 1 1 1 14 1 1 1 2 2 12 1 1 1 3 0 12 1 1 1 4 0 13 1 1 1 5 3 8 1 1 1 6 0 9 1 1 2 1 7 8 1 1 2 2 0 10 1 1 2 3 1 14 1 1 2 4 0 10 1 1 2 5 0 17 1 1 2 6 0 10 1 1 3 1 9 14 1 1 3 2 1 11 1 1 3 3 0 10 1 1 3 4 1 14 1 1 3 5 0 10 1 1 3 6 0 21 1 1 4 1 10 17 1 1 4 2 0 9 1 1 4 3 1 12 1 1 4 4 0 11 1 1 4 5 0 12 1 1 4 6 0 10 1 1 5 1 8 11 1 1 5 2 1 10 (continued) 272 6 Generalized Linear Mixed Models for Proportions and Percentages Data: Disease incidence in grapevine plants (b = block, v = plant, r = shoot, t = treatment, m = number of diseased leaves per shoot, and n = total number of leaves per shoot). b v r t M n 1 1 5 3 0 9 1 1 5 4 2 12 1 1 5 5 0 10 1 1 5 6 1 11 1 2 1 1 7 9 1 2 1 2 2 10 1 2 1 3 0 10 1 2 1 4 0 14 1 2 1 5 1 12 1 2 1 6 0 13 1 2 2 1 6 12 1 2 2 2 0 11 1 2 2 3 1 13 1 2 2 4 0 9 1 2 2 5 2 11 1 2 2 6 0 10 1 2 3 1 6 7 1 2 3 2 1 12 1 2 3 3 0 9 1 2 3 4 1 10 1 2 3 5 0 14 1 2 3 6 2 12 1 2 4 1 7 13 1 2 4 2 0 10 1 2 4 3 0 10 1 2 4 4 1 12 1 2 4 5 0 9 1 2 4 6 1 8 1 2 5 1 11 15 1 2 5 2 1 13 1 2 5 3 0 14 1 2 5 4 1 14 1 2 5 5 0 11 1 2 5 6 0 11 1 3 1 1 5 11 1 3 1 2 5 11 1 3 1 3 0 15 1 3 1 4 1 15 1 3 1 5 0 8 1 3 1 6 1 10 1 3 2 1 4 9 1 3 2 2 1 15 (continued) Appendix 273 Data: Disease incidence in grapevine plants (b = block, v = plant, r = shoot, t = treatment, m = number of diseased leaves per shoot, and n = total number of leaves per shoot). b v r t M n 1 3 2 3 0 11 1 3 2 4 0 13 1 3 2 5 1 12 1 3 2 6 0 12 1 3 3 1 9 12 1 3 3 2 2 14 1 3 3 3 0 12 1 3 3 4 0 12 1 3 3 5 0 10 1 3 3 6 0 13 1 3 4 1 10 10 1 3 4 2 0 10 1 3 4 3 0 8 1 3 4 4 0 10 1 3 4 5 1 14 1 3 4 6 3 11 1 3 5 1 9 11 1 3 5 2 0 11 1 3 5 3 1 11 1 3 5 4 1 14 1 3 5 5 0 9 1 3 5 6 0 9 2 1 1 1 0 12 2 1 1 2 0 12 2 1 1 3 0 14 2 1 1 4 0 12 2 1 1 5 0 10 2 1 1 6 1 13 2 1 2 1 8 9 2 1 2 2 1 9 2 1 2 3 0 12 2 1 2 4 0 10 2 1 2 5 0 12 2 1 2 6 1 10 2 1 3 1 11 14 2 1 3 2 1 12 2 1 3 3 1 11 2 1 3 4 0 10 2 1 3 5 0 9 2 1 3 6 3 11 2 1 4 1 12 15 2 1 4 2 0 15 (continued) 274 6 Generalized Linear Mixed Models for Proportions and Percentages Data: Disease incidence in grapevine plants (b = block, v = plant, r = shoot, t = treatment, m = number of diseased leaves per shoot, and n = total number of leaves per shoot). b v r t M n 2 1 4 3 0 10 2 1 4 4 1 9 2 1 4 5 1 10 2 1 4 6 0 16 2 1 5 1 10 14 2 1 5 2 1 9 2 1 5 3 0 11 2 1 5 4 0 11 2 1 5 5 0 11 2 1 5 6 0 11 2 2 1 1 1 9 2 2 1 2 0 9 2 2 1 3 0 12 2 2 1 4 1 10 2 2 1 5 1 12 2 2 1 6 0 17 2 2 2 1 9 12 2 2 2 2 0 12 2 2 2 3 0 11 2 2 2 4 2 14 2 2 2 5 0 11 2 2 2 6 0 10 2 2 3 1 7 13 2 2 3 2 0 16 2 2 3 3 1 12 2 2 3 4 0 10 2 2 3 5 0 10 2 2 3 6 0 11 2 2 4 1 7 13 2 2 4 2 1 18 2 2 4 3 0 10 2 2 4 4 0 11 2 2 4 5 0 11 2 2 4 6 3 13 2 2 5 1 5 10 2 2 5 2 0 10 2 2 5 3 0 10 2 2 5 4 0 10 2 2 5 5 0 9 2 2 5 6 1 12 2 3 1 1 6 13 2 3 1 2 0 10 (continued) Appendix 275 Data: Disease incidence in grapevine plants (b = block, v = plant, r = shoot, t = treatment, m = number of diseased leaves per shoot, and n = total number of leaves per shoot). b v r t M n 2 3 1 3 1 11 2 3 1 4 3 11 2 3 1 5 0 12 2 3 1 6 1 19 2 3 2 1 12 13 2 3 2 2 0 11 2 3 2 3 0 8 2 3 2 4 0 9 2 3 2 5 0 17 2 3 2 6 0 12 2 3 3 1 8 11 2 3 3 2 4 12 2 3 3 3 0 11 2 3 3 4 0 10 2 3 3 5 0 15 2 3 3 6 3 13 2 3 4 1 5 14 2 3 4 2 1 9 2 3 4 3 0 12 2 3 4 4 1 12 2 3 4 5 0 10 2 3 4 6 2 14 2 3 5 1 10 14 2 3 5 2 0 14 2 3 5 3 1 10 2 3 5 4 1 13 2 3 5 5 1 15 2 3 5 6 4 10 3 1 1 1 8 12 3 1 1 2 1 14 3 1 1 3 0 12 3 1 1 4 0 20 3 1 1 5 1 18 3 1 1 6 7 15 3 1 2 1 9 16 3 1 2 2 1 12 3 1 2 3 0 13 3 1 2 4 0 15 3 1 2 5 0 17 3 1 2 6 1 18 3 1 3 1 7 12 3 1 3 2 0 14 (continued) 276 6 Generalized Linear Mixed Models for Proportions and Percentages Data: Disease incidence in grapevine plants (b = block, v = plant, r = shoot, t = treatment, m = number of diseased leaves per shoot, and n = total number of leaves per shoot). b v r t M n 3 1 3 3 1 13 3 1 3 4 0 18 3 1 3 5 0 14 3 1 3 6 0 14 3 1 4 1 10 14 3 1 4 2 2 17 3 1 4 3 0 10 3 1 4 4 1 19 3 1 4 5 0 17 3 1 4 6 0 16 3 1 5 1 9 10 3 1 5 2 1 14 3 1 5 3 1 11 3 1 5 4 0 18 3 1 5 5 0 15 3 1 5 6 1 11 3 2 1 1 10 10 3 2 1 2 1 11 3 2 1 3 0 12 3 2 1 4 1 15 3 2 1 5 4 20 3 2 1 6 0 14 3 2 2 1 9 12 3 2 2 2 1 10 3 2 2 3 1 12 3 2 2 4 3 18 3 2 2 5 0 16 3 2 2 6 0 12 3 2 3 1 10 11 3 2 3 2 1 16 3 2 3 3 1 14 3 2 3 4 1 17 3 2 3 5 2 15 3 2 3 6 1 16 3 2 4 1 9 11 3 2 4 2 2 14 3 2 4 3 0 10 3 2 4 4 0 18 3 2 4 5 0 17 3 2 4 6 0 12 3 2 5 1 11 12 3 2 5 2 2 12 (continued) Appendix Data: Disease incidence in grapevine plants (b = block, v = plant, r = shoot, t = treatment, m = number of diseased leaves per shoot, and n = total number of leaves per shoot). b v r t M n 3 2 5 3 0 11 3 2 5 4 0 13 3 2 5 5 0 18 3 2 5 6 0 12 3 3 1 1 7 9 3 3 1 2 0 13 3 3 1 3 0 9 3 3 1 4 0 18 3 3 1 5 0 18 3 3 1 6 0 13 3 3 2 1 6 14 3 3 2 2 3 16 3 3 2 3 1 15 3 3 2 4 0 17 3 3 2 5 1 17 3 3 2 6 3 14 3 3 3 1 10 11 3 3 3 2 0 10 3 3 3 3 1 16 3 3 3 4 1 18 3 3 3 5 0 16 3 3 3 6 0 11 3 3 4 1 10 10 3 3 4 2 1 14 3 3 4 3 0 10 3 3 4 4 1 19 3 3 4 5 2 19 3 3 4 6 2 14 3 3 5 1 8 10 3 3 5 2 0 12 3 3 5 3 0 12 3 3 5 4 0 18 3 3 5 5 0 14 3 3 5 6 0 12 277 278 6 Generalized Linear Mixed Models for Proportions and Percentages Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. Chapter 7 Time of Occurrence of an Event of Interest 7.1 Introduction In studies such as biological sciences, animal science, and agronomy, a common outcome of interest is the time at which an event of interest occurs. The main characteristic of these data is that the subjects/experimental units are usually observed for different periods of time until the event of interest occurs. These events of interest may be adverse events such as the death of an experimental unit and the cessation of lactation, or positive events such as the conception of a female’s offspring from a particular treatment and the onset of estrus in a female undergoing hormone treatment, among others. Because of the characteristics of these response variables, a “normal” distribution is often a poor choice for modeling the time at which the event of interest occurs. Exponential, log-normal, gamma, Weibull, and other more complex distributions that tend to be more common and are better choices for modeling these phenomena. Fitting a generalized linear mixed model (GLMM) is a good option for analyzing these phenomena because the conditional response distribution of the random effects of this model has desirable properties. In this vein, it is conventional to speak of survival data and survival analysis, regardless of the nature of the event. Similar data also arise when measuring the time to complete a task, such as walking 50 meters, passing an agronomy exam, performing a sensory evaluation of coffee, and so on. The purpose of this chapter is to provide the reader with the essential language of linear models and the connection between GLMMs and survival analysis. © The Author(s) 2023 J. Salinas Ruíz et al., Generalized Linear Mixed Models with Applications in Agriculture and Biology, https://doi.org/10.1007/978-3-031-32800-8_7 279 280 7.2 7 Time of Occurrence of an Event of Interest Generalized Linear Mixed Models with a Gamma Response The gamma family of distributions encompasses continuous, nonnegative, right-skewed values. A gamma distribution has two nonnegative parameters –α and β –the probability density function of which is given by: f ðy; α, βÞ = 1 -y yα - 1 ef =βg , y ≥ 0 Γ ðαÞβα 1 where Γ ðαÞ = 0 t α - 1 e - t dt is the gamma function (Casella and Berger 2002). The mean and variance of a random gamma variable are E[Y] = αβ = μ and Var½Y = αβ2 = μ2=α, respectively. This density function can be rewritten in terms of the mean μ and the scale parameter ϕ = 1/α. f ðy; α, βÞ = 7.2.1 1 Γ 1 ϕ 1=ϕ 1 - 1 ϕ ðμϕÞ y ef - y=μϕg , y ≥ 0: CRD: Estrus Induction in Pelibuey Ewes Estrus induction in ewes is a very common practice carried out in livestock farms or at research centers. For this, an animal researcher uses gonadotropin-releasing hormone (GnRH), equine chorionic gonadotropin (eCG), and P4 in a controlled internal drug-releasing (CIDR) intravaginal device in female Pelibuey ewes (n = 78) with single, double, and triple lambing as treatments. In order to ensure that all animals were in good condition during the experiment, ewes received the same zootechnical management and feeding. For this experiment, the ewes were synchronized on the same day under a synchronization protocol. Table 7.1 presents the analysis of variance (ANOVA). The variables evaluated in this experiment were the time of onset and duration of estrus (yij) in hours according to the type of calving. The variability among female sheep on weight, age, and body condition must be taken into account in the Table 7.1 Sources of variation and degrees of freedom Sources of variation Treatment Error Degrees of freedom t-1=3-1=2 3 i= Total 3 i= r i - t = 75 r i - 1 = 78 - 1 = 77 7.2 Generalized Linear Mixed Models with a Gamma Response 281 Table 7.2 Results of the analysis of variance (a) Covariance parameter estimates Start of estrus Estimate Standard error -0.01572 . Cov Parm Parto (animal) ð^ σ2birthtypeðanimalÞ Þ 0.06668 Residual ϕ 0.01232 Duration of estrus Estimate Standard error 0.08370 0.09692 0.2073 0.08938 (b) Type III tests of fixed effects Effect Birth type Num DF 2 Den DF 75 Inicio estro F-value 5.12 Pr > F 0.0082 Duración estro F-value Pr > F 22.61 <0.0001 analysis. The data from this experiment can be found in the Appendix 1 of this book (Data: Pelibuey Sheep). Thus, the components of a gamma GLMM are as follows: Distributions: yij j τðrÞij Gamma μij , ϕ ; i = 1, 2, 3; j = 1, ⋯, r i : rðτÞij N 0, σ 2τðanimalÞ Linear predictor: ηij = μ þ τi þ τðr Þij Link function: log μij = ηij where ηij is the ith link function for treatment i (type of birth angle, double or triple) in ewes j, μ is the overall mean, τi is the fixed effect due to type of birth (treatment), r(τ)ij is the random effect due to type of birth (treatment) in ewes j with τðr Þj N 0, σ 2τðanimalÞ . The following GLIMMIX program fits the model proc glimmix nobound method=laplace; class animal birthtype; model Inestro = birthtype/dist=gamma; random birthtype (animal); lsmeans birthtype/lines ilink; run; Part of the results is reported in Table 7.2. Subsection (a) shows the estimated variance components due to the type of parturition used in females σ 2birthtypeðanimalÞ = - 0:0157ð ± 0:0837Þ as well as the scale parameter ϕ = 0:06668 . Table 7.2 (b) shows the results of the hypothesis tests for type III fixed effects, which indicate that there is a statistically significant effect of treatment (type of birth) on the time of onset and duration of ewe estrus. 282 7 Time of Occurrence of an Event of Interest Table 7.3 Means and standard errors on the model scale (“Estimate” column) and the data scale (“Mean” column) for the onset and duration of estrus in Pelibuey ewe lambs Parto least squares means Standard Birth_type Estimate error Start of estrus 1 3.2913 0.06631 2 3.0622 0.04606 3 3.0496 0.04542 Duration of estrus 1 1.6826 0.1518 2 2.6716 0.1171 3 2.8075 0.09846 DF t-value Pr > |t| Mean Standard error mean 75 75 75 49.63 66.48 67.14 <0.0001 <0.0001 <0.0001 26.8787 21.3735 21.1059 1.7824 0.9845 0.9586 75 75 75 11.09 22.81 28.51 <0.0001 <0.0001 <0.0001 5.3795 14.4637 16.5684 0.8164 1.6938 1.6313 The last two columns of Table 7.3, labeled “Mean” and “Standard error,” correspond to the means (μij) on the data scale for the ewes’ mean onset and duration of estrus with their respective standard errors. For example, the mean time to onset of estrus in single-birth ewes was 26.87 ± 1.78 hours, whereas for double- and triplebirth ewes, it was 21.37 ± 0.98 and 21.1 ± 0.95, respectively. On the other hand, the average time (in hours) of estrus duration was longer in double- and triple-birth ewes (14.46 ± 1.69 and 16.56 ± 1.63, respectively) compared to single-birth ewes (5.38 ± 0.81). 7.2.2 Randomized Complete Block Design (RCBD): Itch Relief Drugs A total of 10 male volunteer patients between 20 and 30 years of age participated as a study group to compare 7 treatments (Trts) (5 drugs, 1 placebo, and 1 no drug) to relieve their itching. Since each subject responded differently to each drug, and, in addition, each subject received a different treatment in the 7 days of study, each of the subjects can be considered a block. Treatment assignment was randomized across days. Except for the drug-free day, subjects were administered the treatment intravenously, and, then, their forearms were induced to itch using an effective itch stimulus called cowage. The duration of itching, in seconds, was recorded. The data are shown in Table 7.4. From left to right, the drugs used were papaverine = Papv, morphine = Morp, aminophylline = Amino, pentobarbital = Pent, and tripelennamine pentobarbital = Tripel. The analysis of variance table (Table 7.5) shows the sources of variation and degrees of freedom for this experiment. 7.2 Generalized Linear Mixed Models with a Gamma Response 283 Table 7.4 Time taken to get rid of the itch Patient 1 2 3 4 5 6 7 8 9 10 No drug 174 224 260 255 165 237 191 100 115 189 Table 7.5 Sources of variation and degrees of freedom Placebo 263 213 231 291 168 121 137 102 89 433 Papv 105 103 145 103 144 94 35 133 83 237 Morp 199 143 113 225 176 144 87 120 100 173 Sources of variation Blocks Treatment Error Total Amino 141 168 78 164 127 114 96 222 165 168 Pent 108 341 159 135 239 136 140 134 185 188 Tripel 141 184 125 227 194 155 121 129 79 317 Degrees of freedom r - 1 = 10 - 1 = 9 t-1=7-1=6 (t - 1)(r - 1) = 6 × 9 = 54 r × t - 1 = 10 × 7 - 1 = 69 The components of the GLMM with a gamma response are as follows: Distributions : yij j rðαβÞijk Gammaðμij , ϕÞ; i = 1, ⋯, 7; j = 1, ⋯, 10: r j N 0, σ2patient Linear predictor: ηij = μ þ r j þ τi Link function: log μijk = ηijk where ηij is the predictor with treatment i and block j, μ is the overall mean, rj is the random effect of the patient with r j N 0, σ 2patient , and τi is the fixed effect due to treatment. Note, although the exponential and gamma distributions have a canonical link equal to the inverse of the mean, the gamma and exponential GLMMs most often use a computationally more stable link (link = log), which was used in this and in the previous analysis. The following GLIMMIX syntax adjusts a GLMM into complete blocks. proc glimmix nobound method=laplace; class Patient Trt; model y = Trt/dist=gamma; random Patient; lsmeans Trt/lines ilink; run; 284 7 Table 7.6 Results of the analysis of variance Time of Occurrence of an Event of Interest (a) Fit statistics for conditional distribution -2 Log L (y | r. effects) Pearson’s chi-square Pearson’s chi-square/DF (b) Covariance parameter estimates Cov Parm Estimate Patient 0.03964 0.09132 Residual ϕ (c) Type III tests of fixed effects Effect Num DF Den DF Trt 6 54 728.62 5.69 0.08 Standard error 0.02375 0.01640 F-value 3.82 Pr > F 0.0030 The statistics of the conditional model (Pearson′s chi - squre/DF = 0.08) as well as the variance components (Patient) and the scale parameter ϕ of the model indicate that the gamma model adequately describes the dataset (Table 7.6 parts (a) and (b)). The analysis of variance (Table 7.6 part (c)) indicates that there is a highly significant difference of treatments in the mean time of itch duration (P = 0.0030). The dispersion observed in the following plot (top left) of the residuals versus the linear predictor value suggests that the variance is constant and homogeneous (Fig. 7.1). The histogram (upper right) shows a nearly symmetrical pattern with little bias. Furthermore, the residuals versus quantile plot (bottom left) shows no marked deviations, indicating that the fit is adequate. Finally, the bottom right plot shows that the average residuals are zero and vary between -0.5 and 0.75. The “lsmeans” on the data scale, for each of the five treatments, placebo, and the control treatment, are shown under the “Mean” column with their respective “Standard error” in Table 7.7. Each of the five drugs appear to have a significant effect compared to the placebo and control. Papaverine (Papv) is the most effective drug. Both the placebo and control treatment have statistically similar means. The relatively large difference in the placebo group suggests that some patients responded negatively to the placebo compared to the control, whereas others responded positively. Figure 7.2 shows that the drug papaverine significantly reduced the itching time, followed by the drugs aminophylline and morphine, whereas the efficacies of the drugs pentobarbital and tripelennamine were highly similar to each other in eliminating itching. 7.2.3 Factorial Design: Insect Survival Time This experiment consisted of studying the effectiveness of four different types of insecticides (Insec1, Insec2, Insec3, and Insec4) at three different concentration levels (low, medium, and high) in the survival time (in hours) of a particular species 7.2 Generalized Linear Mixed Models with a Gamma Response 285 Fig. 7.1 Conditional residuals Table 7.7 Means and standard errors on the model scale (“Estimate” column) and the data scale (“Mean” column) for the average duration time of the itch Trt least squares means Standard Trt Estimate error Amino 4.9795 0.1149 Morp 4.9797 0.1146 Papv 4.7356 0.1149 Pento 5.1703 0.1149 Placebo 5.2704 0.1151 No drug 5.2542 0.1148 Tripel 5.0802 0.1147 DF t-value 43.32 43.44 41.20 44.99 45.79 45.76 44.28 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 145.41 145.43 113.93 175.97 194.49 191.36 160.80 Standard error of mean 16.7129 16.6733 13.0956 20.2211 22.3867 21.9723 18.4487 of beetles (Appendix 1: Data: Beetles). The interaction between both factors (insecticide * dose) yielded a total of 12 combinations (treatments). The objective of this study was to compare the insecticides, dose, and interaction with beetle survival time. Due to the intrinsic characteristics of each of the insects, these must be considered as a source of variation in the experiment, since they respond differently 286 7 Time of Occurrence of an Event of Interest Average me (seconds) 250 200 150 100 50 0 Amino Morp Papv Pento Placebo Sindroga Tripel Treatment Fig. 7.2 Average time taken to eliminate itching Table 7.8 Sources of variation and degrees of freedom Sources of variation Blocks Insecticide Dose Insecticide * dosage Error Total Degrees of freedom r-1=4-1=3 a-1=4-1=3 b-1=3-1=2 (a - 1)(b - 1) = 3 × 2 = 6 ab(r - 1) = 4 × 3 × 3 = 33 r × a × b - 1 = 4 × 4 × 3 - 1 = 47 to certain stimuli. Assuming that 48 beetles are available, they were randomly assigned equally to 4 groups (blocks) with 12 treatment combinations. That is, four beetles were randomly assigned to each treatment. The sources of variation and degrees of freedom for this experiment are shown in the following analysis of variance table (Table 7.8). The components of the gamma-response GLMM are as follows: Distributions : yijk j r k Gammaðμijk , ϕÞ; i = 1, ⋯, 4; j = 1, 2, 3; k = 1, ⋯, 4: r k Nð0, σ2block Þ Linear predictor: ηijk = μ þ rk þ αi þ βj þ ðαβÞij Link function: log μijk = ηijk The following GLIMMIX command adjusts a GLMM with a gamma response. 7.2 Generalized Linear Mixed Models with a Gamma Response Table 7.9 Results of the analysis of variance (a) Fit statistics for conditional distribution -2 Log L (tiempo | r. effects) Pearson’s chi-square Pearson’s chi-square/DF (b) Covariance parameter estimates Cov Parm Estimate block -0.00173 Residual 0.04155 (c) Type III tests of fixed effects Effect Num DF Den DF Dose 2 33 Insecticide 3 33 Dose*insecticide 6 33 287 121.05 1.91 0.04 Standard error . 0.008818 F-value 69.61 31.36 2.05 Pr > F <0.0001 <0.0001 0.0868 proc glimmix nobound method=laplace; class dose insecticide insect; model time = dose|insecticide/dist=gamma; random insect; lsmeans dose|insecticide/lines ilink; run; Part of the Statistical Analysis Software (SAS) output is shown in Table 7.9. The value of the conditional model’s Pearson′s chi - square/DF = 0.04 indicates that the gamma distribution adequately models the data. The estimated variance component for blocks and the scaling parameter given by the “residual” value are shown below ^2 = 0:04155, respectivelyÞ. (in part (b)) ð^ σ2block = - 0:00173, and σ The analysis of variance in (c) of Table 7.9 indicates that the insecticides and dose (P = 0.0001) have different significant effectiveness (toxicity) on beetle survival time. However, the interaction between both factors is close to significance (P = 0.0868). The “lsmeans” values on the data scale for dose μi:: (part (a)) and insecticide μ:j: (part (b)) with their respective standard errors for both factors are listed under the columns titled “Mean” and “Standard error mean” of Table 7.10, respectively. The combination of levels of both factors affected the average survival time of the beetles (Table 7.11). For insecticides 1 and 3 at a high dose, the survival time was lower with average times of 2.1 ± 0.209 and 2.35 ± 0.334 hours, respectively. In general, low values of survival times were observed for insecticides 1 and 3 compared to insecticides 2 and 4. 7.2.4 A Split Plot with a Factorial Structure on a Large Plot in a Completely Randomized Design (CRD) Four samples were obtained from each of two batches (Reps) of unprocessed gum from Acacia sp. Trees, with eight samples in total. Within each batch, the four 288 7 Time of Occurrence of an Event of Interest Table 7.10 Means and standard errors on the model scale (“Estimate”) and the data scale (“Mean”) for the factor dose and type of insecticide (a) Dose least squares means Standard Dose Estimate error High 0.9960 0.04984 Low 1.7840 0.04984 Medium 1.6203 0.04984 (b) Insecticide least squares means Standard Insecticide Estimate error Insec1 1.1074 0.05755 Insec2 1.8272 0.05755 Insec3 1.3041 0.05755 Insec4 1.6284 0.05755 DF 33 33 33 DF 33 33 33 33 t-value 19.98 35.79 32.51 t-value 19.24 31.75 22.66 28.29 Pr > |t| <0.0001 <0.0001 <0.0001 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 Mean 2.7075 5.9538 5.0548 Mean 3.0265 6.2166 3.6845 5.0960 Standard error mean 0.1349 0.2967 0.2519 Standard error mean 0.1742 0.3578 0.2121 0.2933 Table 7.11 Means and standard errors on the model scale and the data scale for the interaction between dose and type of insecticide Dose*insecticide least squares means Standard Dose Insecticide Estimate error High Insec1 0.7419 0.09968 High Insec2 1.2089 0.09968 High Insec3 0.8545 0.09969 High Insec4 1.1788 0.09969 Low Insec1 1.4171 0.09968 Low Insec2 2.1747 0.09968 Low Insec3 1.7361 0.09969 Low Insec4 1.8082 0.09968 Medium Insec1 1.1632 0.09969 Medium Insec2 2.0980 0.09968 Medium Insec3 1.3218 0.09969 Medium Insec4 1.8984 0.09969 DF 33 33 33 33 33 33 33 33 33 33 33 33 t-value 7.44 12.13 8.57 11.82 14.22 21.82 17.42 18.14 11.67 21.05 13.26 19.04 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 2.1000 3.3499 2.3501 3.2503 4.1250 8.7998 5.6754 6.0994 3.2000 8.1499 3.7501 6.6753 Standard error mean 0.2093 0.3339 0.2343 0.3240 0.4112 0.8772 0.5658 0.6080 0.3190 0.8124 0.3738 0.6654 samples were randomly assigned to combinations of two factors with two levels each. The first factor refers to whether the gum was demineralized or not, and the second factor refers to whether the gum was pasteurized or not. An emulsion made from each gum sample was divided into three smaller parts, which were randomly assigned to the levels of a third factor, the PH, and pH was adjusted to 2.5, 4.5, or 5.5 using citric acid (Appendix 1: Data: Gum Breakdown Times). This is a split-plot design, with whole plots and rubber samples in a block arrangement. The combined levels of demineralization and pasteurization of the paste are large (whole) plot factors. The split plots are the smaller parts, with a specific pH, which is the only split-plot factor. The response measured ( y) was the 7.2 Generalized Linear Mixed Models with a Gamma Response 289 Table 7.12 Sources of variation and degrees of freedom Sources of variation Demineralization (Des) Pasteurization (Pasteu) Demineralization*pasteurization Des*Pasteu (rep) pH Demineralization*pH Pasteurization*pH Des*Pasteu*pH Error Total Degrees of freedom a-1=2-1=1 b-1=2-1=1 (a - 1)(b - 1) = 1 ab(r - 1) = 2 × 2 × 1 = 4 (c - 1) = 3 - 1 = 2 (a - 1)(c - 1) = 2 (b - 1)(c - 1) = 2 (a - 1)(b - 1)(c - 1) = 2 ab(c - 1)(r - 1) = 2 × 2 × 2 × 1 = 8 r × a × b × c - 1 = 2 × 2 × 2 × 3 - 1 = 23 time to break, i.e., the time (in hours) until the emulsion failed. The sources of variation and degrees of freedom for this experiment are shown in Table 7.12. The components of the GLMM with a Gamma response are as follows: Distributions: yijkl j r l , αβðr Þijl Gamma μijkl , ϕ ; i = 1, 2; j = 1, 2; k = 1, 2, 3; l = 1, 2: r l N 0, σ 2r , αβðr Þijl N 0, σ 2rαβ Linear predictor: ηijkl = μ þ αi þ βj þ ðαβÞij þ r ðαβÞijl þ γ k þ ðαγ Þik þ ðβγ Þjk þ ðαβγ Þijk ; where αi, βj, and γ k are the fixed effects due to the factors demineralization, pasteurization, and pH, respectively; the effects (αβ)ij, (αγ)ik, (βγ)jk, and (αβγ)ijk are the two- and three-way interactions of the factors under study; and αβ(r)ijl are random effects due to the demineralization x pasteurization x rep interaction, 2 . assuming that αβðr Þijl N 0, σ rαβ Link function: log μijk = ηijk The GLIMMIX commands for setting this GLMM are as follows: proc glimmix nobound method=laplace; class Batch Demineralization Pasteurization pH; model y = Demineralization|Pasteurization|pH/dist=gamma; random batch(Demineralization*Pasteurization); lsmeans Demineralization|Pasteurization|pH/lines ilink; run; 290 7 Time of Occurrence of an Event of Interest Table 7.13 Results of the analysis of variance (a) Fit statistics for conditional distribution -2 Log L (y | r. effects) Pearson’s chi-square Pearson’s chi-square/DF (b) Covariance parameter estimates Cov Parm Rep (Desmin*Pasteur) Residual ϕ (c) Type III tests of fixed effects Effect Demineralization (Des) Pasteurization (Pasteu) Demineralization*pasteurization pH Demineralization*pH Pasteurization*pH Des*Pasteu*pH 192.24 0.12 0.01 Estimate 0.001428 0.006011 Num DF 1 1 1 2 2 2 2 Standard error 0.001864 0.002126 Den DF 4 4 4 8 8 8 8 F-value 35.48 19.49 35.67 5.27 3.84 0.57 4.32 Pr > F 0.0040 0.0116 0.0039 0.0346 0.0676 0.5889 0.0535 The relevant results from the SAS output are shown in Table 7.13. The value of χ2 the conditional model DF = 0:01 indicates that the gamma distribution does not cause overdispersion. The variance component due to blocks × demineralization × pasteurization σ 2rðαβÞ and the scale parameter ϕ are shown in (b). The hypothesis tests for type III fixed effects are presented in part (c) of Table 7.13, where a significant effect of the factors demineralization, pasteurization, and pH as well as the interaction between demineralization with pasteurization are observed on the gum. However, the interactions demineralization*pH (P = 0.0676) and demineralization*pasteurization*pH are close to significance (P = 0.0535). The emulsion breaking time is strongly affected by no demineralization (demineralization = 1) and no pasteurization (pasteurization = 1) of the gum and, to a lesser extent, by the pH adjusted to the gum (Table 7.14). Analyzing the simple effects of the factors, we can observe that when the gum has not been pasteurized (B = 1), the average emulsion break time is very similar in the demineralized paste than in the non-demineralized paste at the three pH levels. However, when the gum has been pasteurized, demineralization has a significant impact on the emulsion breakup time; for example, for a paste that is not demineralized and pasteurized (A1B2), the emulsion breakup time is much lower than when the gum has been demineralized and pasteurized (A2B2) at all three pH levels. Finally, with a demineralized, pasteurized gum at pH = 4.5, a gum with higher breaking stability is obtained (Table 7.15). 7.3 Survival Analysis 291 Table 7.14 Means and standard errors of the main effects on the model scale (Estimate) and the data scale (Mean) (a) Demineralization least squares means Standard Demineralization Estimate error 1 5.0911 0.02930 2 5.3379 0.02930 (b) Pasteurization least squares means Standard Pasteurization Estimate error 1 5.1230 0.02930 2 5.3059 0.02930 (c) pH least squares means pH Estimate Standard error DF 1 5.1610 0.03050 8 2 5.2839 0.03050 8 3 5.1986 0.03052 8 DF 4 4 DF 4 4 t-value 173.77 182.18 t-value 174.87 181.08 t-value 169.22 173.24 170.32 Pr > |t| <0.0001 <0.0001 Pr > |t| <0.0001 <0.0001 Pr > |t| <0.0001 <0.0001 <0.0001 Mean 162.57 208.07 Mean 167.84 201.53 Mean 174.33 197.13 181.02 Standard error mean 4.7628 6.0964 Standard error mean 4.9171 5.9051 Standard error mean 5.3171 6.0124 5.5255 Table 7.15 Means and standard errors of the simple effects on the model scale (Estimate) and the data scale (Mean) Demineralization*pasteurization*pH least squares means Standard A B C Estimate error DF t-value Pr > |t| 1 1 1 5.0696 0.06099 8 83.13 <0.0001 1 1 2 5.1695 0.06105 8 84.68 <0.0001 1 1 3 5.1311 0.06100 8 84.12 <0.0001 1 2 1 5.1137 0.06099 8 83.84 <0.0001 1 2 2 5.0445 0.06098 8 82.72 <0.0001 <0.0001 1 2 3 5.0183 0.06098 8 82.29 2 1 1 5.0811 0.06103 8 83.26 <0.0001 2 1 2 5.1694 0.06099 8 84.76 <0.0001 2 1 3 5.1175 0.06110 8 83.76 <0.0001 2 2 1 5.3796 0.06100 8 88.19 <0.0001 2 2 2 5.7520 0.06100 8 94.30 <0.0001 2 2 3 5.5277 0.06106 8 90.53 <0.0001 Mean 159.11 175.83 169.20 166.28 155.17 151.15 160.95 175.81 166.91 216.93 314.81 251.57 Standard error mean 9.7035 10.7339 10.3204 10.1419 9.4623 9.2170 9.8225 10.7225 10.1978 13.2320 19.2031 15.3607 A = demineralization (1 = no, 2 = yes), B = pasteurization (1 = no, 2 = yes), and C = pH (1 = 2.5, 2 = 4.5, and 3 = 5.5) 7.3 Survival Analysis When a research focuses on the time of occurrence of a specific event, we usually refer to survival times, and, hence, the statistical analysis of these times, as mentioned above, is known as survival analysis. A very characteristic feature of survival 292 7 Time of Occurrence of an Event of Interest times is the presence of censored times, that is, when there are individuals whose actual survival time is not known. For a set of survival times (including censored ones) of a sample of individuals, it is possible to estimate the proportion of the population that will survive a time interval under the same circumstances. The methods used to make this estimate are based on the proposal of Kaplan and Meier (1958). This method allows – through different statistical tests (log rank, Breslow, Tarone–Ware, etc.) – the comparison of the survival of two or more groups of individuals who differ with respect to certain factors. Survival analysis focuses its interest on a group or several groups of individuals for whom an event is defined, which occurs after a time interval. To determine the time of interest, there are three requirements: an initial time, a scale to measure the passage of time (minutes, hours, days, etc.), and clarity about what is meant by the event of interest. Survival of an individual is conceptually the probability of being alive in a given time "t" from diagnosis, i.e., initiation of treatment or complete remission for a group of individuals. In clinical studies, survival times often refer to time till death, development of a particular symptom, or relapse after complete remission of a disease. Failure is defined as death, relapse, or the occurrence of a new disease. In many survival analyses, when the end of the observation period previously set by the investigator is reached, there are individuals to whom the event has not occurred and we do not know when it will occur. Therefore, the actual survival time for them is unknown, and only the survival time to the end of the study is known. Such survival times are called censored times. It also happens, in some cases, that some individuals do not continue the study until the end of the analysis period for reasons unrelated to the research, e.g., death from other causes; these times are also censored. These censored data contribute valuable information and, therefore, should not be omitted from the analysis. The pharmaceutical and food industries are legally required to label the shelf life of their product on the packaging. For pharmaceuticals, the requirements for how to determine shelf life are highly regulated. However, the regulatory standards do not specifically define shelf life. Instead, the definition is implicit through the estimation procedure. The interest is in the situation where multiple batches are used to determine a shelf life of a product that applies to all future batches. Consequently, both shelf life and label life are of great importance because of the variability within and between batches. Product development must be very well thought out before a company can have confidence in shelf life estimates. The company must be able to reliably produce a homogeneous product from batch to batch of ingredients, as physical and chemical factors impact the ability of bacteria to grow, such as pH, water activity, and uniformity of the mix (moisture distribution, salt, preservative or food acid) and, consequently, the shelf life of the product. Therefore, products should be inspected at appropriate times and samples should be tested for critical stability of physical and chemical characteristics. These tests also provide an opportunity to begin microbiological testing for spoilage organisms. Testing should continue beyond the intended shelf life unless the product fails earlier. Testing 7.3 Survival Analysis 293 should lead to an understanding of target levels and ranges of ingredients for evaluation of the critical physical and chemical characteristics of the product over the intended shelf life. Survival analysis is the name for a collection of statistical techniques used to describe and quantify the time in which the event of interest occurs. The term “survival time” specifies the amount of time taken to occur. Situations in which survival analyses have been used in epidemiology include: (a) Survival of insects after having received an insecticide. (b) The time taken by cows or ewes to conceive after calving. (c) The time taken for a farm to experience its first case of an exotic disease. 7.3.1 Concepts and Definitions To clearly understand and interpret a rate of change calculated from the event data of interest, a more extensive approach is needed. The definition of a rate of change begins with the mathematical description of a changing pattern over time, represented by the symbol S(t). A version of a ratio is created by dividing the change in function S(t)[S(t) to S(t + Δt)] by the corresponding change over time t(t to t + Δt) producing the rate of change rate of change = change on Sðt Þ Sðt Þ - Sðt þ Δt Þ Sðt Þ - Sðt þ Δt Þ = = change on time Δt ðt þ Δt Þ - t Rates of change, with respect to time, apply to a variety of situations, but one specific function, traditionally denoted by S(t), is fundamental to the analysis of survival data. This is called the survival function and is defined as the probability of surviving (probability of survival) beyond a specific point in time (denoted by t). That is; Sðt Þ = Pðsurvival time = 0 at time = t Þ = Pðsurvival in the interval ½0, t Þ Equivalent to Sðt Þ = Pðsurviving beyond time t Þ = PðT ≥ t Þ = 1 - F ðt Þ where F(t) is the cumulative distribution function with F(t) = P(T ≤ t). Another important concept in survival analysis is the hazard function h(t). The hazard function that depends on T is defined as 294 7 hðt Þ = lim Δt → 0 Time of Occurrence of an Event of Interest Pðt ≤ T < t þ ΔtjT ≥ t Þ Δt such that the following expression can be expressed as F ðt þ Δt Þ - F ðt Þ 1 × Δt PðT ≥ t Þ hðt Þ = lim Δt → 0 f ðt Þ Sð t Þ hð t Þ = where f(t) is the probability density function. Any distribution defined by t 2 [0, t) can serve as a survival distribution. Consequently, hð t Þ = - ∂ f log Sðt Þg: t It then follows that Sðt Þ = expf - H ðt Þg where H(t) the cumulative hazard function t hðuÞdu H ðt Þ = 0 Another useful relationship is H ðt Þ = - log Sðt Þ: For the simplest model, the exponential model with h(t) = λ (λ is a constant), the survival function is given by t Sðt Þ = exp - t hðuÞdu λdu = e - λt = exp - 0 0 with the probability density function given by f ðt Þ = ∂ Sðt Þ = λe - λt : t Thus, the survival function, hazard function, and cumulative risk for the exponential model is given by: 7.3 Survival Analysis 295 Survival function: Sðt Þ = e - λt Risk function: hðt Þ = f ðt Þ λe - λt = - λt = λ Sð t Þ e t Cumulative risk function: H ðt Þ = hðuÞdu = t 7.3.2 t λdu = λt: 0 CRD: Aedes aegypti The objective of this experiment was to test the vulnerability of Aedes aegypti mosquitoes to different fungal treatments (four treatments). A bioassay was conducted to determine the survival time of each of the mosquitoes. Three-day-old mosquitoes were maintained after hatching in 45-cm rearing cages with access to water but not food. The mosquitoes were kept in rearing cages with water and fed warm pig blood (37 °C) through a natural membrane (sausage casing) approximately every 3 days and allowed to oviposit freely during the waiting period. A total of 10 mosquitoes were placed in a chamber to which one of the treatments (four) plus a control was applied. Here, we present part of the data from a bioassay with four replicates. The complete data from this trial can be found in the Appendix 1 (Data: Aedes aegypti). Treatment C C ⋮ C Mam Mam ⋮ MaS MaS ⋮ MaC MaC MaC ⋮ Ma1 Ma1 ⋮ Ma1 Rep 1 1 ⋮ 4 1 1 ⋮ 1 1 ⋮ 1 1 1 ⋮ 1 1 ⋮ 4 Y 8 11 ⋮ 20 2 2 ⋮ 3 3 ⋮ 2 2 2 ⋮ 2 2 ⋮ 11 296 7 Table 7.16 Results of the analysis of variance Time of Occurrence of an Event of Interest (a) Fit statistics for conditional distribution -2 Log L (T | r. effects) Pearson’s chi-square Pearson’s chi-square/DF (b) Type III tests of fixed effects Effect Num DF Den DF F-value Trt 4 192 186.42 716.70 35.33 0.18 Pr > F <0.0001 The components of this GLMM are as follows: Distributions: yij j repj Gamma μij , ϕ repj N 0, σ 2rep Linear predictor: ηij = η þ τi þ repj Link function: ηij = log μij where η is the intercept, τi is the treatment effect, and repj is the random effect due to the mosquito chamber assuming repj N 0, σ 2rep : The following GLIMMIX commands adjust a GLMM with a gamma response: proc glimmix data=mosquitos method=laplace; class bio trt rep; model y = trt/dist=gamma; random rep; lsmeans trt/lines ilink; run; Part of the output is shown in Table 7.16. The statistic in (a) above indicates that there is no over-dispersion in the fit of the data, as indicated by Pearson′s chi square/DF = 0.18. The analysis of variance (type III tests of fixed effects) indicates that there is a highly significant effect (P = 0.0001) of the fungal treatments on the mean mosquito survival time. The relevant information in Table 7.17 “lsmeans” comes from the columns labeled “Estimate” and “Mean”: these are the estimates on the model scale and the data scale, and the average survival time in each of the treatments is represented by μi ð ± standard errorÞ. The estimated risk function for each treatment combination is ^λi = 1=μ^i : For example, for treatment Ma1, the estimated hazard function is λMa1 = 1=3:4223 = 0:2922: We can manually calculate these values from the Mean column or we can automate the process by adding the command “ods output lsmeans = mu” in the GLIMMIX 7.3 Survival Analysis 297 Table 7.17 Means and standard errors of the main effects on the model scale (Estimate) and the data scale (Mean) Trt least squares means Standard Trt Estimate error Ma1 1.2303 0.06354 MaC 0.9562 0.06350 MaS 1.5798 0.06357 Mam 0.6946 0.06350 Control 2.7155 0.06362 DF 192 192 192 192 192 tvalue 19.36 15.06 24.85 10.94 42.68 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 3.4223 2.6017 4.8542 2.0029 15.1126 Standard error mean 0.2174 0.1652 0.3086 0.1272 0.9615 program above. Once we have saved the treatment means, we can ask SAS to estimate the estimated hazard function for the treatments. The commands are as follows: data hazard; set mu; hazard=1/mu; proc print data=hazard; run; The results are listed below in Table 7.18. The hazard column contains the estimated hazard functions for each treatment hi ðt Þ = λi . From the values λi , we can calculate the estimated survival function Si ðt Þ = e - λi t for each of the treatments. Figure 7.3 shows the probability of survival over time ^ obtained with Si ðt Þ = e - λi t of each of the proposed treatments and the control. Clearly, the treatments MaS, Ma1, MaC, and Mam showed a greater efficacy in the biological control of these mosquitoes. 7.3.3 RCBD: Aedes aegypti Similar to the previous example, this experiment consisted of testing the vulnerability of Aedes aegypti mosquitoes to different fungal treatments (four treatments). For this, two bioassays were conducted to determine the survival time of each of the mosquitoes. Three-day-old mosquitoes were maintained after hatching in 45-cm rearing cages with access to water but not food. Mosquitoes were maintained in rearing cages with water and were fed warm pig blood (37 °C) through a natural membrane (sausage casing) approximately every 3 days. They were allowed to freely oviposit during the waiting period. A total of 10 mosquitoes were placed in a chamber to which one of the treatments (four) plus a control was applied. The data can be found in the Appendix 1 (Data: Aedes aegypti). Effect TRT TRT TRT TRT TRT TRT Ma1 MaC MaS Mam Control Estimate 1.2303 0.9562 1.5798 0.6946 2.7155 Standard error 0.06354 0.06350 0.06357 0.06350 0.06362 DF 192 192 192 192 192 t-value 19.36 15.06 24.85 10.94 42.68 Probt <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 3.4223 2.6017 4.8542 2.0029 15.1126 Standard error mean 0.2174 0.1652 0.3086 0.1272 0.9615 Table 7.18 Means and standard errors of the main effects on the model scale (Estimate) and the data scale (Mean) and the hazard function λi Hazard λi 0.29220 0.38437 0.20601 0.49927 0.06617 298 7 Time of Occurrence of an Event of Interest 7.3 Survival Analysis 1 299 Ma1 MaC MaS Mam Control Survival probability (S(t)) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Time (days) Fig. 7.3 Estimated survival probability for each treatment The components of this GLMM are as follows: Distributions: yijk j bioj , repðbioÞkðjÞ Gamma μijk , ϕ bioj N 0, σ 2bio , repðbioÞkðjÞ N 0, σ 2repðbioÞ Linear predictor: ηij = η þ τi þ bioj þ repðbioÞkðjÞ where η is the intercept, τi is the treatment effect, bioj and rep(bio)k( j ) are the random effects of the bioassay and the mosquito chamber within the bioassay, respectively, 2 assuming bioj N 0, σ 2bio and repðbioÞkðjÞ N 0, σ rep ðbioÞ : Link function: ηij = log μij The following GLIMMIX program fits a block GLMM with a gamma response. proc glimmix method=laplace nobound; class bio trt ind rep; model y = trt/dist=gamma; random bio rep(bio); ods output lsmeans=mu; lsmeans trt/lines ilink; run;quit; The results obtained are shown below. Part of the statistics and variance components are listed in Table 7.19. In part (a), the value of the statistic of 300 7 Table 7.19 Results of the analysis of variance Time of Occurrence of an Event of Interest (a) Fit statistics for conditional distribution -2 log L (Y | r. effects) 3303.50 Pearson’s chi-square 202.30 Pearson’s chi-square/DF 0.34 (b) Cov Parm Estimate Standard error BIO 0.1859 0.1936 REP(BIO) 0.02562 0.01673 Residual 0.2822 0.01568 (c) Type III tests of fixed effects Effect Num DF Den DF F-value Pr > F TRAT 4 588 115.36 <0.0001 Table 7.20 Means and standard errors of the main effects on the model scale (Estimate) and the data scale (Mean) TRT least squares means Standard TRT Estimate error Ma1 1.6344 0.3140 MaC 1.4903 0.3140 MaS 1.8788 0.3140 Mam 1.8053 0.3143 Control 2.8293 0.3139 DF 588 588 588 588 588 tvalue 5.21 4.75 5.98 5.74 9.01 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 5.1266 4.4386 6.5455 6.0820 16.9329 Standard error mean 1.6097 1.3939 2.0550 1.9115 5.3153 Table 7.21 Means and standard errors of the main effects on the model scale (Estimate), the data scale (Mean), and the hazard function λi TRT Ma1 MaC MaS Mam Control Estimate 1.6344 1.4903 1.8788 1.8053 2.8293 Standard error 0.3140 0.3140 0.3140 0.3143 0.3139 DF 588 588 588 588 588 tvalue 5.21 4.75 5.98 5.74 9.01 Probt <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 5.1266 4.4386 6.5455 6.0820 16.9329 Standard error mean 1.6097 1.3939 2.0550 1.9115 5.3153 Hazard λi 0.19506 0.22529 0.15278 0.16442 0.05906 Pearson′s chi - square/DF = 0.34 and in part (b), the estimated variance components due to blocks, within-block replicates, and experimental error are σ^2bio = 0:1859, σ^2repðbioÞ = 0:02562, and σ^2 = 0:2822, respectively. The type III effect hypothesis tests (part (c)) indicate that there is a highly significant difference between treatments on the mean survival time, as indicated by P = 0.0001. Tables 7.20 and 7.21 show the estimates on the model scale and the data scale, linear predictors ðηi Þ, means ðμi Þ with their respective standard errors, and the estimated hazard function. The results indicate that the MaC treatment has a greater lethal effect than A. aegypti mosquito control. 7.4 Exercises 301 1 Ma1 Survival probability (S(t)) 0.9 MaC MaS Mam Control 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Time (days) Fig. 7.4 Estimated survival probability for each treatment Figure 7.4 shows the survival times for the different treatments tested. These curves were obtained with Si ðt Þ = e 7.4 - λi t . Exercises Exercise 7.4.1 The investigation of this experiment focused on studying the times of animal incapacitation experienced after being exposed to the burning of eight types of aircraft interior materials (M1–M9) and performances in milligram/gram combustion of seven gases (CO, HCN, H2S, HCl, HBr, NO2, SO2) (Spurgeon 1978). The recorded incapacitation time of the animal when exposed to different combustion materials (under the column “Material”) is found under the column “Time in minutes” and in the third column the value of (1000/Time); these data are shown below (Table 7.22): (a) (b) (c) (d) Write down a statistical model of this experiment. List all the components of the GLMM in (a). Write down the null and alternative hypotheses associated with this experiment. Construct an ANOVA table indicating the sources of variation and degrees of freedom. (e) Analyze the time of inability of the animal to be exposed to the gases of the different types of materials. (f) Comment on the results obtained. 302 7 Time of Occurrence of an Event of Interest Table 7.22 Time of incapacity of the animal when exposed to different combustion gases Material M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 M2 M2 M2 M2 M2 M2 M2 M2 M2 M3 M3 M3 M3 M3 M3 M3 M3 M3 M4 M4 M4 M4 M4 M4 M4 M4 M4 M4 M4 Time 2.36 2.38 2.61 3.07 3.07 3.19 3.7 3.9 4.18 4.7 4.86 5.58 5.85 3.22 3.89 4.79 5.07 5.22 5.82 6.09 8.36 13.02 4.29 4.8 5.04 5.06 5.25 5.5 5.55 7.55 9.58 1.15 2 2.15 2.22 2.23 2.72 2.93 3.07 3.47 4.18 4.64 1000/Time 423.7 420.2 383.1 325.7 325.7 313.5 270.3 256.4 239.2 212.8 205.8 179.2 170.9 310.6 257.1 208.8 197.2 191.6 171.8 164.2 119.6 76.8 233.1 208.3 198.4 197.6 190.5 181.8 180.2 132.5 104.4 869.6 500 465.1 450.5 448.4 367.6 341.3 325.7 288.2 239.2 215.5 CO 164 174 96 101 142 143 147 156 124 101 142 104 90 159 153 161 159 162 106 124 89 88 129 105 108 120 149 28 83 68 28 88 89 63 112 96 78 348 255 112 144 70 HCN 6.4 7.5 4.7 7.5 6.8 8.2 5.2 4.7 3.2 8.9 4.6 3.4 2.3 16.4 2.9 0.6 0 0 3.2 1.5 0.7 0 6 5.8 7.8 11.6 0 9.1 5 5.5 2.4 62.4 41.7 14.9 37.2 7 33.8 1.9 1.9 19.5 3.8 11.2 H2S 0 0 0 0 0 0 0 0 0 0.9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.4 0 0 2 0 13.4 0 14.2 0 13.9 0 0 10.7 0 6.2 HCl 0 0 33 0 27.6 0 11.3 12 23.3 5.4 19.4 80 34.4 0 0 0 4.6 22 45.2 0 0 0 4.2 0 7.3 23 8.6 56.2 0 27.3 137 182 0 0 0 43.1 0 28 0 88 14.5 205 HBr 0 5 5 7.1 0 5.5 0 2.6 0 8 4.1 0 0 5.3 6.6 0 1.7 0 15.6 0 5.3 0 0 0 0 0 0 0 0 0 0 0 0 9.6 20.5 0 0 7.1 0 0 5.1 0 NO2 0.26 1.07 0.08 0.43 0.25 0.33 0.37 0.39 0.2 0.63 0.19 0.15 0.09 2 0.15 0.62 0.04 0.04 0.08 0.85 0.29 0.02 0.02 0.03 0.04 0.02 0 0 0.02 0.01 0 0.52 0 1.6 0 0.53 0 1 0.57 0.03 0.39 0.04 SO2 0 0 0 0 0 0 0 0 0 0 0 0.4 1.2 0 0 0 0 0 0 0 0 0 0.7 0 0 0 0 2.2 0 0.9 16.6 2.1 0.3 8.5 1.5 11.2 0 1.8 0 4.8 0.9 4.9 (continued) 7.4 Exercises 303 Table 7.22 (continued) Material M4 M5 M5 M5 M5 M6 M6 M6 M6 M6 M6 M7 M7 M7 M7 M7 M7 M7 M7 M7 M7 M8 M8 M8 M8 M8 M8 M8 M8 Time 7.57 6.97 7.47 10.7 13.71 4.94 5.26 5.53 7.46 9.84 10.9 3.7 3.8 3.83 4.04 5.19 6.01 7.56 9.41 9.59 10.79 3.7 3.99 6.56 7.68 9.16 10.33 12.26 14.96 1000/Time 132.1 143.5 133.9 93.5 72.9 202.4 190.1 180.8 134 101.6 91.7 270.3 263.2 261.1 247.5 192.7 166.4 132.3 106.3 104.3 92.7 270.3 250.6 152.4 130.2 109.2 96.8 81.6 66.8 CO 92 114 103 70 56 94 55 46 77 52 41 398 345 406 342 196 148 86 54 55 55 0 90 37 66 45 62 31 9 HCN 0 0 0 0 0 6.7 14.9 13.5 3.1 4.1 2.4 0 0 0 0 0 0 0 2.2 1.7 4.1 15 8.6 3.1 0 0 0 2.7 0 H2S 0.3 0 0 0 0 0 5.3 6.1 0 0.7 0 0 0 0 0 0 0.2 0 0 0 0 0 0 0 0 0 0 0 0 HCl 536 114 221 259 220 0 21.9 24.9 158 19 82 0 0 0 23 0 387 0 197 321 162 0 88 27.7 105 0 61 0 0 HBr 0 0 0 0 0 0 0 0 0 0 0 21 15.5 47 10.3 0 0 47 0 0 0 0 0 0 0 0 0 0 0 NO2 0.01 0 0 0.02 0.01 0.32 0 0 0.04 0.01 0 0 0.01 0 0.04 0 0.01 0 0 0 0.02 0.34 0.59 0.01 0 0.01 0.01 0.22 0.01 SO2 3 0 0 1.4 0.9 0 2.2 2.5 0 1.4 0 0 0 0 0 0 1.9 0 2.6 1.1 2.9 0 0 0 0 0 0 0 0 Exercise 7.4.2 Cockroaches are responsible for 80% of infestations in spaces used by humans. They associate with humans and have the ability to contaminate food with their feces and secretions, having both medical and economic implications. Different insecticides have been formulated, mainly synthetic, and, in some cases, have led to the development of cockroaches’ resistance. This example deals with the study of survival in days ( y) of this insect when exposed to two promising fungi in the biological control of this insect plus an already known control. The data for this example are shown below (Table 7.23): 304 7 Time of Occurrence of an Event of Interest Table 7.23 Results of the cockroach biological control experiment Insect 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 Strain Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Age 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 Time 2 3 3 3 4 7 8 9 10 10 17 19 19 20 20 20 20 20 20 20 4 13 13 13 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 3 3 Insect 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 Strain Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Age 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 Time 2 2 2 3 3 3 3 4 5 8 9 10 11 20 20 20 20 20 20 20 3 3 4 6 8 13 17 17 19 19 20 20 20 20 20 20 20 20 20 20 2 3 Insect 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 Strain Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Age 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 Time 2 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 11 20 (continued) 7.4 Exercises 305 Table 7.23 (continued) Insect 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Strain Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Bb1 Age 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 Time 4 4 6 7 8 8 9 10 13 14 16 17 17 20 20 20 20 20 Insect 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Strain Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Bb2 Age 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 Time 4 5 5 10 10 10 11 13 13 13 14 15 15 15 15 19 20 20 Insect 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Strain Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test Age 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 Time 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 (a) (b) (c) (d) Write down a statistical model of this experiment. List all components of the GLMM from (a). Write down the null and alternative hypotheses associated with this experiment. Analyze the survival time of the insect when infected with the different types of fungi. (e) Comment on the results obtained. Exercise 7.4.3 Consider a study on the effect of analgesic treatments (Trt) in elderly patients with neuralgia. Two test treatments (A and B) and a placebo (P) are compared. The response variable is whether the patient reported pain or not (yes = 1, n = 0). The investigators recorded the age (E) and sex (S) of 60 patients and the duration (time = T) in which the pain disappeared after starting the treatment. The data are presented in the Table 7.24 below. (a) List all components of the GLMM for this exercise. (b) Write down the null and alternative hypotheses associated with this experiment. (c) Construct an ANOVA table indicating the sources of variation and degrees of freedom. (d) Analyze the average time during which the patient experiences pain after starting the treatment. Are there any significant differences? (e) Comment on the results obtained. 7 Time of Occurrence of an Event of Interest 306 Table 7.24 Results with neuralgia patients (Trt = Treatment, S = Sex, E = Age, T = Time, D = Pain with yes = 1 and no = 0) Trt P P A A B A P A B B A B P P A P B P P A S F M F M F F M F F M M M M M M F F M F F E 68 66 71 71 66 64 70 64 78 75 70 70 78 66 78 72 65 67 67 74 T 1 26 12 17 12 17 1 30 1 30 12 1 12 4 15 27 7 17 1 1 D 0 1 0 1 0 0 1 0 0 1 0 0 1 1 1 0 0 1 1 0 Tr B B B A A P B A P P A B B P B P P B A B S M F F F M M M M M M F M M F M F F M M M E 74 67 72 63 62 74 66 70 83 77 69 67 77 65 75 70 68 70 67 80 T 16 28 50 27 42 4 19 28 1 29 12 23 1 29 21 13 27 22 10 21 D 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 1 1 0 0 1 Tr P B B A P A B A B P B A B P A A P A P A S F F F F F F M M F F F M F M F M M M F F E 67 77 76 69 64 72 59 69 69 79 65 76 69 60 67 75 68 65 72 69 T 30 16 9 18 1 25 29 1 42 20 14 25 24 26 11 6 11 15 11 3 D 0 0 1 1 1 0 0 0 0 1 0 1 0 1 0 1 1 0 1 0 Exercise 7.4.4 Refer to the previous exercise and perform an analysis of covariance. (a) List the linear predictor of this experiment. (b) Analyze the average time during which the patient experiences pain after starting the treatment using an analysis of covariance. Are there any significant differences? (c) Comment on the results obtained. Your results differ from those obtained in the previous year. Appendix 1 Data: Onset and duration of estrus in Pelibuey ewes (age in weeks, weight in kilograms, Inestro = number of days from the onset of estrus, Durestro = number of days in the duration of estrus) Animal 1 2 3 4 Birth type 1 1 1 1 Age 18.5096 18.4438 19.3973 19.3973 Weight 52.5 47.4 50.2 53.6 CC 4 4 4 4 Inestro 28 28 16 28 Durestro 4 4 20 16 (continued) Appendix 1 Animal 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 307 Birth type 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Age 9.4356 18.674 20.0877 19.5616 19.5288 19.7589 29.9507 19.3644 19.0027 18.3452 20.0877 40.7671 51.189 40.0767 54.3123 52.274 53.6219 40.2082 36.4932 50.6301 51.0247 46.389 50.7945 30.411 30.5096 50.6959 36.6247 30.5425 36.6247 29.9507 47.211 40.2082 52.2411 53.4247 55.5616 30.5425 29.0959 40.1425 50.7945 37.6767 36.4274 30.5425 Weight 47.5 41.3 60.5 49.4 53.4 52.6 35.5 50.5 62.2 54.7 48.7 40.5 49.3 38.1 41.9 58 34.8 40.3 34.6 42.1 52.6 32.1 40 37.9 42.2 33.2 34.2 39 33.7 32.9 39.5 57.5 53.3 43.4 46 31.6 47.8 36 42.2 44.3 43.1 38 CC 4 3 5 4 4 5 3 4 5 4 4 3 4 3 3 4 3 2 2 2 4 2 2 2 3 2 3 2 2 2 2 5 4 3 3 2 3 2 3 3 2 2 Inestro 28 28 28 28 28 28 28 28 28 28 28 28 12 20 24 28 28 24 28 28 28 20 16 24 20 24 20 12 24 24 32 12 12 12 24 24 20 20 24 24 20 20 Durestro 4 4 4 4 4 4 4 4 4 4 4 4 8 20 8 4 4 8 4 4 4 12 16 8 20 16 20 32 16 16 12 32 28 32 16 16 20 20 16 16 20 20 (continued) 308 Animal 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 7 Birth type 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 Age 68.9753 63.1233 68.7781 64.3068 68.6795 62.6301 69.8959 69.6 63.4849 64.274 63.5178 78.7397 64.537 62.4329 67.6603 63.7151 74.4986 63.6493 72.9205 69.4027 69.9616 69.6 63.4849 63.6164 67.8575 63.6822 65.7534 67.989 61.1836 63.3534 79.8904 63.7151 Time of Occurrence of an Event of Interest Weight 42.9 44 38.5 48 40.1 46.3 32.5 42.8 51.3 47.7 44.5 38 52.5 41.2 50.8 48.2 33.3 45.1 33 40.4 43.3 43.2 51 57.4 43 49.7 40.1 33.4 51.6 43.3 44.7 37.9 CC 2 4 3 4 2 3 2 3 4 3 3 2 4 2 4 3 2 3 3 3 3 2 4 4 3 4 3 1 4 3 3 3 Inestro 24 24 20 24 20 32 20 20 24 24 12 12 12 12 20 20 32 24 24 24 12 24 24 24 24 24 24 20 20 20 24 20 Durestro 8 8 12 8 12 4 12 12 8 16 28 28 28 28 20 24 8 16 20 16 28 16 16 16 16 16 16 20 20 20 16 20 Data: Beetles Dose Low Low Low Low Medium Medium Medium Medium High Insecticide Insec1 Insec2 Insec3 Insec4 Insec1 Insec2 Insec3 Insec4 Insec1 Rep 1 1 1 1 1 1 1 1 1 Frac 0.31 0.82 0.43 0.45 0.36 0.92 0.44 0.56 0.22 Time 3.1 8.2 4.3 4.5 3.6 9.2 4.4 5.6 2.2 (continued) Appendix 1 Dose High High High Low Low Low Low Medium Medium Medium Medium High High High High Low Low Low Low Medium Medium Medium Medium High High High High Low Low Low Low Medium Medium Medium Medium High High High High 309 Insecticide Insec2 Insec3 Insec4 Insec1 Insec2 Insec3 Insec4 Insec1 Insec2 Insec3 Insec4 Insec1 Insec2 Insec3 Insec4 Insec1 Insec2 Insec3 Insec4 Insec1 Insec2 Insec3 Insec4 Insec1 Insec2 Insec3 Insec4 Insec1 Insec2 Insec3 Insec4 Insec1 Insec2 Insec3 Insec4 Insec1 Insec2 Insec3 Insec4 Rep 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 Frac 0.3 0.23 0.3 0.45 1.1 0.45 0.71 0.29 0.61 0.35 1.02 0.21 0.37 0.25 0.36 0.46 0.88 0.63 0.66 0.4 0.49 0.31 0.71 0.18 0.38 0.24 0.31 0.43 0.72 0.76 0.62 0.23 1.24 0.4 0.38 0.23 0.29 0.22 0.33 Time 3 2.3 3 4.5 11 4.5 7.1 2.9 6.1 3.5 10.2 2.1 3.7 2.5 3.6 4.6 8.8 6.3 6.6 4 4.9 3.1 7.1 1.8 3.8 2.4 3.1 4.3 7.2 7.6 6.2 2.3 12.4 4 3.8 2.3 2.9 2.2 3.3 310 7 Time of Occurrence of an Event of Interest Data: Rubber break time Block 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 Demineralization 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 1 2 2 2 Pasteurization 2 2 2 1 1 1 2 2 2 1 1 1 2 2 2 1 1 1 2 2 2 1 1 1 pH 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 y 198.5 299 223.1 166.6 196.5 178.9 160.7 151.1 146.5 146.3 169.3 198.1 236.3 330.7 281.2 151.8 156 159.7 171.8 159.3 155.9 175.2 182.2 136.2 Data: Aedes aegypti (Trt = treatment, Rep = repetition, Y = survival time) Trt Control Control Control Control Control Control Control Control Control Control Control Control Control Control Rep 1 1 1 1 1 1 1 1 1 1 2 2 2 2 Y 8 11 11 11 11 11 13 13 14 20 8 11 11 11 Trt Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Rep 1 1 1 1 1 1 1 1 1 1 2 2 2 2 Y 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Trt MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS Rep 1 1 1 1 1 1 1 1 1 1 2 2 2 2 Y 3 3 3 3 4 5 6 6 9 12 3 3 3 3 Trt MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC Rep 1 1 1 1 1 1 1 1 1 1 2 2 2 2 Y 2 2 2 2 3 3 3 3 3 4 2 2 2 2 Trt Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Rep 1 1 1 1 1 1 1 1 1 1 2 2 2 2 Y 2 2 2 2 3 3 3 3 6 12 2 2 2 3 (continued) Appendix 1 Trt Control Control Control Control Control Control Control Control Control Control Control Control Control Control Control Control Control Control Control Control Control Control Control Control Control Control Rep 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 311 Y 11 11 15 15 15 16 11 11 11 11 23 25 26 27 30 30 8 8 11 13 14 19 20 20 20 22 Trt Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Rep 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 Y 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Trt MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS Rep 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 Y 3 3 3 4 5 6 3 3 3 2 2 5 5 6 10 12 3 3 3 4 4 5 5 6 9 12 Trt MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC Rep 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 Y 2 2 3 3 3 4 2 2 2 2 3 3 3 3 4 4 2 2 2 2 2 2 3 3 3 3 Trt Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Rep 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 Y 3 3 3 4 4 4 2 2 2 2 3 3 3 3 4 4 2 2 2 3 3 3 4 5 6 11 Data: Aedes aegypti (Bio = bioassay, Trt = treatment, Rep = repetition, Y = survival time) Bio B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 Trt C C C C C C C C C C C C C C Rep 1 1 1 1 1 1 1 1 1 1 2 2 2 2 Y 8 11 11 11 11 11 13 13 14 20 8 11 11 11 Bio B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 Trt MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS Rep 3 3 3 3 3 3 3 3 3 3 4 4 4 4 Y 3 3 3 2 2 5 5 6 10 12 3 3 3 4 Bio B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 Trt C C C C C C C C C C C C C C Rep 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Y 5 7 8 8 10 13 14 16 20 22 22 23 23 23 (continued) 312 Bio B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 7 Trt C C C C C C C C C C C C C C C C C C C C C C C C C C Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Rep 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 Y 11 11 15 15 15 16 11 11 11 11 23 25 26 27 30 30 8 8 11 13 14 19 20 20 20 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Bio B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 Trt MaS MaS MaS MaS MaS MaS MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC Time of Occurrence of an Event of Interest Rep 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 Y 4 5 5 6 9 12 2 2 2 2 3 3 3 3 3 4 2 2 2 2 2 2 3 3 3 4 2 2 2 2 3 3 3 3 4 4 2 2 2 2 2 2 3 Bio B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 Trt C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C Rep 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 Y 24 24 24 24 28 28 10 11 11 12 12 15 15 16 16 16 16 18 19 27 27 27 27 27 28 28 16 19 19 19 19 19 25 25 26 28 28 28 28 28 28 28 28 (continued) Appendix 1 Bio B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 Trt Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS 313 Rep 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 Y 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 4 5 6 6 9 12 3 3 3 3 3 3 3 4 5 6 Bio B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 B1 Trt MaC MaC MaC Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Rep 4 4 4 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 Y 3 3 3 2 2 2 2 3 3 3 3 6 12 2 2 2 3 3 3 3 4 4 4 2 2 2 2 3 3 3 3 4 4 2 2 2 3 3 3 4 5 6 11 Bio B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 Trt C C C C C C C C C C C C C C C C C C C C C C C Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Rep 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Y 28 28 28 16 17 17 17 17 17 17 19 28 28 28 28 28 28 28 28 28 28 28 28 2 3 3 4 4 4 5 5 5 6 6 6 7 15 17 21 25 28 28 28 (continued) 314 Bio B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 7 Trt Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Rep 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 Y 2 2 2 3 3 3 4 7 11 11 11 13 13 14 14 14 15 16 16 23 3 3 5 6 8 8 10 10 11 11 11 12 17 17 17 17 17 17 18 25 4 4 5 Bio B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 Trt MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaC MaC Time of Occurrence of an Event of Interest Rep 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 Y 18 5 5 5 9 10 10 10 12 12 12 12 12 14 15 18 18 23 25 25 25 5 5 6 6 6 6 7 8 8 9 10 10 10 11 11 11 11 11 11 24 2 2 Bio B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 Trt MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Rep 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 Y 13 23 2 2 3 6 6 6 8 8 8 8 9 9 9 9 10 10 12 19 20 24 2 2 3 3 3 4 4 5 5 5 6 6 6 7 8 8 8 10 17 21 2 (continued) Appendix 1 Bio B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 Trt Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam Mam MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS 315 Rep 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 Y 7 9 10 12 12 12 12 12 12 13 13 13 18 27 27 27 27 2 2 2 2 3 3 3 3 4 5 5 5 6 6 8 8 9 11 13 21 3 3 4 6 6 8 Bio B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 Trt MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC Rep 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 Y 2 2 2 2 3 3 3 3 3 4 4 5 5 7 7 9 9 10 2 2 2 3 3 3 3 3 5 6 6 6 7 7 7 9 10 10 18 19 2 3 6 6 8 Bio B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 Trt Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Rep 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 Y 2 2 3 3 3 4 5 6 7 7 7 8 9 9 10 10 10 10 10 2 3 3 3 5 7 7 7 8 8 8 9 10 10 10 10 10 10 11 17 4 5 8 8 (continued) 316 Bio B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 7 Trt MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS MaS Rep 2 2 2 2 2 2 2 2 2 2 2 2 2 Y 8 9 9 10 10 11 11 11 11 11 12 12 12 Bio B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 Trt MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC MaC Time of Occurrence of an Event of Interest Rep 3 3 3 3 3 3 3 3 3 3 3 3 3 Y 8 9 9 9 9 9 10 10 10 10 10 11 13 Bio B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 Trt Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Ma1 Rep 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 Y 8 9 9 11 11 11 11 11 12 12 13 13 13 13 14 18 Data: Pelibuey Sheep Animal 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 Birthtype 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 Age 18.509589 18.4438356 19.3972603 19.3972603 9.43561644 18.6739726 20.0876712 19.5616438 19.5287671 19.7589041 29.9506849 19.3643836 19.0027397 18.3452055 20.0876712 40.7671233 51.1890411 40.0767123 54.3123288 52.2739726 53.6219178 40.2082192 36.4931507 50.630137 Weight 52.5 47.4 50.2 53.6 47.5 41.3 60.5 49.4 53.4 52.6 35.5 50.5 62.2 54.7 48.7 40.5 49.3 38.1 41.9 58 34.8 40.3 34.6 42.1 Inestro 28 28 16 28 28 28 28 28 28 28 28 28 28 28 28 28 12 20 24 28 28 24 28 28 Durestro 4 4 20 16 4 4 4 4 4 4 4 4 4 4 4 4 8 20 8 4 4 8 4 4 (continued) Appendix 1 Animal 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 317 Birthtype 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 Age 51.0246575 46.3890411 50.7945206 30.4109589 30.509589 50.6958904 36.6246575 30.5424658 36.6246575 29.9506849 47.2109589 40.2082192 52.2410959 53.4246575 55.5616438 30.5424658 29.0958904 40.1424658 50.7945206 37.6767123 36.4273973 30.5424658 68.9753425 63.1232877 68.7780822 64.3068493 68.6794521 62.630137 69.8958904 69.6 63.4849315 64.2739726 63.5178082 78.739726 64.5369863 62.4328767 67.660274 63.7150685 74.4986301 63.6493151 72.920548 69.4027397 69.9616438 Weight 52.6 32.1 40 37.9 42.2 33.2 34.2 39 33.7 32.9 39.5 57.5 53.3 43.4 46 31.6 47.8 36 42.2 44.3 43.1 38 42.9 44 38.5 48 40.1 46.3 32.5 42.8 51.3 47.7 44.5 38 52.5 41.2 50.8 48.2 33.3 45.1 33 40.4 43.3 Inestro 28 20 16 24 20 24 20 12 24 24 32 12 12 12 24 24 20 20 24 24 20 20 24 24 20 24 20 32 20 20 24 24 12 12 12 12 20 20 32 24 24 24 12 Durestro 4 12 16 8 20 16 20 32 16 16 12 32 28 32 16 16 20 20 16 16 20 20 8 8 12 8 12 4 12 12 8 16 28 28 28 28 20 24 8 16 20 16 28 (continued) 318 Animal 22 23 24 25 26 27 28 29 30 31 32 7 Birthtype 3 3 3 3 3 3 3 3 3 3 3 Age 69.6 63.4849315 63.6164384 67.8575343 63.6821918 65.7534247 67.9890411 61.1835616 63.3534247 79.890411 63.7150685 Time of Occurrence of an Event of Interest Weight 43.2 51 57.4 43 49.7 40.1 33.4 51.6 43.3 44.7 37.9 Inestro 24 24 24 24 24 24 20 20 20 24 20 Durestro 16 16 16 16 16 16 20 20 20 16 20 Data: Gum Breakdown Times Batch 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 Demineralization 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 1 2 2 2 Pasteurization 2 2 2 1 1 1 2 2 2 1 1 1 2 2 2 1 1 1 2 2 2 1 1 1 pH 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 Time 198.5 299 223.1 166.6 196.5 178.9 160.7 151.1 146.5 146.3 169.3 198.1 236.3 330.7 281.2 151.8 156 159.7 171.8 159.3 155.9 175.2 182.2 136.2 Appendix 1 319 Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. Chapter 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses 8.1 Introduction According to Agresti (2013), a multinomial distribution is a generalization of a binomial distribution in cases with more than two possible ordered (ordinal) or unordered (nominal) outcomes. Given a response with more than two possible outcomes and independent trials with probabilities of similar category for each trial, the distribution of counts across categories follows a multinomial distribution. Quinn and Keough (2002) believe that several methods exist for multinomial data analysis. The most common form of categorical data analysis in biological sciences, which results in frequency counts, is creating cross-tabulations or contingency tables and chi-squared tests to examine associations between two or more categorical variables. However, such an approach is ill suited for a study aimed at estimating the response when there is a change in the explanatory variable(s), as contingency tables are used to analyze the association between variables without considering a predictor or response variable. In this analysis, the results are valid as long as less than 20% of the cells have an expected count less than five and none are less than one (Logan 2010). Fisher’s exact test extends the chi-squared test in studies involving small sample sizes. There are several methods for modeling multinomial data; traditional methods of multinomial data analysis include frequency analysis (counts), which uses the chi-squared test and the log-linear model for contingency tables. This chapter focuses on describing multinomial logit and probit models in detail. © The Author(s) 2023 J. Salinas Ruíz et al., Generalized Linear Mixed Models with Applications in Agriculture and Biology, https://doi.org/10.1007/978-3-031-32800-8_8 321 322 8.2 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Concepts and Definitions For the multinomial distribution each observation drawn from a total of N observations belongs to exactly one of the mutually and exclusive c = 1, ⋯, C categories and each category has a probability π c (c = 1, ⋯, C) of belonging to the category c. A multinomial distribution refers to the probability that exactly one randomly sampled observation from the population belongs to category y1, that is, it belongs to category 1, y2 observations belong to category 2, and so forth up to C category C,where c=1 yc = N and C c=1 π c = 1. The density function of this distribution is equal to f ð y1 , y 2 , . . . , y C Þ = N! y y y π 1 π 2 . . . π Cc y1 !y2 ! . . . yC ! 1 2 Multinomial models are applied in data analysis where the categorical response variable has more than two possible outcomes while the independent variables can be continuous, categorical, or both (Hosmer and Lemeshow 2000). The categorical response variable can be either ordinal (ordered) or nominal (unordered). Ordinal response variables are single values that represent a rank order on some dimension, but there are not enough values to be treated as a continuous variable. Nominal (unordered) response variables are those whose values provide a rank but do not provide an indication of order. Models for multinomial data are constructed in a similar way as for binomial data. The link functions used in these types of models are similar to the logit and probit functions used for binomial data. Cumulative logit and cumulative probit models define the link function such that when properly fitted to the data, they allow for parsimonious modeling of ordinal or multinomial data. Generalized logit and probit models do not require ordered categories and are therefore suitable for multinomial nominal data. In terms of generalized linear models (GLMs) and generalized linear mixed models (GLMMs), a multinomial distribution with C categories requires C - 1 link functions to fully specify a model that relates the response probabilities (π 1, π 2, . . ., π C) to the linear predictor. The commonly used models are the cumulative logit model, also known as the proportional odds model proposed by McCullagh (1980), and the cumulative probit model, also known as the threshold model. Throughout this chapter, we will use either of these two link functions interchangeably. The link functions for a cumulative logit model with C categories are 8.3 Cumulative Logit Models (Proportional Odds Models) η1 = log 323 π1 = η1 þ Xβ þ Zb 1 - π1 π1 þ π2 = η2 þ Xβ þ Zb 1 - ðπ 1 þ π 2 Þ ⋮ π1 þ π2 þ ⋯ þ πC - 1 ηC - 1 = log = ηC - 1 þ Xβ þ Zb 1 - ðπ 1 þ π 2 þ ⋯ þ π C - 1 Þ η2 = log where X and Z are the design matrices, whereas β and b are the vectors of fixed and random effects parameters, respectively. The inverse links of each of the functions are as follows: 1 = hðη1 Þ 1 þ e - η1 1 = hðη2 Þ π1 þ π2 = 1 þ e - η2 ⋮ 1 π1 þ π2 þ ⋯ þ πC - 1 = = hðηC - 1 Þ: 1 þ e - ηc - 1 π1 = Once h(η1), h(η2), ... h(ηc probabilities π^1 , π^2 , ..., π^c . 8.3 - 1) have been estimated, we can then estimate the Cumulative Logit Models (Proportional Odds Models) Multinomial logit models are used to model the relationships between a polytomous response variable and a set of predictor variables. These polytomous response models can be classified – as mentioned above – into two different types, depending on whether the response variable has an ordered or an unordered structure. In a proportional odds model, the covariates (linear predictor η) have the same effect on the probabilities that the response variable has in any category when considering different values of the covariates, thus shifting the response distribution to the right (or left) without changing the shape of the distribution. In a proportional odds model, the cumulative logits model the effect of the covariates on the response probabilities below or equal to the category cutoff. A multinomial logit model assumes independence of categories, which implies that the probabilities of choosing a category c relative to a category c′ are independent of the category characteristics of c and c′ for c ≠ c′. The assumption requires that if a new category is available, then the prior probabilities are precisely adjusted to preserve the original probabilities between all pairs of outcomes. The proportional odds model employs a strict assumption that the odds ratio does not depend on the category, and, therefore, we need to test the proportional odds assumption, which is also called the “parallel regression assumption.” 324 8.3.1 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Complete Randomize Design (CRD) with a Multinomial Response: Ordinal Data are obtained from an experiment related to red core disease in strawberries, which is caused by the fungus Phytophthora fragariae. In this example, 12 strawberry populations were evaluated in a completely randomized experiment with 4 replications (Table 8.1). Plots generally consisted of 10 plants; in some cases, only 9 plants were observed. At the end of the experiment, each plant was assigned to one of three ordered categories representing fungal damage (1 = no damage, 2 = moderate damage, and 3 = severe damage). A total of 12 populations were obtained by crossing 3 genotypes of male parents with 4 genotypes of female parents. The variation between and within plots is considered minimal, whereas the genetic and nongenetic effects are more significant, as plants from the same cross are not genetically identical. The model that fits these data for the cumulative probabilities is a GLMM, which exhibit a classification effect on the treatment variable (population resulting from crossing genotypes). Thus, the GLMM for multinomial ordered outcomes with C categories requires C - 1 link function equations to fully specify the model that relates the response probabilities (π 1, π 2, . . ., π C) to the linear predictor ηij (Stroup 2013). The C - 1 multinomial logit equations are tested against each of the remaining categories 1, 2, . . , C - 1. Table 8.1 Evaluation of red core disease in strawberry plants Repetition Parent plant male/female 1 1 1 2 1 3 1 4 2 1 2 2 2 3 2 4 3 1 3 2 3 3 3 4 1 2 Disease category 1 2 3 1 0 3 6 2 2 3 5 0 3 4 3 7 0 5 5 5 1 4 4 2 1 4 5 3 4 3 3 5 1 4 5 1 0 0 9 3 5 3 2 3 0 3 6 2 3 0 7 5 3 2 2 3 2 4 2 4 1 2 5 2 5 2 3 6 7 1 1 6 2 4 6 2 5 3 3 1 2 4 1 2 1 1 3 8 2 3 1 7 4 2 3 6 1 8 2 6 3 5 5 6 3 3 3 5 0 7 0 7 3 4 0 3 1 6 0 1 2 2 2 1 1 4 4 2 0 2 0 3 2 5 3 3 4 5 2 2 5 0 1 3 4 3 3 5 5 5 4 4 4 3 10 7 7 3 8.3 Cumulative Logit Models (Proportional Odds Models) 325 The components of the GLMM with an ordinal multinomial response are as follows: Distributions: y1ij, y2ij, y3ij|rj~Multinomial(Nij, π 1ij, π 2ij, π 3ij), where y1ij, y2ij, and y3ij are the observed frequencies of responses (damage level) in each category C (1 = no damage, 2 = moderate damage, and 3 = severe damage) and rj is the random effect due to repetition, assuming r j N 0, σ 2r . Linear predictor: ηcij = ηc + τi + rj, where ηcij is the cth link (c = 1, 2, 3) that relates the mean and the linear predictor for the treatment i (i = 1, 2, . . ., 12) and the jth block ( j = 1, 2, 3, 4); ηc is the intercept for the cth link; τi is the fixed effect due to the ith treatment (cross); and rj is the random effect due to the jth repetition r j N 0, σ 2r . The link functions for each category are as follows: log log π 1ij = η1ij 1 - π 1ij π 1ij þ π 2ij 1 - π 1ij þ π 2ij = η2ij The following GLIMMIX program fits a cumulative logit model with an ordinal multinomial response in a CRD. proc glimmix data=FRESA; class rep trt cat; model cat(order=data)= trt/dist=Multinomial link=clogit solution oddsratio; random intercept/subject=rep solution ; estimate 'c=1, t=1' intercept 1 0 trt 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0, 'c=2, t=1' intercept 0 1 trt 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0, 'c=1, t=2' intercept 1 0 trt 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0, 'c=2, t=2' intercept 0 1 trt 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0, 'c=1, t=3' intercept 1 0 trt 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0, 'c=2, t=3' intercept 0 1 trt 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0, 'c=1, t=4' intercept 1 0 trt 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0, 'c=2, t=4' intercept 0 1 trt 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0, 'c=1, t=5' intercept 1 0 trt 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0, 'c=2, t=5' intercept 0 1 trt 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0, 'c=1, t=6' intercept 1 0 trt 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0, 'c=2, t=6' intercept 0 1 trt 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0, 'c=1, t=7' intercept 1 0 trt 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0, 'c=2, t=7' intercept 0 1 trt 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0, 'c=1, t=8' intercept 1 0 trt 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0, 'c=2, t=8' intercept 0 1 trt 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0, 'c=1, t=9' intercept 1 0 trt 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0, 'c=2, t=9' intercept 0 1 trt 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0, 'c=1, t=10' intercept 1 0 trt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0, 'c=2, t=10' intercept 0 1 trt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0, 326 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Table 8.2 Data ordered Rep rep1 rep1 rep1 rep1 rep1 rep1 rep1 rep1 rep1 rep1 rep1 Cross M1H1 M1H2 M1H3 M1H4 M1H1 M1H2 M1H3 M1H4 M1H1 M1H2 M1H3 Cat Without Without Without Without Moderate Moderate Moderate Moderate Severe Severe Severe Freq 0 2 3 0 3 3 4 5 6 5 3 'c=1, t=11' intercept 1 0 trt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0, 'c=2, t=11' intercept 0 1 trt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0, 'c=1, t=12' intercept 1 0 trt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1, 'c=2, t=12' intercept 0 1 trt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1/ilink; freq freq; run; Although most of the GLIMMIX commands have already been described in previous examples, it is important to emphasize that the data should be structured in a logical way as follows: one line for repetition, treatment, lesion category, and the frequency or number of observations (Y ), which, in this case, is referenced by the variables rep, trt (trt = cross), cat (category), and freq, respectively. Part of the data arrangement can be seen in Table 8.2, whereas the rest of the dataset can be found in the Appendix (Data: CRD with multinomial response: ordinal). In the program commands of this example, “order = data” indicates that the order in which the categories are arranged in the dataset is under an order (ordinal) category. Consider that the observations in each line always have order categories such as no injury (Without), moderate injury (Moderate), and severe injury (Severe). If there is no congruent order in the arrangement of the dataset to be analyzed, then GLIMMIX will reorder the categories in an alphabetical or numerical order depending on the initial coding of the data. The “estimate” command specifies the estimable functions that form the boundaries between the categories for each of the populations (trt). Finally, the “freq command” instructs GLIMMIX to use “freq” as the number of observations (frequency) under the corresponding categorization. In this way, the first estimate “c = 1, t = 1” defines the predictor η1 + τ1, that is, the boundary between the “Without” and “Moderate” categories for treatment 1 with its corresponding logit log 1 -π 11π11 , whereas the second estimate “c = 2, t = 1” defines the boundary between the categories of “Moderate” and “Severe” damage with the 21 , which estimates the probability of observing a plant from logit log 1 -πð11πþπ 11 þπ 21 Þ 8.3 Cumulative Logit Models (Proportional Odds Models) 327 Table 8.3 Results of the multinomial analysis of variance for injury level in strawberry plants (a) Covariance parameter estimates Cov Parm Subject Intercept Rep (b) Type III tests of fixed effects Effect Num degree of freedom (DF) Trt 11 Estimate 0.1453 Den DF 457 Standard error 0.1437 F-value 2.60 Pr > F 0.0032 Table 8.4 Fixed effects solution for injury categories Solutions for fixed effects Effect Cat Intercept Without ðη1 Þ Intercept Moderate ðη2 Þ ðτ 1 Þ Trt ðτ 2 Þ Trt ðτ 3 Þ Trt Trt ðτ 4 Þ ðτ 5 Þ Trt ðτ 6 Þ Trt ðτ 7 Þ Trt Trt ðτ 8 Þ ðτ 9 Þ Trt ðτ10 Þ Trt ðτ11 Þ Trt Trt ðτ12 Þ Trt M1H1 M1H2 M1H3 M1H4 M2H1 M2H2 M2H3 M2H4 M3H1 M3H2 M3H3 M3H4 Estimate -0.4571 1.0631 -1.1456 -0.8355 -0.4621 -0.4716 -1.2644 -0.6060 -0.2332 -0.3912 -1.5563 -0.4508 -1.4426 0 Standard error 0.3526 0.3558 0.4264 0.4179 0.4171 0.4145 0.4295 0.4181 0.4140 0.4168 0.4393 0.4144 0.4350 . DF 3 3 457 457 457 457 457 457 457 457 457 457 457 . t-value -1.30 2.99 -2.69 -2.00 -1.11 -1.14 -2.94 -1.45 -0.56 -0.94 -3.54 -1.09 -3.32 . Pr > |t| 0.2855 0.0582 0.0075 0.0462 0.2685 0.2558 0.0034 0.1479 0.5735 0.3484 0.0004 0.2772 0.0010 . population1 (M1H1 = trt) “Without” damage and “Moderate” damage when exposed to the fungus (Phytophthora fragariae). Part of the output is presented in Table 8.3. The estimated variance component (part (a)) due to plants is σ 2r = 0:1453, whereas the hypothesis tests for type III effects (part (b)) (“Type III tests of fixed effects”) indicate that the crosses have different significant tolerance levels to fungal attacks (Pr > F = P = 0.0032). The results of the fixed effects solution, obtained by specifying the “solution” option in the model, are shown in Table 8.4. From the fixed effects solution, we can estimate the linear predictors for the two categories of each treatment, which are in terms of the model scale. For example, for treatment 1, the first category of injury η11 = η1 þ τ1 = - 0:4571 þ ð- 1:1456Þ = - 1:6027, where η1 defines the boundary between the categories “Without” damage and “Moderate” damage and η2 defines the boundary between the categories “Moderate” damage and “Severe” damage, and the linear predictor is η11 = η2 þ τ1 = 1:0631 þ ð- 1:1456Þ = - 0:0825. Note that for the proportional odds, the τi values are not category-specific; treatment effects move the boundaries as a group. 328 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Table 8.5 Estimated odds ratio Odds ratio estimates Trt _Trt Estimate M1H1 M3H4 0.318 M1H2 M3H4 0.434 M1H3 M3H4 0.630 M1H4 M3H4 0.624 M2H1 M3H4 0.282 M2H2 M3H4 0.546 M2H3 M3H4 0.792 M2H4 M3H4 0.676 M3H1 M3H4 0.211 M3H2 M3H4 0.637 M3H3 M3H4 0.236 DF 457 457 457 457 457 457 457 457 457 457 457 95% Confidence limits 0.138 0.735 0.191 0.986 0.278 1.430 0.276 1.409 0.121 0.657 0.240 1.241 0.351 1.787 0.298 1.534 0.089 0.500 0.282 1.438 0.101 0.556 The odds ratio (Table 8.5) is the result of taking eτi for crosses 1–12. Since odds ratios are not specific to a particular category, this value is the same for all three categories and hence the name odds ratio. In Table 8.6, we show the maximum likelihood estimates of the linear predictors ^ηci = ^ηC þ ^τi in the “Estimate” column, in terms of the model scale, as well as the means on the data scale for each of the categories of the treatments tested (“Mean”). Thus, for c = 1, t = 1 (response category “Without” damage and treatment 1), the estimator is η11 = - 1:6027 and for c = 2, t = 1 (“Moderate” damage and treatment 1), the linear predictor is η21 = - 0:0825. Taking the inverse of the link function yields the probability of π 11 = 1=1þe1:6027 = 0:1676. This is the estimated probability for which the cross (treatment) M1H1 has a response score of “Without damage.” This inverse value is presented under the “Mean” column (Table 8.6). Now, for c = 2, t = 1, the inverse of the link yields the following probability: π 11 þ π 21 = 1=1þe0:0825 = 0:4794 (cumulative probability). From this value, we deduce the probability of observing a “Moderate” damage and a “Severe” damage in the plant of the cross M1H1. For “Moderate” damage, and, for the probability is π 21 = 0:4794 - π 11 = 0:4794 - 0:1676 = 0:3118, “Severe” damage, it is π 31 = 1 - π 11 þ π 21 = 1 - 0:4794 = 0:5206. Similarly, the rest of the probabilities in the different crosses are estimated. 8.3.2 Randomized Complete Block Design (RCBD) with a Multinomial Response: Ordinal In recent years, poultry production has become conscious of animal welfare, which is associated with bird mortality, behavior, and health, among others (Stanley 1981; Martrenchar et al. 2002). One of the diseases related to animal welfare is footpad dermatitis, and, among many repercussions, it affects a bird’s ability to walk (Bilgili et al. 2009). Pododermatitis is known as contact dermatitis or footpad dermatitis and 8.3 Cumulative Logit Models (Proportional Odds Models) 329 Table 8.6 Estimates on the model scale (Estimate) and on the data scale (Mean) for the damage categories in strawberry plants Estimates Label c = 1, t = 1 c = 2, t = 1 c = 1, t = 2 c = 2, t = 2 c = 1, t = 3 c = 2, t = 3 c = 1, t = 4 c = 2, t = 4 c = 1, t = 5 c = 2, t = 5 c = 1, t = 6 c = 2, t = 6 c = 1, t = 7 c = 2, t = 7 c = 1, t = 8 c = 2, t = 8 c = 1, t = 9 c = 2, t = 9 c = 1, t = 10 c = 2, t = 10 c = 1, t = 11 c = 2, t = 11 c = 1, t = 12 c = 2, t = 12 Standard error 0.3706 0.3625 0.3597 0.3542 0.3572 0.3555 0.3542 0.3524 0.3744 0.3656 0.3590 0.3557 0.3526 0.3533 0.3566 0.3556 0.3864 0.3759 0.3540 DF 457 457 457 457 457 457 457 457 457 457 457 457 457 457 457 457 457 457 457 t-value -4.33 -0.23 -3.59 0.64 -2.57 1.69 -2.62 1.68 -4.60 -0.55 -2.96 1.28 -1.96 2.35 -2.38 1.89 -5.21 -1.31 -2.56 Pr > |t| <0.0001 0.8200 0.0004 0.5208 0.0104 0.0916 0.0090 0.0939 <0.0001 0.5822 0.0032 0.1995 0.0509 0.0193 0.0178 0.0595 <0.0001 0.1902 0.0106 Mean 0.1676 0.4794 0.2154 0.5567 0.2851 0.6459 0.2832 0.6437 0.1517 0.4499 0.2567 0.6123 0.3340 0.6963 0.2998 0.6619 0.1178 0.3791 0.2874 Standard error mean 0.05170 0.09047 0.06080 0.08741 0.07281 0.08131 0.07190 0.08081 0.04818 0.09047 0.06850 0.08444 0.07842 0.07471 0.07485 0.07958 0.04016 0.08849 0.07250 0.6123 0.3524 457 1.74 0.0830 0.6485 0.08033 -1.8997 0.3813 457 -4.98 <0.0001 0.1301 0.04317 -0.3795 0.3714 457 -1.02 0.3074 0.4062 0.08958 -0.4571 0.3526 457 -1.30 0.1955 0.3877 0.08369 1.0631 0.3558 457 2.99 0.0030 0.7433 0.06789 Estimate -1.6027 -0.08254 -1.2926 0.2276 -0.9191 0.6010 -0.9286 0.5915 -1.7214 -0.2013 -1.0631 0.4571 -0.6903 0.8299 -0.8483 0.6719 -2.0133 -0.4932 -0.9079 is characterized by inflammation and necrotic lesions from the plantar surface to deep within the footpads of chicken. Deep ulcers may result in abscesses and in the thickening of the underlying tissues and structures (Greene et al. 1985). Chicken feet have great economic importance because they are in high demand in the foreign market, mainly in Southeast Asia and China; however, due to diseases or alterations such as pododermatitis, there are significant economic losses since diseased feet are not suitable for human consumption and this, subsequently, reflects in market prices (Taira et al. 2014). Due to the economic importance of this product, Garcia et al. (2010) have focused on studying the factors that cause this disease and 330 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Table 8.7 Treatment design Treatment Trt1 Trt2 Trt3 Trt4 Features Traditional program +1 kg m-2 of rice husks Traditional program +2 kg m-2 of rice husks Traditional program + podal health program +1 kg m-2 of rice husks Traditional program + podal health program +2 kg m-2 of rice husks on finding strategies to reduce leg and carcass lesions in poultry. Important factors in broiler fattening are the type of litter, litter height, nutrition and feeding programs, and bird health, among others. The objective of this study was to evaluate the effect of litter density and organic minerals (Availa Zn and Availa Mn), with an extract of Yucca schidigera (MicroAid) as a supplement to a traditional fattening program, on the development of footpad dermatitis in broilers. The genetic material used in this experiment was mainly male Ross line chickens. The traditional broiler fattening program by the poultry farm consists of three phases: a starter diet (1–18 days), a grower diet (19–35 days), and a finisher diet (36–50 days), applied for a period of 50 days, where rice husk is used as bedding material at a density of 1 kg m-2. In this research, a foot health program was implemented in addition to the traditional fattening program, which included the addition of 125 ppm of Micro-Aid (Yucca schidigera extract), 40 ppm of Availa Zn, and 40 ppm of Availa Mn to the fattening diet. Based on the above information, four treatments were evaluated at two poultry farms, as described below: • Treatment 1 involved the application of the company’s traditional fattening program (Trt1). • Treatment 2 was the company’s traditional fattening program plus an increase in litter density from 1 to 2 kg m-2 (Trt2). • Treatment 3 was the traditional fattening program plus the implementation of the foot health program during the fattening period until completion (Trt3). • Treatment 4 consisted of the traditional fattening program plus the implementation of the foot health program and an increase in litter density from 1 to 2 kg m-2 (Trt4). The following table lists the treatments studied (Table 8.7): The response variable evaluated was the degree of foot lesion (pododermatitis) at the end of the fattening period (50 days). The response variable was evaluated on 1250 chickens per treatment. The degree of a footpad lesion was determined according to a visual guide for lesions in chickens based on the method of De Jong and Guémené (2012). This method entails defining three grades: grade 0 is attributed to legs with no lesions, grade one is if lesions exist in some areas of the footpad (<50%), and grade two is if the leg has extensive lesions in areas of the footpad (50–100%). Table 8.8 shows the dataset indicating the block, treatment, level of lesion, and the number of birds observed with a given lesion (frequency). 8.3 Cumulative Logit Models (Proportional Odds Models) 331 Table 8.8 Pododermatitis in broilers Block 1 1 1 2 2 2 1 1 1 2 2 2 Trt 1 1 1 1 1 1 2 2 2 2 2 2 Category Without Slight Severe Without Slight Severe Without Slight Severe Without Slight Severe Frequency 26 58 17 37 56 6 40 57 3 77 23 0 Block 1 1 1 2 2 2 1 1 1 2 2 2 Trt 3 3 3 3 3 3 4 4 4 4 4 4 Category Without Slight Severe Without Slight Severe Without Slight Severe Without Slight Severe Frequency 54 43 3 25 69 7 65 34 1 63 36 0 Note: Without stands for no lesion, slight stands for moderate lesion, and severe stands for severe lesion The GLMM for multinomial ordered results with C categories requires C - 1 link function equations instead of one to fully specify a model that relates the response probabilities (π 1, π 2, . . ., π C) to the linear predictor ηij (Stroup 2013). The C - 1 multinomial logit equations are tested against each of the categories 1, 2, . . ., C - 1. The link functions for the cumulative logit model to describe the response variable with C categories are as follows: ηð1Þij = log ηð2Þij = log π 1ij = η1 þ τi þ bj 1 - π 1ij π 1ij þ π 2ij 1 - π 1ij þ π 2ij = η 2 þ τ i þ bj ⋮ π 1ij þ π 2ij þ ⋯ þ π ðC - 1Þij ηðC - 1Þij = log 1 - π 1ij þ π 2ij þ ⋯ þ π ðC - 1Þij = η C - 1 þ τ i þ bj The components of the GLMM with an ordinal multinomial response variable are as follows: Distributions: yoij, y1ij, y2ij|bj ~ Multinomial(Nij, π 0ij, π 1ij, π 2ij), where yoij, y1ij, and y2ij are the observed frequencies of the responses (paw injury) in each category (none, mild, and severe) and bj is the random effect due to block assuming bj N 0, σ 2b . Linear predictor: η(c)ij = ηc + τi + bj, where η(c)ij is cth link (c = 0, 1) for processing i and block j, ηc is the intercept for the cth link, τi is the fixed effect due to the ith treatment, and bj is the random effect due to the jth block bj N 0, σ 2b . The link functions for each category are as follows: 332 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses log log π 0ij = ηð0Þij 1 - π 0ij π 0ij þ π 1ij 1 - π 0ij þ π 1ij = ηð1Þij The following GLIMMIX commands fit a cumulative logit model with an ordinal multinomial response. proc glimmix data=multinomial_ord; class block trt; model categoria (order=data)= trt/dist=Multinomial link=clogit solution oddsratio(DIFF=LAST LABEL); random intercept/subject=block; estimate 'c=0, t=1' intercept 1 0 trt 1 0, 'c=1, t=1' intercept 0 1 trt 1 0, 'c=0, t=2' intercept 1 0 trt 0 1 0, 'c=1, t=2' intercept 0 1 trt 0 1 0, 'c=0, t=3' intercept 1 0 trt 0 0 0 1 0, 'c=1, t=3' intercept 0 1 trt 0 0 0 1 0, 'c=0, t=4' intercept 0 1 trt 0 0 0 0 1, 'c=1, t=4' intercept 1 0 trt 0 0 0 0 1/ilink; freq y; run; The data should have one column for block, treatment, lesion category, and frequency or number of observations (Y ), which, in this case, is referenced by the variables block, trt, category, and frequency, respectively. Most of the options in the above syntax have already been explained previously; the “order = data” option specifies that the order in which the categories appear in the dataset will be treated as ordinal categories from the lowest to the highest for the analysis. If this option is not used with the response variable in the model specification, “proc GLIMMIX” will rearrange its categories in an alphabetical or numerical order, but this will depend on whether the categories are entered as a number or a name. The “freq y” option orders GLIMMIX to use y as the number of observations in the corresponding category. The “estimate” command specifies the estimable functions that form the boundaries between categories of each of the four treatments. For example, the first estimate “c = 0, t = 1” defines η0 + τ1, that is, the boundary between the categories “Without” (no lesion) and “Moderate” (slight lesion) for treatment 1. This first estimate corresponds to logit log 1 -π 01π01 , which is the probability that a chicken that received treatment 1 will respond to a degree of lesion classified under category 0 (no lesion). The second estimation “c = 1, t = 1” defines η1 + τ1, that is, the boundary between the categories “Moderate” (slight lesion) and 8.3 Cumulative Logit Models (Proportional Odds Models) 333 Table 8.9 Results of the analysis of variance in the multinomial cumulative logit model (a) Type III tests of fixed effects Effect Num DF Trt 3 (b) Solutions for fixed effects Effect Categoría Without Intercept ðη1 Þ Moderate Intercept ðη2 Þ Trt ðτ1 Þ Trt ðτ2 Þ Trt ðτ3 Þ Trt ðτ4 Þ Den DF 794 Trt 1 2 3 4 Pr > F <0.0001 F-value 22.45 Estimate 0.6144 3.8787 -1.5034 -0.2509 -1.0365 0 “Severe” (severe lesion) for treatment 1 and corresponds to logit log Standard error 0.1799 0.2465 0.2086 0.2055 0.2036 . π 11 1 - π 11 , and so on. By taking the inverse of these links values, we can obtain the estimated probabilities of π 01 and π 11. Part of the Statistical Analysis Software (SAS) glimmix output is presented below: The results of the analysis of variance in part (a) of Table 8.9 indicate that the degree of lesion in the chicken footpad (pododermatitis) in the treatments tested were significantly different (P < 0.0001). Therefore, the hypothesis of proportional odds of treatments is rejected (H0 : τi = 0 for all i, that is, oddsratio = 1). In part (b) of Table 8.9, we can see that the estimated intercepts η1 = 0:6144 and η2 = 3:8787 define the boundary between the categories “Without” lesion and “Moderate” lesion and the boundary between the categories “Moderate” lesion and “Severe” lesion, respectively. The estimated effect of the treatments ðτi Þ shows that the boundaries move either upward or downward when a certain treatment is applied. In this sense, all estimated treatment coefficients have a negative effect with respect to treatment 4. This means that chickens under treatments 1–3 have a low probability of developing a moderate lesion and a higher probability of developing a severe lesion than when treatment 4 is applied. To calculate the probability that a chicken will not develop footpad dermatitis (c = 0) when receiving treatment 1, that is, “c = 0, Trt = 1,” we first estimate the linear predictor η01 = η0 þ τ1 = 0:6144 þ ð- 1:5034Þ = - 0:889, and, taking the inverse, we obtain π 01 = 1=1þe - ð- 0:889Þ = 0:29. This value is the estimated probability that a chicken will not develop footpad dermatitis when receiving treatment 1. However, now, for “c = 1, Trt = 1,” η11 = η1 þ τ1 = 3:8787 þ ð- 1:5034Þ = 2:3753, whose inverse value is 0.915. This value is an estimate of the probability π 01 þ π 11 . From this value, we obtain the probability that a chicken will develop a moderate lesion and a severe lesion. For a moderate lesion, the probability is π 11 = 0:915 - π 01 = 0:915 - 0:29 = 0:624, and, for a severe lesion, the probability is π 21 = 1 - 0:915 = 0:085. In a similar way the probabilities for the categories (c = 0, 1, 2) of the rest of the treatments are computed. 334 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Table 8.10 Estimated odds ratio Odds ratio estimates Comparison trt 1 vs. 4 trt 2 vs. 4 trt 3 vs. 4 Estimate 0.222 0.778 0.355 DF 794 794 794 95% Confidence limits 0.148 0.335 0.520 1.165 0.238 0.529 Table 8.11 Estimates on the model scale (Estimate) and on the data scale (Mean) for footpad dermatitis categories in the multinomial cumulative logit model Estimates Label c = 0, t=1 c = 1, t=1 c = 0, t=2 c = 1, t=2 c = 0, t=3 c = 1, t=3 c = 0, t=4 c = 1, t=4 Standard error 0.2428 DF 794 t-value -4.93 Pr > |t| <0.0001 Mean 0.2914 Standard error mean 0.001174 2.3753 0.2214 794 10.73 <0.0001 0.9149 0.01724 0.3634 0.1757 794 2.07 0.0390 0.5899 0.04252 3.6277 0.2420 794 14.99 <0.0001 0.9741 0.006103 -0.4222 0.1740 794 -2.43 0.0155 0.3960 0.04162 2.8422 0.2304 794 12.34 <0.0001 0.9449 0.01199 3.8787 0.2465 794 15.73 <0.0001 0.9797 0.004893 0.6144 0.1799 794 3.41 0.0007 0.6489 0.04098 Estimate -0.8893 The odds ratios tabulated in Table 8.10 are the odds ratios for treatments 1 through 4, i.e., eτi for treatments 1–4. These are the estimated odds ratios of adjacent categories of treatments i (i = 1, 2, 3) relative to treatment 4. Values of τi are not category-specific; the odds ratios for “Without” lesion versus “Moderate” lesion and those for “Moderate” lesion versus “Severe” lesion are listed below (hence the name “proportional odds”). From the above odds ratio results, it should be obvious why the F- and P-values in the fixed effects tests are what they are. Adding the “ilink” option to the end of the “estimate” command prompts GLIMMIX to estimate the inverse of the linear predictors ðηci Þ, i.e., the probabilities per category π ci = 1=1þe - ηci (Table 8.11). In the above table, several estimates are shown for ηc þ τi . For example, the probability that a chicken will not develop a lesion under treatment 1 can be represented by “c = 0, t = 1,” that is, ηc þ τ1 = -0.8893. This result matches the one obtained from the fixed effects table “Solutions for fixed effects” previously shown. Taking the inverse of the link yields the probability 8.3 Cumulative Logit Models (Proportional Odds Models) Without 0.680 Moderate 335 Severe 0.649 0.624 0.590 0.549 Probability of lesion 0.580 0.480 0.380 0.384 0.396 0.331 0.291 0.280 0.180 0.085 0.080 -0.020 0.055 0.026 Trt1 Trt2 Trt3 0.020 Trt4 Treatment Fig. 8.1 Estimated probabilities for the footpad lesion categories in the treatments tested, using the cumulative logit model π 01 = 1=ð1 þ e0:8893 Þ = 0:2914. This probability is the maximum likelihood estimate that a chicken will have no footpad lesion with treatment 1. The inverse of the link function is under the “Mean” column of Table 8.11. Now, for the category “c = 1, t = 1,” the inverse of the linear predictor is 0.9149, this is the estimate of π 01 þ π 11 . From this value, we can obtain the probability of a chicken showing a “Moderate” lesion when receiving treatment 1, that is, π 01 þ π 11 = 0:9149, and, substituting the value of π 01 , we obtain the value π 11 = 0:9141 - 0:2914 = 0:6227. Finally, for a “Severe” lesion (category“c = 2, t = 1”), the probability that a chicken will present a severe lesion is π 21 = 1 - 0:9141 = 0:0859. Following the same procedure, we can obtain the probabilities for each of the following categories (c = 0, 1, 2) of the rest of the treatments (2–4). Figure 8.1 shows that under the traditional feeding program with a litter density of 1 kg m-2 of rice husks (Trt1), there is a high probability that broilers will develop moderate and severe footpad lesions, as shown by π 11 = 0:624 and π 21 = 0:085, respectively. When the litter density was increased from 1 to 2 kg m-2 of rice husks under the traditional broiler program (Trt2), the probability of the risk of developing moderate and severe footpad lesions in broilers decreased significantly to π 12 = 0:384 and π 22 = 0:026, respectively, compared to Trt1, whereas the probability of not developing a footpad lesion increased to π 02 = 0:590 ðTrt2Þ compared to π 01 = 0:291 ðTrt1Þ. Regarding the implementation of the two foot care programs plus the litter density of 2 kg husk m-2 of rice husks, the probability of chickens of not developing a footpad lesion is π 04 = 0:649 (Trt4) compared to π 03 = 0:396 in Trt3, whereas the probability of chickens developing moderate and severe lesions decreased from π 14 = 0:331 and π 24 = 0:025 in Trt4 compared to π 13 = 0:549 and π 23 = 0:055 in Trt3. 336 8.4 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Cumulative Probit Models An ordinal cumulative probit model, first considered by Aitchison and Silvey (1957), generalizes a binary probit model to ordinal responses. This model results from the probit modeling of the cumulative probabilities as a linear function of the covariates. The link functions for the cumulative probit model with C categories are listed below: η1 = Φ - 1 ðπ 1 Þ = η1 þ Xβ þ Zb η2 = Φ - 1 ðπ 1 þ π 2 Þ = η2 þ Xβ þ Zb ηC - 1 = Φ -1 ⋮ ðπ 1 þ π 2 þ ⋯ þ π C - 1 Þ = ηC - 1 þ Xβ þ Zb where X and Z are the design matrices, β and b are the vectors of fixed and random effects parameters, respectively, and Φ-1() is the inverse function of the standard normal cumulative distribution. The inverse link of each of the link functions is as follows: π 1 = Φðη1 Þ = hðη1 Þ π 1 þ π 2 = Φðη2 Þ = hðη2 Þ ⋮ π 1 þ π 2 þ ⋯ þ π C - 1 = Φðηc - 1 Þ = hðηc - 1 Þ: Once h(η1), h(η2), ... h(ηc - 1) are estimated, we can estimate π 1 , ... , π C . The quality of the estimates of the ordinal cumulative probit model are usually very similar to those of an ordinal cumulative logit model for some datasets but not all. Both involve stochastic ordering at different levels of the response variable and are designed to detect the location of changes in the response variable. Returning to Example 8.3.1, for the cumulative probit model, we change the “LINK = CPROBIT” option in the model’s definition of the above program syntax. The output will contain all the same elements, except the odds ratios. The analysis for the cumulative probit is exactly the same as that one we performed in the cumulative logit model. Part of the output is shown in parts (a)–(c) of Table 8.12. The estimated variance component due to blocks is σ 2block = 0:0092. The results of the analysis of variance showed that the degrees of lesion in the chickens’ footpad (pododermatitis) in the tested treatments differ significantly (P < 0.0001). In part (b) of Table 8.12, it is possible to observe that the estimated intercepts η1 = 0:3880 and η2 = 2:2407 define the boundary between the “Without” lesion and “Moderate” lesion categories and the boundary between the “Moderate” lesion and “Severe” lesion categories, respectively. The estimated effect of the treatments ðτi Þ moves the boundaries either upward or downward, when a certain treatment is applied. In this sense, all estimated treatment coefficients have a negative effect 8.4 Cumulative Probit Models 337 Table 8.12 Results of the analysis of variance in the multinomial cumulative probit model (a) Covariance parameter estimates Cov Parm Subject Intercept Blk (b) Type III tests of fixed effects Effect Num DF Trt 3 (c) Solutions for fixed effects Effect Categoría Without Intercept ðη1 Þ Moderate Intercept ðη2 Þ Trt ðτ1 Þ Trt ðτ2 Þ Trt ðτ3 Þ Trt ðτ4 Þ Estimate 0.009262 Den DF 794 Trt 1 2 3 4 Standard error 0.01817 F-value 24.57 Estimate 0.3880 2.2407 -0.9278 -0.1595 -0.6459 0 Pr > F <0.0001 Standard error 0.1124 0.1375 0.1227 0.1242 0.1219 . with respect to treatment 4. This means that chickens under treatments 1–3 have a low probability of developing a footpad lesion and a higher probability of developing a severe lesion with respect to treatment 4. From “Type III tests of fixed effects” (Table 8.12, part (b)), the probabilities for each of the categories can be obtained. For the probability that a chicken will not develop a footpad lesion (c = 0) under treatment 1, i.e., “c = 0, Trt = 1, ” the estimated linear predictor is obtained as η01 = η0 þ τ1 = 0:3880 þ ð- 0:9278Þ = - 0:5398 and, taking the inverse, gives π 01 = Φð- 0:5398Þ = 0:2946, that is, the estimated probability that a chicken will not develop a footpad lesion when receiving treatment 1. For “ c = 1, Trt = 1, ” η11 = η1 þ τ1 = 2:2407 þ ð- 0:9278Þ = 1:3129, whose inverse value is 0.9054. This value is an estimator of π 01 þ π 11 . From this value, we can obtain the probability that a chicken will develop a moderate lesion and a severe lesion. For a moderate lesion, π 11 = 0:9054 - π 01 = 0:9054 - 0:2946 = 0:6108, and, for a severe lesion, π 21 = 1 - 0:9054 = 0:0946. Similarly, we can obtain the probabilities of the categories for the other treatments (c = 0, 1, 2) for the rest of the treatments. Similar to the previous example, adding the “ILINK” option to the end of the “ESTIMATE” command prompts GLIMMIX to estimate the values of the linear predictors ðηci Þ and the inverse of the linear predictors, which are the probabilities per category ðπ ci = Φðηci ÞÞ. Table 8.13 shows the estimates of the linear predictors as well as their inverse values (probabilities in this case). From the above table, we show the estimates of ηc þ τi . For example, the estimated linear predictor that a chicken will not develop a footpad lesion under treatment 1, i.e., “ c = 0, t = 1, ” is calculated as ηc þ τ1 = - 0:5398. This result matches the values obtained from the fixed effects table (“Solutions for fixed effects”) previously shown. Taking the inverse of the link function, π 01 = Φð0:5398Þ = 0:2947. This is the 338 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Table 8.13 Estimates on the model scale (Estimate) and on the data scale (Mean) for footpad lesion categories in the multinomial cumulative probit model Estimates Label c = 0, t=1 c = 1, t=1 c = 0, t=2 c = 1, t=2 c = 0, t=3 c = 1, t=3 c = 0, t=4 c = 1, t=4 Standard error 0.1100 DF 794 tvalue -4.91 Pr > |t| <0.0001 Mean 0.2947 Standard error mean 0.03793 1.3129 0.1208 794 10.87 <0.0001 0.9054 0.02035 0.2285 0.1105 794 2.07 0.0389 0.5904 0.04293 2.0812 0.1345 794 15.47 <0.0001 0.9813 0.006153 -0.2578 0.1085 794 -2.38 0.0178 0.3983 0.04189 1.5949 0.1258 794 12.68 <0.0001 0.9446 0.01407 2.2407 0.1375 794 16.29 <0.0001 0.9875 0.004457 0.3880 0.1124 794 3.45 0.0006 0.6510 0.04158 Estimate -0.5398 probability that a chicken will not develop a footpad lesion when receiving treatment 1. This probability is under the “Mean” column. Now, for the category “c = 1, t = 1, ” the inverse of the link function is a probability of 0.9054, which results from the inverse value of the linear predictor η1 þ τ1 = 1:3129. This value is the estimate in terms of probability π 01 þ π 11 . From this value, we can obtain the probability that a chicken presents a “Moderate” lesion when receiving treatment 1, that is, π 01 þ π 11 = 0:9054, and, using the value of π 01 , we obtain the values π 11 = 0:9054 - 0:2947 = 0:6107 and π 21 = 1 - 0:9054 = 0:0946. Following the same procedure, we can obtain the rest of the probabilities for each one of the categories (c = 0, 1, 2) and for the rest of the treatments (2–4). 8.5 Effect of Judges’ Experience on Canned Bean Quality Ratings Canning quality is one of the most essential traits required in all new dry bean (Phaseolus vulgaris L.) varieties, and the selection for this trait is a critical part of bean breeding programs. Advanced lines that are candidates for release as varieties must be evaluated for canning quality for at least 3 years from samples grown at different locations. Quality is evaluated by a panel of judges with varying levels of experience in evaluating breeding lines for visual quality traits. A total of 264 bean breeding lines from 4 commercial classes were retained according to the procedures described by Walters et al. (1997). These included 62 white (navy), 65 black, 8.5 Effect of Judges’ Experience on Canned Bean Quality Ratings 339 Table 8.14 Frequency of ratings of different types of beans as a function of the bean-rating experience Calif 1 2 3 4 5 6 7 Black < 5 Years 13 91 123 72 24 2 0 > 5 Years 32 78 124 122 31 3 0 Kidney < 5 Years 7 32 136 101 47 6 1 > 5 Years 10 31 96 104 71 18 0 Navy < 5 Years 10 56 84 84 51 24 1 > 5 Years 22 51 107 98 52 37 5 Pinto < 5 Years 13 29 91 109 60 25 1 > 5 Years 2 17 68 124 109 78 12 55 kidney, and 82 pinto bean lines plus control or “check” lines. The visual appearance of the processed beans was determined subjectively by a panel of 13 judges on a 7-point hedonic scale (1 = very undesirable, ..., 4 = neither desirable nor undesirable,..., 7 = very desirable). Beans were presented to the panel of judges in random order at the same time. Before evaluating the samples, all judges were shown examples of samples rated as satisfactory. There is concern that certain judges, due to lack of experience, may not be able to correctly score the canned samples. From attribute-based product evaluations, inferences about the effects of experience can be drawn from the psychology literature (Wallsten and Budescu 1981). Prior to the bean canning quality rating experiment, it was postulated that not only do less experienced judges have a more severe rating than do more experienced judges but also that experience should have little or no effect on white beans, for which the canning procedure was developed. Judges are stratified for the purpose of analysis by experience (less than 5 years, greater than 5 years). Counts by canning quality, judge experience, and bean breeding lines are listed in the following table (Table 8.14). The link functions for the cumulative logit model for describing a variable with C categories are as follows: ηð1Þij = log ηð2Þij = log π 1ij = η1 þ αi þ βj þ ðαβÞij 1 - π 1ij π 1ij þ π 2ij 1 - π 1ij þ π 2ij = η2 þ αi þ βj þ ðαβÞij ⋮ π 1ij þ π 2ij þ ⋯ þ π ðC - 1Þij ηC - 1 = log 1 - π 1ij þ π 2ij þ ⋯ þ π ðC - 1Þij = ηC - 1 þ αi þ βj þ ðαβÞij The components of the GLMM with an ordinal multinomial response are as follows: 340 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Distributions: y1ij, y2ij, y3ij, y4ij, y5ij, y6ij,y7ij~Multinomial (Nij, π 1ij, π 2ij, π 3ij, π 4ij, π 5ij, π 6ij, π 7ij), where y1ij, y2ij, y3ij, y4ij, y5ij, y6ij, and y7ij are the observed frequencies of the responses in each category c of the hedonic scale (1 = very undesirable, ..., 4 = neither desirable nor undesirable, ..., 7 = very desirable). Linear predictor: η(c)ij = ηc + αi + βj + (αβ)ij, where η(c)ij is the cth link (c = 1, 2,...,6) for bean type i and judge’s experience j; ηc is the intercept for the cth link; αi is the fixed effect due to the bean type for ith bean class; βj is the fixed effect due to the jth experience of the judge; and (αβ)ij is the fixed effect due to the interaction between bean class and judge experience. The link functions for each category are as follows: log log log log log log π 1ij = η1ij 1 - π 1ij π 1ij þ π 2ij 1 - π 1ij þ π 2ij = η2ij π 1ij þ π 2ij þ π 3ij 1 - π 1ij þ π 2ij þ π 3ij = η3ij π 1ij þ π 2ij þ π 3ij þ π 4ij 1 - π 1ij þ π 2ij þ π 3ij þ π 4ij = η4ij π 1ij þ π 2ij þ π 3ij þ π 4ij þ π 5ij 1 - π 1ij þ π 2ij þ π 3ij þ π 4ij þ π 5ij π 1ij þ π 2ij þ π 3ij þ π 4ij þ π 5ij þ π 6ij 1 - π 1ij þ π 2ij þ π 3ij þ π 4ij þ π 5ij þ π 6ij = η5ij = η6ij The following GLIMMIX commands fit a cumulative logit model with an ordinal multinomial response. proc glimmix data=beans ; class Exper; model cal(order=data)= Exper|Class/dist=Multinomial link=clogit 8.5 Effect of Judges’ Experience on Canned Bean Quality Ratings 341 solution oddsratio; Contrast 'Effect of Experience on Black bean' exper 1 -1 class*exper 1 -1 0 0 0 0 0 0 0 0 0 0; Contrast 'Effect of Experience on Kidney Bean' exper 1 -1 class*exper 0 0 1 -1 0 0 0 0 0 0 0 0; Contrast 'Effect of Experience on Navies bean' exper 1 -1 class*exper 0 0 0 0 0 0 0 1 -1 0 0 0; Contrast 'Effect of Experience on Pinto beans' exper 1 -1 class*exper 0 0 0 0 0 0 0 0 0 0 0 1 -1; estimate 'Black, < 5 year, Rating = 1' Intercept 1 0 0 0 0 0 0 0 0 class 1 0 0 0 0 0 0 exper 1 0 class*exper 1 0 0 0 0 0 0 0 0 0 0/ilink; estimate 'Black, < 5 year, Rating <= 2' Intercept 0 1 0 0 0 0 0 0 0 class 1 0 0 0 0 0 0 exper 1 0 0 class*exper 1 0 0 0 0 0 0 0 0 0 0/ilink; estimate 'Black, < 5 year, Rating <= 3' Intercept 0 0 0 1 0 0 0 0 0 class 1 0 0 0 0 0 0 exper 1 0 0 class*exper 1 0 0 0 0 0 0 0 0 0 0/ilink; estimate 'Black, < 5 year, Rating <= 4' Intercept 0 0 0 0 1 0 0 0 class 1 0 0 0 0 0 0 exper 1 0 0 class*exper 1 0 0 0 0 0 0 0 0 0 0/ilink; estimate 'Black, < 5 year, Rating <= 5' Intercept 0 0 0 0 0 0 1 0 0 class 1 0 0 0 0 0 0 exper 1 0 0 class*exper 1 0 0 0 0 0 0 0 0 0 0/ilink; estimate 'Black, > 5 year, Rating <= 6' Intercept 0 0 0 0 0 0 0 0 1 class 1 0 0 0 0 0 exper 1 0 class*exper 1 0 0 0 0 0 0 0 0 0 0/ilink; estimate 'Black, > 5 year, Rating = 1' Intercept 1 0 0 0 0 0 0 0 0 class 1 0 0 0 0 0 exper 0 1 class*exper 0 1 0 0 0 0 0 0 0 0/ilink; estimate 'Black, > 5 year, Rating <= 2' Intercept 0 1 0 0 0 0 0 0 0 class 1 0 0 0 0 0 0 exper 0 1 class*exper 0 1 0 0 0 0 0 0 0 0/ilink; estimate 'Black, > 5 year, Rating <= 3' Intercept 0 0 0 1 0 0 0 0 0 class 1 0 0 0 0 0 0 exper 0 1 class*exper 0 1 0 0 0 0 0 0 0 0/ilink; estimate 'Black, > 5 year, Rating <= 4' Intercept 0 0 0 0 1 0 0 0 class 1 0 0 0 0 0 exper 0 1 class*exper 0 1 0 0 0 0 0 0 0 0 0/ilink; estimate 'Black, > 5 year, Rating <= 5' Intercept 0 0 0 0 0 0 1 0 0 class 1 0 0 0 0 0 0 exper 0 1 class*exper 0 1 0 0 0 0 0 0 0 0 0/ilink; estimate 'Black, > 5 year, Rating <= 6' Intercept 0 0 0 0 0 0 0 0 1 class 1 0 0 0 0 0 exper 0 1 class*exper 0 1 0 0 0 0 0 0 0 0 0/ilink; estimate 'Kidney, < 5 year, Rating = 1' Intercept 1 0 0 0 0 0 0 0 0 class 0 1 0 0 0 0 exper 1 0 0 class*exper 0 0 0 1 0 0 0 0 0 0 0/ilink; estimate 'Kidney, < 5 year, Rating <= 2' Intercept 0 1 0 0 0 0 0 0 0 class 0 1 0 0 0 0 exper 1 0 0 class*exper 0 0 0 1 0 0 0 0 0 0 0/ilink; estimate 'Kidney, < 5 yr, Rating <= 3' Intercept 0 0 0 1 0 0 0 0 0 class 0 1 0 0 0 exper 1 0 0 class*exper 0 0 0 1 0 0 0 0 0 0 0/ilink; estimate 'Kidney, < 5 year, Rating <= 4' Intercept 0 0 0 0 0 1 0 0 0 class 0 1 0 0 0 exper 1 0 0 class*exper 0 0 0 1 0 0 0 0 0 0 0/ilink; estimate 'Kidney, < 5 year, Rating <= 5' Intercept 0 0 0 0 0 0 1 0 0 class 0 1 0 0 0 exper 1 0 0 class*exper 0 0 0 1 0 0 0 0 0 0 0/ilink; estimate 'Kidney, < 5 year, Rating <= 6' Intercept 0 0 0 0 0 0 0 0 1 class 0 1 0 0 0 0 exper 1 0 0 class*exper 0 0 0 1 0 0 0 0 0 0 0/ilink; estimate 'Kidney, > 5 year, Rating = 1' Intercept 1 0 0 0 0 0 0 0 0 class 0 1 0 0 0 0 exper 0 1 class*exper 0 0 0 0 1 0 0 0 0 0 0/ilink; estimate 'Kidney, > 5 year, Rating <= 2' Intercept 0 1 0 0 0 0 0 0 0 class 0 1 0 0 0 0 exper 0 1 class*exper 0 0 0 0 1 0 0 0 0 0 0/ilink; estimate 'Kidney, > 5 year, Rating <= 3' Intercept 0 0 0 1 0 0 0 0 0 class 0 1 0 0 0 0 exper 0 1 class*exper 0 0 0 0 1 0 0 0 0 0 0/ilink; estimate 'Kidney, > 5 year, Rating <= 4' Intercept 0 0 0 0 0 1 0 0 0 class 0 1 0 0 0 0 exper 0 1 class*exper 0 0 0 0 1 0 0 0 0 0 0/ilink; 342 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses estimate 'Kidney, > 5 year, Rating <= 5' Intercept 0 0 0 0 0 0 1 0 0 class 0 1 0 0 0 0 exper 0 1 class*exper 0 0 0 0 1 0 0 0 0 0 0/ilink; estimate 'Kidney, > 5 year, Rating <= 6' Intercept 0 0 0 0 0 0 0 0 1 class 0 1 0 0 0 0 exper 0 1 class*exper 0 0 0 0 1 0 0 0 0 0 0/ilink; estimate 'Navies, < 5 year, Rating = 1' Intercept 1 0 0 0 0 0 0 0 0 class 0 0 0 1 0 0 exper 1 0 0 class*exper 0 0 0 0 0 1 0 0 0 0 0/ilink; estimate 'Navies, < 5 year, Qualification <= 2' Intercept 0 1 0 0 0 0 0 0 0 class 0 0 0 1 0 0 exper 1 0 0 class*exper 0 0 0 0 0 1 0 0 0 0 0/ilink; estimate 'Navies, < 5 year, Qualification <= 3' Intercept 0 0 0 0 1 0 0 0 0 0 class 0 0 0 1 0 0 exper 1 0 0 class*exper 0 0 0 0 0 1 0 0 0 0 0/ilink; estimate 'Navies, < 5 year, Rating <= 4' Intercept 0 0 0 0 0 1 0 0 0 class 0 0 0 1 0 0 exper 1 0 0 class*exper 0 0 0 0 0 1 0 0 0 0 0/ilink; estimate 'Navies, < 5 year, Rating <= 5' Intercept 0 0 0 0 0 0 0 1 0 0 class 0 0 0 1 0 0 exper 1 0 0 class*exper 0 0 0 0 0 1 0 0 0 0 0/ilink; estimate 'Navies, <5 year, Qualification <= 6' Intercept 0 0 0 0 0 0 0 0 0 1 class 0 0 0 1 0 0 exper 1 0 0 class*exper 0 0 0 0 0 1 0 0 0 0 0/ilink; estimate 'Navies, > 5 year, Qualification = 1' Intercept 1 0 0 0 0 0 0 0 0 class 0 0 0 1 0 0 exper 0 1 class*exper 0 0 0 0 0 0 0 1 0 0 0/ilink; estimate 'Navies, > 5 year, Qualification <= 2' Intercept 0 1 0 0 0 0 0 0 0 class 0 0 0 1 0 0 exper 0 1 class*exper 0 0 0 0 0 0 0 1 0 0 0/ilink; estimate 'Navies, > 5 year, Rating <= 3' Intercept 0 0 0 1 0 0 0 0 0 class 0 0 0 1 0 0 exper 0 1 class*exper 0 0 0 0 0 0 0 1 0 0 0/ilink; estimate 'Navies, > 5 year, Rating <= 4' Intercept 0 0 0 0 0 1 0 0 0 class 0 0 0 1 0 0 exper 0 1 class*exper 0 0 0 0 0 0 0 1 0 0 0/ilink; estimate 'Navies, > 5 year, Rating <= 5' Intercept 0 0 0 0 0 0 0 1 0 0 class 0 0 0 1 0 0 exper 0 1 class*exper 0 0 0 0 0 0 0 1 0 0 0/ilink; estimate 'Navies, > 5 year, Rating <= 6' Intercept 0 0 0 0 0 0 0 0 0 1 class 0 0 0 1 0 0 exper 0 1 class*exper 0 0 0 0 0 0 0 1 0 0 0/ilink; estimate 'Pinto, < 5 year, Qualification = 1' Intercept 1 0 0 0 0 0 0 0 0 class 0 0 0 0 0 0 1 exper 1 0 class*exper 0 0 0 0 0 0 0 0 1 0 0/ilink; estimate 'Pinto, < 5 year, Qualification <= 2' Intercept 0 1 0 0 0 0 0 0 0 class 0 0 0 0 0 0 1 exper 1 0 0 class*exper 0 0 0 0 0 0 0 0 1 0 0/ilink; estimate 'Pinto, < 5 year, Qualification <= 3' Intercept 0 0 0 0 1 0 0 0 0 0 class 0 0 0 0 0 1 exper 1 0 0 class*exper 0 0 0 0 0 0 0 0 1 0 0/ilink; estimate 'Pinto, < 5 year, Rating <= 4' Intercept 0 0 0 0 0 1 0 0 0 class 0 0 0 0 0 1 exper 1 0 0 class*exper 0 0 0 0 0 0 0 0 1 0 0/ilink; estimate 'Pinto, < 5 year, Rating <= 5' Intercept 0 0 0 0 0 0 0 1 0 0 class 0 0 0 0 0 1 exper 1 0 class*exper 0 0 0 0 0 0 0 0 1 0 0/ilink; estimate 'Pinto, < 5 year, Rating <= 6' Intercept 0 0 0 0 0 0 0 0 0 1 class 0 0 0 0 0 1 exper 1 0 class*exper 0 0 0 0 0 0 0 0 1 0 0/ilink; estimate 'Pinto, > 5 years, Qualification = 1' Intercept 1 0 0 0 0 0 0 0 0 class 0 0 0 0 0 1 exper 0 1 class*exper 0 0 0 0 0 0 0 0 0 1/ilink; estimate 'Pinto, > 5 year, Qualification <= 2' Intercept 0 1 0 0 0 0 0 0 0 class 0 0 0 0 0 1 exper 0 1 class*exper 0 0 0 0 0 0 0 0 0 0 1/ilink; estimate 'Pinto, > 5 year, Qualification <= 3' Intercept 0 0 0 1 0 0 0 0 0 class 0 0 0 0 0 1 exper 0 1 class*exper 0 0 0 0 0 0 0 0 0 0 1/ilink; estimate 'Pinto, > 5 year, Rating <= 4' Intercept 0 0 0 0 0 1 0 0 0 class 0 0 0 0 0 1 exper 0 1 class*exper 0 0 0 0 0 0 0 0 0 0 1/ilink; estimate 'Pinto, > 5 year, Rating <= 5' Intercept 0 0 0 0 0 0 0 1 0 0 class 0 0 0 0 0 1 exper 0 1 class*exper 0 0 0 0 0 0 0 0 0 0 1/ilink; estimate 'Pinto, > 5 year, Qualification <= 6' Intercept 0 0 0 0 0 0 0 0 0 1 class 0 0 0 0 0 1 exper 0 1 class*exper 0 0 0 0 0 0 0 0 0 1/ilink; freq y; run; 8.5 Effect of Judges’ Experience on Canned Bean Quality Ratings Table 8.15 Fixed effects hypothesis testing in the multinomial cumulative logit model Type III tests of fixed effects Effect Num DF Class 1 Exper 3 Class*Exper 3 Den DF 2779 2779 2779 343 F-value 36.19 85.20 10.13 Pr > F <0.0001 <0.0001 <0.0001 Table 8.16 Hypothesis testing in quality assessment Contrasts Label Effect of experience on black beans Effect of experience on kidney beans Effect of experience on navy beans Effect of experience on pinto beans Num DF 1 1 1 1 Den DF 2779 2779 2779 2779 F-value 2.77 7.86 0.02 58.06 Pr > F 0.0961 0.0051 0.8822 <0.0001 Part of the results is shown below. The results of the analysis of variance show that the class of bean (Class), experience of the evaluator (Exper), and the interaction between class and experience (Class×Exper) on bean canning scores differ significantly (P = 0.0001). That is, the results of comparing judges with more and less years of experience will depend on the line (variety) of beans (Table 8.15). The contrasts address this interaction (Table 8.16). Hypothesis testing is as follows: π class of bean, < 5 years of experience = π class of bean, > 5 years of experience. The results show that judges with more than 5 years of experience differ from those with less than 5 years of experience in evaluating the quality of canned kidney and pinto beans (Table 8.16). With the “solution” option in the model specification, the fixed parameter estimates table shows the solution of the fixed effects parameters under maximum likelihood. In this table, we can observe the values of the estimated intercepts: η1 = - 4:6421 defines the boundary between the categories, “1 = highly undesirable” and “2 = moderately undesirable”, whereas η2 = - 2:9316 defines the boundary between the categories “2 = moderately undesirable” and “3 = slightly undesirable.” The third intercept defines the boundary between the categories “3 = moderately undesirable” and “3 = slightly undesirable,” η3 = - 1:3995 defines the boundary between the categories “3 = slightly undesirable” and “4 = neither undesirable nor desirable,” and so on. The estimated effects of bean type ðαi Þ, evaluator βi , and their interaction αβij are shown below. From these values, we can estimate the linear predictors for each of the categories. For example, the linear predictor for canned black beans evaluated by an inexperienced judge who assigns the category “1 = very undesirable” is η111 = η1 þ α1 þ β1 þ αβ11 = - 4:6421 þ 1:9670 þ 1:0284 - 0:8066 = - 2:4533, for category “2 = moderately undesirable,” it is η211 = η2 þ α1 þ β1 þ αβ11 = - 2:9316 þ 1:9670 þ 1:0284 - 0:8066 = - 0:7428, for category “3 = slightly undesirable,” it is η311 = η3 þ α1 þ β1 þ αβ11 = - 1:3995 þ 1:9670 þ 1:0284 - 0:8066 = 0:7893, and, 344 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Table 8.17 Maximum likelihood estimation of the estimated parameters in the fixed effects solution of canned bean quality ratings in the multinomial cumulative logit model Fixed parameter estimates Effect Intercept η1 Intercept η2 Intercept η3 Intercept η4 Intercept η5 Intercept η6 Class Class Class Class Cal ηi 1 2 3 4 5 6 Class αi Expert β1 Black α1 Kidney α2 Navy α3 Pinto α4 1 β1 2 Exper Exper Class*Exper Black Class*Exper Class*Exper Black Kidney Class*Exper Class*Exper Kidney Navy Class*Exper Class*Exper Navy Pinto Class*Exper Pinto 1 αβ11 2 1 αβ21 2 1 αβ31 2 1 αβ41 2 Estimate -4.6421 -2.9316 -1.3995 0.004287 1.4191 3.8925 1.9670 Standard error 0.1363 0.1057 0.09643 0.09230 0.1026 0.2346 0.1318 DF 2779 2779 2779 2779 2779 2779 2779 t-value -34.05 -27.74 -14.51 0.05 13.84 16.59 14.93 Pr > |t| <0.0001 <0.0001 <0.0001 0.9630 <0.0001 <0.0001 <0.0001 1.0472 0.1342 2779 7.80 <0.0001 1.3076 0.1345 2779 9.72 <0.0001 0 . . 1.0284 0.1350 2779 7.62 <0.0001 0 -0.8066 . 0.1894 . 2779 . -4.26 . <0.0001 0 -0.6457 . 0.1912 . 2779 . -3.38 . 0.0007 0 -1.0072 . 0.1969 . 2779 . -5.12 . <0.0001 0 0 . . . . . . . . 0 . . . . . . for category “4 = neither undesirable nor desirable,” it is η411 = η4 þ α1 þ β1 þ αβ11 = 0:004287 þ 1:9670 þ 1:0284 - 0:8066 = 2:1931. This is how the other categories are calculated for each type of bean and assessor (Table 8.17). The results of Table 8.18 were obtained with the “estimate” command in conjunction with the “ilink” option that prompts GLIMMIX to compute the values of the linear predictors, ^ηcij , tabulated under the “Estimate” column, and the estimated probabilities π^cij for all categories of each treatment are tabulated under the “Mean” column π cij , except the reference category. From Table 8.18 (“Estimates”), we can obtain the probabilities reported under the “Mean” column in which an inexperienced (<5 years) panelist (judge) would rate canned black beans as category 1 (1 = highly undesirable) with a probability of π^111 = 0:08 compared to an experienced panelist (>5 years) who would give a 8.5 Effect of Judges’ Experience on Canned Bean Quality Ratings 345 Table 8.18 Estimates on the model scale (Estimate) and on the data scale (Mean) based on judges’ experience in canned bean quality ratings in the multinomial cumulative logit model Estimates Label Black <5 years, score = 1 Black <5 years, score ≤ 2 Black <5 years, score ≤ 3 Black <5 years, score ≤ 4 Black <5 years, score ≤ 5 Black >5 years, score ≤ 6 Black >5 years, score = 1 Black >5 years, score ≤ 2 Black >5 years, score ≤ 3 Black >5 years, score ≤ 4 Black >5 years, score ≤ 5 Black >5 years, score ≤ 6 Kidney <5 years, score = 1 Kidney <5 years, score ≤ 2 Estimate -2.4533 Standard error 0.1292 DF 2779 t-value -18.99 Pr > |t| <0.0001 Mean 0.07920 Standard error mean 0.009419 -0.7428 0.1004 2779 -7.40 <0.0001 0.3224 0.02194 0.7893 0.1008 2779 7.83 <0.0001 0.6877 0.02164 2.1931 0.1076 2779 20.38 <0.0001 0.8996 0.009716 3.6079 0.1238 2779 29.15 <0.0001 0.9736 0.003180 6.0814 0.2467 2779 24.65 <0.0001 0.9977 0.000561 -2.6751 0.1264 2779 -21.17 <0.0001 0.06446 0.007621 -0.9646 0.09577 2779 -10.07 <0.0001 0.2760 0.01913 0.5675 0.09314 2779 6.09 <0.0001 0.6382 0.02151 1.9713 0.09967 2779 19.78 <0.0001 0.8778 0.01069 3.3861 0.1170 2779 28.95 <0.0001 0.9673 0.003704 5.8595 0.2434 2779 24.07 <0.0001 0.9972 0.000690 -3.2122 0.1333 2779 -24.11 <0.0001 0.03871 0.004958 -1.5017 0.1018 2779 -14.74 <0.0001 0.1822 0.01517 (continued) 346 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Table 8.18 (continued) Estimates Label Kidney <5 years, score ≤ 3 Kidney <5 years, score ≤ 4 Kidney <5 years, score ≤ 5 Kidney >5 years, score ≤ 6 Kidney >5 years, score = 1 Kidney >5 years, score ≤ 2 Kidney >5 years, score ≤ 3 Kidney >5 years, score ≤ 4 Kidney >5 years, score ≤ 5 Kidney >5 years, score ≤ 6 Navies <5 years, score = 1 Navies <5 years, score ≤ 2 Navies <5 years, score ≤ 3 Navies <5 years, score ≤ 4 Navies <5 years, score ≤ 5 Standard error 0.09608 DF 2779 t-value 0.32 Pr > |t| 0.7517 Mean 0.5076 Standard error mean 0.02401 1.4342 0.1011 2779 14.19 <0.0001 0.8076 0.01571 2.8490 0.1178 2779 24.18 <0.0001 0.9453 0.006096 5.3225 0.2438 2779 21.83 <0.0001 0.9951 0.001179 -3.5949 0.1372 2779 -26.20 <0.0001 0.02673 0.003569 -1.8844 0.1071 2779 -17.60 <0.0001 0.1319 0.01226 -0.3523 0.09988 2779 -3.53 0.0004 0.4128 0.02421 1.0515 0.1020 2779 10.31 <0.0001 0.7411 0.01957 2.4663 0.1176 2779 20.98 <0.0001 0.9217 0.008480 4.9397 0.2436 2779 20.27 <0.0001 0.9929 0.001719 -3.3133 0.1404 2779 -23.60 <0.0001 0.03512 0.004757 -1.6027 0.1119 2779 -14.33 <0.0001 0.1676 0.01561 -0.07066 0.1068 2779 -0.66 0.5084 0.4823 0.02667 1.3332 0.1102 2779 12.10 <0.0001 0.7914 0.01820 2.7479 0.1251 2779 21.97 <0.0001 0.9398 0.007077 Estimate 0.03040 (continued) 8.5 Effect of Judges’ Experience on Canned Bean Quality Ratings 347 Table 8.18 (continued) Estimates Label Navies >5 years, score ≤ 6 Navies >5 years, score = 1 Navies >5 years, score ≤ 2 Navies >5 years, score ≤ 3 Navies >5 years, score ≤ 4 Navies >5 years, score ≤ 5 Navies >5 years, score ≤ 6 Pinto <5 years, score = 1 Pinto <5 years, score ≤ 2 Pinto <5 years, score ≤ 3 Pinto <5 years, score ≤ 4 Pinto <5 years, score ≤ 5 Pinto >5 years, score ≤ 6 Pinto >5 years, score = 1 Pinto >5 years, score ≤ 2 Estimate 5.2214 Standard error 0.2473 DF 2779 t-value 21.12 Pr > |t| <0.0001 Mean 0.9946 Standard error mean 0.001321 -3.3345 0.1348 2779 -24.74 <0.0001 0.03441 0.004479 -1.6240 0.1047 2779 -15.51 <0.0001 0.1647 0.01440 -0.09190 0.09897 2779 -0.93 0.3532 0.4770 0.02469 1.3119 0.1028 2779 12.76 <0.0001 0.7878 0.01719 2.7267 0.1186 2779 22.99 <0.0001 0.9386 0.006836 5.2002 0.2439 2779 21.32 <0.0001 0.9945 0.001331 -3.6137 0.1380 2779 -26.19 <0.0001 0.02624 0.003527 -1.9032 0.1081 2779 -17.61 <0.0001 0.1297 0.01221 -0.3711 0.1008 2779 -3.68 0.0002 0.4083 0.02436 1.0327 0.1030 2779 10.03 <0.0001 0.7374 0.01994 2.4475 0.1184 2779 20.67 <0.0001 0.9204 0.008678 4.9210 0.2439 2779 20.17 <0.0001 0.9928 0.001753 -4.6421 0.1363 2779 -34.05 <0.0001 0.009545 0.001289 -2.9316 0.1057 2779 -27.74 <0.0001 0.05061 0.005078 (continued) 348 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Table 8.18 (continued) Estimates Label Pinto >5 years, score ≤ 3 Pinto >5 years, score ≤ 4 Pinto >5 years, score ≤ 5 Pinto >5 years, score ≤ 6 Standard error 0.09643 DF 2779 t-value -14.51 Pr > |t| <0.0001 Mean 0.1979 Standard error mean 0.01531 0.004287 0.09230 2779 0.05 0.9630 0.5011 0.02307 1.4191 0.1026 2779 13.84 <0.0001 0.8052 0.01609 3.8925 0.2346 2779 16.59 <0.0001 0.9800 0.004595 Estimate -1.3995 probability of π^112 = 0:0646. To calculate the probability that a judge with less than 5 years experience would assign a rating of 2 (2 = moderately undesirable) to canned black beans, we derive this probability from the cumulative probability of 0.3224, from which we get which corresponds to π 211 þ π 111 , π 211 = 0:3224 - π 111 = 0:3224 - 0:08 = 0:24. On the other hand, for a judge with experience (>5 years), the probability of assigning a score of 2 to canned black beans is π 212 = 0:2760 - π 112 = 0:2760 - 0:06446 = 0:2115. Following the same procedure, the other probabilities for the rest of the categories are obtained. The probabilities calculated for each of the categories are shown in Table 8.19 and can be seen in Fig. 8.2. 8.6 Generalized Logit Models: Nominal Response Variables In a model with unordered data, the polytomous response variable does not have an ordered structure. Two classes of models, generalized logit models and conditional logit models, can be used with nominal response data. A generalized logit model consists of a combination of several binary logits estimated simultaneously. A logit model is the simplest and best-known probabilistic choice model. However, there are problems in making use of a multinomial logit model because of its inflexibility. A generalized logit model is essentially more flexible than the traditional multinomial cumulative logit model. A generalized logit model shows the same flexibility as a probit model but is much more tractable. Like cumulative logit and probit models, a generalized logit model has C – 1 link functions, where C denotes the number of response categories. 8.6 Generalized Logit Models: Nominal Response Variables 349 Table 8.19 Probabilities calculated for each of the canned bean grades Black Kidney Navy Pinto Cal1 0.08 0.06 0.04 0.03 0.04 0.03 0.03 0.01 J1 J2 J1 J2 J1 J2 J1 J2 Cal2 0.24 0.21 0.14 0.11 0.13 0.13 0.10 0.04 Cal3 0.37 0.36 0.33 0.28 0.31 0.31 0.28 0.15 Cal4 0.21 0.24 0.30 0.33 0.31 0.31 0.33 0.30 Cal5 0.07 0.09 0.14 0.18 0.15 0.15 0.18 0.30 Cal6 0.02 0.03 0.05 0.07 0.05 0.06 0.07 0.17 Cal7 0.00 0.00 0.00 0.01 0.01 0.01 0.01 0.02 Cal1 = qualification 1, Cal2 = qualification 2,...., Cal7 = qualification 7; J1 = panelist with less than 5 years’ experience, and J2 = panelist with more than 5 years’ experience 0.17 0.33 0.30 0.18 0.15 0.15 0.31 0.18 0.33 0.02 0.01 0.07 J1 Black Cal1 J1 Kidney Cal2 Cal3 J2 Cal5 Cal6 0.15 0.04 0.03 0.03 0.01 J1 Navy Cal4 0.30 0.31 0.31 J2 0.04 0.28 J2 0.03 0.10 J1 0.04 0.13 0.06 0.13 0.08 0.11 0.0 0.14 0.1 0.21 0.3 0.28 0.33 0.4 0.2 0.01 0.06 0.36 0.5 0.37 0.6 0.01 0.05 0.01 0.07 0.31 0.00 0.03 0.05 0.14 0.00 0.30 0.7 0.02 0.24 0.09 0.8 0.21 0.07 0.9 0.00 0.24 Probability for each category 1.0 J2 Pinto Cal7 Fig. 8.2 Estimated probabilities for each category of the acceptability of canned beans, according to the experience of the panelist (judge) Moreover, in this class of models, a category is first defined as the reference category. This may be arbitrary or it may make compelling logical sense in the study to designate a particular response category as the reference. In practice and throughout the analysis, the category used as the reference is irrelevant, as long as we are consistent about it. For example, if C is used as the reference category, then the generalized logits are defined as shown below: 350 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses η1 = log π 1ij = α1 þ Xβ1 þ Zb1 π cij η2 = log π 2ij = α2 þ Xβ2 þ Zb2 π Cij ⋮ π ðC - 1Þij ηC - 1 = log = αc - 1 þ XβC - 1 þ ZbC - 1 π Cij Given the different effects in the models, the intercepts (α´s), β´s, and b´s vary across the pairs of response variable categories for each link function. Using algebra, it can be shown that the general form of the inverse of the link functions is given by eη c πc = 1þ Once π 1, π 1, . . . . π C πC = 1 - 8.6.1 C-1 c=1 - 1 C-1 , c = 1, 2, . . . , C - 1 eη c c=1 are estimated, the reference category is estimated as πc. CRDs with a Nominal Multinomial Response In practice, cumulative models are used for analyzing ordinal data and generalized logit models for nominal data. Returning to Example 8.3.1, we will now implement the analysis of a generalized logit model. This model relaxes the assumptions of proportionality; but it is less parsimonious than the “odds ratio” model since they fit C - 1 binary logit models, where C is the number of categories of the response variable. The linear predictor and distribution are the same as in the previous example. The following GLIMMIX syntax implements the analysis of the generalized logit model: proc glimmix data=chickens ; class trt block category; model category(reference='severe')= trt/dist=Multinomial link=glogit oddsratio; random intercept/subject=block solution group=category; estimate 't=1' intercept 1 trt 1 0, t=2' intercept 1 trt 0 1 0, 't=3' intercept 1 trt 0 0 0 1 0, 't=4' intercept 1 trt 0 0 0 0 1/ilink bycat; freq y; run; 8.6 Generalized Logit Models: Nominal Response Variables Table 8.20 Analysis of variance in the generalized multinomial logit model Type III tests of fixed effects Effect Num DF Den DF Trt 6 790 351 F-value 10.78 Pr > F <0.0001 Table 8.21 Maximum likelihood estimates on the model scale (Estimate) for footpad lesion level in the multinomial generalized logit model Solutions for fixed effects Effect Category Intercept Without lesion Intercept Moderate lesion trt Without lesion trt Moderate lesion trt Without lesion trt Moderate lesion trt Without lesion trt Moderate lesion trt Without lesion trt Moderate lesion Trt 1 1 2 2 3 3 4 4 Estimate 4.8525 4.2485 -3.8447 -2.6478 -1.1888 -0.9651 -2.7860 -1.8326 0 0 Standard error 1.0059 1.0071 1.0330 1.0327 1.1618 1.1662 1.0585 1.0598 . . DF 2 2 790 790 790 790 790 790 . . t-value 4.82 4.22 -3.72 -2.56 -1.02 -0.83 -2.63 -1.73 . . Pr > |t| 0.0404 0.0519 0.0002 0.0105 0.3065 0.4082 0.0087 0.0842 . . Most of the syntax of the program has already been explained. The “reference=” option is new to this program in the command, where the model is defined and is used to designate the reference category. By not specifying the “reference=” option, GLIMMIX by default uses the last category in the dataset. Moreover, the “link = glogit” option prompts GLIMMIX to fit a generalized logit model. The “bycat” option in the “estimate” command is unique to the generalized logit model. Finally, the “ilink” option asks GLIMMIX to estimate all category probabilities for each treatment, except those for the reference category. Part of the output is shown in Table 8.20. The fixed effects test shows that there are highly significant differences (P = 0.0001) on the average percentage of footpad lesion level between treatments. Unlike the cumulative logit model, in the generalized logit model, the estimates of the fixed effects (treatments), as well as the intercepts, are separated for each link function. For the estimation of linear predictors, we use the estimated values of Table 8.21 (“Solutions for fixed effects”). The estimated intercepts α1 = 4:8525 and α2 = 4:2485 define the boundary between the categories “Without” lesion and “Moderate” lesion and the boundary between the categories “Moderate” lesion and “Severe” lesion, respectively. For treatment 1, the treatment effects (^τi Þ estimated for the “Without” lesion category is τ1 = - 3:8447 and for the “Moderate” lesion category, it is τ1 = - 2:6478. With these values, the linear predictors for the “Without” lesion and “Moderate” lesion categories under treatment 1 are η01 = 4:8525 - 3:8447 = 1:0077 and η11 = 4:2485 - 2:6478 = 1:6007, respectively. The estimated probabilities for each of the categories (“Without” lesion and “Moderate” lesion) in each treatment, except for the reference category, are found 352 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Table 8.22 Estimates on the model scale (“Estimate”) and on the data scale (“Mean”) for footpad lesion level observed in treatments in the multinomial generalized logit model Estimates Label t=1 t=1 t=2 t=2 t=3 t=3 t=4 t=4 Category Without lesion Moderate lesion Without lesion Moderate lesion Without lesion Moderate lesion Without lesion Moderate lesion Estimate 1.0077 Standard error 0.2515 DF 790 t-value 4.01 Pr > |t| <0.0001 Mean 0.3150 Standard error mean 0.03552 1.6007 0.2286 790 7.00 <0.0001 0.5700 0.03677 3.6637 0.5881 790 6.23 <0.0001 0.5850 0.03801 3.2834 0.5881 790 5.58 <0.0001 0.4000 0.03761 2.0665 0.3414 790 6.05 <0.0001 0.3929 0.03755 2.4159 0.3300 790 7.32 <0.0001 0.5573 0.03762 4.8525 1.0059 790 4.82 <0.0001 0.6433 0.03687 4.2485 1.0071 790 4.22 <0.0001 0.3517 0.03669 under the “Mean” column of Table 8.22. The probability that a chick has no footpad lesion when receiving treatment 1 is π 01 = 0:315, whereas the value 0.57 corresponds to the cumulative probability π 01 þ π 11 . From this value, we can calculate the probability of observing a moderate lesion, which is π 11 = 0:57 - π 01 = 0:57 - 0:315 = 0:255. From these probabilities, we can estimate the probability of observing a severe footpad lesion under treatment 1 as π 21 = 1 - ð0:57Þ = 0:43. Following the same logic, we can estimate the reference probabilities for the rest of the other treatments. Another important result is the odds ratio estimates. These estimates are shown in Table 8.23. These odds ratios compare the odds for the labeled category to those for the reference category for treatments 1–3 relative to treatment 4. These odds ratio values are derived from the estimated probabilities in each of the categories. For example, the probabilities that a chicken does not present a lesion and a moderate lesion are π 04 = 0:6433 and π 14 = 0:3517, respectively. From these probabilities, we can estimate the probability of observing a severe lesion as follows: π 24 = 1 - ð0:6433 þ 0:3517Þ = 0:005. The estimated odds ratio of not observing a lesion (“Without” lesion) between treatments 1 and 4 is Odds ratioTrt1,Trt4 = π 01 π 04 0:315 0:6433 = = = = 0:0213 π 21 π 24 0:115 0:005 the value provided in the odds ratio estimates table. If we compare the analysis using the cumulative logit link and the generalized logit link, we observe insignificant 8.6 Generalized Logit Models: Nominal Response Variables 353 Table 8.23 Estimated odds ratio Odds ratio estimates Category Without lesion Moderate lesion Without lesion Moderate lesion Without lesion Moderate lesion Trt 1 1 2 2 3 3 _trt 4 4 4 4 4 4 Estimate 0.021 0.071 0.305 0.381 0.062 0.160 DF 790 790 790 790 790 790 95% Confidence limits 0.003 0.163 0.009 0.538 0.031 2.979 0.039 3.759 0.008 0.493 0.020 1.281 changes in the estimated category probabilities by treatment as well as in the significance level in the test of treatment effects. 8.6.2 CRD: Cheese Tasting Consider a study in which you want to know the effects of various additives on the flavor of cheese. Researchers tested 4 cheese additives and obtained 52 response ratings for each additive. Each response was measured on a scale of 9 categories ranging from: I dislike it very much (1) to I like it very much or excellent flavor (9). Data are obtained from the study by McCullagh and Nelder (1989) (Table 8.24). The components of the GLMM with an ordinal multinomial response are as follows: Distributions: y1i, y2i, y3i, y4i, y5i, y6i,y7i, y8i, y9i~Multinomial (Ni, π 1i, π 2i, π 3i, π 4i, π 5i, π 6i, π 7i,π 8i, π 9i), where y1i, y2i, y3i, y4i, y5i, y6i,y7i, y8i, and y9i are the observed frequencies of the responses in each category c of the hedonic scale (1 = very undesirable, ..., 5 = neither desirable nor undesirable, ... , 9 = very desirable). Linear predictor: η(c)i = ηc + αi, where η(c)ij is cth link (c = 1, 2, . . ., 8) for the additive type i, ηc is the intercept for the cth link, and αi is the fixed effect due to the ith additive. The link functions for each category are as follows: log log log π 1i = η1i 1 - π 1i π 1i þ π 2i = η2i 1 - ðπ 1i þ π 2i Þ π 1i þ π 2i þ π 3i = η3i 1 - ðπ 1i þ π 2i þ π 3i Þ 354 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Table 8.24 Effect of additives on cheese flavor id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 log log log log log Additive 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 Y 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Freq 0 0 1 7 8 8 19 8 1 6 9 12 11 7 6 1 0 0 Id 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 Additive 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 Y 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Freq 1 1 6 8 23 7 5 1 0 0 0 0 1 3 7 14 16 11 π 1i þ π 2i þ π 3i þ π 4i = η4i 1 - ðπ 1i þ π 2i þ π 3i þ π 4i Þ π 1i þ π 2i þ π 3i þ π 4i þ π 5i = η5i 1 - ðπ 1i þ π 2i þ π 3i þ π 4i þ π 5i Þ π 1i þ π 2i þ π 3i þ π 4i þ π 5i þ π 6i = η6i 1 - ðπ 1i þ π 2i þ π 3i þ π 4i þ π 5i þ π 6i Þ π 1i þ π 2i þ π 3i þ π 4i þ π 5i þ π 6i þ π 7i = η7i 1 - ðπ 1i þ π 2i þ π 3i þ π 4i þ π 5i þ π 6i þ π 7i Þ π 1i þ π 2i þ π 3i þ π 4i þ π 5i þ π 6i þ π 7i þ π 8i = η8i 1 - ðπ 1i þ π 2i þ π 3i þ π 4i þ π 5i þ π 6i þ π 7i þ π 8i Þ The following GLIMMIX commands fit a cumulative logit model with an ordinal multinomial response. proc glimmix ; class id additive scale; model scale(order=data)= additive/dist=Multinomial link=clogit solution oddsratio; estimate 'c=1, a=1' intercept 1 0 0 0 0 0 0 0 additive 1 0 0 0, 'c=2, a=1' intercept 0 1 0 0 0 0 0 0 additive 1 0 0 0, 'c=3, a=1' intercept 0 0 1 0 0 0 0 0 additive 1 0 0 0, 'c=4, a=1' intercept 0 0 0 1 0 0 0 0 additive 1 0 0 0, 8.6 Generalized Logit Models: Nominal Response Variables Table 8.25 Fixed effects tests in the multinomial cumulative logit model Type III tests of fixed effects Den DF Effect Num DF Additive 3 197 355 F-value 38.11 Pr > F <0.0001 'c=5, a=1' intercept 0 0 0 0 1 0 0 0 additive 1 0 0 0, 'c=6, a=1' intercept 0 0 0 0 0 1 0 0 additive 1 0 0 0, 'c=7, a=1' intercept 0 0 0 0 0 0 1 0 additive 1 0 0 0, 'c=8, a=1' intercept 0 0 0 0 0 0 0 1 additive 1 0 0 0, 'c=1, a=2' intercept 1 0 0 0 0 0 0 0 additive 0 1 0 0, 'c=2, a=2' intercept 0 1 0 0 0 0 0 0 additive 0 1 0 0, 'c=3, a=2' intercept 0 0 1 0 0 0 0 0 additive 0 1 0 0, 'c=4, a=2' intercept 0 0 0 1 0 0 0 0 additive 0 1 0 0, 'c=5, a=2' intercept 0 0 0 0 1 0 0 0 additive 0 1 0 0, 'c=6, a=2' intercept 0 0 0 0 0 1 0 0 additive 0 1 0 0, 'c=7, a=2' intercept 0 0 0 0 0 0 1 0 additive 0 1 0 0, 'c=8, a=2' intercept 0 0 0 0 0 0 0 1 additive 0 1 0 0, 'c=1, a=3' intercept 1 0 0 0 0 0 0 0 additive 0 0 1 0, 'c=2, a=3' intercept 0 1 0 0 0 0 0 0 additive 0 0 1 0, 'c=3, a=3' intercept 0 0 1 0 0 0 0 0 additive 0 0 1 0, 'c=4, a=3' intercept 0 0 0 1 0 0 0 0 additive 0 0 1 0, 'c=5, a=3' intercept 0 0 0 0 1 0 0 0 additive 0 0 1 0, 'c=6, a=3' intercept 0 0 0 0 0 1 0 0 additive 0 0 1 0, 'c=7, a=3' intercept 0 0 0 0 0 0 1 0 additive 0 0 1 0, 'c=8, a=3' intercept 0 0 0 0 0 0 0 1 additive 0 0 1 0, 'c=1, a=4' intercept 1 0 0 0 0 0 0 0 additive 0 0 0 1, 'c=2, a=4' intercept 0 1 0 0 0 0 0 0 additive 0 0 0 1, 'c=3, a=4' intercept 0 0 1 0 0 0 0 0 additive 0 0 0 1, 'c=4, a=4' intercept 0 0 0 1 0 0 0 0 additive 0 0 0 1, 'c=5, a=4' intercept 0 0 0 0 1 0 0 0 additive 0 0 0 1, 'c=6, a=4' intercept 0 0 0 0 0 1 0 0 additive 0 0 0 1, 'c=7, a=4' intercept 0 0 0 0 0 0 1 0 additive 0 0 0 1, 'c=8, a=4' intercept 0 0 0 0 0 0 0 1 additive 0 0 0 1/ilink; freq freq; run; Part of the results is shown in Table 8.25. The results of the analysis of variance show that the type of additive used in the manufacture of cheese significantly affects the degree of consumer acceptance (P = 0.0001). That is, the type of additive affects the sensory characteristics of the cheese. The contrast of hypothesis are presented in Table 8.26. The hypothesis tests are as follows: π additivei = π additivej ; 8i ≠ j The results show that the additives provide different sensory characteristics that are reflected in the evaluation of preference. With the “solution” option in the model specification, Table 8.27 (fixed parameter estimates) shows the solution of the maximum likelihood estimates for the fixed 356 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Table 8.26 Contrast of hypothesis in the acceptance of cheese made with four additives Contrasts Label Additive effect 1 vs. 2 Additive effect 1 vs. 3 Additive effect 2 vs. 3 Additive effect 2 vs. 4 Additive effect 3 vs. 4 Num DF 1 1 1 1 1 Den DF 197 197 197 197 197 F-value 61.13 21.19 19.14 108.45 62.04 Pr > F <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Table 8.27 Maximum likelihood estimates of the fixed effects in the preference ratings of cheese made with different types of additives in the multinomial cumulative logit model Fixed parameter estimates Effect escala Additive Intercept η1 1 Intercept η2 2 3 Intercept η3 4 Intercept η4 5 Intercept η5 Intercept η6 6 7 Intercept η7 8 Intercept η8 1 Aditivo α1 Aditivo α2 2 3 Aditivo α3 4 Aditivo α4 Estimate -7.0802 -6.0250 -4.9254 -3.8568 -2.5206 -1.5685 -0.06688 1.4930 1.6128 4.9646 3.3227 0 Standard error 0.5640 0.4764 0.4257 0.3880 0.3453 0.3122 0.2738 0.3357 0.3805 0.4767 0.4218 . DF 197 197 197 197 197 197 197 197 197 197 197 . t-value -12.55 -12.65 -11.57 -9.94 -7.30 -5.02 -0.24 4.45 4.24 10.41 7.88 . Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 0.8073 <0.0001 <0.0001 <0.0001 <0.0001 . effects parameters. In this table, we observe the values of the estimated intercepts: η1 = - 7:0802 defines the boundary between categories “1” and “2,” whereas η2 = - 6:0250 defines the boundary between categories “2” and “3.” The third intercept, ^η3 = - 4:9254, defines the boundary between categories “3” and “4” and so forth. The estimated effects of the additive type ðαi , i = 1, 2, 3, and 4Þ are 1.628, 4.9646, 3.3227, and 0, respectively. From these values, linear predictors are estimated for each of the categories. For example, the estimated linear predictor for a cheese made with additive 1, where the evaluator (consumer) assigns it category “1 = highly undesirable,” is represented as η11 = η1 þ α1 = - 7:0802 þ 1:6128 = - 5:4674 ; for the category “2 = moderately undesirable,” it is η21 = η2 þ α1 = - 6:0250 þ 1:6128 = - 4:4122; for the category “3 = slightly undesirable,” it is η31 = η3 þ α1 = - 4:9254 þ 1:6128 = - 3:3126; and for the category “4 = neither undesirable nor desirable,” it is η41 = η4 þ α1 = - 3:8568 þ 1:6128 = - 2:2440. These values are shown in the “Estimate” column of Table 8.28; other categories are similarly calculated for each type of additive. The estimated values in Table 8.27 obtained with the “estimate” command in conjunction with the “ilink” option prompts GLIMMIX to calculate the values of the 8.7 Exercises 357 linear predictors ηCi tabulated in the “Estimate” column and estimated probabilities π Ciof all categories of each treatment, tabulated in the “Mean” column π cij , except for the reference category. From Table 8.28 (Estimates), we obtain the probabilities for each category that is reported under the “Mean” column. In this case, the probability for π 11 = 0:004205. This value is obtained by taking the inverse value of the linear predictor η11 = - 5:4674 π 11 = 1= 1 þ exp ð5:4674Þ = 0:004205 . To calculate the probability that a panelist would assign a rating of 2 (2 = moderately undesirable) to cheese made with additive 1, we use the cumulative probability of 0.01198, which corresponds to π^21 þ π^11 . From this value, we obtain π 21 = 0:01198 - π 11 = 0:01198 - 0:004205 = 0:007775 and for the probability of assigning a rating of 3 to cheese made with additive 1, π 31 = 0:03514 - ðπ 21 þ π 11 Þ = 0:03514 - 0:001198 = 0:033942: Following the same procedure, we obtain the other probabilities for the rest of the categories of each of the additives used in the manufacturing of cheese, which are tabulated in Table 8.29 and can be seen in Fig. 8.3. Figure 8.3 shows the probability results of each flavor rating for each of the additives (it should be noted that some probability values were suppressed to avoid overwriting). It can be seen that additive 1 primarily receives ratings of 5–7; additive 2 primarily receives ratings of 2–5; additive 3 primarily receives ratings of 4–6; and additive 4 primarily receives ratings of 7–9. The odds ratio results (Table 8.30) show the preferences more clearly. For example, the odds ratio additive 1 vs. 4 states that the first additive is 5.017 times more likely to receive a lower score than the fourth additive. 8.7 Exercises Exercise 8.7.1 The dataset for this exercise corresponds to the results of 9 judges who rated 2 classes of wine, namely, white wine (WW = 1) and red wine (RW = 2), and, within each wine class, they rated 10 wines on a scale of 1–20 points. The minimum rating for a particular wine was 7, and the maximum rating was 19.5. For didactic purposes, ratings between 7 and 11 were assigned low quality, a rating between 12 and 15 as medium quality, and anything above 15 was considered excellent quality. The data are shown in Table 8.31 of the wine evaluation experiment under columns “Judge” (wine evaluator panelist), “Wine_type” (white wine: 1, red wine: 2), “Quality” (low, medium, and excellent), and the frequency of the observed qualities (“y”). (a) Fit the cumulative logit proportional odds model to these data. Perform a complete and appropriate analysis of the data, focusing on: (i) An evaluation of the effects of the combination of treatments (ii) Interpretation of the odds ratios (iii) The expected probability per category for each treatment 358 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Table 8.28 Estimates on the model scale (Estimate) and on the data scale (Mean) based on judges’ preference ratings of cheese made with different types of additives in the multinomial cumulative logit model Estimates Label c = 1, a=1 c = 2, a=1 c = 3, a=1 c = 4, a=1 c = 5, a=1 c = 6, a=1 c = 7, a=1 c = 8, a=1 c = 1, a=2 c = 2, a=2 c = 3, a=2 c = 4, a=2 c = 5, a=2 c = 6, a=2 c = 7, a=2 c = 8, a=2 c = 1, a=3 c = 2, a=3 c = 3, a=3 c = 4, a=3 c = 5, a=3 Estimate -5.4674 Standard error 0.5236 DF 197 t-value -10.44 Pr > |t| <0.0001 Mean 0.004205 Standard error mean 0.002192 -4.4122 0.4278 197 -10.31 <0.0001 0.01198 0.005064 -3.3126 0.3700 197 -8.95 <0.0001 0.03514 0.01255 -2.2440 0.3267 197 -6.87 <0.0001 0.09587 0.02832 -0.9078 0.2833 197 -3.20 0.0016 0.2875 0.05804 0.04425 0.2646 197 0.17 0.8673 0.5111 0.06611 1.5459 0.3017 197 5.12 <0.0001 0.8243 0.04369 3.1058 0.4057 197 7.65 <0.0001 0.9571 0.01665 -2.1155 0.4106 197 -5.15 <0.0001 0.1076 0.03942 -1.0603 0.3009 197 -3.52 0.0005 0.2572 0.05749 0.03922 0.2735 197 0.14 0.8861 0.5098 0.06836 1.1078 0.2969 197 3.73 0.0002 0.7517 0.05542 2.4441 0.3397 197 7.19 <0.0001 0.9201 0.02497 3.3961 0.3724 197 9.12 <0.0001 0.9676 0.01168 4.8978 0.4249 197 11.53 <0.0001 0.9926 0.003124 6.4576 0.5045 197 12.80 <0.0001 0.9984 0.000789 -3.7575 0.4761 197 -7.89 <0.0001 0.02281 0.01061 -2.7023 0.3677 197 -7.35 <0.0001 0.06284 0.02165 -1.6027 0.3001 197 -5.34 <0.0001 0.1676 0.04186 -0.5341 0.2556 197 -2.09 0.0379 0.3696 0.05955 0.8021 0.2610 197 3.07 0.0024 0.6904 0.05579 (continued) 8.7 Exercises 359 Table 8.28 (continued) Estimates Label c = 6, a=3 c = 7, a=3 c = 8, a=3 c = 1, a=4 c = 2, a=4 c = 3, a=4 c = 4, a=4 c = 5, a=4 c = 6, a=4 c = 7, a=4 c = 8, a=4 Standard error 0.2984 DF 197 t-value 5.88 Pr > |t| <0.0001 Mean 0.8525 Standard error mean 0.03752 3.2558 0.3618 197 9.00 <0.0001 0.9629 0.01293 4.8157 0.4528 197 10.63 <0.0001 0.9920 0.003610 -7.0802 0.5640 197 -12.55 <0.0001 0.000841 0.000474 -6.0250 0.4764 197 -12.65 <0.0001 0.002412 0.001146 -4.9254 0.4257 197 -11.57 <0.0001 0.007207 0.003046 -3.8568 0.3880 197 -9.94 <0.0001 0.02070 0.007865 -2.5206 0.3453 197 -7.30 <0.0001 0.07443 0.02379 -1.5685 0.3122 197 -5.02 <0.0001 0.1724 0.04455 -0.06688 0.2738 197 -0.24 0.8073 0.4833 0.06838 1.4930 0.3357 197 4.45 <0.0001 0.8165 0.05029 Estimate 1.7541 Exercise 8.7.2 Data were obtained from a series of experiments conducted to reduce damage to potato tubers due to a potato lifter. The experiments were conducted at the Institute of Agricultural Engineering (IMAG-DLO) in Wageningen, the Netherlands. One source of damage was the type of rod used in the lifter. In the experiment – under consideration – eight types of rods were compared. It is an empirical fact that the degree of damage varies considerably between potato varieties with the type of rope used in the lifting of full potato sacks. Three blocks of observations were obtained for the combinations of varieties and rope types. Most of the combinations involved about 20 potatoes. For some combinations, there are no data due to an insufficient number of large potatoes. Tubers were dropped from a given height. To determine the damage, all tubers were peeled and the degree of blue coloration was classified into one of four classes (class 1: no damage; class 2: slight damage; class 3: moderate damage; and class 4: severe damage). The observations, in the form of counts per class and combination, are shown in Table 8.32 of the tuber experiment whose columns are “Variety” (1, 2, 3, 4, 4, 5, 6), “String” (1, 2, 3, 4, 5, 6, 7, 8), “Block” (1, 2, 3), Type of damage (sd = no damage, dl = light damage, dm = moderate damage, ds = severe damage), and the observed frequency (Y ). Cal2 0.00778 0.1496 0.04003 0.00157 Cal3 0.02316 0.2526 0.10476 0.0048 Note: Grade1 = Grade 1, Grade2 = Grade 2,...., Grade9 = Grade 9 Additive 1 Additive 2 Additive 3 Additive 4 Rating Cal1 0.00421 0.1076 0.02281 0.00084 Cal4 0.06073 0.2419 0.202 0.01349 Cal5 0.19163 0.1684 0.3208 0.05373 Cal6 0.2236 0.0475 0.1621 0.09797 Cal7 0.3132 0.025 0.1104 0.3109 Cal8 0.1328 0.0058 0.0291 0.3332 Cal9 0.0429 0.0016 0.008 0.1835 8 Table 8.29 Probabilities calculated for each of the ratings by additives used in the manufacture of cheese 360 Generalized Linear Mixed Models for Categorical and Ordinal Responses Exercises Probability of acceptability 8.7 361 1.0 0.0429 0. 0.9 0.1328 0. 0. 0.0291 0.1104 0. 0.0475 0. 0.1835 0. 0. 0.1684 0.8 0.1621 0. 0.7 0.3132 0. 0.3332 0. 0. 0.2419 0.6 0. 0.3208 0.5 0.4 0.2236 0. 0. 0.2526 0.3 0.3109 0. 0. 02 0.202 0.2 0. 0.19163 63 0.1 0. 0.1076 Aditivo 1 Aditivo 2 0.0 Cal1 Cal2 0. 0.09797 97 0.05373 0. 73 0. 0.10476 76 0. 0.06073 73 Cal3 Cal4 Cal5 0. 0.02281 81 Aditivo 3 Cal6 Cal7 Aditivo 4 Cal8 Cal9 Fig. 8.3 Estimated probabilities for the categories of acceptability for the cheese according to the type of additive Table 8.30 Odds ratio Odds ratio estimates Aditivo _Aditivo 1 4 2 4 3 4 Estimate 5.017 143.257 27.735 DF 197 197 197 95% Confidence limits 2.369 10.625 55.953 366.783 12.071 63.724 (a) List the components of the multinomial GLMM. (b) Fit the cumulative logit proportional odds model to these data. Perform a complete and appropriate analysis of the data, focusing on: (i) An evaluation of the effects of the combination of treatments (ii) Interpretation of the odds ratios (iii) The expected probability per category for each treatment (c) Test whether the proportional odds assumption is viable. Cite relevant evidence to support your conclusion regarding the adequacy of the assumption. (d) If as a result of b), you consider that an alternative cumulative logit model is better, then revise your analysis in a) accordingly. Exercise 8.7.3 Fit a generalized multinomial logit model using the dataset from Exercise 8.7.2 of this section, following the instructions: 362 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Table 8.31 Results of the wine evaluation experiment Judge 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 Wine_type 1 1 1 2 2 2 1 1 1 2 2 2 1 1 1 2 2 2 1 1 1 2 2 2 1 1 1 2 2 2 1 1 1 2 2 2 1 1 1 2 2 2 Quality Low Medium Excellent Low Medium Excellent Low Medium Excellent Low Medium Excellent Low Medium Excellent Low Medium Excellent Low Medium Excellent Low Medium Excellent Low Medium Excellent Low Medium Excellent Low Medium Excellent Low Medium Excellent Low Medium Excellent Low Medium Excellent y 4 6 0 3 6 1 3 5 2 2 7 1 0 4 6 0 1 9 1 3 6 4 3 3 6 3 1 3 5 2 0 4 6 0 6 4 1 9 0 3 5 2 (continued) 8.7 Exercises 363 Table 8.31 (continued) Judge 8 8 8 8 8 8 9 9 9 9 9 9 Wine_type 1 1 1 2 2 2 1 1 1 2 2 2 Quality Low Medium Excellent Low Medium Excellent Low Medium Excellent Low Medium Excellent y 0 7 3 0 6 4 0 5 5 2 4 4 (a) List the components of this model. (b) Perform a thorough and appropriate analysis of the data, focusing on: (i) An evaluation of the main effects and treatment interaction (ii) Odds ratio interpretation (iii) The expected probability per category for each treatment (c) Comment on and discuss your results. Cite relevant evidence to support your conclusion regarding the adequacy of the assumption. Exercise 8.7.4 In this exercise, the effects of judges’ experience on quality ratings of canned beans are assessed. Canning quality is one of the most essential traits required in all new dry bean (Phaseolus vulgaris L.) varieties, and selection for this trait is a critical part of bean breeding programs. Advanced lines, which are candidates for release as varieties, must be evaluated for canning quality for at least 3 years from samples grown at different locations. Quality is evaluated by a panel of judges with varying levels of experience in evaluating breeding lines for visual quality traits. In all, 264 bean breeding lines from 4 commercial classes were conserved according to the procedures described by Walters et al. (1997). These included 62 white (navy), 65 black, 55 kidney, and 82 pinto bean lines plus control lines and “checks.” The visual appearance of the processed beans was determined subjectively by a panel of 13 judges on a 7-point hedonic scale (1 = very undesirable, 4 = neither desirable nor undesirable, 7 = very desirable). The beans were presented to the panel of judges in random order at the same time. Prior to evaluating the samples, all judges were shown examples of samples rated as satisfactory (4). There is concern that certain judges, due to lack of experience, may not be able to score canned samples correctly. From attribute-based product evaluations, inferences about the effects of experience can be drawn from the psychology literature (Wallsten and Budescu (1981). Prior to the bean canning quality rating experiment, it was postulated that not only do 364 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Table 8.32 Results of the tuber experiment. V = variety, C = string, B = block, D = damage (sd = no damage, dl = slight damage, dm = moderate damage, ds = severe damage), and Y = observed frequency V 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 C 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 1 1 1 1 2 2 2 2 3 B 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 D sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd Y 5 14 1 0 6 11 1 0 6 13 0 0 2 9 6 3 11 8 0 0 12 5 2 0 8 12 0 0 12 4 0 0 5 31 2 1 6 11 1 0 5 V 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 C 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 1 1 1 1 2 2 2 2 3 B 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 D sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd Y 4 5 4 7 8 3 0 0 3 10 6 1 1 3 11 5 16 3 1 0 16 3 0 0 11 9 0 0 10 10 0 0 5 7 5 1 13 6 1 0 5 V 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 C 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 1 1 1 1 2 2 2 2 3 B 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 D sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd Y 3 2 8 7 18 1 0 0 5 7 4 4 1 4 6 9 12 7 1 0 16 3 0 1 20 0 0 0 18 2 0 0 6 8 5 1 12 6 1 1 10 (continued) 8.7 Exercises 365 Table 8.32 (continued) V 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 C 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 B 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 D dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm Y 13 0 0 2 11 9 0 16 4 0 0 15 5 0 0 9 7 3 0 0 0 0 0 0 0 0 1 7 10 2 0 1 19 0 0 4 13 3 0 15 4 0 V 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 C 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 B 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 D dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm Y 12 3 0 0 8 11 1 10 9 1 0 10 7 1 1 11 6 1 0 13 6 0 0 9 7 2 0 18 2 0 0 13 6 0 0 0 9 9 2 16 2 1 V 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 C 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 B 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 D dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm Y 8 0 2 2 6 10 2 16 4 0 0 14 5 0 0 16 3 0 0 16 3 0 1 3 15 2 0 16 6 2 2 8 9 2 1 3 5 10 1 16 3 0 (continued) 366 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Table 8.32 (continued) V 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 C 5 6 6 6 6 7 7 7 7 8 8 8 8 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 B 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 D ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd Y 1 11 4 0 5 11 9 0 0 17 2 1 0 4 9 7 0 12 8 0 0 10 10 0 0 2 8 4 6 17 2 0 0 18 2 0 0 19 1 0 0 15 V 2 2 2 2 2 2 2 2 2 2 2 2 2 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 C 5 6 6 6 6 7 7 7 7 8 8 8 8 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 B 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 D ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd Y 0 15 3 2 0 14 5 1 0 12 3 2 1 9 10 1 0 17 2 0 0 15 5 0 0 12 7 0 0 16 4 0 0 20 0 0 0 20 0 0 0 20 V 3 3 3 3 3 3 3 3 3 3 3 3 3 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 C 5 6 6 6 6 7 7 7 7 8 8 8 8 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 B 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 D ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd Y 1 15 5 0 0 15 5 0 0 16 4 0 0 5 14 1 0 18 2 0 0 5 14 0 0 5 14 1 0 15 5 0 0 19 1 0 0 18 2 0 0 18 (continued) 8.7 Exercises 367 Table 8.32 (continued) V 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 C 8 8 8 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 1 1 1 1 2 2 2 B 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 D dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm Y 3 0 1 4 11 3 1 17 2 0 1 10 9 0 0 4 8 6 1 18 2 0 0 19 1 0 0 20 0 0 0 20 0 0 0 10 9 1 0 16 3 1 V 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 C 8 8 8 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 1 1 1 1 2 2 2 B 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 D dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm Y 0 0 0 10 10 0 0 19 1 0 0 14 6 0 0 7 11 1 0 15 4 1 0 19 1 0 0 17 2 0 0 18 2 0 0 5 11 4 0 12 8 0 V 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 C 8 8 8 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 1 1 1 1 2 2 2 B 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 D dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm Y 2 0 0 6 13 0 0 18 2 0 0 13 7 0 0 0 15 5 0 16 3 0 0 19 1 0 0 17 3 0 0 15 4 0 0 3 15 2 0 13 7 0 (continued) 368 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Table 8.32 (continued) V 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 C 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 B 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 D ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds Y 0 15 4 0 0 5 11 4 0 17 2 0 0 16 2 0 0 17 2 1 0 17 2 0 0 V 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 C 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 B 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 D ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds Y 0 7 12 1 0 6 6 6 2 16 4 0 0 17 3 0 0 18 2 0 0 17 2 0 0 V 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 C 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 B 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 D ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds sd dl dm ds Y 0 2 18 0 0 4 10 6 0 6 13 1 0 18 2 0 0 15 5 0 0 19 1 0 0 less experienced judges have a more severe rating than do more experienced judges but also that experience should have little or no effect on the white beans for which the canning procedure was developed. Judges are stratified for the purpose of analysis by experience (less than 5 years, greater than 5 years). Counts by canning quality, judge experience, and bean breeding lines are listed in the following table (Table 8.33). 8.7 Exercises 369 Table 8.33 Bean experiment results Cal 1 2 3 4 5 6 7 Black < 5 Years 13 91 123 72 24 2 0 > 5 Years 32 78 124 122 31 3 0 Kidney < 5 Years 7 32 136 101 47 6 1 > 5 Years 10 31 96 104 71 18 0 Navies < 5 Years 10 56 84 84 51 24 1 > 5 Years 22 51 107 98 52 37 5 Pinto < 5 Years 13 29 91 109 60 25 1 > 5 Years 2 17 68 124 109 78 12 (a) Fit the generalized logit model to these data. Perform a complete and appropriate analysis of the data, focusing on: (i) An evaluation of the effects of the combination of treatments (ii) Interpretation of the odds ratios (iii) The expected probability per category for each treatment (b) Test whether the proportional odds assumption is viable. Cite relevant evidence to support your conclusion regarding the adequacy of the assumption. Exercise 8.7.5 An experiment was conducted to look at the damage levels (ordinal categories 0–4) of Picea sitchensis shoots in two time periods (10 November and 8 December), at four temperatures (different on each date), and at four ozone levels (Table 8.34). (a) Fit the cumulative logit proportional odds model to these data. Perform a complete and appropriate analysis of the data, focusing on: (i) An evaluation of the effects of the combination of treatments (ii) Interpretation of the odds ratios (iii) The expected probability per category for each treatment (b) Test whether the proportional odds assumption is viable. Cite relevant evidence to support your conclusion regarding the adequacy of the assumption. 370 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Table 8.34 Experimental results of Picea sitchensis sprouts Weather 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Temperature (°C) -9 -9 -9 -9 -9 -12 -12 -12 -12 -12 -15 -15 -15 -15 -15 -18 -18 -18 -18 -18 -9 -9 -9 -9 -9 -12 -12 -12 -12 -12 -15 -15 -15 -15 -15 -18 -18 -18 -18 -18 -9 -9 Ozone 170 170 170 170 170 170 170 170 170 170 170 170 170 170 170 170 170 170 170 170 120 120 120 120 120 120 120 120 120 120 120 120 120 120 120 120 120 120 120 120 70 70 Category 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 Frequency 1 10 2 2 0 0 8 3 1 3 0 3 2 4 6 0 1 1 4 9 1 9 4 1 0 0 7 7 1 0 0 1 5 6 3 0 0 4 5 6 4 6 (continued) 8.7 Exercises 371 Table 8.34 (continued) Weather 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 Temperature (°C) -9 -9 -9 -12 -12 -12 -12 -12 -15 -15 -15 -15 -15 -18 -18 -18 -18 -18 -9 -9 -9 -9 -9 -12 -12 -12 -12 -12 -15 -15 -15 -15 -15 -18 -18 -18 -18 -18 -15 -15 -15 -15 Ozone 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 170 170 170 170 Category 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 Frequency 3 2 0 1 6 6 2 0 0 3 6 4 2 0 1 0 5 9 2 11 2 0 0 1 6 6 2 0 2 4 4 3 2 0 4 3 5 3 3 8 4 1 (continued) 372 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Table 8.34 (continued) Weather 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Temperature (°C) -15 -19 -19 -19 -19 -19 -23 -23 -23 -23 -23 -27 -27 -27 -27 -27 -15 -15 -15 -15 -15 -19 -19 -19 -19 -19 -23 -23 -23 -23 -23 -27 -27 -27 -27 -27 -15 -15 -15 -15 -15 -19 Ozone 170 170 170 170 170 170 170 170 170 170 170 170 170 170 170 170 120 120 120 120 120 120 120 120 120 120 120 120 120 120 120 120 120 120 120 120 70 70 70 70 70 70 Category 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 Frequency 3 0 10 5 0 4 0 1 8 4 6 0 0 2 3 14 6 6 8 0 0 1 12 7 0 0 0 0 7 7 6 0 0 1 2 17 9 4 5 2 0 2 (continued) 8.7 Exercises 373 Table 8.34 (continued) Weather 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Temperature (°C) -19 -19 -19 -19 -23 -23 -23 -23 -23 -27 -27 -27 -27 -27 -15 -15 -15 -15 -15 -19 -19 -19 -19 -19 -23 -23 -23 -23 -23 -27 -27 -27 -27 -27 Ozone 70 70 70 70 70 70 70 70 70 70 70 70 70 70 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Category 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 Frequency 10 6 0 2 0 3 5 4 8 0 0 0 3 17 5 6 3 1 2 6 5 3 1 2 0 4 2 1 3 0 1 0 5 11 374 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Appendix Data: CRD with a multinomial response: ordinal Rep Trt Cat Freq rep1 M1H1 Without 0 rep2 M1H1 Without 2 rep3 M1H1 Without 2 rep4 M1H1 Without 2 rep1 M1H2 Without 2 rep2 M1H2 Without 0 rep3 M1H2 Without 4 rep4 M1H2 Without 2 rep1 M1H3 Without 3 rep2 M1H3 Without 7 rep3 M1H3 Without 1 rep4 M1H3 Without 2 rep1 M1H4 Without 0 rep2 M1H4 Without 5 rep3 M1H4 Without 2 1 rep4 M1H4 Without rep1 M1H1 Moderate 3 rep2 M1H1 Moderate 2 rep3 M1H1 Moderate 3 rep4 M1H1 Moderate 5 rep1 M1H2 Moderate 3 rep2 M1H2 Moderate 3 rep3 M1H2 Moderate 6 rep4 M1H2 Moderate 3 rep1 M1H3 Moderate 4 rep2 M1H3 Moderate 2 rep3 M1H3 Moderate 1 rep4 M1H3 Moderate 3 rep1 M1H4 Moderate 5 rep2 M1H4 Moderate 4 rep3 M1H4 Moderate 8 rep4 M1H4 Moderate 4 M1H1 Severe 6 rep1 rep2 M1H1 Severe 6 rep3 M1H1 Severe 5 rep4 M1H1 Severe 3 rep1 M1H2 Severe 5 rep2 M1H2 Severe 7 rep3 M1H2 Severe 0 rep4 M1H2 Severe 5 Rep rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep4 Trt M2H3 M2H3 M2H3 M2H3 M2H4 M2H4 M2H4 M2H4 M2H1 M2H1 M2H1 M2H1 M2H2 M2H2 M2H2 M2H2 M2H3 M2H3 M2H3 M2H3 M2H4 M2H4 M2H4 M2H4 M3H1 M3H1 M3H1 M3H1 M3H2 M3H2 M3H2 M3H2 M3H3 M3H3 M3H3 M3H3 M3H4 M3H4 M3H4 M3H4 Cat Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Severe Severe Severe Severe Severe Severe Severe Severe Severe Severe Severe Severe Severe Severe Severe Severe Without Without Without Without Without Without Without Without Without Without Without Without Without Without Without Without Freq 3 1 3 2 4 2 2 5 4 6 7 4 5 2 3 4 3 4 4 4 5 6 0 3 0 3 2 0 5 3 3 2 0 2 1 0 3 5 7 3 (continued) Appendix Data: CRD with a multinomial response: ordinal Rep Trt Cat Freq rep1 M1H3 Severe 3 rep2 M1H3 Severe 1 rep3 M1H3 Severe 7 rep4 M1H3 Severe 5 rep1 M1H4 Severe 5 rep2 M1H4 Severe 1 rep3 M1H4 Severe 0 rep4 M1H4 Severe 5 rep1 M2H1 Without 1 rep2 M2H1 Without 2 rep3 M2H1 Without 1 rep4 M2H1 Without 1 rep1 M2H2 Without 1 rep2 M2H2 Without 3 rep3 M2H2 Without 1 rep4 M2H2 Without 4 rep1 M2H3 Without 4 rep2 M2H3 Without 5 rep3 M2H3 Without 3 rep4 M2H3 Without 4 rep1 M2H4 Without 1 rep2 M2H4 Without 1 rep3 M2H4 Without 8 rep4 M2H4 Without 2 rep1 M2H1 Moderate 4 rep2 M2H1 Moderate 2 rep3 M2H1 Moderate 2 rep4 M2H1 Moderate 5 rep1 M2H2 Moderate 4 rep2 M2H2 Moderate 4 rep3 M2H2 Moderate 6 rep4 M2H2 Moderate 2 375 Rep rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep4 rep1 rep2 rep3 rep4 Trt M3H1 M3H1 M3H1 M3H1 M3H2 M3H2 M3H2 M3H2 M3H3 M3H3 M3H3 M3H3 M3H4 M3H4 M3H4 M3H4 M3H1 M3H1 M3H1 M3H1 M3H2 M3H2 M3H2 M3H2 M3H3 M3H3 M3H3 M3H3 M3H4 M3H4 M3H4 M3H4 Cat Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Severe Severe Severe Severe Severe Severe Severe Severe Severe Severe Severe Severe Severe Severe Severe Severe Freq 0 5 5 0 3 2 6 1 3 5 3 3 0 2 3 4 9 2 3 10 2 5 1 7 6 3 6 7 7 3 0 3 376 8 Generalized Linear Mixed Models for Categorical and Ordinal Responses Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. Chapter 9 Generalized Linear Mixed Models for Repeated Measurements 9.1 Introduction Repeated measures data, also known as longitudinal data, are those derived from experiments in which observations are made on the same experimental units at various planned times. These experiments can be of the regression or analysis of variance (ANOVA) type, can contain two or more treatments, and are set up using familiar designs, such as completely randomized design (CRD), randomized complete block design (RCBD), or randomized incomplete blocks, if blocking is appropriate, or using row and column designs such as Latin squares when appropriate. Repeated measures designs are widely used in the biological sciences and are fairly well understood for normally distributed data but less so with binary, ordinal, count data, and so on. Nevertheless, recent developments in statistical computing methodology and software have greatly increased the number of tools available for analyzing categorical data. A generalized linear mixed model (GLMM) is one of the most useful and sophisticated structures in modern statistics, as it allows complex structures to be incorporated into the framework of a general linear model. Fitting such models has been the subject of much research over the last three decades. GLMMs, for repeated measures, combine both generalized linear model (GLM) theory (e.g., a binomial, multinomial, or Poisson response variable) and linear mixed effects models. Experimentation is sometimes not well understood since researchers believe that it involves only the manipulation of the levels of independent variables and the observation of subsequent responses in dependent variables. Independent variables, whose levels are determined or set by the experimenter, are said to have fixed effects, although random effects are also very common, where the levels of the effects are assumed to be randomly selected from an infinite population of possible levels. Many variables of interest in research are not fully amenable to experimental manipulation but can nevertheless be studied by considering them to have random effects. For example, the genetic composition of individuals of a species cannot be © The Author(s) 2023 J. Salinas Ruíz et al., Generalized Linear Mixed Models with Applications in Agriculture and Biology, https://doi.org/10.1007/978-3-031-32800-8_9 377 378 9 Generalized Linear Mixed Models for Repeated Measurements manipulated experimentally, but it is of great interest to geneticists aiming to assess the genetic contribution to individual variation of some specific behaviors. A GLMM with repeated measures is a generalization of the standard linear model, and this generalization is due to (1) the presence of more than one response variable that can be binary, ordinal, count, and so on and (2) the nonconstant correlation and/or variability exhibited by the data. The linear mixed model, therefore, gives you the flexibility to model not only the means of your data (as in the standard linear model) but also their variances and covariances. Usually, a normal distribution is assumed for random effects. Since normally distributed data can be modeled entirely in terms of their means and variances/covariances, the two sets of parameters in a linear mixed model actually specify the full probability distribution of the data. The parameters of the mean structure in the model are called (known as) fixed effects parameters, which can be qualitative (as in traditional analysis of variance) or quantitative (as in standard regression), and the parameters of the variance–covariance of the model are known as covariance parameters, which help distinguish a linear mixed model from the standard linear model. Covariance parameters come up quite frequently in the following applications, with two more typical scenarios: (a) Experimental units on which data are measured can be grouped into clusters, and data from a common cluster are correlated. (b) Repeated measurements of the same experimental unit are taken, and these repeated measurements correlate or show some variability. The first scenario can be generalized to include a set of clusters nested within one another. For example, if students are the experimental unit, they can be grouped into classes (clusters), which, in turn, can be grouped into schools. Each level of this hierarchy may present an additional source of variability and correlation. The second scenario occurs in longitudinal studies, in which repeated measurements of the same experimental unit over time are taken. Alternatively, these repeated measures could be spatial or multivariate. 9.2 Example of Turf Quality The proportional odds model, introduced by McCullagh (1980), was proposed as an extension of the generalized linear model used for ordinal responses. One can recall that the proportional odds model is a special case of a GLM with a cumulative link function in which the probability of an observation falling into a category or below is modeled. In the case of a logit link, with only two categories (a binary response), the proportional odds model reduces to a standard logistic regression or a classification model. As with any other type of response variable, repeated measurements are common in agronomic research. They result in clustered data structures with correlations between repeated observations in the same experimental unit that must be taken into account in the analysis. 9.2 Example of Turf Quality 379 Table 9.1 Turf quality of five grass varieties (low, Med = medium, Excel = Excellent, Sept = September) Variety 1 2 3 4 5 No. of plots 18 17 17 18 18 May Low 4 2 2 8 1 Med 10 11 11 7 11 Excel 4 4 4 3 6 July Low 1 0 2 4 3 Med 9 7 8 8 4 Excel 8 10 7 6 11 Sept Low 0 0 2 4 3 Med 12 9 11 13 6 Excel 6 8 4 1 9 The data were obtained from an experiment studying the turf quality of five grass varieties. The varieties were sown independently in 17 or 18 plots. The evaluations of the plots (experimental units) were carried out in the months of May, July, and September of the growing season, and turf quality was classified on an ordinal scale into three categories: low quality, medium quality, and excellent quality, as demonstrated in Table 9.1. The components of the GLMM, with repeated measures with an ordinal multinomial response, are as follows: Distributions: y1ij, y2ij, y3ij|ρij~Multinomial(Nij, π 1ij, π 2ij, π 3ij), where y1ij, y2ij, and y3ij are the observed frequencies of the responses (turf quality) in each c category (low, medium, and excellent), and ρij is the random effect due to the combination variety × month (measurement time), assuming ρij N 0, σ 2ρ . Linear predictor: η(c)ij = ηc + τi + ρij, where η(c)ij is the cth link (c = 1, 2) in the ijth combination variety × month, ηc is the intercept for the cth link, τi is the fixed effect due to the ith treatment, and ρij is the random effect due to the ijth measurement of variety × month ρij N 0, σ 2variety × month . The link functions for each category are as follows: log log π 0ij = η0ij 1 - π 0ij π 0ij þ π 1ij 1 - π 0ij þ π 1ij = η1ij The following Statistical Analysis Software (SAS) program fits a repeated measures GLMM with an ordinal response. proc glimmix data=turfgrass method=laplace; class Variety time; model cat(order=data)=variety|time/dist=Multinomial link=clogit solution oddsratio; random intercept/subject=variety type=cs solution ; 380 9 Generalized Linear Mixed Models for Repeated Measurements Table 9.2 Fit statistics under different correlation structures Fit statistics -2 Log likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) CAIC (Consistent Akaike's information criterion) (smaller is better) HQIC (Hannan Quinn information criterion) (smaller is better) Covariance structure Toep CS AR(1) (1) 497.38 497.46 497.37 513.38 513.46 511.37 513.94 514.02 511.81 510.26 510.34 508.64 518.26 518.34 515.64 504.99 505.07 504.03 UN n o c o n v e r g e estimate 'c=1, var=1' intercept 1 0 variedad 1 0 0 0 0, 'c=2, var=1' intercept 0 1 variedad 1 0 0 0 0, 'c=1, var=2' intercept 1 0 variedad 0 1 0 0 0, 'c=2, var=2' intercept 0 1 variedad 0 1 0 0 0, 'c=1, var=3' intercept 1 0 variedad 0 0 1 0 0, 'c=2, var=3' intercept 0 1 variedad 0 0 1 0 0, 'c=1, var=4' intercept 1 0 variedad 0 0 0 1 0, 'c=2, var=4' intercept 0 1 variedad 0 0 0 1 0, 'c=1, var=5' intercept 1 0 variedad 0 0 0 0 1, 'c=2, var=5' intercept 0 1 variedad 0 0 0 0 1/ilink; freq y; run; Mixed models have advantages over fixed linear models (Littell et al. 1996) because they have the ability to incorporate fixed (Xβ) and random effects (Zb) that allow us to select different variance–covariance structures for repeated measures experiments (with or without missing data) to see which covariance structure best fits the model (Henderson 1984; Smith et al. 2005). Selecting or building a good enough model involves selecting a covariance structure that best fits the dataset. The information criteria minus two Restricted Log Likelihood (-2RLL), Akaike information criterion (AIC), Corrected Akaike’s information criterion (AICC), Bayesian information criterion (BIC), etc.) provided by proc GLIMMIX are used as statistical fit measures to select the variance structure (compound symmetry (“CS”), first-order autoregressive (“AR(1)”), Toeplitz (“Toep(1)”), unstructured (“UN)”) that best models the dataset. Most of the commands have already been explained. To provide the correlation structure that you want to model, with the above program, you vary the “TYPE” option = (CS, AR(1), Toep(1), and UN) separately to specify each of the covariance structures in the parentheses. Part of the results is shown below. According to the fit statistics (Table 9.2), the covariance structure that best fits the dataset is Toeplitz of order 1 (Toep(1)). The type III tests of fixed effects, shown in Table 9.3 part (a), indicate that grass variety provides different turfgrass qualities 9.2 Example of Turf Quality 381 Table 9.3 Results of the analysis of variance (a) Type III tests of fixed effects Effect Num degree of freedom (DF) Variety 4 (b) Solutions for fixed effects Effect Cat Variety Estimate Intercept Low -2.4509 Intercept Medium 0.1961 Variety Var1 0.4261 Variety Var2 -0.01502 Variety Var3 0.6125 Variety Var4 1.4904 Variety Var5 0 Den DF 10 Standard error 0.3219 0.2721 0.3753 0.3785 0.3825 0.3943 . Pr > F 0.0202 F-value 4.80 DF 10 10 10 10 10 10 . t-value -7.61 0.72 1.14 -0.04 1.60 3.78 . Pr > |t| <0.0001 0.4875 0.2827 0.9691 0.1404 0.0036 . Table 9.4 Estimated linear predictors and means on the model scale (Estimate) and on the data scale (Mean) for observed turfgrass quality in grass varieties in the multinomial generalized logit model Estimates Label c = 1, var = 1 c = 2, var = 1 c = 1, var = 2 c = 2, var = 2 c = 1, var = 3 c = 2, var = 3 c = 1, var = 4 c = 2, var = 4 c = 1, var = 5 c = 2, var = 5 Standard error 0.3018 DF 10 t-value -6.71 Pr > |t| <0.0001 Mean 0.1166 Standard error mean 0.03110 0.6222 0.2646 10 2.35 0.0405 0.6507 0.06013 -2.4659 0.3177 10 -7.76 <0.0001 0.07828 0.02292 0.1811 0.2667 10 0.68 0.5125 0.5452 0.06613 -1.8384 0.3040 10 -6.05 0.0001 0.1372 0.03599 0.8086 0.2760 10 2.93 0.0150 0.6918 0.05884 -0.9605 0.2791 10 -3.44 0.0063 0.2768 0.05588 1.6865 0.2992 10 5.64 0.0002 0.8438 0.03944 -2.4509 0.3219 10 -7.61 <0.0001 0.07937 0.02352 0.1961 0.2721 10 0.72 0.4875 0.5489 0.06737 Estimate -2.0248 (P = 0.0202). The “solution” option in the model specification “Model” provides the solution of fixed effects of the model (intercepts and treatments), which we use to ^ i (part (b)). estimate the linear predictors ^ηci = ^ηc þ Variety The probabilities π ci obtained using the “Estimate” information are tabulated under the “Mean” column of Table 9.4. 382 9 Generalized Linear Mixed Models for Repeated Measurements From these values, we can observe that for the category "c = 1, var = 1, " the value of the linear predictor is η11 = η1 þ variety1 = - 2:0248. Taking the inverse of ^η11 corresponds to the probability of π 11 = 0:1166 of observing “Low”-quality grass of variety 1. Now, for the category "c = 2, var = 1, " the inverse of the linear predictor is 0.6507, which is the estimate of the probability π 11 þ π 21 . From this value, we can obtain the probability that variety 1 provides grass of “Medium” quality, that is, π 11 þ π 21 = 0:6504, and, substituting the value of π 11 , we obtain the probability value π 21 = 0:6507 - 0:1166 = 0:5341. With these two probability estimates π 11 and π 21 , it is possible to estimate the probability that variety 1 will yield an “Excellent” quality turf, which is equal to π 31 = 1 - 0:6504 = 0:3496. Likewise, we obtain the values of the remaining probabilities π ci for the rest of the grass varieties. 9.3 Effect of Insecticides on Aphid Growth A cage experiment was used to investigate the effect of three insecticides on aphid colonies with partial resistance to a common active compound. There were eight treatments: all combinations of the three insecticides and a control (no insecticide) with two types of colonies (susceptible or partially resistant). The experiment was organized as an RCBD with six blocks of eight cages, and each cage was assigned a treatment combination in each block. A colony of aphids was reared in each cage, and the number of live aphids was recorded before insecticide treatment was applied and then 2 and 6 days after application. Both hatches and deaths could occur within each cage between evaluations. The dataset from this experiment is shown below (Table 9.5). Following the same reasoning as in previous examples, the components of the GLMM with a Poisson response and repeated measures, which models the number of aphids (yijkl), is described in the following lines. Distributions: yijkl j bl , insecticide × cloneðblockÞijðlÞ Poisson λijkl bl N 0, σ 2block , insecticide × cloneðblockÞijðlÞ N 0, σ 2insecticide × clone × block : Linear predictor: ηijkl = θ + Ii + Cj + (IC)ij + bl + IC(b)ij(l ) + τl + (Iτ)il + (Cτ)il + (ICτ)ijkl, Cj + (IC)ij + bl + IC(b)ij(l ) + τl + (Iτ)il + (Cτ)il + (ICτ)ijkl, where ηijkl is the linear predictor, θ is the intercept, Ii (i = 1, 2, 3) is the fixed effect due to the insecticide, Cj ( j = 1, 2) is the fixed effect due to the aphid clone, (IC)ij is the fixed effect due to the interaction between the type of insecticide and clone, bl (k = 1, 2, 3) is the random block effect, assuming bl N 0, σ 2block , IC(b)ij(l ) is the random effect of the interaction between the insecticide and clone within blocks, assuming insecticide × cloneðblockÞijðkÞ N 0, σ 2insecticide × clone × block , τl (l = 1, 2, 3) is the fixed effect due to measurement time, and (Cτ)il and (ICτ)ijkl are the fixed effects due to interaction. 9.3 Effect of Insecticides on Aphid Growth 383 Table 9.5 Effect of insecticides (C = control, R = resistant, S = susceptible) on aphid growth Block 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 Cage 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 Insecticide Control Control D D H H P P Control Control D D H H P P Control Control D D H H P P Control Control D D H H Clone R S R S R S R S R S R S R S R S R S R S R S R S R S R S R S Dia1 60 127 64 110 118 71 66 40 54 58 76 48 130 76 93 49 94 26 121 78 73 54 25 36 75 86 69 122 185 47 Dia2 111 131 30 27 75 10 69 25 152 130 60 22 113 76 77 0 175 33 73 23 74 27 10 22 134 57 32 66 88 23 Dia6 220 220 27 35 121 111 62 19 156 362 110 110 101 85 185 8 292 52 60 1 56 49 32 1 238 194 12 20 251 116 Link function: log(λijkl) = ηijk is the link function that relates the linear predictor to the mean (λijkl). The following SAS program adjusts the GLMM with a Poisson distribution on repeated measures. proc glimmix nobound method=laplace; class ID Block Insecticide Cage Clone time; model y = Insecticide|clone|time/dist=poi; random intercept Insecticide*Clon/subject=block; lsmeans Insecticide|Clon|time/lines ilink; run;quit; 384 9 Generalized Linear Mixed Models for Repeated Measurements Table 9.6 Results of the analysis of variance in the Poisson GLMM (a) Fit statistics Fit statistics CS -2 Log likelihood 1125.54 AIC (smaller is better) 1177.54 AICC (smaller is better) 1202.17 BIC (smaller is better) 1161.58 CAIC (smaller is better) 1187.58 HQIC (smaller is better) 1142.52 (b) Fit statistics for conditional distribution -2 Log L (y | r. effects) Pearson’s chi-square Pearson’s chi-square/DF AR(1) 1113.19 1165.19 1189.82 1149.24 1175.24 1130.18 Toep(1) 1127.25 1177.25 1199.67 1161.91 1186.91 1143.59 UN No converge 1006.38 484.48 5.77 Before fitting the GLMM, we compare the estimates of covariance structures with a Poisson distribution assumed in the response variable. According to the fit statistics, the covariance structure that best models the data is the autoregressive type of order 1 (AR(1)). The value of the fit statistic of the conditional distribution Pearson′s chi - square/DF = 5.77 indicates that there is an extra variation (aka overdispersion) and that the Poisson distribution does not adequately fit the data (Table 9.6). Since there is overdispersion in the data, a highly recommended alternative is to find another suitable (or more appropriate) distribution for this dataset. In this case, the linear predictor will be the same, although now, a negative binomial distribution will be assumed in the response variable. That is, yijkl jbl , insecticide × cloneðblockÞijðlÞ Negative Binomial λijkl , ϕ This negative binomial model arises by assuming that the conditional distribution of observations given random blocks and Insecticide*clone(block)ij(l )) is as follows: yijkljbl, insecticide*clone(block)ij(l ) ~ Poisson(λijkl), where λijkl Gamma ϕ1 , ϕ . The result of the new distribution of yijkljbl, insecticide × clone(Block)ij(l ) is a negative binomial (Negative binomial (λijkl, ϕ)). The link function is log(λijkl) = ηijkl. The following SAS code fits the GLMM with a negative binomial distribution. proc glimmix nobound method=laplace; class ID Block Insecticide Cage Clone time; model y = Insecticide|clone|time/dist=negbi; random intercept Insecticide*Clone/subject=block; lsmeans Insecticide|Clone|time/lines ilink; run; 9.3 Effect of Insecticides on Aphid Growth 385 Table 9.7 Fit statistics (a) Fit statistics -2 Log likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) CAIC (smaller is better) HQIC (smaller is better) (b) Fit statistics for conditional distribution -2 Log L (y | r. effects) Pearson’s chi-square Pearson’s chi-square/DF Table 9.8 Estimated variance components and tests of fixed effects (a) Covariance parameter estimates Cov Parm Subject Estimate Variance Block 0.06138 AR(1) Block -0.7143 Scale 0.1654 (b) Type III tests of fixed effects Num Den Effect DF DF Insecticide 3 19 Clone 1 19 Insecticide*clone 3 19 Time 2 44 Insecticide*time 6 44 Clone*time 2 44 Insecticide*clone*time 6 44 878.02 932.02 956.41 915.45 942.45 895.66 841.84 72.47 0.81 Standard error 0.03429 0.1710 0.03575 Fvalue 23.04 8.60 2.25 6.08 7.93 3.90 2.15 Pr > F <0.0001 0.0086 0.1161 0.0047 <0.0001 0.0275 0.0663 Part of the results is shown in Table 9.7. The values of the fit statistics, assuming a negative binomial distribution of the data, are shown in part (a), and the value of the conditional statistic is observed in part (b) (Pearson′s chi - square/DF = 0.81). This indicates that overdispersion has been eliminated from the data, and, so, the negative binomial distribution adequately models the response variable. The estimated variance components are shown in part (a) of Table 9.8, under an AR(1) covariance structure. The estimates of the variance components of blocks, the interaction between the insecticide and clone within blocks, and the scale parameter are σ 2block = 0:06613, σ 2insecticide × cloneðblockÞ = - 0:7575, and ϕ = 0:1584, respectively. The fixed III type effects tests (part (b)) indicate that there is a significant effect of insecticide type (P < 0.0001), clone (P = 0.0387), measurement time (P = 0.0137), and interactions insecticide x measurement time (P < 0.0001) and clone x measurement time (P = 0.0259) on the average number of aphids. The interaction insecticide x clone x measurement time is close to significance (P < 0.0663). 386 9 Generalized Linear Mixed Models for Repeated Measurements Table 9.9 Estimates of insecticide least squares (LS) means on the model scale (Estimate) and the data scale (Mean) Insecticide C D H P Estimate 4.7344 3.9647 4.3733 3.4892 Standard error 0.1478 0.1547 0.1561 0.1753 DF 19 19 19 19 tvalue 32.03 25.62 28.02 19.90 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 Mean 113.79 52.7043 79.3010 32.7588 Standard error mean 16.8211 8.1553 12.3753 5.7432 Table 9.10 Clone least squares means on the model scale (Estimate) and the data scale (Mean) Clone R S Estimate 4.4332 3.8476 Standard error 0.1320 0.1890 DF 19 19 t-value 33.58 20.36 Pr > |t| <0.0001 <0.0001 Mean 84.1990 46.8785 Standard error mean 11.1158 8.8586 Table 9.11 Insecticide*clone least squares means on the model scale (Estimate) and the data scale (Mean) Insecticide C C D D H H P P Clone R S R S R S R S Estimate 4.8836 4.5852 4.0521 3.8773 4.7106 4.0359 4.0866 2.8918 Standard error 0.1529 0.2479 0.1886 0.2337 0.1997 0.2244 0.2322 0.2534 DF 19 19 19 19 19 19 19 19 tvalue 31.93 18.49 21.49 16.59 23.59 17.98 17.60 11.41 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 132.10 98.0186 57.5153 48.2958 111.11 56.5964 59.5346 18.0255 Standard error mean 20.2032 24.3008 10.8459 11.2858 22.1870 12.7026 13.8263 4.5675 Table 9.12 Time least squares means on the model scale (Estimate) and the data scale (Mean) Time 1 2 3 Estimate 4.2730 3.9108 4.2373 Standard error 0.1434 0.1454 0.1457 DF 44 44 44 t-value 29.79 26.90 29.09 Pr > |t| <0.0001 <0.0001 <0.0001 Mean 71.7375 49.9372 69.2231 Standard error mean 10.2905 7.2603 10.0830 The linear predictors and estimated means of the factors and interaction are under the “Estimate” and “Mean” columns, respectively. Average number of aphids for insecticide, clone and time are given below: For insecticide type (Table 9.9): For clone (Table 9.10): For the interaction insecticide*clone (Table 9.11): For measurement time (Table 9.12): For the interaction insecticide*time (Table 9.13): For the interaction clone*time (Table 9.14): For the interaction insecticide*clone*time (Table 9.15): 9.4 Manufacture of Livestock Feed 387 Table 9.13 Insecticide*time least squares means on the model scale (Estimate) and the data scale (Mean) C C C D D D H H H P P P Time 1 2 3 1 2 3 1 2 3 1 2 3 Estimate 4.2381 4.6631 5.3019 4.4854 3.7061 3.7026 4.4718 3.9790 4.6689 3.8967 3.2949 3.2759 Standard error 0.1930 0.1913 0.1898 0.1965 0.2014 0.2035 0.1978 0.2016 0.1977 0.2241 0.2357 0.2399 DF 44 44 44 44 44 44 44 44 44 44 44 44 tvalue 21.95 24.38 27.94 22.83 18.40 18.20 22.60 19.73 23.62 17.39 13.98 13.65 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 69.2781 105.96 200.71 88.7111 40.6940 40.5537 87.5164 53.4625 106.59 49.2403 26.9755 26.4664 Standard error mean 13.3733 20.2696 38.0898 17.4275 8.1959 8.2517 17.3133 10.7804 21.0694 11.0358 6.3583 6.3502 Table 9.14 Clone*time least squares means on the model scale (Estimate) and the data scale (Mean) Clone R R R S S S 9.4 Time 1 2 3 1 2 3 Estimate 4.3839 4.2826 4.6331 4.1621 3.5390 3.8416 Standard error 0.1595 0.1605 0.1601 0.2092 0.2131 0.2144 DF 44 44 44 44 44 44 tvalue 27.49 26.68 28.94 19.90 16.60 17.91 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 80.1482 72.4270 102.83 64.2093 34.4308 46.5989 Standard error mean 12.7828 11.6256 16.4644 13.4323 7.3387 9.9931 Manufacture of Livestock Feed In this experiment, two types of pelleted feed were manufactured using different amounts of whole sorghum. Using the whole grain resulted in one feed with a high pellet durability index (PDI) and one with a low PDI. The researcher was interested in how much impact this difference in PDI would have on the amount of intact and pelleted feed distributed to the different positions along the feeding line. The line was fed four times with the high PDI feed and four times with the low PDI feed. After each run, the total weight of the feed in each of the 12 identified trays was measured. The feed was then sieved into each tray, and the crushed fine granules were weighed in the feed line. The response of interest was the ratio (proportion) between the weight of fine granules and the total weight of the feed for each tray. The data for this experiment are in the Appendix (Data: Feeding line experiment). The experimental design used in this study was a split plot in a randomized completely design. There were 2 fixed factors, feed with 2 levels (high PDI feed (H) and low PDI feed (L)), and a tray with 12 levels (1, 2, 3, ..., 12 locations along Insecticide C C C C C C D D D D D D H H H H H H P P P P P P Clone R R R S S S R R R S S S R R R S S S R R R S S S Time 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 Estimate 4.2592 4.9631 5.4284 4.2170 4.3630 5.1754 4.4161 3.8625 3.8775 4.5546 3.5497 3.5277 4.8146 4.4776 4.8395 4.1291 3.4803 4.4984 4.0456 3.8271 4.3870 3.7479 2.7627 2.1648 Standard error 0.2321 0.2280 0.2269 0.3044 0.3031 0.2999 0.2538 0.2587 0.2623 0.2908 0.2994 0.3026 0.2622 0.2639 0.2644 0.2839 0.2935 0.2823 0.3077 0.3117 0.3060 0.3189 0.3453 0.3634 DF 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 t-value 18.35 21.76 23.92 13.86 14.40 17.26 17.40 14.93 14.78 15.66 11.86 11.66 18.36 16.97 18.30 14.54 11.86 15.93 13.15 12.28 14.34 11.75 8.00 5.96 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Table 9.15 Insecticide*clone*time least squares means on the model scale (Estimate) and the data scale (Mean) Mean 70.7546 143.04 227.78 67.8323 78.4950 176.87 82.7767 47.5825 48.3054 95.0710 34.8028 34.0460 123.30 88.0246 126.40 62.1193 32.4709 89.8763 57.1426 45.9304 80.3983 42.4308 15.8430 8.7125 Standard error mean 16.4256 32.6184 51.6892 20.6460 23.7891 53.0366 21.0081 12.3088 12.6694 27.6437 10.4203 10.3007 32.3313 23.2294 33.4231 17.6362 9.5292 25.3740 17.5807 14.3176 24.5990 13.5319 5.4700 3.1659 388 9 Generalized Linear Mixed Models for Repeated Measurements 9.4 Manufacture of Livestock Feed Table 9.16 Results of the analysis of variance of the experiment Sources of variation Feeding Feeding (running) Tray Feeding*tray Feeding (tray*tray) Total 389 Degrees of freedom (2 - 1) = 1 2(4 - 1) = 6 (12 - 1) = 11 (2 - 1)(12 - 1) = 11 2(12 - 1)(4 - 1) = 66 a × b × r - 1 = 95 the feed line). Different run levels (1, 2, 3, 4 runs in the feed line) may influence the inference of this experiment, so it is advisable to analyze which variance structure is suitable for this analysis. The ANOVA table (Table 9.16) with degrees of freedom for this experiment is shown below. The researcher aims to draw conclusions about the destructiveness in the feed line with two types of feed, high PDI and low PDI. The following GLMM is used to describe the experiment: yijk = μ þ αi þ αðr Þik þ βj þ ðαβÞij þ εijk where yijk is the proportion observed in the run k (k = 1, 2, 3, 4), tray j ( j = 1, 2, . . ., 12), and in feed i (i = 1, 2); μ is the overall mean; αi is the fixed effect of feed i; α(r)ik is the random effect of the ith feed within the kth run, assuming αðr Þik N 0, σ 2αr ; βj is the fixed effect due to the jth tray; (αβ)ij is the effect of the interaction between the ith feed and the jth tray; and εijk is the experimental error. The components of the conditional GLMM assuming that the response variable follows a beta distribution are listed below: The distribution of the response variable is given by yikj j α(r)ik~Beta(μ + αi + α(r)ik + βj + (αβ)ij, ϕ) whose linear predictor is ηijk = μ + αi + α(r)ik + βj + (αβ)ij with π link function logit 1 -ijkπ ijk = ηijk . The following GLIMMIX syntax fits a GLMM with a beta distribution. proc glimmix method=laplace; class tray feed run; model ratio = feed|tray/dist=beta; random intercept/subject=feed(run) type=toep(1); lsmeans feed|tray/lines ilink; run; Part of the output is shown below. Four covariance structures (“CS,” “AR(1),” “Toep(1),” and “UN”) were tested to see which one best fits the response variable. Of these covariance structures, “Toep(1)” produced the best fit statistics (part (a), Table 9.17). Another important result that gives the guideline to continue with the analysis is the conditional distribution statistic (Pearson′s chi - square/DF = 0.96), whose 390 9 Table 9.17 Results of the analysis of variance Generalized Linear Mixed Models for Repeated Measurements (a) Fit statistics -2 Log likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) CAIC (smaller is better) HQIC (smaller is better) (b) Fit statistics for conditional distribution -2 Log L (ratio | r. effects) Pearson’s chi-square Pearson’s chi-square/DF (c) Type III tests of fixed effects Effect Num DF Den DF F-value Feed 1 6 1071.19 Tray 11 65 18.03 Tray*feed 11 65 1.83 -429.84 -377.84 -357.19 -375.78 -349.78 -391.77 -453.24 91.40 0.96 Pr > F <0.0001 <0.0001 0.0660 Table 9.18 Feed least squares means on the model scale (Estimate) and the data scale (Mean) Feed H L Estimate -2.0009 1.3832 Standard error 0.07409 0.07208 DF 6 6 t-value -27.01 19.19 Pr > |t| <0.0001 <0.0001 Mean 0.1191 0.7995 Standard error mean 0.007773 0.01155 value indicates that the beta model adequately fits the data, whereas the fixed effects tests (part (c)) indicate that there is a statistically significant effect of feeding type (P = 0.0001) and tray (P = 0.0001). The linear predictors and estimated probabilities of the factors and interaction are listed under the “Estimate” and “Mean” columns of the following tables, respectively. For the feeding line (Table 9.18): For the tray (Table 9.19): For the interaction feeding*tray (Table 9.20): 9.5 Characterization of Spatial and Temporal Variations in Fecal Coliform Density During a 1-month period (June 1981), 30 river water samples were collected from the channel at 3 stations, A, B, and C (downstream to upstream) on 5 randomly selected days at 9:00 a.m. and 3:00 p.m. (1 sample per station per hour per day). Each sample was analyzed for fecal coliform by method FC-96. The data from this experiment are shown in Table 9.21. 9.5 Characterization of Spatial and Temporal Variations in Fecal Coliform Density 391 Table 9.19 Tray least squares means on the model scale (Estimate) and the data scale (Mean) Tray 01 02 03 04 05 06 07 08 09 10 11 12 Estimate -0.5652 -0.6607 -0.6950 -0.2958 -0.3773 -0.2947 -0.3520 -0.2992 -0.1314 -0.3935 0.1571 0.2014 Standard error 0.08182 0.08531 0.08822 0.08100 0.08212 0.08057 0.08165 0.07939 0.07670 0.08096 0.07860 0.07949 DF 65 65 65 65 65 65 65 65 65 65 65 65 t-value -6.91 -7.74 -7.88 -3.65 -4.59 -3.66 -4.31 -3.77 -1.71 -4.86 2.00 2.53 Pr > |t| <0.0001 <0.0001 <0.0001 0.0005 <0.0001 0.0005 <0.0001 0.0004 0.0916 <0.0001 0.0499 0.0137 Mean 0.3623 0.3406 0.3329 0.4266 0.4068 0.4268 0.4129 0.4258 0.4672 0.4029 0.5392 0.5502 Standard error mean 0.01891 0.01916 0.01959 0.01981 0.01982 0.01971 0.01979 0.01941 0.01909 0.01948 0.01953 0.01967 Table 9.20 Tray*feed least squares means on the model scale (Estimate) and the data scale (Mean) Tray 01 01 02 02 03 03 04 04 05 05 06 06 07 07 08 08 09 09 10 10 11 11 12 12 Feed H L H L H L H L H L H L H L H L H L H L H L H L Estimate -2.2408 1.1104 -2.4581 1.1367 -2.4724 1.0823 -2.0307 1.4391 -2.1481 1.3935 -2.0087 1.4192 -2.1026 1.3987 -1.9310 1.3325 -1.6240 1.3613 -2.0863 1.2994 -1.4559 1.7701 -1.4519 1.8548 Standard error 0.1284 0.1015 0.1369 0.1018 0.1375 0.1105 0.1217 0.1070 0.1254 0.1061 0.1208 0.1066 0.1242 0.1061 0.1192 0.1050 0.1113 0.1056 0.1238 0.1044 0.1075 0.1148 0.1076 0.1171 DF 65 65 65 65 65 65 65 65 65 65 65 65 65 65 65 65 65 65 65 65 65 65 65 65 t-value -17.46 10.94 -17.95 11.17 -17.98 9.79 -16.69 13.45 -17.13 13.13 -16.62 13.31 -16.93 13.18 -16.20 12.69 -14.59 12.89 -16.85 12.45 -13.55 15.42 -13.50 15.83 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 0.09614 0.7522 0.07885 0.7571 0.07782 0.7469 0.1160 0.8083 0.1045 0.8011 0.1183 0.8052 0.1088 0.8020 0.1266 0.7913 0.1647 0.7960 0.1104 0.7857 0.1891 0.8545 0.1897 0.8647 Standard error mean 0.01116 0.01892 0.009946 0.01872 0.009869 0.02089 0.01248 0.01658 0.01174 0.01690 0.01260 0.01673 0.01204 0.01685 0.01318 0.01734 0.01531 0.01715 0.01216 0.01757 0.01648 0.01427 0.01653 0.01371 392 9 Generalized Linear Mixed Models for Repeated Measurements Table 9.21 Variation in fecal coliform densities of the river water samples from three sampling stations on five sampling days at 9:00 a.m. (TM = 1) and 3:00 p.m. (TM = 2) Sampling date 18 May 18 May 18 May 18 May 18 May 18 May 26 May 26 May 26 May 26 May 26 May 26 May 29 May 29 May 29 May 29 May 29 May 29 May 1 June 1 June 1 June 1 June 1 June 1 June 5 June 5 June 5 June 5 June 5 June 5 June TM 9:00 a.m. 3:00 p.m. 9:00 a.m. 3:00 p.m. 9:00 a.m. 3:00 p.m. 9:00 a.m. 3:00 p.m. 9:00 a.m. 3:00 p.m. 9:00 a.m. 3:00 p.m. 9:00 a.m. 3:00 p.m. 9:00 a.m. 3:00 p.m. 9:00 a.m. 3:00 p.m. 9:00 a.m. 3:00 p.m. 9:00 a.m. 3:00 p.m. 9:00 a.m. 3:00 p.m. 9:00 a.m. 3:00 p.m. 9:00 a.m. 3:00 p.m. 9:00 a.m. 3:00 p.m. Site A A B B C C A A B B C C A A B B C C A A B B C C A A B B C C No. of coliforms per milliliter 648 798 517 702 532 55 1421 1388 1883 1855 1724 1769 1523 759 1361 603 2004 541 1987 1056 1796 1579 1221 1223 870 1099 920 951 926 887 To assess the relative magnitudes of sources of variation due to time, site, and subsampling on the number of coliforms per milliliter (yijk), an analysis of variance using a GLMM with a Poisson response was performed, as described below: We denote yijk as the number of colonies per milliliter, whose conditional distribution is given by yijkjsampling(site)ik ~ Poisson (λijk) with the linear predictor ηijk defined by ηijk = θ þ sitei þ samplingðsiteÞik þ timej þ ðsite × timeÞij 9.5 Characterization of Spatial and Temporal Variations in Fecal Coliform Density 393 Table 9.22 Results of the analysis of variance (a) Estructuras de covarianza Fit statistics Toep(1) CS -2 Log likelihood 2022.64 2022.64 AIC (smaller is better) 2054.64 2056.64 AICC (smaller is better) 2096.49 2107.64 BIC (smaller is better) 2051.31 2053.10 CAIC (smaller is better) 2067.31 2070.10 HQIC (smaller is better) 2041.31 2042.47 (b) Fit statistics for conditional distribution -2 Log L (ufc | r. effects) Pearson’s chi-square Pearson’s chi-square/DF (c) Type III tests of fixed effects Effect Num DF Den DF Site 2 3 T 4 12 T*site 8 12 AR(1) 2022.64 2056.64 2107.64 2053.10 2070.10 2042.47 UN 2022.64 2054.64 2096.49 2051.31 2067.31 2041.31 1989.36 1632.28 54.41 F-value 1.41 956.04 82.44 Pr > F 0.3700 <0.0001 <0.0001 ði = 1, 2, 3; j = 1, 2, 3, 4, 5; k = 1, 2Þ where ηijk is the linear predictor that relates the linear function to the mean, θ is the intercept, sitei is the fixed effect due to the sampling site i, sampling(site)ik is the random effect due to the sampling time nested within the site, assuming samplingðsiteÞik N 0, σ 2samplingðsiteÞ , timej is the fixed effect due to sampling date, and (site × time)ij is the effect of the interaction between the site and sampling date. The link function for this model is log(λijk) = ηijk. The following GLIMMIX syntax fits a GLMM with a Poisson response. proc glimmix data=ufc nobound method=laplace; class T TM Site ; model ufc = Site|T/dist=Poisson link=log; random intercept/subject=TM(Site) type=toep(1); lsmeans Site|T/lines ilink; run; Part of the results is summarized in Table 9.22. To determine which covariance structure best models the response variable, four types were tested (part (a)), all of which produced very similar results. Because of these results, the “Toep(1)” covariance structure was chosen. From this, the fit statistics were obtained, and the value of the conditional distribution statistic is Pearson′s chi - square/DF = 54.41. This value indicates that there is a strong overdispersion in the dataset. Therefore, it is important to look for an alternative distribution that solves this problem. The hypothesis tests in part (c) indicate that there is a significant difference in the date of sampling (P = 0.0001) as well as in the interaction between the site and date 394 Table 9.23 Fit statistics under the negative binomial distribution 9 Generalized Linear Mixed Models for Repeated Measurements (a) Fit statistics -2 Log likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) CAIC (smaller is better) HQIC (smaller is better) (b) Fit statistics for conditional distribution -2 Log L(ufc | r. effects) Pearson’s chi-square Pearson’s chi-square/DF 436.98 470.98 521.98 467.44 484.44 456.81 432.83 22.66 0.76 of sampling (P = 0.0001). That is, the concentration of fecal coliform units per milliliter is affected by the date of data collection. However, we observed that there is an excessive dispersion in the data. One way to check for and deal with overdispersion is to run a quasi-Poisson model, which, during the fitting process, adds an additional dispersion parameter to account for that additional variance. Another option is to look for a distribution that adequately fits the data; in this case, the negative binomial distribution is a good alternative. Next, we will implement the analysis assuming that the response variable is distributed under a negative binomial distribution. This means that the distribution of yijk (number of colonies per militro) is given by yijk j smapling(site)ik~Negative Binomial (λijk, ϕ), where ϕ is the scale parameter. However, the linear predictor ηijk and the link function remain unchanged. The following GLIMMIX commands fit a GLMM with a negative binomial distribution. proc glimmix data=ufc nobound method=laplace; class T TM Site; model ufc = Site|T/dist=negbin; random intercept/subject=TM(Site)/type=Toep(1); lsmeans Site|T/lines ilink; run; Part of the output of the above program is shown below. The values of the fit statistics under the negative binomial distribution (part (a) of Table 9.23) are much smaller compared to those obtained assuming the Poisson model, indicating that the negative binomial distribution adequately fits the response variable. Furthermore, the value of the conditional distribution statistic indicates that the negative binomial distribution is a good distribution for these data (Pearson′s chi - square/DF = 0.76). refers to how many times the This parameter Pearson0 s chi - square DF = 0:76 variance is larger than the mean. Since this value is less than 1 (part (b)), the conditional variance is actually smaller than the conditional mean, indicating that overdispersion has been removed in the fitting of the data. Another direct effect 9.6 Log-Normal Distribution Table 9.24 Type III fixed effects tests 395 Effect Site T T*site Num DF 2 4 8 Den DF 3 12 12 F-value 0.78 11.57 1.13 Pr > F 0.5346 0.0004 0.4096 Table 9.25 Means and standard errors on the model scale (Estimate) and on the data scale (Mean) of the sampling site data Site A B C Estimate 7.0195 7.0243 6.8237 Standard error 0.1270 0.1271 0.1305 DF 3 3 3 t-value 55.25 55.28 52.30 Pr > |t| <0.0001 <0.0001 <0.0001 Mean 1118.25 1123.57 919.40 Standard error mean 142.06 142.78 119.95 Table 9.26 Means and standard errors of measurement time on the model scale (Estimate) and the data scale (Mean) T 1 2 3 4 5 Estimate 6.2084 7.4212 7.0074 7.2910 6.8513 Standard error 0.1467 0.1420 0.1455 0.1418 0.1422 DF 12 12 12 12 12 t-value 42.32 52.27 48.16 51.42 48.19 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 496.91 1670.97 1104.74 1466.97 945.09 Standard error mean 72.8990 237.23 160.75 208.00 134.35 observed when there is no overdispersion is the F-values of the fixed effects tests (Table 9.24). In this case, the date on which the samples were collected was significant but not the interaction between the two factors, as the case when the data were fitted using the Poisson GLMM. The linear predictors and estimated probabilities of the main effects and the interaction between both factors are under the columns “Estimate” and “Mean,” respectively. The sampling site averages are presented below (Table 9.25). The averages by sampling date are listed below (Table 9.26). The means of the interaction site × sampling date are shown below (Table 9.27). 9.6 Log-Normal Distribution Positively skewed distributions are highly common, especially when modeling biological data. Data often have a lower bound, usually 0 or the detection limit, but have no restriction on the upper bound. Therefore, when the data are below the median, no observation can be further away than the lower bound; however, when the data are above the median, there may be values that are many times further away, giving a positively skewed distribution. These skewed distributions can often be approximated by a log-normal distribution (Limpert et al. 2001). 396 9 Generalized Linear Mixed Models for Repeated Measurements Table 9.27 Means and standard errors for the interaction T*site on the model scale (Estimate) and the data scale (Mean) Site A B C A B C A B C A B C A B C T 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 Estimate 6.5905 6.4197 5.6151 7.2508 7.5367 7.4761 7.0336 6.8855 7.1030 7.3224 7.4329 7.1176 6.9003 6.8467 6.8069 Standard error 0.2463 0.2466 0.2772 0.2452 0.2451 0.2463 0.2461 0.2465 0.2586 0.2458 0.2450 0.2463 0.2460 0.2458 0.2453 DF tvalue 26.76 26.03 20.26 29.57 30.75 30.36 28.58 27.94 27.47 29.79 30.33 28.90 28.04 27.85 27.75 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean 728.17 613.79 274.53 1409.17 1875.54 1765.30 1134.09 978.01 1215.59 1513.87 1690.62 1233.47 992.56 940.73 904.07 Standard error mean 179.35 151.38 76.1038 345.59 459.64 434.71 279.14 241.05 314.37 372.07 414.28 303.81 244.21 231.25 221.80 Fig. 9.1 Density function of the log-normal distribution with parameters 1 and 0.6 A log-normal distribution is characterized by having only positive nonzero values, positive skewness, a nonconstant variance that is proportional to the square of the mean value, and a normally distributed natural logarithm. The probability density function for a log-normal distribution has an asymmetric appearance, with a larger amount of data below the expected value and a thinner right tail with higher values. Figure 9.1 shows the positive skewness of a log-normal distribution with mean 1 and standard deviation 0.6. 9.6 Log-Normal Distribution 9.6.1 397 Emission of Nitrous Oxide (N2O) in Beef Cattle Manure with Different Percentages of Crude Protein in the Diet The experiment was conducted between January and February 2017 at the Colegio de Postgraduados Campus Córdoba located in Amatlán de los Reyes, Veracruz, México. The genetic material used were four 5–6-month-old males of the Criollo lechero tropical (CLT) breed, randomly distributed in individual pens of 4.8 × 2.1 m2, each one with 75% shade, a cup drinker, and a drawer-type feeder. To ensure the required crude protein percentages for each treatment, the following diets (treatments 1–4) were developed: Trt1 (12% crude protein), Trt2 (14% crude protein), Trt3 (16% crude protein), and Trt4 (commercial feed with 16% crude protein). Each animal randomly received the four treatments in different periods. Each treatment was applied for 11 days, of which the first 7 were considered adaptation days and the following 4 days were used for the measurement of gases in the daily accumulated excreta. The experiment had a total duration of 44 days. The data from this experiment are tabulated in the Appendix (Data: Nitrous oxide emission). The N2O gas fluxes in ppm were calculated from a linear or nonlinear increase of the concentrations inside the static chambers over time, and these fluxes were converted to micrograms of N2O–N per m2 per hour ( y); for more details, see the study by Nadia Hernández-Tapia et al., (2019). The statistical model used in this study was an analysis of covariance model in a randomized complete block design with repeated measures, as described below. yijk = μ þ τi þ animalj þ timek þ ðτ × timeÞik þ βi xij - x þ εijk where yijk is the flux of N2O–N (μg m-2 h-1); μ is the overall mean; τi is the fixed effect due to treatment i (i = 1, 2, 3, 4); animalj is the random effect due to animal j ( j = 1, 2, 3, 4), assuming animalj~N(0, σ 2animal); timek is the fixed effect of time k (k = 1, 2, 3, 4, 5) at the time of measurement; (τ × time)ik is the effect of the interaction between τi and timek, βi is the coefficient of linear regression of the covariate xij in treatment i and time j, where xij can be the pH, humidity (HE), temperature (TE) in the manure, maximum temperature (TMaxA), minimum temperature (TMinA), maximum humidity (HMaxA), minimum humidity (HMinA), or initial weight (kilograms) at the start of a treatment; x is the mean of the covariate in question; and εijk is the non-normal experimental error. The linear predictor ηijk for N2O–N is ηijk = μ þ τi þ animalj þ timek þ ðτ timeÞik þ βi xij - x . The response variable yijk has a conditional log-normal distribution with a mean μijk and variance eσ - 1 :e2μþσ , that is, yijkjanimalj ~ Log 2 normal (μijk, 2 eσ - 1 :e2μþσ ); the rest of the parameters have already been described above. 2 2 398 9 Generalized Linear Mixed Models for Repeated Measurements The following GLIMMIX syntax adjusts a GLMM with a log-normal response: proc glimmix data=co2 nobound method=laplace; class animal trt time; model flox =trt|time xbar/dist=lognormal; random intercept /subject=animal type=cs; lsmeans trt|time/lines ilink; run; Although most of the commands have already been described in previous chapters, in this chapter, we average the TMinA covariate “xbar.” Part of the output is shown below. The gas emissions from cattle manure, regardless of the treatment applied, are influenced by several factors (covariates) that the researcher cannot control, which have a significant effect on the estimation of means and experimental error. Both are linearly related to the response variable. Covariates such as pH, humidity, and temperature of the excreta, as well as the temperature and humidity (maximum and minimum) of the environment, influence the dynamics of gas emission. These covariates were considered and analyzed in the covariance model to adjust the estimated means of the N2O–N flux. Based on the fit statistics obtained from the proposed models (Table 9.28), the model that best explains the variability of the N2O–N flux is model 5 because this model provides the lowest values in AIC, AICC, BIC, and MSE (Mean Square Error). Therefore, the model that provides the best fit or explains the most variability in the N2O–N flux is the one that includes the minimum environment temperature. The conditional fit statistics (part (a)) and the estimated variance components (part (b)) are shown in Table 9.29. The type III fixed effects tests (part (c)) indicate that there is a significant effect of Trt (P = 0.0008), time (P = 0.0288), the interaction Trt × time (P = 0.0140), the covariate Tmin (P = 0.0079), and the interaction Tmin × Trt (P = 0.038). The average N2O–N emissions between Trt1 (12% CP: Crude Protein) and Trt2 (14% CP) were statistically different from each other. Treatment 1 emitted the highest N2O–N flux despite being the treatment with the lowest percentage of CP (Table 9.30). 9.7 Effect of a Chemical Salt on the Percentage Inhibition of the Fusarium sp. In order to observe the tolerance of the fungus Fusarium sp. to different concentrations of a chemical salt, a bioassay was implemented to evaluate the percentage of inhibition of the fungus. This bioassay consisted of placing a nutritive culture medium in Petri dishes for the fungal development in which different concentrations of the salt in ppm were added (0, 500, 1000, and 2000, ). Mycelium growth was measured during 6 days, and the percentage of inhibition of Fusarium sp. growth was calculated. Part of the data is shown below, and the complete base is in the Appendix (Data: Percentage inhibition). 417.62 443.88 416.78 5. Y ijk = μ þ τi þ animalj þ timek þ ðτ × timeÞik þ β TMinAij - TMinA þ εijk 6. Y ijk = μ þ τi þ animalj þ timek þ ðτ × tiempoÞik þ β HMaxAij - HMaxA þ εijk 7. Y ijk = μ þ τi þ animalj þ timek þ ðτ × timeÞik þ β HMinAij - HMinA þ εijk 450.66 4. Y ijk = μ þ τi þ animalj þ timek þ ðτ × timeÞik þ β TMaxAij - TMaxA þ εijk 428.32 455.42 459.14 462.20 454.28 429.98 402.05 429.15 391.84 435.93 428.01 403.71 418.44 442.74 2. Y ijk = μ þ τi þ animalj þ timek þ ðτ × timeÞik þ β HEij - HE þ εijk BIC 407.67 Fit statistics (N2O–N) AIC AICC 422.40 433.94 3. Y ijk = μ þ τi þ animalj þ timek þ ðτ × tiemeÞik þ β TE ij - TE þ εijk 1. Y ijk = μ þ τi þ animalj þ timek þ ðτ × timeÞik þ β pHχ ij - pH þ εijk Model with the specific covariate Table 9.28 Fit statistics in the different models proposed 0.99 1.18 0.76 1.25 1.16 0.97 CME 1.09 9.7 Effect of a Chemical Salt on the Percentage Inhibition of the Fusarium sp. 399 400 Table 9.29 Conditional fit statistics, variance components, and type III fixed effect tests Table 9.30 Mean and standard error of N flux2 O (μg of N2O-N m-2 h-1) of the different treatments under study Bio 1 1 ⋮ 1 1 ⋮ 1 1 1 1 ⋮ 1 1 ⋮ 1 1 1 Generalized Linear Mixed Models for Repeated Measurements 9 Day 1 1 ⋮ 2 2 ⋮ 3 3 4 4 ⋮ 5 5 ⋮ 6 6 6 Conc 0 0 ⋮ 0 0 ⋮ 0 0 0 500 ⋮ 0 0 ⋮ 1000 1000 2000 (a) Fit statistics for conditional distribution -2 Log L (F | r. effects) Pearson’s chi-square Pearson’s chi-square/DF (b) Covariance parameter estimates Cov Parm Subject Estimate Variance A -0.01561 CS A 0.000767 Residual 0.7776 (c) Type III tests of fixed effects Effect Num DF Den DF Trt 3 88 Time 4 88 Trt*time 11 88 Tmin 1 88 Tmin*Trt 3 88 Tmin*time 4 88 Tmin*Trt*time 11 88 Treatment Trt1 (12% PC) Trt2 (14% PC) Trt3 (16% PC) Trt4 (16% CP, commercial feed) Rep 3 4 ⋮ 2 3 ⋮ 2 3 3 1 ⋮ 3 4 ⋮ 3 4 1 Y 5.263 5.263 ⋮ 1.935 4.516 ⋮ 1.234 3.703 4.672 19.626 ⋮ 4.065 4.065 ⋮ 15.862 18.62 32.413 Bio 2 2 ⋮ 2 2 ⋮ 2 2 2 2 ⋮ 2 2 ⋮ 2 2 2 Day 1 1 ⋮ 2 2 ⋮ 3 3 4 4 ⋮ 5 5 ⋮ 6 6 6 333.62 99.06 0.76 Standard error . . 0.09845 F-value 6.13 2.84 2.34 7.40 4.81 1.80 1.23 Pr > F 0.0008 0.0288 0.0140 0.0079 0.0038 0.1351 0.2814 N2O(μ) ± standard error 3.6442 ± 0.2213a 3.0714 ± 0.3119b 3.5706 ± 0.2974ab 3.1205 ± 0.2130ab Conc 0 0 ⋮ 500 500 ⋮ 500 500 500 500 ⋮ 500 500 ⋮ 2000 2000 2000 Rep 2 3 ⋮ 2 3 ⋮ 3 4 2 3 ⋮ 1 2 ⋮ 1 2 3 Y 0.0016 14.285 ⋮ 31.506 42.465 ⋮ 35.042 24.786 23.123 27.927 ⋮ 13.253 21.285 ⋮ 31.197 29.173 29.848 (continued) 9.7 Effect of a Chemical Salt on the Percentage Inhibition of the Fusarium sp. Bio 1 1 1 Day 6 6 6 Conc 2000 2000 2000 Rep 2 3 4 Y 29.655 31.724 35.172 Bio 2 Day 6 Conc 2000 401 Rep 4 Y 30.522 Following the same reasoning as in previous examples, the components of the GLMM with beta response distribution repeated-measures for the percentage inhibition of Fusarium sp. (yijkl) are listed below: Distributions: yijkl j ωkl, conc(ω)i(kl)~Beta(π ijkl, ϕ); i = 1, ⋯, 4; j = 1, . . ., 6; k = 1, 2; l = 1, . . ., ri. ωkl N 0, σ 2ω , concðωÞiðklÞ N 0, σ 2concðωÞ . Linear predictor: ηijk = θ þ conci þ ωkl þ concðωÞiðklÞ þ timej þ ðconc × timeÞij: where ηijk is the linear predictor, θ is the intercept, conci is the fixed effect of salt concentration, ωkl is the random effect of the Petri dish within the bioassay, assuming ωkl N 0, σ 2ω , conc(ω)i(kl) is the random effect of salt concentration– Petri dish–bioassay, assuming concðωÞiðklÞ N 0, σ 2concðωÞ , timej is the fixed effect due to the day of measurement, and (conc × time)ij is the interaction effect of chemical salt concentration with the day of measurement. Link function: logit(π ijkl) = ηijkl is the link function that relates the linear predictor to the mean (π ijkl). The following SAS program adjusts the beta GLMM with repeated measures. proc glimmix data=inhibition method=laplace nobound; class Bio Day Conc Rep; model pct = Con|Day/dist=beta link=logit; random intercept/subject=con(bio) type=cs; lsmeans Con|Day/lines ilink; run; Before fitting the generalized linear mixed model, we compare the estimates of the covariance structures with the beta distribution in the response variable (Table 9.31 part (a)). According to the fit statistics, the covariance structures that best fit the data are the Toeplitz type (Toep(1)) and unstructured (UN). Having defined the covariance structure, in this case, Toeplitz of order 1, we present part of the results of the data fit (Table 9.31 part (b)). The fit statistic Pearson′s chi - square/DF = 1.07 indicates that there is no overdispersion and that the beta distribution fits the data adequately. The estimated variance component, under Toeplitz (1), of the concentration–repetition bioassay is σ 2conðωÞ = 0:00285 and the scale parameter ϕ = 52:281 (c). 402 9 Generalized Linear Mixed Models for Repeated Measurements Table 9.31 Fit statistics for the conditional distribution and variance components (a) Fit statistics CS -2 Log likelihood -523.69 AIC (smaller is better) -469.69 AICC (smaller is better) -458.73 BIC (smaller is better) -467.54 CAIC (smaller is better) -440.54 HQIC (smaller is better) -484.16 (b) Fit statistics for conditional distribution -2 Log L (pct | r. effects) Pearson’s chi-square Pearson’s chi-square/DF (c) Covariance parameter estimates Cov Parm Subject Variance Con(Bio) Scale Table 9.32 Type III fixed effects tests AR(1) -523.69 -469.69 -458.73 -467.54 -440.54 -484.16 Toep(1) -523.69 -471.69 -461.59 -469.62 -443.62 -485.62 UN -523.69 -471.69 -461.59 -469.62 -443.62 -485.62 -529.79 177.68 1.07 Estimate 0.002849 52.2809 Type III tests of fixed effects Effect Num DF Den DF Con 3 4 Day 5 138 Day*Con 15 138 Standard error 0.004147 5.8849 F-value 125.40 10.99 2.25 Pr > F 0.0002 <0.0001 0.0074 The fixed effects indicate that there is a highly significant effect of salt concentration (P = 0.0002), time (P = 0.0001), and the interaction concentration x time (P = 0.0074) on the growth inhibition of Fusarium sp. (Table 9.32). The linear predictors and estimated probabilities of the factors (Table 9.33 parts (a) and (b)) and interaction (Table 9.34) are found under the columns “Estimate” and “Mean,” respectively. 9.8 Carbon Dioxide (CO2) Emission as a Function of Soil Moisture and Microbial Activity Productive agricultural soil requires a certain level of ventilation to maintain active plant root growth and soil microbial activity. One scientist found that soil oxygenation levels had been affected in soils fertilized with nutrient-rich sludge from a sewage treatment plant. The level of soil aeration can be reduced by (1) the high water content of the sludge added, through compaction with heavy machinery used to add the sludge and, ironically, (2) the increased microbial activity that occurs when sludge with high organic matter content is added. The objective of the research was to determine the moisture levels at which aeration becomes a limiting factor for 9.8 Carbon Dioxide (CO2) Emission as a Function of Soil Moisture. . . 403 Table 9.33 Concentration and measurement time least square means on the model scale (Estimate) and the data scale (Mean) (a) Conc least squares means Standard Estimate error Con ηi: 0 -3.5438 0.1499 500 -1.0650 0.05941 1000 -0.9847 0.05895 2000 -0.4487 0.05891 (b) Day least squares means Standard Estimate error Day η:j 1 -1.6017 0.1161 2 -1.0446 0.08689 3 -1.2475 0.08794 4 -1.5668 0.1020 5 -1.7606 0.1039 6 -1.8422 0.1067 DF 4 4 4 4 t-value -23.64 -17.93 -16.70 -7.62 Pr > |t| <0.0001 <0.0001 <0.0001 0.0016 Mean π i: 0.02809 0.2563 0.2720 0.3897 Standard error mean 0.004093 0.01133 0.01167 0.01401 DF 138 138 138 138 138 138 t-value -13.79 -12.02 -14.19 -15.36 -16.94 -17.26 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean π :j 0.1677 0.2603 0.2231 0.1727 0.1467 0.1368 Standard error mean 0.01621 0.01673 0.01524 0.01457 0.01301 0.01260 microbial activity in the soil. The study included a control treatment (no sludge) and three treatments using sludge as a fertilizer with different moisture contents, whose moisture levels for the fertilized soil were 0.24, 0.26, and 0.28 kg water/kg soil. Soil samples were randomly assigned to the four treatments in a randomized completely design. Soil samples were placed in sealed containers and incubated under favorable conditions for microbial activity. The soil was compacted in the containers simulating a degree of compaction experienced in the field. Microbial activity, measured as an increase in CO2, was used as a measure of the level of soil oxygenation. The CO2 evolution/kilogram soil/day in each container was measured on 2, 4, 6, and 8 days after starting of the incubation period. Microbial activity in each soil sample was recorded as the percentage increase in CO2 produced above the atmospheric level. The data are shown in Table 9.35. The analysis of variance table for this experiment is shown below (Table 9.36). Let pctijk be the percentage of CO2 emission, assuming that pctijk has a beta distribution with a mean π ijk and scale parameter ϕ, i.e., pctijk~Beta(π ijk, ϕ). The linear predictor ηijk that relates the mean to the link function is given by ηijk = θ þ αi þ αðr ÞiðkÞ þ τj þ ðατÞij ; i = 1, . . . , 4, j = 1, . . . , 4, k = 1, 2, 3 where θ is the intercept, αi is the fixed effect of the treatment i, α(r)i(k) is the random effect of treatment nested in the repetition k, assuming that αðr ÞiðkÞ N 0, σ 2αðrÞ , τj is the fixed effect of measurement time j, and (ατ)ij is the interaction effect of treatment with measurement time. The link function is defined by logit(π ijk) = ηijk. The following SAS syntax fits a GLMM on repeated measures with a beta distribution. 404 9 Generalized Linear Mixed Models for Repeated Measurements Table 9.34 Measuring time*salt concentration interaction on the model scale (Estimate) and the data scale (Mean) Day*con least squares means Day 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 Con 0 500 1000 2000 0 500 1000 2000 0 500 1000 2000 0 500 1000 2000 0 500 1000 2000 0 500 1000 2000 Estimate ηij -4.0127 -0.8709 -0.6848 -0.8382 -3.5743 -0.4140 -0.3519 0.1616 -2.9511 -0.9923 -0.9044 -0.1423 -3.5167 -1.2429 -1.0361 -0.4716 -3.5503 -1.4180 -1.4489 -0.6251 -3.6579 -1.4522 -1.4823 -0.7765 Standard error 0.4083 0.1123 0.1092 0.1579 0.2957 0.1061 0.1053 0.1043 0.2944 0.1149 0.1131 0.1041 0.3558 0.1213 0.1159 0.1065 0.3579 0.1269 0.1277 0.1083 0.3691 0.1277 0.1289 0.1106 DF 138 138 138 138 138 138 138 138 138 138 138 138 138 138 138 138 138 138 138 138 138 138 138 138 t-value -9.83 -7.76 -6.27 -5.31 -12.09 -3.90 -3.34 1.55 -10.02 -8.64 -8.00 -1.37 -9.88 -10.25 -8.94 -4.43 -9.92 -11.17 -11.34 -5.77 -9.91 -11.37 -11.50 -7.02 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 0.0001 0.0011 0.1235 <0.0001 <0.0001 <0.0001 0.1739 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Mean π ij 0.01776 0.2951 0.3352 0.3019 0.02727 0.3980 0.4129 0.5403 0.04969 0.2705 0.2881 0.4645 0.02884 0.2239 0.2619 0.3842 0.02791 0.1950 0.1902 0.3486 0.02514 0.1897 0.1851 0.3151 Standard error mean 0.007124 0.02335 0.02434 0.03328 0.007844 0.02543 0.02554 0.02590 0.01390 0.02266 0.02319 0.02590 0.009967 0.02108 0.02241 0.02520 0.009710 0.01992 0.01967 0.02458 0.009046 0.01963 0.01944 0.02388 proc glimmix data=co2 method=laplace; class trt container time; model pct = trt|time/dist=beta link=logit; random trt/subject=container; lsmeans trt|time /lines ilink; run; Part of the results is shown below. The fit statistics under different covariance structures (Table 9.37 part (a)), such as AIC and AICC indicate that a Toeplitz-type covariance structure of order 1 provides the best fit to the dataset of this experiment. Table 9.38 part (a) shows the estimated variance component due to treatment x repetition, i.e., - σ 2aðrÞ = 0:03363, and the estimated scale parameter ϕ = 790:82, and the hypothesis test (part (b)) indicates that the treatments yielded statistically different means (P = 0.0011). 9.8 Carbon Dioxide (CO2) Emission as a Function of Soil Moisture. . . 405 Table 9.35 Repeated measurements of emissions of CO2 by bacterial activity in soil under different moisture conditions Moisture (kg water/kg soil) Control 0.24 0.26 0.28 Table 9.36 Analysis of variance of an RCD with repeated measures Container 1 2 3 1 2 3 1 2 3 1 2 3 %CO2 evolution/kilogram soil/day Day 2 Day 4 Day 6 0.22 0.56 0.66 0.68 0.91 1.06 0.68 0.45 0.72 2.53 2.7 2.1 2.59 1.43 1.35 0.56 1.37 1.87 0.22 0.22 0.2 0.45 0.28 1.24 0.22 0.33 0.34 0.22 0.8 0.8 0.22 0.62 0.89 0.22 0.56 0.69 Sources of variation Treatment Error1 Measurement time Treatment x time Error2 Total Day 8 0.89 0.8 0.89 1.5 0.74 1.21 0.11 0.86 0.2 0.37 0.95 0.63 Degrees of freedom (a - 1) = 4 - 1 = 3 a(r - 1) = 8 (t - 1) = 4 - 1 = 3 (a - 1)(t - 1) = 9 a(t - 1)(r - 1) = 4 × 3 × 2 = 24 a × t × r - 1 = 4 × 4 × 3 - 1 = 47 Table 9.37 Fit statistics of the beta GLMM under different covariance structures (a) Fit statistics CS -2 Log likelihood -433.28 AIC (smaller is better) -395.28 AICC (smaller is better) -368.14 BIC (smaller is better) -412.41 CAIC (smaller is better) -393.41 HQIC (smaller is better) -429.71 (b) Fit statistics for conditional distribution -2 Log L (y | r. effects) Pearson’s chi-square Pearson’s chi-square/DF AR(1) Toep(1) -433.94 -433.28 -395.94 -397.28 -368.80 -373.69 -413.07 -413.50 -394.07 -395.50 -430.37 -429.89 CS AR(1) Toep(1) -446.54 -444.41 -446.58 30.46 33.98 30.38 0.63 0.71 0.63 UN No converge UN No converge Table 9.39 shows the estimated average emissions of CO2 in tested treatments, which showed that the treatment with moisture 0.24 kg water/kg soil favored a higher microbial activity, whereas treatments with moisture levels 0.26 and 0.28 kg water/kg soil showed similar microbial activity between them. 406 9 Table 9.38 Variance components and fixed effects test Generalized Linear Mixed Models for Repeated Measurements (a) Covariance parameter estimates Cov Parm Subject Estimate Standard error Variance Contenedor 0.03363 0.03153 Scale (ϕ) 790.82 190.89 (b) Type III tests of fixed effects Effect Num DF Den DF F-value Pr > F Trt 3 8 15.52 0.0011 Time 3 24 2.94 0.0537 Trt*time 9 24 1.29 0.2914 Table 9.39 Means and standard errors on the model scale (Estimate) and the data scale (Mean) (a) Trt least squares means Standard Trt Estimate error C -4.9242 0.1595 T0.24 -4.1331 0.1343 T0.26 -5.5728 0.1898 T0.28 -5.1588 0.1728 DF 8 8 8 8 t-value -30.87 -30.79 -29.36 -29.86 Pr > |t| <0.0001 <0.0001 <0.0001 <0.0001 Mean 0.007216 0.01578 0.003786 0.005716 Standard error mean 0.001143 0.002085 0.000716 0.000982 0.02 0.018 0.016 % CO2 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0 1 2 3 4 5 6 7 8 Time (Days) C T0.24 T0.26 T0.28 Fig. 9.2 CO2 emission as a measure of microbial activity Figure 9.2 clearly shows that the treatment with moisture 0.24 kg water/kg soil provides the best conditions for soil microbial activity, whereas the rest of the treatments significantly affect the activity of microorganisms. 9.9 9.9 Effect of Soil Compaction and Soil Moisture on Microbial Activity 407 Effect of Soil Compaction and Soil Moisture on Microbial Activity A soil scientist conducted an experiment to evaluate the effects of soil compaction and soil moisture on microbial activity. Ventilation levels may be restricted in highly saturated or compacted soils, thus reducing microbial activity. The experiment consisted of three levels of soil compaction (1.1, 1.4, and 1.6 mg soil/m3) and three levels of soil moisture (0.1, 0.2, and 0.24 kg water/kg soil). The treated soil samples were placed in sealed containers and incubated under conditions to microbial activity. The percentage increase in CO2 produced above atmospheric levels was measured in each soil sample. The experimental design was a completely randomized design (CRD) with a 3 X 3 factorial structure of treatments. Two replicates of the soil container units were prepared for each treatment. The evolution of CO2/kg soil/day was measured for three successive days. The data from this experiment are shown below in Table 9.40. The analysis of variance table for this experiment is shown below (Table 9.41). Let pctijk be the percentage of CO2 emission and assume that pctijk has a beta distribution with a mean π ijk and scale parameter ϕ, i.e., pctijk~Beta(π ijk, ϕ). The linear predictor ηijk that relates the mean to the link function is given by ηijkl = θ þ αi þ βj þ ðαβÞij þ αβðr ÞijðlÞ þ τk þ ðατÞik þ ðβτÞjk þ ðαβτÞijk Table 9.40 Percentage of CO2 by bacterial activity as a function of soil density (mg soil/m3) and soil humidity (kg water/kg soil) Density 1.1 Humidity 0.1 0.2 0.24 1.4 0.1 0.2 0.24 1.6 0.1 0.2 0.24 Replication 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 Day 1 2.7 2.9 5.2 3.6 4 4.1 2.6 2.2 4.3 3.9 1.9 3 2 3 3.8 2.6 1.3 0.5 Day 2 0.34 1.57 5.04 3.92 3.47 3.47 1.12 0.78 3.36 2.91 3.02 3.81 0.67 0.78 2.8 3.14 2.69 0.34 Day 3 0.11 1.25 3.7 2.69 3.47 2.46 0.9 0.34 3.02 2.35 2.58 2.69 0.22 0.22 2.02 2.46 2.46 . 408 9 Generalized Linear Mixed Models for Repeated Measurements Table 9.41 Analysis of variance of an CRD with factorial structure of treatments in repeated measures Sources of variation Treatment Humidity Treatment*humidity Error1 Time Treatment time Humidity*time Treat*hum*time Error2 Total Degrees of freedom (a - 1) = 3 - 1 = 2 (b - 1) = 3 - 1 = 2 (a - 1)(b - 1) = 4 ab(r - 1) = 3 × 3 × 1 = 9 (c - 1) = 3 - 1 = 2 (a - 1)(c - 1) = 4 (b - 1)(c - 1) = 4 (a - 1)(b - 1)(c - 1) = 8 /diferencia/17 a × b × c × r - 1 = 3 × 3 × 3 × 2 - 1 - 1 = 52 Note: Here, 1 degree of freedom was subtracted from the total observations of the experiment since there is a missing observation i = 1, 2, 3, j = 1, 2, 3, k = 1, 2, 3, l = 1, 2 where θ is the intercept, αi is the fixed effect of the density factor, βj is the fixed effect of the humidity factor, (αβ)ij is the effect of the interaction between density and humidity, αβ(r)ij(l ) is the random effect of the interaction density × humidity × repetition αβðr ÞijðlÞ N 0, σ 2αβðrÞ , τl is the fixed effect of measurement time, (ατ)ij is the fixed effect of the interaction between density and measurement time, (βτ)jk is the fixed effect of the interaction between moisture and measurement time, and (αβτ)ijk is the fixed effect of the interaction of density × humidity × time. The link function is defined by logit(π ijkl) = ηijkl. The following SAS GLIMMIX syntax fits a repeated measures GLMM with a beta distribution. proc glimmix data=co2_fact nobound method=laplace; class density moisture rep time; model pct = density|humidity|time/dist=beta link=logit; random density*humidity/subject=rep type=toep(1); lsmeans density|humidity|time/lines ilink; run; Part of the results is listed below. The fit statistics (AIC and AICC) in Table 9.42 part (a) indicate that a Toeplitz covariance structure of order 1 provides the best fit to of the data. The type III tests of fixed effects in Table 9.43 indicate that soil density (P = 0.0021), humidity (P = 0.0001), the evolution of emission over time (P = 0.0001), and the interaction between moisture and time of measurement (P = 0.0001) are statistically significant. 9.10 Joint Model for Binary and Poisson Data 409 Table 9.42 Fit statistics of a beta GLMM with a factorial structure of treatments under different covariance structures (a) Fit statistics CS -2 Log likelihood -413.74 AIC (smaller is better) -353.74 AICC (smaller is better) -269.19 BIC (smaller is better) -392.94 CAIC (smaller is better) -362.94 HQIC (smaller is better) -435.73 (b) Fit statistics for conditional distribution -2 Log L (y | r. effects) Pearson’s chi-square Pearson’s chi-square/DF AR(1) Toep(1) -413.72 -413.72 -353.72 -355.72 -269.18 -280.07 -392.93 -393.62 -362.93 -364.62 -435.71 -434.98 CS AR(1) Toep(1) -413.74 -413.72 -413.72 64.60 64.65 64.65 1.22 1.22 1.22 UN No converge UN No converge Table 9.43 Hypothesis testing of the factors under study Type III tests of fixed effects Effect Density Humidity Density*humidity Time Density*time Humidity*time Density*humidity*time Num DF 2 2 4 2 4 4 8 Den DF 9 9 9 17 17 17 17 F-value 13.14 69.66 3.57 21.97 0.72 17.85 2.12 Pr > F 0.0021 <0.0001 0.0524 <0.0001 0.5904 <0.0001 0.0923 The least mean squares obtained with the “lsmeans” command on the model scale are shown under the “Estimate” column and the data scale under the “Mean” column of Table 9.44. 9.10 Joint Model for Binary and Poisson Data Another advantage of the GLIMMIX procedure is the ability to fit models to data where the distribution and/or link function varies with response variables. This is accomplished through the specification of DIST = BYOBS or LINK=BYOBS in the model definition. The dataset created below provides an example of a variable with a bivariate outcome. This reflects the condition and length of hospital stay for 32 patients with herniorrhaphy. These data are taken from data provided by Mosteller and Tukey (1977) and reproduced in the study by Hand et al. (1994) (Table 9.45). For each patient, two responses were recorded. A binary response takes the value one if a patient experienced a routine recovery and the value zero if postoperative intensive care was required. The second response variable is a count variable that 410 9 Generalized Linear Mixed Models for Repeated Measurements Table 9.44 Means and standard errors and comparison of means (least significance difference (LSD)) on the model scale (Estimate) and data scale (Mean) (a) Density*humidity least squares means Standard Density Humidity Estimate error DF t-value Pr > |t| Mean 1.1 0.1 -4.5961 0.1614 9 -28.48 <0.0001 0.009990 1.1 0.2 -3.1829 0.07606 9 -41.85 <0.0001 0.03981 1.1 0.24 -3.3152 0.08060 9 -41.13 <0.0001 0.03505 1.4 0.1 -4.4567 0.1450 9 -30.74 <0.0001 0.01147 1.4 0.2 -3.3798 0.08333 9 -40.56 <0.0001 0.03293 1.4 0.24 -3.5363 0.08932 9 -39.59 <0.0001 0.02830 1.6 0.1 -4.7890 0.1809 9 -26.47 <0.0001 0.008252 1.6 0.2 -3.5453 0.08972 9 -39.52 <0.0001 0.02805 1.6 0.24 -4.3213 0.1440 9 -30.00 <0.0001 0.01311 (b) T grouping of density*humidity least squares means (α = 0.05) LS means with the same letter are not significantly different Density Humidity Estimate 1.1 0.20 -3.1829 1.1 0.24 -3.3152 B 1.4 0.20 -3.3798 B 1.4 0.24 -3.5363 B 1.6 0.20 -3.5453 B 1.6 0.24 -4.3213 1.4 0.10 -4.4567 1.1 0.10 -4.5961 1.6 0.10 -4.7890 Standard error mean 0.001596 0.002908 0.002726 0.001643 0.002654 0.002456 0.001481 0.002446 0.001863 A A A C C C C measures the length of hospital stay after the surgery (in days). The binary variable “OKstatus” is a regressor variable that distinguishes patients according to their postoperative physical status (“1” implies better status), and the variable age is the age of the patient. These data can be modeled with a separate logistic model for the binary outcome and with a Poisson model for the count outcome. Such separate analyses would not take into account the correlation between the two response variables. It is reasonable to assume that the duration of post-surgery hospitalization is correlated and will depend on whether the patient requires intensive care. In the following analysis, the correlation between the two types of response variables for a patient is modeled with shared random effects (G-side). The dataset variable “dist” identifies the distribution for each observation. For those observations that follow a binary distribution, the response variable option “(event = “1 “)” determines which value of the binary variable is modeled as the event of interest. Since no “link” option is specified, the link is also chosen on an observation-byobservation basis as a predetermined link for the respective distribution. The following GLIMMIX commands fit this dataset with two distributions: 9.10 Joint Model for Binary and Poisson Data 411 Table 9.45 Hospital condition and length of stay of patients D B P B P B P B P B P B P B P B P B P B P B P B P B P B P B P B P Patient 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 Age 78 78 60 60 68 68 62 62 76 76 76 76 64 64 74 74 68 68 79 79 80 80 48 48 35 35 58 58 40 40 19 19 OKstatus 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 y 0 9 0 4 1 7 1 35 0 9 1 7 1 5 1 16 1 7 0 11 1 4 1 9 1 2 1 4 1 3 1 4 D B P B P B P B P B P B P B P B P B P B P B P B P B P B P B P B P Patient 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 Age 79 79 51 51 57 57 51 51 48 48 48 48 66 66 71 71 75 75 2 2 65 65 42 42 54 54 43 43 4 4 52 52 data Poi_Bin; length dist $7; input d$ patient age OKstatus response @@; if d = 'B' then dist='Binary'; else dist='Poisson'; datalines; B 1 78 1 0 P 1 78 1 9 B 2 60 1 0 P 2 60 1 4 B 3 68 1 1 1 P 3 68 1 7 B 4 62 0 1 P 4 62 0 35 ................................. ...................................... ................................... OKstatus 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 y 0 3 1 5 1 8 1 8 1 3 1 5 1 8 0 2 0 7 1 0 0 16 0 3 0 2 1 3 1 3 1 8 412 9 Generalized Linear Mixed Models for Repeated Measurements Table 9.46 Model information Model information Dataset Response variable Response distribution Link function Variance function Variance matrix blocked by Estimation technique Degrees of freedom method WORK.POI_BIN Response Multivariate Multiple Default Patient Residual pseudo-likelihood (PL) Containment .................................... ..................................... .................................. B 29 54 1 0 P 29 54 1 2 B 30 43 1 1 1 P 30 43 1 3 B 31 4 1 1 1 P 31 4 1 3 B 32 52 1 1 1 P 32 52 1 8 ; proc glimmix data=joint; class patient dist; model response(event='1') = dist dist*age dist*OKstatus / noint s dist=byobs(dist); random int / subject=patient; lsmeans dist/lines ilink; run; Some of the output is shown below. Table 9.46 (“Model information”) shows that the distribution of the data is multivariate and that possibly multiple link functions are involved; by default, proc. GLIMMIX uses a logit link for the binary observations and a log link for the Poisson data. Table 9.47 shows the value of the distribution statistic Gener. chi - square/ DF = 0.90, which indicates that there is no overdispersion, and also shows the estimated variance component due to patient, which is, σ 2patient = 0:299. The fixed effects tests for the effects of age and status are shown in part (c). In addition to the above results, the maximum likelihood estimators of the intercepts, as well as the values of the slopes of each of the variables of both probability distributions, are tabulated in Table 9.48. Thus, to calculate the probability that a patient will experience a routine recovery, the following expression is used: π^ = = 1 1 þ exp - β0 - β1 × age - β2 × okstatus 1 1 þ exp f - 5:7783þ0:07572 × ageþ0:4697 × okstatusg 9.11 Exercises Table 9.47 Results of the analysis of variance 413 (a) Fit statistics -2 Res log pseudo-likelihood Generalized chi-square Gener. chi-square/DF (b) Covariance parameter estimates Cov Parm Subject Estimate Intercept Patient 0.2990 (c) Type III tests of fixed effects Effect Num DF Den DF Dist 2 29 Age*dist 2 29 OKstatus*dist 2 29 226.71 52.25 0.90 Standard error 0.1116 F-value 2.74 5.94 0.24 Pr > F 0.0814 0.0069 0.7909 t-value 1.99 1.48 -2.00 2.54 -0.42 -0.61 Pr > |t| 0.0562 0.1506 0.0552 0.0167 0.6794 0.5435 Table 9.48 Maximum likelihood estimators for fixed effects Solutions for fixed effects Effect Dist Dist Binary Dist Poisson Age*dist Binary Age*dist Poisson OKstatus*dist Binary OKstatus*dist Poisson Estimate 5.7783 0.8410 -0.07572 0.01875 -0.4697 -0.1856 Standard error 2.9048 0.5696 0.03791 0.007383 1.1251 0.3020 DF 29 29 29 29 29 29 whereas the following expression is used to calculate the average value of the length of hospital stay after the surgery (in days): ^λ = exp fα0 þα1 × ageþα2 × okstatusg = exp f0:8410þ0:01875 × age - 0:1856 × okstatusg 9.11 Exercises Exercise 9.11.1 Consider an experiment in which three treatments are compared. There are r blocks of n animals, each using grouping criteria relevant to the experiment. Within each block, one animal is randomly assigned to each treatment. A measurement was taken on animals at “week 0,” when treatments were applied, and again at weeks 4 and 12. Variables measured included weight, the presence or absence of disease symptoms, and severity of symptoms, classified as “worse,” “no change,” or “better.” The focus of this experiment was on repeated measures analysis of the last two types of data in the above list: categorical data that are binary or ordinal and ordinal responses/ratings in an experiment designed with a repeated measures and treatment factor structure. Regardless of whether the observations are 414 9 Generalized Linear Mixed Models for Repeated Measurements Table 9.49 Results of a repeated measures experiment with an ordinal response variable Response Bad Without change Better Week 0 Placebo 60 7 0 Trt1 59 6 Trt2 54 13 0 0 Week 4 Response 14 34 Placebo 5 33 Trt1 3 38 Week 12 Trt2 Response 13 10 25 17 Placebo 7 21 15 22 17 17 21 28 normally distributed, categorical, or have some other distribution, a general approach to repeated measures analysis based on the linear mixed model uses the following general form: Observation = systematic between - subjects variation þ random between - subjects variation þ systematic within - subjects effects þ random within - subjects variation: The following table shows the data from an experiment in which each cell contains the number of animals in a given treatment × week × response category combination (Table 9.49). (a) List all the components of the repeated measures under a multinomial GLMM. (b) Study and choose the best covariance structure that models this dataset. Cite the most relevant results. (c) Fit the multinomial cumulative logit model to these data. Perform a complete and appropriate analysis of the data, focusing on: (i) An evaluation of the effects of the combination of treatments (ii) Odds ratio interpretation (iii) The expected probability per category for each treatment (d) Test whether the proportional odds assumption is viable. Cite relevant evidence to support your conclusion regarding the adequacy of the assumption. Repeat (b) through (d), assuming a generalized multinomial logit in Exercise 9.11.1. Discuss your results. Repeat (b) through (d) assuming a multinomial cumulative probit in Exercise 9.11.1. Discuss your results and compare with those found in (1) and (2). Alternatively, the contingency table approach can be implemented using a log-linear model. For the previous example, 9.11.1, fit the log-linear model log λijk = μ þ τi þ ϖ j þ ðτϖ Þij þ ck þ ðτcÞik þ ðτϖcÞijk where λijk is the expected count of the treatment combination ijk by week by response category and τ, ϖ, and c refer to treatment, week,and response category effects, respectively. 9.11 Exercises 415 Table 9.50 Nitrogen injection treatment factors study Handling practices Application rate 1 = N surface applied without additional water injection 2.5 g/cm2 2 = N surface area applied with supplementary water injection 3 = N injected with a number 56 nozzle (7.6 cm depth of injection) 5.0 g/cm2 4 = N injected with a number 53 nozzle (12.7 cm depth of injection) Handling practices 1 Driving practice 2 N2 Total Quality N1 N2 Total Quality N1 Poor 14 5 19 Poor 15 8 23 Average 2 11 13 Average 1 8 9 Good 0 0 0 Good 0 0 0 Excellent 0 0 0 Excellent 0 0 0 Total 16 16 32 Total 16 16 32 Handling practices 3 Handling practices 4 N2 Total Quality N1 N2 Total Quality N1 Poor 0 0 0 Poor 1 0 1 Average 9 2 11 Average 12 4 16 Good 7 14 21 Good 0 0 0 Excellent 0 0 0 Excellent 0 0 0 Total 16 16 32 Total 16 16 32 Exercise 9.11.2 Fertilization of turf has traditionally been accomplished through surface applications. The introduction of new equipment (Hydroject) has made it possible to place soluble materials below the surface (Table 9.50). A study was conducted during the 1997 growing season to compare surface application and subsoil injection of nitrogen on the green color of bentgrass (Agrostis palustris L. Huds) 1 year after transplanting. The treatment structure was a full factorial of grass management factors (four types/levels) and the rate/level (two levels) of nitrogen application per square meter (g/m2). Eight treatment combinations were arranged in a completely randomized design with four replications. Turf color was evaluated in each experimental unit at weekly intervals of 4 weeks as poor, average, good, or excellent. Of particular interest was the determination of the water injection effect, the subsurface effect, and the comparison of injection versus surface applications. These are contrasts between the levels of factor management practice and their primary objective, which was to determine whether the factor interacts with the rate of application. (a) List all the GLMM components of this experiment. (b) Fit the multinomial cumulative logit proportional odds model to these data. Perform a complete and appropriate analysis of the data, focusing on: (i) An evaluation of the effects of the combination of treatments (ii) Interpretation of the odds ratios (iii) The expected probability per category for each treatment 416 9 Generalized Linear Mixed Models for Repeated Measurements (c) Test whether the proportional odds assumption is viable. Cite relevant evidence to support your conclusion regarding the adequacy of the assumption. Exercise 9.11.3 Refer to Exercise 9.11.1. (a) Fit the multinomial generalized logit proportional odds model to these data. (b) List all the components of the GLMM of this experiment. (c) Perform a complete and appropriate analysis of the data, focusing on: (i) An evaluation of the effects of the combination of treatments (ii) Interpretation of odds ratios (iii) The expected probability per category for each treatment (d) Test whether the proportional odds assumption is viable. Cite relevant evidence to support your conclusion regarding the adequacy of the assumption. Exercise 9.11.4 Refer to Exercise 9.11.1. (a) List all the components of the GLMM of this experiment. (b) Fit the multinomial cumulative probit proportional odds model to these data. Perform a complete and appropriate analysis of the data, focusing on: (i) An evaluation of the effects of the combination of treatments (ii) Interpretation of the odds ratios (iii) The expected probability per category for each treatment (c) Test whether the proportional odds assumption is viable. Cite relevant evidence to support your conclusion regarding the adequacy of the assumption. Appendix Data: Feeding line experiment Tray Feeding Run 1 H 1 2 H 1 3 H 1 4 H 1 5 H 1 6 H 1 7 H 1 8 H 1 9 H 1 10 H 1 11 H 1 12 H 1 Proportion 0.18217 0.15493 0.15906 0.15869 0.14891 0.17654 0.12915 0.12895 0.16688 0.11965 0.21719 0.20797 Tray 1 2 3 4 5 6 7 8 9 10 11 12 Feeding H H H H H H H H H H H H Run 3 3 3 3 3 3 3 3 3 3 3 3 Proportion 0.06818 0.05874 0.05757 0.10349 0.08564 0.09359 0.09706 0.13188 0.18477 0.10966 0.18069 0.18182 (continued) Appendix Data: Feeding line experiment Tray Feeding Run 1 L 1 2 L 1 3 L 1 4 L 1 5 L 1 6 L 1 7 L 1 8 L 1 9 L 1 10 L 1 11 L 1 12 L 1 1 H 2 2 H 2 3 H 2 4 H 2 5 H 2 6 H 2 7 H 2 8 H 2 9 H 2 10 H 2 11 H 2 12 H 2 1 L 2 2 L 2 3 L 2 4 L 2 5 L 2 6 L 2 7 L 2 8 L 2 2 9 L 10 L 2 11 L 2 12 L 2 417 Proportion 0.70601 0.68817 0.68317 0.77805 0.76692 0.79127 0.73653 0.74939 0.78773 0.7381 0.88486 0.90401 0.07547 0.05801 0.0565 0.09579 0.10954 0.12154 0.1144 0.13728 0.15012 0.12113 0.17633 0.16408 0.78318 0.78418 0.78589 0.78867 0.81988 0.82793 0.81384 0.81037 0.77528 0.78916 0.87109 0.84704 Tray 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 Feeding L L L L L L L L L L L L H H H H H H H H H H H H L L L L L L L L L L L L Run 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 Proportion 0.75524 0.77249 . 0.84204 0.81572 0.79161 0.81234 0.81795 0.8225 0.79384 0.8135 0.83965 0.07105 0.05511 0.05217 0.10567 0.0755 0.0853 0.09363 0.11154 0.16264 0.09215 0.1834 0.21016 0.76556 0.78307 0.76486 0.82391 0.8044 0.81178 0.84339 0.78833 0.79804 0.82236 0.83807 0.85532 Data: Nitrous oxide emission A T T F pH 3 1 1 108 8 2 2 1 15 4.7 1 3 1 33 5.5 4 4 1 23 6.4 3 1 2 -58 8 2 2 2 -45 4.7 1 3 2 -97 5.5 4 4 2 185 6.4 3 1 3 47 8 2 2 3 26 4.7 1 3 3 41 5.5 19 6.4 4 4 3 3 1 4 311 8 2 2 4 -29 4.7 1 3 4 37 5.5 4 4 4 204 6.4 3 1 5 6.8 8 2 2 5 -18 4.7 1 3 5 68 5.5 4 4 5 91 6.4 3 1 1 135 4.6 2 2 1 1.4 4.3 1 3 1 55 6.5 4 4 1 18 6.1 3 1 5 5 4.6 HE 85 86 85 84 85 86 85 84 85 86 85 84 85 86 85 84 85 86 85 84 84 85 85 85 84 TE 20 20 20 21 23 24 29 28 32 32 34 34 28 27 28 27 22 22 22 22 20 20 20 21 22 tx 17 17 17 17 34 34 34 34 35 35 35 35 30 30 30 30 27 27 27 27 18 18 18 18 24 tm 16 16 16 16 34 34 34 34 35 35 35 35 30 30 30 30 27 27 27 27 18 18 18 18 24 Hx 69 69 69 69 54 54 54 54 38 38 38 38 40 40 40 40 51 51 51 51 60 60 60 60 61 Hm 68 68 68 68 51 51 51 51 30 30 30 30 38 38 38 38 43 43 43 43 59 59 59 59 59 A 2 3 4 1 3 4 1 3 4 1 3 4 1 3 4 1 3 4 1 3 4 1 3 4 1 T 1 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4 t 1 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 1 1 1 5 5 5 1 1 1 4.13 82.4 51.4 704 14 130 537 1.02 41.9 824 53.9 9.88 745 92.7 187 591 7 8.39 1367 -49 -91 711 108 15 621 F pH 6.9 7 6.4 95 7 6.4 95 7 6.4 95 7 6.4 95 7 6.4 95 6.6 6.8 86 6.6 6.8 86 7.1 6.8 86 HE 87 87 87 20 87 87 24 87 87 24 87 87 20 87 87 20 85 87 20 85 87 19 86 87 20 TE 20 22 20 19 22 24 27 25 24 28 20 20 22 20 20 22 20 20 20 19 19 18 20 20 17 tx 19 19 19 19 27 27 27 28 28 27 22 22 22 22 22 22 20 20 18 18 18 18 17 17 16 tm 19 19 19 67 27 27 68 27 27 63 22 22 61 22 22 65 18 18 80 18 18 89 16 16 87 Hx 67 67 67 66 68 68 67 63 63 57 61 61 60 65 65 62 80 80 74 89 89 89 87 87 87 Hm 66 66 66 170 67 67 170 57 57 170 60 60 170 62 62 170 74 74 170 89 89 170 87 87 170 418 9 Generalized Linear Mixed Models for Repeated Measurements 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 1 3 2 1 3 2 1 3 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 3 4 1 3 4 1 3 5 5 5 1 1 1 1 5 5 5 5 1 1 1 1 5 5 5 5 1 1 1 2 2 2 3 3 12 121 51 21 87 21 28 31 101 -37 136 26 16 92 -82 35 -10 41 19 16 83 28 24 29 9.4 29 38 4.3 6.5 6.1 4.3 4.6 6.5 6.1 4.3 4.6 6.5 6.1 4.4 4.8 5.5 6 4.4 4.8 5.5 6 54 5.8 7.2 55 5.8 7.2 44 5.8 85 85 85 85 86 85 85 85 86 85 85 84 85 87 85 84 85 87 85 160 85 84 160 85 84 160 85 22 22 23 19 19 19 19 23 23 23 23 19 19 19 19 22 25 23 23 161 21 20 161 23 22 161 31 24 24 24 18 18 18 18 25 25 25 25 19 19 19 19 24 24 24 24 4 16 16 4 24 24 4 29 24 24 24 17 17 17 17 25 25 25 25 19 19 19 19 22 22 22 22 2 19 19 2 23 23 2 25 61 61 61 61 61 61 61 57 57 57 57 61 61 61 61 67 67 67 67 2 50 50 2 58 58 2 51 59 59 59 58 58 58 58 55 55 55 55 60 60 60 60 62 62 62 62 1 54 54 2 55 55 3 44 3 4 1 3 4 1 3 4 1 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 2 3 4 2 3 4 2 3 4 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 5 5 5 1 1 1 5 5 5 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 6.19 1.36 656 18.2 55.9 731 -77 33.6 880 29.4 49.3 69.7 36.1 -100 123 45.3 18.2 -216 -71 3.47 57.6 59.1 38.2 74.4 26 67.5 -77 7.1 6.8 86 7.3 7 90 7.3 7 90 7.5 7.1 6.7 6.8 7.5 7.1 6.7 6.8 7.5 7.1 6.7 6.8 7.5 7.1 6.7 6.8 7.5 7.1 86 87 25 86 87 19 86 87 26 88 88 85 90 88 88 85 90 88 88 85 90 88 88 85 90 88 88 26 24 24 20 20 18 26 25 25 20 20 20 24 24 24 24 30 30 30 30 26 26 26 26 24 24 24 24 24 24 18 18 17 25 25 25 20 20 20 42 42 42 42 39 39 39 39 31 31 31 31 26 26 26 24 24 65 17 17 77 25 25 65 21 21 21 29 29 29 29 34 34 34 34 28 28 28 28 25 25 25 64 64 170 76 76 170 63 63 163 59 59 59 20 20 20 20 29 29 29 29 23 23 23 23 31 31 31 (continued) 65 65 64 77 77 76 65 65 63 60 60 60 51 51 51 51 45 45 45 45 26 26 26 26 43 43 43 Appendix 419 Data: Nitrous oxide emission A T T F pH 2 4 3 1.2 7.2 1 1 4 28 35 3 3 4 81 5.8 2 4 4 17 7.2 1 1 5 26 39 3 3 5 99 5.8 2 4 5 31 7.2 1 1 1 20 50 3 3 1 -45 6.4 2 4 1 108 7.4 1 1 5 27 47 2.3 6.4 3 3 5 2 4 5 39 7.4 1 1 1 24 53 3 3 1 13 6.4 2 4 1 148 7.1 1 1 5 24 57 3 3 5 9 6.4 2 4 5 13 7.1 1 1 1 22 69 3 3 1 -36 6.5 2 4 1 45 6.8 1 1 5 23 70 3 3 5 35 6.5 2 4 5 -17 6.8 HE 84 160 85 84 160 85 84 160 85 84 160 85 84 160 85 84 160 85 84 160 83 84 160 83 84 TE 31 161 27 29 161 20 20 161 20 22 161 28 28 161 20 20 161 28 28 161 20 20 161 22 22 tx 29 4 28 28 4 26 26 4 20 20 4 27 27 4 24 24 4 24 24 4 22 22 4 23 23 tm 25 2 27 27 2 26 26 2 20 20 2 26 26 2 22 22 2 23 23 2 21 21 2 22 22 Hx 51 2 35 35 2 41 41 2 54 54 2 50 50 2 54 54 2 60 60 2 73 73 2 71 71 Hm 44 4 35 35 5 39 39 1 50 50 5 47 47 1 53 53 5 57 57 1 69 69 5 70 70 A 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 T 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 t 5 1 1 1 1 5 5 5 5 1 1 1 1 5 5 5 5 1 1 1 1 5 5 5 5 40.8 37.5 88.4 63.4 -72 -83 95.1 72.8 12.2 27 23.2 -0.7 61.1 73.8 62.1 -96 40.2 7.22 5.53 74.5 -23 90.3 -21 63.9 -16 F pH 6.7 7.2 8.1 7.6 7.4 7.2 8.1 7.6 7.4 6.9 8.1 7.4 7.3 6.9 8.1 7.4 7.3 7.3 7.9 7.4 7.4 7.3 7.9 7.4 7.4 HE 85 89 88 90 85 89 88 90 85 88 90 88 86 88 90 88 86 90 88 86 82 90 88 86 82 TE 24 22 21 22 22 26 26 26 26 21 22 22 24 19 20 20 20 17 17 18 19 20 20 20 20 tx 26 18 18 18 18 29 30 31 32 23 23 23 23 18 18 18 18 17 17 17 17 21 21 21 21 tm 25 17 17 17 17 27 28 29 30 22 22 22 22 18 18 18 18 17 17 17 17 20 20 20 20 Hx 43 61 61 61 61 48 49 50 51 63 63 63 63 76 76 76 76 73 73 73 73 70 70 70 70 Hm 31 48 48 48 48 45 46 47 48 58 58 58 58 65 65 65 65 67 67 67 67 69 69 69 69 420 9 Generalized Linear Mixed Models for Repeated Measurements Appendix 421 Data: Percentage inhibition (Bio bioassay, Con concentration, Rep repetition, Por percentage inhibition) Bio Day Con Rep Por Bio Day Con Rep Por 1 1 0 3 5.2632 1 6 2000 4 35.1724 1 1 0 4 5.2632 2 1 0 2 0.0016 1 1 500 1 15.7895 2 1 0 3 14.2857 1 1 500 2 26.3158 2 1 500 1 42.8571 1 1 500 3 15.7895 2 1 500 2 42.8571 1 1 500 4 15.7895 2 1 500 3 42.8571 1 1 1000 1 36.8421 2 1 500 4 42.8571 1 1 1000 2 36.8421 2 1 1000 1 7.1429 1 1 1000 3 36.8421 2 1 1000 2 42.8571 1 1 1000 4 36.8421 2 1 1000 3 42.8571 1 1 2000 1 15.7895 2 1 1000 4 42.8571 1 1 2000 2 36.8421 2 2 0 1 1.3699 1 1 2000 3 36.8421 2 2 0 2 1.3699 1 1 2000 4 36.8421 2 2 0 4 1.3699 1 2 0 2 1.9355 2 2 500 1 34.2466 1 2 0 3 4.5161 2 2 500 2 31.5068 1 2 0 4 1.9355 2 2 500 3 42.4658 1 2 500 1 43.2258 2 2 500 4 36.9863 1 2 500 2 48.3871 2 2 1000 1 34.2466 1 2 500 3 40.6452 2 2 1000 2 47.9452 1 2 500 4 40.6452 2 2 1000 3 45.2055 1 2 1000 1 35.4839 2 2 1000 4 45.2055 1 2 1000 2 45.8065 2 2 2000 1 47.9452 1 2 1000 3 43.2258 2 2 2000 2 53.4247 1 2 1000 4 32.9032 2 2 2000 3 50.6849 2000 4 56.1644 1 2 2000 1 58.7097 2 2 1 2 2000 2 53.5484 2 3 0 1 4.2735 1 2 2000 3 53.5484 2 3 0 4 14.5299 1 2 2000 4 58.7097 2 3 500 1 28.2051 1 3 0 2 1.2346 2 3 500 2 28.2051 1 3 0 3 3.7037 2 3 500 3 35.0427 1 3 500 1 25.9259 2 3 500 4 24.7863 1 3 500 2 23.4568 2 3 1000 1 24.7863 1 3 500 3 23.4568 2 3 1000 2 35.0427 1 3 500 4 24.6914 2 3 1000 3 24.7863 1 3 1000 1 30.8642 2 3 1000 4 26.4957 1 3 1000 2 32.0988 2 3 2000 1 40.1709 1 3 1000 3 28.3951 2 3 2000 2 38.4615 1 3 1000 4 25.9259 2 3 2000 3 47.0085 1 3 2000 1 53.0864 2 3 2000 4 41.8803 1 3 2000 2 49.3827 2 4 0 2 1.5015 1 3 2000 3 49.3827 2 4 0 3 1.5015 (continued) 422 9 Generalized Linear Mixed Models for Repeated Measurements Data: Percentage inhibition (Bio bioassay, Con concentration, Rep repetition, Por percentage inhibition) Bio Day Con Rep Por Bio Day Con Rep Por 1 3 2000 4 51.8519 2 4 0 4 1.5015 1 4 0 3 4.6729 2 4 500 1 20.7207 1 4 500 1 19.6262 2 4 500 2 23.1231 1 4 500 2 20.5607 2 4 500 3 27.9279 1 4 500 3 22.4299 2 4 500 4 20.7207 1 4 500 4 20.5607 2 4 1000 1 35.1351 1 4 1000 1 21.4953 2 4 1000 2 26.7267 1 4 1000 2 21.4953 2 4 1000 3 26.7267 1 4 1000 3 23.3645 2 4 1000 4 32.7327 1 4 1000 4 20.5607 2 4 2000 1 33.9339 1 4 2000 1 42.0561 2 4 2000 2 37.5375 1 4 2000 2 36.4486 2 4 2000 3 44.7447 3 32.7103 2 4 2000 4 38.7387 1 4 2000 1 4 2000 4 40.1869 2 5 0 2 2.008 1 5 0 3 4.065 2 5 0 4 0.4016 1 5 0 4 4.065 2 5 500 1 13.253 1 5 500 1 21.1382 2 5 500 2 21.2851 1 5 500 2 24.3902 2 5 500 3 21.2851 1 5 500 3 17.0732 2 5 500 4 18.0723 1 5 500 4 17.0732 2 5 1000 1 21.2851 1 5 1000 1 18.6992 2 5 1000 2 18.0723 1 5 1000 2 18.6992 2 5 1000 3 16.4659 1 5 1000 3 20.3252 2 5 1000 4 16.4659 1 5 1000 4 17.8862 2 5 2000 1 35.743 1 5 2000 1 41.4634 2 5 2000 2 34.1365 2000 3 29.3173 1 5 2000 2 38.2114 2 5 1 5 2000 3 34.1463 2 5 2000 4 30.9237 1 5 2000 4 33.3333 2 6 0 2 4.2159 1 6 0 3 4.8276 2 6 0 4 0.1686 1 6 0 4 2.069 2 6 500 1 18.3811 1 6 500 1 17.2414 2 6 500 2 20.4047 1 6 500 2 18.6207 2 6 500 3 22.4283 1 6 500 3 16.5517 2 6 500 4 20.4047 1 6 500 4 13.7931 2 6 1000 1 21.0793 1 6 1000 1 15.8621 2 6 1000 2 17.7066 1 6 1000 2 16.5517 2 6 1000 3 17.7066 1 6 1000 3 15.8621 2 6 1000 4 20.4047 1 6 1000 4 18.6207 2 6 2000 1 31.1973 1 6 2000 1 32.4138 2 6 2000 2 29.1737 6 2000 2 29.6552 2 6 2000 3 29.8482 1 1 6 2000 3 31.7241 2 6 2000 4 30.5228 Appendix 423 Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. References Agresti A (2013) Introduction to categorical datikea analysis, 3rd edn. Wiley, Hoboken Aitchison J, Silvey S (1957) The generalization of probit analysis to the case of multiple responses. Biometrika 44:131–140 Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Petrov BN, Casake F (eds) Second international symposium on information theory. Akademiai kiado, Budapest, pp 267–281 Akaike H (1974) A new look at the statistical model identification. IEEE Trans Automat Control 19: 716–723 Amoah S, Wilkinson M, Dunwell J, King GJ (2008) Understanding the relationship between DNA methylation and phenotypic plasticity in crop plants. Comp Biochem Physiol 150(Suppl 1): S145 Bekele A, Bultosa G, Belete K (2012) The effect of germination time on malt quality of six sorghum (sorghum bicolor) varieties grown at Melkassa, Ethiopia. J Brew 118(1):76–81 Bilgili S, Hess J, Blake J, Macklin K, Saenmahayak B, Sibley J (2009) Influence of bedding material on footpad dermatitis in broiler chickens. J Appl Poult Res 18(3):583–589 Bliss CL (1934) Methods of probits. Science 79:38–39 Bliss CI (1935) The calculation of the dosage-mortality curve. Ann Appl Biol 22(1):134–167 Breslow NE (2004) Whither PQL? In: Lin DY, Heagerty PJ (eds) Proceedings of the second Seattle symposium in biostatistics: analysis of correlated data. Springer, pp 1–22 Breslow NE, Clayton DG (1993) Approximate inference in generalized linear mixed models. J Am Stat Assoc 88:9–25 Casella G, Berger RL (2002) Statistical inference, 2nd edn. Duxbury, Pacific Grove Collett D (2002) Modelling binary data, 2nd edn. Chapman & Hall/CRC Press, Boca Raton. 387 pp. De Jong IC, Guémené D (2012) Major welfare issues in broiler breeders. Worlds Poult Sci J 67:73– 82 Engel B, te Brake J (1993) Analysis of embryonic development with a model for under-or overdispersion relative binomial variation. Biometrics 49:269–279 Fisher RA (1925) Statistical methods for research workers. Oliver and Boyd, Edinburgh Garcia RG, Almeida PICL, Caldara FR, Naas IA, Pereira DF et al (2010) Effect of litter material on water quality in broiler production. Braz J Poult Sci 12:165–169 Gbur EE, Stroup WW, McCarter KS, Durham S, Young LJ, Christman M, West M, Kramer M (2012) Analysis of generalized linear mixed models in the agricultural and natural resources sciences. ASA, CSSA, SSSA, Madison Gilks WR et al (1996) Introducing Markov chain Monte Carlo. In: Gilks WR (ed) Markov chain Monte Carlo in practice. Chapman and Hall, pp 1–19 © The Editor(s) (if applicable) and The Author(s) 2023 J. Salinas Ruíz et al., Generalized Linear Mixed Models with Applications in Agriculture and Biology, https://doi.org/10.1007/978-3-031-32800-8 425 426 References Goldstein H, Rasbash J (1996) Improved approximations for multilevel models with binary responses. J R Stat Soc Ser A Stat Soc 159:505–513 Hand DJ, Daly F, Lunn AD, McConway KJ, Ostrowski E (1994) A handbook of small data sets. Chapman and Hall, London Heindel J, Price C, Field E, Marr M, Myers C, Morrissey R, Schwetz B (1992) Developmental toxicity of boric acid in mice and rats. Toxicol Sci 18(2):266–277 Henderson CR (1950) Estimation of genetic parameters. Ann Math Stat 21:309–310 Henderson CR (1984) Applications of linear models in animal breeding. In Univ. of Guelph, Guelph Hosmer DW, Lemeshow S (2000) Applied logistic regression, 2nd edn. Wiley, New York Immer RF, Hayes HK, Powers LR (1934) Statistical determination of barley varietal adaptation. J Am Soc Agron 26:403–419 Jermann R, Toumiat M, Imfeld D (2001) Development of an in vitro efficacy test for self-tanning formulations. Int J Cosmet Sci 24(1):35–42 Johnson NL, Kotz S, Balakrishnan N (1995) Continuous univariate distributions, vol 2. Wiley, New York Kaplan EL, Meier P (1958) Nonparametric estimation from incomplete observations. J Am Stat Assoc 53(282):457–481 Lee Y, Nelder JA (2001) Hierarchical generalised linear models: a synthesis of generalised linear models, random-effect models and structured dispersions. Biometrika 88:987–1006 Lee Y, Nelder JA (2004) Conditional and marginal models: anhother view. Stat Inference 19(2): 219–238 Lew M (2007) Good statistical practice in pharmacology. Br J Pharmacol 152(3):299–303 Littell RC, Milliken GA, Stroup WW, Wolfinger RD (1996) SAS for mixed models. SAS Institute, Inc., Cary Littell RC et al (2006) SAS for Mixed Models, 2nd edn. SAS Publishing Limpert E, Stahel WA, Abbt M (2001) Log-normal Distributions across the Sciences: Keys and Clues. BioScience 51(5) Logan M (2010) Biostatistical design and analysis using R: a practical guide. Wiley Madden L, Hughes G (1995) Plant disease incidence: distributions, heterogeneity, and temporal analysis. Annu Rev Phytopathol 33:529–564 Margolin BH, Kaplan N, Zeiger E (1981) Statistical analysis of the Ames Salmonella/microsome test. Proc Natl Acad Sci USA 76:3779–3783 Martrenchar A, Boilletot E, Huonnic D, Pol F (2002) Risk factors for foot-pad dermatitis in chicken and Turkey broilers in France. Prev Vet Med 52(3–4):213–226 McCullagh P (1980) Regression models for ordinal data. J R Stat Soc Series B Methodol 42:109– 142 McCullagh P (1983) Quasi-likelihood functions. Ann Stat 11(1):59–67 McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman and Hall, London Mead J, Curnow R, Hasted A (1993) Statistical methods in agriculture and experimental biology, 2nd edn. Chapman and Hall, London, p 325 Mosteller F, Tukey JW (1977) Data analysis and regression. Addison-Wesley, Reading Myers R, Montgomery D, Vining G (2002) Generalized linear models with applications in engineering and the sciences. Wiley, New York Nadia Hernández-Tapia, Josafhat Salinas-Ruiz, Vinisa Saynes-Santillán, Julio M. AyalaRodríguez, Francisco Hernández-Rosas y Joel Velasco-Velasco (2019). N2O, CO2 and NH3 emission from dung of bovine with different percentage of crude protein in diet. Rev. Int. Contam. Ambie 35 (3) 597–608. DOI: 10.20937/RICA.2019.35.03.07 Nelder JA, Wedderburn RWM (1972) Generalized linear models. J R Stat Soc Ser A 135:370–384 Pinheiro JC, Bates DM (2000) Mixed-effects models in S and SPLUS. Springer Pinheiro JC, Chao EC (2006) Efficient Laplacian and adaptive Gaussian quadrature algorithms for multilevel generalized linear mixed models. J Comput Graph Stat 15:58–81 References 427 Quinn GP, Keough MJ (2002) Experimental design and data analysis for biologists. Cambridge University Press, New York Raudenbush SW et al (2000) Maximum likelihood for generalized linear models with nested random effects via high-order, multivariate Laplace approximation. J Comput Graph Stat 9: 141–157 Robinson GK (1991) That BLUP is a good thing: The estimation of random effects (with discussion), Statist Sci 6:15–51 Rodriguez G, Goldman N (2001) Improved estimation procedures for multilevel models with binary response: a case-study. J R Stat Soc Ser A Stat Soc 164:339–355 Schall R (1991) Estimation in generalized linear models with random effects. Biometrika 78:719– 727 Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464 Shinozaki K, Kira T (1956) Intraspecific competition among higher plants. VII. Logistic theory of the C-D effect. J Inst Polytech 12:69–82 Smith T, Mlambo V, Sikosana JLN, Maphosa V, Mueller-Harvey I, Owen E (2005) Dichrostachys cinerea and Acacia nilotica fruits as dry season feed supplements for goats in a semi-arid environment. Anim Feed Sci Technol 122(1–2):149–157 Spurgeon J (1978) The correlation of animal response data with the yields of selected thermal decomposition products for typical aircraft interior materials. U.S. D.O.T. Report No. FAA-RD78-131 Stanley VG (1981) The effect of stocking density on commercial broiler performance. Poult Sci 60: 1737–1738 Stephens PA et al (2005) Information theory and hypothesis testing: a call for pluralism. J Appl Ecol 42:4–12 Stroup W (2012) Generalized linear mixed models, 1st edn. Chapman & Hall/CRC Stroup W (2013) Generalized linear mixed models. CRC Press, Boca Raton Taira K, Nagai T, Obi T, Takase K (2014) Effect of litter moisture on the development of footpad dermatitis in broiler chickens. J Vet Med Sci 76:583–586 Wagner JG, Aghajanian GK, Bing OHL (1968) Correlation of performance test scores with "tissue concentration" of lysergic acid diethylamide in human subjects. Clin Pharmacol Ther 9(5): 635–638 Wallsten TS, Budescu DV (1981) Adaptivity and nonadditivity in judging MMPI profiles. J Exp Psychol Hum Percept Perform 7:1096–1109 Walters KJ, Hosfield GL, Uebersax MA, Kelly JD (1997) Navy bean canning quality: correlations, heritability estimates and randomly amplified polymorphic DNA markers associated with component traits. J Am Soc Hortic Sci 122(3):338–343 Wolfinger R, O'Connell M (1993) Generalized linear mixed models: a pseudo-likelihood approach. J Stat Comput Simul 48:233–243 Yates F (1935) Complex experiments, with discussion. J R Stat Soc Ser B 2:181–223 Zeger SL, Liang K-Y, Albert PS (1988) Models for longitudinal data: a generalized estimating equation approach. Biometrics 44(4):1049 Zuur AF, Leno EN, Walker N, Saveliev AA, Smith GM (2009) Mixed effects models and extensions in ecology with R. Springer, New York Zuur AF, Hilbe JM, Leno EN (2013) A Beginner's guide to GLM and GLMM with R: a frequentist. and Bayesian perspective for ecologists. Highland Statistics Ltd., Newburgh