Animal Models L. R. Schaeffer 2019 c L. R. Schaeffer, 2019 2 Acknowledgements C. R. Henderson Dale Van Vleck Jim Wilton George Wiggans Brian Kennedy Ivan Mao Bill Slanger Janusz Jamrozik Knut Ronningen Oje & Britta Danell Bill Szkotnicki Jack Dekkers John Moxley Jan Philipsson Daniel Gianola Ignacy Misztal Reinhard Reents Karin Meyer Bruce Tier All my Graduate Students All co-authors Laura McKay Cornell University 1970 Contents I Animal Models 15 1 Background History 1.1 17 Sire Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.1.1 Fixed Versus Random Factors . . . . . . . . . . . . 20 1.1.2 Better Sires in Better Herds . . . . . . . . . . . . . 21 1.1.3 Sires Related . . . . . . . . . . . . . . . . . . . . . . . 27 1.2 The Animal Model . . . . . . . . . . . . . . . . . . . . . . . . 28 1.3 Model Considerations . . . . . . . . . . . . . . . . . . . . . . 30 1.3.1 Time and Location Effects . . . . . . . . . . . . . . . 30 1.3.2 Contemporary Groups . . . . . . . . . . . . . . . . . 31 1.3.3 Interaction of Fixed Factors with Time . . . . . . . 32 1.3.4 Pre-adjustments . . . . . . . . . . . . . . . . . . . . . 33 1.3.5 Additive Genetic Effects . . . . . . . . . . . . . . . . 34 1.3.6 Permanent Environmental Effects . . . . . . . . . . 35 2 Genetic Relationships 2.1 37 Pedigree Preparation . . . . . . . . . . . . . . . . . . . . . . 37 3 4 CONTENTS 2.2 Additive Relationships . . . . . . . . . . . . . . . . . . . . . 39 2.3 Inbreeding Calculations 2.4 Example Additive Matrix . . . . . . . . . . . . . . . . . . . 46 2.5 Inverse of Additive Relationship Matrix . . . . . . . . . . 48 3 Mixed Model Equations . . . . . . . . . . . . . . . . . . . . 43 53 3.1 Accuracies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.2 Expression of EBVs . . . . . . . . . . . . . . . . . . . . . . . 56 3.3 3.2.1 Genetic Base . . . . . . . . . . . . . . . . . . . . . . . 57 3.2.2 EBV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.2.3 EPD, ETA . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.2.4 RBV . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.2.5 Percentile Ranking 3.2.6 Intermediate Traits . . . . . . . . . . . . . . . . . . . 60 . . . . . . . . . . . . . . . . . . . 59 Numerical Example . . . . . . . . . . . . . . . . . . . . . . . 60 4 Phantom Parent Groups 69 4.1 Background History . . . . . . . . . . . . . . . . . . . . . . . 71 4.2 Simplifying the MME . . . . . . . . . . . . . . . . . . . . . . 72 5 Estimation of Variances 75 5.1 Some History . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2 Modern Thinking . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3 The Joint Posterior Distribution . . . . . . . . . . . . . . . 77 5.3.1 Conditional Distribution of Data Vector . . . . . . 78 CONTENTS 5 5.3.2 Prior Distributions of Random Variables . . . . . . 78 5.4 5.5 Fully Conditional Posterior Distributions . . . . . . . . . 81 5.4.1 Fixed and Random Effects of the Model . . . . . . 81 5.4.2 Variances . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Computational Scheme . . . . . . . . . . . . . . . . . . . . . 82 5.5.1 Year Effects . . . . . . . . . . . . . . . . . . . . . . . . 84 5.5.2 Contemporary Group Effects . . . . . . . . . . . . . 85 5.5.3 Contemporary Group Variance . . . . . . . . . . . . 85 5.5.4 Animal Genetic Effects . . . . . . . . . . . . . . . . . 86 5.5.5 Additive Genetic Variance . . . . . . . . . . . . . . . 86 5.5.6 Residual Variance . . . . . . . . . . . . . . . . . . . . 87 5.5.7 Visualization . . . . . . . . . . . . . . . . . . . . . . . 87 5.5.8 Burn-In Periods . . . . . . . . . . . . . . . . . . . . . 89 5.5.9 Post Burn-In Analysis . . . . . . . . . . . . . . . . . 90 5.5.10 Long Chain or Many Chains? . . . . . . . . . . . . 91 5.6 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.7 Covariance Matrices . . . . . . . . . . . . . . . . . . . . . . . 92 5.7.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6 Repeated Records Animal Model 6.1 95 Traditional Approach . . . . . . . . . . . . . . . . . . . . . . 95 6.1.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.1.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.1.3 Estimation of Variances . . . . . . . . . . . . . . . . 100 6 CONTENTS 6.2 6.3 Cumulative Environmental Approach . . . . . . . . . . . . 103 6.2.1 A Model . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.2.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.2.3 Estimation of Variances . . . . . . . . . . . . . . . . 108 Alternative Models . . . . . . . . . . . . . . . . . . . . . . . 109 7 Multiple Trait Models 111 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.3 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.3.1 Year-Season Effects . . . . . . . . . . . . . . . . . . . 117 7.3.2 Age Effects . . . . . . . . . . . . . . . . . . . . . . . . 118 7.3.3 Days in Milk Group Effects . . . . . . . . . . . . . . 119 7.3.4 HYS Effects . . . . . . . . . . . . . . . . . . . . . . . . 120 7.3.5 Additive Genetic Effects . . . . . . . . . . . . . . . . 121 7.3.6 Residual Covariances . . . . . . . . . . . . . . . . . . 122 7.4 Positive Definite Matrices . . . . . . . . . . . . . . . . . . . 123 7.5 Starting Matrices . . . . . . . . . . . . . . . . . . . . . . . . 125 8 Maternal Traits 127 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 8.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 8.3 A Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 8.3.1 Data Structure . . . . . . . . . . . . . . . . . . . . . . 132 8.3.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . 133 CONTENTS 8.4 II 7 MME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 8.4.1 Preparing the Data . . . . . . . . . . . . . . . . . . . 134 8.4.2 Covariance Matrices . . . . . . . . . . . . . . . . . . . 135 8.4.3 Year-Month Effects . . . . . . . . . . . . . . . . . . . 136 8.4.4 Gender Effects . . . . . . . . . . . . . . . . . . . . . . 137 8.4.5 Flock-Year-Month Effects . . . . . . . . . . . . . . . 138 8.4.6 Animal Additive Genetic Effects . . . . . . . . . . . 139 8.4.7 Maternal Genetic Effects . . . . . . . . . . . . . . . . 140 8.4.8 Maternal PE Effects . . . . . . . . . . . . . . . . . . . 142 8.4.9 Residual Effects . . . . . . . . . . . . . . . . . . . . . 143 Random Regression Analyses 9 Longitudinal Data 145 147 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 9.2 Collect Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 9.3 Covariance Functions . . . . . . . . . . . . . . . . . . . . . . 151 9.4 Reduced Orders of Fit . . . . . . . . . . . . . . . . . . . . . 159 9.5 Data Requirements . . . . . . . . . . . . . . . . . . . . . . . 164 10 The Models 167 10.1 Fitting The Trajectory . . . . . . . . . . . . . . . . . . . . . 167 10.1.1 Spline Functions . . . . . . . . . . . . . . . . . . . . . 172 10.1.2 Classification Approach . . . . . . . . . . . . . . . . . 174 10.2 Random Variation in Curves . . . . . . . . . . . . . . . . . 176 8 CONTENTS 10.3 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 10.4 Complete Model . . . . . . . . . . . . . . . . . . . . . . . . . 179 10.4.1 Fixed Factors . . . . . . . . . . . . . . . . . . . . . . . 179 10.4.2 Random Factors . . . . . . . . . . . . . . . . . . . . . 180 10.4.3 Mixed Model Equations . . . . . . . . . . . . . . . . 181 11 RRM Calculations 185 11.1 Enter Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 11.2 Enter Starting Covariance Matrices . . . . . . . . . . . . . 192 11.3 Sampling Process . . . . . . . . . . . . . . . . . . . . . . . . 194 11.3.1 Year-Month Effects . . . . . . . . . . . . . . . . . . . 194 11.3.2 Age Groups . . . . . . . . . . . . . . . . . . . . . . . . 196 11.3.3 Contemporary Groups . . . . . . . . . . . . . . . . . 196 11.3.4 Animal Additive Genetic Effects . . . . . . . . . . . 197 11.3.5 Animal Permanent Environmental Effects . . . . . 199 11.3.6 Residual Covariance Matrices . . . . . . . . . . . . . 200 12 Lactation Production 203 12.1 Measuring Yields . . . . . . . . . . . . . . . . . . . . . . . . . 204 12.2 Curve Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 12.3 Factors in a Model . . . . . . . . . . . . . . . . . . . . . . . . 207 12.3.1 Observations . . . . . . . . . . . . . . . . . . . . . . . 207 12.3.2 Year-Month of Calving . . . . . . . . . . . . . . . . . 207 12.3.3 Age-Season of Calving . . . . . . . . . . . . . . . . . 208 12.3.4 Days Pregnant . . . . . . . . . . . . . . . . . . . . . . 208 CONTENTS 9 12.3.5 Herd-Test-Day . . . . . . . . . . . . . . . . . . . . . . 209 12.3.6 Herd-Year-Season of Calving . . . . . . . . . . . . . 209 12.3.7 Additive Genetic Effects . . . . . . . . . . . . . . . . 210 12.3.8 Permanent Environmental Effects . . . . . . . . . . 210 12.3.9 Number Born . . . . . . . . . . . . . . . . . . . . . . . 210 12.3.10 Residual Effects . . . . . . . . . . . . . . . . . . . . . 211 12.4 Covariance Function Matrices . . . . . . . . . . . . . . . . . 211 12.5 Expression of EBVs . . . . . . . . . . . . . . . . . . . . . . . 215 13 Growth 219 13.1 Curve Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 13.1.1 Spline Function . . . . . . . . . . . . . . . . . . . . . . 222 13.1.2 Classification Approach . . . . . . . . . . . . . . . . . 223 13.2 Model Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 13.2.1 Observations . . . . . . . . . . . . . . . . . . . . . . . 223 13.2.2 Year-Month of Birth-Gender (fixed) . . . . . . . . . 224 13.2.3 Year - Age of Dam - Gender (fixed) . . . . . . . . . 224 13.2.4 Contemporary-Management Groups (random) . . 224 13.2.5 Litter Effects (random) . . . . . . . . . . . . . . . . . 225 13.2.6 Animal Additive Genetic Effects (random) . . . . 225 13.2.7 Animal Permanent Environmental Effects (random) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 13.2.8 Maternal Genetic Effects (random) . . . . . . . . . 226 13.2.9 Maternal Permanent Environmental Effects (random) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 10 CONTENTS 13.2.10 Residual Variances (random) . . . . . . . . . . . . . 227 13.2.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 227 13.3 Covariance Function Matrices . . . . . . . . . . . . . . . . . 227 13.4 Expression of EBVs . . . . . . . . . . . . . . . . . . . . . . . 230 14 Survival 233 14.1 Survival Function . . . . . . . . . . . . . . . . . . . . . . . . 234 14.2 Model Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 14.2.1 Year-Season of Birth-Gender (fixed) . . . . . . . . 237 14.2.2 Age at First Calving (fixed) . . . . . . . . . . . . . . 237 14.2.3 Production Level (fixed) . . . . . . . . . . . . . . . . 238 14.2.4 Conformation Level (fixed) . . . . . . . . . . . . . . 238 14.2.5 Unexpected Events . . . . . . . . . . . . . . . . . . . 238 14.2.6 Contemporary Groups (random) . . . . . . . . . . . 239 14.2.7 Animal Additive Genetic Effects (random) . . . . 239 14.2.8 Animal Permanent Environment Effects (random) 239 14.2.9 Residual Variances (random) . . . . . . . . . . . . . 239 14.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 14.3.1 Year Trajectories . . . . . . . . . . . . . . . . . . . . . 245 14.3.2 Contemporary Groups . . . . . . . . . . . . . . . . . 247 14.3.3 Animal Estimated Breeding Values . . . . . . . . . 248 14.3.4 Covariance Matrices . . . . . . . . . . . . . . . . . . . 250 CONTENTS III Loose Ends 15 Selection 11 253 255 15.1 Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . 255 15.2 Multiple Traits . . . . . . . . . . . . . . . . . . . . . . . . . . 257 15.3 Better Sires In Better Herds Problem . . . . . . . . . . . 261 15.3.1 The Simulation . . . . . . . . . . . . . . . . . . . . . . 262 15.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 15.4 Nonrandom Mating . . . . . . . . . . . . . . . . . . . . . . . 264 15.5 Masking Selection . . . . . . . . . . . . . . . . . . . . . . . . 267 15.6 Preferential Treatment . . . . . . . . . . . . . . . . . . . . . 270 15.7 Nonrandom Progeny Groups . . . . . . . . . . . . . . . . . 270 15.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 16 Genomics 273 16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 16.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 16.3 Model for Growth Traits . . . . . . . . . . . . . . . . . . . . 279 16.4 Models Involving Genomics . . . . . . . . . . . . . . . . . . 280 16.5 Comparison of Models . . . . . . . . . . . . . . . . . . . . . 283 16.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 17 Other Models 289 17.1 Sire-Dam Model . . . . . . . . . . . . . . . . . . . . . . . . . 289 17.2 Sire Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 12 CONTENTS 17.3 Reduced Animal Models . . . . . . . . . . . . . . . . . . . . 291 18 Non-Additive Genetic Effects 299 18.1 Interactions at a Single Locus . . . . . . . . . . . . . . . . . 299 18.2 Interactions for Two Unlinked Loci . . . . . . . . . . . . . 301 18.2.1 Estimation of Additive Effects . . . . . . . . . . . . 302 18.2.2 Estimation of Dominance Effects . . . . . . . . . . . 303 18.2.3 Additive by Additive Effects . . . . . . . . . . . . . 305 18.2.4 Additive by Dominance Effects . . . . . . . . . . . . 306 18.2.5 Dominance by Dominance Effects . . . . . . . . . . 307 18.3 More than Two Loci . . . . . . . . . . . . . . . . . . . . . . . 308 18.4 Linear Models for Non-Additive Genetic Effects . . . . . 308 18.4.1 Simulation of Data . . . . . . . . . . . . . . . . . . . . 308 18.4.2 HMME . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 18.5 Computing Simplification . . . . . . . . . . . . . . . . . . . 312 18.6 Estimation of Variances . . . . . . . . . . . . . . . . . . . . . 314 19 Threshold Models 315 19.1 Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . 315 19.2 Threshold Model Computations . . . . . . . . . . . . . . . 316 19.2.1 Equations to Solve . . . . . . . . . . . . . . . . . . . . 318 19.3 Estimation of Variance Components . . . . . . . . . . . . . 325 19.4 Expression of Genetic Values . . . . . . . . . . . . . . . . . 326 CONTENTS IV Computing Ideas 20 Fortran Programs 13 327 329 20.1 Main Program . . . . . . . . . . . . . . . . . . . . . . . . . . 329 20.2 Call Params . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 20.3 Call Peddys . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 20.4 Call Datum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340 20.5 Iteration Subroutines . . . . . . . . . . . . . . . . . . . . . . 343 20.5.1 Year-Month of Calving Factor . . . . . . . . . . . . 343 20.5.2 Region-Age-Season of Calving . . . . . . . . . . . . 345 20.5.3 Contemporary Groups . . . . . . . . . . . . . . . . . 348 20.5.4 Animal Permanent Environmental . . . . . . . . . . 352 20.5.5 Animal Additive Genetic . . . . . . . . . . . . . . . . 356 20.5.6 Residual Effects . . . . . . . . . . . . . . . . . . . . . 361 20.5.7 Finish Off . . . . . . . . . . . . . . . . . . . . . . . . . 364 20.6 Other Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 21 DKMVHF - Inversion Routine 367 22 References and Suggested Readings 373 14 CONTENTS Part I Animal Models 15 Chapter 1 Background History The individual cow model first appeared in Dr Henderson’s graduate course notes during the mid-1960’s at Cornell University. Because Henderson was a dairy cattle geneticist, he used the term ‘cow’ model rather than animal model. Those models showed the need for the additive genetic relationship matrix and its inverse, but in those early years computers (with only 128K of memory) were not capable of storing such a matrix for cows in a national or regional population. The term ‘animal’ model came from a beef cattle paper by Pollak and Quaas (1981), a name that stuck with the concept, and which Henderson and everyone else readily accepted. In 1976, Henderson discovered and published his rapid method for inverting the additive genetic relationship matrix. That simple discovery made it possible to apply an animal model to a national dairy cattle genetic evaluation program. However, the first applied animal models did not appear until 1987 (Wiggans and Misztal) due mainly to a lack of computer power. Bill Slanger (1976) programmed an early withinherd cow model for evaluating both dairy cows and bulls, but in which genetic relationships between animals in different herds had to be ignored. 1.1 Sire Model The animal model came after the sire model, and therefore, it is important to know the sire model. In 1970, the animal breeding world was introduced to 17 18 CHAPTER 1. BACKGROUND HISTORY linear models and best linear unbiased prediction methods (BLUP) by C. R. Henderson through the Northeast AI Sire Comparison. The initial model was yijklm = Y Si + HY Sij + Gk + Skl + eijklm , where yijklm was first lactation 305-d milk yield of daughter m of sire l belonging to genetic group k making a record in year-season of calving i and herdyear-season j; Y Si was a fixed year-season of calving effect to account for time trends in the data; HY Sij was a random herd-year-season of first calving contemporary group; Gk was a fixed sire genetic group, defined by the year of sampling and AI ownership; Skl was a random sire effect within genetic group; and eijklm was a random residual effect. In matrix notation, let y be the vector of first lactation milk yields, b be the vector of year-season effects, h be the vector of herd-year-season effects, g be the vector of genetic group effects, s be the vector of sire transmitting abilities, and e be the vector of residuals, then y = Xb + Wh + Qg + Zs + e, 1.1. SIRE MODEL 19 where X, W, Q, and Z are design matrices relating observations to the factors in the model. Also, E(y) E(h) E(s) E(e) V ar(e) V ar(s) V ar(h) = = = = = = = Xb + Qg 0 0 0 Iσe2 Iσs2 Iσh2 The assumptions of this model were 1. Sires were unrelated to each other. 2. Sires were mated randomly to dams. 3. Progeny were a random sample of daughters. 4. Daughters of sires were randomly distributed across herd-year-seasons. 5. Milk yields were adjusted perfectly for age and month of calving. The limitations were 1. Sires were related to each other. 2. Sires were not randomly mated to dams in the population. 3. Daughters of higher priced bulls tended to be associated with richer herds that supposedly had better environments. 4. The age-month of calving adjustment factors were not without errors. 5. Only first lactation records were used. 6. Cows were not evaluated. 20 1.1.1 CHAPTER 1. BACKGROUND HISTORY Fixed Versus Random Factors Most statistics books in 1970 would define a fixed factor as one with relatively few levels, such as age groups, treatments, diets, and years, of which differences among the levels were to be estimated. If the conceptualized experiment were to be repeated, then the same levels of ages, treatments, or diets might appear again, and their estimated differences would be expected to be consistent between the two samples. However, year effects would differ in a repeated sampling approach because time does not stand still and years do not repeat themselves. In the proposed model above, year-seasons were considered to be a fixed factor because there were a limited number of years and a limited number of seasons per year in the data. A random factor, on the other hand, would have many levels, large enough to be considered infinite, and large enough to be viewed as being randomly sampled from an infinite population, such as sires or herd-year-seasons. If the conceptualized study were repeated, then sires and herd-year-seasons would be completely new, i.e. a different random sample. In animal breeding situations, studies are not repeated, but instead new sires and new herd-year-seasons are constantly being generated based partly on the composition of previous samples, and thus, only occur once. Sire effects were considered a random factor even though the number of sires was less than 1500 for Holsteins, which is significantly less than infinite. Herd-year-season effects were also a random factor having around 100,000 levels. The distinction between fixed and random factors was also based on how the results were to be interpreted. With a fixed factor, interest was on the estimated differences between levels of that factor for those specific levels of the factor. For example, if there were two diets involving different feed components, then the researcher would be interested in the impacts of those diets on milk production or growth. There could be fifty other diets that could be compared, but the experiment was only for those two specific diets. With a random factor the researcher would be more interested in the variation of effects of levels of random factors on milk production or growth, and not in any specific levels of the random factor. Herd-year-season effects might account for 10% of the variation in milk yields across the country. This fact was more important than the specific difference between two year-seasons within one herd. In genetic evaluation, researchers are interested in the realized values of 1.1. SIRE MODEL 21 levels of a random sire factor. Levels of random factors occur outside the control of the researcher. Researchers do not decide which herd-year-seasons will have cows calving and certainly do not control the effect of that herd-year-season on the yields of cows. The researcher also does not control the number of herd-year-seasons that will be, or where they will be. On the other hand, researchers do decide which diets will be compared, in which herds, and which animals, and thus, diets are a fixed factor. Basically, any experiment in a research environment will have mostly fixed factors in the model, whereas, data collected from livestock producers will have more random factors influencing the observations that the researcher has not been able to control. Determining if a factor is fixed or random undergoes many shades of grey in practice. In the end, the researcher makes the decision based on experience and tradition in that field of study. 1.1.2 Better Sires in Better Herds Before the Northeast AI Sire Comparison became public, the dairy industry in the northeast believed that the better sires were used in the better herds, thus creating a bias in sire estimated transmitting abilities. Henderson had a theory about different kinds of selection bias and how to account for them which he had taught in his graduate course at Cornell since 1965, but did not publish until 1975. To illustrate the problem, consider the progeny data in Table 1 representing 1 year-season, 3 herds, and 4 sires. Ignore genetic groups in this example. 22 CHAPTER 1. BACKGROUND HISTORY Example data Herd 1 (9,270) (1,27) 0 1 2 3 Sire Totals (10,297) Table 1.1 on 4 sires and 3 herds in 1 year-season. (nij ,yij. ) Sires Herd 2 3 4 Totals (16,432) (0,0) (0,0) (25,702) (3,72) (5,85) (1,12) (10,196) (1,21) (25,350) (39,351) (65,722) (20,525) (30,435) (40,363) (100,1620) Note that sires 1 and 2 have most of their progeny in herds 1 and 2, and sires 3 and 4 have most of their progeny in herds 2 and 3. In creating the data sires 1 and 2 had positive true breeding values and sires 3 and 4 had negative true breeding values. Also, herd 1 true value was better than herd 2 and herd 3, herd 2 true value was better than herd 3. Thus, the better sires in better herds selection bias was constructed. In real life, of course, this can not be done because true values of either sires or herds are never known, but industry personnel believed that this bias was occurring. Assume that kh = 10 and ks = 15, then the mixed model equations for the model would be 100 25 10 65 10 20 30 40 25 35 0 0 9 16 0 0 10 0 20 0 1 3 5 1 65 0 0 75 0 1 25 39 10 9 1 0 25 0 0 0 20 16 3 1 0 35 0 0 30 0 5 25 0 0 45 0 40 0 1 39 0 0 0 55 The solutions to the equations were d Y S1 d HY S 11 dS 21 HY dS 31 HY Sb1 Sb2 Sb3 Sb4 = 1620 702 196 722 297 525 435 363 . (1.1) 1.1. SIRE MODEL 23 d Y S1 d HY S 11 HY dS 21 d HY S 31 b S1 b S 2 b S3 Sb 4 = 19.2398 4.7928 0.0779 −4.8707 2.4556 1.9473 −0.4626 −3.9403 . The sire solutions are presumably biased because of the selection bias issue. Henderson (1975) outlined three types of selection. The bias discussed here fell under the selection on u type of selection. Henderson derived the following mixed model equations (MME) to account for the bias. X0 X X0 W X0 Q X0 Z 0 0 0 0 W X W W + Ikh W Q W0 Z −L Q0 X Q0 W Q0 Q Q0 Z 0 0 0 0 0 ZX ZW Z Q Z Z + Iks 0 0 −L0 0 0 L0 L/kh b b b h b g b s θb = X0 y W0 y Q0 y Z0 y 0 (1.2) , where L0 describes selection differentials and θ is a vector of the estimates of selection bias, kh is the ratio of residual to herd-year-season variances, and ks is the ratio of residual to sire variances. Unfortunately, Henderson (1975) or Henderson (1984) did not give any hints about creating an appropriate L, only that one exists. Two different L0 were attempted here. Here u is taken to be h, the herdyear-season random effects. A selection differential is a function that describes the difference between one HYS from the other HYS. One possible matrix is 1 −0.5 −0.5 0 1 −1 L1 = 0 . 0 0 1 This matrix says that HY S11 is different from the average of HY S21 and HY S31 (first row of L0 1 ), that HY S21 is different from HY S31 , and HY S31 is not better than any other HYS. 24 CHAPTER 1. BACKGROUND HISTORY In Henderson (1984), he makes L0 2 = I, which states that each HYS is unique. The solutions from the two attempts and the original solutions to the MME are in Table 1.2. Table 1.2 Solutions to the original MME, and MME using L0 1 , and MME using L0 2 . Effect Original With L0 1 With L0 2 Y S1 19.2398 20.6401 19.5479 HY S11 4.7928 6.6348 7.7269 HY S21 0.0779 -1.4166 -0.3245 HY S31 -4.8707 -8.3010 -7.2089 S1 2.4556 1.2921 1.2921 S2 1.9473 0.5312 0.5312 S3 -0.4626 0.6757 0.6757 S4 -3.9403 -2.4990 -2.4990 θ1 66.3476 77.2691 θ2 19.0076 -3.2447 θ3 -30.8289 -72.0889 Note that the solutions for sires are identical for the two MME using L , and that estimable functions of the other effects are also identical. For example, the solution for Y S1 + HY S11 is the same in the last two columns. Several other L0 matrices were constructed and tried, and all gave the same solutions for sire effects. Henderson likely tried many L0 matrices too, and came to the conclusion that the exact structure of L0 did not matter. Thus, let L0 = I, then the above equations reduce to 0 X0 X X 0 W X 0 Q X0 Z W0 X W0 W W0 Q W0 Z Q0 X Q0 W Q0 Q Q0 Z 0 0 0 0 Z X Z W Z Q Z Z + Iks b b b h g b b s = X0 y W0 y Q0 y Z0 y , which are the MME for the model of the original sire model except that Wh is treated as a fixed effect. The solutions to these equations, for the example data of Table 1.1, are 1.1. SIRE MODEL 25 d Y S1 d HY S 11 HY dS 21 d HY S 31 b S1 b S 2 b S3 Sb 4 = 14.7093 12.5655 4.5141 −2.3703 1.2921 0.5312 0.6757 −2.4990 . The sire solutions are the same as those in the last two columns of Table 1.2, and Y S1 + HY S11 is also identical. Thus, treating random HYS as a fixed factor removes the selection bias due to better sires being used in the better herds. It seems incredulous that any L0 matrix can be used, as long as there are 3 independent rows. Hence Henderson used model (1.2) with random HYS effects, i.e. contemporary groups (CG), treated as fixed in the first official run of the northeast AI sire comparison. Ever since, random HYS effects have been treated as a fixed factor in sire and animal models, except in a few rare cases, but always considered to be a random factor. There has never been a study on the amount of bias that supposedly existed, nor on the increase in prediction error variances as a result of treating HYS effects as fixed. Thompson (1979) argued that L0 was not well defined and therefore, Henderson’s selection bias theory was not sound. Gianola et al. (1988) gave Bayesian arguments against the concept of repeated sampling underlying the assumptions of Henderson’s theory in order to derive the modified MME. Therefore, if Henderson’s selection bias theory is unsound, then have animal breeders made a mistake in treating random HYS as a fixed factor over the last 45 years? How much genetic change has been lost as a result, if any? The modified model became y = Wh + Qg + Zs + e, (1.3) where W, Q, and Z are design matrices relating observations to the factors in the model. Note that h are now treated as fixed effects in the model and that 26 CHAPTER 1. BACKGROUND HISTORY year-season effects, (Xb), were totally confounded with herd-year-seasons, and thus, were not estimable in this model, and therefore, removed. Also, E(y) E(s) E(e) V ar(e) V ar(s) = = = = = Wh + Qg 0 0 Iσe2 Iσs2 In the northeast United States, HYS subclasses were fairly large, but in some European countries there were many HYS with fewer than five animals. Any HYS with all daughters from only one bull did not contribute any information to sire evaluations. Not much data were lost in the northeast US, but in Europe there were many more HYS with only 1 or 2 cows, which were lost when random HYS effects were treated as a fixed factor. The other assumptions and limitations were as with the initial model (1.2). The problems of sires being related, and not being randomly mated to dams were probably much more significant in their effects on estimated transmitting abilities than the problem of non-random association of sires with herd-yearseason effects, but were largely ignored. The modified model (1.3) is the one that every country tried to adopt during the 1970’s. I gave courses in Sweden and Switzerland in 1976 on how to use Henderson’s programs, and those notes still pop up when I visit other countries (Schaeffer 1976). Thus, model (1.3) became the accepted practice even if the bias that was present in the northeast United States did not exist in other countries or situations. For example, sire models used in swine or sheep, where artificial insemination was not prevalent and where progeny group sizes were not large, probably had no selection bias that needed removal. Contemporary groups could have remained a random factor in those models without issue. 1.1. SIRE MODEL 1.1.3 27 Sires Related Henderson (1976) discovered a method of inverting the additive genetic relationship matrix (A), and this made it possible to account for sires that were related, through their sire and maternal grandsire. Herd-year-seasons were still treated as a fixed factor. The model did not account for non-random mating of sires to dams. Now V ar(s) = Aσs2 . The sire model continued to be employed for sire evaluation until 1988. By 1988, computer hardware and computing techniques had improved to make animal models feasible (Wiggans and Misztal, 1987). In summary, the problems with sire models were • Sires were not randomly mated to dams. • Genetic groups should have enough bulls. • Only first lactations of cows were used. • No cow evaluations were produced. • HYS with all cows being daughters of the same bull were useless. Problems motivate changes for the future, and the problems of the sire model motivated change to an animal model, although it took some years. The programming strategy for sire models, as laid out by Henderson, was to first absorb herd-year-season effects into the sire effects, thus creating the coefficients of the mixed model equations explicitly, but only the non-zero coefficients. After the absorption process, the coefficients needed to be sorted by columns within rows, and if there were multiple coefficients with the same row and column identifier, then these needed to be summed together. Finally, the equations were solved by iterating through the coefficients (read in from magnetic tape), and updating the sire solutions each iteration. In 1986, Schaeffer and Kennedy published the iteration-on-data strategy which was much simpler than the series of programs he presented in Sweden in 28 CHAPTER 1. BACKGROUND HISTORY 1976. Firstly, there was no need to absorb the herd-year-season equations because coefficients of the mixed model equations were formed as needed rather than storing them on magnetic tape or in memory. Secondly, you could have other factors in the model besides herd-year-seasons and sires. There had to be enough storage space in memory to keep the solutions for all factors in the model. The calculations ended up being summations with few multiplications or divisions. Multiplications took more than twice as much time to perform than summations, and divisions were nearly four times longer than summations. Today’s computers, the same relative differences between operations exist, but the speeds of operations are so fast that divisions and multiplications are no longer a problem. Compilers also optimize codes. 1.2 The Animal Model In any biological field of research, observations are usually taken on individuals, either one observation or many. Thus, the experimental unit is the individual, or animal (humans are also animals). The statistical analysis would begin with a model that described the factors that affect the observations on individuals. In general, the model for a single trait could be written as y = Xb + Wu + Z 0 aw ao ! + Zp + e where y is a vector of observations of a single trait, b is a vector of fixed effects (such as age, year, gender, farm, cage, diet) that affect the trait of interest, and are not genetic in origin, u is a vector of random factors (such as contemporary groups), ! aw is a vector of animal additive genetic effects (breeding values) for all ao animals included in the pedigree, many of which may not have observations. Animals with records are in aw , and animals without records are in ao , 1.2. THE ANIMAL MODEL 29 p is a vector of permanent environmental effects for animals that have been observed, corresponding to aw , e is a vector of residual effects, X, W and Z are matrices that relate observations to the factors in the model. In addition, E V ar E(b) E(u) ! aw ao E(p) E(e) u aw ao p e = b = 0 0 0 = ! = 0 = 0 = U 0 0 0 0 2 2 0 Aww σa Awo σa 0 0 0 0 Aow σa2 Aoo σa2 0 2 0 0 0 0 Iσp 0 0 0 0 Rσe2 . where 2 U = Σ+ i Iσi , for i going from 1 to the number of other random factors in the model, and R is usually diagonal, but there can be different values on the diagonals depending on the situation. Often all of the diagonals are the same, so that R = I. The additive genetic relationship matrix, A= Aww Awo Aow Aoo ! , is derived from the pedigrees, and can be constructed by the tabular method, that Henderson also described many years ago. In practice, A, is seldom constructed explicitly. The pedigrees are assumed to trace back to the same base generation. That is, to a particular birth year such that all the parents of animals born in that year were unknown. The variance, σa2 , is the genetic variance of animals in that time period where all animals are assumed unrelated. Assumptions implied with an animal model are as follows: 30 CHAPTER 1. BACKGROUND HISTORY • Progeny of sire-dam pairs are randomly generated and not selected from amongst all possible progeny of that pair. • Henderson originally assumed that animals mated randomly, and inbreeding accrued due to finite population size. However, selective matings of sires to dams are traced within the pedigrees, and therefore, are taken into account through the relationship matrix. • Animals are randomly dispersed across levels of fixed and other random factors. • Observations are taken on either males or females, but if taken on both sexes, then the assumption is that parents would rank the same if based only on one gender or the other. • There should be no preferential treatment given to individuals or groups of individuals. • No animals should be omitted from the analysis because their observations were too low (unless an error has occurred) or too high. That is, the data should not be a selected subset of all possible animals. • There are no competition effects between animals. Every animal is able to express their full genetic potential without restraint from other individuals in their contemporary group. • Animals should be from a common population with the same genetic variation in the base generation. Animal models apply equally to plant breeding, but naturally will refer to either individual plants or to individual strains of plants, where the plants represent multiple observations of the same strain or variety. 1.3 1.3.1 Model Considerations Time and Location Effects Data on animals usually accumulates over time (months or years), hence the need to account for time in some manner in an animal model. Environments 1.3. MODEL CONSIDERATIONS 31 improve or disintegrate over time (too much rain, too much heat). Feeds vary in quality and quantity from year to year. The categories of time could be weeks, months, or years. Sometimes, if there are not enough animals in a time period, then some time periods may need to be merged together. If data are from animals that are spread from coast to coast, and have been raised in different geographical locations, there may be the need to include an interaction effect of location by year by month. This assumes that location effects differ from year to year and month to month. This is particularly true for most livestock species within Canada and the United States. Such an effect is a fixed factor and will generally have several hundred, if not thousands, of data points in each subclass. 1.3.2 Contemporary Groups Contemporary groups are always present in an animal model, and are always a random factor. Contemporary groups are usually subsets of the time by location effects (i.e. a nested factor). Animals are raised on different farms in different years, and are managed differently from one owner to the next. Each herd, flock, farm is a smaller location within a larger geographical region. Animals are assumed to belong to a particular contemporary group due to birth and management system. Sometimes the groups can be made smaller by further partitioning the year into seasons or months of the year. A further partitioning may occur by management systems. For example, a group of animals may belong to one farm owned by brothers, and each brother manages his animals slightly differently. Thus, a contemporary group would be the herd-year-owner subclass, in this case. Once animals reach a certain age, such as weaning, males and females may be separated into different management groups. This gives herd-yearmanagement groups as contemporary groups. This emphasizes that contemporaries can change as animals grow and mature. Of importance are the contemporaries at the time that animals are measured or recorded for a trait. Because contemporary groups are created as animals come along, the effects of contemporary groups are always considered to be random factors in an animal model. The animals within a contemporary group are roughly the 32 CHAPTER 1. BACKGROUND HISTORY same age, born around the same date, housed and managed alike, and are perhaps the same gender, depending on the trait being measured. Contemporary groups can have from 1 to many individuals. There have been many papers published which have treated contemporary groups as fixed effects. Suppose we define a contemporary group as cows in the same herd-year-season of calving, this can be written as, HY Sijk = Y S : Hijk = Y Sjk + Hi + Y S : Hijk which says that herd effects are nested within Year-Season effects. As a fixed factor then the two way interaction includes the ‘main’ effects of year-season and herd, as well as the interaction between the two factors. From an estimability standpoint, the main effect factors are not estimable (cannot be separated from the interaction term). However, the herd effect should not be estimated across all year-seasons because it would have a useless value if there had been 20 or 30 years of data. Hence the model for contemporary groups should be HY Sijk = Y Sjk + Y S : Hijk Because contemporary groups are a random factor, then an animal model needs to include both Y Sjk as fixed, and Y S : Hijk as random. Failure to include Y Sjk as fixed will cause bias in the Y S : Hijk estimates and in the estimates of additive genetic values of animals. 1.3.3 Interaction of Fixed Factors with Time In dairy cattle, models for milk production include the effects of age and month of calving interactions for 305-day lactation yields. Differences between age groups or months of calving can change over the years due to increases in the overall means of production and or increases due to nutrition and management. Thus, an interaction of age and month of calving with years of calving (five year periods maybe) is needed to account for changing age-month differences over time. Shorter time periods may be needed if means are changing rapidly. An example could be age of calving (in months) differences in milk production of dairy cows. Suppose in 1973 the difference between 24 months of 1.3. MODEL CONSIDERATIONS 33 age and 30 months of age was 200 kg of milk. Over time, because feeding has improved, and management has improved, the difference between 24 and 30 month old heifers might be bigger due to a general increase in average milk production by the year 2010. Perhaps the difference is now 250 kg. To pick up this effect in the data there needs to be an age by year interaction effect. One would have to explore the data to determine if age effects change every year, or maybe only every three years (Ptak et al. 1993). Any fixed factor in any animal model may need an interaction with time to be part of the model. Models should be considered to be dynamic and constantly evolving, and not static over many years or generations. 1.3.4 Pre-adjustments In the 1960’s and 70’s, there were tables of adjustment factors for converting milk yields of cows to Mature Equivalents (ME) of yield. The reference point was a mature cow of 90 months of age, calving in November, for example. The table of factors had age at calving down the side, and the months of the year going across the top (from January to December). Because the adjustments were to a Mature Equivalent, then most of the multiplicative adjustment factors were above 1.0 in the table. Soon, researchers realized that the factors should be different depending on where in the United States the cow was housed. This was due to temperature differences, and feed availability differences. Cows in the southern US had more stress due to heat, than cows in the northeast, midwest, or west. A problem with adjustment factors were that they needed to be updated every few years (factor by time interaction), and this required a major effort on someone’s time, usually someone at the USDA. There were also adjustment factors for age of dam and sex of calf in beef cattle, dependent upon breed of cattle. Pre-adjustment of data for one or more factors was an accepted practice in data analysis. An assumption was that the adjustment factors were known perfectly, without error. Today (2018), with the computing power that is available to every researcher, there is no need to employ pre-adjustment factors. A sensible animal model could include age and month of calving by time period interactions, and 34 CHAPTER 1. BACKGROUND HISTORY therefore, the adjustments would be estimated simultaneously with the estimated breeding values. One should look at the estimates, from time to time, and make graphs of their changes over time, to see if the time periods need to be shortened or lengthened. 1.3.5 Additive Genetic Effects Originally animals were assumed to be randomly mating, which in an animal breeding context is totally invalid. The purpose of animal breeding is to make genetic improvement, and thus, non-random matings are the way in which genetic improvement is achieved. The animal model, however, uses the additive genetic relationship matrix, and through this matrix every mating is identified (as long as parents for each animal are known). Consequently, all non-randomness in matings is described and accounted for in the analysis, if the relationship matrix is complete. The more important assumption is that each mating should produce a random sample of progeny. Until genomics came along, every animal that was born usually had the opportunity to grow and make an observation, and therefore this assumption was mostly valid. With genomics, however, there can be pre-selection of embryos and culling of those that do not have the best genotypes for a set of markers. Only the better embryos are implanted. Thus, progeny groups being created today (using genomic pre-selection) are NOT a random sample of progeny. This will cause a serious bias in genetic evaluations using an animal model depending on the selection intensity, which will vary depending on the family. The assumption of random progeny groups will be considered true throughout most of these notes. A later chapter on genomics and animal models will cover cases where this assumption is not valid. All animals having observations should have both parents identified. Animals having observations, but with one or both parents unknown, should be eliminated from the analysis. Technically all animals can be included in an animal model (with phantom parent groups), but once producers know that animals with both parents unknown will not be used in genetic evaluation, then there will be better reporting of parentage by owners. In dairy cattle, in Canada, pedigrees go back to the mid 1950’s, but test day records start in the 1980’s. This is a massive amount of data to process 1.3. MODEL CONSIDERATIONS 35 every few months, with new records being added all the time. One should conduct a test where all of the available data are utilized, and then gradually eliminate data, in a systematic manner, until half the data remains. Then compare rankings of the more recent animals to determine the impact. If there is a large impact, then that could indicate that factors are missing in the model or are not appropriate. If rankings of animals are not greatly altered by leaving out older data, then the older data could be omitted permanently, and thereby save on computing time. Advances in computing hardware and software has allowed genetic evaluation centres to continue to utilize all data. However, someone should monitor the consequences of using so much data over a long period of time. 1.3.6 Permanent Environmental Effects When animals are measured many times during a lactation or over a few months, then it is possible to estimate permanent environmental effects. These are factors that affect an animal throughout its life, but which are not transmitted to any progeny. They are a result of the environment and experiences encountered by the animal. An animal could undergo an event that either enhances or detracts from its genetic ability for a trait. For many years a permanent environmental effect was considered an effect that was constant and unchanging throughout the life of the animal. The sum of the additive genetic and permanent environmental variances divided by the total variance provided estimates of repeatability. In actual fact, animals are continually experiencing new environmental effects, and these environmental effects are cumulative over time rather than constant. Cumulative environmental effects should be used rather than permanent environmental effects in animal models. Animals with only one observation on a trait also have an environmental effect on their record, but it is impossible to separate the effect from the additive genetic and residual effects, so that the environmental effect becomes part of the residual term in the model. If one thinks of a racing horse, there are many environmental effects con- 36 CHAPTER 1. BACKGROUND HISTORY tributing to that horse before it makes its first race. Collectively the effects are p1 . After the first race, the trainer may make adjustments to the training schedule, the diet, so that when the second race is run, both p1 and the new environmental effects after the first race, p2 , affect the outcome of race 2. Possibly, p2 offsets p1 or p2 may enhance or aggravate p1 . Up until the third race there are new environmental effects p3 , and (p1 + p2 + p3 ) cumulatively affect the outcome of race 3. Any animal that has a chance to make more than one record for a trait will accumulate environmental effects from one record to the next. The time between records may be a factor to consider. Chapter 2 Genetic Relationships A major component of an animal model are the additive genetic effects and in particular, the additive genetic relationships among the animals. Genetic links are key to increasing the accuracy of genetic evaluations. Maintaining a good database of pedigree information is critical. There are always errors in pedigrees. Estimates of 10% in the United States dairy cattle population have been reported. With SNP markers available, animals can be genotyped and parentage can be verified readily, by using a hundred markers or more. Without genotyping, there are many checks on pedigrees that can be made. For an animal model, arranging animals chronologically is a key to success. 2.1 Pedigree Preparation Pedigrees of animals need to be arranged in chronological order. Parents should appear in a list ahead of their progeny. Birthdates should be ignored because they are often in error. The following procedure does not rely on birthdates. Initially assign a generation number of one to all animals, then iterate through the pedigrees modifying the generation numbers of the sire and dam to be at least one greater than the generation number of the offspring. The number of iterations depends on the number of generations of animals in the list. Probably 20 or less iterations are needed for most situations. If the number of iterations reaches 50 or more, then there could be a loop 37 38 CHAPTER 2. GENETIC RELATIONSHIPS in the pedigrees. That means an animal is its own ancestor, somewhere in the pedigree. For example, A might be the parent of B, and B is the parent of C, and C is the parent of A. In this case the generation numbers will keep increasing in each iteration. Thus, if more than 50 iterations are used, then look at the animals with the highest generation numbers and try to find the loop. A loop is an error in the pedigrees and must be repaired. Either correct the parentage, or remove the parent of the older animal to break the loop. An example pedigree and sorting follows. Animal A H B D E G F Sire B D D G G Table 2.1 Dam Generation Number E 1 F 1 H 1 E 1 F 1 1 1 All animals begin with generation number 1. Proceed through the pedigrees one animal at a time. 1. Take the current generation number of the animal and increase it by one (1), call it m. The first animal is A, for example, and its generation number is 1, increase it by 1 to become m=2. 2. Compare m to the generation numbers of the animal’s sire and dam. In the case of A, the sire is B and B’s generation number is 1. That is less than 2 so B’s generation number has to be changed to 2 (m). The dam is E, and E’s generation number also beomes 2. Repeat for each animal in the pedigree list. Keep modifying the generation numbers until no more need to be changed. The animal with the highest generation number is the oldest animal. At the end, sort the list in decreasing order of generation number. The end result after four iterations of the example pedigree is shown below. 2.2. ADDITIVE RELATIONSHIPS Animal A H B D E G F Sire B D D G G 39 Table 2.2 Dam Generation Number E 1 F 3 H 2 E 4 F 5 6 6 The order of animals G or F is not important. The order of animals within the same generation number is not critical. Once the pedigree is sorted, then the birthdates can be checked. Errors can be spotted more readily. Once the errors are found and corrected, then the generation numbers could be checked again. Animals should then be numbered consecutively according to the last list from 1 to the total number of animals in the list. That means that parent numbers should always be smaller than progeny ID numbers. Having animals in this order facilitates calculation of inbreeding coefficients, assignment of animals with unknown parents to groups, and utilization of the inverse of the relationship matrix in the solution of mixed model equations. 2.2 Additive Relationships The order of the additive genetic relationship matrix, A, equals the number of animals (N ) in the pedigree. However, elements of A can be determined by the tabular method directly, and its inverse can be derived directly using the methods of Henderson (1976) and Meuwissen and Luo (1992). Sewell Wright, in his work on genetic relationships and inbreeding, defined the relationship between two animals to be a correlation coefficient. That is, the genetic covariance between two animals divided by the square root of the product of the genetic variances of each animal. The genetic variance of an animal was equal to (1 + Fi )σa2 , where Fi is the inbreeding coefficient of that animal, and σa2 is the population additive genetic variance. Correlations range 40 CHAPTER 2. GENETIC RELATIONSHIPS from -1 to +1, and therefore, represent a percentage relationship between two individuals, usually positive only. The elements of the additive relationship matrix are the numerators of Wright’s correlation coefficients. Consequently, the diagonals of A can be as high as 2, and relationships between two individuals can be greater than 1. The A is a matrix that represents the relative genetic variances and covariances among individuals. Additive genetic relationships among animals may be calculated using a recursive procedure called the Tabular Method (attributable to Henderson and perhaps to Wright before him). To begin, make a list of all animals that have observations in your data, and for each of these determine their parents (called the sire and dam). An example list is shown below. Table 2.3 Animal Sire Dam A B C D A B E A C F E D The list should be in chronological order so that parents appear in the list before their progeny. The sire and dam of animals A, B, and C are assumed to be unknown, and consequently animals A, B, and C are assumed to be genetically unrelated. In some instances the parentage of animals may be traced for several generations, and for each animal the parentage should be traced to a common base generation. Using the completed list of animals and pedigrees, form a two-way table with n rows and columns, where n is the number of animals in the list, in this case n = 6. Label the rows and columns with the corresponding animal identification and above each animal ID write the ID of its parents as shown below. 2.2. ADDITIVE RELATIONSHIPS 41 Table 2.4 Tabular Method Example, Starting Values. -,- -,- -,- A,B A,C E,D A B C D E F A 1 0 0 B 0 1 0 C 0 0 1 D E F For each animal whose parents were unknown a one(1) was written on the diagonal of the table (i.e for animals A, B, and C), and zeros were written in the off-diagonals between these three animals, assuming they were unrelated. Let the elements of this table (refered to as matrix A) be denoted as aij . Thus, by putting a 1 on the diagonals for animals with unknown parents, the additive genetic relationship of an animal with itself (in the base group) is one. The additive genetic relationship to animals without common parents or whose parents are unknown is assumed to be zero. The next step is to compute relationships between animal A and animals D, E, and F. The relationship of any animal to another is equal to the average of the relationships of that animal with the parents of another animal. For example, the relationship between A and D is the average of the relationships between A and the parents of D, who are A and B. Thus, aAD aAE aAF Table 2.5 = .5 ( aAA + aAB ) = .5(1 + 0) = .5 = .5 ( aAA + aAC ) = .5(1 + 0) = .5 = .5 ( aAE + aAD ) = .5(.5 + .5) = .5 42 CHAPTER 2. GENETIC RELATIONSHIPS The relationship table, or A matrix, is symmetric, so that aAD = aDA , aAE = aEA , and aAF = aF A . Continue calculating the relationships for animals B and C to give the following table. A Table 2.6 Tabular Method Example, Partially Completed. -,- -,- -,- A,B A,C E,D A B C D E F 1 0 0 .5 .5 .5 B 0 1 0 .5 0 .25 C 0 0 1 0 .5 .25 D .5 .5 0 E .5 0 .5 F .5 .25 .25 Next, compute the diagonal element for animal D. By definition this is one plus the inbreeding coefficient, i.e. aDD = 1 + FD . The inbreeding coefficient, FD , is equal to one-half the additive genetic relationship between the parents of animal D, namely, FD = .5aAB = 0. When parents are unknown, the inbreeding coefficient is zero assuming the parents of the individual were unrelated. After computing the diagonal element for an animal, like D, then the remaining relationships to other animals in that row are calculated as before. The completed matrix is given below. Note that only animal F is inbred in this example. The inbreeding coefficient is a measure of the percentage of loci in the gamete of an animal that has become 2.3. INBREEDING CALCULATIONS 43 homozygous, that is, the two alleles at a locus are the same (identical by descent). Sometimes these alleles may be lethal and therefore, inbreeding is generally avoided. A Table 2.7 Tabular Method Example Completed Table. -,- -,- -,- A,B A,C E,D A B C D E F 1 0 0 .5 .5 .5 B 0 1 0 .5 0 .25 C 0 0 1 0 .5 .25 D .5 .5 0 1 .25 .625 E .5 0 .5 .25 1 .625 F .5 .25 .25 .625 .625 1.125 Generally, the matrix A is nonsingular, but if the matrix includes two animals that are identical twins, then two rows and columns of A for these animals would be identical, and therefore, A would be singular. In this situation assume that the twins are genetically equal and treat them as one animal (by giving them the same registration number or identification). 2.3 Inbreeding Calculations The inbreeding coefficients and the inverse of A for inbred animals are generally required for BLUP analyses of animal models. Thus, fast methods of doing both of these calculations, and for very large populations of animals are necessary. The following is a description of Henderson’s discovery of 1976. A = TBT0 , 44 CHAPTER 2. GENETIC RELATIONSHIPS where T is a lower triangular matrix and B is a diagonal matrix. Quaas (1976) showed that the diagonals of B, say bii were bii = (.5 − .25(Fs + Fd )), where Fs and Fd are the inbreeding coefficients of the sire and dam, respectively, of the ith individual. If one parent is unknown, then bii = (.75 − .25Fp ), where Fp is the inbreeding coefficient of the parent that is known. Lastly, if neither parent is known then bii = 1. Animals should be in chronological order, as for the Tabular Method. To illustrate consider the example given in the Tabular Method section. The corresponding elements of B for animals A to F would be 1 1 1 .5 .5 .5 . Now consider a new animal, G, with parents F and B. The first step is to set up three vectors, where the first vector contains the identification of animals in the pedigree of animal G, the second vector will contain the elements of a row of matrix T, and the third vector will contain the corresponding bii for each animal. Step 1 Add animal G to the ID vector, a 1 to the T-vector, and bGG = .5 − .25(.125 + 0) = 15/32 to the B-vector, giving ID vector T-vector B-vector G 1 15/32 Step 2 Add the parents of G to the ID vector, and because they are one generation back, add .5 to the T-vector for each parent. In the D-vector, animal B has bBB = 1, and animal F has bF F = .5. The vectors now appear as 2.3. INBREEDING CALCULATIONS 45 ID vector T-vector B-vector G 1 15/32 F .5 .5 B .5 1 Step 3 Add the parents of F and B to the ID vector, add .25 (.5 times the T-vector value of the individual (F or B)) to the T-vector, and their corresponding bii values. The parents of F were E and D, and the parents of B were unknown. These give ID vector T-vector G 1 F .5 B .5 E .25 D .25 B-vector 15/32 .5 1 .5 .5 Step 4 Add the parents of E and D to the ID vector, .125 to the T-vector, and the appropriate values to the B-vector. The parents of E were A and C, and the parents of D were A and B. ID vector T-vector G 1 F .5 B .5 E .25 D .25 A .125 C .125 A .125 B .125 B-vector 15/32 .5 1 .5 .5 1 1 1 1 The vectors are complete because the parents of A, B, and C are unknown and no further ancestors can be added to the pedigree of animal G. Step 5 Accumulate the values in the T-vector for each animal ID. For example, animals A and B appear twice in the ID vector. Accumulating their T-vector values gives 46 CHAPTER 2. GENETIC RELATIONSHIPS ID vector T-vector B-vector G 1 15/32 F .5 .5 B .5+.125=.625 1 E .25 .5 D .25 .5 A .125+.125=.25 1 C .125 1 Do not accumulate quantities until all pathways in the pedigree have been processed, otherwise a coefficient may be missed and the wrong inbreeding coefficient could be calculated. Step 6 The diagonal of the A matrix for animal G is calculated as the sum of squares of the values in the T-vector times the corresponding value in the B-vector, hence aGG = (1)2 (15/32) + (.5)2 (.5) + (.625)2 +(.25)2 (.5) + (.25)2 (.5) + (.25)2 + (.125)2 = 72/64 1 = 1 8 The inbreeding coefficient for animal G is one-eighth. The efficiency of this algorithm depends on the number of generations in each pedigree. If each pedigree is 10 generations deep, then each of the vectors above could have over 1000 elements for a single animal. To obtain greater efficiency, animals with the same parents could be processed together, and each would receive the same inbreeding coefficient, so that it only needs to be calculated once. For situations with only 3 or 4 generation pedigrees, this algorithm would be very fast and the amount of computer memory required would be low. 2.4 Example Additive Matrix Consider the pedigrees in the table below: 2.4. EXAMPLE ADDITIVE MATRIX 47 Table 2.8 Animal Sire Dam 1 2 3 1 4 1 2 5 3 4 6 1 4 7 5 6 Animals with unknown parents may or may not be selected individuals, but their parents (which are unknown) are assumed to belong to a base generation of animals, i.e. a large, random mating population of unrelated individuals. Animal 3 has one parent known and one parent unknown. Animal 3 and its sire do not belong to the base generation, but its unknown dam is assumed to belong to the base generation. If these assumptions are not valid, then the concept of phantom parent groups needs to be utilized. Using the tabular method, the A matrix for the above seven animals is given below. 1 2 3 4 5 6 7 Table -,- -,1,1,2 1 2 3 4 1 0 .5 .5 0 1 0 .5 .5 0 1 .25 .5 .5 .25 1 .5 .25 .625 .625 .75 .25 .375 .75 .625 .25 .5 .6875 2.9 3,4 1,4 5 6 .5 .75 .25 .25 .625 .375 .625 .75 1.125 .5625 .5625 1.25 .84375 .90625 5,6 7 .625 .25 .5 .6875 .84375 .90625 1.28125 48 CHAPTER 2. GENETIC RELATIONSHIPS Now partition A into T and B giving: Sire Dam 1 1 3 1 5 2 4 4 6 Table 2.10 Animal 1 2 3 4 1 1 0 0 0 2 0 1 0 0 3 .5 0 1 0 4 .5 .5 0 1 5 .5 .25 .5 .5 6 .75 .25 0 .5 7 .625 .25 .25 .5 5 0 0 0 0 1 0 .5 6 0 0 0 0 0 1 .5 7 B 0 1.0 0 1.0 0 .75 0 .50 0 .50 0 .50 1 .40625 Note that the rows of T account for the direct relationships, that is, the direct transfer of genes from parents to offspring. 2.5 Inverse of Additive Relationship Matrix The inverse of the relationship matrix can be constructed by a set of rules. Recall the example in the previous section, of seven animals with the following values for bii . Animal 1 2 3 4 5 6 7 Sire 1 1 3 1 5 Table Dam 2 4 4 6 2.11 bii 1.00 1.00 0.75 0.50 0.50 0.50 0.40625 b−1 ii 1.00 1.00 1.33333 2.00 2.00 2.00 2.4615385 Let δ = b−1 ii , then if both parents are known the following constants are added to the appropriate elements in the inverse matrix: 2.5. INVERSE OF ADDITIVE RELATIONSHIP MATRIX Table 2.12 animal sire animal δ −.5δ sire −.5δ .25δ dam −.5δ .25δ 49 dam −.5δ .25δ .25δ If one parent is unknown, then delete the appropriate row and column from the rules above, and if both parents are unknown then just add δ to the animal’s diagonal element of the inverse. Each animal in the pedigree is processed one at a time, but any order can be taken. Let’s start with animal 6 as an example. The sire is animal 1 and the dam is animal 4. In this case, δ = 2.0. Following the rules and starting with an inverse matrix that is empty, after handling animal 6 the inverse matrix should appear as follows: 1 .5 1 2 3 4 .5 5 6 -1 7 Table 2.13 2 3 4 5 .5 6 -1 .5 -1 -1 2 7 After processing all of the animals, then the inverse of the relationship matrix for these seven animals should be as follows: 1 2 3 4 5 6 7 1 2 3 2.33333 .5 -.66667 .5 1.5 0 -.66667 0 1.83333 -.5 -1 .5 0 0 -1 -1 0 0 0 0 0 Table 2.14 4 5 6 7 -.5 0 -1 0 -1.00000 0 0 0 .5 -1 0 0 3.0000 -1 -1 0 -1 2.61538 .61538 -1.23077 -1 .61538 2.61538 -1.23077 0 -1.23077 -1.23077 2.46154 50 CHAPTER 2. GENETIC RELATIONSHIPS The product of the above matrix with the original relationship matrix, A, gives an identity matrix. Below is an R-script that can create A−1 from a list of sires and dams for animals in chronological order, and with the bi values known. AINV=function(sid,did,bi){ # IDs assumed to be consecutively numbered rules=matrix(data=c(1,-0.5,-0.5,-0.5,0.25, 0.25,-0.5,0.25,0.25), byrow=TRUE,nrow=3) nam = length(sid) np = nam + 1 ss = sid + 1 dd = did + 1 LAI = matrix(data=c(0),nrow=np,ncol=np) for(i in 1:nam){ ip = i + 1 X = 1/bi[i] k = cbind(ip,ss[i],dd[i]) LAI[k,k] = LAI[k,k] + rules*X } k = c(2:np) C = LAI[k,k] return(C) } Likewise for putting pedigrees in chronological order, the R-script could be as follows. 2.5. INVERSE OF ADDITIVE RELATIONSHIP MATRIX border=function(anm,sir,dam){ maxloop=1000 i = 1 count = 0 mam=length(anm) old = rep(1,mam) new = old while(i>0){ for (j in 1:mam){ ks = sir[j] kd = dam[j] gen = new[j]+1 if(ks != "NA"){ js = match(ks,anm) if(gen > new[js]){new[js] = gen} } if(kd != "NA"){ jd = match(kd,anm) if(gen > new[jd]){new[jd] = gen} } } # for loop changes = sum(new - old) old = new i = changes count = count + 1 if(count > maxloop){i=0} } # while loop return(new) } # function loop 51 52 CHAPTER 2. GENETIC RELATIONSHIPS Chapter 3 Mixed Model Equations Henderson first published his Mixed Model Equations (MME) in 1949 without knowing the full statistical properties of the method. He knew that the solutions to MME were the same as Lush’s selection index equations if the means that were used were generalized least squares means. Goldberger (an econometrician) also published the same Mixed Model Equations around the same time, but Henderson made wider use of them in animal breeding. Henderson did not learn matrix algebra until he took a sabbatical leave in New Zealand where he met Shayle Searle. With Searle’s help, Henderson proved that solutions to MME were best linear unbiased predictors, and solutions for the fixed effects were generalized least squares solutions. These proofs were published in 1963. Animal scientists learned about MME in 1973 at the Lush Symposium in North Carolina State Dairy Science meetings, when Henderson presented BLUP and MME for purposes of genetic evaluation of dairy sires. Using the general animal model, then y = Xb + Wu + Z 0 aw ao ! + Zp + e where y is a vector of observations of a single trait, b is a vector of fixed effects (such as age, year, gender, farm, cage, diet) that affect the trait of interest, and are not genetic in origin, 53 54 CHAPTER 3. MIXED MODEL EQUATIONS u is a vector of random factors (such as contemporary groups), ! aw is a vector of animal additive genetic effects (breeding values) for all ao animals included in the pedigree, many of which may not have observations. Animals with records are in aw , and animals without records are in ao . ao may also contain phantom parent groups, p is a vector of permanent environmental effects for animals that have been observed, corresponding to aw , e is a vector of residual effects, X, W and Z are matrices that relate observations to the factors in the model. In addition, E V ar E(b) E(u) ! aw ao E(p) E(e) u aw ao p e = b = 0 0 0 = ! = 0 = 0 = U 0 0 0 0 2 2 0 Aww σa Awo σa 0 0 2 2 0 Aow σa Aoo σa 0 0 0 0 0 Iσp2 0 0 0 0 0 Rσe2 . where 2 U = Σ+ i Iσi , for i going from 1 to the number of other random factors in the model, and R is uaually diagonal, but there can be different values on the diagonals depending on the situation. Often all of the diagonals are the same, so that R = I. Assume the latter is true, then Henderson’s mixed model equations (MME) for a single trait, single record per animal situation, can be written as follows: 3.1. ACCURACIES 55 X0 X X0 W X0 Z 0 0 0 0 −1 2 WZ 0 W X W W + U σe 0 0 0 ww wo ZX ZW ZZ+A α A α 0 0 Aow α Aoo α b b b u b aw bo a = X0 y W0 y Z0 y 0 , where α = σe2 /σa2 . Let a generalized inverse of the coefficient matrix be Cxx Cwx Czx Cox Cxw Cww Czw Cow Cxz Cwz Czz Coz Cxo X0 X Cwo W0 X = Czo Z0 X 0 Coo X0 W 0 W W + U−1 σe2 Z0 W 0 X0 Z W0 Z 0 Z Z + Aww α Aow α 0 0 − A α Aoo α wo then b V ar(b) = Cxx σce2 c2 b − u) = Cww σ V ar(u e ! ! b aw − aw Czz Czo c2 V ar = σe b o − ao a Coz Coo where b − Wu b − Za b w )/(N − r(X)) σce2 = y0 (y − Xb is an estimate of the residual variance. 3.1 Accuracies The variances of prediction error can be used to obtain accuracies of evaluation, also known as reliabilities. Breeders and owners of animals always want to know how accurately their animals are evaluated. The correct calculation of accuracies requires a generalized inverse of the coefficient matrix of the mixed model equations, which is seldom possible due to the large order of the mixed model equations, in the majority of situations. This problem has led to many proposed approximation methods to obtain the pseudo-inverse elements. Some 56 CHAPTER 3. MIXED MODEL EQUATIONS of the methods lead to upwardly biased accuracies and others to downwardly biased accuracies. Of the two options, one in which accuracies are equal to or lower than the correct accuracies would be preferable so that accuracies are always on the conservative side. The selection index approach is an easy and consistent method to calculate accuracies. You may incorporate whatever sources of information that you wish, for example, • The number of observations on the animal, • The number of records on the dam, • The number of progeny of the animal with records, • The number of progeny of the dam with records, and • The number of progeny of the sire with records. This approach ignores contemporary group size. An advantage is that it is consistent from one year to the next, assuming an annual genetic evaluation run. One problem is that there could be many animals with a zero accuracy. In an animal model, accuracies should nearly always be greater than zero due to relationships among animals, but a selection index approximation for accuracies describes the amount of information available (from only those sources that are considered in the calculations). Test out whatever approximation method you choose. See the example in Section 3.3. 3.2 Expression of EBVs The solutions for animals from the MME are called Estimated Breeding Values or EBVs. This is a prediction of the additive genetic value of the animals. Each progeny receives a random sample half of the alleles at all genes in the genome. As with most statistical models only differences among animals can be estimated. A property of animal model solutions from MME are that b = 0. 10 A−1 a 3.2. EXPRESSION OF EBVS 57 That is, a weighted sum of EBVs should equal zero. When parents are known for all animals, except for base population animals, then this means that all of the base population animal EBVs must sum to zero. This is a mathematical consequence of the MME and the model, which almost always have at least one fixed factor in the model. By absorbing out the fixed factor from the MME, one can show that the right hand sides of the MME for animals will add to zero, and consequently, the above property holds true. If some parents are missing and phantom parents need to be used, then the sum of the phantom parent group solutions will be zero. In practice, most livestock producers want to see numbers that have meaning to them. Thus, different forms of expressing EBVs have been created in an effort to help producers understand and use EBV information to improve their animals. 3.2.1 Genetic Base Estimated breeding values (EBVs) which are a product of the mixed model equations, need to be presented relative to a well-defined genetic base. There are two types of genetic base, fixed and rolling, that can be used. The genetic base may be defined many different ways. As an exampke, let the genetic base be defined as the group of all female animals born from year 1 through year 5. To impose that base the average EBV of all female animals that are included in that definition of the base is subtracted from ALL EBVs. Alternatively, the definition could be all male animals. Or the definition could be all first lactation animals calving in years 1 through 5. The best definition is simple, straightforward, and understandable to producers and owners of the animals. A fixed genetic base is where the definition is kept constant for many years or generations. No new animals are added to the base, and no old animals are removed. The animals in the genetic base definition are always the same from one run to the next, in an annual genetic evaluation system. If genetic change is occurring in the population, then gradually over time, the more recent animals will tend to have a higher average EBV than the animals in the genetic base definition. The problem is that producers think that any EBV that is above 0 is an animal that could be used for breeding, and those below 0 would be 58 CHAPTER 3. MIXED MODEL EQUATIONS avoided. However, with a fixed genetic base, eventually all currently living animals will have an EBV above 0. Producers would have to keep informed about the current average EBV. Scientists like to help producers by using a rolling genetic base. That is, the definition is constant, but the years 1 and 5 are increased to 2 and 6, then to 3 and 7, etc., in subsequent years. The animals in the genetic base definition, therefore, change every year, and the average EBV of current breeding candidates stays around zero. Producers will continue to use only animals that have EBVs above 0. Animals with EBVs below zero will not be used for breeding. Deciding how often to roll the definition of genetic base animals can be debated. If genetic change is rapid, then annual changes may be necessary, but otherwise, the definition may be fixed for several years. In this matter, the wishes of the producers should be paid attention. A problem with a rolling base is that as animals age their EBVs go down in value because of the positive genetic trend. This can be disconcerting to animal owners. Either base has advantages and disadvantages. 3.2.2 EBV Estimated Breeding Values (EBV) come out in the solutions to the MME for an animal model. For many countries and most species of livestock, an EBV is all that is needed for the producers. Producers generally like to see genetic information that is expressed in the same units as what they see from day to day. That is, kilograms of milk, kilograms of weight, gain per day, centimeters of height or circumference, and so on. An EBV is a number that allows producers to rank the animals and make decisions. 3.2.3 EPD, ETA Producers often think that the number they see is how the progeny should be expected to perform. In this case, Expected Progeny Differences (EPDs) or Estimated Transmitting Ability (ETAs), which equal one half the EBV, should be used. Producers want to know what amount is actually transmitted to the progeny, which is half the animal’s EBV. The ranking of animals remains 3.2. EXPRESSION OF EBVS 59 exactly the same, but the actual genetic information that the producer sees is half as big as with EBVs. Thus, if a producer has an ETA on each parent, then the progeny would have an EBV equal to the sum of those two ETA. 3.2.4 RBV In some cases, producers favour a Relative Breeding Value, RBV . If µ is the phenotypic mean of animals in the genetic base group, then a RBV is RBV = 100 ∗ ((EBV + µ)/µ). RBV have a mean of around 100, depending on the genetic base chosen for the EBV. The variation in RBV can be great or small, but the RBV are often standardized so that RBV range from 80 to 120. The RBV gives an indication of which animals are above and below average without having to know the average EBV for current animals. In cases where the units associated with the trait have no meaning to producers, such as conformation traits or disease traits, RBV may be easier to understand than EBVs. 3.2.5 Percentile Ranking Another form of expression is the Percentile Ranking (P R). Animals’ EBVs are ranked from high to low (1 to N ), then the Percentile Rank is P R = 100 ∗ (N + 1 − R)/(N + 1), where R is the rank of the animal’s EBV. The higher is the percentile rank, the higher the EBV. Producers should only use animals that are above the 50th percentile. Sometimes producers can see ETA, EBV, RBV , and P R all on the same animals. This seems like overkill, and an overabundance of redundant information. If at all possible try to utilize just one form of expression. When a trait, like backfat thickness, is analyzed, then a high EBV indicates an animal with more fat, which is undesirable. Thus, producers want to 60 CHAPTER 3. MIXED MODEL EQUATIONS select for low EBVs when choosing mates for breeding, for this trait. Relative breeding values or percentile rankings could be altered to reflect the difference in selection goals for that trait, so that a high value always indicates a desirable animal. Percentile rankings can also make such traits easier to interpret. 3.2.6 Intermediate Traits Birthweights can be considered an intermediate trait. Breeders do not want progeny that have high birthweight because that could lead to birthing problems and death of the fetus, the female, or both. Breeders also do not want progeny that have low birthweights because such progeny may not survive to weaning. Thus, there is a range of acceptable birthweights, and the goal is to maintain that optimum range. EBVs could be changed to a categorical basis to TOO HIGH, TOO LOW, and JUST RIGHT. Some conformation traits in dairy cattle, are also intermediate, such as pelvic angle. Classifiers score the trait from 1 to 9, for example, where 1 is a flat extreme, and 9 is a too steep extreme. The scores from 1 to 9 should be analyzed as any other trait, and the resulting EBVs should be re-categorized into TOO FLAT, TOO STEEP, and JUST RIGHT, in the proportions of 25:25:50, respectively. Do not fold a distribution. That is, do not assign new scores such that the middle category 5, remains as 5, and 6, 7, 8, and 9, become 4, 3, 2, and 1, respectively, and categories 1 to 4 stay the same. This is incorrect because the heritability of the folded trait becomes much lower than for the original 1 to 9 scale. With the folded scores, a 1 indicates either TOO FLAT or TOO STEEP which are two genetically opposite conditions. Both are bad, but are not the same genetically. The original scale should be used for analysis, as long as it delineates from one genetic extreme to another. 3.3 Numerical Example Below are data on 15 animals showing parents (all unrelated), herd, year, and contemporary groups (herd-year subclasses). Parents of animals 1 to 14 are unknown. 3.3. NUMERICAL EXAMPLE Example Animal Sire 15 1 16 1 17 3 18 3 19 13 20 5 21 5 22 1 23 1 24 7 25 7 26 5 27 5 28 9 29 11 61 Table 3.1 Data For Animal Model Dam Herd year CG Trait 2 1 1 1 94 2 1 1 1 89 4 1 1 1 72 6 1 1 1 100 2 1 2 2 73 4 1 2 2 70 4 1 2 2 84 8 2 1 3 88 10 2 1 3 102 12 2 1 3 82 14 2 1 3 130 8 2 2 4 93 10 2 2 4 105 12 2 2 4 118 14 2 2 4 69 The model equation is yijk = bi + cj + ak + eijk , where yijk is an observation on the k th animal in the j th contemporary group within the ith year, bi is an effect of the ith year, cj is an effect of the j th contemporary group (herd-year subclasses), ak is an effect of the k th animal, and e[ ijk is a residual effect. Further, the expectations and covariance matrices are as follows: E(bi ) = bi 62 CHAPTER 3. MIXED MODEL EQUATIONS E(cj ) = 0, E(ak ) = 0, E(eijk ) = 0 and 0 0 c Iσc2 Aσa2 0 V ar a = 0 , 2 0 0 Iσe e where c, a, and e respresent vectors of contemporary group, animal breeding values, and residuals, respectively. The parameters will be assumed to be known for this example. σe2 σa2 σc2 h2 = = = = 144, 64, 36, .2623. The order of the mixed model equations is 35 by 35, with 2 equations for years, 4 equations for contemporary groups, and 29 equations for animals. The rank of the equations is also 35. The two year solutions were 94.82039 for year 1 and 87.13676 for year 2. Contemporary group solutions were -2.065268, -3.668846, 2.065268, and 3.668846 for groups 1 to 4, respectively. A table of the solutions for animals is given below, 3.3. NUMERICAL EXAMPLE Animal 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Sire 1 1 3 3 13 5 5 1 1 7 7 5 5 9 11 Table 3.2 EBVs from MME Dam EBV Animal 2 -0.7470 1 2 -1.6561 2 4 -6.4880 3 6 1.1317 4 2 -3.2291 5 4 -4.0223 6 4 -1.4769 7 8 -2.3494 8 10 1.8323 9 12 -1.1046 10 14 7.8180 11 8 0.3701 12 10 4.1883 13 12 7.3074 14 14 -4.7635 63 EBV -0.7469 -1.6324 -1.8120 -4.8231 0.9764 1.3585 2.5589 -1.0471 4.4193 2.9529 -3.7871 1.3569 -1.6086 1.8343 Define the genetic base as those animals with records in Year 1, namely animals 15 to 18, and 22 to 25. Their average EBV was -0.19539, which is subtracted from all EBVs. Thus, all animals are now compared to the average of animals with records in Year 1, as shown in the next table. Also shown are ETAs or one half the EBV (also on the same genetic base), the relative breeding values (RBV ) using 90 as the mean of the trait, and the percentile rankings, P R. The last two columns are the standard errors of prediction and the reliability (or accuracy) from the elements of the generalized inverse of the MME. 64 CHAPTER 3. MIXED MODEL EQUATIONS Animal 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Table 3.3 Animal solutions from MME EBV ETA RBV Percentile SEP Reliability -0.5516 -0.3735 99.39 53 9.23 13 -1.4370 -0.8162 98.40 33 9.19 14 -1.6166 -0.9060 98.20 27 9.45 9 -4.6277 -2.4115 94.86 7 9.20 14 1.1718 0.4882 101.30 60 9.31 12 1.5539 0.6792 101.73 70 9.59 6 2.7543 1.2795 103.06 80 9.45 9 -0.8517 -0.5235 99.05 47 9.28 12 4.6147 2.2097 105.13 90 9.60 6 3.1483 1.4765 103.50 83 9.28 12 -3.5917 -1.8936 96.01 17 9.60 6 1.5523 0.6784 101.72 67 9.31 12 -1.4132 -0.8043 98.43 37 9.61 6 2.0297 0.9171 102.26 77 9.31 12 -0.5516 -0.3735 99.39 50 8.62 24 -1.4607 -0.8281 98.38 30 8.62 24 -6.2926 -3.2440 93.01 3 8.61 25 1.3271 0.5659 101.47 63 8.67 23 -3.0337 -1.6146 96.63 20 8.60 25 -3.8269 -2.0112 95.75 13 8.74 22 -1.2815 -0.7384 98.58 40 8.74 22 -2.1541 -1.1747 97.61 23 8.63 24 2.0277 0.9162 102.25 73 8.63 24 -0.9092 -0.5523 98.99 43 8.62 24 8.0134 3.9090 108.90 97 8.62 24 0.5655 0.1850 100.63 57 8.66 24 4.3836 2.0941 104.87 87 8.66 24 7.5028 3.6537 108.34 93 8.56 25 -4.5682 -2.3818 94.92 10 8.56 25 An estimate of the residual variance is given by b − Wc b − Za b )/(N − r(X)) σce2 = y0 (y − Xb 3.3. NUMERICAL EXAMPLE 65 σce2 = 2872.765/(15 − 2) = 220.9819. The variance of prediction error for animal 25, for example, is V ar(â25 − a25 ) = 0.3359 ∗ σce2 = 74.2278. The standard error of prediction is therefore, SEP = ±8.62. The ratio of residual to genetic variances is 144/64 or 2.25. Therefore, the accuracy (or reliability) of the EBV (ETA or RBV) of animal 25 is (1 − 0.3359 ∗ 2.25) = 0.2442, or 24%. If the animal was inbred, then use ((1 + Fi ) − C ii ∗ (σe2 /σa2 )). Approximate accuracies for these data could be based on (1) the number of records on the animal; (2) the number of progeny of the animal (with records); (3) the number of other progeny of the dam; and (4) the number of other progeny of the sire. The equations to construct, for each animal, are Pb = = 1+(x−1)(r) x 2 .5h .25h2 .25h2 h2 .5h2 .25h2 .25h2 .5h2 1+(z−1)(.25)(h2 ) z .25h2 .25h2 .25h2 .25h2 1+(s−1)(.25)(h2 ) s 0 .25h2 .25h2 0 1+(d−1)(.25)(h2 ) d b1 b 2 b3 b4 = C where x is the number of records on the animal; z is the number of progeny with records on the animal; s is the number of progeny of the sire; and d is the 66 CHAPTER 3. MIXED MODEL EQUATIONS number of progeny of the dam; r is the repeatability of the trait; and h2 is the heritability of the trait. Just substitute in the appropriate numbers for each animal. The sire and dam are assumed to be unrelated, but the correct value could be included, if known. If an animal is inbred, then C1 = (1 + Fi )h2 . The easiest course of action is to ignore inbred animals and assume they are not. Solve, b = P−1 C, b then the accuracy is b 0 C/h2 ). ACC = 100 ∗ (b If one of the pieces of information is missing, then that entire row and column of P is set to zero. Depending on the situation, other sources of information could be included, such as the number of records on the dam and/or sire. If the true heritability is h2 , then it may be better to use a smaller value in calculating approximate accuracies, such as .75h2 to .8h2 , to account for simultaneous estimation of fixed factors in the model. 3.3. NUMERICAL EXAMPLE 67 Table 3.4 Comparison of Accuracies Reliability is from Inverse elements of equations and Accuracy is from the approximation. Animal Records Progeny Reliability Accuracy 1 0 4 13 17.3 2 0 3 14 13.6 3 0 2 9 9.5 4 0 3 14 13.6 5 0 4 12 17.3 6 0 1 6 5.0 7 0 2 9 9.5 8 0 2 12 9.5 9 0 1 6 5.0 10 0 2 12 9.5 11 0 1 6 5.0 12 0 2 12 9.5 13 0 1 6 5.0 14 0 2 12 9.5 15 1 0 24 23.7 16 1 0 24 23.7 17 1 0 25 22.3 18 1 0 23 20.8 19 1 0 25 21.5 20 1 0 22 23.7 21 1 0 22 23.7 22 1 0 24 23.0 23 1 0 24 23.0 24 1 0 24 21.6 25 1 0 24 21.6 26 1 0 24 23.0 27 1 0 24 23.0 28 1 0 25 20.8 29 1 0 25 20.8 The approximations are reasonable values compared to the reliabilities from the inverse elements of the MME for this small example. Only four of the approximate values are greater than the exact values. 68 CHAPTER 3. MIXED MODEL EQUATIONS Chapter 4 Phantom Parent Groups In most real-life applications, there are always some animals with missing parent information. One alternative is to only include animals with records that have both parents identified. However, producers become upset when their animals do not receive EBVs. Thus, another alternative is needed. Westell et al.(1988) and Robinson (1986) assigned phantom parents to animals with unknown parents. Each phantom parent was assumed to have only one progeny. Phantom parents were assumed to be unrelated to all other real or phantom animals. Then, phantom parents were assigned to groups based on the gender of their offspring, and on the year in which the progeny were born. In the end, every animal in the pedigree will have both parents identified, even though their real parent may be unknown. They would still have a phantom parent group identification in place of the unknown parent(s). The logic is that phantom parents whose first progeny were born in a particular time period probably underwent the same degree of selection intensity to become a breeding animal. However, male phantom parents versus female phantom parents might have faced different selection intensities. Phantom parents were assigned to phantom parent groups depending on whether they were sires or dams and on the year of birth of their first progeny, and whether the first progeny was male or female. Phantom parent groups may also be formed depending on breed and regions within a country. The basis for further groups depends on the existence 69 70 CHAPTER 4. PHANTOM PARENT GROUPS of different selection intensities involved in arriving at particular phantom parents. Phantom parent groups are best handled by considering them as additional animals in the pedigree. Then the inverse of the relationship matrix can be constructed using the same rules as before. These results are due to Quaas (1988). To illustrate, use the same seven animals as in section 2.4. Assign the unknown sires of animals 1 and 2 to phantom group 1 (P 1) and the unknown dams to phantom group 2 (P 2). Assign the unknown dam of animal 3 to phantom group 3 (P 3). The resulting matrix will be of order 10 by 10 : A−1 ∗ = Arr Arp Apr App ! , where Arr is a 7 by 7 matrix corresponding to the elements among the real animals; Arp and its transpose are of order 7 by 3 and 3 by 7, respectively, corresponding to elements of the inverse between real animals and phantom groups, and App is of order 3 by 3 and contains inverse elements corresponding to phantom groups. Arr will be exactly the same as A−1 given in the previous section. The other matrices are Arp = App −.5 −.5 .33333 −.5 −.5 0 0 0 −.66667 0 0 0 0 0 0 0 0 0 0 0 0 .5 .5 0 0 = .5 .5 0 0 .33333 In this formulation, phantom groups (according to Quaas (1988)) are additional fixed factors and there is a dependency between phantom groups 1 and 2. This singularity can cause problems in deriving solutions from the MME. The dependency can be removed by adding an identity matrix to App . When genetic groups have many animals assigned to them, then adding the identity matrix to App does not result in any significant re-ranking of animals in genetic evaluation and aids in getting faster convergence of the iterative system of equations. 4.1. BACKGROUND HISTORY 71 Phantom groups are used in many genetic evaluation systems today. The phantom parents assigned to a genetic group are assumed to be the outcome of non random mating and similar selection differentials on their parents. This assumption, while limiting, is not as severe as assuming that all phantom parents belong to one base population. 4.1 Background History A pedigree file may have many animals with missing parent identification. Assign a parent group to an animal based upon the year of birth of the animal and the pathway of selection. If the animal is a female, for example, and the dam information is missing, then the parent group would be in the Dams of Females pathway (DF). There are also Dams of Males (DM), Sires of Females (SF), and Sires of Males (SM). Four pathways and various years of birth nested within each pathway. These have become known as phantom parent groups. Genetic group effects are added to the model. y = Xb + ZQg + Za + e, where a is the vector of animal additive genetic effects, Z is the matrix that relates animals to their observations, g is the vector of genetic group effects, and Q is the matrix that relates animals to their genetic groups, and y, Xb, e are as described in earlier notes. The Estimated Breeding Value, EBV, of an animal is equal to the sum of the group and animal solutions from MME, i.e. b+a b. Vector of EBVs = Qg To illustrate, the pedigree list with groups becomes 72 CHAPTER 4. PHANTOM PARENT GROUPS Animal A B C D E F 4.2 Sire G1 G1 G1 A A E Dam G2 G2 G2 B C D Simplifying the MME The advantage of phantom parent grouping is that the mixed model equations simplify significantly. Using the previous model, the MME are b X0 X X0 ZQ X0 Z X0 y b 0 0 0 0 0 0 b = Q0 Z0 y . Q Z X Q Z ZQ Q Z Z g b Z0 X Z0 ZQ Z0 Z + A−1 α Z0 y a Notice that Q0 times the third row subtracted from the second row gives Q0 A−1 âα = 0. Quaas (1988) showed that the above equations could be transformed so that Qĝ + â can be computed directly. The derivation is as follows. Note that b I 0 0 I 0 0 b b̂ b I 0 0 I 0 g ĝ = 0 b 0 −Q I 0 Q I a â b I 0 0 b b = 0 I 0 . g b+a b 0 −Q I Qg Substituting this equality into the left hand side (LHS) of the MME gives b X0 X 0 X0 Z X0 y b b 0 Q0 Z0 Z Q0 Z0 X = Q0 Z0 y . g b+a b Z0 X −A−1 Qα Z0 Z + A−1 α Z0 y Qg 4.2. SIMPLIFYING THE MME To make the equations symmetric must be premultiplied by I 0 0 73 again, both sides of the above equations 0 0 I −Q0 . 0 I This gives the following system of equations as b X0 X 0 X0 Z X0 y b b Q0 A−1 Qα −Q0 A−1 α 0 = 0 . g b+a b Z0 X −A−1 Qα Z0 Z + A−1 α Z0 y Qg Quaas (1987) examined the structure of Q and the inverse of A under phantom parent grouping and noticed that Q0 A−1 Q and −Q0 A−1 had properties that followed the rules of Henderson (1976) for forming the elements of the inverse of A. Thus, the elements of A−1 and Q0 A−1 Q and −Q0 A−1 can be created by a simple modification of Henderson’s rules. Use δi as computed earlier, (i.e. δi = b−1 ii ), and let i refer to the individual animal, let s and d refer to either the parent or the phantom parent group if either parent is missing, then the rules are Constant to Add δi −δi /2 δi /4 Location in Matrix (i, i) (i, s),(s, i),(i, d), and (d, i) (s, s),(d, d),(s, d), and (d, s) Thus, Q0 A−1 Q and Q0 A−1 can be created directly without explicitly forming Q and without performing the multiplications times A−1 . 74 CHAPTER 4. PHANTOM PARENT GROUPS Chapter 5 Estimation of Variances 5.1 Some History In 1953 Henderson published a paper that described three methods of estimating variance components from a linear model. The first method assumed a model in which all factors in the model were random factors, except for the overall mean. In the 1950’s variances were estimated using mechanical calculators, and took many days for even a simple data set and model. The second method allowed fixed factors in the model, but only as long as there were no interactions of the fixed factors with the random factors. The least squares solutions for all factors were calculated and the observations adjusted for the fixed factors. Then the first method was applied to the adjusted observations, making sure to adjust the expectations involving the residual variance in the sums of squares. This was tedious, but not much more difficult than the first method. His third method, known as the fitting constants method, was applicable to general models with both fixed and random factors. This method required the most computations and was always considered to be the more accurate method of the three. The third method, however, was not well defined in that one could calculate more sums of squares than were needed to estimate all of the components. 75 76 CHAPTER 5. ESTIMATION OF VARIANCES The only statistical properties of Henderson’s three methods were that they were unbiased and easy to calculate. By imposing unbiasedness, the estimates always had a chance to turn out negative, which is an undesirable property for a variance estimator. By 1970, maximum likelihood was described (Hartley and Rao), followed by restricted maximum likelihood (Patterson and Thompson 1971), and minimum variance quadratic estimation (C. R. Rao 1971). These methods were much more complex than Henderson’s methods, and were based on sounder statistical properties (i.e. positive estimates, lower variance of the estimates). At the same time, computing power was also improving substantially. Soon, Henderson’s methods were outdated and not to be used. 5.2 Modern Thinking Every variable in a linear model is a random variable described by a distribution function. A fixed factor is a random variable having a possibly uniform distribution going from a lower limit to an upper limit. A component of variance is a random variable having a Gamma or Chi-square distribution with df degrees of freedom. In addition, there may be information from previous experiments that strongly indicate the value that a variance component may have, and the Bayes approach allows prior information to be included in the analysis. The Bayesian process is to 1. Specify distributions for each random variable of the model. 2. Combine the distributions into a joint posterior distribution. 3. Find the conditional marginal distributions from the joint posterior distribution. 4. Employ Markov Chain Monte Carlo (MCMC) methods to maximize the joint posterior distribution. Gibbs Sampling is a tool in MCMC methods for deriving estimates of parameters from the joint posterior distribution. 5.3. THE JOINT POSTERIOR DISTRIBUTION 77 By determining conditional marginal distributions for each random variable of the model, then random samples from these distributions eventually converge to being random samples from the overall joint posterior distribution. Computationally, any program that calculates solutions to Henderson’s mixed model equations can be modified to implement Gibbs Sampling, and theoretically this allows one to use the same data, same model and same program as for genetic evaluation. That is, there is no need to choose a subset of the data to estimate the variances, except to save time. When a subset of the complete data set is chosen to estimate variance components, then two things may happen. First, the subset may or may not be a representative sample of the complete data set. Usually, the subset is chosen by only keeping animals having pedigrees or animals (sires or dams) having lots of progeny (observations), and thus, the subset is no longer random, and no longer representative. Secondly, a subset is generally smaller than the complete data set, and therefore, the estimates of variances will be less accurate than using all of the data. Thus, if possible, all data should be utilized and no subsets taken. The amount of computer time will be greater than for a subset, but the estimates that are thus derived will be a better reflection of the complete data set. 5.3 The Joint Posterior Distribution Begin with a simple single trait, single record per animal, animal model. That is, y = Xb + Za + Wu + e. Let θ be the vector of random variables, where θ = (b, a, u, σa2 , σu2 , σe2 ), and y is the data vector, then p(θ, y) = p(θ) p(y | θ) = p(y) p(θ | y) 78 CHAPTER 5. ESTIMATION OF VARIANCES Re-arranging gives p(θ | y) = p(θ)p(y | θ) p(y) p(y | θ) p(y) = posterior probability function of θ = (prior for θ) 5.3.1 Conditional Distribution of Data Vector The conditional distribution of y given θ is y | b, a, u, σa2 , σu2 , σe2 ∼ N (Xb + Za + Wu, Iσe2 ), and p(y | b, a, u, σa2 , σu2 , σe2 ) h i ∝ (σe2 )(−N/2) exp −(y − Xb − Za − Wu)0 (y − Xb − Za − Wu)/2σe2 . 5.3.2 Prior Distributions of Random Variables Fixed Effects Vector There is little prior knowledge about what the values in b might have. This is represented by assuming p(b) ∝ constant. Additive Genetic Effects and Variances For a, the vector of additive genetic values, quantitative genetics theory suggests that they follow a normal distribution, i.e. a | A, σa2 ∼ N (0, Aσa2 ) and h i p(a) ∝ (σa2 )(−q/2) exp −a0 A−1 a/2σa2 , 5.3. THE JOINT POSTERIOR DISTRIBUTION 79 where q is the length of a. A natural estimator of σa2 is a0 A−1 a/q, call it Sa2 , where Sa2 ∼ χ2q σa2 /q. Multiply both sides by q and divide by χ2q to give σa2 ∼ qSa2 /χ2q which is a scaled, inverted Chi-square distribution, written as p(σa2 | va , Sa2 ) ∝ va (σa2 )−( 2 +1) va Sa2 , exp − 2 σa2 ! where va and Sa2 are hyperparameters with Sa2 being a prior guess about the value of σa2 and va being the degrees of belief in that prior value. Usually q is much larger than va and therefore, the data provide nearly all of the information about σa2 . Other Random Factors The vector u may contain a number of different factors, u0 = ( u01 u02 · · · u0s ), for s factors, and 2 V ar(uj ) = Iσuj for the j th factor having length qj . For the j th factor, 2 2 ∼ N (0, Iσuj ) uj | I, σuj and h i 2 2 (−qj /2) ) exp −u0j uj /2σuj , p(uj ) ∝ (σuj where qj is the length of uj . 2 2 A natural estimator of σuj is u0j uj /qj , call it Suj , where 2 2 /qj . Suj ∼ χ2q j σuj 80 CHAPTER 5. ESTIMATION OF VARIANCES Multiply both sides by qj and divide by χ2qj to give 2 2 σuj ∼ qj Suj /χ2qj which is a scaled, inverted Chi-square distribution, written as 2 p(σuj | 2 vuj , Suj ) ∝ 2 −( (σuj ) vuj 2 +1) 2 vuj Suj exp − 2 2 σuj ! , 2 2 where vuj and Suj are hyperparameters with Suj being a prior guess about the 2 value of σuj and vuj being the degrees of belief in that prior value. Usually qj is much larger than vuj and therefore, the data provide nearly all of the 2 information about σuj . Residual Effects Similarly, for the residual variance, p(σe2 | ve , Se2 ) ∝ ve (σe2 )−( 2 +1) ve Se2 . exp − 2 σe2 ! Combining Prior Distributions The joint posterior distribution is p(b, a, u, σa2 , σu2 , σe2 | y) ∝ p(b)p(a | σa2 )p(σa2 )p(σe2 )p(y | b, a, u, σa2 , σu2 , σe2 ) which can be written as e +1) −( N +v 2 ∝ (σe2 ) " 1 exp − 2 ((y − Xb − Za − Wu)0 (y − Xb − Za − Wu) + ve Se2 ) 2σe q+va (σa2 )−( 2 +1) 2 −( Πsj=1 (σuj ) qj +vuj 2 " # 1 exp − 2 (a0 A−1 a + va Sa2 ) 2σa " +1) # 1 2 exp − 2 (u0j uj + vuj Suj ) . 2σuj # 5.4. FULLY CONDITIONAL POSTERIOR DISTRIBUTIONS 81 5.4 Fully Conditional Posterior Distributions In order to implement Gibbs sampling, all of the fully conditional posterior distributions (one for each component of θ ) need to be derived from the joint posterior distribution. The conditional posterior distribution is derived from the joint posterior distribution by picking out the parts that involve the unknown parameter in question. 5.4.1 Fixed and Random Effects of the Model Let T = (X Z W), β 0 = (b0 a0 u), 0 0 0 0 Σ = 0 A−1 k , 0 0 U−1 σe2 C = Henderson0 s Mixed Model Equations = T0 T + Σ Cβ̂ = T0 y A new notation is introduced, let 0 β 0 = (βi β−i ), where βi is a scalar representing just one element of the vector β, and β−i is a vector representing all of the other elements except βi . Similarly, C and T can be partitioned in the same manner as T0 = (Ti T−i )0 ! Ci,i Ci,−i C = . C−i,i C−i,−i In general terms, the conditional posterior distribution of β is a normal distribution, −1 2 βi | β−i , σa2 , σu2 , σe2 , y ∼ N (β̂i , Ci,i σe ) 82 CHAPTER 5. ESTIMATION OF VARIANCES where Ci,i β̂i = (T0i y − Ci,−i β−i ). Then −1 2 σe ), bi | b−i , a, u, σa2 , σu2 , σe2 , y ∼ N (b̂i , Ci,i for Ci,i = x0i xi . Also, −1 2 ai | b, a−i , σa2 , σe2 , y ∼ N (âi , Ci,i σe ), where Ci,i = (z0i zi + Ai,i k), for k = σe2 /σa2 . 5.4.2 Variances The conditional posterior distributions for the variances are inverted Chisquare distributions, σa2 | b, a, u, σu2 , σe2 , y ∼ ṽa S̃a2 χ−2 ṽa for ṽa = q + va , and S̃a2 = (a0 A−1 a + va Sa2 )/ṽa , 2 2 −2 σuj | b, a, u, σa2 , σe2 , y ∼ ṽuj S̃uj χṽuj 2 2 )/ṽuj , for each factor uj , and = (u0j uj + vuj Suj for ṽuj = qj + vuj , and S̃uj σe2 | b, a, u, σa2 , σu2 , y ∼ ṽe S̃e2 χ−2 ṽe for ṽe = N + ve , and S̃e2 = (e0 e + ve Se2 )/ṽe , and e = y − Xb − Za − Wu. 5.5 Computational Scheme The following is a small example for an animal model to illustrate Gibbs sampling. The example has • 35 observations, 5.5. COMPUTATIONAL SCHEME 83 • Only one record per animal, • 67 non-inbred animals, • 3 years, • 9 contemporary groups, • 3 variances to estimate, i.e. contemporary group, genetic, and residual. Gibbs sampling is much like Gauss-Seidel iteration. When a new solution is calculated in the Mixed Model Equations for a level of a fixed or random factor, a random amount of variation (or noise) is added to the solution based upon its conditional posterior distribution variance before proceeding to the next level of that factor or the next factor. After all equations have been processed, new values of the variances are calculated and new variance ratios are determined prior to beginning the next round. The example data are given in Table 5.1. Table 5.1 Animal Model Example Data Animal 22 23 24 25 26 27 28 29 30 31 32 37 38 39 40 41 42 43 Sire 1 1 2 2 3 3 4 4 5 5 1 1 2 6 3 4 6 7 Dam 11 12 13 14 15 16 17 18 19 20 21 12 13 14 15 17 18 33 Year 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 CG 1 1 1 1 2 2 2 2 3 3 3 4 4 4 5 5 5 5 Obs. 58 57 38 68 55 56 71 63 43 49 50 65 62 76 55 51 62 79 Animal 44 45 46 47 48 56 57 58 59 60 61 62 63 64 65 66 67 Sire 5 6 7 2 2 2 3 4 5 1 4 8 9 10 6 7 3 Dam 20 21 34 35 36 14 49 50 51 18 33 27 52 35 53 54 55 Year 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 CG 6 6 6 6 6 7 7 7 7 8 8 8 9 9 9 9 9 Obs. 72 45 46 45 45 58 52 59 78 35 47 57 50 46 52 46 52 84 CHAPTER 5. ESTIMATION OF VARIANCES 5.5.1 Year Effects The following variable names and assignments will be used. # # # # nyr = 3 # number of years (generations) ncg = 9 # number of contemporary groups SDa = sqrt(64/1.6) # true genetic SD SDc = sqrt(64/4) # true CG SD SDe = 8 # true Residual SD alpa = SDe*SDe/(SDa*SDa) # starting values alpc = SDe*SDe/(SDc*SDc) # starting values obs = observations as in Table 5.1 cgid = list of contemporary groups for each observation anwr = list of animals with records yrid = list of years for each observation The starting solutions for all factors in the model start at zero. The starting variance ratios can be anything. In this case, the true ratios were used to start the Gibbs sampling, namely, 1.6 and 4 for residual to genetic and residual to contemporary group variances, respectively. The starting value for the residual standard deviation was 8. yrsn=c(1:nyr)*0 # Solutions for year effects cgsn=c(1:ncg)*0 # Solutions for CG ansn=c(1:nam)*0 # Solutions for animals SDE=SDe # Standard deviation of residuals alpa = 1.6 # starting variance ratio alpc = 4 # starting variance ratio The first factor in the Gibbs sampling is the year effects. # Adjust observations for other factors # in the model, besides year effects yobs = obs - cgsn[cgid]-ansn[anwr] XY = tapply(yobs,yrid,sum) XX = tapply(yrid,yrid,length) 5.5. COMPUTATIONAL SCHEME 85 yrsn=XY/XX 61.54129 62.32179 50.23683 # add Gibbs Sampling noise vnois = rnorm(nyr,0,SDE)/sqrt(XX) # -4.0796973 -3.1532584 -0.9838543 yrsn = yrsn + vnois # 57.46159 59.16853 49.25298 The command tapply(yobs,yrid,sum) sums together the adjusted observations for each year id. 5.5.2 Contemporary Group Effects Contemporary groups are a random factor. Note below that alpc is added to the diagonal elements for XX. The observations are adjusted for year and animal additive genetic effects. yobs = obs - yrsn[yrid]-ansn[anwr] XY = tapply(yobs,cgid,sum) XX = tapply(cgid,cgid,length)+alpc cgsn = XY/XX # 2.2201539 2.5634252 -2.4878904 3.2943517 0.2948540 # -0.6972253 5.2038220 -0.7221015 0.1462229 vnois = rnorm(ncg,0,SDE)/sqrt(XX) cgsn=cgsn+vnois # 0.7957398 3.2481316 0.7435676 5.8141136 # 4.0875102 11.1844145 -4.0507944 -2.0645445 5.5.3 0.4675327 Contemporary Group Variance Because contemporary groups are a random factor, a variance needs to be sampled next. ss = sum(cgsn*cgsn) # = 208.2291 86 CHAPTER 5. ESTIMATION OF VARIANCES ndf = length(cgsn)+2 Vcg = ss/rchisq(1,ndf) Vcg # = 21.74546 where ss is the sum of squares of current solutions to contemporary groups, ndf is the degrees of freedom equal to the number of contemporary groups (9) plus 2, and Vcg is the new sample value for the contemporary group variance. 5.5.4 Animal Genetic Effects Animal additive genetic effects involve the inverse of the additive relationship matrix and the residual to genetic variance ratio. Equations for all animals are made for this example. Another approach would be used if there were many hundreds of animals. AI is the inverse of the additive relationship matrix of order 67. yobs=obs - yrsn[yrid]-cgsn[cgid] ZY=anim*0; ZZ=ZY ZY[anwr]=yobs; ZZ[anwr]=1 ZZA=diag(ZZ)+AI*alpa C = ginv(ZZA) ansn = C %*% ZY CD = t(chol(C)) ve = rnorm(nam,0,SDE) vnois = CD %*% ve ansn=ansn+vnois # Too many to show 5.5.5 Additive Genetic Variance Because animal effects are random, the additive genetic variance must be estimated. The sum of squares involves AI. v1=AI%*%ansn 5.5. COMPUTATIONAL SCHEME 87 aaa=sum(ansn*v1) # = 1333.982 ndf=length(ansn)+2 # = 69 Van=aaa/rchisq(1,ndf) Van # = 20.4803 5.5.6 Residual Variance The last random variable of the model is the residual variance. Observations are adjusted for all factors in the model, using the latest current solution vectors. ehat = obs - yrsn[yrid]-cgsn[cgid]-ansn[anwr] rss = sum(ehat*ehat) # = 2905.817 ndf = length(ehat)+2 # = 37 Vare=rss/rchisq(1,ndf)# = 78.04883 alpa = Vare/Van # = 3.810923 alpc = Vare/Vcg # = 3.589201 SDE=sqrt(Vare) # = 8.834525 5.5.7 Visualization One should try to plot the estimates of variances to ‘see’ the convergence of samples to a single joint posterior distribution. Figure 5.1 contains sample values of the additive genetic variance for 2000 samples. 88 CHAPTER 5. ESTIMATION OF VARIANCES Figure 5.1 Heritability may also be calculated after each round of sampling as c2 /(σ c2 + σ c2 + σ c2 ). b2 = σ h a a c e A similar plot to the one above may be made, but also one could do a histogram of the sample values as in Figure 5.2. 5.5. COMPUTATIONAL SCHEME 89 Figure 5.2 5.5.8 Burn-In Periods Samples do not immediately represent samples from the joint posterior distribution. Generally, this takes anywhere from 100 to 10,000 samples depending on the model and amount of data. This period is known as the burn-in period. Samples from the burn-in period are discarded. The length of the burn-in period (i.e. number of samples) is usually judged by visually inspecting a plot of sample values across rounds, as in Figure 5.1. A less subjective approach to determine convergence to the joint posterior distribution is to run two chains at the same time, both beginning with the same random number seed. However, the starting values (in variances) for each chain are usually greatly different, e.g. one set is greatly above the expected outcome and the other set is greatly below the expected outcome. When the two chains essentially become one chain, i.e. the squared difference between variance estimates is less than a specified value (like 10−5 ), then convergence to the joint posterior distribution is deemed to have occurred. All previous samples are considered to be part of the burn-in period and discarded. 90 CHAPTER 5. ESTIMATION OF VARIANCES 5.5.9 Post Burn-In Analysis After burn-in, each round of Gibbs sampling is dependent on the results of the previous round. Depending on the total number of observations and parameters, one round may be positively correlated with the next twenty to three hundred rounds. The user can determine the effective number of samples by calculating lag correlations, i.e. the correlation of estimates between rounds, between every other round, between every third round, etc. Determine the number of rounds between two samples such that the correlation is zero. Divide the total number of samples (after burn-in) by the interval that gives a zero correlation, and that gives the effective number of samples. Suppose a total of 12,000 samples (after removing the burn-in rounds) and an interval of 240 rounds gives a zero correlation between samples, then the effective number of samples is 12,000 divided by 240 or 50 samples. There is no minimum number of independent samples that are required, just the need to know how many there actually were. Another way is to randomly pick 100 to 200 samples out of the 12,000 that were made after burn-in. Then determine the averages and standard deviations of those sample values. This can be repeated a number of times. The appropriate R statements would be as follows, where wvh2 contains the 12,000 sample values, and the sample function randomly chooses 200 of those and stores in vh2. vh2=sample(wvh2,200) mean(vh2) 0.3232938 sd(vh2) 0.04631331 # Standard Error of Estimate Some research has shown that the mode of the estimates might be a better estimate, which indicates that the distribution of sample estimates is skewed. One could report both the mean and mode of the samples, however, the mode should be based on the independent samples only. 5.6. SOFTWARE 5.5.10 91 Long Chain or Many Chains? Early papers on MCMC (Monte Carlo Markov Chain) methods recommended running many chains of samples and then averaging the final values from each chain. This was to insure independence of the samples. Another philosophy recommends one single long chain. For animal breeding applications this could mean 100,000 samples or more. If a month is needed to run 50,000 samples, then maybe three chains of 50,000 would be preferable, all running simultaneously on a network of computers. If only an hour is needed for 50,000 samples, then 1,000,000 samples would not be difficult to run. 5.6 Software There are many people that provide software to estimate variance components and genetic evaluations. The ability to write your own software is a valuable asset for research, one that could be considered essential for animal breeding research. Not everyone has that ability, or wants to write software. Some suppliers of software (as of 2019) are listed below, not in any order. • Karin Meyer, Australia • Ignacy Misztal, Georgia, United States • Arthur Gilmore, Australia (ASREML) • Eildert Groneveld, Germany (VCE) • Per Madsen, Denmark (DMU) • Mehdi Sargolzaei, Canada (FImpute, GBLUP) Some software packages do only REML style estimation, while some can also do Bayesian estimation. Some are limited in the types of models than can be used. Because the programs have been written to be general to handle almost any kind of model and data, they are sometimes inefficient or extravagant in the use of memory, and thus they may force one to take a subset of 92 CHAPTER 5. ESTIMATION OF VARIANCES the data, or simplify the model. At the least, the programs may take a long time to run. The best thing is to try two or three different software packages and compare the estimates from each. This is not easy because each requires different parameter files and instruction sets. Check your results carefully. 5.7 Covariance Matrices Variances are always to be positive quantities. Similarly, covariance matrices are always to be positive definite matrices. Nearly every researcher encounters a covariance matrix during their careers that is not positive definite. The problem is to force a matrix (by some method) to make it positive definite. Hayes and Hill (1981) presented a bending procedure in which the eigenvalues of a matrix were regressed towards the mean of the eigenvalues until all eigenvalues were positive, then reconstruct the matrix with the new eigenvalues. Jorjani et al.(2003) gave a weighted bending procedure, i.e. a different way of pulling the values closer to the mean eigenvalue. Finally, Meyer and Kirkpatrick (2010) used a penalized maximum likelihood method. The bending procedure, in general, causes many of the original correlations and actual variances and covariances to differ greatly from the original values. There must be another way. 5.7.1 Example Consider the matrix, M, of order 5. M= 100 95 80 40 40 95 100 95 80 40 80 95 100 95 80 40 80 95 100 95 40 40 80 95 100 = UDU0 . The eigenvalues are the diagonal elements of D. D= 399.48 98.52 23.65 −3.12 −18.52 . 5.7. COVARIANCE MATRICES 93 Thus, whereas all of the pairwise correlations in M are below 1, the matrix, as a whole, is not positive definite, and therefore, is invalid as a covariance matrix. The procedure that I endorse is to modify the negative eigenvalues such that they become positive and have values between the smallest positive eigenvalue and zero. The lowest positive eigenvalue in this example is 23.65. The following steps should be taken. Step 1 Add together the negative eigenvalues and multiply times 2. s = (−3.12 − 18.52) ∗ 2 = −43.28. Now square this value, multiply by 100 and add 1. t = (s ∗ s) ∗ 100 + 1 = 187, 316.84. Step 2 Let p be the lowest positive eigenvalue (23.65), then take each negative eigenvalue, n, separately and transform as follows: n∗ = p × (s − n) × (s − n)/t or n∗4 = 23.65(−43.28 + 3.12) × (−43.28 + 3.12)/(187, 316.84) = 0.20363, and n∗5 = 23.65(−43.28 + 18.52) × (−43.28 + 18.52)/(187, 316.84) = 0.07740. Step 3 Replace the negative eigenvalues with the new positive values, and reconstruct M using the new eigenvalues and old eigenvectors. 94 CHAPTER 5. ESTIMATION OF VARIANCES D∗ = 399.48 98.52 23.65 0.20363 0.0774 . Then M∗ = UD∗ U0 , = 103.18978 90.82704 79.43676 44.56754 37.06769 90.82704 106.54177 94.13679 74.06296 44.56754 79.43676 94.13679 102.46432 94.13670 79.43676 44.56754 74.06296 94.13679 106.54177 90.82704 37.06769 44.56754 79.43676 90.82704 103.18978 . The only property of this method is that the resulting matrix will be a positive definite matrix and can be used as a valid covariance matrix in BLUP or selection index. Chapter 6 Repeated Records Animal Model 6.1 Traditional Approach Animals are observed more than once for some traits, such as • Fleece weight of sheep in different years. • Calf records of a beef cow over time. • Litter size of sows over time. • Antler size of deer in different seasons. • Racing results of horses from several races. Usually the trait is considered to be perfectly correlated over the ages of the animal. In addition to an animal’s additive genetic value for a trait, there is a common permanent environmental (PE) effect which is a non-genetic effect assumed to be common to all observations on the same animal. 95 96 CHAPTER 6. REPEATED RECORDS ANIMAL MODEL 6.1.1 The Model The model is written as y = Xb + 0 Z a0 aw ! + Zp + Wu + e, where b is a vector of fixed effects, ! a0 are vectors of animal additive genetic effects for animals without aw records and animals with records, respectively, p is a vector of PE effects of length equal to aw , u is a vector of random contemporary group effects, and e is a vector of residual effects. The matrices X, W and Z are design matrices that associate observations to particular levels of fixed effects and to additive genetic and PE effects, respectively. In a repeated records model, Z is not equal to an identity matrix. Also, a | A, σa2 ∼ N (0, Aσa2 ) p | I, σp2 ∼ N (0, Iσp2 ) u | I, σu2 ∼ N (0, Iσu2 ) e ∼ N (0, Iσe2 ). Repeatability is a measure of the average similarity of multiple records on animals across the population (part genetic and part environmental), and is defined as a ratio of variances as r= σa2 + σp2 , σu2 + σa2 + σp2 + σe2 6.1. TRADITIONAL APPROACH 97 which is always going to be greater than or equal to heritability, because h2 = σa2 . σu2 + σa2 + σp2 + σe2 One of the assumptions with a repeated records model is that all records of the animal are included in the analysis. That means, there should be no selection, like analyzing only the best record of each animal, or analyzing only the third record because not all animals may have three records. By utilizing all records the mixed model equations can account (to some degree) for selection of animals to have subsequent records. If the selection intensity is very high, like one out of 3 or greater, then MME may not be able to account for selection, at least not fully. 6.1.2 Example Let the parameters be σu2 = 20 σa2 = 36 σp2 = 16 and σe2 = 100. Thus, h2 = 36 = .21, 20 + 36 + 16 + 100 r= 36 + 16 = .30. 20 + 36 + 16 + 100 and The data are given in the following table for twelve animals with from 1 to 3 records each. 98 CHAPTER 6. REPEATED RECORDS ANIMAL MODEL Table 6.1 Herd Animal Sire Dam Year 1 Year 2 Year 3 1 7 1 2 69 53 65 1 8 3 4 37 47 1 9 5 6 39 62 1 10 1 4 48 72 1 11 3 6 96 1 12 1 2 72 2 19 1 14 55 51 86 2 20 13 16 48 72 2 21 15 17 71 96 2 22 13 18 56 47 2 23 5 14 51 2 24 15 16 77 None of the animals are inbred. There were 6 contemporary groups defined by herd-year interactions, and 3 year effects. The order of the mixed model equations (MME) were 45 by 45. The solutions to the MME were as follows: For years, 52.90 b b = 57.61 , 74.67 and for contemporaray groups, b = u −0.3560628 −0.8611276 −0.1803599 0.3560628 0.8611276 0.1803599 . The solutions for EBV and permanent environmental (PE) effects for animals and the diagonals of the inverse for the EBVs are given in the following table. 6.1. TRADITIONAL APPROACH Animal 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Table 6.2 EBV inverse 1.1615 0.2944 1.7503 0.3172 0.4397 0.3191 -3.3648 0.3132 -3.2406 0.3192 0.3567 0.3237 1.3535 0.2362 -3.8595 0.2552 -3.9218 0.2561 -2.0695 0.2601 3.2348 0.2885 3.3086 0.2799 -1.7874 0.3157 -0.3817 0.3159 3.4269 0.3238 0.9680 0.3187 3.4327 0.3331 -2.7612 0.3325 0.7690 0.2369 0.5640 0.2558 6.8624 0.2568 -5.0355 0.2536 -2.5718 0.2860 2.1917 0.2881 99 PE -0.0910 -2.1306 -2.2043 -0.8603 2.5214 1.6468 0.3369 0.8656 3.0512 -2.4544 -0.6762 -0.0051 An estimate of the residual variance was σce2 = (2629.408)/(22 − 3) = 138.3899. The variance of prediction error for animal 11, would therefore be 0.2885 ∗ 138.3899 = 39.9255, or the standard error of prediction (SEP) is 6.3187. The reliability would be REL = (1 − 0.2885 ∗ (2.78)) = 0.1980. Reliabilities of animals with records are greater than for animals without records, in this example. 100 CHAPTER 6. REPEATED RECORDS ANIMAL MODEL 6.1.3 Estimation of Variances Gibbs sampling is best applied when iterating solutions to the MME by iteration on the data. Hence solutions are obtained for one factor at a time. Random variation is added to each new solution to reflect the relative accuracy of estimation of each solution. This is given by the diagonal element of the MME, denote it by dkk , which is a diagonal of X0 X, or W0 W, or Z0a Za , or Z0p Zp , depending on which solution is being calculated. • Year Effects The new solution is b b i b w − Zp p b − Wu b )/dii , = x0i (y − Za a for dii = x0i xi . Random variation is added to this solution by obtaining a random normal deviate with a mean of 0 and variance of σce2 , the latest sample value of the residual variance. b b i ce )/dii . = bbi + RN ORM (0, σ • Contemporary Group Effects The new solution is b b w − Zp p b − Xb)/(d ubj = wj0 (y − Za a jj + ku ), for djj = wj0 wj and ku is the ratio of the latest sample values of residual and contemporary group variances, i.e. ku = σce2 /σcu2 . Random variation is added to this solution by obtaining a random normal deviate with a mean of 0 and variance of σce2 , the latest sample value of the residual variance. ce )/(djj + ku ). ubj = ubj + RN ORM (0, σ • Additive Breeding Values The new solution is b − Wu b w − Ai0 a b 0 − Zp p b − Xb b )/(dii + aii ka ), abi = z0ai (y − Aiw a for dii = z0ai zai and ka is the ratio of the latest sample values of residual and additive genetic variances, i.e. ka = σce2 /σca2 , 6.1. TRADITIONAL APPROACH 101 and aii is the diagonal of A−1 for the ith animal. Random variation is added to this solution by obtaining a random normal deviate with a mean of 0 and variance of σce2 , the latest sample value of the residual variance. ce )/(dii + aii ka ). abi = abi + RN ORM (0, σ • Permanent Environmental Effects The new solution is b b w − Wu b − Xb)/(d pbj = z0pj (y − Za a jj + kp ), for djj = z0pj zpj and kp is the ratio of the latest sample values of residual and PE variances, i.e. kp = σce2 /σcp2 . Random variation is added to this solution by obtaining a random normal deviate with a mean of 0 and variance of σce2 , the latest sample value of the residual variance. ce )/(djj + kp ). pbj = pbj + RN ORM (0, σ There are 4 variances to estimate in this model, and it does not matter what order they are calculated. Calculations usually follow after obtaining new sample values for the solutions to the MME. • Contemporary Group Variance b 0u b /CHI(6 + 2). σcu2 = u = 1.801702/CHI(8) = 0.17837 where CHI(8) is a random Chi-square variate with 8 degrees of freedom. • Permanent Environmental Variance b 0p b /CHI(12 + 2). σcp2 = p = 35.87092/CHI(14) = 2.535819, where CHI(14) is a random Chi-square variate with 14 degrees of freedom. 102 CHAPTER 6. REPEATED RECORDS ANIMAL MODEL • Residual Variance σce2 = eb0 eb/CHI(22 + 2). = 1971.808/CHI(24) = 92.50942 where CHI(24) is a random Chi-square variate with 24 degrees of freedom. • Animal Additive Genetic Variance b 0 A−1 a b /CHI(24 + 2). σca2 = a = 152.7832/CHI(26) = 8.707745, where CHI(26) is a random Chi-square variate with 26 degrees of freedom. • New Variance Ratios for MME ku = = = kp = = = ka = = = σce2 /σcu2 92.50942/0.17837 518.64 σce2 /σcp2 92.50942/2.535819 36.48 σce2 /σca2 92.50942/8.707745 10.62. The new variance ratios would be used to re-create the MME and obtain new solutions. The number of iterations (or samples) to generate should be very large. A plot of the sample values should be used to determine when the samples have converged to the joint posterior distribution. With this small numerical example, convergence is not likely to occur. A variance (such as the 6.2. CUMULATIVE ENVIRONMENTAL APPROACH 103 contemporary group variance) might converge to zero. With small datasets, use of prior values for the variances should be used with degrees of freedom equal to the number of records. The prior values are constants, based on best guesses about the possible true values of the variances. Variance component estimation should be based on at least 5000 observations or more, but this depends on the model and number of levels of the random factors. In some circumstances, maybe 10,000 or more observations may be necessary. At this level of dataset size, the use of prior values becomes less necessary. 6.2 Cumulative Environmental Approach “Permanent” implies stability and a constant presence. A better proposition is that new environmental effects can appear over time as the animal ages, or as the animal gains experience in the trait events that are being recorded, and are, therefore, cumulative. Suppose an animal makes 3 records over 3 years. The environmental effect, E1 , that affects record 1, also has an influence on records 2 and 3. A new environmental effect, E2 affects record 2 and any later records, but does not affect record 1 because E2 occurred after record 1 was made. Similarly, another new environmental effect, E3 , affects record 3, but not records 1 and 2 because E3 has occurred after records 1 and 2 were made. Conceivably, the variance of environmental effects could differ for each record, but the covariance between environmental effects affecting different records would be assumed to be zero. Hence the effect of E1 on record 1 is independent of the effect of E2 on record 2, and the effect of E3 on record 3, etc. However, the environmental variance for second records would be the sum of variances of environmental effects on records 1 and 2. Possibly, E1 could be negative, and E2 could be positive, thus cancelling out each other, but E1 and E2 could both be negative (or positive) and thus, would add up in influencing record 2. Perhaps also, the variance of E2 could be smaller than for the variance of E1 , or in general, the variability of environmental effects on aged animals becomes less. Or the variances could become larger with age. There have been no studies on cumulative environmental effects on any species to know which might be the true state of nature. 104 CHAPTER 6. REPEATED RECORDS ANIMAL MODEL 6.2.1 A Model The cumulative environmental repeated records (CE) model can be written as y = Xb + 0 Za a0 ar ! + Zp p + e, where b = a0 ar ! = p = e = vector of fixed effects, ! animals without records , animals with records vector of PE effects of length equal to the number of records, and vector of residual effects. The matrices X and Za are design matrices that associate observations to particular levels of fixed effects and to additive genetic effects, respectively, as described earlier. In a CE repeated records model, Zp is not a typical design matrix. Rows of Zp may have more than a single 1. If an animal has two records, then in the row for the second record there will be two 1’s. If an animal has four records, then in the row for the first record there will be one 1, for the second record two 1’s, for the third record three 1’s, and for the fourth record four 1’s. Using the example data in this chapter, the Zp matrix for animal 7 would appear as ··· 1 0 0 ··· Zp p7 = · · · 1 1 0 · · · ··· 1 1 1 ··· .. . p71 p72 p73 .. . . Thus, p71 contributes to records 1, 2, and 3 of animal 7, while p72 contributes to records 2 and 3 of animal 7. Note that there are as many environmental effects to be estimated as there are records. Also, for animal 7, Z0 p Zp ··· 3 2 1 ··· = ··· 2 2 1 ··· . ··· 1 1 1 ··· 6.2. CUMULATIVE ENVIRONMENTAL APPROACH 105 Because of the special form of Zp , the CE effects can be estimated separately from temporary environmental effects, and from the additive genetic effects for animals, which rely on the additive genetic relationship matrix. Typical software such as ASREML, VCE, or DMU can not presently be used for a CE repeated records model. Users need to write their own software. Also, a | A, σa2 ∼ N (0, Aσa2 ) p | I, σp2 ∼ N (0, Pσp2 ) e ∼ N (0, Iσe2 ) ! Aσa2 0 . G = 0 Pσp2 The matrix P is assumed to be a diagonal matrix with 1’s on the diagonal for the first records made by every animal. For second records, one might suspect that the variance of permanent environmental effects might be less than that for first records, and so the diagonals for second records may be reduced. Similarly, the variance of PE effects for third and later records may also be reduced further. The point is that an allowance must be made for the variances of PE effects to be different depending on the record number for that animal. The residual variance is assumed to be the same for all records, but this too could vary. The genetic variance is assumed to be constant and the genetic correlation between records is still assumed to be unity. The variance of three records, for example, on one non-inbred animal would be y1 1 1 1 V ar y2 = 1 1 1 σa2 y3 1 1 1 2 2 2 σp1 σp1 σp1 2 2 2 2 2 + σp2 ) (σp1 + σp2 ) + σp1 (σp1 2 2 2 2 2 2 σp1 (σp1 + σp2 ) (σp1 + σp2 + σp3 ) 1 0 0 2 + 0 1 0 σe . 0 0 1 106 CHAPTER 6. REPEATED RECORDS ANIMAL MODEL This implies that the variance of repeated records is getting larger over time. 6.2.2 Example The data for the repeated records example of earlier in this chapter was analyzed with the CE repeated records model. Let σu2 σa2 σe2 2 σp1 = = = = 20, 36, 100, 16, 2 σp2 = 14, and 2 σp3 = 13. The mixed model equations are of order 55 (i.e. 3 years, 6 contemporary groups, 24 animal additive genetic, and 22 CE effects). The solutions to the mixed model equations are given below. 6.2. CUMULATIVE ENVIRONMENTAL APPROACH 107 Table 6.3 Solutions to the CE repeated records animal model analysis. Year Effects Cont. Groups Animal 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Yr 1 = 52.9257 1 = -0.4818 5 = 0.6824 EBV 1.2225 2.0331 0.4117 -3.4004 -3.1780 0.3103 1.8534 -3.8388 -3.8798 -2.1448 3.1172 3.4352 -1.7051 -0.4867 3.2488 0.7993 3.2710 -2.5266 0.6133 0.3685 6.5309 -4.6424 -2.5643 2.0019 Yr 2 = 57.5993 2 = -0.6824 6 = 0.01028 inverse 0.2970 0.3192 0.3200 0.3146 0.3201 0.3245 0.2438 0.2580 0.2591 0.2633 0.2895 0.2812 0.3170 0.3179 0.3246 0.3196 0.3339 0.3335 0.2451 0.2586 0.2599 0.2572 0.2866 0.2892 Yr 3 = 75.1308 3 = -0.01028 4 = 0.4818 CE1 CE2 CE3 0.2006 -2.0840 -2.1742 -1.8548 -0.4905 -1.1872 -0.9386 -0.8678 -0.0046 2.4500 1.6066 0.2181 0.7302 2.9075 0.0843 1.5498 -2.2458 -0.6507 1.1439 1.4025 -2.6100 -0.0197 Most probable producing abilities (MPPA) are predictions of how animals might perform if they made another record, and are based on the genetic and CE estimates (Lush, 1933). In the CE repeated records model, to predict a fourth record for animal 7, for example, the prediction would be MPPA = â7 + p̂71 + p̂72 + p̂73 + p̂f uture , where the prediction of the future CE effect would be 0, the mean of the distribution from which a future CE effect would be sampled. The variance of prediction error of MPPA would be increased by the variance of the future CE effects. 108 CHAPTER 6. REPEATED RECORDS ANIMAL MODEL Similarly, repeatability would vary according to record number. Assuming the new CE effects are not correlated with previous CE effects, then repeatability would increase with record number because the CE variances would be additive. If CE effects for second records, for example, depends on the CE effect for the first records, then those covariances (positive or negative) may decrease repeatability with record number. Without an analysis of real data and estimates of the CE variances, these comments are only speculative, and the results could vary depending on species and traits. 6.2.3 Estimation of Variances A Bayesian approach using Gibbs sampling could be used to estimate the PE variances, one for each record number, just as one estimates different residual variances for different years or herds. • Contemporary Group Variance b 0g b /CHI(6 + 2) σcg2 = g = 1.377382/CHI(8) = 0.139131 • Residual Variance σce2 = 1738.198/CHI(22 + 2) = 94.50796 • Additive Genetic Variance b 0 A−1 a b /CHI(24 + 2). σca2 = a = 141.5144/CHI(26) = 6.0337. 6.3. ALTERNATIVE MODELS 109 • Cumulative Environmental Variances There are three CE variances to estimate. d 2 σ p1 = = = d 2 σp2 = = = d 2 σp3 = = = b 01 p b 1 /CHI(12 + 2). p 33.07665/CHI(14) 2.169933. b 2 /CHI(8 + 2). b 02 p p 15.62206/CHI(10) 2.529656. b 3 /CHI(2 + 2). b 03 p p 2.718043/CHI(4) 0.3556115. Obviously, there are only two third records on animals and the estimate here is not very accurate. Because this is a single Gibbs sample, these new values will be used to re-construct the MME and another set of samples will be generated. However, the example is too small to obtain valid parameter estimates. 6.3 Alternative Models Multiple Trait Model There are other alternative models for repeated records. Let each observation on an animal be considered a different trait that is not perfectly genetically correlated, then a multiple trait animal model could be used, such that permanent environmental effects are confounded with temporary environmental effects, and therefore, estimated jointly. If the observations are truly perfectly genetically correlated, then the multiple trait approach may have difficulty in estimating a genetic correlation of 1 (Van Vleck and Gregory, 1992). Random Regression Model Another possibility is to consider a random regression model where the observations are functions of age of the animal, and follow a trajectory, such 110 CHAPTER 6. REPEATED RECORDS ANIMAL MODEL as a lactation curve. Test day models include random regression coefficients for permanent environmental effects such that their variability differs over the course of a lactation. However, if animals do not make very many records, then there could be estimation problems. Autoregressive Model The random regression model suggests that permanent environmental effects close together in time are more similar in magnitude than PE effects farther apart in time. Thus, an autoregressive correlation model might be assumed. The model for autocorrelated CE effects can be written as y = Xb + 0 Za a0 ar ! + Zp p + e, where b = a0 ar ! = p = e = vector of fixed effects, ! animals without records , animals with records vector of CE effects of length equal to the number of records, and vector of residual effects. In this model, Zp is an identity matrix within animals, so that the CE effects for each record are separated, but P= 1 ρ ρ2 .. . ρ ρ2 1 ρ ρ 1 .. .. . . ··· ··· ··· ... 2 σ . p Chapter 7 Multiple Trait Models 7.1 Introduction Livestock are often assessed for productive performance, reproductive performance, health, and economic efficiency. Within each area there may be several traits of importance. For example, in beef cattle there is calving ease of the calf, survival to 4 days, birthweight, weaning weight, yearling weight and feed intake that can be grouped under productive performance. Within the reproductive area there could be calving ease of the cow, age at first calving, number of services to conception, conception rate, and others. Health covers general immune response, but also susceptibility to many diseases and parasites. Economic efficiency could include conformation traits and behaviour traits. The point is that there are multiple traits that need to be evaluated for each animal, and usually evaluations for these traits are combined into one or more indexes depending on the breeding goals of the producer. A multiple trait (MT) model is one in which two or more traits are analyzed simultaneously in order to take advantage of genetic and environmental correlations between traits. Not all animals are measured for all traits, and thus, some improvement in accuracy for all traits can be gained by analyzing them together. MT models are useful for traits where the difference between genetic and residual correlations are large ( e.g. greater than 0.5 difference ) or where one 111 112 CHAPTER 7. MULTIPLE TRAIT MODELS trait has a much higher heritability than the other traits. EBVs for traits with low heritability tend to gain more in accuracy than EBVs for traits with high heritability, although all traits benefit to some degree from the simultaneous analysis. Another use of MT models is for traits that occur at different times in the life of the animal, such that culling of animals results in fewer observations on animals for traits that occur later in life compared to those at the start. Consequently, animals which have observations later in life tend to have been selected based on their performance for earlier traits. Thus, analysis of later life traits by themselves could suffer from the effects of culling bias, and the resulting EBV could lead to errors in selecting future parents. An MT analysis that includes observations on all animals upon which culling decisions have been based, has been shown to account, to some degree, for the selection that has taken place. That is, there needs to be some way to estimate the differences in first records between selected and non-selected individuals, to account for the correlated difference that might have been observed in second records had all animals made a second record. However, if the culling is intense, then a MT model may still give biased results. Correlations: If the correlations (covariance matrices) that are used in an MT analysis are inaccurate, then EBV resulting from MT analyses could give erroneous rankings of animals for some or all traits. However, with Bayesian methods of covariance component estimation, appropriate parameters can be obtained readily, so that there should be few problems in ranking animals correctly. 7.2 Data Multiple trait analyses will be demonstrated with a small example involving three traits. All animals were observed for trait 1, but not traits 2 or 3. The data are given in the table below. 7.2. DATA Animal 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 113 Table 7.1 DIM = days in milk group (3) YS = year-season (1 year, 2 seasons) HYS = herd-year-season (4) A zero indicates missing information. Sire Dam Age DIM YS HYS 1 2 3 4 1 2 3 3 4 3 2 2 1 3 4 2 5 6 7 8 9 5 6 8 7 10 12 13 11 14 19 10 1 2 1 3 2 1 3 3 1 2 1 2 3 2 1 3 1 1 2 3 1 3 2 1 2 3 1 2 1 3 2 3 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 4 Traits 1 2 29.3 52 30.9 0 27.4 54 3.1 55 26.6 46 3.7 37 30.4 53 35.2 0 33.2 66 5.0 58 33.8 61 38.8 56 38.1 0 17.6 53 36.8 36 10.3 40 3 137 151 0 195 82 202 167 0 149 173 175 152 194 237 0 148 None of the animals was inbred, and animals 1 to 9 do not have any observations, and unknown parents. The same model is assumed for all three traits, namely, y = YS(year − seasons) +Age(age − groups) +DG(DIM − groups) +HYS +a +e 114 CHAPTER 7. MULTIPLE TRAIT MODELS Assume the following covariance matrices. 11 5 29 R = 5 97 63 , 29 63 944 for the residual effects, 1.0 1.6 −6 14.3 −12 G = 1.6 , −6.0 −12.0 803 for the additive genetic effects, and 2.0 3.1 −10 3.1 29.0 −30 U= , −10.0 −30.0 700 for random HYS effects. The corresponding correlation matrices are 1.000 0.153 0.285 Cor(R) = 0.153 1.000 0.208 , 0.285 0.208 1.000 1.000 0.423 −0.212 1.000 −0.112 Cor(G) = 0.423 , −0.212 −0.112 1.000 and 1.000 0.407 −0.267 1.000 −0.211 Cor(U) = 0.407 . −0.267 −0.211 1.000 The heritabilities of the traits were 0.07, 0.10, and 0.33, respectively. 7.3. GIBBS SAMPLING 7.3 115 Gibbs Sampling The problem with multiple trait examples is that • there are too many equations to show, • the equations involve inverses of R which contain decimal numbers, and • there are several ways to construct the equations. Hence R-code will be used for this example. The first step is to enter the data. obs=matrix(data=c(29.3,52,137, 30.9,0,151, 27.4,54,0, 3.1,55,195, 26.6,46,82, 3.7,37,202, 30.4, 53,167, 35.2,0,0, 33.2,66,149, 5.0,58,173, 33.8,61,175, 38.8,56,152, 38.1,0,194, 17.6, 53, 237, 36.8, 36,0, 10.3,40,148),byrow=TRUE,ncol=3) Next, the pedigrees and calculating A−1 . anwr=c(10:25) # with records anim=c(1:25) sir=c(rep(0,9),1,2,3,4,1,2,3,3, 4,3,2,2,1,3,4,2) dam=c(rep(0,9),5,6,7,8,9,5,6,8, 7,10,12,13,11,14,19,10) length(sir)-length(dam) # check bi=c(rep(1,9),rep(0.5,16)) AI = AINV(sir,dam,bi) Assign each animal (with records) to appropriate age, days in milk, yearseason, and HYS levels. The misc variable identifies which residual matrix inverse should be used, where 1 indicates all traits observed, 2 indicates trait 2 is missing, 3 indicates trait 3 is missing, and 4 indicates traits 2 and 3 are missing. In this example, all animals were required to have trait 1 present, which may happen in some situations. 116 CHAPTER 7. MULTIPLE TRAIT MODELS dimid=c(1,1,2,3,1,3,2,1,2,3,1,2,1,3,2,3) ndim=3 agid=c(1,2,1,3,2,1,3,3,1,2,1,2,3,2,1,3) nage=3 ysid=c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2) nys=2 cgid=c(1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4) ncg=4 misc=c(1,2,3,1,1,1,1,4,1,1,1,1,2,1,3,1) ntr=3 Different R inverses are needed depending on which traits are observed, as indicated in misc. RI=array(data=c(0),dim=c(3,3,4)) RI[ , ,1]=ginv(R) # all present ka = c(1,3) # trait 2 missing work=ginv(R[ka,ka]) RI[ka,ka,2]=work ka=c(1,2) # trait 3 missing work=ginv(R[ka,ka]) RI[ka,ka,3]=work ka=c(1) # traits 2 and 3 missing work=ginv(R[ka,ka]) RI[ka,ka,4]=work noa=length(anwr) Finally, initialize the solution vectors to zero, and begin Gibbs sampling. nam=length(anim) GI=ginv(G) UI=ginv(U) yssn=matrix(data=c(0),nrow=nys,ncol=ntr) cgsn=matrix(data=c(0),nrow=ncg,ncol=ntr) dmsn=matrix(data=c(0),nrow=ndim,ncol=ntr) agsn=matrix(data=c(0),nrow=nage,ncol=ntr) ansn=matrix(data=c(0),nrow=nam,ncol=ntr) 7.3. GIBBS SAMPLING 7.3.1 117 Year-Season Effects Adjust the observations for all factors in the model, except year-season effects, solve, add some sampling noise, and go to the next factor. WY=yssn*0 WW=array(data=c(0),dim=c(3,3,nys)) for(i in 1:noa){ yobs=obs[i, ] mss=misc[i] yobs=yobs-agsn[agid[i], ]-dmsn[dimid[i], ]ansn[anwr[i], ] - cgsn[cgid[i], ] ma=ysid[i] WY[ma, ]=WY[ma, ]+(RI[ , ,mss] %*% yobs) WW[ , ,ma]=WW[ , ,ma]+RI[ , ,mss] } for(k in 1:nys){ work=WW[ , ,k] C = ginv(work) yssn[k, ]=C %*% WY[k, ] CD=t(chol(C)) ve = rnorm(ntr,0,1) yssn[k, ] =yssn[k, ] +(CD %*% ve) } 118 7.3.2 CHAPTER 7. MULTIPLE TRAIT MODELS Age Effects Similar to the year-season statements above. WY = agsn*0 WW = array(data=c(0),dim=c(3,3,nage)) for(i in 1:noa){ yobs = obs[i, ] mss = misc[i] yobs=yobs - yssn[ysid[i], ]-dmsn[dimid[i], ]ansn[anwr[i], ] - cgsn[cgid[i], ] ma = agid[i] WY[ma, ]=WY[ma, ]+(RI[ , ,mss] %*% yobs) WW[, ,ma]=WW[ , ,ma]+RI[ , ,mss] }! \\ for(k in 1:nage){ work=WW[ , ,k] C = ginv(work) agsn[k, ]=C %*% WY[k, ] CD=t(chol(C)) ve = rnorm(ntr,0,1) agsn[k, ]=agsn[k, ] +(CD %*% ve) } 7.3. GIBBS SAMPLING 7.3.3 Days in Milk Group Effects Another similar section for days in milk group effects. WY = dmsn*0 WW = array(data=c(0),dim=c(3,3,ndim)) for(i in 1:noa){ yobs = obs[i, ] mss = misc[i] yobs=yobs - yssn[ysid[i], ]-agsn[agid[i], ]ansn[anwr[i], ] - cgsn[cgid[i], ] ma = dimid[i] WY[ma, ]=WY[ma, ]+(RI[ , ,mss] %*% yobs) WW[, ,ma]=WW[ , ,ma]+RI[ , ,mss] } for(k in 1:ndim){ work=WW[ , ,k] C = ginv(work) dmsn[k, ]=C %*% WY[k, ] CD=t(chol(C)) ve = rnorm(ntr,0,1) dmsn[k, ]=dmsn[k, ] +(CD %*% ve) } 119 120 CHAPTER 7. MULTIPLE TRAIT MODELS 7.3.4 HYS Effects HYS is a random factor in the model, and therefore, the R code is a little different, and there is a need to obtain a new sample for the covariance matrix, U. WY = cgsn*0 WW = array(data=c(0),dim=c(3,3,ncg)) for(i in 1:noa){ yobs = obs[i, ] mss = misc[i] yobs=yobs - yssn[ysid[i], ]-agsn[agid[i], ]ansn[anwr[i], ] - dmsn[dimid[i], ] ma = cgid[i] WY[ma, ]=WY[ma, ]+(RI[ , ,mss] %*% yobs) WW[, ,ma]=WW[ , ,ma]+RI[ , ,mss] } for(k in 1:ncg){ work=WW[ , ,k] + UI #Note addition C = ginv(work) cgsn[k, ]=C %*% WY[k, ] CD=t(chol(C)) ve = rnorm(ntr,0,1) cgsn[k, ]=cgsn[k, ] +(CD %*% ve) } Now the additional part for estimating the covariance matrix. sss = t(cgsn)%*%cgsn ndf = ncg+2 # = 6 x2 = rchisq(1,ndf) Un = sss/x2 # ntr by ntr UI = ginv(Un) 7.3. GIBBS SAMPLING 7.3.5 121 Additive Genetic Effects This part is different from the previous R-scripts in that the relationship matrix must be considered in the calculations. All animals were therefore, estimated simultaneously. no=nam*ntr WW = matrix(data=c(0),nrow=no,ncol=no) WY = matrix(data=c(0),nrow=no,ncol=1) for(i in 1:noa){ yobs = obs[i, ] mss = misc[i] yobs=yobs - yssn[ysid[i], ]-agsn[agid[i], ]cgsn[cgid[i], ] - dmsn[dimid[i], ] ka=(anwr[i]-1)*ntr + 1 kb = ka + 2 kc = c(ka:kb) WY[kc, ]=WY[kc, ]+(RI[ , ,mss] %*% yobs) WW[kc,kc]=WW[kc,kc]+RI[ , ,mss] } HI = AI %x% GI # Direct product work = WW + HI # 75 by 75 C = ginv(work) wsn=C %*% WY CD=t(chol(C)) ve = rnorm(no,0,1) wsn=wsn +(CD %*% ve) # transform to new dimensions ansn = matrix(data=wsn,byrow=TRUE,ncol=3) # 25 by 3 For the genetic covariance matrix, then v1 = AI %*% ansn sss = t(ansn)%*%v1 ndf = nam+2 x2 = rchisq(1,ndf) Gn = sss/x2 GI = ginv(Gn) 122 7.3.6 CHAPTER 7. MULTIPLE TRAIT MODELS Residual Covariances Adjust observations for all factors in the model, estimate the missing residuals, then calculate the sum of squares and cross-products of residuals. sss=matrix(data=c(0),nrow=ntr,ncol=ntr) for(i in 1:noa){ yobs = obs[i, ] mss = misc[i] yobs=yobs - yssn[ysid[i], ]-agsn[agid[i], ]ansn[anwr[i], ]-cgsn[cgid[i], ] - dmsn[dimid[i], ] rwrk = RI[ , ,mss] bhat = matrix(data=yobs,ncol=1) ehat = R %*% rwrk %*% bhat # Must estimate missing sss = sss + (ehat%*%t(ehat)) } ndf = noa+2 x2 = rchisq(1,ndf) Rn = sss/x2 # ntr by ntr The 4 R-inverses need to be re-created using the new Rn. This complete one round of Gibbs sampling. Now go back to the year-season effects and start the next sample. The order in which factors are processed is not critical, but each factor should be processed once within each sample round. A suggestion would be to put a factor with the smallest number of levels first, which would therefore, have the largest number of observations per level. The samples for this factor are likely not going to vary much from sample to sample. The animal additive genetic could go last within a sampling round because the relationship matrix inverse is involved, and because animals only have one set of observations (usually), and hence they can vary from round to round. Animals receive evaluations for all traits even if they have not been observed for those traits. This is possible because of the additive genetic relationship matrix and the genetic correlations among the traits. This leads to some political problems. For example, a sheep producer may not take ultrasound measures on backfat and lean yield on his lambs due to the costs. However, his data on other traits is included in the multiple trait genetic evaluation sys- 7.4. POSITIVE DEFINITE MATRICES 123 tem, and his animals receive an evaluation for the ultrasound traits. Should the producer receive evaluations on ultrasound traits if they did not pay to have the data collected on their animals? The answer in Canada is no, for two reasons. One, those producers did not pay to collect ultrasound data, and two, the accuracy of EBVs for the ultrasound traits in their flock would be low and should probably not be used for making decisions in their flock. 7.4 Positive Definite Matrices Covariance matrices must always be positive definite for multiple trait analyses. Failure to guarantee positive definiteness can lead to incorrect solutions to mixed model equations, erroneous ranking of individuals, and negative diagonals in the inverse of the left hand sides. Suppose you have four new traits and you obtained genetic and contemporary group covariance matrices from averaging reports from 20 different bi-variate studies (two traits at a time). In constructing an order 4 covariance matrix, the result may not be positive definite, as in the matrix below. Let G= 100 80 20 6 80 50 10 2 20 10 6 1 6 2 1 1 . This matrix appears to be positive definite, because the correlations are less than unity amongst all pairs of variables. However, the eigenvalues of G are 162.1627196 4.1339019 0.9171925 −10.2138140 , and therefore, are not all positive. What can be done? The following R-script can be used to force a symmetric matrix to be positive definite. 124 CHAPTER 7. MULTIPLE TRAIT MODELS makPD = function(A){ D = eigen(A) sr =0 nneg=0 V = D$values U = D$vectors N=nrow(A) for(k in 1:N){ if(V[k] < 0){ nneg=nneg+1 sr=sr+V[k]+V[k] } } wr=(sr*sr*100)+1 p=V[N - nneg] for(m in 1:N){ if(V[m] < 0 ){ c = V[m] V[m] = p*(sr-c)*(sr-c)/wr } } A = U %*% diag(V) %*% t(U) return(A) ] Applying the routine to G gives a new matrix, G= 103.615081 75.499246 18.377481 5.013141 75.499246 55.603411 12.020026 3.228634 18.377481 12.020026 6.728218 1.442922 5.013141 3.228634 1.442922 1.269397 . Hayes and Hill (1981) described a “bending” procedure which essentially shrinks all of the eigenvalues towards the mean of the eigenvalues. This procedure is not recommended. On the other hand, one merely needs any matrix that is positive definite to begin the Gibbs sampling process. Hopefully, the samples will converge towards a better estimate of G. 7.5. STARTING MATRICES 7.5 125 Starting Matrices Another quick technique to obtain starting matrices is to first calculate a phenotypic covariance matrix using animals that have all traits observed. P = y0 (I − J/N )y/(N − 1), then G = 0.2 ∗ P U = 0.5 ∗ P R=P−G−U P is positive definite by nature of how it was calculated, unless the length of y is less than the order of P. Then the only matrix that needs to be checked is R. The constants 0.2 and 0.5 are guesses as to the fraction of genetic variances and fraction of variances due to contemporary groups. This is a quick method of obtaining starting covariance matrices for the Gibbs sampling process, if nothing exists prior to the analysis. This approach works reasonably well because genetic covariances tend to follow phenotypic covariances. Pulling variances and covariances from the literature does require mandatory checking of matrices to guarantee they are positive definite. 126 CHAPTER 7. MULTIPLE TRAIT MODELS Chapter 8 Maternal Traits 8.1 Introduction Early growth in most livestock species relies on the mothering ability of the female parent. Maternal ability is expressed by the female parent during gestation and after birth, and is measured by the success of the offspring to live and grow (Wilham, 1972). Embryo transfer allows offspring to be reared pre and post birth by a female that is not the biological parent. In some species, offspring may be moved to a female different from the birth mother to be reared after birth. Thus, there are three types of dams that could potentially have an effect on an animal’s early growth and survival. These are 1. the genetic dam, 2. the dam that gives birth, or the birth dam, and 3. the dam that raises the individual after birth, or the raise dam, and the last possibility is bottle-feeding or direct human feeding, which is not genetic. See Figure 8.1. 127 128 CHAPTER 8. MATERNAL TRAITS Sire Embryo Genetic Dam Birth Dam Raise Dam Lamb Figure 8.1 Three Possible Sources of Maternal Effects The genetic dam provides the DNA to produce the new animal through the ova, along with the sperm from the male. The ova include female mitochondrial genes that do not mix with the male DNA, but are passed directly to the offspring. The genetic dam provides the genes for growth, embryo survival, and all traits. The birth dam contributes no genetic material to the animal, but gives birth to the animal. An embryo transfer (ET) has occurred. Usually embryos are put into females that are expected to give lots of milk (food) to the offspring, or which have already given birth to previous young, so that parturition problems are not anticipated. This female may be a different breed than the embryo, but possibly not considered a valuable individual. The purpose of the birth dam is to carry the embryo through pregnancy to birth. The birth dam provides a uterine environment, blood oxygen and nutrients during embryo development, plus the experience of the parturition event itself. Birth dams affect traits like calving ease of the calf, early survival, and birthweights. The raise dam cares for the young animal after birth. This is the training and education environment, plus milk yield and protection from predators. Sometimes depending on the number born or the welfare of the birth dam, young animals will be cross fostered to other lactating females with one or no young themselves. Raise dams affect traits after birth, such as weaning 8.2. DATA 129 weights. In the majority of situations, the three dams are the same individual. The female provides the DNA material, carries the embryo through pregnancy and gives birth, then raises the young up to weaning age. In the literature, models for dealing with maternal effects have assumed that the three dams are one individual. However, in beef cattle and sheep production, there is a fair rate of embryo transfer, and/or cross fostering of young animals to warrant considering models that accommodate three types of dams. 8.2 Data The following table contains data on 66 animals of which only 26 have both birthweight (BW) and weaning weight (WW). Animals 1 to 40 have unknown parents. 130 Lamb 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 CHAPTER 8. MATERNAL TRAITS Table 8.1 Example data for maternal genetic effects. YM = year-month of lambing, FYM = Flock-year-month of lambing BW = birthweight, WW = weaning weight (50 days). Ram Genetic Birth Raise Gender YM FYM BW WW dam dam dam 1 6 6 6 1 1 1 2.5 7.8 1 7 17 17 1 1 1 3.4 15.4 2 8 8 8 2 1 1 4.1 26.0 2 9 18 19 1 1 1 3.3 10.8 3 10 10 10 1 1 1 1.8 10.9 3 11 11 11 1 2 2 2.5 12.2 4 12 20 20 1 2 2 2.1 11.1 4 13 13 21 2 2 2 3.0 16.6 5 14 14 14 2 2 2 3.7 17.7 5 15 15 15 1 2 2 2.2 10.2 5 16 16 16 2 2 2 2.9 16.9 1 22 22 22 1 1 3 3.5 11.8 1 22 22 22 2 1 3 4.2 16.9 1 23 24 24 2 1 3 6.2 31.2 2 25 25 36 2 1 3 5.7 30.1 3 26 26 26 2 2 4 5.4 25.4 3 27 37 38 1 2 4 4.5 16.5 3 28 28 28 2 2 4 2.7 20.7 4 29 29 29 2 2 4 4.0 20.6 5 30 30 30 1 2 4 3.6 20.6 1 31 31 32 1 2 4 4.3 23.6 1 31 31 31 1 1 5 2.9 22.1 1 6 39 39 2 1 5 2.7 17.4 4 7 7 7 1 1 5 3.5 22.3 4 8 8 40 1 1 5 2.6 21.9 4 8 8 8 2 1 5 3.5 15.8 Lambs 42, 44, 47, 54, 57, and 63 had different birth dams from their genetic dams, and lambs 44, 48, 55, 57, 61, and 65 had different raise dams from their birth dams. The remaining lambs had the same ewe as their genetic dam, birth dam, and raise dam. 8.3. A MODEL 131 All lambs have both BW and WW observations. Lambs 52 and 53, 61 and 62, and 65 and 66 were sets of twin lambs. 8.3 A Model The same model equation will be used for BW and WW. Let yt = Xbt + Wf t + Za at + Zm mt + Zp pt + et , where bt is a fixed vector containing year-month lambing effects for trait t, and gender effects on trait t. ft is a random vector of flock-year-month of lambing effects for trait t, at is a random vector of animal additive genetic effects for trait t, mt is a random vector of maternal genetic effects of the birth dam on BW, and for the raise dam on WW, pt is a random vector of maternal permanent environmental effects of the birth dam on BW, and for the raise dam on WW, and et is a random vector of residual effects for trait t. X, W, Za , Zm , and Zp are the corresponding design matrices. The expected values of the random vectors were null matrices, and the covariance matrices for two traits are as follows: G= 0.2003 0.5381 0.5381 3.2623 ! , for the additive genetic covariance matrix, F= 0.1822 0.1561 0.1561 7.4783 ! , 132 CHAPTER 8. MATERNAL TRAITS for the flock-year-month of lambing covariance matrix, M= ! 0.1391 0.5209 0.5209 3.2945 , for the maternal genetic covariance matrix, P= 0.0301 0.0019 0.0019 1.5049 ! , for the maternal permanent environmental covariance matrix, and R= 0.1826 0.0336 0.0336 6.9517 ! , for the residual covariance matrix. 8.3.1 Data Structure Usually there is a direct genetic by maternal genetic covariance matrix. That is, V ar a m ! N = N A G A C N N A C0 A M ! = H, where C is non-null. In the above description of the covariance matrices, C was assumed to be null. The problem is the estimation of G, M, and C. Proper estimation of C requires a correct and almost complete data structure. The correct data structure is one in which every female has its own records for BW and WW as a lamb, and it has records on its offsprings’ BW and WW, and it has female progeny that also have records on their offsprings’ BW and WW. Three generations of female data should exist. The data structure is complete if all females have three generations of data, except for the latest two generations which have not had the opportunity to have offspring yet (Heydarpour et al. 2004?) 8.3. A MODEL 133 Generally, studies of maternal genetic effects do not have correct data structure, and consequently, the direct-maternal covariances have been estimated to be negative. The poorer is the structure, the more negative are the estimates in C. This can be shown mathematically to occur if the data structure is not correct. The data structure for the example data in this chapter is not correct, and therefore, should not be used to estimate C. When the data structure is not suitable, the best course of action is to assume that C = 0. If C has been estimated from data having a correct structure, then it may be used as the true covariances in genetic evaluation, even for data that do not have the correct structure. Also needed are R −1 = −1 F = 5.48132623 −0.02649317 −0.02649317 0.14397776 5.5884151 −0.1166511 −0.1166511 0.1361552 P 8.3.2 = , ! and −1 ! 33.22523926 −0.04194827 −0.04194827 0.66454894 , ! . Assumptions • The example is small and simple. • The example data do not have the proper data structure for estimation of covariance matrices. • All animals have BW and WW. • Age of dam effects are not important. • Litter of the birth dam effects are not important, as are litter of the raise dam effects. 134 CHAPTER 8. MATERNAL TRAITS 8.4 MME The order of the MME for the small example is 334 because there are two traits and 66 animals having both direct genetic and maternal genetic effects. The example will be described through R-scripts. 8.4.1 Preparing the Data Enter the pedigree data in vectors as follows: aid=c(1:66) sid=c(0*c(1:40),1,1,2,2,3,3,4,4, 5,5,5,1,1,1,2,3,3, 3,4,5,1,1,1,4,4,4) gdam=c(0*c(1:40),6,7,8,9,10,11,12, 13,14,15,16,22,22, 23,25,26,27,28,29,30,31,31,6,7,8,8) bii=c((0*c(1:40)+1),(0*c(41:66)+0.5)) AI=AINV(sid,gdam,bii) dim(AI) # should be order 66 anwr=c(41:66) # animals with records # Enter birth dams and raise dams bdid=c(6,17,8,18,10,11,20,13,14, 15,16,22,22,24,25,26,37,28,29,30, 31,31,39,7,8,8) rdid=c(6,17,8,19,10,11,20,21,14, 15,16,22,22,24,36,26,38,28,29,30, 32,31,39,7,40,8) # Enter gender codes gnid=c(1,1,2,1,1,1,1,2,2,1,2,1,2, 2,2,2,1,2,2,1,1,1,2,1,1,2) 8.4. MME 135 # Enter observations obs=matrix(data=c(25,78,34,154,41,260, 33,108,18,109,25,122,21,111,30,166, 37,177,22,102,29,169,35,118,42,169, 62,312,57,301,54,254,45,165,27,207, 40,206,36,206,43,236,29,221,27,174, 35,223,26,219,35,158),ncol=1) obs=obs/10 # Year-month and flock-year-month classes ymid=c(1,1,1,1,1,2,2,2,2,2,2,1,1,1,1,2, 2,2,2,2,2,1,1,1,1,1) cgid=c(1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,4, 4,4,4,4,4,5,5,5,5,5) 8.4.2 Covariance Matrices This example includes direct by maternal covariance matrix between the two traits in CC. G=matrix(data=c(0.2003,0.5381,0.5381, 3.2623),byrow=TRUE,nrow=2) R=matrix(data=c(0.1826,0.0336,0.0336, 6.9517),byrow=TRUE,nrow=2) F=matrix(data=c(0.1822,0.1561,0.1561, 7.4783),byrow=TRUE,ncol=2) PE=matrix(data=c(0.0301,0.0019,0.0019, 1.5049),byrow=TRUE,ncol=2) MT=matrix(data=c(0.1391,0.5209,0.5209, 3.2945),byrow=TRUE,ncol=2) Inverses are needed for each of the above matrices. Then initialize the solution vectors. ymsn=matrix(data=c(0),nrow=nym,ncol=ntr) cgsn=matrix(data=c(0),nrow=ncg,ncol=ntr) gdsn=matrix(data=c(0),nrow=2,ncol=ntr) The following scripts bdsn=matrix(data=c(0),nrow=nam,ncol=ntr) mpsn=matrix(data=c(0),nrow=nam,ncol=ntr) ansn=matrix(data=c(0),nrow=nam,ncol=ntr) 136 CHAPTER 8. MATERNAL TRAITS describe the Gibbs sampling process where each factor is visited once per round. Data are adjusted for all other factors, new solutions are calculated, and random noise is added to the solutions before going to the next factor. If the factor is a random factor, then a covariance matrix is sampled too. At the end, a new residual covariance matrix is sampled. 8.4.3 Year-Month Effects WY=ymsn*0 WW=array(data=c(0),dim=c(ntr,ntr,nym)) for(i in 1:noa){ yobs=obs[i, ] yobs=yobs-cgsn[cgid[i], ]-gdsn[gnid[i], ]ansn[anwr[i], ] - bdsn[bdid[i], ]-rdsn[rdid[i], ]mpsn[bdid[i], ] - mpsn[rdid[i], ] ma=ymid[i] WY[ma, ]=WY[ma, ]+(RI %*% yobs) WW[ , ,ma]=WW[ , ,ma]+RI }| \\ for(k in 1:nym){ work=WW[ , ,k] C = ginv(work) ymsn[k, ]=C %*% WY[k, ] # add sampling noise CD=t(chol(C)) ve = rnorm(ntr,0,1) ymsn[k, ] =ymsn[k, ] +(CD %*% ve) } 8.4. MME 8.4.4 Gender Effects WY=gdsn*0 WW=array(data=c(0),dim=c(2,2,2)) for(i in 1:noa){ yobs=obs[i, ] yobs=yobs-cgsn[cgid[i], ]-ymsn[ymid[i], ]ansn[anwr[i], ] - bdsn[bdid[i], ]-rdsn[rdid[i], ]mpsn[bdid[i], ] - mpsn[rdid[i], ] ma=gnid[i] WY[ma, ]=WY[ma, ]+(RI %*% yobs) WW[ , ,ma]=WW[ , ,ma]+RI } ngend=2 for(k in 1:ngend){ work=WW[ , ,k] C = ginv(work) gdsn[k, ]=C %*% WY[k, ] # add sampling noise CD=t(chol(C)) ve = rnorm(ntr,0,1) gdsn[k, ] =gdsn[k, ] +(CD %*% ve) } 137 138 8.4.5 CHAPTER 8. MATERNAL TRAITS Flock-Year-Month Effects WY=cgsn*0 WW=array(data=c(0),dim=c(2,2,ncg)) for(i in 1:noa){ yobs=obs[i, ] yobs=yobs-gdsn[gnid[i], ]-ymsn[ymid[i], ]ansn[anwr[i], ] - bdsn[bdid[i], ]-rdsn[rdid[i], ]mpsn[bdid[i], ] - mpsn[rdid[i], ] ma=cgid[i] WY[ma, ]=WY[ma, ]+(RI %*% yobs) WW[ , ,ma]=WW[ , ,ma]+RI } for(k in 1:ncg){ work=WW[ , ,k]+ FI C = ginv(work) cgsn[k, ]=C %*% WY[k, ] # add sampling noise CD=t(chol(C)) ve = rnorm(ntr,0,1) cgsn[k, ] =cgsn[k, ] +(CD %*% ve) } # Estimate covariance matrix sss = t(cgsn)%*%cgsn ndf = ncg+2 x2 = rchisq(1,ndf) Fn = sss/x2 FI = ginv(Fn) 8.4. MME 8.4.6 Animal Additive Genetic Effects no=nam*ntr WW = matrix(data=c(0),nrow=no,ncol=no) WY = matrix(data=c(0),nrow=no,ncol=1) for(i in 1:noa){ yobs = obs[i, ] yobs=yobs-gdsn[gnid[i], ]-ymsn[ymid[i], ]cgsn[cgid[i], ] - bdsn[bdid[i], ]-rdsn[rdid[i], ]mpsn[bdid[i], ] - mpsn[rdid[i], ] ka=(anwr[i]-1)*ntr + 1 kb = ka + 1 kc = c(ka:kb) WY[kc, ]=WY[kc, ]+(RI %*% yobs) WW[kc,kc]=WW[kc,kc]+RI } HI = AI %x% GI work = WW + HI C = ginv(work) wsn=C %*% WY CD=t(chol(C)) ve = rnorm(no,0,1) wsn=wsn +(CD %*% ve) # transform to new dimensions ansn = matrix(data=wsn,byrow=TRUE,ncol=ntr) # Estimate a new G and GI v1 = AI %*% ansn sss = t(ansn)%*%v1 ndf = nam+2 x2 = rchisq(1,ndf) Gn = sss/x2 GI = ginv(Gn) 139 140 CHAPTER 8. MATERNAL TRAITS 8.4.7 Maternal Genetic Effects Maternal genetic effects are different due to the birth dams versus raise dams, which need to be separated. no=nam*ntr WW = matrix(data=c(0),nrow=no,ncol=no) WY = matrix(data=c(0),nrow=no,ncol=1) for(i in 1:noa){ yobs = obs[i, ] yobs=yobs-gdsn[gnid[i], ]-ymsn[ymid[i], ]cgsn[cgid[i], ] - ansn[anwr[i], ]mpsn[bdid[i], ] - mpsn[rdid[i], ] # birth dams kc=(bdid[i]-1)*ntr + 1 wrk=RI %*% yobs WY[kc,]=WY[kc,]+wrk[1,] # Raise dams ka =(rdid[i]-1)*ntr + 1 kb = ka+1 WY[kb,]=WY[kb,]+wrk[2,] kd=c(kc,kb) WW[kd,kd]=WW[kd,kd]+RI } 8.4. MME 141 Now get solutions and add sampling noise, then calculate new sample for the maternal genetic covariance matrix. HI = AI %x% MTI work = WW + HI C = ginv(work) wsn=C %*% WY CD=t(chol(C)) ve = rnorm(no,0,1) wsn=wsn +(CD %*% ve) # transform to new dimensions matn = matrix(data=wsn,byrow=TRUE,ncol=ntr) v1 = AI %*% matn sss = t(matn)%*%v1 ndf = nam+2 x2 = rchisq(1,ndf) MTn = sss/x2 MTI = ginv(MTn) bdsn[ ,1]=matn[ ,1] rdsn[ ,2]=matn[ ,2] bdsn[ ,2]=0 rdsn[ ,1]=0 142 CHAPTER 8. MATERNAL TRAITS 8.4.8 Maternal PE Effects Birth dams and raise dams need to be separated again for this factor. There are no additive genetic relationships among PE effects. no=nam*ntr nope=c(1:nam)*0 WW = matrix(data=c(0),nrow=no,ncol=no) WY = matrix(data=c(0),nrow=no,ncol=1) for(i in 1:noa){ yobs = obs[i, ] yobs=yobs-gdsn[gnid[i], ]-ymsn[ymid[i], ]cgsn[cgid[i], ] - ansn[anwr[i], ]bdsn[bdid[i], ] - rdsn[rdid[i], ] kc=(bdid[i]-1)*ntr + 1 wrk=RI %*% yobs WY[kc,]=WY[kc,]+wrk[1,] ka =(rdid[i]-1)*ntr + 1 nope[rdid[i]]=1 nope[bdid[i]]=1 kb = ka+1 WY[kb,]=WY[kb,]+wrk[2,] kd=c(kc,kb) WW[kd,kd]=WW[kd,kd]+RI } HI = id(nam) %x% PEI work = WW + HI C = ginv(work) wsn=C %*% WY CD=t(chol(C)) ve = rnorm(no,0,1) wsn=wsn +(CD %*% ve) # transform to new dimensions mpsn = matrix(data=wsn,byrow=TRUE,ncol=ntr) 8.4. MME 143 Now obtain a new sample PE covariance matrix. sss = t(mpsn)%*%mpsn ndf=sum(nope)+2 x2 = rchisq(1,ndf) PEn = sss/x2 PEI = ginv(PEn) 8.4.9 Residual Effects Adjust observations for all factors in the model to estimate the residual effects. Then accumulate sum of squares and crossproducts for a new residual covariance matrix. no=nam*ntr sss=id(ntr)*0 for(i in 1:noa){ yobs = obs[i, ] yobs=yobs-gdsn[gnid[i], ]-ymsn[ymid[i], ]cgsn[cgid[i], ] - ansn[anwr[i], ]bdsn[bdid[i], ] - rdsn[rdid[i], ]mpsn[bdid[i], ] - mpsn[rdid[i], ] wrk=matrix(data=yobs,nrow=ntr,ncol=1) bhat=R %*% RI %*% wrk sss = sss + bhat%*%t(bhat) } ndf=noa+2 x2=rchisq(1,ndf) Rn=sss/x2 RI=ginv(Rn) Rn Repeat the process for more samples until convergence is achieved. Given the small number of observations, the Gibbs samples, in this case, will likely not converge to the joint posterior distribution samples. Many more observations are needed. 144 CHAPTER 8. MATERNAL TRAITS Part II Random Regression Analyses 145 Chapter 9 Longitudinal Data 9.1 Introduction A simple example of longitudinal data is the weight of an animal taken at different ages. Meat animals, like beef cattle, pigs, and sheep, are weighed two or three times from birth to market age, generally at birth, at weaning, and at market age. Weighing animals takes time and labour. Birth is always day 1, but weaning and market ages are not the same for every animal. Weights get larger over time because animals grow, and the variance of weights also increases with age. Another example is the lactation yield of dairy cows, sheep, or goats. Dairy animals are milked two or more times daily for up to a year after they give birth. Typically, 24-h production increases shortly after the animal gives birth, peaks at a few weeks after parturition, then slowly decreases until the animal dries up in preparation for the next parturition. Milk recording programs send supervisors to herds once a month or less frequently to weigh the milk and take samples for lab analyses of content. Thus, an animal might give milk for over 300 days, but there might only be seven to ten supervised weighings during that period. Herds with robotic milking machines can have daily weighings, but not daily milk content measurements. Traits measured at various times during the life of an animal are known as longitudinal data. Because weights or yields occur at different ages or times, 147 148 CHAPTER 9. LONGITUDINAL DATA they are not the same genetic trait. Weights at birth and weaning may have a positive correlation, but it is less than unity. Milk weights at day 10 and day 300 may also be correlated, but that correlation is much less than unity. Thus, the weight of an animal at every day of life is a ‘different’ trait. Every milk weight from the start of lactation to the end is a ‘different’ trait. There is a continuum of points in time when an animal could be observed for a trait. These traits have also been called infinitely dimensional traits. Instead of age or time, observations could be based on degree of maturity or weight. For example, fat content of an animal would change depending on an animal’s weight or amount of feed ingested, regardless of age. The possibility exists that a trait may depend on both age and weight. In general, there is a starting point, tmin , e.g. birth or parturition, at which observations start to be taken. The observations are made either at specific intervals or at random intervals, and the number of observations can vary from animal to animal. Then there is the end point, tmax , beyond which no more observations are made, or are not of interest. Each observation, yti , has an associated time variable, ti . For simplicity, ti , are whole integer numbers. There could be a dozen or so points, or there could be 400 points. The number depends on the trait and situation. Orthogonal polynomials have been suggested for use with longitudinal data to model the shape of a growth curve or a lactation curve. The reason being that orthogonal polynomials would be less correlated to each other than would be the correlation between polynomials of age. One simple type of orthogonal polynomials are Legendre polynomials, discovered in 1797. In order to use Legendre polynomials or other kinds of orthogonal polynomials, the time values (whole integer numbers) must be scaled to a range from -1 to +1. The scaling formula is ti − tmin . qi = −1 + 2 tmax − tmin The qi are decimal numbers. Plotting yti against ti (or against qi ) gives a shape that is called the trajectory. This could be a lactation curve, or a growth curve, or an S-curve. The goal is to find a function that fits this trajectory as closely as possible and to study the amount of animal variation around the trajectory from tmin to tmax . This type of study involves covariance functions and random regression 9.2. COLLECT DATA 149 models. Covariance functions help to predict the change in variation from tmin to tmax for the population. Random regression models provide a way to estimate covariance functions, and to determine individual differences in trajectories. 9.2 Collect Data The first step in the study of longitudinal data, is to collect data. In order to illustrate a few basic concepts, consider the following experiment. Two hundred female mice were sampled every hour after an injection with glucose to observe the change in blood insulin levels over the next nine hours. This gave a total of 1800 observations, on two hundred unrelated individuals. A small sample of the data are shown in Table 9.1. Table 9.1 Insulin levels in female mice. Mouse Time After Injection of Glucose, min 60 120 180 240 300 360 420 480 540 1 11.9 9.7 8.7 4.5 5.3 1.9 2.3 1.6 1.0 2 12.9 10.0 7.5 3.3 1.7 2.3 2.3 2.1 0.5 3 12.2 10.0 6.0 4.2 4.4 2.7 2.2 2.9 0.2 4 12.6 10.1 9.5 5.9 5.8 3.4 0.9 0.5 0.7 5 12.7 10.5 8.2 5.4 4.7 2.1 1.9 3.2 0.2 A plot of all 200 mouse insulin decay trajectories are shown in Figure 9.1. 150 CHAPTER 9. LONGITUDINAL DATA Figure 9.1 8 6 2 4 Insulin Levels 10 12 Insulin Decay Over Time 2 4 6 8 Hours Note the general shape of the decay trajectory for all mice. Also, note the variability that exists around the average curve (shown in red). Thus, mice have different decay trajectories. We want to study the variability among mice. Using data on the 200 mice the covariance matrix of the insulin amounts at each hour (a 9 × 9 matrix) can be calculated as follows. V= 0.8852 0.8352 0.6916 0.0350 0.1089 −0.0050 −0.0417 −0.0476 −0.0156 0.8352 0.9574 0.4621 0.0399 0.1607 −0.0281 −0.0433 −0.0203 0.0197 0.6916 0.4621 1.4005 0.1154 0.1556 0.0270 −0.1918 −0.1106 −0.0052 0.0350 0.0399 0.1154 0.9331 0.0454 −0.0723 −0.0362 −0.0030 0.0117 0.1089 0.1607 0.1556 0.0454 0.7993 0.1204 −0.0518 −0.0414 0.0558 −0.0050 −0.0281 0.0270 −0.0723 0.1204 0.6833 0.0015 −0.0086 −0.0211 −0.0417 −0.0433 −0.1918 −0.0362 −0.0518 0.0015 0.5807 0.0474 −0.0035 −0.0476 −0.0203 −0.1106 −0.0030 −0.0414 −0.0086 0.0474 0.4409 −0.0291 −0.0156 0.0197 −0.0052 0.0117 0.0558 −0.0211 −0.0035 −0.0291 0.2855 V = {σti ,tj }. This matrix is automatically a positive definite matrix by virtue of the way it was calculated, but a good practice is to always check each matrix. The eigenvalues (EV ) were all positive, as shown below, and therefore V is positive definite. 9.3. COVARIANCE FUNCTIONS EV = 9.3 2.48452319 0.96233482 0.86962324 0.80221637 0.59747351 0.50820761 0.42359840 0.27266640 0.04525645 151 . Covariance Functions Kirkpatrick et al.(1991) proposed the use of covariance functions for longitudinal data. A covariance function (CF) is a way to model the variances and covariances of a longitudinal trait. Orthogonal polynomials are used in this model and Legendre polynomials are the easiest to apply. Each element of V is modeled as σti ,tj = φ(qi )0 Kφ(qj ), where φ(qi ) is a vector of functions of time, qi , and K is a matrix of constants necessary to estimate σti ,tj . The matrix K must be estimated. In the example, V, above, there are m = 9 time periods, and therefore, there are m ∗ (m + 1)/2 = 45 parameters in V. In Table 9.2 are the time variables and the standardized time values. Note that tmin = 60, and tmax = 540. 152 CHAPTER 9. LONGITUDINAL DATA Table 9.2 Time Variables of Mouse Data. Item ti (minutes) qi 1 60 -1.00 2 120 -0.75 3 180 -0.50 4 240 -0.25 5 300 0.00 6 360 0.25 7 420 0.50 8 480 0.75 9 540 1.00 Legendre polynomials, Pk (x), are defined as follows, for x being one of the qi . P0 (x) = 1, and P1 (x) = x, then, in general, the (n+1)st polynomial is described by the following recursive equation: 1 ((2n + 1)xPn (x) − nPn−1 (x)) . Pn+1 (x) = n+1 These quantities are “normalized” using 2n + 1 φn (x) = 2 .5 Pn (x). This gives the following series, .5 φ0 (x) = φ1 (x) = = P2 (x) = φ2 (x) = = 1 P0 (x) = .7071 2 .5 3 P1 (x) 2 1.2247x 1 (3xP1 (x) − 1P0 (x)) 2 .5 5 3 1 ( x2 − ) 2 2 2 −.7906 + 2.3717x2 , 9.3. COVARIANCE FUNCTIONS 153 and so on. Because V is 9 × 9, then to model all of the σti ,tj we need 9 orthogonal polynomials. Thus, we need Legendre polynomials of order 8, where 8 is the highest order of polynomials of time. Order of 8 means there are 9 covariables (including time to the power of 0). 0 1 2 3 4 5 6 7 8 0.7071 0.0 -0.7906 0.0 0.7955 0.0 -0.7967 0.0 0.7972 0.0 1.2247 0.0 -2.8062 0.0 4.3973 0.0 -5.9907 0.0 0.0 0.0 2.3717 0.0 -7.9550 0.0 16.7312 0.0 -28.6992 Table 9.3 Legendre Polynomials of Order 8. 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.6771 0.0 0.0 0.0 9.2808 0.0 -20.5206 0.0 18.4685 0.0 -50.1935 0.0 53.9164 0.0 -118.6162 0.0 157.8457 0.0 0.0 0.0 0.0 0.0 0.0 0.0 36.8086 0.0 -273.5992 0.0 0.0 0.0 0.0 0.0 0.0 0.0 73.4291 0.0 A simple R function that will give you the above table follows. 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 146.571 154 CHAPTER 9. LONGITUDINAL DATA LPOLY = function(no) { if(no > 9 ) no = 9 nom = no - 1 phi = matrix(data=c(0),nrow=9,ncol=9) phi[1,1]=1 phi[2,2]=1 for(i in 2:nom){ ia = i+1 ib = ia - 1 ic = ia - 2 c = 2*(i-1) + 1 f = i - 1 c = c/i f = f/i for(j in 1:ia){ if(j == 1){ z = 0 } else {z = phi[ib,j-1]} phi[ia,j] = c*z - f*phi[ic,j] } } for( m in 1:no){ f = sqrt((2*(m-1)+1)/2) phi[m, ] = phi[m, ]*f } return(phi[1:no,1:no]) } Let the matrix in Table 9.3 be denoted as Λ0 . Now define another matrix, M, as a matrix containing the polynomials of standardized time values. Therefore, M = = 1 1 1 1 1 1 1 1 1 1 qi qi2 −1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 qi3 qi4 1.0000 0.5625 0.2500 0.0625 0.0000 0.0625 0.2500 0.5625 1.0000 qi5 qi6 −1.000000 −0.421875 −0.125000 −0.015625 0.000000 0.015625 0.125000 0.421875 1.000000 qi7 qi8 1.00000000 0.31640625 0.06250000 0.00390625 0.00000000 0.00390625 0.06250000 0.31640625 1.00000000 −1.0000000000 −0.2373046875 −0.0312500000 −0.0009765625 0.0000000000 0.0009765625 0.0312500000 0.2373046875 1.0000000000 9.3. COVARIANCE FUNCTIONS 1.000000 0.177979 0.015625 0.000244 0.000000 0.000244 0.015625 0.177979 1.000000 −1.0000000 −0.1334839 −0.0078125 −0.0000610 0.0000000 0.0000610 0.0078125 0.1334839 1.0000000 1.000000 0.100113 0.003906 0.000015 0.000000 0.000015 0.003906 0.100113 1.000000 155 . This gives Φ = MΛ, = 0.7071068 0.7071068 0.7071068 0.7071068 0.7071068 0.7071068 0.7071068 0.7071068 0.7071068 −2.3452079 0.9765020 −0.2107023 −0.7967180 0.0000000 0.7967180 0.2107023 −0.9765020 2.3452079 −1.2247449 −0.9185587 −0.6123724 −0.3061862 0.0000000 0.3061862 0.6123724 0.9185587 1.2247449 2.54950976 −0.71584364 0.82410911 0.06189377 −0.79672180 0.06189377 0.82410911 −0.71584364 2.54950976 1.5811388 0.5435165 −0.1976424 −0.6423376 −0.7905694 −0.6423376 −0.1976424 0.5435165 1.5811388 −1.8708287 0.1315426 0.8184876 0.6284815 0.0000000 −0.6284815 −0.8184876 −0.1315426 1.8708287 −2.73861279 0.09361538 −0.61110647 0.76658885 0.00000000 −0.76658885 0.61110647 −0.09361538 2.73861279 2.9154759 0.5761252 −0.2146925 −0.4444760 0.7972005 −0.4444760 −0.2146925 0.5761252 2.9154759 2.1213203 −0.7426693 −0.6131942 0.3345637 0.7954951 0.3345637 −0.6131942 −0.7426693 2.1213203 , which can be used to specify the elements of V as V = ΦKΦ0 = M(ΛKΛ0 )M0 = MHM0 . Note that Φ, M, and Λ are matrices defined by the Legendre polynomial functions and by the standardized time values and do not depend on the data or values in the matrix V. One can estimate either K or H, Estimate K using K = Φ−1 VΦ−T , 156 CHAPTER 9. LONGITUDINAL DATA = 0.51696 −0.10761 0.51424 0.01598 0.34633 −0.03278 −0.13492 0.02114 −0.49907 −0.03278 0.02711 −0.07196 −0.00723 −0.04682 0.12477 0.01695 −0.12536 0.08352 −0.10761 0.35040 −0.00961 0.02374 0.04881 0.02711 −0.02044 −0.12437 −0.05387 −0.13492 −0.02044 −0.28443 0.05744 −0.24611 0.01695 0.17716 −0.05564 0.22522 0.51424 −0.00961 1.20888 −0.10372 0.66167 −0.07196 −0.28443 0.13495 −1.02176 0.02114 −0.12437 0.13495 −0.24125 0.10218 −0.12536 −0.05564 0.35359 −0.12008 0.01598 0.02374 −0.10372 0.32922 −0.05996 −0.00723 0.05744 −0.24125 0.06434 −0.49907 −0.05387 −1.02176 0.06434 −0.70061 0.08352 0.22522 −0.12008 1.03220 0.34633 0.04881 0.66167 −0.05996 0.59757 −0.04682 −0.24611 0.10218 −0.70061 , and estimate H using H = M−1 VM−T 0.7993 0.3759 −16.2282 −4.1121 85.2585 0.3759 18.8458 −17.7140 −140.9800 116.0053 −16.2282 −17.7140 565.2491 145.2098 −3353.8067 −4.1121 −140.9800 145.2098 1244.7194 −1032.9673 85.2585 116.0053 −3353.8068 −1032.9673 20823.0945 = 8.2512 279.3968 −275.2019 −2605.3471 2024.2377 −149.0388 −221.6913 6130.4997 2044.5953 −38807.8169 −4.5415 −157.4900 148.5089 1505.9029 −1117.0747 79.2916 123.1336 −3327.8998 −1156.0341 21270.5182 9.3. COVARIANCE FUNCTIONS 157 8.2512 −149.0388 −4.5415 79.2916 279.3968 −221.6913 −157.4900 123.1336 −275.2019 6130.4997 148.5089 −3327.8998 −2605.3471 2044.5953 1505.9029 −1156.0341 2024.2377 −38807.8169 −1117.0747 21270.5182 . 5566.7003 −4064.5381 −3249.7079 2313.7405 −4064.5381 72970.3381 2262.0269 −40177.7306 −3249.7079 2262.0269 1906.4858 −1292.3600 2313.7405 −40177.7306 −1292.3600 22174.7101 Note the difference in magnitude of elements in K compared to H. Now calculate the correlations among the elements in the two matrices. Cor(K) = 1.00 −0.25 0.65 0.04 0.62 −0.13 −0.45 0.05 −0.68 −0.25 1.00 −0.01 0.07 0.11 0.13 −0.08 −0.35 −0.09 0.65 −0.01 1.00 −0.16 0.78 −0.19 −0.61 0.21 −0.91 0.04 0.07 −0.16 1.00 −0.14 −0.04 0.24 −0.71 0.11 0.62 0.11 0.78 −0.14 1.00 −0.17 −0.76 0.22 −0.89 −0.13 0.13 −0.19 −0.04 −0.17 1.00 0.11 −0.60 0.23 −0.45 −0.08 −0.61 0.24 −0.76 0.11 1.00 −0.22 0.53 0.05 −0.35 0.21 −0.71 0.22 −0.60 −0.22 1.00 −0.20 −0.68 −0.09 −0.91 0.11 −0.89 0.23 0.53 −0.20 1.00 1.00 0.10 −0.76 −0.13 0.66 0.12 −0.62 −0.12 0.60 0.10 1.00 −0.17 −0.92 0.19 0.86 −0.19 −0.83 0.19 −0.76 −0.17 1.00 0.17 −0.98 −0.16 0.95 0.14 −0.94 −0.13 −0.92 0.17 1.00 −0.20 −0.99 0.21 0.98 −0.22 0.66 0.19 −0.98 −0.20 1.00 0.19 −1.00 −0.18 0.99 0.12 0.86 −0.16 −0.99 0.19 1.00 −0.20 −1.00 0.21 −0.62 −0.19 0.95 0.21 −1.00 −0.20 1.00 0.19 −1.00 −0.12 −0.83 0.14 0.98 −0.18 −1.00 0.19 1.00 −0.20 0.60 0.19 −0.94 −0.22 0.99 0.21 −1.00 −0.20 1.00 , and Cor(H) = . In H many of the correlations round off to +1 or -1, which means that H is very close to being singular. This is not a good property for using H to construct mixed model equations. This could lead to poor estimation of effects in the model. By contrast, the largest correlation in K is only -0.91. K is not close to singularity, and should be safe to invert. The signs of the correlations are often opposite to those in H. K is a much preferred matrix for use in mixed model equations. Once there is an estimate of K, then the covariance function model can be used to calculate variances and covariances between other time points (between 158 CHAPTER 9. LONGITUDINAL DATA tmin and tmax ). For example, let t150 = 150 minutes and t400 = 400 minutes, neither of which were actually observed or recorded in the 200 mice, but both points are within the upper and lower bounds of the experimental period. First, calculate the standardized time equivalents (between -1 to +1). q150 = −0.6250 q400 = +0.4167 Set up the matrix M for these two points, M0 = 1 −0.6250000 0.3906250 −0.24414062 0.15258789 −0.09536743 0.059604645 −0.037252903 0.0232830644 1 0.4166667 0.1736111 0.07233796 0.03014082 0.01255867 0.005232781 0.002180325 0.0009084689 , The Legendre polynomials are Φ0 = (MΛ)0 0.7071068 −0.7654655 0.1358791 0.6120387 = −0.8957736 0.5003195 0.2739309 −0.84232235 0.7767490 0 ΦKΦ = 0.7071068 0.5103104 −0.3788145 −0.8309381 −0.3058426 0.5797175 0.7877318 0.7767490 −0.7262336 2.767230 −0.4789920 −0.478992 0.5497332 . ! . 9.4. REDUCED ORDERS OF FIT 159 Thus, the variance at 150 minutes is expected to be 2.767, and for 400 minutes is 0.550. Suppose the variance for 700 minutes was needed. This could not be predicted or calculated because tmax is only 540 minutes. Do not predict variances for time periods outside the observed range. 9.4 Reduced Orders of Fit In the previous example, the matrix V was 9×9, and the Legendre polynomials were generated for a Full fit, with 9 covariates. Thus, the covariance function model resulted in no errors. All of the calculated variances and covariances were exactly the same as those in the original V. Kirkpatrick et al.(1990) suggested looking at the eigenvalues of the matrix K from a full rank fit. Below are the values. The sum of all the eigenvalues was 4.690745, and also shown is the percentage of that total. Table 9.4 Eigenvalues of K. K Eigenvalue Percentage 2.980498752 63.5 0.624171898 13.3 0.433933695 9.3 0.208716099 4.4 0.195913819 4.2 0.135701396 2.9 0.102963163 2.2 0.007215617 0.2 0.001630618 0.03 The majority of change in elements in K is explained by the first three eigenvalues at 86.1 %. The first seven explain 99.8 %. If a cut-off is set to 95%, then the first 5 eigenvalues would be important. 160 CHAPTER 9. LONGITUDINAL DATA Covariance functions can be based on fewer than 9 covariates. Thus, the orders of fit can be 8, 7, 6, 5, 4, 3, 2, 1, or 0. Order 0 implies that all of the elements in V are the same, which is obviously not true. Use the subscript r to indicate a reduced order of fit, that is, r < 8, then V = Φr Kr Φ0r + E, for r < 8, and where Φr is a rectangular matrix of rank r composed of the first r columns of Φ, and E is a matrix of residuals because any lower order fit will not be perfect. Thus, Φr does not have an inverse, but we can obtain an estimate of Kr . To determine Kr , first pre-multiply V by Φ0r and post-multiply that by Φr as Φ0r VΦr = Φ0r (Φr Kr Φ0r )Φr = (Φ0r Φr )Kr (Φ0r Φr ). Now pre- and post- multiply by the inverse of (Φ0r Φr ) = P to determine Kr , Kr = P−1 Φ0r VΦr P−1 . Kr will be square with r rows and columns. With Kr we can estimate Vr as Vr = Φr Kr Φ0r . This matrix is symmetric with 45 unique elements, but only has rank r. The half-store function in R is a way of changing a matrix to a vector of the unique elements. 9.4. REDUCED ORDERS OF FIT hsmat <- function(vcvfull) { mord = nrow(vcvfull) np = (mord *(mord + 1))/2 desg = rep(0,np) k = 0 for(i in 1:mord){ for(j in i:mord){ k = k + 1 desg[k] = vcvfull[i,j] } return(desg) } 161 } Let ssr = ||hsmat(V − Vr )||/||hsmat(V)|| be the goodness of fit statistic. The ||X|| is defined as the sum of squares of the elements of a half-stored symmetric matrix X. Thus, ssr measures the amount of difference between two matrices scaled by the square of the values in the original matrix. This statistic is like a convergence criterion for solving a set of equations. The smaller is ssr , then the less difference there is between the two matrices. To illustrate, for order 7, there are 8 covariates. The calculated K7 is as follows. K7 = 0.28382 −0.13845 0.02516 0.05280 0.00783 0.01502 −0.02732 −0.04759 −0.13845 0.35040 −0.06584 0.02374 0.01210 0.02711 −0.00793 −0.12437 0.02516 −0.06584 0.20045 −0.03657 −0.03170 0.01522 −0.06227 0.00961 0.05280 0.00783 0.02374 0.01210 −0.03657 −0.03170 0.32922 −0.01611 −0.01611 0.12203 −0.00723 0.01010 0.04249 −0.09328 −0.24125 0.02035 162 CHAPTER 9. LONGITUDINAL DATA 0.01502 0.02711 0.01522 −0.00723 0.01010 0.12477 −0.00245 −0.12536 −0.02732 −0.00793 −0.06227 0.04249 −0.09328 −0.00245 0.12822 −0.02775 −0.04759 −0.12437 0.00961 −0.24125 0.02035 −0.12536 −0.02775 0.35359 , which is used to calculate V7 , and the goodness of fit statistic is ||V7 − V|| ||V|| = 0.0495289 ss7 = To determine the probability of finding a smaller value of ssr one can use simulation, as shown in the following R script. 9.4. REDUCED ORDERS OF FIT 163 N=10000 can=c(1:N)*0 VR=V nocov = 2 # order of fit + 1 phr = PH[ ,c(1:nocov)] PVP = t(phr)%*%VR%*%phr PP = t(phr)%*%phr PPI=ginv(PP) Kr = PPI%*%PVP%*%PPI ndf=199 for(ko in 1:N){ Ka = rWishart(1,ndf,Kr)/ndf Kb = Ka[, ,1] Vr = phr%*%Kb%*%t(phr) DEL = Vr - VR er = hsmat(DEL) vh = hsmat(VR) vv = sum(vh*vh) ssr = sum(er*er)/vv can[ko] = ssr } #end of samples hist(can,breaks=50) Then compare the ssr to the histogram to find the probability of obtaining a smaller statistic. With an order of fit equal to 1, ssr = 0.4592. In R one can use kb = order(-can) ncan = can[kb] ncan[1:10] kc=which(ncan < 0.4592) prob = 0 if(length(kc)>0)prob=1-(kc[1]/length(ncan)) prob This gives 0.4379 which is a pretty large probability. Similarly for the other orders, one gets the results in Table 9.5. 164 CHAPTER 9. LONGITUDINAL DATA Order 1 2 3 4 5 6 7 Table 9.5 Test statistics for reduced order of fit models. Covariates ssr Probability 2 0.4591629 0.4379 3 0.4112741 0.2854 4 0.3132577 0.2699 5 0.2564594 0.1864 6 0.1992198 0.1138 7 0.1384349 0.0389 8 0.0495289 0.0001 Orders 1, 2, 3, 4, and 5, gave Vr that were significantly different from V (i.e. probabilities greater than 0.05) while orders 6 and 7 were not different from V (i.e. less than 0.05). Order 6 would be a sufficient minimal fit for the mice insulin decay data. The mouse data example was entirely fictitious. 9.5 Data Requirements Suppose a fixed regression model for growth of sheep is to be studied. Each lamb has, at most, three weight measurements, and from that coefficients for 5 covariates are to be estimated. With three data points we can only estimate coefficients for 3 covariates because there would be no degrees of freedom remaining. However, because we have many lambs weighed at various ages, it is possible to estimate coefficients for 5 covariates across lambs, but not for individual lambs. The same logic must have some effect on a random regression model, even though animal genetic effects are random, it is computationally possible to estimate 5 coefficients per lamb. However, the quality of those estimates might be questionable. In early test day models, researchers required cows to have 7 or more test day records before an attempt was made to estimate covariance matrices with orders of fit equal to 5 or 6 (Jamrozik et al. 1998). A general recommendation is, if the number of weights per animal is three 9.5. DATA REQUIREMENTS 165 or less, then a multiple trait animal model with each weight as a separate trait is the preferred analysis. With more than three weights per animal (on average), then a random regression model could be employed. The appropriate order of fit should not be greater than the average number of weights per animal. 166 CHAPTER 9. LONGITUDINAL DATA Chapter 10 The Models Random regression models (RRM) are used to estimate matrices that are part of the covariance functions to estimate a larger array of variances and covariances. RRM are also used to model the shape or trajectory of observations taken over time. Phenotypically, the trajectory has to be fitted, and at the same time the variation along the trajectory needs to be considered. 10.1 Fitting The Trajectory Most trajectories are smooth, continuous, and can be fit with very few covariates. Sometimes, however, the trajectory is unknown and may be undefined with ups and downs over time. There could be different trajectories for males versus females, or for different breeds. Over the years, the trajectories could shift due to selection of animals for faster growth or higher milk yields. All of these factors must be considered. The first course of action is to plot data against the time scale of interest. Observations can be partitioned by gender, by age at start, or by years. One should look at all aspects of the data, before commiting to one model for analysis. Below are some fictitious data on animals over a period of 1 to 100 days. The trait measured is the amount of resistance to a bacteria from first day of spring to fall. Figure 10.1 represents data on 6 animals with from 4 to 7 observations per animal, a total of 34 observations. 167 168 CHAPTER 10. THE MODELS Figure 10.1 70 Daily Resistance Test ● ● 60 ● ● ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● 30 Resistance Level 50 ● ● ● ● ● ● 10 20 ● ● ● 0 20 ● ● ● 40 60 80 100 Days on Test The data show an initial resistance that becomes less up to day 35, then improves again until fall. Fitting the trajectory of this curve could be problematic. Ordinary linear regressions were fitted on days on test (divided by 100) from linear to sextic equations. Legendre polynomials could be used from order 1 to order 6, but at this stage of model building, simple regressions suffice. 10.1. FITTING THE TRAJECTORY 169 Figure 10.2 Linear Fit 70 Daily Resistance Test ● ● ● 60 ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● 30 Resistance Level 50 ● ● ● ● ● ● ● 10 20 ● ● ● 0 ● ● ● 20 40 60 80 100 Days on Test A linear regression does not fit the data very well. The predicted y is correlated with the original y at 0.66. Including a squared term for days on test gave a correlation of 0.85. Figure 10.3 Quadratic Fit 70 Daily Resistance Test ● ● 60 ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● 30 ● ● ● ● ● ● 20 10 Resistance Level 50 ● ● ● ● ● 0 20 ● ● ● 40 60 Days on Test 80 100 170 CHAPTER 10. THE MODELS Figure 10.4 Cubic Fit 70 Daily Resistance Test ● ● ● 60 ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● 30 Resistance Level 50 ● ● ● ● ● ● ● 10 20 ● ● ● 0 ● ● ● 20 40 60 80 100 Days on Test Figure 10.5 Quartic Fit 70 Daily Resistance Test ● ● 60 ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● 30 ● ● ● ● ● ● 20 10 Resistance Level 50 ● ● ● ● ● 0 20 ● ● ● 40 60 Days on Test 80 100 10.1. FITTING THE TRAJECTORY 171 Figure 10.6 Quintic Fit 70 Daily Resistance Test ● ● ● 60 ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● 30 Resistance Level 50 ● ● ● ● ● ● ● 10 20 ● ● ● 0 ● ● ● 20 40 60 80 100 Days on Test Figure 10.7 Sextic Fit 70 Daily Resistance Test ● ● 60 ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● 30 ● ● ● ● ● ● 20 10 Resistance Level 50 ● ● ● ● ● 0 20 ● ● ● 40 60 Days on Test 80 100 172 CHAPTER 10. THE MODELS A summary of the fit of regressions of different powers of days (divided by 100) are given in Table 10.1. Table 10.1 b with y. Correlations of y Fit Correlation Linear 0.66 Quadratic 0.85 Cubic 0.9356 Quartic 0.9423 Quintic 0.9789 Sextic 0.9740 The fit of the data improves with an increase in power of the time variable, but none of the regressions adequately fit the low observations from day 30 to 40. Going from Quintic to Sextic the correlation actually decreased, so that the fit is starting to become worse. In the quintic equation, there were waves in the first five days, and at the very end from day 90 to 100. The lack of fit for the lower values in days 30 to 40 persists. What happens when there are 2000 observations rather than just 34? Does the fit become worse or better? These functions are not entirely adequate for fitting the trajectories. 10.1.1 Spline Functions Another approach would be to divide the trajectory into parts, such that the parts are best fit by linear or quadratic functions. For the example, the parts might be days 1 to 20, days 21 to 50, and days 51 to 100. These are known as spline functions. Besides estimating the regression coefficients, one must also estimate the best cut-off points, and these might differ from one situation to another. A model might be 2 2 2 ykj = µ + b1 Xkj + b2 Xkj + b3 Ukj + b4 Ukj + b5 Wkj + b6 Wkj + ekj , 10.1. FITTING THE TRAJECTORY 173 where ykj is the j th observation on the k th animal, µ is an overall mean, Xkj is the days on test (divided by 100) corresponding to the observation, Ukj is zero, unless days on test are greater than 20, then it is equal to (Xkj − 0.2), Wkj is zero, unless days on test are greater than 50, then it is equal to (Xkj − 0.5), b` for ` = 1 to 6 are regression coefficients, ekj are residual effects. The fit of the spline function model gave a correlation between actual and predicted observations of 0.9923. and has seven parameters to estimate. The b and y is shown in Figure 10.8. agreement of y Figure 10.8 Fitting Splines 70 Daily Resistance Test ● ● 60 ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● 30 ● ● ● ● ● ● 20 10 Resistance Level 50 ● ● ● ● ● 0 20 ● ● ● 40 60 Days on Test 80 100 174 10.1.2 CHAPTER 10. THE MODELS Classification Approach One hundred days can be grouped into 20 periods of 5 days each. A linear model with time period groups could be used to model the trajectory. Thus, there is no assumption about the function that fits the data. The data tell us what the function looks like. The model is, yij = µ + Ti + eij , where yij is the j th observation within the ith time group (twenty groups), µ is an overall mean, Ti is the effect of the ith time group, and eij is a residual effect. This model fit the data very well, but at the expense of needing to estimate 20 parameters rather than 6 (Quintic function) or 7. The time groups do not need to be equal in size. Time periods of 10 or 20 days might be appropriate if the observations have about the same magnitude over all 10 or 20 days. That would not be true in the example data, because the values are decreasing sharply at the beginning, then increasing quickly. The last 10 or 15 days of the 100 day period might be able to be combined into one group. However, there is little harm in keeping 20 periods of 5 days each. There is no major computing problem in doing so. b The fit of the classification model gave a correlation of 0.9975 between y (red points) and y (blue points) (Figure 10.9). 10.1. FITTING THE TRAJECTORY 175 Figure 10.9 Fitting Time Groups 70 Daily Resistance Test 60 ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ●● 40 ● ● ●●●● ● ● ● ● ● ● ● ●● ● 30 Resistance Level 50 ● ● ●● ● ● ● ● ●● ● ●● 10 20 ● ● ●● ● 0 20 ● ● ● ● ● 40 60 80 100 Days on Test The time group means give a non-smooth tragectory, but it fits the data very well. Also, one does not need to define the type of curve or the shape. A drawback is that days are being combined, so if resistance is changing (increasing or decreasing) a lot from the first day to the fifth within a group, then the group mean or Ti effect will not account for that, and there will be errors for the predicted observations farthest away from the middle day in the group. Forming time groups must be handled cautiously. In this example there were not enough observations to make smaller time groups (e.g. 3 days each or 2 days each). The classification approach is good when you do not know what kind of function describes the trajectory. Usually researchers are dealing with many thousands of observations so that having 20 time groups or a 6 parameter regression does not make very much difference in terms of estimation difficulty. One cannot go wrong with the classification approach over the regression approach for modelling the trajectories unless there are not sufficient numbers of observations. 176 CHAPTER 10. THE MODELS 10.2 Random Variation in Curves Begin by writing a model for the resistance level on one specific day, day t. ytijk`m = gti + htj + ctk + at` + etijk`m where gti is a fixed gender effect, htj is a fixed year effect, ctk is a random contemporary group effect, at` is a random animal additive genetic effect, etijk`m is a residual effect. Now envisage a multiple trait model with 100 traits, (i.e. resistance on each day from 1 to 100). That is too many traits to analyze at one time, and there would be lots of missing data for many of the animals. A random regression model could be used to account for the relationship among observations on different days. All factors of the model have a ‘trajectory’ of effects on the trait. At least one fixed factor has to attempt to fit the phenotypic shape of the trajectory. For the other factors, the model fits the variation around the trajectories. The gender effects in the model above, gti become gip where p refers to time period p group. Hence there are different gender trajectories estimated by 20 period groups of five days each. This should fit the phenotypic trajectories almost perfectly. 10.2. RANDOM VARIATION IN CURVES 177 The year effects can be modelled by a curve function, because the phenotypic trajectory has been modelled by the gender effects. Thus, htj become 4 X hjx ztx x=0 where hjx are regression coefficients on Legendre polynomials (z) of time (t) of order 4. Order 4, (5 parameters), should be suitable to fit the variation around the trajectories. The best order could be achieved by testing the models with different orders assumed. Likewise, contemporary groups and animal additive genetic effects can be modelled by the same regressions. That is, ctk become 4 X ckx ztx x=0 where cjx are regression coefficients on Legendre polynomials (z) of time (t) of order 6, and at` becomes 4 X a`x ztx . x=0 In addition the model needs to include animal permanent environmental effects, because there are repeated observations on the same animal. petm ≈ 4 X pemx ztx . x=0 The RRM is about visualizing curves and covariance functions over time. The observations are changing with time, and all effects of the model also change with time. The curves that fit the trajectory may require more covariates than those that fit the covariance functions. They do not need to be of 178 CHAPTER 10. THE MODELS the same order. Computationally, there could be advantages to using the same order of fit for the fixed and random factors. The important point is to have a good fit of the trajectories amongst the fixed effects, and an appropriate order for the Legendre polynomials with the random factors. 10.3 Residuals Residuals are the difference between predicted observations (using the estimates of parameters from the RRM) and the actual observations. b − y. eb = y Now partition the residuals into the 20 time periods as in the classification model, eb = eb1 eb2 eb3 .. . eb20 . Then estimate the residual variance for the ith time group as σe2i = eb0i ebi /ni , where ni are the number of observations in the ith time group. The overall residual covariance matrix, R, is assumed to be diagonal with 20 different residual variances according to time group. The mixed model equations for the RRM therefore involve R−1 which means that observations are inversely weighted by the magnitude of their residual variance. Larger residual variances lead to lesser weight in the equations. The residual variances can be plotted on a graph relative to time group number. Either a pattern of the residual variances can be observed and a function used to determine the residual variance for each observation, or some 10.4. COMPLETE MODEL 179 collapsing of the groups into larger time groups may be possible. The possibility exists for each of the 20 time group variances to be different, and thus, one may always have to use 20 different variances. 10.4 Complete Model 10.4.1 Fixed Factors The fixed factors are for modelling the shape of the phenotypic trajectories. Let f 0 (t) = t0 t1 t2 · · · tm−1 be a vector of covariates of time of length m, these may or may not be Legendre polynomials depending on the function that best fits the trajectories, OR f 0 (t) = 0 1 0 ··· 0 a vector indicating which time group an observation belongs as in the classification approach (of length m). The desgin matrix for gender effects, for example, would be written as Xg = f 0 (t1 ) 00 f 0 (t2 ) 00 00 f 0 (t3 ) .. .. . . 00 f 0 (tN ) g1 g2 ! where N is the total number of animals observed. The first two animals belonged to gender 1, the third and the N th animals belonged to gender two. Also, gi1 gi2 gi = .. , . gim is a vector of length m for each gender which represent the fixed regression coefficients which give the trajectory of the responses for gender i. Instead of one number for each gender effect, there will be a vector of m numbers. 180 CHAPTER 10. THE MODELS Similarly, the fixed effects of years can be represented as Wh = f 0 (t1 ) 00 f 0 (t2 ) 00 00 f 0 (t3 ) .. .. . . 0 0 00 00 00 00 .. . ··· ··· ··· .. . 00 00 00 .. . 00 · · · f 0 (tN ) h1 h2 h3 .. . hny where ny is the number of years in the data. If m = 7, for example, and N = 1000, then X has 1000 rows and 2 × m = 14 columns. If ny = 10, then W has 1000 rows and 10 × m = 70 columns. 10.4.2 Random Factors The random factors are for modelling the covariance functions and make use of Legendre polynomials of order r. The analysis also gives different curves for every level of each random factor. A curve for each contemporary group, for each animal’s genetic effect, and for each animal’s permanent environmental effect. Let z0 (q) = φ(q)0 φ(q)1 φ(q)2 · · · φ(q)r . The design matrices are constructed in the same manner as those for the fixed factors, only using z0 (q) rather than f 0 (t). Assume that r = 4, for this discussion. The design matrices are Zc for contemporary groups, Za for animal additive genetic effects, and Zp for animal permanent environmental effects. Contemporary groups are groups of animals usually born within a few days or weeks of each other, which are reared together through much of their early life, and who are kept together during the time that the observations were collected. If N = 1000 observations, and the number of contemporary groups is nc = 50, then Zc has 1000 rows and 5 × nc = 250 columns. The covariance function matrix for contemporary groups is Kc of dimension 5 × 5. 10.4. COMPLETE MODEL 181 Normally, the covariance matrix for contemporary group effects is Iσc2 , but in a RRM it is a block diagonal matrix, V ar(c) = I of dimension 250 × 250, where N O Kc is the direct product operation. The additive relationship matrix, A, is used in all animal models, and includes ancestors who may not have been observed or measured in the data itself. The design matrix for animal additive genetic effects must, therefore, have additional columns of zeros to accommodate the ancestors. Let nw be the number of animals with observations in the data, and let na be the total number of animals including ancestors. Hence, na ≥ nw . The matrix, Za , has 1000 rows, and na × 5 columns. If na = 200, then Za has 1000 columns. If Ka is the covariance function matrix for genetic variances, then V ar(a) = A O Ka of dimension 1000 × 1000. The animal permanent environmental (PE) covariance function matrix is Kp . The design matrix, Zp has 1000 rows and nw × 5 columns. Let nw = 140, then that is 700 columns. V ar(p) = I O Kp of dimension 700 × 700. The assumption is that there are no covariances between levels of different random factors, e.g. between contemporary groups and animal additive genetic effects. The residual matrix, R was shown to be diagonal, but with different residual variances depending on the day on test. 10.4.3 Mixed Model Equations Once a model is specified, then Henderson’s best linear unbiased prediction methodology is employed. This requires constructing and solving the Mixed Model Equations(MME). These equations are 182 CHAPTER 10. THE MODELS X0 R−1 X 0 −1 W0 R−1 X Zc R X 0 −1 Za R X Z0p R−1 X X0 R−1 W W0 R−1 W Z0c R−1 W Z0a R−1 W Z0p R−1 W X0 R−1 Zc W0 R−1 Z c N Z0c R−1 Zc + I K−1 c Z0a R−1 Zc Z0p R−1 Zc X0 R−1 Za W0 R−1 Za Z0c R−1 Za N Z0a R−1 Za + A−1 K−1 a 0 −1 Zp R Za 0 −1 b g XR y −1 y b W0 0 R−1 h = Zc R y · b c Z0 R−1 y b a a Z0p R−1 y b p X0 R−1 Zp W0 R−1 Zp Z0c R−1 Zp Z0a R−1 ZN p Z0p R−1 Zp + I K−1 p . These equations are solved by using iteration on data routines, in which coefficients of the MME are calculated as they are needed. The following procedure describes how to estimate the covariance function matrices using a pseudo-Bayesian method. 1. First, go through the observations, one at a time, and calculate deviations, b −Z c b − Wh b − Zp p b y − Xg c b − Za a 2. Accumulate the deviations for gender effects, then solve for new gender regression coefficients. 3. Repeat for year effects. 4. Repeat for contemporary groups. For each contemporary group calculate cbi cb0i and accumulate these 5 × 5 matrices over all contemporary groups. Then c = K c X cbi cb0i /χ2 (nc + 2) i to estimate the covariance function matrix, where χ2 (s) is a random Chi-square variate having s degrees of freedom. 10.4. COMPLETE MODEL 183 5. Repeat for animal additive genetic effects. This step involves elements of A−1 . For each animal that has observations, calculate b ` − 0.5(a b sire + a b dam ) m` = a From A−1 there will be bii = (0.5 − 0.25(Fsire + Fdam ))−1 for each animal. The new covariance function matrix is c = K a X mm0 bii /χ2 (nw + 2). i Only animals with records are used because they are the only ones who contribute to the estimation of variances and covariances directly. 6. Repeat for animal permanent environmental effects. Kp is estimated in the same manner as that for contemporary groups. 7. Estimate new residual variances as before. 8. The new covariance function matrices are used in the next iteration, and should be saved in a file so that they may be averaged to give a final estimate of each. 9. Go back to step 1 and repeat. 184 CHAPTER 10. THE MODELS Chapter 11 RRM Calculations The calculations for applying random regression models are similar to those needed for multiple trait models. The R language is used to illustrate. To analyze thousands of records one would need to write specialized programs in either Fortran or C++, or use commercial software. Below are completely fictitious data for two traits, so that the model is definitely multiple trait as well as random regression. The example data are spread over the following four tables. The pedigree appears in Table 11.4. 185 186 CHAPTER 11. RRM CALCULATIONS Animal 10 10 10 10 10 10 11 11 11 11 11 12 12 12 12 12 13 13 13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 Age 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 Table 11.1 Example Data - Part 1 Yr-Mn HYS DIM 1 1 21 1 1 59 1 1 105 1 1 139 1 1 221 1 1 334 1 1 23 1 1 61 1 1 107 1 1 141 1 1 223 1 1 27 1 1 65 1 1 111 1 1 145 1 1 227 1 2 6 1 2 31 1 2 175 1 2 264 1 2 295 1 2 325 1 2 8 1 2 33 1 2 177 1 2 266 1 2 297 1 2 327 1 2 10 1 2 35 1 2 179 1 2 268 1 2 299 Fat 8.7 5.7 9.4 12.2 9.6 6.4 12.9 10.6 14.1 8.1 11.0 10.3 6.9 10.2 15.4 12.8 12.5 17.3 11.6 9.4 7.0 7.0 9.5 8.3 12.4 12.3 9.7 5.7 14.8 13.3 11.5 8.6 6.0 Protein 6.0 9.7 6.6 9.9 9.4 4.1 8.6 10.1 14.1 7.7 9.5 5.0 6.3 6.7 10.3 7.6 7.4 13.1 10.9 8.5 7.0 7.3 3.3 7.7 12.8 12.5 10.7 9.2 10.6 11.0 8.5 6.9 6.1 187 Animal 16 16 16 16 16 17 17 17 17 17 17 18 18 18 18 18 19 19 19 19 19 20 20 20 20 20 21 21 21 21 21 21 22 Age 3 3 3 3 3 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 1 1 1 1 1 1 2 Table 11.2 Example Data - Part 2 Yr-Mn HYS DIM 1 2 15 1 2 40 1 2 184 1 2 273 1 2 304 2 3 48 2 3 89 2 3 150 2 3 206 2 3 280 2 3 335 2 3 50 2 3 91 2 3 152 2 3 208 2 3 282 2 3 52 2 3 93 2 3 154 2 3 210 2 3 284 2 3 53 2 3 94 2 3 155 2 3 211 2 3 285 2 4 63 2 4 131 2 4 222 2 4 252 2 4 270 2 4 300 2 4 64 Fat 12.6 13.9 9.6 8.2 5.6 3.1 6.4 7.6 7.7 7.0 6.6 1.7 2.3 4.3 4.7 6.8 6.1 5.7 6.4 7.9 5.6 1.2 3.9 8.2 6.5 5.2 2.7 6.5 9.1 4.8 3.0 2.4 0.3 Protein 7.5 9.0 9.1 7.0 7.7 7.7 9.0 8.8 8.1 10.1 12.9 6.4 3.5 4.2 5.7 7.5 7.6 5.7 7.5 7.8 8.2 5.1 8.8 7.6 6.4 8.0 5.3 4.6 5.3 2.6 1.4 1.2 8.9 188 CHAPTER 11. RRM CALCULATIONS Animal 22 22 22 22 22 23 23 23 23 23 23 24 24 24 24 24 24 25 25 25 25 25 Age 2 2 2 2 2 3 3 3 3 3 3 1 1 1 1 1 1 3 3 3 3 3 Table 11.3 Example Data - Part 3 Yr-Mn HYS DIM 2 4 132 2 4 223 2 4 253 2 4 271 2 4 301 2 4 65 2 4 133 2 4 224 2 4 254 2 4 272 2 4 302 2 4 66 2 4 134 2 4 225 2 4 255 2 4 273 2 4 303 2 4 67 2 4 135 2 4 226 2 4 256 2 4 274 Fat 2.2 9.5 7.5 5.8 8.2 2.6 2.9 8.5 6.7 4.4 3.9 0.9 5.1 5.4 6.6 8.4 4.3 6.3 2.8 5.2 4.6 7.6 Protein 6.6 6.6 5.5 4.6 6.8 5.8 6.1 8.7 6.7 6.8 5.7 4.2 5.0 3.1 4.1 5.1 6.4 3.5 3.7 2.7 1.9 1.5 11.1. ENTER DATA 189 Table 11.4 Pedigree of Example Animals. Animal Sire Dam bii 1 0 0 1.0 2 0 0 1.0 3 0 0 1.0 4 0 0 1.0 5 0 0 1.0 6 0 0 1.0 7 0 0 1.0 8 0 0 1.0 9 0 0 1.0 10 1 5 0.5 11 2 6 0.5 12 3 7 0.5 13 4 8 0.5 14 1 9 0.5 15 2 5 0.5 16 3 6 0.5 17 3 8 0.5 18 4 7 0.5 19 3 10 0.5 20 2 12 0.5 21 2 13 0.5 22 1 11 0.5 23 3 14 0.5 24 4 19 0.5 25 2 10 0.5 11.1 Enter Data First, enter the pedigree information and calculate the inverse of the additive genetic relationship matrix. 190 CHAPTER 11. RRM CALCULATIONS anim=c(1:25) sir=c(rep(0,9),1,2,3,4,1,2,3,3,4,3,2,2,1,3,4,2) dam=c(rep(0,9),5,6,7,8,9,5,6,8,7,10,12,13,11,14,19,10) length(sir)-length(dam) # check bi=c(rep(1,9),rep(0.5,16)) AI = AINV(sir,dam,bi) Enter columns of information that identify the age group, year-month, animals with records, HYS numbers, Days in milk, and fat and protein yields (not shown). agid=c(rep(1,6),rep(1,5),rep(2,5),rep(1,6),rep(2,6),rep(3,5), rep(3,5),rep(1,6),rep(2,5),rep(2,5),rep(3,5),rep(1,6), rep(2,6),rep(3,6),rep(1,6),rep(3,5)) nob=length(agid) ymid=c(rep(1,6),rep(1,5),rep(1,5),rep(1,6),rep(1,6),rep(1,5), rep(1,5),rep(2,6),rep(2,5),rep(2,5),rep(2,5),rep(2,6), rep(2,6),rep(2,6),rep(2,6),rep(2,5)) cgid=c(rep(1,6),rep(1,5),rep(1,5),rep(2,6),rep(2,6),rep(2,5), rep(2,5),rep(3,6),rep(3,5),rep(3,5),rep(3,5),rep(4,6), rep(4,6),rep(4,6),rep(4,6),rep(4,5)) anwr=c(rep(10,6),rep(11,5),rep(12,5),rep(13,6),rep(14,6), rep(15,5),rep(16,5),rep(17,6),rep(18,5),rep(19,5),rep(20,5), rep(21,6),rep(22,6),rep(23,6),rep(24,6),rep(25,5)) # obs is 88 by 2 matrix of fat and protein values Observations also have to be assigned to residual error groups (1 to 4) based on days in milk. Group 1 is day 5 to day 99, group 2 is day 100 to 199, group 3 is day 200 to 299, and group 4 is day 300 plus. dimid=c(21,59,105,139,221,334, 23,61,107,141,223, 27,65,111,145,227,6,31,175,264,295,325, 8,33,177, 266,297,327, 10,35,179,268,299,15,40,184,273,304, 48,89,150,206,280,335, 50,91,152,208,282,52,93, 154,210,284, 53,94,155,211,285, 63,131,222,252, 270,300,64,132,223,253,271,301, 65,133,224,254, 272,302, 66,134,225,255,273,303,67,135,226,256,274) 11.1. ENTER DATA 191 erid=c(1,1,2,2,3,4,1,1,2,2,3,1,1,2,2,3,1,1,2,3,3,4, 1,1,2,3,3,4,1,1,2,3,3,1,1,2,3,4,1,1,2,3,3,4,1,1,2, 3,3,1,1,2,3,3,1,1,2,3,3,1,2,3,3,3,4,1,2,3,3,3,4,1, 2,3,3,3,4,1,2,3,3,3,4,1,2,3,3,3) length(erid)-length(dimid) # check Now the Legendre Polynomials need to be constructed, using two R functions, as follows. The first sets the polynomials, LPOLY and the second to calculate the covariates based on standardized time variables for days in milk, LPTIME. LPOLY=function(no) { if(no > 9 ) no = 9 nom = no - 1 phi = matrix(data=c(0),nrow=9,ncol=9) phi[1,1]=1 phi[2,2]=1 for(i in 2:nom){ ia = i+1 ib = ia - 1 ic = ia - 2 c = 2*(i-1) + 1 f = i - 1 c = c/i f = f/i for(j in 1:ia){ if(j == 1){ z = 0 } else {z = phi[ib,j-1]} phi[ia,j] = c*z - f*phi[ic,j] } } for( m in 1:no){ f = sqrt((2*(m-1)+1)/2) phi[m, ] = phi[m, ]*f } return(phi[1:no,1:no]) } The second function is below. 192 CHAPTER 11. RRM CALCULATIONS LPTIME=function(day,tmin,tmax,no){ phi = LPOLY(no) if(day > tmax)day = tmax if(day < tmin)day = tmin z = -1 + 2*((day - tmin)/(tmax - tmin)) s = rep(1,no) for(i in 2:no){ s[i]=z*s[i-1] } x = phi %*% s return(x) } Use the functions to create the covariates in a matrix of order 335 by 5. tmin=5 tmax=335 zcov=matrix(data=c(0),ncol=5,nrow=tmax) for(i in tmin:tmax){ zcov[i, ]=LPTIME(i,tmin,tmax,5) } 11.2 Enter Starting Covariance Matrices The following matrices come from an analysis of Simmental dairy cows. The observations were multiplied by 10, so the covariance matrices have been multiplied by 100. The matrices are positive definite, and are of order 10 by 10 (five covariates times 2 traits). The matrices were entered in half-stored mode, then converted to full-stored with the function fsmat. The genetic covariance matrix is 11.2. ENTER STARTING COVARIANCE MATRICES 193 G=c(0.6132, -0.0353,.0174, .0017, .0008, .0096, -.0005, -.0001, .0011, -.0005, .4705, -.0112, .0059,.0008, -.0061,.0651,.0019,-.0025,-.0005, .2519,-.0040,.0008,.0015,.0017,.0261,.0008, -.0005, .0614,-.0006,-.0001,-.0003,.0004,.0057, -.0002, .0136,.0004,.0003,-.0003,-.0003,.0007, .4272,-.0108,.0172,.0012,.0009,.3324,-.0028, .0150,.0000,.1978,-.0033,.0053,.0885,-.0008,.0310) GM=fsmat(10,G) The contemporary group covariance matrix is as follows. H=c(3.7487,-.3112,-.0732,.0272,-.0201,2.8720, -.1985,-.1479,.0351,-.0275,.8256,-.0211,-.0314, .0140,-.2306,.3768,-.0091,-.0500,.0111,.3382, -.0222,-.0003,-.0967,-.0066,.1027,-.0106,-.0128, .2638,-.0121,.0152,-.0483,-.0062,.0712,-.0051, .1818,-.0215,.0067,-.0145,-.0032,.0419,2.9313, -.1625,-.1058,.0316,-.0197,.5873,-.0103,-.0187, .0094,.2325,-.0104,.0050,.1698,-.0060,.1127) HM=fsmat(10,H) The animal permanent environmental covariance matrix is as follows. P=c(1.1340,-.0151,.0206,.0024,-.0005,.3505, .0017,-.0222,.0031,.0000,.4881,-.0003,.0155, .0018,-.0038,.0805,.0042,-.0015,-.0001,.3317, -.0052,.0037,-.0158,.0054,.0496,.0004,-.0019, .1412,-.0011,-.0010,-.0003,.0000,.0164,.0000, .0330,-.0018,.0006,-.0006,.0002,.0034,.8274, .0103,.0172,.0051,.0043,.3528,.0070,.0194,.0007, .2418,-.0007,.0114,.1161,-.0004,.0577) PM=fsmat(10,P) Invert those matrices for use in mixed model equations. GI=ginv(GM) HI=ginv(HM) PI=ginv(PM) Because there are two traits, the residual covariance matrices are of order 2 194 CHAPTER 11. RRM CALCULATIONS by 2, and there are 4 of them because there are 4 residual groups per lactation. Then the inverses are needed to form MME. R1=matrix(data=c(4.1267,1.5435,1.5435,2.5262), byrow=TRUE,ncol=2) R2=matrix(data=c(3.4424,1.4586,1.4586,1.7293), byrow=TRUE,ncol=2) R3=matrix(data=c(2.4530,1.1217,1.1217,1.2931), byrow=TRUE,ncol=2) R4=matrix(data=c(2.0600,0.9695,0.9695,1.3123), byrow=TRUE,ncol=2) RI=array(data=c(0),dim=c(2,2,4)) R1I=ginv(R1) RI[ , ,1]=R1I RI[ , ,2]=ginv(R2) RI[ , ,3]=ginv(R3) RI[ , ,4]=ginv(R4) 11.3 Sampling Process Initialize the solution vectors for each factor in the model, which represent vectors for year-months, age groups, contemporary groups, animal additive genetic effects, and animal permanent environmental effects, respectively. The number of rows is the number of levels in each factor, and the number of columns is ten (10), for 5 covariates per trait times 2 traits. The model assumes that all of the data are from first lactation animals of the same breed. ymsn=matrix(data=c(0),nrow=nym,ncol=10) agsn=matrix(data=c(0),nrow=nage,ncol=10) cgsn=matrix(data=c(0),nrow=ncg,ncol=10) ansn=matrix(data=c(0),nrow=nam,ncol=10) pesn=ansn 11.3.1 Year-Month Effects All factors in this model are based on regression functions of order 4. There are not enough observations to use DIM classes to model the lactation curves 11.3. SAMPLING PROCESS 195 in this example. The steps described below are similar for age groups, and contemporary group solutions, except for the appropriate changes to the code, which should be obvious. WY=ymsn*0 WW=array(data=c(0),dim=c(10,10,nym)) for(i in 1:nob){ y = obs[i, ] kag=agid[i] #Age indicator kym=ymid[i] #year-month kcg=cgid[i] #contemporary group kam=anwr[i] #animal id kdy=dimid[i] #Days in milk kt1=c(1:5) # fat yield variables kt2=c(6:10)# protein yield variables zphi=zcov[kdy, ] #Legendre covariates y[1]=y[1]-(agsn[kag,kt1]+cgsn[kcg,kt1]+ ansn[kam,kt1]+pesn[kam,kt1])%*%zphi y[2]=y[2]-(agsn[kag,kt2]+cgsn[kcg,kt2]+ ansn[kam,kt2]+pesn[kam,kt2])%*%zphi ker=erid[i] # residual group (1 to 4) eh=RI[ , ,ker]%*%y WY[kym,kt1]=WY[kym,kt1]+zphi*eh[1] WY[kym,kt2]=WY[kym,kt2]+zphi*eh[2] zz=zphi%*%t(zphi) # 5 x 5 matrix WW[ , ,kym]=WW[ , ,kym]+(RI[ , ,ker]%x%zz) } The above script adjusts the observations for all other factors in the model, and accumulates the diagonal 10 by 10 blocks for each level of a factor, and the right hand sides (RHS) for each level. The next step is to solve for new year-month solutions, and add some sampling noise. 196 CHAPTER 11. RRM CALCULATIONS no=10 for(k in 1:nym){ work=WW[ , ,k] C = ginv(work) ymsn[k, ]=C %*% WY[k, ] CD=t(chol(C)) ve = rnorm(no,0,1) ymsn[k, ] =ymsn[k, ] +(CD %*% ve) } 11.3.2 Age Groups The R-script is nearly identical to that for year-month effects, except y[1] and y[2] are adjusted for ymsn rather than agsn plus other factors. The solving part and adding sampling noise is no=10 for(k in 1:nage){ work=WW[ , ,k] C = ginv(work) agsn[k, ]=C %*% WY[k, ] CD=t(chol(C)) ve = rnorm(no,0,1) agsn[k, ] =agsn[k, ] +(CD %*% ve) } 11.3.3 Contemporary Groups Contemporary groups is a random factor, and thus, a covariance matrix should be estimated. However, in this small example there are only 4 contemporary groups with which to estimate and covariance matrix of order 10. The covariance matrix is therefore, not properly estimable. There should be more than 10 contemporary groups. If the starting covariance matrix, H, can be considered a good “guesstimate” of the true covariance matrix, then H can be held constant throughout all of the sampling process. The adjustment of observations and collection of the diagonal blocks and 11.3. SAMPLING PROCESS 197 RHS are the same as in Year-Month and Age factors. The solution phase and estimation of the covariance matrix are as follows. no=10 for(k in 1:ncg){ work=WW[ , ,k] + HI # NOTE C = ginv(work) cgsn[k, ]=C %*% WY[k, ] CD=t(chol(C)) ve = rnorm(no,0,1) cgsn[k, ] =cgsn[k, ] +(CD %*% ve) } If ncg was greater than 10, then the covariance matrix would be estimated as follows. sss = t(cgsn)%*%cgsn ndf = ncg+2 x2 = rchisq(1,ndf) Hn = sss/x2 HI = ginv(Hn) In this example, sss only has a rank of 4, but an order of 10 by 10. Thus, Hn does not have an inverse, and HI can not be used in the mixed model equations. The original H and HI has to be used each time. An alternative would be to treat the contemporary groups as another fixed factor. 11.3.4 Animal Additive Genetic Effects The R-script for animal additive genetic effects is slightly different from that for year-months, ages, and contemporary groups. The difference is that a matrix is created for all animals at one time. Thus, 25 animals by 10 traits is a matrix of order 250 by 250. This would not be a good strategy if there were many thousands of animals, but for the small example this is sufficient. 198 CHAPTER 11. RRM CALCULATIONS no=nam*10 WW = matrix(data=c(0),nrow=no,ncol=no) WY = matrix(data=c(0),nrow=no,ncol=1) for(i in 1:nob){ y = obs[i, ] kag=agid[i] kym=ymid[i] kcg=cgid[i] kam=anwr[i] kdy=dimid[i] kt1=c(1:5) kt2=c(6:10) zphi=zcov[kdy, ] y[1]=y[1]-(ymsn[kym,kt1]+agsn[kag,kt1]+ cgsn[kcg,kt1]+pesn[kam,kt1])%*%zphi y[2]=y[2]-(ymsn[kym,kt2]+agsn[kag,kt2]+ cgsn[kcg,kt2]+pesn[kam,kt2])%*%zphi ker=erid[i] eh=RI[ , ,ker]%*%y WY[kam,kt1]=WY[kam,kt1]+zphi*eh[1] WY[kam,kt2]=WY[kam,kt2]+zphi*eh[2] zz=zphi%*%t(zphi) ka=(kam-1)*10 + 1 kb=ka+9 kc=c(ka:kb) WW[kc,kc]=WW[kc,kc]+(RI[ , ,ker]%x%zz) } The solutions phase is UI = AI %x% GI work = WW + UI C = ginv(work) wsn=C %*% WY CD=t(chol(C)) ve = rnorm(no,0,1) wsn=wsn +(CD %*% ve) 11.3. SAMPLING PROCESS 199 # transform to new dimensions ansn = matrix(data=wsn,byrow=TRUE,ncol=10) # Estimate a new G and GI v1 = AI %*% ansn sss = t(ansn)%*%v1 ndf = nam+2 x2 = rchisq(1,ndf) Gn = sss/x2 GI = ginv(Gn) The additive genetic covariance matrix is based on 25 animals, which is greater than the order of the matrix, and therefore, Gn is estimable and has an inverse, GI. 11.3.5 Animal Permanent Environmental Effects The R-script for this factor is the same as for the animal additive genetic effects. The solution phase is slightly different. UI = id(nam) %x% PI # note id(nam) instead # of AI work = WW + UI C = ginv(work) wsn=C %*% WY CD=t(chol(C)) ve = rnorm(no,0,1) wsn=wsn +(CD %*% ve) pesn = matrix(data=wsn,byrow=TRUE,ncol=10) # Estimate a new P and PI sss = t(pesn)%*%pesn ndf = 16+2 x2 = rchisq(1,ndf) Pn = sss/x2 PI = ginv(Pn) This covariance matrix was also estimable because the number of animals with records was 16 which is greater than the dimension of P. 200 CHAPTER 11. RRM CALCULATIONS 11.3.6 Residual Covariance Matrices Observations are now adjusted for all factors in the model to give estimated residual effects, then sums of squares are calculated separately for each of the four residual groups, each having a different degrees of freedom. In this case the covariance matrices are order 2 by 2, and as long as there are more than two observations in a group, then these covariance matrices are estimable. Rss=array(data=c(0),dim=c(2,2,4)) for(i in 1:nob){ y = obs[i, ] kag=agid[i] kym=ymid[i] kcg=cgid[i] kam=anwr[i] kdy=dimid[i] kt1=c(1:5) kt2=c(6:10) zphi=zcov[kdy, ] y[1]=y[1]-(ymsn[kym,kt1]+agsn[kag,kt1]+ cgsn[kcg,kt1]+ansn[kam,kt1]+pesn[kam,kt1])%*%zphi y[2]=y[2]-(ymsn[kym,kt2]+agsn[kag,kt2]+ cgsn[kcg,kt2]+ansn[kam,kt2]+pesn[kam,kt2])%*%zphi ker=erid[i] eh=matrix(data=y,ncol=1) Rss[ , ,ker]=Rss[ , ,ker]+ eh%*%t(eh) } The covariance matrices are estimated as # Find number of observations in each group kdf=tapply(erid,erid,length) for(k in 1:4){ ndf = kdf[k]+2 x2 = rchisq(1,ndf) Rn = Rss[ , ,k]/x2 Rni=ginv(Rn) RI[ , ,k]=Rni } 11.3. SAMPLING PROCESS 201 Repeat more rounds of sampling. Code is needed to save the samples for each covariance matrix. After all samples have been made, then randomly sample 200 of the sample matrices past the burn-in period, average those for the final estimated matrices. If one is just solving the mixed model equations, then the parts of the scripts that add sampling noise can be skipped, and covariance matrices do not need to be estimated. Some convergence criteria statistic is needed to know when the solutions have converged to the solutions to the MME. The R-code could be written more concisely and compactly, but then the reader may not be able to follow what is being done exactly. The code here has been written to hopefully be easily followed, step by step. 202 CHAPTER 11. RRM CALCULATIONS Chapter 12 Lactation Production When a female dairy cow gives birth her lactation period begins. In an average dairy cow, the lactation period runs for 305 days. The amount of milk produced is greatest after calving and peaks about 40 days with daily production decreasing thereafter. Around 100 to 120 days after calving, the cow is impregnated again through artificial insemination, and the growth of the new fetus begins to pull lactation production downwards even more. About 60 days before the cow is due to give birth again, she is stopped from milking (if she has not already stopped on her own) and dried off. The whole process is repeated as soon as she has the next calf, at approximately 13 month intervals. Some cows have their peak production right at birth of a calf and it decreases continually thereafter. Other cows continue to increase in yield from birth to day 90, before starting to decrease. Another group of cows gives milk at a continuous level for many days. Some cows stop milking around 280 days, while others are kept milking to 365 days or more. Thus, there are many different shapes of daily production trajectories between cows. Besides milk yield, there are also components of milk, i.e. percentages of fat, protein, and lactose, somatic cell scores, milk urea nitrogen, and betahydroxybutylase. All of these have their own interrelated trajectories over the lactation period. Consequently, multiple trait analyses are favoured to make use of genetic correlations among the traits at different points during the lactation period. Luckily, the same model equation is generally assumed for all of these traits. That is, the same factors are assumed to influence each lactation 203 204 CHAPTER 12. LACTATION PRODUCTION trait. 12.1 Measuring Yields In the early days of milk recording in Canada, the federal government would record the daily milk yield of every cow in the herd. The trait that was analyzed was called the 305-day yield. This was the sum of the daily yields of cows from day 5 to 305 in the lactation period (or whenever the cow stopped milking). Daily yields were defined as the amount of milk given in a 24 hour period. This was usually two milkings per day, morning or AM milking, and evening or PM milking. However, daily recording was very costly and someone had to add up the milk weights over the lactation period. The program ended before 1970. At the same time there was a program of supervised testing, in which a milk supervisor would visit a farm at approximately one month intervals. He(she) would measure yields in the PM and following AM, and collect milk samples of each cow, which were sent to laboratories to be analyzed for fat content. The data were accumulated by the milk recording program, which was called Record of Performance (ROP). Later, provincial programs arose called Dairy Herd Improvement (DHI) programs. The monthly milk weights were combined using the Test Interval Method (TIM), which estimated the amount of milk produced between two visits by linear interpolation. There were tables of special factors to adjust the first test day (TD) visit, and another table for projecting the yield after the latest TD visit to 305-days. Both tables were based on the assumption of a standard or average lactation trajectory. There was no allowance for the fact that cows could vary drastically in their trajectories. The factors worked well for most cows, but gave biased results for cows with atypical trajectories. A dairy technical committee existed in Canada, run by Agriculture Canada and composed of scientists from different universities across Canada. The purpose of the committee was to advise the ROP program on the best statistical procedures to adjust milk yields and to evaluate dairy bulls. The committee met twice a year in various locations across Canada. During one of these meetings when it was time to develop new tables of factors for adjusting the first TD yields and the latest TD yields to get 305-day production, there was 12.2. CURVE FITTING 205 debate over who would do this and how often it needed to be done. Dr John Moxley, who worked for the Quebec DHI equivalent, DHAS at the time, made a remark in 1974 that “it would be better if we could analyze TD milk yields directly rather than combining them into a 305-d yield.” In 1974, however, the effort was in getting a linear sire model adopted to evaluate dairy bulls. The computing power of the day was not capable of handling models for test day records. By 1990 the dairy world had advanced to using animal models, and computer hardware had caught up, so that it was feasible to begin working on TD models. The first TD model did not have any curves in it. The model assumed that the trajectories of the curves were the same for all cows. Trajectories only differed by the height at peak yield. Thus, there was still only one variable to estimate per animal. The problem with analyzing TD records was that each cow had 7 to 10 TD records compared to only one 305-d record. There was much more data to process. Jack Dekkers suggested “there should be a different lactation curve for every cow”. Henderson (1984) had a section of his book on the topic of “random regressions.” There was only one paragraph and nothing about their use with TD models. Henderson’s son published a paper in Biometrics in 1982 on random regression models and the analysis of covariance. In 1994 the idea of TD models using random regression models was presented to the WCGALP meetings in Guelph. Four years later, everyone was studying random regression models for many situations. By 2000, Canada had adopted a TD model for its genetic evaluations of dairy bulls and cows. 12.2 Curve Fitting A host of different models have been used to fit lactation curves in different species of dairy cattle. Kistemaker (1996) compared 19 different models that had been studied in the literature previously. His results are shown in Table 12.1. 206 CHAPTER 12. LACTATION PRODUCTION No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Table 12.1 Correlations (r) between Predicted and Actual Test Day Yields and Mean Absolute Error (MAE) when applied to 5409 cows with at least 9 TD yields. Model1 r ln(y/t) = a + bṫ .717 ln(y) = a + b · ln(t) + c · t .951 ln(y) = a + b · ln(t) + c · t + d · t.5 .963 ln(y) = a + b · ln(t) + c · t + d · t2 .964 ln(y) = a + b · t−1 + c · t + d · t2 .964 ln(y) = a + b · ln(t) + c · t ∗ d · t.5 + f · t2 .973 MAE 4.780 1.290 1.084 1.079 1.063 0.888 1/y = a + b · t−1 + c · t 1/y = a + b · t−1 + c · t + d · t2 1/y = a + b · t−1 + c · t + d · t2 + f · t3 .102 .766 .378 2.050 1.269 1.078 y y y y y y y y y y .646 .953 .955 .953 .967 .975 .974 .975 .974 .987 3.466 1.229 1.230 1.232 1.032 0.857 0.878 0.864 0.905 0.581 =a = a + b · t + c exp(−.5(log(t) − 1)/.6)2 · t−1 = a + b · t.5 + c ln(t) = a + b · t + c exp(−.05 · t) = a + b · t.5 + c ln(t) + d · t4 = a + b(t/305) + c(t/305)2 + d ln(305/t) + f ln2 (305/t) = a + b · t + c sin(.01)t2 + d sin(.01)t3 + f exp(−.055t) = a + b · t + c · t2 + d · t3 + f ln(t) = a + b · t + c · t2 + d · t3 + f · t4 = a + b · t + c · t2 + d · t3 + f · t4 + g · t5 + h · t6 Wood’s model (1967) has been used to study groups of cows and is equation 2 in the table. Equation 13 is known as Wilmink’s function (1987) which has been applied in many studies. Equation 15 is known as the Ali and Schaeffer function (1987), which gives the second smallest mean absolute error and the second largest correlation. Equation 19 appears to be the best, but has the most parameters to be estimated. The first 9 equations use the natural log of the test day yields or the inverse of yield. Equations 10 through 19 use the actual TD yields. As the number of covariates in the model increases, the better is the fit of the model, in general. Subsequent work showed that Legendre polynomials of order 4 were similar to the Ali and Schaeffer function, but had the advantage of having much lower correlations among the parameter estimates. Thus, Legendre polynomials of order 4 have been used for both fixed and random regressions in test day models in Canada. 12.3. FACTORS IN A MODEL 207 A classification approach could be used for the fixed factor regressions for at least one of the fixed factors. There have probably been a hundred different studies that investigated the best curve function for fitting lactation curves in dairy cows, dairy goats, dairy sheep, and water buffalos. The conclusions have not been unanimous, depending on the amount of data in the analyses. The majority of studies found test day models gave higher correlations of estimated breeding values with true breeding values, and recommended their use for genetic evaluations of lactation production. 12.3 Factors in a Model 12.3.1 Observations A multiple trait model will be described. Traits will be defined within parity number. Parities one and two are separate, and parity three includes third parity and all subsequent parities. The assumption is that cows in third parity or later are mature and the shape of their curves are similar. In some cases parity two might also be considered mature. No matter what, there are definite shape differences between parity 1 heifers and all later parities. In some situations it might be better to limit analyses to the first three parities only. After parity number come several traits, depending on the country. These include milk yield, fat yield, protein yield, lactose yield, and somatic cell scores (SCS). There could also be milk urea nitrogen and betahydroxybutylase. Finally there could be fatty acid components. Deciding which factors to analyze depends on how many cows have data. In the United States of America, for example, there are too many cows and too many test days, such that a TD model is impractical to apply, even for one trait. The initial Canadian Test Day Model had the first three parities, and milk, fat, and protein yields plus SCS or a total of 12 traits. 12.3.2 Year-Month of Calving In all animal models it is critically important to account for time trends in phenotypes. For lactation production this means putting in a factor for the 208 CHAPTER 12. LACTATION PRODUCTION year and month of calving. If data begin in 1986, then that means 32 years (it is now 2018), times 12 months per year, gives 384 levels or 384 different lactation curves for one parity. Then assume 72 five-day periods within each lactation and that gives 25,920 parameters to be estimated using the classification approach. Hopefully, there are many more test day records with which to estimate those parameters. If there are not, then maybe 36 ten-day periods could be used. If data are limiting, then Legendre polynomials of order 4 could be used, i.e. 1800 parameters to estimate. 12.3.3 Age-Season of Calving Age at calving (parturition) is known to have a significant effect on milk production, as does month of calving. However, month of calving has already been considered in the Year-Month of Calving effects. However, there is an interaction of month of calving with age at calving. To avoid some confounding, months can be combined into seasons either six or four seasons per year. These can be formed on the basis of phenotypic averages, so that consecutive months that are similar in yield levels can be grouped together. Age groups would differ depending on the parity number, and there could be different numbers of age groups per parity. First parity heifers start calving at 18 months of age, and can extend to 30 months. Again, if there are lots of data, then 13 age groups by 6 seasons would only be 78 subclasses in parity one. Later parities extend over a much wider age range, and thus, some groupings of ages may be necessary too. Legendre polynomials would be used for this factor. If the data cover several decades, then age-season differences could change over time as production increases. Thus, time periods of 5 to 10 years should be made and the model expanded to have Time-Age-Season of Calving subclasses. This allows the age-season differences to change over time. 12.3.4 Days Pregnant Once a cow becomes pregnant, part of her feed intake goes towards the growth of the fetus, and therefore, less energy goes towards milk production. Groups 12.3. FACTORS IN A MODEL 209 of 5 or ten days can be created, perhaps 30 groups altogether, to measure the decrease in yield. The assumption is that the decrease in yield is the same regardless of number of days in milk when the cow becomes pregnant. As the number of days pregnant becomes larger, so does the amount of decrease in yield. Legendre polynomials of days in milk would be used within each days pregnant group. Determining the time of conception is not immediate, and therefore, test day records need to be continuously updated when pregnancy is validated. Canada has opted to multiplicatively pre-adjust TD yields for number of days pregnant rather than to put a factor into the model. 12.3.5 Herd-Test-Day The purpose of this factor was to account for the environmental effects on the cows that were tested on the same day. This is very messy because cows (in the same parity) would have calved at different ages and months of the year. Thus, some cows would be just starting a lactation, and other cows (in the same HTD subclass) could be ending their lactation. Thus, the yields would be all over the place. There would only be one parameter to estimate for each HTD subclass. The contemporaries would be constantly changing from TD to TD. VERY IMPORTANT!!! Herd-Test-Day factors are NOT recommended for TD models. Parity-Herd-Year-Season of calving should be used as contemporary groups. 12.3.6 Herd-Year-Season of Calving The random factor of Herd-Year-Season of Calving, (HYS), each with its own curve (not one parameter but five), should be used to account for contemporaries. Contemporaries are cows that share the same environmental effects throughout their lactation, from birth to being dried off. They encounter the same weather and management variables through-out. They likely also have the same test days during their lactations. Perhaps 4 seasons per year of 3 210 CHAPTER 12. LACTATION PRODUCTION months each could be used. However, if number of cows per subclass is small, then maybe larger season groups (4 months or 6 months) may be necessary in some herds, especially for the less numerous breeds. Legendre polynomials of order 4 should be used with this factor, and hence a covariance function matrix needs to be estimated for it. As a random factor in the model, it is less critical to have a minimum number of records per subclass because just one test day record will suffice. 12.3.7 Additive Genetic Effects Every animal with TD records has both parents identified. If a parent is unknown, then a phantom parent group is assigned. Ancestors, without TD records, also have unknown parents replaced by phantom parent groups. The groups are based on year of birth of the animal and whether it is a male or female animal. Phantom groups represent the four pathways of selection, in dairy cattle, and years of birth. Phantom groups are necessary in order to properly estimate genetic trends, unbiasedly. Each animal additive genetic effect is fitted by a Legendre polynomial of order 4. A covariance matrix must also be estimated. 12.3.8 Permanent Environmental Effects Because cows have more than one TD record per lactation, permanent environmental effects are modeled for each parity by Legendre polynomials of order 4. A covariance matrix is needed for this factor too. 12.3.9 Number Born The number of offspring born at a parturition, in litter bearing species such as dairy goats and sheep, can have an effect on the milk yield of the female. A female carrying four young apparently “knows” this is happening and the body prepares by increasing the amount of milk that will be needed after birth to feed that number of young. This is a fixed environmental effect and might 12.4. COVARIANCE FUNCTION MATRICES 211 differ depending on parity number of the dam, but it can be fit by Legendre polynomials of order 4. 12.3.10 Residual Effects In the Canadian Test Day Model, the lactation is divided into 4 periods of various numbers of days, such that the residual variance is similar across days within a period, but different between periods. One should begin using many groups, perhaps 30 of ten days each, in an initial analysis to determine the best grouping of days. The point is, the residual variance changes throughout the lactation. Table 12.2 contains residual variances for milk yields in the first three parities for a small subset of Canadian Holstein dairy cattle born from 2005 through 2009. Residual Days in Milk 1-45 46-115 116-265 266-365 12.4 Table 12.2 variances for a TD model. Parity 1 Parity 2 Parity 3 7.86 13.96 16.42 5.01 8.12 9.33 3.95 5.41 6.24 3.57 4.36 3.60 Covariance Function Matrices Many of the early studies of random regression models focussed on the estimation of the covariance function matrices, and the subsequent graphs that could be made. Let ai represent the vector of random regression coefficients of an animal for parity i. This vector is order 5 by 1 (order 4 Legendre polynomial). Then an analysis of 3 parities gives a covariance function matrix of order 15 by 15. The parts of this matrix are shown below in order 5 by 5 subgroups. 212 CHAPTER 12. LACTATION PRODUCTION V ar(a1 ) = 8.1910 0.2880 −0.6694 0.2360 −0.1407 0.2880 1.4534 −0.1327 0.4590 0.4926 −0.6694 −0.1327 0.5108 −0.1512 0.0713 , 0.2360 0.4590 −0.1512 0.1855 −0.0524 −0.1407 0.4926 0.0713 −0.0524 0.0766 Cov(a1 , a2 ) = Cov(a1 , a3 ) = V ar(a2 ) = 8.1921 0.8613 −0.5547 0.2076 −0.0466 1.6933 1.1975 −0.2810 0.0614 −0.0956 −0.6192 0.1654 0.2865 −0.1721 0.0372 , 0.3242 −0.0281 −0.0950 0.1168 −0.0478 −0.1003 0.1086 0.0856 −0.0424 0.0194 12.0818 1.6093 −0.6870 0.4367 −0.1217 1.6093 3.0648 0.0456 −0.2809 0.0247 −0.6870 0.0456 0.5917 −0.1626 0.0166 0.4367 −0.2809 −0.1626 0.4004 −0.1092 −0.1217 0.0247 0.0166 −0.1092 0.1696 Cov(a2 , a3 ) = 8.3749 0.4892 −0.5377 0.2686 −0.0281 1.486 1.343 −0.1682 −0.0111 −0.0232 −0.7102 0.2314 0.2984 −0.2155 0.0487 , 0.3355 −0.0659 −0.1044 0.1136 −0.0136 −0.1365 0.0979 0.0705 −0.0456 0.0072 , 11.4893 1.9730 −0.9016 0.4398 −0.2730 1.8533 2.7023 −0.1320 −0.2323 −0.0117 −0.6313 −0.0365 0.2919 −0.2010 0.8696 , 0.2867 −0.2212 −0.1646 0.2461 −0.0805 −0.0897 0.0631 0.0654 −0.0579 0.0185 and V ar(a3 ) = 13.5971 1.7354 −0.8951 0.3615 −0.3546 1.7354 3.8151 −0.2634 −0.1821 0.0662 −0.8951 −0.2634 0.9197 −0.2324 0.0873 . 0.3615 −0.1821 −0.2324 0.5535 −0.1615 −0.3546 0.0662 0.0873 −0.1615 0.2291 12.4. COVARIANCE FUNCTION MATRICES 213 A plot of the genetic variances within parities and across the lactation period can be obtained as shown below. # Legendre polynomials LAM=LPOLY(5) ti=c(5:365) tmin=5 tmax=365 qi = 2*(ti - tmin)/(tmax - tmin) - 1 x=qi x0=x*0 + 1 x2=x*x x3=x2*x x4=x3*x M=cbind(x0,x,x2,x3,x4) PH = M %*% t(LAM) Ka1 = matrix(data=c(8.1910, 0.2880,-0.6694, 0.2360, -0.1407, 0.2880, 1.4534, -0.1327, 0.4590, 0.4926, -0.6694, -0.1327, 0.5108, -0.1512, 0.0713, 0.2360, 0.4590, -0.1512, 0.1855, -0.0524, -0.1407, 0.4926, 0.0713, -0.0524, 0.0766), byrow=TRUE,ncol=5) Va1 = PH%*%Ka1%*%t(PH) # order 361 x 361 vg1 = diag(Va1) # similar arrays for vg2, vg3, Ka2, Ka3 (not shown) par(bg="cornsilk") plot(vg1,col="blue",lwd=5,type="l",axes=FALSE, xlab="Days on Test", ylab="Genetic Variance",ylim=c(4,16)) axis(1,days) axis(2) title(main="Genetic Variances") lines(vg2,col="red",lwd=5) lines(vg3,col="darkgreen",lwd=5) points(55,15,pch=0,col="blue",lwd=3) text(55,15,"First Parity",col="blue",pos=4) points(55,14,pch=0,col="red",lwd=3) 214 CHAPTER 12. LACTATION PRODUCTION text(55,14,"Second Parity",col="red",pos=4) points(55,13,pch=0,col="darkgreen",lwd=3) text(55,13,"Third Parity",col="darkgreen",pos=4) The plot is shown in Figure 12.1. An obvious observation is that there are distinct differences in the variance curves across the lactation between parities. Also, the genetic variance is highest at the beginning of lactation and at the end of lactation, for each parity. This implies that there are great differences between cows in the amount of milk produced at the start of lactation, then after 55 days the variances are smaller by nearly half, but tend to increase upwards again towards day 365. Some researchers have interpretted the high variances at the start and end of tests as artifacts of the Legendre polynomials. However, similar shapes are obtained using other polynomials (e.g. Ali and Schaeffer, 1987) of order 4. Spline functions tend to flatten the beginning and end a little more, but the general shape persists. Figure 12.1 16 Genetic Variances 14 First Parity Second Parity 10 4 6 8 Genetic Variance 12 Third Parity 5 55 105 155 205 255 305 355 Days on Test The only way to determine the correct shape of the variances is to use a multiple trait model where yields are divided into 36 ten-day periods, then 12.5. EXPRESSION OF EBVS 215 genetic variances may be estimated for each period, and also covariances between periods. Then a Legendre polynomial of order 4 could be fit to the 36 by 36 covariance matrix, and compared to the estimates from the test-day model. The shape of the variance curves is not important, but rather the entirety of the results which includes the estimated breeding values. The residual variances are greatly reduced. The variances that need to be correct are Cov(ai , aj ) for all pairs of parities. 12.5 Expression of EBVs Estimated breeding values (EBV) in random regression models, come in vectors of length equal to the order of the Legendre polynomials. The problem was how to condense 5 breeding values for a curve into one value for a single trait, like milk yield. Dairy cattle producers were used to a standard called “305-day yields”. The solution was to calculate the daily milk yield per day of lactation, and then to sum those daily yields from day 5 through 305. (The first 4 days of yield were typically used to feed the newborn calf and provide its colostrum, or immunization.) Let the solutions for one animal’s additive genetic value for first parity milk yield be a1i = a1i0 a1i1 a1i2 a1i3 a1i4 , then daily yield (DY )ij for animal i on the j th day would be DYij = φj0 a1i0 + φj1 a1i1 + φj2 a1i2 + φj3 a1i3 + φj4 a1i4 , where φjm is a Legendre polynomial covariate. The 305-d milk yield, M 305i , is the sum of the daily yields, M 305i = 305 X DYij . j=5 Because the breeding values are constant for the calculation of every daily yield, then 305 305 305 305 305 X X X X X M 305i = ( φj0 )a1i0 + ( φj1 )a1i1 + ( φj2 )a1i2 + ( φj3 )a1i3 + ( φj4 )a1i4 j=5 j=5 j=5 j=5 j=5 216 CHAPTER 12. LACTATION PRODUCTION or M 305i = c0 a1i0 + c1 a1i1 + c2 a1i2 + c3 a1i3 + c4 a1i4 , where the cj are constants, and represent the sum of the Legendre polynomial coefficients, which can be obtained by the following R script. PH ka=c(1:301) P305 = PH[ka, ] C305 = t(P305)%*%jd(301,1) C305 [1,] 212.839141 [2,] -61.441368 [3,] -51.778637 [4,] -29.763643 [5,] -1.346922 Now multiply the constants times the EBVs of the random regression coefficient solutions for each animal, and you have 305-d EBVs for ranking animals. One can question if 305 days should be the standard length of lactation. In 2016 many cows lactate for longer than 305 days, and the analysis was for test day yields up to 365 days. So a new standard could be 365 day yields. The constants to use for that standard would be [1,] [2,] [3,] [4,] [5,] 255.2655 0.0000 1.5855 0.0000 2.1410 For dairy sheep and goats the standard length might be less than 305 days because those two species do not lactate as long as cattle. Note that in the dairy cattle example, it is not valid to calculate EBV for daily yields beyond 365 days because only test day yields from days 5 to 365 were analyzed. 12.5. EXPRESSION OF EBVS 217 One of the first new EBV in dairy cattle as a result of random regressions was persistency. Persistency is the ability of a cow to milk at a high level over much of the lactation period. This would allow for better feeding of animals, which could be sub-housed according to high, medium, or low persistency. The trouble was how to define persistency in a random regression model setting. The variable, a1i1 , was itself a measure of persistency, but it did not have any units. Animals with high values were more persistent. Because dairy producers could not relate to this number, other measures were proposed. The idea was to have some number that represented the downward slope of the curve after the peak yield of lactation. Cows differed in the day on which they expressed peak yield, so the initial point had to be well after the day of peak yield. The measure was also desired to be independent of peak yield or total 305-d yield. Suppose the yield on day 60 of an average, first parity cow was 90 kg of milk and on day 260 was 68 kg. Then calculate P ersist = DY260 + 68 DY60 + 90 which should be a number from 0 to 1, in most situations. The higher is the value, then the more persistent is the cow. The average, first parity cow would have a value of (68/90) = 0.756. Later parity cows tend to have lower persistency than first parity heifers. Note that a cow could have a persistency value greater than one, but that should happen very infrequently. Egg laying production in poultry is similar to lactation production in dairy cattle, and a similar model approach should be used. Instead of daily production, one might consider weekly production. 218 CHAPTER 12. LACTATION PRODUCTION Chapter 13 Growth Growth curves have been studied in many species of plants and animals, but usually with non-linear models. Growth is the accumulation of size and mass of an organism over time. For most agricultural species, growth to maturity takes only 3 to 4 years at most, but for humans and other larger mammals, growth can take decades. In beef cattle, growth is important from birth until the animal reaches market age, or often only the period from weaning to one year of age is of interest. In the latter period, growth can be considered almost linear, with a slight quadratic shape. That is, growth tends to slow down as the animal matures. Early growth from birth to weaning is often ignored although important in the overall scheme. A non-linear mathematical model that describes growth from birth to maturity is the Gompertz function, where weight at time t, W Tt , is given by the following equation. W Tt = BW + A · [1.0 − exp(− exp(B) · (tC ))] where t = unit of time, usually in days, BW is average birthweight, 219 220 CHAPTER 13. GROWTH A, B, and C are parameters that define the shape of the growth curve. A is related to mature weight, B is related to the day of change from increasing growth rate to decreasing growth rate, and C is related to the steepness of growth, or how quickly an animal grows to maturity. Predicted body weights are positive at all ages, and weights hardly ever decrease, unless an animal is being starved, or it is sick. There are only 4 parameters to estimate (if you include BW , which means that 5 or more weights per animal are needed to estimate all of the parameters. Unfortunately, we need to solve a nonlinear system. A differential evolutionary algorithm can be used to solve. Figure 13.1 shows the growth curve of a pig from birth to maturity, where A = 272, B = −12.8, and C = 2.65, and birthweight is 1.5kg. Figure 13.1 150 0 50 100 Weight, kg 200 250 Growth Curve of Pigs 0 50 100 150 200 Days of Age The figure emphasizes that growth is cumulative. With the curve the amount of weight gained each day, is shown in Figure 13.2. This is known as average daily gain, ADG. As can be seen in the figure, ADG is not constant over the growth period. 221 Figure 13.2 1.0 0.0 0.5 Weight Gain, kg 1.5 Growth Curve of Pigs 0 50 100 150 200 Days of Age Hence from birth to about 100 days of age, pigs are putting on weight faster and faster. After 100 days, their rate of gain declines. There are problems with measuring ADG. Firstly, the magnitude of ADG is small, only 1 to 2.5 kg per day, so that weigh scales must be precise. Secondly, weight gain depends on the time of day in which it is taken. Did the pig just defecate or just eat breakfast? The amount eaten or lost could be as much as 1 to 2.5 kg. Lastly, you need to weigh pigs every day and this would be very labour intensive, unless it was computerized and automated. The amount of variation in ADG from day to day would be large for one animal. Cumulative weights keep getting larger as the animal ages. Total weights can be off 1 to 2.5 kg without changing the growth curve dramatically, and the pigs do not need to be weighed daily, but obviously there are key times when pigs should be weighed. Birthweights, tend to be small relative to mature weights. Thus, whether the weight is 1.5 kg or 3 kg at birth, does not alter the growth curve substantially, but weights at 200 days of age can differ by 10 to 20 kilograms between animals giving very different growth curves. 222 CHAPTER 13. GROWTH 13.1 Curve Fitting 13.1.1 Spline Function Random regression models are linear models, thus the nonlinear Gompertz function needs to be approximated by linear regressions. The phenotypic shape may be approximated by a spline function. Let tmax be the maximum age, and in terms of the pig growth curve, let that be 240 days of age. tmin is day 1, and let T = t/tmax , and U = (t − 100)/tmax for t > 100 otherwise U = 0. Day 100 is when growth rate starts to decrease with age (Figure 13.2). The phenotypic curve might be y t = b 0 + b1 T + b2 T 2 + b3 T 3 + b4 U + b5 U 2 + b6 U 3 . Estimates of the regression coefficients from the data in Figure 13.1 are b0 b1 b2 b3 b4 b5 b6 = 3.77839 −96.82071 1038.66035 −393.28000 81.43243 −1306.77963 579.37098 . A seven covariate function to model the trajectory of growth seems too large to be practical. There are places along the curve that are not fit well. At the beginning of the growth curve, the spline function will predict that weights actually decrease after birth, and then turn upwards. Also, weights do not plateau at maturity, but actually begin to decrease. The errors in the prediction are at the beginning and end. Inbetween weights are predicted relatively accurately. The inflection point of 100 days was assumed known, but this point would not be 100 days for every animal, and would need to be estimated. Thus, the spline function is not totally suitable. 13.2. MODEL FACTORS 13.1.2 223 Classification Approach Given the problems with the spline function, the classification approach could possibly work much better, for all groups of animals, without making any assumptions about the shape of the growth curve, or the position of the inflection point. Over the 240 day age range, make 48 five-day periods and estimate the mean weights within each period. Unfortunately, that requires estimating 48 means per curve, and thus there needs to be a lot of data points within each mean. 13.2 Model Factors 13.2.1 Observations Growth observations can be weight, height, length, feed intake, backfat thickness, or loin eye area. Depending on the situation, the growth period could be from birth to weaning, weaning to slaughter weight, or birth to maturity. If the period is short term, often growth is linear during this period. A lifetime curve would look the same as in Figure 13.1. This determines the order of the random regression covariates that are required. If growth is after weaning, then maternal genetic effects may be unnecessary and safely ignored. So the factors listed in this section may or may not be needed, but should at least be considered in developing a working model for growth. On a per animal basis there should probably be five or more measures of growth. Management systems where animals can be weighed automatically every day should be considered, or where feed intake can be recorded daily. However, if the management system does not allow weighing more than four times during the life of the animal, then random regression models should not be applied. Multiple trait models should be considered as an alternative, where each weight is a different trait, like birth, weaning, and end of test weights. 224 13.2.2 CHAPTER 13. GROWTH Year-Month of Birth-Gender (fixed) The first fixed factor in the model needs to account for time trends in growth curves for each gender separately. In some species the male is sometimes neutered, and so a third gender is needed for these animals, even if the neutering occurs later, after weaning for example. The classification approach will be used for this factor. Thus, 48 period means within each subclass implies there should be more than 48 observations within the subclasses. With 20 years of data, times 12 months of birth per year, times 3 genders, gives 720 subclasses. Assuming a minimum of 50 observations per subclass, then there should be more than 36,000 weight measures. 13.2.3 Year - Age of Dam - Gender (fixed) The birth-dam can be either the genetic-dam or a recipient dam, in the case of an embryo transfer. Offspring from older birth-dams often outgrow offspring of first time mothers. This might be because offspring from young mothers are smaller than those of older dams, or because young mothers do not provide enough nutrients in the milk as do older females. Age of birth-dam is usually defined within parity groups. So parity 1 with 2 or 3 age groups, parity 2 with 4 or 5 age groups, and so on. The interactions with year of birth and gender of offspring probably exist, so it is best to account for them. Years of birth may be grouped together if there are not enough data. Legendre polynomials of order 3 can be used for this factor. Hence we are estimating deviations from the standard curves defined by the Year-Month of Birth-Gender subclasses. 13.2.4 Contemporary-Management Groups (random) During growth, animals are usually moved to different management groups as they get bigger or older. Thus, animals belong to a different contemporary group each time they are weighed. In pigs there is the farrowing barn during which a dozen or more sows give birth within the same week. All of the piglets could be one contemporary group, separated by gender. After 20 13.2. MODEL FACTORS 225 days, the piglets are moved to growing pens where pigs of different litters are merged and become competitors for feed and water. Later those animals are moved to finishing pens where they are fed to market weight. Some could be selected for potential herd replacements and moved to a different facility. The contemporaries of a pig are, therefore, constantly changing. Contemporary - Management groups are defined as pigs of roughly the same age and gender within the same physical environment at the time of weighing. The contemporary-management group accounts for the environmental effects at one point in time for a group of similarly treated individuals. We do not estimate a growth curve for each contemporary-management group, but only the effect on weights of pigs at one point in time. Contemporary-management groups are a random factor in the model, and there are many of these groups. The number of animals within a contemporary group is not critical. 13.2.5 Litter Effects (random) In litter bearing species, such as sheep, goats, and swine, there is a common litter effect of the group of full-sibs. This has to be matched to the birth-dam or the raise-dam, if an animal is cross-fostered to another dam after birth. This is also a random factor in the model and can be modeled by Legendre polynomials of order 2. 13.2.6 Animal Additive Genetic Effects (random) In growth data, the additive genetic effects are known as direct genetic effects (Willham, 1960s). The usual additive genetic relationship matrix, A, is used, as are phantom parent groups for animals with unknown parents. Legendre polynomials of order 3 are used to model the animal deviations from the fixed trajectories, and hence, four parameters per animal for additive genetic effects to be estimated. 226 13.2.7 CHAPTER 13. GROWTH Animal Permanent Environmental Effects (random) Because animals are weighed several times, permanent environmental effects must be taken into account. Legendre polynomials of order 3 are used for this factor, which only exists for animals with records. Cumulative environmental effects could be considered as well, but constructing the correct mixed model equations could become complicated. 13.2.8 Maternal Genetic Effects (random) Growth, in mammals, is influenced by maternal genetic effects (Willham, 1960s; see Chapter 8). That is, the female that gives birth provides an environment during the early growth period of that offspring. Maternal effects decrease as the animals age. However, some maternal effects can persist a long time. The female provides this environment to every offspring. Her genetic maternal ability is passed along to her progeny (male and female), but is only expressed when her female progeny have their own offspring. The three types of dams, i.e. genetic dam, birth dam, and raise dam, may need to be considered here (See Chapter 8). A third order Legendre polynomial would be used for this random factor too. The additive genetic relationship matrix, and phantom parent groups are also utilized for this factor. 13.2.9 Maternal Permanent Environmental Effects (random) Because dams have more than one progeny in the data, there are non-genetic permanent environmental effects associated with each dam. Legendre polynomials of order 3 are used for this factor too. 13.3. COVARIANCE FUNCTION MATRICES 13.2.10 227 Residual Variances (random) There should be a different residual variance for every day of age, and these variances should be getting larger over time, as weights increase. One can calculate phenotypic variances for each of the 48 five-day periods, separately for each gender. Then express all of the variances relative to the variance at birth. The assumption is that the residual variances will follow that same relative pattern. Residual variances can be estimated for each five-day period. 13.2.11 Summary Growth is a very complicated trait. The main problem is having enough weight measurements on an animal to be able to estimate the trajectories and covariance functions. Maternal genetic effects and the correlation of those with direct genetic effects adds a degree of difficulty to the model analyses. Also, if each animal can have up to three different dams influencing its growth, this too can make a random regression analysis difficult. If there are only 3 or 4 weights per animal, it may be much easier to analyze them with a multiple trait model, where each weight is taken at roughly the same age in all animals. The shape of the trajectory is then not important, and analyses can be simplified. 13.3 Covariance Function Matrices A study of pigs on test from day 40 to 250 at a Quebec test station was conducted on 10,000 pigs. Two pigs per litter were represented in the trial. Litter effects and maternal effects were ignored. Quadratic random regressions (using Legendre polynomials) were for contemporary groups, animal additive genetic, and animal permanent environmental. Each pig had 7 or more weight measurements during the test, and almost daily feed intakes. The covariance function matrices were estimated using Gibbs sampling in a Bayesian approach to a six-trait model. Other traits were number of times visiting the feeder (daily), time spent eating (daily), feed intake (daily), weight, 228 CHAPTER 13. GROWTH fat thickness, and loin thickness. Below are the submatrices for weights only. Kaa Kpe 139.47 126.60 42.25 = 126.60 125.49 50.19 , 42.25 50.19 26.30 117.77 86.97 13.04 = 86.97 76.79 22.74 , 13.04 22.74 16.81 and 80.85 38.39 7.13 Kcg = 38.39 27.97 11.46 . 7.13 11.46 8.26 Using the above matrices, variances for each day on test were calculated, and then plotted (Figure 13.3). 13.3. COVARIANCE FUNCTION MATRICES 229 # Legendre polynomials LAM=LPOLY(3) ti=c(40:250) tmin=40 tmax=250 qi = 2*(ti - tmin)/(tmax - tmin) - 1 x=qi x0=x*0 + 1 x2=x*x M=cbind(x0,x,x2) PH = M %*% t(LAM) Vpe = PH%*%Kpe%*%t(PH) vgpe = diag(Vpe) Vcg = PH%*%Kcg%*%t(PH) vgcg = diag(Vcg) Vaa = PH%*%Kaa%*%t(PH) vgaa = diag(Vaa) par(bg="aquamarine") plot(ti,vgaa,col="blue",lwd=5,type="l",xlab="Days on Test", ylab="Variance, kg-squared",ylim=c(0,1000)) title(main="Variances Over Days on Test") lines(ti,vgcg,col="red",lwd=5) lines(ti,vgpe,col="darkgreen",lwd=5) points(55,900,pch=0,col="blue",lwd=3) text(55,900,"Genetic",col="blue",pos=4) points(55,700,pch=0,col="red",lwd=3) text(55,700,"Contemporary Group",col="red",pos=4) points(55,500,pch=0,col="darkgreen",lwd=3) text(55,500,"PE",col="darkgreen",pos=4) 230 CHAPTER 13. GROWTH Figure 13.3 1000 Variances Over Days on Test 800 Genetic 600 400 PE 0 200 Variance, kg−squared Contemporary Group 50 100 150 200 250 Days on Test Because of the quadratic regression, variances increase as days on test increase, but the larger increases did not occur until after 150 days. Higher order polynomials were not appropriate for these data. Residual variances were divided into 23 periods of 8 or 9 days each. The residual variances ranged from 3.5 kg2 to 17.18 kg2 , and so were much smaller than the other components. 13.4 Expression of EBVs With weight as the growth trait, there are two options for expressing the breeding value of an animal and ranking them. One option is for choosing a particular age and ranking animals on the basis of their EBV for weight at a given age. The other option is for determining the number of days for an animal to reach a particular weight, for example, 110 kg. The latter option is essentially a growth rate. You want to select animals with the smaller age. For swine and some other species, growth has to be combined with feed intake. Which animals grew the fastest and ate the least amount of feed? So an index must be constructed to select for optimum growth. In addition, 13.4. EXPRESSION OF EBVS 231 a fat carcass is usually not desired, and so carcass quality also needs to be included in the index. Increasing weight and growth rate could also have adverse consequences on ease of birth through larger birthweights. Growth is more than a single trait selection problem. 232 CHAPTER 13. GROWTH Chapter 14 Survival The lifetime of an animal is the age when it dies. For agricultural livestock animals, however, humans often determine when an animal dies. Some animals are voluntarily culled because the owner perceives that they are of lesser value than other animals. At the same time, some animals are involuntarily culled due to old age, accident, or disease. Producers generally want animals that are robust and hardy, and which could live a long time. It costs money to feed and raise an animal to maturity. Animals should have “longevity” or “stayability”. Animals should be functional, either at producing offspring or producing milk, meat, eggs. or wool. The date of an animal dying or leaving the herd (flock) for any reason gives an uncensored record of survival. An animal’s record is said to be censored when it has not yet died or been culled due to a lack of adequate opportunities. All current, active animals are censored. When analyzing survival there are two possible situations. 1. Censored data are removed from the analysis, or 2. Censored data are included in the analysis. Animals can relocate from one herd to another through sales. Such animals may be considered culled from the original herd, but are actually still alive and productive in another herd. Reasons for disposal from herds are im233 234 CHAPTER 14. SURVIVAL portant to determine if records are censored. The analysis of survival should include censored data, in an appropriate manner. The age of an animal at the time it is culled is the observation, measured in days, months, or years. This trait is not normally distributed. For censored animals, a prediction of length of productive life is usually made based on probabilities estimated from past data. Thus, if an animal has lived to time t, then the probability that it will live to the next time, t + 1, is used as the observation. A different approach to survival analyses is to define a fixed time period, such as survival to 60 months of age, yes or no. Then survival to 75 months of age as another binary trait. A non-linear approach is where time to failure is modelled. Censored data can be included. A survival function is derived and from this a hazard function is created, which is influenced by time dependent variables, and time independent variables. 14.1 Survival Function Consider 100 months after first calving as the productive life for a dairy cow. A survival function goes from 1 for an animal that is alive to 0 when the animal is dead or culled. A vertical line from 1 to 0 indicates the moment in the productive period when the animal’s function changes, i.e. when the animal is removed from production. The survival function for one animal is a one-step function. Figure 14.1 shows the survival functions of 3 cows, where one has died at 20 months after first calving, one at 45 months, and one at 66 months. The fourth graph in the lower right of Figure 14.1 is the average step function for the three cows combined. 14.1. SURVIVAL FUNCTION 235 Figure 14.1 20 40 60 80 100 0 20 40 60 80 100 Months after First Calving Cow 3 Survival Function Average Survival Function of 3 Cows 0.4 0.0 0.0 0.4 Frequency 0.8 Months after First Calving 0.8 0 Frequency 0.4 Frequency 0.0 0.4 0.0 Frequency 0.8 Cow 2 Survival Function 0.8 Cow 1 Survival Function 0 20 40 60 80 Months after First Calving 100 0 20 40 60 80 100 Months after First Calving As more and more cows are accumulated and averaged together, the survival function for the population becomes a smooth curve as in Figure 14.2. The values on the curve give the expected probability of an animal being alive in x months after first calving. By the time a cow reaches 100 months, it has a pretty high probability of not being alive. 236 CHAPTER 14. SURVIVAL Figure 14.2 0.6 0.4 0.0 0.2 Frequency 0.8 1.0 Population Survival Function 0 20 40 60 80 100 Months after First Calving The approach of Veerkamp et al. (1999) and Galbraith (2003) was to apply a random regression model. For each cow there would be 100 observation points of 0 or 1. A cow that has lived 30 months past first calving and which has not yet been culled, is a censored record. If a cow was censored, then the step function would be just ones up to the point of being censored (e.g. 30 months), and the next seventy values would be not known, or not observed. The survival function in Figure 14.2, for this example, is St = n − dt n where t is the month in which an animal was last alive, n is the total number of live animals that had the opportunity to live for 100 months, and dt is the number that have died up to and including period t. Eventually dt comes closer to n. 14.2. MODEL FACTORS 14.2 237 Model Factors A population survival function is shaped similar to a lactation curve, and so using Legendre polynomials of order 4 (5 covariates) may be appropriate for fitting the general shape. However, because the scale goes from 1 down to 0, at the beginning of the curve many animals are alive, so that the variation in the first months after calving is very small. In general, the variance is the frequency times one minus the frequency, which has the greatest value when frequency is 0.5. The variance becomes smaller again at the end when most animals are dead. Legendre polynomials of order 4 (5 covariates) will be used to model the random animal additive genetic, and permanent environmental effects. 14.2.1 Year-Season of Birth-Gender (fixed) The classification approach can be taken to model the fixed time factor curves for animals born in the same year and season of the year (perhaps months). If both genders are being analyzed together, then the additional interaction with gender is needed. In the dairy cow example, there would be 100 categories of months alive after first calving. That is a lot of levels (i.e. parameters) to be estimated, and requires a lot of animals. At the same time, there are 100 observations for all uncensored animals. If one was studying mice, then the time scale has to be altered, and survival might be related to time after being infected with a deadly virus. Or there could be a study of bacteria and their survival to different antibiotics measured in hours or minutes. In some cases, there might be only one overall fixed curve rather than several. 14.2.2 Age at First Calving (fixed) For dairy cows, the age at first calving could be important to survival after calving. For mice and bacteria an important variable might be the length of exposure time before the trial begins. 238 14.2.3 CHAPTER 14. SURVIVAL Production Level (fixed) Dairy cows that produce at a high level, and therefore make more profit for the owner, tend to have a higher survival advantage. Cows should therefore, be divided in 3 or 5 categories of production levels based upon their EBV for milk yields or protein yields. These groups could also be modelled by classification variables or with order 4 Legendre polynomials. A study should be conducted to see which alternative is more suitable. Adjusting for production level makes the survival evaluations free of production level, and this is called functional survival. 14.2.4 Conformation Level (fixed) Another important factor in dairy cows is their conformation scores. More favourable looking cows (scoring Good Plus or better) have a higher survival than cows scoring Good, Fair or Poor. Making six levels of conformation and fitting classification variables of order 4 Legendre polynomials is necessary. Thus, the survival EBV would be free of both production and conformation considerations. 14.2.5 Unexpected Events An unexpected event which may have a short or long term impact on animal survival are things like outbreaks of disease or drought. Animals have to be culled, that would not normally be culled, to guarantee the survival of the herd. This may affect certain types of animals (e.g. low producers, older animals) more than others. A simple year-month-age of cow subclass effect (not a curve, but an average percentage survival) could be used to model routine and unexpected downturns in survival. This could be across all herds or within provinces or regions of a country. Besides increases in culling, this factor would also identify periods when it was difficult to find cows such that culling is at below normal levels. 14.2. MODEL FACTORS 14.2.6 239 Contemporary Groups (random) Contemporary groups are random effects in the model, and hence modelled with order 4 Legendre polynomials. The definition of a contemporary for survival analyses would be animals born in the same year-season, of the same gender, and undergoing the same or similar management practices up to first calving. Because survival looks at animals over many months and years, animals will move around and be placed in different environments with different managers, and therefore, under different decision processes. Accounting for all of these possibilities is difficult, and therefore, the easy option is to leave animals in their original contemporary group throughout their lifetime. All subsequent changes cause variation that goes into the residual effects. 14.2.7 Animal Additive Genetic Effects (random) Animal additive genetic effects are random, also modelled by order 4 Legendre polynomial. The heritability of survival is generally low due to all of the environmental influences on the decisions to keep or cull animals. 14.2.8 Animal Permanent Environment Effects (random) Animal permanent environmental effects are random and account for some of the environmental influences on each animal. Legendre polynomials of order 4 could be used for this factor too. 14.2.9 Residual Variances (random) For dairy cows, looking at 100 months after first calving, this period could be divided into twenty subgroups of five months each. Some trial and error is needed to get the groupings correct. 240 CHAPTER 14. SURVIVAL 14.3 Example Because this method of survival analysis is not common, a small example will be used to illustrate. Consider a beef cattle situation and we want to look at the survival of cows, as indicated by number of calvings, where the maximum is set at nine. Thus, there are just nine categories, each representing about 12 months. Assume the data are from two years, and six contemporary groups for a total of 30 cows. Including ancestors without survival data, there are a total of 53 animals. None of the animals were inbred. The data are shown in Table 14.1. Note that four of the records in year 2 were censored, which means those animals were still active, i.e. not yet culled. Year 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 CG 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 Cow 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 Sire 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 Table 14.1 Example Beef Cow Survival Data. Dam Calvings Year CG Cow 9 7 2 4 39 10 2 2 4 40 11 5 2 4 41 12 6 2 4 42 13 8 2 4 43 14 3 2 4 44 15 2 2 5 45 16 1 2 5 46 17 4 2 5 47 18 4 2 5 48 19 6 2 5 49 20 6 2 6 50 21 6 2 6 51 22 9 2 6 52 23 3 2 6 53 * indicates censored records The data can be set up in R as follows. Sire 5 5 5 6 6 6 5 6 7 7 7 5 7 8 8 dam 9 10 12 13 29 30 14 17 18 19 35 20 22 23 25 Calvings 5 7* 5 6* 4 6* 2 6 8* 2 4 4 1 3 5 14.3. EXAMPLE 241 # Example data for RRM of survival cg=c(rep(1,5),rep(2,6),rep(3,4),rep(4,6), rep(5,5),rep(6,4)) # contemporary groups YR = c(rep(1,15),rep(2,15)) # Two years # Pedigrees aid=c(1:53) sid=c(rep(0,23),1,1,2,2,2,2,2,3,3,3, 3,3,4,4,4,5,5,5,6,6,6,5,6,7,7,7,5,7,8,8) did=c(rep(0,23),c(9:23),9,10,12,13,29, 30,14,17,18,19,35,20,22,23,25) bi=c(rep(1,23),rep(0.5,30)) # Inverse of additive relationship matrix AI=AINV(sid,did,bi) y = c(7,2,5,6,8, 3,2,1,4,4,6, 6,6,9,3, 5,7,5,6,4,6, 2,6,8,2,4, 4,1,3,5) yb= c(9,9,9,9,9, 9,9,9,9,9,9, 9,9,9,9, 9,7,9,6,9,6, 9,9,8,9,9, 9,9,9,9) sum(yb) # total number of observations The vector y contains the number of calvings completed, and yb contains the number of calvings that could have been observed up to the current date. The covariance function matrices for the random effects were as follows. Ka = .36814 −.17200 .32359 .00000 −.01844 −.17200 .35300 −.24600 −.03448 .00000 .32359 −.24600 .55338 .00000 −.04292 , −.00000 −.03448 .00000 .06567 .00000 −.01844 .00000 −.04292 .00000 .06466 for the additive genetic effects, Kp = .31894 −.13760 .25719 .00000 −.01844 −.13760 .30380 −.19680 −.03448 .00000 .25719 −.19680 .46318 .00000 −.04292 , .00000 −.03448 .00000 .06567 .00000 −.01844 .00000 −.04292 .00000 .06466 242 CHAPTER 14. SURVIVAL for animal permanent environmental effects, and Kc = .68214 −.04600 −.03141 .00000 −.01844 −.04600 .99000 .00900 −.03448 .00000 −.03141 .00900 .14638 .00000 −.04292 , .00000 −.03448 .00000 .06567 .00000 −.01844 .00000 −.04292 .00000 .06466 for contemporary groups. Therefore, the assumed heritabilities by number of calvings are shown in Table 14.2. Table 14.2 Heritabilities by Number of Calvings. Number of Heritability Calvings 1 0.420 2 0.349 3 0.231 4 0.153 5 0.191 6 0.185 7 0.175 8 0.218 9 0.309 The Legendre polynomials for the random factors were order 4, and set up as reglp = jd(9,5)*0 tmin=1 tmax=9 no=5 for(i in 1:9){ reglp[i, ] = LPTIME(i,tmin,tmax,no) } The design matrices need to be constructed for years, contemporary groups, and animal genetic and animal permanent environment factors. Recall that each cow with survival data has 9 observations, unless their record is censored, 14.3. EXAMPLE 243 then they have less than 9 observations. In total in this example, there were 261 observations. We also need to create the observation vector, YOB, of zeros and ones, and the residual variances for each of the nine categories. For simplicity, let the residual variances be equal to the numbers given in pq below, and this script makes the design matrix for years. The year effects make use of the classification approach, so that there are nine parameters for each year. X = jd(261,18)*0 YOB = rep(0,261) ri=YOB nly=length(y) pq=c(1:9); pq[1]=0.09; pq[2]=0.16; pq[3]=0.21; pq[4]=0.24; pq[5]=0.25; pq[6]=0.24; pq[7]=0.21; pq[8]=0.16; pq[9]=0.09 pq = 1/pq # residuals inverted k=0 for(i in 1:nly){ my = YR[i] loff=(my-1)*9 ja = y[i] # number of ones jb = yb[i] # number of obs for animal for(j in 1:jb){ k=k+1 X[k,j+loff]=1 YOB[k]=1 ri[k]=pq[j] if(j > ja)YOB[k]=0 } } Similarly, the design matrix for contemporary groups is generated as follows. Because contemporary groups are random, they are modelled by order 4 Legendre polynomials (from reglp ). 244 CHAPTER 14. SURVIVAL # Contemporary Groups Zc Zc=jd(261,30)*0 k=0 for(i in 1:nly){ mc = cg[i] loff = (mc-1)*5 +1 lofl = loff + 4 ja = y[i] jb = yb[i] for(j in 1:jb){ k=k+1 Zc[k,c(loff:lofl)] = reglp[j, ] } } Animal additive genetic effects are also modelled by order 4 Legendre polynomials, as are the animal permanent environmental effects, but which are a subset of the columns for the animal additive genetic effects. # Animal Additive mcol = 53*5 manc = 23*5 + 1 Za = jd(261,mcol)*0 k=0 for(i in 1:nly){ ma = anwr[i] loff = (ma-1)*5 + 1 lofl = loff + 4 ja = y[i] jb = yb[i] for(j in 1:jb){ k=k+1 Za[k,c(loff:lofl)] = reglp[j, ] } } Zp = Za[ ,c(manc:mcol)] After the design matrices, one sets up matrices for the mixed model equations, then solve them. 14.3. EXAMPLE 245 # setup MME and solve ZZ=cbind(Zc,Za,Zp) RI=diag(ri) # make covariance matrices Gai=solve(Ga) # Ga = Ka Gpi=solve(Gp) # Gp = Kp Gci=solve(Gc) # Gc = Kc HI= id(6) %x% Gci GI=AI %x% Gai PI=id(30) %x% Gpi QI=block(HI,GI,PI) # solve MME RRS = MME(X,ZZ,QI,RI,YOB) MME is a routine for setting up mixed model equations and solving them. See earlier chapters for details of MME and AINV R-scripts. 14.3.1 Year Trajectories The next step is to look at the solutions and make sense of the results. The first thing is to look at the year solutions and plot them in a graph. 246 CHAPTER 14. SURVIVAL Figure 14.3 Year 1 ● Year 2 0.4 0.6 ● 0.0 0.2 Frequency 0.8 1.0 Yearly Survival Functions 2 4 6 8 Number of Calvings Note that in year 1 all of the animal records were uncensored, and therefore, none of them were being observed any longer because they have all been culled (some years after these data were obtained). In year 2, however, there were 4 censored records, and therefore, the line for year 2 is not fully completed, and will not be until all the animals in year 2 have been culled. The line for year 2 could still change, but the line for year 1 is essentially complete and not likely to change very much in future analyses. It will change a little due to adding relatives information in later years. With only 261 observations, the fixed curves (i.e. trajectories) are not very smooth. If there were several thousand records per year, then the curves might be smoother looking. 14.3. EXAMPLE 14.3.2 247 Contemporary Groups Contemporary groups were modelled by order 4 Legendre polynomials. The solutions are shown in Table 14.3. Table 14.3 Random regression solutions for contemporary groups. c0 c1 c2 c3 Group c4 Number 1 0.12264 -0.00160 -0.04552 -0.02469 0.00526 2 -0.28856 -0.01639 0.09221 0.00770 -0.02311 3 0.16592 0.01799 -0.04669 0.01699 0.01784 4 0.24412 -0.02660 -0.07702 0.00605 0.01125 5 -0.03720 0.05555 0.03736 -0.01745 -0.01820 6 -0.20692 -0.02895 0.03966 0.01140 0.00695 From the values in Table 14.3, it is not easy to know which contemporary groups had greater or lesser survival rates. One needs to calculate survival differences for each of the nine categories using the Legendre polynomials, and then one must add the year trajectories for the years in which those contemporary groups were nested. Thus, year 1 trajectory is added to contemporary groups 1, 2, and 3, and year 2 trajectory is added to contemporary groups 4, 5, and 6. Then those values can be plotted as shown in Figure 14.4. 248 CHAPTER 14. SURVIVAL Figure 14.4 1.0 Contemporary Group Survival Functions ● 0.8 ● ● ● ● 0.6 0.4 0.0 0.2 Frequency ● CG 1 CG 2 CG 3 CG 4 CG 5 CG 6 2 4 6 8 Number of Calvings From Figure 14.4, contemporary groups 1, 3, and 4 had the better survival rates, and these corresponded to positive values for c0 , and negative values for c2 . Usually, the survival rates of contemporary groups are not of interest, but they need to be taken into account in calculating animal EBVs. 14.3.3 Animal Estimated Breeding Values With the trait of survival, interest is primarily in sires and how they rank on daughter survival. As with the contemporary groups, the solutions for the regression coefficients are not informative on their own. Multiply times the Legendre polynomials for the nine categories, and add the year 1 trajectory to those numbers. Year 1 was chosen because all animals in that year have been culled (uncensored data). Usually one would take the latest year in which all animal records are uncensored. The results for eight sires are given in Figure 14.5. 14.3. EXAMPLE 249 Figure 14.5 0.6 0.4 0.0 0.2 Frequency 0.8 1.0 Sire Daughter Survival 2 4 6 8 Years after First Calving Notice the difference from Figure 14.4. There were greater differences among contemporary groups than among sires. To pick up differences among sires one has to look at the end of the trajectories, or category 9. Sires rank differently depending on which number of calvings you want to consider as the ranking criteria. The sires and their ranks at the 1st , 5th , and 9th calvings are given in Table 14.4. 250 CHAPTER 14. SURVIVAL Table 14.4 Sire rankings at 1st , 5th , and 9th calvings. Sire 1st 5th 9th 1 7 6 7 2 4 4 4 3 2 2 6 4 5 5 1 5 6 7 5 6 8 1 8 7 3 8 2 8 1 3 3 Which sire would you choose to use in future matings? 14.3.4 Covariance Matrices The covariance function matrices used in the example were not estimated from real data, but were concocted for illustration purposes. Still, by using an order 4 Legendre polynomial for the random factors, the variances at the first and nineth calvings were artificially high (Figure 14.6). As already mentioned the variances at the first and nineth calvings should be the smallest, and the largest should occur at the fifth calving. A full study using a very large data set needs to be conducted. The random regression model approach to survival analyses seems appropriate and useful. Comparisons to other methods may be warranted (Jamrozik et al. 2008). 14.3. EXAMPLE 251 Figure 14.6 2 Genetic ● Perm.Env. ● Cont.Group ● Residual 1 ● 0 Frequency 3 4 Variances Over Categories 2 4 6 Number of Calvings 8 252 CHAPTER 14. SURVIVAL Part III Loose Ends 253 Chapter 15 Selection The selection process must be clearly defined in order to make any statistical attempts at correcting for it. There are some selection processes that do not have any statistical correction methodology. Selection is any process that could lead to a biased and/or inaccurate evaluation of an animal’s underlying genetic value for one or more trais, resulting in the wrong choice of animals to be used in future matings, and incorrect determination of variance components, genetic correlations, or genetic changes in a population. 15.1 Randomization Nearly all statistical procedures have been developed with the concept of random sampling or randomization. For example, if you generate 5,000 random normal deviates from N (0, 100), and average the results, one should get a value close to 0, and a variance close to 100, depending on the quality of the random number generator. Below is a script for repeating that process 1,000 times. wsam=c(1:1000)*0 wsdm=wsam for(i in 1:1000){ n=5000 255 256 CHAPTER 15. SELECTION w=rnorm(n,0,10) wsam[i]=mean(w) wsdm[i]=var(w) } mean(wsam) # -0.000172 mean(wsdm) # 100.00067 If the 5,000 random normal deviates are sorted from high to low, and the top 50% taken as a sample, then the mean and variance of the sample will not estimate the original parameters of the normal distribution. vsam=c(1:1000)*0 vsdm=vsam for(i in 1:1000){ n=5000 w=rnorm(n,0,10) m=2500 ka=order(-w) v=w[ka] vs=v[1:m] vsam[i]=mean(vs) vsdm[i]=var(vs) } mean(vsam) # 7.968375 mean(vsdm) # 36.38509 The expected mean of the select sample would be µs = µ + i · σ where i is the selection intensity (in animal breeding terms) or the mean of the select sample for 50% selected (0.798) times the standard deviation. Thus, µs = 7.98 is the expected value, and the r-script gave 7.968375 for the average of 1,000 samples. The expected variance is V ars = (1 − i2 ) · σ 2 which gives around 36. Hence a select sample of observations can give results greatly different from the original parameters. Only if one knows how the 15.2. MULTIPLE TRAITS 257 sample was selected can the true parameters of the population be estimated. That is, knowing the value of i, then the original mean and variance can be calculated by working backwards. In animal breeding situations, the type of selection is unknown, and thus i is unknown, because there are hundreds of owners of the animals and each is making their own selection decisions. In the above example, the select sample may not be the top 50%, but perhaps some other type of sample. The data available for analysis is the consequence of the accumulation of non-random, human-influenced decisions, which are non-reproducible. Producers decide which pairs of animals will mate to produce the next generation. Due to costs of production, producers eliminate inferior animals early in life, and thus, those animals never have the opportunity to reproduce. With the existence and usage of genetic DNA markers, some embryos are not allowed to be born because their marker genotypes are not acceptable. Non-randomly generated data imply that most statistical methodologies could produce biased results. Biased results imply there could be errors in mating and culling decisions. The question becomes how serious is the bias and how big can the errors in decisions be? Ultimately, is there a methodology that will measure the bias and correct the data for it? 15.2 Multiple Traits In some species, like beef cattle, animals are weighed at birth, at weaning, and at one year of age. Not all calves survive to weaning due to various environmental issues. After weaning, based on their weights and gender, individuals are culled, particularly males. These animals generally do not make a yearling weight. Mature growth, however, is an important economic trait for beef producers. Let there be two traits with the following covariance matrix. V= V= 100 40 40 116 10 0 4 10 ! ! , 10 4 0 10 ! . 258 CHAPTER 15. SELECTION Generate two correlated vectors of random normal deviates, and randomly discard half of the second vector. Compute the mean and variance of the variables in the first and second vectors. L=matrix(data=c(10,0,4,10),byrow=TRUE,ncol=2) wsam=c(1:1000)*0 wsdm=wsam vsam=wsam vsdm=wsam n=5000 m=2500 km=c(2501:5000) for(i in 1:1000){ w=rnorm(n,0,1) v=rnorm(n,0,1) W=rbind(w,v) P = L%*%W p1=P[1, ] p2=P[2, ] # select a random half of p2 q2 = p2[1:m] wsam[i]=mean(p1) wsdm[i]=var(p1) vsam[i]=mean(q2) vsdm[i]=var(q2) } mean(wsam) # 0.00271 mean(wsdm) # 99.9275 mean(vsam) # -0.0014988 mean(vsdm) # 116.0756 With random selection of trait 2, the means and variances are as expected for 5,000 observations for vector one and 2500 for vector 2. Now remove 50% of vector 2 observations after sorting vector 1 from high to low. The script changes as follows: 15.2. MULTIPLE TRAITS 259 n=5000 m=2500 for(i in 1:1000){ w=rnorm(n,0,1) v=rnorm(n,0,1) W=rbind(w,v) P = L%*%W p1=P[1, ] p2=P[2, ] # Re-order based on vector 1 ka=order(-p1) q1=p1[ka] q2=p2[ka] wsam[i]=mean(q1) wsdm[i]=var(q1) km=c(2501:5000) v1s=q1[1:m] v1c=q1[km] v2s=q2[1:m] vs=v2s vsam[i]=mean(vs) vsdm[i]=var(vs) } mean(wsam) # 0.0037 mean(wsdm) # 100.0342 mean(vsam) # 3.2017 mean(vsdm) # 105.8251 Nonrandom selection of trait 2 has increased the mean and decreased the variance of trait 2. The literature suggests that analyzing traits 1 and 2 in a multiple trait model will give unbiased estimates of breeding values (Pollak and Quaas, 1980), as long as all animals have trait 1 included. Is this really true? The script was modified to predict the missing trait 2 values from the trait 1 values that were present. They were predicted using the regression of trait 2 on trait 1 using the true parameter values. That regression was 0.4. 260 CHAPTER 15. SELECTION n=5000 m=2500 r=40/100 for(i in 1:1000){ w=rnorm(n,0,1) v=rnorm(n,0,1) W=rbind(w,v) P = L%*%W p1=P[1, ] p2=P[2, ] ka=order(-p1) q1=p1[ka] q2=p2[ka] wsam[i]=mean(q1) wsdm[i]=var(q1) km=c(2501:5000) v1s=q1[1:m] v1c=q1[km] v2s=q2[1:m] # predict missing trait 2 values v2c=v1c*r vs=c(v2s,v2c) vsam[i]=mean(vs) vsdm[i]=var(vs) } mean(wsam) # -0.005937 mean(wsdm) # 99.9593 mean(vsam) # -0.002093 mean(vsdm) # 65.9069 Thus, there is some truth that a multiple trait model accounts for selection bias in trait 2 given selection on trait 1 values. The mean of trait 2 seems to be estimated correctly, but the variance of observed trait 2 and predicted trait 2 is much lower than the correct variance. However, estimation of the mean is usually more important. In a real situation the true parameters (without selection involved) would not be known, and would have to be estimated from the observed values. The greater is the intensity of selection given trait 1, and given the regression of 15.3. BETTER SIRES IN BETTER HERDS PROBLEM 261 trait 2 on trait 1, the ability to predict trait 2 values may not be very high. For example, if only 10% of trait 2 values were observed and the correlation between traits was only 0.05, then predicted trait 2 values would all be very close to zero and have a very low variance. In conclusion, a multiple trait analysis involving missing values for one or both traits will clearly be better than separate analyses of individual traits by themselves. Estimated covariance matrices will not be estimates of the true covariance matrices based on traits that have not been subject to selection. The true covariance matrix must be estimated using animals that have both trait 1 and trait 2 observations, and that have not been subjected to selection. 15.3 Better Sires In Better Herds Problem In the first days of the Northeast AI Sire Comparison at Cornell University in 1968 the AI industry believed that the better sires were positively associated with the most favourable herds, and that consequently, sire ETAs would be biased in the sire model evaluations. This could be visualized as ranking the herds based on the true magnitudes of their effects on milk production (due to good management, good economics, and favourable environmental circumstances) and then those herd owners choosing sires based on the sires’ true breeding values. Given that no one could ever actually know the true values of either herd effects or sire transmitting abilities, the degree of actual bias might actually be very small, if it existed at all. However, herd owners with lots of money used sires with high semen prices, and money was associated with better herds and better sires. Henderson and his team at Cornell believed that the bias must exist. He thought (based on his own selection bias theories, 1975) if he treated herdyear-season effects as fixed effects, then the bias would not show up in the sire ETAs (Schaeffer, 2018). Thereafter, herd-year-season effects have always been a fixed factor in genetic evaluation models. The presence of bias and any tests that the model (with herd-year-season as fixed) removed the bias were never performed. 262 CHAPTER 15. SELECTION 15.3.1 The Simulation In order to examine this problem daughters of sires were generated over 100 herds and nine years for a trait with heritability of 0.30. True herd values were sorted from best value to worst. There were 200 sires used per year, and these were sorted from high to low true breeding values. The first two sires were used in herd 1, the next two sires used in herd 2, and so on. The poorest two sires were used in herd 100. Thus, the degree of bias should be maximized. Every year 1,500 females were mated to 200 sires giving two progeny with each mating, or 3,000 new animals per year. A total of 27,000 animals with records were available for analysis in each replicate. Data were analyzed with a sire model in two different ways. First, the model equation was y = Xb + Wh + Zs + e where b represents a vector of fixed year effects, h represents a vector of random herd-year-season effects, having a mean of 0 and covariance matrix Iσh2 , s represents a vector of sire transmitting abilities, having a mean of 0 and covariance matrix Iσs2 , X,W, and Z are design matrices relating observations to years, herd-yearseasons, and sires, respectively. This was the original model for the Northeast AI Sire Comparison. The second model equation was y = Wh + Zs + e 15.3. BETTER SIRES IN BETTER HERDS PROBLEM 263 where the variables are as described above, except that h is now treated as a fixed factor, and therefore, fixed year effects are not needed in the model due to confounding with the herd-year-season effects. Data were also generated completely randomly, without sorting herd true values or true sire transmitting abilities. These were also evaluated with both sire models. One hundred replicates were made for each scenario. 15.3.2 Results The correlations between sire true transmitting abilities and estimated transmitting abilities was used as the criterion to compare analyses. The numbers are given in Table 15.1. Table 15.1 Correlations between true and estimated Association Bias HYS No Random No Fixed Yes Random Yes Fixed sire transmitting abilities. Correlation 0.617 ± 0.019 0.534 ± 0.017 0.159 ± 0.019 0.093 ± 0.014 The following conclusions could be made. 1. Treating HYS as fixed resulted in lower accuracies with sire-herd associations or with completely random scenarios. 2. With sire-herd associations, the accuracy was decreased substantially, although the amount of bias created was substantial too. 3. Henderson’s selection bias theory does not appear to be justified. 4. A statistical solution to correct for this bias does not seem obvious. The same scenarios were examined using an animal model, with additive genetic relationships among animals. The correlations among sires were calculated. The results are in Table 15.2. 264 CHAPTER 15. SELECTION Table 15.2 Correlations between true and estimated sire breeding values. Association Bias HYS Correlation No Random 0.783 ± 0.012 No Fixed 0.761 ± 0.015 Yes Random 0.902 ± 0.010 Yes Fixed 0.879 ± 0.010 The presence of sire-herd associations caused an increase in the correlations of EBV with true values, but EBV were also highly correlated to the bias (difference between true value and EBV). Treating HYS as a fixed factor in either sire or animal models was not successful in removing bias due to better sires being used in better herds. Another indirect conclusion is that the actual bias due to the better sires being used in the better herds was probably not very great back in 1969. Treating HYS as fixed was not the optimal solution that was needed, and probably resulted in lower accuracy of sire ETAs for many years. The effect of the bias in animal models does not appear to be as great as in sire models, probably due to the additive genetic relationship matrix. 15.4 Nonrandom Mating Selective mating (nonrandom mating) is the key tool for livestock owners to make genetic change in their animals. Kennedy et al. (1988) discussed the genetic properties of animal models, and showed that animal models with the incorporation of the additive genetic relationship matrix, A, account for nonrandom matings because the parents of each animal with observations are identified. A qualifier was that pedigrees of each animal should be traceable back to a base population of randomly mating individuals. The previous section demonstrated the increase in accuracy that is possible with an animal model over a sire model. This was mainly due to accounting for nonrandom matings, and to including all relatives information into the evaluations. Determining the base population is non trivial, time consuming, and adds a great number of animals to the dimensions of A. Thus, a question that often arises is how many generations of pedigrees are actually needed 15.4. NONRANDOM MATING 265 to achieve the benefits from an animal model. For example, there may be observations on animals from the last two years, and their parents are known, but not any further back in time. How far back in the ancestry must one go to make the animal model worthwhile? The depth of pedigree has two ramifications. First issue is the accuracy of estimated breeding values, and secondly, the effect on an estimated genetic variance. In all cases the progeny of each sire and dam pair should be random with respect to Mendelian sampling effects. A single trait was simulated using a base population of 200 sires and 1500 females. Each year the sires and dams were randomly mated to each other. At the end of each year 200 males and 1500 females were randomly selected from all available individuals. Nine years were simulated. Data were then restricted to those from years 8 and 9. Those animals plus their parents constituted the data for the first animal model analysis. An additive genetic relationship matrix was constructed based only on the animals with records (6,000) plus their parents. This was refered to as Data 0. Subsequent data sets were formed by including the next generation of parents. Hence the pedigree became more complete with each scenario, giving Data 1, Data 2, Data 3, and Data 4. Finally, all of the pedigrees over the nine years of simulation were included, Data All, with the 6,000 records from years 8 and 9. For each data set the correlation of estimated breeding values (EBV) with true breeding values (TBV) was calculated for the 6,000 animals with records, using the true parameters in the analysis. Also, a separate analysis was made to estimate the variance parameters of the model for herd-yearseasons, animal additive genetic, and residual. One hundred replicates were made for each scenario. 266 Correlations of Scenario Ancestors Data 0 9,022 Data 1 11,798 Data 2 12,600 Data 3 12,843 Data 4 12,796 Data All 28,700 CHAPTER 15. SELECTION Table 15.3 Completely random matings. EBV with TBV, and Estimates of Variances. Correlation Genetic HYS Residual .601 ± .014 29.9 ± 3.1 11.1 ± 3.1 58.4 ± 2.7 .606 ± .015 30.0 ± 2.9 11.0 ± 2.9 58.7 ± 2.6 .613 ± .013 30.3 ± 2.6 10.7 ± 2.6 58.8 ± 2.7 .612 ± .015 30.3 ± 2.7 10.7 ± 2.7 58.7 ± 2.8 .612 ± .015 30.6 ± 2.7 10.4 ± 2.7 58.6 ± 2.9 .610 ± .013 30.1 ± 3.4 10.9 ± 3.4 59.1 ± 2.8 The conclusion is that pedigrees containing parents, grandparents, and great grandparents of animals with records are equivalent in accuracy to having all pedigrees available. Retrieving ancestors further back in the pedigree does not result in increases in accuracy (correlations of TBV with EBV for animals with records only). Note also that the estimates of genetic, HYS, and residual variances are nearly unbiased in all cases. The above results were for the case of no selection of animals for breeding. The same simulations were repeated with selection of sires and dams based on their phenotypes. Hence fewer animals became parents because their phenotypes prevented them from being selected. Table 15.4 Sires and dams selected based on phenotypes. Correlations of EBV with TBV, and Estimates of Variances. Scenario Ancestors Correlation Genetic HYS Residual Data 0 8,299 .640 ± .026 29.7 ± 2.7 11.3 ± 2.7 58.9 ± 2.6 Data 1 10,218 .671 ± .032 30.7 ± 2.9 10.3 ± 2.9 58.1 ± 2.4 Data 2 11,634 .681 ± .030 30.3 ± 2.7 10.7 ± 2.7 58.3 ± 2.5 Data 3 12,064 .682 ± .030 30.6 ± 2.6 10.4 ± 2.6 58.4 ± 2.4 Data 4 12,182 .684 ± .034 30.6 ± 2.7 10.4 ± 2.7 58.1 ± 2.7 Data All 28,700 .680 ± .032 30.8 ± 2.3 10.2 ± 2.3 58.1 ± 2.4 The accuracies of EBV were greater when selection was present, presumably because there were fewer parents and those had more progeny among the 6,000 with records, resulting in greater accuracy among those with records. 15.5. MASKING SELECTION 267 The addition of parents, grandparents, and great grandparents was enough to increase the accuracy to the same level as having all pedigrees. The standard error of the average correlations was greater in the case of selective matings as opposed to random matings, being almost twice as great. Thus, there are consequences to selection. The estimated variance components were nearly equal to the true parameters in both the random mating and selective mating scenarios. 15.5 Masking Selection Masking selection is where animals are treated such that they are not allowed to express their true phenotype, and hence their true genotype. An example is where a champion race horse that has won many races is replaced by a looka-like horse. People in the know then bet against the ‘champion’ because they know the look-a-like will not win any races. There is a case similar to this in Sherlock Holmes. In dairy cattle, more and more cows are being drugged in order to synchronize their heat cycles and to improve the chances of conception. Traits like days from calving to first service, and days from first service to conception are traits that can be manipulated phenotypically through the use of drugs. Instead of 120 d to first service, with the use of drugs that could be moved up to 80 d. This is a great management tool for herd owners that do not have good heat detection abilities. The question is, should the phenotype of 80 d be used in an animal model to evaluate cows and bulls for fertility, or should records that are influenced by drugs be removed from the data analysis. However, removing data may also be a form of nonrandom sampling itself causing bias too. To demonstrate the bias, a simple model, y i = µ + ai + e i for i going from 1 to N was used, with four different heritabilities (.05, .10, .20, and .30), and N = 30, 000. The assumed phenotypic mean and variance were 86 and 750, respectively. 268 CHAPTER 15. SELECTION N=30000 h2 = 0.05 vrp = 750 vrg = h2*vrp vre = (1-h2)*vrp sdg = sqrt(vrg) sde = sqrt(vre) wa = rnorm(N,0,sdg) we = rnorm(N,0,sde) phen = 86 + wa + we mean(phen) var(phen) mean(wa) var(wa) cor(phen,wa) To apply masking selection, the phenotypes in phen were altered. Any values above 80 were set equal to 86, the overall mean. This represents cows that may have been treated with drugs in order to time their estrus and get them pregnant earlier than they might have. The drugs were assumed to be 100% effective. Then the means and variances and the correlation between the new phenotypes and the true genetic values were calculated, as shown in the following script. ka = (1:N)[phen >80 ] phan = phen phan[ka]=86 mean(phan) var(phan) cor(phan,wa) wb=wa[ka] mean(wb) var(wb) In real life, only phan is observed, not phen. wb is a vector of the true genetic values of the animals whose true phenotypes were masked by the use of the drugs. The results from one replicate at each heritability are in Table 15.5. 15.5. MASKING SELECTION 269 Table 15.5 Effects of masked selection. Item Heritability 0.05 0.10 0.20 0.30 True Phenotypic Mean 85.88 86.21 85.85 85.9256 True Phenotypic Var. 740.12 742.98 739.67 756.03 True Genetic Mean 0.02 -0.05 -0.08 -0.03 True Genetic Var. 37.16 75.36 150.18 224.69 Cor(Phenotype,Genotype) 0.2191 0.3217 0.4377 0.5452 Masked Phenotypic Mean 75.35 75.45 75.35 75.26 Masked Pheno. Var. 259.22 254.97 258.41 264.96 Genetic Mean of Masked Cows 0.88 1.80 3.47 5.48 Genetic Var. of Masked Cows 36.48 71.54 133.37 185.20 Cor(Masked Phenotype,Genotype) 0.1841 0.2701 0.3694 0.4694 The mean and variance of records with the masked animals is the same regardless of heritability because essentially the same percentage of cows are being masked in each case. However, the genetic mean of the cows that were treated with drugs actually goes up (becomes worse) as heritability increases. Also the correlation between phenotypes of all animals (true and masked together) decreases more as heritability increases. Thus, given that fertility has a low heritability, the decrease in the correlation is only 0.03 at heritability of 0.05, and this does not seem too bad. However, decreasing the accuracy of evaluation for a trait with an already low heritability is not an advantage. Using true and masked phenotypes in genetic evaluation programs is not a recommended practice. Bias due to masking is significant. Removing masked phenotypes from the analysis is also not recommended. Animals which are treated with drugs are not a random sample of all cows, but mostly those that may have a problem with conception. In practice, the amount of bias will depend on the degree of drug usage in the population. Masking will make genetic evaluations less accurate, causing inferior animals to obtain EBVs that may rank them higher than they should be ranked. 270 15.6 CHAPTER 15. SELECTION Preferential Treatment Preferential treatment is the result of an owner (human influence) on the performance of one or more individuals within their own environment. This includes separating an animal from its contemporaries into a more favourable environment for special care, feed, or lower disease risk. At the same time, the remaining animals may be considered to be receiving negative preferential treatment. There could even be individuals that are deliberately mistreated or underfed. Without constant police surveillance and monitoring, the amount of preferential treatment and its effect on performance are highly variable from animal to animal. During normal record keeping programs, the animals which have been preferentially treated can not be easily identified. Owners are also not likely to volunteer that information. The purpose is usually to make an animal look much better than its herdmates or contemporaries, so that it receives higher EBVs or makes higher records leading to an increase in its monetary value. There is little that genetic evaluation centres can do to removing animals that have been preferentially treated, or to account for it in genetic evaluation statistical models. See Tierney and Schaeffer (1994) for a summary of a study on preferential treatment using semen price to identify bulls whose daughters might more likely be preferentially treated. Due to the unpredictable nature of preferential treatment, one might consider it to be a factor that inflates the overall residual variances around traits. Residual variances within contemporary groups could be used to categorize potential preferential treatment situations. Unfortunately, claiming a herd owner has preferentially treated their animals could lead to legal civil action against the genetic evaluation centre, if proof of preferential treatment can not be established. 15.7 Nonrandom Progeny Groups Prior to the availability of genetic markers, the animals that were born from a specific sire and dam mating could always safely be called a random group 15.8. SUMMARY 271 of progeny from that pair (Tyriseva et al. 2018). There was no way to restrict how the DNA was contributed from either parent and no way to affect how that DNA combined into an embryo. Thus, progeny means were good estimators of sire transmitting ability as long as sires were mated randomly to dams. With the animal model, the parents of each animal with records accounts for the mating structure (random or not). However, if embryos are first collected, then genotyped for 200,000 SNP markers, and subsequently only the embryos with the highest probability of high economic merit are implanted into females, then those progeny (from the same sire and dam) will no longer be a random sample of all possible progeny. The bias incurred is proportional to the accuracy of the pre-selection process. The Mendelian sampling variability of the selected progeny will be smaller than one-half the genetic variance. Suppose 20 embryos are produced from one mating, genotyped, and then the best five are implanted to produce offspring, while the other 15 are discarded, then what happens to EBVs from an animal model? Bias is generated as in section 15.1. Genetic evaluations should then be based solely on individual animal performance. An animal model could be used, but relationships to parents must be ignored, except progeny that were naturally conceived. All pre-selected embryos should be coded with unknown parents. Including links to their parents would bias the parent EBVs upwards, and would bias comparisons to other sires, to contemporaries of the pre-selected progeny, and so on through the relationship matrix. Note that pre-selection of embryos within those from a single sire-dam mating results in nonrandom progeny groups, and these can affect all traits of economic importance, not just one trait, because all traits are genetically correlated. 15.8 Summary Many types of selection can lead to biased EBVs from sire or animal models. One must be clear about the type of selection being discussed. In most cases, the animal model does not solve or prevent a bias from affecting EBVs. Eliminating biased data from analyses could also lead to bias. For the first time, genomics has the potential to create nonrandom progeny groups by discarding embryos that do not index high after genotyping. Researchers should be more 272 CHAPTER 15. SELECTION vigilant about the potential for bias to creep into statistical analyses. Statistical procedures are largely based on random sampling within and between treatment groups. This chapter has demonstrated several types of nonrandomness that could hurt animal model analyses. Chapter 16 Genomics Back in the 1970’s the field was called biotechnology, the buzzword that attracted lots of research dollars even though the practical applications were more than 30 years in the future. The promises were that individual genes of major importance would be found, and that humans could control what happens with them. During those 30 years more genetic change was made using statistical models and lots of data. In the early 2000s single nucleotide polymorphisms (SNPs) were found and the usefulness of biotechnology became more obvious and immediate. However, genes were not just genes, but were groups of base pairs from 1,000 to 1 million pairs in length. Genes were not the same size (same number of base pairs). Some genes controlled other genes. SNPs were located within or next to genes so that SNPs became markers for the genes of importance. SNPs had to be located fairly close to the genes (in linkage disequilibrium) so that the marker allele and gene alleles were tightly linked and therefore, tended to be inherited together. There were essentially two uses for SNPs. One was to find genes that had major effects on economically important traits. This was known as a genome wide association study (GWAS). Over the years relatively few major genes have been consistently found, and some of those were already near fixation, that is, nearly every individual already had the favourable alleles. These studies produced hundreds of Manhattan plots of SNP marker effects versus location in the genome. Research has moved along to genome sequencing where instead of SNPs, the DNA sequence of base pairs is performed, usually targeting areas 273 274 CHAPTER 16. GENOMICS where genes are known to exist. The goal is to determine not only the location of a gene, but also its function and products in various organs. This is an important area in human genetics to know how various genetic diseases cause problems. The other use for SNPs in livestock has been to be more specific about the relationships between individuals in order to improve accuracy of EBVs. Some full sibs, for example, are more related to their sire or dam than to other full sibs from the same sire or dam. The SNP genotypes can be used to determine what fraction of SNP genotypes are in common between individuals to build a better relationship matrix. This includes both Identity by Descent and Identity by State. Instead of using A, the matrix of additive genetic relationships based on pedigree information, a genomic matrix, G is constructed from the SNP genotypes of individuals. The animal model is still used, but with G replacing A. There are two difficulties with this strategy. First, G can only be constructed based on animals that have been genotyped with the same SNP chip. However, imputation can be used to predict the SNP genotypes of relatives of animals that were actually genotyped. Secondly, G must be inverted for use in the animal model, and most animal models involve over 100,000 animals. Inverting a matrix up to 50,000 animals may be today’s limit before rounding errors take over. Also, G and its inverse are full matrices (not sparse), and hence storing it and performing multiplications with it become very space and time consuming. This chapter looks at an approach that utilizes only a portion of the SNPs in order to improve genetic evaluations of animals, along with the regular A matrix and its sparse inverse. 16.1 Introduction Genomics research projects can be found in all areas of livestock production including aquaculture. In domesticated aquaculture there are generally few families (100 to 400 per year), and large progeny group sizes (50 to 1000). Parents can be genotyped for many thousands (e.g. 50,000) of single nucleotide polymorphisms (SNPs), but costs prohibit genotyping too many progeny in freshwater and seawater environments. When data are to be analyzed there are a large number of fish with phenotypes and a small number of individuals that are genotyped. Estimation of a large number of SNP effects from a few 16.1. INTRODUCTION 275 thousand genotyped fish is hugely overparameterized. The question arises whether all SNPs need to be utilized to gain an advantage from genomics. The Cooke Aquaculture Atlantic salmon breeding program at the Oak Bay Hatchery in New Brunswick rears candidate broodstock entirely in freshwater tanks, while one of the traits of economic interest is seawater growth production in sea cages. Three year old weights are taken in December and January, and spawning occurs in October to December that year. Genetic improvement programs are used to increase growth rates, reduce feed consumption, improve carcass shape and colour, and increase resistance to many health challenges due to parasites, bacteria, and viruses. Generation intervals in salmon can not be easily shortened without causing other problems, such as grilsing, i.e. early maturing and less growth (Gjerde et al. 1994). Four years is the minimum generation interval (Quinton et al. 2006). Therefore, early assessment of genetic ability will not reduce the spawning age. Families of salmon are well evaluated by traditional animal model analyses for freshwater and seawater growth because family size can be fairly large, and the heritability of growth is around 0.4 (Gjerde et al., 1994). Year classes of Atlantic salmon are almost entirely genetically distinct with completely different family lines. SNPs can improve accuracy of estimates of year class differences because SNP frequencies will likely be somewhat similar between year classes (Liu et al. 2016). Genotyping enough fish in both freshwater and seawater environments is costly and time consuming. Thus, the use of genomics in Atlantic salmon to make greater genetic change in growth is not very feasible. Evaluation methods in fish rely on the ability to identify individual fish, which can only be done once the fish are large enough to be PIT tagged such that they can be readily scanned electronically. Alternatively, fish can be genotyped to determine parentage at the time they are weighed and measured using a hundred or so SNP markers, but this depends on the costs and speed of analyzing DNA samples (Liu et al., 2016). Any fish that will not be used in matings, does not need to be identified individually, but only by family. Families have been identified by fin clipping, or by raising them in separate tanks until they are large enough to be tagged. This generally limits the number of families that can be reared at one time. Fish may also be grown communally, and identified to parentage at three years of age using seven microsatellite markers or a panel of SNPs, which would allow a larger number of families 276 CHAPTER 16. GENOMICS to be created. With a communal system, some families could potentially be eliminated because they were not competitive during feeding. Many methods are available for genomic evaluation of individuals, many designed for dairy cattle. Several of these methods were compared by Yoshida et al. (2018) in rainbow trout who found that low density SNP panels could be used without compromising accuracy of predictions, but that all methods using genomic information gave better results than a traditional pedigree-based best linear unbiased prediction (BLUP) approach using only phenotypes. Toro et al. (2017) conducted a simulation study of the accuracy of genomic within-family selection and used a limited number of markers from 4 to 400 per chromosome, to improve the accuracy of selection. Single-step genomic BLUP (Misztal et al. 2013) is another methodology in which SNP markers are used to create a genomic relationship matrix among genotyped animals, which includes relationships that are identical by descent and identical by state. The inverse of this matrix is combined with the inverse of the pedigree based relationship matrix. The inverse of the genomic relationship matrix becomes more challenging as more animals are genotyped because a direct inversion routine is needed and the storage space to hold the matrix becomes enormous quickly. Also, any multiplications involving this matrix can be very time consuming because the inverse is full (all elements are non-zero). Hence a longer term solution is needed if the number of genotyped animals increases rapidly. Two problems are encountered when using genomic data in genetic evaluation. With a 50K SNP panel, about 50,000 SNP genotype effects, (Nsnp ), need to be estimated. A high percentage of the SNPs, however, have little to no effect on the traits of interest. The second problem is that not enough individuals have been genotyped. Consequently, statistical models tend to be over-parameterized, i.e., too many parameters to estimate from too few animals. If you put the SNP genotypes in a matrix with the number of columns equal to the number of SNPs (Nsnp ), and the rows equal to the number of animals genotyped, (Ng ), then the rank of the matrix can be no greater than the smaller of either Ng or Nsnp . Thus, if you have Ng = 100 and Nsnp = 50, 000, then the rank is 100 or less, and 49,900 columns of that matrix are a linear function of a set of 100 independent columns. Ideally, all animals with phenotypes, (Np ), should also be genotyped, (Ng ), 16.2. DATA 277 and both numbers should be greater than the number of SNP genotypes per animal,(Nsnp ). That is, (Np = Ng ) > Nsnp . If the above holds true, then SNP genotype effects can be estimated uniquely using ordinary least squares equations with an appropriate statistical model. However, typically, Ng is much smaller than Np and Ng is much smaller than Nsnp . Labour costs and the time involved in collecting and analyzing DNA samples is high, and therefore, the number of genotyped animals has been limited by economics. Even so, Ng is expected to continue to rise over time. To increase the number of genotyped animals, imputation can be used to assign genotypes to relatives of genotyped animals (Sargolzaei et al., 2014) based on probability and pedigree relationships. Gengler et al. (2007, 2008) and Mulder et al. (2010) proposed a method to predict SNP genotypes of all animals in the pedigree using a simple pedigree-based animal model. 16.2 Data The Oak Bay Hatchery of Cooke Aquaculture, Inc accommodates up to 150 families of Atlantic salmon in each of four different year classes. The St John River strain was started around 1984 (Glebe, 1998; Quinton et al. 2005). Spawning typically occurs in October, November, and December, and eggs are kept in freshwater (8◦ C). Some eggs can be chilled to 3-4◦ C in order to match up their degree days with eggs fertilized later. Once the alevins have absorbed their yolk sac and are ready to take exogenous feed, the temperature is increased to 12◦ C. As the fish grow they are moved to larger family tanks. At smolting age (1 yr) individual fish may be PIT tagged and a sorting is made for potential broodfish. Fish are communally reared from this point onward. Another movement of fish occurs at 2 years of age, when they are scanned by ultrasound to determine sex. Early maturing fish are removed. Only weights and lengths taken at three years of age were used in the analyses. A total of 26,383 fish were measured at three years of age in this study. Male and female data from the Oak Bay Hatchery have been analyzed collectively for four traits (FW weight, FW length, SW weight, and SW length) in a multiple trait animal model over the last three years. A recent study has shown that male and female growth within families is not perfectly correlated, 278 CHAPTER 16. GENOMICS and that female growth is more variable than male growth. Thus, weights and lengths have been expanded into eight traits separated by gender and rearing environment. The phenotypic means and standard deviations are given in Table 16.1. The distribution of records by year classes and environments are in Table 16.2. Table 16.1 Phenotypic means and standard deviations (SD) of 3-yr-old growth traits. Gender Typea -trait Number Mean SD Males FW weight, kg 2811 5.50 1.97 FW length, cm 2811 75.91 8.39 SW weight, kg 6176 5.28 1.26 SW length, cm 6176 77.54 5.53 Females FW weight, kg 9512 6.04 1.84 FW length, cm 9512 77.05 7.57 SW weight, kg 7884 5.01 1.13 SW length, cm 7884 75.91 5.24 a FW=fresh water, SW=sea water. Number of Year Class 2006 2007 2008 2009 2011 2013 2014 Table 16.2 growth records by year of weighing, gender. Year ——–Freshwater——– Weighed Males Females 2009 0 1393 2010 404 1197 2011 1171 1106 2012 424 0 2014 0 1179 2016 0 81 2017 812 1220 type of environment, and ——–Seawater——– Males Females 0 1566 1552 1371 2325 1381 2095 0 0 2418 0 56 3540 1092 A 50K SNP panel was developed during the project for the North American Atlantic salmon based on a 220K (Ødegård et al. 2014) and 6K panel for European Atlantic salmon (Brenna-Hansen et al. 2012). A total of 3,437 fish with growth data were genotyped with either the 220K or 50K panels. SNPs from the 50K panel were also on the 220K panel. There were a total of 33,189 animals in the pedigree file. 16.3. MODEL FOR GROWTH TRAITS 16.3 279 Model for Growth Traits An 8 trait model was used with the traits described in Table 16.1. Each fish was observed for weight and length, in freshwater or seawater, but families were measured for all traits. Hence an animal model is important to analyze the data. The equation of the model for each trait is simple, y = Wc + Za + e (16.1) where y is a vector of eight traits, c is a vector of 21 fixed tank-year subclass effects, a is a random vector of polygenic effects for each fish, e is a random vector of residuals, W is the design matrix relating tank-year effects to the observations, and Z is the design matrix relating animals to their observations. The expectations of the random vectors and the variances and covariances among the traits are given below. E(a) E(e) V ar(a) V ar(e) = = = = 0, 0, O A H R A is the additive genetic numerator relationship matrix based on pedigrees of 33,189 animals. Matrices H and R, of order 8 by 8, were estimated by Bayesian methods using Gibbs sampling. Forty thousand Gibbs samples using 280 CHAPTER 16. GENOMICS the entire dataset of 26,383 records and 33,189 pedigree animals, with a 10,000 sample burn-in period was used. Standard errors on estimates of covariance components, and on estimates of heritabilities and genetic correlations were obtained using the standard deviations of the Gibbs sample values after burnin (Table 16.3) Table 16.3 Heritability estimates with standard errors for 3-yr-old weight and length in Atlantic salmon by gender and type of environment. Gender Typea Weight Length Males FW 0.18±0.03 0.22±0.02 SW 0.31±0.03 0.25±0.03 Females FW 0.27±0.04 0.38±0.04 SW 0.40±0.03 0.38±0.03 a FW=fresh water, SW=sea water. 16.4 Models Involving Genomics In this study, Ng = 3, 437 and Np = 26, 383, but the traits have been split into 4 groups (Table 1). The number of markers was Nsnp = 49, 638. Thus, Np and Ng are less than the number of SNP markers, Nsnp . If all SNP markers are included, then the model will be highly overparameterized and create difficulty in yielding unbiased and unique solutions for all parameters in the model. A first step was to reduce the number of SNP markers. From the 50K SNP panel, there were 11,467 SNPs with minor allele frequencies between 0.3 and 0.5. The correlations among the 11,467 SNP genotypes were calculated and ranged from 0.03 to -0.03 indicating that those markers were nearly independent of each other. The markers were correlated with SW female growth and ranked within chromosome in decreasing order of absolute magnitude. From the 11,467 markers, subsets of the top 6,000 (200 to 400 per chromosome) and top 270 (10 per chromosome) were chosen. If all fish have both phenotypic records (weight and length) and genotypes for markers, then an appropriate linear model would be 16.4. MODELS INVOLVING GENOMICS y = Wc + Za + Xm + e 281 (16.2) where everything is defined similarly to Equation (1.1), except m is a fixed vector of marker allele effects for the 11,467 markers, and X is a design matrix that contains the genotypes of the markers (-1, 0, or 1) for each animal with records. The matrices H and R are as defined previously, V ar(a) = A O H, where A is the usual additive genetic, pedigree based, relationship matrix among animals. The genomic estimated breeding value, GEBV, would be the b + Xm c ), where a b and m c are estimates of a and m, respectively. sum (Za Mixed model equations were constructed and solved for 8 traits. The number of markers was either 6,000 or 270 (GBV6K and GBV270, respectively). In both cases I was added to the diagonals of X0 R−1 X to give ridge regression estimators of the regression coefficients. This was necessary because overparameterization was still a problem, which could not be avoided when number of markers was 6,000, and to be comparable in the case when number of markers was 270. Different strategies of picking markers were not studied. For each analysis there is likely an optimum set of markers, but finding that set could take a very long time starting with 49,638 markers. The reasoning was that with high minor allele frequencies, all of the genotypes at one SNP would be reasonably represented among genotyped fish. The 11,467 markers were nearly independent, so there would be little chance of confounding of markers. The linkage disequilibrium among the markers was therefore low. The estimated breeding values from the animal model with predicted SNP genotypes were calculated as b i + x0 i m c, EBVi = a 282 CHAPTER 16. GENOMICS for animal i, where x0 i is the row vector of predicted SNP genotypes for markers for animal i from X. The next step was to predict marker genotypes for all animals in the pedigree, (Nped = 33, 189) for both subsets of markers. An animal model was applied to the marker genotypes (-1, 0, or 1) of genotyped fish with an overall mean, and an animal additive effect. The additive genetic relationship matrix, A amongst all animals with ancestors, was used, and a very high heritability assumed. The model is si = µ + gi + ei , (16.3) where si is the marker genotype, either -1, 0, or 1, for an animal that was genotyped, µ is an overall mean, gi is an animal’s breeding value for the marker genotype, and ei is a residual error. Let g be the vector of breeding values for marker genotypes of all animals and let it be partitioned into genotyped (w) and non-genotyped (o) individuals, then E(g) = 0 E(e) = 0 V ar(g) = V ar gw go ! = Aww Awo Aow Aoo V ar(e) = Iσe2 and let σe2 /σg2 = 0.05 = λ, ! σg2 16.5. COMPARISON OF MODELS 283 which corresponds to a heritability of 0.9523. The mixed model equations are N 10 00 µ 10 s ww wo 1 I + A λ A λ gw = s . 0 Aow λ Aoo λ go 0 (16.4) The solutions for animals plus the overall mean gives an estimate of the genotype of each animal. The predicted genotypes, which are decimal numbers, are used in the Equation 2. Mulder et al.(2010) found a .69 correlation between predicted genotypes and actual genotypes in a simulation study. Each SNP would be analyzed separately. The predicted genotypes can be used directly in X, as continuous covariates, in Equation (1.2). 16.5 Comparison of Models The models were compared as follows: 1. Sum of squares of residuals, and 2. Variance of Mendelian sampling effects. GBV270 and GBV6K models should give lower sums of squares of residuals than the ANM model because there were more parameters to estimate than in the traditional animal model. The comparisons are not to determine the better model, but to quantify the differences one might expect. Yoshida et al. (2018) computed an accuracy for their genomic models as the correlation between the GEBVs (genomic estimated breeding values) and the corresponding phenotypic observations, divided by the square root of heritability from the ANM in a validation data set. In this study the data were not separated into estimation and validation groups, and therefore, this statistic was not meaningful for this study. 284 CHAPTER 16. GENOMICS Another measure of improved accuracy would be to calculate the variance of Mendelian sampling effects (for animals with both parents known)(for each trait separately) as a fraction of one half the additive genetic variance used in the analysis (from the G matrix shown above). A Mendelian sampling effect of the ith animal is the animal’s (G)EBV minus the parent average (G)EBV, msi = EBVi − .5(EBVsire + EBVdam ) In theory, the greater is the accuracy of evaluation, the variance of Mendelian sampling effects should increase towards one half the additive genetic variance. Thus, the comparison statistic is v = V ar(ms)/(.5V ar(a)), where V ar(ms) is the variance of Mendelian sampling effects of animals with both parents known, and V ar(a) is the additive genetic variance of the trait (obtained from the G matrix). 16.6 Results From Table 16.1 the number of observations for male, freshwater weights and lengths was only 2811, which is substantially smaller than the total number of genotyped fish, and substantially smaller than the 6000 markers that needed to be estimated in the GBV6k model. Given that only 3,437 animals were genotyped also caused a problem in estimating marker effects in GBV6K and GBV270, because those individuals were used to estimate SNP genotypes for 33,189 fish in the pedigree. While that gives 6,000 SNP genotypes for all 33,189 fish, they are based on only 3,437 degrees of freedom, so to speak. The predicted genotypes are linear functions of the 3,437 independent genotyped animals. Thus, there was an overparameterization problem for both GBV6K and GBV270. The number of genotyped fish should be increased to be above 6,000, and the number of phenotypes should be greater than 6,000 for all groups of fish. Ridge regression was used to solve the equations, where a value of 1 was added to the diagonals of X0 R−1 X. Table 16.4 shows the sums of squares of residuals from each model for each of the traits. GBV270 gave smaller sums of squares than either ANM or 16.6. RESULTS 285 GBV6K. The expectation was that as you increased the number of markers that the sums of squares of residuals would become less. However, GBV6K gave sums of squares more similar, but less than ANM. Thus, more SNP markers was not necessarily better. The question becomes how many markers are sufficient. A stepwise forward regression could be used to add one marker at a time until the sums of squares of residuals stops decreasing. That process would not necessarily result in having markers on each chromosome. The best set of markers would need to be found for each trait to be analyzed. The sets found for each trait could be merged to give one set to be used for all traits. Table 16.4 Sums of squares of residuals from traditional animal model (ANM), GBV270 model, and GBV6K model. Gender Typea Traitb ANM GBV270 GBV6K Males FW WT 2335 2073 2234 FW LG 44,912 41,080 43,176 SW WT 4716 4465 4630 SW LG 113,017 107,966 110,785 Females FW WT 7294 6982 7205 FW LG 107,957 104,598 106659 SW WT 4146 4002 4086 SW LG 95,556 93,339 94,629 a FW=fresh water, SW=sea water. b WT=weight, LG=length. The variance of estimated Mendelian sampling effects (V ar(ms)) was used to indicate genetic accuracy. When GEBV are more accurate then V ar(ms) should increase towards one half the additive genetic variance ( 21 V ar(a)). The values are given in Table 16.5. Thus, the ANM gave only 2.6% to 9.5% of 1 V ar(a). This is because each fish has only one weight and one length observed 2 in just one environment. Hence their EBV are heavily regressed towards the family mean for all eight traits. Both GBV270 and GBV6K gave substantially higher fractions than ANM, but not consistently for one model or the other. GBV270 was higher in more cases than GBV6K. 286 CHAPTER 16. GENOMICS Table 16.5 Variance of Mendelian sampling effects as a fraction of one-half additive genetic variance for the traditional animal model (ANM), GBV270, and GBV6K models. Gender Typea Traitb ANM GBV270 GBV6K Males FW WT .0260 .8476 .2158 FW LG .0744 .6176 .2037 SW WT .0680 .2087 .2394 SW LG .0870 .2250 .2389 Females FW WT .0718 .5628 .3496 FW LG .0950 .3515 .2801 SW WT .0734 .2138 .2032 SW LG .0890 .2245 .1910 a FW=fresh water, SW=sea water. b WT=weight, LG=length. Because GBV270 appears better than GBV6K, further comparisons are given only between GBV270 and ANM. Tables 16.6 and 16.7 show correlations among EBVs between ANM and GBV270, and within each model. All of the correlations were less than unity which means that genomic data added information to the EBVs causing individuals to rank differently from ANM. The correlations between traits within ANM or within GBV270 were different. For the ANM, correlations were generally greater than for GBV270. Recall the ANM uses a multi-trait model with fairly high genetic correlations as input parameters, so that the correlation of EBVs from ANM are usually greater than the true genetic parameters, because traits are used to predict other traits. With GBV270, the same can be said for the polygenic part of the model, but the GEBV includes the regressions on the marker genotypes which do not utilize the genetic parameters to predict traits. Correlations do not indicate which model is better, only the behaviour of the results. 16.6. RESULTS Correlations 1 2 3 4 5 6 7 8 287 Table 16.6 (×100) of Nped EBVs from traditional animal model (ANM), with those of GBV270, for Nped = 33, 189. ANM ——GBV270—— 1 2 3 4 5 6 7 8 M/FW/WT 54 43 56 48 62 53 54 43 M/FW/LG 34 72 68 84 63 87 66 74 M/SW/WT 43 58 84 74 67 73 82 71 M/SW/LG 36 72 72 85 68 90 71 77 F/FW/WT 41 53 63 64 82 70 63 58 F/FW/LG 36 71 70 84 71 92 71 78 F/SW/WT 41 55 79 71 65 72 86 74 F/SW/LG 35 67 71 81 65 87 77 84 a M=male, F=female. b FW=fresh water, SW=sea water. c WT=weight, LG=length. Table 16.7 Correlations (×100) of EBVs within traditional animal model (ANM), below the diagonals, and EBVs within GBV270, above the diagonals for 33,189 individuals. 1 2 3 4 5 6 7 8 1 M/FW/WT 100 70 43 35 32 28 40 30 2 M/FW/LG 58 100 57 70 46 64 54 61 3 M/SW/WT 72 76 100 85 58 66 79 69 4 M/SW/LG 62 99 82 100 61 78 71 78 5 F/FW/WT 77 66 79 73 100 81 55 52 6 F/FW/LG 60 95 80 98 77 100 65 72 7 F/SW/WT 66 70 95 78 77 80 100 87 8 F/SW/LG 56 89 83 93 71 95 87 100 a M=male, F=female. b FW=fresh water, SW=sea water. c WT=weight, LG=length. Animal models including predicted SNP marker genotypes (GBV270 and GBV6K) performed better than ANM. Results imply that 270 markers gave a higher fraction of Mendelian sampling variance than did GBV6K and definitely 288 CHAPTER 16. GENOMICS more than ANM model. There is a need for further study on how many SNPs and which SNPs give the best results. By using fewer markers, a new chip panel could be created with only 500 markers, for example, which might cost less than the 50K panel and therefore, allow more fish to be genotyped. A chip with only 500 markers could be used to determine parentage, genetic sex, and continent of origin (North American or European) as well. Ideally large samples of fish measured for 3 year-old weight and length should be genotyped in FW environments, and random samples of fish from SW environments. Perhaps 50 fish from every family could be targeted, giving 7500 genotyped individuals per year. Chapter 17 Other Models 17.1 Sire-Dam Model A sire-dam model is a subset of the animal model. Assumptions are needed in order to move from an animal model to a sire-dam model. This means that a sire-dam model will always be inferior to an animal model. Start with an animal model as y = Xb + Wc + Za + Zp + e, for a single trait, repeated records, where b is a vector of fixed factors of the model (e.g. time trends, age effects, regions of a country effects), c is a vector of random factors (e.g. contemporary groups), a is a vector of animal additive genetic effects, p is a vector of cumulative environmental effects, and e is a vector of residual effects. 289 290 CHAPTER 17. OTHER MODELS X, W, and Z are design matrices that relate observations in y to factors in the model, and the random factors have null expected values and appropriate covariance matrices. The additive genetic relationship matrix, A, is assumed for the animal additive genetic effects. A necessary assumption to form a Sire-Dam model is to assume that animals have only one record, and thus, p can be joined with e. Then to go to a Sire-Dam model re-write the additive genetic effects as a = 1 1 asire + adam + ms, 2 2 (17.1) where ms is the vector of Mendelian sampling effects of each progeny. The implied assumption is that all animals with a record in y have both parents known. Another assumption is that each Mendelian sampling effect comes from the same population the a zero mean and variance equal to one half the additive genetic variance. Thus, inbred animals are not allowed, or they are all inbred but have the same variance. Substituting (1.1) into the animal model equation gives 1 1 y = Xb + Wc + Z( asire + adam + ms) + p + e. 2 2 Combine the Mendelian sampling effects with the residual term, as well, assuming one record per animal. 1 1 y = Xb + Wc + Zs asire + Zd adam + (ms + p + e). 2 2 The obvious sticking points with a sire-dam model are that • Animals must have only single records. • All progeny must be not inbred. • All progeny should be a random sample of all potential progeny of a sire-dam pair. • The progeny do not later become dams themselves. 17.2. SIRE MODELS 291 • Animals that have records are not parents of other animals, • None of the parents have records of their own, If the data covers only one generation of animals, then the results from an animal model or a sire-dam model should be the same. However, if the data covers many generations, then the animal model should always be more accurate than the sire-dam model. 17.2 Sire Models A sire model is a sub-model of the Sire-Dam Model. More assumptions are needed to move from a Sire-Dam Model to a sire model, and thus, a sire model is inferior to a Sire-Dam Model, and also inferior to an animal model. If dams are assumed to have only one progeny (mated to only one sire), then the dam additive genetic effect can be combined into the residual effect too. One must also assume that the dams are unrelated to each other, and to the sires. 1 1 y = Xb + Wc + Zs asire + ( adam + ms + p + e). 2 2 where Zd = I. Implied in the Sire Model is that sires must be mated randomly to dams. If this is not the case then sire estimated transmitting abilities will be biased. The assumptions that had to be made with the Sire-Dam model also apply to the Sire model. 17.3 Reduced Animal Models Consider an animal model with periods as a fixed factor and one observation per animal (not a realistic model - just an example for demonstrating a reduced animal model), as in Table 17.1. 292 CHAPTER 17. OTHER MODELS Table 17.1 Animal Model Example Data Animal Sire Dam Period Observation 5 1 3 2 250 6 1 3 2 198 7 2 4 2 245 8 2 4 2 260 9 2 4 2 235 4 1 255 3 1 200 2 1 225 Assume that the ratio of residual to additive genetic variances is 2. The MME for this data would be of order 11 (nine animals and two periods). The left hand sides and right hand sides of the MME are 3 0 0 1 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 680 5 0 0 0 0 1 1 1 1 1 1188 0 4 0 2 0 −2 −2 0 0 0 0 0 0 6 0 3 0 0 −2 −2 −2 225 200 0 2 0 5 0 −2 −2 0 0 0 0 0 3 0 6 0 0 −2 −2 −2 , 255 1 −2 0 −2 0 5 0 0 0 0 250 1 −2 0 −2 0 0 5 0 0 0 198 1 0 −2 0 −2 0 0 5 0 0 245 1 0 −2 0 −2 0 0 0 5 0 260 1 0 −2 0 −2 0 0 0 0 5 235 17.3. REDUCED ANIMAL MODELS 293 and the solutions to these equations are b̂1 b̂2 â1 â2 â3 â4 â5 â6 â7 â8 â9 = 225.8641 236.3366 −2.4078 1.3172 −10.2265 11.3172 −2.3210 −12.7210 6.7864 9.7864 4.7864 . A property of these solutions is that 10 A−1 â = 0, which in this case means that the sum of solutions for animals 1 through 4 is zero. The reduced animal model was proposed by Quaas and Pollak (1980). The solutions from the reduced animal model were exactly the same as from the full MME. The amount of reduction depends on the reproductive rate of the species of interest and the percentage of animals that become parents of the next generation. Thus, the reduced animal model methodology is very applicable to swine, poultry, and fish where large numbers of progeny can be produced per mating pair of individuals, and where a very small percentage of animals are actually selected as parents of the next generation. In a typical animal model with a as the vector of additive genetic values of animals, there will be animals that have had progeny, and there will be other animals that have not yet had progeny (and some may never have progeny). Denote those animals which have progeny as ap , and those which have not had progeny as ao , so that a0 = a0p a0o . In terms of the example data, a0p = a1 a2 a3 a4 a0o = a5 a6 a7 a8 a9 , . 294 CHAPTER 17. OTHER MODELS Genetically for any individual, i, their additive genetic value may be written as the average of the additive genetic values of the parents plus a Mendelian sampling effect, which is the animal’s specific deviation from the parent average, i.e. ai = .5(as + ad ) + mi . Therefore, we may write ao = Tap + m, where T is a matrix that indicates the parents of each animal in ao , and m is the vector of Mendelian sampling effects. Then a = ap ao = I T ! ! ap + 0 m ! , and V (a) = Aσa2 ! I = App I T0 σa2 + T 0 0 0 D ! σa2 where D is a diagonal matrix with diagonal elements equal to (1 − .25di ), and di is the number of identified parents, i.e. 0, 1, or 2, for the ith animal, and V (ap ) = App σa2 . The animal model can now be written as yp yo ! = Xp Xo ! b+ Zp 0 0 Zo ! I T ! ap + ep eo + Zo m ! . Note that the residual vector has two different types of residuals and that the additive genetic values of animals without progeny have been replaced with Tap . Because every individual has only one record, then Zo = I, but Zp may have fewer rows than there are elements of ap because not all parents may have observations themselves. In the example data, animal 1 does not have an observation, therefore, 0 1 0 0 Zp = 0 0 1 0 . 0 0 0 1 17.3. REDUCED ANIMAL MODELS 295 Consequently, ep eo + m R = V ! = Iσe2 0 0 Iσe2 + Dσa2 = I 0 0 Ro ! ! σe2 The mixed model equations for the reduced animal model are 0 0 −1 X0p Xp + X0o R−1 o Xo Xp Zp + Xo Ro T 0 0 −1 0 0 −1 Zp Xp + T Ro Xo Zp Zp + T Ro T + A−1 pp α = X0p yp + X0o R−1 o yo 0 0 −1 Zp yp + T Ro yo ! b̂ âp ! . Solutions for âo are derived from the following formulas. âo = Tâp + m̂, where m̂ = (Z0o Zo + D−1 α)−1 (yo − Xo b̂ − Tâp ). Using the example data, T= .5 .5 0 0 0 0 0 .5 .5 .5 .5 .5 0 0 0 0 0 .5 .5 .5 , and D = diag .5 .5 .5 .5 .5 , then the MME with α = 2 are 3 0 0 1 1 1 0 4 .8 1.2 .8 1.2 0 .8 2.4 0 .4 0 1 1.2 0 3.6 0 .6 1 .8 .4 0 3.4 0 1 1.2 0 .6 0 3.6 The solutions are as before, i.e. b̂1 b̂2 â1 â2 â3 â4 = 680. 950.4 179.2 521. 379.2 551. ! 296 CHAPTER 17. OTHER MODELS b̂1 = 225.8641, b̂2 = 236.3366, â1 = -2.4078, â2 = 1.3172, â3 = -10.2265, â4 = 11.3172. To compute âo , we must first calculate m̂ as follows: (I + D−1 α) = yo = Xo b̂ = Tâp = 5 0 0 0 0 0 5 0 0 0 0 0 0 5 0 0 0 0 0 5 250 198 245 260 235 0 0 0 0 0 0 0 5 0 0 1 1 1 1 1 −6.3172 −6.3172 6.3172 6.3172 6.3172 225.8641 236.3366 ! m̂ = (I + D−1 α)−1 (yo − Xo b̂ − Tâp ) 3.9961 −6.4039 .4692 = 3.4692 −1.5308 and Tâp + m̂ = −2.3211 −12.7211 6.7864 9.7864 4.7864 . 17.3. REDUCED ANIMAL MODELS 297 The reduced animal model was originally described for models where animals had only one observation, but Henderson(1988) described many other possible models where this technique could be applied. 298 CHAPTER 17. OTHER MODELS Chapter 18 Non-Additive Genetic Effects Non-additive genetic effects (or epistatic effects) are the interactions among loci in the genome. There are many possible degrees of interaction (involving different numbers of loci), but the effects and contributions of those interactions have been shown to diminish as the degree of complexity increases. Thus, mainly dominance, additive by additive, additive by dominance, and dominance by dominance interactions have been considered in animal studies. To estimate the variances associated with each type of interaction, relatives are needed that have different additive and dominance relationships among themselves. Computation of dominance relationships is more difficult than additive relationships, but can be done as shown in earlier notes. If non-additive genetic effects are included in an animal model, then an assumption of random mating is required. Otherwise non-zero covariances can arise between additive and dominance genetic effects, which complicates the model enormously. 18.1 Interactions at a Single Locus Let the model for the genotypic values be given as Gij = µ + ai + aj + dij , 299 300 CHAPTER 18. NON-ADDITIVE GENETIC EFFECTS where µ = G.. = X fij Gij , i,j ai Gi. aj G.j dij = = = = = Gi. − G.. , P r(A1 )G11 + P r(A2 )G12 , G.j − G.. P r(A1 )G12 + P r(A2 )G22 , Gij − ai − aj − µ Thus, there are just additive effects and dominance effects to be estimated at a single locus. A numerical example is given below. Genotype A1 A1 A1 A2 A2 A2 Frequency f11 = 0.04 f12 = 0.32 f22 = 0.64 Value G11 = 100 G12 = 70 G22 = 50 Then µ = G1. = G.2 = a1 = a2 = d11 = d12 = d22 = 0.04(100) + 0.32(70) + 0.64(50) = 58.4, 0.2(100) + 0.8(70) = 76.0, 0.2(70) + 0.8(50) = 54.0, G1. − µ = 17.6, G.2 − µ = −4.4, G11 − a1 − a1 − µ = 6.4, G12 − a1 − a2 − µ = −1.6, G22 − a2 − a2 − µ = 0.4. Now a table of breeding values and dominance effects can be completed. Genotype A1 A1 A1 A2 A2 A2 Frequency 0.04 0.32 0.64 Total Additive G11 = 100 a1 + a1 = 35.2 G12 = 70 a1 + a2 = 13.2 G22 = 50 a2 + a2 = −8.8 Dominance d11 = 6.4 d12 = −1.6 d22 = 0.4 18.2. INTERACTIONS FOR TWO UNLINKED LOCI 301 The additive genetic variance is σa2 = 0.04(35.2)2 + 0.32(13.2)2 + 0.64(−8.8)2 = 154.88, and the dominance genetic variance is σd2 = 0.04(6.4)2 + 0.32(−1.6)2 + 0.64(0.4)2 = 2.56, and the total genetic variance is 2 σG = = = = = σa2 + σd2 157.44, 0.04(100 − µ)2 + 0.32(70 − µ)2 + 0.64(50 − µ)2 0.04(41.6)2 + 0.32(11.6)2 + 0.64(−8.4)2 157.44. This result implies that there is a zero covariance between the additive and dominance deviations. This can be shown by calculating the covariance between additive and dominance deviations, Cov(A, D) = 0.04(35.2)(6.4) + 0.32(13.2)(−1.6) + 0.64(−8.8)(0.4) = 0 The covariance is zero under the assumption of a large, random mating population without selection. 18.2 Interactions for Two Unlinked Loci Consider two loci each with two alleles, and assume that the two loci are on different chromosomes and therefore unlinked. Let pA = 0.4 be the frequency of the A1 allele at locus A, and let pB = 0.8 be the frequency of the B1 allele at locus B. Then the possible genotypes, their expected frequencies assuming joint equilibrium, and genotypic values would be as in the table below. Joint equilibrium means that each locus is in Hardy-Weinberg equilibrium and that the probabilities of the possible gametes are equal to the product of the allele frequencies as shown in the table below. 302 CHAPTER 18. NON-ADDITIVE GENETIC EFFECTS Possible Gametes A1 B1 Expected Frequencies pA pB = 0.32 A1 B2 p A qB = 0.08 A2 B1 q A pB = 0.48 A2 B2 qA qB = 0.12 Multiplying these gametic frequencies together, to simulate random mating gives the frequencies in the table below. The genotypic values were arbitrarily assigned to illustrate the process of estimating the genetic effects. Genotypes Frequencies A-Locus B-locus fijk` i, j k, ` 11 11 p2A p2B =.1024 2 =.0512 11 12 pA 2pB qB =.0064 11 22 p2A qB2 =.3072 12 11 2pA qA p2B 12 12 4pA qA pB qB =.1536 12 22 2pA qA qB2 =.0192 2 2 22 11 qA pB =.2304 =.1152 22 12 qA2 2pB qB 2 2 22 22 qA qB =.0144 18.2.1 Genotypic Value,Gijk` G1111 G1112 G1122 G1211 G1212 G1222 G2211 G2212 G2222 Estimation of Additive Effects αA1 αA2 αB1 αB2 = = = = G1... − µG = 16.6464, −11.0976, 10.4384, −41.7536. = 108 = 75 = −80 = 95 = 50 = −80 = 48 = 36 = −100 18.2. INTERACTIONS FOR TWO UNLINKED LOCI 303 The additive genetic effect for each genotype is aijk` = αAi + αAj + αBk + αB` a1111 = αA1 + αA1 + αB1 + αB1 , = 16.6464 + 16.6464 + 10.4384 + 10.4384, = 54.1696, a1112 = 1.9776, a1122 = −50.2144, a1211 = 26.4256, a1212 = −25.7664, a1222 = −77.9584, a2211 = −1.3184, a2212 = −53.5104, a2222 = −105.7024. The additive genetic variance is then σa2 = X fijk` a2ijk` = 1241.1517. ijk` 18.2.2 Estimation of Dominance Effects There are six conditional means to compute, one for each single locus genotype. G11.. = = = G12.. = G22.. = G..11 = G..12 = G..22 = P r(B1 B1 )G1111 + P r(B1 B2 )G1112 + P r(B2 B2 )G1122 , (.64)(108) + (.32)(75) + (.04)(−80), 89.92, 73.60, 38.24, 80.16, 48.96, −87.20. 304 CHAPTER 18. NON-ADDITIVE GENETIC EFFECTS The dominance genetic effects are given by δAij = Gij.. − µG − αAi − αAj , so that δA11 = = δA12 = δA22 = δB11 = δB12 = δB22 = 89.92 − 63.4816 − 16.6464 − 16.6464, −6.8544, 4.5696, −3.0464, −4.1984, 16.7936, −67.1744. The dominance deviations for each genotype are dijk` d1111 d1112 d1122 d1211 d1212 d1222 d2211 d2212 d2222 = = = = = = = = = = δAij + δBk` , −11.0528, 9.9392, −74.0288, 0.3712, 21.3632, −62.6048, −7.2448, 13.7472, −70.2208. The dominance genetic variance is therefore, σd2 = X ijk` fijk` d2ijk` = 302.90625. 18.2. INTERACTIONS FOR TWO UNLINKED LOCI 18.2.3 305 Additive by Additive Effects These are the interactions between alleles at different loci. There are four conditional means to calculate, G1.1. = P r(A1 B1 )G1111 + P r(A1 B2 )G1112 +P r(A2 B1 )G1211 + P r(A2 B2 )G1212 , = .32(108) + .08(75) + .48(95) + .12(50), = 92.16, G1..2 = 84.192, G.21. = 61.76, G.2.2 = 14.88. The additive by additive genetic effect is ααA1B1 ααA1B2 ααA2B1 ααA2B2 = = = = G1.1. − µG − αA1 − αB1 = 1.5936, −6.3744, −1.0624, 4.2496. The additive by additive deviations for each genotype are aaijk` aa1111 aa1112 aa1122 aa1211 aa1212 aa1222 aa2211 aa2212 aa2222 = = = = = = = = = = ααAiBk + ααAiB` + ααAjBk + ααAjB` , 6.3744, −9.5616, −25.4976, 1.0624, −1.5936, −4.2496, −4.2496, 6.3744, 16.9984. 306 CHAPTER 18. NON-ADDITIVE GENETIC EFFECTS The additive by additive genetic variance is 2 σaa = X fijk` aa2ijk` = 27.08865. ijk` 18.2.4 Additive by Dominance Effects This is the interaction between a single allele at one locus with the pair of alleles at a second locus. There are twelve possible conditional means for twelve possible different A by D interactions. Not all are shown, G1.11 = = = G.211 = = = P r(A1 )G1111 + P r(A2 )G1211 , .4(108) + .6(95), 100.2, P r(A1 )G1211 + P r(A2 )G2211 , .4(95) + .6(48), 66.8. The specific additive by dominance effects are αδA1B11 = G1.11 − µG − αA1 − 2αB1 − δB11 = 0.2064. Finally, the additive by dominance genetic values for each genotype are adijk` ad1111 ad1112 ad1122 ad1211 ad1212 ad1222 ad2211 ad2212 ad2222 = = = = = = = = = = αδAiBk` + αδAjBk` + αδBkAij + αδB`Aij , −3.8784, 4.7856, 23.7696, 2.9296, −4.5664, −10.3424, −2.1824, 3.9616, 3.2256. 18.2. INTERACTIONS FOR TWO UNLINKED LOCI 307 The additive by dominance genetic variance is 2 = σad X fijk` ad2ijk` = 17.2772. ijk` 18.2.5 Dominance by Dominance Effects Dominance by dominance genetic effects are the interaction between a pair of alleles at one locus with another pair of alleles at a second locus. These effects are calculated as the genotypic values minus all of the other effects for each genotype. That is, ddijk` = Gijk` − µG − aijk` − dijk` − aaijk` − adijk` . The dominance by dominance genetic variance is the sum of the frequencies 2 of each genotype times the dominance by dominance effects squared, σdd = 8.5171. The table of all genetic effects are given below. Genotypes A-Locus B-Locus 11 11 11 12 11 22 12 11 12 12 12 22 22 11 22 12 22 22 fijk` Gijk` .1024 .0512 .0064 .3072 .1536 .0192 .2304 .1152 .0144 108 75 -80 95 50 -80 48 36 -100 aijk` dijk` 54.1696 -11.0528 1.9776 9.9392 -50.2144 -74.0288 26.4256 0.3712 -25.7664 21.3632 -77.9584 -62.6048 -1.3184 -7.2448 -53.5104 13.7472 -105.7024 -70.2208 A summary of the genetic variances is Total Genetic 1596.9409, Additive 1241.1517, Dominance 302.9062, Add by Add 27.0886, Add by Dom 17.2772, Dom by Dom 8.5171. aaijk` adijk` ddijk` 6.3744 -3.8784 -9.5616 4.7856 -25.4976 23.7696 1.0624 2.9296 -1.5936 -4.5664 -4.2496 -10.3424 -4.2496 -2.1824 6.3744 3.9616 16.9984 3.2256 -1.0944 4.3776 -17.5104 0.7296 -2.9184 11.6736 -0.4864 1.9456 -7.7824 308 18.3 CHAPTER 18. NON-ADDITIVE GENETIC EFFECTS More than Two Loci Interactions can occur between several loci. The maximum number of loci involved in simultaneous interactions is unknown, but the limit is the number of gene loci in the genome. Many geneticists believe that the higher order interactions are few in number, and that if they exist the magnitude of their effects is small. Someday the measurement of all of these interactions may be possible, but modelling them may be impossible, and practical utilization of that information may be close to impossible. 18.4 Linear Models for Non-Additive Genetic Effects Consider a simple animal model with additive, dominance, and additive by dominance genetic effects, and repeated observations per animal, i.e., yij = µ + ai + di + (ad)i + pi + eij , where µ is the overall mean, ai is the additive genetic effect of animal i, di is the dominance genetic effect of animal i, (ad)i is the additive by dominance genetic effect of animal i, pi is the permanent environmental effect for an animal with records, and ei is the residual effect. Also, V ar 18.4.1 a d ad p e = 2 Aσ10 0 0 0 0 2 0 Dσ01 0 0 0 2 0 0 A Dσ11 0 0 0 0 0 Iσp2 0 0 0 0 0 Iσe2 . Simulation of Data Data should be simulated to understand the model and methodology. The desired data structure is given in the following table for four animals. 18.4. LINEAR MODELS FOR NON-ADDITIVE GENETIC EFFECTS309 Animal 1 2 3 4 Number of Records 3 2 1 4 Assume that 2 2 σ10 = 324, σ01 = 169, 2 2 σ11 = 49, σp = 144, σe2 = 400. The additive genetic relationship matrix for the four animals is A = L10 L010 1 0 0 0 .5 .866 0 0 = .25 0 .9682 0 .75 .433 0 .7071 = 1 .5 .25 .75 .5 1 .125 .75 .25 .125 1 .1875 .75 .75 .1875 1.25 0 L10 . The dominance genetic relationship matrix (derived from the gametic relationship matrix) is D = L01 L001 = = 1 0 0 0 .25 .9682 0 0 .0625 −.0161 .9979 0 .125 .0968 −.00626 .9874 1 .25 .0625 .125 .25 1 0 .125 . .0625 0 1 0 .125 .125 0 1 0 L01 310 CHAPTER 18. NON-ADDITIVE GENETIC EFFECTS The additive by dominance genetic relationship matrix is the Hadamard product of A and D, which is the element by element product of matrices. A D = L11 L011 = = 1 0 0 0 .125 .9922 0 0 .015625 −.00197 .999876 0 .09375 .08268 −.0013 1.1110 0 L11 1 .125 .015625 .09375 .125 1 0 .09375 . .015625 0 1 0 .09375 .09375 0 1.25 The Cholesky decomposition of each of these matrices is necessary to simulate the separate genetic effects. The simulated genetic effects for the four animals are (with va , vd , and vad being vectors of random normal deviates) a = (324).5 L10 va , 12.91 13.28 = , −10.15 38.60 d = (169).5 L01 vd , 15.09 5.32 = , −17.74 3.89 (ad) = (49).5 L11 vad , −12.22 −1.32 = . −4.30 5.76 In the additive genetic animal model, base population animals were first simulated and then progeny were simulated by averaging the additive genetic values of the parents and adding a random Mendelian sampling effect to obtain the additive genetic values. With non-additive genetic effects, such a simple process does not exist. The appropriate genetic relationship matrices are necessary 18.4. LINEAR MODELS FOR NON-ADDITIVE GENETIC EFFECTS311 and these need to be decomposed. The alternative is to determine the number of loci affecting the trait, and to generate genotypes for each animal after defining the loci with dominance genetic effects and those that have additive by dominance interactions. This might be the preferred method depending on the objectives of the study. Let the permanent environmental effects for the four animals be 8.16 −8.05 −1.67 15.12 p= . The observations on the four animals, after adding a new residual effect for each record, and letting µ = 0, are given in the table below. Animal 1 2 3 4 18.4.2 a 12.91 13.28 -10.15 38.60 d (ad) p 1 2 3 4 15.09 -12.22 8.16 36.21 45.69 49.41 5.32 -1.32 -8.05 9.14 -14.10 -17.74 -4.30 -1.67 -20.74 3.89 5.76 15.12 24.13 83.09 64.67 50.13 HMME Using the simulated data, the MME that need to be constructed are as follows. X0 X X0 Z 0 0 Z X Z Z + A−1 k10 Z0 X Z0 Z 0 ZX Z0 Z 0 ZX Z0 Z X0 Z Z0 Z 0 Z Z + D−1 k01 Z0 Z Z0 Z X0 Z Z0 Z Z0 Z 0 Z Z + (A D)−1 k11 Z0 Z X0 Z Z0 Z Z0 Z Z0 Z 0 Z Z + Ikp b̂ â d̂ ˆ ad p̂ = where k10 = 400/324, k01 = 400/169, k11 = 400/49, and kp = 400/144. Thus, the order is 17 for these four animals, with only 10 observations. Note that X0 y = (327.63) , and Z0 y = 131.31 −4.96 −20.74 222.02 . X0 y Z0 y Z0 y Z0 y Z0 y , 312 CHAPTER 18. NON-ADDITIVE GENETIC EFFECTS The solutions are â = 12.30 1.79 −8.67 15.12 , d̂ = 4.18 −4.86 −6.20 8.19 , ˆ = ad 1.49 −1.68 −1.87 3.00 , p̂ = 4.56 −6.18 −5.57 7.18 , and µ̂ = 17.02. The total genetic merit of an animal can be estimated by adding together the solutions for the additive, dominance, and additive by dominance genetic values, 17.97 −4.75 ˆ ĝ = = (â + d̂ + ad). −16.73 26.32 On the practical side, the solutions for the individual dominance and additive by dominance solutions should be used in breeding programs, but how? Dominance effects arise due to particular sire-dam matings, and thus, dominance genetic values could be used to determine which matings were better. However, additive by dominance genetic solutions may be less useful. Perhaps the main point is that if non-additive genetic effects are significant, then they should be removed through the model to obtain more accurate estimates of the additive genetic effects, assuming that these have a much larger effect than the non-additive genetic effects. 18.5 Computing Simplification Take the MME as shown earlier, i.e. X0 X X0 Z 0 0 Z X Z Z + A−1 k10 Z0 X Z0 Z 0 Z0 Z ZX 0 ZX Z0 Z X0 Z Z0 Z 0 Z Z + D−1 k01 Z0 Z Z0 Z X0 Z Z0 Z Z0 Z 0 Z Z + (A D)−1 k11 Z0 Z X0 Z Z0 Z Z0 Z Z0 Z 0 Z Z + Ikp b̂ â d̂ ˆ ad p̂ = Now subtract the equation for dominance genetic effects from the equation for additive genetic effects, and similarly for the additive by dominance and X0 y Z0 y Z0 y Z0 y Z0 y , 18.5. COMPUTING SIMPLIFICATION 313 permanent environmental effects, giving A−1 k10 â − D−1 k01 d̂ = 0 ˆ = 0 A−1 k10 â − (A D)−1 k11 ad −1 −1 A k10 â − I kp p̂ = 0 Re-arranging terms, then d̂ = DA−1 (k10 /k01 )â ˆ = (A D)A−1 (k10 /k11 )â ad p̂ = A−1 (k10 /kp )â The only inverse that is needed is for A, and the equations to solve are only as large as the usual animal model MME. The steps in the procedure would be iterative. ˆ and p̂ (initially 1. Adjust the observation vector for solutions to d̂, ad, these would be zero) as ˆ + p̂). ỹ = y − Z(d̂ + ad 2. Solve the following equations: X0 X X0 Z Z0 X Z0 Z + A−1 k10 ! b̂ â ! = X0 ỹ Z0 ỹ ! . ˆ and p̂ using 3. Obtain solutions for d̂, ad, d̂ = DA−1 (k10 /k01 )â ˆ = (A D)A−1 (k10 /k11 )â ad p̂ = A−1 (k10 /kp )â. 4. Go to step 1 and begin again until convergence is reached. 314 18.6 CHAPTER 18. NON-ADDITIVE GENETIC EFFECTS Estimation of Variances Given the new computing algorithm, and using Gibbs sampling as a tool the variances can be estimated. Notice from the above formulas that w01 = (D−1 d̂) = A−1 (k10 /k01 )â ˆ = A−1 (k10 /k11 )â w11 = ((A D)−1 ad) wp = (Ip̂) = A−1 (k10 /kp )â. Again, the inverses of D and (A D) are not needed. The necessary quadratic forms are then d̂0 w01 = d̂0 D−1 d̂, ˆ ˆ 0 w11 = ad ˆ 0 (A D)−1 ad, ad p̂0 wp = p̂0 p̂, and â0 A−1 â. Generate 4 random Chi-Square variates, Ci , with degrees of freedom equal to the number of animals in â, then 2 σ10 = â0 A−1 â/C1 2 σ01 = d̂0 w01 /C2 ˆ 0 w11 /C3 σ 2 = ad 11 σp2 = p̂0 wp /C4 . The residual variance would be estimated from σe2 = ê0 ê/C5 , where C5 is a random Chi-square variate with degrees of freedom equal to the total number of observations. This may not be a totally correct algorithm and some refinement may be necessary, but this should be the basic starting point. Chapter 19 Threshold Models 19.1 Categorical Data If observations are expressed phenotypically as being a member of one of m categories, then the data are categorical, discrete, and non-continuous. With two categories, m = 2, then the trait is an ”all-or-none” or binary trait. Most disease traits are binomial, where the animal is either diseased or not diseased. Calving ease or conformation traits have more than two categories arranged in a delineated sequence from one extreme expression to the opposite extreme expression. Calving ease, for example, has four categories, namely (1) unassisted calving, (2) slight assistance, (3) hard pull, or (4) caesarian section required. Rump angle, for example, has nine categories where the pins range from too low to too high with the desirable category being in the middle. Categorical traits may be inherited in a polygenic manner. The underlying susceptibility to a disease trait, calving ease, or type trait may actually be continuous and may follow a normal distribution. The underlying continuous scale is known as the liability scale. On the liability scale is a threshold point (or points) where above this threshold the animal expresses the disease phenotype, and below the threshold point the animal does not express the disease. The liability scale is only conceptual and cannot be observed (Gianola and Foulley, 1983 and Harville and Mee, 1984). The model employed is known as a “threshold” model. The details of this procedure are described. 315 316 CHAPTER 19. THRESHOLD MODELS 19.2 Threshold Model Computations The general linear model is y = Xb + Zu + e, where y is the observation vector, but in the case of categorical data represents a variable on the underlying liability scale. The underlying liability scale is unobservable and therefore, y is unknown. However, y is known to be affected by fixed effects, b, and random effects, u, such as animal additive genetic effects, and random environmental (or residual) effects, e. The assumed covariance structure is ! ! u G 0 V ar = . e 0 R The analysis requires that the threshold points be estimated from the categorical data simultaneously with the estimation of b and u. This makes for a set of non-linear estimation equations. Initial values of the thresholds are used to estimate y, which are then used to estimate new threshold points and b and u. This must be repeated until the threshold points do not change. There are various quantities which need to be computed repeatedly in the analysis, and these are based on normal distribution functions. 1. Φ(x) is known as the cumulative distribution function of the normal distribution. This function gives the area under the normal curve up to the value of x, for x going from minus infinity to plus infinity (the range for the normal distribution). For example, if x = .4568, then Φ(x) = .6761, or if x = −.4568, then Φ(x) = .3239. The short program given at the end of these notes computes Φ(x) for a value of x. Let Φk represent the value up to and including category k. Note that if there are m categories and k = m, then Φk = 1. 2. φ(x) is a function that gives the height of the normal curve at the value x, for a normal distribution with mean zero and variance 1. That is, φ(x) = (2π)−.5 exp −.5x2 . For example, if x = 1.0929, then φ(x) = .21955. 19.2. THRESHOLD MODEL COMPUTATIONS 317 3. P (k) is the probability that x from a N (0, 1) distribution is between two threshold points, or is in category k. That is, P (k) = Φk − Φk−1 . If k = 1, then Φk−1 = 0. Data should be arranged by smallest subclasses. Consider the calving ease data from the paper by Gianola and Foulley (1983). 318 CHAPTER 19. THRESHOLD MODELS HerdYear 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 Calving Ease Data Age of Sex of Sire of Category Dam(yr) Calf Calf 1 2 3 2 M 1 1 0 0 2 F 1 1 0 0 3 M 1 1 0 0 2 F 2 0 1 0 3 M 2 1 0 1 3 F 2 3 0 0 2 M 3 1 1 0 3 F 3 0 1 0 3 M 3 1 0 0 2 2 3 2 3 2 3 2 2 3 3 F M M F M F M M F F M 1 1 1 2 2 3 3 4 4 4 4 2 1 0 1 1 0 0 0 1 2 2 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 Total 1 1 1 1 2 3 2 1 1 2 1 1 2 1 1 1 1 1 2 2 Let njk represent the number of observations in the k th category for the j th subclass, and nj. represent the marginal total for the j th subclass. There are s = 20 subclasses in the above example data, and m = 3 categories for the trait. 19.2.1 Equations to Solve The system of equations to be solved can be written as follows: Q L0 X L0 Z ∆t p 0 0 0 X WZ X L X WX ∆b = X0 v . Z0 L Z0 WX Z0 WZ + G−1 ∆u Z0 v − G−1 u 19.2. THRESHOLD MODEL COMPUTATIONS 319 The equations must be solved iteratively. Note that ∆b, for example, is the change in solutions for b between iterations. The calculations for Q, L, W, p, and v need to be described. The values of these matrices and vectors change with each iteration of the non-linear system. Using the example data on calving ease, then X contains columns referring to an overall mean, herd-year effects, age of dam effects, and sex of calf effects, and Z contains columns referring to sires, as follows: X= 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 1 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 0 1 0 1 0 0 1 1 1 0 1 0 1 0 1 0 1 0 1 1 0 1 0 1 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0 0 1 1 0 , and Z = 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 . The vector t will contain the threshold points for a N (0, 1) distribution at the end of the iterations. The process is begun by choosing starting values for b, u, and t. Let b = 0 and u = 0, then starting values for t can be obtained from the data by knowing the fraction of animals in the first category (i.e. 0.6786), and in the first two categories (i.e. 0.8572). The threshold values that give those percentages are t1 = 0.3904, and t2 = 0.9563. Suppose that several iterations have been performed and the latest solu- 320 CHAPTER 19. THRESHOLD MODELS tions are t= .37529 1.01135 ! , b= µ H1 H2 A1 A2 S1 S2 = 0.0 0.0 .29752 0.0 −.12687 0.0 −.39066 , u= u1 u2 u3 u4 = −.08154 .06550 .12280 −.10676 The following calculations are performed for each subclass from j = 1 to s, those for j = 1 are shown below: 1. fjk = tk − x0j b − z0j u for k = 1 to (m − 1). f11 = = = f12 = = t1 − µ − H1 − A1 − S1 − u1 .37529 − 0 − 0 − 0 − 0 − (−.08154) .45683, and t2 − µ − H1 − A1 − S1 − u1 1.09289. 2. For k = 1 to m, Φjk = Φ(fjk ). Φ11 = = Φ12 = Φ13 = Φ10 = Φ(.45683) .6761, Φ(1.09289) = .8628, 1, and 0. 3. For k = 1 to m, Pjk = Φjk − Φj(k−1) . P11 = .6761 − 0 = .6761, P12 = .8628 − .6761 = .1867, P13 = 1 − .8628 = .1372. . 19.2. THRESHOLD MODEL COMPUTATIONS 321 4. For k = 0 to m, φjk = φ(fjk ). φj0 φj1 φj2 φj3 = = = = 0.0 φ(.45683) = .3594, φ(1.09289) = .2196, 0.0. 5. The matrix W is a diagonal matrix of order equal to the number of smallest subclasses, s. The elements of W provide a weighting factor for the j th subclass that depends on the number of observations in that subclass and on the values of φjk and Pjk . The j th diagonal is given by wjj = nj. m X (φj(k−1) − φjk )2 /Pjk . k=1 For the first subclass, w11 (φ10 − φ11 )2 (φ11 − φ12 )2 (φ12 − φ13 )2 = n1. + + P11 P12 P13 ! 2 2 2 (.1398) (.2196) (−.3594) + + = 1 .6761 .1867 .1372 = .6471. The values for all subclasses were ( .6471 .5167 .6082 .5692 1.3058 1.5721 .5447 .6687 1.2380 .7196 .6923 1.3247 .6776 .7334 .7146 .6110 1.1373 1.3723 ! 1.4002 .7234 ) 6. The vector v is created and used in place of y, which is unknown. For the j th subclass, vj = m X njk (φj(k−1) − φjk )/Pjk . k=1 For j = 1, then vj = [1(φ10 − φ11 )/P11 + 0(φ11 − φ12 )/P12 + 0(φ12 − φ13 )/P13 ], = −.3594/.6761, = −.5316. 322 CHAPTER 19. THRESHOLD MODELS The complete vector v is ( −.5316 1.0521 .6416 -.3475 -.4671 .9848 1.0414 -1.0678 -.0928 -.5731 -.9676 -.6993 1.4632 .9961 -.7115 1.3042 .4859 -.4713 -.8220 -1.2216 ) 7. The matrix L is of order s × (m − 1) and the jk th element is `jk = −nj. φjk [(φjk − φj(k−1) )/Pjk − (φj(k+1) − φjk )/Pj(k+1) ]. For the example data, L= −.46033 −.41079 −.45052 −.43594 −.92257 −1.24395 −.92432 −.42492 −.46308 −.90751 −.45725 −.46306 −.92501 −.45573 −.46349 −.45053 −.45893 −.45138 −.87141 −.92684 −.18681 −.10587 −.15773 −.13326 −.38328 −.32817 −.47584 −.11981 −.20565 −.33045 −.26236 −.22921 −.39969 −.26770 −.21408 −.28290 −.25571 −.15961 −.26588 −.44551 . 8. The matrix Q is a tri-diagonal matrix of order (m−1)×(m−1), however, this is only noticeable when m is greater than 3. The diagonals of Q are given by qkk = s X j=1 nj. φ2jk (Pjk + Pj(k+1) )/Pjk Pj(k+1) , 19.2. THRESHOLD MODEL COMPUTATIONS 323 and the off-diagonals are qk(k+1) = − s X nj. φjk φj(k+1) /Pj(k+1) . j=1 For the example data, Q= 24.12523 −11.55769 −11.55769 16.76720 ! . 9. The elements of vector p are given by pk = s X φjk [(njk /Pjk ) − (nj(k+1) /Pj(k+1) )], j=1 for k = 1 to (m − 1). Hence, p0 = .003711 .000063 . 1 Assume that G = I( 19 ) for the example data. There are dependencies in X and with the threshold points, so that 4 restrictions on the solutions are needed. Let the solutions for µ, H1 , A1 , and S1 be restricted to zero by removing those columns from X. The remaining calculations give the following submatrices of the mixed model equations. ! 0 −6.8311 −6.6726 −6.1344 −3.1131 −2.6858 −2.0568 0 −3.1495 −3.9832 −2.7263 −2.7086 −1.2724 −1.5121 −1.2983 −1.1267 LX = LZ = , 9.9442 4.6588 4.9885 X0 WX = 4.6588 9.3585 3.2541 , 4.9885 3.2541 8.1912 −.00221 0 X v = −.00210 , −.00149 2.6498 2.0481 1.4110 3.8352 0 X WZ = 1.3005 3.6014 1.9469 2.5096 , 1.7546 3.4660 1.2223 1.7483 ! , 324 CHAPTER 19. THRESHOLD MODELS Z0 v − I(19) = −.00070 −.00139 −.00116 −.00052 , Z0 WZ + I(19) = diag( 23.4219 24.4953 23.0246 22.8352 ). The solutions (change in estimates) to these equations are added to the estimates used to start the iteration to give the new estimates, as shown in the table below. t1 t2 H1 H2 A1 A2 S1 S2 u1 u2 u3 u4 ∆β +βi−1 .000229 .37529 .000158 1.01135 0.0 0.0 -.000046 .29752 0.0 0.0 -.000013 -.12687 0.0 0.0 .000064 -.39066 .000011 -.08154 -.000013 .06550 -.000014 .12280 .000017 -.10676 = βi .375519 1.011508 0.0 .297473 0.0 -.126883 0.0 -.390596 -.081529 .065487 .122786 -.106743 As further iterations are completed, then the right hand sides to the equations approach a null vector, so that the solutions, ∆β, also approach a null vector. There could be problems in obtaining convergence, or a valid set of estimates. In theory the threshold points should be strictly increasing. That is, tm should be greater than t(m−1) should be greater than t(m−2) and so on. The solutions may not come out in this order. This could happen when two thresholds are close to each other due to the definition of categories. An option would be to combine two or more of the categories involved in the problem, and to re-analyze with fewer categories. Another problem could occur when all observations fall into either extreme category. This is known as an “extreme category” problem. Suggestions have 19.3. ESTIMATION OF VARIANCE COMPONENTS 325 been to delete subclasses with this problem, but throwing away data is not a good idea. Other types of analyses may be needed. 19.3 Estimation of Variance Components Harville and Mee (1984) recommended a REML-like procedure for estimating the variance ratio of residual to random effects variances. One of the assumptions in a threshold model is that the residual variance is fixed to a value of 1. Hence only that variances of the random effects need to be estimated. Let a generalized inverse of the coefficient matrix in the equations be represented as Ctt Ctx Ctz C = Cxt Cxx Cxz . Czt Czx Czz Then the REML EM estimator of variance of the random effects is σu2 = (u0 G−1 u + tr(G−1 Czz ))/d, where d is the number of levels in u. Using the example data, and the recent estimates (as final values), u0 G−1 u = .0374061, trCzz = .18460 σu2 = (.0374061 + .18460)/4, = .0555015. The new variance ratio is (.0555015)−1 = 18.0175. The REML EM-like algorithm would need to be repeated until the variance estimate is converged. Hence there are two levels of iteration going on at the same time in a very non-linear system of equations. The final estimate of the variances are in terms of the liability or underlying normal distribution scale. If the number of categories was only two, then heritability can be converted to the ”observed” scale using the formula h2obs = h2lia z 2 /(p(1 − p)), where p is the proportion of observations in one category, and z is the height of the normal curve at the threshold point (on the observed scale). 326 19.4 CHAPTER 19. THRESHOLD MODELS Expression of Genetic Values Using the estimates from the non-linear system of equations, the probability of a sire’s offspring to fall into each category can be calculated. Take sire 3 as an example. First the threshold point for male offspring of 2-yr-old dams in herd-year 1 needs to be determined as x = t1 + H1 + A1 + S1 + u3 , = .375519 + 0 + 0 + 0 + .122786, = .498305. Using the short program below, then Φ(x) = Φ(.498305) = .69087. Similarly, the probability of the sire’s offspring to be in categories 1 and 2 would be based on the second threshold, x = 1.011508 + .122786 = 1.134294, or Φ(x) = .87166. Thus, the proportion of offspring in category 2 would be .87166 − .69087 = .18079, and the proportion in category 3 would be 1.0 − .87166 = .12834. In the raw data about .68 of the offspring were in category 1, and therefore sire 3 has only .01 greater in probability for category 1, and for categories 1 and 2 the raw percentage was .86. With the low heritability assumed, sires would not be expected to have greatly different probabilities from the raw averages. Notice that the probabilities are defined in terms of one of the smallest fixed effects subclasses. If 3-yr-old dams and female calves were considered, then the new threshold values would be lower, thereby giving fewer offspring in category 1 and maybe more in categories 2 or 3, because the solutions for 3-yrold dams is negative and for female calves is negative. If an average offspring is envisioned, then solutions for ages of dam can be averaged and solutions for sex of calf can be averaged. A practical approach would be to choose the subclass where there seems to be the most economic losses and rank sires on that basis. Part IV Computing Ideas 327 Chapter 20 Fortran Programs This chapter gives ideas on how to write a FORTRAN program to perform a multiple trait random regression model. The example will be lactation production of dairy cows in the first three lactations for milk yield. The linux servers employ the Intel compiler and utilize a few Intel math libraries of programs (sorting and calculating eigenvalues and eigenvectors). 20.1 Main Program The strategy is to have a main program that is only one or two pages in length. Almost every line of the main program is a call to a subroutine. The subroutines are also kept as simple as possible, but some can be very long. The model assumes that all of the factors except Year-Month of Calving are fitted by an order 4 Legendre polynomials. So every level of these factors has 5 covariates to estimate. The Year-Month of Calving (within parities) has 36 ten day periods to estimate. c c c c Random regression model production test day milk records one trait, 3 lactations, order 4 Legendre polynomials y = YM(36) + RAS(5) + CG(15) + 329 a(15) + p(15) + e(3,4) 330 CHAPTER 20. FORTRAN PROGRAMS include ’SShd.f’ itest = 0 igibb = 0 mxit=6000 if(itest.gt.0)mxit=10 if(igibb.gt.0)then c Two files for storing Gibbs samples open(17,file=’animalVCV.d’,form=’unformatted’, x status=’unknown’) open(19,file=’cgVCV.d’,form=’unformatted’, x status=’unknown’) open(20,file=’resVC.d’,form=’formatted’, x status=’unknown’) endif c c ############################################################# c read in parameters call params c c ############################################################ c read in pedigree info call peddys c c ############################################################ c read in data call datum c c ############################################################ c iterations on equations to solve 800 iter = iter + 1 if(iter.gt.mxit)go to 9901 c ccn = 0.d0 ccd = 0.d0 c one subroutine for each factor in the model call facYM call facRAS call facCG 20.1. MAIN PROGRAM c 331 call permenv call genetic igibb > 0 if covariance matrices are estimated if(igibb.gt.0)then call facRES endif ccc = 100.0*(ccn/ccd) if(mod(iter,100).eq.0)print *,iter,ccc if(ccc.gt.0.1d-09)go to 800 c c ############################################################# c finished, save solutions c 9901 call finis if(igibb.gt.0)then close(17) close(19) close(20) endif stop end include include include include include include include include include include include ’SSparams.f’ ’SSpeddys.f’ ’SSdatum.f’ ’SSfini.f’ ’SSYM.f’ ’SSRAS.f’ ’SSCG.f’ ’SSanm.f’ ’SSape.f’ ’SSres.f’ ’dkmvhf.f’ The program is only a few steps. 1. call params, to read in the necessary covariance function matrices, and set up their inverses for use in the mixed model equations. 332 CHAPTER 20. FORTRAN PROGRAMS 2. call peddys, to read in the pedigree information and to set up the diagonals of A−1 for each animal. 3. call datum, to read and store the data for the iteration on data procedure. Some arrays need to be sorted. 4. Iterate solutions by call subroutines, one for each factor in the model. 5. call finis, to write out and save the solutions to the factors that are of interest, only if igibb = 0 All of the subroutines are joined together by include ’SShd.f’. This is a file that defines the variables in the program that need to be shared between subroutines. It specifies which variables are double precision and which are integer. The big arrays are put into COMMON areas so that they are stored consecutively in memory and thereby take a little less space. Thus, during the initial programming, if an array needs to be increased in length, this can be done in this file, and it therefore, occurs in all other subroutines. There is no chance of forgetting to make a change in the other subroutines. With COMMON areas one has to worry about boundary alignments. Thus, this problem is avoided by having separate COMMON areas for double precision arrays and integer arrays. The boundary alignment problems occur when integer and double precision arrays are mixed together in one COMMON area. If the two types are to be in the same COMMON area, then the double precision arrays should precede the integer arrays. Parameter(no=15,nop=120,nam=200000,ncg=12000,nras=200, x ndim=365,nrec=1270000,nped=500000, nym=500,mcov=5, x ntg=36,ntim=4) c c c c c c c no = 5 covariates times 3 lactations = 15 nop = (no*(no+1))/2, half-stored matrix array size nam = maximum number of animals ncg = maximum number of contemporary groups nras = number of region-age-season of calving subclasses ndim = number of days in milk (maximum) nrec = maximum number of test day records 20.1. MAIN PROGRAM c c c 333 nped = maximum number of pedigree elements to store nym = number of year-month subclasses mcov = number of covariates Common /recs/lp(ndim,5),obs(nrec),anid(nrec),cgid(nrec), x rasid(nrec),ymid(nym),pari(nrec),days(nrec),timg(nrec), x mrec,mcgid,mras,mym,iseed c c c c c c c c c c c c c lp = legendre polynomials of order 4 for days 1 to 365 in milk obs = test day milk yields anid = animal ID associated with each obs cgid = contemporary group associated with each obs rasid = region-age-season for each obs ymid = year-month for each obs pari = parity number for each obs days = days in milk for each obs timg = 1 to 36 time groups within YM subclasses mrec = actual total number of test day records mcgid = max id of contemporary groups mras = max id of ras subclasses mym = max id of year-month subclasses Common /peds/bii(nam),adiag(nam),sir(nam),dam(nam), x cpa(nped),cpc(nped),cps(nped),cpd(nped), x jped(nped),mam,mped c c c c c c c c c c bii = elements needed for A-inverse adiag = diagonal elements for each animal sir = sire ID (consecutively numbered and chronological dam = dam ID (consecutively numbered and chronological cpa = coded pedigree record, animal id cpc = code (0,1,2) cps = sire or progeny ID cpd = dam or mate ID mam = total number of animals < nam mped = pedigree records < nped Common /parms/gi(nop),pi(nop),ci(3,15),res(4,3), 334 CHAPTER 20. FORTRAN PROGRAMS x ri(ndim,3),wv(nras) c c c c c c gi = genetic covariance function matrix pi = permanent environmental covariance function matrix ci = contemporary group covariance function matrix res = residual variances, 3 parities, 4 periods ri = inverses for each day in milk wv = work vector, used for many things Common /diags/pcg(nrec),pras(nrec),pymid(nrec), x panid(nrec),wr(nam),iwv(nras),itest,igibb c c c c c c c c pcg = CGID sorted order pras = rasid sorted order pymid = YMID sorted order panid = anid sorted order wr = number of records per animal (many are zero iwv = integer work vector itest = 0 for good run, not zero during initial programming igibb not zero, means to estimate covariance matrices Common /solns/sanm(nam,no),sape(nam,no),scg(ncg,mcov), x sras(nras,mcov),sym(nym,ntg),ccn,ccd,ccc c c c solution arrays for animal genetic, PE, cont. groups, and region-age-season, ccn, ccd, and ccc for convergence criteria real*8 lp,obs,adiag,gi,pi,res,ri,sanm,sape,scg,sras, x sym,rhs,ccn,ccd,ccc,wv integer anid,cgid,ras,ymid,pari,days,timg,pcg,pras, x pymid,panid,mped,mrec,sir,dam,cpa,cpc,cps,cpd,jped, x wr,iwv,itest,mam,mrec,igibb,mras,mcgid,mym,iseed The above lines are included in every subroutine that is part of the main program. Subroutines may have some of their own variables, which are only used within that subroutine. 20.2. CALL PARAMS 335 Define all of the variables in this file as either real*8 or integer. Do not rely on default rules. 20.2 Call Params The first subroutine to be called is params, which reads in the covariance matrices that will be used to start the iteration process. The random factors of the model are contemporary groups, and animal additive genetic and animal permanent environmental effects. subroutine params include ’SShd.f’ real*8 varc,varg,varp,x,z(5),hh(15) open(10,file=’GP4.d’,form=’formatted’,status=’old’) gi = 0.d0 pi = 0.d0 ri = 0.d0 ci = 0.d0 res = 0.d0 lp=0.0d0 c 10 read(10,1001,end=20)kr,kc,varg,varp 1001 format(1x,2i4,8d20.10) if(kr.eq.0)go to 20 m = ihmssf(kr,kc,no) gi(m) = varg pi(m) = varp go to 10 c 20 read(10,1002,end=21)kp,kr,kc,varc 1002 format(1x,3i4,d20.10) m = ihmssf(kr,kc,mcov) ci(kp,m)=varc 336 CHAPTER 20. FORTRAN PROGRAMS go to 20 close(10) 21 c c residual variances, by parity and days groups c open(11,file=’RES4.d’,form=’formatted’,status=’old’) do 22 i=1,3 do 22 j=1,4 read(11,*,end=30)ka,x res(j,i)=x 22 continue 30 close(11) c call dkmvhf(gi,no,wv,iwv) call dkmvhf(pi,no,wv,iwv) do 81 kp=1,3 hh = 0.d0 do 82 m=1,15 hh(m)=ci(kp,m) 82 continue call dkmvhf(hh,mcov,wv,iwv) do 83 m=1,15 ci(kp,m)=hh(m) 83 continue 81 continue c Different residual covariance matrices through the lactation ri = 0.0d0 do 31 i=1,45 ri(i,1) = 1.d0/res(1,1) ri(i,2) = 1.d0/res(1,2) ri(i,3) = 1.d0/res(1,3) 31 continue do 32 i=46,115 ri(i,1) = 1.d0/res(2,1) ri(i,2) = 1.d0/res(2,2) ri(i,3) = 1.d0/res(2,3) 32 continue do 33 i=116,265 20.2. CALL PARAMS 33 337 ri(i,1) = 1.d0/res(3,1) ri(i,2) = 1.d0/res(3,2) ri(i,3) = 1.d0/res(3,3) continue do 34 i=266,365 ri(i,1) = 1.d0/res(4,1) ri(i,2) = 1.d0/res(4,2) ri(i,3) = 1.d0/res(4,3) continue 34 c c read in Legendre polynomials, order 4 c open(12,file=’LPOLY4.d’,form=’formatted’,status=’old’) lp = 0.0d0 40 read(12,1201,end=55)kdim,z 1201 format(2x,i5,2x,5f20.10) do 42 k=1,5 lp(kdim,k)=z(k) 42 continue go to 40 55 close(12) c c read in a random number seed, initialize random number c generators c open(13,file=’seedno.d’,form=’formatted’,status=’old’) read(13,1301,end=65)iseed 1301 format(1x,i10) call firan(iseed) close(13) return end Four input files are needed, a) one for the covariance matrices for genetic, permanent environmental, and contemporary groups, b) one for the residual variances, c) one for the Legendre polynomials, and d) one for the random number seed. Remember to create the appropriate files, and make sure the format statements are in agreement with the data files. 338 CHAPTER 20. FORTRAN PROGRAMS The inverses of the covariance matrices are obtained using dkmvhf.f. This is Henderson’s inversion routine that he wrote back in the 1960’s. His version was called djnvhf.f, but Karin Meyer found a way to improve its speed. Henderson’s version physically re-arranged rows and columns during the inversion process. Meyer’s version merely kept an array of indexes of the rows to be re-arranged, but did not actually re-arrange them physically until the end. This increased the speed very much, and so the new version became dkmvhf.f where the km is for Karin Meyer. One advantage of both routines is that the matrix can have rows and columns that are all zeros. Many inversion routines only want to invert matrices that are non-singular, so that you must remove the zero rows and columns before calling the subroutine. This routine is given in the Appendix. Note that in this version of FORTRAN an entire array can be set to zero with one statement, gi = 0.d0. It appears to be important to use 0.d0 rather than 0.0 in this version of FORTRAN. The latter results in 0, but only to 7 or so decimal places, which can be critical to some programs. Thus, always use the 0.d0 option. The subroutine firan is specialized software (in Guelph) for initializing a series of random number generators for different distributions. The Mersene twister is used as the algorithm in these routines which has a very long cycle time, (219937 − 1). The cycle time is how many numbers it takes before the sequence starts to repeat itself. When using Gibbs sampling it is good to have a long cycle time. 20.3 Call Peddys The following subroutine reads in the pedigree with the bii values needed for the inverse of the additive relationship matrix. These values were computed by another series of programs which order the animals chronologically, and then calculate the inbreeding coefficients, as long as parents are processed before their progeny. The subroutine also reads in a ‘coded’ pedigree file, which has an animal, then all of its progeny following, and the mate that produced that progeny. This is so additive relationships can be accounted for easily in the iteration 20.3. CALL PEDDYS program. subroutine peddys include ’SShd.f’ character*8 oid real*8 x,v,z open(10,file=’PARTES.d’,form=’formatted’, x status=’old’) adiag sir = dam = bii = mam=0 = 0.d0 0 0 0.d0 10 read(10,1001,end=50)ka,ks,kd,x,z,oid 1001 format(1x,3i10,1x,d20.10,2x,d20.10,1x,a8) mam = mam + 1 sir(ka) = nam if(ks.gt.0)sir(ka) = ks dam(ka) = nam if(kd.gt.0)dam(ka) = kd bii(ka) = x adiag(ka) = adiag(ka) + x v = 0.25d0*x if(ks.gt.0)adiag(ks)=adiag(ks)+v if(kd.gt.0)adiag(kd)=adiag(kd)+v c 50 go to 10 close(10) print *,’peddys, mam= ’,mam c c read and store coded pedigree file c 339 340 CHAPTER 20. FORTRAN PROGRAMS open(11,file=’CARTES.d’,form=’formatted’, x status=’old’) mped = 0 jped = 0 cpa = 0 cpc = 0 cps = 0 cpd = 0 60 read(11,1101,end=90)ia,ic,is,id 1101 format(1x,i10,i3,1x,2i10) mped = mped + 1 if(mped.gt.nped)go to 89 cpa(mped)=ia cpc(mped)=ic if(is.lt.1)is = nam if(id.lt.1)id = nam cps(mped)=is cpd(mped)=id if(ic.eq.0)jped(ia)=mped go to 60 c 89 print *,’nped limit exceeded in SSpeddys.f’ 90 close(11) print *,’peddys, mped = ’,mped c return end 20.4 Call Datum The data have been prepared by other programs, and the levels of each factor have been converted to a consecutive number from 1 to the maximum number of levels for that factor. One could also calculate means of the milk yields by days in milk, year months, or whatever may be of interest. 20.4. CALL DATUM 341 At the end of the routine, the levels of the factors were sorted so that levels of each factor could be processed one at a time, in sequence. Thus, the diagonal blocks for each factor can be constructed at the same time as accumulating the right hand sides of the mixed model equations (MME). Thus, there is no need to save the diagonal blocks on disk or in memory. This is also handy if Gibbs sampling is to be used to estimate the covariance function matrices in the same program, because the covariance function matrices would change with each sampling. Lastly, the solution vectors are set to zeros before the iterations begin. Otherwise there could be unknown information in those arrays that might cause problems with convergence. subroutine datum include ’SShd.f’ real*8 p(3) open(11,file=’MILKTDM.d’,form=’formatted’, x status=’old’) mrec = 0 obs=0.d0 nerr = 0 11 read(11,1101,end=20,err=88)iam,iym,iras,icg, x jdim,jtim,p 1101 format(1x,6i10,3f9.2) c if(jdim.lt.5)go to 11 if(jdim.gt.ndim)go to 11 mrec = mrec + 1 if(mrec.gt.nrec)go to 19 anid(mrec) = iam cgid(mrec) = icg rasid(mrec) = iras ymid(mrec) = iym timg(mrec) = jtim if(icg.gt.mcgid)mcgid=icg if(iras.gt.mras)mras=iras if(iym.gt.mym)mym=iym 342 19 88 20 35 CHAPTER 20. FORTRAN PROGRAMS pari(mrec) = 1 if(p(2).gt.-9000.0)pari(mrec)=2 if(p(3).gt.-9000.0)pari(mrec)=3 kp=pari(mrec) obs(mrec) = p(kp) days(mrec) = jdim go to 11 print *,’Too many records’ go to 20 print *,’Err rec’,mrec go to 11 close(11) print *,’ datum, mrec= ’,mrec write(30,3005)mrec,nrec,mcgid,ncg,mras,nras, x mym,nym 3005 format(1x,2i10,’ recs’/1x,2i10,’ mcgid’/1x,2i10, x ’ mras’/1x,2i10,’ mym’) close(11) c c sort data by levels of different factors c IPSORT is an Intel math library function kflag = 1 ier = 0 call IPSORT(ymid,mrec,pymid,kflag,ier) call IPSORT(rasid,mrec,pras,kflag,ier) call IPSORT(cgid,mrec,pcg,kflag,ier) call IPSORT(anid,mrec,panid,kflag,ier) c c set all solution vectors to zero c sanm = 0.d0 sape = 0.d0 scg = 0.d0 sras = 0.d0 sym = 0.d0 return end 20.5. ITERATION SUBROUTINES 20.5 343 Iteration Subroutines The main program can be thought of as ‘modular’. There are fixed factors, random factors, and the animal additive genetic factor. Fixed factors do not have any covariance function matrices. There are two types of fixed factors in this model. The Year-Month of calving subclasses each have 36 ten-day periods associated with them - to model the trajectory of test day milk yields. The other type is the region-age-season subclasses which are modelled by order 4 Legendre polynomials, and thus, there are 5 parameters for each subclass. 20.5.1 Year-Month of Calving Factor subroutine facYM include ’SShd.f’ c real*8 diags(ntg),vnois(ntg),ay(ntg), x XRY(ntg),c,y,z,w,x,ddif,xad integer levs(nrec),iork(nop),mfac c ####################################################### c determine number of observations per c level of the factor, store in levs levs = 0 mfac = 0 do 8 krec=1,mrec iym = ymid(krec) if(iym.gt.mfac)mfac = iym levs(iym) = levs(iym) + 1 8 continue kend=0 c ####################################################### c For each level of the factor c adjust observations for all other solutions c save in XRY, make diags of MME do 11 iym = 1,mfac jrec = levs(iym) 344 CHAPTER 20. FORTRAN PROGRAMS XRY = 0.d0 diags = 0.d0 if(jrec.lt.1)go to 11 kstr = kend+1 kend = kend+jrec do 10 lrec = kstr,kend krec = pras(lrec) c x 15 c iam = anid(krec) icg = cgid(krec) iras = rasid(krec) jdim = days(krec) jtim = timg(krec) kp = pari(krec) ja = (kp-1)*5 c = ri(jdim,kp) y = obs(krec) do 15 j=1,mcov ka=ja+j y = y - (scg(icg,j) + sanm(iam,ka) + sras(iras,j) + sape(iam,ka))*lp(jdim,j) continue xad = y*c XRY(jtim)=XRY(jtim) + xad diags(jtim)=diags(jtim)+ c c 10 continue c #################################################### c solve for new solution for this level of factor c vnois = 0.d0 c ################################################### c if estimating covariance matrices then c generate sampling variance to c add to solutions if(igibb.gt.0)then do 17 i=1,ntg 20.5. ITERATION SUBROUTINES 17 345 call fgnor3(z) if(diags(i).gt.0.d0)diags(i)=1.d0/diags(i) vnois(i)=z*diags(i) continue endif c do 25 j=1,ntg z = XRY(j)*diags(j) c add vnois, compute convergence criteria z = z + vnois(j) ddif = z - sym(iym,j) ccn = ccn + ddif*ddif ccd = ccd + z*z sym(iym,j) = z 25 continue 11 continue if(itest.gt.0)then jj=6 print 5003,jj,(sym(jj,L),L=1,5) 5003 format(’ PYM’,i4,5f12.4) endif return end 20.5.2 Region-Age-Season of Calving The above statements were for the Year-Month subclasses, modelling the trajectories of lactation curves for milk yield using 36 ten-day periods. Now we compare this routine to one for region-age-seasons which are modelled by order 4 Legendre polynomials as covariates, but only within a parity. Subclasses were numbered consecutively across region-age-seasons. Ages are nested within parities and thus, only 5 covariates per subclass. Dealing with the covariates requires different coding. subroutine facRAS include ’SShd.f’ 346 CHAPTER 20. FORTRAN PROGRAMS c real*8 diags(200),vnois(200),ay(20), x XRY(20),work(200),hh(200),c,y,z,w,x,ddif,xad integer levs(nrec),iork(200),mfac c ####################################################### c determine number of observations per c level of the factor, store in levs levs = 0 mfac = 0 do 8 krec=1,mrec iras = rasid(krec) if(iras.gt.mfac)mfac = iras levs(iras) = levs(iras) + 1 8 continue kend=0 c ####################################################### c For each level of the factor c adjust observations for all other solutions c save in XRY, make diags of MME do 11 iras = 1,mfac jrec = levs(iras) XRY = 0.d0 diags = 0.d0 if(jrec.lt.1)go to 11 kstr = kend+1 kend = kend+jrec do 10 lrec = kstr,kend krec = pras(lrec) iam = anid(krec) icg = cgid(krec) iym = ymid(krec) jdim = days(krec) jtim = timg(krec) kp = pari(krec) ja = (kp-1)*5 c = ri(jdim,kp) 20.5. ITERATION SUBROUTINES x 15 c y = obs(krec) - sym(iym,jtim) do 15 j=1,mcov ka=ja+j y = y - (scg(icg,j) + sanm(iam,ka) + sape(iam,ka))*lp(jdim,j) continue xad = y*c do 17 j=1,mcov kb=ja+j XRY(j)=XRY(j) + xad*lp(jdim,j) do 19 m=j,mcov kc=ja+m ma=ihmssf(j,m,mcov) diags(ma)=diags(ma)+lp(jdim,j)*c*lp(jdim,m) continue continue 19 17 c 10 continue c #################################################### c solve for new solution for this level of factor c call dkmvhf(diags,mcov,work,iork) vnois = 0.d0 c ################################################### c if estimating covariance matrices then c do cholesky decomposition on diags c generate sampling variance (vnois) to c add to solutions if(igibb.gt.0)then call cholsk(diags,work,mcov) call vgnor(vnois,work,hh,mcov) endif c do 25 j=1,mcov z = 0.d0 do 27 k=1,mcov m=ihmssf(j,k,mcov) 347 348 CHAPTER 20. FORTRAN PROGRAMS z = z + diags(m)*XRY(k) 27 continue c add vnois, compute convergence criteria z = z + vnois(j) ddif = z - sras(iras,j) ccn = ccn + ddif*ddif ccd = ccd + z*z sras(iras,j) = 0.5d0*(z + sras(iras,j)) 25 continue 11 continue if(itest.gt.0)then jj=6 print 5003,jj,(sras(jj,L),L=1,5) 5003 format(’ RAS’,i4,5f12.4) endif return end 20.5.3 Contemporary Groups Contemporary groups (CG) are defined as cows in the same parity number calving within a few months of each other in the same herd and year. CG are modelled by order 4 Legendre polynomials. Because three parities are being analyzed together, it is possible for there to be 3 covariance function matrices for CG effects, i.e. one for each parity. We have assumed this would be true. However, if we find that the three covariance matrices are similar, then we could assume the same covariance function matrix for all parities. The subroutine for CG is different from that for RAS because CG is a random factor, and we are allowing for three separate covariance function matrices, and the coding has to allow for the estimation of new matrices, and saving them in a file. subroutine facCG include ’SShd.f’ c real*8 diags(200),vnois(200),ay(20), 20.5. ITERATION SUBROUTINES x XRY(20),work(200),hh(200),c,y,z,w,x,ddif,xad c c c c c c c real*8 ssc(3,15),VIc(15) integer levs(nrec),iork(200),levp(nrec),mfac ####################################################### determine number of observations per level of the factor, store in levs levs = 0 kop = 15 levp = 0 ssc=0.d0 mfac = 0 do 8 krec=1,mrec icg = cgid(krec) if(icg.gt.mfac)mfac = icg levs(icg) = levs(icg) + 1 levp(icg) = pari(krec) 8 continue kend=0 ####################################################### For each level of the factor adjust observations for all other solutions save in XRY, make diags of MME do 11 icg = 1,mfac jrec = levs(icg) XRY = 0.d0 diags = 0.d0 if(jrec.lt.1)go to 11 kstr = kend+1 kend = kend+jrec do 10 lrec = kstr,kend krec = pcgid(lrec) iam = anid(krec) iras = rasid(krec) iym = ymid(krec) jdim = days(krec) jtim = timg(krec) 349 350 CHAPTER 20. FORTRAN PROGRAMS x 15 c kp = pari(krec) ja = (kp-1)*5 c = ri(jdim,kp) y = obs(krec) - sym(iym,jtim) do 15 j=1,mcov ka=ja+j y = y - (sras(iras,j) + sanm(iam,ka) + sape(iam,ka))*lp(jdim,j) continue xad = y*c do 17 j=1,mcov kb=ja+j XRY(j)=XRY(j) + xad*lp(jdim,j) do 19 m=j,mcov kc=ja+m ma=ihmssf(j,m,mcov) diags(ma)=diags(ma)+lp(jdim,j)*c*lp(jdim,m) continue continue 19 17 c 10 continue c #################################################### c Add inverse of covariance function matrix to diags c before inverting (one of three possible inverses) c m=0 kp = levp(icg) do 61 ir=1,mcov do 61 ic=ir,mcov m=m+1 diags(m)=diags(m)+ci(kp,m) 61 continue call dkmvhf(diags,mcov,work,iork) vnois = 0.d0 c ################################################### c if estimating covariance matrices then c do cholesky decomposition on diags 20.5. ITERATION SUBROUTINES c c generate sampling variance (vnois) to add to solutions if(igibb.gt.0)then call cholsk(diags,work,mcov) call vgnor(vnois,work,hh,mcov) endif c do 25 j=1,mcov z = 0.d0 do 27 k=1,mcov m=ihmssf(j,k,mcov) z = z + diags(m)*XRY(k) 27 continue c add vnois, compute convergence criteria z = z + vnois(j) ddif = z - scg(icg,j) ccn = ccn + ddif*ddif ccd = ccd + z*z scg(icg,j) = z 25 continue c c if estimating covariance matrices - must accumulate c sum of squares if(igibb.gt.0)then m=0 ndf(kp)=ndf(kp)+1 do 71 ir=1,mcov z = scg(icg,ir) do 71 ic=ir,mcov m=m+1 ssc(kp,m)=ssc(kp,m)+scg(icg,ic)*z 71 continue endif 11 continue c c Estimate new ci matrices c if(igibb.gt.0)then 351 352 CHAPTER 20. FORTRAN PROGRAMS kop=15 do 217 kp=1,3 nde = ndf(kp) + 2 call fgchi1(nde,w) z=1.d0/w do 215 k=1,kop VIc(k)=ssc(kp,k)*z 215 continue write(17)iter,VIc call dkmvhf(VIc,mcov,work,iork) do 216 k=1,kop ci(kp,k)=VIc(k) 216 continue 217 continue endif if(itest.gt.0)then jj=6 print 5003,jj,(scg(jj,L),L=1,5) 5003 format(’ CGS’,i4,5f12.4) endif return end 20.5.4 Animal Permanent Environmental Animal permanent environmental (APE) effects concern all three parities, and are correlated between parities, so that there are 15 covariates to estimate for each animal. Each parity is modelled by order 4 Legendre polynomials. The covariance matrix is therefore, 15 by 15, and there is only one covariance matrix. subroutine facAPE include ’SShd.f’ c real*8 diags(nop),vnois(nop),ay(no), x XRY(no),work(nop),hh(nop),c,y,z,w,x,ddif,xad 20.5. ITERATION SUBROUTINES c c c c c c c real*8 ssp(nop),VIp(nop) integer levs(nrec),iork(nop),mfac ####################################################### determine number of observations per level of the factor, store in levs levs = 0 ssp=0.d0 mfac = 0 do 8 krec=1,mrec iam = anid(krec) if(iam.gt.mfac)mfac = iam levs(iam) = levs(iam) + 1 8 continue kend=0 ####################################################### For each level of the factor adjust observations for all other solutions save in XRY, make diags of MME do 11 iam = 1,mfac jrec = levs(iam) XRY = 0.d0 diags = 0.d0 if(jrec.lt.1)go to 11 kstr = kend+1 kend = kend+jrec do 10 lrec = kstr,kend krec = pcgid(lrec) icg = cgid(krec) iras = rasid(krec) iym = ymid(krec) jdim = days(krec) jtim = timg(krec) kp = pari(krec) ja = (kp-1)*5 c = ri(jdim,kp) y = obs(krec) - sym(iym,jtim) 353 354 CHAPTER 20. FORTRAN PROGRAMS x 15 c do 15 j=1,mcov ka=ja+j y = y - (sras(iras,j) + sanm(iam,ka) + scg(icg,j) )*lp(jdim,j) continue xad = y*c do 17 j=1,mcov kb=ja+j XRY(kb)=XRY(kb) + xad*lp(jdim,j) do 19 m=j,mcov kc=ja+m ma=ihmssf(kb,kc,no) diags(ma)=diags(ma)+lp(jdim,j)*c*lp(jdim,m) continue continue 19 17 c 10 continue c #################################################### c Add inverse of covariance function matrix to diags c before inverting (one of three possible inverses) c m=0 do 61 ir=1,no do 61 ic=ir,no m=m+1 diags(m)=diags(m)+pi(m) 61 continue call dkmvhf(diags,no,work,iork) vnois = 0.d0 c ################################################### c if estimating covariance matrices then c do cholesky decomposition on diags c generate sampling variance (vnois) to c add to solutions if(igibb.gt.0)then call cholsk(diags,work,no) call vgnor(vnois,work,hh,no) 20.5. ITERATION SUBROUTINES endif c do 25 j=1,no z = 0.d0 do 27 k=1,no m=ihmssf(j,k,no) z = z + diags(m)*XRY(k) 27 continue c add vnois, compute convergence criteria z = z + vnois(j) ddif = z - sape(iam,j) ccn = ccn + ddif*ddif ccd = ccd + z*z sape(icg,j) = z 25 continue c c if estimating covariance matrices - must accumulate c sum of squares if(igibb.gt.0)then m=0 ndf=ndf+1 do 71 ir=1,no z = sape(iam,ir) do 71 ic=ir,no m=m+1 ssp(kp,m)=ssp(kp,m)+sape(iam,ic)*z 71 continue endif 11 continue c c Estimate new pi matrix c if(igibb.gt.0)then nde = ndf + 2 call fgchi1(nde,w) z=1.d0/w do 215 k=1,nop VIp(k)=ssp(k)*z 355 356 CHAPTER 20. FORTRAN PROGRAMS 215 continue nm=1 write(19)iter,nm,VIp call dkmvhf(VIp,no,work,iork) do 216 k=1,nop pi(k)=VIp(k) 216 continue 217 continue endif if(itest.gt.0)then jj=6 print 5003,jj,(sape(jj,L),L=1,5) 5003 format(’ APE’,i4,5f12.4) endif return end 20.5.5 Animal Additive Genetic Animal additive genetic (ANM) effects concern all three parities, like the APE, and are correlated between parities, so that there are 15 covariates to estimate for each animal. Each parity is modelled by order 4 Legendre polynomials. The covariance matrix is therefore, 15 by 15, and there is only one covariance matrix. However, the big difference from APE are the additive genetic relationships that must be taken into account amongst all animals. This accounts for the extra length of the following subroutine. subroutine facANM include ’SShd.f’ c real*8 diags(nop),vnois(nop),ay(no), x XRY(no),work(nop),hh(nop),c,y,z,w,x,ddif,xad real*8 ssa(nop),VIa(nop),tcc(no),dg integer levs(nrec),iork(nop),mfac 20.5. ITERATION SUBROUTINES c ####################################################### c determine number of observations per c level of the factor, store in levs levs = 0 ssa=0.d0 mfac = mam do 8 krec=1,mrec iam = anid(krec) levs(iam) = levs(iam) + 1 8 continue kend=0 c ####################################################### c For each level of the factor c adjust observations for all other solutions c save in XRY, make diags of MME do 11 iam = 1,mfac jrec = levs(iam) XRY = 0.d0 diags = 0.d0 if(jrec.lt.1)go to 11 kstr = kend+1 kend = kend+jrec x do 10 lrec = kstr,kend krec = pcgid(lrec) icg = cgid(krec) iras = rasid(krec) iym = ymid(krec) jdim = days(krec) jtim = timg(krec) kp = pari(krec) ja = (kp-1)*5 c = ri(jdim,kp) y = obs(krec) - sym(iym,jtim) do 15 j=1,mcov ka=ja+j y = y - (sras(iras,j) + sape(iam,ka) + scg(icg,j) )*lp(jdim,j) 357 358 15 c CHAPTER 20. FORTRAN PROGRAMS continue xad = y*c do 17 j=1,mcov kb=ja+j XRY(kb)=XRY(kb) + xad*lp(jdim,j) do 19 m=j,mcov kc=ja+m ma=ihmssf(kb,kc,no) diags(ma)=diags(ma)+lp(jdim,j)*c*lp(jdim,m) continue continue 19 17 c 10 continue c c Must account for genetic relationships among animals c 50 uped = jped(iam) tcc=0.d0 if(uped.lt.1)go to 11 iam = cpa(uped) if(iam.ne.ianm)print *,’xxxxx’,iam,ianm c 850 jcode = cpc(uped) if(cpa(uped).ne.ianm)go to 432 if(jcode.eq.0)then js = cps(uped) jd = cpd(uped) c = bii(ianm)*0.5d0 do 406 jc=1,ntr tcc(jc)=tcc(jc)+c*(sanm(js,jc)+sanm(jd,jc)) 406 continue else jp = cps(uped) jm = cpd(uped) d = bii(jp)*0.5d0 do 412 ja=1,ntr tcc(ja)=tcc(ja)+d*(sanm(jp,ja)-0.5d0*sanm(jm,ja)) 20.5. ITERATION SUBROUTINES 412 c 405 c 432 continue endif uped = uped + 1 if(iam.ne.cpa(uped))go to 432 if(uped.gt.mped)go to 432 go to 850 do 435 jr=1,no s=0.d0 do 437 jc=1,no s=s + gi(ihmssf(jr,jc,ntr))*tcc(jc) 437 continue XRY(jr)=XRY(jr)+s 435 continue c #################################################### c Add inverse of covariance function matrix to diags c before inverting (one of three possible inverses) c dg = adiag(iam) m=0 do 61 ir=1,no do 61 ic=ir,no m=m+1 diags(m)=diags(m)+gi(m)*dg 61 continue call dkmvhf(diags,no,work,iork) vnois = 0.d0 c ################################################### c if estimating covariance matrices then c do cholesky decomposition on diags c generate sampling variance (vnois) to c add to solutions if(igibb.gt.0)then call cholsk(diags,work,no) call vgnor(vnois,work,hh,no) endif c 359 360 CHAPTER 20. FORTRAN PROGRAMS ay=0.d0 js = sir(iam) jd = dam(iam) do 25 j=1,no z = 0.d0 do 27 k=1,no m=ihmssf(j,k,no) z = z + diags(m)*XRY(k) 27 continue c add vnois, compute convergence criteria z = z + vnois(j) ddif = z - sanm(iam,j) ccn = ccn + ddif*ddif ccd = ccd + z*z sanm(iam,j) = z ay(j)=sanm(iam,j)-0.5d0*(sanm(ja,j)+sanm(jd,j)) 25 continue c c if estimating covariance matrices - must accumulate c sum of squares of Mendelian sampling terms if(igibb.gt.0)then if(jrec.gt.0)then m=0 ndf=ndf+1 d = bii(iam) do 71 ir=1,no z = ay(ir)*d do 71 ic=ir,no m=m+1 ssa(m)=ssa(m)+ay(ic)*z 71 continue endif endif 11 continue c c Estimate new gi matrix c if(igibb.gt.0)then 20.5. ITERATION SUBROUTINES 361 nde = ndf + 2 call fgchi1(nde,w) z=1.d0/w do 215 k=1,nop VIa(k)=ssa(k)*z 215 continue nm=2 write(19)iter,nm,VIa call dkmvhf(VIa,no,work,iork) do 216 k=1,nop gi(k)=VIa(k) 216 continue 217 continue endif if(itest.gt.0)then jj=6 print 5003,jj,(sanm(jj,L),L=1,5) 5003 format(’ ANM’,i4,5f12.4) endif return end 20.5.6 Residual Effects If the program is set to estimate covariance matrices, then a subroutine is needed to estimate the residual variances by parity and by periods within a lactation. If igibb=0, then this subroutine is skipped, and no residual variances are calculated. subroutine facRES include ’SShd.f’ c real*8 diags(nop),vnois(nop),ay(no), x XRY(no),work(nop),hh(nop),c,y,z,w,x,ddif,xad real*8 sse(3,ntim),VI(nop),ndf(3,ntim) 362 c c c c c c c CHAPTER 20. FORTRAN PROGRAMS integer levs(nrec),iork(nop),mfac ####################################################### determine number of observations per level of the factor, store in levs levs = 0 sse=0.d0 mfac = 0 do 8 krec=1,mrec itim = timg(krec) levs(itim) = levs(itim) + 1 8 continue kend=0 ####################################################### For each level of the factor adjust observations for all other solutions save in XRY, make diags of MME do 11 itim = 1,mfac jrec = levs(itim) if(jrec.lt.1)go to 11 kstr = kend+1 kend = kend+jrec x do 10 lrec = kstr,kend krec = pcgid(lrec) iam = anid(krec) icg = cgid(krec) iras = rasid(krec) iym = ymid(krec) jdim = days(krec) jtim = timg(krec) kp = pari(krec) ja = (kp-1)*5 c = ri(jdim,kp) y = obs(krec) - sym(iym,jtim) do 15 j=1,mcov ka=ja+j y = y - (sras(iras,j) + sape(iam,ka) + scg(icg,j)+sanm(iam,ka) )*lp(jdim,j) 20.5. ITERATION SUBROUTINES 15 c continue sse(jtim,kp)=sse(jtim,kp)+y*y ndf(jtim,kp)=ndf(jtim,kp)+1.d0 c 10 c 32 31 141 41 142 42 143 43 continue do 31 jtim=1,4 do 32 kp=1,3 nde = ndf(jtim,kp)+2 call fgchi1(nde,w) res(jtim,kp) = sse(jtim,kp)/w continue continue ri=0.d0 do 41 i=1,45 do 141 kp=1,3 ri(i,kp)=1.d0/res(1,kp) continue continue do 42 i=46,115 do 142 kp=1,3 ri(i,kp)=1.d0/res(2,kp) continue continue do 43 i=116,265 do 143 kp=1,3 ri(i,kp)=1.d0/res(3,kp) continue continue do 44 i=266,365 do 144 kp=1,3 ri(i,kp)=1.d0/res(4,kp) continue continue 144 44 c c save new estimates in file with sample number c 363 364 CHAPTER 20. FORTRAN PROGRAMS write(20,1235)iter,(res(i,j),i=1,4),j=1,3) 1235 format(1x,i10,12f15.5) c return end 20.5.7 Finish Off The last subroutine is call finis, which is to save solutions for the important factors, usually just the genetic evaluations of animals. However, some may want to save all of the solutions for all factors. With the genetic evaluations one may also want to save information about the number of records each animal had (by parity number), and perhaps number of progeny, and sire and dam identifications. This information could be used to approximate the reliabilities of the EBVs. Thus, this last subroutine depends on the wishes of the user to decide what information should be saved and how it should be saved. Thus, no coding will be provided for this subroutine. 20.6 Other Items If Gibbs sampling was performed then there will be three files of sample estimates for each covariance matrix and the residuals. The burn-in period needs to be determined, then the remaining samples need to be averaged in some manner. Either all of the samples, after burn-in, could be sampled, or every mth sample could be averaged, where m is a number like 7 or 17 or 19. Consecutive samples are known to be dependent on the previous sample, and by averaging every mth sample this dependency is lessened considerably. Often the same results are obtained either way. After new covariance matrices are available, then another run is made where the new parameters are inputted and igibb = 0 is imposed. This is to obtain solutions to the mixed model equations (MME). 20.6. OTHER ITEMS 365 Having the EBV, then one can compute the 305-d breeding values and persistency in a follow-up program. There may be other necessities to calculate for users of the EBVs. Note also that none of the preliminary programs have been provided. Programs for preparing the data, numbering the levels of each factor, editting out the error records, and ordering the animals chronologically for calculating inbreeding coefficients have also not been shown. The programs shown in this chapter are not available for downloading. If you want to use them, then you should type them in from these pages. Why? Because it will help you to learn what the programs are doing, and you might find some errors in them. The programs are to give you an idea about writing code for a random regression model. 366 CHAPTER 20. FORTRAN PROGRAMS Chapter 21 DKMVHF - Inversion Routine Matrix inversion routine of C. R. Henderson, as modified by Karin Meyer in 1983. Input is a half-stored symmetric matrix. There can be zero rows and columns present in the matrix. SUBROUTINE DKMVHF(A,N,V,IFLAG) C KARIN MEYER C NOVEMBER 1983 C----------------------------------------------------------------------DOUBLE PRECISION A(1),V(1),XX,DMAX,AMAX,ZERO,DIMAX INTEGER IFLAG(1) zero=1.D-12 IF(N.EQ.1)THEN XX=A(1) IF(DABS(XX).GT.ZERO)THEN A(1)=1.D0/XX ELSE A(1)=0.D0 END IF RETURN END IF N1=N+1 NN=(N*N1)/2 367 368 CHAPTER 21. DKMVHF - INVERSION ROUTINE DO 1 I=1,N 1 IFLAG(I)=0 C C C SET MINIMUM ABSOLUTE VALUE OF DIAGONAL ELEMENTS FOR NON-SINGULARITY (MACHINE SPECIFIC!) ZERO=1.D-12 C----------------------------------------------------------------------C START LOOP OVER ROWS/COLS C----------------------------------------------------------------------DO 8 II=1,N C ... FIND DIAGONAL ELEMENT WITH BIGGEST ABSOLUTE VALUE DMAX=0.D0 AMAX=0.D0 KK=-N DO 2 I=1,N C ... CHECK THAT THIS ROW/COL HAS NOT BEEN PROCESSED KK=KK+N1-I IF(IFLAG(I).NE.0)GO TO 2 K=KK+I IF(DABS(A(K)).GT.AMAX)THEN DMAX=A(K) AMAX=DABS(DMAX) IMAX=I END IF 2 CONTINUE C ... CHECK FOR SINGULARITY IF(AMAX.LE.ZERO)GO TO 11 C ... ALL ELEMENTS SCANNED,SET FLAG IFLAG(IMAX)=II C C C ... INVERT DIAGONAL DIMAX=1.D0/DMAX ... DEVIDE ELEMENTS IN ROW/COL PERTAINING TO THE BIGGEST DIAGONAL ELEMENT BY DMAX IL=IMAX-N DO 3 I=1,IMAX-1 IL=IL+N1-I XX=A(IL) 369 A(IL)=XX*DIMAX IF(DABS(XX).LT.0.1D-17)XX=0.D0 3 V(I)=XX C ... NEW DIAGONAL ELEMENT IL=IL+N1-IMAX A(IL)=-DIMAX DO 4 I=IMAX+1,N IL=IL+1 XX=A(IL) A(IL)=XX*DIMAX IF(DABS(XX).LT.0.1D-17)XX=0.D0 4 V(I)=XX C ... ADJUST THE OTHER ROWS/COLS : C A(I,J)=A(I,J)-A(I,IMAX)*A(J,IMAX)/A(IMAX,IMAX) IJ=0 DO 7 I=1,N IF(I.EQ.IMAX)THEN IJ=IJ+N1-I GO TO 7 END IF XX=V(I) IF(XX.NE.0.D0)THEN XX=XX*DIMAX DO 5 J=I,N IJ=IJ+1 IF(J.NE.IMAX)A(IJ)=A(IJ)-XX*V(J) 5 CONTINUE ELSE 6 IJ=IJ+N1-I END IF 7 CONTINUE C ... REPEAT UNTIL ALL ROWS/COLS ARE PROCESSED 8 CONTINUE C----------------------------------------------------------------------C END LOOP OVER ROWS/COLS C----------------------------------------------------------------------C ... REVERSE SIGN DO 9 I=1,NN 370 CHAPTER 21. DKMVHF - INVERSION ROUTINE 9 A(I)=-A(I) C ... AND THAT’S IT ! C PRINT 10,N 10 FORMAT(’ FULL RANK MATRIX INVERTED, ORDER =’,I5) C RETURN RANK AS LAST ELEMENT OF FLAG VECTOR IFLAG(N)=N RETURN C----------------------------------------------------------------------C MATRIX NOT OF FULL RANK, RETURN GENERALISED INVERSE C----------------------------------------------------------------------11 IRANK=II-1 IJ=0 DO 14 I=1,N IF(IFLAG(I).EQ.0)THEN C ... SET REMAINING N-II ROWS/COLS TO ZERO DO 12 J=I,N IJ=IJ+1 A(IJ)=0.D0 12 CONTINUE ELSE DO 13 J=I,N IJ=IJ+1 IF(IFLAG(J).NE.0)THEN C ... REVERSE SIGN FOR II-1 ROWS/COLS PREVIOUSLY PROCESSED A(IJ)=-A(IJ) ELSE A(IJ)=0.D0 END IF 13 CONTINUE END IF 14 CONTINUE C PRINT 15,N,IRANK C 15 FORMAT(’ GENERALISED INVERSE OF MATRIX WITH ORDER =’,I5, C 1 ’ AND RANK =’,I5) IFLAG(N)=IRANK RETURN END c 371 c c 1 2 half-stored matrix subscripting function FUNCTION IHMSSF(I,J,N) IF(I-J)1,1,2 IHMSSF=((N+N-I)*(I-1))/2+J RETURN IHMSSF=((N+N-J)*(J-1))/2+I RETURN END 372 CHAPTER 21. DKMVHF - INVERSION ROUTINE Chapter 22 References and Suggested Readings Ali, T. E. , Schaeffer, L. R. 1987. Accounting for covariances among test day milk yields in dairy cows. Can. J. Anim. Sci. 67:637. Belonsky, G. M. , Kennedy, B. W. 1988. Selection on individual phenotype and best linear unbiased predictor of breeding value in a closed swine herd. J. Anim. Sci. 66:1124-1131. Bentsen, H. , Gjerde, B., Nguyenm N.H., Rye, M., Ponzoni, R.W., Palada de Vera, M.S., Bolivar, H.L., Velasco, R.R., Danting, J.C., Dionisio, E.E., Longalong, F.M., Reyes, R. A., Abella, T.A., Tayamen, M.M., Eknath, A.E., 2012. Growth of farmed tilapias: Genetic parameters for body weight at harvest in Nile tilapia (Oreochromis niloticus) during five generations of testing in multiple environments. Aquaculture 338-341, p56-65. Brenna-Hansen, S. , Li, J., Kent, M.P., Boulding, E.G., Dominik, S., Davidson, W.S., Lien, S., 2012. Chromosomal differences between European and North American Atlantic salmon discovered by linkage mapping and supported by fluorescence in situ hybridization analysis. BMC Genomics 13, 432. Ducrocq, V. 1987. An analysis of productive life in dairy cattle. Ph.D. Diss., Cornell University, Ithaca, NY. 373 374 CHAPTER 22. REFERENCES AND SUGGESTED READINGS Dwyer, D. J. , Schaeffer, L. R., Kennedy, B. W.. 1986. Bias due to corrective matings in sire evaluations for calving ease. J. Dairy Sci. 69:794-799. Foulley, J. L. , Gianola, D. 1984. Estimation of genetic merit from bivariate all or none responses. Genet. Sel. Evol. 16:285-306. Fries, L. A. , Schenkel, F. S. 1993. Estimation and prediction under a selection model. Presented at 30th Reuniao Anual da Sociedade Brasileira de Zootecnia, Rio de Janeiro. Galbraith, F. 2003. Random regression models to evaluate sires for daughter survival. Master’s Thesis, University of Guelph, Ontario, Canada, August. Gianola, D. 1982. Theory and analysis of threshold characters. J. Anim. Sci. 54:1079-1096. Gianola, D. , Foulley, J. L. 1983. Sire evaluation for ordered categorical data with a threshold model. Genet. Sel. Evol. 15:201. Gianola, D. , Fernando, R. L. 1986. Bayesian methods in animal breeding. J. Anim. Sci. 63:217. Gianola, D. , Im, S., Fernando, R. L. 1988. Prediction of breeding value under Henderson’s selection model: A Revisitation. J. Dairy Sci. 71:27902798. Gengler, N. , Mayeres, P., Szydlowski, M., 2007. A simple method to approximate gene content in large pedigree populations: application to the myostatin gene in dual-purpose Belgian Blue cattle. Animal 1:21-27. Gengler, N. , Abras, S., Verkenne, C., Vanderick, S., Szydlowski, M., Renaville, R., 2008. Accuracy of prediction of gene content in large animal populations and its use for candidate gene detection and genetic evaluation. J. Dairy Sci. 91:1652-1659. Gjerde, B. , Simianer, H.,Refstie, T., 1994. Estimates of genetic and phenotypic parameters for body weight, growth rate and sexual maturity in Atlantic salmon. Livest. Prod. Sci. 38:133-143. 375 Glebe, B.D. 1998. East Coast Salmon Aquaculture Breeding Programs: History and Future. Canadian Stock Assessment Secretariat Research Document 98/157, Fisheries and Oceans, Canada. Graser, H. U. , Tier, B. 1997. Applying the concept of number of effective progeny to approximate accuracies of predictions derived from multiple trait analyses. Presented at Australian Association of Animal Breeding and Genetics. Hartley, H. O. , Rao, J. N. K. 1967. Maximum likelihood estimation for the mixed analysis of variance model. Biometrika 54:93-108. Harville, D. A. , Mee, R. W. 1984. A mixed model procedure for analyzing ordered categorical data. Biometrics 40:393-408. Hayes, J. F. , W. G. Hill. 1981. Modification of estimates of parameters in the construction of genetic selection indices (’bending’). Biometrics 37:483-493. Henderson, C. R. 1953. Estimation of variance and covariance components. Biometrics 9:226-310. Henderson, C. R. 1975. Best linear unbiased estimation and prediction under a selection model. Biometrics 31:423-448. Henderson, C. R. 1976. A simple method for computing the inverse of a numerator relationship matrix used for prediction of breeding values. Biometrics, 32:69-74. Henderson, C. R. 1978. Simulation to examine distributions of estimators of variances and ratios of variances. J. Dairy Sci. 61:267-273. Henderson, C. R. 1984. Applications of Linear Models in Animal Breeding. University of Guelph. Henderson, C. R. 1985. Best linear unbiased prediction of nonadditive genetic merits in noninbred populations. J. Anim. Sci. 60:111-117. Henderson, C. R. 1988. A simple method to account for selected base populations. J. Dairy Sci. 71:3399-3404. 376 CHAPTER 22. REFERENCES AND SUGGESTED READINGS Henderson, C. R. 1988. Theoretical basis and computational methods for a number of different animal models. J. Dairy Sci. 71(2):1-11. Henderson, Jr., C. R. 1982. Analysis of covariance in the mixed model: Higher level, nonhomogeneous, and random regressions. Biometrics 38:623640. Heydarpour, M. , Schaeffer, L. R., Yazdi, M. H. 2008. Influence of population structure on estimates of direct and maternal parameters. J. Anim. Breed. Genet. 125:89-99. Houston, R.D. , Gheyas, A., Hamilton, A., Guy, D.R., Tinch, A.E., Taggart, J.B., McAndrew, B.J., Haley, C.S. and Bishop, S.C., 2008. Detection and confirmation of a major QTL affecting resistance to infectious pancreatic necrosis (IPN) in Atlantic salmon (Salmo salar). In Animal Genomics for Animal Health (Vol. 132, pp. 199-204). Karger Publishers. Hudson, G. F. S. , Schaeffer, L. R. 1984. Monte Carlo comparison of sire evaluation models in populations subject to selection and nonrandom mating. J. Dairy Sci. 67:1264-1272. Jamrozik, J. , Kistemaker,G. K., Dekkers, J. C. M., Schaeffer, L. R. 1997. Comparison of possible covariates for use in a random regression model for analyses of test day yields. J. Dairy Sci. 80:2550-2556. Jamrozik, J. , Schaeffer, L. R. 1997. Estimates of genetic parameters for a test day model with random regressions for production of first lactation Holsteins. J. Dairy Sci. 80:762-770. Jamrozik, J. , L. R. Schaeffer, J. C. M. Dekkers. 1997. Genetic evaluation of dairy cattle using test day yields and random regression model. J. Dairy Sci. 80:1217-1226. Jamrozik, J. , J. Fatehi, L. R. Schaeffer. 2008. Comparison of models for genetic evaluation of survival traits in dairy cattle: a simulation study. J. Anim. Breed. Genet. 125:75-83. Jorjani, H. , Klei, L., Emanuelson, U. 2003. A simple method for weighted bending of genetic (co)variance matrices. J. Dairy Sci.86:677-679. 377 Kennedy, B. W. , L. R. Schaeffer, Sorensen, D. A. 1988. Genetic properties of animal models. J. Dairy Sci. 71:(Suppl. 2) 17-26. Kistemaker, G. 1997. The comparison of random regression test day models and a 305-day model for evaluation of milk yield in dairy cattle. Ph.D. Thesis. University of Guelph. Lien, S. , Gidskehaug, L., Moen, T., Hayes, B.J., Berg, P.R., Davidson, W.S., Omholt, S.W., Kent, M.P., 2011. A dense SNP-based linkage map for Atlantic salmon (Salmo salar) reveals extended chromosome homeologies and striking differences in sex-specific recombination patterns. BMC Genomics 12, 615. Liu, S. , Palti, Y., Gao, G., Rexroad III, C.E., 2016. Development and validation of a SNP panel for parentage assignment in rainbow trout. Aquaculture 452, 178-182. Lush, J. L. 1931. The number of daughters necessary to prove a sire. K. Dairy Sci. 14:209-220. Meuwissen, T. H. E. , Luo, Z. 1992. Computing inbreeding coefficients in large populations. Genet. Sel. Evol. 24:305-313. Meuwissen, T. H. E. , De Jong, G., Engel, B. 1996. Joint estimation of breeding values and heterogeneous variances of large data files. J. Dairy Sci. 79:310-316. Meyer, K. 1989. Approximate accuracy of genetic evaluation under an animal model. Livest. Prod. Sci. 21:87-100. Meyer, K. 2000. Random regressions to model phenotypic variation in monthly weights of Australian beef cows. Livest. Prod. Sci. 65:19-38. Meyer, K. , Kirkpatrick, M. 2010. Better estimates of genetic covariance matrices by ”bending” using penalized maximum likelihood. Genetics 185:1097-1110. Moen, T. , Baranski, M., Sonesson, A.K., Kjøglum, S., 2009. Confirmation and fine mapping of a major QTL for resistance to infectious pancreatic necrosis in Atlantic salmon: population-level associations between markers and trait. BMC Genomics 10(1), p.368. 378 CHAPTER 22. REFERENCES AND SUGGESTED READINGS Mulder, H.A. , Calus, M.P.L., Veerkamp, R.F. 2010. Prediction of haplotypes for ungenotyped animals and its effect on marker-assisted breeding value estimation. Genet. Sel. Evol. 42:10. Ødegård, J. , Moen, T., Santi, N., Korsvoll, S.A., Kjøglum, S. and Meuwissen, T.H., 2014. Genomic prediction in an admixed population of Atlantic salmon (Salmo salar). Frontiers in Genetics, 5, p.402. Patterson, H. D. , Thompson, R. 1971. Recovery of interblock information when block sizes are unequal. Biometrika 58:545-554. Pollak, E. J. , Quaas, R. L. 1981. Monte Carlo study of within herd multiple trait evaluation of beef cattle growth traits. J. Anim. Sci. 52:248-256. Pollak, E. J. , Quaas, R. L. 1981. Monte Carlo study of genetic evaluations using sequentially selected records. J. Anim. Sci. 52:257-264. Pollak, E. J. , van der Werf, J., Quaas, R. L. 1984. Selection bias and multiple trait evaluation. J. Dairy Sci. 67:1590-1596. Ptak, E. , Horst, H. S., Schaeffer, L. R. 1993. Interaction of age and month of calving with year for Ontario Holstein production traits. J. Dairy Sci. 76:3792-3798. Quaas, R. L. 1976. Computing the diagonal elements and inverse of a large numerator relationship matrix. Biometrics 32:949. Quaas, R. L. , Pollak, E. J. 1980. Mixed model methodology for farm and ranch beef cattle testing programs. J. Anim. Sci. 51:1277-1287. Quaas, R. L. , Pollak, E. J. 1981. Modified equations for sire models with groups. J. Dairy Sci. 64:1868-1872. Quaas, R. L. 1988. Additive genetic model with groups and relationships. J. Dairy Sci. 71:1310-1318. Quinton, C.D. , McMillan, I., Glebe, B.D., 2005. Development of an Atlantic salmon (Salmo salar) genetic improvement program: Genetic parameters of harvest body weight and carcass quality traits estimated with animal models. Aquaculture 247, 211-217. 379 Rao, C. R. 1970. Estimation of heteroscedastic variances in linear models. J. Am. Statist. Assoc. 65:161-172. Rao, C. R. 1971. Estimation of variance covariance components - MINQUE theory. J. Mult. Anal. 1:445-456. Robinson, G. K. 1986. Group effects and computing strategies for models for estimating breeding values. J. Dairy Sci. 69:3106-3111. Schaeffer, L. R. 1976. BLUP Workshop Notes. August 9 to 14, 1976, Department of Animal Breeding, Agricultural College, Uppsala, Sweden. (Not available in print or electronic formats, some people may have a copy). Schaeffer, L. R. , Kennedy, B. W. 1986. Computing strategies for solving mixed model equations. J. Dairy Sci. 69:575-579. Schaeffer, L. R. , Kennedy, B. W. 1989. Effects of embryo transfer in beef cattle on genetic evaluation methodology. J. Anim. Sci. 67:2536-2543. Schaeffer, L. R. , Kennedy, B. W., Gibson, J. P. 1989. The inverse of the gametic relationship matrix. J. Dairy Sci. 72:1266-1272. Schaeffer, L. R. , Dekkers, J. C. M. 1994. Random regressions in animal models for test-day production in dairy cattle. Proc. 5th World Congress of Genetics Applied to Livestock Production. Guelph, Ontario, Canada XVIII:443-446. Schaeffer, L. R. , Jamrozik, J., Kistemaker, G. J., Van Doormaal, B. J. 2000. Experience with a test-day model. J. Dairy Sci. 83:1135-1144. Schaeffer, L. R. 2003. Computing simplifications for non-additive genetic models. J. Anim. Breed. Genet. 120:394-402. Schaeffer, L. R. 2004. Application of random regression models in animal breeding. Livest. Prod. Sci. 86:35-45. Schaeffer, L. R. 2006. Strategy for applying genome-wide selection in dairy cattle. J. Anim. Breed. Genet. 123:218-223. Schaeffer, L. R. 2010. Cumulative permanent environmental effects in a repeated records animal model. J. Anim. Breed. Genet. 128:95-99. 380 CHAPTER 22. REFERENCES AND SUGGESTED READINGS Schaeffer, L. R. 2018. Necessary changes to improve animal models. J. Anim. Breed. Genet. 135:124-131. Schaeffer, L. R. , Ang, K. P., Elliott, J. A. K., Herlin, M., Powell, F., Boulding, E. G. 2018. Genetic evaluation of Atlantic salmon for growth traits incorporating SNP markers. J. Anim. Breed. Genet. 135(5):349-356. Schenkel, F. S. 1998. Studies on effects of parental selection on estimation of genetic parameters and breeding values of metric traits. Ph.D. Thesis. University of Guelph. Guelph, Ontario, Canada. Searle, S. R. 1971. Linear Models. New York, John Wiley. Simianer, H. 1991. Prospects for third generation methods of genetic evaluation. 42nd Annual Meeting of the European Association for Animal Production, Berlin. Slanger, W. D. , Jensen, E. L., Everett, R. W., Henderson, C. R. 1976. Programming cow evaluation. J. Dairy Sci. 59:1589. Smith, S. P. , Maki-Tanila, A. 1990. Genotypic covariance matrices and their inverses for models allowing dominance and inbreeding. Genet. Sel. Evol. 22:65-91. Sonesson, A.K. 2007. Within-family marker-assisted selection for aquaculture species. Genet. Sel. Evol. 39, 301-317. Sonesson, A.K. , Meuwissen, T.H.E., 2009. Testing Strategies for genomic selection in aquaculture breeding programs. Genet. Sel. Evol. 41, 37-45. Tamate, T. , Maekawa, K. 2004. Female-biased mortality rate and sexual size dimorphism of migratory masu salmon, Oncorhynchus masou. Ecology of Freshwater Fish 13, 96-103. Thompson, R. 1976. The estimation of maternal genetic variances. Biometrics 32:903-918. Thompson, R. 1979. Sire evaluation. Biometrics 35:339-353. Tier, B. , Meyer, K. 2004. Approximating prediction error covariances among additive genetic effects within animals in multiple trait and random regression models. J. Anim. Breed. Genet. 121:77-89. 381 Tierney, J. S. , Schaeffer, L. R. 1994. Inclusion of semen price of the sire in an animal model to account for preferential treatment. J. Dairy Sci. 77:576-582. Tyriseva, A. M. , Mantysaari, E. A., Jakobsen, J., Aamand, G. P., Durr, J., Fikse, W. F., Lidauer, M. H. 2018. Detection of evaluation bias caused by genomic preselection. J. Dairy Sci. 101:1-9. Ufford, G. R. , Henderson, C. R., Van Vleck, L. D. 1979. Computing algorithms for sire evaluation with all lactation records and natural service sires. J. Dairy Sci. 62:511-513. Van Raden, P. M. , Hoeschele, I. 1991. Rapid inversion of additive by additive relationship matrices by including sire-dam combination effects. J. Dairy Sci. 74:570-579. Veerkamp, R. F. , Brotherstone, S., Meuwissen, T. H. E. 1999. Survival analysis using random regression models. Proc. International Workshop on EU Concerted Action Genetic Improvement of Functional Traits in Cattle; Longevity. Interbull Bulletin 21:36-40. Westell, R. A. , Quaas, R. L., Van Vleck, L. D. 1988. Genetic groups in an animal model. J. Dairy Sci. 71:1310-1318. Wiggans, G. R. , Misztal, I. 1987. Supercomputer for animal model evaluation of Ayrshire milk yield. J. Dairy Sci. 70:1906-1912. Wilham, R. L. 1972. The role of maternal effects in animal breeding. III. Biometrical aspects of maternal effects in animals. J. Anim. Sci. 35:12881293. Wood, P. D. P. 1967. Algebraic model of the lactation curve in cattle. Nature 216:164-165. Wood, P. D. P. 1968. Factors affecting the shape of the lactation curve in cattle. Anim. Prod. 11:307-316.