Uploaded by Hilal Yazar Güneş

Animal Models - L. R. Schaeffer

advertisement
Animal Models
L. R. Schaeffer
2019
c L. R. Schaeffer, 2019
2
Acknowledgements
C. R. Henderson
Dale Van Vleck
Jim Wilton
George Wiggans
Brian Kennedy
Ivan Mao
Bill Slanger
Janusz Jamrozik
Knut Ronningen
Oje & Britta Danell Bill Szkotnicki
Jack Dekkers
John Moxley
Jan Philipsson
Daniel Gianola
Ignacy Misztal
Reinhard Reents
Karin Meyer
Bruce Tier
All my Graduate Students All co-authors
Laura McKay
Cornell University 1970
Contents
I
Animal Models
15
1 Background History
1.1
17
Sire Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.1
Fixed Versus Random Factors . . . . . . . . . . . . 20
1.1.2
Better Sires in Better Herds . . . . . . . . . . . . . 21
1.1.3
Sires Related . . . . . . . . . . . . . . . . . . . . . . . 27
1.2
The Animal Model . . . . . . . . . . . . . . . . . . . . . . . . 28
1.3
Model Considerations . . . . . . . . . . . . . . . . . . . . . . 30
1.3.1
Time and Location Effects . . . . . . . . . . . . . . . 30
1.3.2
Contemporary Groups . . . . . . . . . . . . . . . . . 31
1.3.3
Interaction of Fixed Factors with Time . . . . . . . 32
1.3.4
Pre-adjustments . . . . . . . . . . . . . . . . . . . . . 33
1.3.5
Additive Genetic Effects . . . . . . . . . . . . . . . . 34
1.3.6
Permanent Environmental Effects . . . . . . . . . . 35
2 Genetic Relationships
2.1
37
Pedigree Preparation . . . . . . . . . . . . . . . . . . . . . . 37
3
4
CONTENTS
2.2
Additive Relationships . . . . . . . . . . . . . . . . . . . . . 39
2.3
Inbreeding Calculations
2.4
Example Additive Matrix . . . . . . . . . . . . . . . . . . . 46
2.5
Inverse of Additive Relationship Matrix . . . . . . . . . . 48
3 Mixed Model Equations
. . . . . . . . . . . . . . . . . . . . 43
53
3.1
Accuracies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2
Expression of EBVs . . . . . . . . . . . . . . . . . . . . . . . 56
3.3
3.2.1
Genetic Base . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.2
EBV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2.3
EPD, ETA . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2.4
RBV . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.5
Percentile Ranking
3.2.6
Intermediate Traits . . . . . . . . . . . . . . . . . . . 60
. . . . . . . . . . . . . . . . . . . 59
Numerical Example . . . . . . . . . . . . . . . . . . . . . . . 60
4 Phantom Parent Groups
69
4.1
Background History . . . . . . . . . . . . . . . . . . . . . . . 71
4.2
Simplifying the MME . . . . . . . . . . . . . . . . . . . . . . 72
5 Estimation of Variances
75
5.1
Some History . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2
Modern Thinking . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3
The Joint Posterior Distribution . . . . . . . . . . . . . . . 77
5.3.1
Conditional Distribution of Data Vector . . . . . . 78
CONTENTS
5
5.3.2
Prior Distributions of Random Variables . . . . . . 78
5.4
5.5
Fully Conditional Posterior Distributions . . . . . . . . . 81
5.4.1
Fixed and Random Effects of the Model . . . . . . 81
5.4.2
Variances . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Computational Scheme . . . . . . . . . . . . . . . . . . . . . 82
5.5.1
Year Effects . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5.2
Contemporary Group Effects . . . . . . . . . . . . . 85
5.5.3
Contemporary Group Variance . . . . . . . . . . . . 85
5.5.4
Animal Genetic Effects . . . . . . . . . . . . . . . . . 86
5.5.5
Additive Genetic Variance . . . . . . . . . . . . . . . 86
5.5.6
Residual Variance . . . . . . . . . . . . . . . . . . . . 87
5.5.7
Visualization . . . . . . . . . . . . . . . . . . . . . . . 87
5.5.8
Burn-In Periods . . . . . . . . . . . . . . . . . . . . . 89
5.5.9
Post Burn-In Analysis . . . . . . . . . . . . . . . . . 90
5.5.10 Long Chain or Many Chains?
. . . . . . . . . . . . 91
5.6
Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.7
Covariance Matrices . . . . . . . . . . . . . . . . . . . . . . . 92
5.7.1
Example . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6 Repeated Records Animal Model
6.1
95
Traditional Approach . . . . . . . . . . . . . . . . . . . . . . 95
6.1.1
The Model . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.1.2
Example . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.1.3
Estimation of Variances . . . . . . . . . . . . . . . . 100
6
CONTENTS
6.2
6.3
Cumulative Environmental Approach . . . . . . . . . . . . 103
6.2.1
A Model . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.2.2
Example . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2.3
Estimation of Variances . . . . . . . . . . . . . . . . 108
Alternative Models . . . . . . . . . . . . . . . . . . . . . . . 109
7 Multiple Trait Models
111
7.1
Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.3
Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.3.1
Year-Season Effects . . . . . . . . . . . . . . . . . . . 117
7.3.2
Age Effects . . . . . . . . . . . . . . . . . . . . . . . . 118
7.3.3
Days in Milk Group Effects . . . . . . . . . . . . . . 119
7.3.4
HYS Effects . . . . . . . . . . . . . . . . . . . . . . . . 120
7.3.5
Additive Genetic Effects . . . . . . . . . . . . . . . . 121
7.3.6
Residual Covariances . . . . . . . . . . . . . . . . . . 122
7.4
Positive Definite Matrices . . . . . . . . . . . . . . . . . . . 123
7.5
Starting Matrices . . . . . . . . . . . . . . . . . . . . . . . . 125
8 Maternal Traits
127
8.1
Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.2
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
8.3
A Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
8.3.1
Data Structure . . . . . . . . . . . . . . . . . . . . . . 132
8.3.2
Assumptions . . . . . . . . . . . . . . . . . . . . . . . 133
CONTENTS
8.4
II
7
MME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8.4.1
Preparing the Data . . . . . . . . . . . . . . . . . . . 134
8.4.2
Covariance Matrices . . . . . . . . . . . . . . . . . . . 135
8.4.3
Year-Month Effects . . . . . . . . . . . . . . . . . . . 136
8.4.4
Gender Effects . . . . . . . . . . . . . . . . . . . . . . 137
8.4.5
Flock-Year-Month Effects . . . . . . . . . . . . . . . 138
8.4.6
Animal Additive Genetic Effects . . . . . . . . . . . 139
8.4.7
Maternal Genetic Effects . . . . . . . . . . . . . . . . 140
8.4.8
Maternal PE Effects . . . . . . . . . . . . . . . . . . . 142
8.4.9
Residual Effects . . . . . . . . . . . . . . . . . . . . . 143
Random Regression Analyses
9 Longitudinal Data
145
147
9.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
9.2
Collect Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
9.3
Covariance Functions . . . . . . . . . . . . . . . . . . . . . . 151
9.4
Reduced Orders of Fit . . . . . . . . . . . . . . . . . . . . . 159
9.5
Data Requirements . . . . . . . . . . . . . . . . . . . . . . . 164
10 The Models
167
10.1 Fitting The Trajectory . . . . . . . . . . . . . . . . . . . . . 167
10.1.1 Spline Functions . . . . . . . . . . . . . . . . . . . . . 172
10.1.2 Classification Approach . . . . . . . . . . . . . . . . . 174
10.2 Random Variation in Curves . . . . . . . . . . . . . . . . . 176
8
CONTENTS
10.3 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
10.4 Complete Model . . . . . . . . . . . . . . . . . . . . . . . . . 179
10.4.1 Fixed Factors . . . . . . . . . . . . . . . . . . . . . . . 179
10.4.2 Random Factors . . . . . . . . . . . . . . . . . . . . . 180
10.4.3 Mixed Model Equations . . . . . . . . . . . . . . . . 181
11 RRM Calculations
185
11.1 Enter Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
11.2 Enter Starting Covariance Matrices . . . . . . . . . . . . . 192
11.3 Sampling Process
. . . . . . . . . . . . . . . . . . . . . . . . 194
11.3.1 Year-Month Effects . . . . . . . . . . . . . . . . . . . 194
11.3.2 Age Groups . . . . . . . . . . . . . . . . . . . . . . . . 196
11.3.3 Contemporary Groups . . . . . . . . . . . . . . . . . 196
11.3.4 Animal Additive Genetic Effects . . . . . . . . . . . 197
11.3.5 Animal Permanent Environmental Effects . . . . . 199
11.3.6 Residual Covariance Matrices . . . . . . . . . . . . . 200
12 Lactation Production
203
12.1 Measuring Yields . . . . . . . . . . . . . . . . . . . . . . . . . 204
12.2 Curve Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
12.3 Factors in a Model . . . . . . . . . . . . . . . . . . . . . . . . 207
12.3.1 Observations . . . . . . . . . . . . . . . . . . . . . . . 207
12.3.2 Year-Month of Calving . . . . . . . . . . . . . . . . . 207
12.3.3 Age-Season of Calving . . . . . . . . . . . . . . . . . 208
12.3.4 Days Pregnant . . . . . . . . . . . . . . . . . . . . . . 208
CONTENTS
9
12.3.5 Herd-Test-Day . . . . . . . . . . . . . . . . . . . . . . 209
12.3.6 Herd-Year-Season of Calving . . . . . . . . . . . . . 209
12.3.7 Additive Genetic Effects . . . . . . . . . . . . . . . . 210
12.3.8 Permanent Environmental Effects . . . . . . . . . . 210
12.3.9 Number Born . . . . . . . . . . . . . . . . . . . . . . . 210
12.3.10 Residual Effects . . . . . . . . . . . . . . . . . . . . . 211
12.4 Covariance Function Matrices . . . . . . . . . . . . . . . . . 211
12.5 Expression of EBVs . . . . . . . . . . . . . . . . . . . . . . . 215
13 Growth
219
13.1 Curve Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
13.1.1 Spline Function . . . . . . . . . . . . . . . . . . . . . . 222
13.1.2 Classification Approach . . . . . . . . . . . . . . . . . 223
13.2 Model Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
13.2.1 Observations . . . . . . . . . . . . . . . . . . . . . . . 223
13.2.2 Year-Month of Birth-Gender (fixed) . . . . . . . . . 224
13.2.3 Year - Age of Dam - Gender (fixed) . . . . . . . . . 224
13.2.4 Contemporary-Management Groups (random) . . 224
13.2.5 Litter Effects (random) . . . . . . . . . . . . . . . . . 225
13.2.6 Animal Additive Genetic Effects (random) . . . . 225
13.2.7 Animal Permanent Environmental Effects (random) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
13.2.8 Maternal Genetic Effects (random) . . . . . . . . . 226
13.2.9 Maternal Permanent Environmental Effects (random) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
10
CONTENTS
13.2.10 Residual Variances (random) . . . . . . . . . . . . . 227
13.2.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 227
13.3 Covariance Function Matrices . . . . . . . . . . . . . . . . . 227
13.4 Expression of EBVs . . . . . . . . . . . . . . . . . . . . . . . 230
14 Survival
233
14.1 Survival Function . . . . . . . . . . . . . . . . . . . . . . . . 234
14.2 Model Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
14.2.1 Year-Season of Birth-Gender (fixed)
. . . . . . . . 237
14.2.2 Age at First Calving (fixed) . . . . . . . . . . . . . . 237
14.2.3 Production Level (fixed) . . . . . . . . . . . . . . . . 238
14.2.4 Conformation Level (fixed) . . . . . . . . . . . . . . 238
14.2.5 Unexpected Events . . . . . . . . . . . . . . . . . . . 238
14.2.6 Contemporary Groups (random) . . . . . . . . . . . 239
14.2.7 Animal Additive Genetic Effects (random) . . . . 239
14.2.8 Animal Permanent Environment Effects (random) 239
14.2.9 Residual Variances (random) . . . . . . . . . . . . . 239
14.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
14.3.1 Year Trajectories . . . . . . . . . . . . . . . . . . . . . 245
14.3.2 Contemporary Groups . . . . . . . . . . . . . . . . . 247
14.3.3 Animal Estimated Breeding Values . . . . . . . . . 248
14.3.4 Covariance Matrices . . . . . . . . . . . . . . . . . . . 250
CONTENTS
III
Loose Ends
15 Selection
11
253
255
15.1 Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . 255
15.2 Multiple Traits . . . . . . . . . . . . . . . . . . . . . . . . . . 257
15.3 Better Sires In Better Herds Problem . . . . . . . . . . . 261
15.3.1 The Simulation . . . . . . . . . . . . . . . . . . . . . . 262
15.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
15.4 Nonrandom Mating . . . . . . . . . . . . . . . . . . . . . . . 264
15.5 Masking Selection . . . . . . . . . . . . . . . . . . . . . . . . 267
15.6 Preferential Treatment . . . . . . . . . . . . . . . . . . . . . 270
15.7 Nonrandom Progeny Groups . . . . . . . . . . . . . . . . . 270
15.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
16 Genomics
273
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
16.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
16.3 Model for Growth Traits . . . . . . . . . . . . . . . . . . . . 279
16.4 Models Involving Genomics . . . . . . . . . . . . . . . . . . 280
16.5 Comparison of Models . . . . . . . . . . . . . . . . . . . . . 283
16.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
17 Other Models
289
17.1 Sire-Dam Model . . . . . . . . . . . . . . . . . . . . . . . . . 289
17.2 Sire Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
12
CONTENTS
17.3 Reduced Animal Models . . . . . . . . . . . . . . . . . . . . 291
18 Non-Additive Genetic Effects
299
18.1 Interactions at a Single Locus . . . . . . . . . . . . . . . . . 299
18.2 Interactions for Two Unlinked Loci . . . . . . . . . . . . . 301
18.2.1 Estimation of Additive Effects . . . . . . . . . . . . 302
18.2.2 Estimation of Dominance Effects . . . . . . . . . . . 303
18.2.3 Additive by Additive Effects . . . . . . . . . . . . . 305
18.2.4 Additive by Dominance Effects . . . . . . . . . . . . 306
18.2.5 Dominance by Dominance Effects . . . . . . . . . . 307
18.3 More than Two Loci . . . . . . . . . . . . . . . . . . . . . . . 308
18.4 Linear Models for Non-Additive Genetic Effects . . . . . 308
18.4.1 Simulation of Data . . . . . . . . . . . . . . . . . . . . 308
18.4.2 HMME . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
18.5 Computing Simplification . . . . . . . . . . . . . . . . . . . 312
18.6 Estimation of Variances . . . . . . . . . . . . . . . . . . . . . 314
19 Threshold Models
315
19.1 Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . 315
19.2 Threshold Model Computations . . . . . . . . . . . . . . . 316
19.2.1 Equations to Solve . . . . . . . . . . . . . . . . . . . . 318
19.3 Estimation of Variance Components . . . . . . . . . . . . . 325
19.4 Expression of Genetic Values . . . . . . . . . . . . . . . . . 326
CONTENTS
IV
Computing Ideas
20 Fortran Programs
13
327
329
20.1 Main Program . . . . . . . . . . . . . . . . . . . . . . . . . . 329
20.2 Call Params . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
20.3 Call Peddys . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
20.4 Call Datum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
20.5 Iteration Subroutines . . . . . . . . . . . . . . . . . . . . . . 343
20.5.1 Year-Month of Calving Factor . . . . . . . . . . . . 343
20.5.2 Region-Age-Season of Calving . . . . . . . . . . . . 345
20.5.3 Contemporary Groups . . . . . . . . . . . . . . . . . 348
20.5.4 Animal Permanent Environmental . . . . . . . . . . 352
20.5.5 Animal Additive Genetic . . . . . . . . . . . . . . . . 356
20.5.6 Residual Effects . . . . . . . . . . . . . . . . . . . . . 361
20.5.7 Finish Off . . . . . . . . . . . . . . . . . . . . . . . . . 364
20.6 Other Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
21 DKMVHF - Inversion Routine
367
22 References and Suggested Readings
373
14
CONTENTS
Part I
Animal Models
15
Chapter 1
Background History
The individual cow model first appeared in Dr Henderson’s graduate course
notes during the mid-1960’s at Cornell University. Because Henderson was
a dairy cattle geneticist, he used the term ‘cow’ model rather than animal
model. Those models showed the need for the additive genetic relationship
matrix and its inverse, but in those early years computers (with only 128K of
memory) were not capable of storing such a matrix for cows in a national or
regional population. The term ‘animal’ model came from a beef cattle paper
by Pollak and Quaas (1981), a name that stuck with the concept, and which
Henderson and everyone else readily accepted. In 1976, Henderson discovered
and published his rapid method for inverting the additive genetic relationship
matrix. That simple discovery made it possible to apply an animal model to
a national dairy cattle genetic evaluation program. However, the first applied
animal models did not appear until 1987 (Wiggans and Misztal) due mainly
to a lack of computer power. Bill Slanger (1976) programmed an early withinherd cow model for evaluating both dairy cows and bulls, but in which genetic
relationships between animals in different herds had to be ignored.
1.1
Sire Model
The animal model came after the sire model, and therefore, it is important to
know the sire model. In 1970, the animal breeding world was introduced to
17
18
CHAPTER 1. BACKGROUND HISTORY
linear models and best linear unbiased prediction methods (BLUP) by C. R.
Henderson through the Northeast AI Sire Comparison. The initial model was
yijklm = Y Si + HY Sij + Gk + Skl + eijklm ,
where
yijklm was first lactation 305-d milk yield of daughter m of sire l belonging
to genetic group k making a record in year-season of calving i and herdyear-season j;
Y Si was a fixed year-season of calving effect to account for time trends in
the data;
HY Sij was a random herd-year-season of first calving contemporary group;
Gk was a fixed sire genetic group, defined by the year of sampling and AI
ownership;
Skl was a random sire effect within genetic group; and
eijklm was a random residual effect.
In matrix notation, let
y be the vector of first lactation milk yields,
b be the vector of year-season effects,
h be the vector of herd-year-season effects,
g be the vector of genetic group effects,
s be the vector of sire transmitting abilities, and
e be the vector of residuals,
then
y = Xb + Wh + Qg + Zs + e,
1.1. SIRE MODEL
19
where X, W, Q, and Z are design matrices relating observations to the factors
in the model.
Also,
E(y)
E(h)
E(s)
E(e)
V ar(e)
V ar(s)
V ar(h)
=
=
=
=
=
=
=
Xb + Qg
0
0
0
Iσe2
Iσs2
Iσh2
The assumptions of this model were
1. Sires were unrelated to each other.
2. Sires were mated randomly to dams.
3. Progeny were a random sample of daughters.
4. Daughters of sires were randomly distributed across herd-year-seasons.
5. Milk yields were adjusted perfectly for age and month of calving.
The limitations were
1. Sires were related to each other.
2. Sires were not randomly mated to dams in the population.
3. Daughters of higher priced bulls tended to be associated with richer herds
that supposedly had better environments.
4. The age-month of calving adjustment factors were not without errors.
5. Only first lactation records were used.
6. Cows were not evaluated.
20
1.1.1
CHAPTER 1. BACKGROUND HISTORY
Fixed Versus Random Factors
Most statistics books in 1970 would define a fixed factor as one with relatively
few levels, such as age groups, treatments, diets, and years, of which differences among the levels were to be estimated. If the conceptualized experiment
were to be repeated, then the same levels of ages, treatments, or diets might
appear again, and their estimated differences would be expected to be consistent between the two samples. However, year effects would differ in a repeated
sampling approach because time does not stand still and years do not repeat
themselves. In the proposed model above, year-seasons were considered to
be a fixed factor because there were a limited number of years and a limited
number of seasons per year in the data.
A random factor, on the other hand, would have many levels, large enough
to be considered infinite, and large enough to be viewed as being randomly
sampled from an infinite population, such as sires or herd-year-seasons. If the
conceptualized study were repeated, then sires and herd-year-seasons would be
completely new, i.e. a different random sample. In animal breeding situations,
studies are not repeated, but instead new sires and new herd-year-seasons
are constantly being generated based partly on the composition of previous
samples, and thus, only occur once. Sire effects were considered a random
factor even though the number of sires was less than 1500 for Holsteins, which
is significantly less than infinite. Herd-year-season effects were also a random
factor having around 100,000 levels.
The distinction between fixed and random factors was also based on how
the results were to be interpreted. With a fixed factor, interest was on the
estimated differences between levels of that factor for those specific levels of
the factor. For example, if there were two diets involving different feed components, then the researcher would be interested in the impacts of those diets
on milk production or growth. There could be fifty other diets that could be
compared, but the experiment was only for those two specific diets. With a
random factor the researcher would be more interested in the variation of effects of levels of random factors on milk production or growth, and not in any
specific levels of the random factor. Herd-year-season effects might account
for 10% of the variation in milk yields across the country. This fact was more
important than the specific difference between two year-seasons within one
herd. In genetic evaluation, researchers are interested in the realized values of
1.1. SIRE MODEL
21
levels of a random sire factor.
Levels of random factors occur outside the control of the researcher. Researchers do not decide which herd-year-seasons will have cows calving and
certainly do not control the effect of that herd-year-season on the yields of
cows. The researcher also does not control the number of herd-year-seasons
that will be, or where they will be. On the other hand, researchers do decide
which diets will be compared, in which herds, and which animals, and thus, diets are a fixed factor. Basically, any experiment in a research environment will
have mostly fixed factors in the model, whereas, data collected from livestock
producers will have more random factors influencing the observations that the
researcher has not been able to control.
Determining if a factor is fixed or random undergoes many shades of grey
in practice. In the end, the researcher makes the decision based on experience
and tradition in that field of study.
1.1.2
Better Sires in Better Herds
Before the Northeast AI Sire Comparison became public, the dairy industry
in the northeast believed that the better sires were used in the better herds,
thus creating a bias in sire estimated transmitting abilities. Henderson had
a theory about different kinds of selection bias and how to account for them
which he had taught in his graduate course at Cornell since 1965, but did not
publish until 1975.
To illustrate the problem, consider the progeny data in Table 1 representing 1 year-season, 3 herds, and 4 sires. Ignore genetic groups in this example.
22
CHAPTER 1. BACKGROUND HISTORY
Example data
Herd
1
(9,270)
(1,27)
0
1
2
3
Sire
Totals (10,297)
Table 1.1
on 4 sires and 3 herds in 1 year-season.
(nij ,yij. )
Sires
Herd
2
3
4
Totals
(16,432)
(0,0)
(0,0)
(25,702)
(3,72)
(5,85)
(1,12)
(10,196)
(1,21) (25,350) (39,351)
(65,722)
(20,525)
(30,435) (40,363)
(100,1620)
Note that sires 1 and 2 have most of their progeny in herds 1 and 2, and
sires 3 and 4 have most of their progeny in herds 2 and 3. In creating the data
sires 1 and 2 had positive true breeding values and sires 3 and 4 had negative
true breeding values. Also, herd 1 true value was better than herd 2 and herd
3, herd 2 true value was better than herd 3. Thus, the better sires in better
herds selection bias was constructed. In real life, of course, this can not be
done because true values of either sires or herds are never known, but industry
personnel believed that this bias was occurring.
Assume that kh = 10 and ks = 15, then the mixed model equations for
the model would be














100 25 10 65 10 20 30 40
25 35 0 0 9 16 0 0
10 0 20 0 1 3 5 1
65 0 0 75 0 1 25 39
10 9 1 0 25 0 0 0
20 16 3 1 0 35 0 0
30 0 5 25 0 0 45 0
40 0 1 39 0 0 0 55
The solutions to the equations were















d
Y
S1
d
HY S 11
dS 21
HY
dS 31
HY
Sb1
Sb2
Sb3
Sb4






















= 






1620
702
196
722
297
525
435
363







 . (1.1)






1.1. SIRE MODEL
23
 d
Y S1
 d
 HY S 11

 HY
dS

21
 d
 HY S 31

 b
 S1

b
 S
 2
 b
 S3
Sb
4































=
19.2398
4.7928
0.0779
−4.8707
2.4556
1.9473
−0.4626
−3.9403








.






The sire solutions are presumably biased because of the selection bias issue.
Henderson (1975) outlined three types of selection. The bias discussed
here fell under the selection on u type of selection. Henderson derived the
following mixed model equations (MME) to account for the bias.







X0 X
X0 W
X0 Q
X0 Z
0
0
0
0
W X W W + Ikh W Q
W0 Z
−L
Q0 X
Q0 W
Q0 Q
Q0 Z
0
0
0
0
0
ZX
ZW
Z Q Z Z + Iks
0
0
−L0
0
0
L0 L/kh








b
b
b
h
b
g
b
s
θb












= 


X0 y
W0 y
Q0 y
Z0 y
0




(1.2)
,



where L0 describes selection differentials and θ is a vector of the estimates of
selection bias, kh is the ratio of residual to herd-year-season variances, and ks
is the ratio of residual to sire variances. Unfortunately, Henderson (1975) or
Henderson (1984) did not give any hints about creating an appropriate L, only
that one exists.
Two different L0 were attempted here. Here u is taken to be h, the herdyear-season random effects. A selection differential is a function that describes
the difference between one HYS from the other HYS. One possible matrix is


1 −0.5 −0.5

0
1
−1 
L1 = 0
.
0
0
1
This matrix says that HY S11 is different from the average of HY S21 and
HY S31 (first row of L0 1 ), that HY S21 is different from HY S31 , and HY S31 is
not better than any other HYS.
24
CHAPTER 1. BACKGROUND HISTORY
In Henderson (1984), he makes L0 2 = I, which states that each HYS is
unique. The solutions from the two attempts and the original solutions to the
MME are in Table 1.2.
Table 1.2
Solutions to the original MME, and
MME using L0 1 , and
MME using L0 2 .
Effect
Original With L0 1 With L0 2
Y S1
19.2398
20.6401
19.5479
HY S11
4.7928
6.6348
7.7269
HY S21
0.0779
-1.4166
-0.3245
HY S31 -4.8707
-8.3010
-7.2089
S1
2.4556
1.2921
1.2921
S2
1.9473
0.5312
0.5312
S3
-0.4626
0.6757
0.6757
S4
-3.9403
-2.4990
-2.4990
θ1
66.3476
77.2691
θ2
19.0076
-3.2447
θ3
-30.8289 -72.0889
Note that the solutions for sires are identical for the two MME using
L , and that estimable functions of the other effects are also identical. For
example, the solution for Y S1 + HY S11 is the same in the last two columns.
Several other L0 matrices were constructed and tried, and all gave the same
solutions for sire effects. Henderson likely tried many L0 matrices too, and
came to the conclusion that the exact structure of L0 did not matter. Thus,
let L0 = I, then the above equations reduce to
0





X0 X X 0 W X 0 Q
X0 Z
W0 X W0 W W0 Q
W0 Z
Q0 X Q0 W Q0 Q
Q0 Z
0
0
0
0
Z X Z W Z Q Z Z + Iks
 b 
b

b 

h




 g
b 

b
s




=
X0 y
W0 y
Q0 y
Z0 y



,

which are the MME for the model of the original sire model except that Wh
is treated as a fixed effect. The solutions to these equations, for the example
data of Table 1.1, are
1.1. SIRE MODEL
25
 d
Y S1
 d
 HY S 11

 HY
dS

21
 d
 HY S 31

 b
 S1

b
 S
 2
 b
 S3
Sb
4































=
14.7093
12.5655
4.5141
−2.3703
1.2921
0.5312
0.6757
−2.4990








.






The sire solutions are the same as those in the last two columns of Table 1.2,
and Y S1 + HY S11 is also identical. Thus, treating random HYS as a fixed
factor removes the selection bias due to better sires being used in the better
herds. It seems incredulous that any L0 matrix can be used, as long as there
are 3 independent rows.
Hence Henderson used model (1.2) with random HYS effects, i.e. contemporary groups (CG), treated as fixed in the first official run of the northeast AI
sire comparison. Ever since, random HYS effects have been treated as a fixed
factor in sire and animal models, except in a few rare cases, but always considered to be a random factor. There has never been a study on the amount of
bias that supposedly existed, nor on the increase in prediction error variances
as a result of treating HYS effects as fixed.
Thompson (1979) argued that L0 was not well defined and therefore, Henderson’s selection bias theory was not sound. Gianola et al. (1988) gave
Bayesian arguments against the concept of repeated sampling underlying the
assumptions of Henderson’s theory in order to derive the modified MME.
Therefore, if Henderson’s selection bias theory is unsound, then have animal
breeders made a mistake in treating random HYS as a fixed factor over the
last 45 years? How much genetic change has been lost as a result, if any? The
modified model became
y = Wh + Qg + Zs + e,
(1.3)
where W, Q, and Z are design matrices relating observations to the factors in
the model. Note that h are now treated as fixed effects in the model and that
26
CHAPTER 1. BACKGROUND HISTORY
year-season effects, (Xb), were totally confounded with herd-year-seasons, and
thus, were not estimable in this model, and therefore, removed.
Also,
E(y)
E(s)
E(e)
V ar(e)
V ar(s)
=
=
=
=
=
Wh + Qg
0
0
Iσe2
Iσs2
In the northeast United States, HYS subclasses were fairly large, but in
some European countries there were many HYS with fewer than five animals.
Any HYS with all daughters from only one bull did not contribute any information to sire evaluations. Not much data were lost in the northeast US, but
in Europe there were many more HYS with only 1 or 2 cows, which were lost
when random HYS effects were treated as a fixed factor.
The other assumptions and limitations were as with the initial model (1.2).
The problems of sires being related, and not being randomly mated to dams
were probably much more significant in their effects on estimated transmitting
abilities than the problem of non-random association of sires with herd-yearseason effects, but were largely ignored.
The modified model (1.3) is the one that every country tried to adopt
during the 1970’s. I gave courses in Sweden and Switzerland in 1976 on how
to use Henderson’s programs, and those notes still pop up when I visit other
countries (Schaeffer 1976). Thus, model (1.3) became the accepted practice
even if the bias that was present in the northeast United States did not exist in
other countries or situations. For example, sire models used in swine or sheep,
where artificial insemination was not prevalent and where progeny group sizes
were not large, probably had no selection bias that needed removal. Contemporary groups could have remained a random factor in those models without
issue.
1.1. SIRE MODEL
1.1.3
27
Sires Related
Henderson (1976) discovered a method of inverting the additive genetic relationship matrix (A), and this made it possible to account for sires that were
related, through their sire and maternal grandsire. Herd-year-seasons were still
treated as a fixed factor. The model did not account for non-random mating
of sires to dams. Now
V ar(s) = Aσs2 .
The sire model continued to be employed for sire evaluation until 1988.
By 1988, computer hardware and computing techniques had improved to make
animal models feasible (Wiggans and Misztal, 1987).
In summary, the problems with sire models were
• Sires were not randomly mated to dams.
• Genetic groups should have enough bulls.
• Only first lactations of cows were used.
• No cow evaluations were produced.
• HYS with all cows being daughters of the same bull were useless.
Problems motivate changes for the future, and the problems of the sire model
motivated change to an animal model, although it took some years.
The programming strategy for sire models, as laid out by Henderson, was
to first absorb herd-year-season effects into the sire effects, thus creating the
coefficients of the mixed model equations explicitly, but only the non-zero
coefficients. After the absorption process, the coefficients needed to be sorted
by columns within rows, and if there were multiple coefficients with the same
row and column identifier, then these needed to be summed together. Finally,
the equations were solved by iterating through the coefficients (read in from
magnetic tape), and updating the sire solutions each iteration.
In 1986, Schaeffer and Kennedy published the iteration-on-data strategy
which was much simpler than the series of programs he presented in Sweden in
28
CHAPTER 1. BACKGROUND HISTORY
1976. Firstly, there was no need to absorb the herd-year-season equations because coefficients of the mixed model equations were formed as needed rather
than storing them on magnetic tape or in memory. Secondly, you could have
other factors in the model besides herd-year-seasons and sires. There had to
be enough storage space in memory to keep the solutions for all factors in the
model. The calculations ended up being summations with few multiplications
or divisions. Multiplications took more than twice as much time to perform
than summations, and divisions were nearly four times longer than summations. Today’s computers, the same relative differences between operations
exist, but the speeds of operations are so fast that divisions and multiplications are no longer a problem. Compilers also optimize codes.
1.2
The Animal Model
In any biological field of research, observations are usually taken on individuals,
either one observation or many. Thus, the experimental unit is the individual,
or animal (humans are also animals). The statistical analysis would begin with
a model that described the factors that affect the observations on individuals.
In general, the model for a single trait could be written as
y = Xb + Wu +
Z 0
aw
ao
!
+ Zp + e
where
y is a vector of observations of a single trait,
b is a vector of fixed effects (such as age, year, gender, farm, cage, diet) that
affect the trait of interest, and are not genetic in origin,
u is a vector of random factors (such as contemporary groups),
!
aw
is a vector of animal additive genetic effects (breeding values) for all
ao
animals included in the pedigree, many of which may not have observations. Animals with records are in aw , and animals without records are
in ao ,
1.2. THE ANIMAL MODEL
29
p is a vector of permanent environmental effects for animals that have been
observed, corresponding to aw ,
e is a vector of residual effects,
X, W and Z are matrices that relate observations to the factors in the model.
In addition,
E




V ar 



E(b)
E(u)
!
aw
ao
E(p)
E(e)

u

aw 

ao 

p 

e
= b
= 0
0
0
=
!
= 0
= 0

=







U
0
0
0
0
2
2
0 Aww σa Awo σa 0
0
0
0 Aow σa2 Aoo σa2 0
2
0
0
0
0
Iσp
0
0
0
0 Rσe2




.



where
2
U = Σ+
i Iσi ,
for i going from 1 to the number of other random factors in the model, and R is
usually diagonal, but there can be different values on the diagonals depending
on the situation. Often all of the diagonals are the same, so that R = I.
The additive genetic relationship matrix,
A=
Aww Awo
Aow Aoo
!
,
is derived from the pedigrees, and can be constructed by the tabular method,
that Henderson also described many years ago. In practice, A, is seldom constructed explicitly. The pedigrees are assumed to trace back to the same base
generation. That is, to a particular birth year such that all the parents of animals born in that year were unknown. The variance, σa2 , is the genetic variance
of animals in that time period where all animals are assumed unrelated.
Assumptions implied with an animal model are as follows:
30
CHAPTER 1. BACKGROUND HISTORY
• Progeny of sire-dam pairs are randomly generated and not selected from
amongst all possible progeny of that pair.
• Henderson originally assumed that animals mated randomly, and inbreeding accrued due to finite population size. However, selective matings of sires to dams are traced within the pedigrees, and therefore, are
taken into account through the relationship matrix.
• Animals are randomly dispersed across levels of fixed and other random
factors.
• Observations are taken on either males or females, but if taken on both
sexes, then the assumption is that parents would rank the same if based
only on one gender or the other.
• There should be no preferential treatment given to individuals or groups
of individuals.
• No animals should be omitted from the analysis because their observations were too low (unless an error has occurred) or too high. That is,
the data should not be a selected subset of all possible animals.
• There are no competition effects between animals. Every animal is able
to express their full genetic potential without restraint from other individuals in their contemporary group.
• Animals should be from a common population with the same genetic
variation in the base generation.
Animal models apply equally to plant breeding, but naturally will refer
to either individual plants or to individual strains of plants, where the plants
represent multiple observations of the same strain or variety.
1.3
1.3.1
Model Considerations
Time and Location Effects
Data on animals usually accumulates over time (months or years), hence the
need to account for time in some manner in an animal model. Environments
1.3. MODEL CONSIDERATIONS
31
improve or disintegrate over time (too much rain, too much heat). Feeds vary
in quality and quantity from year to year. The categories of time could be
weeks, months, or years. Sometimes, if there are not enough animals in a time
period, then some time periods may need to be merged together.
If data are from animals that are spread from coast to coast, and have been
raised in different geographical locations, there may be the need to include an
interaction effect of location by year by month. This assumes that location
effects differ from year to year and month to month. This is particularly true
for most livestock species within Canada and the United States. Such an effect
is a fixed factor and will generally have several hundred, if not thousands, of
data points in each subclass.
1.3.2
Contemporary Groups
Contemporary groups are always present in an animal model, and are always
a random factor. Contemporary groups are usually subsets of the time by
location effects (i.e. a nested factor). Animals are raised on different farms in
different years, and are managed differently from one owner to the next. Each
herd, flock, farm is a smaller location within a larger geographical region.
Animals are assumed to belong to a particular contemporary group due to
birth and management system. Sometimes the groups can be made smaller by
further partitioning the year into seasons or months of the year.
A further partitioning may occur by management systems. For example, a
group of animals may belong to one farm owned by brothers, and each brother
manages his animals slightly differently. Thus, a contemporary group would
be the herd-year-owner subclass, in this case.
Once animals reach a certain age, such as weaning, males and females
may be separated into different management groups. This gives herd-yearmanagement groups as contemporary groups. This emphasizes that contemporaries can change as animals grow and mature. Of importance are the
contemporaries at the time that animals are measured or recorded for a trait.
Because contemporary groups are created as animals come along, the effects of contemporary groups are always considered to be random factors in
an animal model. The animals within a contemporary group are roughly the
32
CHAPTER 1. BACKGROUND HISTORY
same age, born around the same date, housed and managed alike, and are perhaps the same gender, depending on the trait being measured. Contemporary
groups can have from 1 to many individuals.
There have been many papers published which have treated contemporary
groups as fixed effects. Suppose we define a contemporary group as cows in
the same herd-year-season of calving, this can be written as,
HY Sijk = Y S : Hijk
= Y Sjk + Hi + Y S : Hijk
which says that herd effects are nested within Year-Season effects. As a fixed
factor then the two way interaction includes the ‘main’ effects of year-season
and herd, as well as the interaction between the two factors. From an estimability standpoint, the main effect factors are not estimable (cannot be
separated from the interaction term). However, the herd effect should not be
estimated across all year-seasons because it would have a useless value if there
had been 20 or 30 years of data. Hence the model for contemporary groups
should be
HY Sijk = Y Sjk + Y S : Hijk
Because contemporary groups are a random factor, then an animal model needs
to include both Y Sjk as fixed, and Y S : Hijk as random. Failure to include
Y Sjk as fixed will cause bias in the Y S : Hijk estimates and in the estimates
of additive genetic values of animals.
1.3.3
Interaction of Fixed Factors with Time
In dairy cattle, models for milk production include the effects of age and month
of calving interactions for 305-day lactation yields. Differences between age
groups or months of calving can change over the years due to increases in the
overall means of production and or increases due to nutrition and management.
Thus, an interaction of age and month of calving with years of calving (five
year periods maybe) is needed to account for changing age-month differences
over time. Shorter time periods may be needed if means are changing rapidly.
An example could be age of calving (in months) differences in milk production of dairy cows. Suppose in 1973 the difference between 24 months of
1.3. MODEL CONSIDERATIONS
33
age and 30 months of age was 200 kg of milk. Over time, because feeding has
improved, and management has improved, the difference between 24 and 30
month old heifers might be bigger due to a general increase in average milk
production by the year 2010. Perhaps the difference is now 250 kg. To pick
up this effect in the data there needs to be an age by year interaction effect.
One would have to explore the data to determine if age effects change every
year, or maybe only every three years (Ptak et al. 1993).
Any fixed factor in any animal model may need an interaction with time
to be part of the model. Models should be considered to be dynamic and
constantly evolving, and not static over many years or generations.
1.3.4
Pre-adjustments
In the 1960’s and 70’s, there were tables of adjustment factors for converting
milk yields of cows to Mature Equivalents (ME) of yield. The reference point
was a mature cow of 90 months of age, calving in November, for example. The
table of factors had age at calving down the side, and the months of the year
going across the top (from January to December). Because the adjustments
were to a Mature Equivalent, then most of the multiplicative adjustment factors were above 1.0 in the table.
Soon, researchers realized that the factors should be different depending on
where in the United States the cow was housed. This was due to temperature
differences, and feed availability differences. Cows in the southern US had
more stress due to heat, than cows in the northeast, midwest, or west.
A problem with adjustment factors were that they needed to be updated
every few years (factor by time interaction), and this required a major effort
on someone’s time, usually someone at the USDA. There were also adjustment
factors for age of dam and sex of calf in beef cattle, dependent upon breed of
cattle. Pre-adjustment of data for one or more factors was an accepted practice
in data analysis. An assumption was that the adjustment factors were known
perfectly, without error.
Today (2018), with the computing power that is available to every researcher, there is no need to employ pre-adjustment factors. A sensible animal
model could include age and month of calving by time period interactions, and
34
CHAPTER 1. BACKGROUND HISTORY
therefore, the adjustments would be estimated simultaneously with the estimated breeding values. One should look at the estimates, from time to time,
and make graphs of their changes over time, to see if the time periods need to
be shortened or lengthened.
1.3.5
Additive Genetic Effects
Originally animals were assumed to be randomly mating, which in an animal
breeding context is totally invalid. The purpose of animal breeding is to make
genetic improvement, and thus, non-random matings are the way in which
genetic improvement is achieved. The animal model, however, uses the additive genetic relationship matrix, and through this matrix every mating is
identified (as long as parents for each animal are known). Consequently, all
non-randomness in matings is described and accounted for in the analysis, if
the relationship matrix is complete.
The more important assumption is that each mating should produce a
random sample of progeny. Until genomics came along, every animal that
was born usually had the opportunity to grow and make an observation, and
therefore this assumption was mostly valid. With genomics, however, there
can be pre-selection of embryos and culling of those that do not have the best
genotypes for a set of markers. Only the better embryos are implanted. Thus,
progeny groups being created today (using genomic pre-selection) are NOT a
random sample of progeny. This will cause a serious bias in genetic evaluations
using an animal model depending on the selection intensity, which will vary
depending on the family. The assumption of random progeny groups will be
considered true throughout most of these notes. A later chapter on genomics
and animal models will cover cases where this assumption is not valid.
All animals having observations should have both parents identified. Animals having observations, but with one or both parents unknown, should be
eliminated from the analysis. Technically all animals can be included in an
animal model (with phantom parent groups), but once producers know that
animals with both parents unknown will not be used in genetic evaluation,
then there will be better reporting of parentage by owners.
In dairy cattle, in Canada, pedigrees go back to the mid 1950’s, but test
day records start in the 1980’s. This is a massive amount of data to process
1.3. MODEL CONSIDERATIONS
35
every few months, with new records being added all the time. One should
conduct a test where all of the available data are utilized, and then gradually
eliminate data, in a systematic manner, until half the data remains. Then
compare rankings of the more recent animals to determine the impact. If
there is a large impact, then that could indicate that factors are missing in the
model or are not appropriate.
If rankings of animals are not greatly altered by leaving out older data,
then the older data could be omitted permanently, and thereby save on computing time. Advances in computing hardware and software has allowed genetic evaluation centres to continue to utilize all data. However, someone
should monitor the consequences of using so much data over a long period of
time.
1.3.6
Permanent Environmental Effects
When animals are measured many times during a lactation or over a few
months, then it is possible to estimate permanent environmental effects. These
are factors that affect an animal throughout its life, but which are not transmitted to any progeny. They are a result of the environment and experiences
encountered by the animal. An animal could undergo an event that either
enhances or detracts from its genetic ability for a trait.
For many years a permanent environmental effect was considered an effect
that was constant and unchanging throughout the life of the animal. The sum
of the additive genetic and permanent environmental variances divided by the
total variance provided estimates of repeatability. In actual fact, animals are
continually experiencing new environmental effects, and these environmental
effects are cumulative over time rather than constant. Cumulative environmental effects should be used rather than permanent environmental effects in
animal models.
Animals with only one observation on a trait also have an environmental
effect on their record, but it is impossible to separate the effect from the
additive genetic and residual effects, so that the environmental effect becomes
part of the residual term in the model.
If one thinks of a racing horse, there are many environmental effects con-
36
CHAPTER 1. BACKGROUND HISTORY
tributing to that horse before it makes its first race. Collectively the effects
are p1 . After the first race, the trainer may make adjustments to the training
schedule, the diet, so that when the second race is run, both p1 and the new
environmental effects after the first race, p2 , affect the outcome of race 2. Possibly, p2 offsets p1 or p2 may enhance or aggravate p1 . Up until the third race
there are new environmental effects p3 , and (p1 + p2 + p3 ) cumulatively affect
the outcome of race 3. Any animal that has a chance to make more than one
record for a trait will accumulate environmental effects from one record to the
next. The time between records may be a factor to consider.
Chapter 2
Genetic Relationships
A major component of an animal model are the additive genetic effects and
in particular, the additive genetic relationships among the animals. Genetic
links are key to increasing the accuracy of genetic evaluations. Maintaining
a good database of pedigree information is critical. There are always errors
in pedigrees. Estimates of 10% in the United States dairy cattle population
have been reported. With SNP markers available, animals can be genotyped
and parentage can be verified readily, by using a hundred markers or more.
Without genotyping, there are many checks on pedigrees that can be made.
For an animal model, arranging animals chronologically is a key to success.
2.1
Pedigree Preparation
Pedigrees of animals need to be arranged in chronological order. Parents should
appear in a list ahead of their progeny. Birthdates should be ignored because
they are often in error. The following procedure does not rely on birthdates.
Initially assign a generation number of one to all animals, then iterate through
the pedigrees modifying the generation numbers of the sire and dam to be at
least one greater than the generation number of the offspring. The number
of iterations depends on the number of generations of animals in the list.
Probably 20 or less iterations are needed for most situations.
If the number of iterations reaches 50 or more, then there could be a loop
37
38
CHAPTER 2. GENETIC RELATIONSHIPS
in the pedigrees. That means an animal is its own ancestor, somewhere in
the pedigree. For example, A might be the parent of B, and B is the parent
of C, and C is the parent of A. In this case the generation numbers will keep
increasing in each iteration. Thus, if more than 50 iterations are used, then
look at the animals with the highest generation numbers and try to find the
loop. A loop is an error in the pedigrees and must be repaired. Either correct
the parentage, or remove the parent of the older animal to break the loop.
An example pedigree and sorting follows.
Animal
A
H
B
D
E
G
F
Sire
B
D
D
G
G
Table 2.1
Dam Generation Number
E
1
F
1
H
1
E
1
F
1
1
1
All animals begin with generation number 1. Proceed through the pedigrees one animal at a time.
1. Take the current generation number of the animal and increase it by one
(1), call it m. The first animal is A, for example, and its generation
number is 1, increase it by 1 to become m=2.
2. Compare m to the generation numbers of the animal’s sire and dam. In
the case of A, the sire is B and B’s generation number is 1. That is less
than 2 so B’s generation number has to be changed to 2 (m). The dam
is E, and E’s generation number also beomes 2.
Repeat for each animal in the pedigree list. Keep modifying the generation
numbers until no more need to be changed. The animal with the highest
generation number is the oldest animal.
At the end, sort the list in decreasing order of generation number. The
end result after four iterations of the example pedigree is shown below.
2.2. ADDITIVE RELATIONSHIPS
Animal
A
H
B
D
E
G
F
Sire
B
D
D
G
G
39
Table 2.2
Dam Generation Number
E
1
F
3
H
2
E
4
F
5
6
6
The order of animals G or F is not important. The order of animals within
the same generation number is not critical.
Once the pedigree is sorted, then the birthdates can be checked. Errors
can be spotted more readily. Once the errors are found and corrected, then the
generation numbers could be checked again. Animals should then be numbered
consecutively according to the last list from 1 to the total number of animals
in the list. That means that parent numbers should always be smaller than
progeny ID numbers. Having animals in this order facilitates calculation of
inbreeding coefficients, assignment of animals with unknown parents to groups,
and utilization of the inverse of the relationship matrix in the solution of mixed
model equations.
2.2
Additive Relationships
The order of the additive genetic relationship matrix, A, equals the number
of animals (N ) in the pedigree. However, elements of A can be determined by
the tabular method directly, and its inverse can be derived directly using the
methods of Henderson (1976) and Meuwissen and Luo (1992).
Sewell Wright, in his work on genetic relationships and inbreeding, defined
the relationship between two animals to be a correlation coefficient. That is,
the genetic covariance between two animals divided by the square root of the
product of the genetic variances of each animal. The genetic variance of an
animal was equal to (1 + Fi )σa2 , where Fi is the inbreeding coefficient of that
animal, and σa2 is the population additive genetic variance. Correlations range
40
CHAPTER 2. GENETIC RELATIONSHIPS
from -1 to +1, and therefore, represent a percentage relationship between two
individuals, usually positive only.
The elements of the additive relationship matrix are the numerators of
Wright’s correlation coefficients. Consequently, the diagonals of A can be as
high as 2, and relationships between two individuals can be greater than 1.
The A is a matrix that represents the relative genetic variances and covariances
among individuals.
Additive genetic relationships among animals may be calculated using a
recursive procedure called the Tabular Method (attributable to Henderson and
perhaps to Wright before him). To begin, make a list of all animals that have
observations in your data, and for each of these determine their parents (called
the sire and dam). An example list is shown below.
Table 2.3
Animal Sire Dam
A
B
C
D
A
B
E
A
C
F
E
D
The list should be in chronological order so that parents appear in the list
before their progeny. The sire and dam of animals A, B, and C are assumed
to be unknown, and consequently animals A, B, and C are assumed to be
genetically unrelated. In some instances the parentage of animals may be
traced for several generations, and for each animal the parentage should be
traced to a common base generation.
Using the completed list of animals and pedigrees, form a two-way table
with n rows and columns, where n is the number of animals in the list, in
this case n = 6. Label the rows and columns with the corresponding animal
identification and above each animal ID write the ID of its parents as shown
below.
2.2. ADDITIVE RELATIONSHIPS
41
Table 2.4
Tabular Method Example,
Starting Values.
-,- -,- -,- A,B A,C E,D
A B C
D
E
F
A 1 0 0
B
0
1
0
C
0
0
1
D
E
F
For each animal whose parents were unknown a one(1) was written on the
diagonal of the table (i.e for animals A, B, and C), and zeros were written in
the off-diagonals between these three animals, assuming they were unrelated.
Let the elements of this table (refered to as matrix A) be denoted as aij .
Thus, by putting a 1 on the diagonals for animals with unknown parents, the
additive genetic relationship of an animal with itself (in the base group) is
one. The additive genetic relationship to animals without common parents or
whose parents are unknown is assumed to be zero.
The next step is to compute relationships between animal A and animals
D, E, and F. The relationship of any animal to another is equal to the average
of the relationships of that animal with the parents of another animal. For
example, the relationship between A and D is the average of the relationships
between A and the parents of D, who are A and B. Thus,
aAD
aAE
aAF
Table 2.5
= .5 ( aAA + aAB ) = .5(1 + 0) = .5
= .5 ( aAA + aAC ) = .5(1 + 0) = .5
= .5 ( aAE + aAD ) = .5(.5 + .5) = .5
42
CHAPTER 2. GENETIC RELATIONSHIPS
The relationship table, or A matrix, is symmetric, so that aAD = aDA , aAE =
aEA , and aAF = aF A . Continue calculating the relationships for animals B and
C to give the following table.
A
Table 2.6
Tabular Method Example,
Partially Completed.
-,- -,- -,- A,B A,C E,D
A B
C
D
E
F
1
0
0
.5
.5
.5
B
0
1
0
.5
0
.25
C
0
0
1
0
.5
.25
D
.5
.5
0
E
.5
0
.5
F
.5
.25
.25
Next, compute the diagonal element for animal D. By definition this is
one plus the inbreeding coefficient, i.e.
aDD = 1 + FD .
The inbreeding coefficient, FD , is equal to one-half the additive genetic relationship between the parents of animal D, namely,
FD = .5aAB = 0.
When parents are unknown, the inbreeding coefficient is zero assuming the
parents of the individual were unrelated. After computing the diagonal element
for an animal, like D, then the remaining relationships to other animals in
that row are calculated as before. The completed matrix is given below. Note
that only animal F is inbred in this example. The inbreeding coefficient is a
measure of the percentage of loci in the gamete of an animal that has become
2.3. INBREEDING CALCULATIONS
43
homozygous, that is, the two alleles at a locus are the same (identical by
descent). Sometimes these alleles may be lethal and therefore, inbreeding is
generally avoided.
A
Table 2.7
Tabular Method Example
Completed Table.
-,- -,- -,- A,B A,C E,D
A B
C
D
E
F
1
0
0
.5
.5
.5
B
0
1
0
.5
0
.25
C
0
0
1
0
.5
.25
D
.5
.5
0
1
.25
.625
E
.5
0
.5
.25
1
.625
F
.5
.25 .25
.625
.625
1.125
Generally, the matrix A is nonsingular, but if the matrix includes two
animals that are identical twins, then two rows and columns of A for these
animals would be identical, and therefore, A would be singular. In this situation assume that the twins are genetically equal and treat them as one animal
(by giving them the same registration number or identification).
2.3
Inbreeding Calculations
The inbreeding coefficients and the inverse of A for inbred animals are generally required for BLUP analyses of animal models. Thus, fast methods of
doing both of these calculations, and for very large populations of animals are
necessary. The following is a description of Henderson’s discovery of 1976.
A = TBT0 ,
44
CHAPTER 2. GENETIC RELATIONSHIPS
where T is a lower triangular matrix and B is a diagonal matrix. Quaas (1976)
showed that the diagonals of B, say bii were
bii = (.5 − .25(Fs + Fd )),
where Fs and Fd are the inbreeding coefficients of the sire and dam, respectively, of the ith individual. If one parent is unknown, then
bii = (.75 − .25Fp ),
where Fp is the inbreeding coefficient of the parent that is known. Lastly, if
neither parent is known then bii = 1.
Animals should be in chronological order, as for the Tabular Method. To
illustrate consider the example given in the Tabular Method section. The
corresponding elements of B for animals A to F would be
1 1 1 .5 .5 .5
.
Now consider a new animal, G, with parents F and B. The first step is to set
up three vectors, where the first vector contains the identification of animals
in the pedigree of animal G, the second vector will contain the elements of a
row of matrix T, and the third vector will contain the corresponding bii for
each animal.
Step 1 Add animal G to the ID vector, a 1 to the T-vector, and
bGG = .5 − .25(.125 + 0) = 15/32
to the B-vector, giving
ID vector T-vector B-vector
G
1
15/32
Step 2 Add the parents of G to the ID vector, and because they are one
generation back, add .5 to the T-vector for each parent. In the D-vector,
animal B has bBB = 1, and animal F has bF F = .5. The vectors now
appear as
2.3. INBREEDING CALCULATIONS
45
ID vector T-vector B-vector
G
1
15/32
F
.5
.5
B
.5
1
Step 3 Add the parents of F and B to the ID vector, add .25 (.5 times the
T-vector value of the individual (F or B)) to the T-vector, and their
corresponding bii values. The parents of F were E and D, and the parents
of B were unknown. These give
ID vector T-vector
G
1
F
.5
B
.5
E
.25
D
.25
B-vector
15/32
.5
1
.5
.5
Step 4 Add the parents of E and D to the ID vector, .125 to the T-vector,
and the appropriate values to the B-vector. The parents of E were A
and C, and the parents of D were A and B.
ID vector T-vector
G
1
F
.5
B
.5
E
.25
D
.25
A
.125
C
.125
A
.125
B
.125
B-vector
15/32
.5
1
.5
.5
1
1
1
1
The vectors are complete because the parents of A, B, and C are unknown
and no further ancestors can be added to the pedigree of animal G.
Step 5 Accumulate the values in the T-vector for each animal ID. For example, animals A and B appear twice in the ID vector. Accumulating their
T-vector values gives
46
CHAPTER 2. GENETIC RELATIONSHIPS
ID vector T-vector
B-vector
G
1
15/32
F
.5
.5
B
.5+.125=.625 1
E
.25
.5
D
.25
.5
A
.125+.125=.25 1
C
.125
1
Do not accumulate quantities until all pathways in the pedigree have
been processed, otherwise a coefficient may be missed and the wrong
inbreeding coefficient could be calculated.
Step 6 The diagonal of the A matrix for animal G is calculated as the sum
of squares of the values in the T-vector times the corresponding value in
the B-vector, hence
aGG = (1)2 (15/32) + (.5)2 (.5) + (.625)2
+(.25)2 (.5) + (.25)2 (.5) + (.25)2 + (.125)2
= 72/64
1
= 1
8
The inbreeding coefficient for animal G is one-eighth.
The efficiency of this algorithm depends on the number of generations in
each pedigree. If each pedigree is 10 generations deep, then each of the vectors
above could have over 1000 elements for a single animal. To obtain greater
efficiency, animals with the same parents could be processed together, and
each would receive the same inbreeding coefficient, so that it only needs to
be calculated once. For situations with only 3 or 4 generation pedigrees, this
algorithm would be very fast and the amount of computer memory required
would be low.
2.4
Example Additive Matrix
Consider the pedigrees in the table below:
2.4. EXAMPLE ADDITIVE MATRIX
47
Table 2.8
Animal Sire Dam
1
2
3
1
4
1
2
5
3
4
6
1
4
7
5
6
Animals with unknown parents may or may not be selected individuals, but their parents (which are unknown) are assumed to belong to a base
generation of animals, i.e. a large, random mating population of unrelated individuals. Animal 3 has one parent known and one parent unknown. Animal
3 and its sire do not belong to the base generation, but its unknown dam is
assumed to belong to the base generation. If these assumptions are not valid,
then the concept of phantom parent groups needs to be utilized. Using the
tabular method, the A matrix for the above seven animals is given below.
1
2
3
4
5
6
7
Table
-,- -,1,1,2
1
2
3
4
1
0
.5
.5
0
1
0
.5
.5
0
1
.25
.5 .5 .25
1
.5 .25 .625 .625
.75 .25 .375
.75
.625 .25
.5 .6875
2.9
3,4
1,4
5
6
.5
.75
.25
.25
.625
.375
.625
.75
1.125 .5625
.5625
1.25
.84375 .90625
5,6
7
.625
.25
.5
.6875
.84375
.90625
1.28125
48
CHAPTER 2. GENETIC RELATIONSHIPS
Now partition A into T and B giving:
Sire
Dam
1
1
3
1
5
2
4
4
6
Table 2.10
Animal
1
2
3 4
1
1
0
0 0
2
0
1
0 0
3
.5
0
1 0
4
.5 .5
0 1
5
.5 .25 .5 .5
6
.75 .25
0 .5
7
.625 .25 .25 .5
5
0
0
0
0
1
0
.5
6
0
0
0
0
0
1
.5
7
B
0
1.0
0
1.0
0
.75
0
.50
0
.50
0
.50
1 .40625
Note that the rows of T account for the direct relationships, that is, the direct
transfer of genes from parents to offspring.
2.5
Inverse of Additive Relationship Matrix
The inverse of the relationship matrix can be constructed by a set of rules.
Recall the example in the previous section, of seven animals with the following
values for bii .
Animal
1
2
3
4
5
6
7
Sire
1
1
3
1
5
Table
Dam
2
4
4
6
2.11
bii
1.00
1.00
0.75
0.50
0.50
0.50
0.40625
b−1
ii
1.00
1.00
1.33333
2.00
2.00
2.00
2.4615385
Let δ = b−1
ii , then if both parents are known the following constants are
added to the appropriate elements in the inverse matrix:
2.5. INVERSE OF ADDITIVE RELATIONSHIP MATRIX
Table 2.12
animal sire
animal
δ
−.5δ
sire
−.5δ
.25δ
dam
−.5δ
.25δ
49
dam
−.5δ
.25δ
.25δ
If one parent is unknown, then delete the appropriate row and column
from the rules above, and if both parents are unknown then just add δ to the
animal’s diagonal element of the inverse.
Each animal in the pedigree is processed one at a time, but any order can
be taken. Let’s start with animal 6 as an example. The sire is animal 1 and the
dam is animal 4. In this case, δ = 2.0. Following the rules and starting with
an inverse matrix that is empty, after handling animal 6 the inverse matrix
should appear as follows:
1
.5
1
2
3
4 .5
5
6 -1
7
Table 2.13
2 3 4 5
.5
6
-1
.5
-1
-1
2
7
After processing all of the animals, then the inverse of the relationship
matrix for these seven animals should be as follows:
1
2
3
4
5
6
7
1
2
3
2.33333 .5 -.66667
.5 1.5
0
-.66667
0 1.83333
-.5 -1
.5
0
0
-1
-1
0
0
0
0
0
Table 2.14
4
5
6
7
-.5
0
-1
0
-1.00000
0
0
0
.5
-1
0
0
3.0000
-1
-1
0
-1 2.61538
.61538 -1.23077
-1
.61538 2.61538 -1.23077
0 -1.23077 -1.23077 2.46154
50
CHAPTER 2. GENETIC RELATIONSHIPS
The product of the above matrix with the original relationship matrix, A,
gives an identity matrix.
Below is an R-script that can create A−1 from a list of sires and dams for
animals in chronological order, and with the bi values known.
AINV=function(sid,did,bi){
# IDs assumed to be consecutively numbered
rules=matrix(data=c(1,-0.5,-0.5,-0.5,0.25,
0.25,-0.5,0.25,0.25),
byrow=TRUE,nrow=3)
nam = length(sid)
np = nam + 1
ss = sid + 1
dd = did + 1
LAI = matrix(data=c(0),nrow=np,ncol=np)
for(i in 1:nam){
ip = i + 1
X = 1/bi[i]
k = cbind(ip,ss[i],dd[i])
LAI[k,k] = LAI[k,k] + rules*X
}
k = c(2:np)
C = LAI[k,k]
return(C) }
Likewise for putting pedigrees in chronological order, the R-script could
be as follows.
2.5. INVERSE OF ADDITIVE RELATIONSHIP MATRIX
border=function(anm,sir,dam){
maxloop=1000
i = 1
count = 0
mam=length(anm)
old = rep(1,mam)
new = old
while(i>0){
for (j in 1:mam){
ks = sir[j]
kd = dam[j]
gen = new[j]+1
if(ks != "NA"){
js = match(ks,anm)
if(gen > new[js]){new[js] = gen}
}
if(kd != "NA"){
jd = match(kd,anm)
if(gen > new[jd]){new[jd] = gen}
}
}
# for loop
changes = sum(new - old)
old = new
i = changes
count = count + 1
if(count > maxloop){i=0}
}
# while loop
return(new)
} # function loop
51
52
CHAPTER 2. GENETIC RELATIONSHIPS
Chapter 3
Mixed Model Equations
Henderson first published his Mixed Model Equations (MME) in 1949 without knowing the full statistical properties of the method. He knew that the
solutions to MME were the same as Lush’s selection index equations if the
means that were used were generalized least squares means. Goldberger (an
econometrician) also published the same Mixed Model Equations around the
same time, but Henderson made wider use of them in animal breeding. Henderson did not learn matrix algebra until he took a sabbatical leave in New
Zealand where he met Shayle Searle. With Searle’s help, Henderson proved
that solutions to MME were best linear unbiased predictors, and solutions for
the fixed effects were generalized least squares solutions. These proofs were
published in 1963. Animal scientists learned about MME in 1973 at the Lush
Symposium in North Carolina State Dairy Science meetings, when Henderson
presented BLUP and MME for purposes of genetic evaluation of dairy sires.
Using the general animal model, then
y = Xb + Wu +
Z 0
aw
ao
!
+ Zp + e
where
y is a vector of observations of a single trait,
b is a vector of fixed effects (such as age, year, gender, farm, cage, diet) that
affect the trait of interest, and are not genetic in origin,
53
54
CHAPTER 3. MIXED MODEL EQUATIONS
u is a vector of random factors (such as contemporary groups),
!
aw
is a vector of animal additive genetic effects (breeding values) for all
ao
animals included in the pedigree, many of which may not have observations. Animals with records are in aw , and animals without records are
in ao . ao may also contain phantom parent groups,
p is a vector of permanent environmental effects for animals that have been
observed, corresponding to aw ,
e is a vector of residual effects,
X, W and Z are matrices that relate observations to the factors in the model.
In addition,
E




V ar 



E(b)
E(u)
!
aw
ao
E(p)
E(e)

u

aw 

ao 

p 

e
= b
= 0
0
0
=
!
= 0
= 0

=







U
0
0
0
0
2
2
0 Aww σa Awo σa 0
0
2
2
0 Aow σa Aoo σa 0
0
0
0
0
Iσp2
0
0
0
0
0 Rσe2




.



where
2
U = Σ+
i Iσi ,
for i going from 1 to the number of other random factors in the model, and R is
uaually diagonal, but there can be different values on the diagonals depending
on the situation. Often all of the diagonals are the same, so that R = I.
Assume the latter is true, then Henderson’s mixed model equations (MME)
for a single trait, single record per animal situation, can be written as follows:
3.1. ACCURACIES





55
X0 X
X0 W
X0 Z
0
0
0
0
−1 2
WZ
0
W X W W + U σe
0
0
0
ww
wo
ZX
ZW
ZZ+A α A α
0
0
Aow α
Aoo α
 b
b

b
u



b
 aw
bo
a








= 


X0 y
W0 y
Z0 y
0



,

where
α = σe2 /σa2 .
Let a generalized inverse of the coefficient matrix be

Cxx
 Cwx

 Czx
Cox
Cxw
Cww
Czw
Cow
Cxz
Cwz
Czz
Coz
 
Cxo
X0 X


Cwo   W0 X
=
Czo   Z0 X
0
Coo
X0 W
0
W W + U−1 σe2
Z0 W
0
X0 Z
W0 Z
0
Z Z + Aww α
Aow α
0
0
−


A α 
Aoo α
wo
then
b
V ar(b)
= Cxx σce2
c2
b − u) = Cww σ
V ar(u
e
!
!
b
aw − aw
Czz Czo c2
V ar
=
σe
b o − ao
a
Coz Coo
where
b − Wu
b − Za
b w )/(N − r(X))
σce2 = y0 (y − Xb
is an estimate of the residual variance.
3.1
Accuracies
The variances of prediction error can be used to obtain accuracies of evaluation,
also known as reliabilities. Breeders and owners of animals always want to
know how accurately their animals are evaluated. The correct calculation of
accuracies requires a generalized inverse of the coefficient matrix of the mixed
model equations, which is seldom possible due to the large order of the mixed
model equations, in the majority of situations. This problem has led to many
proposed approximation methods to obtain the pseudo-inverse elements. Some
56
CHAPTER 3. MIXED MODEL EQUATIONS
of the methods lead to upwardly biased accuracies and others to downwardly
biased accuracies. Of the two options, one in which accuracies are equal to or
lower than the correct accuracies would be preferable so that accuracies are
always on the conservative side.
The selection index approach is an easy and consistent method to calculate
accuracies. You may incorporate whatever sources of information that you
wish, for example,
• The number of observations on the animal,
• The number of records on the dam,
• The number of progeny of the animal with records,
• The number of progeny of the dam with records, and
• The number of progeny of the sire with records.
This approach ignores contemporary group size. An advantage is that it is
consistent from one year to the next, assuming an annual genetic evaluation
run. One problem is that there could be many animals with a zero accuracy.
In an animal model, accuracies should nearly always be greater than zero due
to relationships among animals, but a selection index approximation for accuracies describes the amount of information available (from only those sources
that are considered in the calculations). Test out whatever approximation
method you choose. See the example in Section 3.3.
3.2
Expression of EBVs
The solutions for animals from the MME are called Estimated Breeding Values
or EBVs. This is a prediction of the additive genetic value of the animals.
Each progeny receives a random sample half of the alleles at all genes in the
genome. As with most statistical models only differences among animals can
be estimated. A property of animal model solutions from MME are that
b = 0.
10 A−1 a
3.2. EXPRESSION OF EBVS
57
That is, a weighted sum of EBVs should equal zero. When parents are known
for all animals, except for base population animals, then this means that all of
the base population animal EBVs must sum to zero. This is a mathematical
consequence of the MME and the model, which almost always have at least one
fixed factor in the model. By absorbing out the fixed factor from the MME,
one can show that the right hand sides of the MME for animals will add to
zero, and consequently, the above property holds true. If some parents are
missing and phantom parents need to be used, then the sum of the phantom
parent group solutions will be zero.
In practice, most livestock producers want to see numbers that have meaning to them. Thus, different forms of expressing EBVs have been created in
an effort to help producers understand and use EBV information to improve
their animals.
3.2.1
Genetic Base
Estimated breeding values (EBVs) which are a product of the mixed model
equations, need to be presented relative to a well-defined genetic base. There
are two types of genetic base, fixed and rolling, that can be used. The genetic
base may be defined many different ways.
As an exampke, let the genetic base be defined as the group of all female
animals born from year 1 through year 5. To impose that base the average
EBV of all female animals that are included in that definition of the base
is subtracted from ALL EBVs. Alternatively, the definition could be all male
animals. Or the definition could be all first lactation animals calving in years 1
through 5. The best definition is simple, straightforward, and understandable
to producers and owners of the animals.
A fixed genetic base is where the definition is kept constant for many years
or generations. No new animals are added to the base, and no old animals are
removed. The animals in the genetic base definition are always the same from
one run to the next, in an annual genetic evaluation system. If genetic change is
occurring in the population, then gradually over time, the more recent animals
will tend to have a higher average EBV than the animals in the genetic base
definition. The problem is that producers think that any EBV that is above
0 is an animal that could be used for breeding, and those below 0 would be
58
CHAPTER 3. MIXED MODEL EQUATIONS
avoided. However, with a fixed genetic base, eventually all currently living
animals will have an EBV above 0. Producers would have to keep informed
about the current average EBV.
Scientists like to help producers by using a rolling genetic base. That
is, the definition is constant, but the years 1 and 5 are increased to 2 and
6, then to 3 and 7, etc., in subsequent years. The animals in the genetic
base definition, therefore, change every year, and the average EBV of current
breeding candidates stays around zero. Producers will continue to use only
animals that have EBVs above 0. Animals with EBVs below zero will not be
used for breeding. Deciding how often to roll the definition of genetic base
animals can be debated. If genetic change is rapid, then annual changes may
be necessary, but otherwise, the definition may be fixed for several years. In
this matter, the wishes of the producers should be paid attention.
A problem with a rolling base is that as animals age their EBVs go down
in value because of the positive genetic trend. This can be disconcerting to
animal owners. Either base has advantages and disadvantages.
3.2.2
EBV
Estimated Breeding Values (EBV) come out in the solutions to the MME
for an animal model. For many countries and most species of livestock, an
EBV is all that is needed for the producers. Producers generally like to see
genetic information that is expressed in the same units as what they see from
day to day. That is, kilograms of milk, kilograms of weight, gain per day,
centimeters of height or circumference, and so on. An EBV is a number that
allows producers to rank the animals and make decisions.
3.2.3
EPD, ETA
Producers often think that the number they see is how the progeny should
be expected to perform. In this case, Expected Progeny Differences (EPDs)
or Estimated Transmitting Ability (ETAs), which equal one half the EBV,
should be used. Producers want to know what amount is actually transmitted
to the progeny, which is half the animal’s EBV. The ranking of animals remains
3.2. EXPRESSION OF EBVS
59
exactly the same, but the actual genetic information that the producer sees is
half as big as with EBVs. Thus, if a producer has an ETA on each parent,
then the progeny would have an EBV equal to the sum of those two ETA.
3.2.4
RBV
In some cases, producers favour a Relative Breeding Value, RBV . If µ is the
phenotypic mean of animals in the genetic base group, then a RBV is
RBV = 100 ∗ ((EBV + µ)/µ).
RBV have a mean of around 100, depending on the genetic base chosen for
the EBV. The variation in RBV can be great or small, but the RBV are
often standardized so that RBV range from 80 to 120. The RBV gives an
indication of which animals are above and below average without having to
know the average EBV for current animals.
In cases where the units associated with the trait have no meaning to
producers, such as conformation traits or disease traits, RBV may be easier
to understand than EBVs.
3.2.5
Percentile Ranking
Another form of expression is the Percentile Ranking (P R). Animals’ EBVs
are ranked from high to low (1 to N ), then the Percentile Rank is
P R = 100 ∗ (N + 1 − R)/(N + 1),
where R is the rank of the animal’s EBV. The higher is the percentile rank,
the higher the EBV. Producers should only use animals that are above the
50th percentile.
Sometimes producers can see ETA, EBV, RBV , and P R all on the same
animals. This seems like overkill, and an overabundance of redundant information. If at all possible try to utilize just one form of expression.
When a trait, like backfat thickness, is analyzed, then a high EBV indicates an animal with more fat, which is undesirable. Thus, producers want to
60
CHAPTER 3. MIXED MODEL EQUATIONS
select for low EBVs when choosing mates for breeding, for this trait. Relative
breeding values or percentile rankings could be altered to reflect the difference
in selection goals for that trait, so that a high value always indicates a desirable
animal. Percentile rankings can also make such traits easier to interpret.
3.2.6
Intermediate Traits
Birthweights can be considered an intermediate trait. Breeders do not want
progeny that have high birthweight because that could lead to birthing problems and death of the fetus, the female, or both. Breeders also do not want
progeny that have low birthweights because such progeny may not survive to
weaning. Thus, there is a range of acceptable birthweights, and the goal is to
maintain that optimum range. EBVs could be changed to a categorical basis
to TOO HIGH, TOO LOW, and JUST RIGHT.
Some conformation traits in dairy cattle, are also intermediate, such as
pelvic angle. Classifiers score the trait from 1 to 9, for example, where 1 is a
flat extreme, and 9 is a too steep extreme. The scores from 1 to 9 should be
analyzed as any other trait, and the resulting EBVs should be re-categorized
into TOO FLAT, TOO STEEP, and JUST RIGHT, in the proportions of
25:25:50, respectively. Do not fold a distribution. That is, do not assign new
scores such that the middle category 5, remains as 5, and 6, 7, 8, and 9,
become 4, 3, 2, and 1, respectively, and categories 1 to 4 stay the same. This
is incorrect because the heritability of the folded trait becomes much lower
than for the original 1 to 9 scale. With the folded scores, a 1 indicates either
TOO FLAT or TOO STEEP which are two genetically opposite conditions.
Both are bad, but are not the same genetically. The original scale should be
used for analysis, as long as it delineates from one genetic extreme to another.
3.3
Numerical Example
Below are data on 15 animals showing parents (all unrelated), herd, year, and
contemporary groups (herd-year subclasses). Parents of animals 1 to 14 are
unknown.
3.3. NUMERICAL EXAMPLE
Example
Animal Sire
15
1
16
1
17
3
18
3
19
13
20
5
21
5
22
1
23
1
24
7
25
7
26
5
27
5
28
9
29
11
61
Table 3.1
Data For Animal Model
Dam Herd year CG Trait
2
1
1
1
94
2
1
1
1
89
4
1
1
1
72
6
1
1
1
100
2
1
2
2
73
4
1
2
2
70
4
1
2
2
84
8
2
1
3
88
10
2
1
3
102
12
2
1
3
82
14
2
1
3
130
8
2
2
4
93
10
2
2
4
105
12
2
2
4
118
14
2
2
4
69
The model equation is
yijk = bi + cj + ak + eijk ,
where
yijk is an observation on the k th animal in the j th contemporary group within
the ith year,
bi is an effect of the ith year,
cj is an effect of the j th contemporary group (herd-year subclasses),
ak is an effect of the k th animal, and
e[ ijk is a residual effect.
Further, the expectations and covariance matrices are as follows:
E(bi ) = bi
62
CHAPTER 3. MIXED MODEL EQUATIONS
E(cj ) = 0,
E(ak ) = 0,
E(eijk ) = 0
and
0
0
c
Iσc2


Aσa2 0 
V ar 
 a  =  0
,
2
0
0 Iσe
e




where c, a, and e respresent vectors of contemporary group, animal breeding
values, and residuals, respectively.
The parameters will be assumed to be known for this example.
σe2
σa2
σc2
h2
=
=
=
=
144,
64,
36,
.2623.
The order of the mixed model equations is 35 by 35, with 2 equations
for years, 4 equations for contemporary groups, and 29 equations for animals.
The rank of the equations is also 35. The two year solutions were 94.82039 for
year 1 and 87.13676 for year 2. Contemporary group solutions were -2.065268,
-3.668846, 2.065268, and 3.668846 for groups 1 to 4, respectively. A table of
the solutions for animals is given below,
3.3. NUMERICAL EXAMPLE
Animal
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Sire
1
1
3
3
13
5
5
1
1
7
7
5
5
9
11
Table 3.2
EBVs from MME
Dam
EBV Animal
2 -0.7470
1
2 -1.6561
2
4 -6.4880
3
6 1.1317
4
2 -3.2291
5
4 -4.0223
6
4 -1.4769
7
8 -2.3494
8
10 1.8323
9
12 -1.1046
10
14 7.8180
11
8 0.3701
12
10 4.1883
13
12 7.3074
14
14 -4.7635
63
EBV
-0.7469
-1.6324
-1.8120
-4.8231
0.9764
1.3585
2.5589
-1.0471
4.4193
2.9529
-3.7871
1.3569
-1.6086
1.8343
Define the genetic base as those animals with records in Year 1, namely
animals 15 to 18, and 22 to 25. Their average EBV was -0.19539, which is
subtracted from all EBVs. Thus, all animals are now compared to the average
of animals with records in Year 1, as shown in the next table. Also shown
are ETAs or one half the EBV (also on the same genetic base), the relative
breeding values (RBV ) using 90 as the mean of the trait, and the percentile
rankings, P R. The last two columns are the standard errors of prediction and
the reliability (or accuracy) from the elements of the generalized inverse of the
MME.
64
CHAPTER 3. MIXED MODEL EQUATIONS
Animal
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Table 3.3
Animal solutions from MME
EBV
ETA
RBV Percentile SEP Reliability
-0.5516 -0.3735 99.39
53 9.23
13
-1.4370 -0.8162 98.40
33 9.19
14
-1.6166 -0.9060 98.20
27 9.45
9
-4.6277 -2.4115 94.86
7 9.20
14
1.1718 0.4882 101.30
60 9.31
12
1.5539 0.6792 101.73
70 9.59
6
2.7543 1.2795 103.06
80 9.45
9
-0.8517 -0.5235 99.05
47 9.28
12
4.6147 2.2097 105.13
90 9.60
6
3.1483 1.4765 103.50
83 9.28
12
-3.5917 -1.8936 96.01
17 9.60
6
1.5523 0.6784 101.72
67 9.31
12
-1.4132 -0.8043 98.43
37 9.61
6
2.0297 0.9171 102.26
77 9.31
12
-0.5516 -0.3735 99.39
50 8.62
24
-1.4607 -0.8281 98.38
30 8.62
24
-6.2926 -3.2440 93.01
3 8.61
25
1.3271 0.5659 101.47
63 8.67
23
-3.0337 -1.6146 96.63
20 8.60
25
-3.8269 -2.0112 95.75
13 8.74
22
-1.2815 -0.7384 98.58
40 8.74
22
-2.1541 -1.1747 97.61
23 8.63
24
2.0277 0.9162 102.25
73 8.63
24
-0.9092 -0.5523 98.99
43 8.62
24
8.0134 3.9090 108.90
97 8.62
24
0.5655 0.1850 100.63
57 8.66
24
4.3836 2.0941 104.87
87 8.66
24
7.5028 3.6537 108.34
93 8.56
25
-4.5682 -2.3818 94.92
10 8.56
25
An estimate of the residual variance is given by
b − Wc
b − Za
b )/(N − r(X))
σce2 = y0 (y − Xb
3.3. NUMERICAL EXAMPLE
65
σce2 = 2872.765/(15 − 2) = 220.9819.
The variance of prediction error for animal 25, for example, is
V ar(â25 − a25 ) = 0.3359 ∗ σce2
= 74.2278.
The standard error of prediction is therefore, SEP = ±8.62.
The ratio of residual to genetic variances is 144/64 or 2.25. Therefore, the
accuracy (or reliability) of the EBV (ETA or RBV) of animal 25 is
(1 − 0.3359 ∗ 2.25) = 0.2442,
or 24%. If the animal was inbred, then use
((1 + Fi ) − C ii ∗ (σe2 /σa2 )).
Approximate accuracies for these data could be based on (1) the number of
records on the animal; (2) the number of progeny of the animal (with records);
(3) the number of other progeny of the dam; and (4) the number of other
progeny of the sire. The equations to construct, for each animal, are

Pb =








= 


1+(x−1)(r)
x
2
.5h
.25h2
.25h2
h2
.5h2
.25h2
.25h2
.5h2
1+(z−1)(.25)(h2 )
z
.25h2
.25h2
.25h2
.25h2
1+(s−1)(.25)(h2 )
s
0
.25h2
.25h2
0
1+(d−1)(.25)(h2 )
d



b1

 b 
 2 


  b3 

b4





= C
where x is the number of records on the animal; z is the number of progeny
with records on the animal; s is the number of progeny of the sire; and d is the
66
CHAPTER 3. MIXED MODEL EQUATIONS
number of progeny of the dam; r is the repeatability of the trait; and h2 is the
heritability of the trait. Just substitute in the appropriate numbers for each
animal. The sire and dam are assumed to be unrelated, but the correct value
could be included, if known. If an animal is inbred, then C1 = (1 + Fi )h2 . The
easiest course of action is to ignore inbred animals and assume they are not.
Solve,
b = P−1 C,
b
then the accuracy is
b 0 C/h2 ).
ACC = 100 ∗ (b
If one of the pieces of information is missing, then that entire row and
column of P is set to zero. Depending on the situation, other sources of
information could be included, such as the number of records on the dam
and/or sire.
If the true heritability is h2 , then it may be better to use a smaller value
in calculating approximate accuracies, such as .75h2 to .8h2 , to account for
simultaneous estimation of fixed factors in the model.
3.3. NUMERICAL EXAMPLE
67
Table 3.4
Comparison of Accuracies
Reliability is from Inverse elements of equations and Accuracy is
from the approximation.
Animal Records Progeny Reliability Accuracy
1
0
4
13
17.3
2
0
3
14
13.6
3
0
2
9
9.5
4
0
3
14
13.6
5
0
4
12
17.3
6
0
1
6
5.0
7
0
2
9
9.5
8
0
2
12
9.5
9
0
1
6
5.0
10
0
2
12
9.5
11
0
1
6
5.0
12
0
2
12
9.5
13
0
1
6
5.0
14
0
2
12
9.5
15
1
0
24
23.7
16
1
0
24
23.7
17
1
0
25
22.3
18
1
0
23
20.8
19
1
0
25
21.5
20
1
0
22
23.7
21
1
0
22
23.7
22
1
0
24
23.0
23
1
0
24
23.0
24
1
0
24
21.6
25
1
0
24
21.6
26
1
0
24
23.0
27
1
0
24
23.0
28
1
0
25
20.8
29
1
0
25
20.8
The approximations are reasonable values compared to the reliabilities
from the inverse elements of the MME for this small example. Only four of
the approximate values are greater than the exact values.
68
CHAPTER 3. MIXED MODEL EQUATIONS
Chapter 4
Phantom Parent Groups
In most real-life applications, there are always some animals with missing
parent information. One alternative is to only include animals with records
that have both parents identified. However, producers become upset when
their animals do not receive EBVs. Thus, another alternative is needed.
Westell et al.(1988) and Robinson (1986) assigned phantom parents to
animals with unknown parents. Each phantom parent was assumed to have
only one progeny. Phantom parents were assumed to be unrelated to all other
real or phantom animals. Then, phantom parents were assigned to groups
based on the gender of their offspring, and on the year in which the progeny
were born. In the end, every animal in the pedigree will have both parents
identified, even though their real parent may be unknown. They would still
have a phantom parent group identification in place of the unknown parent(s).
The logic is that phantom parents whose first progeny were born in a particular time period probably underwent the same degree of selection intensity
to become a breeding animal. However, male phantom parents versus female
phantom parents might have faced different selection intensities. Phantom
parents were assigned to phantom parent groups depending on whether they
were sires or dams and on the year of birth of their first progeny, and whether
the first progeny was male or female.
Phantom parent groups may also be formed depending on breed and regions within a country. The basis for further groups depends on the existence
69
70
CHAPTER 4. PHANTOM PARENT GROUPS
of different selection intensities involved in arriving at particular phantom parents.
Phantom parent groups are best handled by considering them as additional
animals in the pedigree. Then the inverse of the relationship matrix can be
constructed using the same rules as before. These results are due to Quaas
(1988). To illustrate, use the same seven animals as in section 2.4. Assign the
unknown sires of animals 1 and 2 to phantom group 1 (P 1) and the unknown
dams to phantom group 2 (P 2). Assign the unknown dam of animal 3 to
phantom group 3 (P 3). The resulting matrix will be of order 10 by 10 :
A−1
∗
=
Arr Arp
Apr App
!
,
where Arr is a 7 by 7 matrix corresponding to the elements among the real
animals; Arp and its transpose are of order 7 by 3 and 3 by 7, respectively,
corresponding to elements of the inverse between real animals and phantom
groups, and App is of order 3 by 3 and contains inverse elements corresponding
to phantom groups. Arr will be exactly the same as A−1 given in the previous
section. The other matrices are

Arp =













App

−.5 −.5
.33333

−.5 −.5
0 

0
0 −.66667 

0
0
0 


0
0
0 

0
0
0 

0
0
0

.5 .5
0

0 
=  .5 .5

0 0 .33333
In this formulation, phantom groups (according to Quaas (1988)) are additional fixed factors and there is a dependency between phantom groups 1 and
2. This singularity can cause problems in deriving solutions from the MME.
The dependency can be removed by adding an identity matrix to App . When
genetic groups have many animals assigned to them, then adding the identity
matrix to App does not result in any significant re-ranking of animals in genetic evaluation and aids in getting faster convergence of the iterative system
of equations.
4.1. BACKGROUND HISTORY
71
Phantom groups are used in many genetic evaluation systems today. The
phantom parents assigned to a genetic group are assumed to be the outcome
of non random mating and similar selection differentials on their parents. This
assumption, while limiting, is not as severe as assuming that all phantom
parents belong to one base population.
4.1
Background History
A pedigree file may have many animals with missing parent identification.
Assign a parent group to an animal based upon the year of birth of the animal
and the pathway of selection. If the animal is a female, for example, and the
dam information is missing, then the parent group would be in the Dams of
Females pathway (DF). There are also Dams of Males (DM), Sires of Females
(SF), and Sires of Males (SM). Four pathways and various years of birth nested
within each pathway. These have become known as phantom parent groups.
Genetic group effects are added to the model.
y = Xb + ZQg + Za + e,
where
a is the vector of animal additive genetic effects,
Z is the matrix that relates animals to their observations,
g is the vector of genetic group effects, and
Q is the matrix that relates animals to their genetic groups, and
y, Xb, e are as described in earlier notes.
The Estimated Breeding Value, EBV, of an animal is equal to the sum of the
group and animal solutions from MME, i.e.
b+a
b.
Vector of EBVs = Qg
To illustrate, the pedigree list with groups becomes
72
CHAPTER 4. PHANTOM PARENT GROUPS
Animal
A
B
C
D
E
F
4.2
Sire
G1
G1
G1
A
A
E
Dam
G2
G2
G2
B
C
D
Simplifying the MME
The advantage of phantom parent grouping is that the mixed model equations
simplify significantly. Using the previous model, the MME are
b
X0 X
X0 ZQ
X0 Z
X0 y
b





0 0
0 0
0 0
b  =  Q0 Z0 y  .
 Q Z X Q Z ZQ Q Z Z
 g
b
Z0 X
Z0 ZQ
Z0 Z + A−1 α
Z0 y
a





Notice that Q0 times the third row subtracted from the second row gives
Q0 A−1 âα = 0.
Quaas (1988) showed that the above equations could be transformed so that
Qĝ + â can be computed directly. The derivation is as follows. Note that








b
I 0 0
I 0 0
b
b̂






b 
I 0  0 I 0  g
 ĝ  =  0
b
0 −Q I
0 Q I
a
â

b
I 0 0
b



b
=  0 I 0 
.
g
b+a
b
0 −Q I
Qg
Substituting this equality into the left hand side (LHS) of the MME gives
b
X0 X
0
X0 Z
X0 y
b






b
0
Q0 Z0 Z
 Q0 Z0 X

 =  Q0 Z0 y  .
g
b+a
b
Z0 X −A−1 Qα Z0 Z + A−1 α
Z0 y
Qg





4.2. SIMPLIFYING THE MME
To make the equations symmetric
must be premultiplied by

I

 0
0
73
again, both sides of the above equations

0 0
I −Q0 
.
0
I
This gives the following system of equations as
b
X0 X
0
X0 Z
X0 y
b






b
Q0 A−1 Qα −Q0 A−1 α  
 0
 =  0 .
g
b+a
b
Z0 X −A−1 Qα Z0 Z + A−1 α
Z0 y
Qg





Quaas (1987) examined the structure of Q and the inverse of A under phantom
parent grouping and noticed that Q0 A−1 Q and −Q0 A−1 had properties that
followed the rules of Henderson (1976) for forming the elements of the inverse
of A. Thus, the elements of A−1 and Q0 A−1 Q and −Q0 A−1 can be created
by a simple modification of Henderson’s rules. Use δi as computed earlier, (i.e.
δi = b−1
ii ), and let i refer to the individual animal, let s and d refer to either
the parent or the phantom parent group if either parent is missing, then the
rules are
Constant to Add
δi
−δi /2
δi /4
Location in Matrix
(i, i)
(i, s),(s, i),(i, d), and (d, i)
(s, s),(d, d),(s, d), and (d, s)
Thus, Q0 A−1 Q and Q0 A−1 can be created directly without explicitly forming Q and without performing the multiplications times A−1 .
74
CHAPTER 4. PHANTOM PARENT GROUPS
Chapter 5
Estimation of Variances
5.1
Some History
In 1953 Henderson published a paper that described three methods of estimating variance components from a linear model. The first method assumed
a model in which all factors in the model were random factors, except for
the overall mean. In the 1950’s variances were estimated using mechanical
calculators, and took many days for even a simple data set and model.
The second method allowed fixed factors in the model, but only as long
as there were no interactions of the fixed factors with the random factors. The
least squares solutions for all factors were calculated and the observations adjusted for the fixed factors. Then the first method was applied to the adjusted
observations, making sure to adjust the expectations involving the residual
variance in the sums of squares. This was tedious, but not much more difficult
than the first method.
His third method, known as the fitting constants method, was applicable
to general models with both fixed and random factors. This method required
the most computations and was always considered to be the more accurate
method of the three. The third method, however, was not well defined in that
one could calculate more sums of squares than were needed to estimate all of
the components.
75
76
CHAPTER 5. ESTIMATION OF VARIANCES
The only statistical properties of Henderson’s three methods were that
they were unbiased and easy to calculate. By imposing unbiasedness, the
estimates always had a chance to turn out negative, which is an undesirable
property for a variance estimator.
By 1970, maximum likelihood was described (Hartley and Rao), followed
by restricted maximum likelihood (Patterson and Thompson 1971), and minimum variance quadratic estimation (C. R. Rao 1971). These methods were
much more complex than Henderson’s methods, and were based on sounder
statistical properties (i.e. positive estimates, lower variance of the estimates).
At the same time, computing power was also improving substantially. Soon,
Henderson’s methods were outdated and not to be used.
5.2
Modern Thinking
Every variable in a linear model is a random variable described by a distribution function. A fixed factor is a random variable having a possibly uniform
distribution going from a lower limit to an upper limit. A component of variance is a random variable having a Gamma or Chi-square distribution with
df degrees of freedom. In addition, there may be information from previous
experiments that strongly indicate the value that a variance component may
have, and the Bayes approach allows prior information to be included in the
analysis.
The Bayesian process is to
1. Specify distributions for each random variable of the model.
2. Combine the distributions into a joint posterior distribution.
3. Find the conditional marginal distributions from the joint posterior distribution.
4. Employ Markov Chain Monte Carlo (MCMC) methods to maximize the
joint posterior distribution. Gibbs Sampling is a tool in MCMC methods
for deriving estimates of parameters from the joint posterior distribution.
5.3. THE JOINT POSTERIOR DISTRIBUTION
77
By determining conditional marginal distributions for each random variable of the model, then random samples from these distributions eventually
converge to being random samples from the overall joint posterior distribution.
Computationally, any program that calculates solutions to Henderson’s mixed
model equations can be modified to implement Gibbs Sampling, and theoretically this allows one to use the same data, same model and same program as
for genetic evaluation. That is, there is no need to choose a subset of the data
to estimate the variances, except to save time.
When a subset of the complete data set is chosen to estimate variance
components, then two things may happen. First, the subset may or may not
be a representative sample of the complete data set. Usually, the subset is
chosen by only keeping animals having pedigrees or animals (sires or dams)
having lots of progeny (observations), and thus, the subset is no longer random,
and no longer representative.
Secondly, a subset is generally smaller than the complete data set, and
therefore, the estimates of variances will be less accurate than using all of the
data. Thus, if possible, all data should be utilized and no subsets taken. The
amount of computer time will be greater than for a subset, but the estimates
that are thus derived will be a better reflection of the complete data set.
5.3
The Joint Posterior Distribution
Begin with a simple single trait, single record per animal, animal model. That
is,
y = Xb + Za + Wu + e.
Let θ be the vector of random variables, where
θ = (b, a, u, σa2 , σu2 , σe2 ),
and y is the data vector, then
p(θ, y) = p(θ) p(y | θ)
= p(y) p(θ | y)
78
CHAPTER 5. ESTIMATION OF VARIANCES
Re-arranging gives
p(θ | y) =
p(θ)p(y | θ)
p(y)
p(y | θ)
p(y)
= posterior probability function of θ
= (prior for θ)
5.3.1
Conditional Distribution of Data Vector
The conditional distribution of y given θ is
y | b, a, u, σa2 , σu2 , σe2 ∼ N (Xb + Za + Wu, Iσe2 ),
and
p(y | b, a, u, σa2 , σu2 , σe2 )
h
i
∝ (σe2 )(−N/2) exp −(y − Xb − Za − Wu)0 (y − Xb − Za − Wu)/2σe2 .
5.3.2
Prior Distributions of Random Variables
Fixed Effects Vector
There is little prior knowledge about what the values in b might have. This is
represented by assuming
p(b) ∝ constant.
Additive Genetic Effects and Variances
For a, the vector of additive genetic values, quantitative genetics theory suggests that they follow a normal distribution, i.e.
a | A, σa2 ∼ N (0, Aσa2 )
and
h
i
p(a) ∝ (σa2 )(−q/2) exp −a0 A−1 a/2σa2 ,
5.3. THE JOINT POSTERIOR DISTRIBUTION
79
where q is the length of a.
A natural estimator of σa2 is a0 A−1 a/q, call it Sa2 , where
Sa2 ∼ χ2q σa2 /q.
Multiply both sides by q and divide by χ2q to give
σa2 ∼ qSa2 /χ2q
which is a scaled, inverted Chi-square distribution, written as
p(σa2
|
va , Sa2 )
∝
va
(σa2 )−( 2 +1)
va Sa2
,
exp −
2 σa2
!
where va and Sa2 are hyperparameters with Sa2 being a prior guess about the
value of σa2 and va being the degrees of belief in that prior value. Usually
q is much larger than va and therefore, the data provide nearly all of the
information about σa2 .
Other Random Factors
The vector u may contain a number of different factors,
u0 = ( u01 u02 · · · u0s ),
for s factors, and
2
V ar(uj ) = Iσuj
for the j th factor having length qj .
For the j th factor,
2
2
∼ N (0, Iσuj
)
uj | I, σuj
and
h
i
2
2 (−qj /2)
)
exp −u0j uj /2σuj
,
p(uj ) ∝ (σuj
where qj is the length of uj .
2
2
A natural estimator of σuj
is u0j uj /qj , call it Suj
, where
2
2
/qj .
Suj
∼ χ2q j σuj
80
CHAPTER 5. ESTIMATION OF VARIANCES
Multiply both sides by qj and divide by χ2qj to give
2
2
σuj
∼ qj Suj
/χ2qj
which is a scaled, inverted Chi-square distribution, written as
2
p(σuj
|
2
vuj , Suj
)
∝
2 −(
(σuj
)
vuj
2
+1)
2
vuj Suj
exp −
2
2 σuj
!
,
2
2
where vuj and Suj
are hyperparameters with Suj
being a prior guess about the
2
value of σuj and vuj being the degrees of belief in that prior value. Usually
qj is much larger than vuj and therefore, the data provide nearly all of the
2
information about σuj
.
Residual Effects
Similarly, for the residual variance,
p(σe2
|
ve , Se2 )
∝
ve
(σe2 )−( 2 +1)
ve Se2
.
exp −
2 σe2
!
Combining Prior Distributions
The joint posterior distribution is
p(b, a, u, σa2 , σu2 , σe2 | y) ∝ p(b)p(a | σa2 )p(σa2 )p(σe2 )p(y | b, a, u, σa2 , σu2 , σe2 )
which can be written as
e +1)
−( N +v
2
∝ (σe2 )
"
1
exp − 2 ((y − Xb − Za − Wu)0 (y − Xb − Za − Wu) + ve Se2 )
2σe
q+va
(σa2 )−( 2 +1)
2 −(
Πsj=1 (σuj
)
qj +vuj
2
"
#
1
exp − 2 (a0 A−1 a + va Sa2 )
2σa
"
+1)
#
1
2
exp − 2 (u0j uj + vuj Suj
) .
2σuj
#
5.4. FULLY CONDITIONAL POSTERIOR DISTRIBUTIONS 81
5.4
Fully Conditional Posterior Distributions
In order to implement Gibbs sampling, all of the fully conditional posterior
distributions (one for each component of θ ) need to be derived from the
joint posterior distribution. The conditional posterior distribution is derived
from the joint posterior distribution by picking out the parts that involve the
unknown parameter in question.
5.4.1
Fixed and Random Effects of the Model
Let
T = (X Z W),
β 0 = (b0 a0 u),


0
0
0


0
Σ =  0 A−1 k
,
0
0
U−1 σe2
C = Henderson0 s Mixed Model Equations
= T0 T + Σ
Cβ̂ = T0 y
A new notation is introduced, let
0
β 0 = (βi β−i
),
where βi is a scalar representing just one element of the vector β, and β−i is
a vector representing all of the other elements except βi . Similarly, C and T
can be partitioned in the same manner as
T0 = (Ti T−i )0
!
Ci,i Ci,−i
C =
.
C−i,i C−i,−i
In general terms, the conditional posterior distribution of β is a normal distribution,
−1 2
βi | β−i , σa2 , σu2 , σe2 , y ∼ N (β̂i , Ci,i
σe )
82
CHAPTER 5. ESTIMATION OF VARIANCES
where
Ci,i β̂i = (T0i y − Ci,−i β−i ).
Then
−1 2
σe ),
bi | b−i , a, u, σa2 , σu2 , σe2 , y ∼ N (b̂i , Ci,i
for
Ci,i = x0i xi .
Also,
−1 2
ai | b, a−i , σa2 , σe2 , y ∼ N (âi , Ci,i
σe ),
where Ci,i = (z0i zi + Ai,i k), for k = σe2 /σa2 .
5.4.2
Variances
The conditional posterior distributions for the variances are inverted Chisquare distributions,
σa2 | b, a, u, σu2 , σe2 , y ∼ ṽa S̃a2 χ−2
ṽa
for ṽa = q + va , and S̃a2 = (a0 A−1 a + va Sa2 )/ṽa ,
2
2 −2
σuj
| b, a, u, σa2 , σe2 , y ∼ ṽuj S̃uj
χṽuj
2
2
)/ṽuj , for each factor uj , and
= (u0j uj + vuj Suj
for ṽuj = qj + vuj , and S̃uj
σe2 | b, a, u, σa2 , σu2 , y ∼ ṽe S̃e2 χ−2
ṽe
for ṽe = N + ve , and S̃e2 = (e0 e + ve Se2 )/ṽe , and e = y − Xb − Za − Wu.
5.5
Computational Scheme
The following is a small example for an animal model to illustrate Gibbs sampling. The example has
• 35 observations,
5.5. COMPUTATIONAL SCHEME
83
• Only one record per animal,
• 67 non-inbred animals,
• 3 years,
• 9 contemporary groups,
• 3 variances to estimate, i.e. contemporary group, genetic, and residual.
Gibbs sampling is much like Gauss-Seidel iteration. When a new solution
is calculated in the Mixed Model Equations for a level of a fixed or random
factor, a random amount of variation (or noise) is added to the solution based
upon its conditional posterior distribution variance before proceeding to the
next level of that factor or the next factor. After all equations have been
processed, new values of the variances are calculated and new variance ratios
are determined prior to beginning the next round.
The example data are given in Table 5.1.
Table 5.1
Animal Model Example Data
Animal
22
23
24
25
26
27
28
29
30
31
32
37
38
39
40
41
42
43
Sire
1
1
2
2
3
3
4
4
5
5
1
1
2
6
3
4
6
7
Dam
11
12
13
14
15
16
17
18
19
20
21
12
13
14
15
17
18
33
Year
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
CG
1
1
1
1
2
2
2
2
3
3
3
4
4
4
5
5
5
5
Obs.
58
57
38
68
55
56
71
63
43
49
50
65
62
76
55
51
62
79
Animal
44
45
46
47
48
56
57
58
59
60
61
62
63
64
65
66
67
Sire
5
6
7
2
2
2
3
4
5
1
4
8
9
10
6
7
3
Dam
20
21
34
35
36
14
49
50
51
18
33
27
52
35
53
54
55
Year
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
3
3
CG
6
6
6
6
6
7
7
7
7
8
8
8
9
9
9
9
9
Obs.
72
45
46
45
45
58
52
59
78
35
47
57
50
46
52
46
52
84
CHAPTER 5. ESTIMATION OF VARIANCES
5.5.1
Year Effects
The following variable names and assignments will be used.
#
#
#
#
nyr = 3
# number of years (generations)
ncg = 9
# number of contemporary groups
SDa = sqrt(64/1.6) # true genetic SD
SDc = sqrt(64/4)
# true CG SD
SDe = 8
# true Residual SD
alpa = SDe*SDe/(SDa*SDa) # starting values
alpc = SDe*SDe/(SDc*SDc) # starting values
obs = observations as in Table 5.1
cgid = list of contemporary groups for each observation
anwr = list of animals with records
yrid = list of years for each observation
The starting solutions for all factors in the model start at zero. The
starting variance ratios can be anything. In this case, the true ratios were
used to start the Gibbs sampling, namely, 1.6 and 4 for residual to genetic and
residual to contemporary group variances, respectively. The starting value for
the residual standard deviation was 8.
yrsn=c(1:nyr)*0 # Solutions for year effects
cgsn=c(1:ncg)*0 # Solutions for CG
ansn=c(1:nam)*0 # Solutions for animals
SDE=SDe
# Standard deviation of residuals
alpa = 1.6 # starting variance ratio
alpc = 4
# starting variance ratio
The first factor in the Gibbs sampling is the year effects.
# Adjust observations for other factors
# in the model, besides year effects
yobs = obs - cgsn[cgid]-ansn[anwr]
XY = tapply(yobs,yrid,sum)
XX = tapply(yrid,yrid,length)
5.5. COMPUTATIONAL SCHEME
85
yrsn=XY/XX
61.54129 62.32179 50.23683
# add Gibbs Sampling noise
vnois = rnorm(nyr,0,SDE)/sqrt(XX)
# -4.0796973 -3.1532584 -0.9838543
yrsn = yrsn + vnois
# 57.46159 59.16853 49.25298
The command tapply(yobs,yrid,sum) sums together the adjusted observations for each year id.
5.5.2
Contemporary Group Effects
Contemporary groups are a random factor. Note below that alpc is added
to the diagonal elements for XX. The observations are adjusted for year and
animal additive genetic effects.
yobs = obs - yrsn[yrid]-ansn[anwr]
XY = tapply(yobs,cgid,sum)
XX = tapply(cgid,cgid,length)+alpc
cgsn = XY/XX
# 2.2201539 2.5634252 -2.4878904 3.2943517 0.2948540
# -0.6972253 5.2038220 -0.7221015 0.1462229
vnois = rnorm(ncg,0,SDE)/sqrt(XX)
cgsn=cgsn+vnois
# 0.7957398 3.2481316 0.7435676 5.8141136
# 4.0875102 11.1844145 -4.0507944 -2.0645445
5.5.3
0.4675327
Contemporary Group Variance
Because contemporary groups are a random factor, a variance needs to be
sampled next.
ss = sum(cgsn*cgsn)
# = 208.2291
86
CHAPTER 5. ESTIMATION OF VARIANCES
ndf = length(cgsn)+2
Vcg = ss/rchisq(1,ndf)
Vcg
# = 21.74546
where ss is the sum of squares of current solutions to contemporary
groups, ndf is the degrees of freedom equal to the number of contemporary
groups (9) plus 2, and Vcg is the new sample value for the contemporary group
variance.
5.5.4
Animal Genetic Effects
Animal additive genetic effects involve the inverse of the additive relationship
matrix and the residual to genetic variance ratio. Equations for all animals are
made for this example. Another approach would be used if there were many
hundreds of animals. AI is the inverse of the additive relationship matrix of
order 67.
yobs=obs - yrsn[yrid]-cgsn[cgid]
ZY=anim*0;
ZZ=ZY
ZY[anwr]=yobs; ZZ[anwr]=1
ZZA=diag(ZZ)+AI*alpa
C = ginv(ZZA)
ansn = C %*% ZY
CD = t(chol(C))
ve = rnorm(nam,0,SDE)
vnois = CD %*% ve
ansn=ansn+vnois # Too many to show
5.5.5
Additive Genetic Variance
Because animal effects are random, the additive genetic variance must be estimated. The sum of squares involves AI.
v1=AI%*%ansn
5.5. COMPUTATIONAL SCHEME
87
aaa=sum(ansn*v1)
# = 1333.982
ndf=length(ansn)+2 # = 69
Van=aaa/rchisq(1,ndf)
Van
# = 20.4803
5.5.6
Residual Variance
The last random variable of the model is the residual variance. Observations
are adjusted for all factors in the model, using the latest current solution
vectors.
ehat = obs - yrsn[yrid]-cgsn[cgid]-ansn[anwr]
rss = sum(ehat*ehat) # = 2905.817
ndf = length(ehat)+2 # = 37
Vare=rss/rchisq(1,ndf)# = 78.04883
alpa = Vare/Van
# = 3.810923
alpc = Vare/Vcg
# = 3.589201
SDE=sqrt(Vare)
# = 8.834525
5.5.7
Visualization
One should try to plot the estimates of variances to ‘see’ the convergence of
samples to a single joint posterior distribution. Figure 5.1 contains sample
values of the additive genetic variance for 2000 samples.
88
CHAPTER 5. ESTIMATION OF VARIANCES
Figure 5.1
Heritability may also be calculated after each round of sampling as
c2 /(σ
c2 + σ
c2 + σ
c2 ).
b2 = σ
h
a
a
c
e
A similar plot to the one above may be made, but also one could do a histogram
of the sample values as in Figure 5.2.
5.5. COMPUTATIONAL SCHEME
89
Figure 5.2
5.5.8
Burn-In Periods
Samples do not immediately represent samples from the joint posterior distribution. Generally, this takes anywhere from 100 to 10,000 samples depending
on the model and amount of data. This period is known as the burn-in period.
Samples from the burn-in period are discarded. The length of the burn-in
period (i.e. number of samples) is usually judged by visually inspecting a plot
of sample values across rounds, as in Figure 5.1.
A less subjective approach to determine convergence to the joint posterior
distribution is to run two chains at the same time, both beginning with the
same random number seed. However, the starting values (in variances) for each
chain are usually greatly different, e.g. one set is greatly above the expected
outcome and the other set is greatly below the expected outcome. When the
two chains essentially become one chain, i.e. the squared difference between
variance estimates is less than a specified value (like 10−5 ), then convergence
to the joint posterior distribution is deemed to have occurred. All previous
samples are considered to be part of the burn-in period and discarded.
90
CHAPTER 5. ESTIMATION OF VARIANCES
5.5.9
Post Burn-In Analysis
After burn-in, each round of Gibbs sampling is dependent on the results of the
previous round. Depending on the total number of observations and parameters, one round may be positively correlated with the next twenty to three
hundred rounds. The user can determine the effective number of samples by
calculating lag correlations, i.e. the correlation of estimates between rounds,
between every other round, between every third round, etc. Determine the
number of rounds between two samples such that the correlation is zero. Divide the total number of samples (after burn-in) by the interval that gives a
zero correlation, and that gives the effective number of samples. Suppose a
total of 12,000 samples (after removing the burn-in rounds) and an interval of
240 rounds gives a zero correlation between samples, then the effective number of samples is 12,000 divided by 240 or 50 samples. There is no minimum
number of independent samples that are required, just the need to know how
many there actually were.
Another way is to randomly pick 100 to 200 samples out of the 12,000
that were made after burn-in. Then determine the averages and standard
deviations of those sample values. This can be repeated a number of times.
The appropriate R statements would be as follows, where wvh2 contains the
12,000 sample values, and the sample function randomly chooses 200 of those
and stores in vh2.
vh2=sample(wvh2,200)
mean(vh2)
0.3232938
sd(vh2)
0.04631331 # Standard Error of Estimate
Some research has shown that the mode of the estimates might be a better
estimate, which indicates that the distribution of sample estimates is skewed.
One could report both the mean and mode of the samples, however, the mode
should be based on the independent samples only.
5.6. SOFTWARE
5.5.10
91
Long Chain or Many Chains?
Early papers on MCMC (Monte Carlo Markov Chain) methods recommended
running many chains of samples and then averaging the final values from each
chain. This was to insure independence of the samples. Another philosophy
recommends one single long chain. For animal breeding applications this could
mean 100,000 samples or more. If a month is needed to run 50,000 samples,
then maybe three chains of 50,000 would be preferable, all running simultaneously on a network of computers. If only an hour is needed for 50,000 samples,
then 1,000,000 samples would not be difficult to run.
5.6
Software
There are many people that provide software to estimate variance components
and genetic evaluations. The ability to write your own software is a valuable
asset for research, one that could be considered essential for animal breeding
research. Not everyone has that ability, or wants to write software. Some
suppliers of software (as of 2019) are listed below, not in any order.
• Karin Meyer, Australia
• Ignacy Misztal, Georgia, United States
• Arthur Gilmore, Australia (ASREML)
• Eildert Groneveld, Germany (VCE)
• Per Madsen, Denmark (DMU)
• Mehdi Sargolzaei, Canada (FImpute, GBLUP)
Some software packages do only REML style estimation, while some can
also do Bayesian estimation. Some are limited in the types of models than
can be used. Because the programs have been written to be general to handle
almost any kind of model and data, they are sometimes inefficient or extravagant in the use of memory, and thus they may force one to take a subset of
92
CHAPTER 5. ESTIMATION OF VARIANCES
the data, or simplify the model. At the least, the programs may take a long
time to run.
The best thing is to try two or three different software packages and compare the estimates from each. This is not easy because each requires different
parameter files and instruction sets. Check your results carefully.
5.7
Covariance Matrices
Variances are always to be positive quantities. Similarly, covariance matrices
are always to be positive definite matrices. Nearly every researcher encounters a covariance matrix during their careers that is not positive definite. The
problem is to force a matrix (by some method) to make it positive definite.
Hayes and Hill (1981) presented a bending procedure in which the eigenvalues
of a matrix were regressed towards the mean of the eigenvalues until all eigenvalues were positive, then reconstruct the matrix with the new eigenvalues.
Jorjani et al.(2003) gave a weighted bending procedure, i.e. a different way
of pulling the values closer to the mean eigenvalue. Finally, Meyer and Kirkpatrick (2010) used a penalized maximum likelihood method. The bending
procedure, in general, causes many of the original correlations and actual variances and covariances to differ greatly from the original values. There must
be another way.
5.7.1
Example
Consider the matrix, M, of order 5.

M=







100 95 80 40 40
95 100 95 80 40
80 95 100 95 80
40 80 95 100 95
40 40 80 95 100








= UDU0 .
The eigenvalues are the diagonal elements of D.
D=
399.48 98.52 23.65 −3.12 −18.52
.
5.7. COVARIANCE MATRICES
93
Thus, whereas all of the pairwise correlations in M are below 1, the matrix,
as a whole, is not positive definite, and therefore, is invalid as a covariance
matrix.
The procedure that I endorse is to modify the negative eigenvalues such
that they become positive and have values between the smallest positive eigenvalue and zero. The lowest positive eigenvalue in this example is 23.65. The
following steps should be taken.
Step 1
Add together the negative eigenvalues and multiply times 2.
s = (−3.12 − 18.52) ∗ 2 = −43.28.
Now square this value, multiply by 100 and add 1.
t = (s ∗ s) ∗ 100 + 1 = 187, 316.84.
Step 2
Let p be the lowest positive eigenvalue (23.65), then take each negative eigenvalue, n, separately and transform as follows:
n∗ = p × (s − n) × (s − n)/t
or
n∗4 = 23.65(−43.28 + 3.12) × (−43.28 + 3.12)/(187, 316.84) = 0.20363,
and
n∗5 = 23.65(−43.28 + 18.52) × (−43.28 + 18.52)/(187, 316.84) = 0.07740.
Step 3
Replace the negative eigenvalues with the new positive values, and reconstruct
M using the new eigenvalues and old eigenvectors.
94
CHAPTER 5. ESTIMATION OF VARIANCES
D∗ =
399.48 98.52 23.65 0.20363 0.0774
.
Then
M∗ = UD∗ U0 ,

=







103.18978 90.82704 79.43676 44.56754 37.06769
90.82704 106.54177 94.13679 74.06296 44.56754
79.43676 94.13679 102.46432 94.13670 79.43676
44.56754 74.06296 94.13679 106.54177 90.82704
37.06769 44.56754 79.43676 90.82704 103.18978




.



The only property of this method is that the resulting matrix will be a
positive definite matrix and can be used as a valid covariance matrix in BLUP
or selection index.
Chapter 6
Repeated Records Animal
Model
6.1
Traditional Approach
Animals are observed more than once for some traits, such as
• Fleece weight of sheep in different years.
• Calf records of a beef cow over time.
• Litter size of sows over time.
• Antler size of deer in different seasons.
• Racing results of horses from several races.
Usually the trait is considered to be perfectly correlated over the ages of
the animal. In addition to an animal’s additive genetic value for a trait, there
is a common permanent environmental (PE) effect which is a non-genetic effect
assumed to be common to all observations on the same animal.
95
96
CHAPTER 6. REPEATED RECORDS ANIMAL MODEL
6.1.1
The Model
The model is written as
y = Xb +
0 Z
a0
aw
!
+ Zp + Wu + e,
where
b is a vector of fixed effects,
!
a0
are vectors of animal additive genetic effects for animals without
aw
records and animals with records, respectively,
p is a vector of PE effects of length equal to aw ,
u is a vector of random contemporary group effects, and
e is a vector of residual effects.
The matrices X, W and Z are design matrices that associate observations
to particular levels of fixed effects and to additive genetic and PE effects,
respectively. In a repeated records model, Z is not equal to an identity matrix.
Also,
a | A, σa2 ∼ N (0, Aσa2 )
p | I, σp2 ∼ N (0, Iσp2 )
u | I, σu2 ∼ N (0, Iσu2 )
e ∼ N (0, Iσe2 ).
Repeatability is a measure of the average similarity of multiple records on
animals across the population (part genetic and part environmental), and is
defined as a ratio of variances as
r=
σa2 + σp2
,
σu2 + σa2 + σp2 + σe2
6.1. TRADITIONAL APPROACH
97
which is always going to be greater than or equal to heritability, because
h2 =
σa2
.
σu2 + σa2 + σp2 + σe2
One of the assumptions with a repeated records model is that all records
of the animal are included in the analysis. That means, there should be no
selection, like analyzing only the best record of each animal, or analyzing only
the third record because not all animals may have three records. By utilizing all
records the mixed model equations can account (to some degree) for selection
of animals to have subsequent records. If the selection intensity is very high,
like one out of 3 or greater, then MME may not be able to account for selection,
at least not fully.
6.1.2
Example
Let the parameters be
σu2 = 20
σa2 = 36
σp2 = 16 and
σe2 = 100.
Thus,
h2 =
36
= .21,
20 + 36 + 16 + 100
r=
36 + 16
= .30.
20 + 36 + 16 + 100
and
The data are given in the following table for twelve animals with from 1
to 3 records each.
98
CHAPTER 6. REPEATED RECORDS ANIMAL MODEL
Table 6.1
Herd Animal Sire Dam Year 1 Year 2 Year 3
1
7
1
2
69
53
65
1
8
3
4
37
47
1
9
5
6
39
62
1
10
1
4
48
72
1
11
3
6
96
1
12
1
2
72
2
19
1
14
55
51
86
2
20
13
16
48
72
2
21
15
17
71
96
2
22
13
18
56
47
2
23
5
14
51
2
24
15
16
77
None of the animals are inbred. There were 6 contemporary groups defined
by herd-year interactions, and 3 year effects. The order of the mixed model
equations (MME) were 45 by 45. The solutions to the MME were as follows:
For years,


52.90


b
b =  57.61  ,
74.67
and for contemporaray groups,

b =
u









−0.3560628
−0.8611276
−0.1803599
0.3560628
0.8611276
0.1803599





.




The solutions for EBV and permanent environmental (PE) effects for animals and the diagonals of the inverse for the EBVs are given in the following
table.
6.1. TRADITIONAL APPROACH
Animal
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Table 6.2
EBV inverse
1.1615 0.2944
1.7503 0.3172
0.4397 0.3191
-3.3648 0.3132
-3.2406 0.3192
0.3567 0.3237
1.3535 0.2362
-3.8595 0.2552
-3.9218 0.2561
-2.0695 0.2601
3.2348 0.2885
3.3086 0.2799
-1.7874 0.3157
-0.3817 0.3159
3.4269 0.3238
0.9680 0.3187
3.4327 0.3331
-2.7612 0.3325
0.7690 0.2369
0.5640 0.2558
6.8624 0.2568
-5.0355 0.2536
-2.5718 0.2860
2.1917 0.2881
99
PE
-0.0910
-2.1306
-2.2043
-0.8603
2.5214
1.6468
0.3369
0.8656
3.0512
-2.4544
-0.6762
-0.0051
An estimate of the residual variance was
σce2 = (2629.408)/(22 − 3) = 138.3899.
The variance of prediction error for animal 11, would therefore be
0.2885 ∗ 138.3899 = 39.9255,
or the standard error of prediction (SEP) is 6.3187. The reliability would be
REL = (1 − 0.2885 ∗ (2.78)) = 0.1980.
Reliabilities of animals with records are greater than for animals without
records, in this example.
100
CHAPTER 6. REPEATED RECORDS ANIMAL MODEL
6.1.3
Estimation of Variances
Gibbs sampling is best applied when iterating solutions to the MME by iteration on the data. Hence solutions are obtained for one factor at a time.
Random variation is added to each new solution to reflect the relative accuracy of estimation of each solution. This is given by the diagonal element of
the MME, denote it by dkk , which is a diagonal of X0 X, or W0 W, or Z0a Za , or
Z0p Zp , depending on which solution is being calculated.
• Year Effects The new solution is
b
b
i
b w − Zp p
b − Wu
b )/dii ,
= x0i (y − Za a
for dii = x0i xi . Random variation is added to this solution by obtaining
a random normal deviate with a mean of 0 and variance of σce2 , the latest
sample value of the residual variance.
b
b
i
ce )/dii .
= bbi + RN ORM (0, σ
• Contemporary Group Effects The new solution is
b
b w − Zp p
b − Xb)/(d
ubj = wj0 (y − Za a
jj + ku ),
for djj = wj0 wj and ku is the ratio of the latest sample values of residual
and contemporary group variances, i.e.
ku = σce2 /σcu2 .
Random variation is added to this solution by obtaining a random normal
deviate with a mean of 0 and variance of σce2 , the latest sample value of
the residual variance.
ce )/(djj + ku ).
ubj = ubj + RN ORM (0, σ
• Additive Breeding Values The new solution is
b − Wu
b w − Ai0 a
b 0 − Zp p
b − Xb
b )/(dii + aii ka ),
abi = z0ai (y − Aiw a
for dii = z0ai zai and ka is the ratio of the latest sample values of residual
and additive genetic variances, i.e.
ka = σce2 /σca2 ,
6.1. TRADITIONAL APPROACH
101
and aii is the diagonal of A−1 for the ith animal. Random variation is
added to this solution by obtaining a random normal deviate with a mean
of 0 and variance of σce2 , the latest sample value of the residual variance.
ce )/(dii + aii ka ).
abi = abi + RN ORM (0, σ
• Permanent Environmental Effects The new solution is
b
b w − Wu
b − Xb)/(d
pbj = z0pj (y − Za a
jj + kp ),
for djj = z0pj zpj and kp is the ratio of the latest sample values of residual
and PE variances, i.e.
kp = σce2 /σcp2 .
Random variation is added to this solution by obtaining a random normal
deviate with a mean of 0 and variance of σce2 , the latest sample value of
the residual variance.
ce )/(djj + kp ).
pbj = pbj + RN ORM (0, σ
There are 4 variances to estimate in this model, and it does not matter
what order they are calculated. Calculations usually follow after obtaining
new sample values for the solutions to the MME.
• Contemporary Group Variance
b 0u
b /CHI(6 + 2).
σcu2 = u
= 1.801702/CHI(8)
= 0.17837
where CHI(8) is a random Chi-square variate with 8 degrees of freedom.
• Permanent Environmental Variance
b 0p
b /CHI(12 + 2).
σcp2 = p
= 35.87092/CHI(14)
= 2.535819,
where CHI(14) is a random Chi-square variate with 14 degrees of freedom.
102
CHAPTER 6. REPEATED RECORDS ANIMAL MODEL
• Residual Variance
σce2 = eb0 eb/CHI(22 + 2).
= 1971.808/CHI(24)
= 92.50942
where CHI(24) is a random Chi-square variate with 24 degrees of freedom.
• Animal Additive Genetic Variance
b 0 A−1 a
b /CHI(24 + 2).
σca2 = a
= 152.7832/CHI(26)
= 8.707745,
where CHI(26) is a random Chi-square variate with 26 degrees of freedom.
• New Variance Ratios for MME
ku =
=
=
kp =
=
=
ka =
=
=
σce2 /σcu2
92.50942/0.17837
518.64
σce2 /σcp2
92.50942/2.535819
36.48
σce2 /σca2
92.50942/8.707745
10.62.
The new variance ratios would be used to re-create the MME and obtain
new solutions. The number of iterations (or samples) to generate should be
very large. A plot of the sample values should be used to determine when the
samples have converged to the joint posterior distribution. With this small
numerical example, convergence is not likely to occur. A variance (such as the
6.2. CUMULATIVE ENVIRONMENTAL APPROACH
103
contemporary group variance) might converge to zero. With small datasets,
use of prior values for the variances should be used with degrees of freedom
equal to the number of records. The prior values are constants, based on best
guesses about the possible true values of the variances.
Variance component estimation should be based on at least 5000 observations or more, but this depends on the model and number of levels of the
random factors. In some circumstances, maybe 10,000 or more observations
may be necessary. At this level of dataset size, the use of prior values becomes
less necessary.
6.2
Cumulative Environmental Approach
“Permanent” implies stability and a constant presence. A better proposition
is that new environmental effects can appear over time as the animal ages, or
as the animal gains experience in the trait events that are being recorded, and
are, therefore, cumulative. Suppose an animal makes 3 records over 3 years.
The environmental effect, E1 , that affects record 1, also has an influence on
records 2 and 3. A new environmental effect, E2 affects record 2 and any later
records, but does not affect record 1 because E2 occurred after record 1 was
made. Similarly, another new environmental effect, E3 , affects record 3, but
not records 1 and 2 because E3 has occurred after records 1 and 2 were made.
Conceivably, the variance of environmental effects could differ for each
record, but the covariance between environmental effects affecting different
records would be assumed to be zero. Hence the effect of E1 on record 1 is
independent of the effect of E2 on record 2, and the effect of E3 on record 3,
etc. However, the environmental variance for second records would be the sum
of variances of environmental effects on records 1 and 2.
Possibly, E1 could be negative, and E2 could be positive, thus cancelling
out each other, but E1 and E2 could both be negative (or positive) and thus,
would add up in influencing record 2. Perhaps also, the variance of E2 could be
smaller than for the variance of E1 , or in general, the variability of environmental effects on aged animals becomes less. Or the variances could become larger
with age. There have been no studies on cumulative environmental effects on
any species to know which might be the true state of nature.
104
CHAPTER 6. REPEATED RECORDS ANIMAL MODEL
6.2.1
A Model
The cumulative environmental repeated records (CE) model can be written as
y = Xb +
0 Za
a0
ar
!
+ Zp p + e,
where
b =
a0
ar
!
=
p =
e =
vector of fixed effects,
!
animals without records
,
animals with records
vector of PE effects of length equal to the number of records, and
vector of residual effects.
The matrices X and Za are design matrices that associate observations to
particular levels of fixed effects and to additive genetic effects, respectively, as
described earlier. In a CE repeated records model, Zp is not a typical design
matrix. Rows of Zp may have more than a single 1. If an animal has two
records, then in the row for the second record there will be two 1’s. If an
animal has four records, then in the row for the first record there will be one
1, for the second record two 1’s, for the third record three 1’s, and for the
fourth record four 1’s. Using the example data in this chapter, the Zp matrix
for animal 7 would appear as

 


··· 1 0 0 ···


 
Zp p7 =  · · · 1 1 0 · · ·  


··· 1 1 1 ···

..
.
p71
p72
p73
..
.





.



Thus, p71 contributes to records 1, 2, and 3 of animal 7, while p72 contributes
to records 2 and 3 of animal 7. Note that there are as many environmental
effects to be estimated as there are records. Also, for animal 7,

Z0 p Zp

··· 3 2 1 ···

=  ··· 2 2 1 ··· 
.
··· 1 1 1 ···
6.2. CUMULATIVE ENVIRONMENTAL APPROACH
105
Because of the special form of Zp , the CE effects can be estimated separately
from temporary environmental effects, and from the additive genetic effects
for animals, which rely on the additive genetic relationship matrix. Typical
software such as ASREML, VCE, or DMU can not presently be used for a CE
repeated records model. Users need to write their own software.
Also,
a | A, σa2 ∼ N (0, Aσa2 )
p | I, σp2 ∼ N (0, Pσp2 )
e ∼ N (0, Iσe2 )
!
Aσa2 0
.
G =
0 Pσp2
The matrix P is assumed to be a diagonal matrix with 1’s on the diagonal
for the first records made by every animal. For second records, one might
suspect that the variance of permanent environmental effects might be less
than that for first records, and so the diagonals for second records may be
reduced. Similarly, the variance of PE effects for third and later records may
also be reduced further. The point is that an allowance must be made for the
variances of PE effects to be different depending on the record number for that
animal.
The residual variance is assumed to be the same for all records, but this
too could vary. The genetic variance is assumed to be constant and the genetic
correlation between records is still assumed to be unity.
The variance of three records, for example, on one non-inbred animal
would be




y1
1 1 1




V ar  y2  =  1 1 1  σa2
y3
1 1 1
2
2
2
σp1
σp1
σp1
 2

2
2
2
2
+ σp2
)
(σp1
+ σp2
)
+  σp1 (σp1

2
2
2
2
2
2
σp1 (σp1 + σp2 ) (σp1 + σp2 + σp3 )




1 0 0

2
+  0 1 0 
 σe .
0 0 1
106
CHAPTER 6. REPEATED RECORDS ANIMAL MODEL
This implies that the variance of repeated records is getting larger over time.
6.2.2
Example
The data for the repeated records example of earlier in this chapter was analyzed with the CE repeated records model. Let
σu2
σa2
σe2
2
σp1
=
=
=
=
20,
36,
100,
16,
2
σp2
= 14, and
2
σp3
= 13.
The mixed model equations are of order 55 (i.e. 3 years, 6 contemporary
groups, 24 animal additive genetic, and 22 CE effects). The solutions to the
mixed model equations are given below.
6.2. CUMULATIVE ENVIRONMENTAL APPROACH
107
Table 6.3
Solutions to the CE repeated records animal model analysis.
Year Effects
Cont. Groups
Animal
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Yr 1 = 52.9257
1 = -0.4818
5 = 0.6824
EBV
1.2225
2.0331
0.4117
-3.4004
-3.1780
0.3103
1.8534
-3.8388
-3.8798
-2.1448
3.1172
3.4352
-1.7051
-0.4867
3.2488
0.7993
3.2710
-2.5266
0.6133
0.3685
6.5309
-4.6424
-2.5643
2.0019
Yr 2 = 57.5993
2 = -0.6824
6 = 0.01028
inverse
0.2970
0.3192
0.3200
0.3146
0.3201
0.3245
0.2438
0.2580
0.2591
0.2633
0.2895
0.2812
0.3170
0.3179
0.3246
0.3196
0.3339
0.3335
0.2451
0.2586
0.2599
0.2572
0.2866
0.2892
Yr 3 = 75.1308
3 = -0.01028
4 = 0.4818
CE1
CE2
CE3
0.2006
-2.0840
-2.1742
-1.8548
-0.4905
-1.1872
-0.9386
-0.8678
-0.0046
2.4500
1.6066
0.2181
0.7302
2.9075
0.0843
1.5498
-2.2458
-0.6507
1.1439
1.4025
-2.6100
-0.0197
Most probable producing abilities (MPPA) are predictions of how animals
might perform if they made another record, and are based on the genetic and
CE estimates (Lush, 1933). In the CE repeated records model, to predict a
fourth record for animal 7, for example, the prediction would be
MPPA = â7 + p̂71 + p̂72 + p̂73 + p̂f uture ,
where the prediction of the future CE effect would be 0, the mean of the
distribution from which a future CE effect would be sampled. The variance
of prediction error of MPPA would be increased by the variance of the future
CE effects.
108
CHAPTER 6. REPEATED RECORDS ANIMAL MODEL
Similarly, repeatability would vary according to record number. Assuming
the new CE effects are not correlated with previous CE effects, then repeatability would increase with record number because the CE variances would be
additive. If CE effects for second records, for example, depends on the CE
effect for the first records, then those covariances (positive or negative) may
decrease repeatability with record number. Without an analysis of real data
and estimates of the CE variances, these comments are only speculative, and
the results could vary depending on species and traits.
6.2.3
Estimation of Variances
A Bayesian approach using Gibbs sampling could be used to estimate the PE
variances, one for each record number, just as one estimates different residual
variances for different years or herds.
• Contemporary Group Variance
b 0g
b /CHI(6 + 2)
σcg2 = g
= 1.377382/CHI(8)
= 0.139131
• Residual Variance
σce2 = 1738.198/CHI(22 + 2)
= 94.50796
• Additive Genetic Variance
b 0 A−1 a
b /CHI(24 + 2).
σca2 = a
= 141.5144/CHI(26)
= 6.0337.
6.3. ALTERNATIVE MODELS
109
• Cumulative Environmental Variances
There are three CE variances to estimate.
d
2
σ
p1 =
=
=
d
2
σp2 =
=
=
d
2
σp3 =
=
=
b 01 p
b 1 /CHI(12 + 2).
p
33.07665/CHI(14)
2.169933.
b 2 /CHI(8 + 2).
b 02 p
p
15.62206/CHI(10)
2.529656.
b 3 /CHI(2 + 2).
b 03 p
p
2.718043/CHI(4)
0.3556115.
Obviously, there are only two third records on animals and the estimate
here is not very accurate. Because this is a single Gibbs sample, these
new values will be used to re-construct the MME and another set of
samples will be generated. However, the example is too small to obtain
valid parameter estimates.
6.3
Alternative Models
Multiple Trait Model
There are other alternative models for repeated records. Let each observation
on an animal be considered a different trait that is not perfectly genetically
correlated, then a multiple trait animal model could be used, such that permanent environmental effects are confounded with temporary environmental
effects, and therefore, estimated jointly. If the observations are truly perfectly
genetically correlated, then the multiple trait approach may have difficulty in
estimating a genetic correlation of 1 (Van Vleck and Gregory, 1992).
Random Regression Model
Another possibility is to consider a random regression model where the observations are functions of age of the animal, and follow a trajectory, such
110
CHAPTER 6. REPEATED RECORDS ANIMAL MODEL
as a lactation curve. Test day models include random regression coefficients
for permanent environmental effects such that their variability differs over the
course of a lactation. However, if animals do not make very many records,
then there could be estimation problems.
Autoregressive Model
The random regression model suggests that permanent environmental effects
close together in time are more similar in magnitude than PE effects farther
apart in time. Thus, an autoregressive correlation model might be assumed.
The model for autocorrelated CE effects can be written as
y = Xb +
0 Za
a0
ar
!
+ Zp p + e,
where
b =
a0
ar
!
=
p =
e =
vector of fixed effects,
!
animals without records
,
animals with records
vector of CE effects of length equal
to the number of records, and
vector of residual effects.
In this model, Zp is an identity matrix within animals, so that the CE effects
for each record are separated, but

P=





1
ρ
ρ2
..
.
ρ ρ2
1 ρ
ρ 1
.. ..
. .
···
···
···
...


 2
σ .
 p

Chapter 7
Multiple Trait Models
7.1
Introduction
Livestock are often assessed for productive performance, reproductive performance, health, and economic efficiency. Within each area there may be several
traits of importance. For example, in beef cattle there is calving ease of the
calf, survival to 4 days, birthweight, weaning weight, yearling weight and feed
intake that can be grouped under productive performance. Within the reproductive area there could be calving ease of the cow, age at first calving,
number of services to conception, conception rate, and others. Health covers
general immune response, but also susceptibility to many diseases and parasites. Economic efficiency could include conformation traits and behaviour
traits. The point is that there are multiple traits that need to be evaluated for
each animal, and usually evaluations for these traits are combined into one or
more indexes depending on the breeding goals of the producer.
A multiple trait (MT) model is one in which two or more traits are analyzed simultaneously in order to take advantage of genetic and environmental
correlations between traits. Not all animals are measured for all traits, and
thus, some improvement in accuracy for all traits can be gained by analyzing
them together.
MT models are useful for traits where the difference between genetic and
residual correlations are large ( e.g. greater than 0.5 difference ) or where one
111
112
CHAPTER 7. MULTIPLE TRAIT MODELS
trait has a much higher heritability than the other traits. EBVs for traits with
low heritability tend to gain more in accuracy than EBVs for traits with high
heritability, although all traits benefit to some degree from the simultaneous
analysis.
Another use of MT models is for traits that occur at different times in the
life of the animal, such that culling of animals results in fewer observations
on animals for traits that occur later in life compared to those at the start.
Consequently, animals which have observations later in life tend to have been
selected based on their performance for earlier traits. Thus, analysis of later
life traits by themselves could suffer from the effects of culling bias, and the
resulting EBV could lead to errors in selecting future parents. An MT analysis
that includes observations on all animals upon which culling decisions have
been based, has been shown to account, to some degree, for the selection that
has taken place. That is, there needs to be some way to estimate the differences
in first records between selected and non-selected individuals, to account for
the correlated difference that might have been observed in second records had
all animals made a second record. However, if the culling is intense, then a
MT model may still give biased results.
Correlations: If the correlations (covariance matrices) that are used in an
MT analysis are inaccurate, then EBV resulting from MT analyses could give
erroneous rankings of animals for some or all traits. However, with Bayesian
methods of covariance component estimation, appropriate parameters can be
obtained readily, so that there should be few problems in ranking animals
correctly.
7.2
Data
Multiple trait analyses will be demonstrated with a small example involving
three traits. All animals were observed for trait 1, but not traits 2 or 3. The
data are given in the table below.
7.2. DATA
Animal
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
113
Table 7.1
DIM = days in milk group (3)
YS = year-season (1 year, 2 seasons)
HYS = herd-year-season (4)
A zero indicates missing information.
Sire Dam Age DIM YS HYS
1
2
3
4
1
2
3
3
4
3
2
2
1
3
4
2
5
6
7
8
9
5
6
8
7
10
12
13
11
14
19
10
1
2
1
3
2
1
3
3
1
2
1
2
3
2
1
3
1
1
2
3
1
3
2
1
2
3
1
2
1
3
2
3
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
4
Traits
1 2
29.3 52
30.9 0
27.4 54
3.1 55
26.6 46
3.7 37
30.4 53
35.2 0
33.2 66
5.0 58
33.8 61
38.8 56
38.1 0
17.6 53
36.8 36
10.3 40
3
137
151
0
195
82
202
167
0
149
173
175
152
194
237
0
148
None of the animals was inbred, and animals 1 to 9 do not have any
observations, and unknown parents. The same model is assumed for all three
traits, namely,
y = YS(year − seasons)
+Age(age − groups)
+DG(DIM − groups)
+HYS
+a
+e
114
CHAPTER 7. MULTIPLE TRAIT MODELS
Assume the following covariance matrices.


11 5 29


R =  5 97 63  ,
29 63 944
for the residual effects,


1.0
1.6 −6

14.3 −12 
G =  1.6
,
−6.0 −12.0 803
for the additive genetic effects, and


2.0
3.1 −10

3.1
29.0 −30 
U=
,
−10.0 −30.0 700
for random HYS effects. The corresponding correlation matrices are


1.000 0.153 0.285

Cor(R) = 
 0.153 1.000 0.208  ,
0.285 0.208 1.000




1.000
0.423
−0.212
1.000
−0.112 
Cor(G) = 
 0.423
,
−0.212 −0.112 1.000
and
1.000
0.407
−0.267

1.000
−0.211 
Cor(U) =  0.407
.
−0.267 −0.211 1.000
The heritabilities of the traits were 0.07, 0.10, and 0.33, respectively.
7.3. GIBBS SAMPLING
7.3
115
Gibbs Sampling
The problem with multiple trait examples is that
• there are too many equations to show,
• the equations involve inverses of R which contain decimal numbers, and
• there are several ways to construct the equations.
Hence R-code will be used for this example.
The first step is to enter the data.
obs=matrix(data=c(29.3,52,137, 30.9,0,151,
27.4,54,0, 3.1,55,195, 26.6,46,82,
3.7,37,202, 30.4, 53,167, 35.2,0,0,
33.2,66,149, 5.0,58,173, 33.8,61,175,
38.8,56,152, 38.1,0,194, 17.6, 53, 237,
36.8, 36,0, 10.3,40,148),byrow=TRUE,ncol=3)
Next, the pedigrees and calculating A−1 .
anwr=c(10:25) # with records
anim=c(1:25)
sir=c(rep(0,9),1,2,3,4,1,2,3,3,
4,3,2,2,1,3,4,2)
dam=c(rep(0,9),5,6,7,8,9,5,6,8,
7,10,12,13,11,14,19,10)
length(sir)-length(dam) # check
bi=c(rep(1,9),rep(0.5,16))
AI = AINV(sir,dam,bi)
Assign each animal (with records) to appropriate age, days in milk, yearseason, and HYS levels. The misc variable identifies which residual matrix
inverse should be used, where 1 indicates all traits observed, 2 indicates trait
2 is missing, 3 indicates trait 3 is missing, and 4 indicates traits 2 and 3 are
missing. In this example, all animals were required to have trait 1 present,
which may happen in some situations.
116
CHAPTER 7. MULTIPLE TRAIT MODELS
dimid=c(1,1,2,3,1,3,2,1,2,3,1,2,1,3,2,3)
ndim=3
agid=c(1,2,1,3,2,1,3,3,1,2,1,2,3,2,1,3)
nage=3
ysid=c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2)
nys=2
cgid=c(1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4)
ncg=4
misc=c(1,2,3,1,1,1,1,4,1,1,1,1,2,1,3,1)
ntr=3
Different R inverses are needed depending on which traits are observed,
as indicated in misc.
RI=array(data=c(0),dim=c(3,3,4))
RI[ , ,1]=ginv(R) # all present
ka = c(1,3)
# trait 2 missing
work=ginv(R[ka,ka])
RI[ka,ka,2]=work
ka=c(1,2)
# trait 3 missing
work=ginv(R[ka,ka])
RI[ka,ka,3]=work
ka=c(1)
# traits 2 and 3 missing
work=ginv(R[ka,ka])
RI[ka,ka,4]=work
noa=length(anwr)
Finally, initialize the solution vectors to zero, and begin Gibbs sampling.
nam=length(anim)
GI=ginv(G)
UI=ginv(U)
yssn=matrix(data=c(0),nrow=nys,ncol=ntr)
cgsn=matrix(data=c(0),nrow=ncg,ncol=ntr)
dmsn=matrix(data=c(0),nrow=ndim,ncol=ntr)
agsn=matrix(data=c(0),nrow=nage,ncol=ntr)
ansn=matrix(data=c(0),nrow=nam,ncol=ntr)
7.3. GIBBS SAMPLING
7.3.1
117
Year-Season Effects
Adjust the observations for all factors in the model, except year-season effects,
solve, add some sampling noise, and go to the next factor.
WY=yssn*0
WW=array(data=c(0),dim=c(3,3,nys))
for(i in 1:noa){
yobs=obs[i, ]
mss=misc[i]
yobs=yobs-agsn[agid[i], ]-dmsn[dimid[i], ]ansn[anwr[i], ] - cgsn[cgid[i], ]
ma=ysid[i]
WY[ma, ]=WY[ma, ]+(RI[ , ,mss] %*% yobs)
WW[ , ,ma]=WW[ , ,ma]+RI[ , ,mss]
}
for(k in 1:nys){
work=WW[ , ,k]
C = ginv(work)
yssn[k, ]=C %*% WY[k, ]
CD=t(chol(C))
ve = rnorm(ntr,0,1)
yssn[k, ] =yssn[k, ] +(CD %*% ve)
}
118
7.3.2
CHAPTER 7. MULTIPLE TRAIT MODELS
Age Effects
Similar to the year-season statements above.
WY = agsn*0
WW = array(data=c(0),dim=c(3,3,nage))
for(i in 1:noa){
yobs = obs[i, ]
mss = misc[i]
yobs=yobs - yssn[ysid[i], ]-dmsn[dimid[i], ]ansn[anwr[i], ] - cgsn[cgid[i], ]
ma = agid[i]
WY[ma, ]=WY[ma, ]+(RI[ , ,mss] %*% yobs)
WW[, ,ma]=WW[ , ,ma]+RI[ , ,mss]
}! \\ for(k in 1:nage){
work=WW[ , ,k]
C = ginv(work)
agsn[k, ]=C %*% WY[k, ]
CD=t(chol(C))
ve = rnorm(ntr,0,1)
agsn[k, ]=agsn[k, ] +(CD %*% ve)
}
7.3. GIBBS SAMPLING
7.3.3
Days in Milk Group Effects
Another similar section for days in milk group effects.
WY = dmsn*0
WW = array(data=c(0),dim=c(3,3,ndim))
for(i in 1:noa){
yobs = obs[i, ]
mss = misc[i]
yobs=yobs - yssn[ysid[i], ]-agsn[agid[i], ]ansn[anwr[i], ] - cgsn[cgid[i], ]
ma = dimid[i]
WY[ma, ]=WY[ma, ]+(RI[ , ,mss] %*% yobs)
WW[, ,ma]=WW[ , ,ma]+RI[ , ,mss]
}
for(k in 1:ndim){
work=WW[ , ,k]
C = ginv(work)
dmsn[k, ]=C %*% WY[k, ]
CD=t(chol(C))
ve = rnorm(ntr,0,1)
dmsn[k, ]=dmsn[k, ] +(CD %*% ve)
}
119
120
CHAPTER 7. MULTIPLE TRAIT MODELS
7.3.4
HYS Effects
HYS is a random factor in the model, and therefore, the R code is a little
different, and there is a need to obtain a new sample for the covariance matrix,
U.
WY = cgsn*0
WW = array(data=c(0),dim=c(3,3,ncg))
for(i in 1:noa){
yobs = obs[i, ]
mss = misc[i]
yobs=yobs - yssn[ysid[i], ]-agsn[agid[i], ]ansn[anwr[i], ] - dmsn[dimid[i], ]
ma = cgid[i]
WY[ma, ]=WY[ma, ]+(RI[ , ,mss] %*% yobs)
WW[, ,ma]=WW[ , ,ma]+RI[ , ,mss]
}
for(k in 1:ncg){
work=WW[ , ,k] + UI #Note addition
C = ginv(work)
cgsn[k, ]=C %*% WY[k, ]
CD=t(chol(C))
ve = rnorm(ntr,0,1)
cgsn[k, ]=cgsn[k, ] +(CD %*% ve)
}
Now the additional part for estimating the covariance matrix.
sss = t(cgsn)%*%cgsn
ndf = ncg+2 # = 6
x2 = rchisq(1,ndf)
Un = sss/x2 # ntr by ntr
UI = ginv(Un)
7.3. GIBBS SAMPLING
7.3.5
121
Additive Genetic Effects
This part is different from the previous R-scripts in that the relationship matrix
must be considered in the calculations. All animals were therefore, estimated
simultaneously.
no=nam*ntr
WW = matrix(data=c(0),nrow=no,ncol=no)
WY = matrix(data=c(0),nrow=no,ncol=1)
for(i in 1:noa){
yobs = obs[i, ]
mss = misc[i]
yobs=yobs - yssn[ysid[i], ]-agsn[agid[i], ]cgsn[cgid[i], ] - dmsn[dimid[i], ]
ka=(anwr[i]-1)*ntr + 1
kb = ka + 2
kc = c(ka:kb)
WY[kc, ]=WY[kc, ]+(RI[ , ,mss] %*% yobs)
WW[kc,kc]=WW[kc,kc]+RI[ , ,mss]
}
HI = AI %x% GI # Direct product
work = WW + HI # 75 by 75
C = ginv(work)
wsn=C %*% WY
CD=t(chol(C))
ve = rnorm(no,0,1)
wsn=wsn +(CD %*% ve)
# transform to new dimensions
ansn = matrix(data=wsn,byrow=TRUE,ncol=3)
# 25 by 3
For the genetic covariance matrix, then
v1 = AI %*% ansn
sss = t(ansn)%*%v1
ndf = nam+2
x2 = rchisq(1,ndf)
Gn = sss/x2
GI = ginv(Gn)
122
7.3.6
CHAPTER 7. MULTIPLE TRAIT MODELS
Residual Covariances
Adjust observations for all factors in the model, estimate the missing residuals,
then calculate the sum of squares and cross-products of residuals.
sss=matrix(data=c(0),nrow=ntr,ncol=ntr)
for(i in 1:noa){
yobs = obs[i, ]
mss = misc[i]
yobs=yobs - yssn[ysid[i], ]-agsn[agid[i], ]ansn[anwr[i], ]-cgsn[cgid[i], ] - dmsn[dimid[i], ]
rwrk = RI[ , ,mss]
bhat = matrix(data=yobs,ncol=1)
ehat = R %*% rwrk %*% bhat # Must estimate missing
sss = sss + (ehat%*%t(ehat))
}
ndf = noa+2
x2 = rchisq(1,ndf)
Rn = sss/x2 # ntr by ntr
The 4 R-inverses need to be re-created using the new Rn. This complete
one round of Gibbs sampling. Now go back to the year-season effects and start
the next sample.
The order in which factors are processed is not critical, but each factor
should be processed once within each sample round. A suggestion would be
to put a factor with the smallest number of levels first, which would therefore,
have the largest number of observations per level. The samples for this factor
are likely not going to vary much from sample to sample.
The animal additive genetic could go last within a sampling round because
the relationship matrix inverse is involved, and because animals only have one
set of observations (usually), and hence they can vary from round to round.
Animals receive evaluations for all traits even if they have not been observed for those traits. This is possible because of the additive genetic relationship matrix and the genetic correlations among the traits. This leads to some
political problems. For example, a sheep producer may not take ultrasound
measures on backfat and lean yield on his lambs due to the costs. However,
his data on other traits is included in the multiple trait genetic evaluation sys-
7.4. POSITIVE DEFINITE MATRICES
123
tem, and his animals receive an evaluation for the ultrasound traits. Should
the producer receive evaluations on ultrasound traits if they did not pay to
have the data collected on their animals? The answer in Canada is no, for two
reasons. One, those producers did not pay to collect ultrasound data, and two,
the accuracy of EBVs for the ultrasound traits in their flock would be low and
should probably not be used for making decisions in their flock.
7.4
Positive Definite Matrices
Covariance matrices must always be positive definite for multiple trait analyses. Failure to guarantee positive definiteness can lead to incorrect solutions
to mixed model equations, erroneous ranking of individuals, and negative diagonals in the inverse of the left hand sides.
Suppose you have four new traits and you obtained genetic and contemporary group covariance matrices from averaging reports from 20 different
bi-variate studies (two traits at a time). In constructing an order 4 covariance
matrix, the result may not be positive definite, as in the matrix below.
Let




G=
100 80 20 6
80 50 10 2
20 10 6 1
6 2 1 1



.

This matrix appears to be positive definite, because the correlations are less
than unity amongst all pairs of variables. However, the eigenvalues of G are
162.1627196 4.1339019 0.9171925 −10.2138140
,
and therefore, are not all positive. What can be done?
The following R-script can be used to force a symmetric matrix to be
positive definite.
124
CHAPTER 7. MULTIPLE TRAIT MODELS
makPD = function(A){
D = eigen(A)
sr =0
nneg=0
V = D$values
U = D$vectors
N=nrow(A)
for(k in 1:N){
if(V[k] < 0){
nneg=nneg+1
sr=sr+V[k]+V[k] }
}
wr=(sr*sr*100)+1
p=V[N - nneg]
for(m in 1:N){
if(V[m] < 0 ){
c = V[m]
V[m] = p*(sr-c)*(sr-c)/wr
} }
A = U %*% diag(V) %*% t(U)
return(A) ]
Applying the routine to G gives a new matrix,




G=
103.615081 75.499246 18.377481 5.013141
75.499246 55.603411 12.020026 3.228634
18.377481 12.020026 6.728218 1.442922
5.013141 3.228634 1.442922 1.269397



.

Hayes and Hill (1981) described a “bending” procedure which essentially
shrinks all of the eigenvalues towards the mean of the eigenvalues. This procedure is not recommended. On the other hand, one merely needs any matrix
that is positive definite to begin the Gibbs sampling process. Hopefully, the
samples will converge towards a better estimate of G.
7.5. STARTING MATRICES
7.5
125
Starting Matrices
Another quick technique to obtain starting matrices is to first calculate a
phenotypic covariance matrix using animals that have all traits observed.
P = y0 (I − J/N )y/(N − 1),
then
G = 0.2 ∗ P
U = 0.5 ∗ P
R=P−G−U
P is positive definite by nature of how it was calculated, unless the length of y
is less than the order of P. Then the only matrix that needs to be checked is
R. The constants 0.2 and 0.5 are guesses as to the fraction of genetic variances
and fraction of variances due to contemporary groups.
This is a quick method of obtaining starting covariance matrices for the
Gibbs sampling process, if nothing exists prior to the analysis. This approach
works reasonably well because genetic covariances tend to follow phenotypic
covariances.
Pulling variances and covariances from the literature does require mandatory checking of matrices to guarantee they are positive definite.
126
CHAPTER 7. MULTIPLE TRAIT MODELS
Chapter 8
Maternal Traits
8.1
Introduction
Early growth in most livestock species relies on the mothering ability of the
female parent. Maternal ability is expressed by the female parent during gestation and after birth, and is measured by the success of the offspring to live
and grow (Wilham, 1972). Embryo transfer allows offspring to be reared pre
and post birth by a female that is not the biological parent. In some species,
offspring may be moved to a female different from the birth mother to be
reared after birth. Thus, there are three types of dams that could potentially
have an effect on an animal’s early growth and survival. These are
1. the genetic dam,
2. the dam that gives birth, or the birth dam, and
3. the dam that raises the individual after birth, or the raise dam, and
the last possibility is bottle-feeding or direct human feeding, which is not
genetic. See Figure 8.1.
127
128
CHAPTER 8. MATERNAL TRAITS
Sire
Embryo
Genetic Dam
Birth Dam
Raise Dam
Lamb
Figure 8.1 Three Possible Sources of Maternal Effects
The genetic dam provides the DNA to produce the new animal through the
ova, along with the sperm from the male. The ova include female mitochondrial
genes that do not mix with the male DNA, but are passed directly to the
offspring. The genetic dam provides the genes for growth, embryo survival,
and all traits.
The birth dam contributes no genetic material to the animal, but gives
birth to the animal. An embryo transfer (ET) has occurred. Usually embryos
are put into females that are expected to give lots of milk (food) to the offspring, or which have already given birth to previous young, so that parturition
problems are not anticipated. This female may be a different breed than the
embryo, but possibly not considered a valuable individual. The purpose of the
birth dam is to carry the embryo through pregnancy to birth. The birth dam
provides a uterine environment, blood oxygen and nutrients during embryo
development, plus the experience of the parturition event itself. Birth dams
affect traits like calving ease of the calf, early survival, and birthweights.
The raise dam cares for the young animal after birth. This is the training
and education environment, plus milk yield and protection from predators.
Sometimes depending on the number born or the welfare of the birth dam,
young animals will be cross fostered to other lactating females with one or
no young themselves. Raise dams affect traits after birth, such as weaning
8.2. DATA
129
weights.
In the majority of situations, the three dams are the same individual.
The female provides the DNA material, carries the embryo through pregnancy
and gives birth, then raises the young up to weaning age. In the literature,
models for dealing with maternal effects have assumed that the three dams are
one individual. However, in beef cattle and sheep production, there is a fair
rate of embryo transfer, and/or cross fostering of young animals to warrant
considering models that accommodate three types of dams.
8.2
Data
The following table contains data on 66 animals of which only 26 have both
birthweight (BW) and weaning weight (WW). Animals 1 to 40 have unknown
parents.
130
Lamb
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
CHAPTER 8. MATERNAL TRAITS
Table 8.1
Example data for maternal genetic effects.
YM = year-month of lambing,
FYM = Flock-year-month of lambing
BW = birthweight, WW = weaning weight (50 days).
Ram Genetic Birth Raise Gender YM FYM BW WW
dam
dam dam
1
6
6
6
1
1
1
2.5
7.8
1
7
17
17
1
1
1
3.4 15.4
2
8
8
8
2
1
1
4.1 26.0
2
9
18
19
1
1
1
3.3 10.8
3
10
10
10
1
1
1
1.8 10.9
3
11
11
11
1
2
2
2.5 12.2
4
12
20
20
1
2
2
2.1 11.1
4
13
13
21
2
2
2
3.0 16.6
5
14
14
14
2
2
2
3.7 17.7
5
15
15
15
1
2
2
2.2 10.2
5
16
16
16
2
2
2
2.9 16.9
1
22
22
22
1
1
3
3.5 11.8
1
22
22
22
2
1
3
4.2 16.9
1
23
24
24
2
1
3
6.2 31.2
2
25
25
36
2
1
3
5.7 30.1
3
26
26
26
2
2
4
5.4 25.4
3
27
37
38
1
2
4
4.5 16.5
3
28
28
28
2
2
4
2.7 20.7
4
29
29
29
2
2
4
4.0 20.6
5
30
30
30
1
2
4
3.6 20.6
1
31
31
32
1
2
4
4.3 23.6
1
31
31
31
1
1
5
2.9 22.1
1
6
39
39
2
1
5
2.7 17.4
4
7
7
7
1
1
5
3.5 22.3
4
8
8
40
1
1
5
2.6 21.9
4
8
8
8
2
1
5
3.5 15.8
Lambs 42, 44, 47, 54, 57, and 63 had different birth dams from their
genetic dams, and lambs 44, 48, 55, 57, 61, and 65 had different raise dams
from their birth dams. The remaining lambs had the same ewe as their genetic
dam, birth dam, and raise dam.
8.3. A MODEL
131
All lambs have both BW and WW observations. Lambs 52 and 53, 61
and 62, and 65 and 66 were sets of twin lambs.
8.3
A Model
The same model equation will be used for BW and WW. Let
yt = Xbt + Wf t + Za at + Zm mt + Zp pt + et ,
where
bt is a fixed vector containing year-month lambing effects for trait t, and
gender effects on trait t.
ft is a random vector of flock-year-month of lambing effects for trait t,
at is a random vector of animal additive genetic effects for trait t,
mt is a random vector of maternal genetic effects of the birth dam on BW,
and for the raise dam on WW,
pt is a random vector of maternal permanent environmental effects of the
birth dam on BW, and for the raise dam on WW, and
et is a random vector of residual effects for trait t.
X, W, Za , Zm , and Zp are the corresponding design matrices. The expected
values of the random vectors were null matrices, and the covariance matrices
for two traits are as follows:
G=
0.2003 0.5381
0.5381 3.2623
!
,
for the additive genetic covariance matrix,
F=
0.1822 0.1561
0.1561 7.4783
!
,
132
CHAPTER 8. MATERNAL TRAITS
for the flock-year-month of lambing covariance matrix,
M=
!
0.1391 0.5209
0.5209 3.2945
,
for the maternal genetic covariance matrix,
P=
0.0301 0.0019
0.0019 1.5049
!
,
for the maternal permanent environmental covariance matrix, and
R=
0.1826 0.0336
0.0336 6.9517
!
,
for the residual covariance matrix.
8.3.1
Data Structure
Usually there is a direct genetic by maternal genetic covariance matrix. That
is,
V ar
a
m
!
N
=
N
A G A C
N
N
A C0 A M
!
= H,
where C is non-null. In the above description of the covariance matrices, C
was assumed to be null.
The problem is the estimation of G, M, and C. Proper estimation of
C requires a correct and almost complete data structure. The correct data
structure is one in which every female has its own records for BW and WW
as a lamb, and it has records on its offsprings’ BW and WW, and it has
female progeny that also have records on their offsprings’ BW and WW. Three
generations of female data should exist. The data structure is complete if all
females have three generations of data, except for the latest two generations
which have not had the opportunity to have offspring yet (Heydarpour et al.
2004?)
8.3. A MODEL
133
Generally, studies of maternal genetic effects do not have correct data
structure, and consequently, the direct-maternal covariances have been estimated to be negative. The poorer is the structure, the more negative are the
estimates in C. This can be shown mathematically to occur if the data structure is not correct. The data structure for the example data in this chapter is
not correct, and therefore, should not be used to estimate C. When the data
structure is not suitable, the best course of action is to assume that C = 0.
If C has been estimated from data having a correct structure, then it may
be used as the true covariances in genetic evaluation, even for data that do
not have the correct structure.
Also needed are
R
−1
=
−1
F
=
5.48132623 −0.02649317
−0.02649317
0.14397776
5.5884151 −0.1166511
−0.1166511
0.1361552
P
8.3.2
=
,
!
and
−1
!
33.22523926 −0.04194827
−0.04194827
0.66454894
,
!
.
Assumptions
• The example is small and simple.
• The example data do not have the proper data structure for estimation
of covariance matrices.
• All animals have BW and WW.
• Age of dam effects are not important.
• Litter of the birth dam effects are not important, as are litter of the raise
dam effects.
134
CHAPTER 8. MATERNAL TRAITS
8.4
MME
The order of the MME for the small example is 334 because there are two
traits and 66 animals having both direct genetic and maternal genetic effects.
The example will be described through R-scripts.
8.4.1
Preparing the Data
Enter the pedigree data in vectors as follows:
aid=c(1:66)
sid=c(0*c(1:40),1,1,2,2,3,3,4,4,
5,5,5,1,1,1,2,3,3,
3,4,5,1,1,1,4,4,4)
gdam=c(0*c(1:40),6,7,8,9,10,11,12,
13,14,15,16,22,22,
23,25,26,27,28,29,30,31,31,6,7,8,8)
bii=c((0*c(1:40)+1),(0*c(41:66)+0.5))
AI=AINV(sid,gdam,bii)
dim(AI) # should be order 66
anwr=c(41:66) # animals with records
# Enter birth dams and raise dams
bdid=c(6,17,8,18,10,11,20,13,14,
15,16,22,22,24,25,26,37,28,29,30,
31,31,39,7,8,8)
rdid=c(6,17,8,19,10,11,20,21,14,
15,16,22,22,24,36,26,38,28,29,30,
32,31,39,7,40,8)
# Enter gender codes
gnid=c(1,1,2,1,1,1,1,2,2,1,2,1,2,
2,2,2,1,2,2,1,1,1,2,1,1,2)
8.4. MME
135
# Enter observations
obs=matrix(data=c(25,78,34,154,41,260,
33,108,18,109,25,122,21,111,30,166,
37,177,22,102,29,169,35,118,42,169,
62,312,57,301,54,254,45,165,27,207,
40,206,36,206,43,236,29,221,27,174,
35,223,26,219,35,158),ncol=1)
obs=obs/10
# Year-month and flock-year-month classes
ymid=c(1,1,1,1,1,2,2,2,2,2,2,1,1,1,1,2,
2,2,2,2,2,1,1,1,1,1)
cgid=c(1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,4,
4,4,4,4,4,5,5,5,5,5)
8.4.2
Covariance Matrices
This example includes direct by maternal covariance matrix between the two
traits in CC.
G=matrix(data=c(0.2003,0.5381,0.5381,
3.2623),byrow=TRUE,nrow=2)
R=matrix(data=c(0.1826,0.0336,0.0336,
6.9517),byrow=TRUE,nrow=2)
F=matrix(data=c(0.1822,0.1561,0.1561,
7.4783),byrow=TRUE,ncol=2)
PE=matrix(data=c(0.0301,0.0019,0.0019,
1.5049),byrow=TRUE,ncol=2)
MT=matrix(data=c(0.1391,0.5209,0.5209,
3.2945),byrow=TRUE,ncol=2)
Inverses are needed for each of the above matrices. Then initialize the
solution vectors.
ymsn=matrix(data=c(0),nrow=nym,ncol=ntr)
cgsn=matrix(data=c(0),nrow=ncg,ncol=ntr)
gdsn=matrix(data=c(0),nrow=2,ncol=ntr)
The following scripts
bdsn=matrix(data=c(0),nrow=nam,ncol=ntr)
mpsn=matrix(data=c(0),nrow=nam,ncol=ntr)
ansn=matrix(data=c(0),nrow=nam,ncol=ntr)
136
CHAPTER 8. MATERNAL TRAITS
describe the Gibbs sampling process where each factor is visited once per
round. Data are adjusted for all other factors, new solutions are calculated,
and random noise is added to the solutions before going to the next factor. If
the factor is a random factor, then a covariance matrix is sampled too. At the
end, a new residual covariance matrix is sampled.
8.4.3
Year-Month Effects
WY=ymsn*0
WW=array(data=c(0),dim=c(ntr,ntr,nym))
for(i in 1:noa){
yobs=obs[i, ]
yobs=yobs-cgsn[cgid[i], ]-gdsn[gnid[i], ]ansn[anwr[i], ] - bdsn[bdid[i], ]-rdsn[rdid[i], ]mpsn[bdid[i], ] - mpsn[rdid[i], ]
ma=ymid[i]
WY[ma, ]=WY[ma, ]+(RI %*% yobs)
WW[ , ,ma]=WW[ , ,ma]+RI
}| \\ for(k in 1:nym){
work=WW[ , ,k]
C = ginv(work)
ymsn[k, ]=C %*% WY[k, ]
# add sampling noise
CD=t(chol(C))
ve = rnorm(ntr,0,1)
ymsn[k, ] =ymsn[k, ] +(CD %*% ve)
}
8.4. MME
8.4.4
Gender Effects
WY=gdsn*0
WW=array(data=c(0),dim=c(2,2,2))
for(i in 1:noa){
yobs=obs[i, ]
yobs=yobs-cgsn[cgid[i], ]-ymsn[ymid[i], ]ansn[anwr[i], ] - bdsn[bdid[i], ]-rdsn[rdid[i], ]mpsn[bdid[i], ] - mpsn[rdid[i], ]
ma=gnid[i]
WY[ma, ]=WY[ma, ]+(RI %*% yobs)
WW[ , ,ma]=WW[ , ,ma]+RI
}
ngend=2
for(k in 1:ngend){
work=WW[ , ,k]
C = ginv(work)
gdsn[k, ]=C %*% WY[k, ]
# add sampling noise
CD=t(chol(C))
ve = rnorm(ntr,0,1)
gdsn[k, ] =gdsn[k, ] +(CD %*% ve)
}
137
138
8.4.5
CHAPTER 8. MATERNAL TRAITS
Flock-Year-Month Effects
WY=cgsn*0
WW=array(data=c(0),dim=c(2,2,ncg))
for(i in 1:noa){
yobs=obs[i, ]
yobs=yobs-gdsn[gnid[i], ]-ymsn[ymid[i], ]ansn[anwr[i], ] - bdsn[bdid[i], ]-rdsn[rdid[i], ]mpsn[bdid[i], ] - mpsn[rdid[i], ]
ma=cgid[i]
WY[ma, ]=WY[ma, ]+(RI %*% yobs)
WW[ , ,ma]=WW[ , ,ma]+RI
}
for(k in 1:ncg){
work=WW[ , ,k]+ FI
C = ginv(work)
cgsn[k, ]=C %*% WY[k, ]
# add sampling noise
CD=t(chol(C))
ve = rnorm(ntr,0,1)
cgsn[k, ] =cgsn[k, ] +(CD %*% ve)
}
# Estimate covariance matrix
sss = t(cgsn)%*%cgsn
ndf = ncg+2
x2 = rchisq(1,ndf)
Fn = sss/x2
FI = ginv(Fn)
8.4. MME
8.4.6
Animal Additive Genetic Effects
no=nam*ntr
WW = matrix(data=c(0),nrow=no,ncol=no)
WY = matrix(data=c(0),nrow=no,ncol=1)
for(i in 1:noa){
yobs = obs[i, ]
yobs=yobs-gdsn[gnid[i], ]-ymsn[ymid[i], ]cgsn[cgid[i], ] - bdsn[bdid[i], ]-rdsn[rdid[i], ]mpsn[bdid[i], ] - mpsn[rdid[i], ]
ka=(anwr[i]-1)*ntr + 1
kb = ka + 1
kc = c(ka:kb)
WY[kc, ]=WY[kc, ]+(RI %*% yobs)
WW[kc,kc]=WW[kc,kc]+RI
}
HI = AI %x% GI
work = WW + HI
C = ginv(work)
wsn=C %*% WY
CD=t(chol(C))
ve = rnorm(no,0,1)
wsn=wsn +(CD %*% ve)
# transform to new dimensions
ansn = matrix(data=wsn,byrow=TRUE,ncol=ntr)
# Estimate a new G and GI
v1 = AI %*% ansn
sss = t(ansn)%*%v1
ndf = nam+2
x2 = rchisq(1,ndf)
Gn = sss/x2
GI = ginv(Gn)
139
140
CHAPTER 8. MATERNAL TRAITS
8.4.7
Maternal Genetic Effects
Maternal genetic effects are different due to the birth dams versus raise dams,
which need to be separated.
no=nam*ntr
WW = matrix(data=c(0),nrow=no,ncol=no)
WY = matrix(data=c(0),nrow=no,ncol=1)
for(i in 1:noa){
yobs = obs[i, ]
yobs=yobs-gdsn[gnid[i], ]-ymsn[ymid[i], ]cgsn[cgid[i], ] - ansn[anwr[i], ]mpsn[bdid[i], ] - mpsn[rdid[i], ]
# birth dams
kc=(bdid[i]-1)*ntr + 1
wrk=RI %*% yobs
WY[kc,]=WY[kc,]+wrk[1,]
# Raise dams
ka =(rdid[i]-1)*ntr + 1
kb = ka+1
WY[kb,]=WY[kb,]+wrk[2,]
kd=c(kc,kb)
WW[kd,kd]=WW[kd,kd]+RI
}
8.4. MME
141
Now get solutions and add sampling noise, then calculate new sample for
the maternal genetic covariance matrix.
HI = AI %x% MTI
work = WW + HI
C = ginv(work)
wsn=C %*% WY
CD=t(chol(C))
ve = rnorm(no,0,1)
wsn=wsn +(CD %*% ve)
# transform to new dimensions
matn = matrix(data=wsn,byrow=TRUE,ncol=ntr)
v1 = AI %*% matn
sss = t(matn)%*%v1
ndf = nam+2
x2 = rchisq(1,ndf)
MTn = sss/x2
MTI = ginv(MTn)
bdsn[ ,1]=matn[ ,1]
rdsn[ ,2]=matn[ ,2]
bdsn[ ,2]=0
rdsn[ ,1]=0
142
CHAPTER 8. MATERNAL TRAITS
8.4.8
Maternal PE Effects
Birth dams and raise dams need to be separated again for this factor. There
are no additive genetic relationships among PE effects.
no=nam*ntr
nope=c(1:nam)*0
WW = matrix(data=c(0),nrow=no,ncol=no)
WY = matrix(data=c(0),nrow=no,ncol=1)
for(i in 1:noa){
yobs = obs[i, ]
yobs=yobs-gdsn[gnid[i], ]-ymsn[ymid[i], ]cgsn[cgid[i], ] - ansn[anwr[i], ]bdsn[bdid[i], ] - rdsn[rdid[i], ]
kc=(bdid[i]-1)*ntr + 1
wrk=RI %*% yobs
WY[kc,]=WY[kc,]+wrk[1,]
ka =(rdid[i]-1)*ntr + 1
nope[rdid[i]]=1
nope[bdid[i]]=1
kb = ka+1
WY[kb,]=WY[kb,]+wrk[2,]
kd=c(kc,kb)
WW[kd,kd]=WW[kd,kd]+RI
}
HI = id(nam) %x% PEI
work = WW + HI
C = ginv(work)
wsn=C %*% WY
CD=t(chol(C))
ve = rnorm(no,0,1)
wsn=wsn +(CD %*% ve)
# transform to new dimensions
mpsn = matrix(data=wsn,byrow=TRUE,ncol=ntr)
8.4. MME
143
Now obtain a new sample PE covariance matrix.
sss = t(mpsn)%*%mpsn
ndf=sum(nope)+2
x2 = rchisq(1,ndf)
PEn = sss/x2
PEI = ginv(PEn)
8.4.9
Residual Effects
Adjust observations for all factors in the model to estimate the residual effects. Then accumulate sum of squares and crossproducts for a new residual
covariance matrix.
no=nam*ntr
sss=id(ntr)*0
for(i in 1:noa){
yobs = obs[i, ]
yobs=yobs-gdsn[gnid[i], ]-ymsn[ymid[i], ]cgsn[cgid[i], ] - ansn[anwr[i], ]bdsn[bdid[i], ] - rdsn[rdid[i], ]mpsn[bdid[i], ] - mpsn[rdid[i], ]
wrk=matrix(data=yobs,nrow=ntr,ncol=1)
bhat=R %*% RI %*% wrk
sss = sss + bhat%*%t(bhat)
}
ndf=noa+2
x2=rchisq(1,ndf)
Rn=sss/x2
RI=ginv(Rn)
Rn
Repeat the process for more samples until convergence is achieved. Given
the small number of observations, the Gibbs samples, in this case, will likely not
converge to the joint posterior distribution samples. Many more observations
are needed.
144
CHAPTER 8. MATERNAL TRAITS
Part II
Random Regression Analyses
145
Chapter 9
Longitudinal Data
9.1
Introduction
A simple example of longitudinal data is the weight of an animal taken at
different ages. Meat animals, like beef cattle, pigs, and sheep, are weighed two
or three times from birth to market age, generally at birth, at weaning, and
at market age. Weighing animals takes time and labour. Birth is always day
1, but weaning and market ages are not the same for every animal. Weights
get larger over time because animals grow, and the variance of weights also
increases with age.
Another example is the lactation yield of dairy cows, sheep, or goats.
Dairy animals are milked two or more times daily for up to a year after they give
birth. Typically, 24-h production increases shortly after the animal gives birth,
peaks at a few weeks after parturition, then slowly decreases until the animal
dries up in preparation for the next parturition. Milk recording programs send
supervisors to herds once a month or less frequently to weigh the milk and take
samples for lab analyses of content. Thus, an animal might give milk for over
300 days, but there might only be seven to ten supervised weighings during
that period. Herds with robotic milking machines can have daily weighings,
but not daily milk content measurements.
Traits measured at various times during the life of an animal are known as
longitudinal data. Because weights or yields occur at different ages or times,
147
148
CHAPTER 9. LONGITUDINAL DATA
they are not the same genetic trait. Weights at birth and weaning may have
a positive correlation, but it is less than unity. Milk weights at day 10 and
day 300 may also be correlated, but that correlation is much less than unity.
Thus, the weight of an animal at every day of life is a ‘different’ trait. Every
milk weight from the start of lactation to the end is a ‘different’ trait. There
is a continuum of points in time when an animal could be observed for a trait.
These traits have also been called infinitely dimensional traits.
Instead of age or time, observations could be based on degree of maturity
or weight. For example, fat content of an animal would change depending on an
animal’s weight or amount of feed ingested, regardless of age. The possibility
exists that a trait may depend on both age and weight.
In general, there is a starting point, tmin , e.g. birth or parturition, at
which observations start to be taken. The observations are made either at
specific intervals or at random intervals, and the number of observations can
vary from animal to animal. Then there is the end point, tmax , beyond which no
more observations are made, or are not of interest. Each observation, yti , has
an associated time variable, ti . For simplicity, ti , are whole integer numbers.
There could be a dozen or so points, or there could be 400 points. The number
depends on the trait and situation.
Orthogonal polynomials have been suggested for use with longitudinal
data to model the shape of a growth curve or a lactation curve. The reason
being that orthogonal polynomials would be less correlated to each other than
would be the correlation between polynomials of age. One simple type of
orthogonal polynomials are Legendre polynomials, discovered in 1797. In order
to use Legendre polynomials or other kinds of orthogonal polynomials, the time
values (whole integer numbers) must be scaled to a range from -1 to +1. The
scaling formula is
ti − tmin
.
qi = −1 + 2
tmax − tmin
The qi are decimal numbers.
Plotting yti against ti (or against qi ) gives a shape that is called the trajectory. This could be a lactation curve, or a growth curve, or an S-curve. The
goal is to find a function that fits this trajectory as closely as possible and
to study the amount of animal variation around the trajectory from tmin to
tmax . This type of study involves covariance functions and random regression
9.2. COLLECT DATA
149
models.
Covariance functions help to predict the change in variation from tmin to
tmax for the population. Random regression models provide a way to estimate
covariance functions, and to determine individual differences in trajectories.
9.2
Collect Data
The first step in the study of longitudinal data, is to collect data. In order
to illustrate a few basic concepts, consider the following experiment. Two
hundred female mice were sampled every hour after an injection with glucose
to observe the change in blood insulin levels over the next nine hours. This
gave a total of 1800 observations, on two hundred unrelated individuals. A
small sample of the data are shown in Table 9.1.
Table 9.1
Insulin levels in female mice.
Mouse
Time After Injection of Glucose, min
60 120 180 240 300 360 420 480 540
1
11.9 9.7 8.7 4.5 5.3 1.9 2.3 1.6 1.0
2
12.9 10.0 7.5 3.3 1.7 2.3 2.3 2.1 0.5
3
12.2 10.0 6.0 4.2 4.4 2.7 2.2 2.9 0.2
4
12.6 10.1 9.5 5.9 5.8 3.4 0.9 0.5 0.7
5
12.7 10.5 8.2 5.4 4.7 2.1 1.9 3.2 0.2
A plot of all 200 mouse insulin decay trajectories are shown in Figure 9.1.
150
CHAPTER 9. LONGITUDINAL DATA
Figure 9.1
8
6
2
4
Insulin Levels
10
12
Insulin Decay Over Time
2
4
6
8
Hours
Note the general shape of the decay trajectory for all mice. Also, note the
variability that exists around the average curve (shown in red). Thus, mice
have different decay trajectories. We want to study the variability among mice.
Using data on the 200 mice the covariance matrix of the insulin amounts
at each hour (a 9 × 9 matrix) can be calculated as follows.





V=




0.8852
0.8352
0.6916
0.0350
0.1089
−0.0050
−0.0417
−0.0476
−0.0156
0.8352
0.9574
0.4621
0.0399
0.1607
−0.0281
−0.0433
−0.0203
0.0197
0.6916
0.4621
1.4005
0.1154
0.1556
0.0270
−0.1918
−0.1106
−0.0052
0.0350
0.0399
0.1154
0.9331
0.0454
−0.0723
−0.0362
−0.0030
0.0117
0.1089
0.1607
0.1556
0.0454
0.7993
0.1204
−0.0518
−0.0414
0.0558
−0.0050
−0.0281
0.0270
−0.0723
0.1204
0.6833
0.0015
−0.0086
−0.0211
−0.0417
−0.0433
−0.1918
−0.0362
−0.0518
0.0015
0.5807
0.0474
−0.0035
−0.0476
−0.0203
−0.1106
−0.0030
−0.0414
−0.0086
0.0474
0.4409
−0.0291
−0.0156
0.0197
−0.0052
0.0117
0.0558
−0.0211
−0.0035
−0.0291
0.2855










V = {σti ,tj }.
This matrix is automatically a positive definite matrix by virtue of the
way it was calculated, but a good practice is to always check each matrix. The
eigenvalues (EV ) were all positive, as shown below, and therefore V is positive
definite.
9.3. COVARIANCE FUNCTIONS

EV =
9.3
















2.48452319
0.96233482
0.86962324
0.80221637
0.59747351
0.50820761
0.42359840
0.27266640
0.04525645
151









.







Covariance Functions
Kirkpatrick et al.(1991) proposed the use of covariance functions for longitudinal data. A covariance function (CF) is a way to model the variances and
covariances of a longitudinal trait. Orthogonal polynomials are used in this
model and Legendre polynomials are the easiest to apply. Each element of V
is modeled as
σti ,tj = φ(qi )0 Kφ(qj ),
where φ(qi ) is a vector of functions of time, qi , and K is a matrix of constants
necessary to estimate σti ,tj . The matrix K must be estimated.
In the example, V, above, there are m = 9 time periods, and therefore,
there are m ∗ (m + 1)/2 = 45 parameters in V. In Table 9.2 are the time
variables and the standardized time values. Note that tmin = 60, and tmax =
540.
152
CHAPTER 9. LONGITUDINAL DATA
Table 9.2
Time Variables of Mouse Data.
Item ti (minutes)
qi
1
60 -1.00
2
120 -0.75
3
180 -0.50
4
240 -0.25
5
300 0.00
6
360 0.25
7
420 0.50
8
480 0.75
9
540 1.00
Legendre polynomials, Pk (x), are defined as follows, for x being one of the
qi .
P0 (x) = 1, and
P1 (x) = x,
then, in general, the (n+1)st polynomial is described by the following recursive
equation:
1
((2n + 1)xPn (x) − nPn−1 (x)) .
Pn+1 (x) =
n+1
These quantities are “normalized” using
2n + 1
φn (x) =
2
.5
Pn (x).
This gives the following series,
.5
φ0 (x) =
φ1 (x) =
=
P2 (x) =
φ2 (x) =
=
1
P0 (x) = .7071
2
.5
3
P1 (x)
2
1.2247x
1
(3xP1 (x) − 1P0 (x))
2
.5
5
3
1
( x2 − )
2
2
2
−.7906 + 2.3717x2 ,
9.3. COVARIANCE FUNCTIONS
153
and so on.
Because V is 9 × 9, then to model all of the σti ,tj we need 9 orthogonal
polynomials. Thus, we need Legendre polynomials of order 8, where 8 is the
highest order of polynomials of time. Order of 8 means there are 9 covariables
(including time to the power of 0).
0
1
2
3
4
5
6
7
8
0.7071
0.0
-0.7906
0.0
0.7955
0.0
-0.7967
0.0
0.7972
0.0
1.2247
0.0
-2.8062
0.0
4.3973
0.0
-5.9907
0.0
0.0
0.0
2.3717
0.0
-7.9550
0.0
16.7312
0.0
-28.6992
Table 9.3
Legendre Polynomials of Order 8.
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
4.6771
0.0
0.0
0.0
9.2808
0.0
-20.5206
0.0
18.4685
0.0
-50.1935
0.0
53.9164
0.0
-118.6162
0.0
157.8457
0.0
0.0
0.0
0.0
0.0
0.0
0.0
36.8086
0.0
-273.5992
0.0
0.0
0.0
0.0
0.0
0.0
0.0
73.4291
0.0
A simple R function that will give you the above table follows.
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
146.571
154
CHAPTER 9. LONGITUDINAL DATA
LPOLY = function(no) {
if(no > 9 ) no = 9
nom = no - 1
phi = matrix(data=c(0),nrow=9,ncol=9)
phi[1,1]=1
phi[2,2]=1
for(i in 2:nom){
ia = i+1
ib = ia - 1
ic = ia - 2
c = 2*(i-1) + 1
f = i - 1
c = c/i
f = f/i
for(j in 1:ia){
if(j == 1){ z = 0 }
else {z = phi[ib,j-1]}
phi[ia,j] = c*z - f*phi[ic,j]
}
}
for( m in 1:no){
f = sqrt((2*(m-1)+1)/2)
phi[m, ] = phi[m, ]*f
}
return(phi[1:no,1:no])
}
Let the matrix in Table 9.3 be denoted as Λ0 . Now define another matrix,
M, as a matrix containing the polynomials of standardized time values.
Therefore,
M
=





= 




1
1
1
1
1
1
1
1
1
1
qi
qi2
−1.00
−0.75
−0.50
−0.25
0.00
0.25
0.50
0.75
1.00
qi3
qi4
1.0000
0.5625
0.2500
0.0625
0.0000
0.0625
0.2500
0.5625
1.0000
qi5
qi6
−1.000000
−0.421875
−0.125000
−0.015625
0.000000
0.015625
0.125000
0.421875
1.000000
qi7
qi8
1.00000000
0.31640625
0.06250000
0.00390625
0.00000000
0.00390625
0.06250000
0.31640625
1.00000000
−1.0000000000
−0.2373046875
−0.0312500000
−0.0009765625
0.0000000000
0.0009765625
0.0312500000
0.2373046875
1.0000000000
9.3. COVARIANCE FUNCTIONS
1.000000
0.177979
0.015625
0.000244
0.000000
0.000244
0.015625
0.177979
1.000000
−1.0000000
−0.1334839
−0.0078125
−0.0000610
0.0000000
0.0000610
0.0078125
0.1334839
1.0000000
1.000000
0.100113
0.003906
0.000015
0.000000
0.000015
0.003906
0.100113
1.000000
155





.




This gives
Φ
=
MΛ,

=









0.7071068
0.7071068
0.7071068
0.7071068
0.7071068
0.7071068
0.7071068
0.7071068
0.7071068
−2.3452079
0.9765020
−0.2107023
−0.7967180
0.0000000
0.7967180
0.2107023
−0.9765020
2.3452079
−1.2247449
−0.9185587
−0.6123724
−0.3061862
0.0000000
0.3061862
0.6123724
0.9185587
1.2247449
2.54950976
−0.71584364
0.82410911
0.06189377
−0.79672180
0.06189377
0.82410911
−0.71584364
2.54950976
1.5811388
0.5435165
−0.1976424
−0.6423376
−0.7905694
−0.6423376
−0.1976424
0.5435165
1.5811388
−1.8708287
0.1315426
0.8184876
0.6284815
0.0000000
−0.6284815
−0.8184876
−0.1315426
1.8708287
−2.73861279
0.09361538
−0.61110647
0.76658885
0.00000000
−0.76658885
0.61110647
−0.09361538
2.73861279
2.9154759
0.5761252
−0.2146925
−0.4444760
0.7972005
−0.4444760
−0.2146925
0.5761252
2.9154759
2.1213203
−0.7426693
−0.6131942
0.3345637
0.7954951
0.3345637
−0.6131942
−0.7426693
2.1213203





,




which can be used to specify the elements of V as
V = ΦKΦ0
= M(ΛKΛ0 )M0
= MHM0 .
Note that Φ, M, and Λ are matrices defined by the Legendre polynomial
functions and by the standardized time values and do not depend on the data
or values in the matrix V. One can estimate either K or H,
Estimate K using
K = Φ−1 VΦ−T ,
156
CHAPTER 9. LONGITUDINAL DATA

=
















0.51696
−0.10761
0.51424
0.01598
0.34633
−0.03278
−0.13492
0.02114
−0.49907
−0.03278
0.02711
−0.07196
−0.00723
−0.04682
0.12477
0.01695
−0.12536
0.08352
−0.10761
0.35040
−0.00961
0.02374
0.04881
0.02711
−0.02044
−0.12437
−0.05387
−0.13492
−0.02044
−0.28443
0.05744
−0.24611
0.01695
0.17716
−0.05564
0.22522
0.51424
−0.00961
1.20888
−0.10372
0.66167
−0.07196
−0.28443
0.13495
−1.02176
0.02114
−0.12437
0.13495
−0.24125
0.10218
−0.12536
−0.05564
0.35359
−0.12008
0.01598
0.02374
−0.10372
0.32922
−0.05996
−0.00723
0.05744
−0.24125
0.06434
−0.49907
−0.05387
−1.02176
0.06434
−0.70061
0.08352
0.22522
−0.12008
1.03220
0.34633
0.04881
0.66167
−0.05996
0.59757
−0.04682
−0.24611
0.10218
−0.70061









,







and estimate H using
H = M−1 VM−T

0.7993
0.3759
−16.2282
−4.1121
85.2585

0.3759
18.8458
−17.7140
−140.9800
116.0053


 −16.2282
−17.7140
565.2491
145.2098 −3353.8067


−4.1121
−140.9800
145.2098
1244.7194
−1032.9673


85.2585
116.0053 −3353.8068 −1032.9673
20823.0945
= 


8.2512
279.3968 −275.2019 −2605.3471
2024.2377

 −149.0388 −221.6913
6130.4997
2044.5953 −38807.8169



−4.5415 −157.4900
148.5089
1505.9029 −1117.0747
79.2916
123.1336 −3327.8998 −1156.0341
21270.5182
9.3. COVARIANCE FUNCTIONS
157
8.2512
−149.0388
−4.5415
79.2916
279.3968
−221.6913 −157.4900
123.1336 


−275.2019
6130.4997
148.5089 −3327.8998 

−2605.3471
2044.5953
1505.9029 −1156.0341 

2024.2377 −38807.8169 −1117.0747
21270.5182 
.

5566.7003 −4064.5381 −3249.7079
2313.7405 

−4064.5381
72970.3381
2262.0269 −40177.7306 


−3249.7079
2262.0269
1906.4858 −1292.3600 
2313.7405 −40177.7306 −1292.3600
22174.7101

Note the difference in magnitude of elements in K compared to H. Now
calculate the correlations among the elements in the two matrices.





Cor(K) = 




1.00
−0.25
0.65
0.04
0.62
−0.13
−0.45
0.05
−0.68
−0.25
1.00
−0.01
0.07
0.11
0.13
−0.08
−0.35
−0.09
0.65
−0.01
1.00
−0.16
0.78
−0.19
−0.61
0.21
−0.91
0.04
0.07
−0.16
1.00
−0.14
−0.04
0.24
−0.71
0.11
0.62
0.11
0.78
−0.14
1.00
−0.17
−0.76
0.22
−0.89
−0.13
0.13
−0.19
−0.04
−0.17
1.00
0.11
−0.60
0.23
−0.45
−0.08
−0.61
0.24
−0.76
0.11
1.00
−0.22
0.53
0.05
−0.35
0.21
−0.71
0.22
−0.60
−0.22
1.00
−0.20
−0.68
−0.09
−0.91
0.11
−0.89
0.23
0.53
−0.20
1.00

1.00
0.10
−0.76
−0.13
0.66
0.12
−0.62
−0.12
0.60
0.10
1.00
−0.17
−0.92
0.19
0.86
−0.19
−0.83
0.19
−0.76
−0.17
1.00
0.17
−0.98
−0.16
0.95
0.14
−0.94
−0.13
−0.92
0.17
1.00
−0.20
−0.99
0.21
0.98
−0.22
0.66
0.19
−0.98
−0.20
1.00
0.19
−1.00
−0.18
0.99
0.12
0.86
−0.16
−0.99
0.19
1.00
−0.20
−1.00
0.21
−0.62
−0.19
0.95
0.21
−1.00
−0.20
1.00
0.19
−1.00
−0.12
−0.83
0.14
0.98
−0.18
−1.00
0.19
1.00
−0.20
0.60
0.19
−0.94
−0.22
0.99
0.21
−1.00
−0.20
1.00





,




and





Cor(H) = 








.




In H many of the correlations round off to +1 or -1, which means that H is very
close to being singular. This is not a good property for using H to construct
mixed model equations. This could lead to poor estimation of effects in the
model.
By contrast, the largest correlation in K is only -0.91. K is not close to
singularity, and should be safe to invert. The signs of the correlations are often
opposite to those in H. K is a much preferred matrix for use in mixed model
equations.
Once there is an estimate of K, then the covariance function model can be
used to calculate variances and covariances between other time points (between
158
CHAPTER 9. LONGITUDINAL DATA
tmin and tmax ). For example, let t150 = 150 minutes and t400 = 400 minutes,
neither of which were actually observed or recorded in the 200 mice, but both
points are within the upper and lower bounds of the experimental period.
First, calculate the standardized time equivalents (between -1 to +1).
q150 = −0.6250
q400 = +0.4167
Set up the matrix M for these two points,

M0 =
















1
−0.6250000
0.3906250
−0.24414062
0.15258789
−0.09536743
0.059604645
−0.037252903
0.0232830644
1
0.4166667
0.1736111
0.07233796
0.03014082
0.01255867
0.005232781
0.002180325
0.0009084689









,







The Legendre polynomials are
Φ0 = (MΛ)0

0.7071068
 −0.7654655


 0.1358791

 0.6120387


=  −0.8957736

 0.5003195

 0.2739309


 −0.84232235
0.7767490
0
ΦKΦ =
0.7071068
0.5103104
−0.3788145
−0.8309381
−0.3058426
0.5797175
0.7877318
0.7767490
−0.7262336
2.767230
−0.4789920
−0.478992 0.5497332









.







!
.
9.4. REDUCED ORDERS OF FIT
159
Thus, the variance at 150 minutes is expected to be 2.767, and for 400
minutes is 0.550.
Suppose the variance for 700 minutes was needed. This could not be
predicted or calculated because tmax is only 540 minutes. Do not predict
variances for time periods outside the observed range.
9.4
Reduced Orders of Fit
In the previous example, the matrix V was 9×9, and the Legendre polynomials
were generated for a Full fit, with 9 covariates. Thus, the covariance function
model resulted in no errors. All of the calculated variances and covariances
were exactly the same as those in the original V.
Kirkpatrick et al.(1990) suggested looking at the eigenvalues of the matrix
K from a full rank fit. Below are the values. The sum of all the eigenvalues
was 4.690745, and also shown is the percentage of that total.
Table 9.4
Eigenvalues of K.
K
Eigenvalue Percentage
2.980498752
63.5
0.624171898
13.3
0.433933695
9.3
0.208716099
4.4
0.195913819
4.2
0.135701396
2.9
0.102963163
2.2
0.007215617
0.2
0.001630618
0.03
The majority of change in elements in K is explained by the first three
eigenvalues at 86.1 %. The first seven explain 99.8 %. If a cut-off is set to
95%, then the first 5 eigenvalues would be important.
160
CHAPTER 9. LONGITUDINAL DATA
Covariance functions can be based on fewer than 9 covariates. Thus, the
orders of fit can be 8, 7, 6, 5, 4, 3, 2, 1, or 0. Order 0 implies that all of the
elements in V are the same, which is obviously not true.
Use the subscript r to indicate a reduced order of fit, that is, r < 8, then
V = Φr Kr Φ0r + E,
for r < 8, and where Φr is a rectangular matrix of rank r composed of the
first r columns of Φ, and E is a matrix of residuals because any lower order fit
will not be perfect. Thus, Φr does not have an inverse, but we can obtain an
estimate of Kr .
To determine Kr , first pre-multiply V by Φ0r and post-multiply that by
Φr as
Φ0r VΦr = Φ0r (Φr Kr Φ0r )Φr
= (Φ0r Φr )Kr (Φ0r Φr ).
Now pre- and post- multiply by the inverse of
(Φ0r Φr ) = P
to determine Kr ,
Kr = P−1 Φ0r VΦr P−1 .
Kr will be square with r rows and columns.
With Kr we can estimate Vr as
Vr = Φr Kr Φ0r .
This matrix is symmetric with 45 unique elements, but only has rank r. The
half-store function in R is a way of changing a matrix to a vector of the unique
elements.
9.4. REDUCED ORDERS OF FIT
hsmat <- function(vcvfull) {
mord = nrow(vcvfull)
np = (mord *(mord + 1))/2
desg = rep(0,np)
k = 0
for(i in 1:mord){
for(j in i:mord){
k = k + 1
desg[k] = vcvfull[i,j] }
return(desg)
}
161
}
Let
ssr = ||hsmat(V − Vr )||/||hsmat(V)||
be the goodness of fit statistic. The ||X|| is defined as the sum of squares of
the elements of a half-stored symmetric matrix X. Thus, ssr measures the
amount of difference between two matrices scaled by the square of the values
in the original matrix. This statistic is like a convergence criterion for solving
a set of equations. The smaller is ssr , then the less difference there is between
the two matrices.
To illustrate, for order 7, there are 8 covariates. The calculated K7 is as
follows.

K7 =














0.28382
−0.13845
0.02516
0.05280
0.00783
0.01502
−0.02732
−0.04759
−0.13845
0.35040
−0.06584
0.02374
0.01210
0.02711
−0.00793
−0.12437
0.02516
−0.06584
0.20045
−0.03657
−0.03170
0.01522
−0.06227
0.00961
0.05280
0.00783
0.02374
0.01210
−0.03657 −0.03170
0.32922 −0.01611
−0.01611
0.12203
−0.00723
0.01010
0.04249 −0.09328
−0.24125
0.02035
162
CHAPTER 9. LONGITUDINAL DATA
0.01502
0.02711
0.01522
−0.00723
0.01010
0.12477
−0.00245
−0.12536
−0.02732
−0.00793
−0.06227
0.04249
−0.09328
−0.00245
0.12822
−0.02775
−0.04759
−0.12437
0.00961
−0.24125
0.02035
−0.12536
−0.02775
0.35359








,






which is used to calculate V7 , and the goodness of fit statistic is
||V7 − V||
||V||
= 0.0495289
ss7 =
To determine the probability of finding a smaller value of ssr one can use
simulation, as shown in the following R script.
9.4. REDUCED ORDERS OF FIT
163
N=10000
can=c(1:N)*0
VR=V
nocov = 2 # order of fit + 1
phr = PH[ ,c(1:nocov)]
PVP = t(phr)%*%VR%*%phr
PP = t(phr)%*%phr
PPI=ginv(PP)
Kr = PPI%*%PVP%*%PPI
ndf=199
for(ko in 1:N){
Ka = rWishart(1,ndf,Kr)/ndf
Kb = Ka[, ,1]
Vr = phr%*%Kb%*%t(phr)
DEL = Vr - VR
er = hsmat(DEL)
vh = hsmat(VR)
vv = sum(vh*vh)
ssr = sum(er*er)/vv
can[ko] = ssr
} #end of samples
hist(can,breaks=50)
Then compare the ssr to the histogram to find the probability of obtaining
a smaller statistic. With an order of fit equal to 1, ssr = 0.4592. In R one can
use
kb = order(-can)
ncan = can[kb]
ncan[1:10]
kc=which(ncan < 0.4592)
prob = 0
if(length(kc)>0)prob=1-(kc[1]/length(ncan))
prob
This gives 0.4379 which is a pretty large probability. Similarly for the
other orders, one gets the results in Table 9.5.
164
CHAPTER 9. LONGITUDINAL DATA
Order
1
2
3
4
5
6
7
Table 9.5
Test statistics for
reduced order of fit models.
Covariates
ssr Probability
2
0.4591629
0.4379
3
0.4112741
0.2854
4
0.3132577
0.2699
5
0.2564594
0.1864
6
0.1992198
0.1138
7
0.1384349
0.0389
8
0.0495289
0.0001
Orders 1, 2, 3, 4, and 5, gave Vr that were significantly different from V
(i.e. probabilities greater than 0.05) while orders 6 and 7 were not different
from V (i.e. less than 0.05). Order 6 would be a sufficient minimal fit for the
mice insulin decay data.
The mouse data example was entirely fictitious.
9.5
Data Requirements
Suppose a fixed regression model for growth of sheep is to be studied. Each
lamb has, at most, three weight measurements, and from that coefficients for
5 covariates are to be estimated. With three data points we can only estimate
coefficients for 3 covariates because there would be no degrees of freedom
remaining. However, because we have many lambs weighed at various ages,
it is possible to estimate coefficients for 5 covariates across lambs, but not for
individual lambs.
The same logic must have some effect on a random regression model,
even though animal genetic effects are random, it is computationally possible
to estimate 5 coefficients per lamb. However, the quality of those estimates
might be questionable. In early test day models, researchers required cows
to have 7 or more test day records before an attempt was made to estimate
covariance matrices with orders of fit equal to 5 or 6 (Jamrozik et al. 1998).
A general recommendation is, if the number of weights per animal is three
9.5. DATA REQUIREMENTS
165
or less, then a multiple trait animal model with each weight as a separate
trait is the preferred analysis. With more than three weights per animal (on
average), then a random regression model could be employed. The appropriate
order of fit should not be greater than the average number of weights per
animal.
166
CHAPTER 9. LONGITUDINAL DATA
Chapter 10
The Models
Random regression models (RRM) are used to estimate matrices that are part
of the covariance functions to estimate a larger array of variances and covariances. RRM are also used to model the shape or trajectory of observations
taken over time. Phenotypically, the trajectory has to be fitted, and at the
same time the variation along the trajectory needs to be considered.
10.1
Fitting The Trajectory
Most trajectories are smooth, continuous, and can be fit with very few covariates. Sometimes, however, the trajectory is unknown and may be undefined
with ups and downs over time. There could be different trajectories for males
versus females, or for different breeds. Over the years, the trajectories could
shift due to selection of animals for faster growth or higher milk yields. All of
these factors must be considered.
The first course of action is to plot data against the time scale of interest.
Observations can be partitioned by gender, by age at start, or by years. One
should look at all aspects of the data, before commiting to one model for
analysis. Below are some fictitious data on animals over a period of 1 to 100
days. The trait measured is the amount of resistance to a bacteria from first
day of spring to fall. Figure 10.1 represents data on 6 animals with from 4 to
7 observations per animal, a total of 34 observations.
167
168
CHAPTER 10. THE MODELS
Figure 10.1
70
Daily Resistance Test
●
●
60
●
●
●
●
●
●
●
●
●
40
●
●
●
●
●
●
●
●
●
●
●
30
Resistance Level
50
●
●
●
●
●
●
10
20
●
●
●
0
20
●
●
●
40
60
80
100
Days on Test
The data show an initial resistance that becomes less up to day 35, then
improves again until fall. Fitting the trajectory of this curve could be problematic.
Ordinary linear regressions were fitted on days on test (divided by 100)
from linear to sextic equations. Legendre polynomials could be used from order
1 to order 6, but at this stage of model building, simple regressions suffice.
10.1. FITTING THE TRAJECTORY
169
Figure 10.2
Linear Fit
70
Daily Resistance Test
●
●
●
60
●
●
●
●
●
●
40
●
●
●
●
●
●
●
●
●
●
●
●
30
Resistance Level
50
●
●
●
●
●
●
●
10
20
●
●
●
0
●
●
●
20
40
60
80
100
Days on Test
A linear regression does not fit the data very well. The predicted y is
correlated with the original y at 0.66. Including a squared term for days on
test gave a correlation of 0.85.
Figure 10.3
Quadratic Fit
70
Daily Resistance Test
●
●
60
●
●
●
●
●
●
●
40
●
●
●
●
●
●
●
●
●
●
●
30
●
●
●
●
●
●
20
10
Resistance Level
50
●
●
●
●
●
0
20
●
●
●
40
60
Days on Test
80
100
170
CHAPTER 10. THE MODELS
Figure 10.4
Cubic Fit
70
Daily Resistance Test
●
●
●
60
●
●
●
●
●
●
40
●
●
●
●
●
●
●
●
●
●
●
●
30
Resistance Level
50
●
●
●
●
●
●
●
10
20
●
●
●
0
●
●
●
20
40
60
80
100
Days on Test
Figure 10.5
Quartic Fit
70
Daily Resistance Test
●
●
60
●
●
●
●
●
●
●
40
●
●
●
●
●
●
●
●
●
●
●
30
●
●
●
●
●
●
20
10
Resistance Level
50
●
●
●
●
●
0
20
●
●
●
40
60
Days on Test
80
100
10.1. FITTING THE TRAJECTORY
171
Figure 10.6
Quintic Fit
70
Daily Resistance Test
●
●
●
60
●
●
●
●
●
●
40
●
●
●
●
●
●
●
●
●
●
●
●
30
Resistance Level
50
●
●
●
●
●
●
●
10
20
●
●
●
0
●
●
●
20
40
60
80
100
Days on Test
Figure 10.7
Sextic Fit
70
Daily Resistance Test
●
●
60
●
●
●
●
●
●
●
40
●
●
●
●
●
●
●
●
●
●
●
30
●
●
●
●
●
●
20
10
Resistance Level
50
●
●
●
●
●
0
20
●
●
●
40
60
Days on Test
80
100
172
CHAPTER 10. THE MODELS
A summary of the fit of regressions of different powers of days (divided by
100) are given in Table 10.1.
Table 10.1
b with y.
Correlations of y
Fit
Correlation
Linear
0.66
Quadratic
0.85
Cubic
0.9356
Quartic
0.9423
Quintic
0.9789
Sextic
0.9740
The fit of the data improves with an increase in power of the time variable,
but none of the regressions adequately fit the low observations from day 30 to
40. Going from Quintic to Sextic the correlation actually decreased, so that
the fit is starting to become worse.
In the quintic equation, there were waves in the first five days, and at the
very end from day 90 to 100. The lack of fit for the lower values in days 30 to
40 persists.
What happens when there are 2000 observations rather than just 34? Does
the fit become worse or better? These functions are not entirely adequate for
fitting the trajectories.
10.1.1
Spline Functions
Another approach would be to divide the trajectory into parts, such that the
parts are best fit by linear or quadratic functions. For the example, the parts
might be days 1 to 20, days 21 to 50, and days 51 to 100. These are known as
spline functions. Besides estimating the regression coefficients, one must also
estimate the best cut-off points, and these might differ from one situation to
another.
A model might be
2
2
2
ykj = µ + b1 Xkj + b2 Xkj
+ b3 Ukj + b4 Ukj
+ b5 Wkj + b6 Wkj
+ ekj ,
10.1. FITTING THE TRAJECTORY
173
where
ykj is the j th observation on the k th animal,
µ is an overall mean,
Xkj is the days on test (divided by 100) corresponding to the observation,
Ukj is zero, unless days on test are greater than 20, then it is equal to
(Xkj − 0.2),
Wkj is zero, unless days on test are greater than 50, then it is equal to
(Xkj − 0.5),
b` for ` = 1 to 6 are regression coefficients,
ekj are residual effects.
The fit of the spline function model gave a correlation between actual and
predicted observations of 0.9923. and has seven parameters to estimate. The
b and y is shown in Figure 10.8.
agreement of y
Figure 10.8
Fitting Splines
70
Daily Resistance Test
●
●
60
●
●
●
●
●
●
●
40
●
●
●
●
●
●
●
●
●
●
●
30
●
●
●
●
●
●
20
10
Resistance Level
50
●
●
●
●
●
0
20
●
●
●
40
60
Days on Test
80
100
174
10.1.2
CHAPTER 10. THE MODELS
Classification Approach
One hundred days can be grouped into 20 periods of 5 days each. A linear
model with time period groups could be used to model the trajectory. Thus,
there is no assumption about the function that fits the data. The data tell us
what the function looks like. The model is,
yij = µ + Ti + eij ,
where
yij is the j th observation within the ith time group (twenty groups),
µ is an overall mean,
Ti is the effect of the ith time group, and
eij is a residual effect.
This model fit the data very well, but at the expense of needing to estimate
20 parameters rather than 6 (Quintic function) or 7.
The time groups do not need to be equal in size. Time periods of 10 or 20
days might be appropriate if the observations have about the same magnitude
over all 10 or 20 days. That would not be true in the example data, because
the values are decreasing sharply at the beginning, then increasing quickly.
The last 10 or 15 days of the 100 day period might be able to be combined
into one group. However, there is little harm in keeping 20 periods of 5 days
each. There is no major computing problem in doing so.
b
The fit of the classification model gave a correlation of 0.9975 between y
(red points) and y (blue points) (Figure 10.9).
10.1. FITTING THE TRAJECTORY
175
Figure 10.9
Fitting Time Groups
70
Daily Resistance Test
60
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●●
40
●
●
●●●●
●
●
●
●
●
●
●
●●
●
30
Resistance Level
50
●
●
●●
●
●
●
●
●●
●
●●
10
20
●
●
●●
●
0
20
●
● ● ●
●
40
60
80
100
Days on Test
The time group means give a non-smooth tragectory, but it fits the data
very well. Also, one does not need to define the type of curve or the shape.
A drawback is that days are being combined, so if resistance is changing (increasing or decreasing) a lot from the first day to the fifth within a group, then
the group mean or Ti effect will not account for that, and there will be errors
for the predicted observations farthest away from the middle day in the group.
Forming time groups must be handled cautiously.
In this example there were not enough observations to make smaller time
groups (e.g. 3 days each or 2 days each). The classification approach is good
when you do not know what kind of function describes the trajectory. Usually
researchers are dealing with many thousands of observations so that having 20
time groups or a 6 parameter regression does not make very much difference
in terms of estimation difficulty. One cannot go wrong with the classification
approach over the regression approach for modelling the trajectories unless
there are not sufficient numbers of observations.
176
CHAPTER 10. THE MODELS
10.2
Random Variation in Curves
Begin by writing a model for the resistance level on one specific day, day t.
ytijk`m = gti + htj + ctk + at` + etijk`m
where
gti is a fixed gender effect,
htj is a fixed year effect,
ctk is a random contemporary group effect,
at` is a random animal additive genetic effect,
etijk`m is a residual effect.
Now envisage a multiple trait model with 100 traits, (i.e. resistance on
each day from 1 to 100). That is too many traits to analyze at one time, and
there would be lots of missing data for many of the animals. A random regression model could be used to account for the relationship among observations
on different days. All factors of the model have a ‘trajectory’ of effects on the
trait. At least one fixed factor has to attempt to fit the phenotypic shape of
the trajectory. For the other factors, the model fits the variation around the
trajectories.
The gender effects in the model above,
gti
become
gip
where p refers to time period p group. Hence there are different gender trajectories estimated by 20 period groups of five days each. This should fit the
phenotypic trajectories almost perfectly.
10.2. RANDOM VARIATION IN CURVES
177
The year effects can be modelled by a curve function, because the phenotypic trajectory has been modelled by the gender effects. Thus,
htj
become
4
X
hjx ztx
x=0
where hjx are regression coefficients on Legendre polynomials (z) of time (t) of
order 4. Order 4, (5 parameters), should be suitable to fit the variation around
the trajectories. The best order could be achieved by testing the models with
different orders assumed.
Likewise, contemporary groups and animal additive genetic effects can be
modelled by the same regressions. That is,
ctk
become
4
X
ckx ztx
x=0
where cjx are regression coefficients on Legendre polynomials (z) of time (t) of
order 6, and
at`
becomes
4
X
a`x ztx .
x=0
In addition the model needs to include animal permanent environmental
effects, because there are repeated observations on the same animal.
petm ≈
4
X
pemx ztx .
x=0
The RRM is about visualizing curves and covariance functions over time.
The observations are changing with time, and all effects of the model also
change with time. The curves that fit the trajectory may require more covariates than those that fit the covariance functions. They do not need to be of
178
CHAPTER 10. THE MODELS
the same order. Computationally, there could be advantages to using the same
order of fit for the fixed and random factors. The important point is to have a
good fit of the trajectories amongst the fixed effects, and an appropriate order
for the Legendre polynomials with the random factors.
10.3
Residuals
Residuals are the difference between predicted observations (using the estimates of parameters from the RRM) and the actual observations.
b − y.
eb = y
Now partition the residuals into the 20 time periods as in the classification
model,

eb =








eb1
eb2
eb3
..
.
eb20





.



Then estimate the residual variance for the ith time group as
σe2i = eb0i ebi /ni ,
where ni are the number of observations in the ith time group.
The overall residual covariance matrix, R, is assumed to be diagonal with
20 different residual variances according to time group. The mixed model
equations for the RRM therefore involve R−1 which means that observations
are inversely weighted by the magnitude of their residual variance. Larger
residual variances lead to lesser weight in the equations.
The residual variances can be plotted on a graph relative to time group
number. Either a pattern of the residual variances can be observed and a
function used to determine the residual variance for each observation, or some
10.4. COMPLETE MODEL
179
collapsing of the groups into larger time groups may be possible. The possibility exists for each of the 20 time group variances to be different, and thus,
one may always have to use 20 different variances.
10.4
Complete Model
10.4.1
Fixed Factors
The fixed factors are for modelling the shape of the phenotypic trajectories.
Let
f 0 (t) =
t0 t1 t2 · · · tm−1
be a vector of covariates of time of length m, these may or may not be Legendre
polynomials depending on the function that best fits the trajectories, OR
f 0 (t) =
0 1 0 ··· 0
a vector indicating which time group an observation belongs as in the classification approach (of length m). The desgin matrix for gender effects, for
example, would be written as

Xg =








f 0 (t1 )
00
f 0 (t2 )
00
00
f 0 (t3 )
..
..
.
.
00
f 0 (tN )









g1
g2
!
where N is the total number of animals observed. The first two animals belonged to gender 1, the third and the N th animals belonged to gender two.
Also,


gi1


 gi2 
gi = 
.. 

,

. 
gim
is a vector of length m for each gender which represent the fixed regression
coefficients which give the trajectory of the responses for gender i. Instead of
one number for each gender effect, there will be a vector of m numbers.
180
CHAPTER 10. THE MODELS
Similarly, the fixed effects of years can be represented as

Wh =








f 0 (t1 )
00
f 0 (t2 )
00
00
f 0 (t3 )
..
..
.
.
0
0
00
00
00
00
..
.
···
···
···
..
.
00
00
00
..
.
00 · · · f 0 (tN )









h1
h2
h3
..
.
hny









where ny is the number of years in the data. If m = 7, for example, and
N = 1000, then X has 1000 rows and 2 × m = 14 columns. If ny = 10, then
W has 1000 rows and 10 × m = 70 columns.
10.4.2
Random Factors
The random factors are for modelling the covariance functions and make use
of Legendre polynomials of order r. The analysis also gives different curves for
every level of each random factor. A curve for each contemporary group, for
each animal’s genetic effect, and for each animal’s permanent environmental
effect. Let
z0 (q) =
φ(q)0 φ(q)1 φ(q)2 · · · φ(q)r
.
The design matrices are constructed in the same manner as those for the
fixed factors, only using z0 (q) rather than f 0 (t). Assume that r = 4, for this
discussion. The design matrices are
Zc for contemporary groups,
Za for animal additive genetic effects, and
Zp for animal permanent environmental effects.
Contemporary groups are groups of animals usually born within a few
days or weeks of each other, which are reared together through much of their
early life, and who are kept together during the time that the observations
were collected. If N = 1000 observations, and the number of contemporary
groups is nc = 50, then Zc has 1000 rows and 5 × nc = 250 columns. The
covariance function matrix for contemporary groups is Kc of dimension 5 × 5.
10.4. COMPLETE MODEL
181
Normally, the covariance matrix for contemporary group effects is Iσc2 , but in
a RRM it is a block diagonal matrix,
V ar(c) = I
of dimension 250 × 250, where
N
O
Kc
is the direct product operation.
The additive relationship matrix, A, is used in all animal models, and
includes ancestors who may not have been observed or measured in the data
itself. The design matrix for animal additive genetic effects must, therefore,
have additional columns of zeros to accommodate the ancestors. Let nw be
the number of animals with observations in the data, and let na be the total
number of animals including ancestors. Hence, na ≥ nw . The matrix, Za , has
1000 rows, and na × 5 columns. If na = 200, then Za has 1000 columns. If Ka
is the covariance function matrix for genetic variances, then
V ar(a) = A
O
Ka
of dimension 1000 × 1000.
The animal permanent environmental (PE) covariance function matrix is
Kp . The design matrix, Zp has 1000 rows and nw × 5 columns. Let nw = 140,
then that is 700 columns.
V ar(p) = I
O
Kp
of dimension 700 × 700.
The assumption is that there are no covariances between levels of different
random factors, e.g. between contemporary groups and animal additive genetic
effects.
The residual matrix, R was shown to be diagonal, but with different residual variances depending on the day on test.
10.4.3
Mixed Model Equations
Once a model is specified, then Henderson’s best linear unbiased prediction
methodology is employed. This requires constructing and solving the Mixed
Model Equations(MME). These equations are
182
CHAPTER 10. THE MODELS
 X0 R−1 X
0 −1
 W0 R−1 X
 Zc R X
 0 −1
Za R X
Z0p R−1 X
X0 R−1 W
W0 R−1 W
Z0c R−1 W
Z0a R−1 W
Z0p R−1 W
X0 R−1 Zc
W0 R−1 Z
c
N
Z0c R−1 Zc + I
K−1
c
Z0a R−1 Zc
Z0p R−1 Zc
X0 R−1 Za
W0 R−1 Za
Z0c R−1 Za N
Z0a R−1 Za + A−1
K−1
a
0
−1
Zp R Za
 0 −1

b
g
XR y
−1 y
b
 W0 0 R−1
 h 
 =  Zc R y
· b
c
 Z0 R−1 y


b
a
a
Z0p R−1 y
b
p

X0 R−1 Zp
W0 R−1 Zp
Z0c R−1 Zp
Z0a R−1 ZN
p
Z0p R−1 Zp + I
K−1
p


.

These equations are solved by using iteration on data routines, in which
coefficients of the MME are calculated as they are needed. The following
procedure describes how to estimate the covariance function matrices using a
pseudo-Bayesian method.
1. First, go through the observations, one at a time, and calculate deviations,
b −Z c
b − Wh
b − Zp p
b
y − Xg
c b − Za a
2. Accumulate the deviations for gender effects, then solve for new gender
regression coefficients.
3. Repeat for year effects.
4. Repeat for contemporary groups. For each contemporary group calculate
cbi cb0i
and accumulate these 5 × 5 matrices over all contemporary groups. Then
c =
K
c
X
cbi cb0i /χ2 (nc + 2)
i
to estimate the covariance function matrix, where χ2 (s) is a random
Chi-square variate having s degrees of freedom.




10.4. COMPLETE MODEL
183
5. Repeat for animal additive genetic effects. This step involves elements
of A−1 . For each animal that has observations, calculate
b ` − 0.5(a
b sire + a
b dam )
m` = a
From A−1 there will be bii = (0.5 − 0.25(Fsire + Fdam ))−1 for each animal.
The new covariance function matrix is
c =
K
a
X
mm0 bii /χ2 (nw + 2).
i
Only animals with records are used because they are the only ones who
contribute to the estimation of variances and covariances directly.
6. Repeat for animal permanent environmental effects. Kp is estimated in
the same manner as that for contemporary groups.
7. Estimate new residual variances as before.
8. The new covariance function matrices are used in the next iteration, and
should be saved in a file so that they may be averaged to give a final
estimate of each.
9. Go back to step 1 and repeat.
184
CHAPTER 10. THE MODELS
Chapter 11
RRM Calculations
The calculations for applying random regression models are similar to those
needed for multiple trait models. The R language is used to illustrate. To
analyze thousands of records one would need to write specialized programs in
either Fortran or C++, or use commercial software.
Below are completely fictitious data for two traits, so that the model is
definitely multiple trait as well as random regression. The example data are
spread over the following four tables. The pedigree appears in Table 11.4.
185
186
CHAPTER 11. RRM CALCULATIONS
Animal
10
10
10
10
10
10
11
11
11
11
11
12
12
12
12
12
13
13
13
13
13
13
14
14
14
14
14
14
15
15
15
15
15
Age
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
1
1
1
1
1
1
2
2
2
2
2
2
3
3
3
3
3
Table 11.1
Example Data - Part 1
Yr-Mn HYS DIM
1
1
21
1
1
59
1
1
105
1
1
139
1
1
221
1
1
334
1
1
23
1
1
61
1
1
107
1
1
141
1
1
223
1
1
27
1
1
65
1
1
111
1
1
145
1
1
227
1
2
6
1
2
31
1
2
175
1
2
264
1
2
295
1
2
325
1
2
8
1
2
33
1
2
177
1
2
266
1
2
297
1
2
327
1
2
10
1
2
35
1
2
179
1
2
268
1
2
299
Fat
8.7
5.7
9.4
12.2
9.6
6.4
12.9
10.6
14.1
8.1
11.0
10.3
6.9
10.2
15.4
12.8
12.5
17.3
11.6
9.4
7.0
7.0
9.5
8.3
12.4
12.3
9.7
5.7
14.8
13.3
11.5
8.6
6.0
Protein
6.0
9.7
6.6
9.9
9.4
4.1
8.6
10.1
14.1
7.7
9.5
5.0
6.3
6.7
10.3
7.6
7.4
13.1
10.9
8.5
7.0
7.3
3.3
7.7
12.8
12.5
10.7
9.2
10.6
11.0
8.5
6.9
6.1
187
Animal
16
16
16
16
16
17
17
17
17
17
17
18
18
18
18
18
19
19
19
19
19
20
20
20
20
20
21
21
21
21
21
21
22
Age
3
3
3
3
3
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
1
1
1
1
1
1
2
Table 11.2
Example Data - Part 2
Yr-Mn HYS DIM
1
2
15
1
2
40
1
2
184
1
2
273
1
2
304
2
3
48
2
3
89
2
3
150
2
3
206
2
3
280
2
3
335
2
3
50
2
3
91
2
3
152
2
3
208
2
3
282
2
3
52
2
3
93
2
3
154
2
3
210
2
3
284
2
3
53
2
3
94
2
3
155
2
3
211
2
3
285
2
4
63
2
4
131
2
4
222
2
4
252
2
4
270
2
4
300
2
4
64
Fat
12.6
13.9
9.6
8.2
5.6
3.1
6.4
7.6
7.7
7.0
6.6
1.7
2.3
4.3
4.7
6.8
6.1
5.7
6.4
7.9
5.6
1.2
3.9
8.2
6.5
5.2
2.7
6.5
9.1
4.8
3.0
2.4
0.3
Protein
7.5
9.0
9.1
7.0
7.7
7.7
9.0
8.8
8.1
10.1
12.9
6.4
3.5
4.2
5.7
7.5
7.6
5.7
7.5
7.8
8.2
5.1
8.8
7.6
6.4
8.0
5.3
4.6
5.3
2.6
1.4
1.2
8.9
188
CHAPTER 11. RRM CALCULATIONS
Animal
22
22
22
22
22
23
23
23
23
23
23
24
24
24
24
24
24
25
25
25
25
25
Age
2
2
2
2
2
3
3
3
3
3
3
1
1
1
1
1
1
3
3
3
3
3
Table 11.3
Example Data - Part 3
Yr-Mn HYS DIM
2
4
132
2
4
223
2
4
253
2
4
271
2
4
301
2
4
65
2
4
133
2
4
224
2
4
254
2
4
272
2
4
302
2
4
66
2
4
134
2
4
225
2
4
255
2
4
273
2
4
303
2
4
67
2
4
135
2
4
226
2
4
256
2
4
274
Fat
2.2
9.5
7.5
5.8
8.2
2.6
2.9
8.5
6.7
4.4
3.9
0.9
5.1
5.4
6.6
8.4
4.3
6.3
2.8
5.2
4.6
7.6
Protein
6.6
6.6
5.5
4.6
6.8
5.8
6.1
8.7
6.7
6.8
5.7
4.2
5.0
3.1
4.1
5.1
6.4
3.5
3.7
2.7
1.9
1.5
11.1. ENTER DATA
189
Table 11.4
Pedigree of Example Animals.
Animal Sire Dam bii
1
0
0
1.0
2
0
0
1.0
3
0
0
1.0
4
0
0
1.0
5
0
0
1.0
6
0
0
1.0
7
0
0
1.0
8
0
0
1.0
9
0
0
1.0
10
1
5
0.5
11
2
6
0.5
12
3
7
0.5
13
4
8
0.5
14
1
9
0.5
15
2
5
0.5
16
3
6
0.5
17
3
8
0.5
18
4
7
0.5
19
3
10
0.5
20
2
12
0.5
21
2
13
0.5
22
1
11
0.5
23
3
14
0.5
24
4
19
0.5
25
2
10
0.5
11.1
Enter Data
First, enter the pedigree information and calculate the inverse of the additive
genetic relationship matrix.
190
CHAPTER 11. RRM CALCULATIONS
anim=c(1:25)
sir=c(rep(0,9),1,2,3,4,1,2,3,3,4,3,2,2,1,3,4,2)
dam=c(rep(0,9),5,6,7,8,9,5,6,8,7,10,12,13,11,14,19,10)
length(sir)-length(dam) # check
bi=c(rep(1,9),rep(0.5,16))
AI = AINV(sir,dam,bi)
Enter columns of information that identify the age group, year-month,
animals with records, HYS numbers, Days in milk, and fat and protein yields
(not shown).
agid=c(rep(1,6),rep(1,5),rep(2,5),rep(1,6),rep(2,6),rep(3,5),
rep(3,5),rep(1,6),rep(2,5),rep(2,5),rep(3,5),rep(1,6),
rep(2,6),rep(3,6),rep(1,6),rep(3,5))
nob=length(agid)
ymid=c(rep(1,6),rep(1,5),rep(1,5),rep(1,6),rep(1,6),rep(1,5),
rep(1,5),rep(2,6),rep(2,5),rep(2,5),rep(2,5),rep(2,6),
rep(2,6),rep(2,6),rep(2,6),rep(2,5))
cgid=c(rep(1,6),rep(1,5),rep(1,5),rep(2,6),rep(2,6),rep(2,5),
rep(2,5),rep(3,6),rep(3,5),rep(3,5),rep(3,5),rep(4,6),
rep(4,6),rep(4,6),rep(4,6),rep(4,5))
anwr=c(rep(10,6),rep(11,5),rep(12,5),rep(13,6),rep(14,6),
rep(15,5),rep(16,5),rep(17,6),rep(18,5),rep(19,5),rep(20,5),
rep(21,6),rep(22,6),rep(23,6),rep(24,6),rep(25,5))
# obs is 88 by 2 matrix of fat and protein values
Observations also have to be assigned to residual error groups (1 to 4)
based on days in milk. Group 1 is day 5 to day 99, group 2 is day 100 to 199,
group 3 is day 200 to 299, and group 4 is day 300 plus.
dimid=c(21,59,105,139,221,334, 23,61,107,141,223,
27,65,111,145,227,6,31,175,264,295,325, 8,33,177,
266,297,327, 10,35,179,268,299,15,40,184,273,304,
48,89,150,206,280,335, 50,91,152,208,282,52,93,
154,210,284, 53,94,155,211,285, 63,131,222,252,
270,300,64,132,223,253,271,301, 65,133,224,254,
272,302, 66,134,225,255,273,303,67,135,226,256,274)
11.1. ENTER DATA
191
erid=c(1,1,2,2,3,4,1,1,2,2,3,1,1,2,2,3,1,1,2,3,3,4,
1,1,2,3,3,4,1,1,2,3,3,1,1,2,3,4,1,1,2,3,3,4,1,1,2,
3,3,1,1,2,3,3,1,1,2,3,3,1,2,3,3,3,4,1,2,3,3,3,4,1,
2,3,3,3,4,1,2,3,3,3,4,1,2,3,3,3)
length(erid)-length(dimid) # check
Now the Legendre Polynomials need to be constructed, using two R functions, as follows. The first sets the polynomials, LPOLY and the second to
calculate the covariates based on standardized time variables for days in milk,
LPTIME.
LPOLY=function(no) {
if(no > 9 ) no = 9
nom = no - 1
phi = matrix(data=c(0),nrow=9,ncol=9)
phi[1,1]=1
phi[2,2]=1
for(i in 2:nom){
ia = i+1
ib = ia - 1
ic = ia - 2
c = 2*(i-1) + 1
f = i - 1
c = c/i
f = f/i
for(j in 1:ia){
if(j == 1){ z = 0 }
else {z = phi[ib,j-1]}
phi[ia,j] = c*z - f*phi[ic,j]
}
}
for( m in 1:no){
f = sqrt((2*(m-1)+1)/2)
phi[m, ] = phi[m, ]*f
}
return(phi[1:no,1:no])
}
The second function is below.
192
CHAPTER 11. RRM CALCULATIONS
LPTIME=function(day,tmin,tmax,no){
phi = LPOLY(no)
if(day > tmax)day = tmax
if(day < tmin)day = tmin
z = -1 + 2*((day - tmin)/(tmax - tmin))
s = rep(1,no)
for(i in 2:no){
s[i]=z*s[i-1] }
x = phi %*% s
return(x)
}
Use the functions to create the covariates in a matrix of order 335 by 5.
tmin=5
tmax=335
zcov=matrix(data=c(0),ncol=5,nrow=tmax)
for(i in tmin:tmax){
zcov[i, ]=LPTIME(i,tmin,tmax,5)
}
11.2
Enter Starting Covariance Matrices
The following matrices come from an analysis of Simmental dairy cows. The
observations were multiplied by 10, so the covariance matrices have been multiplied by 100. The matrices are positive definite, and are of order 10 by 10
(five covariates times 2 traits). The matrices were entered in half-stored mode,
then converted to full-stored with the function fsmat. The genetic covariance
matrix is
11.2. ENTER STARTING COVARIANCE MATRICES
193
G=c(0.6132, -0.0353,.0174, .0017, .0008, .0096,
-.0005, -.0001, .0011, -.0005, .4705, -.0112,
.0059,.0008, -.0061,.0651,.0019,-.0025,-.0005,
.2519,-.0040,.0008,.0015,.0017,.0261,.0008,
-.0005, .0614,-.0006,-.0001,-.0003,.0004,.0057,
-.0002, .0136,.0004,.0003,-.0003,-.0003,.0007,
.4272,-.0108,.0172,.0012,.0009,.3324,-.0028,
.0150,.0000,.1978,-.0033,.0053,.0885,-.0008,.0310)
GM=fsmat(10,G)
The contemporary group covariance matrix is as follows.
H=c(3.7487,-.3112,-.0732,.0272,-.0201,2.8720,
-.1985,-.1479,.0351,-.0275,.8256,-.0211,-.0314,
.0140,-.2306,.3768,-.0091,-.0500,.0111,.3382,
-.0222,-.0003,-.0967,-.0066,.1027,-.0106,-.0128,
.2638,-.0121,.0152,-.0483,-.0062,.0712,-.0051,
.1818,-.0215,.0067,-.0145,-.0032,.0419,2.9313,
-.1625,-.1058,.0316,-.0197,.5873,-.0103,-.0187,
.0094,.2325,-.0104,.0050,.1698,-.0060,.1127)
HM=fsmat(10,H)
The animal permanent environmental covariance matrix is as follows.
P=c(1.1340,-.0151,.0206,.0024,-.0005,.3505,
.0017,-.0222,.0031,.0000,.4881,-.0003,.0155,
.0018,-.0038,.0805,.0042,-.0015,-.0001,.3317,
-.0052,.0037,-.0158,.0054,.0496,.0004,-.0019,
.1412,-.0011,-.0010,-.0003,.0000,.0164,.0000,
.0330,-.0018,.0006,-.0006,.0002,.0034,.8274,
.0103,.0172,.0051,.0043,.3528,.0070,.0194,.0007,
.2418,-.0007,.0114,.1161,-.0004,.0577)
PM=fsmat(10,P)
Invert those matrices for use in mixed model equations.
GI=ginv(GM)
HI=ginv(HM)
PI=ginv(PM)
Because there are two traits, the residual covariance matrices are of order 2
194
CHAPTER 11. RRM CALCULATIONS
by 2, and there are 4 of them because there are 4 residual groups per lactation.
Then the inverses are needed to form MME.
R1=matrix(data=c(4.1267,1.5435,1.5435,2.5262),
byrow=TRUE,ncol=2)
R2=matrix(data=c(3.4424,1.4586,1.4586,1.7293),
byrow=TRUE,ncol=2)
R3=matrix(data=c(2.4530,1.1217,1.1217,1.2931),
byrow=TRUE,ncol=2)
R4=matrix(data=c(2.0600,0.9695,0.9695,1.3123),
byrow=TRUE,ncol=2)
RI=array(data=c(0),dim=c(2,2,4))
R1I=ginv(R1)
RI[ , ,1]=R1I
RI[ , ,2]=ginv(R2)
RI[ , ,3]=ginv(R3)
RI[ , ,4]=ginv(R4)
11.3
Sampling Process
Initialize the solution vectors for each factor in the model, which represent
vectors for year-months, age groups, contemporary groups, animal additive
genetic effects, and animal permanent environmental effects, respectively. The
number of rows is the number of levels in each factor, and the number of
columns is ten (10), for 5 covariates per trait times 2 traits. The model assumes
that all of the data are from first lactation animals of the same breed.
ymsn=matrix(data=c(0),nrow=nym,ncol=10)
agsn=matrix(data=c(0),nrow=nage,ncol=10)
cgsn=matrix(data=c(0),nrow=ncg,ncol=10)
ansn=matrix(data=c(0),nrow=nam,ncol=10)
pesn=ansn
11.3.1
Year-Month Effects
All factors in this model are based on regression functions of order 4. There
are not enough observations to use DIM classes to model the lactation curves
11.3. SAMPLING PROCESS
195
in this example.
The steps described below are similar for age groups, and contemporary
group solutions, except for the appropriate changes to the code, which should
be obvious.
WY=ymsn*0
WW=array(data=c(0),dim=c(10,10,nym))
for(i in 1:nob){
y = obs[i, ]
kag=agid[i] #Age indicator
kym=ymid[i] #year-month
kcg=cgid[i] #contemporary group
kam=anwr[i] #animal id
kdy=dimid[i] #Days in milk
kt1=c(1:5) # fat yield variables
kt2=c(6:10)# protein yield variables
zphi=zcov[kdy, ] #Legendre covariates
y[1]=y[1]-(agsn[kag,kt1]+cgsn[kcg,kt1]+
ansn[kam,kt1]+pesn[kam,kt1])%*%zphi
y[2]=y[2]-(agsn[kag,kt2]+cgsn[kcg,kt2]+
ansn[kam,kt2]+pesn[kam,kt2])%*%zphi
ker=erid[i] # residual group (1 to 4)
eh=RI[ , ,ker]%*%y
WY[kym,kt1]=WY[kym,kt1]+zphi*eh[1]
WY[kym,kt2]=WY[kym,kt2]+zphi*eh[2]
zz=zphi%*%t(zphi) # 5 x 5 matrix
WW[ , ,kym]=WW[ , ,kym]+(RI[ , ,ker]%x%zz)
}
The above script adjusts the observations for all other factors in the model,
and accumulates the diagonal 10 by 10 blocks for each level of a factor, and
the right hand sides (RHS) for each level. The next step is to solve for new
year-month solutions, and add some sampling noise.
196
CHAPTER 11. RRM CALCULATIONS
no=10
for(k in 1:nym){
work=WW[ , ,k]
C = ginv(work)
ymsn[k, ]=C %*% WY[k, ]
CD=t(chol(C))
ve = rnorm(no,0,1)
ymsn[k, ] =ymsn[k, ] +(CD %*% ve)
}
11.3.2
Age Groups
The R-script is nearly identical to that for year-month effects, except y[1] and
y[2] are adjusted for ymsn rather than agsn plus other factors. The solving
part and adding sampling noise is
no=10
for(k in 1:nage){
work=WW[ , ,k]
C = ginv(work)
agsn[k, ]=C %*% WY[k, ]
CD=t(chol(C))
ve = rnorm(no,0,1)
agsn[k, ] =agsn[k, ] +(CD %*% ve)
}
11.3.3
Contemporary Groups
Contemporary groups is a random factor, and thus, a covariance matrix should
be estimated. However, in this small example there are only 4 contemporary
groups with which to estimate and covariance matrix of order 10. The covariance matrix is therefore, not properly estimable. There should be more
than 10 contemporary groups. If the starting covariance matrix, H, can be
considered a good “guesstimate” of the true covariance matrix, then H can be
held constant throughout all of the sampling process.
The adjustment of observations and collection of the diagonal blocks and
11.3. SAMPLING PROCESS
197
RHS are the same as in Year-Month and Age factors. The solution phase and
estimation of the covariance matrix are as follows.
no=10
for(k in 1:ncg){
work=WW[ , ,k] + HI # NOTE
C = ginv(work)
cgsn[k, ]=C %*% WY[k, ]
CD=t(chol(C))
ve = rnorm(no,0,1)
cgsn[k, ] =cgsn[k, ] +(CD %*% ve)
}
If ncg was greater than 10, then the covariance matrix would be estimated
as follows.
sss = t(cgsn)%*%cgsn
ndf = ncg+2
x2 = rchisq(1,ndf)
Hn = sss/x2
HI = ginv(Hn)
In this example, sss only has a rank of 4, but an order of 10 by 10. Thus,
Hn does not have an inverse, and HI can not be used in the mixed model
equations. The original H and HI has to be used each time.
An alternative would be to treat the contemporary groups as another fixed
factor.
11.3.4
Animal Additive Genetic Effects
The R-script for animal additive genetic effects is slightly different from that
for year-months, ages, and contemporary groups. The difference is that a
matrix is created for all animals at one time. Thus, 25 animals by 10 traits is
a matrix of order 250 by 250. This would not be a good strategy if there were
many thousands of animals, but for the small example this is sufficient.
198
CHAPTER 11. RRM CALCULATIONS
no=nam*10
WW = matrix(data=c(0),nrow=no,ncol=no)
WY = matrix(data=c(0),nrow=no,ncol=1)
for(i in 1:nob){
y = obs[i, ]
kag=agid[i]
kym=ymid[i]
kcg=cgid[i]
kam=anwr[i]
kdy=dimid[i]
kt1=c(1:5)
kt2=c(6:10)
zphi=zcov[kdy, ]
y[1]=y[1]-(ymsn[kym,kt1]+agsn[kag,kt1]+
cgsn[kcg,kt1]+pesn[kam,kt1])%*%zphi
y[2]=y[2]-(ymsn[kym,kt2]+agsn[kag,kt2]+
cgsn[kcg,kt2]+pesn[kam,kt2])%*%zphi
ker=erid[i]
eh=RI[ , ,ker]%*%y
WY[kam,kt1]=WY[kam,kt1]+zphi*eh[1]
WY[kam,kt2]=WY[kam,kt2]+zphi*eh[2]
zz=zphi%*%t(zphi)
ka=(kam-1)*10 + 1
kb=ka+9
kc=c(ka:kb)
WW[kc,kc]=WW[kc,kc]+(RI[ , ,ker]%x%zz)
}
The solutions phase is
UI = AI %x% GI
work = WW + UI
C = ginv(work)
wsn=C %*% WY
CD=t(chol(C))
ve = rnorm(no,0,1)
wsn=wsn +(CD %*% ve)
11.3. SAMPLING PROCESS
199
#
transform to new dimensions
ansn = matrix(data=wsn,byrow=TRUE,ncol=10)
# Estimate a new G and GI
v1 = AI %*% ansn
sss = t(ansn)%*%v1
ndf = nam+2
x2 = rchisq(1,ndf)
Gn = sss/x2
GI = ginv(Gn)
The additive genetic covariance matrix is based on 25 animals, which is
greater than the order of the matrix, and therefore, Gn is estimable and has
an inverse, GI.
11.3.5
Animal Permanent Environmental Effects
The R-script for this factor is the same as for the animal additive genetic
effects. The solution phase is slightly different.
UI = id(nam) %x% PI # note id(nam) instead
# of AI
work = WW + UI
C = ginv(work)
wsn=C %*% WY
CD=t(chol(C))
ve = rnorm(no,0,1)
wsn=wsn +(CD %*% ve)
pesn = matrix(data=wsn,byrow=TRUE,ncol=10)
# Estimate a new P and PI
sss = t(pesn)%*%pesn
ndf = 16+2
x2 = rchisq(1,ndf)
Pn = sss/x2
PI = ginv(Pn)
This covariance matrix was also estimable because the number of animals
with records was 16 which is greater than the dimension of P.
200
CHAPTER 11. RRM CALCULATIONS
11.3.6
Residual Covariance Matrices
Observations are now adjusted for all factors in the model to give estimated
residual effects, then sums of squares are calculated separately for each of the
four residual groups, each having a different degrees of freedom. In this case
the covariance matrices are order 2 by 2, and as long as there are more than
two observations in a group, then these covariance matrices are estimable.
Rss=array(data=c(0),dim=c(2,2,4))
for(i in 1:nob){
y = obs[i, ]
kag=agid[i]
kym=ymid[i]
kcg=cgid[i]
kam=anwr[i]
kdy=dimid[i]
kt1=c(1:5)
kt2=c(6:10)
zphi=zcov[kdy, ]
y[1]=y[1]-(ymsn[kym,kt1]+agsn[kag,kt1]+
cgsn[kcg,kt1]+ansn[kam,kt1]+pesn[kam,kt1])%*%zphi
y[2]=y[2]-(ymsn[kym,kt2]+agsn[kag,kt2]+
cgsn[kcg,kt2]+ansn[kam,kt2]+pesn[kam,kt2])%*%zphi
ker=erid[i]
eh=matrix(data=y,ncol=1)
Rss[ , ,ker]=Rss[ , ,ker]+ eh%*%t(eh)
}
The covariance matrices are estimated as
# Find number of observations in each group
kdf=tapply(erid,erid,length)
for(k in 1:4){
ndf = kdf[k]+2
x2 = rchisq(1,ndf)
Rn = Rss[ , ,k]/x2
Rni=ginv(Rn)
RI[ , ,k]=Rni
}
11.3. SAMPLING PROCESS
201
Repeat more rounds of sampling. Code is needed to save the samples for
each covariance matrix. After all samples have been made, then randomly
sample 200 of the sample matrices past the burn-in period, average those for
the final estimated matrices.
If one is just solving the mixed model equations, then the parts of the
scripts that add sampling noise can be skipped, and covariance matrices do
not need to be estimated. Some convergence criteria statistic is needed to
know when the solutions have converged to the solutions to the MME.
The R-code could be written more concisely and compactly, but then the
reader may not be able to follow what is being done exactly. The code here
has been written to hopefully be easily followed, step by step.
202
CHAPTER 11. RRM CALCULATIONS
Chapter 12
Lactation Production
When a female dairy cow gives birth her lactation period begins. In an average dairy cow, the lactation period runs for 305 days. The amount of milk
produced is greatest after calving and peaks about 40 days with daily production decreasing thereafter. Around 100 to 120 days after calving, the cow
is impregnated again through artificial insemination, and the growth of the
new fetus begins to pull lactation production downwards even more. About 60
days before the cow is due to give birth again, she is stopped from milking (if
she has not already stopped on her own) and dried off. The whole process is
repeated as soon as she has the next calf, at approximately 13 month intervals.
Some cows have their peak production right at birth of a calf and it decreases continually thereafter. Other cows continue to increase in yield from
birth to day 90, before starting to decrease. Another group of cows gives milk
at a continuous level for many days. Some cows stop milking around 280 days,
while others are kept milking to 365 days or more. Thus, there are many
different shapes of daily production trajectories between cows.
Besides milk yield, there are also components of milk, i.e. percentages of
fat, protein, and lactose, somatic cell scores, milk urea nitrogen, and betahydroxybutylase. All of these have their own interrelated trajectories over the
lactation period. Consequently, multiple trait analyses are favoured to make
use of genetic correlations among the traits at different points during the lactation period. Luckily, the same model equation is generally assumed for all of
these traits. That is, the same factors are assumed to influence each lactation
203
204
CHAPTER 12. LACTATION PRODUCTION
trait.
12.1
Measuring Yields
In the early days of milk recording in Canada, the federal government would
record the daily milk yield of every cow in the herd. The trait that was
analyzed was called the 305-day yield. This was the sum of the daily yields of
cows from day 5 to 305 in the lactation period (or whenever the cow stopped
milking). Daily yields were defined as the amount of milk given in a 24 hour
period. This was usually two milkings per day, morning or AM milking, and
evening or PM milking.
However, daily recording was very costly and someone had to add up the
milk weights over the lactation period. The program ended before 1970. At
the same time there was a program of supervised testing, in which a milk
supervisor would visit a farm at approximately one month intervals. He(she)
would measure yields in the PM and following AM, and collect milk samples
of each cow, which were sent to laboratories to be analyzed for fat content.
The data were accumulated by the milk recording program, which was called
Record of Performance (ROP). Later, provincial programs arose called Dairy
Herd Improvement (DHI) programs. The monthly milk weights were combined using the Test Interval Method (TIM), which estimated the amount of
milk produced between two visits by linear interpolation. There were tables
of special factors to adjust the first test day (TD) visit, and another table for
projecting the yield after the latest TD visit to 305-days. Both tables were
based on the assumption of a standard or average lactation trajectory. There
was no allowance for the fact that cows could vary drastically in their trajectories. The factors worked well for most cows, but gave biased results for cows
with atypical trajectories.
A dairy technical committee existed in Canada, run by Agriculture Canada
and composed of scientists from different universities across Canada. The purpose of the committee was to advise the ROP program on the best statistical
procedures to adjust milk yields and to evaluate dairy bulls. The committee met twice a year in various locations across Canada. During one of these
meetings when it was time to develop new tables of factors for adjusting the
first TD yields and the latest TD yields to get 305-day production, there was
12.2. CURVE FITTING
205
debate over who would do this and how often it needed to be done. Dr John
Moxley, who worked for the Quebec DHI equivalent, DHAS at the time, made
a remark in 1974 that “it would be better if we could analyze TD milk yields
directly rather than combining them into a 305-d yield.” In 1974, however, the
effort was in getting a linear sire model adopted to evaluate dairy bulls. The
computing power of the day was not capable of handling models for test day
records. By 1990 the dairy world had advanced to using animal models, and
computer hardware had caught up, so that it was feasible to begin working on
TD models.
The first TD model did not have any curves in it. The model assumed
that the trajectories of the curves were the same for all cows. Trajectories only
differed by the height at peak yield. Thus, there was still only one variable to
estimate per animal. The problem with analyzing TD records was that each
cow had 7 to 10 TD records compared to only one 305-d record. There was
much more data to process.
Jack Dekkers suggested “there should be a different lactation curve for
every cow”. Henderson (1984) had a section of his book on the topic of “random regressions.” There was only one paragraph and nothing about their use
with TD models. Henderson’s son published a paper in Biometrics in 1982 on
random regression models and the analysis of covariance.
In 1994 the idea of TD models using random regression models was presented to the WCGALP meetings in Guelph. Four years later, everyone was
studying random regression models for many situations. By 2000, Canada had
adopted a TD model for its genetic evaluations of dairy bulls and cows.
12.2
Curve Fitting
A host of different models have been used to fit lactation curves in different
species of dairy cattle. Kistemaker (1996) compared 19 different models that
had been studied in the literature previously. His results are shown in Table
12.1.
206
CHAPTER 12. LACTATION PRODUCTION
No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Table 12.1
Correlations (r) between Predicted and Actual
Test Day Yields and Mean Absolute Error (MAE)
when applied to 5409 cows with at least 9 TD yields.
Model1
r
ln(y/t) = a + bṫ
.717
ln(y) = a + b · ln(t) + c · t
.951
ln(y) = a + b · ln(t) + c · t + d · t.5
.963
ln(y) = a + b · ln(t) + c · t + d · t2
.964
ln(y) = a + b · t−1 + c · t + d · t2
.964
ln(y) = a + b · ln(t) + c · t ∗ d · t.5 + f · t2
.973
MAE
4.780
1.290
1.084
1.079
1.063
0.888
1/y = a + b · t−1 + c · t
1/y = a + b · t−1 + c · t + d · t2
1/y = a + b · t−1 + c · t + d · t2 + f · t3
.102
.766
.378
2.050
1.269
1.078
y
y
y
y
y
y
y
y
y
y
.646
.953
.955
.953
.967
.975
.974
.975
.974
.987
3.466
1.229
1.230
1.232
1.032
0.857
0.878
0.864
0.905
0.581
=a
= a + b · t + c exp(−.5(log(t) − 1)/.6)2 · t−1
= a + b · t.5 + c ln(t)
= a + b · t + c exp(−.05 · t)
= a + b · t.5 + c ln(t) + d · t4
= a + b(t/305) + c(t/305)2 + d ln(305/t) + f ln2 (305/t)
= a + b · t + c sin(.01)t2 + d sin(.01)t3 + f exp(−.055t)
= a + b · t + c · t2 + d · t3 + f ln(t)
= a + b · t + c · t2 + d · t3 + f · t4
= a + b · t + c · t2 + d · t3 + f · t4 + g · t5 + h · t6
Wood’s model (1967) has been used to study groups of cows and is equation 2 in the table. Equation 13 is known as Wilmink’s function (1987) which
has been applied in many studies. Equation 15 is known as the Ali and Schaeffer function (1987), which gives the second smallest mean absolute error and
the second largest correlation. Equation 19 appears to be the best, but has
the most parameters to be estimated. The first 9 equations use the natural
log of the test day yields or the inverse of yield. Equations 10 through 19 use
the actual TD yields.
As the number of covariates in the model increases, the better is the fit
of the model, in general. Subsequent work showed that Legendre polynomials
of order 4 were similar to the Ali and Schaeffer function, but had the advantage of having much lower correlations among the parameter estimates. Thus,
Legendre polynomials of order 4 have been used for both fixed and random
regressions in test day models in Canada.
12.3. FACTORS IN A MODEL
207
A classification approach could be used for the fixed factor regressions for
at least one of the fixed factors. There have probably been a hundred different
studies that investigated the best curve function for fitting lactation curves
in dairy cows, dairy goats, dairy sheep, and water buffalos. The conclusions
have not been unanimous, depending on the amount of data in the analyses.
The majority of studies found test day models gave higher correlations of
estimated breeding values with true breeding values, and recommended their
use for genetic evaluations of lactation production.
12.3
Factors in a Model
12.3.1
Observations
A multiple trait model will be described. Traits will be defined within parity
number. Parities one and two are separate, and parity three includes third
parity and all subsequent parities. The assumption is that cows in third parity
or later are mature and the shape of their curves are similar. In some cases
parity two might also be considered mature. No matter what, there are definite shape differences between parity 1 heifers and all later parities. In some
situations it might be better to limit analyses to the first three parities only.
After parity number come several traits, depending on the country. These
include milk yield, fat yield, protein yield, lactose yield, and somatic cell scores
(SCS). There could also be milk urea nitrogen and betahydroxybutylase. Finally there could be fatty acid components. Deciding which factors to analyze
depends on how many cows have data. In the United States of America, for
example, there are too many cows and too many test days, such that a TD
model is impractical to apply, even for one trait. The initial Canadian Test
Day Model had the first three parities, and milk, fat, and protein yields plus
SCS or a total of 12 traits.
12.3.2
Year-Month of Calving
In all animal models it is critically important to account for time trends in
phenotypes. For lactation production this means putting in a factor for the
208
CHAPTER 12. LACTATION PRODUCTION
year and month of calving. If data begin in 1986, then that means 32 years
(it is now 2018), times 12 months per year, gives 384 levels or 384 different
lactation curves for one parity. Then assume 72 five-day periods within each
lactation and that gives 25,920 parameters to be estimated using the classification approach. Hopefully, there are many more test day records with which
to estimate those parameters. If there are not, then maybe 36 ten-day periods
could be used. If data are limiting, then Legendre polynomials of order 4 could
be used, i.e. 1800 parameters to estimate.
12.3.3
Age-Season of Calving
Age at calving (parturition) is known to have a significant effect on milk production, as does month of calving. However, month of calving has already
been considered in the Year-Month of Calving effects. However, there is an
interaction of month of calving with age at calving. To avoid some confounding, months can be combined into seasons either six or four seasons per year.
These can be formed on the basis of phenotypic averages, so that consecutive
months that are similar in yield levels can be grouped together.
Age groups would differ depending on the parity number, and there could
be different numbers of age groups per parity. First parity heifers start calving
at 18 months of age, and can extend to 30 months. Again, if there are lots of
data, then 13 age groups by 6 seasons would only be 78 subclasses in parity one.
Later parities extend over a much wider age range, and thus, some groupings
of ages may be necessary too. Legendre polynomials would be used for this
factor.
If the data cover several decades, then age-season differences could change
over time as production increases. Thus, time periods of 5 to 10 years should be
made and the model expanded to have Time-Age-Season of Calving subclasses.
This allows the age-season differences to change over time.
12.3.4
Days Pregnant
Once a cow becomes pregnant, part of her feed intake goes towards the growth
of the fetus, and therefore, less energy goes towards milk production. Groups
12.3. FACTORS IN A MODEL
209
of 5 or ten days can be created, perhaps 30 groups altogether, to measure the
decrease in yield. The assumption is that the decrease in yield is the same
regardless of number of days in milk when the cow becomes pregnant. As the
number of days pregnant becomes larger, so does the amount of decrease in
yield. Legendre polynomials of days in milk would be used within each days
pregnant group.
Determining the time of conception is not immediate, and therefore, test
day records need to be continuously updated when pregnancy is validated.
Canada has opted to multiplicatively pre-adjust TD yields for number of days
pregnant rather than to put a factor into the model.
12.3.5
Herd-Test-Day
The purpose of this factor was to account for the environmental effects on the
cows that were tested on the same day. This is very messy because cows (in
the same parity) would have calved at different ages and months of the year.
Thus, some cows would be just starting a lactation, and other cows (in the
same HTD subclass) could be ending their lactation. Thus, the yields would
be all over the place. There would only be one parameter to estimate for each
HTD subclass. The contemporaries would be constantly changing from TD to
TD.
VERY IMPORTANT!!!
Herd-Test-Day factors are NOT recommended for TD models.
Parity-Herd-Year-Season of calving
should be used as contemporary groups.
12.3.6
Herd-Year-Season of Calving
The random factor of Herd-Year-Season of Calving, (HYS), each with its own
curve (not one parameter but five), should be used to account for contemporaries. Contemporaries are cows that share the same environmental effects
throughout their lactation, from birth to being dried off. They encounter the
same weather and management variables through-out. They likely also have
the same test days during their lactations. Perhaps 4 seasons per year of 3
210
CHAPTER 12. LACTATION PRODUCTION
months each could be used. However, if number of cows per subclass is small,
then maybe larger season groups (4 months or 6 months) may be necessary in
some herds, especially for the less numerous breeds.
Legendre polynomials of order 4 should be used with this factor, and
hence a covariance function matrix needs to be estimated for it. As a random
factor in the model, it is less critical to have a minimum number of records
per subclass because just one test day record will suffice.
12.3.7
Additive Genetic Effects
Every animal with TD records has both parents identified. If a parent is
unknown, then a phantom parent group is assigned. Ancestors, without TD
records, also have unknown parents replaced by phantom parent groups. The
groups are based on year of birth of the animal and whether it is a male or
female animal. Phantom groups represent the four pathways of selection, in
dairy cattle, and years of birth. Phantom groups are necessary in order to
properly estimate genetic trends, unbiasedly.
Each animal additive genetic effect is fitted by a Legendre polynomial of
order 4. A covariance matrix must also be estimated.
12.3.8
Permanent Environmental Effects
Because cows have more than one TD record per lactation, permanent environmental effects are modeled for each parity by Legendre polynomials of order
4. A covariance matrix is needed for this factor too.
12.3.9
Number Born
The number of offspring born at a parturition, in litter bearing species such
as dairy goats and sheep, can have an effect on the milk yield of the female.
A female carrying four young apparently “knows” this is happening and the
body prepares by increasing the amount of milk that will be needed after birth
to feed that number of young. This is a fixed environmental effect and might
12.4. COVARIANCE FUNCTION MATRICES
211
differ depending on parity number of the dam, but it can be fit by Legendre
polynomials of order 4.
12.3.10
Residual Effects
In the Canadian Test Day Model, the lactation is divided into 4 periods of
various numbers of days, such that the residual variance is similar across days
within a period, but different between periods. One should begin using many
groups, perhaps 30 of ten days each, in an initial analysis to determine the
best grouping of days. The point is, the residual variance changes throughout
the lactation.
Table 12.2 contains residual variances for milk yields in the first three
parities for a small subset of Canadian Holstein dairy cattle born from 2005
through 2009.
Residual
Days in Milk
1-45
46-115
116-265
266-365
12.4
Table 12.2
variances for a TD model.
Parity 1 Parity 2 Parity 3
7.86
13.96
16.42
5.01
8.12
9.33
3.95
5.41
6.24
3.57
4.36
3.60
Covariance Function Matrices
Many of the early studies of random regression models focussed on the estimation of the covariance function matrices, and the subsequent graphs that could
be made. Let ai represent the vector of random regression coefficients of an
animal for parity i. This vector is order 5 by 1 (order 4 Legendre polynomial).
Then an analysis of 3 parities gives a covariance function matrix of order 15
by 15. The parts of this matrix are shown below in order 5 by 5 subgroups.
212
CHAPTER 12. LACTATION PRODUCTION

V ar(a1 ) =








8.1910
0.2880 −0.6694
0.2360 −0.1407

0.2880
1.4534 −0.1327
0.4590
0.4926 

−0.6694 −0.1327
0.5108 −0.1512
0.0713 
,
0.2360
0.4590 −0.1512
0.1855 −0.0524 

−0.1407
0.4926
0.0713 −0.0524
0.0766

Cov(a1 , a2 ) =








Cov(a1 , a3 ) =








V ar(a2 ) =








8.1921
0.8613 −0.5547
0.2076 −0.0466

1.6933
1.1975 −0.2810
0.0614 −0.0956 

−0.6192
0.1654
0.2865 −0.1721
0.0372 
,
0.3242 −0.0281 −0.0950
0.1168 −0.0478 

−0.1003
0.1086
0.0856 −0.0424
0.0194
12.0818
1.6093 −0.6870
0.4367 −0.1217
1.6093
3.0648
0.0456 −0.2809
0.0247
−0.6870
0.0456
0.5917 −0.1626
0.0166
0.4367 −0.2809 −0.1626
0.4004 −0.1092
−0.1217
0.0247
0.0166 −0.1092
0.1696

Cov(a2 , a3 ) =

8.3749
0.4892 −0.5377
0.2686 −0.0281

1.486
1.343 −0.1682 −0.0111 −0.0232 

−0.7102
0.2314
0.2984 −0.2155
0.0487 
,
0.3355 −0.0659 −0.1044
0.1136 −0.0136 

−0.1365
0.0979
0.0705 −0.0456
0.0072











,




11.4893
1.9730 −0.9016
0.4398 −0.2730

1.8533
2.7023 −0.1320 −0.2323 −0.0117 

−0.6313 −0.0365
0.2919 −0.2010
0.8696 
,
0.2867 −0.2212 −0.1646
0.2461 −0.0805 

−0.0897
0.0631
0.0654 −0.0579
0.0185
and

V ar(a3 ) =








13.5971
1.7354 −0.8951
0.3615 −0.3546

1.7354
3.8151 −0.2634 −0.1821
0.0662 

−0.8951 −0.2634
0.9197 −0.2324
0.0873 
.
0.3615 −0.1821 −0.2324
0.5535 −0.1615 

−0.3546
0.0662
0.0873 −0.1615
0.2291
12.4. COVARIANCE FUNCTION MATRICES
213
A plot of the genetic variances within parities and across the lactation
period can be obtained as shown below.
# Legendre polynomials
LAM=LPOLY(5)
ti=c(5:365)
tmin=5
tmax=365
qi = 2*(ti - tmin)/(tmax - tmin) - 1
x=qi
x0=x*0 + 1
x2=x*x
x3=x2*x
x4=x3*x
M=cbind(x0,x,x2,x3,x4)
PH = M %*% t(LAM)
Ka1 = matrix(data=c(8.1910, 0.2880,-0.6694,
0.2360, -0.1407,
0.2880, 1.4534, -0.1327, 0.4590, 0.4926,
-0.6694, -0.1327, 0.5108, -0.1512, 0.0713,
0.2360, 0.4590, -0.1512, 0.1855, -0.0524,
-0.1407, 0.4926, 0.0713, -0.0524, 0.0766),
byrow=TRUE,ncol=5)
Va1 = PH%*%Ka1%*%t(PH) # order 361 x 361
vg1 = diag(Va1)
# similar arrays for vg2, vg3, Ka2, Ka3 (not shown)
par(bg="cornsilk")
plot(vg1,col="blue",lwd=5,type="l",axes=FALSE,
xlab="Days on Test",
ylab="Genetic Variance",ylim=c(4,16))
axis(1,days)
axis(2)
title(main="Genetic Variances")
lines(vg2,col="red",lwd=5)
lines(vg3,col="darkgreen",lwd=5)
points(55,15,pch=0,col="blue",lwd=3)
text(55,15,"First Parity",col="blue",pos=4)
points(55,14,pch=0,col="red",lwd=3)
214
CHAPTER 12. LACTATION PRODUCTION
text(55,14,"Second Parity",col="red",pos=4)
points(55,13,pch=0,col="darkgreen",lwd=3)
text(55,13,"Third Parity",col="darkgreen",pos=4)
The plot is shown in Figure 12.1. An obvious observation is that there are
distinct differences in the variance curves across the lactation between parities.
Also, the genetic variance is highest at the beginning of lactation and at the
end of lactation, for each parity. This implies that there are great differences
between cows in the amount of milk produced at the start of lactation, then
after 55 days the variances are smaller by nearly half, but tend to increase
upwards again towards day 365.
Some researchers have interpretted the high variances at the start and end
of tests as artifacts of the Legendre polynomials. However, similar shapes are
obtained using other polynomials (e.g. Ali and Schaeffer, 1987) of order 4.
Spline functions tend to flatten the beginning and end a little more, but the
general shape persists.
Figure 12.1
16
Genetic Variances
14
First Parity
Second Parity
10
4
6
8
Genetic Variance
12
Third Parity
5
55
105
155
205
255
305
355
Days on Test
The only way to determine the correct shape of the variances is to use
a multiple trait model where yields are divided into 36 ten-day periods, then
12.5. EXPRESSION OF EBVS
215
genetic variances may be estimated for each period, and also covariances between periods. Then a Legendre polynomial of order 4 could be fit to the 36 by
36 covariance matrix, and compared to the estimates from the test-day model.
The shape of the variance curves is not important, but rather the entirety
of the results which includes the estimated breeding values. The residual variances are greatly reduced. The variances that need to be correct are Cov(ai , aj )
for all pairs of parities.
12.5
Expression of EBVs
Estimated breeding values (EBV) in random regression models, come in vectors
of length equal to the order of the Legendre polynomials. The problem was
how to condense 5 breeding values for a curve into one value for a single trait,
like milk yield. Dairy cattle producers were used to a standard called “305-day
yields”. The solution was to calculate the daily milk yield per day of lactation,
and then to sum those daily yields from day 5 through 305. (The first 4 days
of yield were typically used to feed the newborn calf and provide its colostrum,
or immunization.) Let the solutions for one animal’s additive genetic value for
first parity milk yield be
a1i =
a1i0 a1i1 a1i2 a1i3 a1i4
,
then daily yield (DY )ij for animal i on the j th day would be
DYij = φj0 a1i0 + φj1 a1i1 + φj2 a1i2 + φj3 a1i3 + φj4 a1i4 ,
where φjm is a Legendre polynomial covariate. The 305-d milk yield, M 305i ,
is the sum of the daily yields,
M 305i =
305
X
DYij .
j=5
Because the breeding values are constant for the calculation of every daily
yield, then
305
305
305
305
305
X
X
X
X
X
M 305i = (
φj0 )a1i0 + (
φj1 )a1i1 + (
φj2 )a1i2 + (
φj3 )a1i3 + (
φj4 )a1i4
j=5
j=5
j=5
j=5
j=5
216
CHAPTER 12. LACTATION PRODUCTION
or
M 305i = c0 a1i0 + c1 a1i1 + c2 a1i2 + c3 a1i3 + c4 a1i4 ,
where the cj are constants, and represent the sum of the Legendre polynomial
coefficients, which can be obtained by the following R script.
PH
ka=c(1:301)
P305 = PH[ka, ]
C305 = t(P305)%*%jd(301,1)
C305
[1,] 212.839141
[2,] -61.441368
[3,] -51.778637
[4,] -29.763643
[5,] -1.346922
Now multiply the constants times the EBVs of the random regression
coefficient solutions for each animal, and you have 305-d EBVs for ranking
animals.
One can question if 305 days should be the standard length of lactation.
In 2016 many cows lactate for longer than 305 days, and the analysis was for
test day yields up to 365 days. So a new standard could be 365 day yields.
The constants to use for that standard would be
[1,]
[2,]
[3,]
[4,]
[5,]
255.2655
0.0000
1.5855
0.0000
2.1410
For dairy sheep and goats the standard length might be less than 305 days
because those two species do not lactate as long as cattle.
Note that in the dairy cattle example, it is not valid to calculate EBV for
daily yields beyond 365 days because only test day yields from days 5 to 365
were analyzed.
12.5. EXPRESSION OF EBVS
217
One of the first new EBV in dairy cattle as a result of random regressions
was persistency. Persistency is the ability of a cow to milk at a high level over
much of the lactation period. This would allow for better feeding of animals,
which could be sub-housed according to high, medium, or low persistency. The
trouble was how to define persistency in a random regression model setting.
The variable, a1i1 , was itself a measure of persistency, but it did not have
any units. Animals with high values were more persistent. Because dairy
producers could not relate to this number, other measures were proposed.
The idea was to have some number that represented the downward slope of
the curve after the peak yield of lactation. Cows differed in the day on which
they expressed peak yield, so the initial point had to be well after the day of
peak yield. The measure was also desired to be independent of peak yield or
total 305-d yield.
Suppose the yield on day 60 of an average, first parity cow was 90 kg of
milk and on day 260 was 68 kg. Then calculate
P ersist =
DY260 + 68
DY60 + 90
which should be a number from 0 to 1, in most situations. The higher is
the value, then the more persistent is the cow. The average, first parity cow
would have a value of (68/90) = 0.756. Later parity cows tend to have lower
persistency than first parity heifers. Note that a cow could have a persistency
value greater than one, but that should happen very infrequently.
Egg laying production in poultry is similar to lactation production in
dairy cattle, and a similar model approach should be used. Instead of daily
production, one might consider weekly production.
218
CHAPTER 12. LACTATION PRODUCTION
Chapter 13
Growth
Growth curves have been studied in many species of plants and animals, but
usually with non-linear models. Growth is the accumulation of size and mass
of an organism over time. For most agricultural species, growth to maturity
takes only 3 to 4 years at most, but for humans and other larger mammals,
growth can take decades.
In beef cattle, growth is important from birth until the animal reaches
market age, or often only the period from weaning to one year of age is of
interest. In the latter period, growth can be considered almost linear, with a
slight quadratic shape. That is, growth tends to slow down as the animal matures. Early growth from birth to weaning is often ignored although important
in the overall scheme.
A non-linear mathematical model that describes growth from birth to
maturity is the Gompertz function, where weight at time t, W Tt , is given by
the following equation.
W Tt = BW + A · [1.0 − exp(− exp(B) · (tC ))]
where
t = unit of time, usually in days,
BW is average birthweight,
219
220
CHAPTER 13. GROWTH
A, B, and C are parameters that define the shape of the growth curve.
A is related to mature weight, B is related to the day of change from increasing growth rate to decreasing growth rate, and C is related to the
steepness of growth, or how quickly an animal grows to maturity.
Predicted body weights are positive at all ages, and weights hardly ever
decrease, unless an animal is being starved, or it is sick. There are only 4
parameters to estimate (if you include BW , which means that 5 or more
weights per animal are needed to estimate all of the parameters. Unfortunately,
we need to solve a nonlinear system. A differential evolutionary algorithm can
be used to solve. Figure 13.1 shows the growth curve of a pig from birth to
maturity, where A = 272, B = −12.8, and C = 2.65, and birthweight is 1.5kg.
Figure 13.1
150
0
50
100
Weight, kg
200
250
Growth Curve of Pigs
0
50
100
150
200
Days of Age
The figure emphasizes that growth is cumulative. With the curve the
amount of weight gained each day, is shown in Figure 13.2. This is known as
average daily gain, ADG. As can be seen in the figure, ADG is not constant
over the growth period.
221
Figure 13.2
1.0
0.0
0.5
Weight Gain, kg
1.5
Growth Curve of Pigs
0
50
100
150
200
Days of Age
Hence from birth to about 100 days of age, pigs are putting on weight faster
and faster. After 100 days, their rate of gain declines. There are problems with
measuring ADG. Firstly, the magnitude of ADG is small, only 1 to 2.5 kg per
day, so that weigh scales must be precise. Secondly, weight gain depends on
the time of day in which it is taken. Did the pig just defecate or just eat
breakfast? The amount eaten or lost could be as much as 1 to 2.5 kg. Lastly,
you need to weigh pigs every day and this would be very labour intensive,
unless it was computerized and automated. The amount of variation in ADG
from day to day would be large for one animal.
Cumulative weights keep getting larger as the animal ages. Total weights
can be off 1 to 2.5 kg without changing the growth curve dramatically, and the
pigs do not need to be weighed daily, but obviously there are key times when
pigs should be weighed. Birthweights, tend to be small relative to mature
weights. Thus, whether the weight is 1.5 kg or 3 kg at birth, does not alter
the growth curve substantially, but weights at 200 days of age can differ by 10
to 20 kilograms between animals giving very different growth curves.
222
CHAPTER 13. GROWTH
13.1
Curve Fitting
13.1.1
Spline Function
Random regression models are linear models, thus the nonlinear Gompertz
function needs to be approximated by linear regressions. The phenotypic shape
may be approximated by a spline function. Let tmax be the maximum age, and
in terms of the pig growth curve, let that be 240 days of age. tmin is day 1, and
let T = t/tmax , and U = (t − 100)/tmax for t > 100 otherwise U = 0. Day 100
is when growth rate starts to decrease with age (Figure 13.2). The phenotypic
curve might be
y t = b 0 + b1 T + b2 T 2 + b3 T 3 + b4 U + b5 U 2 + b6 U 3 .
Estimates of the regression coefficients from the data in Figure 13.1 are













b0
b1
b2
b3
b4
b5
b6


























=
3.77839
−96.82071
1038.66035
−393.28000
81.43243
−1306.77963
579.37098







.





A seven covariate function to model the trajectory of growth seems too
large to be practical. There are places along the curve that are not fit well.
At the beginning of the growth curve, the spline function will predict that
weights actually decrease after birth, and then turn upwards. Also, weights
do not plateau at maturity, but actually begin to decrease. The errors in
the prediction are at the beginning and end. Inbetween weights are predicted
relatively accurately. The inflection point of 100 days was assumed known,
but this point would not be 100 days for every animal, and would need to be
estimated. Thus, the spline function is not totally suitable.
13.2. MODEL FACTORS
13.1.2
223
Classification Approach
Given the problems with the spline function, the classification approach could
possibly work much better, for all groups of animals, without making any assumptions about the shape of the growth curve, or the position of the inflection
point. Over the 240 day age range, make 48 five-day periods and estimate the
mean weights within each period. Unfortunately, that requires estimating 48
means per curve, and thus there needs to be a lot of data points within each
mean.
13.2
Model Factors
13.2.1
Observations
Growth observations can be weight, height, length, feed intake, backfat thickness, or loin eye area. Depending on the situation, the growth period could be
from birth to weaning, weaning to slaughter weight, or birth to maturity. If
the period is short term, often growth is linear during this period. A lifetime
curve would look the same as in Figure 13.1. This determines the order of
the random regression covariates that are required. If growth is after weaning,
then maternal genetic effects may be unnecessary and safely ignored. So the
factors listed in this section may or may not be needed, but should at least be
considered in developing a working model for growth.
On a per animal basis there should probably be five or more measures of
growth. Management systems where animals can be weighed automatically
every day should be considered, or where feed intake can be recorded daily.
However, if the management system does not allow weighing more than four
times during the life of the animal, then random regression models should not
be applied. Multiple trait models should be considered as an alternative, where
each weight is a different trait, like birth, weaning, and end of test weights.
224
13.2.2
CHAPTER 13. GROWTH
Year-Month of Birth-Gender (fixed)
The first fixed factor in the model needs to account for time trends in growth
curves for each gender separately. In some species the male is sometimes
neutered, and so a third gender is needed for these animals, even if the neutering occurs later, after weaning for example.
The classification approach will be used for this factor. Thus, 48 period
means within each subclass implies there should be more than 48 observations
within the subclasses. With 20 years of data, times 12 months of birth per year,
times 3 genders, gives 720 subclasses. Assuming a minimum of 50 observations
per subclass, then there should be more than 36,000 weight measures.
13.2.3
Year - Age of Dam - Gender (fixed)
The birth-dam can be either the genetic-dam or a recipient dam, in the case of
an embryo transfer. Offspring from older birth-dams often outgrow offspring
of first time mothers. This might be because offspring from young mothers are
smaller than those of older dams, or because young mothers do not provide
enough nutrients in the milk as do older females. Age of birth-dam is usually
defined within parity groups. So parity 1 with 2 or 3 age groups, parity 2 with
4 or 5 age groups, and so on. The interactions with year of birth and gender
of offspring probably exist, so it is best to account for them. Years of birth
may be grouped together if there are not enough data.
Legendre polynomials of order 3 can be used for this factor. Hence we are
estimating deviations from the standard curves defined by the Year-Month of
Birth-Gender subclasses.
13.2.4
Contemporary-Management Groups (random)
During growth, animals are usually moved to different management groups
as they get bigger or older. Thus, animals belong to a different contemporary group each time they are weighed. In pigs there is the farrowing barn
during which a dozen or more sows give birth within the same week. All of
the piglets could be one contemporary group, separated by gender. After 20
13.2. MODEL FACTORS
225
days, the piglets are moved to growing pens where pigs of different litters are
merged and become competitors for feed and water. Later those animals are
moved to finishing pens where they are fed to market weight. Some could
be selected for potential herd replacements and moved to a different facility.
The contemporaries of a pig are, therefore, constantly changing. Contemporary - Management groups are defined as pigs of roughly the same age and
gender within the same physical environment at the time of weighing. The
contemporary-management group accounts for the environmental effects at
one point in time for a group of similarly treated individuals. We do not estimate a growth curve for each contemporary-management group, but only
the effect on weights of pigs at one point in time. Contemporary-management
groups are a random factor in the model, and there are many of these groups.
The number of animals within a contemporary group is not critical.
13.2.5
Litter Effects (random)
In litter bearing species, such as sheep, goats, and swine, there is a common
litter effect of the group of full-sibs. This has to be matched to the birth-dam
or the raise-dam, if an animal is cross-fostered to another dam after birth.
This is also a random factor in the model and can be modeled by Legendre
polynomials of order 2.
13.2.6
Animal Additive Genetic Effects (random)
In growth data, the additive genetic effects are known as direct genetic effects
(Willham, 1960s). The usual additive genetic relationship matrix, A, is used,
as are phantom parent groups for animals with unknown parents.
Legendre polynomials of order 3 are used to model the animal deviations
from the fixed trajectories, and hence, four parameters per animal for additive
genetic effects to be estimated.
226
13.2.7
CHAPTER 13. GROWTH
Animal Permanent Environmental Effects (random)
Because animals are weighed several times, permanent environmental effects
must be taken into account. Legendre polynomials of order 3 are used for this
factor, which only exists for animals with records. Cumulative environmental
effects could be considered as well, but constructing the correct mixed model
equations could become complicated.
13.2.8
Maternal Genetic Effects (random)
Growth, in mammals, is influenced by maternal genetic effects (Willham,
1960s; see Chapter 8). That is, the female that gives birth provides an environment during the early growth period of that offspring. Maternal effects
decrease as the animals age. However, some maternal effects can persist a
long time. The female provides this environment to every offspring. Her genetic maternal ability is passed along to her progeny (male and female), but is
only expressed when her female progeny have their own offspring. The three
types of dams, i.e. genetic dam, birth dam, and raise dam, may need to be
considered here (See Chapter 8).
A third order Legendre polynomial would be used for this random factor
too. The additive genetic relationship matrix, and phantom parent groups are
also utilized for this factor.
13.2.9
Maternal Permanent Environmental Effects (random)
Because dams have more than one progeny in the data, there are non-genetic
permanent environmental effects associated with each dam. Legendre polynomials of order 3 are used for this factor too.
13.3. COVARIANCE FUNCTION MATRICES
13.2.10
227
Residual Variances (random)
There should be a different residual variance for every day of age, and these
variances should be getting larger over time, as weights increase. One can
calculate phenotypic variances for each of the 48 five-day periods, separately
for each gender. Then express all of the variances relative to the variance at
birth. The assumption is that the residual variances will follow that same
relative pattern. Residual variances can be estimated for each five-day period.
13.2.11
Summary
Growth is a very complicated trait. The main problem is having enough weight
measurements on an animal to be able to estimate the trajectories and covariance functions. Maternal genetic effects and the correlation of those with direct
genetic effects adds a degree of difficulty to the model analyses. Also, if each
animal can have up to three different dams influencing its growth, this too can
make a random regression analysis difficult.
If there are only 3 or 4 weights per animal, it may be much easier to analyze
them with a multiple trait model, where each weight is taken at roughly the
same age in all animals. The shape of the trajectory is then not important,
and analyses can be simplified.
13.3
Covariance Function Matrices
A study of pigs on test from day 40 to 250 at a Quebec test station was
conducted on 10,000 pigs. Two pigs per litter were represented in the trial.
Litter effects and maternal effects were ignored. Quadratic random regressions
(using Legendre polynomials) were for contemporary groups, animal additive
genetic, and animal permanent environmental. Each pig had 7 or more weight
measurements during the test, and almost daily feed intakes.
The covariance function matrices were estimated using Gibbs sampling in
a Bayesian approach to a six-trait model. Other traits were number of times
visiting the feeder (daily), time spent eating (daily), feed intake (daily), weight,
228
CHAPTER 13. GROWTH
fat thickness, and loin thickness. Below are the submatrices for weights only.

Kaa

Kpe

139.47 126.60 42.25


=  126.60 125.49 50.19  ,
42.25 50.19 26.30

117.77 86.97 13.04


=  86.97 76.79 22.74  ,
13.04 22.74 16.81
and


80.85 38.39 7.13

Kcg = 
 38.39 27.97 11.46  .
7.13 11.46 8.26
Using the above matrices, variances for each day on test were calculated,
and then plotted (Figure 13.3).
13.3. COVARIANCE FUNCTION MATRICES
229
# Legendre polynomials
LAM=LPOLY(3)
ti=c(40:250)
tmin=40
tmax=250
qi = 2*(ti - tmin)/(tmax - tmin) - 1
x=qi
x0=x*0 + 1
x2=x*x
M=cbind(x0,x,x2)
PH = M %*% t(LAM)
Vpe = PH%*%Kpe%*%t(PH)
vgpe = diag(Vpe)
Vcg = PH%*%Kcg%*%t(PH)
vgcg = diag(Vcg)
Vaa = PH%*%Kaa%*%t(PH)
vgaa = diag(Vaa)
par(bg="aquamarine")
plot(ti,vgaa,col="blue",lwd=5,type="l",xlab="Days on Test",
ylab="Variance, kg-squared",ylim=c(0,1000))
title(main="Variances Over Days on Test")
lines(ti,vgcg,col="red",lwd=5)
lines(ti,vgpe,col="darkgreen",lwd=5)
points(55,900,pch=0,col="blue",lwd=3)
text(55,900,"Genetic",col="blue",pos=4)
points(55,700,pch=0,col="red",lwd=3)
text(55,700,"Contemporary Group",col="red",pos=4)
points(55,500,pch=0,col="darkgreen",lwd=3)
text(55,500,"PE",col="darkgreen",pos=4)
230
CHAPTER 13. GROWTH
Figure 13.3
1000
Variances Over Days on Test
800
Genetic
600
400
PE
0
200
Variance, kg−squared
Contemporary Group
50
100
150
200
250
Days on Test
Because of the quadratic regression, variances increase as days on test
increase, but the larger increases did not occur until after 150 days. Higher
order polynomials were not appropriate for these data. Residual variances were
divided into 23 periods of 8 or 9 days each. The residual variances ranged from
3.5 kg2 to 17.18 kg2 , and so were much smaller than the other components.
13.4
Expression of EBVs
With weight as the growth trait, there are two options for expressing the
breeding value of an animal and ranking them. One option is for choosing a
particular age and ranking animals on the basis of their EBV for weight at
a given age. The other option is for determining the number of days for an
animal to reach a particular weight, for example, 110 kg. The latter option is
essentially a growth rate. You want to select animals with the smaller age.
For swine and some other species, growth has to be combined with feed
intake. Which animals grew the fastest and ate the least amount of feed?
So an index must be constructed to select for optimum growth. In addition,
13.4. EXPRESSION OF EBVS
231
a fat carcass is usually not desired, and so carcass quality also needs to be
included in the index. Increasing weight and growth rate could also have
adverse consequences on ease of birth through larger birthweights. Growth is
more than a single trait selection problem.
232
CHAPTER 13. GROWTH
Chapter 14
Survival
The lifetime of an animal is the age when it dies. For agricultural livestock
animals, however, humans often determine when an animal dies. Some animals
are voluntarily culled because the owner perceives that they are of lesser value
than other animals. At the same time, some animals are involuntarily culled
due to old age, accident, or disease. Producers generally want animals that are
robust and hardy, and which could live a long time. It costs money to feed and
raise an animal to maturity. Animals should have “longevity” or “stayability”.
Animals should be functional, either at producing offspring or producing milk,
meat, eggs. or wool.
The date of an animal dying or leaving the herd (flock) for any reason
gives an uncensored record of survival. An animal’s record is said to be censored
when it has not yet died or been culled due to a lack of adequate opportunities.
All current, active animals are censored. When analyzing survival there are
two possible situations.
1. Censored data are removed from the analysis, or
2. Censored data are included in the analysis.
Animals can relocate from one herd to another through sales. Such animals may be considered culled from the original herd, but are actually still
alive and productive in another herd. Reasons for disposal from herds are im233
234
CHAPTER 14. SURVIVAL
portant to determine if records are censored. The analysis of survival should
include censored data, in an appropriate manner.
The age of an animal at the time it is culled is the observation, measured
in days, months, or years. This trait is not normally distributed. For censored
animals, a prediction of length of productive life is usually made based on
probabilities estimated from past data. Thus, if an animal has lived to time
t, then the probability that it will live to the next time, t + 1, is used as the
observation.
A different approach to survival analyses is to define a fixed time period,
such as survival to 60 months of age, yes or no. Then survival to 75 months
of age as another binary trait.
A non-linear approach is where time to failure is modelled. Censored
data can be included. A survival function is derived and from this a hazard
function is created, which is influenced by time dependent variables, and time
independent variables.
14.1
Survival Function
Consider 100 months after first calving as the productive life for a dairy cow. A
survival function goes from 1 for an animal that is alive to 0 when the animal
is dead or culled. A vertical line from 1 to 0 indicates the moment in the
productive period when the animal’s function changes, i.e. when the animal is
removed from production. The survival function for one animal is a one-step
function. Figure 14.1 shows the survival functions of 3 cows, where one has
died at 20 months after first calving, one at 45 months, and one at 66 months.
The fourth graph in the lower right of Figure 14.1 is the average step function
for the three cows combined.
14.1. SURVIVAL FUNCTION
235
Figure 14.1
20
40
60
80
100
0
20
40
60
80
100
Months after First Calving
Cow 3 Survival Function
Average Survival Function of 3 Cows
0.4
0.0
0.0
0.4
Frequency
0.8
Months after First Calving
0.8
0
Frequency
0.4
Frequency
0.0
0.4
0.0
Frequency
0.8
Cow 2 Survival Function
0.8
Cow 1 Survival Function
0
20
40
60
80
Months after First Calving
100
0
20
40
60
80
100
Months after First Calving
As more and more cows are accumulated and averaged together, the survival function for the population becomes a smooth curve as in Figure 14.2.
The values on the curve give the expected probability of an animal being alive
in x months after first calving. By the time a cow reaches 100 months, it has
a pretty high probability of not being alive.
236
CHAPTER 14. SURVIVAL
Figure 14.2
0.6
0.4
0.0
0.2
Frequency
0.8
1.0
Population Survival Function
0
20
40
60
80
100
Months after First Calving
The approach of Veerkamp et al. (1999) and Galbraith (2003) was to
apply a random regression model. For each cow there would be 100 observation
points of 0 or 1. A cow that has lived 30 months past first calving and which
has not yet been culled, is a censored record. If a cow was censored, then the
step function would be just ones up to the point of being censored (e.g. 30
months), and the next seventy values would be not known, or not observed.
The survival function in Figure 14.2, for this example, is
St =
n − dt
n
where t is the month in which an animal was last alive, n is the total number
of live animals that had the opportunity to live for 100 months, and dt is the
number that have died up to and including period t. Eventually dt comes
closer to n.
14.2. MODEL FACTORS
14.2
237
Model Factors
A population survival function is shaped similar to a lactation curve, and so
using Legendre polynomials of order 4 (5 covariates) may be appropriate for
fitting the general shape. However, because the scale goes from 1 down to 0,
at the beginning of the curve many animals are alive, so that the variation
in the first months after calving is very small. In general, the variance is the
frequency times one minus the frequency, which has the greatest value when
frequency is 0.5. The variance becomes smaller again at the end when most
animals are dead. Legendre polynomials of order 4 (5 covariates) will be used
to model the random animal additive genetic, and permanent environmental
effects.
14.2.1
Year-Season of Birth-Gender (fixed)
The classification approach can be taken to model the fixed time factor curves
for animals born in the same year and season of the year (perhaps months). If
both genders are being analyzed together, then the additional interaction with
gender is needed.
In the dairy cow example, there would be 100 categories of months alive
after first calving. That is a lot of levels (i.e. parameters) to be estimated,
and requires a lot of animals. At the same time, there are 100 observations for
all uncensored animals.
If one was studying mice, then the time scale has to be altered, and
survival might be related to time after being infected with a deadly virus. Or
there could be a study of bacteria and their survival to different antibiotics
measured in hours or minutes. In some cases, there might be only one overall
fixed curve rather than several.
14.2.2
Age at First Calving (fixed)
For dairy cows, the age at first calving could be important to survival after
calving. For mice and bacteria an important variable might be the length of
exposure time before the trial begins.
238
14.2.3
CHAPTER 14. SURVIVAL
Production Level (fixed)
Dairy cows that produce at a high level, and therefore make more profit for the
owner, tend to have a higher survival advantage. Cows should therefore, be
divided in 3 or 5 categories of production levels based upon their EBV for milk
yields or protein yields. These groups could also be modelled by classification
variables or with order 4 Legendre polynomials. A study should be conducted
to see which alternative is more suitable. Adjusting for production level makes
the survival evaluations free of production level, and this is called functional
survival.
14.2.4
Conformation Level (fixed)
Another important factor in dairy cows is their conformation scores. More
favourable looking cows (scoring Good Plus or better) have a higher survival
than cows scoring Good, Fair or Poor. Making six levels of conformation and
fitting classification variables of order 4 Legendre polynomials is necessary.
Thus, the survival EBV would be free of both production and conformation
considerations.
14.2.5
Unexpected Events
An unexpected event which may have a short or long term impact on animal
survival are things like outbreaks of disease or drought. Animals have to be
culled, that would not normally be culled, to guarantee the survival of the
herd. This may affect certain types of animals (e.g. low producers, older
animals) more than others. A simple year-month-age of cow subclass effect
(not a curve, but an average percentage survival) could be used to model
routine and unexpected downturns in survival. This could be across all herds
or within provinces or regions of a country. Besides increases in culling, this
factor would also identify periods when it was difficult to find cows such that
culling is at below normal levels.
14.2. MODEL FACTORS
14.2.6
239
Contemporary Groups (random)
Contemporary groups are random effects in the model, and hence modelled
with order 4 Legendre polynomials. The definition of a contemporary for survival analyses would be animals born in the same year-season, of the same
gender, and undergoing the same or similar management practices up to first
calving. Because survival looks at animals over many months and years, animals will move around and be placed in different environments with different
managers, and therefore, under different decision processes. Accounting for
all of these possibilities is difficult, and therefore, the easy option is to leave
animals in their original contemporary group throughout their lifetime. All
subsequent changes cause variation that goes into the residual effects.
14.2.7
Animal Additive Genetic Effects (random)
Animal additive genetic effects are random, also modelled by order 4 Legendre
polynomial. The heritability of survival is generally low due to all of the
environmental influences on the decisions to keep or cull animals.
14.2.8
Animal Permanent Environment Effects (random)
Animal permanent environmental effects are random and account for some of
the environmental influences on each animal. Legendre polynomials of order
4 could be used for this factor too.
14.2.9
Residual Variances (random)
For dairy cows, looking at 100 months after first calving, this period could be
divided into twenty subgroups of five months each. Some trial and error is
needed to get the groupings correct.
240
CHAPTER 14. SURVIVAL
14.3
Example
Because this method of survival analysis is not common, a small example will
be used to illustrate. Consider a beef cattle situation and we want to look at
the survival of cows, as indicated by number of calvings, where the maximum
is set at nine. Thus, there are just nine categories, each representing about 12
months. Assume the data are from two years, and six contemporary groups
for a total of 30 cows. Including ancestors without survival data, there are a
total of 53 animals. None of the animals were inbred. The data are shown in
Table 14.1. Note that four of the records in year 2 were censored, which means
those animals were still active, i.e. not yet culled.
Year
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
CG
1
1
1
1
1
2
2
2
2
2
2
3
3
3
3
Cow
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Sire
1
1
2
2
2
2
2
3
3
3
3
3
4
4
4
Table 14.1
Example Beef Cow Survival Data.
Dam Calvings Year CG Cow
9
7
2
4
39
10
2
2
4
40
11
5
2
4
41
12
6
2
4
42
13
8
2
4
43
14
3
2
4
44
15
2
2
5
45
16
1
2
5
46
17
4
2
5
47
18
4
2
5
48
19
6
2
5
49
20
6
2
6
50
21
6
2
6
51
22
9
2
6
52
23
3
2
6
53
* indicates censored records
The data can be set up in R as follows.
Sire
5
5
5
6
6
6
5
6
7
7
7
5
7
8
8
dam
9
10
12
13
29
30
14
17
18
19
35
20
22
23
25
Calvings
5
7*
5
6*
4
6*
2
6
8*
2
4
4
1
3
5
14.3. EXAMPLE
241
# Example data for RRM of survival
cg=c(rep(1,5),rep(2,6),rep(3,4),rep(4,6),
rep(5,5),rep(6,4)) # contemporary groups
YR = c(rep(1,15),rep(2,15)) # Two years
# Pedigrees
aid=c(1:53)
sid=c(rep(0,23),1,1,2,2,2,2,2,3,3,3,
3,3,4,4,4,5,5,5,6,6,6,5,6,7,7,7,5,7,8,8)
did=c(rep(0,23),c(9:23),9,10,12,13,29,
30,14,17,18,19,35,20,22,23,25)
bi=c(rep(1,23),rep(0.5,30))
# Inverse of additive relationship matrix
AI=AINV(sid,did,bi)
y = c(7,2,5,6,8, 3,2,1,4,4,6, 6,6,9,3,
5,7,5,6,4,6, 2,6,8,2,4, 4,1,3,5)
yb= c(9,9,9,9,9, 9,9,9,9,9,9, 9,9,9,9,
9,7,9,6,9,6, 9,9,8,9,9, 9,9,9,9)
sum(yb) # total number of observations
The vector y contains the number of calvings completed, and yb contains the number of calvings that could have been observed up to the current
date.
The covariance function matrices for the random effects were as follows.

Ka =








.36814 −.17200
.32359
.00000 −.01844

−.17200
.35300 −.24600 −.03448
.00000 

.32359 −.24600
.55338
.00000 −.04292 
,
−.00000 −.03448
.00000
.06567
.00000 

−.01844
.00000 −.04292
.00000
.06466
for the additive genetic effects,

Kp =








.31894 −.13760
.25719
.00000 −.01844

−.13760
.30380 −.19680 −.03448
.00000 

.25719 −.19680
.46318
.00000 −.04292 
,
.00000 −.03448
.00000
.06567
.00000 

−.01844
.00000 −.04292
.00000
.06466
242
CHAPTER 14. SURVIVAL
for animal permanent environmental effects, and

Kc =








.68214 −.04600 −.03141
.00000 −.01844

−.04600
.99000
.00900 −.03448
.00000 

−.03141
.00900
.14638
.00000 −.04292 
,
.00000 −.03448
.00000
.06567
.00000 

−.01844
.00000 −.04292
.00000
.06466
for contemporary groups. Therefore, the assumed heritabilities by number of
calvings are shown in Table 14.2.
Table 14.2
Heritabilities by Number of Calvings.
Number of Heritability
Calvings
1
0.420
2
0.349
3
0.231
4
0.153
5
0.191
6
0.185
7
0.175
8
0.218
9
0.309
The Legendre polynomials for the random factors were order 4, and set
up as
reglp = jd(9,5)*0
tmin=1
tmax=9
no=5
for(i in 1:9){
reglp[i, ] = LPTIME(i,tmin,tmax,no)
}
The design matrices need to be constructed for years, contemporary groups,
and animal genetic and animal permanent environment factors. Recall that
each cow with survival data has 9 observations, unless their record is censored,
14.3. EXAMPLE
243
then they have less than 9 observations. In total in this example, there were
261 observations. We also need to create the observation vector, YOB, of zeros
and ones, and the residual variances for each of the nine categories. For simplicity, let the residual variances be equal to the numbers given in pq below,
and this script makes the design matrix for years. The year effects make use
of the classification approach, so that there are nine parameters for each year.
X = jd(261,18)*0
YOB = rep(0,261)
ri=YOB
nly=length(y)
pq=c(1:9); pq[1]=0.09; pq[2]=0.16;
pq[3]=0.21; pq[4]=0.24; pq[5]=0.25;
pq[6]=0.24; pq[7]=0.21; pq[8]=0.16;
pq[9]=0.09
pq = 1/pq # residuals inverted
k=0
for(i in 1:nly){
my = YR[i]
loff=(my-1)*9
ja = y[i] # number of ones
jb = yb[i] # number of obs for animal
for(j in 1:jb){
k=k+1
X[k,j+loff]=1
YOB[k]=1
ri[k]=pq[j]
if(j > ja)YOB[k]=0
}
}
Similarly, the design matrix for contemporary groups is generated as follows. Because contemporary groups are random, they are modelled by order
4 Legendre polynomials (from reglp ).
244
CHAPTER 14. SURVIVAL
# Contemporary Groups Zc
Zc=jd(261,30)*0
k=0
for(i in 1:nly){
mc = cg[i]
loff = (mc-1)*5 +1
lofl = loff + 4
ja = y[i]
jb = yb[i]
for(j in 1:jb){
k=k+1
Zc[k,c(loff:lofl)] = reglp[j, ]
}
}
Animal additive genetic effects are also modelled by order 4 Legendre
polynomials, as are the animal permanent environmental effects, but which
are a subset of the columns for the animal additive genetic effects.
# Animal Additive
mcol = 53*5
manc = 23*5 + 1
Za = jd(261,mcol)*0
k=0
for(i in 1:nly){
ma = anwr[i]
loff = (ma-1)*5 + 1
lofl = loff + 4
ja = y[i]
jb = yb[i]
for(j in 1:jb){
k=k+1
Za[k,c(loff:lofl)] = reglp[j, ]
}
}
Zp = Za[ ,c(manc:mcol)]
After the design matrices, one sets up matrices for the mixed model equations, then solve them.
14.3. EXAMPLE
245
# setup MME and solve
ZZ=cbind(Zc,Za,Zp)
RI=diag(ri)
# make covariance matrices
Gai=solve(Ga) # Ga = Ka
Gpi=solve(Gp) # Gp = Kp
Gci=solve(Gc) # Gc = Kc
HI= id(6) %x% Gci
GI=AI %x% Gai
PI=id(30) %x% Gpi
QI=block(HI,GI,PI)
# solve MME
RRS = MME(X,ZZ,QI,RI,YOB)
MME is a routine for setting up mixed model equations and solving them.
See earlier chapters for details of MME and AINV R-scripts.
14.3.1
Year Trajectories
The next step is to look at the solutions and make sense of the results. The
first thing is to look at the year solutions and plot them in a graph.
246
CHAPTER 14. SURVIVAL
Figure 14.3
Year 1
●
Year 2
0.4
0.6
●
0.0
0.2
Frequency
0.8
1.0
Yearly Survival Functions
2
4
6
8
Number of Calvings
Note that in year 1 all of the animal records were uncensored, and therefore, none of them were being observed any longer because they have all been
culled (some years after these data were obtained). In year 2, however, there
were 4 censored records, and therefore, the line for year 2 is not fully completed, and will not be until all the animals in year 2 have been culled. The
line for year 2 could still change, but the line for year 1 is essentially complete
and not likely to change very much in future analyses. It will change a little
due to adding relatives information in later years.
With only 261 observations, the fixed curves (i.e. trajectories) are not
very smooth. If there were several thousand records per year, then the curves
might be smoother looking.
14.3. EXAMPLE
14.3.2
247
Contemporary Groups
Contemporary groups were modelled by order 4 Legendre polynomials. The
solutions are shown in Table 14.3.
Table 14.3
Random regression solutions for
contemporary groups.
c0
c1
c2
c3
Group
c4
Number
1
0.12264 -0.00160 -0.04552 -0.02469 0.00526
2
-0.28856 -0.01639 0.09221 0.00770 -0.02311
3
0.16592 0.01799 -0.04669 0.01699 0.01784
4
0.24412 -0.02660 -0.07702 0.00605 0.01125
5
-0.03720 0.05555 0.03736 -0.01745 -0.01820
6
-0.20692 -0.02895 0.03966 0.01140 0.00695
From the values in Table 14.3, it is not easy to know which contemporary
groups had greater or lesser survival rates. One needs to calculate survival
differences for each of the nine categories using the Legendre polynomials, and
then one must add the year trajectories for the years in which those contemporary groups were nested. Thus, year 1 trajectory is added to contemporary
groups 1, 2, and 3, and year 2 trajectory is added to contemporary groups 4,
5, and 6. Then those values can be plotted as shown in Figure 14.4.
248
CHAPTER 14. SURVIVAL
Figure 14.4
1.0
Contemporary Group Survival Functions
●
0.8
●
●
●
●
0.6
0.4
0.0
0.2
Frequency
●
CG 1
CG 2
CG 3
CG 4
CG 5
CG 6
2
4
6
8
Number of Calvings
From Figure 14.4, contemporary groups 1, 3, and 4 had the better survival
rates, and these corresponded to positive values for c0 , and negative values for
c2 . Usually, the survival rates of contemporary groups are not of interest, but
they need to be taken into account in calculating animal EBVs.
14.3.3
Animal Estimated Breeding Values
With the trait of survival, interest is primarily in sires and how they rank on
daughter survival. As with the contemporary groups, the solutions for the
regression coefficients are not informative on their own. Multiply times the
Legendre polynomials for the nine categories, and add the year 1 trajectory to
those numbers. Year 1 was chosen because all animals in that year have been
culled (uncensored data). Usually one would take the latest year in which all
animal records are uncensored. The results for eight sires are given in Figure
14.5.
14.3. EXAMPLE
249
Figure 14.5
0.6
0.4
0.0
0.2
Frequency
0.8
1.0
Sire Daughter Survival
2
4
6
8
Years after First Calving
Notice the difference from Figure 14.4. There were greater differences
among contemporary groups than among sires. To pick up differences among
sires one has to look at the end of the trajectories, or category 9. Sires rank
differently depending on which number of calvings you want to consider as the
ranking criteria. The sires and their ranks at the 1st , 5th , and 9th calvings are
given in Table 14.4.
250
CHAPTER 14. SURVIVAL
Table 14.4
Sire rankings at 1st , 5th , and 9th calvings.
Sire 1st 5th 9th
1
7
6
7
2
4
4
4
3
2
2
6
4
5
5
1
5
6
7
5
6
8
1
8
7
3
8
2
8
1
3
3
Which sire would you choose to use in future matings?
14.3.4
Covariance Matrices
The covariance function matrices used in the example were not estimated from
real data, but were concocted for illustration purposes. Still, by using an order
4 Legendre polynomial for the random factors, the variances at the first and
nineth calvings were artificially high (Figure 14.6). As already mentioned
the variances at the first and nineth calvings should be the smallest, and the
largest should occur at the fifth calving. A full study using a very large data
set needs to be conducted. The random regression model approach to survival
analyses seems appropriate and useful. Comparisons to other methods may be
warranted (Jamrozik et al. 2008).
14.3. EXAMPLE
251
Figure 14.6
2
Genetic
●
Perm.Env.
●
Cont.Group
●
Residual
1
●
0
Frequency
3
4
Variances Over Categories
2
4
6
Number of Calvings
8
252
CHAPTER 14. SURVIVAL
Part III
Loose Ends
253
Chapter 15
Selection
The selection process must be clearly defined in order to make any statistical
attempts at correcting for it. There are some selection processes that do not
have any statistical correction methodology. Selection is any process that could
lead to a biased and/or inaccurate evaluation of an animal’s underlying genetic
value for one or more trais, resulting in the wrong choice of animals to be used
in future matings, and incorrect determination of variance components, genetic
correlations, or genetic changes in a population.
15.1
Randomization
Nearly all statistical procedures have been developed with the concept of random sampling or randomization. For example, if you generate 5,000 random
normal deviates from N (0, 100), and average the results, one should get a value
close to 0, and a variance close to 100, depending on the quality of the random
number generator. Below is a script for repeating that process 1,000 times.
wsam=c(1:1000)*0
wsdm=wsam
for(i in 1:1000){
n=5000
255
256
CHAPTER 15. SELECTION
w=rnorm(n,0,10)
wsam[i]=mean(w)
wsdm[i]=var(w)
}
mean(wsam) # -0.000172
mean(wsdm) # 100.00067
If the 5,000 random normal deviates are sorted from high to low, and the
top 50% taken as a sample, then the mean and variance of the sample will not
estimate the original parameters of the normal distribution.
vsam=c(1:1000)*0
vsdm=vsam
for(i in 1:1000){
n=5000
w=rnorm(n,0,10)
m=2500
ka=order(-w)
v=w[ka]
vs=v[1:m]
vsam[i]=mean(vs)
vsdm[i]=var(vs)
}
mean(vsam) # 7.968375
mean(vsdm) # 36.38509
The expected mean of the select sample would be
µs = µ + i · σ
where i is the selection intensity (in animal breeding terms) or the mean of
the select sample for 50% selected (0.798) times the standard deviation. Thus,
µs = 7.98 is the expected value, and the r-script gave 7.968375 for the average
of 1,000 samples.
The expected variance is
V ars = (1 − i2 ) · σ 2
which gives around 36. Hence a select sample of observations can give results
greatly different from the original parameters. Only if one knows how the
15.2. MULTIPLE TRAITS
257
sample was selected can the true parameters of the population be estimated.
That is, knowing the value of i, then the original mean and variance can be
calculated by working backwards.
In animal breeding situations, the type of selection is unknown, and thus
i is unknown, because there are hundreds of owners of the animals and each is
making their own selection decisions. In the above example, the select sample
may not be the top 50%, but perhaps some other type of sample. The data
available for analysis is the consequence of the accumulation of non-random,
human-influenced decisions, which are non-reproducible.
Producers decide which pairs of animals will mate to produce the next
generation. Due to costs of production, producers eliminate inferior animals
early in life, and thus, those animals never have the opportunity to reproduce.
With the existence and usage of genetic DNA markers, some embryos are
not allowed to be born because their marker genotypes are not acceptable.
Non-randomly generated data imply that most statistical methodologies could
produce biased results. Biased results imply there could be errors in mating
and culling decisions. The question becomes how serious is the bias and how
big can the errors in decisions be? Ultimately, is there a methodology that
will measure the bias and correct the data for it?
15.2
Multiple Traits
In some species, like beef cattle, animals are weighed at birth, at weaning, and
at one year of age. Not all calves survive to weaning due to various environmental issues. After weaning, based on their weights and gender, individuals
are culled, particularly males. These animals generally do not make a yearling weight. Mature growth, however, is an important economic trait for beef
producers.
Let there be two traits with the following covariance matrix.
V=
V=
100 40
40 116
10 0
4 10
!
!
,
10 4
0 10
!
.
258
CHAPTER 15. SELECTION
Generate two correlated vectors of random normal deviates, and randomly
discard half of the second vector. Compute the mean and variance of the
variables in the first and second vectors.
L=matrix(data=c(10,0,4,10),byrow=TRUE,ncol=2)
wsam=c(1:1000)*0
wsdm=wsam
vsam=wsam
vsdm=wsam
n=5000
m=2500
km=c(2501:5000)
for(i in 1:1000){
w=rnorm(n,0,1)
v=rnorm(n,0,1)
W=rbind(w,v)
P = L%*%W
p1=P[1, ]
p2=P[2, ]
# select a random half of p2
q2 = p2[1:m]
wsam[i]=mean(p1)
wsdm[i]=var(p1)
vsam[i]=mean(q2)
vsdm[i]=var(q2)
}
mean(wsam) # 0.00271
mean(wsdm) # 99.9275
mean(vsam) # -0.0014988
mean(vsdm) # 116.0756
With random selection of trait 2, the means and variances are as expected
for 5,000 observations for vector one and 2500 for vector 2.
Now remove 50% of vector 2 observations after sorting vector 1 from high
to low. The script changes as follows:
15.2. MULTIPLE TRAITS
259
n=5000
m=2500
for(i in 1:1000){
w=rnorm(n,0,1)
v=rnorm(n,0,1)
W=rbind(w,v)
P = L%*%W
p1=P[1, ]
p2=P[2, ]
# Re-order based on vector 1
ka=order(-p1)
q1=p1[ka]
q2=p2[ka]
wsam[i]=mean(q1)
wsdm[i]=var(q1)
km=c(2501:5000)
v1s=q1[1:m]
v1c=q1[km]
v2s=q2[1:m]
vs=v2s
vsam[i]=mean(vs)
vsdm[i]=var(vs)
}
mean(wsam) # 0.0037
mean(wsdm) # 100.0342
mean(vsam) # 3.2017
mean(vsdm) # 105.8251
Nonrandom selection of trait 2 has increased the mean and decreased the
variance of trait 2.
The literature suggests that analyzing traits 1 and 2 in a multiple trait
model will give unbiased estimates of breeding values (Pollak and Quaas, 1980),
as long as all animals have trait 1 included. Is this really true? The script was
modified to predict the missing trait 2 values from the trait 1 values that were
present. They were predicted using the regression of trait 2 on trait 1 using
the true parameter values. That regression was 0.4.
260
CHAPTER 15. SELECTION
n=5000
m=2500
r=40/100
for(i in 1:1000){
w=rnorm(n,0,1)
v=rnorm(n,0,1)
W=rbind(w,v)
P = L%*%W
p1=P[1, ]
p2=P[2, ]
ka=order(-p1)
q1=p1[ka]
q2=p2[ka]
wsam[i]=mean(q1)
wsdm[i]=var(q1)
km=c(2501:5000)
v1s=q1[1:m]
v1c=q1[km]
v2s=q2[1:m]
# predict missing trait 2 values
v2c=v1c*r
vs=c(v2s,v2c)
vsam[i]=mean(vs)
vsdm[i]=var(vs)
}
mean(wsam) # -0.005937
mean(wsdm) # 99.9593
mean(vsam) # -0.002093
mean(vsdm) # 65.9069
Thus, there is some truth that a multiple trait model accounts for selection
bias in trait 2 given selection on trait 1 values. The mean of trait 2 seems to
be estimated correctly, but the variance of observed trait 2 and predicted trait
2 is much lower than the correct variance. However, estimation of the mean is
usually more important.
In a real situation the true parameters (without selection involved) would
not be known, and would have to be estimated from the observed values. The
greater is the intensity of selection given trait 1, and given the regression of
15.3. BETTER SIRES IN BETTER HERDS PROBLEM
261
trait 2 on trait 1, the ability to predict trait 2 values may not be very high.
For example, if only 10% of trait 2 values were observed and the correlation
between traits was only 0.05, then predicted trait 2 values would all be very
close to zero and have a very low variance.
In conclusion, a multiple trait analysis involving missing values for one or
both traits will clearly be better than separate analyses of individual traits by
themselves. Estimated covariance matrices will not be estimates of the true
covariance matrices based on traits that have not been subject to selection.
The true covariance matrix must be estimated using animals that have both
trait 1 and trait 2 observations, and that have not been subjected to selection.
15.3
Better Sires In Better Herds Problem
In the first days of the Northeast AI Sire Comparison at Cornell University in
1968 the AI industry believed that the better sires were positively associated
with the most favourable herds, and that consequently, sire ETAs would be
biased in the sire model evaluations. This could be visualized as ranking
the herds based on the true magnitudes of their effects on milk production
(due to good management, good economics, and favourable environmental
circumstances) and then those herd owners choosing sires based on the sires’
true breeding values. Given that no one could ever actually know the true
values of either herd effects or sire transmitting abilities, the degree of actual
bias might actually be very small, if it existed at all. However, herd owners
with lots of money used sires with high semen prices, and money was associated
with better herds and better sires.
Henderson and his team at Cornell believed that the bias must exist. He
thought (based on his own selection bias theories, 1975) if he treated herdyear-season effects as fixed effects, then the bias would not show up in the sire
ETAs (Schaeffer, 2018). Thereafter, herd-year-season effects have always been
a fixed factor in genetic evaluation models. The presence of bias and any tests
that the model (with herd-year-season as fixed) removed the bias were never
performed.
262
CHAPTER 15. SELECTION
15.3.1
The Simulation
In order to examine this problem daughters of sires were generated over 100
herds and nine years for a trait with heritability of 0.30. True herd values were
sorted from best value to worst. There were 200 sires used per year, and these
were sorted from high to low true breeding values. The first two sires were
used in herd 1, the next two sires used in herd 2, and so on. The poorest two
sires were used in herd 100. Thus, the degree of bias should be maximized.
Every year 1,500 females were mated to 200 sires giving two progeny with each
mating, or 3,000 new animals per year. A total of 27,000 animals with records
were available for analysis in each replicate.
Data were analyzed with a sire model in two different ways. First, the
model equation was
y = Xb + Wh + Zs + e
where
b represents a vector of fixed year effects,
h represents a vector of random herd-year-season effects, having a mean of
0 and covariance matrix Iσh2 ,
s represents a vector of sire transmitting abilities, having a mean of 0 and
covariance matrix Iσs2 ,
X,W, and Z are design matrices relating observations to years, herd-yearseasons, and sires, respectively.
This was the original model for the Northeast AI Sire Comparison.
The second model equation was
y = Wh + Zs + e
15.3. BETTER SIRES IN BETTER HERDS PROBLEM
263
where the variables are as described above, except that h is now treated as a
fixed factor, and therefore, fixed year effects are not needed in the model due
to confounding with the herd-year-season effects.
Data were also generated completely randomly, without sorting herd true
values or true sire transmitting abilities. These were also evaluated with both
sire models. One hundred replicates were made for each scenario.
15.3.2
Results
The correlations between sire true transmitting abilities and estimated transmitting abilities was used as the criterion to compare analyses. The numbers
are given in Table 15.1.
Table 15.1
Correlations between true and estimated
Association Bias HYS
No
Random
No
Fixed
Yes
Random
Yes
Fixed
sire transmitting abilities.
Correlation
0.617 ± 0.019
0.534 ± 0.017
0.159 ± 0.019
0.093 ± 0.014
The following conclusions could be made.
1. Treating HYS as fixed resulted in lower accuracies with sire-herd associations or with completely random scenarios.
2. With sire-herd associations, the accuracy was decreased substantially,
although the amount of bias created was substantial too.
3. Henderson’s selection bias theory does not appear to be justified.
4. A statistical solution to correct for this bias does not seem obvious.
The same scenarios were examined using an animal model, with additive genetic relationships among animals. The correlations among sires were
calculated. The results are in Table 15.2.
264
CHAPTER 15. SELECTION
Table 15.2
Correlations between true and estimated sire breeding values.
Association Bias HYS
Correlation
No
Random 0.783 ± 0.012
No
Fixed
0.761 ± 0.015
Yes
Random 0.902 ± 0.010
Yes
Fixed
0.879 ± 0.010
The presence of sire-herd associations caused an increase in the correlations of EBV with true values, but EBV were also highly correlated to the bias
(difference between true value and EBV). Treating HYS as a fixed factor in
either sire or animal models was not successful in removing bias due to better
sires being used in better herds.
Another indirect conclusion is that the actual bias due to the better sires
being used in the better herds was probably not very great back in 1969.
Treating HYS as fixed was not the optimal solution that was needed, and
probably resulted in lower accuracy of sire ETAs for many years. The effect
of the bias in animal models does not appear to be as great as in sire models,
probably due to the additive genetic relationship matrix.
15.4
Nonrandom Mating
Selective mating (nonrandom mating) is the key tool for livestock owners to
make genetic change in their animals. Kennedy et al. (1988) discussed the
genetic properties of animal models, and showed that animal models with
the incorporation of the additive genetic relationship matrix, A, account for
nonrandom matings because the parents of each animal with observations are
identified. A qualifier was that pedigrees of each animal should be traceable
back to a base population of randomly mating individuals.
The previous section demonstrated the increase in accuracy that is possible
with an animal model over a sire model. This was mainly due to accounting
for nonrandom matings, and to including all relatives information into the
evaluations. Determining the base population is non trivial, time consuming,
and adds a great number of animals to the dimensions of A. Thus, a question
that often arises is how many generations of pedigrees are actually needed
15.4. NONRANDOM MATING
265
to achieve the benefits from an animal model. For example, there may be
observations on animals from the last two years, and their parents are known,
but not any further back in time. How far back in the ancestry must one
go to make the animal model worthwhile? The depth of pedigree has two
ramifications. First issue is the accuracy of estimated breeding values, and
secondly, the effect on an estimated genetic variance. In all cases the progeny
of each sire and dam pair should be random with respect to Mendelian sampling
effects.
A single trait was simulated using a base population of 200 sires and 1500
females. Each year the sires and dams were randomly mated to each other. At
the end of each year 200 males and 1500 females were randomly selected from
all available individuals. Nine years were simulated. Data were then restricted
to those from years 8 and 9. Those animals plus their parents constituted
the data for the first animal model analysis. An additive genetic relationship
matrix was constructed based only on the animals with records (6,000) plus
their parents. This was refered to as Data 0.
Subsequent data sets were formed by including the next generation of
parents. Hence the pedigree became more complete with each scenario, giving
Data 1, Data 2, Data 3, and Data 4. Finally, all of the pedigrees over the nine
years of simulation were included, Data All, with the 6,000 records from years
8 and 9. For each data set the correlation of estimated breeding values (EBV)
with true breeding values (TBV) was calculated for the 6,000 animals with
records, using the true parameters in the analysis. Also, a separate analysis
was made to estimate the variance parameters of the model for herd-yearseasons, animal additive genetic, and residual. One hundred replicates were
made for each scenario.
266
Correlations of
Scenario Ancestors
Data 0
9,022
Data 1
11,798
Data 2
12,600
Data 3
12,843
Data 4
12,796
Data All
28,700
CHAPTER 15. SELECTION
Table 15.3
Completely random matings.
EBV with TBV, and Estimates of Variances.
Correlation
Genetic
HYS
Residual
.601 ± .014 29.9 ± 3.1 11.1 ± 3.1 58.4 ± 2.7
.606 ± .015 30.0 ± 2.9 11.0 ± 2.9 58.7 ± 2.6
.613 ± .013 30.3 ± 2.6 10.7 ± 2.6 58.8 ± 2.7
.612 ± .015 30.3 ± 2.7 10.7 ± 2.7 58.7 ± 2.8
.612 ± .015 30.6 ± 2.7 10.4 ± 2.7 58.6 ± 2.9
.610 ± .013 30.1 ± 3.4 10.9 ± 3.4 59.1 ± 2.8
The conclusion is that pedigrees containing parents, grandparents, and
great grandparents of animals with records are equivalent in accuracy to having
all pedigrees available. Retrieving ancestors further back in the pedigree does
not result in increases in accuracy (correlations of TBV with EBV for animals
with records only). Note also that the estimates of genetic, HYS, and residual
variances are nearly unbiased in all cases.
The above results were for the case of no selection of animals for breeding. The same simulations were repeated with selection of sires and dams
based on their phenotypes. Hence fewer animals became parents because their
phenotypes prevented them from being selected.
Table 15.4
Sires and dams selected based on phenotypes.
Correlations of EBV with TBV, and Estimates of Variances.
Scenario Ancestors Correlation
Genetic
HYS
Residual
Data 0
8,299 .640 ± .026 29.7 ± 2.7 11.3 ± 2.7 58.9 ± 2.6
Data 1
10,218 .671 ± .032 30.7 ± 2.9 10.3 ± 2.9 58.1 ± 2.4
Data 2
11,634 .681 ± .030 30.3 ± 2.7 10.7 ± 2.7 58.3 ± 2.5
Data 3
12,064 .682 ± .030 30.6 ± 2.6 10.4 ± 2.6 58.4 ± 2.4
Data 4
12,182 .684 ± .034 30.6 ± 2.7 10.4 ± 2.7 58.1 ± 2.7
Data All
28,700 .680 ± .032 30.8 ± 2.3 10.2 ± 2.3 58.1 ± 2.4
The accuracies of EBV were greater when selection was present, presumably because there were fewer parents and those had more progeny among the
6,000 with records, resulting in greater accuracy among those with records.
15.5. MASKING SELECTION
267
The addition of parents, grandparents, and great grandparents was enough to
increase the accuracy to the same level as having all pedigrees. The standard
error of the average correlations was greater in the case of selective matings
as opposed to random matings, being almost twice as great. Thus, there are
consequences to selection. The estimated variance components were nearly
equal to the true parameters in both the random mating and selective mating
scenarios.
15.5
Masking Selection
Masking selection is where animals are treated such that they are not allowed
to express their true phenotype, and hence their true genotype. An example
is where a champion race horse that has won many races is replaced by a looka-like horse. People in the know then bet against the ‘champion’ because they
know the look-a-like will not win any races. There is a case similar to this in
Sherlock Holmes.
In dairy cattle, more and more cows are being drugged in order to synchronize their heat cycles and to improve the chances of conception. Traits
like days from calving to first service, and days from first service to conception
are traits that can be manipulated phenotypically through the use of drugs.
Instead of 120 d to first service, with the use of drugs that could be moved up
to 80 d. This is a great management tool for herd owners that do not have
good heat detection abilities. The question is, should the phenotype of 80 d
be used in an animal model to evaluate cows and bulls for fertility, or should
records that are influenced by drugs be removed from the data analysis. However, removing data may also be a form of nonrandom sampling itself causing
bias too.
To demonstrate the bias, a simple model,
y i = µ + ai + e i
for i going from 1 to N was used, with four different heritabilities (.05, .10,
.20, and .30), and N = 30, 000. The assumed phenotypic mean and variance
were 86 and 750, respectively.
268
CHAPTER 15. SELECTION
N=30000
h2 = 0.05
vrp = 750
vrg = h2*vrp
vre = (1-h2)*vrp
sdg = sqrt(vrg)
sde = sqrt(vre)
wa = rnorm(N,0,sdg)
we = rnorm(N,0,sde)
phen = 86 + wa + we
mean(phen)
var(phen)
mean(wa)
var(wa)
cor(phen,wa)
To apply masking selection, the phenotypes in phen were altered. Any
values above 80 were set equal to 86, the overall mean. This represents cows
that may have been treated with drugs in order to time their estrus and get
them pregnant earlier than they might have. The drugs were assumed to be
100% effective. Then the means and variances and the correlation between the
new phenotypes and the true genetic values were calculated, as shown in the
following script.
ka = (1:N)[phen >80 ]
phan = phen
phan[ka]=86
mean(phan)
var(phan)
cor(phan,wa)
wb=wa[ka]
mean(wb)
var(wb)
In real life, only phan is observed, not phen. wb is a vector of the true
genetic values of the animals whose true phenotypes were masked by the use
of the drugs.
The results from one replicate at each heritability are in Table 15.5.
15.5. MASKING SELECTION
269
Table 15.5
Effects of masked selection.
Item
Heritability
0.05
0.10
0.20
0.30
True Phenotypic Mean
85.88 86.21 85.85 85.9256
True Phenotypic Var.
740.12 742.98 739.67 756.03
True Genetic Mean
0.02
-0.05
-0.08
-0.03
True Genetic Var.
37.16 75.36 150.18 224.69
Cor(Phenotype,Genotype)
0.2191 0.3217 0.4377 0.5452
Masked Phenotypic Mean
75.35 75.45 75.35
75.26
Masked Pheno. Var.
259.22 254.97 258.41 264.96
Genetic Mean of Masked Cows
0.88
1.80
3.47
5.48
Genetic Var. of Masked Cows
36.48 71.54 133.37 185.20
Cor(Masked Phenotype,Genotype) 0.1841 0.2701 0.3694 0.4694
The mean and variance of records with the masked animals is the same
regardless of heritability because essentially the same percentage of cows are
being masked in each case. However, the genetic mean of the cows that were
treated with drugs actually goes up (becomes worse) as heritability increases.
Also the correlation between phenotypes of all animals (true and masked together) decreases more as heritability increases. Thus, given that fertility has
a low heritability, the decrease in the correlation is only 0.03 at heritability
of 0.05, and this does not seem too bad. However, decreasing the accuracy of
evaluation for a trait with an already low heritability is not an advantage.
Using true and masked phenotypes in genetic evaluation programs is not a
recommended practice. Bias due to masking is significant. Removing masked
phenotypes from the analysis is also not recommended. Animals which are
treated with drugs are not a random sample of all cows, but mostly those that
may have a problem with conception.
In practice, the amount of bias will depend on the degree of drug usage in
the population. Masking will make genetic evaluations less accurate, causing
inferior animals to obtain EBVs that may rank them higher than they should
be ranked.
270
15.6
CHAPTER 15. SELECTION
Preferential Treatment
Preferential treatment is the result of an owner (human influence) on the performance of one or more individuals within their own environment. This includes separating an animal from its contemporaries into a more favourable
environment for special care, feed, or lower disease risk. At the same time,
the remaining animals may be considered to be receiving negative preferential
treatment. There could even be individuals that are deliberately mistreated
or underfed.
Without constant police surveillance and monitoring, the amount of preferential treatment and its effect on performance are highly variable from animal
to animal. During normal record keeping programs, the animals which have
been preferentially treated can not be easily identified. Owners are also not
likely to volunteer that information. The purpose is usually to make an animal look much better than its herdmates or contemporaries, so that it receives
higher EBVs or makes higher records leading to an increase in its monetary
value.
There is little that genetic evaluation centres can do to removing animals
that have been preferentially treated, or to account for it in genetic evaluation
statistical models. See Tierney and Schaeffer (1994) for a summary of a study
on preferential treatment using semen price to identify bulls whose daughters
might more likely be preferentially treated.
Due to the unpredictable nature of preferential treatment, one might consider it to be a factor that inflates the overall residual variances around traits.
Residual variances within contemporary groups could be used to categorize
potential preferential treatment situations. Unfortunately, claiming a herd
owner has preferentially treated their animals could lead to legal civil action
against the genetic evaluation centre, if proof of preferential treatment can not
be established.
15.7
Nonrandom Progeny Groups
Prior to the availability of genetic markers, the animals that were born from
a specific sire and dam mating could always safely be called a random group
15.8. SUMMARY
271
of progeny from that pair (Tyriseva et al. 2018). There was no way to restrict
how the DNA was contributed from either parent and no way to affect how that
DNA combined into an embryo. Thus, progeny means were good estimators of
sire transmitting ability as long as sires were mated randomly to dams. With
the animal model, the parents of each animal with records accounts for the
mating structure (random or not). However, if embryos are first collected, then
genotyped for 200,000 SNP markers, and subsequently only the embryos with
the highest probability of high economic merit are implanted into females, then
those progeny (from the same sire and dam) will no longer be a random sample
of all possible progeny. The bias incurred is proportional to the accuracy of
the pre-selection process. The Mendelian sampling variability of the selected
progeny will be smaller than one-half the genetic variance.
Suppose 20 embryos are produced from one mating, genotyped, and then
the best five are implanted to produce offspring, while the other 15 are discarded, then what happens to EBVs from an animal model? Bias is generated
as in section 15.1. Genetic evaluations should then be based solely on individual animal performance. An animal model could be used, but relationships to
parents must be ignored, except progeny that were naturally conceived. All
pre-selected embryos should be coded with unknown parents. Including links
to their parents would bias the parent EBVs upwards, and would bias comparisons to other sires, to contemporaries of the pre-selected progeny, and so
on through the relationship matrix.
Note that pre-selection of embryos within those from a single sire-dam
mating results in nonrandom progeny groups, and these can affect all traits
of economic importance, not just one trait, because all traits are genetically
correlated.
15.8
Summary
Many types of selection can lead to biased EBVs from sire or animal models.
One must be clear about the type of selection being discussed. In most cases,
the animal model does not solve or prevent a bias from affecting EBVs. Eliminating biased data from analyses could also lead to bias. For the first time,
genomics has the potential to create nonrandom progeny groups by discarding
embryos that do not index high after genotyping. Researchers should be more
272
CHAPTER 15. SELECTION
vigilant about the potential for bias to creep into statistical analyses. Statistical procedures are largely based on random sampling within and between
treatment groups.
This chapter has demonstrated several types of nonrandomness that could
hurt animal model analyses.
Chapter 16
Genomics
Back in the 1970’s the field was called biotechnology, the buzzword that attracted lots of research dollars even though the practical applications were
more than 30 years in the future. The promises were that individual genes
of major importance would be found, and that humans could control what
happens with them. During those 30 years more genetic change was made
using statistical models and lots of data. In the early 2000s single nucleotide
polymorphisms (SNPs) were found and the usefulness of biotechnology became
more obvious and immediate. However, genes were not just genes, but were
groups of base pairs from 1,000 to 1 million pairs in length. Genes were not
the same size (same number of base pairs). Some genes controlled other genes.
SNPs were located within or next to genes so that SNPs became markers for
the genes of importance. SNPs had to be located fairly close to the genes (in
linkage disequilibrium) so that the marker allele and gene alleles were tightly
linked and therefore, tended to be inherited together.
There were essentially two uses for SNPs. One was to find genes that had
major effects on economically important traits. This was known as a genome
wide association study (GWAS). Over the years relatively few major genes have
been consistently found, and some of those were already near fixation, that
is, nearly every individual already had the favourable alleles. These studies
produced hundreds of Manhattan plots of SNP marker effects versus location
in the genome. Research has moved along to genome sequencing where instead
of SNPs, the DNA sequence of base pairs is performed, usually targeting areas
273
274
CHAPTER 16. GENOMICS
where genes are known to exist. The goal is to determine not only the location
of a gene, but also its function and products in various organs. This is an
important area in human genetics to know how various genetic diseases cause
problems.
The other use for SNPs in livestock has been to be more specific about the
relationships between individuals in order to improve accuracy of EBVs. Some
full sibs, for example, are more related to their sire or dam than to other full
sibs from the same sire or dam. The SNP genotypes can be used to determine
what fraction of SNP genotypes are in common between individuals to build a
better relationship matrix. This includes both Identity by Descent and Identity
by State. Instead of using A, the matrix of additive genetic relationships
based on pedigree information, a genomic matrix, G is constructed from the
SNP genotypes of individuals. The animal model is still used, but with G
replacing A. There are two difficulties with this strategy. First, G can only
be constructed based on animals that have been genotyped with the same
SNP chip. However, imputation can be used to predict the SNP genotypes
of relatives of animals that were actually genotyped. Secondly, G must be
inverted for use in the animal model, and most animal models involve over
100,000 animals. Inverting a matrix up to 50,000 animals may be today’s limit
before rounding errors take over. Also, G and its inverse are full matrices (not
sparse), and hence storing it and performing multiplications with it become
very space and time consuming. This chapter looks at an approach that utilizes
only a portion of the SNPs in order to improve genetic evaluations of animals,
along with the regular A matrix and its sparse inverse.
16.1
Introduction
Genomics research projects can be found in all areas of livestock production
including aquaculture. In domesticated aquaculture there are generally few
families (100 to 400 per year), and large progeny group sizes (50 to 1000).
Parents can be genotyped for many thousands (e.g. 50,000) of single nucleotide
polymorphisms (SNPs), but costs prohibit genotyping too many progeny in
freshwater and seawater environments. When data are to be analyzed there
are a large number of fish with phenotypes and a small number of individuals
that are genotyped. Estimation of a large number of SNP effects from a few
16.1. INTRODUCTION
275
thousand genotyped fish is hugely overparameterized. The question arises
whether all SNPs need to be utilized to gain an advantage from genomics.
The Cooke Aquaculture Atlantic salmon breeding program at the Oak Bay
Hatchery in New Brunswick rears candidate broodstock entirely in freshwater
tanks, while one of the traits of economic interest is seawater growth production
in sea cages. Three year old weights are taken in December and January, and
spawning occurs in October to December that year. Genetic improvement
programs are used to increase growth rates, reduce feed consumption, improve
carcass shape and colour, and increase resistance to many health challenges
due to parasites, bacteria, and viruses. Generation intervals in salmon can not
be easily shortened without causing other problems, such as grilsing, i.e. early
maturing and less growth (Gjerde et al. 1994). Four years is the minimum
generation interval (Quinton et al. 2006). Therefore, early assessment of
genetic ability will not reduce the spawning age.
Families of salmon are well evaluated by traditional animal model analyses
for freshwater and seawater growth because family size can be fairly large, and
the heritability of growth is around 0.4 (Gjerde et al., 1994). Year classes
of Atlantic salmon are almost entirely genetically distinct with completely
different family lines. SNPs can improve accuracy of estimates of year class
differences because SNP frequencies will likely be somewhat similar between
year classes (Liu et al. 2016). Genotyping enough fish in both freshwater
and seawater environments is costly and time consuming. Thus, the use of
genomics in Atlantic salmon to make greater genetic change in growth is not
very feasible.
Evaluation methods in fish rely on the ability to identify individual fish,
which can only be done once the fish are large enough to be PIT tagged such
that they can be readily scanned electronically. Alternatively, fish can be
genotyped to determine parentage at the time they are weighed and measured
using a hundred or so SNP markers, but this depends on the costs and speed
of analyzing DNA samples (Liu et al., 2016). Any fish that will not be used in
matings, does not need to be identified individually, but only by family. Families have been identified by fin clipping, or by raising them in separate tanks
until they are large enough to be tagged. This generally limits the number of
families that can be reared at one time. Fish may also be grown communally,
and identified to parentage at three years of age using seven microsatellite
markers or a panel of SNPs, which would allow a larger number of families
276
CHAPTER 16. GENOMICS
to be created. With a communal system, some families could potentially be
eliminated because they were not competitive during feeding.
Many methods are available for genomic evaluation of individuals, many
designed for dairy cattle. Several of these methods were compared by Yoshida
et al. (2018) in rainbow trout who found that low density SNP panels could be
used without compromising accuracy of predictions, but that all methods using
genomic information gave better results than a traditional pedigree-based best
linear unbiased prediction (BLUP) approach using only phenotypes.
Toro et al. (2017) conducted a simulation study of the accuracy of genomic
within-family selection and used a limited number of markers from 4 to 400
per chromosome, to improve the accuracy of selection. Single-step genomic
BLUP (Misztal et al. 2013) is another methodology in which SNP markers
are used to create a genomic relationship matrix among genotyped animals,
which includes relationships that are identical by descent and identical by
state. The inverse of this matrix is combined with the inverse of the pedigree
based relationship matrix. The inverse of the genomic relationship matrix
becomes more challenging as more animals are genotyped because a direct
inversion routine is needed and the storage space to hold the matrix becomes
enormous quickly. Also, any multiplications involving this matrix can be very
time consuming because the inverse is full (all elements are non-zero). Hence
a longer term solution is needed if the number of genotyped animals increases
rapidly.
Two problems are encountered when using genomic data in genetic evaluation. With a 50K SNP panel, about 50,000 SNP genotype effects, (Nsnp ),
need to be estimated. A high percentage of the SNPs, however, have little
to no effect on the traits of interest. The second problem is that not enough
individuals have been genotyped. Consequently, statistical models tend to be
over-parameterized, i.e., too many parameters to estimate from too few animals. If you put the SNP genotypes in a matrix with the number of columns
equal to the number of SNPs (Nsnp ), and the rows equal to the number of animals genotyped, (Ng ), then the rank of the matrix can be no greater than the
smaller of either Ng or Nsnp . Thus, if you have Ng = 100 and Nsnp = 50, 000,
then the rank is 100 or less, and 49,900 columns of that matrix are a linear
function of a set of 100 independent columns.
Ideally, all animals with phenotypes, (Np ), should also be genotyped, (Ng ),
16.2. DATA
277
and both numbers should be greater than the number of SNP genotypes per
animal,(Nsnp ). That is,
(Np = Ng ) > Nsnp .
If the above holds true, then SNP genotype effects can be estimated uniquely
using ordinary least squares equations with an appropriate statistical model.
However, typically, Ng is much smaller than Np and Ng is much smaller than
Nsnp . Labour costs and the time involved in collecting and analyzing DNA
samples is high, and therefore, the number of genotyped animals has been
limited by economics. Even so, Ng is expected to continue to rise over time.
To increase the number of genotyped animals, imputation can be used to assign
genotypes to relatives of genotyped animals (Sargolzaei et al., 2014) based on
probability and pedigree relationships. Gengler et al. (2007, 2008) and Mulder
et al. (2010) proposed a method to predict SNP genotypes of all animals in
the pedigree using a simple pedigree-based animal model.
16.2
Data
The Oak Bay Hatchery of Cooke Aquaculture, Inc accommodates up to 150
families of Atlantic salmon in each of four different year classes. The St John
River strain was started around 1984 (Glebe, 1998; Quinton et al. 2005).
Spawning typically occurs in October, November, and December, and eggs
are kept in freshwater (8◦ C). Some eggs can be chilled to 3-4◦ C in order to
match up their degree days with eggs fertilized later. Once the alevins have
absorbed their yolk sac and are ready to take exogenous feed, the temperature
is increased to 12◦ C. As the fish grow they are moved to larger family tanks.
At smolting age (1 yr) individual fish may be PIT tagged and a sorting is made
for potential broodfish. Fish are communally reared from this point onward.
Another movement of fish occurs at 2 years of age, when they are scanned by
ultrasound to determine sex. Early maturing fish are removed.
Only weights and lengths taken at three years of age were used in the
analyses. A total of 26,383 fish were measured at three years of age in this
study. Male and female data from the Oak Bay Hatchery have been analyzed
collectively for four traits (FW weight, FW length, SW weight, and SW length)
in a multiple trait animal model over the last three years. A recent study has
shown that male and female growth within families is not perfectly correlated,
278
CHAPTER 16. GENOMICS
and that female growth is more variable than male growth. Thus, weights and
lengths have been expanded into eight traits separated by gender and rearing
environment. The phenotypic means and standard deviations are given in
Table 16.1. The distribution of records by year classes and environments are
in Table 16.2.
Table 16.1
Phenotypic means and standard deviations (SD) of 3-yr-old growth traits.
Gender
Typea -trait
Number Mean SD
Males
FW weight, kg
2811
5.50 1.97
FW length, cm
2811 75.91 8.39
SW weight, kg
6176
5.28 1.26
SW length, cm
6176 77.54 5.53
Females FW weight, kg
9512
6.04 1.84
FW length, cm
9512 77.05 7.57
SW weight, kg
7884
5.01 1.13
SW length, cm
7884 75.91 5.24
a
FW=fresh water, SW=sea water.
Number of
Year
Class
2006
2007
2008
2009
2011
2013
2014
Table 16.2
growth records by year of weighing,
gender.
Year ——–Freshwater——–
Weighed Males
Females
2009
0
1393
2010
404
1197
2011
1171
1106
2012
424
0
2014
0
1179
2016
0
81
2017
812
1220
type of environment, and
——–Seawater——–
Males
Females
0
1566
1552
1371
2325
1381
2095
0
0
2418
0
56
3540
1092
A 50K SNP panel was developed during the project for the North American Atlantic salmon based on a 220K (Ødegård et al. 2014) and 6K panel for
European Atlantic salmon (Brenna-Hansen et al. 2012). A total of 3,437 fish
with growth data were genotyped with either the 220K or 50K panels. SNPs
from the 50K panel were also on the 220K panel. There were a total of 33,189
animals in the pedigree file.
16.3. MODEL FOR GROWTH TRAITS
16.3
279
Model for Growth Traits
An 8 trait model was used with the traits described in Table 16.1. Each fish
was observed for weight and length, in freshwater or seawater, but families
were measured for all traits. Hence an animal model is important to analyze
the data. The equation of the model for each trait is simple,
y = Wc + Za + e
(16.1)
where
y is a vector of eight traits,
c is a vector of 21 fixed tank-year subclass effects,
a is a random vector of polygenic effects for each fish,
e is a random vector of residuals,
W is the design matrix relating tank-year effects to the observations, and
Z is the design matrix relating animals to their observations.
The expectations of the random vectors and the variances and covariances
among the traits are given below.
E(a)
E(e)
V ar(a)
V ar(e)
=
=
=
=
0,
0,
O
A
H
R
A is the additive genetic numerator relationship matrix based on pedigrees
of 33,189 animals. Matrices H and R, of order 8 by 8, were estimated by
Bayesian methods using Gibbs sampling. Forty thousand Gibbs samples using
280
CHAPTER 16. GENOMICS
the entire dataset of 26,383 records and 33,189 pedigree animals, with a 10,000
sample burn-in period was used. Standard errors on estimates of covariance
components, and on estimates of heritabilities and genetic correlations were
obtained using the standard deviations of the Gibbs sample values after burnin (Table 16.3)
Table 16.3
Heritability estimates with standard errors for 3-yr-old weight and length in
Atlantic salmon by gender and type of environment.
Gender Typea
Weight
Length
Males
FW 0.18±0.03 0.22±0.02
SW
0.31±0.03 0.25±0.03
Females FW 0.27±0.04 0.38±0.04
SW
0.40±0.03 0.38±0.03
a
FW=fresh water, SW=sea water.
16.4
Models Involving Genomics
In this study, Ng = 3, 437 and Np = 26, 383, but the traits have been split into
4 groups (Table 1). The number of markers was Nsnp = 49, 638. Thus, Np and
Ng are less than the number of SNP markers, Nsnp . If all SNP markers are
included, then the model will be highly overparameterized and create difficulty
in yielding unbiased and unique solutions for all parameters in the model.
A first step was to reduce the number of SNP markers. From the 50K SNP
panel, there were 11,467 SNPs with minor allele frequencies between 0.3 and
0.5. The correlations among the 11,467 SNP genotypes were calculated and
ranged from 0.03 to -0.03 indicating that those markers were nearly independent of each other. The markers were correlated with SW female growth and
ranked within chromosome in decreasing order of absolute magnitude. From
the 11,467 markers, subsets of the top 6,000 (200 to 400 per chromosome) and
top 270 (10 per chromosome) were chosen.
If all fish have both phenotypic records (weight and length) and genotypes
for markers, then an appropriate linear model would be
16.4. MODELS INVOLVING GENOMICS
y = Wc + Za + Xm + e
281
(16.2)
where everything is defined similarly to Equation (1.1), except
m is a fixed vector of marker allele effects for the 11,467 markers, and
X is a design matrix that contains the genotypes of the markers (-1, 0, or 1)
for each animal with records.
The matrices H and R are as defined previously,
V ar(a) = A
O
H,
where A is the usual additive genetic, pedigree based, relationship matrix
among animals. The genomic estimated breeding value, GEBV, would be the
b + Xm
c ), where a
b and m
c are estimates of a and m, respectively.
sum (Za
Mixed model equations were constructed and solved for 8 traits. The number of markers was either 6,000 or 270 (GBV6K and GBV270, respectively).
In both cases I was added to the diagonals of X0 R−1 X to give ridge regression
estimators of the regression coefficients. This was necessary because overparameterization was still a problem, which could not be avoided when number of
markers was 6,000, and to be comparable in the case when number of markers
was 270.
Different strategies of picking markers were not studied. For each analysis
there is likely an optimum set of markers, but finding that set could take a
very long time starting with 49,638 markers. The reasoning was that with high
minor allele frequencies, all of the genotypes at one SNP would be reasonably
represented among genotyped fish. The 11,467 markers were nearly independent, so there would be little chance of confounding of markers. The linkage
disequilibrium among the markers was therefore low.
The estimated breeding values from the animal model with predicted SNP
genotypes were calculated as
b i + x0 i m
c,
EBVi = a
282
CHAPTER 16. GENOMICS
for animal i, where x0 i is the row vector of predicted SNP genotypes for markers
for animal i from X.
The next step was to predict marker genotypes for all animals in the
pedigree, (Nped = 33, 189) for both subsets of markers. An animal model was
applied to the marker genotypes (-1, 0, or 1) of genotyped fish with an overall
mean, and an animal additive effect. The additive genetic relationship matrix,
A amongst all animals with ancestors, was used, and a very high heritability
assumed. The model is
si = µ + gi + ei ,
(16.3)
where
si is the marker genotype, either -1, 0, or 1, for an animal that was genotyped,
µ is an overall mean,
gi is an animal’s breeding value for the marker genotype, and
ei is a residual error.
Let g be the vector of breeding values for marker genotypes of all animals
and let it be partitioned into genotyped (w) and non-genotyped (o) individuals,
then
E(g) = 0
E(e) = 0
V ar(g) = V ar
gw
go
!
=
Aww Awo
Aow Aoo
V ar(e) = Iσe2
and let
σe2 /σg2 = 0.05 = λ,
!
σg2
16.5. COMPARISON OF MODELS
283
which corresponds to a heritability of 0.9523.
The mixed model equations are
N
10
00
µ
10 s





ww
wo
 1 I + A λ A λ   gw  =  s  .
0
Aow λ
Aoo λ
go
0





(16.4)
The solutions for animals plus the overall mean gives an estimate of the
genotype of each animal. The predicted genotypes, which are decimal numbers,
are used in the Equation 2. Mulder et al.(2010) found a .69 correlation between
predicted genotypes and actual genotypes in a simulation study. Each SNP
would be analyzed separately. The predicted genotypes can be used directly
in X, as continuous covariates, in Equation (1.2).
16.5
Comparison of Models
The models were compared as follows:
1. Sum of squares of residuals, and
2. Variance of Mendelian sampling effects.
GBV270 and GBV6K models should give lower sums of squares of residuals than the ANM model because there were more parameters to estimate
than in the traditional animal model. The comparisons are not to determine
the better model, but to quantify the differences one might expect.
Yoshida et al. (2018) computed an accuracy for their genomic models
as the correlation between the GEBVs (genomic estimated breeding values)
and the corresponding phenotypic observations, divided by the square root of
heritability from the ANM in a validation data set. In this study the data
were not separated into estimation and validation groups, and therefore, this
statistic was not meaningful for this study.
284
CHAPTER 16. GENOMICS
Another measure of improved accuracy would be to calculate the variance
of Mendelian sampling effects (for animals with both parents known)(for each
trait separately) as a fraction of one half the additive genetic variance used in
the analysis (from the G matrix shown above). A Mendelian sampling effect
of the ith animal is the animal’s (G)EBV minus the parent average (G)EBV,
msi = EBVi − .5(EBVsire + EBVdam )
In theory, the greater is the accuracy of evaluation, the variance of Mendelian
sampling effects should increase towards one half the additive genetic variance.
Thus, the comparison statistic is
v = V ar(ms)/(.5V ar(a)),
where V ar(ms) is the variance of Mendelian sampling effects of animals with
both parents known, and V ar(a) is the additive genetic variance of the trait
(obtained from the G matrix).
16.6
Results
From Table 16.1 the number of observations for male, freshwater weights and
lengths was only 2811, which is substantially smaller than the total number of
genotyped fish, and substantially smaller than the 6000 markers that needed
to be estimated in the GBV6k model.
Given that only 3,437 animals were genotyped also caused a problem in
estimating marker effects in GBV6K and GBV270, because those individuals
were used to estimate SNP genotypes for 33,189 fish in the pedigree. While
that gives 6,000 SNP genotypes for all 33,189 fish, they are based on only
3,437 degrees of freedom, so to speak. The predicted genotypes are linear
functions of the 3,437 independent genotyped animals. Thus, there was an
overparameterization problem for both GBV6K and GBV270. The number
of genotyped fish should be increased to be above 6,000, and the number of
phenotypes should be greater than 6,000 for all groups of fish. Ridge regression
was used to solve the equations, where a value of 1 was added to the diagonals
of X0 R−1 X.
Table 16.4 shows the sums of squares of residuals from each model for
each of the traits. GBV270 gave smaller sums of squares than either ANM or
16.6. RESULTS
285
GBV6K. The expectation was that as you increased the number of markers
that the sums of squares of residuals would become less. However, GBV6K
gave sums of squares more similar, but less than ANM. Thus, more SNP
markers was not necessarily better. The question becomes how many markers
are sufficient. A stepwise forward regression could be used to add one marker
at a time until the sums of squares of residuals stops decreasing. That process
would not necessarily result in having markers on each chromosome. The best
set of markers would need to be found for each trait to be analyzed. The sets
found for each trait could be merged to give one set to be used for all traits.
Table 16.4
Sums of squares of residuals from traditional animal model (ANM), GBV270
model, and GBV6K model.
Gender Typea Traitb
ANM GBV270 GBV6K
Males
FW
WT
2335
2073
2234
FW
LG
44,912
41,080
43,176
SW
WT
4716
4465
4630
SW
LG
113,017 107,966 110,785
Females FW
WT
7294
6982
7205
FW
LG
107,957 104,598
106659
SW
WT
4146
4002
4086
SW
LG
95,556
93,339
94,629
a
FW=fresh water, SW=sea water.
b
WT=weight, LG=length.
The variance of estimated Mendelian sampling effects (V ar(ms)) was used
to indicate genetic accuracy. When GEBV are more accurate then V ar(ms)
should increase towards one half the additive genetic variance ( 21 V ar(a)). The
values are given in Table 16.5. Thus, the ANM gave only 2.6% to 9.5% of
1
V ar(a). This is because each fish has only one weight and one length observed
2
in just one environment. Hence their EBV are heavily regressed towards the
family mean for all eight traits. Both GBV270 and GBV6K gave substantially
higher fractions than ANM, but not consistently for one model or the other.
GBV270 was higher in more cases than GBV6K.
286
CHAPTER 16. GENOMICS
Table 16.5
Variance of Mendelian sampling effects as a fraction of one-half additive
genetic variance for the traditional animal model (ANM), GBV270, and
GBV6K models.
Gender Typea Traitb ANM GBV270 GBV6K
Males
FW
WT
.0260
.8476
.2158
FW
LG
.0744
.6176
.2037
SW
WT
.0680
.2087
.2394
SW
LG
.0870
.2250
.2389
Females FW
WT
.0718
.5628
.3496
FW
LG
.0950
.3515
.2801
SW
WT
.0734
.2138
.2032
SW
LG
.0890
.2245
.1910
a
FW=fresh water, SW=sea water.
b
WT=weight, LG=length.
Because GBV270 appears better than GBV6K, further comparisons are
given only between GBV270 and ANM. Tables 16.6 and 16.7 show correlations
among EBVs between ANM and GBV270, and within each model. All of
the correlations were less than unity which means that genomic data added
information to the EBVs causing individuals to rank differently from ANM.
The correlations between traits within ANM or within GBV270 were different.
For the ANM, correlations were generally greater than for GBV270. Recall the
ANM uses a multi-trait model with fairly high genetic correlations as input
parameters, so that the correlation of EBVs from ANM are usually greater
than the true genetic parameters, because traits are used to predict other
traits. With GBV270, the same can be said for the polygenic part of the
model, but the GEBV includes the regressions on the marker genotypes which
do not utilize the genetic parameters to predict traits. Correlations do not
indicate which model is better, only the behaviour of the results.
16.6. RESULTS
Correlations
1
2
3
4
5
6
7
8
287
Table 16.6
(×100) of Nped EBVs from traditional animal model (ANM),
with those of GBV270, for Nped = 33, 189.
ANM
——GBV270——
1 2 3 4 5 6 7 8
M/FW/WT 54 43 56 48 62 53 54 43
M/FW/LG 34 72 68 84 63 87 66 74
M/SW/WT 43 58 84 74 67 73 82 71
M/SW/LG 36 72 72 85 68 90 71 77
F/FW/WT 41 53 63 64 82 70 63 58
F/FW/LG
36 71 70 84 71 92 71 78
F/SW/WT 41 55 79 71 65 72 86 74
F/SW/LG
35 67 71 81 65 87 77 84
a
M=male, F=female.
b
FW=fresh water, SW=sea water.
c
WT=weight, LG=length.
Table 16.7
Correlations (×100) of EBVs within traditional animal model (ANM), below
the diagonals, and EBVs within GBV270, above the diagonals for 33,189
individuals.
1
2
3
4
5
6
7
8
1 M/FW/WT 100 70 43 35 32 28 40 30
2 M/FW/LG
58 100 57 70 46 64 54 61
3 M/SW/WT
72 76 100 85 58 66 79 69
4 M/SW/LG
62 99 82 100 61 78 71 78
5 F/FW/WT
77 66 79 73 100 81 55 52
6 F/FW/LG
60 95 80 98 77 100 65 72
7 F/SW/WT
66 70 95 78 77 80 100 87
8 F/SW/LG
56 89 83 93 71 95 87 100
a
M=male, F=female.
b
FW=fresh water, SW=sea water.
c
WT=weight, LG=length.
Animal models including predicted SNP marker genotypes (GBV270 and
GBV6K) performed better than ANM. Results imply that 270 markers gave a
higher fraction of Mendelian sampling variance than did GBV6K and definitely
288
CHAPTER 16. GENOMICS
more than ANM model. There is a need for further study on how many SNPs
and which SNPs give the best results.
By using fewer markers, a new chip panel could be created with only 500
markers, for example, which might cost less than the 50K panel and therefore,
allow more fish to be genotyped. A chip with only 500 markers could be used to
determine parentage, genetic sex, and continent of origin (North American or
European) as well. Ideally large samples of fish measured for 3 year-old weight
and length should be genotyped in FW environments, and random samples
of fish from SW environments. Perhaps 50 fish from every family could be
targeted, giving 7500 genotyped individuals per year.
Chapter 17
Other Models
17.1
Sire-Dam Model
A sire-dam model is a subset of the animal model. Assumptions are needed
in order to move from an animal model to a sire-dam model. This means that
a sire-dam model will always be inferior to an animal model. Start with an
animal model as
y = Xb + Wc + Za + Zp + e,
for a single trait, repeated records, where
b is a vector of fixed factors of the model (e.g. time trends, age effects,
regions of a country effects),
c is a vector of random factors (e.g. contemporary groups),
a is a vector of animal additive genetic effects,
p is a vector of cumulative environmental effects, and
e is a vector of residual effects.
289
290
CHAPTER 17. OTHER MODELS
X, W, and Z are design matrices that relate observations in y to factors in
the model, and the random factors have null expected values and appropriate
covariance matrices. The additive genetic relationship matrix, A, is assumed
for the animal additive genetic effects.
A necessary assumption to form a Sire-Dam model is to assume that
animals have only one record, and thus, p can be joined with e. Then to go
to a Sire-Dam model re-write the additive genetic effects as
a =
1
1
asire + adam + ms,
2
2
(17.1)
where ms is the vector of Mendelian sampling effects of each progeny. The
implied assumption is that all animals with a record in y have both parents
known. Another assumption is that each Mendelian sampling effect comes
from the same population the a zero mean and variance equal to one half the
additive genetic variance. Thus, inbred animals are not allowed, or they are all
inbred but have the same variance. Substituting (1.1) into the animal model
equation gives
1
1
y = Xb + Wc + Z( asire + adam + ms) + p + e.
2
2
Combine the Mendelian sampling effects with the residual term, as well, assuming one record per animal.
1
1
y = Xb + Wc + Zs asire + Zd adam + (ms + p + e).
2
2
The obvious sticking points with a sire-dam model are that
• Animals must have only single records.
• All progeny must be not inbred.
• All progeny should be a random sample of all potential progeny of a
sire-dam pair.
• The progeny do not later become dams themselves.
17.2. SIRE MODELS
291
• Animals that have records are not parents of other animals,
• None of the parents have records of their own,
If the data covers only one generation of animals, then the results from
an animal model or a sire-dam model should be the same. However, if the
data covers many generations, then the animal model should always be more
accurate than the sire-dam model.
17.2
Sire Models
A sire model is a sub-model of the Sire-Dam Model. More assumptions are
needed to move from a Sire-Dam Model to a sire model, and thus, a sire model
is inferior to a Sire-Dam Model, and also inferior to an animal model. If dams
are assumed to have only one progeny (mated to only one sire), then the dam
additive genetic effect can be combined into the residual effect too. One must
also assume that the dams are unrelated to each other, and to the sires.
1
1
y = Xb + Wc + Zs asire + ( adam + ms + p + e).
2
2
where Zd = I.
Implied in the Sire Model is that sires must be mated randomly to dams.
If this is not the case then sire estimated transmitting abilities will be biased.
The assumptions that had to be made with the Sire-Dam model also apply to
the Sire model.
17.3
Reduced Animal Models
Consider an animal model with periods as a fixed factor and one observation
per animal (not a realistic model - just an example for demonstrating a reduced
animal model), as in Table 17.1.
292
CHAPTER 17. OTHER MODELS
Table 17.1
Animal Model Example Data
Animal Sire Dam Period Observation
5
1
3
2
250
6
1
3
2
198
7
2
4
2
245
8
2
4
2
260
9
2
4
2
235
4
1
255
3
1
200
2
1
225
Assume that the ratio of residual to additive genetic variances is 2. The MME
for this data would be of order 11 (nine animals and two periods). The left
hand sides and right hand sides of the MME are






















3
0
0
1
1
1
0
0
0
0
0
 

0
0
1
1
1
0
0
0
0
0
680


5
0
0
0
0
1
1
1
1
1   1188 

 



0
4
0
2
0 −2 −2
0
0
0  
0 



0
0
6
0
3
0
0 −2 −2 −2 
  225 
 

 200 
0
2
0
5
0 −2 −2
0
0
0 
 



0
0
3
0
6
0
0 −2 −2 −2 
 ,  255 
 
1 −2
0 −2
0
5
0
0
0
0   250 

 



1 −2
0 −2
0
0
5
0
0
0   198 



1
0 −2
0 −2
0
0
5
0
0 
  245 
 

1
0 −2
0 −2
0
0
0
5
0   260 
1
0 −2
0 −2
0
0
0
0
5
235
17.3. REDUCED ANIMAL MODELS
293
and the solutions to these equations are























b̂1
b̂2
â1
â2
â3
â4
â5
â6
â7
â8
â9













































=
225.8641
236.3366
−2.4078
1.3172
−10.2265
11.3172
−2.3210
−12.7210
6.7864
9.7864
4.7864











.










A property of these solutions is that
10 A−1 â = 0,
which in this case means that the sum of solutions for animals 1 through 4 is
zero.
The reduced animal model was proposed by Quaas and Pollak (1980).
The solutions from the reduced animal model were exactly the same as from
the full MME. The amount of reduction depends on the reproductive rate of
the species of interest and the percentage of animals that become parents of
the next generation. Thus, the reduced animal model methodology is very
applicable to swine, poultry, and fish where large numbers of progeny can be
produced per mating pair of individuals, and where a very small percentage of
animals are actually selected as parents of the next generation.
In a typical animal model with a as the vector of additive genetic values of
animals, there will be animals that have had progeny, and there will be other
animals that have not yet had progeny (and some may never have progeny).
Denote those animals which have progeny as ap , and those which have not had
progeny as ao , so that
a0 = a0p a0o .
In terms of the example data,
a0p =
a1 a2 a3 a4
a0o =
a5 a6 a7 a8 a9
,
.
294
CHAPTER 17. OTHER MODELS
Genetically for any individual, i, their additive genetic value may be written as the average of the additive genetic values of the parents plus a Mendelian
sampling effect, which is the animal’s specific deviation from the parent average, i.e.
ai = .5(as + ad ) + mi .
Therefore, we may write
ao = Tap + m,
where T is a matrix that indicates the parents of each animal in ao , and m is
the vector of Mendelian sampling effects. Then
a =
ap
ao
=
I
T
!
!
ap +
0
m
!
,
and
V (a) = Aσa2
!
I
=
App I T0 σa2 +
T
0 0
0 D
!
σa2
where D is a diagonal matrix with diagonal elements equal to (1 − .25di ), and
di is the number of identified parents, i.e. 0, 1, or 2, for the ith animal, and
V (ap ) = App σa2 .
The animal model can now be written as
yp
yo
!
=
Xp
Xo
!
b+
Zp 0
0 Zo
!
I
T
!
ap +
ep
eo + Zo m
!
.
Note that the residual vector has two different types of residuals and that the
additive genetic values of animals without progeny have been replaced with
Tap . Because every individual has only one record, then Zo = I, but Zp may
have fewer rows than there are elements of ap because not all parents may
have observations themselves. In the example data, animal 1 does not have an
observation, therefore,


0 1 0 0


Zp =  0 0 1 0  .
0 0 0 1
17.3. REDUCED ANIMAL MODELS
295
Consequently,
ep
eo + m
R = V
!
=
Iσe2 0
0
Iσe2 + Dσa2
=
I 0
0 Ro
!
!
σe2
The mixed model equations for the reduced animal model are
0
0
−1
X0p Xp + X0o R−1
o Xo Xp Zp + Xo Ro T
0
0 −1
0
0 −1
Zp Xp + T Ro Xo Zp Zp + T Ro T + A−1
pp α
=
X0p yp + X0o R−1
o yo
0
0 −1
Zp yp + T Ro yo
!
b̂
âp
!
.
Solutions for âo are derived from the following formulas.
âo = Tâp + m̂,
where
m̂ = (Z0o Zo + D−1 α)−1 (yo − Xo b̂ − Tâp ).
Using the example data,

T=







.5
.5
0
0
0
0
0
.5
.5
.5
.5
.5
0
0
0
0
0
.5
.5
.5




,



and
D = diag
.5 .5 .5 .5 .5
,
then the MME with α = 2 are










3 0
0
1
1
1
0 4 .8 1.2 .8 1.2
0 .8 2.4 0 .4 0
1 1.2 0 3.6 0 .6
1 .8 .4 0 3.4 0
1 1.2 0 .6 0 3.6
The solutions are as before, i.e.











b̂1
b̂2
â1
â2
â3
â4





















=
680.
950.4
179.2
521.
379.2
551.










!
296
CHAPTER 17. OTHER MODELS
b̂1 = 225.8641,
b̂2 = 236.3366,
â1 = -2.4078,
â2 = 1.3172,
â3 = -10.2265,
â4 = 11.3172.
To compute âo , we must first calculate m̂ as follows:

(I + D−1 α) =








yo =








Xo b̂ =








Tâp =







5
0
0
0
0
0
5
0
0
0
0
0
0
5
0
0
0
0
0
5









250
198
245
260
235
0
0
0
0
0
0
0
5
0
0







1
1
1
1
1








−6.3172
−6.3172
6.3172
6.3172
6.3172
225.8641
236.3366
!








m̂ = (I + D−1 α)−1 (yo − Xo b̂ − Tâp )


3.9961


 −6.4039 


.4692 
= 




3.4692


−1.5308
and

Tâp + m̂ =







−2.3211
−12.7211
6.7864
9.7864
4.7864




.



17.3. REDUCED ANIMAL MODELS
297
The reduced animal model was originally described for models where animals
had only one observation, but Henderson(1988) described many other possible
models where this technique could be applied.
298
CHAPTER 17. OTHER MODELS
Chapter 18
Non-Additive Genetic Effects
Non-additive genetic effects (or epistatic effects) are the interactions among
loci in the genome. There are many possible degrees of interaction (involving
different numbers of loci), but the effects and contributions of those interactions have been shown to diminish as the degree of complexity increases.
Thus, mainly dominance, additive by additive, additive by dominance, and
dominance by dominance interactions have been considered in animal studies.
To estimate the variances associated with each type of interaction, relatives are needed that have different additive and dominance relationships
among themselves. Computation of dominance relationships is more difficult
than additive relationships, but can be done as shown in earlier notes.
If non-additive genetic effects are included in an animal model, then an
assumption of random mating is required. Otherwise non-zero covariances can
arise between additive and dominance genetic effects, which complicates the
model enormously.
18.1
Interactions at a Single Locus
Let the model for the genotypic values be given as
Gij = µ + ai + aj + dij ,
299
300
CHAPTER 18. NON-ADDITIVE GENETIC EFFECTS
where
µ = G.. =
X
fij Gij ,
i,j
ai
Gi.
aj
G.j
dij
=
=
=
=
=
Gi. − G.. ,
P r(A1 )G11 + P r(A2 )G12 ,
G.j − G..
P r(A1 )G12 + P r(A2 )G22 ,
Gij − ai − aj − µ
Thus, there are just additive effects and dominance effects to be estimated
at a single locus. A numerical example is given below.
Genotype
A1 A1
A1 A2
A2 A2
Frequency
f11 = 0.04
f12 = 0.32
f22 = 0.64
Value
G11 = 100
G12 = 70
G22 = 50
Then
µ =
G1. =
G.2 =
a1 =
a2 =
d11 =
d12 =
d22 =
0.04(100) + 0.32(70) + 0.64(50) = 58.4,
0.2(100) + 0.8(70) = 76.0,
0.2(70) + 0.8(50) = 54.0,
G1. − µ = 17.6,
G.2 − µ = −4.4,
G11 − a1 − a1 − µ = 6.4,
G12 − a1 − a2 − µ = −1.6,
G22 − a2 − a2 − µ = 0.4.
Now a table of breeding values and dominance effects can be completed.
Genotype
A1 A1
A1 A2
A2 A2
Frequency
0.04
0.32
0.64
Total
Additive
G11 = 100 a1 + a1 = 35.2
G12 = 70 a1 + a2 = 13.2
G22 = 50 a2 + a2 = −8.8
Dominance
d11 = 6.4
d12 = −1.6
d22 = 0.4
18.2. INTERACTIONS FOR TWO UNLINKED LOCI
301
The additive genetic variance is
σa2 = 0.04(35.2)2 + 0.32(13.2)2 + 0.64(−8.8)2 = 154.88,
and the dominance genetic variance is
σd2 = 0.04(6.4)2 + 0.32(−1.6)2 + 0.64(0.4)2 = 2.56,
and the total genetic variance is
2
σG
=
=
=
=
=
σa2 + σd2
157.44,
0.04(100 − µ)2 + 0.32(70 − µ)2 + 0.64(50 − µ)2
0.04(41.6)2 + 0.32(11.6)2 + 0.64(−8.4)2
157.44.
This result implies that there is a zero covariance between the additive
and dominance deviations. This can be shown by calculating the covariance
between additive and dominance deviations,
Cov(A, D) = 0.04(35.2)(6.4) + 0.32(13.2)(−1.6) + 0.64(−8.8)(0.4)
= 0
The covariance is zero under the assumption of a large, random mating population without selection.
18.2
Interactions for Two Unlinked Loci
Consider two loci each with two alleles, and assume that the two loci are on
different chromosomes and therefore unlinked. Let pA = 0.4 be the frequency
of the A1 allele at locus A, and let pB = 0.8 be the frequency of the B1 allele
at locus B. Then the possible genotypes, their expected frequencies assuming
joint equilibrium, and genotypic values would be as in the table below. Joint
equilibrium means that each locus is in Hardy-Weinberg equilibrium and that
the probabilities of the possible gametes are equal to the product of the allele
frequencies as shown in the table below.
302
CHAPTER 18. NON-ADDITIVE GENETIC EFFECTS
Possible
Gametes
A1 B1
Expected
Frequencies
pA pB
= 0.32
A1 B2
p A qB
= 0.08
A2 B1
q A pB
= 0.48
A2 B2
qA qB
= 0.12
Multiplying these gametic frequencies together, to simulate random mating gives the frequencies in the table below. The genotypic values were arbitrarily assigned to illustrate the process of estimating the genetic effects.
Genotypes
Frequencies
A-Locus B-locus
fijk`
i, j
k, `
11
11
p2A p2B
=.1024
2
=.0512
11
12
pA 2pB qB
=.0064
11
22
p2A qB2
=.3072
12
11
2pA qA p2B
12
12
4pA qA pB qB =.1536
12
22
2pA qA qB2
=.0192
2 2
22
11
qA pB
=.2304
=.1152
22
12
qA2 2pB qB
2 2
22
22
qA qB
=.0144
18.2.1
Genotypic
Value,Gijk`
G1111
G1112
G1122
G1211
G1212
G1222
G2211
G2212
G2222
Estimation of Additive Effects
αA1
αA2
αB1
αB2
=
=
=
=
G1... − µG = 16.6464,
−11.0976,
10.4384,
−41.7536.
= 108
= 75
= −80
= 95
= 50
= −80
= 48
= 36
= −100
18.2. INTERACTIONS FOR TWO UNLINKED LOCI
303
The additive genetic effect for each genotype is
aijk` = αAi + αAj + αBk + αB`
a1111 = αA1 + αA1 + αB1 + αB1 ,
= 16.6464 + 16.6464 + 10.4384 + 10.4384,
= 54.1696,
a1112 = 1.9776,
a1122 = −50.2144,
a1211 = 26.4256,
a1212 = −25.7664,
a1222 = −77.9584,
a2211 = −1.3184,
a2212 = −53.5104,
a2222 = −105.7024.
The additive genetic variance is then
σa2 =
X
fijk` a2ijk` = 1241.1517.
ijk`
18.2.2
Estimation of Dominance Effects
There are six conditional means to compute, one for each single locus genotype.
G11.. =
=
=
G12.. =
G22.. =
G..11 =
G..12 =
G..22 =
P r(B1 B1 )G1111 + P r(B1 B2 )G1112 + P r(B2 B2 )G1122 ,
(.64)(108) + (.32)(75) + (.04)(−80),
89.92,
73.60,
38.24,
80.16,
48.96,
−87.20.
304
CHAPTER 18. NON-ADDITIVE GENETIC EFFECTS
The dominance genetic effects are given by
δAij = Gij.. − µG − αAi − αAj ,
so that
δA11 =
=
δA12 =
δA22 =
δB11 =
δB12 =
δB22 =
89.92 − 63.4816 − 16.6464 − 16.6464,
−6.8544,
4.5696,
−3.0464,
−4.1984,
16.7936,
−67.1744.
The dominance deviations for each genotype are
dijk`
d1111
d1112
d1122
d1211
d1212
d1222
d2211
d2212
d2222
=
=
=
=
=
=
=
=
=
=
δAij + δBk` ,
−11.0528,
9.9392,
−74.0288,
0.3712,
21.3632,
−62.6048,
−7.2448,
13.7472,
−70.2208.
The dominance genetic variance is therefore,
σd2 =
X
ijk`
fijk` d2ijk` = 302.90625.
18.2. INTERACTIONS FOR TWO UNLINKED LOCI
18.2.3
305
Additive by Additive Effects
These are the interactions between alleles at different loci. There are four
conditional means to calculate,
G1.1. = P r(A1 B1 )G1111 + P r(A1 B2 )G1112
+P r(A2 B1 )G1211 + P r(A2 B2 )G1212 ,
= .32(108) + .08(75) + .48(95) + .12(50),
= 92.16,
G1..2 = 84.192,
G.21. = 61.76,
G.2.2 = 14.88.
The additive by additive genetic effect is
ααA1B1
ααA1B2
ααA2B1
ααA2B2
=
=
=
=
G1.1. − µG − αA1 − αB1 = 1.5936,
−6.3744,
−1.0624,
4.2496.
The additive by additive deviations for each genotype are
aaijk`
aa1111
aa1112
aa1122
aa1211
aa1212
aa1222
aa2211
aa2212
aa2222
=
=
=
=
=
=
=
=
=
=
ααAiBk + ααAiB` + ααAjBk + ααAjB` ,
6.3744,
−9.5616,
−25.4976,
1.0624,
−1.5936,
−4.2496,
−4.2496,
6.3744,
16.9984.
306
CHAPTER 18. NON-ADDITIVE GENETIC EFFECTS
The additive by additive genetic variance is
2
σaa
=
X
fijk` aa2ijk` = 27.08865.
ijk`
18.2.4
Additive by Dominance Effects
This is the interaction between a single allele at one locus with the pair of
alleles at a second locus. There are twelve possible conditional means for
twelve possible different A by D interactions. Not all are shown,
G1.11 =
=
=
G.211 =
=
=
P r(A1 )G1111 + P r(A2 )G1211 ,
.4(108) + .6(95),
100.2,
P r(A1 )G1211 + P r(A2 )G2211 ,
.4(95) + .6(48),
66.8.
The specific additive by dominance effects are
αδA1B11 = G1.11 − µG − αA1 − 2αB1 − δB11 = 0.2064.
Finally, the additive by dominance genetic values for each genotype are
adijk`
ad1111
ad1112
ad1122
ad1211
ad1212
ad1222
ad2211
ad2212
ad2222
=
=
=
=
=
=
=
=
=
=
αδAiBk` + αδAjBk` + αδBkAij + αδB`Aij ,
−3.8784,
4.7856,
23.7696,
2.9296,
−4.5664,
−10.3424,
−2.1824,
3.9616,
3.2256.
18.2. INTERACTIONS FOR TWO UNLINKED LOCI
307
The additive by dominance genetic variance is
2
=
σad
X
fijk` ad2ijk` = 17.2772.
ijk`
18.2.5
Dominance by Dominance Effects
Dominance by dominance genetic effects are the interaction between a pair of
alleles at one locus with another pair of alleles at a second locus. These effects
are calculated as the genotypic values minus all of the other effects for each
genotype. That is,
ddijk` = Gijk` − µG − aijk` − dijk` − aaijk` − adijk` .
The dominance by dominance genetic variance is the sum of the frequencies
2
of each genotype times the dominance by dominance effects squared, σdd
=
8.5171. The table of all genetic effects are given below.
Genotypes
A-Locus B-Locus
11
11
11
12
11
22
12
11
12
12
12
22
22
11
22
12
22
22
fijk`
Gijk`
.1024
.0512
.0064
.3072
.1536
.0192
.2304
.1152
.0144
108
75
-80
95
50
-80
48
36
-100
aijk`
dijk`
54.1696 -11.0528
1.9776
9.9392
-50.2144 -74.0288
26.4256
0.3712
-25.7664 21.3632
-77.9584 -62.6048
-1.3184 -7.2448
-53.5104 13.7472
-105.7024 -70.2208
A summary of the genetic variances is
Total Genetic 1596.9409,
Additive
1241.1517,
Dominance
302.9062,
Add by Add
27.0886,
Add by Dom
17.2772,
Dom by Dom
8.5171.
aaijk`
adijk`
ddijk`
6.3744 -3.8784
-9.5616
4.7856
-25.4976 23.7696
1.0624
2.9296
-1.5936 -4.5664
-4.2496 -10.3424
-4.2496 -2.1824
6.3744
3.9616
16.9984
3.2256
-1.0944
4.3776
-17.5104
0.7296
-2.9184
11.6736
-0.4864
1.9456
-7.7824
308
18.3
CHAPTER 18. NON-ADDITIVE GENETIC EFFECTS
More than Two Loci
Interactions can occur between several loci. The maximum number of loci
involved in simultaneous interactions is unknown, but the limit is the number
of gene loci in the genome. Many geneticists believe that the higher order
interactions are few in number, and that if they exist the magnitude of their
effects is small. Someday the measurement of all of these interactions may be
possible, but modelling them may be impossible, and practical utilization of
that information may be close to impossible.
18.4
Linear Models for Non-Additive Genetic
Effects
Consider a simple animal model with additive, dominance, and additive by
dominance genetic effects, and repeated observations per animal, i.e.,
yij = µ + ai + di + (ad)i + pi + eij ,
where µ is the overall mean, ai is the additive genetic effect of animal i, di is the
dominance genetic effect of animal i, (ad)i is the additive by dominance genetic
effect of animal i, pi is the permanent environmental effect for an animal with
records, and ei is the residual effect. Also,




V ar 



18.4.1
a
d
ad
p
e
















=
2
Aσ10
0
0
0
0
2
0
Dσ01
0
0
0
2
0
0
A Dσ11
0
0
0
0
0
Iσp2 0
0
0
0
0 Iσe2




.



Simulation of Data
Data should be simulated to understand the model and methodology. The
desired data structure is given in the following table for four animals.
18.4. LINEAR MODELS FOR NON-ADDITIVE GENETIC EFFECTS309
Animal
1
2
3
4
Number of Records
3
2
1
4
Assume that
2
2
σ10
= 324, σ01
= 169,
2
2
σ11 = 49, σp = 144,
σe2 = 400.
The additive genetic relationship matrix for the four animals is
A = L10 L010

1
0
0
0
 .5 .866
0
0
= 

 .25
0 .9682
0
.75 .433
0 .7071




= 
1
.5
.25
.75
.5
1 .125
.75
.25 .125
1 .1875
.75 .75 .1875 1.25


 0
 L10




.

The dominance genetic relationship matrix (derived from the gametic relationship matrix) is
D = L01 L001


= 




= 


1
0
0
0
.25
.9682
0
0
.0625 −.0161
.9979
0
.125
.0968 −.00626 .9874
1 .25 .0625 .125
.25
1
0 .125 

.
.0625
0
1
0 
.125 .125
0
1



 0
 L01

310
CHAPTER 18. NON-ADDITIVE GENETIC EFFECTS
The additive by dominance genetic relationship matrix is the Hadamard product of A and D, which is the element by element product of matrices.
A
D = L11 L011




= 




= 
1
0
0
0
.125
.9922
0
0
.015625 −.00197 .999876
0
.09375
.08268 −.0013 1.1110


 0
 L11

1
.125 .015625 .09375
.125
1
0 .09375 

.
.015625
0
1
0 
.09375 .09375
0
1.25

The Cholesky decomposition of each of these matrices is necessary to simulate
the separate genetic effects. The simulated genetic effects for the four animals
are (with va , vd , and vad being vectors of random normal deviates)
a = (324).5 L10 va ,


12.91
 13.28 


= 
,
 −10.15 
38.60
d = (169).5 L01 vd ,


15.09

5.32 


= 
,
 −17.74 
3.89
(ad) = (49).5 L11 vad ,


−12.22
 −1.32 

= 

.
 −4.30 
5.76
In the additive genetic animal model, base population animals were first simulated and then progeny were simulated by averaging the additive genetic values
of the parents and adding a random Mendelian sampling effect to obtain the
additive genetic values. With non-additive genetic effects, such a simple process does not exist. The appropriate genetic relationship matrices are necessary
18.4. LINEAR MODELS FOR NON-ADDITIVE GENETIC EFFECTS311
and these need to be decomposed. The alternative is to determine the number of loci affecting the trait, and to generate genotypes for each animal after
defining the loci with dominance genetic effects and those that have additive
by dominance interactions. This might be the preferred method depending on
the objectives of the study.
Let the permanent environmental effects for the four animals be
8.16
−8.05
−1.67
15.12



p=




.

The observations on the four animals, after adding a new residual effect for
each record, and letting µ = 0, are given in the table below.
Animal
1
2
3
4
18.4.2
a
12.91
13.28
-10.15
38.60
d
(ad)
p
1
2
3
4
15.09 -12.22 8.16 36.21 45.69 49.41
5.32 -1.32 -8.05
9.14 -14.10
-17.74 -4.30 -1.67 -20.74
3.89
5.76 15.12 24.13 83.09 64.67 50.13
HMME
Using the simulated data, the MME that need to be constructed are as follows.






X0 X
X0 Z
0
0
Z X Z Z + A−1 k10
Z0 X
Z0 Z
0
ZX
Z0 Z
0
ZX
Z0 Z
X0 Z
Z0 Z
0
Z Z + D−1 k01
Z0 Z
Z0 Z
X0 Z
Z0 Z
Z0 Z
0
Z Z + (A D)−1 k11
Z0 Z
X0 Z
Z0 Z
Z0 Z
Z0 Z
0
Z Z + Ikp






b̂
â
d̂
ˆ
ad
p̂

 
 
=
 
 
where k10 = 400/324, k01 = 400/169, k11 = 400/49, and kp = 400/144. Thus,
the order is 17 for these four animals, with only 10 observations. Note that
X0 y = (327.63) ,
and




Z0 y = 
131.31
−4.96
−20.74
222.02



.


X0 y
Z0 y
Z0 y
Z0 y
Z0 y



,


312
CHAPTER 18. NON-ADDITIVE GENETIC EFFECTS
The solutions are



â = 

12.30
1.79
−8.67
15.12




,




d̂ = 
4.18
−4.86
−6.20
8.19




,


ˆ =

ad

1.49
−1.68
−1.87
3.00




,




p̂ = 
4.56
−6.18
−5.57
7.18



,

and µ̂ = 17.02.
The total genetic merit of an animal can be estimated by adding together
the solutions for the additive, dominance, and additive by dominance genetic
values,


17.97
 −4.75 

ˆ
ĝ = 

 = (â + d̂ + ad).
 −16.73 
26.32
On the practical side, the solutions for the individual dominance and additive by dominance solutions should be used in breeding programs, but how?
Dominance effects arise due to particular sire-dam matings, and thus, dominance genetic values could be used to determine which matings were better.
However, additive by dominance genetic solutions may be less useful. Perhaps
the main point is that if non-additive genetic effects are significant, then they
should be removed through the model to obtain more accurate estimates of
the additive genetic effects, assuming that these have a much larger effect than
the non-additive genetic effects.
18.5
Computing Simplification
Take the MME as shown earlier, i.e.






X0 X
X0 Z
0
0
Z X Z Z + A−1 k10
Z0 X
Z0 Z
0
Z0 Z
ZX
0
ZX
Z0 Z
X0 Z
Z0 Z
0
Z Z + D−1 k01
Z0 Z
Z0 Z
X0 Z
Z0 Z
Z0 Z
0
Z Z + (A D)−1 k11
Z0 Z
X0 Z
Z0 Z
Z0 Z
Z0 Z
0
Z Z + Ikp






b̂
â
d̂
ˆ
ad
p̂


 
 
=
 
 
Now subtract the equation for dominance genetic effects from the equation
for additive genetic effects, and similarly for the additive by dominance and
X0 y
Z0 y
Z0 y
Z0 y
Z0 y



,


18.5. COMPUTING SIMPLIFICATION
313
permanent environmental effects, giving
A−1 k10 â − D−1 k01 d̂ = 0
ˆ = 0
A−1 k10 â − (A D)−1 k11 ad
−1
−1
A k10 â − I kp p̂ = 0
Re-arranging terms, then
d̂ = DA−1 (k10 /k01 )â
ˆ = (A D)A−1 (k10 /k11 )â
ad
p̂ = A−1 (k10 /kp )â
The only inverse that is needed is for A, and the equations to solve are only
as large as the usual animal model MME. The steps in the procedure would
be iterative.
ˆ and p̂ (initially
1. Adjust the observation vector for solutions to d̂, ad,
these would be zero) as
ˆ + p̂).
ỹ = y − Z(d̂ + ad
2. Solve the following equations:
X0 X
X0 Z
Z0 X Z0 Z + A−1 k10
!
b̂
â
!
=
X0 ỹ
Z0 ỹ
!
.
ˆ and p̂ using
3. Obtain solutions for d̂, ad,
d̂ = DA−1 (k10 /k01 )â
ˆ = (A D)A−1 (k10 /k11 )â
ad
p̂ = A−1 (k10 /kp )â.
4. Go to step 1 and begin again until convergence is reached.
314
18.6
CHAPTER 18. NON-ADDITIVE GENETIC EFFECTS
Estimation of Variances
Given the new computing algorithm, and using Gibbs sampling as a tool the
variances can be estimated. Notice from the above formulas that
w01 = (D−1 d̂) = A−1 (k10 /k01 )â
ˆ = A−1 (k10 /k11 )â
w11 = ((A D)−1 ad)
wp = (Ip̂) = A−1 (k10 /kp )â.
Again, the inverses of D and (A D) are not needed. The necessary quadratic
forms are then
d̂0 w01 = d̂0 D−1 d̂,
ˆ
ˆ 0 w11 = ad
ˆ 0 (A D)−1 ad,
ad
p̂0 wp = p̂0 p̂,
and â0 A−1 â. Generate 4 random Chi-Square variates, Ci , with degrees of
freedom equal to the number of animals in â, then
2
σ10
= â0 A−1 â/C1
2
σ01
= d̂0 w01 /C2
ˆ 0 w11 /C3
σ 2 = ad
11
σp2
= p̂0 wp /C4 .
The residual variance would be estimated from
σe2 = ê0 ê/C5 ,
where C5 is a random Chi-square variate with degrees of freedom equal to the
total number of observations. This may not be a totally correct algorithm and
some refinement may be necessary, but this should be the basic starting point.
Chapter 19
Threshold Models
19.1
Categorical Data
If observations are expressed phenotypically as being a member of one of m
categories, then the data are categorical, discrete, and non-continuous. With
two categories, m = 2, then the trait is an ”all-or-none” or binary trait. Most
disease traits are binomial, where the animal is either diseased or not diseased.
Calving ease or conformation traits have more than two categories arranged in
a delineated sequence from one extreme expression to the opposite extreme expression. Calving ease, for example, has four categories, namely (1) unassisted
calving, (2) slight assistance, (3) hard pull, or (4) caesarian section required.
Rump angle, for example, has nine categories where the pins range from too
low to too high with the desirable category being in the middle.
Categorical traits may be inherited in a polygenic manner. The underlying susceptibility to a disease trait, calving ease, or type trait may actually be
continuous and may follow a normal distribution. The underlying continuous
scale is known as the liability scale. On the liability scale is a threshold point
(or points) where above this threshold the animal expresses the disease phenotype, and below the threshold point the animal does not express the disease.
The liability scale is only conceptual and cannot be observed (Gianola and
Foulley, 1983 and Harville and Mee, 1984). The model employed is known as
a “threshold” model. The details of this procedure are described.
315
316
CHAPTER 19. THRESHOLD MODELS
19.2
Threshold Model Computations
The general linear model is
y = Xb + Zu + e,
where y is the observation vector, but in the case of categorical data represents
a variable on the underlying liability scale. The underlying liability scale is unobservable and therefore, y is unknown. However, y is known to be affected by
fixed effects, b, and random effects, u, such as animal additive genetic effects,
and random environmental (or residual) effects, e. The assumed covariance
structure is
!
!
u
G 0
V ar
=
.
e
0 R
The analysis requires that the threshold points be estimated from the categorical data simultaneously with the estimation of b and u. This makes for a set
of non-linear estimation equations. Initial values of the thresholds are used to
estimate y, which are then used to estimate new threshold points and b and
u. This must be repeated until the threshold points do not change.
There are various quantities which need to be computed repeatedly in the
analysis, and these are based on normal distribution functions.
1. Φ(x) is known as the cumulative distribution function of the normal
distribution. This function gives the area under the normal curve up
to the value of x, for x going from minus infinity to plus infinity (the
range for the normal distribution). For example, if x = .4568, then
Φ(x) = .6761, or if x = −.4568, then Φ(x) = .3239. The short program
given at the end of these notes computes Φ(x) for a value of x. Let Φk
represent the value up to and including category k. Note that if there
are m categories and k = m, then Φk = 1.
2. φ(x) is a function that gives the height of the normal curve at the value
x, for a normal distribution with mean zero and variance 1. That is,
φ(x) = (2π)−.5 exp −.5x2 .
For example, if x = 1.0929, then φ(x) = .21955.
19.2. THRESHOLD MODEL COMPUTATIONS
317
3. P (k) is the probability that x from a N (0, 1) distribution is between two
threshold points, or is in category k. That is,
P (k) = Φk − Φk−1 .
If k = 1, then Φk−1 = 0.
Data should be arranged by smallest subclasses. Consider the calving ease
data from the paper by Gianola and Foulley (1983).
318
CHAPTER 19. THRESHOLD MODELS
HerdYear
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
Calving Ease Data
Age of Sex of Sire of Category
Dam(yr) Calf
Calf 1 2 3
2
M
1
1 0 0
2
F
1
1 0 0
3
M
1
1 0 0
2
F
2
0 1 0
3
M
2
1 0 1
3
F
2
3 0 0
2
M
3
1 1 0
3
F
3
0 1 0
3
M
3
1 0 0
2
2
3
2
3
2
3
2
2
3
3
F
M
M
F
M
F
M
M
F
F
M
1
1
1
2
2
3
3
4
4
4
4
2
1
0
1
1
0
0
0
1
2
2
0
0
0
0
0
1
0
1
0
0
0
0
0
1
1
0
0
1
0
0
0
0
Total
1
1
1
1
2
3
2
1
1
2
1
1
2
1
1
1
1
1
2
2
Let njk represent the number of observations in the k th category for the
j th subclass, and nj. represent the marginal total for the j th subclass. There
are s = 20 subclasses in the above example data, and m = 3 categories for the
trait.
19.2.1
Equations to Solve
The system of equations to be solved can be written as follows:
Q
L0 X
L0 Z
∆t
p





0
0
0
X WZ
 X L X WX
  ∆b  =  X0 v
.
Z0 L Z0 WX Z0 WZ + G−1
∆u
Z0 v − G−1 u





19.2. THRESHOLD MODEL COMPUTATIONS
319
The equations must be solved iteratively. Note that ∆b, for example, is the
change in solutions for b between iterations. The calculations for Q, L, W, p,
and v need to be described. The values of these matrices and vectors change
with each iteration of the non-linear system. Using the example data on calving
ease, then X contains columns referring to an overall mean, herd-year effects,
age of dam effects, and sex of calf effects, and Z contains columns referring to
sires, as follows:

X=











































1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
0
0
1
0
0
1
1
0
1
0
1
0
1
1
0
0
0
0
1
0
1
1
0
1
1
0
0
1
0
1
0
1
0
0
1
1
1
0
1
0
1
0
1
0
1
0
1
1
0
1
0
1
1
0
0
1
0
1
0
1
0
1
0
1
0
1
0
0
1
0
1
0
0
1
1
0























,
































































and Z =
1
1
1
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1






















.





















The vector t will contain the threshold points for a N (0, 1) distribution at the
end of the iterations.
The process is begun by choosing starting values for b, u, and t. Let
b = 0 and u = 0, then starting values for t can be obtained from the data
by knowing the fraction of animals in the first category (i.e. 0.6786), and in
the first two categories (i.e. 0.8572). The threshold values that give those
percentages are t1 = 0.3904, and t2 = 0.9563.
Suppose that several iterations have been performed and the latest solu-
320
CHAPTER 19. THRESHOLD MODELS
tions are

t=
.37529
1.01135
!
, b=












µ
H1
H2
A1
A2
S1
S2


























=
0.0
0.0
.29752
0.0
−.12687
0.0
−.39066







,








u=

u1
u2
u3
u4







=


−.08154
.06550
.12280
−.10676
The following calculations are performed for each subclass from j = 1 to
s, those for j = 1 are shown below:
1. fjk = tk − x0j b − z0j u for k = 1 to (m − 1).
f11 =
=
=
f12 =
=
t1 − µ − H1 − A1 − S1 − u1
.37529 − 0 − 0 − 0 − 0 − (−.08154)
.45683, and
t2 − µ − H1 − A1 − S1 − u1
1.09289.
2. For k = 1 to m, Φjk = Φ(fjk ).
Φ11 =
=
Φ12 =
Φ13 =
Φ10 =
Φ(.45683)
.6761,
Φ(1.09289) = .8628,
1, and
0.
3. For k = 1 to m, Pjk = Φjk − Φj(k−1) .
P11 = .6761 − 0 = .6761,
P12 = .8628 − .6761 = .1867,
P13 = 1 − .8628 = .1372.



.

19.2. THRESHOLD MODEL COMPUTATIONS
321
4. For k = 0 to m, φjk = φ(fjk ).
φj0
φj1
φj2
φj3
=
=
=
=
0.0
φ(.45683) = .3594,
φ(1.09289) = .2196,
0.0.
5. The matrix W is a diagonal matrix of order equal to the number of
smallest subclasses, s. The elements of W provide a weighting factor
for the j th subclass that depends on the number of observations in that
subclass and on the values of φjk and Pjk . The j th diagonal is given by
wjj = nj.
m
X
(φj(k−1) − φjk )2 /Pjk .
k=1
For the first subclass,
w11
(φ10 − φ11 )2 (φ11 − φ12 )2 (φ12 − φ13 )2
= n1.
+
+
P11
P12
P13
!
2
2
2
(.1398)
(.2196)
(−.3594)
+
+
= 1
.6761
.1867
.1372
= .6471.
The values for all subclasses were
( .6471 .5167 .6082 .5692 1.3058 1.5721
.5447 .6687 1.2380 .7196 .6923 1.3247
.6776 .7334 .7146 .6110 1.1373 1.3723
!
1.4002
.7234
)
6. The vector v is created and used in place of y, which is unknown. For
the j th subclass,
vj =
m
X
njk (φj(k−1) − φjk )/Pjk .
k=1
For j = 1, then
vj = [1(φ10 − φ11 )/P11 + 0(φ11 − φ12 )/P12 + 0(φ12 − φ13 )/P13 ],
= −.3594/.6761,
= −.5316.
322
CHAPTER 19. THRESHOLD MODELS
The complete vector v is
( −.5316
1.0521
.6416
-.3475 -.4671 .9848 1.0414 -1.0678 -.0928
-.5731 -.9676 -.6993 1.4632
.9961 -.7115
1.3042 .4859 -.4713 -.8220 -1.2216
)
7. The matrix L is of order s × (m − 1) and the jk th element is
`jk = −nj. φjk [(φjk − φj(k−1) )/Pjk − (φj(k+1) − φjk )/Pj(k+1) ].
For the example data,

L=











































−.46033
−.41079
−.45052
−.43594
−.92257
−1.24395
−.92432
−.42492
−.46308
−.90751
−.45725
−.46306
−.92501
−.45573
−.46349
−.45053
−.45893
−.45138
−.87141
−.92684
−.18681
−.10587
−.15773
−.13326
−.38328
−.32817
−.47584
−.11981
−.20565
−.33045
−.26236
−.22921
−.39969
−.26770
−.21408
−.28290
−.25571
−.15961
−.26588
−.44551






















.





















8. The matrix Q is a tri-diagonal matrix of order (m−1)×(m−1), however,
this is only noticeable when m is greater than 3. The diagonals of Q are
given by
qkk =
s
X
j=1
nj. φ2jk (Pjk + Pj(k+1) )/Pjk Pj(k+1) ,
19.2. THRESHOLD MODEL COMPUTATIONS
323
and the off-diagonals are
qk(k+1) = −
s
X
nj. φjk φj(k+1) /Pj(k+1) .
j=1
For the example data,
Q=
24.12523 −11.55769
−11.55769
16.76720
!
.
9. The elements of vector p are given by
pk =
s
X
φjk [(njk /Pjk ) − (nj(k+1) /Pj(k+1) )],
j=1
for k = 1 to (m − 1). Hence,
p0 =
.003711 .000063
.
1
Assume that G = I( 19
) for the example data. There are dependencies
in X and with the threshold points, so that 4 restrictions on the solutions
are needed. Let the solutions for µ, H1 , A1 , and S1 be restricted to zero by
removing those columns from X. The remaining calculations give the following
submatrices of the mixed model equations.
!
0
−6.8311 −6.6726 −6.1344
−3.1131 −2.6858 −2.0568
0
−3.1495 −3.9832 −2.7263 −2.7086
−1.2724 −1.5121 −1.2983 −1.1267
LX =
LZ =

,

9.9442 4.6588 4.9885

X0 WX = 
 4.6588 9.3585 3.2541  ,
4.9885 3.2541 8.1912


−.00221


0
X v =  −.00210  ,
−.00149


2.6498 2.0481 1.4110 3.8352

0
X WZ =  1.3005 3.6014 1.9469 2.5096 
,
1.7546 3.4660 1.2223 1.7483
!
,
324
CHAPTER 19. THRESHOLD MODELS



Z0 v − I(19) = 

−.00070
−.00139
−.00116
−.00052



,

Z0 WZ + I(19) = diag( 23.4219 24.4953 23.0246 22.8352 ).
The solutions (change in estimates) to these equations are added to the
estimates used to start the iteration to give the new estimates, as shown in
the table below.
t1
t2
H1
H2
A1
A2
S1
S2
u1
u2
u3
u4
∆β
+βi−1
.000229 .37529
.000158 1.01135
0.0
0.0
-.000046 .29752
0.0
0.0
-.000013 -.12687
0.0
0.0
.000064 -.39066
.000011 -.08154
-.000013 .06550
-.000014 .12280
.000017 -.10676
= βi
.375519
1.011508
0.0
.297473
0.0
-.126883
0.0
-.390596
-.081529
.065487
.122786
-.106743
As further iterations are completed, then the right hand sides to the equations approach a null vector, so that the solutions, ∆β, also approach a null
vector.
There could be problems in obtaining convergence, or a valid set of estimates. In theory the threshold points should be strictly increasing. That
is, tm should be greater than t(m−1) should be greater than t(m−2) and so on.
The solutions may not come out in this order. This could happen when two
thresholds are close to each other due to the definition of categories. An option
would be to combine two or more of the categories involved in the problem,
and to re-analyze with fewer categories.
Another problem could occur when all observations fall into either extreme
category. This is known as an “extreme category” problem. Suggestions have
19.3. ESTIMATION OF VARIANCE COMPONENTS
325
been to delete subclasses with this problem, but throwing away data is not a
good idea. Other types of analyses may be needed.
19.3
Estimation of Variance Components
Harville and Mee (1984) recommended a REML-like procedure for estimating
the variance ratio of residual to random effects variances. One of the assumptions in a threshold model is that the residual variance is fixed to a value of 1.
Hence only that variances of the random effects need to be estimated. Let a
generalized inverse of the coefficient matrix in the equations be represented as


Ctt Ctx Ctz


C =  Cxt Cxx Cxz  .
Czt Czx Czz
Then the REML EM estimator of variance of the random effects is
σu2 = (u0 G−1 u + tr(G−1 Czz ))/d,
where d is the number of levels in u.
Using the example data, and the recent estimates (as final values),
u0 G−1 u = .0374061,
trCzz = .18460
σu2 = (.0374061 + .18460)/4,
= .0555015.
The new variance ratio is (.0555015)−1 = 18.0175. The REML EM-like algorithm would need to be repeated until the variance estimate is converged.
Hence there are two levels of iteration going on at the same time in a very
non-linear system of equations.
The final estimate of the variances are in terms of the liability or underlying normal distribution scale. If the number of categories was only two, then
heritability can be converted to the ”observed” scale using the formula
h2obs = h2lia z 2 /(p(1 − p)),
where p is the proportion of observations in one category, and z is the height
of the normal curve at the threshold point (on the observed scale).
326
19.4
CHAPTER 19. THRESHOLD MODELS
Expression of Genetic Values
Using the estimates from the non-linear system of equations, the probability
of a sire’s offspring to fall into each category can be calculated. Take sire 3 as
an example. First the threshold point for male offspring of 2-yr-old dams in
herd-year 1 needs to be determined as
x = t1 + H1 + A1 + S1 + u3 ,
= .375519 + 0 + 0 + 0 + .122786,
= .498305.
Using the short program below, then
Φ(x) = Φ(.498305) = .69087.
Similarly, the probability of the sire’s offspring to be in categories 1 and 2
would be based on the second threshold,
x = 1.011508 + .122786 = 1.134294,
or Φ(x) = .87166. Thus, the proportion of offspring in category 2 would be
.87166 − .69087 = .18079, and the proportion in category 3 would be 1.0 −
.87166 = .12834. In the raw data about .68 of the offspring were in category
1, and therefore sire 3 has only .01 greater in probability for category 1, and
for categories 1 and 2 the raw percentage was .86. With the low heritability
assumed, sires would not be expected to have greatly different probabilities
from the raw averages.
Notice that the probabilities are defined in terms of one of the smallest
fixed effects subclasses. If 3-yr-old dams and female calves were considered,
then the new threshold values would be lower, thereby giving fewer offspring in
category 1 and maybe more in categories 2 or 3, because the solutions for 3-yrold dams is negative and for female calves is negative. If an average offspring
is envisioned, then solutions for ages of dam can be averaged and solutions
for sex of calf can be averaged. A practical approach would be to choose the
subclass where there seems to be the most economic losses and rank sires on
that basis.
Part IV
Computing Ideas
327
Chapter 20
Fortran Programs
This chapter gives ideas on how to write a FORTRAN program to perform a
multiple trait random regression model. The example will be lactation production of dairy cows in the first three lactations for milk yield. The linux servers
employ the Intel compiler and utilize a few Intel math libraries of programs
(sorting and calculating eigenvalues and eigenvectors).
20.1
Main Program
The strategy is to have a main program that is only one or two pages in
length. Almost every line of the main program is a call to a subroutine. The
subroutines are also kept as simple as possible, but some can be very long.
The model assumes that all of the factors except Year-Month of Calving
are fitted by an order 4 Legendre polynomials. So every level of these factors
has 5 covariates to estimate. The Year-Month of Calving (within parities) has
36 ten day periods to estimate.
c
c
c
c
Random regression model production test day milk records
one trait, 3 lactations, order 4 Legendre polynomials
y = YM(36) + RAS(5) + CG(15) +
329
a(15) + p(15) + e(3,4)
330
CHAPTER 20. FORTRAN PROGRAMS
include ’SShd.f’
itest = 0
igibb = 0
mxit=6000
if(itest.gt.0)mxit=10
if(igibb.gt.0)then
c Two files for storing Gibbs samples
open(17,file=’animalVCV.d’,form=’unformatted’,
x status=’unknown’)
open(19,file=’cgVCV.d’,form=’unformatted’,
x status=’unknown’)
open(20,file=’resVC.d’,form=’formatted’,
x status=’unknown’)
endif
c
c #############################################################
c read in parameters
call params
c
c ############################################################
c read in pedigree info
call peddys
c
c ############################################################
c read in data
call datum
c
c ############################################################
c iterations on equations to solve
800 iter = iter + 1
if(iter.gt.mxit)go to 9901
c
ccn = 0.d0
ccd = 0.d0
c one subroutine for each factor in the model
call facYM
call facRAS
call facCG
20.1. MAIN PROGRAM
c
331
call permenv
call genetic
igibb > 0 if covariance matrices are estimated
if(igibb.gt.0)then
call facRES
endif
ccc = 100.0*(ccn/ccd)
if(mod(iter,100).eq.0)print *,iter,ccc
if(ccc.gt.0.1d-09)go to 800
c
c #############################################################
c finished, save solutions
c
9901 call finis
if(igibb.gt.0)then
close(17)
close(19)
close(20)
endif
stop
end
include
include
include
include
include
include
include
include
include
include
include
’SSparams.f’
’SSpeddys.f’
’SSdatum.f’
’SSfini.f’
’SSYM.f’
’SSRAS.f’
’SSCG.f’
’SSanm.f’
’SSape.f’
’SSres.f’
’dkmvhf.f’
The program is only a few steps.
1. call params, to read in the necessary covariance function matrices, and
set up their inverses for use in the mixed model equations.
332
CHAPTER 20. FORTRAN PROGRAMS
2. call peddys, to read in the pedigree information and to set up the
diagonals of A−1 for each animal.
3. call datum, to read and store the data for the iteration on data procedure. Some arrays need to be sorted.
4. Iterate solutions by call subroutines, one for each factor in the model.
5. call finis, to write out and save the solutions to the factors that are
of interest, only if igibb = 0
All of the subroutines are joined together by include ’SShd.f’. This is
a file that defines the variables in the program that need to be shared between
subroutines. It specifies which variables are double precision and which are
integer. The big arrays are put into COMMON areas so that they are stored
consecutively in memory and thereby take a little less space. Thus, during the
initial programming, if an array needs to be increased in length, this can be
done in this file, and it therefore, occurs in all other subroutines. There is no
chance of forgetting to make a change in the other subroutines.
With COMMON areas one has to worry about boundary alignments. Thus,
this problem is avoided by having separate COMMON areas for double precision
arrays and integer arrays. The boundary alignment problems occur when
integer and double precision arrays are mixed together in one COMMON area. If
the two types are to be in the same COMMON area, then the double precision
arrays should precede the integer arrays.
Parameter(no=15,nop=120,nam=200000,ncg=12000,nras=200,
x ndim=365,nrec=1270000,nped=500000, nym=500,mcov=5,
x ntg=36,ntim=4)
c
c
c
c
c
c
c
no = 5 covariates times 3 lactations = 15
nop = (no*(no+1))/2, half-stored matrix array size
nam = maximum number of animals
ncg = maximum number of contemporary groups
nras = number of region-age-season of calving subclasses
ndim = number of days in milk (maximum)
nrec = maximum number of test day records
20.1. MAIN PROGRAM
c
c
c
333
nped = maximum number of pedigree elements to store
nym = number of year-month subclasses
mcov = number of covariates
Common /recs/lp(ndim,5),obs(nrec),anid(nrec),cgid(nrec),
x rasid(nrec),ymid(nym),pari(nrec),days(nrec),timg(nrec),
x mrec,mcgid,mras,mym,iseed
c
c
c
c
c
c
c
c
c
c
c
c
c
lp = legendre polynomials of order 4 for days 1 to 365 in milk
obs = test day milk yields
anid = animal ID associated with each obs
cgid = contemporary group associated with each obs
rasid = region-age-season for each obs
ymid = year-month for each obs
pari = parity number for each obs
days = days in milk for each obs
timg = 1 to 36 time groups within YM subclasses
mrec = actual total number of test day records
mcgid = max id of contemporary groups
mras = max id of ras subclasses
mym = max id of year-month subclasses
Common /peds/bii(nam),adiag(nam),sir(nam),dam(nam),
x cpa(nped),cpc(nped),cps(nped),cpd(nped),
x jped(nped),mam,mped
c
c
c
c
c
c
c
c
c
c
bii = elements needed for A-inverse
adiag = diagonal elements for each animal
sir = sire ID (consecutively numbered and chronological
dam = dam ID (consecutively numbered and chronological
cpa = coded pedigree record, animal id
cpc = code (0,1,2)
cps = sire or progeny ID
cpd = dam or mate ID
mam = total number of animals < nam
mped = pedigree records < nped
Common /parms/gi(nop),pi(nop),ci(3,15),res(4,3),
334
CHAPTER 20. FORTRAN PROGRAMS
x ri(ndim,3),wv(nras)
c
c
c
c
c
c
gi = genetic covariance function matrix
pi = permanent environmental covariance function matrix
ci = contemporary group covariance function matrix
res = residual variances, 3 parities, 4 periods
ri = inverses for each day in milk
wv = work vector, used for many things
Common /diags/pcg(nrec),pras(nrec),pymid(nrec),
x panid(nrec),wr(nam),iwv(nras),itest,igibb
c
c
c
c
c
c
c
c
pcg = CGID sorted order
pras = rasid sorted order
pymid = YMID sorted order
panid = anid sorted order
wr = number of records per animal (many are zero
iwv = integer work vector
itest = 0 for good run, not zero during initial programming
igibb not zero, means to estimate covariance matrices
Common /solns/sanm(nam,no),sape(nam,no),scg(ncg,mcov),
x sras(nras,mcov),sym(nym,ntg),ccn,ccd,ccc
c
c
c
solution arrays for animal genetic, PE, cont. groups,
and region-age-season, ccn, ccd, and ccc for
convergence criteria
real*8 lp,obs,adiag,gi,pi,res,ri,sanm,sape,scg,sras,
x sym,rhs,ccn,ccd,ccc,wv
integer anid,cgid,ras,ymid,pari,days,timg,pcg,pras,
x pymid,panid,mped,mrec,sir,dam,cpa,cpc,cps,cpd,jped,
x wr,iwv,itest,mam,mrec,igibb,mras,mcgid,mym,iseed
The above lines are included in every subroutine that is part of the main
program. Subroutines may have some of their own variables, which are only
used within that subroutine.
20.2. CALL PARAMS
335
Define all of the variables in this file as either real*8 or integer. Do not
rely on default rules.
20.2
Call Params
The first subroutine to be called is params, which reads in the covariance
matrices that will be used to start the iteration process. The random factors
of the model are contemporary groups, and animal additive genetic and animal
permanent environmental effects.
subroutine params
include ’SShd.f’
real*8 varc,varg,varp,x,z(5),hh(15)
open(10,file=’GP4.d’,form=’formatted’,status=’old’)
gi = 0.d0
pi = 0.d0
ri = 0.d0
ci = 0.d0
res = 0.d0
lp=0.0d0
c
10
read(10,1001,end=20)kr,kc,varg,varp
1001 format(1x,2i4,8d20.10)
if(kr.eq.0)go to 20
m = ihmssf(kr,kc,no)
gi(m) = varg
pi(m) = varp
go to 10
c
20
read(10,1002,end=21)kp,kr,kc,varc
1002 format(1x,3i4,d20.10)
m = ihmssf(kr,kc,mcov)
ci(kp,m)=varc
336
CHAPTER 20. FORTRAN PROGRAMS
go to 20
close(10)
21
c
c residual variances, by parity and days groups
c
open(11,file=’RES4.d’,form=’formatted’,status=’old’)
do 22 i=1,3
do 22 j=1,4
read(11,*,end=30)ka,x
res(j,i)=x
22
continue
30
close(11)
c
call dkmvhf(gi,no,wv,iwv)
call dkmvhf(pi,no,wv,iwv)
do 81 kp=1,3
hh = 0.d0
do 82 m=1,15
hh(m)=ci(kp,m)
82
continue
call dkmvhf(hh,mcov,wv,iwv)
do 83 m=1,15
ci(kp,m)=hh(m)
83
continue
81
continue
c Different residual covariance matrices through the lactation
ri = 0.0d0
do 31 i=1,45
ri(i,1) = 1.d0/res(1,1)
ri(i,2) = 1.d0/res(1,2)
ri(i,3) = 1.d0/res(1,3)
31
continue
do 32 i=46,115
ri(i,1) = 1.d0/res(2,1)
ri(i,2) = 1.d0/res(2,2)
ri(i,3) = 1.d0/res(2,3)
32
continue
do 33 i=116,265
20.2. CALL PARAMS
33
337
ri(i,1) = 1.d0/res(3,1)
ri(i,2) = 1.d0/res(3,2)
ri(i,3) = 1.d0/res(3,3)
continue
do 34 i=266,365
ri(i,1) = 1.d0/res(4,1)
ri(i,2) = 1.d0/res(4,2)
ri(i,3) = 1.d0/res(4,3)
continue
34
c
c read in Legendre polynomials, order 4
c
open(12,file=’LPOLY4.d’,form=’formatted’,status=’old’)
lp = 0.0d0
40 read(12,1201,end=55)kdim,z
1201 format(2x,i5,2x,5f20.10)
do 42 k=1,5
lp(kdim,k)=z(k)
42
continue
go to 40
55 close(12)
c
c read in a random number seed, initialize random number
c
generators
c
open(13,file=’seedno.d’,form=’formatted’,status=’old’)
read(13,1301,end=65)iseed
1301 format(1x,i10)
call firan(iseed)
close(13)
return
end
Four input files are needed, a) one for the covariance matrices for genetic,
permanent environmental, and contemporary groups, b) one for the residual
variances, c) one for the Legendre polynomials, and d) one for the random
number seed. Remember to create the appropriate files, and make sure the
format statements are in agreement with the data files.
338
CHAPTER 20. FORTRAN PROGRAMS
The inverses of the covariance matrices are obtained using dkmvhf.f. This
is Henderson’s inversion routine that he wrote back in the 1960’s. His version
was called djnvhf.f, but Karin Meyer found a way to improve its speed. Henderson’s version physically re-arranged rows and columns during the inversion
process. Meyer’s version merely kept an array of indexes of the rows to be
re-arranged, but did not actually re-arrange them physically until the end.
This increased the speed very much, and so the new version became dkmvhf.f
where the km is for Karin Meyer. One advantage of both routines is that the
matrix can have rows and columns that are all zeros. Many inversion routines
only want to invert matrices that are non-singular, so that you must remove
the zero rows and columns before calling the subroutine. This routine is given
in the Appendix.
Note that in this version of FORTRAN an entire array can be set to zero
with one statement, gi = 0.d0. It appears to be important to use 0.d0 rather
than 0.0 in this version of FORTRAN. The latter results in 0, but only to 7
or so decimal places, which can be critical to some programs. Thus, always
use the 0.d0 option.
The subroutine firan is specialized software (in Guelph) for initializing
a series of random number generators for different distributions. The Mersene
twister is used as the algorithm in these routines which has a very long cycle
time, (219937 − 1). The cycle time is how many numbers it takes before the
sequence starts to repeat itself. When using Gibbs sampling it is good to have
a long cycle time.
20.3
Call Peddys
The following subroutine reads in the pedigree with the bii values needed for
the inverse of the additive relationship matrix. These values were computed by
another series of programs which order the animals chronologically, and then
calculate the inbreeding coefficients, as long as parents are processed before
their progeny.
The subroutine also reads in a ‘coded’ pedigree file, which has an animal,
then all of its progeny following, and the mate that produced that progeny.
This is so additive relationships can be accounted for easily in the iteration
20.3. CALL PEDDYS
program.
subroutine peddys
include ’SShd.f’
character*8 oid
real*8 x,v,z
open(10,file=’PARTES.d’,form=’formatted’,
x status=’old’)
adiag
sir =
dam =
bii =
mam=0
= 0.d0
0
0
0.d0
10
read(10,1001,end=50)ka,ks,kd,x,z,oid
1001 format(1x,3i10,1x,d20.10,2x,d20.10,1x,a8)
mam = mam + 1
sir(ka) = nam
if(ks.gt.0)sir(ka) = ks
dam(ka) = nam
if(kd.gt.0)dam(ka) = kd
bii(ka) = x
adiag(ka) = adiag(ka) + x
v = 0.25d0*x
if(ks.gt.0)adiag(ks)=adiag(ks)+v
if(kd.gt.0)adiag(kd)=adiag(kd)+v
c
50
go to 10
close(10)
print *,’peddys, mam= ’,mam
c
c read and store coded pedigree file
c
339
340
CHAPTER 20. FORTRAN PROGRAMS
open(11,file=’CARTES.d’,form=’formatted’,
x status=’old’)
mped = 0
jped = 0
cpa = 0
cpc = 0
cps = 0
cpd = 0
60
read(11,1101,end=90)ia,ic,is,id
1101 format(1x,i10,i3,1x,2i10)
mped = mped + 1
if(mped.gt.nped)go to 89
cpa(mped)=ia
cpc(mped)=ic
if(is.lt.1)is = nam
if(id.lt.1)id = nam
cps(mped)=is
cpd(mped)=id
if(ic.eq.0)jped(ia)=mped
go to 60
c
89
print *,’nped limit exceeded in SSpeddys.f’
90
close(11)
print *,’peddys, mped = ’,mped
c
return
end
20.4
Call Datum
The data have been prepared by other programs, and the levels of each factor
have been converted to a consecutive number from 1 to the maximum number
of levels for that factor.
One could also calculate means of the milk yields by days in milk, year
months, or whatever may be of interest.
20.4. CALL DATUM
341
At the end of the routine, the levels of the factors were sorted so that levels
of each factor could be processed one at a time, in sequence. Thus, the diagonal
blocks for each factor can be constructed at the same time as accumulating
the right hand sides of the mixed model equations (MME). Thus, there is no
need to save the diagonal blocks on disk or in memory. This is also handy
if Gibbs sampling is to be used to estimate the covariance function matrices
in the same program, because the covariance function matrices would change
with each sampling.
Lastly, the solution vectors are set to zeros before the iterations begin.
Otherwise there could be unknown information in those arrays that might
cause problems with convergence.
subroutine datum
include ’SShd.f’
real*8 p(3)
open(11,file=’MILKTDM.d’,form=’formatted’,
x status=’old’)
mrec = 0
obs=0.d0
nerr = 0
11
read(11,1101,end=20,err=88)iam,iym,iras,icg,
x jdim,jtim,p
1101 format(1x,6i10,3f9.2)
c
if(jdim.lt.5)go to 11
if(jdim.gt.ndim)go to 11
mrec = mrec + 1
if(mrec.gt.nrec)go to 19
anid(mrec) = iam
cgid(mrec) = icg
rasid(mrec) = iras
ymid(mrec) = iym
timg(mrec) = jtim
if(icg.gt.mcgid)mcgid=icg
if(iras.gt.mras)mras=iras
if(iym.gt.mym)mym=iym
342
19
88
20
35
CHAPTER 20. FORTRAN PROGRAMS
pari(mrec) = 1
if(p(2).gt.-9000.0)pari(mrec)=2
if(p(3).gt.-9000.0)pari(mrec)=3
kp=pari(mrec)
obs(mrec) = p(kp)
days(mrec) = jdim
go to 11
print *,’Too many records’
go to 20
print *,’Err rec’,mrec
go to 11
close(11)
print *,’ datum, mrec= ’,mrec
write(30,3005)mrec,nrec,mcgid,ncg,mras,nras,
x mym,nym
3005 format(1x,2i10,’ recs’/1x,2i10,’ mcgid’/1x,2i10,
x ’ mras’/1x,2i10,’ mym’)
close(11)
c
c sort data by levels of different factors
c
IPSORT is an Intel math library function
kflag = 1
ier = 0
call IPSORT(ymid,mrec,pymid,kflag,ier)
call IPSORT(rasid,mrec,pras,kflag,ier)
call IPSORT(cgid,mrec,pcg,kflag,ier)
call IPSORT(anid,mrec,panid,kflag,ier)
c
c set all solution vectors to zero
c
sanm = 0.d0
sape = 0.d0
scg = 0.d0
sras = 0.d0
sym = 0.d0
return
end
20.5. ITERATION SUBROUTINES
20.5
343
Iteration Subroutines
The main program can be thought of as ‘modular’. There are fixed factors,
random factors, and the animal additive genetic factor. Fixed factors do not
have any covariance function matrices. There are two types of fixed factors
in this model. The Year-Month of calving subclasses each have 36 ten-day
periods associated with them - to model the trajectory of test day milk yields.
The other type is the region-age-season subclasses which are modelled by order
4 Legendre polynomials, and thus, there are 5 parameters for each subclass.
20.5.1
Year-Month of Calving Factor
subroutine facYM
include ’SShd.f’
c
real*8 diags(ntg),vnois(ntg),ay(ntg),
x XRY(ntg),c,y,z,w,x,ddif,xad
integer levs(nrec),iork(nop),mfac
c #######################################################
c determine number of observations per
c
level of the factor, store in levs
levs = 0
mfac = 0
do 8 krec=1,mrec
iym = ymid(krec)
if(iym.gt.mfac)mfac = iym
levs(iym) = levs(iym) + 1
8 continue
kend=0
c #######################################################
c
For each level of the factor
c
adjust observations for all other solutions
c
save in XRY, make diags of MME
do 11 iym = 1,mfac
jrec = levs(iym)
344
CHAPTER 20. FORTRAN PROGRAMS
XRY = 0.d0
diags = 0.d0
if(jrec.lt.1)go to 11
kstr = kend+1
kend = kend+jrec
do 10 lrec = kstr,kend
krec = pras(lrec)
c
x
15
c
iam = anid(krec)
icg = cgid(krec)
iras = rasid(krec)
jdim = days(krec)
jtim = timg(krec)
kp = pari(krec)
ja = (kp-1)*5
c = ri(jdim,kp)
y = obs(krec)
do 15 j=1,mcov
ka=ja+j
y = y - (scg(icg,j) + sanm(iam,ka)
+ sras(iras,j) + sape(iam,ka))*lp(jdim,j)
continue
xad = y*c
XRY(jtim)=XRY(jtim) + xad
diags(jtim)=diags(jtim)+ c
c
10
continue
c ####################################################
c solve for new solution for this level of factor
c
vnois = 0.d0
c ###################################################
c
if estimating covariance matrices then
c
generate sampling variance to
c
add to solutions
if(igibb.gt.0)then
do 17 i=1,ntg
20.5. ITERATION SUBROUTINES
17
345
call fgnor3(z)
if(diags(i).gt.0.d0)diags(i)=1.d0/diags(i)
vnois(i)=z*diags(i)
continue
endif
c
do 25 j=1,ntg
z = XRY(j)*diags(j)
c
add vnois, compute convergence criteria
z = z + vnois(j)
ddif = z - sym(iym,j)
ccn = ccn + ddif*ddif
ccd = ccd + z*z
sym(iym,j) = z
25
continue
11
continue
if(itest.gt.0)then
jj=6
print 5003,jj,(sym(jj,L),L=1,5)
5003 format(’ PYM’,i4,5f12.4)
endif
return
end
20.5.2
Region-Age-Season of Calving
The above statements were for the Year-Month subclasses, modelling the trajectories of lactation curves for milk yield using 36 ten-day periods. Now we
compare this routine to one for region-age-seasons which are modelled by order 4 Legendre polynomials as covariates, but only within a parity. Subclasses
were numbered consecutively across region-age-seasons. Ages are nested within
parities and thus, only 5 covariates per subclass. Dealing with the covariates
requires different coding.
subroutine facRAS
include ’SShd.f’
346
CHAPTER 20. FORTRAN PROGRAMS
c
real*8 diags(200),vnois(200),ay(20),
x XRY(20),work(200),hh(200),c,y,z,w,x,ddif,xad
integer levs(nrec),iork(200),mfac
c #######################################################
c determine number of observations per
c
level of the factor, store in levs
levs = 0
mfac = 0
do 8 krec=1,mrec
iras = rasid(krec)
if(iras.gt.mfac)mfac = iras
levs(iras) = levs(iras) + 1
8 continue
kend=0
c #######################################################
c
For each level of the factor
c
adjust observations for all other solutions
c
save in XRY, make diags of MME
do 11 iras = 1,mfac
jrec = levs(iras)
XRY = 0.d0
diags = 0.d0
if(jrec.lt.1)go to 11
kstr = kend+1
kend = kend+jrec
do 10 lrec = kstr,kend
krec = pras(lrec)
iam = anid(krec)
icg = cgid(krec)
iym = ymid(krec)
jdim = days(krec)
jtim = timg(krec)
kp = pari(krec)
ja = (kp-1)*5
c = ri(jdim,kp)
20.5. ITERATION SUBROUTINES
x
15
c
y = obs(krec) - sym(iym,jtim)
do 15 j=1,mcov
ka=ja+j
y = y - (scg(icg,j) + sanm(iam,ka)
+ sape(iam,ka))*lp(jdim,j)
continue
xad = y*c
do 17 j=1,mcov
kb=ja+j
XRY(j)=XRY(j) + xad*lp(jdim,j)
do 19 m=j,mcov
kc=ja+m
ma=ihmssf(j,m,mcov)
diags(ma)=diags(ma)+lp(jdim,j)*c*lp(jdim,m)
continue
continue
19
17
c
10
continue
c ####################################################
c solve for new solution for this level of factor
c
call dkmvhf(diags,mcov,work,iork)
vnois = 0.d0
c ###################################################
c
if estimating covariance matrices then
c
do cholesky decomposition on diags
c
generate sampling variance (vnois) to
c
add to solutions
if(igibb.gt.0)then
call cholsk(diags,work,mcov)
call vgnor(vnois,work,hh,mcov)
endif
c
do 25 j=1,mcov
z = 0.d0
do 27 k=1,mcov
m=ihmssf(j,k,mcov)
347
348
CHAPTER 20. FORTRAN PROGRAMS
z = z + diags(m)*XRY(k)
27
continue
c
add vnois, compute convergence criteria
z = z + vnois(j)
ddif = z - sras(iras,j)
ccn = ccn + ddif*ddif
ccd = ccd + z*z
sras(iras,j) = 0.5d0*(z + sras(iras,j))
25
continue
11
continue
if(itest.gt.0)then
jj=6
print 5003,jj,(sras(jj,L),L=1,5)
5003 format(’ RAS’,i4,5f12.4)
endif
return
end
20.5.3
Contemporary Groups
Contemporary groups (CG) are defined as cows in the same parity number
calving within a few months of each other in the same herd and year. CG are
modelled by order 4 Legendre polynomials. Because three parities are being
analyzed together, it is possible for there to be 3 covariance function matrices
for CG effects, i.e. one for each parity. We have assumed this would be true.
However, if we find that the three covariance matrices are similar, then we
could assume the same covariance function matrix for all parities.
The subroutine for CG is different from that for RAS because CG is a
random factor, and we are allowing for three separate covariance function
matrices, and the coding has to allow for the estimation of new matrices, and
saving them in a file.
subroutine facCG
include ’SShd.f’
c
real*8 diags(200),vnois(200),ay(20),
20.5. ITERATION SUBROUTINES
x XRY(20),work(200),hh(200),c,y,z,w,x,ddif,xad
c
c
c
c
c
c
c
real*8 ssc(3,15),VIc(15)
integer levs(nrec),iork(200),levp(nrec),mfac
#######################################################
determine number of observations per
level of the factor, store in levs
levs = 0
kop = 15
levp = 0
ssc=0.d0
mfac = 0
do 8 krec=1,mrec
icg = cgid(krec)
if(icg.gt.mfac)mfac = icg
levs(icg) = levs(icg) + 1
levp(icg) = pari(krec)
8 continue
kend=0
#######################################################
For each level of the factor
adjust observations for all other solutions
save in XRY, make diags of MME
do 11 icg = 1,mfac
jrec = levs(icg)
XRY = 0.d0
diags = 0.d0
if(jrec.lt.1)go to 11
kstr = kend+1
kend = kend+jrec
do 10 lrec = kstr,kend
krec = pcgid(lrec)
iam = anid(krec)
iras = rasid(krec)
iym = ymid(krec)
jdim = days(krec)
jtim = timg(krec)
349
350
CHAPTER 20. FORTRAN PROGRAMS
x
15
c
kp = pari(krec)
ja = (kp-1)*5
c = ri(jdim,kp)
y = obs(krec) - sym(iym,jtim)
do 15 j=1,mcov
ka=ja+j
y = y - (sras(iras,j) + sanm(iam,ka)
+ sape(iam,ka))*lp(jdim,j)
continue
xad = y*c
do 17 j=1,mcov
kb=ja+j
XRY(j)=XRY(j) + xad*lp(jdim,j)
do 19 m=j,mcov
kc=ja+m
ma=ihmssf(j,m,mcov)
diags(ma)=diags(ma)+lp(jdim,j)*c*lp(jdim,m)
continue
continue
19
17
c
10
continue
c ####################################################
c Add inverse of covariance function matrix to diags
c
before inverting (one of three possible inverses)
c
m=0
kp = levp(icg)
do 61 ir=1,mcov
do 61 ic=ir,mcov
m=m+1
diags(m)=diags(m)+ci(kp,m)
61 continue
call dkmvhf(diags,mcov,work,iork)
vnois = 0.d0
c ###################################################
c
if estimating covariance matrices then
c
do cholesky decomposition on diags
20.5. ITERATION SUBROUTINES
c
c
generate sampling variance (vnois) to
add to solutions
if(igibb.gt.0)then
call cholsk(diags,work,mcov)
call vgnor(vnois,work,hh,mcov)
endif
c
do 25 j=1,mcov
z = 0.d0
do 27 k=1,mcov
m=ihmssf(j,k,mcov)
z = z + diags(m)*XRY(k)
27
continue
c
add vnois, compute convergence criteria
z = z + vnois(j)
ddif = z - scg(icg,j)
ccn = ccn + ddif*ddif
ccd = ccd + z*z
scg(icg,j) = z
25
continue
c
c if estimating covariance matrices - must accumulate
c
sum of squares
if(igibb.gt.0)then
m=0
ndf(kp)=ndf(kp)+1
do 71 ir=1,mcov
z = scg(icg,ir)
do 71 ic=ir,mcov
m=m+1
ssc(kp,m)=ssc(kp,m)+scg(icg,ic)*z
71
continue
endif
11
continue
c
c Estimate new ci matrices
c
if(igibb.gt.0)then
351
352
CHAPTER 20. FORTRAN PROGRAMS
kop=15
do 217 kp=1,3
nde = ndf(kp) + 2
call fgchi1(nde,w)
z=1.d0/w
do 215 k=1,kop
VIc(k)=ssc(kp,k)*z
215
continue
write(17)iter,VIc
call dkmvhf(VIc,mcov,work,iork)
do 216 k=1,kop
ci(kp,k)=VIc(k)
216
continue
217 continue
endif
if(itest.gt.0)then
jj=6
print 5003,jj,(scg(jj,L),L=1,5)
5003 format(’ CGS’,i4,5f12.4)
endif
return
end
20.5.4
Animal Permanent Environmental
Animal permanent environmental (APE) effects concern all three parities, and
are correlated between parities, so that there are 15 covariates to estimate
for each animal. Each parity is modelled by order 4 Legendre polynomials.
The covariance matrix is therefore, 15 by 15, and there is only one covariance
matrix.
subroutine facAPE
include ’SShd.f’
c
real*8 diags(nop),vnois(nop),ay(no),
x XRY(no),work(nop),hh(nop),c,y,z,w,x,ddif,xad
20.5. ITERATION SUBROUTINES
c
c
c
c
c
c
c
real*8 ssp(nop),VIp(nop)
integer levs(nrec),iork(nop),mfac
#######################################################
determine number of observations per
level of the factor, store in levs
levs = 0
ssp=0.d0
mfac = 0
do 8 krec=1,mrec
iam = anid(krec)
if(iam.gt.mfac)mfac = iam
levs(iam) = levs(iam) + 1
8 continue
kend=0
#######################################################
For each level of the factor
adjust observations for all other solutions
save in XRY, make diags of MME
do 11 iam = 1,mfac
jrec = levs(iam)
XRY = 0.d0
diags = 0.d0
if(jrec.lt.1)go to 11
kstr = kend+1
kend = kend+jrec
do 10 lrec = kstr,kend
krec = pcgid(lrec)
icg = cgid(krec)
iras = rasid(krec)
iym = ymid(krec)
jdim = days(krec)
jtim = timg(krec)
kp = pari(krec)
ja = (kp-1)*5
c = ri(jdim,kp)
y = obs(krec) - sym(iym,jtim)
353
354
CHAPTER 20. FORTRAN PROGRAMS
x
15
c
do 15 j=1,mcov
ka=ja+j
y = y - (sras(iras,j) + sanm(iam,ka)
+ scg(icg,j) )*lp(jdim,j)
continue
xad = y*c
do 17 j=1,mcov
kb=ja+j
XRY(kb)=XRY(kb) + xad*lp(jdim,j)
do 19 m=j,mcov
kc=ja+m
ma=ihmssf(kb,kc,no)
diags(ma)=diags(ma)+lp(jdim,j)*c*lp(jdim,m)
continue
continue
19
17
c
10
continue
c ####################################################
c Add inverse of covariance function matrix to diags
c
before inverting (one of three possible inverses)
c
m=0
do 61 ir=1,no
do 61 ic=ir,no
m=m+1
diags(m)=diags(m)+pi(m)
61 continue
call dkmvhf(diags,no,work,iork)
vnois = 0.d0
c ###################################################
c
if estimating covariance matrices then
c
do cholesky decomposition on diags
c
generate sampling variance (vnois) to
c
add to solutions
if(igibb.gt.0)then
call cholsk(diags,work,no)
call vgnor(vnois,work,hh,no)
20.5. ITERATION SUBROUTINES
endif
c
do 25 j=1,no
z = 0.d0
do 27 k=1,no
m=ihmssf(j,k,no)
z = z + diags(m)*XRY(k)
27
continue
c
add vnois, compute convergence criteria
z = z + vnois(j)
ddif = z - sape(iam,j)
ccn = ccn + ddif*ddif
ccd = ccd + z*z
sape(icg,j) = z
25
continue
c
c if estimating covariance matrices - must accumulate
c
sum of squares
if(igibb.gt.0)then
m=0
ndf=ndf+1
do 71 ir=1,no
z = sape(iam,ir)
do 71 ic=ir,no
m=m+1
ssp(kp,m)=ssp(kp,m)+sape(iam,ic)*z
71
continue
endif
11
continue
c
c Estimate new pi matrix
c
if(igibb.gt.0)then
nde = ndf + 2
call fgchi1(nde,w)
z=1.d0/w
do 215 k=1,nop
VIp(k)=ssp(k)*z
355
356
CHAPTER 20. FORTRAN PROGRAMS
215
continue
nm=1
write(19)iter,nm,VIp
call dkmvhf(VIp,no,work,iork)
do 216 k=1,nop
pi(k)=VIp(k)
216
continue
217 continue
endif
if(itest.gt.0)then
jj=6
print 5003,jj,(sape(jj,L),L=1,5)
5003 format(’ APE’,i4,5f12.4)
endif
return
end
20.5.5
Animal Additive Genetic
Animal additive genetic (ANM) effects concern all three parities, like the APE,
and are correlated between parities, so that there are 15 covariates to estimate
for each animal. Each parity is modelled by order 4 Legendre polynomials.
The covariance matrix is therefore, 15 by 15, and there is only one covariance
matrix.
However, the big difference from APE are the additive genetic relationships that must be taken into account amongst all animals. This accounts for
the extra length of the following subroutine.
subroutine facANM
include ’SShd.f’
c
real*8 diags(nop),vnois(nop),ay(no),
x XRY(no),work(nop),hh(nop),c,y,z,w,x,ddif,xad
real*8 ssa(nop),VIa(nop),tcc(no),dg
integer levs(nrec),iork(nop),mfac
20.5. ITERATION SUBROUTINES
c #######################################################
c determine number of observations per
c
level of the factor, store in levs
levs = 0
ssa=0.d0
mfac = mam
do 8 krec=1,mrec
iam = anid(krec)
levs(iam) = levs(iam) + 1
8 continue
kend=0
c #######################################################
c
For each level of the factor
c
adjust observations for all other solutions
c
save in XRY, make diags of MME
do 11 iam = 1,mfac
jrec = levs(iam)
XRY = 0.d0
diags = 0.d0
if(jrec.lt.1)go to 11
kstr = kend+1
kend = kend+jrec
x
do 10 lrec = kstr,kend
krec = pcgid(lrec)
icg = cgid(krec)
iras = rasid(krec)
iym = ymid(krec)
jdim = days(krec)
jtim = timg(krec)
kp = pari(krec)
ja = (kp-1)*5
c = ri(jdim,kp)
y = obs(krec) - sym(iym,jtim)
do 15 j=1,mcov
ka=ja+j
y = y - (sras(iras,j) + sape(iam,ka)
+ scg(icg,j) )*lp(jdim,j)
357
358
15
c
CHAPTER 20. FORTRAN PROGRAMS
continue
xad = y*c
do 17 j=1,mcov
kb=ja+j
XRY(kb)=XRY(kb) + xad*lp(jdim,j)
do 19 m=j,mcov
kc=ja+m
ma=ihmssf(kb,kc,no)
diags(ma)=diags(ma)+lp(jdim,j)*c*lp(jdim,m)
continue
continue
19
17
c
10
continue
c
c Must account for genetic relationships among animals
c
50
uped = jped(iam)
tcc=0.d0
if(uped.lt.1)go to 11
iam = cpa(uped)
if(iam.ne.ianm)print *,’xxxxx’,iam,ianm
c
850 jcode = cpc(uped)
if(cpa(uped).ne.ianm)go to 432
if(jcode.eq.0)then
js = cps(uped)
jd = cpd(uped)
c = bii(ianm)*0.5d0
do 406 jc=1,ntr
tcc(jc)=tcc(jc)+c*(sanm(js,jc)+sanm(jd,jc))
406
continue
else
jp = cps(uped)
jm = cpd(uped)
d = bii(jp)*0.5d0
do 412 ja=1,ntr
tcc(ja)=tcc(ja)+d*(sanm(jp,ja)-0.5d0*sanm(jm,ja))
20.5. ITERATION SUBROUTINES
412
c
405
c
432
continue
endif
uped = uped + 1
if(iam.ne.cpa(uped))go to 432
if(uped.gt.mped)go to 432
go to 850
do 435 jr=1,no
s=0.d0
do 437 jc=1,no
s=s + gi(ihmssf(jr,jc,ntr))*tcc(jc)
437
continue
XRY(jr)=XRY(jr)+s
435 continue
c ####################################################
c Add inverse of covariance function matrix to diags
c
before inverting (one of three possible inverses)
c
dg = adiag(iam)
m=0
do 61 ir=1,no
do 61 ic=ir,no
m=m+1
diags(m)=diags(m)+gi(m)*dg
61 continue
call dkmvhf(diags,no,work,iork)
vnois = 0.d0
c ###################################################
c
if estimating covariance matrices then
c
do cholesky decomposition on diags
c
generate sampling variance (vnois) to
c
add to solutions
if(igibb.gt.0)then
call cholsk(diags,work,no)
call vgnor(vnois,work,hh,no)
endif
c
359
360
CHAPTER 20. FORTRAN PROGRAMS
ay=0.d0
js = sir(iam)
jd = dam(iam)
do 25 j=1,no
z = 0.d0
do 27 k=1,no
m=ihmssf(j,k,no)
z = z + diags(m)*XRY(k)
27
continue
c
add vnois, compute convergence criteria
z = z + vnois(j)
ddif = z - sanm(iam,j)
ccn = ccn + ddif*ddif
ccd = ccd + z*z
sanm(iam,j) = z
ay(j)=sanm(iam,j)-0.5d0*(sanm(ja,j)+sanm(jd,j))
25
continue
c
c if estimating covariance matrices - must accumulate
c
sum of squares of Mendelian sampling terms
if(igibb.gt.0)then
if(jrec.gt.0)then
m=0
ndf=ndf+1
d = bii(iam)
do 71 ir=1,no
z = ay(ir)*d
do 71 ic=ir,no
m=m+1
ssa(m)=ssa(m)+ay(ic)*z
71
continue
endif
endif
11
continue
c
c Estimate new gi matrix
c
if(igibb.gt.0)then
20.5. ITERATION SUBROUTINES
361
nde = ndf + 2
call fgchi1(nde,w)
z=1.d0/w
do 215 k=1,nop
VIa(k)=ssa(k)*z
215
continue
nm=2
write(19)iter,nm,VIa
call dkmvhf(VIa,no,work,iork)
do 216 k=1,nop
gi(k)=VIa(k)
216
continue
217 continue
endif
if(itest.gt.0)then
jj=6
print 5003,jj,(sanm(jj,L),L=1,5)
5003 format(’ ANM’,i4,5f12.4)
endif
return
end
20.5.6
Residual Effects
If the program is set to estimate covariance matrices, then a subroutine is
needed to estimate the residual variances by parity and by periods within a
lactation. If igibb=0, then this subroutine is skipped, and no residual variances are calculated.
subroutine facRES
include ’SShd.f’
c
real*8 diags(nop),vnois(nop),ay(no),
x XRY(no),work(nop),hh(nop),c,y,z,w,x,ddif,xad
real*8 sse(3,ntim),VI(nop),ndf(3,ntim)
362
c
c
c
c
c
c
c
CHAPTER 20. FORTRAN PROGRAMS
integer levs(nrec),iork(nop),mfac
#######################################################
determine number of observations per
level of the factor, store in levs
levs = 0
sse=0.d0
mfac = 0
do 8 krec=1,mrec
itim = timg(krec)
levs(itim) = levs(itim) + 1
8 continue
kend=0
#######################################################
For each level of the factor
adjust observations for all other solutions
save in XRY, make diags of MME
do 11 itim = 1,mfac
jrec = levs(itim)
if(jrec.lt.1)go to 11
kstr = kend+1
kend = kend+jrec
x
do 10 lrec = kstr,kend
krec = pcgid(lrec)
iam = anid(krec)
icg = cgid(krec)
iras = rasid(krec)
iym = ymid(krec)
jdim = days(krec)
jtim = timg(krec)
kp = pari(krec)
ja = (kp-1)*5
c = ri(jdim,kp)
y = obs(krec) - sym(iym,jtim)
do 15 j=1,mcov
ka=ja+j
y = y - (sras(iras,j) + sape(iam,ka)
+ scg(icg,j)+sanm(iam,ka) )*lp(jdim,j)
20.5. ITERATION SUBROUTINES
15
c
continue
sse(jtim,kp)=sse(jtim,kp)+y*y
ndf(jtim,kp)=ndf(jtim,kp)+1.d0
c
10
c
32
31
141
41
142
42
143
43
continue
do 31 jtim=1,4
do 32 kp=1,3
nde = ndf(jtim,kp)+2
call fgchi1(nde,w)
res(jtim,kp) = sse(jtim,kp)/w
continue
continue
ri=0.d0
do 41 i=1,45
do 141 kp=1,3
ri(i,kp)=1.d0/res(1,kp)
continue
continue
do 42 i=46,115
do 142 kp=1,3
ri(i,kp)=1.d0/res(2,kp)
continue
continue
do 43 i=116,265
do 143 kp=1,3
ri(i,kp)=1.d0/res(3,kp)
continue
continue
do 44 i=266,365
do 144 kp=1,3
ri(i,kp)=1.d0/res(4,kp)
continue
continue
144
44
c
c save new estimates in file with sample number
c
363
364
CHAPTER 20. FORTRAN PROGRAMS
write(20,1235)iter,(res(i,j),i=1,4),j=1,3)
1235 format(1x,i10,12f15.5)
c
return
end
20.5.7
Finish Off
The last subroutine is call finis, which is to save solutions for the important
factors, usually just the genetic evaluations of animals. However, some may
want to save all of the solutions for all factors.
With the genetic evaluations one may also want to save information about
the number of records each animal had (by parity number), and perhaps number of progeny, and sire and dam identifications. This information could be
used to approximate the reliabilities of the EBVs.
Thus, this last subroutine depends on the wishes of the user to decide what
information should be saved and how it should be saved. Thus, no coding will
be provided for this subroutine.
20.6
Other Items
If Gibbs sampling was performed then there will be three files of sample estimates for each covariance matrix and the residuals. The burn-in period needs
to be determined, then the remaining samples need to be averaged in some
manner. Either all of the samples, after burn-in, could be sampled, or every
mth sample could be averaged, where m is a number like 7 or 17 or 19. Consecutive samples are known to be dependent on the previous sample, and by
averaging every mth sample this dependency is lessened considerably. Often
the same results are obtained either way.
After new covariance matrices are available, then another run is made
where the new parameters are inputted and igibb = 0 is imposed. This is to
obtain solutions to the mixed model equations (MME).
20.6. OTHER ITEMS
365
Having the EBV, then one can compute the 305-d breeding values and
persistency in a follow-up program. There may be other necessities to calculate
for users of the EBVs. Note also that none of the preliminary programs have
been provided. Programs for preparing the data, numbering the levels of each
factor, editting out the error records, and ordering the animals chronologically
for calculating inbreeding coefficients have also not been shown.
The programs shown in this chapter are not available for downloading. If
you want to use them, then you should type them in from these pages. Why?
Because it will help you to learn what the programs are doing, and you might
find some errors in them. The programs are to give you an idea about writing
code for a random regression model.
366
CHAPTER 20. FORTRAN PROGRAMS
Chapter 21
DKMVHF - Inversion Routine
Matrix inversion routine of C. R. Henderson, as modified by Karin Meyer in
1983. Input is a half-stored symmetric matrix. There can be zero rows and
columns present in the matrix.
SUBROUTINE DKMVHF(A,N,V,IFLAG)
C
KARIN MEYER
C
NOVEMBER 1983
C----------------------------------------------------------------------DOUBLE PRECISION A(1),V(1),XX,DMAX,AMAX,ZERO,DIMAX
INTEGER IFLAG(1)
zero=1.D-12
IF(N.EQ.1)THEN
XX=A(1)
IF(DABS(XX).GT.ZERO)THEN
A(1)=1.D0/XX
ELSE
A(1)=0.D0
END IF
RETURN
END IF
N1=N+1
NN=(N*N1)/2
367
368
CHAPTER 21. DKMVHF - INVERSION ROUTINE
DO 1 I=1,N
1 IFLAG(I)=0
C
C
C
SET MINIMUM ABSOLUTE VALUE OF DIAGONAL ELEMENTS FOR
NON-SINGULARITY (MACHINE SPECIFIC!)
ZERO=1.D-12
C----------------------------------------------------------------------C
START LOOP OVER ROWS/COLS
C----------------------------------------------------------------------DO 8 II=1,N
C
... FIND DIAGONAL ELEMENT WITH BIGGEST ABSOLUTE VALUE
DMAX=0.D0
AMAX=0.D0
KK=-N
DO 2 I=1,N
C
... CHECK THAT THIS ROW/COL HAS NOT BEEN PROCESSED
KK=KK+N1-I
IF(IFLAG(I).NE.0)GO TO 2
K=KK+I
IF(DABS(A(K)).GT.AMAX)THEN
DMAX=A(K)
AMAX=DABS(DMAX)
IMAX=I
END IF
2
CONTINUE
C
... CHECK FOR SINGULARITY
IF(AMAX.LE.ZERO)GO TO 11
C
... ALL ELEMENTS SCANNED,SET FLAG
IFLAG(IMAX)=II
C
C
C
... INVERT DIAGONAL
DIMAX=1.D0/DMAX
... DEVIDE ELEMENTS IN ROW/COL PERTAINING TO THE BIGGEST
DIAGONAL ELEMENT BY DMAX
IL=IMAX-N
DO 3 I=1,IMAX-1
IL=IL+N1-I
XX=A(IL)
369
A(IL)=XX*DIMAX
IF(DABS(XX).LT.0.1D-17)XX=0.D0
3
V(I)=XX
C
... NEW DIAGONAL ELEMENT
IL=IL+N1-IMAX
A(IL)=-DIMAX
DO 4 I=IMAX+1,N
IL=IL+1
XX=A(IL)
A(IL)=XX*DIMAX
IF(DABS(XX).LT.0.1D-17)XX=0.D0
4
V(I)=XX
C
... ADJUST THE OTHER ROWS/COLS :
C
A(I,J)=A(I,J)-A(I,IMAX)*A(J,IMAX)/A(IMAX,IMAX)
IJ=0
DO 7 I=1,N
IF(I.EQ.IMAX)THEN
IJ=IJ+N1-I
GO TO 7
END IF
XX=V(I)
IF(XX.NE.0.D0)THEN
XX=XX*DIMAX
DO 5 J=I,N
IJ=IJ+1
IF(J.NE.IMAX)A(IJ)=A(IJ)-XX*V(J)
5
CONTINUE
ELSE
6
IJ=IJ+N1-I
END IF
7
CONTINUE
C
... REPEAT UNTIL ALL ROWS/COLS ARE PROCESSED
8 CONTINUE
C----------------------------------------------------------------------C
END LOOP OVER ROWS/COLS
C----------------------------------------------------------------------C
... REVERSE SIGN
DO 9 I=1,NN
370
CHAPTER 21. DKMVHF - INVERSION ROUTINE
9 A(I)=-A(I)
C
... AND THAT’S IT !
C
PRINT 10,N
10 FORMAT(’ FULL RANK MATRIX INVERTED, ORDER =’,I5)
C
RETURN RANK AS LAST ELEMENT OF FLAG VECTOR
IFLAG(N)=N
RETURN
C----------------------------------------------------------------------C
MATRIX NOT OF FULL RANK, RETURN GENERALISED INVERSE
C----------------------------------------------------------------------11 IRANK=II-1
IJ=0
DO 14 I=1,N
IF(IFLAG(I).EQ.0)THEN
C
... SET REMAINING N-II ROWS/COLS TO ZERO
DO 12 J=I,N
IJ=IJ+1
A(IJ)=0.D0
12
CONTINUE
ELSE
DO 13 J=I,N
IJ=IJ+1
IF(IFLAG(J).NE.0)THEN
C
... REVERSE SIGN FOR II-1 ROWS/COLS PREVIOUSLY PROCESSED
A(IJ)=-A(IJ)
ELSE
A(IJ)=0.D0
END IF
13
CONTINUE
END IF
14 CONTINUE
C
PRINT 15,N,IRANK
C 15 FORMAT(’ GENERALISED INVERSE OF MATRIX WITH ORDER =’,I5,
C
1 ’
AND RANK =’,I5)
IFLAG(N)=IRANK
RETURN
END
c
371
c
c
1
2
half-stored matrix subscripting function
FUNCTION IHMSSF(I,J,N)
IF(I-J)1,1,2
IHMSSF=((N+N-I)*(I-1))/2+J
RETURN
IHMSSF=((N+N-J)*(J-1))/2+I
RETURN
END
372
CHAPTER 21. DKMVHF - INVERSION ROUTINE
Chapter 22
References and Suggested
Readings
Ali, T. E. , Schaeffer, L. R. 1987. Accounting for covariances among test day
milk yields in dairy cows. Can. J. Anim. Sci. 67:637.
Belonsky, G. M. , Kennedy, B. W. 1988. Selection on individual phenotype
and best linear unbiased predictor of breeding value in a closed swine
herd. J. Anim. Sci. 66:1124-1131.
Bentsen, H. , Gjerde, B., Nguyenm N.H., Rye, M., Ponzoni, R.W., Palada
de Vera, M.S., Bolivar, H.L., Velasco, R.R., Danting, J.C., Dionisio,
E.E., Longalong, F.M., Reyes, R. A., Abella, T.A., Tayamen, M.M.,
Eknath, A.E., 2012. Growth of farmed tilapias: Genetic parameters for
body weight at harvest in Nile tilapia (Oreochromis niloticus) during five
generations of testing in multiple environments. Aquaculture 338-341,
p56-65.
Brenna-Hansen, S. , Li, J., Kent, M.P., Boulding, E.G., Dominik, S., Davidson, W.S., Lien, S., 2012. Chromosomal differences between European
and North American Atlantic salmon discovered by linkage mapping and
supported by fluorescence in situ hybridization analysis. BMC Genomics
13, 432.
Ducrocq, V. 1987. An analysis of productive life in dairy cattle. Ph.D. Diss.,
Cornell University, Ithaca, NY.
373
374
CHAPTER 22. REFERENCES AND SUGGESTED READINGS
Dwyer, D. J. , Schaeffer, L. R., Kennedy, B. W.. 1986. Bias due to corrective matings in sire evaluations for calving ease. J. Dairy Sci. 69:794-799.
Foulley, J. L. , Gianola, D. 1984. Estimation of genetic merit from bivariate
all or none responses. Genet. Sel. Evol. 16:285-306.
Fries, L. A. , Schenkel, F. S. 1993. Estimation and prediction under a selection model. Presented at 30th Reuniao Anual da Sociedade Brasileira de
Zootecnia, Rio de Janeiro.
Galbraith, F. 2003. Random regression models to evaluate sires for daughter survival. Master’s Thesis, University of Guelph, Ontario, Canada,
August.
Gianola, D. 1982. Theory and analysis of threshold characters. J. Anim.
Sci. 54:1079-1096.
Gianola, D. , Foulley, J. L. 1983. Sire evaluation for ordered categorical data
with a threshold model. Genet. Sel. Evol. 15:201.
Gianola, D. , Fernando, R. L. 1986. Bayesian methods in animal breeding.
J. Anim. Sci. 63:217.
Gianola, D. , Im, S., Fernando, R. L. 1988. Prediction of breeding value under Henderson’s selection model: A Revisitation. J. Dairy Sci. 71:27902798.
Gengler, N. , Mayeres, P., Szydlowski, M., 2007. A simple method to approximate gene content in large pedigree populations: application to the
myostatin gene in dual-purpose Belgian Blue cattle. Animal 1:21-27.
Gengler, N. , Abras, S., Verkenne, C., Vanderick, S., Szydlowski, M., Renaville, R., 2008. Accuracy of prediction of gene content in large animal
populations and its use for candidate gene detection and genetic evaluation. J. Dairy Sci. 91:1652-1659.
Gjerde, B. , Simianer, H.,Refstie, T., 1994. Estimates of genetic and phenotypic parameters for body weight, growth rate and sexual maturity in
Atlantic salmon. Livest. Prod. Sci. 38:133-143.
375
Glebe, B.D. 1998. East Coast Salmon Aquaculture Breeding Programs: History and Future. Canadian Stock Assessment Secretariat Research Document 98/157, Fisheries and Oceans, Canada.
Graser, H. U. , Tier, B. 1997. Applying the concept of number of effective
progeny to approximate accuracies of predictions derived from multiple
trait analyses. Presented at Australian Association of Animal Breeding
and Genetics.
Hartley, H. O. , Rao, J. N. K. 1967. Maximum likelihood estimation for the
mixed analysis of variance model. Biometrika 54:93-108.
Harville, D. A. , Mee, R. W. 1984. A mixed model procedure for analyzing
ordered categorical data. Biometrics 40:393-408.
Hayes, J. F. , W. G. Hill. 1981. Modification of estimates of parameters
in the construction of genetic selection indices (’bending’). Biometrics
37:483-493.
Henderson, C. R. 1953. Estimation of variance and covariance components.
Biometrics 9:226-310.
Henderson, C. R. 1975. Best linear unbiased estimation and prediction under a selection model. Biometrics 31:423-448.
Henderson, C. R. 1976. A simple method for computing the inverse of a
numerator relationship matrix used for prediction of breeding values.
Biometrics, 32:69-74.
Henderson, C. R. 1978. Simulation to examine distributions of estimators
of variances and ratios of variances. J. Dairy Sci. 61:267-273.
Henderson, C. R. 1984. Applications of Linear Models in Animal Breeding.
University of Guelph.
Henderson, C. R. 1985. Best linear unbiased prediction of nonadditive genetic merits in noninbred populations. J. Anim. Sci. 60:111-117.
Henderson, C. R. 1988. A simple method to account for selected base populations. J. Dairy Sci. 71:3399-3404.
376
CHAPTER 22. REFERENCES AND SUGGESTED READINGS
Henderson, C. R. 1988. Theoretical basis and computational methods for
a number of different animal models. J. Dairy Sci. 71(2):1-11.
Henderson, Jr., C. R. 1982. Analysis of covariance in the mixed model:
Higher level, nonhomogeneous, and random regressions. Biometrics 38:623640.
Heydarpour, M. , Schaeffer, L. R., Yazdi, M. H. 2008. Influence of population structure on estimates of direct and maternal parameters. J. Anim.
Breed. Genet. 125:89-99.
Houston, R.D. , Gheyas, A., Hamilton, A., Guy, D.R., Tinch, A.E., Taggart,
J.B., McAndrew, B.J., Haley, C.S. and Bishop, S.C., 2008. Detection and
confirmation of a major QTL affecting resistance to infectious pancreatic
necrosis (IPN) in Atlantic salmon (Salmo salar). In Animal Genomics
for Animal Health (Vol. 132, pp. 199-204). Karger Publishers.
Hudson, G. F. S. , Schaeffer, L. R. 1984. Monte Carlo comparison of sire
evaluation models in populations subject to selection and nonrandom
mating. J. Dairy Sci. 67:1264-1272.
Jamrozik, J. , Kistemaker,G. K., Dekkers, J. C. M., Schaeffer, L. R. 1997.
Comparison of possible covariates for use in a random regression model
for analyses of test day yields. J. Dairy Sci. 80:2550-2556.
Jamrozik, J. , Schaeffer, L. R. 1997. Estimates of genetic parameters for a
test day model with random regressions for production of first lactation
Holsteins. J. Dairy Sci. 80:762-770.
Jamrozik, J. , L. R. Schaeffer, J. C. M. Dekkers. 1997. Genetic evaluation
of dairy cattle using test day yields and random regression model. J.
Dairy Sci. 80:1217-1226.
Jamrozik, J. , J. Fatehi, L. R. Schaeffer. 2008. Comparison of models for
genetic evaluation of survival traits in dairy cattle: a simulation study.
J. Anim. Breed. Genet. 125:75-83.
Jorjani, H. , Klei, L., Emanuelson, U. 2003. A simple method for weighted
bending of genetic (co)variance matrices. J. Dairy Sci.86:677-679.
377
Kennedy, B. W. , L. R. Schaeffer, Sorensen, D. A. 1988. Genetic properties
of animal models. J. Dairy Sci. 71:(Suppl. 2) 17-26.
Kistemaker, G. 1997. The comparison of random regression test day models
and a 305-day model for evaluation of milk yield in dairy cattle. Ph.D.
Thesis. University of Guelph.
Lien, S. , Gidskehaug, L., Moen, T., Hayes, B.J., Berg, P.R., Davidson, W.S.,
Omholt, S.W., Kent, M.P., 2011. A dense SNP-based linkage map for
Atlantic salmon (Salmo salar) reveals extended chromosome homeologies
and striking differences in sex-specific recombination patterns. BMC
Genomics 12, 615.
Liu, S. , Palti, Y., Gao, G., Rexroad III, C.E., 2016. Development and validation of a SNP panel for parentage assignment in rainbow trout. Aquaculture 452, 178-182.
Lush, J. L. 1931. The number of daughters necessary to prove a sire. K.
Dairy Sci. 14:209-220.
Meuwissen, T. H. E. , Luo, Z. 1992. Computing inbreeding coefficients in
large populations. Genet. Sel. Evol. 24:305-313.
Meuwissen, T. H. E. , De Jong, G., Engel, B. 1996. Joint estimation of
breeding values and heterogeneous variances of large data files. J. Dairy
Sci. 79:310-316.
Meyer, K. 1989. Approximate accuracy of genetic evaluation under an animal model. Livest. Prod. Sci. 21:87-100.
Meyer, K. 2000. Random regressions to model phenotypic variation in monthly
weights of Australian beef cows. Livest. Prod. Sci. 65:19-38.
Meyer, K. , Kirkpatrick, M. 2010. Better estimates of genetic covariance
matrices by ”bending” using penalized maximum likelihood. Genetics
185:1097-1110.
Moen, T. , Baranski, M., Sonesson, A.K., Kjøglum, S., 2009. Confirmation
and fine mapping of a major QTL for resistance to infectious pancreatic necrosis in Atlantic salmon: population-level associations between
markers and trait. BMC Genomics 10(1), p.368.
378
CHAPTER 22. REFERENCES AND SUGGESTED READINGS
Mulder, H.A. , Calus, M.P.L., Veerkamp, R.F. 2010. Prediction of haplotypes for ungenotyped animals and its effect on marker-assisted breeding
value estimation. Genet. Sel. Evol. 42:10.
Ødegård, J. , Moen, T., Santi, N., Korsvoll, S.A., Kjøglum, S. and Meuwissen, T.H., 2014. Genomic prediction in an admixed population of Atlantic salmon (Salmo salar). Frontiers in Genetics, 5, p.402.
Patterson, H. D. , Thompson, R. 1971. Recovery of interblock information
when block sizes are unequal. Biometrika 58:545-554.
Pollak, E. J. , Quaas, R. L. 1981. Monte Carlo study of within herd multiple
trait evaluation of beef cattle growth traits. J. Anim. Sci. 52:248-256.
Pollak, E. J. , Quaas, R. L. 1981. Monte Carlo study of genetic evaluations
using sequentially selected records. J. Anim. Sci. 52:257-264.
Pollak, E. J. , van der Werf, J., Quaas, R. L. 1984. Selection bias and multiple trait evaluation. J. Dairy Sci. 67:1590-1596.
Ptak, E. , Horst, H. S., Schaeffer, L. R. 1993. Interaction of age and month
of calving with year for Ontario Holstein production traits. J. Dairy Sci.
76:3792-3798.
Quaas, R. L. 1976. Computing the diagonal elements and inverse of a large
numerator relationship matrix. Biometrics 32:949.
Quaas, R. L. , Pollak, E. J. 1980. Mixed model methodology for farm and
ranch beef cattle testing programs. J. Anim. Sci. 51:1277-1287.
Quaas, R. L. , Pollak, E. J. 1981. Modified equations for sire models with
groups. J. Dairy Sci. 64:1868-1872.
Quaas, R. L. 1988. Additive genetic model with groups and relationships.
J. Dairy Sci. 71:1310-1318.
Quinton, C.D. , McMillan, I., Glebe, B.D., 2005. Development of an Atlantic salmon (Salmo salar) genetic improvement program: Genetic parameters of harvest body weight and carcass quality traits estimated with
animal models. Aquaculture 247, 211-217.
379
Rao, C. R. 1970. Estimation of heteroscedastic variances in linear models.
J. Am. Statist. Assoc. 65:161-172.
Rao, C. R. 1971. Estimation of variance covariance components - MINQUE
theory. J. Mult. Anal. 1:445-456.
Robinson, G. K. 1986. Group effects and computing strategies for models
for estimating breeding values. J. Dairy Sci. 69:3106-3111.
Schaeffer, L. R. 1976. BLUP Workshop Notes. August 9 to 14, 1976, Department of Animal Breeding, Agricultural College, Uppsala, Sweden.
(Not available in print or electronic formats, some people may have a
copy).
Schaeffer, L. R. , Kennedy, B. W. 1986. Computing strategies for solving
mixed model equations. J. Dairy Sci. 69:575-579.
Schaeffer, L. R. , Kennedy, B. W. 1989. Effects of embryo transfer in beef
cattle on genetic evaluation methodology. J. Anim. Sci. 67:2536-2543.
Schaeffer, L. R. , Kennedy, B. W., Gibson, J. P. 1989. The inverse of the
gametic relationship matrix. J. Dairy Sci. 72:1266-1272.
Schaeffer, L. R. , Dekkers, J. C. M. 1994. Random regressions in animal
models for test-day production in dairy cattle. Proc. 5th World Congress
of Genetics Applied to Livestock Production. Guelph, Ontario, Canada
XVIII:443-446.
Schaeffer, L. R. , Jamrozik, J., Kistemaker, G. J., Van Doormaal, B. J.
2000. Experience with a test-day model. J. Dairy Sci. 83:1135-1144.
Schaeffer, L. R. 2003. Computing simplifications for non-additive genetic
models. J. Anim. Breed. Genet. 120:394-402.
Schaeffer, L. R. 2004. Application of random regression models in animal
breeding. Livest. Prod. Sci. 86:35-45.
Schaeffer, L. R. 2006. Strategy for applying genome-wide selection in dairy
cattle. J. Anim. Breed. Genet. 123:218-223.
Schaeffer, L. R. 2010. Cumulative permanent environmental effects in a
repeated records animal model. J. Anim. Breed. Genet. 128:95-99.
380
CHAPTER 22. REFERENCES AND SUGGESTED READINGS
Schaeffer, L. R. 2018. Necessary changes to improve animal models. J.
Anim. Breed. Genet. 135:124-131.
Schaeffer, L. R. , Ang, K. P., Elliott, J. A. K., Herlin, M., Powell, F., Boulding, E. G. 2018. Genetic evaluation of Atlantic salmon for growth traits
incorporating SNP markers. J. Anim. Breed. Genet. 135(5):349-356.
Schenkel, F. S. 1998. Studies on effects of parental selection on estimation
of genetic parameters and breeding values of metric traits. Ph.D. Thesis.
University of Guelph. Guelph, Ontario, Canada.
Searle, S. R. 1971. Linear Models. New York, John Wiley.
Simianer, H. 1991. Prospects for third generation methods of genetic evaluation. 42nd Annual Meeting of the European Association for Animal
Production, Berlin.
Slanger, W. D. , Jensen, E. L., Everett, R. W., Henderson, C. R. 1976.
Programming cow evaluation. J. Dairy Sci. 59:1589.
Smith, S. P. , Maki-Tanila, A. 1990. Genotypic covariance matrices and
their inverses for models allowing dominance and inbreeding. Genet.
Sel. Evol. 22:65-91.
Sonesson, A.K. 2007. Within-family marker-assisted selection for aquaculture species. Genet. Sel. Evol. 39, 301-317.
Sonesson, A.K. , Meuwissen, T.H.E., 2009. Testing Strategies for genomic
selection in aquaculture breeding programs. Genet. Sel. Evol. 41, 37-45.
Tamate, T. , Maekawa, K. 2004. Female-biased mortality rate and sexual
size dimorphism of migratory masu salmon, Oncorhynchus masou. Ecology of Freshwater Fish 13, 96-103.
Thompson, R. 1976. The estimation of maternal genetic variances. Biometrics 32:903-918.
Thompson, R. 1979. Sire evaluation. Biometrics 35:339-353.
Tier, B. , Meyer, K. 2004. Approximating prediction error covariances among
additive genetic effects within animals in multiple trait and random regression models. J. Anim. Breed. Genet. 121:77-89.
381
Tierney, J. S. , Schaeffer, L. R. 1994. Inclusion of semen price of the sire
in an animal model to account for preferential treatment. J. Dairy Sci.
77:576-582.
Tyriseva, A. M. , Mantysaari, E. A., Jakobsen, J., Aamand, G. P., Durr, J.,
Fikse, W. F., Lidauer, M. H. 2018. Detection of evaluation bias caused
by genomic preselection. J. Dairy Sci. 101:1-9.
Ufford, G. R. , Henderson, C. R., Van Vleck, L. D. 1979. Computing algorithms for sire evaluation with all lactation records and natural service
sires. J. Dairy Sci. 62:511-513.
Van Raden, P. M. , Hoeschele, I. 1991. Rapid inversion of additive by additive relationship matrices by including sire-dam combination effects.
J. Dairy Sci. 74:570-579.
Veerkamp, R. F. , Brotherstone, S., Meuwissen, T. H. E. 1999. Survival
analysis using random regression models. Proc. International Workshop
on EU Concerted Action Genetic Improvement of Functional Traits in
Cattle; Longevity. Interbull Bulletin 21:36-40.
Westell, R. A. , Quaas, R. L., Van Vleck, L. D. 1988. Genetic groups in an
animal model. J. Dairy Sci. 71:1310-1318.
Wiggans, G. R. , Misztal, I. 1987. Supercomputer for animal model evaluation of Ayrshire milk yield. J. Dairy Sci. 70:1906-1912.
Wilham, R. L. 1972. The role of maternal effects in animal breeding. III.
Biometrical aspects of maternal effects in animals. J. Anim. Sci. 35:12881293.
Wood, P. D. P. 1967. Algebraic model of the lactation curve in cattle. Nature 216:164-165.
Wood, P. D. P. 1968. Factors affecting the shape of the lactation curve in
cattle. Anim. Prod. 11:307-316.
Download