Whittaker et al. (2000)

advertisement
Genomic selection
Aaron Lorenz
Department of Agronomy
and Horticulture
Role of markers in crop improvement
Bernardo, 2008
Genomic selection
DNA marker data
Model training
Training Population
Calibration Set
y  Xb + Zu + e
Predict and select
Phenotypic data
• No QTL mapping
• No testing for
significant markers
Selection candidates
Genomic rA
A genome-wide approach typically
provides better predictions
MAS
MAS rA
Lorenzana and Bernardo (2009)
Lorenz (2013)
GS
MAS
GS
Whittaker et al. (2000)
• When doing MAS, cannot include all the markers, so must
select subset of markers to fit.
• No entirely satisfactory way of doing this exists.
• Objective is to evaluate ridge regression.
– Superior to subset selection when objective is to make
predictions.
Whittaker et al. (2000)
• Find subset of markers Q.
• Interested in aˆ 
ˆ x
i

kQ
i i
ˆβ  ( XT X) 1 XT y
• Cannot include all markers in Q
– Increases variance of β
– If number of markers really large, not enough d.f.
Whittaker et al. (2000)
• Ridge regression – include all variables, but replace normal
least-squares estimators with
ˆβ  ( XT X   I ) 1 XT y
• Normal estimates shrunk toward 0
– Degree of shrinkage determined by lambda
• Choose lambda to minimize model error
• Addition of λI term reduces collinearity and prevents the
matrix XTX from becoming singular.
Whittaker et al. (2000)
MHG 2001
Objective: “Compare statistical methods for their accuracy in
predicting total breeding value of individuals in a situation where
a limited number of recorded individuals are genotyped for many
markers.”
- Computer simulation
- 2000 individuals
- Need to estimate 50,000 haplotype effects
MHG 2001
1
r(GEBV:True BV)
0.8
0.6
0.4
0.2
0
Least-squares
BLUP
BayesA
BayesB
Genomic selection models
Genomic selection models
Shrinkage models
• RR-BLUP, G-BLUP
2. Dimension reduction methods
• Partial least squares
• Principal component
regression
3. Variable selection models
• BayesB, BayesCπ, BayesDπ
4. Kernel and machine learning
methods
• Support vector machine
regression
LARGE p !!
1.
Training population
Line
Yield
Mrk 1 Mrk 2
…
Mrk p
Line 1
Line 2
Line 3
Line 4
76
56
45
67
1
1
1
0
1
1
1
1
1
1
1
0
Line n
22
1
1
1
…
smaller n !!
Baseline model
yi      k xik  ei
k
k ~ ?
--More predictors than variables.
--Solution: fit predictors as random effects.
-- Constrain possible effects.
-- What distribution is β being sampled from?
Priors and penalizations (examples)
yi      k xik  ei
k
Ridge regression
 k ~ N (0,  2 )
LASSO
 k ~ DE ( )
BayesC
0
k  
2
~N(0,


)
with prob 
with prob (1- )
Double exponential distribution
Normal distribution
Represent two different assumptions about the underlying
distribution of QTL effects
de Los Campos et al. (2013)
Priors
Marker effect estimates
Large-effect QTL
simulated
BayesCπ
Many small-effect
QTL simulated
RR-BLUP
Comparing marker effects
between models
G-BLUP
• Similar to tradition BLUP with pedigrees
• Calculate genomic relationship matrix
• Use genomic relationships in mixed-linear model to predict
breeding value of relatives
yi    ui  ei
ui ~ MVN (0, G u2 )
 G11 G12
G
G22
21

G


Gn1 Gn 2
G1n 
G2 n 


Gnn 
Selection
candidates
Training Pop.
Training Pop.
Relationships between TP
and selection candidates
leveraged for prediction
Selection
candidates
Equivalency between RR-BLUP and G-BLUP
y     xk k  e
 k ~ N (0,  2 )
k
u   x k  k  Xβ
k
From MVN distribution properties:
var(u)  XXT  2  G u2
G  XXT
Only valid with the normal prior!
Predicting prediction accuracy
• Prediction accuracy: ruuˆ  cor(TBV , GEBV )
Nh 2
Me
E (ruuˆ ) 
2
Nh
Me
Daetwyler et al. (2008)
1
1/2


Nh
E (ruuˆ )  r  2 2

 r Nh  M e 
2
2
Lian et al. (2014)
N = training pop size
h2 = trait heritability
Me = effective number of loci
r2 = LD between marker and QTL (see Lian ref)
Factors affecting prediction accuracy
• Training population size
• Trait heritability
– Influence of G x E, precision of measurements
• Marker density
• Effective population size of breeding population
– i.e., genetic diversity of breeding population
• Genetic relationship between training population and
selection candidates
• Statistical model
Effect of relationships: Predicting across populations
1180 polymorphic markers
PC 2
Training sets
Validation sets
Subpop 2
Subpop 1
PC 1
BuschAg
University of MN
NDSU 6-row
Pred accuracy
Effect of relationships: Presence of
relatives in TP
Mean relationship of top ten relatives
Clark et al. (2012)
Models typically similar in accuracy
1
RR-BLUP
BayesCpi
Bayesian
LASSO
Accuracy
0.8
Models also equivalent in:
• Bernardo and Yu (2007)
[Maize]
• Lorenzana and Bernardo
(2009) [Several plant species]
0.6
0.4
• Van Raden et al. (2009)
[Holstein]
0.2
• Hayes (2009) [Holstein]
0
DON
FHB
Why?
• Extensive LD in plant and animal breeding programs
– Perfect situation for G-BLUP
– Long stretches of genome that are identical by descent
means relationships calculated with markers are good
indicators of relationships at causal polymorphisms.
– Extensive LD also means it’s hard for variable selection
models to zero in on markers in tight LD with casual
polymorphisms.
• Expect variable selection models will be superior when
– Individuals are unrelated
– Very large TP (millions?)
– Very high marker density so that markers in LD with
causal polymorphisms
Resources and packages
•
•
rrBLUP package
– cran.r-project.org/web/packages/rrBLUP/rrBLUP.pdf
– Endelman, J.B. 2011. Ridge regression and other kernels for genomic
selection with R package rrBLUP. Plant Genome 4:250-255.
– Endelman, J.B., and J-L. Jannink. 2012. Shrinkage estimation of the
realized relationship matrix. G3:2:1045
BLR (Bayesian Linear Regression) package
– http://bglr.r-forge.r-project.org/
– Perez et al. 2010. Genomic-enabled prediction based on molecular
markers and pedigree using the Bayesian linear regression package in R.
Plant Genome 3:106-116.
References
•
•
•
•
•
•
•
•
•
Bernardo, R. 2008. Molecular markers and selection for complex traits in plants: Learning from the
last 20 years. Crop Sci 48:1649-1664.
Clark, S.A., J.M. Hickey, H.D. Daetwyler and van der Werf, Julius HJ. 2012. The importance of
information on relatives for the prediction of genomic breeding values and the implications for the
makeup of reference data sets in livestock breeding schemes. Genet. Sel. Evol. 44:.
Daetwyler, H.D., B. Villanueva and J.A. Woolliams. 2008. Accuracy of predicting the genetic risk of
disease using a genome-wide approach. Plos One 3:.
de los Campos, G., J.M. Hickey, R. Pong-Wong, H.D. Daetwyler and M.P.L. Calus. 2013. Wholegenome regression and prediction methods applied to plant and animal breeding. Genetics 193:327-+.
Lian, L., A. Jacobson, S. Zhong and R. Bernardo. 2014. Genomewide prediction accuracy within 969
maize biparental populations. Crop Sci.
Lorenz, A.J. 2013. Resource allocation for maximizing prediction accuracy and genetic gain of
genomic selection in plant breeding: A simulation experiment. G3-Genes Genomes Genetics 3:481491.
Lorenzana, R.E. and R. Bernardo. 2009. Accuracy of genotypic value predictions for marker-based
selection in biparental plant populations. Theor. Appl. Genet. 120:151-161.
Meuwissen, T.H., B.J. Hayes and M.E. Goddard. 2001. Prediction of total genetic value using
genome-wide dense marker maps. Genetics 157:1819-1829.
Whittaker, J.C., R. Thompson and M.C. Denham. 2000. Marker-assisted selection using ridge
regression. Genet. Res. 75:249-252.
Download