Genomic selection Aaron Lorenz Department of Agronomy and Horticulture Role of markers in crop improvement Bernardo, 2008 Genomic selection DNA marker data Model training Training Population Calibration Set y Xb + Zu + e Predict and select Phenotypic data • No QTL mapping • No testing for significant markers Selection candidates Genomic rA A genome-wide approach typically provides better predictions MAS MAS rA Lorenzana and Bernardo (2009) Lorenz (2013) GS MAS GS Whittaker et al. (2000) • When doing MAS, cannot include all the markers, so must select subset of markers to fit. • No entirely satisfactory way of doing this exists. • Objective is to evaluate ridge regression. – Superior to subset selection when objective is to make predictions. Whittaker et al. (2000) • Find subset of markers Q. • Interested in aˆ ˆ x i kQ i i ˆβ ( XT X) 1 XT y • Cannot include all markers in Q – Increases variance of β – If number of markers really large, not enough d.f. Whittaker et al. (2000) • Ridge regression – include all variables, but replace normal least-squares estimators with ˆβ ( XT X I ) 1 XT y • Normal estimates shrunk toward 0 – Degree of shrinkage determined by lambda • Choose lambda to minimize model error • Addition of λI term reduces collinearity and prevents the matrix XTX from becoming singular. Whittaker et al. (2000) MHG 2001 Objective: “Compare statistical methods for their accuracy in predicting total breeding value of individuals in a situation where a limited number of recorded individuals are genotyped for many markers.” - Computer simulation - 2000 individuals - Need to estimate 50,000 haplotype effects MHG 2001 1 r(GEBV:True BV) 0.8 0.6 0.4 0.2 0 Least-squares BLUP BayesA BayesB Genomic selection models Genomic selection models Shrinkage models • RR-BLUP, G-BLUP 2. Dimension reduction methods • Partial least squares • Principal component regression 3. Variable selection models • BayesB, BayesCπ, BayesDπ 4. Kernel and machine learning methods • Support vector machine regression LARGE p !! 1. Training population Line Yield Mrk 1 Mrk 2 … Mrk p Line 1 Line 2 Line 3 Line 4 76 56 45 67 1 1 1 0 1 1 1 1 1 1 1 0 Line n 22 1 1 1 … smaller n !! Baseline model yi k xik ei k k ~ ? --More predictors than variables. --Solution: fit predictors as random effects. -- Constrain possible effects. -- What distribution is β being sampled from? Priors and penalizations (examples) yi k xik ei k Ridge regression k ~ N (0, 2 ) LASSO k ~ DE ( ) BayesC 0 k 2 ~N(0, ) with prob with prob (1- ) Double exponential distribution Normal distribution Represent two different assumptions about the underlying distribution of QTL effects de Los Campos et al. (2013) Priors Marker effect estimates Large-effect QTL simulated BayesCπ Many small-effect QTL simulated RR-BLUP Comparing marker effects between models G-BLUP • Similar to tradition BLUP with pedigrees • Calculate genomic relationship matrix • Use genomic relationships in mixed-linear model to predict breeding value of relatives yi ui ei ui ~ MVN (0, G u2 ) G11 G12 G G22 21 G Gn1 Gn 2 G1n G2 n Gnn Selection candidates Training Pop. Training Pop. Relationships between TP and selection candidates leveraged for prediction Selection candidates Equivalency between RR-BLUP and G-BLUP y xk k e k ~ N (0, 2 ) k u x k k Xβ k From MVN distribution properties: var(u) XXT 2 G u2 G XXT Only valid with the normal prior! Predicting prediction accuracy • Prediction accuracy: ruuˆ cor(TBV , GEBV ) Nh 2 Me E (ruuˆ ) 2 Nh Me Daetwyler et al. (2008) 1 1/2 Nh E (ruuˆ ) r 2 2 r Nh M e 2 2 Lian et al. (2014) N = training pop size h2 = trait heritability Me = effective number of loci r2 = LD between marker and QTL (see Lian ref) Factors affecting prediction accuracy • Training population size • Trait heritability – Influence of G x E, precision of measurements • Marker density • Effective population size of breeding population – i.e., genetic diversity of breeding population • Genetic relationship between training population and selection candidates • Statistical model Effect of relationships: Predicting across populations 1180 polymorphic markers PC 2 Training sets Validation sets Subpop 2 Subpop 1 PC 1 BuschAg University of MN NDSU 6-row Pred accuracy Effect of relationships: Presence of relatives in TP Mean relationship of top ten relatives Clark et al. (2012) Models typically similar in accuracy 1 RR-BLUP BayesCpi Bayesian LASSO Accuracy 0.8 Models also equivalent in: • Bernardo and Yu (2007) [Maize] • Lorenzana and Bernardo (2009) [Several plant species] 0.6 0.4 • Van Raden et al. (2009) [Holstein] 0.2 • Hayes (2009) [Holstein] 0 DON FHB Why? • Extensive LD in plant and animal breeding programs – Perfect situation for G-BLUP – Long stretches of genome that are identical by descent means relationships calculated with markers are good indicators of relationships at causal polymorphisms. – Extensive LD also means it’s hard for variable selection models to zero in on markers in tight LD with casual polymorphisms. • Expect variable selection models will be superior when – Individuals are unrelated – Very large TP (millions?) – Very high marker density so that markers in LD with causal polymorphisms Resources and packages • • rrBLUP package – cran.r-project.org/web/packages/rrBLUP/rrBLUP.pdf – Endelman, J.B. 2011. Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome 4:250-255. – Endelman, J.B., and J-L. Jannink. 2012. Shrinkage estimation of the realized relationship matrix. G3:2:1045 BLR (Bayesian Linear Regression) package – http://bglr.r-forge.r-project.org/ – Perez et al. 2010. Genomic-enabled prediction based on molecular markers and pedigree using the Bayesian linear regression package in R. Plant Genome 3:106-116. References • • • • • • • • • Bernardo, R. 2008. Molecular markers and selection for complex traits in plants: Learning from the last 20 years. Crop Sci 48:1649-1664. Clark, S.A., J.M. Hickey, H.D. Daetwyler and van der Werf, Julius HJ. 2012. The importance of information on relatives for the prediction of genomic breeding values and the implications for the makeup of reference data sets in livestock breeding schemes. Genet. Sel. Evol. 44:. Daetwyler, H.D., B. Villanueva and J.A. Woolliams. 2008. Accuracy of predicting the genetic risk of disease using a genome-wide approach. Plos One 3:. de los Campos, G., J.M. Hickey, R. Pong-Wong, H.D. Daetwyler and M.P.L. Calus. 2013. Wholegenome regression and prediction methods applied to plant and animal breeding. Genetics 193:327-+. Lian, L., A. Jacobson, S. Zhong and R. Bernardo. 2014. Genomewide prediction accuracy within 969 maize biparental populations. Crop Sci. Lorenz, A.J. 2013. Resource allocation for maximizing prediction accuracy and genetic gain of genomic selection in plant breeding: A simulation experiment. G3-Genes Genomes Genetics 3:481491. Lorenzana, R.E. and R. Bernardo. 2009. Accuracy of genotypic value predictions for marker-based selection in biparental plant populations. Theor. Appl. Genet. 120:151-161. Meuwissen, T.H., B.J. Hayes and M.E. Goddard. 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819-1829. Whittaker, J.C., R. Thompson and M.C. Denham. 2000. Marker-assisted selection using ridge regression. Genet. Res. 75:249-252.