Association Mapping versus Genomic Selection Association Mapping • To discover genes and genetic variants that control a trait • Knowledge can be applied understand mechanism, genetic architecture, design pathways with diversity, ideas for transgenic improvement Genomic Selection • To identify germplasm with the best breeding values and performance • Can identify complementary varieties that should be crossed for future improvement. 255 Association-based selection methods: Genomic selection • We have MAS, why do we need something different? • Historical introduction to genomic selection – – – – – – – The basic idea Methods Theory Selected simulation results Empirical results long-term genomic selection Introgressing diversity using GS 256 MAS problems • Relevant germplasm • Bias of estimated effects • Effects too small for detection 257 Association mapping identifies QTL rapidly while scanning relevant germplasm Intermated recombinant inbreds 5 Research time (year) Positional cloning Near-isogenic lines Recombinant inbred lines Relevance to breeding germplasm High Depends Low Pedigree 1 Association mapping 1 1 x 104 F2 / BC 1 x 107 Resolution (bp) 258 Bias in Effect Estimation Significance Threshold Effect Estimate (True + Error) Average “Detected” Effect Estimated Bias True Effect Locus Effect Estimate • Keep in all loci => No threshold => Estimated effects are unbiased 259 In polygenic traits, much is hidden E.g., h2 = 0.8 α = 0.01 1200 260 Lande & Thompson 1990 Genomic selection principles • Meuwissen et al. 2001 Genetics 157:1819-1829 • No distinction between “significant” and “nonsignificant”; no arbitrary inclusion / exclusion: all markers contribute to prediction • More effects must be estimated than there are phenotypic observations • Estimated effects are unbiased • Capture small effects 261 Genomic selection: Prediction using many markers Breeding Material Genotyping Calculate GEBV Make Selections Meuwissen et al. 2001 Genetics 157:1819-1829 262 Statistical modeling: The two cultures X Observed inputs Nature Can we understand Y? X Regression Observed responses Y Identify causal inputs Y Can we predict Y? X ? Y Regression Decision trees Whatever works Breiman 2001 Stat. Sci. 16:199-231 263 Need to shorten breeding cycle 1 4 3.5 3 2.5 i 2 1.5 1 0.5 0 0.8 0.6 rA 0.4 0.2 0 1 10 100 1000 Ratio Candidates / Selected 10000 1 10 100 Number of Replications 1000 i cumulates over breeding cycles 264 Phenotypic Selection Select Cross Inbreed 1 Season Years F13× Inducer 2 Seasons 1 Rep Self DH0 N=2270 S=100 Phenotype 2 Years 5 Reps N=100 S=10 Release 265 Genomic Selection Select Cross 1 Year! Inbreed Phenotype Release 266 FastGS Select 1 Season = ⅓ Year!! Cross Inbreed Phenotype Release 267 Selection Intensities • Phenotypic – N = 2270, S = 10: i = 2.4 • FastGS (!!!) – N = 370, S = 43: i = 1.7 – 9 × i ≅ 15 Inbreeding: 268 Rates of gain per year 269 Impacts • Schaeffer, L.R. 2006. Strategy for applying genome-wide selection in dairy cattle. J. Anim. Breed. Genet. 123:218-223. 270 Cost per genetic standard deviation $116 M Genomic Phenotypic Schaeffer 2006 $4.2 M 271 Potential Impact Heffner, E.L. et al. 2009. Genomic Selection for Crop Improvement. Crop Science 49:1-12 Test varieties and release Advance lines informative for model improvement Phenotype (lines have already been genotyped) Model Training Cycle Train prediction model Advance lines with highest GEBV Updated Model Genomic Selection Line Make crosses Development and advance generations Cycle Genotype New Germplasm 272 What (I think) is revolutionary Test varieties and release Advance lines informative for model improvement Phenotype (lines have already been genotyped) Model Training Cycle Train prediction model Advance lines with highest GEBV Updated Model Genomic Phenotypic Selection Selection Line Make crosses Development and advance generations Cycle Genotype New Germplasm For a century, breeding has focused on better ways to evaluate lines. Henceforth it will focus on how to improve a model. 273 A Focus for Information Genomic Prediction Model Development • Current pheno–geno data • Historical pheno–geno data • Linkage and association mapping • Biological knowledge Select Cultivar Release Cross Population Improvement 274 The Alleletarian Revolution • The breeding line as the focus of evaluation has been dethroned in favor of the allele • A line is useful to us only with respect to the alleles it carries • Time-honored practice: replicate (progeny test) lines • But alleles are replicated regardless of what line carries them 275 Methods • Linear models: – Effects are random – Methods differ in marker effect priors • Machine learning methods – Regression trees 276 Linear models: Priors on coefficients • Ridge regression • • BayesB (SSVS) • else • BayesCπ • else 277 Ridge regression BayesB Density BayesCπ Var(β) 278 Machine learning methods • Random Forests – Forest of regression trees – Each tree on a bootstrapped sample – Nodes split on randomly sampled features – Prediction is forest mean 0 0 M2 M1 1 1 0 M2 1 M1 • Can capture interactions M2 0 1 0 1 0 1 0 1 279 Additive models and breeding value • Breeding value = Mean phenotype of progeny – Most important parent selection criterion – Recombination: parents do not always pass combinations of genes to their progeny – > Sum of individual locus effects • Linear models capture this; Machine learning methods may not 280 Theory • How accurate will GS be? • Impact of GS on inbreeding / loss of diversity • Genomic selection captures pedigree relatedness among candidates 281 Prediction accuracy = Correlation(predicted, true) • R = irAσA rA = corr(selection criterion, breeding value) • On simulated data corr(Â, A) is easy • On real data: 282 Predict prediction accuracy • Daetwyler, H.D. et al. 2008. Accuracy of Predicting the Genetic Risk of Disease Using a Genome-Wide Approach. PLoS ONE 3:e3395 • Assume all loci affecting the trait are known and are independent • Assume marker effects are fixed 283 λ 20 10 5 2 1 0.5 0.1 0.02 Replicating hurts: 2000 with 1 plot is better than 1000 with 2 plots 284 Predict prediction accuracy • Hayes, B.J. et al. 2009. Increased accuracy of artificial selection by using the realized relationship matrix. Genetics Research 91:47-60. • Detail on the population genetics that drive nG • Assume marker effects are random • Still assume all markers independent and estimated separately 285 Analytical approximations Daetwyler et al., 2008 NP / N G Hayes et al., 2009 NP / N G 286 Take Homes • Even with traits of very low heritability (h2 = 0.01), sufficient nP gives accuracy • Replication may not be good • The number of loci estimated (nG) is a critical parameter • If you don’t know where the QTL are, higher marker coverage requires higher nG • N.B. All conclusions assuming only 100% LD! 287 Genetic diversity loss / inbreeding • Daetwyler, H.D. et al. 2007. Inbreeding in genome-wide selection. J. Anim. Breed. Genet. 124:369-376 • Avoid selecting close relatives together • What is the correlation in the estimated breeding value between full sibs? Correlation sibling estimates 288 Genetic diversity loss / inbreeding _BLUP_ Mendelian sampling term σ2B Aj = ½AS + ½AD + aj σ2W = 0 __GS__ σ2B Correlation sibling estimates σ2W > 0 289 Daetwyler et al. 2007 Take Homes • Genomic selection captures the Mendelian sampling term. – Correlation between the estimates of sibling performance are reduced – Co-selection of sibs is reduced – Rate of inbreeding / loss of diversity is reduced 290 A word on pedigree relatedness • Five individuals, a, b, c, d, and e. – a, b, and c unrelated – d offspring of a and b – e offspring of a and c A= a b a 1 0 b 0 1 c 0 0 d ½ ½ e ½ 0 c d 0 ½ 0 ½ 1 0 0 1 ½ ¼ e ½ 0 ½ ¼ 1 291 Ridge Regression Habier, D. et al. 2007. Genetics 177:2389-2397 Hayes, B.J. et al. 2009. Genetics Research 91:47-60. 292 Habier et al. simulation set up 293 Genetic relationship decays fast • Prediction from pedigree relationship loses acccuracy very quickly • Decay rate is initially more rapid then stabilizes after about 5 generations • Rapid initial decay reflects that the closest marker may not be in highest LD with the QTL • RR-BLUP accuracy decays more rapidly than Bayes-B because more markers absorb the effect of a QTL Training population here 294 Habier et al. 2007 Take homes • The ability of genomic selection to capture information on genetic relatedness is valuable • That information decays rapidly • The amount of that information relates to the number of markers fitted by a model: – Ridge regression > BayesB • Bayes-B captured more LD information: – Long-term accuracy: BayesB > Ridge regression 295 Accuracy due to relationships vs. LD 296 Stochastic vs deterministic prediction Habier et al. Zhong et al. NP / N G 297 To replicate or not to replicate 504 Lines replicated once Ridge Regression 168 Lines replicated three times BayesB 298 Genetic diversity loss / inbreeding _BLUP_ Mendelian sampling term σ2B Aj = ½AS + ½AD + aj σ2W = 0 Correlation sibling estimates __GS__ σ2B σ2W > 0 Capturing relationship Information increases σ2B NOT σ2W 299 Simulation setting: Meuwissen; Habier; Solberg • Ne = 100; 1000 generations • Mutation / Drift / Recombination equilibrium • High marker mutation rate (2.5 x 10-3 / loc / gen); higher “haplotype mutation rate” • Mutation effect distribution Gamma (1.66, 0.4): “effective QTL number” is only about 6 (!) – > Watch out how you simulate! 300 Results • Prediction accuracy estimated by simulation MHG HFD RR-BLUP 0.73 0.64 BayesB 0.85 0.69 • These accuracies are ASTOUNDING • If h2 = 1, r = 0.71 301 Noteworthy discussion • Markers flanking QTL not always in model – QTL effects captured by multiple markers – No need to “detect” QTL • Recombination causes accuracy to decay – Faster than if QTL captured by flanking markers – Markers far from QTL contribute to capture its effect • Ne / 2 markers per Morgan achieves close to maximum accuracy – Dependent on high marker mutation rates (?) 302 Solberg et al. 2008 • Density: Number of markers per Morgan SNP: SSR: ¼ Ne 1 Ne ½ Ne 2 Ne 1 Ne 4 Ne 2 Ne 8 Ne 303 Zhong et al. 2009 • Zhong, S. et al. 2009. Genetics 182:355-364. • 42 diverse 2-row barley • 1040 markers ~ evenly spaced • Mating designs to generate 500 high and low LD training dataset • 20 or 80 QTL; h2 = 0.4 304 Ridge regression Vs. BayesB QTL: Observed Unobserved 20QTL – HiLD 20QTL – LoLD Ridge Regression 80QTL – HiLD 80QTL – LoLD BayesB 305 Zhong et al. 2009 Take-home messages • Ridge regression is not affected by the number of QTL / the QTL effect size • BayesB performs better with large markerassociated effects • Co-linearity is more detrimental to BayesB • High marker density and training pop. size? Yes: BayesB No: RR-BLUP 306 VanRaden et al. 2009 • VanRaden, P.M. et al. 2009. Invited Review: Reliability of genomic predictions for North American Holstein bulls. J. Dairy Sci. 92:16-24. 307 VanRaden et al. 2009 • Some traits have major genes, others do not 308 VanRaden et al. 2009 • The larger the training population, the better. Where diminishing returns will begin is not in sight. Predictor 309 Take Homes • • • • Training population requirements very large BayesB did not help == no large marker-associated effects == Like the “Case of the missing heritability” in human GWAS studies – Are many quantitative traits driven by very low frequency variants? – RR would capture this case better than BayesB 310 Empirical data on crops: TP size 311 Empirical data on crops: Marker No. 312 Empirical data on Humans: Marker No. Out of 295K SNP Yang et al. 2010. Nat. Genet. 10.1038/ng.608 313 Long-term genomic selection • • • • • Marker data from elite six-row barley program 880 Markers 100 hidden as additive-effect QTL Evaluate 200 progeny, select 20 Phenotypic compared to genomic selection 314 Breeding / model update cycles Season 1 Season 2 Season 3 Season 4 Season 5 Season 6 Cross & Inbreed Evaluate & Select Cross & Inbreed Evaluate & Select Phenotypic Selection Cross & Inbreed Evaluate & Select Genomic Selection Cross & Inbreed Evaluate & Select Evaluate Cross, Inb. & Select Cross, Inb. & Select Evaluate Cross, Inb. & Select Cross, Inb. & Select Evaluation is possible every other season. Candidates from every other cycle can be evaluated. There is still a lag: Parents of C2 are selected based on evaluation of C0. 315 Mean Genotypic Value Response in genotypic value Phenotypic Selection Genomic; Small Training Pop Genomic; Large Training Pop Phenotypic Breeding Cycle 316 Mean Realized Accuracy Accuracy Phenotypic Selection Genomic; Small Training Pop Genomic; Large Training Pop Phenotypic Breeding Cycle 317 Mean Genotypic Standard Deviation Genetic variance Phenotypic Selection Genomic; Small Training Pop Genomic; Large Training Pop Phenotypic Breeding Cycle 318 Mean Number Lost Favorable Allleles Lost favorable alleles Phenotypic Selection Genomic; Small Training Pop Genomic; Large Training Pop Phenotypic Breeding Cycle 319 Goddard 2008; Hayes et al. 2009 320 Response in genotypic value Weighted Mean Genotypic Value Unweighted Phenotypic Selection Genomic; Small Training Pop Genomic; Large Training Pop Phenotypic Breeding Cycle Phenotypic Breeding Cycle 321 Genetic variance Mean Genotypic Standard Deviation Unweighted Weighted Phenotypic Selection Genomic; Small Training Pop Genomic; Large Training Pop Phenotypic Breeding Cycle Phenotypic Breeding Cycle 322 Lost favorable alleles Mean Number Lost Favorable Alleles Unweighted Weighted Phenotypic Selection Genomic; Small Training Pop Genomic; Large Training Pop Phenotypic Breeding Cycle Phenotypic Breeding Cycle 323 Long term genomic selection • The acceleration of the breeding cycle is key • Some favorable alleles will be lost – Likely those not in LD with any marker • Managing diversity / favorable alleles appears a good idea • This can be done using the same data as used for genomic prediction 324 Introgressing diversity • GS relies on marker–QTL allele association • An “exotic” line comes from a sub-population divergent from the breeding population • After sub-populations separate – Drift moves allele frequencies independently – Drift & recombination shift associations independently • Will the GS prediction model identify valuable segments from the exotic? 325 Three approaches • Create a bi-parental family with the exotic (Bernardo 2009) – Develop a mini-training population for that family – Improve the family – Bring it into the main breeding population • Develop a separate training population for the exotic sub-population (Ødegård et al. 2009) • Develop a single multi-subpopulation (specieswide?) training population (Goddard 2006) 326 Need higher marker density Ancestral LD sub-population specific LD • Tightly–linked: ancestral LD • Loosely–linked: sub-population specific LD 327 Consistency of association across barley subpopulations 1.0 0 cM recombination distance 5 cM recombination distance Correlation of r 0.8 0.6 0.4 0.2 0.0 0.0 0.5 Genetic Distance 328 Example: Dairy cattle breeds 0.7 VP = Holstein Prediction Accuracy 0.6 VP = Jersey 0.5 0.4 0.3 0.2 0.1 0 TP = Hols. TP = Jers. Hols. + Jers. 329 Oat sub-populations (UOPN) G1 G2 G3 N=136 N=149 N=161 330 Combined sub-population TP (β-Glucan) 0.11 VP TP G1 G2 and G3 0.50 G1 and G2 0.39 G3 G1 G2 G3 331 Introgressing diversity using GS • Need higher marker density • Analysis of consistency of r may indicate whether current density is sufficient – Not sure we have it for barley • If you have the density, a multi-subpopulation training population seems like a good idea – Focuses the model on tighter ancestral LD rather than looser sub-population specific LD 332