Added value of whole-genome sequence data to genomic predictions in dairy cattle
Rianne van Binsbergen 1,2 , Mario Calus 1 , Chris Schrooten 3 , Fred van Eeuwijk 2 , Roel Veerkamp 1 , Marco Bink 2
1 Animal Breeding & Genetics Centre, Wageningen UR (NL)
2 Biometris, Wageningen UR (NL)
3 CRV (cattle breeding company) , Arnhem (NL)
Genomic Prediction in agricultural species
Reference population:
1) Estimate effects for each SNP (w)
2) Generate a prediction equation that combines all the marker genotypes with their effects to predict the breeding value of each individual
Apply prediction equation to a group of individuals that have genotypes but not phenotypes
Estimated genomic breeding values
Select the best individuals for breeding
Each SNP represented by a variable (x), which takes the values
0 [ A A ]
1 [ A B]
2 [B B]
Advantages:
• Select at early age (before phenotypes available)
• Save costs to phenotype candidates
• Increase accuracy of predicted Breeding Values
Goddard & Hayes (2009)
Nature Reviews Genetics 10:381
Simulation Study
Dense marker maps
SNP markers at 1cM density
Prediction Accuracy
Least Squares method:
Genomic BLUP method:
Bayesian methods(A,B):
0.32
0.73
0.85
Conclusion:
“selection on genetic values predicted from markers could substantially increase the rate of genetic gain in animals and plants , especially if combined with reproductive techniques to shorten the generation interval”
Another (seminal) paper on Genomic Prediction
“In the case of whole-genome sequence data, the polymorphisms that are causing the genetic differences between the individuals are among those being analyzed.”
Higher accuracy in genomic predictions since causal mutation is included (assumption)
No dependency on LD
Persistency across generations
Genomic prediction across breeds
Prediction of Total Genetic Value
Using Genome-Wide Dense Marker
Maps
T. H. E. Meuwissen,* B. J. Hayes† and
M. E. Goddard†,‡
“Only few SNPs were useful for predicting the trait [because they were in linkage disequilibrium (LD) with mutations causing variation in the trait] while many SNPs were not useful.”
Genomic predictions from whole-genome sequence data
Tremendous increase in number of SNPs (more noise)
Large (sequence) data are required
Solution
Sequence core set of individuals (e.g. founders)
Impute whole-genome sequence genotypes of other individuals
Accuracy of imputation to whole-genome sequence data was generally high for imputation from 777K SNP panel
Van Binsbergen, et al. Genet Sel Evol
2014 (in press)
This presentation:
First results of genomic prediction with imputed whole-genome sequence data for 5503 bulls with accurate phenotypes
Dataset: SNP genotypes & trait phenotypes
5503 Holstein Friesian bulls
777K SNP genotypes
(Illumina BovineHD BeadChip)
Imputation - Beagle v4 software
1000 bull genomes project
28M SNP genotypes
429 bulls
(multiple breeds)
5503 Holstein Friesian bulls
12M SNP genotypes
MAF > 0.005
Imputation accuracy > 0.05
De-regressed progeny based proofs (DRP 1 ) and associated effective daughter contributions (EDC 2 )
Somatic cell score (SCS)
Interval fist and last insemination (IFL)
Protein yield (PY)
1 VanRaden et al. 2009 (J Dairy Sci)
2 VanRaden and Wiggans 1991 (J Dairy Sci)
= squared correlation between original phenotype (DRP) and estimated genetic values (GEBV)
5503 Holstein Friesian bulls
777K SNP genotypes
(Illumina BovineHD BeadChip)
5503 Holstein Friesian bulls
12M SNP genotypes
MAF > 0.005
Imputation accuracy > 0.05
training population
4322 old bulls validation population
1181 young bulls training population
4322 old bulls validation population
1181 young bulls
Validation population
Youngest bulls with EDC 0
Mainly sons of bulls in training population
Mimics breeding practice
GBLUP
Genome-enabled best linear unbiased prediction
BSSVS
Bayes stochastic search variable selection
Distribution QTL effects to be close to infinitesimal model (all
SNPs equally small effect)
Build a genomic relationship matrix to model variancecovariance structure
3 chains of 60,000 cycles
(10,000 cycles burn-in)
Large number of SNPs with tiny
(close to zero) and a few SNPs with moderate effects (=mixture of two Normal distributions)
Implementation via
Markov chain Monte Carlo (MCMC) simulation algorithms (computer intensive)
Calus M (2014). Right-hand-side updating for fast computing of genomic breeding values.
Genetics Selection Evolution 46(1): 24.
GBLUP
777K
●
HPC – 1 node
SNP
12M
SNP
●
~ 3 hours
●
~ 32 GB RAM
●
HPC – 12 nodes
●
~ 6 hours
●
~ 600 GB RAM
3 chains of 60,000 cycles
(10,000 cycles burn-in)
BSSVS (per MCMC chain)
●
Windows – 1 CPU
●
~ 5 days
●
~ 1.6 GB RAM
●
HPC – 1 node
●
~ 50 days
●
~ 32 GB RAM
Windows 7 Enterprise desktop pc:
32 CPU – 8 GB RAM/CPU (clock speed 2.60 GHz)
HPC Linux cluster:
Normal nodes – 64 GB/node (2.60 GHz); 2 fat nodes – 1 TB RAM/node (2.20 GHz)
0,6
0,5
0,4
0,3
0,2
0,1
0,0
SCS IFL
BSSVS: Average over 3 chains of 60,000 cycles
(10,000 cycles burn-in)
PY
BovineHD GBLUP
BovineHD BSSVS
Sequence GBLUP
Sequence BSSVS *
* Based on
45,000 cycles
0,6
0,5
0,4
0,3
0,2
0,1
0,0
SCS IFL PY
BovineHD GBLUP
BovineHD BSSVS
Sequence GBLUP
Sequence BSSVS *
* Based on
45,000 cycles
Trace of variance of SNP effects Bayes Factor for SNP effects
777K SNP
12M SNP
3 chains of 60,000 cycles
(10,000 cycles burn-in)
Sequence: 45,000 cycles
Large number of SNPs with tiny and a few SNPs with moderate effects
●
Sequence data: Really large number of SNPs with tiny effects
Captures too much signal?
Another Bayesian Prediction Model: Bayes-C
●
Large number of SNPs with NO effect and a few SNPs with moderate effects
Concentrate on single chromosome (BTA 6)
MCMC convergence
BSSSVS
777K SNP
Bayes-C
12M SNP
Concentrate on single chromosome (BTA 6)
Signal of QTL effects
BSSSVS
777K SNP
Bayes-C
12M SNP
Reliability estimates
BSSVS
BovineHD 0.328
Sequence 0.324
BayesC
0.328
0.325
Genomic prediction using sequence data becomes reality
●
However, sequence data requires intensive computation
Need for faster algorithms
Use of Sequence Data did not improve Prediction reliability
●
Convergence issues with BSSVS
Longer chains may yield better results
BSSVS slightly better compared to GBLUP
Preliminary results BTA6 hint that Bayes-C method may work better (than BSSVS) for sequence data
Next Steps: Did we bet on the wrong horse - named BSSVS?
Review choice of priors in BSSVS model.
Apply Bayes-C model to whole genome sequence data
1000 bull genomes project
(www.1000bullgenomes.com)
De-regressed proofs (DRP)
Effective daughter contribution (EDC)
𝐷𝑅𝑃 = 𝑃𝐴 + 𝐸𝐵𝑉 − 𝑃𝐴 ∗
𝐸𝐷𝐶
𝐸𝐵𝑉
𝐸𝐷𝐶 𝑝𝑟𝑜𝑔
Parent average
Effective Daughter
Estimated breeding value
Contribution
𝐸𝐷𝐶
𝐸𝐵𝑉
= 𝛼 𝑅𝐸𝐿
𝐸𝐵𝑉
/ 1 − 𝑅𝐸𝐿
𝐸𝐵𝑉
(4 − ℎ 2 )/ℎ 2 Published reliability of EBV
𝐸𝐷𝐶 𝑝𝑟𝑜𝑔
= 𝐸𝐷𝐶
𝐸𝐵𝑉
− 𝐸𝐷𝐶
𝑃𝐴
VanRaden et al. 2009 (J Dairy Sci)
Based on reliability of parents
𝑅𝐸𝐿 𝑠𝑖𝑟𝑒
+ 𝑅𝐸𝐿 𝑑𝑎𝑚
/4