Day2 Imputation

advertisement
From sequence data to
genomic prediction
Course overview
• Day 1
– Introduction
– Generation, quality control, alignment of sequence data
– Detection of variants, quality control and filtering
• Day 2
– Imputation from SNP array genotypes to sequence data
• Day 3
– Genome wide association studies with SNP array and
sequence variant genotypes
• Day 4 & 5
– Genomic prediction with SNP array and sequence variant
genotypes (BLUP and Bayesian methods)
– Use of genomic selection in breeding programs
Imputation
• Why impute?
• Approaches for imputation
• Factors affecting accuracy of imputation
• Does imputation give you more power?
• Imputation to whole genome sequence
variant genotypes
Why impute?
• Fill in missing genotypes from the lab
• Merge data sets with genotypes on different
arrays
– Eg. Affy and Illumina data
• Impute from low density to high density
– 7K-> 50K (save $$$)
– 50K->800K
– capture power of higher density?
– Better persistence of accuracy
• Sequence expensive, can we impute to full
sequence data?
Core concept
•Identity by state (IBS)
– A pair of individuals have the same allele
at a locus
•Identity by descent (IBD)
– A pair of individuals have the same
alleles at a locus and it traces to a
common ancestor
• Imputation methods determine
whether a chromosome segment is
IBD
Causes of LD
• A chunk of ancestral chromosome is
conserved in the current population
Marker Haplotype
1 1 1 2
Core concept 2
• Any individuals in a population may
share a proportion of their genome
identical by descent (IBD)
– IBD segments are the same and have
originated in a common ancestor
• The closer the relationship the longer
the IBD segments
– Pedigree relationships
Several methods for imputation
•Two main categories:
– Family based
– Population based
– Or combination of the two
– Some of the most effective are Beagle
(Browning and Browning, 2009), MACH
(Li et al., 2010), Impute2 (Howie et al.,
2009), AlphaPhase (Hickey et al 2011)
Several methods for imputation
•Two main categories:
– Family based
– Population based
– Or combination of the two
– Some of the most effective are Beagle
(Browning and Browning, 2009), MACH
(Li et al., 2010), Impute2 (Howie et al.,
2009), AlphaPhase (Hickey et al 2011)
Finding an IBD segment
Sire
0
2
2
0
2
0
2
0
0
2
0
2
2
2
0
2
?
0
2
?
2
2
?
?
0
0
2
0
Progeny
Sire
0
2
2
0
2
0
2
2
0
2
0
2
2
2
IBD segment
Progeny
0
2
2
?
0
2
?
2
?
?
0
0
2
0
Sire
0
2
2
0
2
0
2
2
0
2
0
2
2
2
0
2
2
0
2
0
2
2
?
?
0
0
2
0
Progeny
Several methods for imputation
•Two main categories:
– Family based
– Population based (exploits LD)
– Or combination of the two
– Some of the most effective are Beagle
(Browning and Browning, 2009), MACH
(Li et al., 2010), Impute2 (Howie et al.,
2009), AlphaPhase (Hickey et al 2011)
Population based imputation
• Hidden Markov Models
– Has “hidden states”
– For target individuals these are “map” of
reference haplotypes that have been
inherited
– Imputation problem is to derive
genotype probabilities given hidden
states, sparse genotypes, recombination
rates, other population parameters
Population based imputation
Reference
population
Target
population
Marchini J, Howie B. Genotype imputation for genome-wide association studies. Nat Rev Genet. 2010 11:499-511.
Population based imputation
• Consider three markers, 4 reference
haplotypes
•0
•0
•1
•0
1
1
0
0
1
0
1
1
• Imputation?
Li and Stephens
Beagle
Imputation accuracy
• Accuracy = correlation of real and
imputed genotypes
• Concordance = percentage (%) of
genotypes called correctly
Imputation accuracy
• Depends on
– Size of reference set
• bigger the better!
– Density of markers
• extent of LD, effective population size
– Frequency of SNP alleles
– Genetic relationship to reference
Table 6. Accuracy of imputation from BovineLD genotypes to BovineSNP50 genotypes for Australian,
French, and North American breeds.
Boichard D, Chung H, Dassonneville R, David X, et al. (2012) Design of a Bovine Low-Density SNP Array Optimized for Imputation. PLoS
ONE 7(3): e34130. doi:10.1371/journal.pone.0034130
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0034130
Imputation accuracy
• Density of markers (extent of LD)
– In Holstein Dairy cattle
• 3K -> 50K accuracy 0.93
• 7K -> 50K accuracy 0.98
Illumina Bovine HD array
• We genotyped
— 898 Holstein heifers
— 47 Holstein Key ancestor bulls
• After (stringent) QC 634,307 SNPs
Imputation 50K -> 800K
• Holsteins
Cross validation
Heifers only
1
2
Average
Heifers
using key
ancestors
1
2
Average
% Correct
96.7%
96.7%
96.7%
97.8%
97.7%
97.7%
Imputation accuracy
• Rare alleles?
Imputation accuracy
• Relationship to reference?
Imputation accuracy
• Effect of map errors?
Why more power with imputation
• High accuracies of imputation
demonstrate that we can infer
haplotypes of animal genotyped with
e.g. 3K accurately
• But potentially large number of
haplotypes
• With imputed data can test single
snp, only use 1 degree of freedom,
rather than number of haplotypes
Why more power with imputation
• Weigel et al. (2010)
Imputation
• Why impute?
• Approaches for imputation
• Factors affecting accuracy of imputation
• Does imputation give you more power?
• Imputation to whole genome sequence
variant genotypes
Which individuals to sequence?
• Those which capture greatest
genetic diversity?
• Select set of individuals which are
likely to capture highest proportion
of unique chromosome segments
Which individuals to sequence?
• Let total number of individuals in
population be n, number of individuals
that can be sequenced be m.
• A = average relationship matrix among n
individuals, from pedigree
• An example A matrix……..
Pedigree
Animal
Sire
1
2
3
4
5
6
Animal 1
Animal 1
Animal 2
Animal 3
Animal 4
Animal 5
Animal 6
1
0
0
0.5
0.5
0.5
Dam
0
0
0
1
1
1
0
0
0
2
2
3
Animal 2
1
0
0.5
0.5
0
Animals 6 is a half sib of 4 and 5
Animal 3
1
0
0
0.5
Animal 4
1
0.5
0.25
Animal 5
1
0.25
Animal 6
1
Which individuals to sequence?
• Let total number of individuals in
population be n, number of individuals
that can be sequenced be m.
• A = average relationship matrix among n
individuals, from pedigree
• c is a vector of size n, which for each
animal has the average relationship to the
population (eg. Sum up the elements of A
down the column for individual i, take
mean)
Which individuals to sequence?
• If we choose a group of m animals for
sequencing, how much of the diversity do
they capture
• pm = Am-1cm
– Where Am is the sub matrix of A for the m
individuals, and cm is the elements of the c
vector for the m individuals
• Proportion of diversity = pm’1n
Which individuals to sequence?
• Example
Which individuals to sequence?
• Then choose set of individuals to
sequence (m) which maximise
pm’1n
• Step wise regression
– Find single individual with largest pi, set
ci to zero, next largest pi, set ci to
zero…..
• Genetic algorithm
Which individuals to sequence?
• Then choose set of individuals to
sequence (m) which maximise
pm’1n
• Step wise regression
– Find single individual with largest pi, set
ci to zero, next largest pi, set ci to
zero…..
• Genetic algorithm
• No A? Use G
Which individuals to sequence?
• Poll Dorset sheep
Proportion of genetic diversity captured
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0
10
20
30
40
50
60
Rams sequenced, ranked from most influential
70
80
Imputation of full sequence data
• Two groups of individuals
– Sequenced individuals: reference
population
– Individuals genotyped on SNP array:
target individuals
Imputation of full sequence data
• Steps:
– Step 1. Find polymorphisms in sequence
data
– Step 2. Genotype all sequenced animals
for polymorphisms (SNP, Indels)
– Step 3. Phase genotypes (eg Beagle) in
sequenced individuals, create reference
file
– Step 4. Impute all polymorphisms into
individuals genotyped with SNP array
Imputation of full sequence data
Variant calling
Create BAM files
1. Filter reads on quality
score, trim ends
2. Remove PCR
duplicates
3. Align with BWA
BAM
SamTools mPileup
Vcf file -> filter
(number forward
/reverse reads of
each allele, read
depth, quality,
filter number of
variants in 5bp
window)
Analysis
Genome wide association
Genomic selection
Genotype probabilities
Beagle Phasing
in Reference
Input genotype
probs from Phred
scores
QC with 800K
Reference file
for imputation
Beagle Imputation in
Target
SNP array data in target
population
Imputation of full sequence data
• How accurate?
Breed/Cross
Holstein (Black and White)
Simmental (Dual and Beef)
Angus (Black and Red)
Jersey
Brown Swiss
Gelbvieh
Charolais
Hereford
Limousin
Guelph Composite
Beef Booster
Alberta Composite
Montbeliarde
AyrshireFinnish
Normande
Holstein (Red and White)
Swedish Red
Danish Red
Other Crosses
Belgian Blue
Piedmontese
Eringer
Galloway
Unknown
Scottish Highland
Pezzata Rossa Italiana
Romagnola
Salers
Tyrolean Grey
Total
Run4.0 1000 bull genomes Run 4.0
•
•
•
•
CRV
1147 animals sequenced
27 breeds
20 Partners
Average 11X
Number
288
216
138
61
59
34
33
31
31
30
29
28
28
25
24
23
16
15
11
10
5
2
2
2
2
1
1
1
1
1147
1000 bull genomes Run 4.0
• 36.9 million
filtered variants
• 35.2 million SNP
• 1.7 million INDEL
X
Imputation of full sequence data
– Accuracy?
• Chromosome 14
• Remove 50 Holsteins, 20 Jerseys from data set
• Reduce genotypes to 800K for these animals
• Impute full sequence using rest of animals as
reference
Imputation of full sequence data
Imputation of full sequence data
Imputation of full sequence data
Imputation of full sequence data
– Why so difficult to impute rare mutations?
– Examples Complex Veterbral Malformation
(CVM) and Bovine Leukocyte Deficiency
(BLAD)
• All cases of CVM trace back to Ivanhoe Bell
• BLAD traces to Osbornedale Ivanhoe
Imputation of full sequence data
– Why so difficult to impute rare mutations?
Location
Frequency
BLAD
CVM
Chr1:145114963
Chr3:43412427
0.0014
0.0103
Bulls genotyped
5987
5987
Imputed correctly
5970
5836
Accuracy
0.9972
0.9748
# Carriers
17
123
# Carriers correctly imputed
13
5
0.765
0.041
Prop. Carriers correctly imputed
Imputation of full sequence data
– Why so difficult to impute rare mutations?
– The BLAD mutation is in a unique 250kb
haplotype, which does not occur in any noncarriers
– The CVM mutation is in a 250kb haplotype
which occurs in many non carriers, and also
occurs in breeds without mutation
– Hypothesis – BLAD mutation occurred on
rare haplotype, while CVM a recent
mutation that occurred on a common
haplotype background
Imputation of full sequence data
– Computationally efficient strategies
– Beagle – run imputation in chromosome
segments, say 5MB with 0.5MB overlap (to
avoid edge effects)
– Fimpute – much faster than Beagle, used
to impute 32,500 animals from 800K to 16
million SNP!
• Does not give probabilties
– Beagle phasing + Minimac
Conclusion
• Impute
– to fill in missing genotypes
– low density to high density to save $$
• Accuracy depends on size of reference,
effective population size, relationship to
reference, marker density
• Imputation to sequence possible,
relatively low accuracies for rare alleles
• Use genotype probabilities from
imputation in GWAS and genomic
prediction
Download