Imputation 2 Presenter: Ka-Kit Lam 1

advertisement
Imputation 2
Presenter: Ka-Kit Lam
1
Outline
•
•
•
•
•
•
Big Picture and Motivation
IMPUTE
IMPUTE2
Experiments
Conclusion and Discussion
Supplementary :
– GWAS
– Estimate on mutation rate
2
Big Picture and Motivation
3
Background
• Genome-wide association study:
– Identify common genetic factors that influence
health/disease
4
Background
• Important to know the SNPs
• However, . . . ,
– Not all SNPs are genotyped for all individuals in
the case-control study in GWAS.
?
?
Individual 1: ACCCAATTACCAGTATTTA…
?
Individual 2: CCCCATTTACCACTATTTA…
?
Individual 3: ACCCATTTACCACTATTTA…
?
Individual 4: CCCCATTTACCAGTATTTA…
• How can we guess the missing parts?
5
Information known
• Luckily, we now have references for human
DNA:
• But, how can we use the reference genomes?
6
Main Question
• Objective:
– Design algorithms
• to impute the missing genotypes of the individuals
being studied
– Criteria for algorithms
• Scalable
• Accurate
7
Big Picture on Algorithm Design
SNPs in study,
reference haplotype/genotype
Imputed genotype,
associated confidence
Algorithms
In theory, it makes sense
1. Scalability
2. Accuracy
In practice, it works
1. Experimental validation
2. Application
8
IMPUTE
9
Notations and Setting
Reference Haplotypes :
N
0
1
1
1
0
0
1
1
0
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
1
0
0
0
1
0
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
L
Genotype in the study sample:
K
0
?
?
?
?
?
2
?
?
?
?
?
0
?
?
1
?
?
?
?
?
1
?
?
?
?
?
?
?
?
2
?
?
?
?
?
0
?
?
?
?
?
1
?
?
1
?
?
?
?
?
0
?
?
?
?
?
1
?
?
L
(Rmk: 0-00 , 1-01,102-11)
Formulation
• Observed genotype and missing genotype
• Classical inference problem:
– A reasonable estimate:
– Confidence:
11
Modeling (HMM model):
Relationship btw (H,G)
• Assumptions:
– Study individuals are independent
– Copying process of haplotypes as a mosaic of
reference captured by a Hidden Markov Model
– Mutation at different sites are conditionally
independent given the copied haplotype
12
Modeling (HMM model):
Relationship btw (H,G)
Reference Haplotypes :
N
0
1
1
1
0
0
1
1
0
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
1
0
0
0
1
0
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
L
Study Individual:
0
?
?
?
?
?
2
?
?
?
?
?
0
?
?
0
2
2
2
0
0
2
2
0
0
0
1
0
2
1
13
Modeling (HMM model):
Relationship btw (H,G)
L
N
0
1
1
1
0
0
1
1
0
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
1
0
0
0
1
0
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
0
2
2
2
0
0
2
2
0
0
0
1
0
2
1
…
…
14
Modeling (Transition Probability)
• States
• Transition
• What is the intuition?
15
Modeling :relationship btw transition
Probability and Recombination
• Recombination Process:
16
Modeling :relationship btw transition
Probability and Recombination
• Recombination Process:
– More reference, longer the copy length
Ref panel 1
Ref panel 2
More likely to have longer
copy length here
Study individual:
– Copy length in our model depends on genetic
distance btw SNPs
17
Modeling (Transition Probability)
• States
• Transition
18
Modeling (Emission Probability)
• Emission probability
– Define mutation rate :
– Since mutation is assumed independent across
site
0-00
1-01
2 -11
00
(1-λ)2
2λ(1-λ)
(λ)2
01
λ(1-λ) (λ)2+(1-λ)2 λ(1-λ)
11
(λ)2
2λ(1-λ)
(1-λ)2
19
Extension (completely missing)
• Problem:
– Missing genotype across all references and study
samples. How to impute?
• What can we expect?
– Generate information from no information?
– We cannot expect to know the genotype
– But we can guess the relationship btw them
0 0 1 0 1 1 0 ? 1 1 0 0 0 0 0
0 0 1 0 1 1 0 ? 1 1 0 0 0 0 1
0 0 1 0 1 1 0 ? 1 1 0 0 0 1 1
– Our friend : population genetics may help !
20
Imputation on Reference
• Illustration
H(1)
1
1
1
0
0
1
?0
0
0
0
1
0
1
0
H(2)
1
1
1
0
1
0
?0
1
1
0
0
0
1
0
H (3)
1
1
1
0
0
0
?1
0
0
0
1
1
1
1
H (4)
1
1
1
1
0
0
?0
0
0
0
0
0
0
0
H(N)
1
1
1
0
1
1
?1
0
0
1
1
1
0
0
21
Imputation on Reference
Algorithm:
1. Randomly select an ordering
2. Sample the first mutation according to
3. Treat previous as references and impute
4. Repeat several time to get a stable output
5. Use the imputed reference to impute the study
22
Computational Complexity:
Imputation
…
…
O(N2L) for each individual
23
Computational Complexity:
Imputation
O(N2L) for each individual
24
Computational Complexity:
Forward-Backward Algorithm
• Forward Equations:
• Naïve application takes O(N4)
25
Computational Complexity:
Forward-Backward Algorithm
• Q : How to compute the following in O(N2) ?
• A: (suggested in fastPhase)
26
Computational Complexity:
Forward-Backward Algorithm
• Finally, we have
O(N2)
O(N2) totally
O(N) for each j
O(N2) totally
O(N) for each i
O(N2) totally
• Similarly for the backward part
27
Demo
./impute
-h example/haplo.txt
-l example/legend.txt
-g example/geno.txt
-m example/map.txt
-s example/strand.txt
-Ne 11400
-int 62000000
63000000
28
Demo
29
IMPUTE2
30
Motivation
• Accuracy:
– Not all information used during imputation (e.g.
other study individuals)
• Complexity:
– Need to scale well if we incorporate all
information (e.g. previously it is O(LN2))
• New data type:
– Diploid reference (1000 genome project)
• Q: How to design algorithms to handle this?
31
Description of Setting(Scenario A)
Reference Haplotypes :
Nhap
0
1
1
1
0
0
1
1
0
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
1
0
0
0
1
0
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
L
Genotype in the inference panel:
Ninf
0
?
?
?
?
?
2
?
?
?
?
?
0
?
?
1
?
?
?
?
?
1
?
?
?
?
?
?
?
?
2
?
?
?
?
?
0
?
?
?
?
?
1
?
?
1
?
?
?
?
?
0
?
?
?
?
?
1
?
?
:T,
:U
(Rmk : sets of index of SNPs)
L
(Rmk: 0-00 , 1-01,322-11)
Description of Setting(Scenario B)
Reference Haplotypes :
Nhap
0
1
1
11
0
0
1
0
0
0
1
0
1
0
0
11
11
0
1
1
0
1
1
1
1
11
11
1
1
0
0
0
1
0
0
0
0
00
0
0
0
1
1
1
1
L
Diploid reference panel
? ? ?2 ?
Ndip 1 ? ? ?2 ?
?2
?
?
?
?
0
?
?
?
1
?1
?
?
?
?
?
?
?
?
?
?
0
?
?
?
?
?
1
?
?
1 ? ?
Inference panel
?
?
?
0
?
?
?
?
?
1
?
?
:T,
?
2
?
Ninf
2
?
:U1 ,
:U2
(Rmk : sets of index of SNPs)
L
(Rmk: 0-00 , 1-01,332-11)
Algorithm for Scenario A
• Illustration:
0
1
1
1
0
0
1
1
0
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
1
0
0
0
1
0
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
0
?
?
?
?
?
2
?
?
?
?
?
0
?
?
1
?
?
?
?
?
1
?
?
?
?
?
?
?
?
2
?
?
?
?
?
0
?
?
?
?
?
1
?
?
1
?
?
?
?
?
0
?
?
?
?
?
1
?
?
34
Algorithm for Scenario A
• Illustration (Burn in)
0
1
1
1
0
0
1
1
0
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
1
0
0
0
1
0
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
0
0
?
?
?
?
?
1
1
?
?
?
?
?
0
0
?
?
1
0
?
?
?
?
?
1
0
?
?
?
?
?
0
0
?
?
1
1
?
?
?
?
?
0
0
?
?
?
?
?
1
0
?
?
1
0
?
?
?
?
?
0
0
?
?
?
?
?
1
0
?
?
35
Algorithm for Scenario A
• Illustration (Phasing)
Update i
0
1
1
1
0
0
1
1
0
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
1
0
0
0
1
0
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
0
0
?
?
?
?
?
1
1
?
?
?
?
?
0
0
?
?
1
0
?
?
?
?
?
1
0
?
?
?
?
?
0
0
?
?
1
1
?
?
?
?
?
0
0
?
?
?
?
?
1
0
?
?
1
?
?0
?
?
?
?
?
?
?
?
?
?
0
?
?0
?
?
?
?
?
?
?
?
?
?
1
?
?0
?
?
?
?
(genotype) (1)
(0)
(1)
36
Algorithm for Scenario A
• Illustration (Imputing)
Update i
0
1
1
1
0
0
1
1
0
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
1
0
0
0
1
0
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
0
0
?
?
?
?
?
1
1
?
?
?
?
?
0
0
?
?
1
0
?
?
?
?
?
1
0
?
?
?
?
?
0
0
?
?
1
1
?
?
?
?
?
0
0
?
?
?
?
?
1
0
?
?
1
0
1
?
?1
0
?
?1
1
?
?1
0
?
?0
0
?
?1
0
0
0
?
?1
0
?
?1
0
?
?1
0
?
?0
1
?
?0
1
0
1
?
?1
1
?
?0
(genotype) (1)
(0)
(1)
37
Phasing Step: Path Sampling
• How to sample path?
…
…
38
Imputation Step:
Extract Posterior Probability
• After many rounds, we can get :
– For each individual and for each missing site
Hap 1
0
1
Hap 2
0
1
Genotype
0
1
2
0.3 0.7
0.1 0.9
0.03
0.34
0.63
0.2 0.8
0.4 0.6
0.08
0.44
0.48
…
…
…
…
…
…
…
Take average then
– Assuming independence in sampling the haploid
pair
39
Algorithm for Scenario A:
Complexity Analysis
• A) Burn in phase
• B) MCMC iterations for m times:
– For each individual i
• i) phase(i,T,hap+inf)
O((Nhap + Ninf)2LT)
O(NhapLT+U)
• ii) impute(i,T+U,hap)
• iii) record(posterior probability) O(LT+U)
• C) Average over different runs of MCMC to get
the genotype and confidence
40
Benefits of the Algorithm
• Faster:
– Reducing the load in the imputation step
• More accurate:
– Utilize information available to guess
41
Algorithm for Scenario B
:U1 ,
:T,
:U2
• Illustration:
0
1
1
1
0
0
1
1
0
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
1
0
0
0
1
0
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
0
?
?
2
?
?
2
2
?
?
?
?
0
?
?
1
?
?
2
?
?
1
1
?
?
?
?
0
?
?
2
?
?
?
?
?
0
?
?
?
?
?
1
?
?
1
?
?
?
?
?
0
?
?
?
?
?
1
?
?
Nhap
Ndip
Ninf
42
Algorithm for Scenario B
:U1 ,
:T,
:U2
• Illustration: (Burn in )
0
1
1
1
0
0
1
1
0
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
1
0
0
0
1
0
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
0
0
?
?
1
1
?
?
1
1
1
1
?
?
?
?
0
0
?
?
1
0
?
?
1
1
?
?
1
0
1
0
?
?
?
?
0
0
?
?
1
1
?
?
?
?
?
0
0
?
?
?
?
?
1
0
?
?
1
0
?
0
0
?
1
0
?
?
?
?
?
?
?
?
?
Nhap
Ndip
Ninf
?
43
Algorithm for Scenario B
:U1 ,
:T,
:U2
• Illustration: (Phase T and U2 in diploid ref)
0
1
1
1
0
0
1
1
0
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
1
0
0
0
1
0
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
0
0
?
?
1
1
?
?
1
1
1
1
?
?
?
?
0
0
?
?
Update i 1?
0?
?
?
1?
1?
?
?
1?
0?
1?
0?
?
?
?
?
0?
0?
?
?
1
1
?
?
?
?
?
0
0
?
?
?
?
?
1
0
?
?
1
0
?
0
0
?
1
0
?
?
?
?
?
?
?
?
?
Nhap
Ndip
Ninf
?
44
Algorithm for Scenario B
:U1 ,
:T,
:U2
• Illustration: (Impute U1 in diploid ref)
0
1
1
1
0
0
1
1
0
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
1
0
0
0
1
0
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
0
0
1
1
1
1
1
1
0
0
0
1
1
1
1
1
0
1
0
1
0
0
1
1
0
0
1
1
0
0
1
Update i
0
1?
1
1?
1
1
1
0?
0
0?
0
1
0
1
0
0?
0
0?
0
0?
0
1?
0
0
0
1?
1
0?
0
1
1
?
?
?
?
?
0
0
?
?
?
?
?
1
0
?
?
1
0
?
0
0
?
1
0
?
?
?
?
?
?
?
?
?
Nhap
Ndip
Ninf
?
45
Algorithm for Scenario B
:U1 ,
:T,
:U2
• Illustration: (Phase T in inference panel)
0
1
1
1
0
0
1
1
0
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
1
0
0
0
1
0
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
0
0
1
1
1
1
1
1
0
0
0
1
1
1
1
1
0
1
0
1
0
0
1
1
0
0
1
1
0
0
1
0
1?
1
1?
1
1
1
0?
0
0?
0
1
0
1
0
0?
0
0?
0
0?
0
1?
0
0
0
1?
1
0?
0
1
1
?
?
?
?
?
?
?
?
?
?
0
0
?
?
?
?
?
?
?
?
?
?
1
0
?
?
?
?
?
Update i 1
?0
?
?
?
?
?
?
?
?
?
?
?0
?0
?
?
?
?
?
?
?
?
?
?
?1
?0
?
?
?
?
Nhap
Ndip
Ninf
46
Algorithm for Scenario B
:U1 ,
:T,
:U2
• Illustration: (Impute U2 in inference panel)
0
1
1
1
0
0
1
1
0
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
1
0
0
0
1
0
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
0
0
1
1
1
1
1
1
0
0
0
1
1
1
1
1
0
1
0
1
0
0
1
1
0
0
1
1
0
0
1
0
1?
1
1?
1
1
1
0?
0
0?
0
1
0
1
0
0?
0
0?
0
0?
0
1?
0
0
0
1?
1
0?
0
1
1
?
?
?
?
?
0
0
?
?
?
?
?
1
0
?
?
Update i 1
0
?
?
0
0
?
1
?
0
1
0
?
?
?
?
?
1
?
1
?
?
?
?
?
?
?
?
?
?
?
?
Nhap
Ndip
Ninf
?
?
47
Algorithm for Scenario B
:U1 ,
:T,
:U2
• Illustration: (Impute U1 in inference panel)
Update i
0
1
1
1
0
0
1
1
0
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
1
0
0
0
1
0
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
0
0
1
1
1
1
1
1
0
0
0
1
1
1
1
1
0
1
0
1
0
0
1
1
0
0
1
1
0
0
1
0
1?
1
1?
1
1
1
0?
0
0?
0
1
0
1
0
0?
0
0?
0
0?
0
1?
0
0
0
1?
1
0?
0
1
1
?
?
?
?
?
0
0
?
?
?
?
?
1
0
?
?
1
0
?1
?1
0
0
1
0
1
0
?1
?1
?1
?1
1
1
?0
?0
?1
?0
?1
?0
?1
?0
?0
?0
?0
?1
Nhap
Ndip
Ninf
?0
?1
48
Algorithm for Scenario B:
Complexity Analysis
• A) Burn in phase
• B) MCMC iterations for m times:
– For each individual i in dip:
• i) phase(i,T+U2,hap+dip)
• ii) impute(i,T+U1,hap)
• Iii) record(posterior probability)
– For each individual i in inference :
•
•
•
•
i) phase(i,T,hap+dip+inf)
ii) impute(i,T+U2,hap+dip)
iii) impute(i,U1, hap)
iv) record(posterior probability)
O((Nhap + Ninf)2LT+U2)
O(NhapLT+U1)
O(LT+U1)
O((Nhap + Ndip + Ninf)2LT)
O(Nhap+dipLT+U2)
O(NhapLU1)
O(LT+U1+U2)
• C) Average over different runs of MCMC to get the
genotype and confidence
49
Benefits of the Algorithm
• Able to handle new data type
• Faster and more accurate
50
Further Speeding Up
• Choose k closest neighours in phasing
• Need to compute Hamming distance
• O(k2L) for HMM but O(NL) for Hamming
distance computation (better than O(N2L) in
previous HMM calculation)
• Choose khap closest neighbours in imputation
• Khap >> k is also good (because O(k2) in phasing
but O(k) in imputation)
51
Comparison with Beagle
• Weakness of BEAGLE:
– Full joint modeling of all individuals
– Accuracy decreases when population increases
/number of SNPs increases in the experiments
– Less accurate in rare SNPs than IMPUTE2
– More memory efficient
• Strength of BEAGLE:
– Faster
– Better accommodate trio and duos
52
Demo
./impute2 \
-m ./Example/example.chr22.map \
-h ./Example/example.chr22.1kG.haps \
-l ./Example/example.chr22.1kG.legend \
-g ./Example/example.chr22.study.gens \
-strand_g ./Example/example.chr22.study.strand \
-int 20.4e6 20.5e6 \
-Ne 20000 \
53
-o ./Example/example.chr22.one.phased.impute2
Experiments
54
Experiment plans
• Evaluation of the performance of imputation:
– Accuracy
– Time and space complexity
– Comparison with other methods
• Application of imputation
– Identification of associated SNPs in GWAS
• Optimizing performance
– Effect of multiple reference panels
55
Accuracy and Calibration
• Setting:
–
–
–
–
Mask the known genotype
Impute using IMPUTE
Compare called base with ground-truth
Calling Threshold:
• by genotype
• by SNPs
– Measure % missing and % mismatch for different
threshold
– Compare the estimated confidence with the
experimental confidence
56
Accuracy and Calibration
%mismatch
%missing
Message: IMPUTE is reasonably accurate and is well calibrated
57
Comparison: Accuracy
(in general and rare allele)
The more to lower left the better
Message: IMPUTE2 is accurate , especially in rare allele
58
Comparison: Algorithm Complexity
(Time and Space Complexity)
Phasing step: shorter L
Imputation step: linear in N
Multiple MCMC increases time
Message: IMPUTE2 is not too bad in terms of time and space complexity
59
Application 1:
Identification of associated SNPs
• Setting:
– Uses case and control set to identify the gene
associated with Type II Diabetes
– Use filtered genotype and that have MAP > 1%
– Evaluate the P-value and plot against the
chromosome position to identify the causal gene
• Useful in
1. Identifying SNPs to follow up
2. Assessing strength of signal
60
Application 1:
Identification of associated SNPs
Red: Imputed SNPs
Black: typed SNPs
Message: IMPUTE helps identifying SNPs associated with phenotype
61
Application 2:
Validation of missing data
• Setting:
– Some genotype collected are not very reliable
– Use imputation to impute the genotype by
assuming it is missing
– Call and compare to the original genotype
62
Application 2:
Validation of missing data
BB
?
AB
AA
Message: IMPUTE helps reassuring the confidence of data
63
Effect of Reference Set
64
Effect of reference set
• Motivation:
– Capture low-frequency variants by incorporating
data among populations
– Remain computationally efficient
• Setting:
– Pearson correlation for accuracy
– Varying Khap
– Adding more references
65
Effect of Reference Set
Improvement get saturated when we have enough references
Improvement get saturated when khap reach a certain threshold
Message: More reference set improves accuracy and IMPUTE2 facilitate
66 this
Summary
• IMPUTE, IMPUTE2 and their extensions
• They attempt to design algorithms for
imputation based on
– Population genetics model
– HMM computation
• Extensive experiments suggests that IMPUTE2
is reasonably accurate and can make good use
of reference data set available for GWAS.
67
Discussion
• Parameters in HMM:
– Can they learn the parameters of copying process from the
study data through EM algorithm?
• Completely missing SNPs:
– Can they use clustering algorithm in imputing completely
missing data?
• Trios:
– Can they use different panels to do the imputation?
• Speed:
– Can they preprocess the reference to speed up the
computation?
– Can the ideas of BEAGLE of merging come into place at some
part of pre-HMM computation?
68
Supplementary : GWAS
69
Genetic Architecture
• Why are we interested in imputation?
– For GWAS.
• Domain of interest:
70
Case-Control Study and Bayes Factor
0
1
2
Cases
s0
s1
s2
Control
r0
r1
r2
Distribution of prior theta is known
71
Supplementary :
Reverse Engineering the per site
mutation probability
72
Review of Population Genetics
• Wright Fisher Model for coalescence :
2M individuals
Generate next generation
by randomly choosing
with replacement from
the last generation and copy
• Infinite site model for mutation
– At every inheritance, there is a probability u of
mutation. And mutation occurs only at a distinct
site never happened in history.
73
Relationship btw
Coalescent Theory and Imputation
• Our question:
– Having a sample of N individuals as references
– What is the mutation rate(per site) λ btw study
sample and the nearest neighbor in the N
references
Whole population (2M)
study
N references
Nearest neighbor in references
74
Estimation of Mutation Rate λ
• Pr(no coalescence between
the study and all references
in last t generations)
Time t
A
N references
B
study
• Average time to
coalescence
• Thus, mutation rate is λ
75
Estimation of Mutation Rate λ
Time t
• Estimate u
t2
t3
t4
N references
• Estimate λ
λ
76
References
•
•
•
•
•
•
Marchini et al. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet (2007)
vol. 39 (7) pp. 906-13
Howie et al. A flexible and accurate genotype imputation method for the next generation of genome-wide association
studies. PLoS Genet (2009) vol. 5 (6) pp. e1000529
Howie et al. Genotype imputation with thousands of genomes. G3 (Bethesda) (2011) vol. 1 (6) pp. 457-70
Marchini and Howie. Genotype imputation for genome-wide association studies. Nat Rev Genet (2010) vol. 11 (7) pp. 499511
R. Durrett. Probability Models for DNA Sequence Evolution. Springer, 2nd ed., 2008
N. Li and M. Stephens. Modelling linkage disequilibrium, and identifying recombination hotspots using snp data. Genetics,
165:2213–2233, 2003.
77
78
Download