Haplotyping Algorithms - Division of Statistical Genomics

advertisement
Haplotyping Algorithms
Qunyuan Zhang
Division of Statistical Genomics
GEMS Course M21-621
Computational Statistical Genetics
Mar. 29, 2012
https://dsgweb.wustl.edu/qunyuan/presentations/Haplotyping_GEMS_2012.ppt
1
Questions
WHAT is haplotype?
WHY study haplotype?
WHY use algorithms for haplotyping?
HOW ? (Data, Hypotheses, Algorithms)
2
WHAT is Haplotype?
A haplotype (Greek haploos = simple) is a combination of alleles at
multiple linked loci that are transmitted together. Haplotype may refer to
as few as two loci or to an entire chromosome depending on the number
of recombination events that have occurred between a given set of loci.
The term haplotype is a portmanteau of "haploid genotype.“
In a second meaning, haplotype is a set of single nucleotide
polymorphisms (SNPs) on a single chromatid that are statistically
associated. It is thought that these associations, and the identification of
a few alleles of a haplotype block, can unambiguously identify all other
polymorphic sites in its region. Such information is very valuable for
investigating the genetics behind common diseases, and is collected by
the International HapMap Project.
From http://en.wikipedia.org/wiki/Haplotype
3
Haplotype = Genotype of Haploid
Haplotypes: AB//ab
Genotype: Aa Bb
Haplotype
CG
Genotype
CT GA
Haplotype
TA
Haplotypes: Ab//aB
Genotype: Aa Bb
4
WHY Study Haplotype?
An efficient way of presentation of genetic
variation/polymorphism, useful in genomics,
population genetics, and genetic epidemiology
Population evolution
LD analysis
Missing genotype imputation
IBD estimation
Tag marker (SNP) selection
Multi-locus linkage & association
…
5
WHY use algorithm in haplotyping?
Most of current molecular genotyping techniques mix DNA
pieces from two complementary chromosomes and only provide
genotypes of diploid (mixture of haplotypes)
?
genotype(AaBb)
haplotype (Ab//aB or AB//ab)
Some molecular techniques can directly measure haplotypes,
but expensive (money, labor, time ….), especially for genomewide study.
So, at least now, we need algorithms …
6
Ambiguity of Haplotype
Genotype
Haplotypes
AA BB
AB//AB
Aa bb
Ab//ab
Aa Bb
Ab//aB or AB//ab
Aa Bb Cc
ABC//abc, ABc//abC, Abc//aBC or aBC//Abc
Haplotypic ambiguity/uncertainty happens while ≥2 makers/loci
are heterozygous and their genetic phase is unknown
7
Rule-based Approaches
(Parsimony & Phylogeny)
Search an optimal set of haplotypes that
satisfies some specific rules
8
Parsimony Approaches
Parsimony rules: Maximum-resolution of genotypes
and/or
Minimum set of haplotypes
Clark’s Algorithm
1.List all unambiguous haplotypes
2.Resolve ambiguous individuals one
by one using listed haplotypes
ABC, abc, abC
Abc
AaBbCC => ABC//abC
3. If only half-resolved, add new
haplotype to the list
AABbCc => ABC//Abc
4. Continue 2 & 3
Continue …
5. Until on one can be solved
Until on one can be resolved
9
Clark, 1990, Mol. Biol. Evol., 7(2): 111-122
Phylogeny Approaches
Given a set of genotypes, find a set of explaining haplotypes,
which defines a perfect phylogeny.
Perfect Phylogeny Haplotype (PPH) rule: coalescent rule (no
recombination, infinite-site mutation, but only once for one site)
D. Gusfield. 2002. Proc. of the 6th Annual Inter. Conf. on Res. In Comput. Mol. Biology, p166–175.
10
Probability-based Approaches
(EM & MCMC)
Calculate probability of haplotype, conditional on genotypes.
Pr(H|G)=?
11
Data Structure for Haplotyping
Loci (A,B,C…)
Subjects(1,2,3…)
Gene/haplotype
frequencies
HWE, LD
A
B
Linkage
C
G1,A
G1,B
G1,C
…
G2,A
G2,B
G2,C
…
G3,A
G3,B
G4,C
…
…
…
…
…
Genetic Relationship
12
HWE & LD
Hardy-Weinberg Equilibrium (HWE)
Hardy-Weinberg Disequilibrium (HWD)
HWE: random combination of alleles from the same locus
Under HWE, allele freq. determines genotype freq.
HWE => Pr(AA)=Pr(A)*Pr(A), Pr(aa)=Pr(a)*Pr(a), Pr(Aa)=2*Pr(A)*Pr(a)
Linkage Equilibrium (LE)
Linkage Disequilibrium (LD)
LE: random combination of alleles from different loci
LD: association between alleles from different loci
Under LE, allele freq. determines haplotype freq.
LE => Pr(ABC)=Pr(A)*Pr(B)*Pr(C)
13
Genetic Relationship (R) & Linkage (r)
AB//ab or aB//Ab
AaBb
Recombination rate (r)
r =0, complete Linkage
AABB
0< r <0.5, incomplete Linkage
AaBb
AB//ab
AABB
r =0.5, no Linkage
aabb
AaBb
(if r=0) AB//ab
AaBb
(if r>0) AB//ab, Ab//aB
14
Haplotyping & Conditional Probability
AaBB: Pr(AB//aB)=1
AAbB: Pr(AB//Ab)=1
AaBb: Pr(AB//ab)=0.5, Pr(Ab//aB)=0.5
P(H|G)=?
AABB, aabb, AABB, aabb, AABB, AABb, aabb
AaBB, aabb, AABB, AABB, AABB, AABB, aabb
aabb, AABB, AABB, AABB, AaBb, AABB,aabb
aabb, AABB, AABB, aabb, AABB, aabb, AABB …
Pr(AB//ab)=Pr(Ab//aB)=0.5 ?
HWE or HWD?
LD or LE?
P(H|G, R, r)=?
15
EM Algorithm
for unrelated individuals
Pr(H|G,F)=?
Pr(AB)=0.25, Pr(Ab)=0.25
Pr(AB//ab)=?
AaBb
Pr(Ab//aB)=?
Pr(aB)=0.25, Pr(ab)=0.25
OR
Pr(AB)=0.01, Pr(Ab)=0.49
Pr(aB)=0.49, Pr(ab)=0.01
Excoffier et al., 1995, Mol. Biol. Evol., 12(5): 921-927
Hawley et al., 1995, J Hered., 86:409-411 (software: HAPLO)
16
Likelihood: L(G|F)
H  ( H1 , H 2 ,  , H i ,  , H h ) Haplotypes
F  ( f1 , f 2 ,  , f i ,  , f h )
Haplotype Frequencies
G  (G1 , G2 ,  , Gk ,  , Gg ) Genotypes
g
L(G | F )   P r(Gk | F )
Joint Likelihood of G given F
k 1
h
h
k
P r(Gk | F )   cab
f a f b Prbability of the k-th individual’s G given F & HWE
a 1 b 1
1 ( H a // H b  Gk )
Haplotype-Genotype compatibility index of the
c 
0 ( H a // H b  Gk ) k-th individual
k
ab
h
f
i 1
i
1
const raint
g
h
h
L(G | F )   ( c f a f b )
k 1
a 1 b 1
k
ab
F=? => Max. L(G|F)
17
EM Algorithm
Maximum Likelihood
h
g (F )  1   fi  0
i 1
g
k 1
g
h
a 1 b 1
h
h
Q ( F ,  )   log( c f a f b )   (1   f i )
k 1
k
ab
a 1 b 1
i 1
Q f i  0

Q   0
Lagrange multiplier
g ( x)  c
max{q ( x)}, x  ?
1
fi 
2g
Q ( x,  )  q ( x )   ( g ( x )  c )
g

k 1
z=1 if i in (a,b), or z=0
c=1 if (a,b)=>G, or c=0
h
h
 zabi cabk f a f b
a 1 b 1
h
h
 c
a 1 b 1
h
Partial
Derivative
fi
( t 1)
Equations
1

2g
g

k 1
k
ab
Maximization
E
…
M
EM
Recursion
f a fb
h
i
k
(t )
(t )
z
c
f
f
 ab ab a b
a 1 b 1
h
h
 c
a 1 b 1
Prior Expectation
h
k
q ( F )  log(L(G | F ))   log( cab
f a fb )
Estimation of
Haplotype Freq.
 Q
 x  0
 Q

0
 
h
k
ab
f a( t ) f b( t )
E
M…
F  Pr(Ha,b | G, F )  F  Pr(Ha,b | G, F )  ...  F  Pr(Ha,b | G, F )  F
( 0)
( 0)
(1)
(1)
(t )
(t )
(t 1)
 ...
18
Posterior Probability of Haplotype
P r(H a ,b | Gk , F ) 
P r(H a ,b | Gk ) * P r(F )
 P r(H
a ,b
| Gk ) * P r(F )
( a ,b )
P r(F )  P r(H a ) * P r(H b )  f a * f b
Exam ple:
Gk  DdEe
H : H 1  DE, H 2  De, H 3  dE, H 4  de
Prior Prob.
P r(H 1, 4 | Gk )  P r( DE // de | DdEe)  0.5
P r(H 2,3 | Gk )  P r( De // dE | DdEe)  0.5
F : f1  0.4, f 2  0.1, f 3  0.1, f 4  0.4
Posterior Prob.
P r(H 1, 4 | Gk , F ) 

P r(H1, 4 | Gk ) * f1 * f 4
P r(H 1, 4 | Gk ) * f1 * f 4  P r(H 2,3 | Gk ) * f 2 * f 3
0 . 5 * 0 .4 * 0 .4
0.08

 0.9412
0.5 * 0.4 * 0.4  0.5 * 0.1* 0.1 0.08  0.005
P r(H 2,3 | Gk , F )  0.0588
19
Limitation of EM Algorithm
For diploid(2n) organism, a genotype of L
heterozygous markers may have 2L possible
haplotypes, EM is unpractical for large L
Only suitable for small number of loci, 2~12
While L=20, 2L=1,048,576 …Large space of F
Subseting approaches (partition-ligation & block
partitioning etc.) have been used to reduce
computational burden …
20
MCMC
Markov Chain Monte Carlo Algorithm
for unrelated individuals
by sampling from Pr(H|G,F)
Stephens et al., 2001, Am. J. Hum. Genet., 68:978-989 (software: PHASE)
21
Markov Chain
MCMC Estimation
H G( 01 )  H G( 02)  ...  H G( 0k )  H G( 0k )1  ...  H G( 0g)
P r(H G1 | G, H G1 )

H G(11)  H G( 02)  ...  H G( 0k )  H G( 0k )1  ...  H G( 0g)
H G(11)
P r(H G2 | G, H G2 )
 (1)
 H G2  ...  H G( 0k )  H G( 0k )1  ...  H G( 0g)
Random sampling based on
Pr(H|G,H_)
Repeat many times
......
H G(11)  H G(12)  ...  H G(1k)  H G(1k)1
P r(H Gg | G, H Gg )

 ...  H G(1g)
P r(H G1 | G, H G1 )
 ( 2)
H G1  H G(12)  ...  H G(1k)  H G(1k)1  ...  H G(1g)
......
H
(t )
G1
H
(t )
G2
 ...  H
(t )
Gk
H
(t )
Gk 1
....
H
(t  N )
G1
H
(t  N )
G2
 ...  H
(t  N )
Gk
 (t )
 ...  H Gg
 (t  N )
 ...  H Gg
After getting close to stationary
distribution of P(H|G)
Collect samples
Average over samples
22
Transition Probability
given H a(t,b)
of
L loci
Pr(HGk | G, H Gk )
for all Gk
list H  ( H1, H 2 ,...,H m )
count n  (n1, n2 ,...,nm )
subseting loci, reducing time
pick Gk
rem ove H a(t,b)Gk
from H
if
Gk  H i then pi  0
if
Gk  H i then
Gk  ( H i , H j ) and check :
if
H j  H then pi  (ni   / M ) (n j   / M )  ( / M ) 2
if
H j  H then pi  ni ( / M )
Finally
Coalescent hypothesis, Mutation rate, M haplotypes
get p  ( p1 , p2 ,..., pm )
For H i  H construct haplotype with
prob. pi
p
i'
i'
For H i  H random ly chose phase with
prob. 2 L ( / M ) 2 ( pi  2 L ( / M ) 2 )
i
Add the newly constructed haplotype
Ha(t,b1)Gk to list H, pick Gk+1 …
23
EM vs. MCMC
EM
MCMC
Search F, Max. L(G|F)
Sample from Pr(H|G,F)
Haplo. freq. => Haplo. construction
Haplo. construction => Haplo. freq.
Maximum likelihood approach
Sampling approach
“Analytical” posterior distribution
“Empirical” posterior distribution
Less loci
More loci
Convergence: Local Maximum
Better convergence: whole parameter
space (more computer time)
24
EM Algorithm
for family data
(no recombination, r=0)
Pr(H{fam.}|G,R,F)=?
Rohde et al., 2001, Human Mutation, 17: 289-295 (software: HAPLO)
Becher et al., 2004, Genetic Epidemiology, 27:21-32 (software: FAMHAP)
O’Connell, 2000, Genetic Epidemiology, 19(Suppl 1):S64-S70 (software: ZAPLO)
25
Haplotype Configuration
of Family
Genotypes
AaBb
AaBb
AaBb
Possible Haplotype Configurations
AB//ab
AB//ab
AB//ab
Ab//aB
Ab//aB
Ab//aB
AB//ab
AB//ab
Ab//aB
recombinant, as r=0 or nearly =0, impossible or very low prob. , ignored
26
EM Algorithm
Haplotype Freq. Estimation using Nuclear Families
Tips:
Unrelated.Indv.
h
fi
( t 1)
1 g


2 g k 1
h
i
k
(t ) (t )
z
c
f
 ab ab a f b
a 1 b 1
h
h
k
(t ) (t )
c
f
 ab a fb
a 1 b 1
Only use parents to calculate
haplotype freq. (f)
Use parents+children ’s info
to determine compatibility (c)
Nuclear.Fam ilies
h
fi
( t 1)
1

4 N fam.
N fam .

fam.1
h
h
h
i
fam.
(t ) (t ) (t ) (t )
z
c
f
   a1b1a 2b2 a1b1a 2b2 a1 fb1 f a 2 fb2
a1 1 b1 1 a 2 1 b 2 1
h
h
h
h
fam.
(t ) (t ) (t ) (t )
c
f
1
1
2
2
   a b a b a1 fb1 f a 2 fb2
a1 1 b1 1 a 2 1 b 2 1
27
EM Algorithm
Haplotype Freq. Estimation for General Pedigrees
h , h , h , h ,...h , h
fi
( t 1)

1
N fam .
n
fam.1
N fam .

fam.1
'
fam.

.
(t ) (t ) (t ) (t )
(t ) (t )
zai 1b1a 2b2 ...a nbn cafam
f
f
f
f
...
f
f
1 1 2 2
b a b ...a nb n a1 b1 a 2 b 2
an bn
a1 ,b1 , a 2 ,b 2 ,...,a n ,b n
h , h , h , h ,...h , h

a1 ,b1 , a 2 ,b 2 ,...,a n ,b n
.
(t ) (t ) (t ) (t )
(t ) (t )
cafam
f
f
f
f
...
f
f
1 1 2 2
b a b ...a nb n a1 b1 a 2 b 2
an bn
Tips:
Only use founders to calculate haplotype freq. (f)
Use all members (founders & non- founders) to determine compatibility (c)
Discard the cases with too small probabilities to save time
28
Posterior Probability of Haplotype Configuration
General Fam ily
fam
P r(H afam
, F founders ) 
,b | Gk
fam
P r(H afam
) * P r(F founders )
,b | Gk
 P r(H
fam
a ,b
| Gkfam ) * P r(F founders )
( all .configs.)
P r(F founders ) 
N founders
 P r(H
j 1
aj
) * P r(H b j ) 
N founders
f
j 1
aj
* fb j
Nuclear Fam ily
fam
P r(H afam
|
G
, Fparents ) 
,b
k
fam
P r(H afam
|
G
) * P r(Fparents )
,b
k
fam
fam
P
r(
H
|
G
) * P r(Fparents )

a ,b
k
( all .configs.)
P r(Fparents )  P r(H a1 ) * P r(H b1 ) * P r(H a 2 ) * P r(H b 2 )  f a1 * f b1 * f a 2 * f b 2
Dad
Mom
29
A Middle Summary …
Subject-oriented Algorithms
A
B
C
X
X
X
indiv. by indiv.
unrelated
family by family
r=0
Joint Prob. / Likelihood
Large/General Pedigree & Allowing Recombination (r>0) ?
30
Next …
Locus-oriented Algorithm (Lander-Green)
For Large/General Pedigree Data & Allowing Recombination (r>0)
A
B
X
C
X
X
…
Joint Prob./
Likelihood
A
B
C
Locus by Locus
A Pedigree
31
Inheritance Vector (V)
of a pedigree
Prob.
A
Lander & Green, 1987, Proc. Natl. Acad. Sci., 84: 2363-2367
Kruglyak et al., 1996, Am. J. Hum. Genet., 58:1347-1363 (software: GENEHUNTER)
Abecasis et al., 2005, Am. J. Hum. Genet., 77:754-767(software: MERLIN)
Sobel et al., 1996, Am. J. Hum. Genet., 58:1323-1337 (software: SIMWALK2)
32
Inheritance Vector & Haplotype
5: AaBb
1101
AB//ab
1101
1101
Ab//aB
1111
33
Lander-Green Algorithm
Loci A,B,C,…
A
B
C
…
One pedigree
Hidden status
(inheritance vectors)
VA
Transition Prob.=f(r)
Emission Prob.
Observations
(genotypes)
VB
Pr(VB|VA)
Pr(GA |VA)
GA
Pr(GB |VB)
GB
VC
Pr(VC|VB)
…
Pr(Vt+1|Vt)
Pr(GC |VC)
GC
34
Lander-Green Algorithm Based
(or Similar) Approaches
Kruglyak et al., 1996, Am. J. Hum. Genet., 58:1347-1363
(software: GENEHUNTER)
Viterbi algorithm, the best haplotype configuration
Sobel et al., 1996, Am. J. Hum. Genet., 58:1323-1337
(software: SIMWALK2)
MCMC: Annealing & Metropolis Process
Abecasis et al., 2005, Am. J. Hum. Genet., 77:754-767
(software: MERLIN)
Allowing LD & Marker Cluster/Block
35
Haplotyping
based on sequencing data
(can be done for individual subject with no population data)
36
Rationale
Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346.
37
Data Structure
Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346.
38
Algorithms
ML
Or MCMC when H space is huge
Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346.
39
Prob(sequence/haplotype)
Sequencing/mapping error
haplotype
observed sequence
=1 if observed sequence X matches assumed haplotype
=0 otherwise
(for the j-th variant site of i-th fragment )
Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346.
40
Markov Chain
Sampling H from .
Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346.
41
Practices
(1) If a child’s genotype of 4 loci is AaBbCcDD, list all possible
haplotype pairs of the child, calculate the probability of each
pair, given no any extra information.
(2) If you know his/her father’s genotype is also AaBbCcDD
and mother is AaBbCCDD, list all possible haplotype
configurations of his/her family, calculate the probability of each
configuration. (Assume recombination rate r=0)
(3) If you know the haplotype frequencies below in population:
ABCD(0.2),ABcD(0.1),AbcD(0.1)
aBCD(0.1),aBcD(0.2),abcD(0.3)
calculate the posterior probabilities in (1) .
Within a week, send your answers to (E-mail: qunyuan@wustl.edu)
42
Download