Haplotyping Algorithms Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 29, 2012 https://dsgweb.wustl.edu/qunyuan/presentations/Haplotyping_GEMS_2012.ppt 1 Questions WHAT is haplotype? WHY study haplotype? WHY use algorithms for haplotyping? HOW ? (Data, Hypotheses, Algorithms) 2 WHAT is Haplotype? A haplotype (Greek haploos = simple) is a combination of alleles at multiple linked loci that are transmitted together. Haplotype may refer to as few as two loci or to an entire chromosome depending on the number of recombination events that have occurred between a given set of loci. The term haplotype is a portmanteau of "haploid genotype.“ In a second meaning, haplotype is a set of single nucleotide polymorphisms (SNPs) on a single chromatid that are statistically associated. It is thought that these associations, and the identification of a few alleles of a haplotype block, can unambiguously identify all other polymorphic sites in its region. Such information is very valuable for investigating the genetics behind common diseases, and is collected by the International HapMap Project. From http://en.wikipedia.org/wiki/Haplotype 3 Haplotype = Genotype of Haploid Haplotypes: AB//ab Genotype: Aa Bb Haplotype CG Genotype CT GA Haplotype TA Haplotypes: Ab//aB Genotype: Aa Bb 4 WHY Study Haplotype? An efficient way of presentation of genetic variation/polymorphism, useful in genomics, population genetics, and genetic epidemiology Population evolution LD analysis Missing genotype imputation IBD estimation Tag marker (SNP) selection Multi-locus linkage & association … 5 WHY use algorithm in haplotyping? Most of current molecular genotyping techniques mix DNA pieces from two complementary chromosomes and only provide genotypes of diploid (mixture of haplotypes) ? genotype(AaBb) haplotype (Ab//aB or AB//ab) Some molecular techniques can directly measure haplotypes, but expensive (money, labor, time ….), especially for genomewide study. So, at least now, we need algorithms … 6 Ambiguity of Haplotype Genotype Haplotypes AA BB AB//AB Aa bb Ab//ab Aa Bb Ab//aB or AB//ab Aa Bb Cc ABC//abc, ABc//abC, Abc//aBC or aBC//Abc Haplotypic ambiguity/uncertainty happens while ≥2 makers/loci are heterozygous and their genetic phase is unknown 7 Rule-based Approaches (Parsimony & Phylogeny) Search an optimal set of haplotypes that satisfies some specific rules 8 Parsimony Approaches Parsimony rules: Maximum-resolution of genotypes and/or Minimum set of haplotypes Clark’s Algorithm 1.List all unambiguous haplotypes 2.Resolve ambiguous individuals one by one using listed haplotypes ABC, abc, abC Abc AaBbCC => ABC//abC 3. If only half-resolved, add new haplotype to the list AABbCc => ABC//Abc 4. Continue 2 & 3 Continue … 5. Until on one can be solved Until on one can be resolved 9 Clark, 1990, Mol. Biol. Evol., 7(2): 111-122 Phylogeny Approaches Given a set of genotypes, find a set of explaining haplotypes, which defines a perfect phylogeny. Perfect Phylogeny Haplotype (PPH) rule: coalescent rule (no recombination, infinite-site mutation, but only once for one site) D. Gusfield. 2002. Proc. of the 6th Annual Inter. Conf. on Res. In Comput. Mol. Biology, p166–175. 10 Probability-based Approaches (EM & MCMC) Calculate probability of haplotype, conditional on genotypes. Pr(H|G)=? 11 Data Structure for Haplotyping Loci (A,B,C…) Subjects(1,2,3…) Gene/haplotype frequencies HWE, LD A B Linkage C G1,A G1,B G1,C … G2,A G2,B G2,C … G3,A G3,B G4,C … … … … … Genetic Relationship 12 HWE & LD Hardy-Weinberg Equilibrium (HWE) Hardy-Weinberg Disequilibrium (HWD) HWE: random combination of alleles from the same locus Under HWE, allele freq. determines genotype freq. HWE => Pr(AA)=Pr(A)*Pr(A), Pr(aa)=Pr(a)*Pr(a), Pr(Aa)=2*Pr(A)*Pr(a) Linkage Equilibrium (LE) Linkage Disequilibrium (LD) LE: random combination of alleles from different loci LD: association between alleles from different loci Under LE, allele freq. determines haplotype freq. LE => Pr(ABC)=Pr(A)*Pr(B)*Pr(C) 13 Genetic Relationship (R) & Linkage (r) AB//ab or aB//Ab AaBb Recombination rate (r) r =0, complete Linkage AABB 0< r <0.5, incomplete Linkage AaBb AB//ab AABB r =0.5, no Linkage aabb AaBb (if r=0) AB//ab AaBb (if r>0) AB//ab, Ab//aB 14 Haplotyping & Conditional Probability AaBB: Pr(AB//aB)=1 AAbB: Pr(AB//Ab)=1 AaBb: Pr(AB//ab)=0.5, Pr(Ab//aB)=0.5 P(H|G)=? AABB, aabb, AABB, aabb, AABB, AABb, aabb AaBB, aabb, AABB, AABB, AABB, AABB, aabb aabb, AABB, AABB, AABB, AaBb, AABB,aabb aabb, AABB, AABB, aabb, AABB, aabb, AABB … Pr(AB//ab)=Pr(Ab//aB)=0.5 ? HWE or HWD? LD or LE? P(H|G, R, r)=? 15 EM Algorithm for unrelated individuals Pr(H|G,F)=? Pr(AB)=0.25, Pr(Ab)=0.25 Pr(AB//ab)=? AaBb Pr(Ab//aB)=? Pr(aB)=0.25, Pr(ab)=0.25 OR Pr(AB)=0.01, Pr(Ab)=0.49 Pr(aB)=0.49, Pr(ab)=0.01 Excoffier et al., 1995, Mol. Biol. Evol., 12(5): 921-927 Hawley et al., 1995, J Hered., 86:409-411 (software: HAPLO) 16 Likelihood: L(G|F) H ( H1 , H 2 , , H i , , H h ) Haplotypes F ( f1 , f 2 , , f i , , f h ) Haplotype Frequencies G (G1 , G2 , , Gk , , Gg ) Genotypes g L(G | F ) P r(Gk | F ) Joint Likelihood of G given F k 1 h h k P r(Gk | F ) cab f a f b Prbability of the k-th individual’s G given F & HWE a 1 b 1 1 ( H a // H b Gk ) Haplotype-Genotype compatibility index of the c 0 ( H a // H b Gk ) k-th individual k ab h f i 1 i 1 const raint g h h L(G | F ) ( c f a f b ) k 1 a 1 b 1 k ab F=? => Max. L(G|F) 17 EM Algorithm Maximum Likelihood h g (F ) 1 fi 0 i 1 g k 1 g h a 1 b 1 h h Q ( F , ) log( c f a f b ) (1 f i ) k 1 k ab a 1 b 1 i 1 Q f i 0 Q 0 Lagrange multiplier g ( x) c max{q ( x)}, x ? 1 fi 2g Q ( x, ) q ( x ) ( g ( x ) c ) g k 1 z=1 if i in (a,b), or z=0 c=1 if (a,b)=>G, or c=0 h h zabi cabk f a f b a 1 b 1 h h c a 1 b 1 h Partial Derivative fi ( t 1) Equations 1 2g g k 1 k ab Maximization E … M EM Recursion f a fb h i k (t ) (t ) z c f f ab ab a b a 1 b 1 h h c a 1 b 1 Prior Expectation h k q ( F ) log(L(G | F )) log( cab f a fb ) Estimation of Haplotype Freq. Q x 0 Q 0 h k ab f a( t ) f b( t ) E M… F Pr(Ha,b | G, F ) F Pr(Ha,b | G, F ) ... F Pr(Ha,b | G, F ) F ( 0) ( 0) (1) (1) (t ) (t ) (t 1) ... 18 Posterior Probability of Haplotype P r(H a ,b | Gk , F ) P r(H a ,b | Gk ) * P r(F ) P r(H a ,b | Gk ) * P r(F ) ( a ,b ) P r(F ) P r(H a ) * P r(H b ) f a * f b Exam ple: Gk DdEe H : H 1 DE, H 2 De, H 3 dE, H 4 de Prior Prob. P r(H 1, 4 | Gk ) P r( DE // de | DdEe) 0.5 P r(H 2,3 | Gk ) P r( De // dE | DdEe) 0.5 F : f1 0.4, f 2 0.1, f 3 0.1, f 4 0.4 Posterior Prob. P r(H 1, 4 | Gk , F ) P r(H1, 4 | Gk ) * f1 * f 4 P r(H 1, 4 | Gk ) * f1 * f 4 P r(H 2,3 | Gk ) * f 2 * f 3 0 . 5 * 0 .4 * 0 .4 0.08 0.9412 0.5 * 0.4 * 0.4 0.5 * 0.1* 0.1 0.08 0.005 P r(H 2,3 | Gk , F ) 0.0588 19 Limitation of EM Algorithm For diploid(2n) organism, a genotype of L heterozygous markers may have 2L possible haplotypes, EM is unpractical for large L Only suitable for small number of loci, 2~12 While L=20, 2L=1,048,576 …Large space of F Subseting approaches (partition-ligation & block partitioning etc.) have been used to reduce computational burden … 20 MCMC Markov Chain Monte Carlo Algorithm for unrelated individuals by sampling from Pr(H|G,F) Stephens et al., 2001, Am. J. Hum. Genet., 68:978-989 (software: PHASE) 21 Markov Chain MCMC Estimation H G( 01 ) H G( 02) ... H G( 0k ) H G( 0k )1 ... H G( 0g) P r(H G1 | G, H G1 ) H G(11) H G( 02) ... H G( 0k ) H G( 0k )1 ... H G( 0g) H G(11) P r(H G2 | G, H G2 ) (1) H G2 ... H G( 0k ) H G( 0k )1 ... H G( 0g) Random sampling based on Pr(H|G,H_) Repeat many times ...... H G(11) H G(12) ... H G(1k) H G(1k)1 P r(H Gg | G, H Gg ) ... H G(1g) P r(H G1 | G, H G1 ) ( 2) H G1 H G(12) ... H G(1k) H G(1k)1 ... H G(1g) ...... H (t ) G1 H (t ) G2 ... H (t ) Gk H (t ) Gk 1 .... H (t N ) G1 H (t N ) G2 ... H (t N ) Gk (t ) ... H Gg (t N ) ... H Gg After getting close to stationary distribution of P(H|G) Collect samples Average over samples 22 Transition Probability given H a(t,b) of L loci Pr(HGk | G, H Gk ) for all Gk list H ( H1, H 2 ,...,H m ) count n (n1, n2 ,...,nm ) subseting loci, reducing time pick Gk rem ove H a(t,b)Gk from H if Gk H i then pi 0 if Gk H i then Gk ( H i , H j ) and check : if H j H then pi (ni / M ) (n j / M ) ( / M ) 2 if H j H then pi ni ( / M ) Finally Coalescent hypothesis, Mutation rate, M haplotypes get p ( p1 , p2 ,..., pm ) For H i H construct haplotype with prob. pi p i' i' For H i H random ly chose phase with prob. 2 L ( / M ) 2 ( pi 2 L ( / M ) 2 ) i Add the newly constructed haplotype Ha(t,b1)Gk to list H, pick Gk+1 … 23 EM vs. MCMC EM MCMC Search F, Max. L(G|F) Sample from Pr(H|G,F) Haplo. freq. => Haplo. construction Haplo. construction => Haplo. freq. Maximum likelihood approach Sampling approach “Analytical” posterior distribution “Empirical” posterior distribution Less loci More loci Convergence: Local Maximum Better convergence: whole parameter space (more computer time) 24 EM Algorithm for family data (no recombination, r=0) Pr(H{fam.}|G,R,F)=? Rohde et al., 2001, Human Mutation, 17: 289-295 (software: HAPLO) Becher et al., 2004, Genetic Epidemiology, 27:21-32 (software: FAMHAP) O’Connell, 2000, Genetic Epidemiology, 19(Suppl 1):S64-S70 (software: ZAPLO) 25 Haplotype Configuration of Family Genotypes AaBb AaBb AaBb Possible Haplotype Configurations AB//ab AB//ab AB//ab Ab//aB Ab//aB Ab//aB AB//ab AB//ab Ab//aB recombinant, as r=0 or nearly =0, impossible or very low prob. , ignored 26 EM Algorithm Haplotype Freq. Estimation using Nuclear Families Tips: Unrelated.Indv. h fi ( t 1) 1 g 2 g k 1 h i k (t ) (t ) z c f ab ab a f b a 1 b 1 h h k (t ) (t ) c f ab a fb a 1 b 1 Only use parents to calculate haplotype freq. (f) Use parents+children ’s info to determine compatibility (c) Nuclear.Fam ilies h fi ( t 1) 1 4 N fam. N fam . fam.1 h h h i fam. (t ) (t ) (t ) (t ) z c f a1b1a 2b2 a1b1a 2b2 a1 fb1 f a 2 fb2 a1 1 b1 1 a 2 1 b 2 1 h h h h fam. (t ) (t ) (t ) (t ) c f 1 1 2 2 a b a b a1 fb1 f a 2 fb2 a1 1 b1 1 a 2 1 b 2 1 27 EM Algorithm Haplotype Freq. Estimation for General Pedigrees h , h , h , h ,...h , h fi ( t 1) 1 N fam . n fam.1 N fam . fam.1 ' fam. . (t ) (t ) (t ) (t ) (t ) (t ) zai 1b1a 2b2 ...a nbn cafam f f f f ... f f 1 1 2 2 b a b ...a nb n a1 b1 a 2 b 2 an bn a1 ,b1 , a 2 ,b 2 ,...,a n ,b n h , h , h , h ,...h , h a1 ,b1 , a 2 ,b 2 ,...,a n ,b n . (t ) (t ) (t ) (t ) (t ) (t ) cafam f f f f ... f f 1 1 2 2 b a b ...a nb n a1 b1 a 2 b 2 an bn Tips: Only use founders to calculate haplotype freq. (f) Use all members (founders & non- founders) to determine compatibility (c) Discard the cases with too small probabilities to save time 28 Posterior Probability of Haplotype Configuration General Fam ily fam P r(H afam , F founders ) ,b | Gk fam P r(H afam ) * P r(F founders ) ,b | Gk P r(H fam a ,b | Gkfam ) * P r(F founders ) ( all .configs.) P r(F founders ) N founders P r(H j 1 aj ) * P r(H b j ) N founders f j 1 aj * fb j Nuclear Fam ily fam P r(H afam | G , Fparents ) ,b k fam P r(H afam | G ) * P r(Fparents ) ,b k fam fam P r( H | G ) * P r(Fparents ) a ,b k ( all .configs.) P r(Fparents ) P r(H a1 ) * P r(H b1 ) * P r(H a 2 ) * P r(H b 2 ) f a1 * f b1 * f a 2 * f b 2 Dad Mom 29 A Middle Summary … Subject-oriented Algorithms A B C X X X indiv. by indiv. unrelated family by family r=0 Joint Prob. / Likelihood Large/General Pedigree & Allowing Recombination (r>0) ? 30 Next … Locus-oriented Algorithm (Lander-Green) For Large/General Pedigree Data & Allowing Recombination (r>0) A B X C X X … Joint Prob./ Likelihood A B C Locus by Locus A Pedigree 31 Inheritance Vector (V) of a pedigree Prob. A Lander & Green, 1987, Proc. Natl. Acad. Sci., 84: 2363-2367 Kruglyak et al., 1996, Am. J. Hum. Genet., 58:1347-1363 (software: GENEHUNTER) Abecasis et al., 2005, Am. J. Hum. Genet., 77:754-767(software: MERLIN) Sobel et al., 1996, Am. J. Hum. Genet., 58:1323-1337 (software: SIMWALK2) 32 Inheritance Vector & Haplotype 5: AaBb 1101 AB//ab 1101 1101 Ab//aB 1111 33 Lander-Green Algorithm Loci A,B,C,… A B C … One pedigree Hidden status (inheritance vectors) VA Transition Prob.=f(r) Emission Prob. Observations (genotypes) VB Pr(VB|VA) Pr(GA |VA) GA Pr(GB |VB) GB VC Pr(VC|VB) … Pr(Vt+1|Vt) Pr(GC |VC) GC 34 Lander-Green Algorithm Based (or Similar) Approaches Kruglyak et al., 1996, Am. J. Hum. Genet., 58:1347-1363 (software: GENEHUNTER) Viterbi algorithm, the best haplotype configuration Sobel et al., 1996, Am. J. Hum. Genet., 58:1323-1337 (software: SIMWALK2) MCMC: Annealing & Metropolis Process Abecasis et al., 2005, Am. J. Hum. Genet., 77:754-767 (software: MERLIN) Allowing LD & Marker Cluster/Block 35 Haplotyping based on sequencing data (can be done for individual subject with no population data) 36 Rationale Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346. 37 Data Structure Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346. 38 Algorithms ML Or MCMC when H space is huge Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346. 39 Prob(sequence/haplotype) Sequencing/mapping error haplotype observed sequence =1 if observed sequence X matches assumed haplotype =0 otherwise (for the j-th variant site of i-th fragment ) Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346. 40 Markov Chain Sampling H from . Bansal et al. Genome Res. 2008 August; 18(8): 1336–1346. 41 Practices (1) If a child’s genotype of 4 loci is AaBbCcDD, list all possible haplotype pairs of the child, calculate the probability of each pair, given no any extra information. (2) If you know his/her father’s genotype is also AaBbCcDD and mother is AaBbCCDD, list all possible haplotype configurations of his/her family, calculate the probability of each configuration. (Assume recombination rate r=0) (3) If you know the haplotype frequencies below in population: ABCD(0.2),ABcD(0.1),AbcD(0.1) aBCD(0.1),aBcD(0.2),abcD(0.3) calculate the posterior probabilities in (1) . Within a week, send your answers to (E-mail: qunyuan@wustl.edu) 42