Fast Elimination of Redundant Linear Equations and Reconstruction of Recombination-free Mendelian Inheritance on a Pedigree Authors: Lan Liu & Tao Jiang, Univ. California, Riverside Jing Xiao, Lirong Xia, Tsinghua Univ. , China Outline Introduction and problem definition A new system of linear equations for ZRHC An O(mn3) time algorithm for ZRHC An improved algorithm for ZRHC Conclusion Pedigree An example: British Royal Family Elizabeth II of the United Kingdom Diana, Prince Charles, Camilla, Princess of Wales Prince of Wales Duchess of Cornwall Prince William of Wales Prince Henry of Wales Prince Philip, Duke of Edinburgh Captain Commander Princess Anne, Mark Phillips Princess Royal Timothy Laurence Peter Phillips Zara Phillips Sarah Prince Andrew, Duke of York Margaret Ferguson Princess Beatrice of York Princess Eugenie of York Prince Edward, Earl of Wessex Sophie Rhys-Jones Lady Louise Windsor Biological Background Basic concepts Genotype Haplotype 2 2 Locus 2 1 1 2 1 1 1 2 paternal Mendelian Law: one haplotype comes from the father and the other comes from the mother. maternal 11 22: homozygous 12: 1|2 2|1 heterozgyous Example: Mendelian experiment Notations and Recombinant 1 1 2 2 2 2 2 2 Genotype 1 2 2 2 2 1 2 2 Haplotype Configuration 1 1 1 1 2 2 2 2 2 2 2 2 Father 2 2 2 2 Mother 1 1 1 1 2 2 2 2 Child 0 recombinant 1 1 1 1 2 2 2 2 2 2 2 2 Father 2 2 2 2 Mother : recombinant 1 1 2 2 2 2 2 2 Child 1 recombinant Haplotype Configuration Reconstruction Haplotypes: useful, but expensive to obtain Genotypes: not so informative, but cheaper to obtain In biological application, genotypes instead of haplotypes are collected. How to reconstruct haplotype from genotype? recombination-free assumption 1 1 2 2 1 1 1 1 2 2 2 2 1 1 2 2 1 1 1 1 2 2 (a) 2 2 1 2 2 1 1 1 1 1 (b) 2 2 2 2 The ZRHC problem Problem definition Given a pedigree and the genotype information for each member, find a recombination-free haplotype configuration for each member that obeys the Mendelian law of inheritance. Previous Work Li and Jiang introduced a system of linear equations over F[2] and presented an O m3n3 time algorithm for ZRHC [LJ03] , where m is #loci and n is #members in pedigree. Several attempts have been made recently, but the authors failed to prove the correctness of their algorithms in all cases, especially when the input pedigree has mating loops [CZ04] [LCL06]. Recently, Chan et al. proposed a linear-time algorithm in [CCC+06], which only works for pedigree without mating loops. Related work Methods based on fast matrix multiplication algorithms could achieve an asymptotic speed of O(k2.376) on k equations with k unknowns The Lanczos and conjugate gradient algorithms are only heuristics [GV96]. The Wiedeman algorithm has expected quadratic running time [W86] Our Result We present a much faster algorithm for ZRHC with running time O mn2 n3 log 2 n loglog n . O mn O(n) transformation O mn Ax=b O mn Ax=b redundancy elimination O(n log2n log log n) O(n) Ax=b Outline Introduction and problem definition A new system of linear equations for ZRHC An O(mn3) time algorithm for ZRHC An improved algorithm for ZRHC Conclusion O mn O mn Ax=b The New Linear System n, m m : #loci n: #members in pedigree Unknowns : the paternal haplotype vector of a member j. : the scalar demonstrating inheritance info between a parent j1 and a child j. The New Linear System j2 j1 0 1 0 0 1 1 0 1 0 0 0 0 j 0 0 0 1 1 1 0 1 pj1,2=1 pj1,3=0 0 1 1 1 j2 j1 Pj1,1 pj1,2 pj1,3 pj1,4 Pj1 hj1,j Pj1,1 +1 pj1,2 +0 pj1,3 +0 pj1,4 +1 Pj1 +wj1 Pj2,1 pj2,2 pj2,3 pj2,4 Pj2 hj2,j j Pj,1 pj,2 pj,3 pj,4 Pj Pj2,1 +0 pj2,2 +1 pj2,3 +1 pj2,4 +1 Pj2 +wj2 Pj,1 +1 pj,2 +1 pj,3 +0 pj,4 +0 Pj +wj The Linear System O(mn) equations on O(mn) unknowns. Given a homozygous locus i on a member j (with a child j1), pj[i] and pj1[i] are pre-determined. Pedigree Graph A pedigree with genotype 12 22 11 11 12 12 12 1 12 12 11 12 2 3 12 4 12 12 Pedigree graph G 12 2 1 12 12 5 12 3 11 7 6 22 12 4 5 7 6 12 8 22 9 12 22 8 22 9 12 #edges · 2n Locus Graph Locus graph Gi Gi = (V, Ei), where Ei= {(k,j)| k is a parent of j, wk[i]=1} 12 22 11 1 2 3 ? 1 1 0 2 3 h1,4 12 4 12 5 12 6 11 7 1 1 4 1 5 0 6 h6,8 12 Zero-weight 22 8 h8,9 1 9 (a) Genotype info 0 h4,9 9 (b) Locus graph Example: Locus graph for the 3rd locus 8 7 : Outline Introduction and problem definition A new system of linear equations for ZRHC An O(mn3) time algorithm for ZRHC An improved algorithm for ZRHC Conclusion O mn O(n) transformation O mn Ax=b O(mn) Ax=b An Observation For any cycle or any path in a locus graph connecting two predetermined vertices, the summation of h-variables along the path is a constant. We can use paths to denote constraints! (proof sketch) Assume the path connecting two pre-determined vertices j0 and jk . Pj0[i] … dj1, j2 hj1, j2 dj0, j1 hj0, j1 Pj1[i] Pj2[i] in locus graph Gi djk-1, jk hjk-1, jk Pjk-1[i] Pjk[i] Pj0[i]+ hj0, j1 = Pj1[i] + dj0, j1 Pj1[i]+ hj1, j2 = Pj2[i] + dj1, j2 Pj2[i]+ hj2, j2 = Pj3[i] + dj2, j3 … Pjk-1[i]+ hjk-1, jk= Pjk[i] + djk-1, jk a constant Examples of Linear Constraints ? 1 0 2 1 1 4 0 3 1 5 ? 2 1 1 0 6 ? h3,5 h2,5 7 1 ? 4 5 3 ? 1 h3,6 h2,6 ? 1 9 (a) 1st locus graph h6,8 + h8,9= 1 0 8 : 1 6 1 0 h2,4 ? 2 h3,5 3 h3,6 h2,5 h6,8 h8,9 ? 8 9 (b) 2nd locus graph h3,5 + h3,6 + h2,5 + h2,6 = 0 7 ? 4 ? ? 5 ? 6 h6,8 h4,9 1 0 8 9 (c) 3rd locus graph h4,9 + h2,4 + h2,5 + h3,5 + h3,6 + h6,8 = 0 7 Linear Constraints Obviously, the linear constraints are necessary. We can also show that these constraints are sufficient. Moreover, we can upper bound #constraints in each locus graph as O(n), while the trivial analysis gives an upper bound O(n2). Total #constraints = O(mn). The ZRHC-PHASE algorithm Algorithm ZRHC_PHASE Traditional method input: a pedigree G=(V,E) and genotype {gj} Solve h-variables and p- output: a general solution of {pj} begin Step 1. Preprocessing Step 2. Linear constraint generation on h-variables Step 3. Solve h-variables by Gaussian Elimination Step 4. Solve the p-variables by propagation from pre-determined p-variables to others. end variables together O(mn) equations on O(mn) unknowns: O(mn) p-variables and O(n) h-variables. Our method Solve h-variables and pvariables separately O(mn) linear equations on O(n) h-variables. Outline Introduction and problem definition A new system of linear equations for ZRHC An O(mn3) time algorithm for ZRHC An improved algorithm for ZRHC Conclusion O mn O(n) transformation O mn Ax=b O(mn) Ax=b redundancy elimination O(n log2n log log n) O(n) Ax=b Redundant Equation Elimination An observation j0 Given a cycle , assume that there are constraints among each pair of vertices. j1 j2 jk … jk-2 jk-1 j0 ~ j2 j2 ~ jk-1 j0 ~ jk-1 Key lemma Originally, there are O(k2) constraints. Notice that they are not independent. However, we can replace the original constraints by an equivalent set of constraints with size O(k). Remove the redundant equations without solving them! Redundant Equation Elimination Given a spanning tree, the stretch of an edge (k, j) is defined as the length of the unique path between k and j on the tree. Elkin, Emeky, Spielman and Teng shows that we can embed any graph in a low-stretch spanning tree with average stretch O(log2n log log n). The number of irredundant constraints can be bounded by the sum of cycle lengths, which is further bounded by the sum of stretches O(nlog2n log log n). Conclusion We present an efficient algorithm for ZRHC with running time O(mn2+n3 log2n log log n). It remains interesting if the time complexity for ZRHC on general pedigrees can be improved to O(mn2+n3) or lower. Another open question is how to use the algorithm to get haplotype configurations on pedigrees that require only a small (constant) number of recombinants Thanks for your time and attention!