zrhc-soda

advertisement
Fast Elimination of Redundant Linear Equations
and Reconstruction of Recombination-free
Mendelian Inheritance on a Pedigree
Authors:
Lan Liu & Tao Jiang, Univ. California, Riverside
Jing Xiao, Lirong Xia, Tsinghua Univ. , China
Outline





Introduction and problem definition
A new system of linear equations for ZRHC
An O(mn3) time algorithm for ZRHC
An improved algorithm for ZRHC
Conclusion
Pedigree

An example: British
Royal Family
Elizabeth II of
the United Kingdom
Diana,
Prince Charles,
Camilla,
Princess of Wales Prince of Wales Duchess of Cornwall
Prince William
of Wales
Prince Henry of
Wales
Prince Philip,
Duke of Edinburgh
Captain
Commander
Princess Anne,
Mark Phillips Princess Royal Timothy Laurence
Peter Phillips
Zara Phillips
Sarah
Prince Andrew,
Duke of York Margaret Ferguson
Princess
Beatrice of York
Princess
Eugenie of York
Prince Edward,
Earl of Wessex
Sophie
Rhys-Jones
Lady Louise
Windsor
Biological Background

Basic concepts
Genotype
Haplotype
2
2 Locus
2
1
1
2
1
1
1
2
paternal

Mendelian Law: one haplotype
comes from the father and the other comes
from the mother.
maternal
11 22: homozygous
12:
1|2
2|1
heterozgyous
Example: Mendelian experiment
Notations and Recombinant
1
1
2
2
2
2
2
2
Genotype
1
2
2
2
2
1
2
2
Haplotype
Configuration
1
1
1
1
2
2
2
2
2
2
2
2
Father
2
2
2
2
Mother
1
1
1
1
2
2
2
2
Child
0 recombinant
1
1
1
1
2
2
2
2
2
2
2
2
Father
2
2
2
2
Mother
:
recombinant
1
1
2
2
2
2
2
2
Child
1 recombinant
Haplotype Configuration Reconstruction

Haplotypes: useful, but expensive to obtain
Genotypes: not so informative, but cheaper to obtain


In biological application, genotypes instead of haplotypes
are collected.
How to reconstruct haplotype from genotype?
recombination-free assumption
1
1
2
2
1
1
1
1
2
2
2
2
1
1
2
2
1
1
1
1
2
2
(a)
2
2
1
2
2
1
1
1
1
1
(b)
2
2
2
2
The ZRHC problem

Problem definition
Given a pedigree and the genotype information
for each member, find a recombination-free
haplotype configuration for each member that
obeys the Mendelian law of inheritance.
Previous Work



Li and Jiang introduced a system of linear equations
over F[2] and presented an O  m3n3  time algorithm for
ZRHC [LJ03] , where m is #loci and n is #members in
pedigree.
Several attempts have been made recently, but the
authors failed to prove the correctness of their
algorithms in all cases, especially when the input
pedigree has mating loops [CZ04] [LCL06].
Recently, Chan et al. proposed a linear-time algorithm
in [CCC+06], which only works for pedigree without
mating loops.
Related work



Methods based on fast matrix multiplication algorithms
could achieve an asymptotic speed of O(k2.376) on k
equations with k unknowns
The Lanczos and conjugate gradient algorithms are
only heuristics [GV96].
The Wiedeman algorithm has expected quadratic
running time [W86]
Our Result

We present a much faster algorithm for ZRHC with running
time O mn2  n3 log 2 n loglog n .


O  mn 
O(n)
transformation
O  mn 
Ax=b
O  mn 
Ax=b
redundancy
elimination
O(n log2n log log n)
O(n)
Ax=b
Outline
 Introduction and problem definition




A new system of linear equations for ZRHC
An O(mn3) time algorithm for ZRHC
An improved algorithm for ZRHC
Conclusion
O  mn 
O  mn 
Ax=b
The New Linear System

n, m


m : #loci
n: #members in pedigree
Unknowns

: the paternal haplotype vector of a member j.
: the scalar demonstrating inheritance info between a
parent j1 and a child j.

The New Linear System
j2
j1
0
1
0
0
1
1
0
1
0
0
0
0
j
0
0
0
1
1
1
0
1
pj1,2=1
pj1,3=0
0
1
1
1
j2
j1
Pj1,1
pj1,2
pj1,3
pj1,4
Pj1
hj1,j
Pj1,1 +1
pj1,2 +0
pj1,3 +0
pj1,4 +1
Pj1 +wj1
Pj2,1
pj2,2
pj2,3
pj2,4
Pj2
hj2,j
j
Pj,1
pj,2
pj,3
pj,4
Pj
Pj2,1 +0
pj2,2 +1
pj2,3 +1
pj2,4 +1
Pj2 +wj2
Pj,1 +1
pj,2 +1
pj,3 +0
pj,4 +0
Pj +wj
The Linear System
 O(mn) equations on O(mn) unknowns.
 Given a homozygous locus i on a member j (with a
child j1), pj[i] and pj1[i] are pre-determined.
Pedigree Graph

A pedigree with genotype
12
22
11
11
12
12
12
1
12
12
11
12
2
3
12
4
12
12
Pedigree graph G
12
2
1
12
12
5

12
3
11
7
6
22
12
4
5
7
6
12
8
22
9
12
22
8
22
9
12
#edges · 2n
Locus Graph
 Locus graph Gi
Gi = (V, Ei), where Ei= {(k,j)| k is a parent of j, wk[i]=1}
12
22
11
1
2
3
?
1
1
0
2
3
h1,4
12
4
12
5
12
6
11
7
1
1
4
1
5
0
6
h6,8
12
Zero-weight
22
8
h8,9
1
9
(a) Genotype info
0
h4,9
9
(b) Locus graph
Example: Locus graph for the 3rd locus
8
7
:
Outline





Introduction and problem definition
A new system of linear equations for ZRHC
An O(mn3) time algorithm for ZRHC
An improved algorithm for ZRHC
Conclusion
O  mn 
O(n)
transformation
O  mn 
Ax=b
O(mn)
Ax=b
An Observation
 For any cycle or any path in a locus graph connecting two predetermined vertices, the summation of h-variables along the
path is a constant.
We can use paths to denote constraints!
 (proof sketch)
Assume the path
connecting two pre-determined vertices j0 and jk .
Pj0[i]
…
dj1, j2
hj1, j2
dj0, j1
hj0, j1
Pj1[i]
Pj2[i]
in locus graph Gi
djk-1, jk
hjk-1, jk
Pjk-1[i]
Pjk[i]
Pj0[i]+ hj0, j1 = Pj1[i] + dj0, j1
Pj1[i]+ hj1, j2 = Pj2[i] + dj1, j2
Pj2[i]+ hj2, j2 = Pj3[i] + dj2, j3
…
Pjk-1[i]+ hjk-1, jk= Pjk[i] + djk-1, jk
a constant
Examples of Linear Constraints
?
1
0
2
1
1
4
0
3
1
5
?
2
1
1
0
6
?
h3,5
h2,5
7
1
?
4
5
3
?
1
h3,6
h2,6
?
1
9
(a) 1st locus graph
h6,8 + h8,9= 1
0
8
:
1
6
1
0
h2,4
?
2
h3,5
3
h3,6
h2,5
h6,8
h8,9
?
8
9
(b) 2nd locus graph
h3,5 + h3,6 + h2,5 + h2,6 = 0
7
?
4
?
?
5
?
6
h6,8
h4,9
1
0
8
9
(c) 3rd locus graph
h4,9 + h2,4 + h2,5 + h3,5 +
h3,6 + h6,8 = 0
7
Linear Constraints
Obviously, the linear constraints are necessary. We
can also show that these constraints are sufficient.
 Moreover, we can upper bound #constraints in each
locus graph as O(n), while the trivial analysis gives an
upper bound O(n2).
 Total #constraints = O(mn).

The ZRHC-PHASE algorithm
Algorithm ZRHC_PHASE
Traditional method
input: a pedigree G=(V,E) and genotype {gj}
 Solve h-variables and p-
output: a general solution of {pj}
begin
Step 1. Preprocessing
Step 2. Linear constraint generation on h-variables
Step 3. Solve h-variables by Gaussian Elimination
Step 4. Solve the p-variables by propagation from
pre-determined p-variables to others.
end
variables together
 O(mn) equations on O(mn)
unknowns: O(mn) p-variables
and O(n) h-variables.
Our method
 Solve h-variables and pvariables separately
 O(mn) linear equations on O(n)
h-variables.
Outline





Introduction and problem definition
A new system of linear equations for ZRHC
An O(mn3) time algorithm for ZRHC
An improved algorithm for ZRHC
Conclusion
O  mn 
O(n)
transformation
O  mn 
Ax=b
O(mn)
Ax=b
redundancy
elimination
O(n log2n log log n)
O(n)
Ax=b
Redundant Equation Elimination

An observation
j0
Given a cycle
, assume that
there are constraints among each pair of vertices.
j1

j2
jk
…
jk-2
jk-1
j0 ~ j2
j2 ~ jk-1
j0 ~ jk-1

Key lemma
Originally, there are O(k2) constraints. Notice that
they are not independent.

However, we can replace the original constraints
by an equivalent set of constraints with size O(k).

Remove the redundant equations
without solving them!
Redundant Equation Elimination
Given a spanning tree, the stretch of an edge
(k, j) is defined as the length of the unique path
between k and j on the tree.

Elkin, Emeky, Spielman and Teng shows that we
can embed any graph in a low-stretch spanning
tree with average stretch O(log2n log log n).

The number of irredundant constraints can be
bounded by the sum of cycle lengths, which is
further bounded by the sum of stretches O(nlog2n
log log n).

Conclusion



We present an efficient algorithm for ZRHC with
running time O(mn2+n3 log2n log log n).
It remains interesting if the time complexity for ZRHC
on general pedigrees can be improved to O(mn2+n3)
or lower.
Another open question is how to use the algorithm to
get haplotype configurations on pedigrees that require
only a small (constant) number of recombinants
Thanks for your time
and attention!
Download