A Linear-Time Algorithm for the Perfect Phylogeny - CS-CSIF

advertisement
A Linear-Time Algorithm for the
Perfect Phylogeny Haplotyping
(PPH) Problem
Zhihong Ding, Vladimir Filkov, Dan Gusfield
Department of Computer Science
University of California, Davis
RECOMB 2005
1
Haplotypes to Genotypes



Each individual has two “copies” of each
chromosome.
At each site, each chromosome has one of
two states denoted by 0 and 1
From haplotypes to genotypes:
For each site of an individual, if both haplotypes
have state 0, then the genotype has state 0. Same
rule for state 1. If two haplotypes have state 0 and
1, or 1 and 0, then the state of the genotype is 2.
2
Haplotypes to Genotypes
Sites: 1 2 3 4 5 6 7 8 9
0 1 1 1 0 0 1 1 0 Two haplotypes per individual
1 1 0 1 0 0 1 0 0
Merge the haplotypes
2 1 2 1 0 0 1 2 0 Genotype for the individual
3
Genotypes to Haplotypes
For each site, if the genotype has state 0 or 1, then
the two haplotypes must have states 0, 0 or 1, 1.
If the genotype has state 2, the two haplotypes can
either have states 0, 1 or 1, 0.
0 1 1 1 0 0 1 1 0
Two haplotypes per individual
1 1 0 1 0 0 1 0 0
2 1 2 1 0 0 1 2 0 Genotype for the individual
4
Haplotype Inference Problem

For disease association studies, haplotype data
is more valuable than genotype data, but
haplotype data is harder and more expensive
to collect than genotype data.

Haplotype Inference Problem: Given a set
of n genotypes, determine the original set of n
haplotype pairs that generated the n genotypes.

NIH leads HAPMAP project to find common
haplotypes in the human population.
5
Haplotype Inference Problem



If the genotype has state 2 at k sites, there
are 2k – 1 possible explaining haplotype
pairs.
How to determine which haplotype pair is
the original one generating the genotype?
We need a model of haplotype evolution
to help solve the haplotype inference
problem.
6
The Perfect Phylogeny Model of
Haplotype Evolution
sites 12345
Ancestral haplotype 00000
1
4
Site mutations on edges
3
10100
10000
2
00010
5
01010
01011
Extant haplotypes at the leaves
7
Assumptions of Perfect Phylogeny
Model

No recombination, only mutation.

Infinite-site assumption: one mutation per
site.
8
The Perfect Phylogeny Haplotyping
(PPH) Problem
Given a set of genotypes, find an explaining
set of haplotypes that fits a perfect phylogeny
Site
a
b
c
1
2
0
1
2
2
2
0
Genotype matrix
a
a
b
b
c
c
1
1
0
0
0
1
1
2
1
0
1
0
1
0
cc
0 10 10
Haplotype matrix
2
b
00
a
10
a
01
Perfect phylogeny
b
01
9
Prior Work

Several existing algorithms that solve the
PPH problem, but none of them is in
linear time.

Our contribution:
A linear time algorithm.
 Our implementation is about 250 times faster
than the fastest one of previous algorithms
for large data set.

10
A P-Class of PPH Solutions


P-Class: Maximum common
subgraph in all PPH solutions
Each P-Class consists of two subtrees
1
Sites: 1 2 3 4 5
a
b
Genotypes
c
d
22200
20022
22202
20020
Genotype Matrix
root
5
a,d
b,c
4
2
3
b,d
a,c
One PPH Solution11
P-Class Property of PPH Solutions

All PPH solutions can be obtained by choosing how
to flip each P-Class.
Switching
points
root
1
a,d 5
b,c
4
Switching
points
4
2
3
1
b,d
a,c
One PPH Solution
b,d
root
5 a,d
2 b,c
3
a,c
Second PPH Solutions
12
The Key Theorem

Every PPH solution can be obtained by
choosing a flip for each P-Class.

Conversely, after fixing one P-Class, every
distinct choice of flips of P-Classes, leads to
a distinct PPH solution.

If there are k P-Classes, there are 2k – 1
distinct PPH solutions.
13
Shadow Tree




Contains classes
Each class in the shadow tree is a subgraph
of a P-Class
Merging classes results in larger classes,
classes are never split
Contains tree edges and shadow edges
14
The Algorithm

Process the genotype matrix one row at a
time, starting at the first row, and modify
the shadow tree

The genotype matrix only contains entries
of value 0 and 2.
15
Overview of the Algorithm
for One Row

Procedure FirstPath

Procedure SecondPath

Procedure FixTree

Procedure NewEntries
16
OldEntryList

3
OldEntryList : column indices that have
entries of value 2 in this row and also have
entries of value 2 in some previous rows
22200
20022
22202
20020
OldEntryList for row 3:
1, 2, 3, 5
Genotype Matrix
17
Procedures FirstPath and
SecondPath

FirstPath : Construct a first path towards the
root of the shadow tree which passes through
tree edges of as many columns in OldEntryList
as possible

SecondPath : Construct a second path towards
the root of the shadow tree which passes
through tree edges of columns in OldEntryList
and not on the first path
18
Shadow Tree After Processing the
First Two Rows
root
Genotype Matrix
1
2
3
22200
20022
22202
20020
OldEntryList for row 3 :
1, 2, 3, 5
1
1
2
3
4
5
4
5
2
3
19
Algorithm – FirstPath
root
OldEntryList: 1, 2, 3, 5
CheckList: 3 , 2
Edges 4 and 5
cannot be on the
same path to the
root in any PPH
solution
1
2
3
4
5
1
4
5
2
3
20
Algorithm – SecondPath
root
OldEntryList: 1, 2, 3, 5
CheckList: 2, 3
1
1
2
3
4
5
4
5
2
3
21
Shadow Tree to PPH Solutions
Sites: 1 2 3 4 5
a
22200
b
c
d
1 2
4
5
3
root
20022
22202
20020
1
1
One PPH Solution
Genotype Matrix
3
2
4
5
4
5
Final shadow tree
2
3
22
Shadow Tree to PPH Solutions
root
1
2
4
5
1
4
b,c
2
1 a,d
4 2
3
b,d
3
5
5
Final shadow tree
3
a,c
Second PPH Solution
23
Implementation – Leaf Count

Leaf count of column i (L[i]):
the number of 2's plus twice
the number of 1's in column i.
a
b
c
d
1 2
3
4
1
0
2
2
0
2
2
0
0
0
0
2

L[i] is the number of leaves
below mutation i, in every
perfect phylogeny for the
genotype matrix.

Along any path to the root in
any PPH solution, the
Leaf
4
successive edges are labeled
Count:
by columns with strictly
increasing leaf counts.
1
2
0
0
3 2 1
24
Time Complexity




Constant number of simple operations on
each edge per row
Each traversal in the shadow tree goes
through O(m) edges.
The algorithm does constant number of
traversals in the shadow tree for each row.
Total time: O(nm)
n, m are the number of rows and
columns in the genotype matrix.
25
Results
Average Running Times (seconds)
Sites (m) Individuals (n) Dataset
DPPH O(nm2)
Our Alg. O(nm)
300
150
30
1.07
0.05
500
250
30
5.72
0.13
1000
500
30
45.85
0.48
2000
1000
10
467.18
1.89
26
Thank you !
Paper and program can be downloaded at:
http://wwwcsif.cs.ucdavis.edu/~gusfield/lpph/
27
Download