Algorithms for Imperfect Phylogeny
Haplotyping (IPPH) with a Single
Homoplasy or Recombnation Event
Yun S. Song, Yufeng Wu and Dan Gusfield
University of California, Davis
WABI 2005
• Diploid organisms have two copies of (not identical) chromosomes.
• A single copy is haplotype, a vector of Single
Nucleotides Polymorphisms (SNPs)
• SNP: a site with two types of nucleotides occur frequently, 0 or 1
• The mixed description is genotype, vector of
0 , 1 , 2
– If both haplotypes are 0, genotype is 0
– If both haplotypes are 1, genotype is 1
– If one is 0 and the other is 1, genotype is 2
Sites: 1 2 3 4 5 6 7 8 9
0 1 1 1 0 0 1 1 0
1 1 0 1 0 0 1 0 0
Two haplotypes per individual
Merge the haplotypes
2 1 2 1 0 0 1 2 0
Genotype for the individual
• Haplotype Inference (HI) Problem: given a set of n genotypes, infer n haplotype pairs that form the given genotypes
• Finding original haplotypes in nature hopeless without genetic model to guide solution picking
• Gusfield (2002) introduced PPH problem
• PPH is to find HI solutions that fit into a perfect phylogeny.
• Nice results for PPH, including a linear time algorithm
Assume at most 1 mutation at each site sites
Ancestral sequence
12345
00000
1 4
Site mutations on edges
The tree derives the set M:
10100
10000
01011
01010
00010
3
00010
10100
5
10000
01011
2
01010
Extant sequences at the leaves
Genotypes
Inferred
Haplotypes
Perfect Phylogeny
Imperfect Phylogeny Haplotyping
(IPPH): Extending PPH
• Often, the real biological data does not have
PPH solutions.
• Eskin, et al (2003) found deleting small part of data may lead to PPH solution (heuristic)
• Our approach: IPPH with explicit genetic model, with small amount of
– Homoplasy, i.e. back or recurrent mutation
– Recombination
• Goal: Extend usage of PPH
– Real data: may be of small perturbation from PPH
– Haplotype block: low recombination or homoplasy
Data
000
010
101
110
More than one mutation at a site
000
010
2
1
1
100
3
010
110 101 000
• Recombination is one of the principle genetic force shaping genetic variations
• Two equal length sequences generate the third equal length sequence
110001111111001
Prefix
000110000001111
Suffix
11000 0000001111 breakpoint
• Small deviation from PPH
• H-1 IPPH problem
– Find a tree that allows exactly one site to mutate twice
– The rest of sites can only mutate at most once
– Derive haplotypes for the given genotypes
• R-1 IPPH problem
– Find a network that has exactly one recombination event
– Each site mutates at most once
– Derive haplotypes for the given genotypes
Number of Minimum
Recombinations for Haplotypes
Rmin Rho=1 Rho=3 Rho=5
2
3
0
1
4
5
6
60.8%
31.8%
6.8%
23.6%
35.2%
24.8%
11.6%
3.8%
0.8%
0.2%
8.4%
27.6%
27.8%
21.6%
9.0%
3.6%
1.4%
Frequency of
Minimum recombinations for small rho
(scaled recombination rate)
20 sequences
30 sites
500 simulations
More than one mutation at a site 1
000
Genotype
Haplotype
1 Homoplasy
Tree s1 s2 s3 a 0 2 0 b 1 2 2 s1 s2 s3 a1 0 0 0 a2 0 1 0 b1 1 0 1 b2 1 1 0
010
2
1
1
100
3 a2 b2 b1 a1
• For each site s in the input genotype data M
– Test whether M-{s} has PPH solutions
– If not, move to next site.
– Otherwise, check whether 1 homoplasy at site s can lead to HI solutions
– If yes, stop and report result
• Assume only one PPH solution for M-{s}
• But how to find solutions with 1 homoplasy at s efficiently?
M
M-{i3}
Site i3
{i3}
M-{i3} {i3}
PPH
Mh-{i3} h{i3}
Combine Mh-{i3} with h{i3}
Assume Mh-{i3} is fixed.
Haplotypes for the same genotype must pair up.
Two ways to pair r2 s2 r2’ s2’
Mh-{i3} h{i3} Mh1 Mh2
?
• 4 ways to try pairing i3.
• Exponential number in general, even for one PPH solution
• Need polynomial-time method to avoid trying all the pairings
Mh-{i3}
h{i3}
Convert perfect phylogeny tree from PPH solution to un-rooted
r
s
Tree T Tree T r s s
O1 L1 L2 O2
Recurrent mutation @ site s
Deleting s induces tree T r s induces a split T s
L1, L2
T s s
O1, O2
Tree T
r
s
Tree Tr s s
O1 L1 L - L1 O2
Find two subtrees Ts1, Ts2, in Tr, s.t.
Ts1, Ts2 corresponds to one side of T s
L
L1
T s s
L - L1
O
1. Pick one side of partition from Ts
2. Pick leaves from Tr corresponding the chosen partition side
3. Check whether the selected leaves fit into two sub-trees
s2 can pair with r2’
1. May need to refine a non-binary vertex before picking subtree
Solution
• Efficient graph-coloring based method to select two subtrees (skipped)
• Implemented in C++
• Simulation with data with program ms .
• Compare to PHASE (a haplotyping program)
– Accuracy: comparable
– Speed: at least 10x faster
– 100x100 data: about 3 seconds
• Can identify the homoplasy site with high accuracy: >95% in simulation
M
M
L
M
R
Split M by cutting between two sites
Build perfect phylogeny for two partitions
SPR: subtree-prune-regraft operation
1 recombination condition equivalent to distance-SPR(T
L
,T
R
) = 1
• Brute-force 1-SPR idea leads to exponential time when T binary.
L or T
R are not
• Trickier than H1-IPPH, but with care,
R1-IPPH can be solved in polynomial time. (not in paper)
• Contributions
– Assuming bounded number of PPH solutions
1. Polynomial time algorithm for H1-IPPH problem
2. Polynomial time algorithm for R1-IPPH problem
3. Possible extension to more than 1 homoplasy event.
• Open problems
– Haplotyping with more than 1 recombination efficiently.
– Remove assumption that number of PPH solutions for M-{s} is bounded.
• Questions?