Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single

advertisement

Algorithms for Imperfect Phylogeny

Haplotyping (IPPH) with a Single

Homoplasy or Recombnation Event

Yun S. Song, Yufeng Wu and Dan Gusfield

University of California, Davis

WABI 2005

Haplotyping Problem

• Diploid organisms have two copies of (not identical) chromosomes.

• A single copy is haplotype, a vector of Single

Nucleotides Polymorphisms (SNPs)

• SNP: a site with two types of nucleotides occur frequently, 0 or 1

• The mixed description is genotype, vector of

0 , 1 , 2

– If both haplotypes are 0, genotype is 0

– If both haplotypes are 1, genotype is 1

– If one is 0 and the other is 1, genotype is 2

Haplotypes and Genotypes

Sites: 1 2 3 4 5 6 7 8 9

0 1 1 1 0 0 1 1 0

1 1 0 1 0 0 1 0 0

Two haplotypes per individual

Merge the haplotypes

2 1 2 1 0 0 1 2 0

Genotype for the individual

• Haplotype Inference (HI) Problem: given a set of n genotypes, infer n haplotype pairs that form the given genotypes

Perfect Phylogeny

Haplotyping (PPH)

• Finding original haplotypes in nature hopeless without genetic model to guide solution picking

• Gusfield (2002) introduced PPH problem

• PPH is to find HI solutions that fit into a perfect phylogeny.

• Nice results for PPH, including a linear time algorithm

The Perfect Phylogeny Model for Haplotypes

Assume at most 1 mutation at each site sites

Ancestral sequence

12345

00000

1 4

Site mutations on edges

The tree derives the set M:

10100

10000

01011

01010

00010

3

00010

10100

5

10000

01011

2

01010

Extant sequences at the leaves

Genotypes

PPH Example

Inferred

Haplotypes

Perfect Phylogeny

Imperfect Phylogeny Haplotyping

(IPPH): Extending PPH

• Often, the real biological data does not have

PPH solutions.

• Eskin, et al (2003) found deleting small part of data may lead to PPH solution (heuristic)

• Our approach: IPPH with explicit genetic model, with small amount of

– Homoplasy, i.e. back or recurrent mutation

– Recombination

• Goal: Extend usage of PPH

– Real data: may be of small perturbation from PPH

– Haplotype block: low recombination or homoplasy

Data

000

010

101

110

Back/Recurrent Mutation for

Haplotypes

More than one mutation at a site

000

010

2

1

1

100

3

010

110 101 000

Recombinations: Single

Crossover

• Recombination is one of the principle genetic force shaping genetic variations

• Two equal length sequences generate the third equal length sequence

110001111111001

Prefix

000110000001111

Suffix

11000 0000001111 breakpoint

IPPH (Imperfect Phylogeny

Haplotyping) Problems

• Small deviation from PPH

• H-1 IPPH problem

– Find a tree that allows exactly one site to mutate twice

– The rest of sites can only mutate at most once

– Derive haplotypes for the given genotypes

• R-1 IPPH problem

– Find a network that has exactly one recombination event

– Each site mutates at most once

– Derive haplotypes for the given genotypes

Number of Minimum

Recombinations for Haplotypes

Rmin Rho=1 Rho=3 Rho=5

2

3

0

1

4

5

6

60.8%

31.8%

6.8%

23.6%

35.2%

24.8%

11.6%

3.8%

0.8%

0.2%

8.4%

27.6%

27.8%

21.6%

9.0%

3.6%

1.4%

Frequency of

Minimum recombinations for small rho

(scaled recombination rate)

20 sequences

30 sites

500 simulations

Haplotyping with One

Homoplasy

More than one mutation at a site 1

000

Genotype

Haplotype

1 Homoplasy

Tree s1 s2 s3 a 0 2 0 b 1 2 2 s1 s2 s3 a1 0 0 0 a2 0 1 0 b1 1 0 1 b2 1 1 0

010

2

1

1

100

3 a2 b2 b1 a1

Algorithm for H1-IPPH

• For each site s in the input genotype data M

– Test whether M-{s} has PPH solutions

– If not, move to next site.

– Otherwise, check whether 1 homoplasy at site s can lead to HI solutions

– If yes, stop and report result

• Assume only one PPH solution for M-{s}

• But how to find solutions with 1 homoplasy at s efficiently?

M

Example

M-{i3}

Site i3

{i3}

M-{i3} {i3}

PPH

Mh-{i3} h{i3}

Combine Mh-{i3} with h{i3}

Assume Mh-{i3} is fixed.

Haplotypes for the same genotype must pair up.

Two ways to pair r2 s2 r2’ s2’

Mh-{i3} h{i3} Mh1 Mh2

?

• 4 ways to try pairing i3.

• Exponential number in general, even for one PPH solution

• Need polynomial-time method to avoid trying all the pairings

Mh-{i3}

Move to Trees

h{i3}

Convert perfect phylogeny tree from PPH solution to un-rooted

1 Homoplasy: from T to T

r

, T

s

Tree T Tree T r s s

O1 L1 L2 O2

Recurrent mutation @ site s

Deleting s induces tree T r s induces a split T s

L1, L2

T s s

O1, O2

Tree T

From T

r

, T

s

to T

Tree Tr s s

O1 L1 L - L1 O2

Find two subtrees Ts1, Ts2, in Tr, s.t.

Ts1, Ts2 corresponds to one side of T s

L

L1

T s s

L - L1

O

1. Pick one side of partition from Ts

2. Pick leaves from Tr corresponding the chosen partition side

3. Check whether the selected leaves fit into two sub-trees

s2 can pair with r2’

1. May need to refine a non-binary vertex before picking subtree

Solution

Algorithms and Results

• Efficient graph-coloring based method to select two subtrees (skipped)

• Implemented in C++

• Simulation with data with program ms .

• Compare to PHASE (a haplotyping program)

– Accuracy: comparable

– Speed: at least 10x faster

– 100x100 data: about 3 seconds

• Can identify the homoplasy site with high accuracy: >95% in simulation

Algorithm for R1-IPPH

M

M

L

M

R

Split M by cutting between two sites

PPH Solutions

Build perfect phylogeny for two partitions

1-SPR operation

SPR: subtree-prune-regraft operation

1 recombination condition equivalent to distance-SPR(T

L

,T

R

) = 1

Algorithm for R1-IPPH

• Brute-force 1-SPR idea leads to exponential time when T binary.

L or T

R are not

• Trickier than H1-IPPH, but with care,

R1-IPPH can be solved in polynomial time. (not in paper)

Conclusions

• Contributions

– Assuming bounded number of PPH solutions

1. Polynomial time algorithm for H1-IPPH problem

2. Polynomial time algorithm for R1-IPPH problem

3. Possible extension to more than 1 homoplasy event.

• Open problems

– Haplotyping with more than 1 recombination efficiently.

– Remove assumption that number of PPH solutions for M-{s} is bounded.

• Questions?

Thank you

Download