Fixed-Parameter and Approximation Algorithms for the Subtree Distance Between Phylogenies

Fixed-Parameter and Approximation Algorithms for the Subtree Distance Between Phylogenies

Charles Semple

Biomathematics Research Centre

Department of Mathematics and Statistics

University of Canterbury

New Zealand

Joint work with Magnus Bordewich, Catherine McCartin e-Science Institute, Edinburgh 2007

Subtree Prune and Regraft (SPR)

Example.

r r a b

S c d

1 SPR a

1 SPR b

T

1 c r d a c

T

2 d b

Applications of SPR

Used

I.

As a search tool for selecting the best tree in reconstruction algorithms.

II.

To quantify the dissimilarity between two phylogenetic trees.

III.

To provide a lower bound on the number of reticulation events in the case of non-tree-like evolution.

For II and III , one wants the minimum number of SPR operations to transform one phylogeny into another.

This number is the SPR distance between two phylogenies S and T .

The Mathematical Problem

MINIMUM SPR

Instance: Two rooted binary phylogenetic trees S and T .

Goal: Find a minimum length sequence of single SPR operations that transforms S into T .

Measure: The length of the sequence.

Notation: Use d

SPR

(S, T) to denote this minimum length.

Theorem (Bordewich, S 2004)

MINIMUM SPR is NP-hard.

Overriding goal is to find (with no restrictions) the exact solution or a heuristic solution with a guarantee of closeness .

Algorithms for NP-Hard Problems

Fixed-parameter algorithms are a practical way to find optimal solutions if the parameter measuring the hardness of the problem is small.

Polynomial-time approximation algorithms can efficiently find feasible solutions that are sometimes arbitrarily close to the optimal solution.

Agreement Forests

A forest of T is a disjoint collection of phylogenetic subtrees whose union of leaf sets is X U r .

Example.

r a b c

S d e f r a b c

F

1 d e f

Agreement Forests

A forest of T is a disjoint collection of phylogenetic subtrees whose union of leaf sets is X U r .

Example.

r a b c

S d e f r a b c

F

1 d e f

Agreement Forests

An agreement forest for S and T is a forest of both S and T .

Example.

r r a b c

S d e f r a b c

F

1 d e f d e f

T a b c r a b c

F

2 d e f

Agreement Forests


Example.

r r a b c

S d e f r a b c

F

1 d e f d e f

T a b c r a b c

F

2 d e f

Agreement Forests


Example.

r r a b c

S d e f r a b c

F

1 d e f d e f

T a b c r a b c

F

2 d e f

Theorem.

(Bordewich, S, 2004)

Let S and T be two binary phylogenetic trees. Then d

SPR

(S,T) = size of maximum-agreement forest - 1 .

o It’s fast to construct from a maximum-agreement forest for S and T a sequence of SPR operations that transforms S into T .

Reducing the Size of the Instance

Subtree Reduction Chain Reduction

Fixed-Parameter Algorithms

The underlying idea is to find an algorithm whose running time separates the size of the problem instance from the parameter of interest.

One way to obtain such an algorithm is to reduce the size of the problem instance, while preserving the optimal value (kernalizing the problem) .

Are the subtree and chain reductions enough to kernalize the problem?


Lemma. If n’ denotes the size of the leaf sets of the fully reduced trees using the subtree and chain reductions, then n’ < 28d

SPR

(S,T) .

Corollary.

(Bordewich, S 2004) MINIMUM SPR is fixed-parameter tractable.

1.

Repeatedly apply the subtree and chain rules.

2.

Exhaustively find a maximum-agreement forest by deleting edges in S and comparing with T .

Running time is O((56k) k + p(n)) compared with O((2n) k ) , where k=d

SPR

(S,T) and p(n) is the polynomial bound for reducing the trees using the subtree and chain reductions.


A second way to obtain a fixed-parameter algorithm is using the method of bounded search trees .

Preliminaries o A rooted triple is a binary tree with three leaves.

o a b ab|c c

A tree S displays ab|c if the last common ancestor of ab is a descendant of the last common ancestor of ac .

r

Fixed-Parameter Algorithms r

Observation.

If F is an agreement forest for S and T , then F can be obtained from S by deleting |F|-1 edges.

Recall.

If F is maximum, then

|F|-1=d

SPR

(S,T) .

a d b c

S d e f e f

T a b c

r


Observation.

If F is an agreement forest for S and T , then F can be obtained from S by deleting |F|-1 edges.

Recall.

If F is maximum, then

|F|-1=d

SPR

(S,T) .

a d b c

S d e f e f

T a b c

r


1.

Find a minimal triple xy|z in S not in T .

2.

Branch into 4 computational paths e for each path until no such triple.

x

, e y

, e z

, e r

, and repeat

3.

For each path, find a pair of overlapping components t and branch into 2 paths e s

Repeat until no such components.

s

and t

, e t

.

t a d b c

S d e f e f

T a b c

r


1.


2.


x

, e y

, e z

, e r

, and repeat

3.



s

and t

, e t

.

t a d b c

S d e f e f

T a b c

r


1.


2.


x

, e y

, e z

, e r

, and repeat

3.



s

and t

, e t

.

t e a a e b b c

S d d e f e f

T a b c

r


1.


2.


x

, e y

, e z

, e r

, and repeat

3.



s

and t

, e t

.

t e a a e r e b b c

S d d e f e f

T a b c

r


1.


2.


x

, e y

, e z

, e r

, and repeat

3.



s

and t

, e t

.

t e a a e r e b b e d c

S d d e f e f

T a b c

r


1.


2.


x

, e y

, e z

, e r

, and repeat

3.



s

and t

, e t

.


S d e d e a e b e r d e f e f

T a b c

r


1.


2.


x

, e y

, e z

, e r

, and repeat

3.



s

and t

, e t

.



T a b c

r


1.


2.


x

, e y

, e z

, e r

, and repeat

3.



s

and t

, e t

.

t a d b c

S d e f e d e a e b e r e f

T a b c

r


1.


2.


x

, e y

, e z

, e r

, and repeat

3.



s

and t

, e t

.

t e a a r e r e b b c

S d e e d e f e f

T a e d e a e b e r b c

r


1.


2.


x

, e y

, e z

, e r

, and repeat

3.



s

and t

, e t

.

t e a a r e r e b b c

S d e e d e f e f

T a e d e a e b e r b c

r


1.


2.


x

, e y

, e z

, e r

, and repeat

3.



s

and t

, e t

.



T a b c

r


1.


2.


x

, e y

, e z

, e r

, and repeat

3.



s

and t

, e t

.



T a b c

r


1.


2.


x

, e y

, e z

, e r

, and repeat

3.



s

and t

, e t

.

t a d b c


T a b c

r


1.


2.


x

, e y

, e z

, e r

, and repeat

3.



s

and t

, e t

.

t a d b c


T a b c

r


1.


2.


x

, e y

, e z

, e r

, and repeat

3.



s

and t

, e t

.

t a d b c

S d e f e d e a e b e r v st e f

T a b c

r

Fixed-Parameter Algorithms e r s

1.


2.


x

, e y

, e z

, e r

, and repeat

3.



s

and t

, e t

.

t e a t d b c

S d e f e d e a e b e r v st e f

T a b c

r

Fixed-Parameter Algorithms e r s

1.


2.


x

, e y

, e z

, e r

, and repeat

3.



s

and t

, e t

.

t e a t d b c

S d e f e d e a e b e r e s e t v st e f

T a b c

r


1.


2.


x

, e y

, e z

, e r

, and repeat

3.



s

and t

, e t

.

t a d b c

S d e f e d e a e b e r e s e t e f

T a b c

r


Key Lemma.

d

SPR

(S,T) <= k if and only if one of the computational paths of length at most k results in an agreement forest for S and T .

Theorem. (Bordewich, McCartin, S

2007)

The bounded search tree algorithm has running time O(4 k n 4 ) .

Combining the methods of kernalization and bounded search gives an FPT algorithm with running time O(4 k k 4 +p(n)) .

a b c

S d e f e d e b e r e a d e s e t e f

T a b c

Approximation Algorithms

Polynomial-time approximation algorithms can efficiently find feasible solutions that are sometimes arbitrarily close to the optimal solution.

An r-approximation algorithm A for MINIMUM SPR means that the size of an agreement forest (minus 1) for S and T returned by A is at most r times d

SPR

(S,T) .

Example. If r=3 , then the size of any agreement forest (minus 1) for S and T returned by A is at most 3 times d

SPR

(S,T) .

Based on a pairwise comparison, Bonet, St. John, Mahindru, Amenta

2006 establish a 5-approximation algorithm ( O(n) ) for MINIMUM

SPR .

r

Approximation Algorithms r

1.


2.

Delete each of components t e s

, e t e x

, e t z

, e

3.

Find a pair of overlapping r

, and

and delete

. Repeat until no such components.

s

and t y

, e repeat until no such triple.

Theorem. (Bordewich, McCartin, S

2007)

This gives a 4-approximation alg. for

MINIMUM SPR ( O(n 5 ) ).

Deleting only e x

, e z

, e r gives a 3approximation algorithm.

a b c

S d e f e e d r e a e b e s d e t e f

T a b c

Exact algorithms r=1

(B, S 07)

1 + 1/2112

Maximum independent set in planar graphs (PTAS)

MINIMUM SPR

No r, unless

P=NP

(B, M,S 07)

3

5

(B S,M, A 06)

1. Travelling salesman problem

2. Maximum clique

Modelling Hybridization Events with SPR Operations

Reticulation processes cause species to be a composite of

DNA regions derived from different ancestors.

Processes include o horizontal gene transfer , o hybridization , and o recombination .

… molecular phylogeneticists will have failed to find the `true tree’, not because their methods are inadequate or because they have chosen the wrong genes, but because the history of life cannot be properly represented as a tree.

Ford Doolittle, 1999

(Dalhousie University)


A single SPR operation models a single hybridization event (Maddison

1997) .

r

Example.

r a r b

S c d a b

T c d a b

H c d



1997) .

r

Example.

r a r b

S c d a b

T c d a b

H c d



1997) .

r

Example.

r a r b

S c d a b

T c d a b

H c d



1997) .

r

Example.

r a r b

S c d a b

T c d a b

H c d deleting hybrid edges r a d

F b c

A Fundamental Problem for Biologists

Given an initial set of data that correctly repesents the tree-like evolution of different parts of various species genomes, what is the smallest number of reticulation events required that simultaneously explains the evolution of the species?

This smallest number o Provides a lower bound on the number of such events.

o Indicates the extent that hybridization has had on the evolutionary history of the species under consideration.

Since 1930’s botantists have asked the question: How significant has the effect of hybridization been on the New Zealand flora?

Trees and Hybridization Networks

H explains T if T can be obtained from a rooted subtree of H by suppressing degree-2 vertices.

Example.

a b

S c d a c

T b d a b

H

1 c d a b

H

2 c d



Example.

a b

S c d a c

T b d a b

H

1 c d a b

H

2 c d



Example.

a b

S c d a c

T b d a b

H

1 c d a b

H

2 c d

The Mathematical Problem

MINIMUM HYBRIDIZATION

Instance: Two rooted binary phylogenetic trees S and T .

Goal: Find a hybridization network H that explains S and T , and minimizes the number of hybridization vertices .

Measure: The number of hybridization vertices in H .

Notation: Use h(S, T) to denote this minimum number.

Example: Arbitrary SPR operations not sufficient.

r a b c

F

1 d e f

o A sequence of SPR operations that avoids creating directed cycles to make a hybridization network that explains S and T .

o If one minimizes the length of an (acyclic) sequence , does the resulting network minimize the number of hybridization events to explain S and T ?

o YES, and such a sequence corresponds to an acyclic-agreement forest .

Theorem.

(Baroni, Grünewald, Moulton, S, 2005)

Let S and T be two binary phylogenetic trees. Then h(S,T) = size of maximum-acyclic agreement forest - 1 .

o It’s fast to construct from a maximum-acyclic agreement for S and T a hybridization network that realizes h(S,T) .

Theorem.

(Bordewich, S, 2007)

MINIMUM HYBRIDIZATION is NP-hard.


Subtree Reduction Chain Reduction


Are the subtree and chain reductions enough to kernalize the problem?

Lemma. If n’ denotes the size of the leaf sets of the fully reduced trees using the subtree and chain reductions, then n’<14h(S,T) .

Corollary. (Bordewich, S 2007) MINIMUM HYBRIDIZATION is fixed-parameter tractable.

Running time is O((28k) k + p(n)) compared with O((2n) k ) , where k=h(S,T) and p(n) is the polynomial bound for reducing the trees using the subtree and chain reductions.


Cluster Reduction (Baroni 2004)

+

A Grass (Poaceae) Dataset (Grass Phylogeny Working Group,

Düsseldorf) o Ellstrand, Whitkus, Rieseburg 1996 (Distribution of spontaneous plant hybrids) o For each sequence, used fastDNAml to reconstruct a phylogenetic tree (H. Schmidt).

Chloroplast ( phyB ) sequences Nuclear (ITS) sequences






h=3 h=1

Chloroplast ( phyB ) sequences h=4

Nuclear (ITS) sequences h=0

pairwise combination ndhF ndhF ndhF ndhF ndhF phyB phyB phyB phyB rbcL rbcL rbcL rpoC2 rpoC2 waxy

ITS rpoC2 waxy

ITS waxy

ITS

ITS phyB rbcL rpoC2 waxy

ITS rbcL rpoC2 waxy

40

36

34

19

46

21

21

14

30

26

12

29

10

31

15

# overlapping taxa h(S,T)

14

13

12

9 at least 15

4

7

3

8

13

7 at least 9

1 at least 10

8 running time

2000 MHz

CPU, 2GB RAM

11h

11.8h

26.3h

320s

1s

180s

1s

19s

29.5h

230s

1s

620s

Bordewich, Linz, St John, S, 2007

Computing d

SPR

(S,T) and h(S,T) d

SPR

(S,T)

1.

FPT using kernalization and bounded search tree methods ( O(4 k k 4 +p(n)) ).

2.

No cluster-based reduction.

3.

3-approximation algorithm.

h(S,T)

1.

FPT using kernalization only

( O((28k) k +p(n)) ). Unknown if a bounded search tree method exists.

2.

Cluster-based reduction.

3.

Unknown if there even is an approximation algorithm.


Subtree Reduction Chain Reduction a

1 a

2

S a n a n

T a

2 a

1 x

2 x

1

S’ x

3 x

3 x

2 x

1

T’

Fixed-Parameter and Approximation Algorithms for the Subtree Distance Between Phylogenies

Fixed-Parameter and Approximation Algorithms for the Subtree Distance Between Phylogenies

Reducing the Size of the Instance

Related documents

Products

Support

Fixed-Parameter and Approximation Algorithms for the Subtree Distance Between Phylogenies