Genome Rearrangement Phylogeny

advertisement
Genome Rearrangement Phylogeny
Robert K. Jansen
School of Biology
University of Texas at Austin
Bernard M.E. Moret
Department of Computer Science
University of New Mexico
Li-San Wang
Tandy Warnow
Department of Computer Sciences
University of Texas at Austin
Outline
• Introduction
• Genome rearrangement phylogeny
reconstruction
• Application
• Other methods
• Future research
2
New Phylogenetic Signals
• Large-throughput sequencing efforts lead to larger
datasets
− Challenge: inferring deep evolutionary events
• Biologists turning to “rare genomic changes”
− Rare
− Large state space
− High signal-to-noise ratio
− Potential for clarifying early evolution
− Best studied: gene order evolution
(genome rearrangement)
3
Genomes As Signed Permutations
1 –5 3 4 -2 -6
or
5 –1 6 2 -4 -3
etc.
4
Gene Order Data
• Rare changes on the genomic scale
• Large state space
DNA:
4 states/character
− Protein (amino acid sequence):
20 states/character
− Circular gene order with 120 genes:
−
2119 119! 3.7049 10 232
states/character
• High signal-to-noise ratio
5
Genomes Evolve by Rearrangements
1
2
3
4
5
6
7
8
9
10
Inversion:
1 2 –6 –5 -4 -3
7
8
9
10
Transposition:
1 2 7 8 3
5
6
9
10
Inverted Transposition:
1 2 7 8 –6 -5 -4 -3
9
10
4
6
Edit Distances Between Genomes
• (INV) Inversion distance [Hannenhalli & Pevzner 1995]
−
Computable in linear time [Moret et al 2001]
• (BP) Breakpoint distance [Watterson et al. 1982]
−
−
Computable in linear time
NJ(BP): [Blanchette, Kunisawa, Sankoff, 1999]
A=
1
2
3
4
5
6
7
8
9
10
B=
1
2
3 -8 -7 -6
4
5
9
10
BP(A,B)=3
7
Our Model: the Generalized
Nadeau-Taylor Model [STOC’01]
• Three types of events:
−
−
−
Inversions (INV)
Transpositions (TRP)
Inverted Transpositions (ITP)
• Events of the same type are equiprobable
• Probabilities of the three types have fixed ratio
• We focus on signed circular genomes in this talk.
8
Simulation Study Protocol
Synthetic Input
Evolutionary Process
A
1 -4
2 -3
5
6
B
1
3
2
5
4
6
C
1
3 -2
5
6
4
D
1
3
4
5 -2
6
E
1
4
5 -3 -2
6
F
1
2
3
6
4
5
Phylogenetic
Method
True Tree
C
Known in
simulation
A
D
B
A
B
D
E F
C
E
F
Inferred Tree
9
Quantifying Error
True Tree
C
A
B
D
E F
FN: false negative
D
B
C
E
F
A
Inferred Tree
(missing edge)
1/3=33.3% error rate
10
Outline
• Genome rearrangement evolution
• Genome rearrangement phylogeny
reconstruction
• Application
• Other methods
• Future research
11
Gene Order Parsimony
A
A
B
C
D
C
X
Y
E
F
Z
B
D
Length (T) =
E
W
F
min
AX+BX+XY+CY+YZ+DZ+YW+EW+FW
X,Y,Z,W
12
Breakpoint Phylogeny
[Sankoff & Blanchette 1998]
• “Maximum Parsimony”-style problem:
− Find tree(s), leaf-labeled by genomes, with
shortest breakpoint length
• NP-hard problem on two levels:
− Find the shortest tree (the space of trees has
exponential size)
− Given a tree, find its breakpoint length
(Even for a tree with 3 leaves, but can be
reduced to TSP)
• BPAnalysis [Sankoff & Blanchette 1998]
− Takes 200 years to compute our 13-taxon
dataset on a Sun workstation
13
BPAnalysis
• Tree length evaluation for EVERY tree
• Given a fixed tree topology, evaluate the tree
length:
−
Iteratively evaluate the median problem (tree
length for a 3-leaf tree)
AA
C
C
X’
X
Y
Y’
E
Y
B
ZZ
B
D
W
F
14
GRAPPA
(Genome Rearrangement Analysis under
Parsimony and other Phylogenetic Algorithms)
http://www.cs.unm.edu/~moret/GRAPPA/
• Uses lowerbound techniques to speed up
• Used on real datasets, producing thousand-fold
speedups over BPAnalysis [ISMB’01]
• Contributors: (led by Bernard Moret at UNM)
U. New Mexico
U. Texas at Austin
Universitá di Bologna, Italy
15
The Circular Lowerbound of the
Length of a Tree
•
Given a tree, we can lowerbound its length very
quickly:
2w(T )  lb (T )  d1,2  d 2,3  d3,4  d 4,1
2w(T )
lb (T )
1
2
3
4
16
The Lowerbound Technique
•
Avoid any tree X without potential:
− tree X whose lowerbound lb(X) is higher than
twice the length c(T) of the best tree T
•
Finding a good starting tree quickly is of utmost
importance
•
We turn to distance-based methods
− Neighbor joining (NJ) [Saitou and Nei 1987]
−
Weighbor [Bruno et al. 2000]
17
Additive Distance Matrix and
True Evolutionary Distance (T.E.D.)
S3
S1
5
7
3
4
S4
5
1
8
S1 S2 S3
S1 0 9 15
S2
0 14
S3
0
S4
S5
S4
14
13
13
0
S5
17
16
16
13
0
S2
S5
Theorem [Waterman et al. 1977] Given an m×m
additive distance matrix, we can reconstruct a tree
realizing the distance in O(m2) time.
18
Error Tolerance of Neighbor Joining
Theorem [Atteson 1999]
Let {Dij} be the true evolutionary distances, and
{dij} be the estimated distances for T.
Let
be the length of the shortest edge in T. If
for all taxa i,j, we have

1
| Dij  d ij |  
2
then neighbor joining returns T.
19
BP and INV
BP/2 vs K
(120 genes)
(K: Actual number of inversions)
INV vs K
(Inversion-only evolution)
20
NJ(BP) [Blanchette, Kunisawa, Sankoff 1999]
and NJ(INV)
Inversion only
Transpositions/
inverted transpositions only
120 genes, 160 leaves
Uniformly Random Tree
21
Estimate True Evolutionary Distances
Using BP
To use the scatter plot to
estimate the actual number
of events (K):
1. Compute BP/2
2. From the curve, look up
the corresponding value
of K
(2)
(1)
BP/2 vs K
(120 genes)
(K: Actual number of inversions)
(Inversion-only evolution)
22
True Evolutionary Distance (t.e.d.)
Estimators for Gene Order Data
T.E.D.
Estimator
Exact-IEBP
[WABI’01]
Based on the Breakpoint
Expectation of distance
(Exact)
Approx-IEBP
[STOC’01]
EDE
[ISMB’01]
Breakpoint
distance
(Approx.)
Inversion
distance
(Approx.)
Derivation
Analytical
Analytical
Empirical
Model
knowledge
Required
Required
Inversiononly
IEBP: Inverting the Expected BreakPoint distance
EDE: Empirically Derived Estimator
23
True Evolutionary Distance Estimators
BP vs K
(120 genes)
(K: Actual number of inversions)
Exact-IEBP vs K
(Inversion-only evolution)
24
Variance of True Evolutionary
Distance Estimators
• There are new distance-based
phylogeny reconstruction
methods (though designed for
DNA sequences)
− Weighbor [Bruno et al. 2000]
These methods use the
variance of good t.e.d.’s, and
yield more accurate trees than
NJ.
• Variance estimates for the t.e.d.s
[Wang WABI’02]
− Weighbor(IEBP),
Weighbor(EDE)
K vs Exact-IEBP (120 genes)
25
Using T.E.D. Helps
120 genes
160 leaves
Uniformly random tree
Transpositions/inverted
transpositions only
(180 runs per figure)
5%
26
Observations
• EDE is the best distance estimator when used
with NJ and Weighbor.
• True evolutionary distance estimators are reliable
even when we do not know the GNT model
parameters (the probability ratios of the three
types of events).
27
Outline
• Genome rearrangement evolution
• Genome rearrangement phylogeny reconstruction
• Application
• Other methods
• Future research
28
Percentage of Trees Eliminated
Through Bounding [ISMB’01]
edge length=2
# taxa
10
20
40
80
160
10
0
0
0
1%
1%
20
0
80%
91%
1%
1%
40
91%
100%
100%
100%
100%
80
99%
100%
100%
100%
100%
160
100%
100%
100%
100%
100%
320
100%
100%
100%
100%
100%
#genes
Uses NJ(EDE) as starting tree
29
Campanulaceae cpDNA
• 13 taxa (tobacco as outlier)
• 105 gene segments
• GRAPPA finds 216 trees with shortest breakpoint
length (out of 654,729,075 trees)
• Running Time:
−
−
−
BPAnalysis takes 2 centuries on a Sun
workstation
GRAPPA takes 1.5 hours on a 512-node
supercluster
About 2300-fold speedup on a single node
30
Trachelium
Campanula
Adenophora
Symphandra
Legousia
Asyneuma
Triodanis
Wahlenbergia
Codonopsis
Merciera
Cyananthus
Platycodon
Tobacco
Campanulaceae [Moret et al. ISMB 2001]
Strict consensus of 216 optimal trees found by GRAPPA
6 out of 10 max.
edges found
31
Outline
• Genome rearrangement evolution
• Genome rearrangement phylogeny reconstruction
• Application
• Other methods
• Future research
32
“Fast” Approaches for
Genome Rearrangement Phylogeny
• Basic technique: encode data as strings and apply
maximum parsimony
• Running time exponential in the number of
genomes, but polynomial in the number of genes
(faster than GRAPPA)
• MPBE
[ISMB’00]
Maximum Parsimony using Binary Encodings
• MPME [Boore et al. Nature ’95, PSB’02]
Maximum Parsimony using Multi-state Encodings
• The length of a tree using these two methods is a
lowerbound of the true breakpoint length
[Bryant ’01]
33
Maximum Parsimony using Binary
Encoding (MPBE)
Input genome (circular)
A: 1 2 3 4
= -4 –3 –2 –1
B: 1 -4 -3 –2
= 2 3 4 -1
C: 1 2 -3 –4
= 4 3 –2 -1
MPBE Strings
A: 1
B: 0
C: 1
1
1
0
1
1
0
1
0
0
0
1
0
0
1
0
0
0
1
0
0
1
0
0
1
34
Maximum Parsimony using
Multistate Encoding (MPME)
Input genome (circular)
A: 1 2 3 4
= -4 –3 –2 –1
B: 1 -4 -3 –2
= 2 3 4 -1
C: 1 2 -3 –4
= 4 3 –2 -1
MPME Strings
1
2
3
4 -1 –2 –3 -4
A: 2 3 4 1 –4 –1 –2 -3
B: -4 3 4 –1 2 1 –2 -3
C: 2 –3 -2 3 4 –1 -4 1
We use PAUP to
solve Maximum
Parsimony
=> Constraint:
number of states per
site cannot exceed 32
35
NJ vs MP
(120 genes, 160 genomes)
All three event types equiprobable
(datasets that exceed 32-state limit for MPME are dropped)
36
Inversion Phylogeny
• Inversion median has higher running time
than breakpoint median
• Inversion phylogeny overall has shorter
running time than breakpoint phylogeny, and
returns more accurate trees
[Moret et al. WABI ’02]
37
DCM-GRAPPA [Moret & Tang 2003]
• Disk-Covering Method: divide the original
problem into subproblems [Huson, Nettles, Parida,
Warnow and Yooseph, 1998]
• Uses inversion distance
• DCM-GRAPPA: can now process thousands of
genomes, each having hundreds of genes
38
Ongoing and Future Research
• Genome rearrangement phylogeny with
unequal gene content (duplications, deletions,
etc.)
• Non-uniform genome rearrangement models
(Segment-length dependent model, hotspots)
39
Acknowledgements
• University of Texas
Tandy Warnow (Advisor)
Robert K. Jansen
Stacia Wyman
Luay Nakhleh
Usman Roshan
Cara Stockham
Jerry Sun
• University of New Mexico
Bernard M.E. Moret
David Bader
Jijun Tang
Mi Yan
• Central Washington University
Linda Raubeson
40
Phylolab
Department of Computer Sciences
University of Texas at Austin
Please visit us at
http://www.cs.utexas.edu/users/phylo/
41
Download