Crunching Huge Phylogenies: A Rapid Bootstrap Algorithm and BlueGene

advertisement
Crunching Huge Phylogenies:
A Rapid Bootstrap Algorithm and
Massive Parallelism on the IBM
BlueGene
Alexandros Stamatakis
Swiss Federal Institute of Technology Lausanne (EPFL)
School of Computer & Communication Sciences
Laboratory for Computational Biology and Bioinformatics
Lausanne, Switzerland
&
Swiss Institute of Bioinformatics
Alexandros.Stamatakis@epfl.ch
icwww.epfl.ch/~stamatak
The Missing Part
Data Assembly
Alexandros Stamatakis, October 2007
Inference ?
Tree Analysis
The Missing Part
Data Assembly
Alexandros Stamatakis, October 2007
Tree Analysis
IBM BlueGene/L
supercomputer
Alexandros Stamatakis, October 2007
Rapid Bootstrapping
Bootstopping Criterion
Alexandros Stamatakis, October 2007
The Big Hardware Problem
CPU Speed 40% p.a.
Memory Speed 9% p.a.
1980
Alexandros Stamatakis, October 2007
2007
... and why this concerns
Bioinformatics
Sequence
Data
CPU Speed 40% p.a.
Memory Speed 9% p.a.
1980
Alexandros Stamatakis, October 2007
2007
... and why this concerns
Bioinformatics
Application of HPC
techniques will become
CPU Speed
40% p.a.
much more
important
Sequence
Data
Memory Speed 9% p.a.
1980
Alexandros Stamatakis, October 2007
2007
Cache Hierarchy
Alexandros Stamatakis, October 2007
Outline
●
Introduction
●
●
●
●
●
●
●
Web & Grid Services
Three Steps Towards the Tree of Life
●
●
Computation of Phylogenies
Maximum Likelihood
Parallelism on IBM BlueGene/L
Rapid Bootstrapping
A Bootstopping criterion
Related Projects
Outlook
Alexandros Stamatakis, October 2007
Phylogenetics



Input: “good” multiple Alignment
Output: unrooted binary tree
Various methods for phylogenetic
inference




Neighbour Joining (fast & simple)
Maximum Parsimony (relatively fast &
simple)
Maximum Likelihood (complex & slow)
Bayesian Methods (complex & slower)
Alexandros Stamatakis, October 2007
Phylogenetics



Input: “good” multiple Alignment
Output: unrooted
binary tree
ML & Bayesian:
explicit
model choice
Various methods
for phylogenetic
inference




Neighbour Joining (fast & simple)
Maximum Parsimony (relatively fast &
simple)
Maximum Likelihood (complex & slow)
Bayesian Methods (complex & slower)
Alexandros Stamatakis, October 2007
Phylogenetics



Complex
Methods
&
Input: “good”
multiple
Alignment
Models required to
Output: unrooted
binary
tree
reconstruct
large
&
Various methods
for phylogenetic
complicated
trees !
inference




of(fast
this talk
is on
NeighbourFocus
Joining
& simple)
Maximum Likelihood!
Maximum Parsimony (relatively fast &
simple)
Maximum Likelihood (complex & slow)
Bayesian Methods (complex & slower)
Alexandros Stamatakis, October 2007
Phylogenetics



Input: “good” multiple Alignment
Output: unrooted binary tree
Various methods for phylogenetic
inference




NeighbourThe
Joining
& simple)
real (fast
reason
for
on (relatively
ML: ...... fast &
Maximum working
Parsimony
simple)
Maximum Likelihood (complex & slow)
Bayesian Methods (complex & slower)
Alexandros Stamatakis, October 2007
Challenges for Phyloinformatics







Holy grail: “Tree of Life”
What is a good alignment in a
phylogenetic context?
Simultaneous alignment and tree building
Improve/extend models ... but thereby size
of computable trees decreases!
More HPC awareness
Exploit multi-core architectures
Amount of available data grows at a
higher rate than algorithms are getting
faster
Alexandros Stamatakis, October 2007
The algorithmic problem
Alexandros Stamatakis, October 2007
The number of trees
Alexandros Stamatakis, October 2007
The number of trees
Alexandros Stamatakis, October 2007
The number of trees
Alexandros Stamatakis, October 2007
The number of trees
explodes!
BANG !
Alexandros Stamatakis, October 2007
Outline
●
Introduction
●
●
●
●
●
●
●
Web & Grid Services
Three Steps Towards the Tree of Life
●
●
Computation of Phylogenies
Maximum Likelihood
Parallelism on IBM BlueGene/L
Rapid Bootstrapping
A Bootstopping criterion
Related Projects
Outlook
Alexandros Stamatakis, October 2007
Maximum Likelihood
Length: m
Seq1
Seq2
Seq3
Seq4
Alignment
Alexandros Stamatakis, October 2007
Maximum Likelihood
Length: m
A C G T
Seq1
Seq2
Seq3
Seq4
Alignment
A
C
G
T
Alexandros Stamatakis, October 2007
Substitution
model
Maximum Likelihood
Length: m
A C G T
Seq1
Seq2
Seq3
Seq4
Alignment
A
C
G
T
Alexandros Stamatakis, October 2007
Prior probabilities,
Empirical base frequencies
Substitution
model
πA πC πG πT
Maximum Likelihood
Length: m
A C G T
Seq1
Seq2
Seq3
Seq4
A
C
G
T
Alignment
Seq 1
Prior probabilities,
Empirical base frequencies
Substitution
model
πA πC πG πT
Seq 3
b3
b1
b5
b2
Seq 2
Alexandros Stamatakis, October 2007
b4
Seq 4
Maximum Likelihood
Length: m
A C G T
Seq1
Seq2
Seq3
Seq4
A
C
G
T
Alignment
Seq 1
Prior probabilities,
Empirical base frequencies
Substitution
model
πA πC πG πT
Seq 3
b3
b1
b5
b2
b4
Seq 2
Seq 4
virtual root: vr
Alexandros Stamatakis, October 2007
Maximum Likelihood
Length: m
A C G T
Seq1
Seq2
Seq3
Seq4
A
C
G
T
Alignment
Seq 1
b1
Substitution
model
vr
Seq 3
b5
b4
P(A) P(C) P(G) P(T)
P(A) P(C) P(G) P(T)
m
Alexandros Stamatakis, October 2007
πA πC πG πT
b3
b2
Seq 2
Prior probabilities,
Empirical base frequencies
Seq 4
Maximum Likelihood
Length: m
A C G T
Seq1
Seq2
Seq3
Seq4
A
C
G
T
Alignment
Substitution
model
πA πC πG πT
Lots of floating point
Seq 3
b3
vr
operations!
b5
Seq 1
b1
b2
Seq 2
Prior probabilities,
Empirical base frequencies
b4
P(A) P(C) P(G) P(T)
P(A) P(C) P(G) P(T)
m
Alexandros Stamatakis, October 2007
Seq 4
Maximum Likelihood
Length: m
A C G T
Seq1
Seq2
Seq3
Seq4
Alignment
A
C
G
T
Seq 1
Seq 2
Prior probabilities,
Empirical base frequencies
Substitution
model
πA πC πG πT
Seq 3
Seq 4
optimize branch lengths
Alexandros Stamatakis, October 2007
Maximum Likelihood
Length: m
A C G T
Seq1
Seq2
Seq3
Seq4
Alignment
A
C
G
T
Prior probabilities,
Empirical base frequencies
Substitution
model
πA πC πG πT
optimize model parameters
Seq 1
Seq 2
Alexandros Stamatakis, October 2007
Seq 3
Seq 4
Maximum Likelihood
Goal: Obtain topology with maximum likelihood value
Problem I: Number of possible topologies is exponential in n
Problem II: Computation of likelihood function is expensive
Problem III: Probably high score accuracy required
Problem IV: High memory consumption
Solution:
• New Algorithms
• New Models
• High Performance Computing
Alexandros Stamatakis, October 2007
Maximum Likelihood
Goal: Obtain topology with maximum likelihood value
Problem I: Number of possible topologies is exponential in n
RAxML
Randomized Axelerated
Problem III: Probably high score accuracy
requiredLikelihood
Maximum
Problem II: Computation of likelihood function is expensive
Problem IV: High memory consumption
Solution:
• New Algorithms
• New Models
• High Performance Computing
Alexandros Stamatakis, October 2007
Web & Grid Services




RAxML Web-Server at San Diego Supercomputing
Center via www.phylo.org (CIPRES project)
Web-Server at Vital-IT unit of Swiss Institute of
Bioinformatics phylobench.vital-it.ch/raxml-bb/
 Includes novel search algorithm with 1 order of
magnitude run-time improvement
 Since Sept 3, about 700 jobs from 130 Ips
 Extension to SwissGrid planned
 Novel algorithm with Bootstopping to be
integrated into CIPRES portal soon
RAxML integration into Distributed European
Infrastructure for Supercomputing Applications
www.deisa.org started 10 days ago
Integration into Debian medical distribution
Alexandros Stamatakis, October 2007
RAxML Black Box
Alexandros Stamatakis, October 2007
RAxML Black Box
Why are Black Boxes
useful?
Alexandros Stamatakis, October 2007
Outline
●
Introduction
●
●
●
●
●
●
●
Web & Grid Services
Three Steps Towards the Tree of Life
●
●
Computation of Phylogenies
Maximum Likelihood
Parallelism on IBM BlueGene/L
Rapid Bootstrapping
A Bootstopping criterion
Related Projects
Outlook
Alexandros Stamatakis, October 2007
Levels of Parallelism
Embarrassing Parallelism
MPI, CORBA, Grid Technologies
Alexandros Stamatakis, October 2007
Coarse-Grained Parallelism:
MPI Version of RAxML
PC-CLUSTER
Worker Processes
B-2
B-0
B-1
B-3
Interconnection
Network
B-4
Master Process
Alexandros Stamatakis, October 2007
Levels of Parallelism
Embarrassing Parallelism
MPI, CORBA, Grid Technologies
Inference Parallelism
MPI, algorithm-dependent
Alexandros Stamatakis, October 2007
Levels of Parallelism
Embarrassing Parallelism
MPI, CORBA, Grid Technologies
Inference Parallelism
MPI, algorithm-dependent
Loop-Level Parallelism
OpenMP, GPUs,
IBM CELL (Playstation),
IBM BlueGene,
Clusters with fast Interconnect
Alexandros Stamatakis, October 2007
Loop Level Parallelism
virtual root
P
Q
R
P[i] = f(Q[i], R[i])
Alexandros Stamatakis, October 2007
Loop Level Parallelism
virtual root
P
This operation uses ≥ 90%
of total execution time !
Q
R
P[i] = f(Q[i], R[i])
Alexandros Stamatakis, October 2007
Loop Level Parallelism
virtual root
P
This operation uses ≥ 90%
of total execution time !
→ simple fine-grained
parallelization
Q
R
P[i] = f(Q[i], R[i])
Alexandros Stamatakis, October 2007
Loop Level Parallelism
virtual root
P
Q
R
Alexandros Stamatakis, October 2007
Loop Level Parallelism
virtual root
P
Q
R
Alexandros Stamatakis, October 2007
Loop Level Parallelism
virtual root
P
Q
R
Alexandros Stamatakis, October 2007
Loop Level Parallelism
virtual root
P
The real reason for
assuming independent
evolution among sites:
......
Q
R
Alexandros Stamatakis, October 2007
Fine-Grained Parallelism:
OpenMP version of RAxML
Alexandros Stamatakis, October 2007
Fine-Grained Parallelism:
OpenMP version of RAxML
Alexandros Stamatakis, October 2007
HPC for ML (Bayesian)


Proof of Concept & Programming
Techniques:
 RAxML on a Graphics Processing Unit
 RAxML on the IBM CELL & Playstation
Production Level Implementations:
 RAxML with OpenMP
 RaxML with MPI
 RAxML on BlueGene
 Multi-Core Architectures
Alexandros Stamatakis, October 2007
HPC for ML (Bayesian)


Proof of Concept & Programming
Techniques:
 RAxML on a Graphics Processing Unit
 RAxML on the IBM CELL & Playstation
Production Level Implementations:
A good excuse to buy one
 RAxML with OpenMP
 RaxML with MPI
 RAxML on BlueGene
 Multi-Core Architectures
Alexandros Stamatakis, October 2007
RAxML-BlueGene




Many slow processors: 1024 in one rack
512 MB or 1GB of main memory per node
But: high performance network
Challenges:




Distribute tree data structure among CPUs
Exploit fast collective communication network
For optimal efficiency: loop-level +
embarrassing parallelism  hybrid
parallelism with MPI
Test & Production Run Data


With Olaf Bininda-Emonds, Jena: 2,182
mammalian sequences x 51,000 base pairs
With Dan Janies, Ohio State: 270 Human
Haplotype Map sequences x 500,000 base pairs
Alexandros Stamatakis, October 2007
RAxML-BlueGene
To be presented at IEEE/ACM




2007 Supercomputing
Many slow processors:
1024 in one rack
Conference.
512 MB or 1GB of main memory per node
But: high performance network
Challenges:




Distribute tree data structure among CPUs
Exploit fast collective communication network
For optimal efficiency: loop-level +
embarrassing parallelism  hybrid
parallelism with MPI
Test & Production Run Data


With Olaf Bininda-Emonds, Jena: 2,182
mammalian sequences x 51,000 base pairs
With Dan Janies, Ohio State: 270 Human
Haplotype Map sequences x 500,000 base pairs
Alexandros Stamatakis, October 2007
RAxML-BlueGene




Many slow processors: 1024 in one rack
512 MB or 1GB of main memory per node
But: high performance network
Challenges:




ML analysis
toCPUs
date in
Distribute treeLargest
data structure
among
terms ofcommunication
memory footprint
Exploit fast collective
network
For optimal efficiency: loop-level +
embarrassing parallelism  hybrid
parallelism with MPI
Test & Production Run Data


With Olaf Bininda-Emonds, Jena: 2,182
mammalian sequences x 51,000 base pairs
With Dan Janies, Ohio State: 270 Human
Haplotype Map sequences x 500,000 base pairs
Alexandros Stamatakis, October 2007
Loop-Level Parallelism on
BlueGene
Alexandros Stamatakis, October 2007
50 Seqs x 23,385 bp
Alexandros Stamatakis, October 2007
50 Seqs x 23,385 bp
Superlinear Speedup
Alexandros Stamatakis, October 2007
250 Seqs x 403,581 bp
Alexandros Stamatakis, October 2007
Embarrassing Parallelism
W
W
W
W
W
M
M
W
W
M
M
W
W
W
W
W
Alexandros Stamatakis, October 2007
Outline
●
Introduction
●
●
●
●
●
●
●
Web & Grid Services
Three Steps Towards the Tree of Life
●
●
Computation of Phylogenies
Maximum Likelihood
Parallelism on IBM BlueGene/L
Rapid Bootstrapping
A Bootstopping criterion
Related Projects
Outlook
Alexandros Stamatakis, October 2007
Confidence Values


Tree without node confidence
values is mostly useless
Problem:
Confidence value calculation is major
computational obstacle
 We can compute large trees but not
analyse them: compute ≠analyse !


Current Slow Methods


Sampling with Bayesian methods
Non-parametric Bootstrapping
Alexandros Stamatakis, October 2007
A Tree with Confidence Values
JointAlexandros
work Stamatakis,
with Marc
Charite Hospital, Berlin
OctoberGottschling,
2007
Bootstrapping
Original Alignment
perturbation
compute tree compute tree compute tree
Alexandros Stamatakis, October 2007
Bootstrapping
Original Alignment
This needs to be done
100-1000 times
Embarrassingly
Parallel !
perturbation
compute tree compute tree compute tree
Alexandros Stamatakis, October 2007
Two Questions


How to compute Bootstraps faster?
How many Bootstrap replicates do we
need?
Alexandros Stamatakis, October 2007
Current Work:
Rapid Bootstrapping Algorithm




Tested on 22 diverse (mammals, bacteria, archaea,
grasses, fishes, plants, viral) real-world DNA/AA
single-/multi-gene datasets containing 125-7,764
sequences
Pearson correlation on best-scoring ML trees between
RBS (Rapid BS) & SBS (Standard BS) support values
0.95-0.99 (except one dataset at 0.91), average 0.97
Weighted topological distance < 6%, average 4%
Program Acceleration: 8-20, average ≈ 15



Acceleration by one order of magnitude
Full ML analysis (100BS + ML search) of datasets of
up to 5,000 sequences within less than 5 days on
your desktop!
Allows for a sufficiently large number of Bootstrap
replicates
Alexandros Stamatakis, October 2007
Quick & Dirty Bootstrap
Modify Algorithm
Computational Experiments
Alexandros Stamatakis, October 2007
Quick & Dirty Bootstrap
Modify Algorithm
iterate
Computational Experiments
Alexandros Stamatakis, October 2007
Rapid Bootstrap
11111111111111
01102211111111
10111102220111
11111110112021
Alexandros Stamatakis, October 2007
Rapid Bootstrap
11111111111111
01102211111111
10111102220111
11111110112021
Alexandros Stamatakis, October 2007
Compute Starting Tree
Rapid Bootstrap
11111111111111
01102211111111
10111102220111
11111110112021
Alexandros Stamatakis, October 2007
Optimize Model Params &
Branch Lengths
Rapid Bootstrap
Use Starting Tree &
Model Params to compute
RELL scores
11111111111111
01102211111111
10111102220111
11111110112021
Alexandros Stamatakis, October 2007
-110
-105
-100
Rapid Bootstrap
Use Starting Tree &
Model Params to compute
RELL scores
11111111111111
01102211111111
10111102220111
11111110112021
Alexandros Stamatakis, October 2007
-110
-105
-100
Sort by RELL
Rapid Bootstrap
11111111111111
11111110112021
10111102220111
01102211111111
Alexandros Stamatakis, October 2007
-100
-105
-110
T0: Thorough Search
Rapid Bootstrap
11111111111111
11111110112021
10111102220111
01102211111111
Alexandros Stamatakis, October 2007
-100
-105
-110
T0: Thorough Search
T1: Quick Search on T0
Rapid Bootstrap
11111111111111
11111110112021
10111102220111
01102211111111
Alexandros Stamatakis, October 2007
-100
-105
-110
T0: Thorough Search
T1: Quick Search on T0
T2: Quick Search on T1
Rapid Bootstrap
11111111111111
sequential
dependency is
bad for
11111110112021
parallelism
10111102220111
01102211111111
Alexandros Stamatakis, October 2007
-100
-105
-110
T0: Thorough Search
T1: Quick Search on T0
T2: Quick Search on T1
Scalability of Rapid
Bootstrap
Alexandros Stamatakis, October 2007
Scalability of Rapid
Bootstrap
Some datasets
are harder than
others
Alexandros Stamatakis, October 2007
Scalability of Rapid
Bootstrap
Alexandros Stamatakis, October 2007
ML-Scores: Garli, RAxML,
PHYML 715 Sequences
Alexandros Stamatakis, October 2007
Correlation 125 Taxa: 0.91
Alexandros Stamatakis, October 2007
Support Value Distribution
Alexandros Stamatakis, October 2007
Bootstrap Likelihood Values
125 x 19,436
10,000 replicates
only 195 non-trivial
bipartitions
Alexandros Stamatakis, October 2007
Bootstrap Likelihood Values
125 x 19,436
Alexandros Stamatakis, October 2007
3,491 rBCL sequences
Rapid versus Standard BS
Correlation:
0.98
Alexandros Stamatakis, October 2007
7,764 DNA Best Tree
Alexandros Stamatakis, October 2007
7,764 DNA All Bipartitions
Alexandros Stamatakis, October 2007
775 x 3,838 AA
Alexandros Stamatakis, October 2007
New Opportunities



Assess Impact of Alignment Method
on tree and support values
Test Bootstrap of the Bootstrap
(double Bootstrap) procedures
Devise and empirically verify
Bootstopping criteria
Alexandros Stamatakis, October 2007
Bootstrap of the Bootstrap
140 AA (Efron et al PNAS 1996)
Alexandros Stamatakis, October 2007
Bootstrap of the Bootstrap
3,491 rBCL
Alexandros Stamatakis, October 2007
Bootstopping

Rapid Bootstrapping allows to assess
Bootstopping criteria as follows
1. Compute a high number of BS replicates (10,000)
2. Devise topology-based bootstopping criterion and
apply it to these 10,000 replicates
3. Compare support values induced by bootstopped
trees (say 300 replicates) with 10,000 replicates

We have 10,000 replicates for 18
datasets containing 125 to 2,554
sequences
Alexandros Stamatakis, October 2007
Bootstopping Criterion


Every 50, 100, 150, ... replicates do a test:
 Say we have N BS trees
 Do the following 100 times:
 Randomly split up this set of N trees into 2
equal sets S1, S2, of size N/2
 Compute the bipartition support vectors for
S1 and S2
 Compute Pearson correlation of the support
vectors
 return average of the 100 Pearson correlations
if average > 0.99 stop
Alexandros Stamatakis, October 2007
Result Overview



Bootstopped between 100-400 (avg
213)
Correlation on best tree: Bootstopped
versus 10,000 replicates > 0.99 (avg
0.995)
Correlation of all bipartitions > 0.995
(avg 0.997)
Alexandros Stamatakis, October 2007
Bootstopping Best 140 AA
Alexandros Stamatakis, October 2007
Bootstopping Best 404 DNA
(Multi-Gene)
Alexandros Stamatakis, October 2007
Bootstopping Best 994 DNA
Alexandros Stamatakis, October 2007
Bootstopping All 994 DNA
Alexandros Stamatakis, October 2007
Bootstopping Best 1,908
DNA
Alexandros Stamatakis, October 2007
Bootstopping Best 2,554
DNA
Alexandros Stamatakis, October 2007
Putting the Pieces together

Blue-Gene: Can handle huge datasets

Use Cat approximation on BlueGene


Further speedup of factor 3.5
Memory footprint reduction factor 4
Alexandros Stamatakis, October 2007
8,864 Bacteria under GTR+Γ
and GTR+CAT
Log Likelihood
Score under Γ
7 days
Execution
Time
Alexandros Stamatakis, October 2007
14 days
Putting the Pieces together

Blue-Gene: Can handle huge datasets

Use Cat approximation on BlueGene



Integrate rapid Bootstrap into BlueGene
version



Further speedup of factor 3.5
Memory footprint reduction factor 4
Additional speedup ≈ 15
Mechanisms available to accelerate
BlueGene version by factor 50-60
Integrate Bootstopping into BlueGene
 Conclusion: We will soon be able to
compute a small tree of life with 10,000
organisms and data from multiple genes!
Alexandros Stamatakis, October 2007
Outline
●
Introduction
●
●
●
●
●
●
●
Web & Grid Services
Three Steps Towards the Tree of Life
●
●
Computation of Phylogenies
Maximum Likelihood
Parallelism on IBM BlueGene/L
Rapid Bootstrapping
A Bootstopping criterion
Related Projects
Outlook
Alexandros Stamatakis, October 2007
Host-Parasite Co-Evolution
Hosts (eg Mammals)
Alexandros Stamatakis, October 2007
Parasites (eg Lice)
Host-Parasite Co-Evolution
Hosts
Parasites
Co-Evolution Hypothesis
8 Parasites
Adjacency
6 hosts Matrix 0/1
Alexandros Stamatakis, October 2007
Host-Parasite Co-Evolution
Hosts
Parasites
Co-Evolution Hypothesis
8 Parasites
Adjacency
6 hosts Matrix 0/1
Statistical Test
Alexandros Stamatakis, October 2007
What can HPC do forBioinformatics?
Axelerated Parafit





“Parafit: statistical test of co-evolution”, Pierre
Legendre, Syst. Biol. 2003
AxParafit (Axelerated Parafit)
 Statistical test of hypotheses of host-parasite coevolution
 C porting, optimization, BLAS integration
 Speedup up to factor 67
 Master-Worker MPI-parallelization
Largest co-phylogenetic study to date conducted
within 8 minutes instead of 4 weeks
Open-Source Code:
http://icwww.epfl.ch/~stamatak/AxParafit.html
SwissGrid-based Web-Server planned
Alexandros Stamatakis, October 2007
AxParafit: Sequential
Performance
Alexandros Stamatakis, October 2007
AxParafit: Parallel
Performance
Alexandros Stamatakis, October 2007
The ML Benchmark:
A Current Community Project





Standardized way required to test ML search programs
Web-Server with real-world alignments and performance data
at Swiss Institute of Bioinformatics
Many developers of popular ML programs involved
 Stephane Guindon (PHYML) Montpellier
 Simon Wheelan (LeaPhy) Manchester
 Bui Quang Minh (IQPNNI) Vienna
 Derrick Zwickl (GARLI) Virginia
 Thomas Keane (dprML) Cambridge
Byproduct: SPEC-like CPU benchmark for phylogenetics
Follow-up: (planned) ML competition at major conference with
industrial sponsor
Alexandros Stamatakis, October 2007
A Current Problem:
Handling Multi-Gene Alignments
Gene 1
Gene 2
Sequence 1
Sequence 5
Missing Data ≠ Gap Data
Alexandros Stamatakis, October 2007
A Multi-Gene Model
Alexandros Stamatakis, October 2007
A Multi-Gene Model
Alexandros Stamatakis, October 2007
A Multi-Gene Model
Alexandros Stamatakis, October 2007
A Multi-Gene Model
LogLH (T) = LogLh (T|Red)
Alexandros Stamatakis, October 2007
A Multi-Gene Model
LogLH (T) = LogLh (T|Red) +
LogLH(T|Yellow)
Alexandros Stamatakis, October 2007
A Multi-Gene Model
Challenge: devise efficient data
structures for this
LogLH (T) = LogLh (T|Red) +
LogLH(T|Yellow)
Alexandros Stamatakis, October 2007
Why are Individual Branches
per Gene a Challenge?
Alexandros Stamatakis, October 2007
Why are Individual Branches
per Gene a Challenge?
Alexandros Stamatakis, October 2007
Outlook
Alexandros Stamatakis, October 2007
Outlook






Tree of Life
What is a good alignment in a
phylogenetic context?
Simultaneous alignment and tree building
More HPC & memory-aware programming
Multi-core architectures
Models for “gappy” multi-gene alignments
Alexandros Stamatakis, October 2007
Acknowledgements





BlueGene Project

Michael Ott, TUM

Srinivas Aluru, Jaroslaw Zola, Iowa State

Dan Janies, Andrew Johnson, Ohio State
IBM CELL & Playstation

Filip Blagojevic, Dimitris Nikolopoulos, Virginia Tech

Christos Antonopoulos, Univ. of Thessaly
Bootstopping

Bernard Moret, Masoud Alipour, EPFL

Olaf Bininda-Emonds, Univ. Jena
RAxML Web-Server

Jacques Rougemont, SIB

Terri Liebowitz, SDSC
AxParafit/AxPcoords


Markus Goeker, Alexander Auch, Jan Meier-Kolthoff, University of Tuebingen
Datasets for Studies

Jun Inoue (Florida), Nicolas Salamin (Lausanne), Marc Gottschling (Berlin), Guido Grimm
(Tuebingen), Nikos Poulakakis (Yale), Usman Roshan (NJIT)
Alexandros Stamatakis, October 2007
Thank you for your
Attention !
Lake
Geneva,
Switzerland
Alexandros
Stamatakis,
October 2007
Download