High-Performance Algorithm Engineering for Computational Phylogenetics [B Moret, D Bader]

advertisement
High-Performance Algorithm
Engineering for Computational
Phylogenetics [B Moret, D Bader]
Kexue Liu
CMSC 838 Presentation
Motivation


Phylogeny reconstruction from molecular data

Poses complex optimization problem

NP hard and thus computationally intractable
High performance Algorithm Engineering

Reduce the running time of existing phylogenetic algoritms
CMSC 838T – Presentation
Talk Overview

Overview of talk

Background

Breakpoint Phylogeny

Breakpoint Analysis

Re-Engineering Techniques

Impact in computational Biology

Observations
CMSC 838T – Presentation
Background


Algorithm Engineering

Transform a pencil-and-paper algorithm into an efficient, robust
implementation.

Main focus is experimentation
High Performance Algorithm Engineering

Running time and quality of the solution as the paramount goal

Includes parallelism

Refining serial part of the code

Cache-aware programming is a key to performance
CMSC 838T – Presentation
Background

Phylogeny



Reconstruction of the evolutionary history of a collection of
organisms
Takes the form of an evolutionary tree
Computational Phylogenetics




Is extremely computation-intensive
Methods for sequence data (RNA, DNA, amino acid, Protein) do not
scale up to whole genome
Genome level data
a)
At this level, evolution is slow
b)
Enable us to recover deep evolutionary relationships
c)
Much hard to analyze than sequence data
Optimization criteria
a)
Heuristics
b)
Parsimony criterion
c)
Maximum likelihood
CMSC 838T – Presentation
Breakpoint Phylogeny


Deal with simple genomic data

Organisms have a single chromosome or contain singlechromosome organelles

Each chromosome can be represented by an ordering of oriented
genes.

Evolutionary process includes inversion, transposition, insertion,
deletion and duplication.
Approaches

Construct parsimonious tree

Known or conjectured to be NP hard
b)
No automated tool to solve it
Neighbor-joining heuristics

Fast and valuable
b)
Can’t recover the ancestral gene orders.
Breakpoint phylogeny by Blanchette and Sankoff.
a)
a)
CMSC 838T – Presentation
Breakpoint phylogeny


More special case:

All the genomes have the same set of genes

Each gene appears once.
Is of interest to biologists

Inversions are the main evolutionary mechanism on such
genomes

Works well for certain datasets.

Implementation developed by Sankoff and Blanchette

Breakpoint Analysis

Too slow to be used on anything other than small datasets with
a few genes.
CMSC 838T – Presentation
Breakpoint Analysis: Details


Breakpoint:

Two genomes G and G’ with the same set of genes and each
gene appears exactly once in each genome

Ordered pair of genes, (gi , gj) appears in G

Neither (gi , gj) nor (-gj , -gi) appears in G’
Breakpoint Distance


Median for three genomes


Number of breakpoints between two genomes.
The genome which minimizes the breakpoint distance
Median Problem for Breakpoints

Construct a median of given genomes

NP hard
CMSC 838T – Presentation
Breakpoint Analysis

Method developed by Sankoff and Blanchette to solve
breakpoint phylogeny

Uses reduction from MPB to Travelling Salesman
Problem

Directed MPB to undirected TSP

Representing each gene by a pair of cities connected by an
edge

Outer loop enumerates all (2n-5)!! trees on n leaves

Inner loop runs unknown number of iterations

Computation complexity is exponential in each of the
number of genomes and the number of genes.
CMSC 838T – Presentation
Breakpoint Analysis
Initially label all internal nodes with gene orders
Repeat
For each internal node v, with neighbors A, B, C do
Solve the MPB on A, B, C to yield label m
If relabelling v with m improves the score of T,
then do it
until no internal node can be relabelled
CMSC 838T – Presentation
Re-Engineering Techniques

Profiling:

Identify bottlenecks to balance implementation

Eliminate problems which include excessive resource
consumption or poor results.

Examples:
Hand-unrolling loops, cut the running time down by a
factor at least six.
b.
Refine distance computations
c.
Refine lower bound computations
Speed-up by one order of magnitude on Campanulaceae
dataset
a.

CMSC 838T – Presentation
Re-Engineering Techniques

Cache Awareness

Memory footprint

BPAnalysis: 60MB
b.
GRAPPA: 1.8MB
Memory locality

BPAnalysis: poor locality, working set size of about 12MB
b.
GRAPPA : good locality, working set size of about 600KB
Minimizing pointer dereferencing

Reuses allocated storage

Studies indicate that gain is likely to be factors of anywhere
from 2 to 40
a.
a.
CMSC 838T – Presentation
Re-Engineering Techniques

Low-level Algorithmic Changes

Using all of the available information

Examples:

Using lower bound to eliminate over 95% of the tree.
b.
Take advantage of special structures: TSP has only two
nontrivial edges( cost 1 and cost 2)
Speed-up by a factor of 5-10.
a.
CMSC 838T – Presentation
Re-Engineering Techniques: Parallel Aspects

Efficient Tree Generation,






Avoid unbounded-precision arithmetic
Allow generation from any count with variable gap
Provides parallel generation and also sampling of search space
Portable MPI implementation, each processor handles a fraction of
trees.
On the 512-processor Alliance cluster LOS LOBOS at UNM,
obtained a 512-fold speedup.
Summarize speedups:





Profiling: one order of magnitude
Cache awareness: factors of anywhere from 2 to 40
Low-level Algorithmic changes: 5-10
512-processor parallelism: 512
Overall, Grappa demonstrated a million-fold speedup over the original
implementation
CMSC 838T – Presentation
Evaluation: the Bluebell Family

Dataset: full gene sequences for the chloroplasts of
12 species of Campanulaceae (Bluebells), plus
tobacco.




Chloroplast
a.
A semi-independent organism that lives within plant cells
and allow them to photosynthesize.
b.
Have a single chromosome with about 120 genes.
Optimization target: reconstruct the phylogeny with
the least total amount of genomic changes.
Environment: 512-processor Los Lobos supercluster
at UNM
Results:


Speedup by three to four orders in the serial part
Total speedup by over one million
CMSC 838T – Presentation
Phylogeny of Bluebell Family
CMSC 838T – Presentation
Impact in Computational Biology


Much faster implementations

Alter the practice of research in biology and medicine

Reducing the time of an analysis from two years down to a day

Makes an enormous difference in the pace and cost of drug
discover and development
Fast and accurate analysis software

Enables researchers to pursue more leads, develop better
institution on small dataset

Form new conjectures about biological mechanism
CMSC 838T – Presentation
Observations

Algorithm re-engineering

Uncovers salient characteristic of the algorithm

Enable us to develop better algorithms

Example: find a true linear time algorithm for computing
inversion distance in the development of GRAPPA.
Can be applied to any existing bioinformatics algorithms

Several have been engineered for performance, such as
BLAST
Limited benefits in theoretical terms when applied to NP-hard
optimization problems



Does not scale up to “industrial-strengthen”

Grappa only enables to move from 10 taxa to 13 taxa
CMSC 838T – Presentation
Thank you
CMSC 838T – Presentation
Download