High-Performance Algorithm Engineering for Computational Phylogenetics [B Moret, D Bader] Kexue Liu CMSC 838 Presentation Motivation Phylogeny reconstruction from molecular data Poses complex optimization problem NP hard and thus computationally intractable High performance Algorithm Engineering Reduce the running time of existing phylogenetic algoritms CMSC 838T – Presentation Talk Overview Overview of talk Background Breakpoint Phylogeny Breakpoint Analysis Re-Engineering Techniques Impact in computational Biology Observations CMSC 838T – Presentation Background Algorithm Engineering Transform a pencil-and-paper algorithm into an efficient, robust implementation. Main focus is experimentation High Performance Algorithm Engineering Running time and quality of the solution as the paramount goal Includes parallelism Refining serial part of the code Cache-aware programming is a key to performance CMSC 838T – Presentation Background Phylogeny Reconstruction of the evolutionary history of a collection of organisms Takes the form of an evolutionary tree Computational Phylogenetics Is extremely computation-intensive Methods for sequence data (RNA, DNA, amino acid, Protein) do not scale up to whole genome Genome level data a) At this level, evolution is slow b) Enable us to recover deep evolutionary relationships c) Much hard to analyze than sequence data Optimization criteria a) Heuristics b) Parsimony criterion c) Maximum likelihood CMSC 838T – Presentation Breakpoint Phylogeny Deal with simple genomic data Organisms have a single chromosome or contain singlechromosome organelles Each chromosome can be represented by an ordering of oriented genes. Evolutionary process includes inversion, transposition, insertion, deletion and duplication. Approaches Construct parsimonious tree Known or conjectured to be NP hard b) No automated tool to solve it Neighbor-joining heuristics Fast and valuable b) Can’t recover the ancestral gene orders. Breakpoint phylogeny by Blanchette and Sankoff. a) a) CMSC 838T – Presentation Breakpoint phylogeny More special case: All the genomes have the same set of genes Each gene appears once. Is of interest to biologists Inversions are the main evolutionary mechanism on such genomes Works well for certain datasets. Implementation developed by Sankoff and Blanchette Breakpoint Analysis Too slow to be used on anything other than small datasets with a few genes. CMSC 838T – Presentation Breakpoint Analysis: Details Breakpoint: Two genomes G and G’ with the same set of genes and each gene appears exactly once in each genome Ordered pair of genes, (gi , gj) appears in G Neither (gi , gj) nor (-gj , -gi) appears in G’ Breakpoint Distance Median for three genomes Number of breakpoints between two genomes. The genome which minimizes the breakpoint distance Median Problem for Breakpoints Construct a median of given genomes NP hard CMSC 838T – Presentation Breakpoint Analysis Method developed by Sankoff and Blanchette to solve breakpoint phylogeny Uses reduction from MPB to Travelling Salesman Problem Directed MPB to undirected TSP Representing each gene by a pair of cities connected by an edge Outer loop enumerates all (2n-5)!! trees on n leaves Inner loop runs unknown number of iterations Computation complexity is exponential in each of the number of genomes and the number of genes. CMSC 838T – Presentation Breakpoint Analysis Initially label all internal nodes with gene orders Repeat For each internal node v, with neighbors A, B, C do Solve the MPB on A, B, C to yield label m If relabelling v with m improves the score of T, then do it until no internal node can be relabelled CMSC 838T – Presentation Re-Engineering Techniques Profiling: Identify bottlenecks to balance implementation Eliminate problems which include excessive resource consumption or poor results. Examples: Hand-unrolling loops, cut the running time down by a factor at least six. b. Refine distance computations c. Refine lower bound computations Speed-up by one order of magnitude on Campanulaceae dataset a. CMSC 838T – Presentation Re-Engineering Techniques Cache Awareness Memory footprint BPAnalysis: 60MB b. GRAPPA: 1.8MB Memory locality BPAnalysis: poor locality, working set size of about 12MB b. GRAPPA : good locality, working set size of about 600KB Minimizing pointer dereferencing Reuses allocated storage Studies indicate that gain is likely to be factors of anywhere from 2 to 40 a. a. CMSC 838T – Presentation Re-Engineering Techniques Low-level Algorithmic Changes Using all of the available information Examples: Using lower bound to eliminate over 95% of the tree. b. Take advantage of special structures: TSP has only two nontrivial edges( cost 1 and cost 2) Speed-up by a factor of 5-10. a. CMSC 838T – Presentation Re-Engineering Techniques: Parallel Aspects Efficient Tree Generation, Avoid unbounded-precision arithmetic Allow generation from any count with variable gap Provides parallel generation and also sampling of search space Portable MPI implementation, each processor handles a fraction of trees. On the 512-processor Alliance cluster LOS LOBOS at UNM, obtained a 512-fold speedup. Summarize speedups: Profiling: one order of magnitude Cache awareness: factors of anywhere from 2 to 40 Low-level Algorithmic changes: 5-10 512-processor parallelism: 512 Overall, Grappa demonstrated a million-fold speedup over the original implementation CMSC 838T – Presentation Evaluation: the Bluebell Family Dataset: full gene sequences for the chloroplasts of 12 species of Campanulaceae (Bluebells), plus tobacco. Chloroplast a. A semi-independent organism that lives within plant cells and allow them to photosynthesize. b. Have a single chromosome with about 120 genes. Optimization target: reconstruct the phylogeny with the least total amount of genomic changes. Environment: 512-processor Los Lobos supercluster at UNM Results: Speedup by three to four orders in the serial part Total speedup by over one million CMSC 838T – Presentation Phylogeny of Bluebell Family CMSC 838T – Presentation Impact in Computational Biology Much faster implementations Alter the practice of research in biology and medicine Reducing the time of an analysis from two years down to a day Makes an enormous difference in the pace and cost of drug discover and development Fast and accurate analysis software Enables researchers to pursue more leads, develop better institution on small dataset Form new conjectures about biological mechanism CMSC 838T – Presentation Observations Algorithm re-engineering Uncovers salient characteristic of the algorithm Enable us to develop better algorithms Example: find a true linear time algorithm for computing inversion distance in the development of GRAPPA. Can be applied to any existing bioinformatics algorithms Several have been engineered for performance, such as BLAST Limited benefits in theoretical terms when applied to NP-hard optimization problems Does not scale up to “industrial-strengthen” Grappa only enables to move from 10 taxa to 13 taxa CMSC 838T – Presentation Thank you CMSC 838T – Presentation