BioPerf: An Open Benchmark Suite for Evaluating Computer Architecture on Bioinformatics and Life Science Applications David A. Bader Collaborators • Vipin Sachdeva (U New Mexico, Georgia Tech, IBM Austin) • Tao Li (U Florida) • Yue Li (U Florida) • Virat Agrawal (IIT Delhi) • Gaurav Goel (IIT Delhi) • Abhishek Narain Singh (IIT Delhi) • Ram Rajamony (IBM Austin) BioPerf: an open bioinformatics and life sciences workload, David A. Bader Acknowledgment of Support • National Science Foundation – CAREER: High-Performance Algorithms for Scientific Applications (06-11589; 0093039) – ITR: Building the Tree of Life -- A National Resource for Phyloinformatics and Computational Phylogenetics (EF/BIO 03-31654) – DEB: Ecosystem Studies: Self-Organization of Semi-Arid Landscapes: Test of Optimality Principles (99-10123) – ITR/AP: Reconstructing Complex Evolutionary Histories (01-21377) – DEB Comparative Chloroplast Genomics: Integrating Computational Methods, Molecular Evolution, and Phylogeny (01-20709) – ITR/AP(DEB): Computing Optimal Phylogenetic Trees under Genome Rearrangement Metrics (01-13095) – DBI: Acquisition of a High Performance Shared-Memory Computer for Computational Science and Engineering (04-20513). • IBM PERCS / DARPA High Productivity Computing Systems (HPCS) – DARPA Contract NBCH30390004 BioPerf: an open bioinformatics and life sciences workload, David A. Bader Contributions of this Work • An open source, freely-available, freelyredistributable suite of applications and inputs, BioPerf, which spans a wide variety of bioinformatics application – www.bioperf.org • Performance study on PowerPC G5, IBM Mambo simulator, and Alpha BioPerf: an open bioinformatics and life sciences workload, David A. Bader Outline • Motivation • Bioinformatics Workload • BioPerf Suite • Performance Analysis on PowerPC G5 and Mambo • Conclusions and Future Work BioPerf: an open bioinformatics and life sciences workload, David A. Bader Motivation • Improve performance on a wide range of bioinformatics applications – Heterogeneous in problems, algorithms, applications • BioPerf workload assembled as a representative set of bioinformatics applications important now and expected to increase in usage over the next 5—10 years • Decide if this is YAW “yet another workload” or rather unique in its characteristics BioPerf: an open bioinformatics and life sciences workload, David A. Bader Related Work • General benchmark suites: SPEC • Domain-specific benchmarks – TPC, EEMBC, SPLASH, SPLASH-2 • Few special benchmark for bioinformatics • Previous attempts have been incomplete: Analysis on old architectures (BioBench) [Albayraktaroglu et al., ISPASS 2005] • Included proprietary codes in benchmark suite (BioInfoMark) [Li et al., MASCOTS 2005] • Previous suites not available for download • Included several non-redistributable packages • Inputs not articulated and not included with benchmark suite for similar comparisons BioPerf: an open bioinformatics and life sciences workload, David A. Bader Guiding Principles for BioPerf • Coverage: The packages must span the heterogeneity of algorithms and biological and life science problems important today as well as (in our view) increasing in importance over the next 5-10 years. • Popularity: Codes with larger numbers of users are preferred because these packages represent a greater percentage of the aggregate workloads used in this domain. • Open Source: Open source code allows the scientific study of the applicatio performance, the ability to place hooks into the code, and eases porting to new architectures. • Licensing: Only packages for which their licensing allows free redistribution as open source are included. This requirement eliminated several popular packages, but was kept as a strict requirement to encourage the broadest use of this suite. • Portability: Preference was given to packages that used standard programming languages and could easily be ported to new systems (both in sequential and parallel languages). • Performance: We gave slight preference to packages whose performance is wellcharacterized in other studies. In addition, we strived for computationallydemanding packages and included parallel versions where available. BioPerf: an open bioinformatics and life sciences workload, David A. Bader BioPerf Suite • Pre-compiled binaries (PowerPC, x86, Alpha) • Scalable Input datasets with each code for fair comparisons • Scripts for installation, running and collecting outputs • Documentation for compiling and using the suite • Parallel codes where available • Available for download from www.bioperf.org BioPerf: an open bioinformatics and life sciences workload, David A. Bader BioPerf workload Area Package Executables Word-based Profile-based BLAST HMMER blastp, blastn hmmpfam, hmmsearch Pairwise Multiple Multiple FASTA ssearch, fasta CLUSTALW clustalw, clustalw_smp TCOFFEE tcoffee PHYLIP dnapenny, promlk GRAPPA grappa PREDATOR predator GLIMMER glimmer,glimmer-package CE ce Sequence homology Sequence Alignment Phylogeny Parsimony/Likelihood Gene Rearrangement Protein Structure Prediction Gene Finding Molecular Dynamics BioPerf: an open bioinformatics and life sciences workload, David A. Bader Sequence Alignment • Sequence Alignment one of the most useful techniques in computational biology – Sequence Alignment : Stacking the sequences against each other, with gaps if necessary, to expose similarity. ALIGNMENT S1 : ACGCTGATATTA ACGCTGATAT---TA S2 : AGTGTTATCCCTA AG--TGTTATCCCTA S1 : ACGCTGATATTA ACGCTGATAT---TA S2 : AGTGTTATCCCTA AG--TGTTATCCCTA MATCH BioPerf: an open bioinformatics and life sciences workload, David A. Bader Sequence Alignment • Sequence Alignment one of the most useful techniques in computational biology – Sequence Alignment : Stacking the sequences against each other, with gaps if necessary, to expose similarity. ALIGNMENT S1 : ACGCTGATATTA ACGCTGATAT---TA S2 : AGTGTTATCCCTA AG--TGTTATCCCTA S1 : ACGCTGATATTA ACGCTGATAT---TA S2 : AGTGTTATCCCTA AG--TGTTATCCCTA MISMATCH BioPerf: an open bioinformatics and life sciences workload, David A. Bader Sequence Alignment • Sequence Alignment one of the most useful techniques in computational biology – Sequence Alignment : Stacking the sequences against each other, with gaps if necessary, to expose similarity. ALIGNMENT S1 : ACGCTGATATTA ACGCTGATAT---TA S2 : AGTGTTATCCCTA AG--TGTTATCCCTA S1 : ACGCTGATATTA ACGCTGATAT---TA S2 : AGTGTTATCCCTA AG--TGTTATCCCTA “GAPS” BioPerf: an open bioinformatics and life sciences workload, David A. Bader Multiple Sequence Alignment • Bring the greatest number of similar characters into same column. • Provides much more information than pairwise alignment A A S N S V S N —S —S N A — ———A S V S N S Run-time of dynamic programming solution = O(2k nk) 6 sequences of length 100 6.4X1013 calculations Hence heuristics employed BioPerf: an open bioinformatics and life sciences workload, David A. Bader Sequence Homology • Find similar sequences (DNA/protein) to an unknown sequence (DNA/protein). • Computationally expensive • Size of data is huge and grows exponentially every year • Public databases available: Genbank, SwissProt, PDB NCBI Genbank Swissprot PDB DNA sequences Protein Sequences Protein Structure 5 million sequences 160,000 sequences 32,000 structures Problems with computational approach • Exact alignment is O(l2) dynamic programming solution • Quicker but less accurate heuristics employed BioPerf: an open bioinformatics and life sciences workload, David A. Bader Blast • Basic Local Alignment Search Tool • Developed by NCBI • The most important bioinformatics application for its popularity Blast blastp blastn The homo sapiens hereditary haemochromatosis protein Non-redundant protein sequence nr developed by NCBI BioPerf: an open bioinformatics and life sciences workload, David A. Bader FASTA • Also performs pairwise sequence alignment FASTA Fasta34 ssearch The human LDL receptor precursor nr BioPerf: an open bioinformatics and life sciences workload, David A. Bader ClustalW • Multiple sequence alignment (MSA) program ClustalW 317 Ureaplasma’s gene Clustalw sequences from NCBI Clustalw_smp Bacteria genomes database BioPerf: an open bioinformatics and life sciences workload, David A. Bader T-Coffee • A sequential MSA similar to ClustalW with higher accuracy and complexity T-coffee Tcoffee 50 sequences of average length 850 extracted from the Prefab database BioPerf: an open bioinformatics and life sciences workload, David A. Bader Hmmer • Align multiple sequences by using hidden Markov models Brine shrimp globin Hmmer hmmsearch hmmpfam BioPerf: an open bioinformatics and life sciences workload, David A. Bader HMM of 50 aligned globin sequences Phylogenetic Reconstruction • Study the evolution of all sequences and all species The Tree of Life (10-100M organisms) • Find the best among all possible trees. • Given n taxa, number of possible trees (2n-3)!! • 10 taxa 2 million trees • Approaches like maximum parsimony, maximum likelihood, among others BioPerf: an open bioinformatics and life sciences workload, David A. Bader Phylogeny Reconstruction: Phylip • Collection of programs for inferring phylogenies • Methods include – Maximum parsimony – Maximum likelihood – Distance based methods. • Input: Aligned dataset of 92 cyclophilins proteins of eukaryotes each of length 220 BioPerf: an open bioinformatics and life sciences workload, David A. Bader Phylogeny Reconstruction: GRAPPA • Campanulaceae • Bob Jansen, UT-Austin; • Linda Raubeson, Central Washington U Tobacco • Gene-order based phylogeny A D A C X Y Z B E C F B D E W F • • • Genome Rearrangements Analysis under Parsimony and other Phylogenetic Algorithm • Freely-available, open-source, GNU GPL • already used by other computational phylogeny groups, Caprara, Pevzner, LANL, FBI, Smithsonian Institute, Aventis, GlaxoSmithKline, PharmCos. Gene-order Phylogeny Reconstruction • Breakpoint Median • Inversion Median over one-billion fold speedup from previous codes Parallelism scales linearly with the number of processors [Bader, Moret, Warnow] Input: 12 bluebell flower species of 105 genes BioPerf: an open bioinformatics and life sciences workload, David A. Bader Protein Structure Prediction • Find the sequences, three dimensional structures and functions of all proteins and vice-versa – Why computationally? • Experimental Techniques slow and expensive – Problems with computational approach • Little understanding of how structure develops • Does function really follow structure ? BioPerf: an open bioinformatics and life sciences workload, David A. Bader Protein Structure : Predator • Tool for finding protein structures. • Relies on local alignments from BLAST, FASTA • Input: 20 sequences from Swissprot each of length about 7000 residues. BioPerf: an open bioinformatics and life sciences workload, David A. Bader CE (Combinatorial Extension) • Find structural similarities between the primary structures of pairs of proteins. CE ce Two different types of hemoglobin which is used to transport oxygen BioPerf: an open bioinformatics and life sciences workload, David A. Bader Gene-Finding: Glimmer • Gene-Finding: Find regions of genome which code for proteins. • Widely used gene finding tool for microbial DNA. • Input: Bacteria genome consisting of 9.2 million base pairs BioPerf: an open bioinformatics and life sciences workload, David A. Bader Pre-compiled binaries • PowerPC • x86 • Alpha BioPerf: an open bioinformatics and life sciences workload, David A. Bader BioPerf Performance Studies • Analysis at the instruction and memory level on PowerPC • Livegraph data helps to visualize performance as it varies during phases of a run • Identify bottlenecks of current processors and make inputs for better performance on future processors • Ongoing work using Mambo simulator (IBM PERCS) • Pre-compiled Alpha binaries for the majority of benchmarks for simulation • In order to reduce the simulation time, we collect the simulation points for those benchmarks by using SimPoint BioPerf: an open bioinformatics and life sciences workload, David A. Bader Conclusions • Bioinformatics is a rapidly evolving field of increasing importance to computing • BioPerf is a first step to characterize bioinformatics workload: infrastructure to evaluate performance • Performance data collected so far provides insight into the limitations of current architectures BioPerf: an open bioinformatics and life sciences workload, David A. Bader Related Publications • D.A. Bader, V. Sachdeva, A. Trehan, V. Agarwal, G. Gupta, and A.N. Singh, “BioSPLASH: A sample workload from bioinformatics and computational biology for optimizing next-generation high-performance computer systems,” (Poster Session), 13th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2005), Detroit, MI, June 25-29, 2005. • D.A. Bader, V. Sachdeva, “BioSPLASH: Incorporating life sciences applications in the architectural optimizations of next-generation petaflop-system,”(Poster Session), The 4th IEEE Computational Systems Bioinformatics Conference (CSB 2005), Stanford University, CA, August 8-11, 2005 • D.A. Bader, Y. Li, T. Li, V. Sachdeva, “BioPerf: A Benchmark Suite to Evaluate High-Performance Computer Architecture on Bioinformatics Applications,” The IEEE International Symposium on Workload Characterization (IISWC 2005), Austin, TX, October 6-8, 2005 BioPerf: an open bioinformatics and life sciences workload, David A. Bader Backup Slides BioPerf: an open bioinformatics and life sciences workload, David A. Bader BioPerf on PowerPC • PowerPC G5 dual-processor machine – Uniprocessor performance ( nvram boot-args=1 ) – CPU frequency of 1.8 Ghz – 1 GB of physical memory available. • Codes compiled using gcc-3.3 with no additional optimizations. • MOnster tool of C.H.U.D package used for collecting hardware performance counters – Instruction and Memory level analysis BioPerf: an open bioinformatics and life sciences workload, David A. Bader Clustalw Algorithm Summary • Pairwise alignment of all sequences against one another. – dynamic programming step • Generate guide tree for aligning sequences – Sequences with highest similarity get aligned first • Sequence-group and group-group alignments (progressive) – All possible pairwise alignments between sequence and group are tried. Highest scoring pair is how it gets aligned to the group. – All possible pairwise alignments of sequences between groups are tried; highest scoring pair is how groups get aligned – Clustalw uses calculations from step 1 for this step BioPerf: an open bioinformatics and life sciences workload, David A. Bader Clustalw Livegraphs •Input: 318 sequences each of length almost 1.4 Progressive alignment step (29.8%) Almost all instructions are 1050 ppc Instr (ppc,io,ld.st) 1.2 1.0 0.8 0.6 Pairwise0.4 alignment step (70.1%) 0.2 ppc instructions 0.0 lag the total instructions Guide tree formation (<0.1%) of total time 0 500 1000 1500 2000 Time Samples Instr.Completed (ppc, io, ld/st)/Cycle Instr. (ppc)/Cycle BioPerf: an open bioinformatics and life sciences workload, David A. Bader 2500 3000 Clustalw Livegraphs L1D hit rate almost 100% 1.4 1.05 1.2 1.00 Instructions executed increase remarkably 1.0 0.95 Instructios executed low 0.8 0.90 0.6 0.4 0.85 0.2 0.80 0.0 0 500 1000 1500 Time Samples Instr. Completed/Cycle L1d Hit Rate BioPerf: an open bioinformatics and life sciences workload, David A. Bader 2000 2500 0.75 3000 L1D Hit Rate falls down Instruction count increases in progressive alignment Clustalw Livegraphs Is performance directly related to branch mispredicts ? 1.4 0.030 1.2 0.025 1.0 0.020 Branch mispredicts 0.8 is high in dynamic 0.6 programming 0.015 0.010 0.4 0.005 0.2 0.000 0.0 Instruction count is low 0 500 1000 1500 Instr. Completed/Cycle Branch Mispredicts/Instr. BioPerf: an open bioinformatics and life sciences workload, David A. Bader 2000 2500 3000 Branch mispredicts falls in progressive alignment Clustalw livegraphs Almost all branch mispredicts caused due to condition register mispredict 0.007 0.030 0.006 0.025 0.005 0.020 0.004 0.015 0.003 0.010 0.002 0.005 0.001 0.000 0.000 0 500 1000 1500 2000 X Data Branch mispredict due to TA Branch mispredict due to CR BioPerf: an open bioinformatics and life sciences workload, David A. Bader 2500 3000 But what about loads per instruction ? Instruction count is low Instruction count 1.0 increases in progressive 0.9 alignment 1.4 1.2 1.0 0.8 0.8 0.7 0.6 0.6 0.4 0.5 0.2 0.4 Loads per 0.0 instruction is high in dynamic programming Loads per instruction falls in 0.2 3000 progressive alignment 0.3 0 500 1000 1500 Instr. Completed/Cycle Loads/instr BioPerf: an open bioinformatics and life sciences workload, David A. Bader 2000 2500 Clustalw livegraphs - smaller inputs Same •performance Smaller input - 44 sequences of length 583 1.8 1.02 characteristics 1.6 but with longer progressive 1.4 alignment step 1.00 0.98 1.2 0.96 1.0 0.94 0.8 0.92 0.6 0.90 0.4 0.88 0.2 0.86 0.0 0 500 1000 1500 Instructions per cycle L1d hit rate BioPerf: an open bioinformatics and life sciences workload, David A. Bader 2000 2500 0.84 3000 Clustalw livegraphs – smaller inputs 1.6 0.030 1.4 0.025 1.2 0.020 1.0 0.8 0.015 0.6 0.010 0.4 0.2 Same performance 0.0 characteristics but with longer progressive alignment step 0.005 0 500 1000 1500 X Data Instructions per cycle Branch mispredicts/instr BioPerf: an open bioinformatics and life sciences workload, David A. Bader 2000 2500 0.000 3000 Clustalw livegraphs – smaller inputs Almost all branch mispredicts caused due to condition register mispredict 0.012 0.030 0.010 0.025 0.008 0.020 0.006 0.015 0.004 0.010 0.002 0.005 0.000 0.000 0 500 1000 1500 2000 X Data Branch mispredicts due to TA Branch mispredicts due to CR BioPerf: an open bioinformatics and life sciences workload, David A. Bader 2500 3000 Clustalw livegraphs – smaller input Can we use Mambo with smaller input sizes for more performance analysis ? 1.6 0.7 1.4 0.6 1.2 1.0 0.5 0.8 0.6 0.4 0.4 Same 0.2 performance 0.0 characteristics but with longer progressive alignment step 0.3 0 500 1000 1500 X Data Instructions Per Cycle Loads/instr. BioPerf: an open bioinformatics and life sciences workload, David A. Bader 2000 2500 0.2 3000 Using Mambo with Clustalw and other applications • Collect separate outputs for each phase of the run • Inserted “callthru exit” into the source code separating each part • Dump the system statistics at the end of each phase – mysim stats dump – mysim caches stats dump – MamboClearSystemStats (clean the previous statistics) • Multiple “mysim go” in the .tcl file. BioPerf: an open bioinformatics and life sciences workload, David A. Bader Clustalw on Mambo Mambo offers far more detailed instruction profiling than G5 ? 6e+5 Progressive alignment uses results from first step – high branch and loads 5e+5 4e+5 Pairwise alignment – high loads and arithmetic instructions 3e+5 2e+5 1e+5 0 0 1e+9 2e+9 X Data INST_TYPE_ARITH INST_TYPE_BRANCH INST_TYPE_LOAD BioPerf: an open bioinformatics and life sciences workload, David A. Bader 3e+9 4e+9 5e+9 Comparing large datasets with small datasets Branch mispredicts lesser due to smaller Is it feasible to use smaller input datasets for accurate simulation dynamic results ? programming step Branch mispredicts much higher BioPerf: an open bioinformatics and life sciences workload, David A. Bader High increase in L1d hit rate Summary of BioPerf performance Highest instructions executed per cycle Highest branch High % of mispredicts and ld/st/io Low TLB High loads per TLB misses misses instructionVery low % instruction of ld/st/io High Low branch L1d Hit mispredicts rate BioPerf: an open bioinformatics and life sciences workload, David A. Bader Summary of BioPerf performance High branch mispredicts Mid-range instructions per cycle High loads per instruction BioPerf: an open bioinformatics and life sciences workload, David A. Bader Low TLB misses Low % of ld/st/io instructions Summary of BioPerf Performance Lowest instruction rate Lowest loads per Low branch instructionmispredicts and TLBDavid misses BioPerf: an open bioinformatics and life sciences workload, A. Bader Lowest L1D Low % of ld/st/io and L2D hit instructions rate