BioPerf

advertisement
BioPerf: An Open Benchmark Suite for Evaluating Computer
Architecture on Bioinformatics and Life Science Applications
David A. Bader
Collaborators
• Vipin Sachdeva (U New Mexico, Georgia Tech,
IBM Austin)
• Tao Li (U Florida)
• Yue Li (U Florida)
• Virat Agrawal (IIT Delhi)
• Gaurav Goel (IIT Delhi)
• Abhishek Narain Singh (IIT Delhi)
• Ram Rajamony (IBM Austin)
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Acknowledgment of Support
• National Science Foundation
– CAREER: High-Performance Algorithms for Scientific Applications (06-11589; 0093039)
– ITR: Building the Tree of Life -- A National Resource for Phyloinformatics and
Computational Phylogenetics (EF/BIO 03-31654)
– DEB: Ecosystem Studies: Self-Organization of Semi-Arid Landscapes: Test of Optimality
Principles (99-10123)
– ITR/AP: Reconstructing Complex Evolutionary Histories (01-21377)
– DEB Comparative Chloroplast Genomics: Integrating Computational Methods,
Molecular Evolution, and Phylogeny (01-20709)
– ITR/AP(DEB): Computing Optimal Phylogenetic Trees under Genome Rearrangement
Metrics (01-13095)
– DBI: Acquisition of a High Performance Shared-Memory Computer for Computational
Science and Engineering (04-20513).
• IBM PERCS / DARPA High Productivity Computing Systems (HPCS)
– DARPA Contract NBCH30390004
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Contributions of this Work
• An open source, freely-available, freelyredistributable suite of applications and
inputs, BioPerf, which spans a wide variety of
bioinformatics application
– www.bioperf.org
• Performance study on PowerPC G5, IBM
Mambo simulator, and Alpha
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Outline
• Motivation
• Bioinformatics Workload
• BioPerf Suite
• Performance Analysis on PowerPC G5 and
Mambo
• Conclusions and Future Work
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Motivation
• Improve performance on a wide range of
bioinformatics applications
– Heterogeneous in problems, algorithms,
applications
• BioPerf workload assembled as a
representative set of bioinformatics
applications important now and expected to
increase in usage over the next 5—10 years
• Decide if this is YAW “yet another workload”
or rather unique in its characteristics
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Related Work
• General benchmark suites: SPEC
• Domain-specific benchmarks
– TPC, EEMBC, SPLASH, SPLASH-2
• Few special benchmark for bioinformatics
• Previous attempts have been incomplete: Analysis on old
architectures (BioBench) [Albayraktaroglu et al., ISPASS
2005]
• Included proprietary codes in benchmark suite (BioInfoMark)
[Li et al., MASCOTS 2005]
• Previous suites not available for download
• Included several non-redistributable packages
• Inputs not articulated and not included with benchmark
suite for similar comparisons
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Guiding Principles for BioPerf
• Coverage: The packages must span the heterogeneity of algorithms and
biological and life science problems important today as well as (in our view)
increasing in importance over the next 5-10 years.
• Popularity: Codes with larger numbers of users are preferred because these
packages represent a greater percentage of the aggregate workloads used in this
domain.
• Open Source: Open source code allows the scientific study of the applicatio
performance, the ability to place hooks into the code, and eases porting to new
architectures.
• Licensing: Only packages for which their licensing allows free redistribution as
open source are included. This requirement eliminated several popular
packages, but was kept as a strict requirement to encourage the broadest use of
this suite.
• Portability: Preference was given to packages that used standard programming
languages and could easily be ported to new systems (both in sequential and
parallel languages).
• Performance: We gave slight preference to packages whose performance is wellcharacterized in other studies. In addition, we strived for computationallydemanding packages and included parallel versions where available.
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
BioPerf Suite
• Pre-compiled binaries (PowerPC, x86, Alpha)
• Scalable Input datasets with each code for
fair comparisons
• Scripts for installation, running and collecting
outputs
• Documentation for compiling and using the
suite
• Parallel codes where available
• Available for download from www.bioperf.org
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
BioPerf workload
Area
Package
Executables
Word-based
Profile-based
BLAST
HMMER
blastp, blastn
hmmpfam, hmmsearch
Pairwise
Multiple
Multiple
FASTA
ssearch, fasta
CLUSTALW
clustalw, clustalw_smp
TCOFFEE
tcoffee
PHYLIP
dnapenny, promlk
GRAPPA
grappa
PREDATOR
predator
GLIMMER
glimmer,glimmer-package
CE
ce
Sequence homology
Sequence Alignment
Phylogeny
Parsimony/Likelihood
Gene Rearrangement
Protein Structure Prediction
Gene Finding
Molecular Dynamics
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Sequence Alignment
• Sequence Alignment one of the most useful
techniques in computational biology
– Sequence Alignment : Stacking the sequences
against each other, with gaps if necessary, to
expose similarity. ALIGNMENT
S1 : ACGCTGATATTA
ACGCTGATAT---TA
S2 : AGTGTTATCCCTA
AG--TGTTATCCCTA
S1 : ACGCTGATATTA
ACGCTGATAT---TA
S2 : AGTGTTATCCCTA
AG--TGTTATCCCTA
MATCH
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Sequence Alignment
• Sequence Alignment one of the most useful
techniques in computational biology
– Sequence Alignment : Stacking the sequences
against each other, with gaps if necessary, to
expose similarity. ALIGNMENT
S1 : ACGCTGATATTA
ACGCTGATAT---TA
S2 : AGTGTTATCCCTA
AG--TGTTATCCCTA
S1 : ACGCTGATATTA
ACGCTGATAT---TA
S2 : AGTGTTATCCCTA
AG--TGTTATCCCTA
MISMATCH
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Sequence Alignment
• Sequence Alignment one of the most useful
techniques in computational biology
– Sequence Alignment : Stacking the sequences
against each other, with gaps if necessary, to
expose similarity. ALIGNMENT
S1 : ACGCTGATATTA
ACGCTGATAT---TA
S2 : AGTGTTATCCCTA
AG--TGTTATCCCTA
S1 : ACGCTGATATTA
ACGCTGATAT---TA
S2 : AGTGTTATCCCTA
AG--TGTTATCCCTA
“GAPS”
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Multiple Sequence Alignment
• Bring the greatest number of similar characters into
same column.
• Provides much more information than pairwise alignment
A
A
S
N
S
V S N —S
—S N A —
———A S
V S
N S
Run-time of dynamic programming solution = O(2k nk)
6 sequences of length 100  6.4X1013 calculations
Hence heuristics employed
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Sequence Homology
• Find similar sequences (DNA/protein) to an unknown
sequence (DNA/protein).
• Computationally expensive
• Size of data is huge and grows exponentially every year
• Public databases available: Genbank, SwissProt, PDB
NCBI Genbank
Swissprot
PDB
DNA sequences
Protein Sequences
Protein Structure
5 million sequences
160,000 sequences
32,000 structures
Problems with computational approach
• Exact alignment is O(l2) dynamic programming solution
• Quicker but less accurate heuristics employed
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Blast
• Basic Local Alignment Search Tool
• Developed by NCBI
• The most important bioinformatics
application for its popularity
Blast
blastp
blastn
The homo sapiens hereditary
haemochromatosis protein
Non-redundant protein
sequence nr developed by NCBI
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
FASTA
• Also performs pairwise sequence alignment
FASTA
Fasta34
ssearch
The human LDL receptor
precursor nr
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
ClustalW
• Multiple sequence alignment (MSA) program
ClustalW
317 Ureaplasma’s gene
Clustalw
sequences from NCBI
Clustalw_smp Bacteria genomes
database
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
T-Coffee
• A sequential MSA similar to ClustalW with
higher accuracy and complexity
T-coffee
Tcoffee
50 sequences of average
length 850 extracted from
the Prefab database
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Hmmer
• Align multiple sequences by using hidden
Markov models
Brine shrimp globin
Hmmer
hmmsearch
hmmpfam
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
HMM of 50 aligned
globin sequences
Phylogenetic Reconstruction
• Study the evolution of all sequences and all
species
The Tree of Life
(10-100M organisms)
• Find the best among all possible trees.
• Given n taxa, number of possible trees (2n-3)!!
• 10 taxa  2 million trees
• Approaches like maximum parsimony, maximum likelihood,
among others
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Phylogeny Reconstruction: Phylip
• Collection of programs for inferring
phylogenies
• Methods include
– Maximum parsimony
– Maximum likelihood
– Distance based methods.
• Input: Aligned dataset of 92 cyclophilins
proteins of eukaryotes each of length 220
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Phylogeny Reconstruction: GRAPPA
•
Campanulaceae
• Bob Jansen, UT-Austin;
• Linda Raubeson, Central Washington U
Tobacco
•
Gene-order based phylogeny
A D
A
C
X
Y
Z
B E
C F
B
D
E
W
F
•
•
•
Genome Rearrangements Analysis
under Parsimony and other
Phylogenetic Algorithm
• Freely-available, open-source,
GNU GPL
• already used by other
computational phylogeny groups,
Caprara, Pevzner, LANL, FBI,
Smithsonian Institute, Aventis,
GlaxoSmithKline, PharmCos.
Gene-order Phylogeny Reconstruction
• Breakpoint Median
• Inversion Median
over one-billion fold speedup from
previous codes
Parallelism scales linearly with the
number of processors
[Bader, Moret, Warnow]
Input: 12 bluebell flower species of 105 genes
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Protein Structure Prediction
• Find the sequences, three dimensional structures
and functions of all proteins and vice-versa
– Why computationally?
• Experimental Techniques slow and expensive
– Problems with computational approach
• Little understanding of how structure develops
• Does function really follow structure ?
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Protein Structure : Predator
• Tool for finding protein structures.
• Relies on local alignments from BLAST, FASTA
• Input: 20 sequences from Swissprot each of
length about 7000 residues.
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
CE (Combinatorial Extension)
• Find structural similarities between the
primary structures of pairs of proteins.
CE
ce
Two different types of
hemoglobin which is used
to transport oxygen
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Gene-Finding: Glimmer
• Gene-Finding: Find regions of genome which
code for proteins.
• Widely used gene finding tool for microbial
DNA.
• Input: Bacteria genome consisting of 9.2
million base pairs
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Pre-compiled binaries
• PowerPC
• x86
• Alpha
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
BioPerf Performance Studies
• Analysis at the instruction and memory level on
PowerPC
• Livegraph data helps to visualize performance as it
varies during phases of a run
• Identify bottlenecks of current processors and make
inputs for better performance on future processors
• Ongoing work using Mambo simulator (IBM PERCS)
• Pre-compiled Alpha binaries for the majority of
benchmarks for simulation
• In order to reduce the simulation time, we collect
the simulation points for those benchmarks by
using SimPoint
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Conclusions
• Bioinformatics is a rapidly evolving field of
increasing importance to computing
• BioPerf is a first step to characterize
bioinformatics workload: infrastructure to
evaluate performance
• Performance data collected so far provides
insight into the limitations of current
architectures
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Related Publications
• D.A. Bader, V. Sachdeva, A. Trehan, V. Agarwal, G. Gupta, and A.N. Singh,
“BioSPLASH: A sample workload from bioinformatics and computational
biology for optimizing next-generation high-performance computer
systems,” (Poster Session), 13th Annual International Conference on
Intelligent Systems for Molecular Biology (ISMB 2005), Detroit, MI, June
25-29, 2005.
• D.A. Bader, V. Sachdeva, “BioSPLASH: Incorporating life sciences
applications in the architectural optimizations of next-generation
petaflop-system,”(Poster Session), The 4th IEEE Computational Systems
Bioinformatics Conference (CSB 2005), Stanford University, CA, August
8-11, 2005
• D.A. Bader, Y. Li, T. Li, V. Sachdeva, “BioPerf: A Benchmark Suite to
Evaluate High-Performance Computer Architecture on Bioinformatics
Applications,” The IEEE International Symposium on Workload
Characterization (IISWC 2005), Austin, TX, October 6-8, 2005
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Backup Slides
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
BioPerf on PowerPC
• PowerPC G5 dual-processor machine
– Uniprocessor performance ( nvram boot-args=1 )
– CPU frequency of 1.8 Ghz
– 1 GB of physical memory available.
• Codes compiled using gcc-3.3 with no additional
optimizations.
• MOnster tool of C.H.U.D package used for collecting
hardware performance counters
– Instruction and Memory level analysis
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Clustalw Algorithm Summary
• Pairwise alignment of all sequences against one another.
– dynamic programming step
• Generate guide tree for aligning sequences
– Sequences with highest similarity get aligned first
• Sequence-group and group-group alignments (progressive)
– All possible pairwise alignments between sequence and group are tried.
Highest scoring pair is how it gets aligned to the group.
– All possible pairwise alignments of sequences between groups are tried;
highest scoring pair is how groups get aligned
– Clustalw uses calculations from step 1 for this step
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Clustalw Livegraphs
•Input: 318 sequences each of length almost
1.4
Progressive
alignment step
(29.8%)
Almost all
instructions are
1050
ppc
Instr (ppc,io,ld.st)
1.2
1.0
0.8
0.6
Pairwise0.4
alignment step
(70.1%) 0.2
ppc instructions
0.0
lag the total
instructions
Guide tree
formation
(<0.1%) of total
time
0
500
1000
1500
2000
Time Samples
Instr.Completed (ppc, io, ld/st)/Cycle
Instr. (ppc)/Cycle
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
2500
3000
Clustalw Livegraphs
L1D hit rate
almost
100%
1.4
1.05
1.2
1.00
Instructions
executed
increase
remarkably
1.0
0.95
Instructios
executed
low
0.8
0.90
0.6
0.4
0.85
0.2
0.80
0.0
0
500
1000
1500
Time Samples
Instr. Completed/Cycle
L1d Hit Rate
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
2000
2500
0.75
3000
L1D Hit
Rate falls
down
Instruction count
increases in
progressive
alignment
Clustalw Livegraphs
Is performance directly related to branch mispredicts ?
1.4
0.030
1.2
0.025
1.0
0.020
Branch
mispredicts 0.8
is
high in dynamic
0.6
programming
0.015
0.010
0.4
0.005
0.2
0.000
0.0
Instruction
count is low
0
500
1000
1500
Instr. Completed/Cycle
Branch Mispredicts/Instr.
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
2000
2500
3000
Branch
mispredicts
falls in
progressive
alignment
Clustalw livegraphs
Almost all branch mispredicts caused due to condition register mispredict
0.007
0.030
0.006
0.025
0.005
0.020
0.004
0.015
0.003
0.010
0.002
0.005
0.001
0.000
0.000
0
500
1000
1500
2000
X Data
Branch mispredict due to TA
Branch mispredict due to CR
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
2500
3000
But what about loads per instruction ?
Instruction
count is low
Instruction count
1.0
increases
in
progressive
0.9
alignment
1.4
1.2
1.0
0.8
0.8
0.7
0.6
0.6
0.4
0.5
0.2
0.4
Loads per
0.0
instruction is
high in dynamic
programming
Loads per
instruction
falls in
0.2
3000 progressive
alignment
0.3
0
500
1000
1500
Instr. Completed/Cycle
Loads/instr
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
2000
2500
Clustalw livegraphs - smaller inputs
Same
•performance
Smaller
input - 44 sequences of length 583
1.8
1.02
characteristics
1.6
but with longer
progressive
1.4
alignment step
1.00
0.98
1.2
0.96
1.0
0.94
0.8
0.92
0.6
0.90
0.4
0.88
0.2
0.86
0.0
0
500
1000
1500
Instructions per cycle
L1d hit rate
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
2000
2500
0.84
3000
Clustalw livegraphs – smaller inputs
1.6
0.030
1.4
0.025
1.2
0.020
1.0
0.8
0.015
0.6
0.010
0.4
0.2
Same
performance 0.0
characteristics
but with longer
progressive
alignment step
0.005
0
500
1000
1500
X Data
Instructions per cycle
Branch mispredicts/instr
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
2000
2500
0.000
3000
Clustalw livegraphs – smaller inputs
Almost all branch mispredicts caused due to condition register mispredict
0.012
0.030
0.010
0.025
0.008
0.020
0.006
0.015
0.004
0.010
0.002
0.005
0.000
0.000
0
500
1000
1500
2000
X Data
Branch mispredicts due to TA
Branch mispredicts due to CR
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
2500
3000
Clustalw livegraphs – smaller input
Can we use Mambo with smaller input sizes for more performance analysis ?
1.6
0.7
1.4
0.6
1.2
1.0
0.5
0.8
0.6
0.4
0.4
Same 0.2
performance
0.0
characteristics
but with longer
progressive
alignment step
0.3
0
500
1000
1500
X Data
Instructions Per Cycle
Loads/instr.
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
2000
2500
0.2
3000
Using Mambo with Clustalw and other
applications
• Collect separate outputs for each phase of the run
• Inserted “callthru exit” into the source code
separating each part
• Dump the system statistics at the end of each
phase
– mysim stats dump
– mysim caches stats dump
– MamboClearSystemStats (clean the previous statistics)
• Multiple “mysim go” in the .tcl file.
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Clustalw on Mambo
Mambo offers far more detailed instruction profiling than G5 ?
6e+5
Progressive
alignment
uses results
from first step
– high branch
and loads
5e+5
4e+5
Pairwise
alignment – high
loads and
arithmetic
instructions
3e+5
2e+5
1e+5
0
0
1e+9
2e+9
X Data
INST_TYPE_ARITH
INST_TYPE_BRANCH
INST_TYPE_LOAD
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
3e+9
4e+9
5e+9
Comparing large datasets with small
datasets
Branch mispredicts
lesser due to smaller
Is it feasible to use smaller input
datasets for accurate simulation
dynamic
results ?
programming step
Branch mispredicts
much higher
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
High increase in
L1d hit rate
Summary of BioPerf performance
Highest
instructions
executed per
cycle
Highest branch
High % of
mispredicts and
ld/st/io
Low TLB
High loads per
TLB misses misses
instructionVery low %
instruction
of ld/st/io
High
Low branch
L1d Hit
mispredicts
rate
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Summary of BioPerf performance
High branch
mispredicts
Mid-range
instructions
per cycle
High loads
per
instruction
BioPerf: an open bioinformatics and life sciences workload, David A. Bader
Low
TLB
misses
Low % of ld/st/io
instructions
Summary of BioPerf Performance
Lowest
instruction
rate
Lowest
loads per Low branch
instructionmispredicts and
TLBDavid
misses
BioPerf: an open bioinformatics and life sciences workload,
A. Bader
Lowest L1D Low % of
ld/st/io
and L2D hit
instructions
rate
Download