TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub Bin Gan CMSC 838 Presentation Motivation NCBI BLAST on a single processor has become too costly , inefficient, and time-consuming. Sequence database are exploding in size. Growing at an exponential rate Exceeds the rate of increase in hardware capabilities ( Moores’ Law) Thrashing and buffer management Goals Faster results for life science laboratories Do not change the BLAST algorithm Avoid using costly multiprocessor machines Cheap alternatives of clusters of machines CMSC 838T – Presentation Talk Overview Overview of talk Motivation Techniques Database partition Use the sequential BLAST Merge results TurboHub infrastructure Evaluation 3 test runs and analysis Related work Powerblast Paracel’s BLAST Machine mpiBLAST Words of Bill Pearson ( auther of FASTA) Observations CMSC 838T – Presentation Techniques Approach Main intuition Implementation Clients, master, and workers TurboHub System Load balance Fault recovery Dynamic database partitioning Binary tree analogy CMSC 838T – Presentation Techniques Approach Split databases instead of query sequences in binary tree fashion Algorithms to decide how to split with the goal of balance overhead & load Each processor runs complete sequential BLAST using database subsets Merge the result into XML format Adjust BLAST statistics for database sizes TurboHub provide backend support for scheduling, fault recovery, etc. Main Intuitions Divide and Conquer BLAST compares target sequence with each sequence in the database individually Very little communication is needed, and the communication is not order dependant Easy to achieve parallelism by splitting the database and assembling the result CMSC 838T – Presentation Techniques Implementations 3 tier system Client End user submitting job to the system Master Java application accepts the job Sets up for processing Uses TurboHub Manage task execution Coordinate the workers Support dynamic change in set of workers, fault tolerances, etc. Workers CMSC 838T – Presentation Techniques Implementations Cont. Workers Has a local copy of NCBI blastall Partition the database so that the resulting portion can fit into available physical memory Initial task group of 10-20 sequences against all the databases to avoid startup cost Some worker process will merge the results Parse the output (store as XML format) Adjust BLAST statistics for database size Scheduling using Piranha models not talked in paper, but very important CMSC 838T – Presentation Techniques TurboHub System Developed by Scientific Computing Associates Capabilities Pipelining Component Replication Parallel Components in combination with tools from SCA, MPI, PVM, OpenMP Application in this topic Worker is a wrapped-up blastall components Component scheduling Fault recovery CMSC 838T – Presentation Techniques Task/Database Splitting 2 options Large Task Advantage Disadvantage Maximize resource utilization Minimize task startup overhead Load imbalance Limit the performance gain Small Task Advantage and disadvantage are reverse of the above CMSC 838T – Presentation Techniques Task/Database Splitting cont. The paper’s intermediate approach Create large initial task by experience Communication and program startup are trivial for at least 10-20 input query sequences with 256M memory If the task is too large, split the databases For multiple databases, create roughly half of databases in each sub database For single database, split the database by half Uses virtual shared memory The actual database files are never sent to a worker until it actually requires them CMSC 838T – Presentation Techniques Database Splitting Split using NCBI database formatting program formatdb Analogy of binary tree All the combined leaves are the database The portion of the database to access depends on which node the worker has decided to be at Uses all leaves under the chosen node Advantage: Flexibility Deliver exact amount of data as needed Single copy of database CMSC 838T – Presentation Evaluation Experimental environment for test one Input data sets: 50 Expressed Sequence Tags (ESTs) Database used: Drosophila (1,170 sequences, 123 million nucleotides), GSS Division of GENBANK (1.27 million sequences, 651 million nucleotides) E Coli (400 sequences, 4.6 million nucleotides) A group of 500 Mhz PIII with 512K cache, 256M Memory,100Mb Ethernet Performance result for test one Serial version: 2131.8 second (wall clock time) Parallel version with 11 workers: 130.0 second. (Speedup = 16) CMSC 838T – Presentation Evaluation Experimental environment for test two Input data sets: Chromosomes 1, 2, 4 from the Arabidopsis Genome Database used: Swiss-Prot Protein database (12.8 Million peptides) A group of 500 Mhz PIII with 512K cache, 256M Memory,100Mb Ethernet Performance result for test two Serial version: 5 Days 19 hours and 13 minutes Parallel version with 11 workers: 12 hours, 54 minutes. (Speedup = 10.8) CMSC 838T – Presentation Evaluation Experimental environment for test three Input data sets: 500 mouse ESTs with 200-400 nucleotides each Database used: An NT database from NCBI (1,681,522,266 nucleotides) IBM linux cluster of 8 dual processor workstation Each workstation contains 2 996 PIII’s with 2 G memory, 100 Mbit ethernet Performance result for test three Serial version: 4945 second Parallel version with 8 workstations(16 workers): 357.03 second. (Speedup = 13.85) CMSC 838T – Presentation Evaluation Analysis Memory size vs. database size Thrashing avoidance for superlinear speedups Single query at a time Single query at each node Overhead Need to combine results TurboHub overhead Database transmission overhead CMSC 838T – Presentation Related Work Other parallel BLAST Blackstone's PowerBLAST (part of PowerCloud) Automate the splitting of query databases into smaller chunks Spread out over the cluster nodes' local disks for querying Automates the merging of BLAST results Use disk caching and scheduling techniques to speed up future queries of the same datasets Paracel's BLAST Machine Paracel actually got inside BLAST and parallelized the code Post impressive speed up numbers and the statistics Same as an unaltered BLAST query mpiBLAST Splits the database across each node in the cluster, so it can usually reside in the buffer-cache CMSC 838T – Presentation Related Work Words of Bill Pearson (FASTA) in response to why there are no MPI or PVM parallelized versions of BLAST Note: Paracel’s types of parallelization It is too fast and there is not much demand for it 95% of the time, BLAST is almost an in-memory grep Sequence comparison is embarrassingly parallel, and very easily threaded Distributing the sequence databases and collecting results has more overhead FASTA is 5 - 10X slower than BLAST Smith-Waterman is 5-20X slower than FASTA The communications overhead is low, and distributed systems work OK for FASTA, and great for Smith-Waterman CMSC 838T – Presentation Observations Observation Efficient due to the parallelism embedded in the BLAST algorithm Different database splitting techniques Feasible in practice (in computing power, user effort, etc…) Similar result to previous work Improvement Due to the requirement of not changing code on BLAST, superlinear speedup is only possible if existing thrashing is avoided. Larger memory and cache size Better load balancing technique Overhead reduction, flexibility vs performance CMSC 838T – Presentation