TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub Bin Gan

advertisement
TurboBLAST: A Parallel
Implementation of BLAST Built on
the TurboHub
Bin Gan
CMSC 838 Presentation
Motivation


NCBI BLAST on a single processor has become too
costly , inefficient, and time-consuming.

Sequence database are exploding in size.

Growing at an exponential rate

Exceeds the rate of increase in hardware capabilities ( Moores’
Law)

Thrashing and buffer management
Goals

Faster results for life science laboratories

Do not change the BLAST algorithm

Avoid using costly multiprocessor machines

Cheap alternatives of clusters of machines
CMSC 838T – Presentation
Talk Overview

Overview of talk





Motivation
Techniques
 Database partition
 Use the sequential BLAST
 Merge results
 TurboHub infrastructure
Evaluation
 3 test runs and analysis
Related work
 Powerblast
 Paracel’s BLAST Machine
 mpiBLAST
 Words of Bill Pearson ( auther of FASTA)
Observations
CMSC 838T – Presentation
Techniques

Approach

Main intuition

Implementation



Clients, master, and workers
TurboHub System

Load balance

Fault recovery
Dynamic database partitioning

Binary tree analogy
CMSC 838T – Presentation
Techniques

Approach







Split databases instead of query sequences in binary tree fashion
Algorithms to decide how to split with the goal of balance overhead &
load
Each processor runs complete sequential BLAST using database
subsets
Merge the result into XML format
Adjust BLAST statistics for database sizes
TurboHub provide backend support for scheduling, fault recovery, etc.
Main Intuitions




Divide and Conquer
BLAST compares target sequence with each sequence in the database
individually
Very little communication is needed, and the communication is not
order dependant
Easy to achieve parallelism by splitting the database and assembling
the result
CMSC 838T – Presentation
Techniques

Implementations

3 tier system

Client

End user submitting job to the system
Master




Java application accepts the job
Sets up for processing
Uses TurboHub




Manage task execution
Coordinate the workers
Support dynamic change in set of workers, fault tolerances,
etc.
Workers
CMSC 838T – Presentation
Techniques

Implementations Cont.

Workers







Has a local copy of NCBI blastall
Partition the database so that the resulting portion can fit
into available physical memory
Initial task group of 10-20 sequences against all the
databases to avoid startup cost
Some worker process will merge the results
Parse the output (store as XML format)
Adjust BLAST statistics for database size
Scheduling using Piranha models

not talked in paper, but very important
CMSC 838T – Presentation
Techniques

TurboHub System

Developed by Scientific Computing Associates

Capabilities

Pipelining
 Component Replication
 Parallel Components in combination with tools from SCA,
MPI, PVM, OpenMP
Application in this topic




Worker is a wrapped-up blastall components
Component scheduling
Fault recovery
CMSC 838T – Presentation
Techniques

Task/Database Splitting

2 options

Large Task

Advantage



Disadvantage



Maximize resource utilization
Minimize task startup overhead
Load imbalance
Limit the performance gain
Small Task

Advantage and disadvantage are reverse of the above
CMSC 838T – Presentation
Techniques

Task/Database Splitting cont.

The paper’s intermediate approach

Create large initial task by experience

Communication and program startup are trivial for at least
10-20 input query sequences with 256M memory
If the task is too large, split the databases





For multiple databases, create roughly half of databases in
each sub database
For single database, split the database by half
Uses virtual shared memory
The actual database files are never sent to a worker until it
actually requires them
CMSC 838T – Presentation
Techniques

Database Splitting

Split using NCBI database formatting program formatdb

Analogy of binary tree

All the combined leaves are the database

The portion of the database to access depends on which node
the worker has decided to be at

Uses all leaves under the chosen node

Advantage:



Flexibility
Deliver exact amount of data as needed
Single copy of database
CMSC 838T – Presentation
Evaluation

Experimental environment for test one




Input data sets: 50 Expressed Sequence Tags (ESTs)
Database used:
Drosophila (1,170 sequences, 123 million nucleotides),
GSS Division of GENBANK (1.27 million sequences, 651
million nucleotides)
E Coli (400 sequences, 4.6 million nucleotides)
A group of 500 Mhz PIII with 512K cache, 256M Memory,100Mb
Ethernet
Performance result for test one


Serial version: 2131.8 second (wall clock time)
Parallel version with 11 workers: 130.0 second. (Speedup =
16)
CMSC 838T – Presentation
Evaluation


Experimental environment for test two

Input data sets: Chromosomes 1, 2, 4 from the Arabidopsis
Genome

Database used:

Swiss-Prot Protein database (12.8 Million peptides)
A group of 500 Mhz PIII with 512K cache, 256M Memory,100Mb
Ethernet
Performance result for test two

Serial version: 5 Days 19 hours and 13 minutes

Parallel version with 11 workers: 12 hours, 54 minutes.
(Speedup = 10.8)
CMSC 838T – Presentation
Evaluation

Experimental environment for test three

Input data sets: 500 mouse ESTs with 200-400 nucleotides
each

Database used:

An NT database from NCBI (1,681,522,266 nucleotides)
IBM linux cluster of 8 dual processor workstation


Each workstation contains 2 996 PIII’s with 2 G memory, 100
Mbit ethernet
Performance result for test three

Serial version: 4945 second

Parallel version with 8 workstations(16 workers): 357.03
second. (Speedup = 13.85)
CMSC 838T – Presentation
Evaluation Analysis

Memory size vs. database size

Thrashing avoidance for superlinear speedups


Single query at a time

Single query at each node
Overhead

Need to combine results

TurboHub overhead

Database transmission overhead
CMSC 838T – Presentation
Related Work

Other parallel BLAST



Blackstone's PowerBLAST (part of PowerCloud)
 Automate the splitting of query databases into smaller
chunks
 Spread out over the cluster nodes' local disks for querying
 Automates the merging of BLAST results
 Use disk caching and scheduling techniques to speed up
future queries of the same datasets
Paracel's BLAST Machine
 Paracel actually got inside BLAST and parallelized the code
 Post impressive speed up numbers and the statistics
 Same as an unaltered BLAST query
mpiBLAST
 Splits the database across each node in the cluster, so it
can usually reside in the buffer-cache
CMSC 838T – Presentation
Related Work

Words of Bill Pearson (FASTA) in response to why
there are no MPI or PVM parallelized versions of BLAST

Note: Paracel’s types of parallelization

It is too fast and there is not much demand for it

95% of the time, BLAST is almost an in-memory grep

Sequence comparison is embarrassingly parallel, and very
easily threaded

Distributing the sequence databases and collecting results has
more overhead

FASTA is 5 - 10X slower than BLAST

Smith-Waterman is 5-20X slower than FASTA

The communications overhead is low, and distributed systems
work OK for FASTA, and great for Smith-Waterman
CMSC 838T – Presentation
Observations

Observation

Efficient due to the parallelism embedded in the BLAST
algorithm
Different database splitting techniques
 Feasible in practice (in computing power, user effort, etc…)
Similar result to previous work



Improvement

Due to the requirement of not changing code on BLAST,
superlinear speedup is only possible if existing thrashing is
avoided.

Larger memory and cache size

Better load balancing technique

Overhead reduction, flexibility vs performance
CMSC 838T – Presentation
Download