CMSC 838 Presentation
Authors : Dmitri Mikhailov, Haruna Cofer, Roberto Gomperts
SGI
Multiple Sequence Alignment (MSA)
Basis for phylogenetic analysis - Infer homology relationships
Building protein families - conserved region may imply common function
Aids in function/structure prediction of new proteins
Global MSA – Clustal W
Is it computationally expensive ? Yes, for 100 sequences.
Goal : Parallelize Clustal W
Clustal W takes hours for 100 or more sequences
Parallelization possible for the algorithm
Contribution of the paper
Parallel Clustal W
Parallel version of basic Clustal W
HT Clustal
Parallelize heterogeneous Multiple Sequence Alignment problems
MULTICLUSTAL
Parallel version of an optimization on Clustal W
CMSC 838T – Presentation
Overview of talk
Motivation
Background
Sequential Clustal W
Parallel Clustal W
HT Clustal
Problem Statement
Optimizations
MULTICLUSTAL
Sequential Algorithm
Optimizations
Observations
CMSC 838T – Presentation
Sequential Clustal W Algorithm
Given N sequences of length M each
Pairwise Alignment (PA)
Creates distance matrix N x N based on pairwise alignment scores
Evolutionary distance
Guide Tree (GT) construction (Phylogenetic tree)
Use Neighbor-joining algorithm
Progressive Multiple Alignment (PA)
Use guide tree to align closely related pairs of sequences
Progressively align next sequence to existing alignment
CMSC 838T – Presentation
Problem Statement
Parallelize the Sequential Clustal W
Execution time breakup
PW = pairwise alignment, GT = guide tree, PA = progressive alignment
CMSC 838T – Presentation
Pairwise Alignment Stage
N(N-1)/2 pairwise alignments
Send them randomly to different processors
Random – as jobs of different load
Random also produces statistically uniform distribution
(over a large set of jobs)
1.8X speedup achieved on a 1000 sequence MSA with 8 CPUs
Guide Tree Stage
Parallelize “find closest neighbors from distance matrix”
Used in the neighbor joining algorithm
Find minimum element of each row concurrently
Use this to find minimum element of matrix
CMSC 838T – Presentation
Progressive Alignment Stage
Computation of a function score(I,J) precomputed in parallel
Alignment score of sequence I and J
Not much parallelization in the third stage
Overall Speedup
Speedup of 10x for 600 MA sequences using 16 CPUs
Time reduced from 1 hr 7 minutes to 6.5 minutes
Relative scaling is better for larger inputs
CMSC 838T – Presentation
Problem Statement
Calculate large numbers of MSAs of various sizes (independent problems)
Such problems seen in high-throughput (HT) research environments
Representative Problem (from paper) :
Perform independent MSA over
100 sets of sequences
Each set has between 20 to
100 sequences with average of 60 sequences
Average Length of sequence = 390
CMSC 838T – Presentation
Basic Idea
Each MSA operation (on one set of sequences) is independent of the other
Run ClustalW as a uniprocessor job on one MSA problem
Launch multiple Clustal W jobs on different processors
Job Scheduling
Jobs of different duration – depends on sequence set
Two scheduling options explored:
Schedule dynamically – if processor is free, schedule an
MSA job – chosen randomly
Schedule dynamically – Sequences are presorted (based on filesize)
CMSC 838T – Presentation
Speedups
Almost linear speedups
31x on 32 CPUs for the representative MSA problem
116X on 128 CPUs for a larger test case
Solution time reduced from 18.5 hours to 9.5 minutes
Speedup shown for the example MSA set:
CMSC 838T – Presentation
Effect of presorting
Figure shows effect of presorting for the example
MSA set
32 CPUs, 100 sets,
~3 jobs per CPU
If average number of jobs per CPU < 5 presorting helps
For larger number of jobs per CPU statistical averaging reduces load imbalance
CMSC 838T – Presentation
MULTICLUSTAL Algorithm
A Perl script to generate high quality MSA with little user intervention
Searches for best combination of Clustal W input parameters
To reduce gaps, increase clustering
Parameters to vary :
Scoring matrices : pairwise and multiple
Gap open and extension penalties (pairwise and multiple)
Sequential Algorithm :
1.
2.
3.
4.
Till all parameters are sufficiently varied { alignment = Run Clustal W ()
Calculate quality of alignment
Change Parameters }
Quality of alignment
A numerical quantity based on
identitical amino acid matches
Conservative amino acid substitutions
Gap events, amino acid islands I.e. –X-, -XX-, -XXX-, -XXXX-
CMSC 838T – Presentation
Optimization on MULTICLUSTAL
Run Clustal W once
Reuse tree generated in the PW/GT Stages
Guide tree calculated only once for multiple runs
Results in speedups from 1.5X to 3X
Use Parallel Clustal W for each run of Clustal W
CMSC 838T – Presentation
Parallelizability
First (pairwise alignment) and second (guide tree) stages are parallelizable
Third stage is mostly sequential – speedup limited
100 sequence MSAs possible ?
PIR at NBRF (Georgetown University) takes maximum of 20 sequences for MSA
Speedup improves user response, for 20 sequences a PC would be sufficient
Probable applications:
Research Environments ?
PIR servers ?
Speedup only on shared memory SGI 3000 workstation ?
CMSC 838T – Presentation