Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation

advertisement
Parallel Computation in
Biological Sequence Analysis
Xue Wu
CMSC 838 Presentation
Motivation

Scanning and analyzing biological sequences are common and
repeated tasks in molecular biology



Homologous sequence searching
 Based on pairwise alignment
 Task is to find similarities between a particular query sequence
and all the sequences of a biosequence databank
Multiple sequence alignment
 Simultaneous alignment of three or more nucleotide or amino acid
sequences
Problems with sequential solution



With the exponential growth of the biosequence banks, homologous
sequence searching becomes time consuming
The automatic generation of an accurate multiple alignment is
computationally expensive
Parallel solution can reduce computation time and provide more
accurate result
CMSC 838T – Presentation
Talk Overview

Overview of talk

Motivation

Techniques and Evaluations

Similarity Sequence Searching
 Multiple Sequence Alignment
Observations

CMSC 838T – Presentation
Techniques – similarity sequence searching

Two main parallel methods to search sequence database

Fine grain approach for SIMD parallel computer

Parallelize the comparison algorithm itself
 All processors cooperate to determine the similarity score
Coarser grain approach for MIMD parallel computer





Parallelize the database searching
Each processor performs a selected number of comparison
method used in the paper
Parallelize Similarity Searching – coarser grain approach

Workload balancing is the key point for better parallelism

Partition database, combine results from sequential search for each
database requires equal-sized pieces of database for load balance

Percentage of Load ImBalance (PLIB) as metric for load imbalance
 L arg estLoad  SmallestLoad 
PLIB  
 100
L arg estLoad


CMSC 838T – Presentation
Techniques – similarity sequence searching

Splitting up database

Unsorted portion method – first load balancing technique

Partition the database into a number of portions
 Portion_size = database_size / processors_number
 If sequence assignment causes sum of sequence lengths
in portion P exceed ideal size by more than X percent,
reassign the sequence to portion P+1
 Low communication overhead, but possibly high PLIB
Sorted portion method – Master-worker method




Sequences are sorted in decreasing length order to
minimize PLIB
The master processor distributes the sequences to the
worker processors dynamically
Low PLIB, but high communication overhead
CMSC 838T – Presentation
Techniques – similarity sequence searching

Proposed bucket method


Statically apply sorted portion method
Algorithm




Sequences in the database are sorted in decreasing length
order
Starting from the longest-length sequence, place the
sequences in N buckets. For each sequence,
– Find the sum of the sequences length in each bucket
– Find the bucket with the smallest sum value
– Place the sequence in the bucket
– In the case of a tie, the smallest numbered bucket is
selected
Each of the N processors performs sequence search in its
own bucket
If only N/n processors are used, each processor searches n
bucket
CMSC 838T – Presentation
Techniques – similarity sequence searching

Evaluation and comparison

Comparison of Bucket and Portion method

Comparison of Bucket and Master-worker method




Algorithms are implemented on the Intel iPSC/860
Preprocessing is performed on SPARC station 2
Data source is GenBank (release 86.0)
Preprocessing overhead is added for Bucket method
CMSC 838T – Presentation
Techniques – similarity sequence searching

Evaluation and comparison continued

Conclusions


In all tested cases, proposed Bucket method has
 Lower PLIB than Portion method
 Higher speedup than master-worker method
Bucket method has obvious advantage when
 Sequences length is relatively small
 Processing with large number of processors
CMSC 838T – Presentation
Techniques – multiple sequence alignment

Sequential Berger-Munson algorithm
Accept
Calculate
alignment
Reject
CMSC 838T – Presentation
Techniques – multiple sequence alignment

Sequential Berger-Munson algorithm


Applied randomized techniques with optimization to iteratively improve
the multiple sequence alignment
Description
1.
Randomly partition the input sequences into two groups
2.
Align two groups of sequences instead of individual sequence with
alignment score calculated by
Si , j
 Pi , j

 MAX Si 1, j 1  sub' (ai , b j )

Qi , j
K
L

Si 1, j  w1
Pi , j  MAX 

 Pi 1, j  v
K 1
sub' ( X i , Y j )   sub( X k ,i , Yl , j )  
k 1 l 1
3.
4.
K

k 1 m  k 1

S i , j 1  w1
Qi , j  MAX 

Qi , j 1  v
L 1
sub( X k ,i , X m , j )  
L
 sub(Y
l 1 m l 1
l, j
, Ym , j )
If the new alignment score is higher than the previous one, the
alignment is accepted and the gaps are inserted into the sequences
accordingly.
The modified or unmodified alignment is used as the input for the
next iteration. The process is stopped after q consecutive iterations
of rejection.
CMSC 838T – Presentation
Techniques – multiple sequence alignment

Parallel Berger-Munson algorithm with speculative computation

Consecutive sequence of rejected iterations are not dependent on
each other and can be done in parallel
Sequential Iteration number
1 2 3 4 5 6 7…
AAAAARAAARRARRRARRRARRRARRRRR…
Decision Sequence
0
1
2
3
4
5
6
8
9
10
13
17
21
25
1
2
3
4
5
6
7
9
10
11
14
18
22
26
2
3
4
5
6
7
8
10
11
12
15
19
23
27
3
4
5
6
7
8
9
11
12
13
16
20
24
28
CMSC 838T – Presentation
Techniques – multiple sequence alignment

Evaluation

Method: Improve the alignment generated by experts and other
program (CLUSTALV)

Data Source



Three different groups of immunoglobulin sequences from Kabat
Database (Beta Release 5.0)
Group Name
Num. of Seqs
Avg. Length
Initial Score
CLLC
93
62
1,574,724
HHC3
185
65
6,650,831
MKL5
324
83
31,393,504
The average sequence lengths of three groups are similar
The number of sequences are different, which in each group is as
twice as the previous group
CMSC 838T – Presentation
Techniques – multiple sequence alignment

Evaluation continued

Alignment score comparison

Apparently, Berger-Munson method provides more accurate
alignments with sacrifice of computation time, which is not ignorable
CMSC 838T – Presentation
Techniques – multiple sequence alignment

Parallel Algorithm Speedup factor

Conclusion




The original iterative method is a good tool for improving alignment results
With the parallel speculative computation technique, it can
 Increase the alignment score
 Reduce the computation time
Can achieve higher speedup factors when
 Processing large_sized sequence group
 Processing sequences with high alignment score
Cannot be compared with the previous algorithm by Ishikawa et al.
CMSC 838T – Presentation
Observations

Similarity Sequence Searching




With the increasing size of biosequence database and growth of
computation power, coarse grain parallelism for sequence searching
is more simple and effective
Time required for processing any given sequence depends not only
on the length of sequence, but also on
 The composition of the sequence
 CPU power and CPU availability
 So dynamic load balancing is still necessary.
To minimize communication and scheduling overhead,
 Distributing sequences by fixed/variable size block
 Applying buffering strategy to reduce data starvation and shadow
scheduling latency
Multiple Sequence Alignment

With the increasing of computation power, parallelizing single multiple
sequence Alignment is not necessary. However, using parallelism to
increase the alignment accuracy is still attractive.
 Using computation time to exchange for alignment accuracy
CMSC 838T – Presentation
Thank you!
Download