Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation Motivation Scanning and analyzing biological sequences are common and repeated tasks in molecular biology Homologous sequence searching Based on pairwise alignment Task is to find similarities between a particular query sequence and all the sequences of a biosequence databank Multiple sequence alignment Simultaneous alignment of three or more nucleotide or amino acid sequences Problems with sequential solution With the exponential growth of the biosequence banks, homologous sequence searching becomes time consuming The automatic generation of an accurate multiple alignment is computationally expensive Parallel solution can reduce computation time and provide more accurate result CMSC 838T – Presentation Talk Overview Overview of talk Motivation Techniques and Evaluations Similarity Sequence Searching Multiple Sequence Alignment Observations CMSC 838T – Presentation Techniques – similarity sequence searching Two main parallel methods to search sequence database Fine grain approach for SIMD parallel computer Parallelize the comparison algorithm itself All processors cooperate to determine the similarity score Coarser grain approach for MIMD parallel computer Parallelize the database searching Each processor performs a selected number of comparison method used in the paper Parallelize Similarity Searching – coarser grain approach Workload balancing is the key point for better parallelism Partition database, combine results from sequential search for each database requires equal-sized pieces of database for load balance Percentage of Load ImBalance (PLIB) as metric for load imbalance L arg estLoad SmallestLoad PLIB 100 L arg estLoad CMSC 838T – Presentation Techniques – similarity sequence searching Splitting up database Unsorted portion method – first load balancing technique Partition the database into a number of portions Portion_size = database_size / processors_number If sequence assignment causes sum of sequence lengths in portion P exceed ideal size by more than X percent, reassign the sequence to portion P+1 Low communication overhead, but possibly high PLIB Sorted portion method – Master-worker method Sequences are sorted in decreasing length order to minimize PLIB The master processor distributes the sequences to the worker processors dynamically Low PLIB, but high communication overhead CMSC 838T – Presentation Techniques – similarity sequence searching Proposed bucket method Statically apply sorted portion method Algorithm Sequences in the database are sorted in decreasing length order Starting from the longest-length sequence, place the sequences in N buckets. For each sequence, – Find the sum of the sequences length in each bucket – Find the bucket with the smallest sum value – Place the sequence in the bucket – In the case of a tie, the smallest numbered bucket is selected Each of the N processors performs sequence search in its own bucket If only N/n processors are used, each processor searches n bucket CMSC 838T – Presentation Techniques – similarity sequence searching Evaluation and comparison Comparison of Bucket and Portion method Comparison of Bucket and Master-worker method Algorithms are implemented on the Intel iPSC/860 Preprocessing is performed on SPARC station 2 Data source is GenBank (release 86.0) Preprocessing overhead is added for Bucket method CMSC 838T – Presentation Techniques – similarity sequence searching Evaluation and comparison continued Conclusions In all tested cases, proposed Bucket method has Lower PLIB than Portion method Higher speedup than master-worker method Bucket method has obvious advantage when Sequences length is relatively small Processing with large number of processors CMSC 838T – Presentation Techniques – multiple sequence alignment Sequential Berger-Munson algorithm Accept Calculate alignment Reject CMSC 838T – Presentation Techniques – multiple sequence alignment Sequential Berger-Munson algorithm Applied randomized techniques with optimization to iteratively improve the multiple sequence alignment Description 1. Randomly partition the input sequences into two groups 2. Align two groups of sequences instead of individual sequence with alignment score calculated by Si , j Pi , j MAX Si 1, j 1 sub' (ai , b j ) Qi , j K L Si 1, j w1 Pi , j MAX Pi 1, j v K 1 sub' ( X i , Y j ) sub( X k ,i , Yl , j ) k 1 l 1 3. 4. K k 1 m k 1 S i , j 1 w1 Qi , j MAX Qi , j 1 v L 1 sub( X k ,i , X m , j ) L sub(Y l 1 m l 1 l, j , Ym , j ) If the new alignment score is higher than the previous one, the alignment is accepted and the gaps are inserted into the sequences accordingly. The modified or unmodified alignment is used as the input for the next iteration. The process is stopped after q consecutive iterations of rejection. CMSC 838T – Presentation Techniques – multiple sequence alignment Parallel Berger-Munson algorithm with speculative computation Consecutive sequence of rejected iterations are not dependent on each other and can be done in parallel Sequential Iteration number 1 2 3 4 5 6 7… AAAAARAAARRARRRARRRARRRARRRRR… Decision Sequence 0 1 2 3 4 5 6 8 9 10 13 17 21 25 1 2 3 4 5 6 7 9 10 11 14 18 22 26 2 3 4 5 6 7 8 10 11 12 15 19 23 27 3 4 5 6 7 8 9 11 12 13 16 20 24 28 CMSC 838T – Presentation Techniques – multiple sequence alignment Evaluation Method: Improve the alignment generated by experts and other program (CLUSTALV) Data Source Three different groups of immunoglobulin sequences from Kabat Database (Beta Release 5.0) Group Name Num. of Seqs Avg. Length Initial Score CLLC 93 62 1,574,724 HHC3 185 65 6,650,831 MKL5 324 83 31,393,504 The average sequence lengths of three groups are similar The number of sequences are different, which in each group is as twice as the previous group CMSC 838T – Presentation Techniques – multiple sequence alignment Evaluation continued Alignment score comparison Apparently, Berger-Munson method provides more accurate alignments with sacrifice of computation time, which is not ignorable CMSC 838T – Presentation Techniques – multiple sequence alignment Parallel Algorithm Speedup factor Conclusion The original iterative method is a good tool for improving alignment results With the parallel speculative computation technique, it can Increase the alignment score Reduce the computation time Can achieve higher speedup factors when Processing large_sized sequence group Processing sequences with high alignment score Cannot be compared with the previous algorithm by Ishikawa et al. CMSC 838T – Presentation Observations Similarity Sequence Searching With the increasing size of biosequence database and growth of computation power, coarse grain parallelism for sequence searching is more simple and effective Time required for processing any given sequence depends not only on the length of sequence, but also on The composition of the sequence CPU power and CPU availability So dynamic load balancing is still necessary. To minimize communication and scheduling overhead, Distributing sequences by fixed/variable size block Applying buffering strategy to reduce data starvation and shadow scheduling latency Multiple Sequence Alignment With the increasing of computation power, parallelizing single multiple sequence Alignment is not necessary. However, using parallelism to increase the alignment accuracy is still attractive. Using computation time to exchange for alignment accuracy CMSC 838T – Presentation Thank you!