Massively Parallel Solutions for Molecular Sequence Analysis Prabhakar R. Gudla CMSC 838T Presentation 04/23/2003 Outline Motivation Smith-Waterman Algorithm Parallelization High Performance Computing Hybrid Architecture Fuzion 150 Performance Evaluation Conclusions and Comments 04/23/2003 CMSC 838T – Presentation 2 Motivation Discovered sequences are analyzed by comparison with databases Complexity is proportional to the product of query size times database size ☞ Analysis too slow on sequential computers 04/23/2003 CMSC 838T – Presentation 3 Sequence Alignment Two possible approaches Heuristics, e.g. BLAST, FASTA, but the more efficient the heuristics, the worse the quality of the results Parallel Processing, get high-quality results in reasonable time BLAST, FASTA, Smith-Waterman (S-W) Slower SmithWaterman Search Speed FASTA BLAST Faster Lower 04/23/2003 Data Quality CMSC 838T – Presentation Higher 4 Outline Motivation Smith-Waterman Algorithm Parallelization High Performance Computing Hybrid Architecture Fuzion 150 Performance Evaluation Conclusion and Comments 04/23/2003 CMSC 838T – Presentation 5 Parallelization of S-W l1 P1 P2 A T G C A T A C T C A T A C T A C T C T C C T C G G C A C T G C T G T G A T G C T G C T A T C T G l2 G T C T A T C P6 A T C T C G 0 0 0 0 0 0 0 0 0 0 1 4 3 2 3 6 0 0 1 4 5 5 4 6 0 0 0 0 0 2 1 0 0 0 2 1 2 2 4 3 0 0 2 3 6 5 4 5 matrix cells along a single diagonal are computed in parallel comparison is performed in l1+l21 steps on l1 PEs 04/23/2003 CMSC 838T – Presentation 0 2 1 3 4 4 4 5 6 Parallel Architectures Embedded Massively Parallel Accelerators Systola 1024: PC add-on board with 1024 processors Fuzion 150: 1536 processors on a single chip Other accelerators: Decypher, Biocellerator, GeneMatcher2, Kestrel, SAMBA, P-NAC, Splash-2, BioScan 04/23/2003 CMSC 838T – Presentation 7 Outline Motivation Smith-Waterman Algorithm Parallelization High Performance Computing Hybrid Architecture Fuzion 150 Performance Evaluation Conclusion and Comments 04/23/2003 CMSC 838T – Presentation 8 Previous Applications Volume Visualization [Schmidt `00] Automatic Visual Quality Control (Automobile Industry) Computer Tomography [Schmidt, Schimmler, and Schröder `98] Video Compression [Schmidt and Schimmler `99] Range of Transforms (Fourier, Wavelet, Hough, Radon) [Schmidt, Schimmler and Schröder `99] Image Processing [Schimmler and Lang `96, Lenders and Schröder `90, Jiang Edirisinghe, and Schröder `97] 04/23/2003 CMSC 838T – Presentation 9 Hybrid Architecture Systola Systola Systola Systola Systola Systola Systola Systola 1024 1024 1024 1024 1024 1024 1024 1024 High speed Myrinet switch Systola Systola Systola Systola Systola Systola Systola Systola 1024 1024 1024 1024 1024 1024 1024 1024 combines SIMD and MIMD paradigm within a parallel architecture Hybrid Computer 04/23/2003 CMSC 838T – Presentation 10 Architecture of Systola 1024 Instruction Systolic Array: 32 32 mesh of processing elements wavefront instruction execution RAM NORTH RAM WEST program memory host computer bus Controller ISA Interface processors 04/23/2003 CMSC 838T – Presentation 11 Mapping onto Systola 1024 a1023 a1022 a992 a63 a62 a32 a31 a30 a0 a: query sequence (equal to 1024) b: subject sequence …c1c0 X bk….b1b0 Efficient routing on the ISA: Row Ringshift and Broadcast Subject sequences can be pipelined with only step delay k steps for subject sequence of length k 04/23/2003 CMSC 838T – Presentation 12 Fuzion 150 Architecture Linear SIMD Array 1536 PEs each with 2 Kbytes DRAM Host AGP SIMD Controller Instruction Fetch FUZION Bus 32-bit EPU (ARC) Video I/O Local Rambus Memory 1,2 or 4 Display Channels (6.4 GB/s) 0.25-m, single-chip, SIMD architecture 1536 PEs @ 200 MHz 300 GOPS 600 GB/s on-chip, 6.4 GB/s off-chip bandwidth multithreading (control units interact via semaphores) developed by Clearspeed Technology (UK) for graphics, networking processing 04/23/2003 CMSC 838T – Presentation 13 Fuzion 150 Architecture Local Memory Instructions Block 5 Fuzion Bus PE (5,0) PE (5,1) PE (5,255) Left PE Block 1 PE (1,0) PE (1,1) ALU (8 bits) Register file 32 Bytes Right PE PE (1,255) PE Memory 2 KByte DRAM Block 0 PE (0,0) 04/23/2003 PE (0,1) PE (0,255) CMSC 838T – Presentation Block I/O Channel 14 Mapping onto the Fuzion 150 Block 5 a: query sequence (equal to 1536) b: subject sequence a1535 a1534 a1280 Block 1 a511 a510 a256 Block 0 a0 a1 a255 …c1c0 X bk….b1b0 No fast global communication 2-step local communication Subject sequence can be pipelined with only step delay 04/23/2003 CMSC 838T – Presentation 15 Contents Motivation Smith-Waterman Algorithm Parallelization High Performance Computing Hybrid Architecture Fuzion 150 Performance Evaluation Conclusion and Comments 04/23/2003 CMSC 838T – Presentation 16 Performance Evaluation Scan times in seconds for TrEMBL 14 (351’834 Protein Sequences) for various query sequence lengths Query sequence length 256 512 1024 2048 4096 Fuzion 150 speedup to PIII 1Ghz Systola 1024 speedup to PIII 1Ghz Cluster of 16 Systolas speedup to PIII 1GHz 12 88 294 4 20 53 22 97 577 4 38 56 42 102 1137 4 73 58 82 105 2241 4 142 60 162 106 4611 4 290 59 Parallel implementation scales linearly with sequence length Computing time dominates data transfer time Fuzion 150 is 25 times faster than a single Systola 1024; difference in CMOS technology (0.25 vs 1.0) 04/23/2003 CMSC 838T – Presentation 17 Performance Evaluation Time comparisons for a 10 Mbase search on different parallel architectures with different query length Seconds 100 512 10 1024 2048 1 SAMBA Fuzion 150 Kestrel 16K-PE MasPar 4faster than 16K-PE MasPar 6faster than Kestrel 5faster than SAMBA (special-purpose 3-board architecture) 04/23/2003 CMSC 838T – Presentation 18 Performance Evaluation USparc : Sun Ultrasparc 140 MHz B-SYS: 470-PE ISA Alpha: DEC Alpha – 433 MHz 1K MP2: 1K-PE MasPar Paragon: 32-node Paragon Decy-1: 1-board Decypher-II* Merc1: 1-board Mercury+ Bcll-1: Biocellerator* Samba: 2-board Samba+ 16-MP2: 16K-PE MasPar FDF-3: 5-Board Paracell FDF+ Kestrel: 1-board Kestrel Decy-15: 15-board Decypher-II* Source: Dahle et. al, PDPTA, 1243-1249, 1999 04/23/2003 CMSC 838T – Presentation + (single purpose); * (FPGA) 19 Outline Motivation Smith-Waterman Algorithm Parallelization High Performance Computing Hybrid Architecture Fuzion 150 Performance Evaluation Conclusions and Comments 04/23/2003 CMSC 838T – Presentation 20 Conclusions Demonstrated how fine-grained and hybrid parallel architectures can be applied efficiently for Comparative Genomics Significant runtime savings for full genome comparisons and database searching Same systems can be used for accelerating other bioinformatics applications, e.g. Hidden Markov Models 04/23/2003 CMSC 838T – Presentation 21 Comments ☞ With hardware support, is S-W as fast as BLAST? Comparative search speeds on 600 MHz 21264A Alpha machine (comparable MCUPS as Hybrid System and Fuzion 150) Search Tools (against Swiss-Prot DB) Sequence Under Test ELVIS (5) Metr (276) Arp_arath (536) Time taken for the search (seconds) FASTA 3.3 4.3 20.0 25.0 BLAST 2.2 1.0 4.0 10.0 SSearch (SW) 6.0 240.0 565.0 H’Ware Accl. 3.2 16.8 29.7 * Source: Shane Sturrock, SCS, 2(1), April 2002 04/23/2003 CMSC 838T – Presentation 22 Comments ☞ Is it feasible to use S-W as the default ? Currently offered as a default option at EBI (European Bioinformatics Institute), handles 15K queries per month w/ full implementation of S-W Depends on the “objectives” of the search ☞ Just how much more accurate is S-W ? 5-10% more “sensitive” towards divergent matches than BLAST (Shpaer et. al., Genomics 38, 179-191, 1996) BLAST will retrieve most biologically significant similarities, but will miss a few and will include some chance similarities 04/23/2003 CMSC 838T – Presentation 23 Comparison of S-W VS BLAST Source: Shpaer et.al., Genomics 38(2), pp.179-191, 1996 ☞ Is there a real difference in the results ? YES 04/23/2003 CMSC 838T – Presentation 24 Comparison of S-W, FASTA, and BLAST Note: The numbers in the table show for how many protein SF the method in the column performed better than the one in the row 04/23/2003 CMSC 838T – Presentation 25 Acknowledgements Dr. Bertil Schmidt Dr. Chau-Wen Tseng 04/23/2003 CMSC 838T – Presentation 26 Q&A 04/23/2003 CMSC 838T – Presentation 27 Extra Slides 04/23/2003 CMSC 838T – Presentation 28 Full Genome Comparison 3918 Protein Sequences 1.329.298 AminoAcids 4289 Protein Sequences 1.359.008 AminoAcids related Organisms, but Tuberculosis causes a disease find common and different parts 16106 pairwise sequence comparisons 04/23/2003 CMSC 838T – Presentation 29 Smith-Waterman Algorithm Optimal local alignment of two sequences Performs an exhaustive search for the optimal local alignment Complexity O(nm) for sequence lengths n and m Based on the 'dynamic programming' (DP) algorithm Fill the DP matrix using a substitution (mutation) matrix Find the maximal value (score) in the matrix Trace back from the score until a 0 value is reached 04/23/2003 CMSC 838T – Presentation 30 Smith-Waterman Algorithm Aligning S1 and S2 of length l1 and l2 using recurrences: 0 E (i, j ) H (i, j ) max ,1 i l1 , 1 j l2 F (i, j ) H (i 1, j 1) Sbt ( S1i , S 2 j ) H (i,0) E (i,0) 0 H (0, j ) F (0, j ) 0 H (i, j 1) H (i 1, j ) E (i, j ) max , F (i, j ) max E (i, j 1) F (i 1, j ) Calculate three possible ways to extend the alignment by one aminoacid (AA) in each sequence by one AA in the first sequence and align it with a gap in the second by one AA in the second sequence and align it with a gap in the first 04/23/2003 CMSC 838T – Presentation 31 Smith-Waterman Algorithm Align S1=ATCTCGTATGATG S2=GTCTATCAC A T 2 if ( x y ) Sbt ( x, y ) 1 else =1, =1 G T C T A T C A C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 2 1 0 0 0 00 2 1 1 4 2 3 2 2 4 3 3 6 2 5 1 4 0 H (i 1, j ) 1 H (i, j ) max H (i, j 1) 1 H (i 1, j 1) Sbt ( S1i , S 2 j ) 04/23/2003 C T 0 0 2 3 6 5 4 5 5 4 C G T A T G A T G 0 0 1 4 5 5 4 6 5 7 0 0 3 3 4 7 5 5 7 6 0 2 1 3 4 4 4 5 5 6 0 1 4 3 5 4 6 5 4 5 0 0 2 2 5 6 9 8 7 6 0 0 2 1 1 1 1 0 4 3 5 6 8 7 8 7 7 10 10 6 9 0 0 3 2 2 5 8 7 9 9 0 2 2 2 1 4 7 7 8 8 ATCTCGTATGATG GTC TATCAC CMSC 838T – Presentation 32 Principles of the ISA ..... ..... 04/23/2003 CMSC 838T – Presentation 33 Principles of the ISA CommunicationRegister 04/23/2003 CMSC 838T – Presentation 34 Interface Processors Interface Processors North Interface Processors West 04/23/2003 .... .... ISA CMSC 838T – Presentation 35 Instruction Systolic Array instructions column selectors *+ -* -*+ + *+ * -+ * + *-+ * *+ -+ * -* + -*+ + * + *- -*+ -+ * + * + -* row selectors wavefront instruction execution fast accumulation operations (e.g. row sum, broadcast, ringshift) 04/23/2003 CMSC 838T – Presentation 36 Advantage of ISA’s: Performing Aggregate Functions • Row Broadcast 234 C := C[WEST] C := CW C := CW C := CW C := CW C = 234 0 C = 234 0 C = 234 C = 0234 • Row Sum C := C + C[WEST] noop C:=C+CW C=1 C=3 2 C:=C+CW C=6 3 C:=C+CW C = 410 • Row Ringshift C := C[WEST]; C:=C[EAST] 04/23/2003 C := CW noop C:=CE C := CW C:=CE C:=CW C=1 1000 C = 11000 C:=CW C:=CE C = 1000 1 CMSC 838T – Presentation C:=CW C = 11000 37 Data Transfer In Systola 1024, input of new character (bj) into the lower western IP, and when l1 > 2048, the input of previously computed H, E, and F cells and output of H, E, and F cells For Fuzion 150, during the 16 new H-cells in each PE, one new character is input via Fuzion bus 04/23/2003 CMSC 838T – Presentation 38 Instruction Counts Instruction Count (IC) to update 2 and 16 H-cells in Systola 1024 and Fuzion 150, respectively: Operations in each PE per iteration step Systola Fuzion Get H(i – 1, j), F(i – 1), bj, maxi-1 from neighbor 20 22 Compute t = max{0, H(i – 1, j – 1) + Sbt(ai, bj)} 20 576 Compute F(i, j) = max{H(i – 1, j} – , F(i – 1, j) – } 8 336 Compute E(i, j) = max{H(i, j – 1} – , E(i, j – 1) – } 8 448 Compute F(i, j) = max{t, H(i, j}, F(i, j)} 8 368 Compute maxi = max{H(i, j), maxi-1} 4 184 68 1934 Sum 04/23/2003 CMSC 838T – Presentation 39 Maximum Characters/PE The memory per PE on Systola is 32 (16-bit) registers 2 characters per PE is the maximal possible (2 chars x 20 AAs substitution row x 8-bit per substitution value = 20 registers) The memory per PE on Fuzion is 2Kb maximum chars per PE is 16 restricted due to “indirect addressing” per PE 04/23/2003 CMSC 838T – Presentation 40 Indirect Address An addressing mode found in many processors' instruction sets where the instruction contains the address of a memory location which contains the address of the operand (the "effective address") or specifies a register which contains the effective address 04/23/2003 CMSC 838T – Presentation 41 Myrinet - Overview Myrinet is a cost-effective, high-performance, packetcommunication and switching technology that is widely used to interconnect clusters of workstations, PCs, servers, or single-board computers Conventional networks (e.g., ethernet) can be used to build clusters, but do not provide the performance/features required for HPC or highavailability clustering 04/23/2003 CMSC 838T – Presentation 42 Myrinet - Characteristics Full-duplex 2+2 Gigabit/second data rate links, switch ports, and interface ports Flow control, error control, and "heartbeat" continuity monitoring on every link Low-latency, cut-through, crossbar switches, with monitoring for highavailability applications Switch networks that can scale to tens of thousands of hosts, and that can also provide alternative communication paths between hosts Host interfaces that execute a control program to interact directly with host processes ("OS bypass") for low-latency communication, and directly with the network to send, receive, and buffer packets 04/23/2003 CMSC 838T – Presentation 43 lq processors: Hybrid Query sequence = M, Number of processors in ISA = N2, assuming M = k x N: 1. k N: Each k x N subarray computes the alignment of the same query sequence with different subject sequences 2. k≥N: • • 04/23/2003 k/N = 2: load 2 chars per PE k/N > 2: split query sequence into k/2N passes and load 2N2 chars in each pass CMSC 838T – Presentation 44 lq processors: Fuzion 150 Length of query sequence = M, Number of processors = 1536: 1. k x M = 1536: k alignments of same query sequence w/ different subject sequences carried out in parallel 2. k x 1536 = M: • Split into k passes – requires I/O of intermediate results in each step • Data transfers can be minimized by assigning k/M chars per PE – currently 16 chars per PE is the limit 04/23/2003 CMSC 838T – Presentation 45 Concept of true and false hits The following cases were distinguished: true positives, alignments between proteins of similar structure that fall above a given threshold (defined by the sequence alignment method) false positives, alignments between proteins of dissimilar structure that fall above a given threshold of the sequence alignment true negatives, alignments between proteins of dissimilar structure that that fall below a given threshold false negatives, alignments between proteins of similar structure that fall below a given threshold 04/23/2003 CMSC 838T – Presentation 46 Guidelines When to use S-W ? if you are looking for a protein distantly related to your query sequence (e.g., you have a known protein sequence and you want to find possible distant homologues) if you are looking for the protein encoded in your low-quality DNA query sequence (e.g., you have a badly sequenced cDNA clone) if you are looking for a DNA sequence corresponding to your protein query sequence (e.g., you want to identify potential homologues of your protein in the EST databases) When to use BLAST ? if you are looking for close matches and you don't mind missing lower homology sequences if you want a quick answer 04/23/2003 CMSC 838T – Presentation 47 Performance Evaluation of SAMBA Query sequence length 10 30 100 300 1000 3000 10000 Time in seconds Samba 25 25 DEC-Alpha – 150 Mhz 57 Speed up 26 30 40 77 210 120 350 1041 3468 11510 38450 2.3 4.8 13.5 34.7 86.7 150 183 SUN-Sparc 5 – 110 MHz 95 239 746 2215 7300 24269 80300 Speed up 3.8 9.5 7.4 183 315 382 28.6 DEC 5000/250 – 40 MHz 182 548 1407 4054 12920 41169 131193 Speed up 22 323 534 625 7.3 54 135 Source: Jamet and Laveneir, CABIOS, 12(7), 609-615, 1997 ☞ The longer the query length, the better the speed-up 04/23/2003 CMSC 838T – Presentation 48 Performance Evaluation of Kestrel USparc : Sun Ultrasparc 140 MHz B-SYS: 470-PE ISA Alpha: DEC Alpha – 433 MHz 1K MP2: 1K-PE MasPar Paragon: 32-node Paragon Decy-1: 1-board Decypher-II* Merc1: 1-board Mercury+ Bcll-1: Biocellerator* Samba: 2-board Samba+ 16-MP2: 16K-PE MasPar FDF-3: 5-Board Paracell FDF+ Kestrel: 1-board Kestrel Decy-15: 15-board Decypher-II* Source: Dahle et. al, PDPTA, 1243-1249, 1999 04/23/2003 CMSC 838T – Presentation + (single purpose); * (FPGA) 49 Performance Evaluation of Splash-2 Hardware Specifics MCUPS Splash-2 Unidir; 16 boards 43,000 Splash-2 Bidir; 16 boards 34,000 Splash-2 Unidir; 1 board 3,000 Splash-2 Bidir; 1 board 2,100 Splash-1 Bidir; 746 PE’s 370 SPARC 10/30 GX gcc –O2 1.2 VAX 6620 VMS; CC 1.0 SPARC-1 gcc –O2 0.87 486DX-50 PC DOS; gcc –O2 0.67 Source: Hoang, IEEE-CMM, 185-191, 1993 04/23/2003 CMSC 838T – Presentation 50