Massively Parallel Solutions for Molecular Sequence Analysis Prabhakar R. Gudla CMSC 838T Presentation

advertisement
Massively Parallel Solutions for Molecular
Sequence Analysis
Prabhakar R. Gudla
CMSC 838T Presentation
04/23/2003
Outline

Motivation

Smith-Waterman Algorithm


Parallelization
High Performance Computing

Hybrid Architecture

Fuzion 150

Performance Evaluation

Conclusions and Comments
04/23/2003
CMSC 838T – Presentation
2
Motivation
Discovered sequences are
analyzed by comparison
with databases
Complexity is proportional
to the product of query size
times database size
☞ Analysis too slow on sequential computers
04/23/2003
CMSC 838T – Presentation
3
Sequence Alignment


Two possible approaches

Heuristics, e.g. BLAST, FASTA, but the more efficient the
heuristics, the worse the quality of the results

Parallel Processing, get high-quality results in reasonable time
BLAST, FASTA, Smith-Waterman (S-W)
Slower
SmithWaterman
Search
Speed
FASTA
BLAST
Faster
Lower
04/23/2003
Data
Quality
CMSC 838T – Presentation
Higher
4
Outline

Motivation

Smith-Waterman Algorithm


Parallelization
High Performance Computing

Hybrid Architecture

Fuzion 150

Performance Evaluation

Conclusion and Comments
04/23/2003
CMSC 838T – Presentation
5
Parallelization of S-W
l1
P1 P2
A T
G
C
A
T
A
C
T
C
A
T
A
C
T
A
C
T
C
T
C
C T
C G
G
C
A
C
T G
C
T G
T G
A
T G
C
T G
C
T
A
T
C
T
G
l2

G
T
C
T
A
T
C
P6
 A T
C T
C G
0
0
0
0
0
0
0
0
0
0
1
4
3
2
3
6
0
0
1
4
5
5
4
6
0
0
0
0
0
2
1
0
0
0
2
1
2
2
4
3
0
0
2
3
6
5
4
5

matrix cells along a single diagonal are computed in parallel

comparison is performed in l1+l21 steps on l1 PEs
04/23/2003
CMSC 838T – Presentation
0
2
1
3
4
4
4
5
6
Parallel Architectures

Embedded Massively Parallel Accelerators

Systola 1024: PC add-on
board with 1024 processors

Fuzion 150: 1536 processors on a
single chip

Other accelerators: Decypher, Biocellerator, GeneMatcher2, Kestrel,
SAMBA, P-NAC, Splash-2, BioScan
04/23/2003
CMSC 838T – Presentation
7
Outline

Motivation

Smith-Waterman Algorithm


Parallelization
High Performance Computing

Hybrid Architecture

Fuzion 150

Performance Evaluation

Conclusion and Comments
04/23/2003
CMSC 838T – Presentation
8
Previous Applications

Volume Visualization [Schmidt `00]

Automatic Visual Quality Control (Automobile
Industry)

Computer Tomography [Schmidt, Schimmler, and Schröder
`98]

Video Compression [Schmidt and Schimmler `99]

Range of Transforms (Fourier, Wavelet, Hough,
Radon) [Schmidt, Schimmler and Schröder `99]

Image Processing [Schimmler and Lang `96, Lenders and
Schröder `90, Jiang Edirisinghe, and Schröder `97]
04/23/2003
CMSC 838T – Presentation
9
Hybrid Architecture
Systola Systola Systola Systola Systola Systola Systola Systola
1024
1024
1024
1024
1024
1024
1024
1024
High speed Myrinet switch
Systola Systola Systola Systola Systola Systola Systola Systola
1024
1024
1024
1024
1024
1024
1024
1024

combines SIMD and MIMD paradigm within a parallel architecture  Hybrid
Computer
04/23/2003
CMSC 838T – Presentation
10
Architecture of Systola 1024

Instruction Systolic Array:

32  32 mesh of
processing elements

wavefront instruction
execution
RAM NORTH
RAM WEST
program memory
host computer bus
Controller
ISA
Interface processors
04/23/2003
CMSC 838T – Presentation
11
Mapping onto Systola 1024
a1023
a1022
a992
a63
a62
a32
a31
a30
a0
a: query
sequence
(equal to
1024)
b: subject
sequence
…c1c0 X bk….b1b0

Efficient routing on the ISA: Row Ringshift and Broadcast

Subject sequences can be pipelined with only step delay  k steps for
subject sequence of length k
04/23/2003
CMSC 838T – Presentation
12
Fuzion 150 Architecture
Linear SIMD Array
1536 PEs
each with 2 Kbytes DRAM
Host
AGP
SIMD Controller
Instruction Fetch
FUZION Bus
32-bit EPU
(ARC)
Video
I/O
Local
Rambus Memory
1,2 or 4
Display Channels
(6.4 GB/s)

0.25-m, single-chip, SIMD architecture

1536 PEs @ 200 MHz  300 GOPS

600 GB/s on-chip, 6.4 GB/s off-chip bandwidth

multithreading (control units interact via semaphores)

developed by Clearspeed Technology (UK) for graphics, networking processing
04/23/2003
CMSC 838T – Presentation
13
Fuzion 150 Architecture
Local
Memory
Instructions
Block 5
Fuzion Bus
PE
(5,0)
PE
(5,1)
PE
(5,255)
Left
PE
Block 1
PE
(1,0)
PE
(1,1)
ALU
(8 bits)
Register file
32 Bytes
Right
PE
PE
(1,255)
PE Memory
2 KByte DRAM
Block 0
PE
(0,0)
04/23/2003
PE
(0,1)
PE
(0,255)
CMSC 838T – Presentation
Block I/O
Channel
14
Mapping onto the Fuzion 150
Block 5
a: query
sequence
(equal to
1536)
b: subject
sequence
a1535
a1534
a1280
Block 1
a511
a510
a256
Block 0
a0
a1
a255
…c1c0 X bk….b1b0

No fast global communication  2-step local communication

Subject sequence can be pipelined with only step delay
04/23/2003
CMSC 838T – Presentation
15
Contents

Motivation

Smith-Waterman Algorithm


Parallelization
High Performance Computing

Hybrid Architecture

Fuzion 150

Performance Evaluation

Conclusion and Comments
04/23/2003
CMSC 838T – Presentation
16
Performance Evaluation


Scan times in seconds for TrEMBL 14 (351’834 Protein
Sequences) for various query sequence lengths
Query sequence length
256
512
1024
2048
4096
Fuzion 150
speedup to PIII 1Ghz
Systola 1024
speedup to PIII 1Ghz
Cluster of 16 Systolas
speedup to PIII 1GHz
12
88
294
4
20
53
22
97
577
4
38
56
42
102
1137
4
73
58
82
105
2241
4
142
60
162
106
4611
4
290
59

Parallel implementation scales linearly with sequence length

Computing time dominates data transfer time
Fuzion 150 is 25 times faster than a single Systola
1024; difference in CMOS technology (0.25 vs 1.0)
04/23/2003
CMSC 838T – Presentation
17
Performance Evaluation

Time comparisons for a 10 Mbase search on different
parallel architectures with different query length
Seconds
100
512
10
1024
2048
1
SAMBA
Fuzion 150
Kestrel
16K-PE
MasPar

4faster than 16K-PE MasPar

6faster than Kestrel

5faster than SAMBA (special-purpose 3-board architecture)
04/23/2003
CMSC 838T – Presentation
18
Performance Evaluation
USparc : Sun Ultrasparc 140 MHz
B-SYS: 470-PE ISA
Alpha: DEC Alpha – 433 MHz
1K MP2: 1K-PE MasPar
Paragon: 32-node Paragon
Decy-1: 1-board Decypher-II*
Merc1: 1-board Mercury+
Bcll-1: Biocellerator*
Samba: 2-board Samba+
16-MP2: 16K-PE MasPar
FDF-3: 5-Board Paracell FDF+
Kestrel: 1-board Kestrel
Decy-15: 15-board Decypher-II*
Source: Dahle et. al, PDPTA, 1243-1249, 1999
04/23/2003
CMSC 838T – Presentation
+
(single purpose); * (FPGA)
19
Outline

Motivation

Smith-Waterman Algorithm


Parallelization
High Performance Computing

Hybrid Architecture

Fuzion 150

Performance Evaluation

Conclusions and Comments
04/23/2003
CMSC 838T – Presentation
20
Conclusions

Demonstrated how fine-grained and hybrid parallel
architectures can be applied efficiently for
Comparative Genomics

Significant runtime savings for full genome
comparisons and database searching

Same systems can be used for accelerating other
bioinformatics applications, e.g. Hidden Markov
Models
04/23/2003
CMSC 838T – Presentation
21
Comments
☞ With hardware support, is S-W as fast as BLAST?
Comparative search speeds on 600 MHz 21264A Alpha machine
(comparable MCUPS as Hybrid System and Fuzion 150)
Search Tools
(against
Swiss-Prot
DB)
Sequence Under Test
ELVIS (5)
Metr (276)
Arp_arath (536)
Time taken for the search (seconds)
FASTA 3.3
4.3
20.0
25.0
BLAST 2.2
1.0
4.0
10.0
SSearch (SW)
6.0
240.0
565.0
H’Ware Accl.
3.2
16.8
29.7
* Source: Shane Sturrock, SCS, 2(1), April 2002
04/23/2003
CMSC 838T – Presentation
22
Comments
☞ Is it feasible to use S-W as the default ?

Currently offered as a default option at EBI (European
Bioinformatics Institute), handles 15K queries per month w/ full
implementation of S-W

Depends on the “objectives” of the search
☞ Just how much more accurate is S-W ?

5-10% more “sensitive” towards divergent matches than
BLAST (Shpaer et. al., Genomics 38, 179-191, 1996)

BLAST will retrieve most biologically significant similarities,
but will miss a few and will include some chance similarities
04/23/2003
CMSC 838T – Presentation
23
Comparison of S-W VS BLAST
Source: Shpaer et.al., Genomics 38(2), pp.179-191, 1996
☞ Is there a real difference in the results ?
 YES
04/23/2003
CMSC 838T – Presentation
24
Comparison of S-W, FASTA, and BLAST
Note: The numbers in the table show for how many protein SF the
method in the column performed better than the one in the row
04/23/2003
CMSC 838T – Presentation
25
Acknowledgements
Dr. Bertil Schmidt
Dr. Chau-Wen Tseng
04/23/2003
CMSC 838T – Presentation
26
Q&A
04/23/2003
CMSC 838T – Presentation
27
Extra Slides
04/23/2003
CMSC 838T – Presentation
28
Full Genome Comparison
3918 Protein
Sequences
1.329.298
AminoAcids
4289 Protein
Sequences
1.359.008
AminoAcids

related Organisms, but Tuberculosis causes a disease  find common
and different parts

16106 pairwise sequence comparisons
04/23/2003
CMSC 838T – Presentation
29
Smith-Waterman Algorithm
 Optimal
local alignment of two sequences
 Performs
an exhaustive search for the optimal
local alignment

Complexity O(nm) for sequence lengths n and m
 Based
on the 'dynamic programming' (DP)
algorithm

Fill the DP matrix using a substitution (mutation) matrix

Find the maximal value (score) in the matrix

Trace back from the score until a 0 value is reached
04/23/2003
CMSC 838T – Presentation
30
Smith-Waterman Algorithm

Aligning S1 and S2 of length l1 and l2 using recurrences:
0
 E (i, j )

H (i, j )  max 
,1  i  l1 , 1  j  l2
 F (i, j )
 H (i  1, j  1)  Sbt ( S1i , S 2 j )
H (i,0)  E (i,0)  0
H (0, j )  F (0, j )  0

H (i, j  1)  
H (i  1, j )  
E (i, j )  max 
, F (i, j )  max 
E (i, j  1)  
F (i  1, j )  
Calculate three possible ways to extend the alignment

by one aminoacid (AA) in each sequence

by one AA in the first sequence and align it with a gap in the second

by one AA in the second sequence and align it with a gap in the first
04/23/2003
CMSC 838T – Presentation
31
Smith-Waterman Algorithm
Align S1=ATCTCGTATGATG S2=GTCTATCAC
 A T
2 if ( x  y )
Sbt ( x, y )  
 1 else
=1, =1

G
T
C
T
A
T
C
A
C
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
1
0
2
1
0 0
0 00
2 1
1 4
2 3
2 2
4 3
3 6
2 5
1 4
0
 H (i  1, j )  1

H (i, j )  max 
 H (i, j  1)  1
 H (i  1, j  1)  Sbt ( S1i , S 2 j )
04/23/2003
C T
0
0
2
3
6
5
4
5
5
4
C G T
A T G A T G
0
0
1
4
5
5
4
6
5
7
0
0
3
3
4
7
5
5
7
6
0
2
1
3
4
4
4
5
5
6
0
1
4
3
5
4
6
5
4
5
0
0
2
2
5
6
9
8
7
6
0 0
2 1
1 1
1 0
4 3
5 6
8 7
8 7
7 10
10
6 9
0
0
3
2
2
5
8
7
9
9
0
2
2
2
1
4
7
7
8
8
ATCTCGTATGATG
GTC TATCAC
CMSC 838T – Presentation
32
Principles of the ISA
.....
.....
04/23/2003
CMSC 838T – Presentation
33
Principles of the ISA
CommunicationRegister
04/23/2003
CMSC 838T – Presentation
34
Interface Processors
Interface Processors
North
Interface
Processors West
04/23/2003
....
....
ISA
CMSC 838T – Presentation
35
Instruction Systolic Array
instructions
column
selectors
*+
-*
-*+
+
*+
*
-+
*
+
*-+
*
*+
-+
*
-*
+
-*+
+
*
+
*-
-*+
-+
*
+
*
+
-*
row selectors

wavefront instruction execution  fast accumulation operations (e.g. row
sum, broadcast, ringshift)
04/23/2003
CMSC 838T – Presentation
36
Advantage of ISA’s: Performing Aggregate
Functions
• Row Broadcast
234
C := C[WEST]
C := CW
C := CW
C := CW
C := CW
C = 234
0
C = 234
0
C = 234
C = 0234
• Row Sum
C := C + C[WEST]
noop
C:=C+CW
C=1
C=3
2
C:=C+CW
C=6
3
C:=C+CW
C = 410
• Row Ringshift C := C[WEST]; C:=C[EAST]
04/23/2003
C
:= CW
noop
C:=CE
C := CW
C:=CE
C:=CW
C=1
1000
C = 11000
C:=CW
C:=CE
C = 1000
1
CMSC 838T – Presentation
C:=CW
C = 11000
37
Data Transfer


In Systola 1024,

input of new character (bj) into the lower western IP, and

when l1 > 2048, the input of previously computed H, E, and F
cells and output of H, E, and F cells
For Fuzion 150, during the 16 new H-cells in each PE,
one new character is input via Fuzion bus
04/23/2003
CMSC 838T – Presentation
38
Instruction Counts

Instruction Count (IC) to update 2 and 16 H-cells in Systola
1024 and Fuzion 150, respectively:
Operations in each PE per iteration step
Systola
Fuzion
Get H(i – 1, j), F(i – 1), bj, maxi-1 from neighbor
20
22
Compute t = max{0, H(i – 1, j – 1) + Sbt(ai, bj)}
20
576
Compute F(i, j) = max{H(i – 1, j} – , F(i – 1, j) – }
8
336
Compute E(i, j) = max{H(i, j – 1} – , E(i, j – 1) – }
8
448
Compute F(i, j) = max{t, H(i, j}, F(i, j)}
8
368
Compute maxi = max{H(i, j), maxi-1}
4
184
68
1934
Sum
04/23/2003
CMSC 838T – Presentation
39
Maximum Characters/PE


The memory per PE on Systola is 32 (16-bit) registers

2 characters per PE is the maximal possible

(2 chars x 20 AAs substitution row x 8-bit per substitution value
= 20 registers)
The memory per PE on Fuzion is 2Kb

maximum chars per PE is 16

restricted due to “indirect addressing” per PE
04/23/2003
CMSC 838T – Presentation
40
Indirect Address

An addressing mode found in many processors'
instruction sets where the instruction contains the
address of a memory location which contains the
address of the operand (the "effective address") or
specifies a register which contains the effective address
04/23/2003
CMSC 838T – Presentation
41
Myrinet - Overview


Myrinet is a cost-effective, high-performance, packetcommunication and switching technology that is widely
used to interconnect clusters of workstations, PCs,
servers, or single-board computers
Conventional networks (e.g., ethernet) can be used to
build clusters, but do not provide the
performance/features required for HPC or highavailability clustering
04/23/2003
CMSC 838T – Presentation
42
Myrinet - Characteristics





Full-duplex 2+2 Gigabit/second data rate links, switch ports, and
interface ports
Flow control, error control, and "heartbeat" continuity monitoring on
every link
Low-latency, cut-through, crossbar switches, with monitoring for highavailability applications
Switch networks that can scale to tens of thousands of hosts, and that
can also provide alternative communication paths between hosts
Host interfaces that execute a control program to interact directly with
host processes ("OS bypass") for low-latency communication, and
directly with the network to send, receive, and buffer packets
04/23/2003
CMSC 838T – Presentation
43
lq  processors: Hybrid
Query sequence = M, Number of processors
in ISA = N2, assuming M = k x N:
1.
k  N: Each k x N subarray computes the alignment of
the same query sequence with different subject
sequences
2.
k≥N:
•
•
04/23/2003
k/N = 2: load 2 chars per PE
k/N > 2: split query sequence into k/2N passes and load 2N2
chars in each pass
CMSC 838T – Presentation
44
lq  processors: Fuzion 150
Length of query sequence = M, Number
of processors = 1536:
1.
k x M = 1536: k alignments of same query sequence w/
different subject sequences carried out in parallel
2.
k x 1536 = M:
•
Split into k passes – requires I/O of intermediate results in each
step
•
Data transfers can be minimized by assigning k/M chars per PE
– currently 16 chars per PE is the limit
04/23/2003
CMSC 838T – Presentation
45
Concept of true and false hits
The following cases were distinguished:

true positives, alignments between proteins of similar
structure that fall above a given threshold (defined by
the sequence alignment method)

false positives, alignments between proteins of
dissimilar structure that fall above a given threshold of
the sequence alignment

true negatives, alignments between proteins of
dissimilar structure that that fall below a given
threshold

false negatives, alignments between proteins of similar
structure that fall below a given threshold
04/23/2003
CMSC 838T – Presentation
46
Guidelines
When to use S-W ?

if you are looking for a protein distantly related to your query
sequence (e.g., you have a known protein sequence and you want
to find possible distant homologues)

if you are looking for the protein encoded in your low-quality DNA
query sequence (e.g., you have a badly sequenced cDNA clone)

if you are looking for a DNA sequence corresponding to your
protein query sequence (e.g., you want to identify potential
homologues of your protein in the EST databases)
When to use BLAST ?

if you are looking for close matches and you don't mind missing
lower homology sequences

if you want a quick answer
04/23/2003
CMSC 838T – Presentation
47
Performance Evaluation of SAMBA
Query sequence length
10
30
100
300
1000
3000
10000
Time in seconds
Samba
25
25
DEC-Alpha – 150 Mhz
57
Speed up
26
30
40
77
210
120 350
1041
3468
11510
38450
2.3
4.8
13.5
34.7
86.7
150
183
SUN-Sparc 5 – 110 MHz
95
239 746
2215
7300
24269
80300
Speed up
3.8
9.5
7.4
183
315
382
28.6
DEC 5000/250 – 40 MHz 182
548 1407 4054
12920
41169
131193
Speed up
22
323
534
625
7.3
54
135
Source: Jamet and Laveneir, CABIOS, 12(7), 609-615, 1997
☞ The longer the query length, the better the speed-up
04/23/2003
CMSC 838T – Presentation
48
Performance Evaluation of Kestrel
USparc : Sun Ultrasparc 140 MHz
B-SYS: 470-PE ISA
Alpha: DEC Alpha – 433 MHz
1K MP2: 1K-PE MasPar
Paragon: 32-node Paragon
Decy-1: 1-board Decypher-II*
Merc1: 1-board Mercury+
Bcll-1: Biocellerator*
Samba: 2-board Samba+
16-MP2: 16K-PE MasPar
FDF-3: 5-Board Paracell FDF+
Kestrel: 1-board Kestrel
Decy-15: 15-board Decypher-II*
Source: Dahle et. al, PDPTA, 1243-1249, 1999
04/23/2003
CMSC 838T – Presentation
+
(single purpose); * (FPGA)
49
Performance Evaluation of Splash-2
Hardware
Specifics
MCUPS
Splash-2
Unidir; 16 boards
43,000
Splash-2
Bidir; 16 boards
34,000
Splash-2
Unidir; 1 board
3,000
Splash-2
Bidir; 1 board
2,100
Splash-1
Bidir; 746 PE’s
370
SPARC 10/30 GX
gcc –O2
1.2
VAX 6620
VMS; CC
1.0
SPARC-1
gcc –O2
0.87
486DX-50 PC
DOS; gcc –O2
0.67
Source: Hoang, IEEE-CMM, 185-191, 1993
04/23/2003
CMSC 838T – Presentation
50
Download