SAAHPC12 Presentation - plaza

advertisement
FPGA-Accelerated Isotope Pattern Calculator
for Use in Simulated Mass Spectrometry
Peptide and Protein Chemistry
SAAHPC 2012
Carlo Pascoe (speaker), David Box,
Herman Lam, Alan George
NSF Center for High-Performance Reconfigurable Computing (CHREC)
Dept. of Electrical and Computer Engineering, University of Florida
Gainesville FL, USA
Email: {pascoe, box, hlam, george}@chrec.org
Wednesday July 11th, 2012
Motivation

Protein Identification Algorithms (PIAs)



Heavily utilized in pharmaceutical research and cancer diagnostics
Current industry standard methods unreliable (at best!) [1,2]
Highly accurate algorithms with potential to revolutionize accuracy
exist, however not/under utilized due to extreme computational
intensity and prohibitive execution times

Must accelerate for feasible use
Objective: Develop sustainable solution for increasing
the speed, and thus achievable accuracy, of many PIAs
Approach:
 Accelerate Isotope Pattern Calculator (IPC), a
dominant subroutine common in de novo PIAs
 Provide customizable design for general use
 Capitalize on reconfigurable computing at scale to
achieve sustainable supercomputing performance
2
Presentation Outline

Background




IPC Problem Description









SED Calculation Reduced to LUTs
SED Iterative Combination in Hardware
Performance Evaluation on Novo-G


Elemental Isotope SADs
Stage 1: SED Calculation
Stage 2: SED Combination
Additional IPC Functionality
A Configurable & Scalable IPC Hardware Architecture


Protein Identification
De Novo PIAs
Theoretical Mass Spectrum Generation
Single-FPGA Performance
Multi-FPGA Performance
Summary & Conclusions
Future Work
Q&A
3
SAD: Single-Atom Distribution
SED: Single-Element Distribution,
Protein Identification

Protein: biochemical molecule consisting
of one or more polypeptides


To
this...
Macromolecular chains of linked amino acids
Current protein ID approach




This
…
Methodically fragment protein sample
Analyze with mass spectrometer
Employ PIAs to generate string representing
amino acid primary structure
Algorithms classified as database or de novo
4
To
this…
RPPGFSPFR
peptide amino acid sequence
De Novo PIAs

General de novo approach




Theoretical need to consider all linear
combinations of amino acids


Make educated guess for amino acid string
Generate theoretical mass spectra and
compare to experimental spectrum
Iteratively refine guess until theoretical
and experimental spectra match
Number of candidates grows exponentially with final sequence length
Employ diverse heuristic pruning methods to limit protein search space


Necessity for practical use on conventional computing systems
Often leads to false identifications (e.g., N and GG can have same mass)
By accelerating key computation common in many de novo algorithms,
algorithm developers can employ less restrictive pruning criteria, potentially
allowing a greater degree of accuracy in less time
5
Theoretical Mass Spectrum Generation


Majority of execution time for many highly accurate de novo algorithms
Calculation comprises:




Complicated by fact that, in nature, elements occur as mixture of isotopes


Decomposition of candidate sequence string into many amino acid substrings
Generation of probable mass contributions for each predicted substring
Histogram-like combination of probable masses to form theoretical distribution
directly comparable to experimental mass spectra
Neutron quantity differences suggest distribution of possible molecule masses
Use IPC subroutine to predict possible masses


Enumerates all possible combinations of constituent element isotopes
produce list of mass/probability pairs
Although a relatively simple calculation for the smallest of molecules, IPC
executions for medium- to large- sized molecules quickly become a
computational bottleneck of many chemistry applications,
most notably de novo protein identification.
6
IPC Problem Description
Given a chemical formula and a database of element isotope SADs,
produce a list of mass/probability pairs representing the
distribution of possible molecular masses

Analogous to evaluating
(𝐸11 + 𝐸21 + ⋯ )𝑁1 (𝐸12 + 𝐸22 + ⋯ )𝑁2 (𝐸13 + 𝐸23 + ⋯ )𝑁3 ⋯


𝑗
𝐸𝑖 represents ith isotope of jth unique element in chemical formula
containing Nj atoms of jth element type
Problem reducible to two-stage process
1. Compute each single element distribution (SED)
2. Combine SEDs to form final distribution
7
SAD: Single-Atom Distribution
SED: Single-Element Distribution,
Elemental Isotope SADs
8
Stage 1: SED Calculation

Consider SEDs of Hydrogen from SAD
H1:
1
1𝐻
+ 21𝐻 1 → 11𝐻 + 21𝐻 →
H2:
1
1𝐻
+ 21𝐻 2 → 11𝐻 11𝐻 + 2 11𝐻 21𝐻 + 21𝐻 21𝐻→
ALGORITHM 1. Calculate HN SED
p0 ← 0.999885, m0 ← 1.007825
p1 ← 0.000115, m1 ← 2.014102
FOR 𝑛1 ← 0 to N
𝑛0 ← N – 𝑛1
𝑁!
𝑛
𝑛
p ← 𝑛𝑁 𝑝0 0 𝑝1𝑛1 = 𝑛 !𝑛 ! 𝑝0 0 𝑝1𝑛1
1
m ← 𝑛0 𝑚0 + 𝑛1 𝑚1
PRINT (m, p)
1
M= 1.007825, p= 9.99885e-01,
M= 2.014102, p= 1.15e-04
M= 2.01565, p= 9.99770e-01,
M= 3.02193, p= 2.2997e-04,
M= 4.02820, p= 1.3225e-06
HN: 11𝐻 + 21𝐻 𝑁
𝑁(𝑁−1)
→ 11𝐻𝑁 + 𝑁 11𝐻 (𝑁−1) 21𝐻1 + 2 11𝐻(𝑁−2) 21𝐻2 + ⋯
→ A really long list with many low probability peaks!
0
Impose Threshold Probability
END LOOP
9
Stage 1: SED Calculation


Can modify ALGORITHM 1. to handle any element with two
stable isotopes (e.g., Helium, Carbon, Nitrogen, etc.)
If an element has more than two stable isotopes?

Consider SEDs of Sulfur
SN:
1
16𝑆
+ 162𝑆 + 163𝑆 + 164𝑆
ALGORITHM 2. Calculate SN SED
p0 ← 0.9493, m0 ← 31.972079
p1 ← 0.0076, m1 ← 32.971459
p2 ← 0.0429, m2 ← 33.967867
p3 ← 0.0002, m3 ← 35.967081
FOR 𝑛1 ← 0 to N
FOR 𝑛2 ← 0 to N – 𝑛1
FOR 𝑛3 ← 0 to N – (𝑛1+ 𝑛2)
𝑛0 ← N – (𝑛1+ 𝑛2+ 𝑛3)
𝑁
→ A REALLY, REALLY long list!
Computation Significantly Increases as the
Number of Stable Isotopes Increases
10
𝑁
𝑁−𝑛1
𝑁−𝑛1 −𝑛2
𝑛1
𝑛2
𝑛3
𝑁!
𝑛0 𝑛1 𝑛2 𝑛3
𝑝 𝑝 𝑝 𝑝
𝑛3 !𝑛2 !𝑛1 !𝑛0 ! 0 1 2 3
p ←
𝑛
𝑛
m ← 𝑛0 𝑚0 + 𝑛1 𝑚1 + 𝑛2 𝑚2 + 𝑛3 𝑚3
PRINT (m, p)
END LOOP
END LOOP
END LOOP
𝑛
𝑛
𝑝0 0 𝑝1 1 𝑝2 2 𝑝3 3
Stage 2: SED Combination

With Stage 1 complete, analogous to evaluating
𝑃11 + 𝑃21 + ⋯


𝑃12 + 𝑃22 + ⋯
𝑃13 + 𝑃23 + ⋯ ⋯
𝑗
𝑃𝑖 represents ith peak from SED generated for jth unique element
Removal of exponent allows for straightforward combination
Simple Example) H2O:
M= 2.01565, p= 9.9977e-01,
M= 3.02193, p= 2.2997e-04,
M= 4.02820, p= 1.3225e-06
M= 15.9949, p= 9.9757e-01,
M= 16.9991, p= 3.8e-04,
M= 17.9992, p= 2.05e-03
1 1
1𝐻 1𝐻
+ 2 11𝐻 21𝐻 + 21𝐻 21𝐻
1
8𝑂
+ 28𝑂 + 38𝑂 →
M= 2.01565 + 15.9949 = 18.0106, p= 9.9977e-01 * 9.9757e-01 = 9.9734e-01,
M= 2.01565 + 16.9991 = 19.0148, p= 9.9977e-01 * 3.8e-04
= 3.7991e-04,
M= 2.01565 + 17.9992 = 20.0149, p= 9.9977e-01 * 2.05e-03 = 2.0495e-03,
M= 3.02193 + 15.9949 = 19.0168, p= 2.2997e-04 * 9.9757e-01 = 2.2941e-04,
M= 3.02193 + 16.9991 = 20.0210, p= 2.2997e-04 * 3.8e-04
= 8.7389e-08,
M= 3.02193 + 17.9992 = 21.0211, p= 2.2997e-04 * 2.05e-03 = 4.7144e-07,
M= 4.02820 + 15.9949 = 20.0231, p= 1.3225e-06 * 9.9757e-01 = 1.3193e-06,
M= 4.02820 + 16.9991 = 21.0273, p= 1.3225e-06 * 3.8e-04
= 5.0255e-10,
M= 4.02820 + 17.9992 = 22.0274, p= 1.3225e-06 * 2.05e-03 = 2.7111e-09
11
Additional IPC Functionality
Probability Threshold: Filter prob < PT (e.g, PT = 1.0e-05)
Simple Example
Continued
H2O:
M= 18.0106, p= 9.9734e-01,
M= 19.0148, p= 3.7991e-04,
M= 20.0149, p= 2.0495e-03,
M= 19.0168, p= 2.2941e-04,
M= 20.0210, p= 8.7389e-08,
M= 21.0211, p= 4.7144e-07,
M= 20.0231, p= 1.3193e-06,
M= 21.0273, p= 5.0255e-10,
M= 22.0274, p= 2.7111e-09
M= 18.0106, p= 9.9734e-01,
M= 19.0148, p= 3.7991e-04,
M= 19.0168, p= 2.2941e-04,
M= 20.0149, p= 2.0495e-03,
M= 20.0210, p= 8.7389e-08,
M= 20.0231, p= 1.3193e-06,
M= 21.0211, p= 4.7144e-07,
M= 21.0273, p= 5.0255e-10,
M= 22.0274, p= 2.7111e-09
Sort by Mass
Sort by Probability
M= 18.0106, p= 9.9734e-01,
M= 19.0148, p= 3.7991e-04,
M= 20.0149, p= 2.0495e-03,
M= 19.0168, p= 2.2941e-04
M= 18.0106, p= 9.9734e-01,
M= 20.0149, p= 2.0495e-03,
M= 19.0148, p= 3.7991e-04,
M= 19.0168, p= 2.2941e-04,
M= 20.0231, p= 1.3193e-06,
M= 21.0211, p= 4.7144e-07,
M= 20.0210, p= 8.7389e-08,
M= 22.0274, p= 2.7111e-09,
M= 21.0273, p= 5.0255e-10
Window Filter:
Filter any peaks after the Nth
(e.g, N = 6)
Mass Peak Centroiding:
Essentially moving average filter over
close peaks, weighted by probability
M= 18.0106, p= 9.9734e-01,
M= 19.0148, p= 3.7991e-04,
M= 19.0168, p= 2.2941e-04,
M= 20.0149, p= 2.0495e-03,
M= 20.0210, p= 8.7389e-08,
M= 20.0231, p= 1.3193e-06
M= 18.0106, p= 9.9734e-01,
M= 19.0156, p= 6.0932e-04,
M= 20.0149, p= 2.0509e-03
12
M= 18.0106, p= 9.9734e-01,
M= 20.0149, p= 2.0495e-03,
M= 19.0148, p= 3.7991e-04,
M= 19.0168, p= 2.2941e-04,
M= 20.0231, p= 1.3193e-06,
M= 21.0211, p= 4.7144e-07
A Configurable & Scalable IPC Hardware Architecture
Adapt two-stage procedure to a configurable & scalable hardware
architecture capable of converting a stream of independent chemical formula
queries into a delimited stream of variable-quantity mass/probability pairs
Single Module Handles
Stage 1 Functionality
Multiple Modules Handle Stage 2
Computation
No. of Modules
Independent from Input
Stream Data and Host
Stream Consists of
Chemical Formula
Query Information
and Control Data
Result Distributions
Returned in Same Order
as Received in Input
Stream
13
SED calculation reduced to LUTs
Precompute SEDs Exactly,
Pull SEDs from LUTS at
Runtime vs. SADs vs. FCFDs
Single Bank of LUTs Feed
All Distribution Calculators
Sample LUT
Address Space
for SEDs
SEDs Presorted by Probability,
Filtered at Runtime with
Configurable Threshold Prob.
In-Stream
H1 − H256
0
C1 − C256
Control[3]
N1 − N256
O1 − O256
S1 − S64
Token-Based
Round Robin Scheduler
60 Other Elements with
16 SEDs per Element
2047
Equation from Slide 8:
14
( 𝐸11 + 𝐸21 + ⋯ )𝑁1 (𝐸12 + 𝐸22 + ⋯ )𝑁2 (𝐸13 + 𝐸23 + ⋯ )𝑁3 ⋯
SAD: Single-Atom Distribution
SED: Single-Element Distribution
FCFD: Full Chemical Formula Distribution
SED Iterative Combination in Hardware
Single-cycle SED combination architecture required for worst-case
excessively wasteful when processing common-case, employ
iterative combination to boost hardware utilization
X: No. of Parallel
Multipliers and Adders
Y: Buffer Depth
ALGORITHM 3. Distribution Calculator Procedure
WHILE Control ≠ “done”
SED[1…N]←FIFO[1…N].pop(), Control←FIFO[N+1].pop()
IF Control = “begin”
PrevItBuff[1...N]←SED[1…N], PrevItBuff[N+1...Y]←(-1,0)
CurrItBuff[1...Y] ←(-1,0)
IF Control = “middle” or Control = “end”
WHILE tmp←PrevItBuff[1..Y].shift() ≠ (-1,0)
i←1
WHILE i ≤ N and SED[i].prob > 0
MultAdd[1...X].mass←SED[i...i+X−1].mass + tmp.mass
MultAdd[1...X].prob←SED[i...i+X−1].prob ∗ tmp.prob
PSort[1…X]←Sort(Filter(MultAdd[1...X], TP))
CurrItBuff[1…Y]←InSort(CurrItBuff[1…Y], PSort[1..X])
i←i+X
END LOOP
END LOOP
PrevItBuff[1...Y]←CurrItBuff[1…Y]
CurrItBuff[1...Y] ←(-1,0)
IF Control = “end”
FinalResBuff[1...Y]←PrevItBuff[1...Y]
END LOOP
Result Reporting Circuitry Operates
Independently of Distribution Calculation
15
Insert Centroiding
Here if so Desired
Performance Evaluation on Novo-G



Previously discussed hardware architecture
implemented in VHDL and tested on Novo-G[4,5]
Initial experiments on single Altera Stratix IV E530
FPGA in GiDEL PROCStar IV board along with
an Intel Xeon E5620 CPU for host support
Single-device implementation scaled up to
a single Novo-G “ps4” compute node



i.e., up to 16 E530s in 4 PROCStar IVs
Implications of scaling to multiple
compute nodes of Novo-G discussed
Software baseline: highly optimized, serial
C++ code mirroring hardware algorithm



Executed on single E5620 core
Orders of magnitude faster than code at [6]
Hardware and software results compared to
confirm hardware correctness
16
Novo-G Annual Growth
2009: 96 top-end Stratix-III FPGAs,
each with 4.25GB SDRAM
2010: 96 more Stratix-III FPGAs,
each with 4.25GB SDRAM
2011: 96 top-end Stratix-IV FPGAs,
each with 8.50GB SDRAM
2012: 96 more Stratix-IV FPGAs,
each with 8.50GB SDRAM
Single-FPGA Performance
TABLE I. Single-FPGA performance for several parameter configurations.
Configuration*
N X
Y
Qi.fMas Qi.fProb Cen
Freq Speedup† Speedup†
/FPGA
/DC
M
(MHz)
1 128 16.16
1.31
N
12
115
−
72
2. 16 1 128
3. 16 2 128
14.8
1.15
N
20
145
−
115
14.8
1.15
N
15
145
−
115
4. 16
3 128
14.8
1.15
N
10
110
−
87
5. 12
2 128
14.8
1.15
N
14
155
−
123
6. 16
2
80
14.8
1.15
N
21
150
−
120
7. 12
2
80
14.8
1.15
N
22
155
−
127
8. 16
1 128 16.16
1.31
Y
11
120
17
186
9. 16
1 128 14.12
1.23
Y
13
135
18
236
10. 16
1 128
14.8
1.15
Y
16
140
24
384
11. 16
1 128 14.12
1.23
Y
13
135
18
236
12. 16
2 128 14.12
1.23
Y
10
130
32
325
13. 16
3 128 14.12
1.23
Y
6
100
35
214
14. 16
2 128 14.12
1.23
Y
10
130
32
325
15. 12
2 128 14.12
1.23
Y
10
135
34
338
1. 16
8
2 128 14.12
1.23
Y
11
135
35
381
17. 16
2 100 14.12
1.23
Y
13
135
34
444
18. 16
2
80
14.12
1.23
Y
14
130
32
454
19. 16
2
60
14.12
1.23
Y
17
130
33
566
20. 12
2
80
14.12
1.23
Y
15
135
34
516
16.
Performance Trends for Various IPC
Parameter Configurations
Configurations Bandwidth Limited
Computation-bound problem in software
becomes I/O-bound in FPGAs
Reducing Calculation Word Width
Reduced Logic Usage & Increased Operating Frequency
vs.
Reduced Result Precision
Increasing Parallel Computations per DC
Increased Operations per Clock Cycle
vs.
Increased Logic Usage, Reduced Routability,
& Operating Frequency
Reducing Distribution Window Width
Reduced Logic Usage
vs.
Reduced Result Exactness
* N: max peaks/SED, X: number of parallel peak computations per DC,
Y: max peaks/output-window, Qi.f: fixed-point word width bits (integer.fractional),
Cen: centroiding capability enabled (Yes/No), M: DCs per FPGA.
† w.r.t. C++ software processing 16x220 statistically representative queries (based on
relative elemental abundance in amino acids) in 821 seconds on a single E5620 core.
Suitable “sweet spot,” achieving remarkable speedup
while ensuring results remain scientifically relevant
17
Multi-FPGA Performance
Performance Trends of “sweet spot” for
Various Novo-G Node Configurations
Increasing PROCStar IVs per Node
Available system bandwidth far exceeds board link
bandwidth bottleneck observed with single-board
Increasing FPGAs per PROCStar IV
Scalability now limited by CPU resources
Scalability limited by I/O-bandwidth
Novo-G “ps4” nodes have 8 physical cores (16
logical with hyper-threading) vs. max 32
threads for row 9
Expect increased scalability with system config
employing more physical cores
PROCStar IV only supports 8 lane, Gen 1 PCIe
Expect increased scalability with system config. employing
more lanes and/or more recent Gen 3 PCIe standard
TABLE II. Multi-FPGA performance for several node configurations.
Boards
/Node
FPGAs
/Board
Total
FPGAs
Speedup†
/FPGA
Total
Speedup†
2.
1
1
1
2
1
2
517
516
517
1031
Assuming input queries are pre-partitioned, no required
communication between compute nodes
3.
1
3
3
398
1192
4.
1
4
4
315
1259
5.
2
1
2
516
1033
Overhead limited to initialization & completion synchronization so
expect performance to scale almost linearly with additional nodes
6.
2
2
4
502
2009
7.
2
4
8
313
2510
8.
4
1
4
492
1968
9.
4
4
16
209
3340
Multi-Node Scaling Expectations
1.
We plan to verify these expectations by scaling to multiple
compute nodes in Novo-G as future work
18
† w.r.t. C++ software processing 230 statistically representative queries (based on
relative elemental abundance) in 52,478 seconds on a single E5620 core.
Multiple FPGA
Advantage?
Summary & Conclusions

Presented first FPGA-based Isotope Pattern Calculator



Computationally intense subroutine common in de novo PIAs
Provides 23 customization parameters for general use
Discussed parameter tradeoffs & experimentally demonstrate effect on performance
Between 72 and 566 speedup† on a single FPGA
Wide range of achieved
single-node performance due to
embarrassingly parallel scalability
restricted by real-world system
limitations such as
insufficient I/O bandwidth and
CPU resources
Up to 1259 speedup † on a
single board (4 FPGAs)
Up to 3340 speedup † on a
single node (16 FPGAs)
Can enable use of previously dismissed protein identification
algorithms with potentially revolutionary accuracy yet obscene
execution time on conventional computing platforms
Still much to be done before this is a reality for protein Identification
†
with respect to a highly optimized, serial C++ IPC implementation
19
Future Work


Continue scaling design to multiple nodes of Novo-G
Integrate FPGA accelerated IPC into full de novo PIA



First integrate with full theoretical spectrum generator
Move more of algorithm onto FPGA to lessen bandwidth bottleneck issue
Explore the possibility of a GPU accelerated IPC


GPU amenable given minor modifications to the algorithm as stated
Preliminary design already mapped out, ready for implementation & testing

Implement non-sorted output option



Sorting fundamentally integral to current DC design
Non-sorting DC would allow greater parallelization
while utilizing less resources
If sorted distribution not required by targeted PIA,
expect much greater performance
20
Thank You For Listing!
Any Questions
21
References
[1] A. W. Bell et al., “A HUPO test sample study reveals common problems in mass
spectrometry-based proteomics,” Nat. Methods, vol 6, pp. 423-430, 2009.
[2] E. A. Kapp et al., "An evaluation, comparison, and accurate benchmarking of several
publicly available MS/MS search algorithms: Sensitivity and specificity analysis,"
Proteomics, vol 5, pp. 3475–3490, 2005.
[3] C. Pascoe et al., “Reconfigurable supercomputing with scalable systolic arrays and instream control for wavefront genomics processing,” Proc. of Symposium on
Application Accelerators in High-Performance Computing (SAAHPC), TN, 2010.
[4] A. George, H. Lam, A. Lawande, C. Pascoe, and G. Stitt, “Novo-G: A View at the HPC
Crossroads for Scientific Computing,” Proc. of the Int. Conf. on Eng. of Reconf. Sys.
and Algs. (ERSA), NV, 2010.
[5] A. George, H. Lam, and G. Stitt, “Novo-G: At the Forefront of Scalable Reconfigurable
Computing,” IEEE Computing in Sci. & Eng. (CiSE), Vol. 13, No. 1, Jan/Feb. 2011, pp.
82-86.
[6] Dirk (2005), Isotopic Pattern Calculator, http://isotopatcalc.sourceforge.net/index.php,
File: gips-0.7.tar.gz.
22
Download