FPGA-Accelerated Isotope Pattern Calculator for Use in Simulated Mass Spectrometry Peptide and Protein Chemistry SAAHPC 2012 Carlo Pascoe (speaker), David Box, Herman Lam, Alan George NSF Center for High-Performance Reconfigurable Computing (CHREC) Dept. of Electrical and Computer Engineering, University of Florida Gainesville FL, USA Email: {pascoe, box, hlam, george}@chrec.org Wednesday July 11th, 2012 Motivation Protein Identification Algorithms (PIAs) Heavily utilized in pharmaceutical research and cancer diagnostics Current industry standard methods unreliable (at best!) [1,2] Highly accurate algorithms with potential to revolutionize accuracy exist, however not/under utilized due to extreme computational intensity and prohibitive execution times Must accelerate for feasible use Objective: Develop sustainable solution for increasing the speed, and thus achievable accuracy, of many PIAs Approach: Accelerate Isotope Pattern Calculator (IPC), a dominant subroutine common in de novo PIAs Provide customizable design for general use Capitalize on reconfigurable computing at scale to achieve sustainable supercomputing performance 2 Presentation Outline Background IPC Problem Description SED Calculation Reduced to LUTs SED Iterative Combination in Hardware Performance Evaluation on Novo-G Elemental Isotope SADs Stage 1: SED Calculation Stage 2: SED Combination Additional IPC Functionality A Configurable & Scalable IPC Hardware Architecture Protein Identification De Novo PIAs Theoretical Mass Spectrum Generation Single-FPGA Performance Multi-FPGA Performance Summary & Conclusions Future Work Q&A 3 SAD: Single-Atom Distribution SED: Single-Element Distribution, Protein Identification Protein: biochemical molecule consisting of one or more polypeptides To this... Macromolecular chains of linked amino acids Current protein ID approach This … Methodically fragment protein sample Analyze with mass spectrometer Employ PIAs to generate string representing amino acid primary structure Algorithms classified as database or de novo 4 To this… RPPGFSPFR peptide amino acid sequence De Novo PIAs General de novo approach Theoretical need to consider all linear combinations of amino acids Make educated guess for amino acid string Generate theoretical mass spectra and compare to experimental spectrum Iteratively refine guess until theoretical and experimental spectra match Number of candidates grows exponentially with final sequence length Employ diverse heuristic pruning methods to limit protein search space Necessity for practical use on conventional computing systems Often leads to false identifications (e.g., N and GG can have same mass) By accelerating key computation common in many de novo algorithms, algorithm developers can employ less restrictive pruning criteria, potentially allowing a greater degree of accuracy in less time 5 Theoretical Mass Spectrum Generation Majority of execution time for many highly accurate de novo algorithms Calculation comprises: Complicated by fact that, in nature, elements occur as mixture of isotopes Decomposition of candidate sequence string into many amino acid substrings Generation of probable mass contributions for each predicted substring Histogram-like combination of probable masses to form theoretical distribution directly comparable to experimental mass spectra Neutron quantity differences suggest distribution of possible molecule masses Use IPC subroutine to predict possible masses Enumerates all possible combinations of constituent element isotopes produce list of mass/probability pairs Although a relatively simple calculation for the smallest of molecules, IPC executions for medium- to large- sized molecules quickly become a computational bottleneck of many chemistry applications, most notably de novo protein identification. 6 IPC Problem Description Given a chemical formula and a database of element isotope SADs, produce a list of mass/probability pairs representing the distribution of possible molecular masses Analogous to evaluating (𝐸11 + 𝐸21 + ⋯ )𝑁1 (𝐸12 + 𝐸22 + ⋯ )𝑁2 (𝐸13 + 𝐸23 + ⋯ )𝑁3 ⋯ 𝑗 𝐸𝑖 represents ith isotope of jth unique element in chemical formula containing Nj atoms of jth element type Problem reducible to two-stage process 1. Compute each single element distribution (SED) 2. Combine SEDs to form final distribution 7 SAD: Single-Atom Distribution SED: Single-Element Distribution, Elemental Isotope SADs 8 Stage 1: SED Calculation Consider SEDs of Hydrogen from SAD H1: 1 1𝐻 + 21𝐻 1 → 11𝐻 + 21𝐻 → H2: 1 1𝐻 + 21𝐻 2 → 11𝐻 11𝐻 + 2 11𝐻 21𝐻 + 21𝐻 21𝐻→ ALGORITHM 1. Calculate HN SED p0 ← 0.999885, m0 ← 1.007825 p1 ← 0.000115, m1 ← 2.014102 FOR 𝑛1 ← 0 to N 𝑛0 ← N – 𝑛1 𝑁! 𝑛 𝑛 p ← 𝑛𝑁 𝑝0 0 𝑝1𝑛1 = 𝑛 !𝑛 ! 𝑝0 0 𝑝1𝑛1 1 m ← 𝑛0 𝑚0 + 𝑛1 𝑚1 PRINT (m, p) 1 M= 1.007825, p= 9.99885e-01, M= 2.014102, p= 1.15e-04 M= 2.01565, p= 9.99770e-01, M= 3.02193, p= 2.2997e-04, M= 4.02820, p= 1.3225e-06 HN: 11𝐻 + 21𝐻 𝑁 𝑁(𝑁−1) → 11𝐻𝑁 + 𝑁 11𝐻 (𝑁−1) 21𝐻1 + 2 11𝐻(𝑁−2) 21𝐻2 + ⋯ → A really long list with many low probability peaks! 0 Impose Threshold Probability END LOOP 9 Stage 1: SED Calculation Can modify ALGORITHM 1. to handle any element with two stable isotopes (e.g., Helium, Carbon, Nitrogen, etc.) If an element has more than two stable isotopes? Consider SEDs of Sulfur SN: 1 16𝑆 + 162𝑆 + 163𝑆 + 164𝑆 ALGORITHM 2. Calculate SN SED p0 ← 0.9493, m0 ← 31.972079 p1 ← 0.0076, m1 ← 32.971459 p2 ← 0.0429, m2 ← 33.967867 p3 ← 0.0002, m3 ← 35.967081 FOR 𝑛1 ← 0 to N FOR 𝑛2 ← 0 to N – 𝑛1 FOR 𝑛3 ← 0 to N – (𝑛1+ 𝑛2) 𝑛0 ← N – (𝑛1+ 𝑛2+ 𝑛3) 𝑁 → A REALLY, REALLY long list! Computation Significantly Increases as the Number of Stable Isotopes Increases 10 𝑁 𝑁−𝑛1 𝑁−𝑛1 −𝑛2 𝑛1 𝑛2 𝑛3 𝑁! 𝑛0 𝑛1 𝑛2 𝑛3 𝑝 𝑝 𝑝 𝑝 𝑛3 !𝑛2 !𝑛1 !𝑛0 ! 0 1 2 3 p ← 𝑛 𝑛 m ← 𝑛0 𝑚0 + 𝑛1 𝑚1 + 𝑛2 𝑚2 + 𝑛3 𝑚3 PRINT (m, p) END LOOP END LOOP END LOOP 𝑛 𝑛 𝑝0 0 𝑝1 1 𝑝2 2 𝑝3 3 Stage 2: SED Combination With Stage 1 complete, analogous to evaluating 𝑃11 + 𝑃21 + ⋯ 𝑃12 + 𝑃22 + ⋯ 𝑃13 + 𝑃23 + ⋯ ⋯ 𝑗 𝑃𝑖 represents ith peak from SED generated for jth unique element Removal of exponent allows for straightforward combination Simple Example) H2O: M= 2.01565, p= 9.9977e-01, M= 3.02193, p= 2.2997e-04, M= 4.02820, p= 1.3225e-06 M= 15.9949, p= 9.9757e-01, M= 16.9991, p= 3.8e-04, M= 17.9992, p= 2.05e-03 1 1 1𝐻 1𝐻 + 2 11𝐻 21𝐻 + 21𝐻 21𝐻 1 8𝑂 + 28𝑂 + 38𝑂 → M= 2.01565 + 15.9949 = 18.0106, p= 9.9977e-01 * 9.9757e-01 = 9.9734e-01, M= 2.01565 + 16.9991 = 19.0148, p= 9.9977e-01 * 3.8e-04 = 3.7991e-04, M= 2.01565 + 17.9992 = 20.0149, p= 9.9977e-01 * 2.05e-03 = 2.0495e-03, M= 3.02193 + 15.9949 = 19.0168, p= 2.2997e-04 * 9.9757e-01 = 2.2941e-04, M= 3.02193 + 16.9991 = 20.0210, p= 2.2997e-04 * 3.8e-04 = 8.7389e-08, M= 3.02193 + 17.9992 = 21.0211, p= 2.2997e-04 * 2.05e-03 = 4.7144e-07, M= 4.02820 + 15.9949 = 20.0231, p= 1.3225e-06 * 9.9757e-01 = 1.3193e-06, M= 4.02820 + 16.9991 = 21.0273, p= 1.3225e-06 * 3.8e-04 = 5.0255e-10, M= 4.02820 + 17.9992 = 22.0274, p= 1.3225e-06 * 2.05e-03 = 2.7111e-09 11 Additional IPC Functionality Probability Threshold: Filter prob < PT (e.g, PT = 1.0e-05) Simple Example Continued H2O: M= 18.0106, p= 9.9734e-01, M= 19.0148, p= 3.7991e-04, M= 20.0149, p= 2.0495e-03, M= 19.0168, p= 2.2941e-04, M= 20.0210, p= 8.7389e-08, M= 21.0211, p= 4.7144e-07, M= 20.0231, p= 1.3193e-06, M= 21.0273, p= 5.0255e-10, M= 22.0274, p= 2.7111e-09 M= 18.0106, p= 9.9734e-01, M= 19.0148, p= 3.7991e-04, M= 19.0168, p= 2.2941e-04, M= 20.0149, p= 2.0495e-03, M= 20.0210, p= 8.7389e-08, M= 20.0231, p= 1.3193e-06, M= 21.0211, p= 4.7144e-07, M= 21.0273, p= 5.0255e-10, M= 22.0274, p= 2.7111e-09 Sort by Mass Sort by Probability M= 18.0106, p= 9.9734e-01, M= 19.0148, p= 3.7991e-04, M= 20.0149, p= 2.0495e-03, M= 19.0168, p= 2.2941e-04 M= 18.0106, p= 9.9734e-01, M= 20.0149, p= 2.0495e-03, M= 19.0148, p= 3.7991e-04, M= 19.0168, p= 2.2941e-04, M= 20.0231, p= 1.3193e-06, M= 21.0211, p= 4.7144e-07, M= 20.0210, p= 8.7389e-08, M= 22.0274, p= 2.7111e-09, M= 21.0273, p= 5.0255e-10 Window Filter: Filter any peaks after the Nth (e.g, N = 6) Mass Peak Centroiding: Essentially moving average filter over close peaks, weighted by probability M= 18.0106, p= 9.9734e-01, M= 19.0148, p= 3.7991e-04, M= 19.0168, p= 2.2941e-04, M= 20.0149, p= 2.0495e-03, M= 20.0210, p= 8.7389e-08, M= 20.0231, p= 1.3193e-06 M= 18.0106, p= 9.9734e-01, M= 19.0156, p= 6.0932e-04, M= 20.0149, p= 2.0509e-03 12 M= 18.0106, p= 9.9734e-01, M= 20.0149, p= 2.0495e-03, M= 19.0148, p= 3.7991e-04, M= 19.0168, p= 2.2941e-04, M= 20.0231, p= 1.3193e-06, M= 21.0211, p= 4.7144e-07 A Configurable & Scalable IPC Hardware Architecture Adapt two-stage procedure to a configurable & scalable hardware architecture capable of converting a stream of independent chemical formula queries into a delimited stream of variable-quantity mass/probability pairs Single Module Handles Stage 1 Functionality Multiple Modules Handle Stage 2 Computation No. of Modules Independent from Input Stream Data and Host Stream Consists of Chemical Formula Query Information and Control Data Result Distributions Returned in Same Order as Received in Input Stream 13 SED calculation reduced to LUTs Precompute SEDs Exactly, Pull SEDs from LUTS at Runtime vs. SADs vs. FCFDs Single Bank of LUTs Feed All Distribution Calculators Sample LUT Address Space for SEDs SEDs Presorted by Probability, Filtered at Runtime with Configurable Threshold Prob. In-Stream H1 − H256 0 C1 − C256 Control[3] N1 − N256 O1 − O256 S1 − S64 Token-Based Round Robin Scheduler 60 Other Elements with 16 SEDs per Element 2047 Equation from Slide 8: 14 ( 𝐸11 + 𝐸21 + ⋯ )𝑁1 (𝐸12 + 𝐸22 + ⋯ )𝑁2 (𝐸13 + 𝐸23 + ⋯ )𝑁3 ⋯ SAD: Single-Atom Distribution SED: Single-Element Distribution FCFD: Full Chemical Formula Distribution SED Iterative Combination in Hardware Single-cycle SED combination architecture required for worst-case excessively wasteful when processing common-case, employ iterative combination to boost hardware utilization X: No. of Parallel Multipliers and Adders Y: Buffer Depth ALGORITHM 3. Distribution Calculator Procedure WHILE Control ≠ “done” SED[1…N]←FIFO[1…N].pop(), Control←FIFO[N+1].pop() IF Control = “begin” PrevItBuff[1...N]←SED[1…N], PrevItBuff[N+1...Y]←(-1,0) CurrItBuff[1...Y] ←(-1,0) IF Control = “middle” or Control = “end” WHILE tmp←PrevItBuff[1..Y].shift() ≠ (-1,0) i←1 WHILE i ≤ N and SED[i].prob > 0 MultAdd[1...X].mass←SED[i...i+X−1].mass + tmp.mass MultAdd[1...X].prob←SED[i...i+X−1].prob ∗ tmp.prob PSort[1…X]←Sort(Filter(MultAdd[1...X], TP)) CurrItBuff[1…Y]←InSort(CurrItBuff[1…Y], PSort[1..X]) i←i+X END LOOP END LOOP PrevItBuff[1...Y]←CurrItBuff[1…Y] CurrItBuff[1...Y] ←(-1,0) IF Control = “end” FinalResBuff[1...Y]←PrevItBuff[1...Y] END LOOP Result Reporting Circuitry Operates Independently of Distribution Calculation 15 Insert Centroiding Here if so Desired Performance Evaluation on Novo-G Previously discussed hardware architecture implemented in VHDL and tested on Novo-G[4,5] Initial experiments on single Altera Stratix IV E530 FPGA in GiDEL PROCStar IV board along with an Intel Xeon E5620 CPU for host support Single-device implementation scaled up to a single Novo-G “ps4” compute node i.e., up to 16 E530s in 4 PROCStar IVs Implications of scaling to multiple compute nodes of Novo-G discussed Software baseline: highly optimized, serial C++ code mirroring hardware algorithm Executed on single E5620 core Orders of magnitude faster than code at [6] Hardware and software results compared to confirm hardware correctness 16 Novo-G Annual Growth 2009: 96 top-end Stratix-III FPGAs, each with 4.25GB SDRAM 2010: 96 more Stratix-III FPGAs, each with 4.25GB SDRAM 2011: 96 top-end Stratix-IV FPGAs, each with 8.50GB SDRAM 2012: 96 more Stratix-IV FPGAs, each with 8.50GB SDRAM Single-FPGA Performance TABLE I. Single-FPGA performance for several parameter configurations. Configuration* N X Y Qi.fMas Qi.fProb Cen Freq Speedup† Speedup† /FPGA /DC M (MHz) 1 128 16.16 1.31 N 12 115 − 72 2. 16 1 128 3. 16 2 128 14.8 1.15 N 20 145 − 115 14.8 1.15 N 15 145 − 115 4. 16 3 128 14.8 1.15 N 10 110 − 87 5. 12 2 128 14.8 1.15 N 14 155 − 123 6. 16 2 80 14.8 1.15 N 21 150 − 120 7. 12 2 80 14.8 1.15 N 22 155 − 127 8. 16 1 128 16.16 1.31 Y 11 120 17 186 9. 16 1 128 14.12 1.23 Y 13 135 18 236 10. 16 1 128 14.8 1.15 Y 16 140 24 384 11. 16 1 128 14.12 1.23 Y 13 135 18 236 12. 16 2 128 14.12 1.23 Y 10 130 32 325 13. 16 3 128 14.12 1.23 Y 6 100 35 214 14. 16 2 128 14.12 1.23 Y 10 130 32 325 15. 12 2 128 14.12 1.23 Y 10 135 34 338 1. 16 8 2 128 14.12 1.23 Y 11 135 35 381 17. 16 2 100 14.12 1.23 Y 13 135 34 444 18. 16 2 80 14.12 1.23 Y 14 130 32 454 19. 16 2 60 14.12 1.23 Y 17 130 33 566 20. 12 2 80 14.12 1.23 Y 15 135 34 516 16. Performance Trends for Various IPC Parameter Configurations Configurations Bandwidth Limited Computation-bound problem in software becomes I/O-bound in FPGAs Reducing Calculation Word Width Reduced Logic Usage & Increased Operating Frequency vs. Reduced Result Precision Increasing Parallel Computations per DC Increased Operations per Clock Cycle vs. Increased Logic Usage, Reduced Routability, & Operating Frequency Reducing Distribution Window Width Reduced Logic Usage vs. Reduced Result Exactness * N: max peaks/SED, X: number of parallel peak computations per DC, Y: max peaks/output-window, Qi.f: fixed-point word width bits (integer.fractional), Cen: centroiding capability enabled (Yes/No), M: DCs per FPGA. † w.r.t. C++ software processing 16x220 statistically representative queries (based on relative elemental abundance in amino acids) in 821 seconds on a single E5620 core. Suitable “sweet spot,” achieving remarkable speedup while ensuring results remain scientifically relevant 17 Multi-FPGA Performance Performance Trends of “sweet spot” for Various Novo-G Node Configurations Increasing PROCStar IVs per Node Available system bandwidth far exceeds board link bandwidth bottleneck observed with single-board Increasing FPGAs per PROCStar IV Scalability now limited by CPU resources Scalability limited by I/O-bandwidth Novo-G “ps4” nodes have 8 physical cores (16 logical with hyper-threading) vs. max 32 threads for row 9 Expect increased scalability with system config employing more physical cores PROCStar IV only supports 8 lane, Gen 1 PCIe Expect increased scalability with system config. employing more lanes and/or more recent Gen 3 PCIe standard TABLE II. Multi-FPGA performance for several node configurations. Boards /Node FPGAs /Board Total FPGAs Speedup† /FPGA Total Speedup† 2. 1 1 1 2 1 2 517 516 517 1031 Assuming input queries are pre-partitioned, no required communication between compute nodes 3. 1 3 3 398 1192 4. 1 4 4 315 1259 5. 2 1 2 516 1033 Overhead limited to initialization & completion synchronization so expect performance to scale almost linearly with additional nodes 6. 2 2 4 502 2009 7. 2 4 8 313 2510 8. 4 1 4 492 1968 9. 4 4 16 209 3340 Multi-Node Scaling Expectations 1. We plan to verify these expectations by scaling to multiple compute nodes in Novo-G as future work 18 † w.r.t. C++ software processing 230 statistically representative queries (based on relative elemental abundance) in 52,478 seconds on a single E5620 core. Multiple FPGA Advantage? Summary & Conclusions Presented first FPGA-based Isotope Pattern Calculator Computationally intense subroutine common in de novo PIAs Provides 23 customization parameters for general use Discussed parameter tradeoffs & experimentally demonstrate effect on performance Between 72 and 566 speedup† on a single FPGA Wide range of achieved single-node performance due to embarrassingly parallel scalability restricted by real-world system limitations such as insufficient I/O bandwidth and CPU resources Up to 1259 speedup † on a single board (4 FPGAs) Up to 3340 speedup † on a single node (16 FPGAs) Can enable use of previously dismissed protein identification algorithms with potentially revolutionary accuracy yet obscene execution time on conventional computing platforms Still much to be done before this is a reality for protein Identification † with respect to a highly optimized, serial C++ IPC implementation 19 Future Work Continue scaling design to multiple nodes of Novo-G Integrate FPGA accelerated IPC into full de novo PIA First integrate with full theoretical spectrum generator Move more of algorithm onto FPGA to lessen bandwidth bottleneck issue Explore the possibility of a GPU accelerated IPC GPU amenable given minor modifications to the algorithm as stated Preliminary design already mapped out, ready for implementation & testing Implement non-sorted output option Sorting fundamentally integral to current DC design Non-sorting DC would allow greater parallelization while utilizing less resources If sorted distribution not required by targeted PIA, expect much greater performance 20 Thank You For Listing! Any Questions 21 References [1] A. W. Bell et al., “A HUPO test sample study reveals common problems in mass spectrometry-based proteomics,” Nat. Methods, vol 6, pp. 423-430, 2009. [2] E. A. Kapp et al., "An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: Sensitivity and specificity analysis," Proteomics, vol 5, pp. 3475–3490, 2005. [3] C. Pascoe et al., “Reconfigurable supercomputing with scalable systolic arrays and instream control for wavefront genomics processing,” Proc. of Symposium on Application Accelerators in High-Performance Computing (SAAHPC), TN, 2010. [4] A. George, H. Lam, A. Lawande, C. Pascoe, and G. Stitt, “Novo-G: A View at the HPC Crossroads for Scientific Computing,” Proc. of the Int. Conf. on Eng. of Reconf. Sys. and Algs. (ERSA), NV, 2010. [5] A. George, H. Lam, and G. Stitt, “Novo-G: At the Forefront of Scalable Reconfigurable Computing,” IEEE Computing in Sci. & Eng. (CiSE), Vol. 13, No. 1, Jan/Feb. 2011, pp. 82-86. [6] Dirk (2005), Isotopic Pattern Calculator, http://isotopatcalc.sourceforge.net/index.php, File: gips-0.7.tar.gz. 22