Auto-Tuning Dedispersion for Many-Core Accelerators Alessio Sclocco, Henri E. Bal Jason Hessels, Joeri van Leeuwen Rob V. van Nieuwpoort International Parallel & Distributed Processing Symposium (IPDPS) 2014 22nd May 2014 A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 0 Outline 1 Introduction 2 Dedispersion 3 Auto-Tuning 4 Performance 5 Impact of Auto-Tuning 6 Conclusions A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 0 Modern Radio Astronomy Two trends in modern experimental science 1 Growing scale of instruments 2 Use of software Radio astronomy is no exception 1 Petascale & Exascale instruments 2 Software telescopes A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 1 A Big Software Telescope: LOFAR A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 2 Pulsars (1/2) Searching pulsars is a “hot topic” in radio astronomy A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 3 Pulsars (2/2) Pulsars are extremely difficult to find Radio Frequency Interference (RFI) Dispersion, scintillation, scattering Gravitational interactions Fast periods Faint signals A big multidimensional search space must be explored A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 4 An Effect of Distance: Dispersion ∆ ≈ 4, 150 × DM × ( f12 − i A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort 1 ) fh2 Auto-Tuning Dedispersion for Many-Core Accelerators 5 Contributions Achieving real-time performance for dedispersion Using auto-tuning to adapt the algorithm for: Different many-core platforms Different observational scenarios Different number of Dispersion Measures (DMs) Showing that auto-tuning is necessary to achieve high-performance Highlighting how optimal configurations are difficult to guess A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 6 Outline 1 Introduction 2 Dedispersion 3 Auto-Tuning 4 Performance 5 Impact of Auto-Tuning 6 Conclusions A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 6 The Algorithm ∀sample PnrChannels channel =0 input [channel ][sample + ∆] A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 7 An Example of The ∆ Function 14 DM DM DM DM 12 = = = = 1.00 0.75 0.50 0.25 Delay 10 8 6 4 2 0 1424.62 1462.11 1499.6 1537.09 1574.58 1612.07 1649.56 1687.05 1724.54 Frequency (MHz) A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 8 Complexity and Arithmetic Intensity Complexity: Single step: O (nrChannels × nrSamples ) Search: O (nrDMs × nrChannels × nrSamples ) Complexity is also increased by the number of different beams of the observation Dedispersion is a memory-bound algorithm: Arithmetic Intensity: AI = A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort 1 4+ < 1 4 Auto-Tuning Dedispersion for Many-Core Accelerators 9 Data Reuse 5 DM DM DM DM 4 = = = = 1.00 0.75 0.50 0.25 Delay 3 2 1 0 1640.24 1669.53 1698.82 Frequency (MHz) Taking data reuse to the extreme: AI < 1 1 1 1 4×( nrDMs + nrSamples + nrChannels ) Theoretical bound, impossible to achieve in any realistic scenario A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 10 Parallelization Parallel dedispersion algorithm implemented with OpenCL Four user-controlled parameters govern the number of DMs and samples computed per work-group and work-item The configuration of these parameters influences the algorithm’s Arithmetic Intensity We use auto-tuning to find the optimal configuration of these parameters A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 11 Auto-Tuning: Parameters The four user-controlled parameters: 1 Work-items associated with different samples 2 Work-items associated with different DMs 3 Samples processed per work-item 4 DMs processed per work-item How do these parameters interact? (1) × (2): work-items per work-group (3) × (4): registers per work-item (1) × (3): samples processed per work-group (2) × (4): DMs processed per work-group (data reuse) A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 12 Auto-Tuning: Platforms Platform AMD HD7970 Intel Xeon Phi 5110P NVIDIA GTX 680 NVIDIA K20 NVIDIA GTX Titan CEs 64 × 32 2 × 60 192 × 8 192 × 13 192 × 14 A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort GFLOP/s 3,788 2,022 3,090 3,519 4,500 GB/s 264 320 192 208 288 Auto-Tuning Dedispersion for Many-Core Accelerators 13 Auto-Tuning: Scenarios Apertif Time resolution: 20 kHz Bandwidth: 300 MHz Channels: 1,204 Frequency range: 1,420 – 1,720 MHz LOFAR Time resolution: 200 kHz Bandwidth: 6 MHz Channels: 32 Frequency range: 138 – 145 MHz We also tune the algorithm for a varying number of DMs A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 14 Outline 1 Introduction 2 Dedispersion 3 Auto-Tuning 4 Performance 5 Impact of Auto-Tuning 6 Conclusions A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 14 Auto-Tuning: Work-Items (Apertif) 1200 HD7970 Xeon Phi GTX 680 K20 GTX Titan 1000 work-items 800 600 400 200 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 Dispersion Measures A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 15 Auto-Tuning: Work-Items (LOFAR) 1400 HD7970 Xeon Phi GTX 680 K20 GTX Titan 1200 work-items 1000 800 600 400 200 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 Dispersion Measures A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 16 Auto-Tuning: Registers (Apertif) HD7970 Xeon Phi GTX 680 K20 GTX Titan 120 Registers 100 80 60 40 20 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 Dispersion Measures A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 17 Auto-Tuning: Registers (LOFAR) 70 HD7970 Xeon Phi GTX 680 K20 GTX Titan 60 Registers 50 40 30 20 10 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 Dispersion Measures A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 18 Analysis The algorithms adapts exploiting the characteristics of both platforms and scenarios Platform AMD HD7970 Intel Xeon Phi 5110P NVIDIA GTX 680 NVIDIA K20 NVIDIA GTX Titan A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Apertif 32, 8, 5, 4 16, 1, 10, 8 32, 32, 25, 2 32, 16, 25, 4 32, 16, 25, 4 LOFAR 64, 4, 25, 1 16, 2, 10, 1 250, 4, 20, 2 160, 4, 25, 2 160, 4, 25, 2 Auto-Tuning Dedispersion for Many-Core Accelerators 19 Outline 1 Introduction 2 Dedispersion 3 Auto-Tuning 4 Performance 5 Impact of Auto-Tuning 6 Conclusions A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 19 Performance (Apertif) 400 HD7970 Xeon Phi GTX 680 K20 GTX Titan real-time 350 300 GFLOP/s 250 200 150 100 50 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 Dispersion Measures A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 20 Performance (LOFAR) 120 HD7970 Xeon Phi GTX 680 K20 GTX Titan real-time 100 GFLOP/s 80 60 40 20 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 Dispersion Measures A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 21 Speedup over CPU (Intel Xeon E5-2620) 70 HD7970 Xeon Phi GTX 680 K20 GTX Titan 60 Speedup 50 40 30 20 10 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 Dispersion Measures A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 22 Building Apertif Apertif operational requirements: DMs: 2,000 Beams: 450 Number of devices to build Apertif: CPUs: 1,800 GPUs: 50 Using many-cores and an auto-tuned algorithm will be feasible to build a real-time transients searching instrument for Apertif A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 23 Outline 1 Introduction 2 Dedispersion 3 Auto-Tuning 4 Performance 5 Impact of Auto-Tuning 6 Conclusions A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 23 Signal-to-Noise Ratio (Apertif) 5 HD7970 Xeon Phi GTX 680 K20 GTX Titan 4 SNR 3 2 1 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 Dispersion Measures A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 24 Signal-to-Noise Ratio (LOFAR) HD7970 Xeon Phi GTX 680 K20 GTX Titan 5 SNR 4 3 2 1 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 Dispersion Measures A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 25 Example of Auto-Tuning Histogram HD7970 Average 10 Con gurations 8 6 4 2 0 0 50 100 150 200 250 300 350 400 GFLOP/s A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 26 Speedup over Best Fixed Configuration (Apertif) 4 HD7970 Xeon Phi GTX 680 K20 GTX Titan Speedup 3 2 1 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 Dispersion Measures A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 27 Speedup over Best Fixed Configuration (LOFAR) 3 HD7970 Xeon Phi GTX 680 K20 GTX Titan 2.5 Speedup 2 1.5 1 0.5 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 Dispersion Measures A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 28 Speedup over Traditional Optimization (Apertif) 12 HD7970 Xeon Phi GTX 680 K20 GTX Titan 10 Speedup 8 6 4 2 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 Dispersion Measures A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 29 Speedup over Traditional Optimization (LOFAR) 6 HD7970 Xeon Phi GTX 680 K20 GTX Titan 5 Speedup 4 3 2 1 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 Dispersion Measures A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 30 Outline 1 Introduction 2 Dedispersion 3 Auto-Tuning 4 Performance 5 Impact of Auto-Tuning 6 Conclusions A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 30 Conclusions Many-cores can be used to accelerate a memory-bound algorithm like dedispersion Auto-tuning: permits to exploit data-reuse, and thus to achieve higher performance allows algorithms to adapt to different platforms and scenarios The impact that auto-tuning has on dedispersion’s performance is sensible Guessing a good configuration without auto-tuning is difficult A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort Auto-Tuning Dedispersion for Many-Core Accelerators 31