Auto-Tuning Dedispersion for Many-Core

advertisement
Auto-Tuning Dedispersion for Many-Core
Accelerators
Alessio Sclocco, Henri E. Bal
Jason Hessels, Joeri van Leeuwen
Rob V. van Nieuwpoort
International Parallel & Distributed Processing Symposium (IPDPS) 2014
22nd May 2014
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
0
Outline
1
Introduction
2
Dedispersion
3
Auto-Tuning
4
Performance
5
Impact of Auto-Tuning
6
Conclusions
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
0
Modern Radio Astronomy
Two trends in modern experimental science
1
Growing scale of instruments
2
Use of software
Radio astronomy is no exception
1
Petascale & Exascale instruments
2
Software telescopes
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
1
A Big Software Telescope: LOFAR
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
2
Pulsars (1/2)
Searching pulsars is a “hot topic” in radio astronomy
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
3
Pulsars (2/2)
Pulsars are extremely difficult to find
Radio Frequency Interference (RFI)
Dispersion, scintillation, scattering
Gravitational interactions
Fast periods
Faint signals
A big multidimensional search space must be explored
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
4
An Effect of Distance: Dispersion
∆ ≈ 4, 150 × DM × ( f12 −
i
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
1
)
fh2
Auto-Tuning Dedispersion for Many-Core Accelerators
5
Contributions
Achieving real-time performance for dedispersion
Using auto-tuning to adapt the algorithm for:
Different many-core platforms
Different observational scenarios
Different number of Dispersion Measures (DMs)
Showing that auto-tuning is necessary to achieve
high-performance
Highlighting how optimal configurations are difficult to guess
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
6
Outline
1
Introduction
2
Dedispersion
3
Auto-Tuning
4
Performance
5
Impact of Auto-Tuning
6
Conclusions
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
6
The Algorithm
∀sample
PnrChannels
channel =0
input [channel ][sample + ∆]
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
7
An Example of The ∆ Function
14
DM
DM
DM
DM
12
=
=
=
=
1.00
0.75
0.50
0.25
Delay
10
8
6
4
2
0
1424.62 1462.11 1499.6 1537.09 1574.58 1612.07 1649.56 1687.05 1724.54
Frequency (MHz)
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
8
Complexity and Arithmetic Intensity
Complexity:
Single step: O (nrChannels × nrSamples )
Search: O (nrDMs × nrChannels × nrSamples )
Complexity is also increased by the number of different beams of
the observation
Dedispersion is a memory-bound algorithm:
Arithmetic Intensity: AI =
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
1
4+
<
1
4
Auto-Tuning Dedispersion for Many-Core Accelerators
9
Data Reuse
5
DM
DM
DM
DM
4
=
=
=
=
1.00
0.75
0.50
0.25
Delay
3
2
1
0
1640.24
1669.53
1698.82
Frequency (MHz)
Taking data reuse to the extreme: AI <
1
1
1
1
4×( nrDMs
+ nrSamples
+ nrChannels
)
Theoretical bound, impossible to achieve in any realistic scenario
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
10
Parallelization
Parallel dedispersion algorithm implemented with OpenCL
Four user-controlled parameters govern the number of DMs
and samples computed per work-group and work-item
The configuration of these parameters influences the
algorithm’s Arithmetic Intensity
We use auto-tuning to find the optimal configuration of these
parameters
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
11
Auto-Tuning: Parameters
The four user-controlled parameters:
1
Work-items associated with different samples
2
Work-items associated with different DMs
3
Samples processed per work-item
4
DMs processed per work-item
How do these parameters interact?
(1) × (2): work-items per work-group
(3) × (4): registers per work-item
(1) × (3): samples processed per work-group
(2) × (4): DMs processed per work-group (data reuse)
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
12
Auto-Tuning: Platforms
Platform
AMD HD7970
Intel Xeon Phi 5110P
NVIDIA GTX 680
NVIDIA K20
NVIDIA GTX Titan
CEs
64 × 32
2 × 60
192 × 8
192 × 13
192 × 14
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
GFLOP/s
3,788
2,022
3,090
3,519
4,500
GB/s
264
320
192
208
288
Auto-Tuning Dedispersion for Many-Core Accelerators
13
Auto-Tuning: Scenarios
Apertif
Time resolution: 20 kHz
Bandwidth: 300 MHz
Channels: 1,204
Frequency range: 1,420 – 1,720 MHz
LOFAR
Time resolution: 200 kHz
Bandwidth: 6 MHz
Channels: 32
Frequency range: 138 – 145 MHz
We also tune the algorithm for a varying number of DMs
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
14
Outline
1
Introduction
2
Dedispersion
3
Auto-Tuning
4
Performance
5
Impact of Auto-Tuning
6
Conclusions
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
14
Auto-Tuning: Work-Items (Apertif)
1200
HD7970
Xeon Phi
GTX 680
K20
GTX Titan
1000
work-items
800
600
400
200
0
1
2
4
8
16
32
64
128 256 512 1024 2048 4096 8192
Dispersion Measures
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
15
Auto-Tuning: Work-Items (LOFAR)
1400
HD7970
Xeon Phi
GTX 680
K20
GTX Titan
1200
work-items
1000
800
600
400
200
0
1
2
4
8
16
32
64
128 256 512 1024 2048 4096 8192
Dispersion Measures
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
16
Auto-Tuning: Registers (Apertif)
HD7970
Xeon Phi
GTX 680
K20
GTX Titan
120
Registers
100
80
60
40
20
0
1
2
4
8
16
32
64
128 256 512 1024 2048 4096 8192
Dispersion Measures
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
17
Auto-Tuning: Registers (LOFAR)
70
HD7970
Xeon Phi
GTX 680
K20
GTX Titan
60
Registers
50
40
30
20
10
0
1
2
4
8
16
32
64
128
256
512 1024 2048 4096 8192
Dispersion Measures
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
18
Analysis
The algorithms adapts exploiting the characteristics of both
platforms and scenarios
Platform
AMD HD7970
Intel Xeon Phi 5110P
NVIDIA GTX 680
NVIDIA K20
NVIDIA GTX Titan
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Apertif
32, 8, 5, 4
16, 1, 10, 8
32, 32, 25, 2
32, 16, 25, 4
32, 16, 25, 4
LOFAR
64, 4, 25, 1
16, 2, 10, 1
250, 4, 20, 2
160, 4, 25, 2
160, 4, 25, 2
Auto-Tuning Dedispersion for Many-Core Accelerators
19
Outline
1
Introduction
2
Dedispersion
3
Auto-Tuning
4
Performance
5
Impact of Auto-Tuning
6
Conclusions
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
19
Performance (Apertif)
400
HD7970
Xeon Phi
GTX 680
K20
GTX Titan
real-time
350
300
GFLOP/s
250
200
150
100
50
0
1
2
4
8
16
32
64
128 256 512 1024 2048 4096 8192
Dispersion Measures
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
20
Performance (LOFAR)
120
HD7970
Xeon Phi
GTX 680
K20
GTX Titan
real-time
100
GFLOP/s
80
60
40
20
0
1
2
4
8
16
32
64
128 256 512 1024 2048 4096 8192
Dispersion Measures
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
21
Speedup over CPU (Intel Xeon E5-2620)
70
HD7970
Xeon Phi
GTX 680
K20
GTX Titan
60
Speedup
50
40
30
20
10
0
1
2
4
8
16
32
64
128
256
512 1024 2048 4096 8192
Dispersion Measures
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
22
Building Apertif
Apertif operational requirements:
DMs: 2,000
Beams: 450
Number of devices to build Apertif:
CPUs: 1,800
GPUs: 50
Using many-cores and an auto-tuned algorithm will be feasible to
build a real-time transients searching instrument for Apertif
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
23
Outline
1
Introduction
2
Dedispersion
3
Auto-Tuning
4
Performance
5
Impact of Auto-Tuning
6
Conclusions
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
23
Signal-to-Noise Ratio (Apertif)
5
HD7970
Xeon Phi
GTX 680
K20
GTX Titan
4
SNR
3
2
1
0
1
2
4
8
16
32
64
128
256
512 1024 2048 4096 8192
Dispersion Measures
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
24
Signal-to-Noise Ratio (LOFAR)
HD7970
Xeon Phi
GTX 680
K20
GTX Titan
5
SNR
4
3
2
1
0
1
2
4
8
16
32
64
128
256
512 1024 2048 4096 8192
Dispersion Measures
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
25
Example of Auto-Tuning Histogram
HD7970
Average
10
Con gurations
8
6
4
2
0
0
50
100
150
200
250
300
350
400
GFLOP/s
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
26
Speedup over Best Fixed Configuration (Apertif)
4
HD7970
Xeon Phi
GTX 680
K20
GTX Titan
Speedup
3
2
1
0
1
2
4
8
16
32
64
128
256
512 1024 2048 4096 8192
Dispersion Measures
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
27
Speedup over Best Fixed Configuration (LOFAR)
3
HD7970
Xeon Phi
GTX 680
K20
GTX Titan
2.5
Speedup
2
1.5
1
0.5
0
1
2
4
8
16
32
64
128 256 512 1024 2048 4096 8192
Dispersion Measures
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
28
Speedup over Traditional Optimization (Apertif)
12
HD7970
Xeon Phi
GTX 680
K20
GTX Titan
10
Speedup
8
6
4
2
0
1
2
4
8
16
32
64
128
256
512 1024 2048 4096 8192
Dispersion Measures
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
29
Speedup over Traditional Optimization (LOFAR)
6
HD7970
Xeon Phi
GTX 680
K20
GTX Titan
5
Speedup
4
3
2
1
0
1
2
4
8
16
32
64
128
256
512 1024 2048 4096 8192
Dispersion Measures
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
30
Outline
1
Introduction
2
Dedispersion
3
Auto-Tuning
4
Performance
5
Impact of Auto-Tuning
6
Conclusions
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
30
Conclusions
Many-cores can be used to accelerate a memory-bound
algorithm like dedispersion
Auto-tuning:
permits to exploit data-reuse, and thus to achieve higher
performance
allows algorithms to adapt to different platforms and scenarios
The impact that auto-tuning has on dedispersion’s
performance is sensible
Guessing a good configuration without auto-tuning is difficult
A. Sclocco, H. Bal, J. Hessels, J. van Leeuwen, R. van Nieuwpoort
Auto-Tuning Dedispersion for Many-Core Accelerators
31
Download