Performance of Applications using Dual-Rail

advertisement
Performance of Applications Using
Dual-Rail InfiniBand 3D Torus
Network on the Gordon
Supercomputer
Dongju Choi, Glenn Lockwood, Robert Sinkovits,
Mahidhar Tatineni
San Diego Supercomputer Center
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Background
• SDSC data intensive supercomputer Gordon:
• 1,024 dual-socket Intel Sandy Bridge nodes, each with 64 GB DDR3–1333
memory
• 16 cores per node and 16 nodes (256 cores) per switch
• Large IO nodes and local/global ssd disks
• Dual rails QDR InfiniBand network supports IO and Compute
communication separately.
• Can be scheduled to be used for computation also.
• We have been interested witch communication oversubscription in
switch-to-switch and switch/node topology effects on application
performance.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Gordon System Architecture
3-D torus of switches on Gordon
Subrack level network architecture on Gordon
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
MVAPICH2 MPI Implementation
• MVAPICH2 current version 1.9, 2.0 on the Gordon system
• Full control of dual rail usage at the task level via user
settable environment variables:
• MV2_NUM_HCAS=2,
• MV2_IBA_HCA=mlx4_0:mlx4_1
• MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD=8000: can be as
low as 8KB,
• MV2_SM_SCHEDULING=ROUND_ROBIN: explicitly distribute tasks
over rails
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
OSU Micro-Benchmarks
• Compare the performance of single and dual rail QDR
InfiniBand vs FDR InfiniBand: evaluate the impact of rail
sharing, scheduling, and threshold parameters
• Bandwidth tests
• Latency tests
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
OSU Bandwidth Test Results for Single Rail QDR,
FDR, and Dual-Rail QDR Network Configurations
- Single rail FDR
performance is much
better than single rail QDR
for message sizes larger
than 4K bytes
- Dual rail QDR performance
exceeds FDR performance
at sizes greater than 32K
- FDR showing better
performance between 4K
and 32K byte sizes due to
the rail-sharing threshold
OSU Bandwidth Test Performance with
MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD=8K
- Lowering the rail
sharing threshold
bridges the dual-rail
QDR, FDR
performance gap
down to 8K bytes.
OSU Bandwidth Test Performance with
MV2_SM_SCHEDULING = ROUND_ROBIN
- Adding explicit
round-robin
tasks to
communicate
over different
rails
OSU Latency Benchmark Results for QDR, DualRail QDR with MVAPICH2 Defaults, FDR
- There is no latency
penalty at small message
sizes (expected as only
one rail is active below
the striping threshold).
- Above the striping
threshold a minor
increase in latency is
observed but the
performance is still
better than single rail
FDR.
OSU Latency Benchmark Results for QDR,
Dual-Rail QDR with Round Robin Option, FDR
- Distributing messages
across HCAs using the
round-robin option
increases the latency at
small message sizes.
- Again, the latency
results are better than the
FDR case.
Application Performance Benchmarks
• Applications
• P3DFFT Benchmark
• LAMMPS Water Box Benchmark
• AMBER Cellulose Benchmark
• Test Configuration
• Single Rail vs. Dual Rails
• Multiple Switch Runs with Maximum Hops=1 or no hops limit for 512
core runs (2 switches are involved)
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
P3DFFT Benchmark
• Parallel Three-Dimensional Fast Fourier Transforms
• Used for studies of turbulence, climatology, astrophysics
and material science
• Depends strongly on the available bandwidth as the main
communication component is driven by transposes of
large arrays (alltoallv)
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Simulation Results for P3DFFT benchmark with
256 cores and QDR, Dual-Rail QDR
Run#
QDR
Wallclock Time (s)
Dual-Rail QDR
Wallclock Time (s)
1
992
761
2
985
760
3
991
766
4
993
759
- Dual-rail runs are
consistently faster than
the single rail runs, with
an average performance
gain of 23%.
Communication and Compute Time Breakdown for
256 core, Single/Dual QDR rail P3DFFT Runs.
Single Rail Runs
Dual Rail Runs
Run
#
1
Total
Time
992
Comm.
Time
539
Compute
Time
453
Run
#
1
Total
Time
761
Comm.
Time
302
Compute
Time
459
2
985
535
450
2
760
301
459
3
991
539
452
3
766
308
458
4
993
543
450
4
759
300
459
- Compute part is nearly identical in both sets of runs
- Performance improvement is almost entirely in the communication part of the code
- Shows that Dual rail boosts the alltoallv performance and consequently speeds up
the overall calculation
Communication and Compute Time Breakdown for
512 core, Single/Dual QDR Rail P3DFFT Runs.
Maximum Switch Hops=1
Single Rail Runs
Dual Rail Runs
Run
#
1
Total
Time
802
Comm.
Time
592
Compute
Time
210
Run
#
1
Total
Time
537
Comm.
Time
322
Compute
Time
215
2
802
592
210
2
538
322
216
3
804
594
210
3
538
322
216
4
803
592
211
4
538
322
216
- Shows similar dual rail benefits
- Fewer runs pans/links, reducing the likelihood of oversubscription due to other jobs
- Also can increase the likelihood of oversubscription due to lesser switch connections
P3DFFT benchmark with 512 cores, Single Rail QDR. No
Switch Hop Restriction
Run #
1
2
3
4
5
6
Total Time
717
732
789
726
825
697
Comm. Time
506
525
580
518
615
488
Compute Time
211
207
209
208
210
209
- oversubscription is mitigated by topology of the run and the performance is nearly
15% better than the single hop case. However, as seen from the results a different
topology may also lead to lower performance if the distribution is not optimal (it
could be by oversubscription of the job itself or from other jobs).
P3DFFT benchmark with 512 cores, Single Rail QDR. No Switch
Hop Restriction
Run #
1..3
4
5
6
Total Time
726
825
697
Comm. Time
518
615
488
Compute Time
208
210
209
- Spread out the computation on several switches. Lowering bandwidth
requirements on a given set of switch-to-switch links
- bad for latency bound codes (given the extra switch hops) but benefit
bandwidth sensitive codes depending on the topology of the run
- Nukada et. al. utilizes dynamic links to minimize congestion to perform
better in the dual-rail case
Nukada, A., Sato, K. and Matsuoka, S.. 2012. Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer.
In Proceedings of the International Conference on HighPerformance Computing, Networking, Storage and
Analysis (SC '12). IEEE Computer Society Press, LosAlamitos, CA, USA, Article 44, 10 pages.
Communication and Compute Time Breakdown for
1024 Core P3DFFT Runs
Single Rail Runs
Run
Dual Rail Runs
Total Comm Compute
Time . Time
Time
Run
Total Comm. Compute
Time Time
Time
1
404
307
97
1
332
232
100
2
408
310
98
2
325
226
99
-. No switch hop restrictions are placed on the runs.
-. Communication aspect is greatly improved in the dual rail cases while compute
fraction is the nearly identical in all the runs.
LAMMPS Water Box Benchmark
• Large-scale Atomic/Molecular Massively Parallel Simulator
(LAMMPS) is a widely used classical molecular dynamics
code.
• 12,000 water molecules (36,000 atoms) are set in the input
• Simulation is run for 20 picoseconds.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
LAMMPS Water Box Benchmark with
Single/Dual Rail QDR and 256 cores.
Run #
QDR
Wallclock Time (s)
Dual-Rail QDR
Wallclock Time (s)
1
57
46
2
57
46
3
58
46
4
57
46
- Dual-Rail runs show better performance than the single rail runs and mitigate
communication overhead with an average of 32% in wallclock time used.
improvement
LAMMPS Water Box Benchmark with SingleDual Rail QDR and 512 Cores
Run #
Single Rail QDR w
MAX_HOP=1
Wallclock Time (s)
Single Rail QDR w
No Limit in MAX_HOP
Wallclock Time (s)
Dual Rail QDR
Wallclock Time (s)
1
2
3
4
69
69
70
69
71
70
281
450
47
47
48
47
- Application is not scaling due to larger communication overhead (happens due to fine
level of domain decomposition)
- LAMMPS benchmark is very sensitive to topology and shows large variations
if the maximum switch hops are not restricted
AMBER Cellulose Benchmark
• Amber is a package of programs for molecular dynamics
simulations of proteins and nucleic acids.
• 408,609 atoms are used for the tests.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Amber Cellulose Benchmark with Single/Dual
Rail QDR and 256 Cores
Run #
Single Rail QDR
Wallclock Time (s)
Dual Rail QDR
Wallclock Time (s)
1
218
212
2
219
213
3
218
212
4
219
212
- Communication overhead is low and the dual rail benefit is minor (<3%)
Amber Cellulose Benchmark with Single/Dual
Rail QDR, 512 cores
1
Single Rail QDR w
MAX_HOP=1
Wallclock Time (s)
204
2
202
331
168
3
202
396
168
4
202
373
167
Run #
Single Rail QDR w
Dual Rail QDR
No Limit in MAX_HOP
Wallclock Time (s)
Wallclock Time (s)
332
168
- There is a modest benefit (<5 %) in the single rail QDR runs
- Communication overhead increases with increased core count, leading to the drop off in
scaling. This can be mitigated with dual rail QDR
- Dual rail QDR performance is better by 17%
Amber Cellulose Benchmark with Single/Dual
Rail QDR, 512 cores
1
Single Rail QDR w
MAX_HOP=1
Wallclock Time (s)
204
2
202
331
168
3
202
396
168
4
202
373
167
Run #
Single Rail QDR w
Dual Rail QDR
No Limit in MAX_HOP
Wallclock Time (s)
Wallclock Time (s)
332
168
- Dual rail enables the benchmark to scale to higher core count
- Shows sensitivity to the topology due to the larger number of switch
hops and possible contention from other jobs
Summary
• Aggregated bandwidth obtained with dual rail QDR
exceeds the FDR performance.
• Shows performance benefits from dual rail QDR
configurations.
• Gordon’s 3-D torus of switches leads to variability in
performance due to oversubscription/topology
considerations.
• Switch topology can be configured to enable mitigation of
the link oversubscription bottleneck.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Summary
• Performance improvement also varies based on the
degree of communication overhead.
• Benchmark cases with larger communication fractions (with respect to
overall run time) show more improvement with dual rail QDR
configurations.
• Computational time scaled with the core counts in both
single and dual rail configurations for the currently
benchmarked applications: LAMMPS and Amber
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Acknowlegements
• This work was supported by NSF grant:
OCI #0910847 Gordon: A Data Intensive Supercomputer.
Download