Performance Benefits on HPCx Phase 2A: Power5 Chips and

advertisement
Performance Benefits on
HPCx from Power5 chips
and SMT
HPCx User Group Meeting
28 June 2006
Alan Gray
EPCC, University of Edinburgh
Contents
• Introduction and System Overview
• Benchmark Results
• Synthetic
• Applications
• Simultaneous Multithreading
• Conclusions
28/06/06
Introduction and System Overview
28/06/06
Introduction
• HPCx underwent upgrade in November 2005 from
Power4 to Power5 technology.
• New system features Simultaneous Multithreading (SMT)
• We will compare the new and old systems via
benchmark results, both synthetic and involving real
applications representing typical use of the system.
• the use of SMT is also investigated
• Also included for comparison are results from EPCC’s
Blue Gene/L system.
28/06/06
Systems for comparison
• Previous HPCx (Phase 2): 50 IBM e-Server p690+ nodes
–
–
–
–
SMP cluster, 32 Power4 1.7 GHz processors per node
32 GB of RAM per node
Federation interconnect
6.2 TFLOP/s Linpack
• HPCx (Phase 2a): 96 IBM e-Server p575 nodes
– SMP cluster, 16 Power5 1.5GHz processors per node
– Power5 have improved memory architecture over Power4
– 32 GB of RAM per node (twice as much per processor than Phase 2)
– Federation interconnect (same as Phase 2)
– 7.4 TFLOP/s Linpack, No 46 on top500
• BlueSky: Single e-Server Blue Gene frame
– 1024 dual core chips, 2048 PowerPC440 processors, 700 MHz
– 512 MB of RAM per chip (distributed memory system), shared
between the two cores
– 4.7 TFLOP/s Linpack, joint No 73 on top500
28/06/06
5
Benchmark Results: Synthetic
28/06/06
Synthetic benchmarks: Intel MPI suite
• Ping Pong Benchmark – 2 processes communicate, either
over the switch or within a node via shared memory
• Switch communication: Insignificant difference (not surprising – same
switch)
• Intra node communication: Phase2a has better asymptotic bandwidth
but slightly higher latency
28/06/06
Synthetic benchmarks: Intel MPI suite
• Multi Ping Pong: All
•
available processors
utilised
Modified to ensure that all
comms utilise the switch
• No difference between
Phase 2 and 2a (not
surprising – same switch)
28/06/06
Streams performance (scale)
• Streams benchmark
•
gives measure of
memory bandwidth
Hardware limit is 2
load+store per cycle
• Can clearly see
•
caches
Phase2a significantly
better than Phase 2
for all memory levels
28/06/06
9
Benchmark Results: Applications
28/06/06
CASTEP: AL2O3
• Density functional theory
•
•
application, Payne et al.,
2002, Segall et al., 2002
Widely used in the UK
(largest user of HPCx)
Benchmark: Al2O3: 270
atom slab sampled with 2
k-points
• Phase 2a around 1.3
times faster than Phase 2
– Even although clocks are
slower
– Code is taking advantage
of improved memory
bandwidth
28/06/06
11
H2MOL
• Solves time dependent
Schrödinger equation for
laser driven dissociation of
H2-molecules
• Refines grid when
increasing processor count,
hence constant work/proc
• Phase2a almost a factor of
•
2 faster than Phase 2
Writing of intermediate
results shows up poor IO on
Blue Gene
28/06/06
12
PCHAN
• Finite difference code for
•
•
•
Turbulent Flow:
shock/boundary layer
interaction (SBLI)
Communications: haloexchanges between
adjacent computational
sub-domains
Phase 2a around 2 times
faster than Phase 2
Very good scaling all
systems – HPCx
superscales
28/06/06
13
AIMPRO
• Ab Initio Modelling
•
•
PROgram
Determines structure of
atoms using Born and
Oppenheimer
approximation
Benchmark: DS-4 - 433
atoms, 12124 basis
functions and 4 k-points
• Phase 2a outperforming
Phase 2 by around a factor
of 1.2
28/06/06
MDCASK
• MDCASK: classical molecular
dynamics code to study radiation
damage in metals
• Benchmark used: 1372000
atoms in Ti lattice
• Performance is worse on Phase
2a than on Phase 2.
– by factor larger than clock
frequency ratio
– Scaling also worse on Phase 2a
• Classical Molecular dynamics
codes are characterised by many
strided memory accesses.
– degradation could be due to
sensitivity to increased latency in
some part of memory subsystem
28/06/06
15
LAMMPS
• Classical Molecular
•
Dynamics - can simulate
wide range of materials
Rhodopsin Benchmark:
2048000 atoms
• Performance degradation
•
again on new system.
Factor is clock ratio at low
processor numbers, but
scaling worse on Phase
2a.
28/06/06
NAMD 2.6b1: APO-A1 92224 atoms
• NAMD:
classical
molecular
dynamics code designed for
high-performance
simulation of large
biomolecular systems
• ApoA1 benchmark 92224
atoms
• Similarly to other classical
molecular dynamics,
performs worse on Phase
2a.
28/06/06
17
DL_POLY3: Gramicidin 792960 atoms
• DL_POLY is a general
•
•
purpose molecular
dynamics package
DL_POLY3 uses a
distributed domain
decomposition model
The benchmark: a system
of eight Gramicidin-A
species (792,960 atoms)
• performs slightly better on
Phase2a, but not as well
as some of the other
codes
28/06/06
18
Simultaneous Multithreading
28/06/06
Simultaneous Multithreading (SMT)
• Theoretical peak floating point performance of
•
microprocessors has steadily risen in recent years
Actual performance of apps, relative to theoretical peak, has
dropped substantially
– i.e. Number of cycles for which floating point units are idle is rising
• Due to latencies involved with processor operations.
• Compiler attempts to schedule instructions to minimise waste
cycles
– but effectiveness is limited by lack of independent instructions
• SMT: multiple threads can issue instructions to the functional
units in each cycle.
– no. independent instructions increases, no. idle cycles decreases
28/06/06
Simultaneous Multithreading (SMT)
• Power 5 processors on HPCx have 2 floating point units, and
•
support SMT with 2 threads.
Hence have 2 virtual processes (MPI tasks or OpenMP
threads) running per physical processor.
No SMT:
#@tasks_per_node = 16
With SMT:
#@ tasks_per_node = 32
#@ requirements = (Feature == “SMT”)
• Disadvantages:
– More communication
– Memory limit per task is halved
28/06/06
SMT: Streams
• Compare open squares (No
SMT) with open circles (SMT)
• With SMT, there are twice as
many tasks per node. For
direct comparison, SMT results
have been multiplied by a
factor of 2
• No difference observed in
memory bandwidth with SMT.
• Of course, caches are
effectively halved in size
• Therefore, any improvements
in apps must be due to
reduced memory latency (as
expected).
28/06/06
SMT: Classical Molecular Dynamics
• Reminder: Classical Molecular
Dynamics codes did worse than
expected on Phase 2a, likely due
to sensitivity to increased memory
latency.
• Such codes benefit from SMT:
seems that latencies are
successfully hidden.
• Benefit limited to lower processor
counts. At high counts large
amount of communication takes
over.
• For NAMD, up to factor of 1.4
improvement and crossover point
is around 512 processors.
28/06/06
SMT: Classical Molecular Dynamics
• Reminder: Classical Molecular
Dynamics codes did worse than
expected on Phase 2a, likely due to
sensitivity to increased memory
latency.
• Such codes benefit from SMT:
seems that latencies are
successfully hidden.
• Benefit limited to lower processor
counts. At high counts large amount
of communication takes over.
• For MDcask, up to factor of 1.4
improvement and crossover point is
around 256 processors.
28/06/06
SMT: Classical Molecular Dynamics
• Reminder: Classical Molecular
Dynamics codes did worse than
expected on Phase 2a, likely due to
sensitivity to increased memory
latency.
• Such codes benefit from SMT:
seems that latencies are
successfully hidden.
• Benefit limited to lower processor
counts. At high counts large amount
of communication takes over.
• For DL_POLY, up to factor of 1.2
improvement and crossover point is
around 64 processors.
28/06/06
SMT: Castep and H2MOL
• Reminder: Performance of Castep and H2MOL codes
improved on new system.
– No performance benefit seen with SMT.
– SMT degrades performance in certain situations
28/06/06
Conclusions
• HPCx upgraded from Power4 to Power5 technology recently.
• Although new chips have slightly lower clock frequency,
significant improvements observed in majority of applications
– due to better memory bandwidth
• Some types of application, in particular Classical Molecular
Dynamics, have not performed as well as expected on new
system.
– These apps characterised by many strided memory accesses
– Sensitivity to an increased latency could be to blame.
• Performance benefits with the use of SMT have been
observed in certain situations
– In particular for those codes which didn't do as well as expected
– Users should benchmark their own codes
28/06/06
Download