HPCS HPCchallenge Benchmark Suite David Koester, Ph.D. (MITRE) Jack Dongarra (UTK)

advertisement
HPCS HPCchallenge Benchmark
Suite
David Koester, Ph.D. (MITRE)
Jack Dongarra (UTK)
Piotr Luszczek (ICL/UTK)
28 September 2004
MITRE
Slide-1
HPCchallenge Benchmarks
ICL/UTK
Outline
•
•
•
•
•
Brief DARPA HPCS Overview
Architecture/Application Characterization
HPCchallenge Benchmarks
Preliminary Results
Summary
MITRE
Slide-2
HPCchallenge Benchmarks
ICL/UTK
High Productivity Computing Systems
 Create a new generation of economically viable computing systems and a
procurement methodology for the security/industrial community (2007 – 2010)
Impact:
 Performance (time-to-solution): speedup critical national
security applications by a factor of 10X to 40X
 Programmability (idea-to-first-solution): reduce cost and
time of developing application solutions
 Portability (transparency): insulate research and
operational application software from system
 Robustness (reliability): apply all known techniques to
protect against outside attacks, hardware faults, &
programming errors
HPCS Program Focus Areas
Applications:
 Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant
modeling and biotechnology
Fill the Critical Technology and Capability Gap
Today (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing)
MITRE
Slide-3
HPCchallenge Benchmarks
ICL/UTK
High Productivity Computing Systems
-Program Overview Create a new generation of economically viable computing systems and a
procurement methodology for the security/industrial community (2007 – 2010)
Full Scale
Development
Half-Way Point
Phase 2
Technology
Assessment
Review
Petascale/s Systems
Vendors
Validated Procurement
Evaluation Methodology
Advanced
Design &
Prototypes
Test Evaluation
Framework
Concept
Study
New Evaluation
Framework
Phase 1
MITRE
Slide-4
HPCchallenge Benchmarks
Phase 2
(2003-2005)
Phase 3
(2006-2010)
ICL/UTK
HPCS Program Goals‡
• HPCS overall productivity goals:
– Execution (sustained performance)


–
1 Petaflop/sec (scalable to greater than 4 Petaflop/sec)
Reference: Production workflow
Development


10X over today’s systems
Reference: Lone researcher and Enterprise workflows
• Productivity Framework
– Base lined for today’s systems
– Successfully used to evaluate the vendors emerging
productivity techniques
– Provide a solid reference for evaluation of vendor’s proposed
Phase III designs.
• Subsystem Performance Indicators
1)
2)
3)
4)
2+ PF/s LINPACK
6.5 PB/sec data STREAM bandwidth
3.2 PB/sec bisection bandwidth
64,000 GUPS
MITRE
Slide-5
HPCchallenge Benchmarks
‡Bob
Graybill (DARPA/IPTO)
(Emphasis added)
ICL/UTK
Outline
•
•
•
•
•
Brief DARPA HPCS Overview
Architecture/Application Characterization
HPCchallenge Benchmarks
Preliminary Results
Summary
MITRE
Slide-6
HPCchallenge Benchmarks
ICL/UTK
Processor-Memory Performance Gap
CPU
“Moore’s Law”
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
7%/yr.
100
10
1
µProc
60%/yr.
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
1000
•Alpha 21264 full cache miss / instructions executed:
180 ns/1.7 ns =108 clks x 4 or 432 instructions
• Caches in Pentium Pro: 64% area, 88% transistors
*Taken from Patterson-Keeton Talk to SigMod
MITRE
Slide-7
HPCchallenge Benchmarks
ICL/UTK
Processing vs. Memory Access
• Doesn’t cache solve this problem?
– It depends. With small amounts of contiguous data, usually.
With large amounts of non-contiguous data, usually not
– In most computers the programmer has no control over
cache
– Often “a few” Bytes/FLOP is considered OK
• However, consider operations on the transpose of a matrix
(e.g., for adjunct problems)
– Xa= b
XTa = b
– If X is big enough, 100% cache misses are guaranteed, and
we need at least 8 Bytes/FLOP (assuming a and b can be held
in cache)
• Latency and limited bandwidth of processor-memory and
node-node communications are major limiters of
performance for scientific computation
MITRE
Slide-8
HPCchallenge Benchmarks
ICL/UTK
Processing vs. Memory Access
High Performance LINPACK
Consider another benchmark: Linpack
Ax=b
Solve this linear equation for the vector x, where A is a
known matrix, and b is a known vector. Linpack uses the
BLAS routines, which divide A into blocks.
On the average Linpack requires 1 memory reference for every
2 FLOPs, or 4Bytes/Flop.
Many of these can be cache references
MITRE
Slide-9
HPCchallenge Benchmarks
ICL/UTK
Processing vs. Memory Access
STREAM TRIAD
Consider the simple benchmark: STREAM TRIAD
a(i) = b(i) + q * c(i)
a(i), b(i), and c(i) are vectors; q is a scalar
Vector length is chosen to be much longer than cache size
Each execution includes
2 memory loads + 1 memory store
2 FLOPs
12 Bytes/FLOP (assuming 32 bit precision)
No computer has enough memory bandwidth to reference
12 Bytes for each FLOP!
MITRE
Slide-10
HPCchallenge Benchmarks
ICL/UTK
Processing vs. Memory Access
RandomAccess
64 bits
Tables
The expected value of the number
of accesses per memory location T[ k ]
E[ T[ k ] ] = (2n+2 / 2n) = 4
T
2n
1/2 Memory
k
T[ k ]
Define
Addresses
Highest n bits
Length
2n+2
{Ai}
MITRE
Bit-Level
Exclusive Or

Data Stream
Slide-11
HPCchallenge Benchmarks
ai
k = [ai <63, 64-n>]
Sequences of
bits within ai
Data-Driven
Memory Access

ai
64 bits
The Commutative and Associative nature of 
allows processing in any order
Acceptable Error — 1%
Look ahead and Storage — 1024 per “node”
ICL/UTK
Bounding Mission Partner
Applications
HPCS Productivity
Design Points
FFT
Spatial Locality
Low
RandomAccess
Mission
Partner
Applications
HPL
PTRANS
STREAM
High
High
MITRE
Slide-12
HPCchallenge Benchmarks
Temporal Locality
Low
ICL/UTK
Outline
•
•
•
•
•
Brief DARPA HPCS Overview
Architecture/Application Characterization
HPCchallenge Benchmarks
Preliminary Results
Summary
MITRE
Slide-13
HPCchallenge Benchmarks
ICL/UTK
HPCS HPCchallenge Benchmarks
•
HPCSchallenge Benchmarks
– Being developed by Jack Dongarra (ICL/UT)
– Funded by the DARPA High Productivity
Computing Systems (HPCS) program
(Bob Graybill (DARPA/IPTO))
To examine the performance of High Performance
Computer (HPC) architectures using kernels with
more challenging memory access patterns than
High Performance Linpack (HPL)
MITRE
Slide-14
HPCchallenge Benchmarks
ICL/UTK
HPCchallenge Goals
•
To examine the performance of HPC
architectures using kernels with more
challenging memory access patterns than HPL
– HPL works well on all architectures ― even cachebased, distributed memory multiprocessors due to
1.
2.
3.
4.
•
•
Extensive memory reuse
Scalable with respect to the amount of computation
Scalable with respect to the communication volume
Extensive optimization of the software
To complement the Top500 list
To provide benchmarks that bound the
performance of many real applications as a
function of memory access characteristics ―
e.g., spatial and temporal locality
MITRE
Slide-15
HPCchallenge Benchmarks
ICL/UTK
HPCchallenge Benchmarks
•
•
•
•
•
•
•
•
•
•
•
Local
DGEMM (matrix x matrix multiply)
STREAM
–
–
–
–
FFT
EP-RandomAccess
COPY
SCALE
ADD
TRIADD
EP-RandomAccess
1D FFT
Global
High Performance LINPACK (HPL)
PTRANS — parallel matrix
transpose
G-RandomAccess
1D FFT
b_eff — interprocessor bandwidth
and latency
DGEMM
STREAM
FFT
G-RandomAccess
HPL
PTRANS
HPCchallenge pushes spatial and temporal boundaries; sets performance bounds
Available for download http://icl.cs.utk.edu/hpcc/
MITRE
Slide-16
HPCchallenge Benchmarks
ICL/UTK
Web Site
http://icl.cs.utk.edu/hpcc/
•
•
•
•
•
•
•
•
•
•
Home
Rules
News
Download
FAQ
Links
Collaborators
Sponsors
Upload
Results
MITRE
Slide-17
HPCchallenge Benchmarks
ICL/UTK
Outline
•
•
•
•
•
Brief DARPA HPCS Overview
Architecture/Application Characterization
HPCchallenge Benchmarks
Preliminary Results
Summary
MITRE
Slide-18
HPCchallenge Benchmarks
ICL/UTK
Preliminary Results
Machine List (1 of 2)
Affiliatio
n
Manufacturer
System
ProcessorType
Procs
U Tenn
Atipa Cluster AMD 128 procs
Conquest cluster
AMD Opteron
128
AHPCRC
Cray X1 124 procs
X1
Cray X1 MSP
124
AHPCRC
Cray X1 124 procs
X1
Cray X1 MSP
124
AHPCRC
Cray X1 124 procs
X1
Cray X1 MSP
124
ERDC
Cray X1 60 procs
X1
Cray X1 MSP
60
ERDC
Cray X1 60 procs
X1
Cray X1 MSP
60
ORNL
Cray X1 252 procs
X1
Cray X1 MSP
252
ORNL
Cray X1 252 procs
X1
Cray X1 MSP
252
AHPCRC
Cray X1 120 procs
X1
Cray X1 MSP
120
ORNL
Cray X1 64 procs
X1
Cray X1 MSP
64
AHPCRC
Cray T3E 1024 procs
T3E
Alpha 21164
ORNL
HP zx6000 Itanium 2 128 procs
Integrity zx6000
Intel Itanium 2
128
PSC
HP AlphaServer SC45 128 procs
AlphaServer SC45
Alpha 21264B
128
ERDC
HP AlphaServer SC45 484 procs
AlphaServer SC45
Alpha 21264B
484
MITRE
Slide-19
HPCchallenge Benchmarks
ICL/UTK
1024
Preliminary Results
Machine List (2 of 2)
Affiliation
Manufacturer
System
ProcessorType
Procs
IBM
IBM 655 Power4+ 64 procs
eServer pSeries 655
IBM Power 4+
64
IBM
IBM 655 Power4+ 128 procs
eServer pSeries 655
IBM Power 4+
128
IBM
IBM 655 Power4+ 256 procs
eServer pSeries 655
IBM Power 4+
256
NAVO
IBM p690 Power4 504 procs
p690
IBM Power 4
504
ARL
IBM SP Power3 512 procs
RS/6000 SP
IBM Power 3
512
ORNL
IBM p690 Power4 256 procs
p690
IBM Power 4
256
ORNL
IBM p690 Power4 64 procs
p690
IBM Power 4
64
ARL
Linux Networx Xeon 256 procs
Powell
Intel Xeon
U Manchester
SGI Altix Itanium 2 32 procs
Altix 3700
Intel Itanium 2
32
ORNL
SGI Altix Itanium 2 128 procs
Altix
Intel Itanium 2
128
U Tenn
SGI Altix Itanium 2 32 procs
Altix
Intel Itanium 2
32
U Tenn
SGI Altix Itanium 2 32 procs
Altix
Intel Itanium 2
32
U Tenn
SGI Altix Itanium 2 32 procs
Altix
Intel Itanium 2
32
U Tenn
SGI Altix Itanium 2 32 procs
Altix
Intel Itanium 2
32
NASA ASC
SGI Origin 23900 R16K 256 procs
Origin 3900
SGI MIPS R16000
256
U Aachen/RWTH
SunFire 15K 128 procs
Sun Fire 15k/6800 SMP-Cluster
Sun UltraSparc III
128
OSC
Voltaire Cluster Xeon 128 procs
Pinnacle 2X200 Cluster
Intel Xeon
128
MITRE
Slide-20
HPCchallenge Benchmarks
ICL/UTK
256
STREAM TRIAD vs HPL
120-128 Processors
Basic Performance
120-128 Processors
2.5
EP-STREAM
TRIAD Tflop/s
HPL
TFlop/s
2.0
1.5
()op/s
1.0
HPL
Ax=b
oc
pr
8
12
on
Xe
er
s
c
st
s
lu
ro
C 8p
oc
pr
2
ire
lta K 1 128 s
Vo 15
2
oc
m
pr ocs
ire
u
i
8
r
n F ta n
12 8 p
I
Su
+
x
12 s
r4
lti
e
5
I A ow C4
oc
pr
P rS
SG
8
5
2
65 erve 2 1 s
c
M
IB haS ium pro
lp tan 28
A
P 00 I D 1
H
60 AM
zx ter
s
P
H lus roc
C p
a
ip 24 s
At 1 1 roc
X
p
y 24 s
ra
C 1 1 roc
p
X
y 24 s
ra
C 1 1 roc
p
X
y 20
ra
C 11
X
y
a(i) = b(i) + q *c(i)
ra
C
s
ICL/UTK
MITRE
Slide-21
HPCchallenge Benchmarks
STREAM TRIAD
0.5
0.0
STREAM TRIAD vs HPL
>252 Processors
Basic Performance
>=252 Processors
2.5
2.0
()op/s
1.0
a(i) = b(i) + q *c(i)
HPL
Ax=b
ra
C
s
oc s
pr
c
ro
24
p
s
10 12
E
5
oc
r
T3
r3 4 p ocs
e
y
r
ra
50 4 p
ow
C
P
r4 48 ocs
SP we 5
pr
4
6
M
Po S C
IB
0
25
r
9
p6 erve 16K ocs
r
M
R
IB haS 0 6 p
lp 390 25
A
2 on
cs
P
H igi n Xe pro
r
x
6
I O wor 25 cs
4
ro
r
SG et
N we 6 p
x
5
Po 2
nu
Li 90 r 4+
p6 we
o s
M
IB 5 P roc
65 2 p
M 25 cs
IB X1 pro
y 52
ra
C 12
X
y
STREAM TRIAD
0.5
ICL/UTK
MITRE
Slide-22
HPCchallenge Benchmarks
EP-STREAM
TRIAD Tflop/s
HPL
TFlop/s
1.5
0.0
STREAM ADD vs PTRANS
60-128 Processors
Basic Performance
60-128 Processors
10,000.0
1,000.0
100.0
s
oc
pr
8
12
n
eo
rX s s
te roc o c
us p pr
Cl 2 8 28 s
ire 1 1 ro c cs
lta 5 K 2 p ro
m
1
Vo ire niu 1 28 8 p
2 s
nF Ita r4+ 5 1 ro c
Su ltix we C4 8 p
I A Po r S 12 cs
SG 55 rve 2 pr o
6 e m
M aS niu 28
I B lp h I t a 1
A 00 MD
HP x60 er A s
z t c
us r o
HP C l 4 p s
c
a
ip 12 r o
At X1 4 p s
2 c
s
ay 1 r o
Cr X1 4 p cs roc
2
ay 1 r o 4 p cs
Cr X1 0 p 4 6 pro
2 r
ay 1 e 4
Cr X1 Pow + 6
ay 0 e r4
Cr p69 ow s
c
M P ro
IB 655 4 p s
c
M 6 ro
IB X1 0 p s
ay 6 ro c
Cr X1 0 p
ay 6
Cr X1
ay
Cr
ICL/UTK
MITRE
Slide-23
HPCchallenge Benchmarks
a = a + bT
PTRANS
a(i) = b(i) + c(i)
STREAM ADD
0.1
PTRANS
GB/s
EP-STREAM ADD
GB/s
GB/s
10.0
1.0
STREAM ADD vs PTRANS
>252 Processors
Basic Performance
>=252 Processors
10,000.0
EP-STREAM ADD
GB/s
100.0
a(i) = b(i) + c(i)
PTRANS
a = a + bT
ra
C
s
oc s
pr
c
24 pr o
10 12
cs
E
5
o
3
s
T3
pr
er
y
4
oc
r
ra
ow
50 4 p
C
P
s
r4
8
SP we 5 4 roc
M
Po C4 6 p
IB
25
90
rS
p6 rve 6K cs
e
1
o
M
IB haS 0 R pr
lp 390 256
A
2
n
s
P
H igi n eo roc
X
p
r
x
I O o r 2 5 6 cs
SG tw 4
ro
e
N wer 6 p
x
5
nu Po + 2
Li 90
r4
p6 we
o s
M
IB 5 P roc
65 2 p
s
M 5
IB 1 2 roc
X
p
y 52
ra
C 12
X
y
STREAM ADD
0.1
ICL/UTK
MITRE
Slide-24
HPCchallenge Benchmarks
PTRANS
GB/s
1,000.0
GB/s
10.0
1.0
Outline
•
•
•
•
•
Brief DARPA HPCS Overview
Architecture/Application Characterization
HPCchallenge Benchmarks
Preliminary Results
Summary
MITRE
Slide-25
HPCchallenge Benchmarks
ICL/UTK
Summary
•
DARPA HPCS Subsystem Performance Indicators
–
–
–
–
•
Important to understand architecture/application characterization
–
•
2+ PF/s LINPACK
6.5 PB/sec data STREAM bandwidth
3.2 PB/sec bisection bandwidth
64,000 GUPS
Where did all the lost “Moore’s Law performance go?”
HPCchallenge Benchmarks — http://icl.cs.utk.edu/hpcc/
–
–
Peruse the results!
Contribute!
HPCS Productivity
Design Points
FFT
Spatial Locality
Low
RandomAccess
Mission
Partner
Applications
HPL
PTRANS
STREAM
High
High
MITRE
Slide-26
HPCchallenge Benchmarks
Temporal Locality
Low
ICL/UTK
Download