Survey of Computer Architecture

advertisement
An Overview of High
Performance Computing and
Challenges for the Future
Jack Dongarra
INNOVATIVE COMP
ING LABORATORY
University of Tennessee
Oak Ridge National Laboratory
University of Manchester
3/23/2016
1
Outline
• Top500 Results
• Four Important Concepts that Will
Effect Math Software
 Effective Use of Many-Core
 Exploiting Mixed Precision in Our
Numerical Computations
 Self Adapting / Auto Tuning of Software
 Fault Tolerant Algorithms
2
H. Meuer, H. Simon, E. Strohmaier, & JD
Rate
- Listing of the 500 most powerful
Computers in the World
- Yardstick: Rmax from LINPACK MPP
Ax=b, dense problem
TPP performance
- Updated twice a year
Size
SC‘xy in the States in November
Meeting in Germany in June
- All data available from www.top500.org
3
Performance Development
4.92 PF/s
1 Pflop/s
281 TF/s
IBM BlueGene/L
100 Tflop/s
10 Tflop/s
SUM
NEC Earth Simulator
N=1
1.17 TF/s
1 Tflop/s
4.0 TF/s
IBM ASCI White
6-8Redyears
Intel ASCI
59.7 GF/s
100 Gflop/s
Fujitsu 'NWT'
10 Gflop/s
1 Gflop/s
N=500
My Laptop
0.4 GF/s
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
100 Mflop/s
4
29th List: The TOP10
Manufacturer
1
IBM
2
10
Cray
3
2
4
3
Sandia/Cray
IBM
5
IBM
6
4
IBM
7
IBM
8
Dell
9
5
IBM
10
SGI
Computer
BlueGene/L
eServer Blue Gene
Jaguar
Cray XT3/XT4
Red Storm
Cray XT3
BGW
eServer Blue Gene
New York BLue
eServer Blue Gene
ASC Purple
eServer pSeries p575
BlueGene/L
eServer Blue Gene
Abe
PowerEdge 1955, Infiniband
MareNostrum
JS21 Cluster, Myrinet
HLRB-II
SGI Altix 4700
Rmax
Installation Site
Country
280.6
DOE/NNSA/LLNL
USA
2005 131,072
101.7
DOE/ORNL
USA
2007 23,016
101.4
DOE/NNSA/Sandia
USA
2006 26,544
91.29
IBM Thomas Watson
USA
2005 40,960
82.16
Stony Brook/BNL
USA
2007 36,864
75.76
DOE/NNSA/LLNL
USA
2005 12,208
73.03
Rensselaer Polytechnic
Institute/CCNI
USA
2007 32,768
62.68
NCSA
USA
2007
62.63
Barcelona Supercomputing
Center
Spain
56.52
LRZ
Germany
[TF/s]
www.top500.org
Year
#Proc
9,600
2006 12,240
2007
9,728
29th List / June 2007
page 5
Performance Projection
1 EF/s
Eflop/s
100
PF/s
10PF/s
PF/s
Pflop/s
1 PF/s
SUM
100 TF/s
10 TF/s
1 TF/s
100 GF/s
10 GF/s
1 GF/s
100 MF/s
N=1
N=500
1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015
6
Cores per System - June 2007
250
150
100
50
0
33-64
65-128 129-256 257-512
5131024
10252048
20494096
4k-8k
8k-16k 16k-32k 32k-64k
64k128k
7
Number of Systems
200
150000
1
13
25
37
49
61
73
85
97
109
121
133
145
157
169
181
193
205
217
229
241
253
265
277
289
301
313
325
337
349
361
373
385
397
409
421
433
445
457
469
481
493
Rmax (Gflop/s)
RMax
300000
250000
14 systems > 50 Tflop/s
200000
88 systems > 10 Tflop/s
326 systems > 5 Tflop/s
100000
50000
0
8
Chips Used in Each of the 500 Systems
Sun Sparc
1%
NEC
1%
HP Alpha
0%
Intel IA-32
6%
96% = 58% Intel
17% IBM
Intel EM 64T
46%
21% AMD
Cray
0%
HP PA-RISC
2%
AM D x86_64
21%
Intel IA-64
6%
IBM Power
17%
9
Interconnects / Systems
Others
500
Cray
Interconnect
SP Switch
400
Crossbar
300
Quadrics
200
Infiniband
Myrinet
100
(128)
(46)
Gigabit Ethernet (206)
2006
2007
2004
2005
2002
2003
1999
2000
2001
1997
1998
1995
1996
1993
1994
0
N/A
GigE + Infiniband + Myrinet = 74%
10
Countries / Systems
Rank
Site
Manufact
Computer
Procs
RMax
Interconn
ect
Segment Family
66
CINECA
IBM
eServer 326 Opteron Dual
5120
12608
Academic Infband
132
SCS S.r.l.
HP
Cluster Platform 3000 Xeon
1024
7987.2
Research
Infband
271
Telecom Italia
HP
SuperDome 875 MHz
3072
5591
Industry
Myrinet
295
Telecom Italia
HP
Cluster Platform 3000 Xeon
740
5239
Industry
Gige
305
Esprinet
HP
Cluster Platform 3000 Xeon
664
5179
Industry
Gige
www.top500.org
29th List / June 2007
page 11
Power is an Industry Wide Problem
 Google facilities
 leveraging
hydroelectric
power
“Hiding in Plain Sight, Google Seeks More Power”,
by John Markoff, June 14, 2006
 old aluminum
plants
 >500,000
servers
worldwide
New Google Plant in The Dulles, Oregon,
from NYT, June 14, 2006
12
Gflop/KWatt in the Top 20
300
250
200
150
100
50
0
13
IBM BlueGene/L #1 131,072 Cores
Total of 33 systems in the Top500
1.6 MWatts (1600 homes)
43,000 ops/s/person
(64 racks, 64x32x32)
131,072 procs
Rack
(32 Node boards, 8x8x16)
2048 processors
BlueGene/L Compute ASIC
Node Board
(32 chips, 4x4x2)
16 Compute Cards
64 processors
Compute Card
(2 chips, 2x1x1)
Chip 4 processors
(2 processors)
17 watts
180/360 TF/s
32 TB DDR
90/180 GF/s
16 GB DDR
2.8/5.6 GF/s
4 MB (cache)
2.9/5.7 TF/s
0.5 TB DDR
5.6/11.2 GF/s
1 GB DDR
The compute node ASICs include all networking and processor functionality.
Each compute ASIC includes two 32-bit superscalar PowerPC 440 embedded
cores (note that L1 cache coherence is not maintained between these cores).
(13K sec about 3.6 hours; n=1.8M)
Full system total of
131,072 processors
“Fastest Computer”
BG/L 700 MHz 131K proc
64 racks
Peak:
367 Tflop/s
Linpack: 281 Tflop/s 14
77% of peak
Increasing the number of gates into a tight knot and decreasing the cycle time of the processor
Increase
Clock Rate
& Transistor
Density
Lower
Voltage
Cache
Cache
Core
C1
Core
C2
Cache
C3
C4
Core
We have seen increasing number of gates on a
chip and increasing clock speed.
Heat becoming an unmanageable problem, Intel
Processors > 100 Watts
C1 C2 C1 C2
C1
C2
C3 C4 C3 C4
We will not see the dramatic increases in clock
speeds in the future.
Cache
C1 C2 C1 C2
C3
C4
C3 C4 C3 C4
However, the number of
gates on a chip will
continue to increase.
15
Power Cost of Frequency
• Power ∝ Voltage2 x Frequency (V2F)
• Frequency ∝ Voltage
• Power ∝Frequency3
16
Power Cost of Frequency
• Power ∝ Voltage2 x Frequency (V2F)
• Frequency ∝ Voltage
• Power ∝Frequency3
17
What’s Next?
All Large Core
Mixed Large
and
Small Core
Many Small Cores
All Small Core
Many FloatingPoint Cores
+ 3D Stacked
Memory
SRAM
Different Classes of Chips
Home
Games / Graphics
Business
Scientific
Novel Opportunities in Multicores
• Don’t have to contend with
uniprocessors
• Not your same old multiprocessor
problem
 How does going from Multiprocessors to
Multicores impact programs?
 What changed?
 Where is the Impact?
• Communication Bandwidth
• Communication Latency
19
Communication Bandwidth
• How much data can be communicated
between two cores?
• What changed?
 Number of Wires
 Clock rate
 Multiplexing
• Impact on programming model?
 Massive data exchange is possible
 Data movement is not the bottleneck
 processor affinity not that important
20
10,000X
32 Giga bits/sec
~300 Tera bits/sec
Communication Latency
• How long does it take for a round trip
communication?
• What changed?
 Length of wire
 Pipeline stages
• Impact on programming model?
 Ultra-fast synchronization
 Can run real-time apps
on multiple cores
21
50X
~200 Cycles
~4 cycles
80 Core
• Intel’s 80
Core chip
 1 Tflop/s
 62 Watts
 1.2 TB/s
internal BW
22
NSF Track 1 – NCSA/UIUC
•
•
•
•
•
•
•
•
$200M
10 Pflop/s;
40K 8-core 4Ghz IBM Power7 chips;
1.2 PB memory;
5PB/s global bandwidth;
interconnect BW of 0.55PB/s;
18 PB disk at 1.8 TB/s I/O bandwidth.
For use by a few people
NSF UTK/JICS Track 2 proposal
• $65M over 5 years for a 1 Pflop/s system
 $30M over 5 years for equipment
• 36 cabinets of a Cray XT5
• (AMD 8-core/chip, 12 socket/board, 3 GHz, 4 flops/cycle/core)
 $35M over 5 years for operations
• Power cost:
• $1.1M/year
• Cray Maintenance:
• $1M/year
• To be used by the NSF community
 1000’s of users
• Joins UCSD, PSC, TACC
Last Year’s Track 2
award to U of Texas
Major Changes to Software
• Must rethink the design of our
software
 Another disruptive technology
• Similar to what happened with cluster
computing and message passing
 Rethink and rewrite the applications,
algorithms, and software
• Numerical libraries for example will
change
 For example, both LAPACK and
ScaLAPACK will undergo major changes
to accommodate this
27
Major Changes to Software
• Must rethink the design of our
software
 Another disruptive technology
• Similar to what happened with cluster
computing and message passing
 Rethink and rewrite the applications,
algorithms, and software
• Numerical libraries for example will
change
 For example, both LAPACK and
ScaLAPACK will undergo major changes
to accommodate this
28
A New Generation of Software:
Algorithms follow hardware evolution in time
LINPACK (80’s)
(Vector operations)
Rely on
- Level-1 BLAS
operations
LAPACK (90’s)
(Blocking, cache
friendly)
Rely on
- Level-3 BLAS
operations
PLASMA (00’s)
New Algorithms
(many-core friendly)
Rely on
- a DAG/scheduler
- block data layout
- some extra kernels
Those new algorithms
- have a very low granularity, they scale very well (multicore, petascale computing, … )
- removes a lots of dependencies among the tasks, (multicore, distributed computing)
- avoid latency (distributed computing, out-of-core)
- rely on fast kernels
Those new algorithms need new kernels and rely on efficient scheduling algorithms.
A New Generation of Software:
Parallel Linear Algebra Software for Multicore Architectures (PLASMA)
Algorithms follow hardware evolution in time
LINPACK (80’s)
(Vector operations)
Rely on
- Level-1 BLAS
operations
LAPACK (90’s)
(Blocking, cache
friendly)
Rely on
- Level-3 BLAS
operations
PLASMA (00’s)
New Algorithms
(many-core friendly)
Rely on
- a DAG/scheduler
- block data layout
- some extra kernels
Those new algorithms
- have a very low granularity, they scale very well (multicore, petascale computing, … )
- removes a lots of dependencies among the tasks, (multicore, distributed computing)
- avoid latency (distributed computing, out-of-core)
- rely on fast kernels
Those new algorithms need new kernels and rely on efficient scheduling algorithms.
Steps in the LAPACK LU
DGETF2
LAPACK
DLSWP
LAPACK
DLSWP
LAPACK
DTRSM
(Triangular solve)
BLAS
DGEMM
BLAS
(Factor a panel)
(Backward swap)
(Forward swap)
(Matrix multiply)
31
LU Timing Profile (4 processor system)
Threads – no lookahead
Time for each component
1D decomposition and SGI Origin
DGETF2
DLASWP(L)
DLASWP(R)
DTRSM
DGEMM
DGETF2
DLSWP
DLSWP
DTRSM
Bulk Sync Phases
DGEMM
Adaptive Lookahead - Dynamic
Event Driven Multithreading
Reorganizing
algorithms to use
this approach
33
Fork-Join vs. Dynamic Execution
A
T
T
A
B
T
C
Fork-Join – parallel BLAS
C
Time
Experiments on
Intel’s Quad34Core Clovertown
with 2 Sockets w/ 8 Treads
Fork-Join vs. Dynamic Execution
A
T
T
A
B
T
C
Fork-Join – parallel BLAS
C
Time
DAG-based – dynamic scheduling
Time
saved
Experiments on
Intel’s Quad35Core Clovertown
with 2 Sockets w/ 8 Treads
With the Hype on Cell & PS3
We Became Interested
• The PlayStation 3's CPU based on a "Cell“ processor
• Each Cell contains a Power PC processor and 8 SPEs. (SPE is processing unit,
SPE: SPU + DMA engine)
 An SPE is a self contained vector processor which acts independently from the
others.
• 4 way SIMD floating point units capable of a total of 25.6 Gflop/s @ 3.2 GHZ
 204.8 Gflop/s peak!
 The catch is that this is for 32 bit floating point; (Single Precision SP)
 And 64 bit floating point runs at 14.6 Gflop/s total for all 8 SPEs!!
• Divide SP peak by 14; factor of 2 because of DP and 7 because of latency issues
SPE ~ 25 Gflop/s peak
36
Performance of Single Precision
on Conventional Processors
• Realized have the
similar situation on
our commodity
processors.
Size
SGEMM/
DGEMM
Size
SGEMV/
DGEMV
3000
2.00
5000
1.70
• That is, SP is 2X as UltraSparc-IIe
fast as DP on many
Intel PIII
systems
3000
1.64
5000
1.66
Coppermine
3000
2.03
5000
2.09
• The Intel Pentium
and AMD Opteron
have SSE2
PowerPC 970
Intel
Woodcrest
3000
2.04
5000
1.44
3000
1.81
5000
2.18
• 2 flops/cycle DP
• 4 flops/cycle SP
Intel XEON
Intel Centrino
Duo
3000
2.04
5000
1.82
3000
2.71
5000
2.21
• IBM PowerPC has
AltiVec
• 8 flops/cycle SP
• 4 flops/cycle DP
• No DP on AltiVec
AMD Opteron
246
Single precision is faster because:
• Higher parallelism in SSE/vector units
• Reduced data motion
• Higher locality in cache
32 or 64 bit Floating Point Precision?
• A long time ago 32 bit floating point was
used
 Still used in scientific apps but limited
• Most apps use 64 bit floating point
 Accumulation of round off error
• A 10 TFlop/s computer running for 4 hours performs > 1
Exaflop (1018) ops.
 Ill conditioned problems
 IEEE SP exponent bits too few (8 bits, 10±38)
 Critical sections need higher precision
• Sometimes need extended precision (128 bit fl pt)
 However some can get by with 32 bit fl pt in
some parts
• Mixed precision a possibility
 Approximate in lower precision and then refine
or improve solution to high precision.
38
Idea Goes Something Like This…
• Exploit 32 bit floating point as much as
possible.
 Especially for the bulk of the computation
• Correct or update the solution with selective
use of 64 bit floating point to provide a
refined results
• Intuitively:
 Compute a 32 bit result,
 Calculate a correction to 32 bit result using
selected higher precision and,
 Perform the update of the 32 bit results with the
correction using high precision.
39
Mixed-Precision
Iterative Refinement
• Iterative refinement for dense systems, Ax = b, can work this
way.
L U = lu(A)
SINGLE
O(n3)
x = L\(U\b)
SINGLE
O(n2)
r = b – Ax
DOUBLE
O(n2)
WHILE || r || not small enough
END
z = L\(U\r)
SINGLE
O(n2)
x=x+z
DOUBLE
O(n1)
r = b – Ax
DOUBLE
O(n2)
Mixed-Precision
Iterative Refinement
• Iterative refinement for dense systems, Ax = b, can work this
way.
L U = lu(A)
SINGLE
O(n3)
x = L\(U\b)
SINGLE
O(n2)
r = b – Ax
DOUBLE
O(n2)
WHILE || r || not small enough
z = L\(U\r)
SINGLE
O(n2)
x=x+z
DOUBLE
O(n1)
r = b – Ax
DOUBLE
O(n2)
END
 Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt
results when using DP fl pt.
 It can be shown that using this approach we can compute the solution
to 64-bit floating point precision.
• Requires extra storage, total is 1.5 times normal;
• O(n3) work is done in lower precision
• O(n2) work is done in high precision
• Problems if the matrix is ill-conditioned in sp; O(108)
Results for Mixed Precision Iterative
Refinement for Dense Ax = b
Architecture (BLAS)
1
2
3
4
5
6
7
8
9
10
11
•
Single precision is faster than DP because:
 Higher parallelism within vector units
 4 ops/cycle (usually) instead of 2 ops/cycle
 Reduced data motion
 32 bit data instead of 64 bit data
 Higher locality in cache
 More data items in cache
Intel Pentium III Coppermine (Goto)
Intel Pentium III Katmai (Goto)
Sun UltraSPARC IIe (Sunperf)
Intel Pentium IV Prescott (Goto)
Intel Pentium IV-M Northwood (Goto)
AMD Opteron (Goto)
Cray X1 (libsci)
IBM Power PC G5 (2.7 GHz) (VecLib)
Compaq Alpha EV6 (CXML)
IBM SP Power3 (ESSL)
SGI Octane (ATLAS)
Results for Mixed Precision Iterative
Refinement for Dense Ax = b
Architecture (BLAS)
1
2
3
4
5
6
7
8
9
10
11
Architecture (BLAS-MPI)
•
Intel Pentium III Coppermine (Goto)
Intel Pentium III Katmai (Goto)
Sun UltraSPARC IIe (Sunperf)
Intel Pentium IV Prescott (Goto)
Intel Pentium IV-M Northwood (Goto)
AMD Opteron (Goto)
Cray X1 (libsci)
IBM Power PC G5 (2.7 GHz) (VecLib)
Compaq Alpha EV6 (CXML)
IBM SP Power3 (ESSL)
SGI Octane (ATLAS)
# procs
n
DP Solve
/SP Solve
DP Solve
/Iter Ref
#
iter
AMD Opteron (Goto – OpenMPI MX)
32
22627
1.85
1.79
6
AMD Opteron (Goto – OpenMPI MX)
64
32000
1.90
1.83
6
Single precision is faster than DP because:
 Higher parallelism within vector units
 4 ops/cycle (usually) instead of 2 ops/cycle
 Reduced data motion
 32 bit data instead of 64 bit data
 Higher locality in cache
 More data items in cache
What about the Cell?
• Power PC at 3.2 GHz
 DGEMM at 5 Gflop/s
 Altivec peak at 25.6 Gflop/s
• Achieved 10 Gflop/s SGEMM
• 8 SPUs
 204.8 Gflop/s peak!
 The catch is that this is for 32 bit floating point;
(Single Precision SP)
 And 64 bit floating point runs at 14.6 Gflop/s
total for all 8 SPEs!!
• Divide SP peak by 14; factor of 2 because of DP and 7
because of latency issues
44
Moving Data Around on the Cell
256 KB
25.6 GB/s
Injection bandwidth
Worst case memory bound operations (no reuse of data)
3 data movements (2 in and 1 out) with 2 ops (SAXPY)
For the cell would be 4.6 Gflop/s (25.6 GB/s*2ops/12B)
Injection bandwidth
IBM Cell 3.2 GHz, Ax = b
250
200
8 SGEMM (Embarrassingly Parallel)
SP Peak (204 Gflop/s)
SP Ax=b IBM
150
.30 secs
GFlop/s
DP Peak (15 Gflop/s)
DP Ax=b IBM
100
50
3.9 secs
0
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Matrix Size
46
IBM Cell 3.2 GHz, Ax = b
250
200
SP Peak (204 Gflop/s)
8 SGEMM (Embarrassingly Parallel)
SP Ax=b IBM
150
.30 secs
GFlop/s
DSGESV
DP Peak (15 Gflop/s)
DP Ax=b IBM
.47 secs
100
8.3X
50
3.9 secs
0
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Matrix Size
47
Cholesky on the Cell, Ax=b, A=AT, xTAx > 0
Single precision performance
Mixed precision performance using iterative refinement
Method achieving 64 bit accuracy
33
For the SPE’s standard C code and C language SIMD extensions (intrinsics)
48
Cholesky - Using 2 Cell Chips
49
Intriguing Potential
• Exploit lower precision as much as possible
 Payoff in performance
• Faster floating point
• Less data to move
• Automatically switch between SP and DP to match
the desired accuracy
 Compute solution in SP and then a correction to the
solution in DP
• Potential for GPU, FPGA, special purpose processors
 What about 16 bit floating point?
• Use as little you can get away with and improve the accuracy
• Applies to sparse direct and iterative linear systems
and Eigenvalue, optimization problems, where
Newton’s method is used.
50
Correction = - A\(b – Ax)
IBM/Mercury Cell Blade
 From IBM or
Mercury
 2 Cell chip
 Each w/8 SPEs
 512 MB/Cell
 ~$8K - 17K
 Some SW
33
51
Sony Playstation 3 Cluster PS3-T
 From IBM or
Mercury
 2 Cell chip
 Each w/8 SPEs
 512 MB/Cell
 ~$8K - 17K
 Some SW
 From WAL*MART
PS3
 1 Cell chip
 w/6 SPEs




256 MB/PS3
$600
Download SW
Dual boot
33
52
Cell Hardware Overview
25.6 Gflop/s
PE
25.6 Gflop/s 25.6 Gflop/s 25.6 Gflop/s
PE
PE
PE
SIT CELL
200 GB/s
PowerPC
PE
25.6 Gflop/s
PE
PE
PE
25.6 Gflop/s 25.6 Gflop/s 25.6 Gflop/s
25 GB/s
512 MiB
3.2 GHz
25 GB/s injection bandwidth
200 GB/s between SPEs
32 bit peak perf 8*25.6 Gflop/s
204.8 Gflop/s peak
64 bit peak perf 8*1.8 Gflop/s
14.6 Gflop/s peak
512 MiB memory
PS3 Hardware Overview
25.6 Gflop/s
PE
25.6 Gflop/s 25.6 Gflop/s
PE
Disabled/Broken: Yield issues
PE
SIT CELL
200 GB/s
PowerPC
GameOS
Hypervisor
PE
25.6 Gflop/s
25 GB/s
256 MiB
PE
PE
25.6 Gflop/s 25.6 Gflop/s
3.2 GHz
25 GB/s injection bandwidth
200 GB/s between SPEs
32 bit peak perf 6*25.6 Gflop/s
153.6 Gflop/s peak
64 bit peak perf 6*1.8 Gflop/s
10.8 Gflop/s peak
1 Gb/s NIC
256 MiB memory
PlayStation 3 LU Codes
180
160
140
6 SGEMM (Embarrassingly Parallel)
SP Peak (153.6 Gflop/s)
120
GFlop/s
SP Ax=b IBM
100
DP Peak (10.9 Gflop/s)
80
60
40
20
0
0
500
1000
1500
2000
2500
Matrix Size
33
55
PlayStation 3 LU Codes
180
160
140
6 SGEMM (Embarrassingly Parallel)
SP Peak (153.6 Gflop/s)
120
GFlop/s
SP Ax=b IBM
100
DSGESV
DP Peak (10.9 Gflop/s)
80
60
40
20
0
0
500
1000
1500
Matrix Size
33
2000
2500
56
Cholesky on the PS3, Ax=b, A=AT, xTAx > 0
33
57
HPC in the Living Room
33
58
Matrix Multiple on a 4 Node PlayStation3 Cluster
What's good
What's bad
Very cheap: ~4$ per Gflop/s (with 32
bit fl pt theoretical peak)
Fast local computations between SPEs
Perfect overlap between
communications and computations is
possible (Open-MPI running):
 PPE does communication via MPI
 SPEs do computation via SGEMMs

Gigabit network card. 1 Gb/s is too
little for such computational power (150
Gflop/s per node)
Linux can only run on top of GameOS
(hypervisor)
 Extremely high network access
latencies (120 usec)
 Low bandwidth (600 Mb/s)
Only 256 MB local memory
Only 6 SPEs

33
Gold: Computation: 8 ms
Blue: Communication: 20 ms
Users Guide for SC on PS3
• SCOP3: A Rough
Guide to Scientific
Computing on the
PlayStation 3
• See webpage
for details
33
Conclusions
• For the last decade or more, the research
investment strategy has been
overwhelmingly biased in favor of hardware.
• This strategy needs to be rebalanced barriers to progress are increasingly on the
software side.
• Moreover, the return on investment is more
favorable to software.
 Hardware has a half-life measured in years, while
software has a half-life measured in decades.
• High Performance Ecosystem out of balance
 Hardware, OS, Compilers, Software, Algorithms, Applications
• No Moore’s Law for software, algorithms and applications
Collaborators / Support
Alfredo Buttari, UTK
Julien Langou,
UColorado
Julie Langou, UTK
Piotr Luszczek,
MathWorks
Jakub Kurzak, UTK
Stan Tomov, UTK
33
Download