Data Movement Dominates (DMD) - X

advertisement
Arun Rodrigues,
Scott
Hemmert,
Resnick:
John Shalf,
David
Donofrio:Dave
Lawrence Berkeley National Laboratory
Sandia
National
Lab
(ABQ)
Curtis
Janssen,
Helgi
Adalsteinsson:
Sandia National Laboratories
Keren Bergman:
Columbia
University
Dan Quinlan:
Lawrence Livermore National Laboratory
Bruce Jacob:
U. Maryland
Sudhakar
Yalamanchili:
Georgia Tech
John Shalf,
David Donofrio:
Lawrence Berkeley National Laboratory
John Shalf, Paul Hargrove: Lawrence Berkeley National Laboratory
Curtis Janssen, Helgi Adalsteinsson: Sandia National Laboratories
Gilbert Hendry:
Sandia National Laboratory
http://www.nersc.gov/projects/CoDEx
Dan Quinlan: Lawrence Livermore National Laboratory
Dan Quinlan, Chunhua Liao: Lawrence Livermore National Lab
Sudhakar Yalamanchili: Georgia Tech
Sudhakar Yalamanchili: Georgia Tech
http://www.nersc.gov/projects/CoDEx
Data Movement Dominates (DMD)
and
Architectural Simulation and Modeling for Exascale Platform Development
CoDEx: CoDesign for Exascale
Architectural Simulation and Modeling for Exascale Platform Development
0
Codesign Tools Recap
ROSE
ACE
Architectural Simulation to Accelerate CoDesign
SST
ROSE Compiler: Enables deep analysis of
application requirements, semi-automatic
generation of skeleton applications, and code
generation for ACE and SST.
ACE Node Emulation: Rapid design synthesis
and FPGA-accelerated emulation for rapid
prototyping cycle accurate models of manycore
node designs.
ROSE
• Application
Analysis
SST Macro System Simulation: Enables systemscale simulation through capture of application
ACE
communication traces and simulation of largescale interconnects.
• Node level
emulation
SST Micro Software Simulators: Software
simulation for node-level simulation
SST
• System level
models
Codesign Tools Recap
ROSE
ACE
Architectural Simulation to Accelerate CoDesign
ROSE Compiler: Enables deep analysis of
application requirements, semi-automatic
generation of skeleton applications, and code
generation for ACE and SST.
ROSE
ACE
Node Emulation:
Rapid
design
synthesis
CoDEx:
CoDesign
For
Exascale
and FPGA-accelerated emulation for rapid
prototyping cycle accurate models of manycore
ASCR-funded Simulation
node designs.
SST
• Application
Analysis
Infrastructure Project
ACE
SST Macro System Simulation: Enables systemscale simulation through capture of application
communication
tracesSimulation
and simulation
of large-• Node level
SST: Structure
Toolkit
emulation
scale interconnects.
NNSA-funded Simulation Tools
SST Micro Software Simulators: Software
Program)
simulation for(ASC
node-level
simulation
SST
• System level
models
Codesign Tools Recap
ROSE
ACE
Architectural Simulation to Accelerate CoDesign
ROSE Compiler: Enables deep analysis of
application requirements, semi-automatic
generation of skeleton applications, and code
generation for ACE and SST.
ROSE
ACE
Node Emulation:
Rapid
design
synthesis
CoDEx:
CoDesign
For
Exascale
and FPGA-accelerated emulation for rapid
prototyping cycle accurate models of manycore
ASCR-funded Simulation
node designs.
SST
• Application
Analysis
CAL: (Sandia/LBL)
Infrastructure Project
Computer Architecture
ACE
SST Macro System Simulation: Enables systemscale simulation
through capture of application
Laboratory
communication traces and simulation of large-• Node level
SST: Structure Simulation Toolkit
scale interconnects.
NNSA-funded Simulation Tools
SST Micro Software Simulators: Software
Program)
simulation for(ASC
node-level
simulation
emulation
SST
• System level
models
Fidelity vs. Scope for Architectural
Simulation Methods
Simula on Scope/Parallelism
107
106
Coarse-grained
simula on:
SST/macro
105
104
103
102
101
Cons tu ve
models
Emula on:
Green Flash
100
Crude
guess
Rough
idea
Cause
and
effect
Simula on Fidelity
4
Very
good
es mates
Exact
hardware
model
ROSE
ACE
SST
ROSE
ROSE Compiler
ACE
Full Program Understanding through Deep Source-Code Analysis
C/C++/Fortran/
OpenMP/UPC
EDG Front-end/
Open Fortran Parser
http://www.roseCompiler.org
EDG /Fortran-toROSE Connector
Program
Analysis
Vendor
Compiler
Internal
Representation
(IR)
Transformed
Source Code
ROSE
Unparser
2009
USER
Program
Transformation
Control-Flow
ROSE
System-dependency
Sliced-system-dependency
int aFunction(int a, int b)
{
int c=b;
return a;
}
Intermediate Representation
de
Co
e
y
urc
So Binar le
b
Or cuta
e
Ex
main()
{
int a,b,c,d,e;
int i=4;
for (i=0;i<10;i++)
{
int j=55;
c=i+j;
c=aFunction(i,c);
a=aFunction(a+1,b);
}
#pragma SliceTarget
a;
return 0;
}
Data-dependency
Disassembly and Representation
plus Instruction Semantics
Control-dependency
5
SST
ExaSAT: Exascale Static Analysis Tool
Compiler-Automated Performance Model Extraction
• Can automatically predict performance for many
input codes and software optimizations
• Predict performance under different architectural
scenarios
• Much faster than hardware simulation and manual
Performance Prediction
modeling
Spreadsheet
Machine
Parameter
s
Combustion
Codes
Compiler
Analysis
<XML>
Performanc
e Model
Dependency
Graph Optimization
User
Parameter
s
6
SST/macro: Coarse-Grained
Simulation
OL %(22O; E /9&. (2"3
4#/' . &(
An*(!"#$
application
code
S HH#"9/5.
#%&' !(&4*(/0(
with
#"=L ' : %"=L
' (' Lminor
&%/+0(
modifications
- . 3 H#%' %(#"$&/&, ("3 H#%3 %*' /5. *0(
SST/Macro Impl.
of interfaces (MPI),
which simulate
execution and
communication
P&. 9%00. &(
E P&
%3. .9%
&,00.
( &(
P&
%
3. .9%
&,00.
( &(
TE"0Y(
%3 . &, (
TE"0Y(
T "0Y(
) /&+: /&%(
E . +%#0(
70' /$#"0L %+(0"3 4#/5. *(H#/g . &3 0(
V%j=j(E PM
B(- . 3 H4' %X(
> %' : . &Y(
•
2%&F"9%0(
V%j=j(P@2B(K. $(29L %+4#%&X(
>M
-(
>M
-(
>M
-(
- . /&0%1=&/"*(L /&+: /&%(3 . +%#0(
?"$&/&"%0(
> . +%(
(
(
•
V0H/: *X(
2. < : /&%(
?"$&/&"%0(
(
•
2Y%#%' . *(
S HH#"9/5. *(
OL &%/+0(
•
T "09&%' %(7F%*' (2"3 4#/5. *(
V#() (*22O; 3 /9&. B(22O; 9. &%B(h E > %Oi i B(2, 0' %3 - X(
7
SST Simulation Project Overview
Goals
SST/micro: Cycle-Accurate Framework
e standard architectural
framework for HPC
evaluate future systems
rkloads
omputers to design
uters
Status
•Current Release (2.1) at
code.google.com/p/sst-simulator/
•Includes parallel simulation core,
configuration, power models, basic
network and processor models, and
interface to detailed memory model
• Has a general simulation framework for
integrating models
• Simulation backend is parallel
cal Approach
Consortium
•“Best of Breed”
simulation suite
• Plenty of people
involved
•Combine Lab, academic, & industry
crete Event core with
optimization over MPI
ech. Models for power
Panalyzer
d simple models for
etwork, and memory
non viral, modular
8
Some Models Currently Integrated
Gem5
SimObject
Gem5 & 1MacSim
Int roduct ion
•GeM5
–Trace Driven (x86 & PTX)
–Models OoO and GPU-like
dnesday, March 28, 2012
Gem5
Queue
Port
MacSim is a heterogeneous architecture simulator, specifically supporting x86 ISA and NVIDIA
PTX ISA. It is a trace-driven cycle-level simulator. It SimObject
can simulate homogeneous ISA multicore
simulations, heterogeneous ISA multicoresimulations. It usesOcelot for PTX tracegeneration and
Pin to generate x86 traces. Both traces are converted internal RISC style uops and those uops
are simulated. MacSim is a microarchitecture simulator thatSST::Component
simulates detailed pipeline (in-order
SST::Link
and out-of-order) and a memory system includingSST::Link
caches, NoC, memory controlle
rs. It supports,
asymmetric multicore configurations (small cores + medium cores+ big cores ) and SMT or MT
SST::Component
SST::Component
architectures as well.
Currently interconnection network model (based on IRIS) and power model (based on McPat)
are connected. ARM ISA support is on-progress. MacSim is also one of the components of SST so
SST::Component
multiple MacSim simulators can run concurrently.
–M5: Modular
platform
for
Gem5
is a well-known
computer architectural
system
simulator
architecture
research,
with
models for
encompassing
system-level
processors,
caches,
architecture as well as
busses, and network
processor microarchitecture.
components.
–Provides detailed, fullsystem CPU models for x86,
ARM, SPARC, Alpha
•MacSim
Port
SimObject
MacSim provides a
model of GPU/CPU
cores or geterogenous
computing nodes,
which can be driven
from x86 or PTX
(CUDA) traces.
SST Queue
GPUOcelot
NVCC
(compiler)
PTX code
Emulator/
Trace Generator
CUDA code (*.cu)
Pin
Trace Generator
X86 Binaries
Figure 1. The overview of MacSim Simulator
! "# $%
$&$' (%
MacSim
MacSim
Figure 1 shows SST
the overview
Macs
im simulator.
! "#of$%
$&$' (%
! "# $%
$&$' (%
415%
SST
Mem
465%
SST
Mem
cache
425%
- . /%
MacSim
GPU
CPU
Mem
Mem
IRIS provides a
pipelined, cycleaccurate router model
capable of modeling a
variety
of Network-onHeterogeneous
Architecture
Chip (NoC) and interTiming & Power
Simulator interconnection
node
architectures.
PhoenixSim models
photonic networks.
! "# $%
$&$' (% MacSim
core
) $# *+, %
$&$' (%
- . /%
core
! "# $%
$&$' (% MacSim
core
- . /%
0123$(%
MacSim
MacSim
core
core
4
9
Leveraging Embedded Design Automation
For Design Space Exploration
Applicationoptimized processor
implementation
(RTL/Verilog)
Base CPU
OCD
Apps
Cache Timer
Datapaths
Extended Registers FPU
Processor configuration
1. Select from menu
2. Automatic instruction
discovery (XPRES Compiler)
3. Explicit instruction
This
description (TIE)
Processor
Generator
(Tensilica)
stuff is
essential!
Build with any
Tailored SW Tools: process in any fab
Compiler, debugger,
simulators, Linux,
other OS Ports
(Automatically
generated together
with the Core)
Embedded Design Automation
(Using FPGA emulation to do rapid prototyping)
Applicationoptimized processor
implementation
(RTL/Verilog)
Base CPU
OCD
Apps
Cache Timer
Datapaths
Extended Registers FPU
Processor configuration
1. Select from menu
2. Automatic instruction
discovery (XPRES Compiler)
3. Explicit instruction
description (TIE)
Processor
Generator
(Tensilica)
Build with any
Tailored SW Tools: process in any fab
Compiler, debugger,
Or “tape out”
simulators, Linux,
other OS Ports
To FPGA
(Automatically
generated together
with the Core)
RAMP FPGA-accelerated
Emulation of ASIC
Data Movement Dominates (Sandia, Micron, Columbia, LBL)
Understand the Potential of Intelligent, Stacked DRAM Technology
• Data movement are projected to account for over 75% of
power budget for an exascale platform
DRA
M
Laye
rs
• Work to reduce that via
– Optical interconnect(s)
Modulators
– 3D stacking (logic + memory + optics)
– New memory protocols
Receivers
Ph
o
to
n
Lo
gic
Laser Source
Waveguide
Research Questions
– What is the performance potential of stacked memory (power
& speed)
– How much intelligence to put into logic layer
• Atomics, gather/scatter, checksums, full-processor-in-memory
– What is the memory consistency model for intelligent DRAM
– How to program it if we put embed more intelligence into
DRAM
ic
L
La
ye
r
aye
r
The Cost of Moving Data
Intranode/SMP Intranode/MPI
Communica on Communica on
10000
On‐chip / CMP
communica on
100
now
2018
10
lo
ss
ys
te
m
Cr
os
ne
ct
nt
er
co
n
ca
li
p/
DR
AM
Of
f‐c
hi
hi
p
5m
m
on
‐c
hi
p
1m
m
on
‐c
st
er
Re
gi
FL
OP
1
DP
PicoJoules
1000
Locality Management is Key
What are the best combination of software and hardware
mechanisms to maximize data movement efficiency
Vertical Locality Management
Horizontal Locality Management
Temporal
Topological
14
Why Study Chip Stacking (TSVs)?
Energy = (V 2 ∗ C) ∗ Overhead + Ecomm
TSVs Reduce Costs
DRAM Cells Efficient
•
•
•
•
DRAM cells require < 1 pJ to access
Current DRAM architectures are not
power efficient
Long distances ➔ high power
We pay for more than we get at every
level
–
–
–
–
•
Cache: throw away 75-80%
DRAM Row: Charge 1024B for each 64B
access
DIMM: Charge 8-9 chips/access
~800 pJ/byte total
DRAM design driven by packaging
constraints
–
–
~50% of DRAM chip cost is packaging,
mainly in pins
DIMMs use multiple chips with a few data
pins to achieve high BW
•
•
•
•
•
•
•
•
•
15
TSVs orders of magnitude less energy
–250 fJ/bit for reading DRAM
–5 fJ/bit for TSV
–250 fJ/bit for mem. controller
–~0.5 pJ/bit (compared to 30pJ for
conventional DIMM)
–Don’t have to access more data than
needed
• Enables....
–Lower Capacitance: Narrower –
Lower Overhead: Smarter –In-Memory
computation
• Requires
–...changes to how we view the
machine & the memory
Why Photonics?
Photonics changes the rules for Bandwidth-per-Watt.
PHOTONICS:
ELECTRONICS:
 Modulate/receive data stream
once per communication event.
 Wavelength Parallelism:
Broadband switch routes entire
multi-wavelength stream.
 Off-chip BW ≈ on-chip BW for
nearly same power.
 Buffer, receive, and re-transmit
at every router.
 Space Parallelism:
Each bus lane routed
independently (P  NLANES).
 Off-chip BW requires much
more power than on-chip BW.
TX
RX
TX
RX
TX
RX
TX
RX
TX
RX
TX
RX
Why Optically-Connected Memory?
Traditional Memory
HBDRAM
HBDRAM
Optically-Connected Memory
HBDRAM
CPU
HBDRAM
HBDRAM
CPU
HBDRAM
Electronic Bus
•Large Pin-out
•Complex wiring
•Low bandwidth density
•Distance constrained by electrical
limitations
•High power dissipation
Will not scale to meet power and
bandwidth requirements of future highperformance computing systems
HBDRAM
HBDRAM
Optical Link
•All-optical link, no electronic bus to drive
•Bit-rate transparent link
•High bandwidth density, less pins
•Distance immunity at computer scale
•Low power dissipation
Enables scaling of high-performance
computing through increased memory
capacity and bandwidth
18
Mixed Model Simulation
cycle accurate and energy-accurate models
MPI Traces
(C, C++, Fortran)
(DUMPI)
kernels
Checkpoint/r
estart
Workload
Translation
SST/macro
SystemC
Processor Model
(SST/micro & Tensilica)
(C++)
Address
Translation
NoC Model
(PhoenixSim)
Memory Model
(DRAMSim2, FLASHsim, NVRAM)
Fault Injection
skeleton app
Simulator Infrastructure: Interconnects
cycle accurate and energy-accurate models
MPI Traces
(C, C++, Fortran)
(DUMPI)
kernels
Checkpoint/
restart
Workload
Translation
SST/macro
SystemC
Processor Model
(SST/micro & Tensilica)
(C++)
Address
Translation
Intranode/SMP Intranode/MPI
Communica on Communica on
10000
NoC Model
(PhoenixSim)
Memory Model
(DRAMSim2, FLASHsim, NVRAM)
On‐chip / CMP
communica on
100
now
2018
10
lo
ss
ys
te
m
Cr
os
ne
ct
nt
er
co
n
ca
li
p/
DR
AM
hi
p
Of
f‐c
hi
m
on
‐c
hi
p
5m
m
on
‐c
st
er
1m
Re
gi
FL
OP
1
DP
PicoJoules
1000
Developed by Sandia Collaborators
CoDEx project
Fault Injection
skeleton app
Simulator Infrastructure: Memory
cycle accurate and energy-accurate models
MPI Traces
(C, C++, Fortran)
(DUMPI)
kernels
Checkpoint/
restart
Workload
Translation
SST/macro
SystemC
Processor Model
(SST/micro & Tensilica)
(C++)
Address
Translation
Intranode/SMP Intranode/MPI
Communica on Communica on
10000
NoC Model
(PhoenixSim)
Memory Model
(DRAMSim2, FLASHsim, NVRAM)
On‐chip / CMP
communica on
100
now
2018
10
lo
ss
ys
te
m
Cr
os
ne
ct
nt
er
co
n
ca
li
p/
DR
AM
hi
p
Of
f‐c
hi
m
on
‐c
hi
p
5m
m
on
‐c
st
er
1m
Re
gi
FL
OP
1
DP
PicoJoules
1000
Validated against Micron DRAM
HMC model coming this summer
Fault Injection
skeleton app
Simulator Infrastructure
cycle accurate and energy-accurate models
MPI Traces
(C, C++, Fortran)
(DUMPI)
kernels
Checkpoint/
restart
Workload
Translation
SST/macro
SystemC
Processor Model
(SST/micro & Tensilica)
(C++)
Address
Translation
Intranode/SMP Intranode/MPI
Communica on Communica on
10000
NoC Model
(PhoenixSim)
Memory Model
(DRAMSim2, FLASHsim, NVRAM)
On‐chip / CMP
communica on
100
now
2018
10
lo
ss
ys
te
m
Cr
os
ne
ct
nt
er
co
n
ca
li
p/
DR
AM
hi
p
Of
f‐c
hi
m
on
‐c
hi
p
5m
m
on
‐c
st
er
1m
Re
gi
FL
OP
1
DP
PicoJoules
1000
Rewrote Columbia PhoenixSim
summer 2011
Orion-2 energy model
Validated against Cornell test parts
Fault Injection
skeleton app
Simulator Infrastructure
cycle accurate and energy-accurate models
MPI Traces
(C, C++, Fortran)
(DUMPI)
kernels
Checkpoint/
restart
Workload
Translation
SST/macro
SystemC
Processor Model
(SST/micro & Tensilica)
(C++)
Address
Translation
Intranode/SMP Intranode/MPI
Communica on Communica on
10000
NoC Model
(PhoenixSim)
Memory Model
(DRAMSim2, FLASHsim, NVRAM)
On‐chip / CMP
communica on
100
now
2018
10
lo
ss
ys
te
m
Cr
os
ne
ct
nt
er
co
n
ca
li
p/
DR
AM
hi
p
Of
f‐c
hi
m
on
‐c
hi
p
5m
m
on
‐c
st
er
1m
Re
gi
FL
OP
1
DP
PicoJoules
1000
Full Gate-level RTL model of processor
Well characterized energy model
Fault Injection
skeleton app
Download