Part VI: 1hr: Discussions of using MacSim and Ocelot

advertisement
1
MacSim Tutorial (In ICPADS 2013)
2/8
| The Structural Simulation Toolkit: A Parallel Architectural
Simulator (for HPC)




A parallel simulation environment based on MPI
Fully modular design that enables extensive exploration of an
individual system parameter without the need for intrusive changes to
the simulator
Includes parallel simulation core, configuration, power models, basic
network and processor models, and interface to detailed memory
model
SST-download link: http://sst.sandia.gov/
MacSim Tutorial (In ICPADS 2013)
3/8
MacSim Tutorial (In ICPADS 2013)
4/8
| Processor Components


MacSim
Gem5
| Memory Components



DRAMSim2
VaultSim (3D memory model)
MemHierarchy
| Network Components


Merlin
Iris
MacSim Tutorial (In ICPADS 2013)
5/8
| Multiple MacSim components can be instantiated
| Each of which can act as




An entire GPU node (composed of multiple SMs)
A heterogeneous computing node (CPU + GPU)
A GPU/CPU core
Any combination of listed above
MacSim Tutorial (In ICPADS 2013)
6/8
| MacSim can talk to

memHierarchy
| MacSim can make use of memHierarchy’s cache hierarchy.
Which means, whatever memory system is connected to
memHierarchy, MacSim can be configured with them.

DRAMSim2 or VaultSim.
| Pipeline Stages with memHierarchy
Front-end
Decode
Rename
Schedule
Execution
Retire
SST Link
I-Cache (MH)
D-Cache (MH)
VaultSim
VaultSim
MacSim Tutorial (In ICPADS 2013)
7/8
| MacSim can directly talk to


DRAMSim2
VaultSim
| Using MacSim’s highly versatile memory controller interface,
it can directly talk to DRAMSim2 and VaultSim.
| Pipeline Stages with external memory component
Front-end
Decode
Rename
Schedule
Execution
Retire
SST Link
I-Cache (MS)
D-Cache (MS)
VaultSim
VaultSim
MacSim Tutorial (In ICPADS 2013)
8/8
| A SST component which models a memory hierarchy, such
as multiple cache levels

Sub component: Cache, Bus, Memory Controller
| Usage

Processor Component(s) + memHierarchy(s) + Memory Component(s)



MacSim + L1/L2 cache + DRAMSim2
MacSim + L1/L2 cache + (3D memory model)
(MacSim + private L1 cache) + (Gem5 + private L1 cache) + shared L2
cache + (DRAMSim2 or 3D memory model)
MacSim Tutorial (In ICPADS 2013)
9
| Encapsulated MacSim as a
SST Component, SST
feeds clocks into MacSim
and provides
communication channels.
| By talking to
memHierarchy, MacSim
indirectly can
communicate with bunch
of memory components
without bothering to
modify its interface.
MacSim
MacSim
SST::Component
SST::Component
SST::Link
SST::Link
L1
(memHierarchy)
SST::Link
SST::Link
L2 (memHierarchy)
SST::Link
DRAMSim2
SST::Component
MacSim Tutorial (In ICPADS 2013)
L1
(memHierarchy)
SST::Link
VaultSim
10
MacSim
MacSim
Gem5
MacSim
SST::Component
SST::Component
SST::Component
SST::Component
SST::Link
SST::Link
L1
L1
SST::Link
SST::Link
L2 (memHierarchy)
SST::Link
LLC (VaultSim)
SST::Link
DRAMSim2
SST::Component
MacSim Tutorial (In ICPADS 2013)
SST::Link
SST::Link
L1
(memHierarchy)
L1
(memHierarchy)
SST::Link
SST::Link
L2 (memHierarchy)
SST::Link
DRAMSim2
SST::Component
SST::Link
VaultSim
11/8
 Make sure macsimComponent doesn’t have .ignore file,
otherwise SST build system will ignore the component
 How to build: See the instruction from SST website
 How to execute: Pay special attention to the following files



SDL (or XML) : SST component configuration
trace_file_list: Which trace to execute. Can be specified in the
aforementioned SDL file
params.in: MacSim configuration, in which you can specify…


Whether MacSim uses its internal cache or memHierarchy as cache
Which DRAM controller to use amongst its internal FCFS/FRFCFS-based
controller, DRAMSim2 controller and VaultSim controller.
 Specific examples will be elaborated in the following slides
12/8
| params.in


use_memhierarchy = 0
dram_scheduling_policy = FRFCFS or FCFS
| SDL (or XML)


Nothing except macsimComponent configuration
In this case, link configuration will not be used
MacSim Tutorial (In ICPADS, 2013)
13/8
| params.in


use_memhierarchy = 1
Note, when use_memhierarchy is set to 1, MacSim’s DRAM controller
configuration has no effect at all
| SDL (or XML)


Specify memHierarchy’s cache configuration like the following
Similar configuration for D-cache as well
MacSim Tutorial (In ICPADS, 2013)
14/8
| params.in


use_memhierarchy = 1
Note, when use_memhierarchy is set to 1, MacSim’s DRAM controller
configuration has no effect at all
| SDL (or XML)


Specify MemController configuration for DRAMSim2 like the following
Note, DRAMSim2 configurations should be appended
MacSim Tutorial (In ICPADS, 2013)
15/8
| params.in


use_memhierarchy = 1
Note, when use_memhierarchy is set to 1, MacSim’s DRAM controller
configuration has no effect at all
| SDL (or XML)


Specify MemController configuration for VaultSim like the following
Note, VaultSim configurations should be appended
MacSim Tutorial (In ICPADS, 2013)
16/8
| params.in


use_memhierarchy = 0
dram_scheduling_policy = DRAMSIM
| SDL (or XML)

Specify configurations for DRAMSim2 like the following
MacSim Tutorial (In ICPADS, 2013)
17/8
| params.in


use_memhierarchy = 0
dram_scheduling_policy = VAULTSIM
| SDL (or XML)

Nothing special but to set macsimComponent’s mem_link matches to
VaultSim’s toCPU link
18
MacSim Tutorial (In ICPADS 2013)
MacSim Tutorial (In ICPADS 2013)
19/8
Front-end
Memory System
• Thread fetch policies
• Branch predictor
• Software and
Hardware prefetcher
• Cache studies
(sharing, inclusion)
• DRAM scheduling
• Interconnection
studies
MacSim Tutorial (In ICPADS 2013)
Misc.
• Power model
20/8
MacSim
Trace Generator
(PIN, GPUOCelot)
Frontend
Software prefetch instructions
PTX  prefetch, prefetchu
x86  prefetcht0, prefetcht1,
prefetchnta
Hardware
prefetch
requests
Stream, stride,
GHB, …
•
•
•
Memory System
Hardware
Prefetcher
Many-thread Aware Prefetching Mechanism [Lee et al. MICRO-43, 2010]
When prefetching works, when it doesn’t, and why [Lee et al. ACM TACO, 2012]
Spare Register Aware Prefetching for Graph Algorithms on GPUs [Lakshminarayana, HPCA 2014]
MacSim Tutorial (In ICPADS 2013)
21/8
| Cache studies – sharing, inclusion property
| On-chip interconnection studies
$
$
$
$
$
$
Interconnection
Shared $
•
TLP-Aware Cache Management Policy [Lee and Kim, HPCA-18, 2012]
MacSim Tutorial (In ICPADS 2013)
$
Private Caches
Interconnection
Shared Cache
22/8
| Heterogeneous link configuration
C0
C1
C2
G0
G1
G2
M1
M0
L3
L3
L3
L3
C0
G0
C2
G1
C1
G2
M1
M0
L3
L3
L3
L3
CPU
GPU
Ring Network
MC
Different topologies
L3
•
C
C
M
M
C
C
M
M
C
C
G
G
C
C
G
G
On-chip Interconnection for CPU-GPU Heterogeneous Architecture [Lee et al. JPDC2013]
MacSim Tutorial (In ICPADS 2013)
23/8
Trace Generator
(GPUOCelot)
Frontend
RR, ICOUNT, FAIR, LRF, …
Execution
DRAM
•
FCFS, FRFCFS, FAIR, …
Effect of Instruction Fetch and Memory Scheduling on GPU Performance [Lakshminarayana and Kim,
LCA-GPGPU, 2010]
MacSim Tutorial (In ICPADS 2013)
24/8
DRAM Bank
W0 W1 W2 W3
W0 W1 W2 W3
RH RH RH
RM RM RM
RM RM RM
RM
RM
RM
RH RH
RM
RM
RM
Qs for Core-0
Qs for Core-1
Potential
of Requests from Core-0 = |W0|α + |W1|α + |W2|α + |W3|α
DRAM
Controller
α
α
α
Tolerance(Core-0) < Tolerance(Core-1)  select Core-0
= 4 + 3 + 5 (α < 1)
Servicing row hit from W1 (of Core-0) results in
greatest reduction in potential, so service row hits from
W1 next
Core-0
Reduction in potential if:
row hit from queue of length L is serviced next  Lα – (L – 1)α
row hit from queue of length L is serviced next  Lα – (L – 1/m)α
m = cost of servicing row miss/cost of servicing row hit
Core-1
Tolerance(Core-0) < Tolerance(Core-1)
•
DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function [Lakshminarayana et al.
IEEE CAL, 2011]
MacSim Tutorial (In ICPADS 2013)
Out-of-The-Box
MacSim
Trace Generator
(PIN, GPUOcelot)
Cache Hierarchy
•
•
CPU Traces (X86)
GPU Traces (CUDA)
Frontend
Off-Chip Memory
Memory System
Memory
Requests
3D Stacked DRAM Model
(New Module)
Configure 3-D Stack as
• DRAM caches
• Part of main memory
DRAM Stacks
• Resilient Die-stacked DRAM Caches, [Sim et al.,ISCA-40, 2013]
• A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch
[Sim et al., MICRO, 2012]
MacSim Tutorial (In ICPADS 2013)
26/8
1.2
1
IPC
0.8
0.6
0.4
Modeled
Measured
t
ConstCache TextureCache
1%
1%
SharedMem
1%
ns
ul
m
m
ad
d
m
em
cm
fp
in
t
sh
ar
ed
m
b1
1s
am
e
m
b1
0s
am
e
m
b1
4s
am
e
m
b1
2s
am
e
0
co
0.2
Fetch Decode
3%
1%
Schedule
3%
RF
5%
| Verifying simulator and GTX580
| Modeling X86-CPU power
MMU
0%
| Modeling GPU power
Execution

Still on-going research
MacSim Tutorial (In ICPADS 2013)
L1
27%
EX_alu
6%
0%
EX_LD/ST
3%
EX_SFU
1%
EX_fpu
48%
27/8
ARM Architecture
Mobile Platform
Power/Energy Model
2013 ~ 2014
MacSim Tutorial (In ICPADS 2013)
28/8
MacSim
MacSim
Tutorial
Tutorial
(In (In
ICPADS
ICPADS
2013)
2013)
29
MacSim Tutorial (In ICPADS 2013)
Download