Overview of MacSim

advertisement
HPArch Research Group
| Part 2. Overview of MacSim


Introduction
For black box approach users
| Part 3: Details of MacSim

For computer architecture researchers
| Part 4.




MacSim-SST case studies
Ocelot-MacSim case studies
Research using Ocelot
Research using MacSim
MacSim Tutorial (In ISCA-39, 2012)
| Heterogeneous architecture simulator (x86+PTX)
| Developed from Georgia Tech
| Trace driven simulator


Internal RISC style micro-op generation module
X86 traces – using Pin, PTX traces – using GPUOcelot
| Cycle-level simulator

Cores, caches, memory systems are modeled
| Support various simulations - single/multi-threaded
application, multi-program, heterogeneous (CPU+GPU)
MacSim Tutorial (In ISCA-39, 2012)
| Flexible design to support various platforms
| Integration with a parallel simulator (SST) to support highperformance computing systems
| From mobile to Exascale computing systems
MacSim Tutorial (In ISCA-39, 2012)
Prof. Yalamanchili
(Georgia Tech)
CUDA code
(.cu)
NVCC
(Compiler)
GPUOcelot
Trace Generator
PIN
Trace Generator
X86 binaries
Open GL code
PTX code
PIN
(API Generator)
Instruction
Thread information
Attila
(OpenGL Emulator)
Ongoing Work
MacSim Tutorial (In ISCA-39, 2012)
Heterogeneous
Architecture
Timing & Power
Simulator
| Getting MacSim


Stable version – google code project
http://macsim.googlecode.com/files/macsim-1.0.tar.gz
Latest code from SVN repository
| Directions are explained in
http://code.google.com/p/macsim/wiki/GettingMacsim
| How to build



http://code.google.com/p/macsim/wiki/BuildingMacsim
Chapter 2 of manual provides an instruction to build
README file in the simulator directory
MacSim Tutorial (In ISCA-39, 2012)
| Macsim package


IRIS (NoC simulator from Prof. Yalamanchili’s group) is included
CPU trace generator


Download PIN separately. Trace generator tool is in the MacSim Package
GPU trace generator

Download Ocelot Separately. Trace generator is in the Ocelot’s package
| MacSim-SST

SST needs to be downloaded separately
| Energy Introspector (From Prof. Yalamanchili’s group)

EI is a power model based on McPAT, HotSpot.
Because of McPAT license issue, currently EI cannot be distributed, but
we will resolve this issue soon
MacSim Tutorial (In ISCA-39, 2012)
MacSim Tutorial (In ISCA-39, 2012)
| Once build process is successful, binary will be created in

macsim-top/trunk/bin/macsim
| Screenshot of a simulation
| Now, How to configure simulation models ?
MacSim Tutorial (In ISCA-39, 2012)
| Knob variables need to set up (3 ways)



Default value in the source code
Params.in
Command line
Core type 1
Core type 1
Core type 1
Core type 1
Core type 1
Core type 2
Core type 2
Core type 2
Core type 2
Core type 2
Memory
MacSim Tutorial (In ISCA-39, 2012)
Core type 3
Core type 3
Core type 3
Core type 3
Core type 3
| Configuration


4 cores
2-way SMT
.def
param<NUM_SIM_CORES, num_sim_cores, int, 4>
num_sim_cores 4
// 4 cores
num_sim_small_cores 0
num_sim_medium_cores 0
num_sim_large_cores 4
max_threads_per_large_core 2
large_core_type x86
repeat_trace 1
./macsim –num_sim_cores=4
MacSim Tutorial (In ISCA-39, 2012)
commandline
params.in
| To configure CPU+GPU arch.

Set up number of cores and
type accordingly
num_sim_cores 8
// 4 CPUs + 4 GPUs
num_sim_small_cores 4 // 4 GPU
num_sim_medium_cores 0
num_sim_large_cores 4 // 4 CPUs
core_type ptx
// specify small cores
large_core_type x86
cpu_frequency 3
gpu_frequency 1.5
repeat_trace 1
MacSim Tutorial (In ISCA-39, 2012)
| Usually, we use small
core for GPU and large
for CPU
| GPU has internally
multiple processing
elements (N-wide SIMD)
| Multiple Applications

Set up from trace_file_list
Blackscholes
4 <-- number of applications
/sample/mcf/trace.txt <- appl 1
/sample/gcc/trace.txt <- appl 2
/sample/mm/trace.txt <- appl 3
/sample/blackscholes/trace.txt <- appl 4
MacSim Tutorial (In ISCA-39, 2012)
MCF
GCC
MM
MM
thread thread
1
2
| Execution time for each application is different.
| Provide an option to enable repeat short traces until the
longest trace ends
Program 1
Program 2
Program 3
mcf
gcc
bfs
gcc
bfs
gcc
bfs
| Whether it’s the right way to simulate?
MacSim Tutorial (In ISCA-39, 2012)
gcc
bfs
bfs
File name
Contents
params_8800gt
GeForce 8800 GT (G80)
params_gtx280
GeForce GTX 280 (GT200)
params_gtx465 NVIDIA GeForce GTX
465 (Fermi)
params_gtx465
GeForce GTX 465 (Fermi)
params_x86
Intel’s Sandy Bridge (CPU part only)
params_hetero_4c_4g
Intel’s Sandy Bridge (CPU + GPU)
| Sample configuration files in

macsim-top/trunk/params
MacSim Tutorial (In ISCA-39, 2012)
| Thread spawn is modeled.
| Lock is not modeled.
Host thread
Main thread
Threads spawn
GPU Kernel invocation
Barrier
core
MacSim Tutorial (In ISCA-39, 2012)
core
core
core
| It will be covered in Part-III
| Trace generator will generate thread execution information is
automatically.
| Users do not need to worry about this.
MacSim Tutorial (In ISCA-39, 2012)
| MacSim has 5 different clock domains





CPU
GPU
Last-level cache
Interconnection network
DRAM
# Clock
clock_cpu
clock_gpu
clock_l3
clock_noc
clock_mc
MacSim Tutorial (In ISCA-39, 2012)
3
1.5
1
1
1.6
| X86 instructions are mapped to uops
| PTX instructions are mapped to uops (almost 1-1 mapping)
MacSim
Macro instructions
with decoded
information from
Pin’s XED
Pin
XED
Trace
decoder
uops
Timing/
power
simulator
| Pipeline stages
Front-end
Decode
Rename
Schedule
Execution
Memory
MacSim Tutorial (In ISCA-39, 2012)
Retire
| Front-end, DEC/Rename: Just a simple FIFO queue.





fetch_latency 5 // front-end depth
alloc_latency 5 // decode/allocation depth
width // pipeline width (same width for all the pipeline)
bp_dir_mech gshare
bp_hist_length 14 // branch history length
| Rename: create RAW dependency (map structure)

rob_size 96 // ROB size
| Scheduler // in-order scheduler, ooo scheduler

schedule io, ooo // instruction scheduling policy
MacSim Tutorial (In ISCA-39, 2012)
| Execution latency


Fixed uop latency (macsim-top/def/uop_latency_[x86,ptx].def)
Variable latency: Cache/Memory latency
| Instruction scheduling rates



isched_rate 4 // # of integer inst. that can be executed per cycle
msched_rate 2 // # of memory inst. that can be executed per cycle
fsched_rate 2 // # of FP inst. That can be executed per cycle
MacSim Tutorial (In ISCA-39, 2012)
| Cache configuration

# of sets, # of associativity, line size, # of banks, etc. (See manual)
| Cache size = # of sets x assoc x line_size x # of tiles
L3 only
| DRAM configuration



Frequency, bus width, column/activate/precharge latency
# of Memory controllers, # banks, # channels, row buffer size, DRAM
scheduling policy
Simple, but fast DRAM model that models key features
| MacSim is connected with DRAM-SIM2

Users can use DRAM-SIM2 for a detailed DRAM timing simulation
MacSim Tutorial (In ISCA-39, 2012)
| Statistics


Simulation outputs: *.stat.out
macsim/trunk/def file has stat definition (more details in Part-III)
| Important Stats


IPC = INST_COUNT_TOT/CYC_COUNT_TOT
CPI = CYC_COUNT_TOT/INST_COUNT_TOT
| Per Core stats

IPC for core 0  INST_COUNT_CORE_0/CYC_COUNT_CORE_0
| Multiple applications stats


*.stat.out.<application_id> e.g.) memory.stat.out.0, bp.stat.out.1
Each stat file contains stats only for the first running (repeated
simulations are ignored)
MacSim Tutorial (In ISCA-39, 2012)
| Memory Systems


L[1-3]_HIT_CPU/L[1-3]_HIT_GPU
L[1-3]_MISS_CPU/L[1-3]_MISS_GPU
| Front-end

BP_ON_PATH_[CORRECT/MISPREDICT/MISFETCH ]
| Instruction profiling

Based on instruction category. inst.stat.out
| More details regarding statistics are in the documentation
| We will provide simple script file to fetch stat data
MacSim Tutorial (In ISCA-39, 2012)
MacSim Tutorial (In ISCA-39, 2012)
| Multi-threading support is already there.
| Different ISAs: using micro-ops
| Warp ?


One warp is treated as one thread. Each thread generates its own
trace file. Active bit information is included
Trace format will be explained in Part-III
| Thread and block scheduling


Block-level barrier, block-level scheduling/retirement
More details will be explained in Part-III
| Different memory structures

Memory systems
MacSim Tutorial (In ISCA-39, 2012)
Addr 0 Addr 1 Addr 2 Addr 3 Addr 4 Addr 5 Addr 6 Addr 7
Coalesced
Mem inst with 128B size
SIMD load instruction
Uncoalesced
64B Request
32B Req.
Trace file
Trace file
TraceInst
TraceInst_begin
TraceMem1
TraceMem2
TraceMem3
TraceInst_end
32B Req.
start of memory
instruction marker
end of memory
instruction marker
| Include the memory access by each thread of a warp as a
separate instruction in the trace
| In trace, mark these accesses as coming from the same warp
MacSim Tutorial (In ISCA-39, 2012)
Trace file
TraceInst_begin
TraceMem1
TraceMem2
TraceMem3
…
TraceMemN
TraceInst_end
MacSim
start of memory
instruction marker
Parent uop
uop
Mem_type: ld
#children: 8
Children uops
end of memory
instruction marker
addr0
addr1
addr2
addr3
addr4
addr5
…
addrN
| During simulation, form a “parent” uop that holds all the
individual memory accesses as its child uops
| Parent uop flows through the pipeline, only in the memory
stage, the individual children uops are issued to the memory

Parent uop is ready for retirement when all children have completed
MacSim Tutorial (In ISCA-39, 2012)
MacSim Tutorial (In ISCA-39, 2012)
| IRIS (From Prof. Yalamanchili’s group)



Flit-level interconnection network simulator
Virtual channel, credit-based flow control
deadlock-avoidance, …
Part-IV will cover more.
Node
Node
| MacSim-SST

Parallel simulation
MacSim Tutorial (In ISCA-39, 2012)
Node
router
Node
router
Topology
(Ring, Mesh, Torus, ..)
Download