1 MacSim Tutorial (In ICPADS 2013) 2/8 | The Structural Simulation Toolkit: A Parallel Architectural Simulator (for HPC) A parallel simulation environment based on MPI Fully modular design that enables extensive exploration of an individual system parameter without the need for intrusive changes to the simulator Includes parallel simulation core, configuration, power models, basic network and processor models, and interface to detailed memory model SST-download link: http://sst.sandia.gov/ MacSim Tutorial (In ICPADS 2013) 3/8 MacSim Tutorial (In ICPADS 2013) 4/8 | Processor Components MacSim Gem5 | Memory Components DRAMSim2 VaultSim (3D memory model) MemHierarchy | Network Components Merlin Iris MacSim Tutorial (In ICPADS 2013) 5/8 | Multiple MacSim components can be instantiated | Each of which can act as An entire GPU node (composed of multiple SMs) A heterogeneous computing node (CPU + GPU) A GPU/CPU core Any combination of listed above MacSim Tutorial (In ICPADS 2013) 6/8 | MacSim can talk to memHierarchy | MacSim can make use of memHierarchy’s cache hierarchy. Which means, whatever memory system is connected to memHierarchy, MacSim can be configured with them. DRAMSim2 or VaultSim. | Pipeline Stages with memHierarchy Front-end Decode Rename Schedule Execution Retire SST Link I-Cache (MH) D-Cache (MH) VaultSim VaultSim MacSim Tutorial (In ICPADS 2013) 7/8 | MacSim can directly talk to DRAMSim2 VaultSim | Using MacSim’s highly versatile memory controller interface, it can directly talk to DRAMSim2 and VaultSim. | Pipeline Stages with external memory component Front-end Decode Rename Schedule Execution Retire SST Link I-Cache (MS) D-Cache (MS) VaultSim VaultSim MacSim Tutorial (In ICPADS 2013) 8/8 | A SST component which models a memory hierarchy, such as multiple cache levels Sub component: Cache, Bus, Memory Controller | Usage Processor Component(s) + memHierarchy(s) + Memory Component(s) MacSim + L1/L2 cache + DRAMSim2 MacSim + L1/L2 cache + (3D memory model) (MacSim + private L1 cache) + (Gem5 + private L1 cache) + shared L2 cache + (DRAMSim2 or 3D memory model) MacSim Tutorial (In ICPADS 2013) 9 | Encapsulated MacSim as a SST Component, SST feeds clocks into MacSim and provides communication channels. | By talking to memHierarchy, MacSim indirectly can communicate with bunch of memory components without bothering to modify its interface. MacSim MacSim SST::Component SST::Component SST::Link SST::Link L1 (memHierarchy) SST::Link SST::Link L2 (memHierarchy) SST::Link DRAMSim2 SST::Component MacSim Tutorial (In ICPADS 2013) L1 (memHierarchy) SST::Link VaultSim 10 MacSim MacSim Gem5 MacSim SST::Component SST::Component SST::Component SST::Component SST::Link SST::Link L1 L1 SST::Link SST::Link L2 (memHierarchy) SST::Link LLC (VaultSim) SST::Link DRAMSim2 SST::Component MacSim Tutorial (In ICPADS 2013) SST::Link SST::Link L1 (memHierarchy) L1 (memHierarchy) SST::Link SST::Link L2 (memHierarchy) SST::Link DRAMSim2 SST::Component SST::Link VaultSim 11/8 Make sure macsimComponent doesn’t have .ignore file, otherwise SST build system will ignore the component How to build: See the instruction from SST website How to execute: Pay special attention to the following files SDL (or XML) : SST component configuration trace_file_list: Which trace to execute. Can be specified in the aforementioned SDL file params.in: MacSim configuration, in which you can specify… Whether MacSim uses its internal cache or memHierarchy as cache Which DRAM controller to use amongst its internal FCFS/FRFCFS-based controller, DRAMSim2 controller and VaultSim controller. Specific examples will be elaborated in the following slides 12/8 | params.in use_memhierarchy = 0 dram_scheduling_policy = FRFCFS or FCFS | SDL (or XML) Nothing except macsimComponent configuration In this case, link configuration will not be used MacSim Tutorial (In ICPADS, 2013) 13/8 | params.in use_memhierarchy = 1 Note, when use_memhierarchy is set to 1, MacSim’s DRAM controller configuration has no effect at all | SDL (or XML) Specify memHierarchy’s cache configuration like the following Similar configuration for D-cache as well MacSim Tutorial (In ICPADS, 2013) 14/8 | params.in use_memhierarchy = 1 Note, when use_memhierarchy is set to 1, MacSim’s DRAM controller configuration has no effect at all | SDL (or XML) Specify MemController configuration for DRAMSim2 like the following Note, DRAMSim2 configurations should be appended MacSim Tutorial (In ICPADS, 2013) 15/8 | params.in use_memhierarchy = 1 Note, when use_memhierarchy is set to 1, MacSim’s DRAM controller configuration has no effect at all | SDL (or XML) Specify MemController configuration for VaultSim like the following Note, VaultSim configurations should be appended MacSim Tutorial (In ICPADS, 2013) 16/8 | params.in use_memhierarchy = 0 dram_scheduling_policy = DRAMSIM | SDL (or XML) Specify configurations for DRAMSim2 like the following MacSim Tutorial (In ICPADS, 2013) 17/8 | params.in use_memhierarchy = 0 dram_scheduling_policy = VAULTSIM | SDL (or XML) Nothing special but to set macsimComponent’s mem_link matches to VaultSim’s toCPU link 18 MacSim Tutorial (In ICPADS 2013) MacSim Tutorial (In ICPADS 2013) 19/8 Front-end Memory System • Thread fetch policies • Branch predictor • Software and Hardware prefetcher • Cache studies (sharing, inclusion) • DRAM scheduling • Interconnection studies MacSim Tutorial (In ICPADS 2013) Misc. • Power model 20/8 MacSim Trace Generator (PIN, GPUOCelot) Frontend Software prefetch instructions PTX prefetch, prefetchu x86 prefetcht0, prefetcht1, prefetchnta Hardware prefetch requests Stream, stride, GHB, … • • • Memory System Hardware Prefetcher Many-thread Aware Prefetching Mechanism [Lee et al. MICRO-43, 2010] When prefetching works, when it doesn’t, and why [Lee et al. ACM TACO, 2012] Spare Register Aware Prefetching for Graph Algorithms on GPUs [Lakshminarayana, HPCA 2014] MacSim Tutorial (In ICPADS 2013) 21/8 | Cache studies – sharing, inclusion property | On-chip interconnection studies $ $ $ $ $ $ Interconnection Shared $ • TLP-Aware Cache Management Policy [Lee and Kim, HPCA-18, 2012] MacSim Tutorial (In ICPADS 2013) $ Private Caches Interconnection Shared Cache 22/8 | Heterogeneous link configuration C0 C1 C2 G0 G1 G2 M1 M0 L3 L3 L3 L3 C0 G0 C2 G1 C1 G2 M1 M0 L3 L3 L3 L3 CPU GPU Ring Network MC Different topologies L3 • C C M M C C M M C C G G C C G G On-chip Interconnection for CPU-GPU Heterogeneous Architecture [Lee et al. JPDC2013] MacSim Tutorial (In ICPADS 2013) 23/8 Trace Generator (GPUOCelot) Frontend RR, ICOUNT, FAIR, LRF, … Execution DRAM • FCFS, FRFCFS, FAIR, … Effect of Instruction Fetch and Memory Scheduling on GPU Performance [Lakshminarayana and Kim, LCA-GPGPU, 2010] MacSim Tutorial (In ICPADS 2013) 24/8 DRAM Bank W0 W1 W2 W3 W0 W1 W2 W3 RH RH RH RM RM RM RM RM RM RM RM RM RH RH RM RM RM Qs for Core-0 Qs for Core-1 Potential of Requests from Core-0 = |W0|α + |W1|α + |W2|α + |W3|α DRAM Controller α α α Tolerance(Core-0) < Tolerance(Core-1) select Core-0 = 4 + 3 + 5 (α < 1) Servicing row hit from W1 (of Core-0) results in greatest reduction in potential, so service row hits from W1 next Core-0 Reduction in potential if: row hit from queue of length L is serviced next Lα – (L – 1)α row hit from queue of length L is serviced next Lα – (L – 1/m)α m = cost of servicing row miss/cost of servicing row hit Core-1 Tolerance(Core-0) < Tolerance(Core-1) • DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function [Lakshminarayana et al. IEEE CAL, 2011] MacSim Tutorial (In ICPADS 2013) Out-of-The-Box MacSim Trace Generator (PIN, GPUOcelot) Cache Hierarchy • • CPU Traces (X86) GPU Traces (CUDA) Frontend Off-Chip Memory Memory System Memory Requests 3D Stacked DRAM Model (New Module) Configure 3-D Stack as • DRAM caches • Part of main memory DRAM Stacks • Resilient Die-stacked DRAM Caches, [Sim et al.,ISCA-40, 2013] • A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch [Sim et al., MICRO, 2012] MacSim Tutorial (In ICPADS 2013) 26/8 1.2 1 IPC 0.8 0.6 0.4 Modeled Measured t ConstCache TextureCache 1% 1% SharedMem 1% ns ul m m ad d m em cm fp in t sh ar ed m b1 1s am e m b1 0s am e m b1 4s am e m b1 2s am e 0 co 0.2 Fetch Decode 3% 1% Schedule 3% RF 5% | Verifying simulator and GTX580 | Modeling X86-CPU power MMU 0% | Modeling GPU power Execution Still on-going research MacSim Tutorial (In ICPADS 2013) L1 27% EX_alu 6% 0% EX_LD/ST 3% EX_SFU 1% EX_fpu 48% 27/8 ARM Architecture Mobile Platform Power/Energy Model 2013 ~ 2014 MacSim Tutorial (In ICPADS 2013) 28/8 MacSim MacSim Tutorial Tutorial (In (In ICPADS ICPADS 2013) 2013) 29 MacSim Tutorial (In ICPADS 2013)