ISCA Tutorial June 5th, 2005 Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen, Min Xu, Kevin Moore Please Ask Questions (C) 2005 Multifacet Project http://www.cs.wisc.edu/gems What do you want to simulate? CPU Uniprocessor P P P P $ $ $ $ Chip Multiprocessor (CMP) Slide 2 Glueless Multiprocessor Symmetric Multiprocessor CMP CMP CMP CMP Multiple-CMP http://www.cs.wisc.edu/gems Open Source Release of GEMS • GEMS v1.1 released as GPL software http://www.cs.wisc.edu/gems • Contributors Alaa Alameldeen Carl Mauer Brad Beckmann Kevin Moore Ross Dickson Manoj Plakal Pacia Harper Dan Sorin Milo Martin Min Xu Mike Marty Luke Yen • Multifacet Project directed by Mark Hill & David Wood Slide 3 http://www.cs.wisc.edu/gems GEMS Requirements • Virtutech Simics 2.0.x or 2.2.x – Personal academic licenses available – http://www.virtutech.com • Host Machine – x86 (32 or 64-bit) Linux or Sparc/Solaris host machine – > 1 GB Memory • Workload Checkpoints YOU Create – License issues w/ releasing checkpoints Slide 4 http://www.cs.wisc.edu/gems Trace flie Contended locks Random Tester Deterministic GEMS From 50,000 Feet Microbenchmarks Slide 5 Simics Opal Detailed Processor Model http://www.cs.wisc.edu/gems Trace flie Contended locks Random Tester Deterministic GEMS From 50,000 Feet Microbenchmarks Simics Opal Detailed Processor Model Full-System Functional Simulator • • Boots unmodified Solaris 9 BUT, each instruction 1-cycle • www.virtutech.com Slide 6 http://www.cs.wisc.edu/gems GEMS From 50,000 Feet Contended locks Deterministic Memory System Model Random Opal Trace flie Simics • Flexible multiprocessor memory hierarchy • Includes domain-specific language Detailed Tester Microbenchmarks Slide 7 Processor Model http://www.cs.wisc.edu/gems Trace flie Contended locks Random Tester Deterministic GEMS From 50,000 Feet Microbenchmarks OoO Processor Model Simics Opal Detailed Processor Model • Implements partial SPARC v9 ISA • Modeled after MIPS R10000 Slide 8 http://www.cs.wisc.edu/gems Trace flie Contended locks Random Tester Deterministic GEMS From 50,000 Feet Simics Microbenchmarks Opal Detailed Processor Model Other Drivers • Testing independent of Simics • Microbenchmarks Slide 9 http://www.cs.wisc.edu/gems Outline • Introduction and Motivation • Demo: Simulating a Multiple-CMP System with GEMS • Ruby: Memory system model • BREAK • Opal: Out-of-order processor model • Demo: Two gems are better than one • GEMS Source Code Tour and Extending Ruby • Building Workloads Slide 10 http://www.cs.wisc.edu/gems Demo Full-System Simulation with GEMS • Steps: – – – – – – – Slide 11 Choosing a Ruby protocol Building Ruby and Opal Starting and configuring Simics Loading and configuring Ruby Loading and configuring Opal Running simulation Getting results http://www.cs.wisc.edu/gems Demo Choosing the Ruby System/Protocol • Included with GEMS release v1.1 – CMP protocols • MOESI_CMP_token: M-CMP token coherence • MSI_MOSI_CMP_directory: 2-level Directory • MOESI_CMP_directory: higher performing 2-level Directory – SMP protocols • • • • Slide 12 MOSI_SMP_bcast: snooping on ordered interconnect MOSI_SMP_directory MOSI_SMP_hammer: based on AMD Hammer And more http://www.cs.wisc.edu/gems Demo Building Ruby and Opal • Ruby module cd $GEMS_ROOT/ruby – set compile-time defaults vi config/rubyconfig.defaults – Build module, choosing protocol and destination dir make PROTOCOL=MOESI_CMP_token DESTINATION=MOESI_CMP_token – SLICC runs, generates HTML and additional C++ files – Ruby module built and moved to $GEMS_ROOT/simics/home/MOESI_CMP_token • Build Opal cd $GEMS_ROOT/opal make module DESTINATION=MOESI_CMP_token Slide 13 http://www.cs.wisc.edu/gems Demo Starting Simics • Start non-GUI Simics maya(9)% cd $GEMS_ROOT/simics/home/MOESI_CMP_token/ maya(10)% ./simics Checking out a license... done: academic license. Looking for additional Simics modules in ./modules +----------------+ Reserved | Virtutech | | Simics | +----------------+ www.simics.com Virtutech AB Type Type Type Type Copyright 1998-2004 by Virtutech, All Rights Version: simics-2.0.23 Compiled: Thu Oct 14 20:27:36 CEST 2004 "Virtutech" and "Simics" are trademarks of 'copyright' for details on copyright. 'license' for details on warranty, copying, etc. 'readme' for further information about this version. 'help help' for info on the on-line documentation. simics> Slide 14 http://www.cs.wisc.edu/gems Demo Checkpoint and Configuration • Checkpoints should be created first – Simics-only process simics> read-configuration ../../checkpoints-u3/jbb/jbb-16p.check – SpecJBB checkpoint loaded • Load python scripts simics> @sys.path.append("../../../gen-scripts") simics> @import mfacet • Configure Simics simics> Turning simics> Turning simics> simics> Slide 15 istc-disable I-STC off and flushing old data dstc-disable D-STC off and flushing old data instruction-fetch-mode instruction-fetch-trace magic-break-enable http://www.cs.wisc.edu/gems Demo Load and Configure Ruby Load module simics> load-module ruby Setting # processors is required simics> ruby0.setparam g_NUM_PROCESSORS 16 Create a M-CMP system (4 chips, 4 procs/chip) simics> ruby0.setparam g_PROCS_PER_CHIP 4 Override compile-time defaults simics> simics> simics> simics> ruby0.setparam ruby0.setparam ruby0.setparam ruby0.setparam g_NUM_L2_BANKS 32 L2_CACHE_ASSOC 4 L2_CACHE_NUM_SETS_BITS 16 NETWORK_LINK_LATENCY 50 Initialize simics> ruby0.init Slide 16 http://www.cs.wisc.edu/gems Demo Optionally Load and Configure Opal Load module simics> load-module opal Initialize default processor simics> opal0.init simics> opal0.listparam Start opal (but do not start simulating) simics> opal0.sim-start “output.opal" Slide 17 http://www.cs.wisc.edu/gems Demo Running simulation • Setup transaction-based simulation – “magic breakpoints” – Five JBB transactions simics> @mfacet.setup_run_for_n_transactions(5,1) • Start simulating – Ruby only (Simics drives Ruby): simics> c – Opal is loaded (Opal steps Simics): simics> opal0.sim-step 9999999999 Slide 18 http://www.cs.wisc.edu/gems Demo Dumping Some Output • Opal stats simics> opal0.stats • Ruby stats simics> ruby0.dump-stats ruby.stats • Ruby short stats simics> ruby0.dump-short-stats – Ruby_cycles is a good runtime metric Slide 19 http://www.cs.wisc.edu/gems Outline • Introduction and Motivation • Demo: Simulating a Multiple-CMP System with GEMS • Ruby: Memory system model – – – – – – Overview (Drivers & Memory System) Event-driven simulation Interconnection network SLICC: Specifying the logic of the system Simple example: SMP MI protocol Limitations • BREAK • • • • Slide 20 Opal: Out-of-order processor model Demo: Two gems are better than one GEMS Source Code Tour and Extending Ruby Building Workloads http://www.cs.wisc.edu/gems Trace flie Contended locks Microbenchmarks Simics Opal Detailed Processor Model Memory System Drivers Random Tester Deterministic High-Level Infrastructure Map Slide 21 http://www.cs.wisc.edu/gems Ruby Driver: Random Tester • “Verifying a Multiprocessor Cache Controller Using Random Test Generation” [Wood et al. 90] • Purpose: Excite cache coherency bugs • Competing actions performed then checked • Utilizes false sharing – Multiple writers - action – Single read - check • Randomly inserted delay Slide 22 Random Tester http://www.cs.wisc.edu/gems Ruby Driver: Microbenchmarks • Deterministic tester – Compare and swap atomic op. – RequestGenerator.C / SyntheticDriver.C • Trace file Trace file • Contended locks Deterministic • GETX, SeriesGETS, Inv Contended locks – Simple sequence of requests – Sanity checking and performance tuning – DeterministicDriver.C Microbenchmarks – Issues requests one at a time – Similar to cache warmup mechanism – ‘-z <trace_file.gz>’ Slide 23 http://www.cs.wisc.edu/gems Ruby Driver: In-order Processor Model • Simics blocking interface (in-order processor) – Single issue, non-pipelined processor – Only one outstanding request per CPU Simics • SIMICS_RUBY_MULTIPLIER > 1 – Estimates a higher performance processor – Multiple simics processor cycles == one ruby cycle Slide 24 http://www.cs.wisc.edu/gems Ruby Driver: In-order Processor Model instructions SIMICS P0 Simics in-order processor model P2 P1 P3 Ruby Memory System Model Simics time queue • Implements Simics’ mh_memorytracer_possible_cache_miss() • “Callback” Simics with SIM_stall_cycle(proc_ptr, 0) Slide 25 http://www.cs.wisc.edu/gems Ruby Driver: Out-of-order Processor Model • Opal (out-of-order processor) – Super-scalar pipelined processor – Multiple outstanding requests per CPU Opal • OPAL_RUBY_MULTIPLIER > 1 – Faster processor core frequency than memory – Simulation execution optimization Detailed Processor Model What are they driving? Slide 26 http://www.cs.wisc.edu/gems Ruby Multiprocessor Memory System • Physical Components – Caches – Memory – System Interconnect Ruby Memory System Model • Determines the timing of memory requests – Driver issues memory request to Ruby – Ruby simulates the requests – Ruby eventually callbacks the driver with the latency • Ruby’s purpose: Return memory latency Slide 27 http://www.cs.wisc.edu/gems Outline • Introduction and Motivation • Demo: Simulating a Multiple-CMP System with GEMS • Ruby: Memory system model – – – – – – Overview (Drivers & Memory System) Event-driven simulation Interconnection network SLICC: Specifying the logic of the system Simple example: SMP MI protocol Limitations • BREAK • • • • Slide 28 Opal: Out-of-order processor model Demo: Two gems are better than one GEMS Source Code Tour and Extending Ruby Building Workloads http://www.cs.wisc.edu/gems Discrete Event-driven Simulation • Discrete event-driven simulation – Events change system state – Series of scheduled events *Event A 4 • Global EventQueue – Heart of Ruby – Priority heap of event/time pairs • Not a true queue - not in FIFO order • Self-sorting queue – Given cycle events occur in arbitrary order – All events must be at least one unit of time Slide 29 Global EventQueue Event | Time *Event G 7 *Event B 5 *Event J 3 *Event S 3 http://www.cs.wisc.edu/gems Events and Consumers • Event = Consumer Wakeup – Consumer determines event type – Consumer changes system state • Typical event – Consumer wakes up to observe its input ports – Consumer acts upon the incoming message(s) • Change system state • Enqueue outgoing messages – Consumer pops the incoming message(s) – Consumer schedules outgoing message(s) consumers Output Port Consumer Input Port Consumer Output Port Consumer Slide 30 http://www.cs.wisc.edu/gems Events and Consumers • Stalled event – Consumer wakes up to observer its input ports – Consumer encounters a stall – Consumer schedules itself again • Doesn’t pop incoming queue Output Port Consumer Input Port Consumer Output Port Consumer Slide 31 http://www.cs.wisc.edu/gems Outline • Introduction and Motivation • Demo: Simulating a Multiple-CMP System with GEMS • Ruby: Memory system model – – – – – – Overview (Drivers & Memory System) Event-driven simulation Interconnection network SLICC: Specifying the logic of the system Simple example: SMP MI protocol Limitations • BREAK • • • • Slide 32 Opal: Out-of-order processor model Demo: Two gems are better than one GEMS Source Code Tour and Extending Ruby Building Workloads http://www.cs.wisc.edu/gems Interconnection Network • A single flexible infrastructure – Point-to-point links and switches: Consumers – Both intra-chip and inter-chip networks • Dynamic network creation – Routing tables created at runtime – Utilizes input parameters • Throttle.C Link Two ways to generate topologies 1. Auto-generated – Intra-chip network: Single on-chip switch – Inter-chip network: 4 included (next slide) 2. Customized PerfectSwitch.C – TopologyType_FILE_SPECIFIED – Adjust individual link latency and bandwidth Switch – Specify one link per line Slide 33 http://www.cs.wisc.edu/gems Auto-generated Inter-chip Network Topologies TopologyType_TORUS_2D TopologyType_PT_TO_PT Slide 34 TopologyType_HIERARCHICAL_SWITCH TopologyType_CROSSBAR http://www.cs.wisc.edu/gems Network Characteristics • Link latency 1. Auto-generated – – ON_CHIP_LINK_LATENCY NETWORK_LINK_LATENCY 2. Customized – • ‘link_latency:’ Link bandwidth – Bandwidth specified in 1000th of byte 1. Auto-generated – – On-chip = 10 x g_endpoint_bandwidth Off-chip = g_endpoint_bandwidth 2. Customized – • Buffer size – – Infinite by default Customized network supports finite buffering • • • Slide 35 Individual link bandwidth = ‘bw_multiplier:’ x g_endpoint_bandwidth Prevent 2D-mesh network deadlock through e-cube restrictive routing ‘link_weight’ Perfect switch bandwidth http://www.cs.wisc.edu/gems Outline • Introduction and Motivation • Demo: Simulating a Multiple-CMP System with GEMS • Ruby: Memory system model – – – – – – Overview (Drivers & Memory System) Event-driven simulation Interconnection network SLICC: Specifying the logic of the system Simple example: SMP MI protocol Limitations • BREAK • • • • Slide 36 Opal: Out-of-order processor model Demo: Two gems are better than one GEMS Source Code Tour and Extending Ruby Building Workloads http://www.cs.wisc.edu/gems Specification Language for Implementing Cache Coherence (SLICC) • Domain-specific language – – – – Designed to specify state machines for cache coherence Syntactically similar to C/C++/Java Constrains to hardware-like structures (i.e. no loops) Generates C++ tightly coupled to Ruby Network Out-ports Network In-ports • Two purposes SLICC State Machine 1. Specify system coherence – Per-memory-block State Machines – I.e. cache and memory controller logic 2. Glue components together – Caches with transaction buffers – Network ports with controllers Slide 37 http://www.cs.wisc.edu/gems System Flexibility via SLICC • Substantial portion of Ruby code generated – In combination with dynamic network creation – Permits a tremendously flexible simulation infrastructure • protocols/<protocol_name>.slicc – Indicates the SLICC files needed by the protocol – Specifies the necessary generated objects • Controller state machines • Network messages – Snooping protocol: requests and response messages – Directory protocol: requests, forwarded requests, and responses – Allocates only C++ objects needed by the particular protocol • Ex. Shadow tags for an exclusive two-level cache • Ex. Persistent Request Table for Token coherence Slide 38 http://www.cs.wisc.edu/gems Inside a SLICC State Machine • Network buffers – Outgoing and incoming ports • States – Base and transient states • Events – Internal events that cause state transitions <controller_name>.sm network ports states events ruby structures trigger events • Ruby Structures – Caches, transaction buffers… etc. actions • Trigger events – Incoming messages trigger internal events • Actions – Operations performed on structures transitions • Transitions Slide 39 – Cross-product of possible states and events – Performs atomic sequence of actions http://www.cs.wisc.edu/gems Outline • Introduction and Motivation • Demo: Simulating a Multiple-CMP System with GEMS • Ruby: Memory system model – – – – – – Overview (Drivers & Memory System) Event-driven simulation Interconnection network SLICC: Specifying the logic of the system Simple example: SMP MI protocol Limitations • BREAK • • • • Slide 40 Opal: Out-of-order processor model Demo: Two gems are better than one GEMS Source Code Tour and Extending Ruby Building Workloads http://www.cs.wisc.edu/gems Demo Creating a protocol with SLICC • MI-example protocol – Simple, SMP directory protocol – Cache and directory/memory controller – Assume ordered interconnect (for simplicity) dir dir dir M Ruby interconnect $ Slide 41 $ $ I http://www.cs.wisc.edu/gems Demo MI Cache Controller – States and Events // STATES enumeration(State, desc="Cache states") { // stables states I, desc="Not Present/Invalid"; M, desc="Modified"; // transient states MI, desc="Modified, issued PUT"; II, desc="Not Present/Invalid, issued PUT"; IS, desc="Issued request for IFETCH/GETX"; IM, desc="Issued request for STORE/ATOMIC"; } // EVENTS enumeration(Event, desc="Cache events") { // from processor Load, desc="Load request from processor"; Ifetch, desc="Ifetch request from processor"; Store, desc="Store request from processor"; Data, Fwd_GETX, desc="Data from network"; desc="Forward from network"; Replacement, desc="Replace a block"; Writeback_Ack, desc="Ack from the directory for a writeback"; Writeback_Nack, desc="Nack from the directory for a writeback"; } Slide 42 http://www.cs.wisc.edu/gems Demo MI Cache Controller – Network Ports // NETWORK BUFFERS MessageBuffer requestFromCache, network="To", virtual_network="0", ordered="true"; MessageBuffer responseFromCache, network="To", virtual_network="1", ordered="true"; MessageBuffer forwardToCache, network="From", virtual_network="2", ordered="true"; MessageBuffer responseToCache, network="From", virtual_network="1", ordered="true"; // NETWORK PORTS out_port(requestNetwork_out, RequestMsg, requestFromCache); out_port(responseNetwork_out, ResponseMsg, responseFromCache); in_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) { if (forwardRequestNetwork_in.isReady()) { peek(forwardRequestNetwork_in, RequestMsg) { if (in_msg.Type == CoherenceRequestType:GETX) { trigger(Event:Fwd_GETX, in_msg.Address); } else if (in_msg.Type == CoherenceRequestType:WB_ACK) { trigger(Event:Writeback_Ack, in_msg.Address); } else { error("Unexpected message"); } Slide 43 http://www.cs.wisc.edu/gems } Demo MI Cache Controller – Structures // CacheEntry structure(Entry, desc="...", interface="AbstractCacheEntry") { State CacheState, desc="cache state"; bool Dirty, desc="Is the data dirty (different than memory)?"; DataBlock DataBlk, desc="data for the block"; } external_type(CacheMemory) { bool cacheAvail(Address); Address cacheProbe(Address); void allocate(Address); void deallocate(Address); Entry lookup(Address); void changePermission(Address, AccessPermission); bool isTagPresent(Address); } CacheMemory cacheMemory, template_hack="<L1Cache_Entry>", constructor_hack='L1_CACHE_NUM_SETS_BITS, L1_CACHE_ASSOC, MachineType_L1Cache, int_to_string(i)+"_L1"', abstract_chip_ptr="true"; Slide 44 http://www.cs.wisc.edu/gems Demo MI Cache Controller – “Mandatory Queue” // Mandatory Queue in_port(mandatoryQueue_in, CacheMsg, mandatoryQueue, desc="...") { if (mandatoryQueue_in.isReady()) { peek(mandatoryQueue_in, CacheMsg) { if (cacheMemory.isTagPresent(in_msg.Address) == false && cacheMemory.cacheAvail(in_msg.Address) == false ) { // make room for the block trigger(Event:Replacement, cacheMemory.cacheProbe(in_msg.Address)); } else { trigger(mandatory_request_type_to_event(in_msg.Type), in_msg.Address); } } } } Slide 45 http://www.cs.wisc.edu/gems Demo MI Cache Controller – Transitions transition(I, Store, IM) { v_allocateTBE; i_allocateL1CacheBlock; a_issueRequest; m_popMandatoryQueue; } Atomic sequence of actions transition(IM, Data, M) { u_writeDataToCache; s_store_hit; w_deallocateTBE; n_popResponseQueue; } transition(M, Fwd_GETX, I) { e_sendData; o_popForwardedRequestQueue; } transition(M, Replacement, MI) { v_allocateTBE; b_issuePUT; x_copyDataFromCacheToTBE; h_deallocateL1CacheBlock; } Slide 46 http://www.cs.wisc.edu/gems Demo MI Cache Controller – Actions action(a_issueRequest, "a", desc="Issue a request") { enqueue(requestNetwork_out, RequestMsg, latency="ISSUE_LATENCY") { out_msg.Address := address; out_msg.Type := CoherenceRequestType:GETX; out_msg.Requestor := machineID; out_msg.Destination.add(map_Address_to_Directory(address)); out_msg.MessageSize := MessageSizeType:Control; } } action(e_sendData, "e", desc="Send data from cache to requestor") { peek(forwardRequestNetwork_in, RequestMsg) { enqueue(responseNetwork_out, ResponseMsg, latency="CACHE_RESPONSE_LATENCY") { out_msg.Address := address; out_msg.Type := CoherenceResponseType:DATA; out_msg.Sender := machineID; out_msg.Destination.add(in_msg.Requestor); out_msg.DataBlk := cacheMemory[address].DataBlk; out_msg.MessageSize := MessageSizeType:Response_Data; } } } Slide 47 http://www.cs.wisc.edu/gems Demo SLICC-generated HTML tables • http://www.cs.wisc.edu/gems/MI_example_html/ Slide 48 http://www.cs.wisc.edu/gems Demo Testing MI_example Build Protocol cd $GEMS_ROOT/ruby make PROTOCOL=MI_example Random test – stresses protocol with simultaneous false-sharing requests – 16 processors (-p), 10000 requests (-l) ./amd64_linux/generated/MI_example/bin/tester.exec –p 16 –l 10000 Deterministic test with transition trace – use a trace, requests handled one at a time – input trace (-z), compressed or non-compressed – transition debug (-s) starting at cycle 1 ./amd64_linux/generated/MI_example/bin/tester.exec –p 16 –z ruby.trace.gz –s 1 Slide 49 http://www.cs.wisc.edu/gems Outline • Introduction and Motivation • Demo: Simulating a Multiple-CMP System with GEMS • Ruby: Memory system model • BREAK • Opal: Out-of-order processor model – – – – Overview Pipeline Example: Load instruction Additional Tidbits • Demo: Two gems are better than one • GEMS Source Code Tour and Extending Ruby • Building Workloads Slide 50 http://www.cs.wisc.edu/gems Overview • What is OPAL? – Out-of-Order SPARC processor simulator • (modeled after MIPS R10K) – Uses Timing-First design – Realized as a Simics module – like RUBY – Does NOT use Simics’ MAI interface • Goal of this section – Starting point for hacking Opal • Learning approaches – Code review / summarization (using Control Flow Graphs) – Example: a load instruction – Analogies to SimpleScalar…pay attention to the differences Slide 51 http://www.cs.wisc.edu/gems Ruby Driver: In-order Processor Model instructions SIMICS P0 Simics in-order processor model P2 P1 P3 Ruby Memory System Model Simics time queue • Implements Simics’ mh_memorytracer_possible_cache_miss() • “Callback” Simics with SIM_stall_cycle(proc_ptr, 0) Slide 52 http://www.cs.wisc.edu/gems Preview: OPAL & Simics OPAL fetch SIMICS 8 7 6 5 Phy_mem 2 4 13 1 P0 HIT IFETCH RUBY decode Schedule/execute HIT LOAD retire Instruction • Use opal’s opal0.sim-step command Slide 53 http://www.cs.wisc.edu/gems Timing-First Simulation [Mauer Sigmetrics 02] • Timing Simulator (Opal) – functional execution of user/supervisor operations – speculative, OoO multiprocessor timing simulation – does NOT implement full ISA or any devices • Functional Simulator (Simics) – full-system multiprocessor simulation – does NOT model detailed micro-architectural timing KEY: Reload state if Opal state != Simics state Slide 54 http://www.cs.wisc.edu/gems Measured Deviations • Less than 20 deviations per 100,000 instructions (0.02%) Worst case performance error: Slide 55 additional timing slides 2.4% (assuming deviation latency is pipeline flush) http://www.cs.wisc.edu/gems Opal and UltraSparc • Functionally simulates 103 of 183 of UltraSparc ISA instructions (99.99% of all dynamic instr in workloads) LIST • Sample of unimplemented instrs: – ARRAY -FEXPAND – EDGE -FMUL8x16 – SHUTDOWN -SIAM -FPADD -RDSOFTINT -FPMERGE -RDSTICK -SIR -WRSOFTINT -WRSTICK • Does not functionally simulate devices or any I/O instructions – – – – SCSI controllers and disks PCI and SBUS interfaces interrupt and DMA controllers temperature sensors Correctness type Functional Performance Slide 56 % error 0 2.4 (worst case) http://www.cs.wisc.edu/gems Simulation Control (system.[C h]) system_t::simulate(int instrs) Disable all simics procs Simulated enough instrs? For MP sims: P0’s instrs counted here Yes return No Forall seq->advanceCycle() Pipeline is modeled here ruby->advanceTime() global_cycle++ Slide 57 http://www.cs.wisc.edu/gems Outline • Introduction and Motivation • Demo: Simulating a Multiple-CMP System with GEMS • Ruby: Memory system model • BREAK • Opal: Out-of-order processor model – – – – Overview Pipeline Example: Load instruction Additional Tidbits • Demo: Two gems are better than one • GEMS Source Code Tour and Extending Ruby • Building Workloads Slide 58 http://www.cs.wisc.edu/gems What’s done in a cycle? pseq::advanceCycle() FetchInstructions() DecodeInstructions() ScheduleInstructions() Uses separate queues (finitecycle.h) to record how many instructions are available for each stage. The order is in fact not important here. RetireInstructions() Scheduler->execute() return • SimpleScalar uses a reverse order, why? Slide 59 http://www.cs.wisc.edu/gems Pipeline Model (pseq.[C h]) MAX_FETCH F – Delay modeled with separate queues (finitecycle.h) F – Types: CONFIG_ALU_MAPPING – Number: CONFIG_NUM_ALUS MAX_DISPATCH D Sched FU1 FU0 FU1 FU0 MAX_DECODE FU0 R MAX_RETIRE Slide 60 R http://www.cs.wisc.edu/gems RETIRE_STAGES Determined by CONFIG_ALU_LATENCY MAX_EXECUTE D DECODE_STAGES • Models fully-pipelined FUs F FETCH_STAGES • Instructions stored/tracked in a RUU-like structure (iwindow.[C h]) • Flexible multi-stage pipeline Instructions ({dynamic,statici,memop,controlop}.[C h] ) decoded instr (statici.[C h]) Dynamic Event Times Wait List ptr Control (controlop.[C h]) Predicted Addr Actual Addr Taken/Not Taken Slide 61 Memory (memop.[C h]) Registers Traps Seq # ALU (dynamic.[C h]) Virtual/Phys Addr LSQ index http://www.cs.wisc.edu/gems Outline • Introduction and Motivation • Demo: Simulating a Multiple-CMP System with GEMS • Ruby: Memory system model • BREAK • Opal: Out-of-order processor model – – – – Overview Pipeline Example: Load instruction Additional Tidbits • Demo: Two gems are better than one • GEMS Source Code Tour and Extending Ruby • Building Workloads Slide 62 http://www.cs.wisc.edu/gems Fetch Stall fetch Is Fetch Ready? Emit NOP/Stall Fetch Yes Address Translation Yes I-TLB Miss? No Read instruction: pseq::getInstr() Invoke Ruby to simulate Ifetch timing Create Dynamic Instr (load_inst_t) Slide 63 http://www.cs.wisc.edu/gems Decode Get load instr from instr window Get current source operand mappings : arf::readDecodeMap() (regmap.[C h], arf.[C h]) dynamic_inst_t:: decode() Rename dest reg : arf::allocateRegister() Insert decoded load inst in decode queue Slide 64 (regmap.[C h], arf.[C h]) http://www.cs.wisc.edu/gems Schedule Get load instr from instr window Exceeded scheduling window? TestSourceReadiness() Stop scheduling Yes Source not ready Wakeup Source is ready Scheduler->schedule() Slide 65 WAIT_XX_STAGE Yes All sources ready? http://www.cs.wisc.edu/gems No Execute No, reschedule Read port avail? D-TLB address translate (memory_inst_t::addresstranslate()) Yes Yes Raise TLB miss exception TLB Miss? No Invoke Ruby to simulate load timing (rubycache_t::access()) Cache Miss? Yes No Read value from Simics memory (pseq->readPhysicalMemory()) pseq->complete() Slide 66 CACHE_MISS_STAGE http://www.cs.wisc.edu/gems Retire Step Simics (pseq->advanceSimics()) Get completed LD inst takeTrap() (set trap state,squash pipeline) Yes Traps? No checkChangedState() (verify load value) Match checkCriticalstate(): (PC, NPC,regs) Match Retire LD Slide 67 FullSquash() (reload state & refetch from instr following LD) Memory Consistency http://www.cs.wisc.edu/gems Outline • Introduction and Motivation • Demo: Simulating a Multiple-CMP System with GEMS • Ruby: Memory system model • BREAK • Opal: Out-of-order processor model – – – – Overview Pipeline Example: Load instruction Additional Tidbits • Demo: Two gems are better than one • GEMS Source Code Tour and Extending Ruby • Building Workloads Slide 68 http://www.cs.wisc.edu/gems Opal-Ruby Interface Asynchronous LD load_inst_t:: Execute() 1 8 Complete() OPAL 7 rubycache_t: access() 2 complete() 5 system_t: rubyCompletedRequest() RUBY OpalInterface: isReady() 3 makeRequest() 4 hitCallback() 6 pseq_t: completedRequest() Slide 69 http://www.cs.wisc.edu/gems Branch Prediction pseq_t::createInstruction{ … s_instr->nextPC() … } Controlop_t:: Execute(){ (check prediction and flush if mispredict) } Retire(){ … Bpred->Update() … } Slide 70 dynamic_inst_t:: nextPC_call(), nextPC_predicated_branch(), nextPC_predict_branch(), nextPC_indirect() Predict() Update() Branch predictor (fetch/{yags.[C h], …} : Predict() Update() http://www.cs.wisc.edu/gems Common Config Parameters Processor Width: MAX_FETCH _DECODE _DISPATCH _EXECUTE _RETIRE Pipeline Stages: FETCH_STAGES DECODE_STAGES RETIRE_STAGES Register File Sizes: CONFIG_IREG_PHYSICAL (int) CONFIG_FPREG_PHYSICAL (fp) CONFIG_CCREG_PHYSICAL (cond code) ROB Size: IWINDOW_ROB_SIZE Scheduling Window Size: IWINDOW_WIN_SIZE Slide 71 http://www.cs.wisc.edu/gems Opal : Present and Future • Implements Sparc instructions – Simulating additional Sparc instructions easy task – Porting to x86 substantial code rewrite • Simulates timing of weaker memory consistency models – Add SC checks in Opal – Add write buffers for weaker models (like TSO) • No functional simulation of I/O – Plug in disk simulator that interacts with Opal • Not currently using MAI interface – Possible to replace Opal w/ MAI module that interacts with Ruby • Aggressive micro-architectural techniques not modeled – Add support for trace caches, mem. dependence pred., etc Slide 72 http://www.cs.wisc.edu/gems Outline • Introduction and Motivation • Demo: Simulating a Multiple-CMP System with GEMS • Ruby: Memory system model • BREAK • Opal: Out-of-order processor model • Demo: Two gems are better than one – Breakdown network stats – Example: Network contention with and without Opal – Simulation runtimes • GEMS Source Code Tour and Extending Ruby • Building Workloads Slide 73 http://www.cs.wisc.edu/gems Demo Breaking Down Ruby Stats Files • Ruby system config print – Values of all ruby config parameters <system_config>.stats Ruby config • Overall runtime – Target and host machine runtimes, IPC, etc. • Cache profiling: L1I, L1D, L2…etc. • Structure occupancy – Demand for cache ports, transaction buffers • • • • Latency breakdown Request vs. system state (optional) Message delay cycles (optional) Network stats – Link and switch utilization • CC event / transition counts Slide 74 Overall runtime Cache profiling Structure occupancy Latency breakdown Request vs. system state Message delay cycles Network stats Event / transition counts http://www.cs.wisc.edu/gems Demo • • • • • Two GEMS are Better than One Network behavior with and without Opal 8 processor CMP SPLASH benchmark: ocean 8 byte-wide links between CPUs & L2 cache banks Two runs using a customized network 1. Ruby only • • • • Allows only one requests per processor Maximum 8 outstanding requests Low network utilization Little network contention 2. Ruby & Opal • • • • Slide 75 Allows multiple outstanding requests Maximum 128 outstanding requests Higher network utilization Noticeable network contention http://www.cs.wisc.edu/gems Demo Two GEMS are Better than One Ruby Only Ruby_cycles: 41361869 Message Delayed Cycles ---------------------- Total_delay_cycles: [binsize: 16 max: 553 count: 22892759 average: 0.534205 | standard deviation: 4.18656 | 22855760 20077 1945 325 309 175 105 3935 7681 518 338 254 397 273 166 130 142 33 41 25 26 29 15 10 10 2 0 2 0 4 4 10 7 6 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ] Network Stats ------------- utilized_percent_switch_0_link_3: 4.38966 bw: 8000 utilized_percent_switch_0_link_4: 4.36838 bw: 8000 links_ base_latency: 1 links_ base_latency: 1 Slide 76 http://www.cs.wisc.edu/gems Demo Two GEMS are Better than One Ruby & Opal Ruby_cycles: 72550169 (41361869) Message Delayed Cycles ---------------------- Total_delay_cycles: [binsize: 16 max: 703 count: 22893122 average: 1.35992 (0.534205) | standard deviation: 6.55126 | 22608266 220366 29575 9084 4686 3248 2009 1687 6018 1798 1143 828 625 516 384 272 271 288 398 319 299 228 203 161 92 51 41 26 12 9 30 39 48 43 25 20 3 0 0 1 0 2 4 4 0 0 0 0 0 0 ] Network Stats ------------- utilized_percent_switch_0_link_3: 7.81863 (4.38966) links_ bw: 8000 base_latency: 1 utilized_percent_switch_0_link_4: 7.64388 (4.36838) links_ bw: 8000 base_latency: 1 Slide 77 http://www.cs.wisc.edu/gems Simulation Time Comparison • Comparisons of Runtimes – Progressively add more simulation fidelity • Simics only • Simics + Ruby • Simics + Ruby + Opal – Accuracy vs. simulation time tradeoff • Target Machine – 8 UltraSPARC™ iii processor SMP (1 GHz) – 4 GBs of memory • Host Machine – AMD Opteron™ uniprocessor (2.2 GHz) – 4 GBs of memory Slide 78 http://www.cs.wisc.edu/gems Simulation Slowdown 2000 JBB Transactions Time Target 20 ms Simics 1 minute Simics + Ruby Simics + Ruby + Opal Slowdown 1 Slowdown / CPU 1 3000 x 380 x 15 minutes 45000 x 5600 x 45 minutes 140000 x 17000 x CAVEAT: These performance numbers may not reflect the optimal configuration of Virtutech Simics. For example, running Simics in “fast mode” (or emulation-only mode) can reduce the slowdown (per CPU) of Simics, compared to real hardware, to less than 10x Slide 79 http://www.cs.wisc.edu/gems Outline • Introduction and Motivation • Demo: Simulating a Multiple-CMP System with GEMS • Ruby: Memory system model • BREAK • Opal: Out-of-order processor model • Demo: Two gems are better than one • GEMS Source Code Tour and Extending Ruby – GEMS software structure – Directory Tour – Demo: Extending Ruby and a CMP Protocol • Building Workloads Slide 80 http://www.cs.wisc.edu/gems GEMS Software Structure System Chip generated/<protocol>/Chip.h Driver common/Driver.h Network network/simple Profiler profiler/Profiler.h Internal Drivers Random Tester tester/Tester.h Topology network/simple/Topology.h Deterministic Tester tester/DeterministicDriver.h Multiple Instantiations Contended Locks tester/SyntheticDriver.h Simics Interface simics/SimicsInterface.h Opal Interface One Instantiation interface/OpalInterface.h http://www.cs.wisc.edu/gems Ruby Software Structure Chip generated/<protocol>/Chip.h SLICC Ruby Sequencer system/Sequencer.h Network Ports Caches Directory system/CacheMemory.h system/DirectoryMemory.h Ruby buffer/MessageBuffer.h SLICC Cache Line generated/<protocol>/L1Cache_Entry.h Directory State generated/<protocol>/Directory_Entry.h Cache Controllers generated/<protocol>/L1Cache_Controller.h generated/<protocol>/L2Cache_Controller.h Directory Controller generated/<protocol>/Directory_Controller.h http://www.cs.wisc.edu/gems Outline • Introduction and Motivation • Demo: Simulating a Multiple-CMP System with GEMS • Ruby: Memory system model • BREAK • Opal: Out-of-order processor model • Demo: Two gems are better than one • GEMS Source Code Tour and Extending Ruby – GEMS software structure – Directory Tour – Demo: Extending Ruby and a CMP Protocol • Building Workloads Slide 83 http://www.cs.wisc.edu/gems Map of Directories: Top-Level Top-Level Directory ruby Memory System Components common Common GEMS C++ code results Simulation Output Slide 84 opal Processor Components gen-scripts slicc Generator Code scripts Generated Simics Interface Scripts Common GEMS scripts LICENSE README protocols Protocol Specification Files microbenchmarks Separate Microbenchmark Executables KNOWN_ISSUES http://www.cs.wisc.edu/gems Map of Directories: ruby ruby Makefile buffers MessageBuffer between consumers eventqueue Global eventqueue profiler Profiling code interfaces Ruby → Opal & Simics recorder cache and trace recorders common Common Ruby C++ structs module The ruby simics module simics Simics → Ruby system tester platform Physical memory components Random tester & ubenchmarks Object files & executables html Protocol tables Slide 85 ruby.trace.gz Example trace file init.h/.C Ruby initializer & destroyer config Ruby config files for Module and tester network Simple network code slicc_interface Abstract classes interface with different protocols generated SLICC generated C++ files README.debugging Ruby debug flag info http://www.cs.wisc.edu/gems Map of Directories: ruby/system ruby/system CacheMemory.h cache template data structure DirectoryMemory.h/C memory data structure MachineID.h NodeID.h object that uniquely identifies all ruby machines object that identifies a unique chip or machine instatiation PresistentArbiter.h/C PersistentTable.h/C NodePresistentTable.h/C specific to token protocol Sequencer.h/C manages memory requests between the driver and L1 cache controller TBETable.h/C transaction buffer entry table used by cache controllers for transient requests Slide 86 PerfrectCacheMemory.h a fully associative, unbounded cache memory template StoreBuffer.h/C used to simulate TSO-like timing TimerTable.h specific to token protocol specific to token protocol StoreCache.h/C used to simulate TSO-like timing System.h/C top-level object of the ruby memory system, all ruby objects can be accessed via the g_system_ptr specific to token protocol http://www.cs.wisc.edu/gems Map of Directories: ruby/slicc_interface ruby/slicc_interface AbstractCacheEntry.h/C ruby abstract class for the protocol specific cache entries Message.h parent class of all messages messages communicated between consumers via MessageBuffers RubySlicc_Profiler_interface.h/C interface between the generated protocol logic and the ruby profiler code Slide 87 AbstractChip.h/C ruby abstract class for the protocol specific chip object NetworkMessage.h parent class of all network messages, each protocol implements unique network message objects to communicate between controllers RubySlicc_Util.h miscellaneous ruby functions used by the generated controllers AbstractProtocol.h/C contains booleans to define protocol characteristics to ruby RubySlicc_ComponentMapping.h All address manipulation to determine location and set mapping is here RubySlicc_includes.h wrapper for the RubySlicc interface files http://www.cs.wisc.edu/gems Map of Directories: slicc slicc Makefile Makefile for the SLICC code generator executable parser contains the lexer and parser that construct a protocol’s AST main.h/C main function of the SLICC executable Slide 88 ast doc Abstract Syntax Tree code contains some old but useful documentation symbols platform contains SLICC objects created during the first pass of the AST, majority of code generated by these symbols README Summary of how SLICC works generator file, html and MIF generator code generated Object files & executables generated lexer and parser files slicc_global.h defines typedef, namespaces, etc. http://www.cs.wisc.edu/gems Map of Directories: opal opal Makefile benchmark Micro-architecture benchmarks config Module and tester config files python Misc test and graphing scripts design Helpful informal design docs regression Golden results for tester bypassing Misc. proc structs fetch Predictors (branch,Trap,RAS) sparc Implementationspecific defines tester trace platform Opal tester files Files for branch, memory traces Object files & executables TODO Todo wish list Slide 89 README Describes building & running Opal common Global Opal structs module Code for Opal module system Pipeline model generated Files for parsing config params README.memory_consistency Opal handling of mem. consistency http://www.cs.wisc.edu/gems Map of Directories: opal/system (1) opal/system actor.[C h] General micro-arch. structure class checkresult.h Structs used in validation w/ Simics dtlb.[C h] TLB implementation for stand-alone sims flow.[C h] CFG class Slide 90 arf.[C h] Register file interface config.include Type defines for config params dx.[C h i] Code for execution of dynamic instrs hfa.C Opal-Simics interface cache.[C h] chain.[C h] Opal’s built-in cache structures Used to analyze mem dependencies controlop.[C h] decode.[C h] Per opcode stats collector class Branch instr type class dynamic.[C h] Top-level class for all dynamic instrs flatarf.[C h] Non-renamed register file interface histogram.[C h] hfa_init.h Opal-Simics interface externs Histogram stats class http://www.cs.wisc.edu/gems Map of Directories: opal/system (2) opal/system ipage.[C h] Instruction page class lockstat.[C h] Stats on locks in system mf_api.h Simlink to Opal-Ruby interface pseq.[C h] Top-level proc sequencer Slide 91 ipagemap.[C h] Instruction page cache class lsq.[C h] LSQ structure mshr.[C h] MSHR structure (used in Opal cache hierarchy only) iwindow.[C h] ix.[C h] RUU-like struct for storing/tracking instrs Code to execute CFG instructions memop.[C h] memstat.[C h] Memory instr class Memory addr stats class pipepool.[C h] pipestate.[C h] Wait-list object . Used to model MSHR when running w/ Ruby Single waiter object for pipepool pstate.[C h] ptrace.[C h] regbox.[C h] Functions used for API calls to Simics Used for analyzing memory traces Contains interface ptrs to registers. http://www.cs.wisc.edu/gems Map of Directories: opal/system (3) opal/system regfile.[C h] Models the register file itself simdist12.C Dummy Simics functions for tester stopwatch.[C h] Timer class, used to collect time stats regmap.[C h] Rename map structure sparx.C Several includes sysstat.[C h] Stats class for dynamic insts rubycache.[C h] Handles all Opal Ruby memory transactions Global event queue statici.[C h] sstat.[C h] Decoded instr class Stats class for static insts system.[C h] Top-level class for manipulating sim scheduler.[C h] threadstat.[C h] Stats class for tracking per-thread stats wait.[C h] Wait-list object for dynamic insts Slide 92 http://www.cs.wisc.edu/gems Outline • Introduction and Motivation • Demo: Simulating a Multiple-CMP System with GEMS • Ruby: Memory system model • BREAK • Opal: Out-of-order processor model • Demo: Two gems are better than one • GEMS Source Code Tour and Extending Ruby – GEMS software structure – Directory Tour – Demo: Extending Ruby and a CMP Protocol • Building Workloads Slide 93 http://www.cs.wisc.edu/gems Demo Extending Ruby • Goal: – Add new functionality to Ruby and interface to SLICC • DemoPrefetcher – – – – – Simple, L2->memory next-line prefetcher Module implemented as C++ object (DemoPrefetcher.C) New type added to SLICC Observes L1 GETS requests via function call Triggers event for prefetch in next cycle • Object is connected to an in_port – Not the only way (or the right way) of implementing a prefetcher Slide 94 http://www.cs.wisc.edu/gems Demo Implementing DemoPrefetcher • Creating an object that can “wakeup” a controller DemoPrefetcher.h class DemoPrefetcher { public: // An object in a SLICC controller will be passed a Chip* DemoPrefetcher(Chip* chip_ptr); // Allow an in_port to be attached void setConsumer(Consumer* consumer_ptr) { m_consumer_ptr = consumer_ptr; } // When wakeup() is called, ensure it should do something bool isReady() const; // functions to implement simple next-line prefetching const Address& popNextPrefetch(); const Address& peekNextPrefetch() const; void cancelNextPrefetch(); void observeL1Request(const Address& address); Slide 95 http://www.cs.wisc.edu/gems Demo Implementing DemoPrefetcher DemoPrefetcher.C void DemoPrefetcher::observeL1Request(const Address& address) { // next-line prefetch address Address prefetch_addr = address; prefetch_addr.makeNextStrideAddress(1); // add to prefetch queue m_prefetch_queue.push( prefetch_addr ); // when to wakeup-- choose 1 cycles later Time ready_time = g_eventQueue_ptr->getTime() + 1; // schedule a wakeup() so that the L2 controller can trigger g_eventQueue_ptr->scheduleEventAbsolute(m_consumer_ptr, ready_time); } Slide 96 http://www.cs.wisc.edu/gems Demo Interfacing DemoPrefetcher to SLICC external_type(DemoPrefetcher, inport="yes") { bool isReady(); Address popNextPrefetch(); void cancelNextPrefetch(); Address peekNextPrefetch(); void observeL1Request(Address); } DemoPrefetcher prefetcher; // wakeup logic in_port(prefetcher_in, Null, prefetcher) { if (prefetcher_in.isReady() ) { if (L2cacheMemory.cacheAvail(prefetcher.peekNextPrefetch()) || L2cacheMemory.isTagPresent(prefetcher.peekNextPrefetch())) { if ( getState(prefetcher.peekNextPrefetch()) == State:I || getState(prefetcher.peekNextPrefetch()) == State:NP ) { trigger(Event:Prefetch, prefetcher.popNextPrefetch()); } else { // tag is already present in a non-invalid state prefetcher.cancelNextPrefetch(); } } else { trigger(Event:L2_Replacement, L2cacheMemory.cacheProbe(prefetcher.peekNextPrefetch())); } } } Slide 97 http://www.cs.wisc.edu/gems Demo Implementing DemoPrefetcher • Nice property of TokenCMP: no tracking of prefetch – A tag is allocated and a request issued to memory – keeps received tokens/data if tag allocated MOESI_CMP_tokenDEMO-L2cache.sm transition(NP, Prefetch, I) { vv_allocateL2CacheBlock; a_issuePrefetch; } transition(I, Prefetch) { a_issuePrefetch; } transition({S,O,M,I_L,S_L}, Prefetch) { // do nothing } Slide 98 http://www.cs.wisc.edu/gems Outline • Introduction and Motivation • Demo: Simulating a Multiple-CMP System with GEMS • Ruby: Memory system model • BREAK • Opal: Out-of-order processor model • Demo: Two gems are better than one • GEMS Source Code Tour and Extending Ruby • Building Workloads Slide 99 http://www.cs.wisc.edu/gems Workloads for Simics/GEMS • Unfortunately, we cannot release our workloads (legal reasons) • Steps for Workload Development – Simple Example: Barnes-Hut – What about more complex applications? • Workload Simulation Methodology – Simulating transactions/requests – Coping with workload variability Slide 100 http://www.cs.wisc.edu/gems Workload Setup • Simple Example: Barnes-Hut (Splash2 suite) – Commands not to be taken literally! (might be different in different versions) • Main Steps: – – – – Slide 101 Build OS checkpoint Copy application source or binary to simulation Create initial (cold) application checkpoint in Simics Create warm application checkpoint with Simics/Ruby http://www.cs.wisc.edu/gems Build OS Checkpoint • Use Simics to boot your OS and get a checkpoint (assuming 16 processor serengeti target machine) – cd simics/home/sarek – ./simics –x sarek-16p.simics • Script loads configuration and boots Solaris • Scripts should be provided with your Simics distribution assuming you have Solaris license (contact Virtutech Simics Forum) • Modify scripts to fit your target configuration (e.g., memory, disk, network) – At the end of your script, take a system snapshot (checkpoint): simics> write-configuration CHKPT_DIR/sarek-16p.check simics> quit – Use this checkpoint to build all your workloads’ 16 processor checkpoints Slide 102 http://www.cs.wisc.edu/gems Copy Barnes Source or Binary • Develop benchmark on real machine (if available) – Use Simics “magic” instructions after initialization • See Simics reference manual for magic instruction use – Compile benchmark with such instructions before running in Simics • Load from your OS checkpoint – ./simics simics> read-configuration CHKPT_DIR/sarek-16p.check simics> magic-break-enable • Copy binary into simulated machine (or copy source and compile) – Console commands: mount /host cp –r /host/workloads/splash2/codes/apps/barnes/BARNES . • See Simics reference manual on the use of the /host filesystem Slide 103 http://www.cs.wisc.edu/gems Obtain Initial Barnes Checkpoint • Warm up application in Simics – Console Commands: ./BARNES < input-warm • input_warm specifies Barnes parameters ./BARNES < input-warm • Use this second run to warm up cache (see next slide) ./BARNES < input-run > output; magic_call break • After initial run, write checkpoint simics> write-configuration CHKPT_DIR/barnes-cold16p.check simics> quit • Checkpoint is ready for GEMS run Slide 104 http://www.cs.wisc.edu/gems Obtain Warm Barnes Checkpoint • Load initial checkpoint – – – – – setenv CHECKPOINT_AT_END yes setenv TRANSACTIONS 1 setenv PROCESSORS 16 setenv CHECKPOINT CHKPT_DIR/barnes-cold-16p.check ./simics -no-win -x GEMS_ROOT/gen-scripts/go.simics • Script (provided in release) should load ruby and run till the end of the warmup run – Also writes checkpoint at the end • Edit checkpoint to remove ruby object – Modify script to suit your needs Slide 105 http://www.cs.wisc.edu/gems What About More Complex Applications? • Setup on real hardware – – – – Tune workload, OS parameters Scale-down for PC memory limits Re-tune For details, [Alameldeen et al., IEEE Computer, Feb’03] • What if we don’t have access to real hardware? – Install applications and setup in Simics – Checkpoint often – Not optimal for large scale applications! Slide 106 http://www.cs.wisc.edu/gems Simulating Transactions/Requests • Throughput-based applications – Work-based unit to compare configurations – IPC not always meaningful • Counting Transactions during Simulation – Enable magic breaks in Simics – Benchmark traps to Simics on every magic instruction – Count magic breaks until we reach required number of transactions – Cope with benchmark variability Slide 107 http://www.cs.wisc.edu/gems Why Consider Variability? OLTP Slide 108 http://www.cs.wisc.edu/gems Workload Variability • How can slower memory lead to faster workload? • Answer: Multithreaded workload takes different paths – Different lock race outcomes – Different scheduling decisions → Runs from same initial conditions can be different This can lead to wrong conclusions for deterministic simulations • Solution with deterministic simulation – Add pseudo-random delay on memory accesses (MEMORY_LATENCY) – Simulate base (and enhanced) system multiple times – Use simple or complex statistics [Alameldeen and Wood, HPCA 2003] Slide 109 http://www.cs.wisc.edu/gems The End • Download and Subscribe to Mailing Lists http://www.cs.wisc.edu/gems • We encourage your contributions – Workloads – Additional timing fidelity Slide 110 http://www.cs.wisc.edu/gems Additional Opal Slides Slide 111 http://www.cs.wisc.edu/gems Sensitivity Analysis Slide 112 return http://www.cs.wisc.edu/gems Sensitivity Results Slide 113 return http://www.cs.wisc.edu/gems Opal and Memory Consistency • Designed to be aggressive OoO processor • Our use of Simics is sequentially consistent execution • Models the performance of weaker models (such as TSO) for only SC memory interleavings • Violations of SC in Opal: – Identical MSHR entry for memory requests with same addr – Executes Ld/St out of program order – No snooping of LSQ for external stores Return Slide 114 http://www.cs.wisc.edu/gems Implemented UltraSparc Instructions (1) add bpe addc bpg bpge addcc bpgu addccc bpl alignaddr bple alignaddrl bpleu bpn and bpne andcc bpneg andn bpos andncc bppos bpvc ba bpvs bcc brgez bcs brgz be brlez brlz bg brnz bge brz bgu bshuffle bl bvc bvs ble call bleu casa bmask casxa bn cmp done bne retry bneg fabsd bpa fabsq bpcc fabss bpcs Slide 115 faddd faddq fadds falignda ta fba fbe fbg fbge fbl fble fblg fbn fbne fbo fbpa fbpe fbpg fbpge fbpl fbple fbplg fbpn fbpne fbpo fbpu fbpue fbpug fbpuge fbpul fbpule fbu fbue fbug fbuge fbul fbule fcmpd fcmped fcmpeq fcmpeq16 fcmpeq32 fcmpes fcmpgt16 fcmpgt32 fcmple16 fcmple32 fcmpne16 fcmpne32 fcmpq fcmps fdivd fdivq fdivs fdmulq fdtoi fdtoq fdtos fdtox fitod fitoq fitos flush flushw fmovd fmovda fmovdcc fmovdcs fmovde fmovdg fmovfqlg fmovql fmovdge fmovfqn fmovqle fmovdgu fmovfqne fmovqleu fmovdl fmovfqo fmovqn fmovdle fmovfqu fmovqne fmovdleu fmovfque fmovqneg fmovdn fmovfqug fmovqpos fmovdne fmovfquge fmovqvc fmovdneg fmovfqul fmovqvs fmovdpos fmovfqule fmovrdgez fmovrdgz fmovdvc fmovfsa fmovrdlez fmovdvs fmovfse fmovfda fmovfsg fmovrdlz fmovfde fmovfsge fmovrdnz fmovrdz fmovfdg fmovfsl fmovfdge fmovfsle fmovrqgez fmovfdl fmovfslg fmovrqgz fmovfdle fmovfsn fmovrqlez fmovfdlg fmovfsne fmovrqlz fmovfdn fmovfso fmovrqnz fmovfdne fmovfsu fmovrqz fmovfdo fmovfsue fmovrsgez fmovfdu fmovfsug fmovrsgz fmovfdue fmovfsuge fmovrslez fmovfdug fmovfsul fmovrslz fmovfduge fmovfsule fmovrsnz fmovrsz fmovfdul fmovq fmovs fmovfdule fmovqa fmovfqa fmovqcc fmovsa fmovfqe fmovqcs fmovscc fmovscs fmovfqg fmovqe fmovse fmovfqge fmovqg fmovfql fmovqge fmovsg fmovfqle fmovqgu fmovsge fmovsgu fmovsl fmovsle fmovsleu fmovsn fmovsne fmovsneg fmovspos fmovsvc fmovsvs fmuld fmulq fmuls fnegd fnegq fnegs fqtod fqtoi fqtos fqtox fsmuld fsqrtd fsqrtq fsqrts fsrc1 fstod fstoi fstoq fstox fsubd fsubq fsubs fxtod fxtoq http://www.cs.wisc.edu/gems Implemented UltraSparc Instructions (2) fxtos fzero fzeros ill impdep1 impdep2 jmp jmpl ldblk ldd ldda lddf lddfa ldf ldfa ldfsr ldqa ldqf ldqfa ldsb ldsba ldsh ldsha ldstub ldstuba ldsw ldswa ldub lduba lduh lduha lduw lduwa ldx Slide 116 ldxa ldxfsr membar mov mova movcc movcs move movfa movfe movfg movfge movfl movfle movflg movfn movfne movfo movfu movfue movfug movfuge movful movfule movg movge movgu movl movle movleu movn movne movneg movpos movrgez movrgz movrlez movrlz movrnz movrz movvc movvs mulscc mulx nop not or orcc orn orncc popc prefetch prefetcha rd rdcc rdpr restore restored retrn save saved sdiv sdivcc sdivx sethi sll sllx smul smulcc sra srax srl srlx stb stba stbar stblk std stda stdf stdfa stf stfa stfsr sth stha stqf stqfa stw stwa stx stxa stxfsr sub subc subcc subccc swap swapa ta taddcc taddcctv tcc tcs te tg tge tgu tl tle tleu tn tne tneg tpos trap tsubcc tsubcctv tvc tvs udiv udivcc udivx umul umulcc wr wrcc wrpr xnor xnorcc xor xorcc return http://www.cs.wisc.edu/gems TLB Misses • ITLB Misses – emit special NOP instruction: STATIC_INSTR_MOP; stall fetch – does NOT update PC, NPC – fetch resumes whenever any instr (including special NOP) squashes • DTLB Misses – Set DTLB miss trap for instruction (setTrapType()) in Execute() – In retireInstruction(), retrieve trap and call takeTrap() to set trap state for DTLB handler – refetch from DTLB handler Slide 117 http://www.cs.wisc.edu/gems Example: Load instruction • In dynamic_t::Schedule(), load waits until all operands ready (WAIT_XX_STAGE cases) • Scheduler gets invoked when all operands ready • Load waits until read port to L1 is available • Load_inst_t::Execute() gets called – – – – Generates virtual address Performs D-TLB address translation Inserts entry in LSQ Initiates cache access (via Ruby or Opal’s built-in simple cache hierarchy) – If cache miss -> put on wait list (CACHE_MISS_STAGE) and is woken up by rubycache_t::complete() • Invokes Simics to read actual memory value in load_inst_t::Complete() • Retirement check of load value & squash if value deviates from Simics Slide 118 http://www.cs.wisc.edu/gems Modifying Opal-Ruby Interface • Ruby->Opal interface defined in mf_opal_api object (ruby/interfaces/mf_api.h) • Opal->Ruby interface defined in mf_ruby_api object • To create new Ruby->Opal callback (ex: hitCallback()) – Define function in ruby/interfaces/OpalInterface.C – Add new function pointer to mf_opal_api object – Create a new function handler in opal/system/system.C and assign m_opal_api object’s new function pointer to this function handler • To create new Opal->Ruby callback (ex: makeRequest()) – Define function in ruby/interfaces/OpalInterface.C – Add new function pointer to mf_ruby_api object – Assign function pointer to new function in OpalInterface::installInterface() Slide 119 http://www.cs.wisc.edu/gems