Introduction to SimpleScalar (Based on SimpleScalar Tutorial) CSCE614 Texas A&M University 2015-04-09 1 Overview • What is an architectural simulator – a tool that reproduces the behavior of a computing device • Why use a simulator – Leverage a faster, more flexible software development cycle • • • • • 2015-04-09 Permit more design space exploration Facilitates validation before H/W becomes available Level of abstraction is tailored by design task Possible to increase/improve system instrumentation Usually less expensive than building a real system 2 Advantages of SimpleScalar • Highly flexible – functional simulator + performance simulator • Portable – Host: virtual target runs on most Unix-like systems – Target: simulators can support multiple ISAs • Extensible – Source is included for compiler, libraries, simulators – Easy to write simulators • Performance – Runs codes approaching ‘real’ sizes 2015-04-09 3 Simulation Tools Architectural Simulators Functional Trace-Driven Performance Exec-Driven Interpreters Inst Schedulers Cycle Timers Direct Execution Shaded tools are included in SimpleScalar Tool Set 2015-04-09 4 Functional vs. Performance Simulators • Functional simulators implement the architecture – perform real execution – Implement what programmers see • Performance simulators implement the microarchitecture – Model system resources/internals – Concern about time – Do not implement what programmers see 2015-04-09 5 Trace Driven vs. Execution Driven Simulators • Trace-Driven – Simulator reads a ‘trace’ of the instructions captured during a previous execution – Easy to implement – No functional components necessary – No feedback to trace (eg. mis-prediction) • Execution-Driven – Simulator runs the program (trace-on-the-fly) – Hard to implement – Advantages • • • • 2015-04-09 Faster than tracing No need to store traces Register and memory values usually are not in trace Support mis-speculation cost modeling 6 Instruction Schedulers vs. Cycle Timers • Instruction Schedulers – Simulator schedules instruction when resources are available – Instructions proceeded one at a time – Simpler, but less detailed • Cycle Timers – Simulator tracks microarch. state each cycle – Simulator state == microarchitecture state – Perfect for microarchitecture simulation 2015-04-09 7 SimpleScalar Release 3.0 • SimpleScalar now executes multiple instruction sets: SimpleScalar PISA (the old "SimpleScalar ISA") and Alpha AXP. • All simulators now support external I/O traces (EIO traces). Generated with a new simulator (sim-eio) • Support more platforms • explicit fault support • And many more 2015-04-09 8 Simulator Suite Sim-Fast -300 lines -functional -4+ MIPS Sim-Safe -350 lines -functional w/checks Sim-Profile -900 lines -functional -Lot of stats Performance Detail 2015-04-09 Sim-Cache Sim-Cheetah Sim-Outorder Sim-BPred -< 1000 lines -functional -Cache stats -Branch stats -3900 lines -performance -OoO issue -Branch pred. -Mis-spec. -ALUs -Cache -TLB -200+ KIPS 9 Sim-Fast • • • • • • • Functional simulation Optimized for speed Assumes no cache Assumes no instruction checking Does not support Dlite! Does not allow command line arguments <300 lines of code 2015-04-09 10 Sim-Safe • • • • • • Functional simulation Checks for instruction errors Optimized for speed Assumes no cache Supports Dlite! Does not allow command line arguments 2015-04-09 11 Sim-Cache • Cache simulation • Ideal for fast simulation of caches (if the effect of cache performance on execution time is not necessary) • Accepts command line arguments for: – – – – level 1 & 2 instruction and data caches TLB configuration (data and instruction) Flush and compress and more • Ideal for performing high-level cache studies that don’t take access time of the caches into account 2015-04-09 12 Sim-Cache (cont'd) • generates one- and two-level cache hierarchy statistics and profiles • extra options (also supported on sim-outorder): -cache:dl1 <config> - level 1 data cache configuration -cache:dl2 <config> - level 2 data cache configuration -cache:il1 <config> - level 1 instruction cache configuration -cache:il2 <config> - level 2 instruction cache configuration -tlb:dtlb <config> - data TLB configuration -tlb:itlb <config> - instruction TLB configuration -flush <config> - flush caches on system calls -icompress - remaps 64-bit inst addresses to 32-bit equiv. -pcstat <stat> - record statistic <stat> by text address 2015-04-09 13 Specifying Cache Configurations • all caches and TLB configurations specified with same format: <name>:<nsets>:<bsize>:<assoc>:<repl> • where: <name> - cache name (make this unique) <nsets> - number of sets <assoc> - associativity (number of “ways”) <repl> - set replacement policy l - for LRU f - for FIFO r - for RANDOM • examples: il1:1024:32:2:l dtlb:1:4096:64:r 2015-04-09 2-way set-assoc 64k-byte cache, LRU 64-entry fully assoc TLB w/ 4k pages,random replacement 14 Sim-Bpred • Simulate different branch prediction mechanisms • Generate prediction hit and miss rate reports • Does not simulate the effect of branch prediction on total execution time nottaken taken perfect bimod 2lev comb 2015-04-09 bimodal predictor 2-level adaptive predictor combined predictor (bimodal and 2-level) 15 Sim-Profile ● Program Profiler ● Generates detailed profiles, by symbol and by address ● Keeps track of and reports ● Dynamic instruction counts ● Instruction class counts ● Branch class counts ● Usage of address modes ● Profiles of the text & data segment 2015-04-09 16 Sim-Outorder • Most complicated and detailed simulator • Supports out-of-order issue and execution • Provides reports – – – – branch prediction cache external memory various configuration 2015-04-09 17 Sim-Outorder: Detailed Performance Simulator • generates timing statistics for a detailed out-of-order issue processor core with two-level cache memory hierarchy and main memory • extra options: -fetch:ifqsize <size> - instruction fetch queue size (in insts) -fetch:mplat <cycles> - extra branch mis-prediction latency (cycles) -bpred <type> - specify the branch predictor -decode:width <insts> - decoder bandwidth (insts/cycle) -issue:width <insts> - RUU issue bandwidth (insts/cycle) -issue:inorder - constrain instruction issue to program order -issue:wrongpath - permit instruction issue after mis-speculation -ruu:size <insts> - capacity of RUU (insts) -lsq:size <insts> - capacity of load/store queue (insts) -cache:dl1 <config> - level 1 data cache configuration -cache:dl1lat <cycles> - level 1 data cache hit latency 2015-04-09 18 Sim-Outorder: Detailed Performance Simulator -cache:dl2 <config> - level 2 data cache configuration -cache:dl2lat <cycles> - level 2 data cache hit latency -cache:il1 <config> - level 1 instruction cache configuration -cache:il1lat <cycles> - level 1 instruction cache hit latency -cache:il2 <config> - level 2 instruction cache configuration -cache:il2lat <cycles> - level 2 instruction cache hit latency -cache:flush - flush all caches on system calls -cache:icompress - remap 64-bit inst addresses to 32-bit equiv. -mem:lat <1st> <next> - specify memory access latency (first, rest) -mem:width - specify width of memory bus (in bytes) -tlb:itlb <config> - instruction TLB configuration -tlb:dtlb <config> - data TLB configuration -tlb:lat <cycles> - latency (in cycles) to service a TLB miss 2015-04-09 19 Sim-Outorder: Detailed Performance Simulator -res:ialu - specify number of integer ALUs -res:imult - specify number of integer multiplier/dividers -res:memports - specify number of first-level cache ports -res:fpalu - specify number of FP ALUs -res:fpmult - specify number of FP multiplier/dividers -pcstat <stat> - record statistic <stat> by text address -ptrace <file> <range> - generate pipetrace 2015-04-09 20 Specifying the Branch Predictor • specifying the branch predictor type: -bpred <type> • the supported predictor types are: nottaken taken perfect bimod 2lev always predict not taken always predict taken perfect predictor bimodal predictor (BTB w/ 2 bit counters) 2-level adaptive predictor • configuring the bimodal predictor (only useful when “bpred bimod” is specified): -bpred:bimod <size> size of direct-mapped BTB 2015-04-09 21 Specifying the Branch Predictor (cont'd) • configuring the 2-level adaptive predictor (only useful when “bpred 2lev” is specified): -bpred:2lev <l1size> <l2size> <hist_size> <xor> Configurations: N, M, W, X N:# entries in first level (# of shift register(s)) M:# entries in 2nd level (# of counters, or other FSM) W:width of shift register(s) (# of bits in each shift register) X:(yes-1/no-0) xor history (We use 0 for this homework.) and address for 2nd level index Sample predictors: GAg: 1,M,W,0 where GAp: 1,M,W,0 where PAg: N,M,W,0 where PAp: N,M,W,0 where 2015-04-09 M M M M = = = = 2^W C*2^W, C is # of per-address prediction tables 2^W N * 2^W 22 Performance Comparison of GAg,GAp, PAg and PAp • GAp: 1 global history register and 8 per-address prediction tables Branch address 4 2-bits per branch predictor Prediction 2-bit global branch history 2015-04-09 23 (a) GAp (b) (2,2) predictor Hack the state machine of Branch Predictor! Taken Taken Not taken Not taken T T Taken Taken Not taken Not taken NT NT Taken Not taken (a) A3 (Same as shown in the textbook) 2015-04-09 T T Taken Taken Not taken Not taken NT NT Taken Not taken (b) A2 (Original Simplescalar Implementation) 24 Sim-Outorder HW Architecture Fetch I-Cache Dispatch Register Scheduler Memory Scheduler I-TLB Exe Writeback Commit Mem D-Cache D-TLB Virtual Memory 2015-04-09 25 Sim-Outorder (Main Loop) • sim_main() in sim-outorder.c ruu_init(); for(;;){ ruu_commit(); ruu_writeback(); lsq_refresh(); ruu_issue(); ruu_dispatch(); ruu_fetch(); } • Executed once for each simulated machine cycle • Walks pipeline from Commit to Fetch – Reverse traversal handles inter-stage latch synchronization by only one pass 2015-04-09 26 Sim-Outorder (RUU/LSQ) • RUU (Register Update Unit) – Handles register synchronization/communication – Serves as reorder buffer and reservation stations – Performs out-of-order issue when register and memory dependences are satisfied • LSQ (Load/Store Queue) – Handles memory synchronization/communication – Contains all loads and stores in program order • Relationship between RUU and LSQ – Memory dependencies are resolved by LSQ – Load/Store effective address calculated in RUU 2015-04-09 27 Sim-Outorder: Fetch ● ● ● ruu_fetch() Models machine fetch bandwidth Fetches instructions from one I-cache/memory ● ● ● block until I-cache misses are resolved Instructions are put into the instruction fetch queue named fetch_data in sim-outorder.c (it is also called dispatch queue in the paper) Probes branch predictor to obtain the cache line for next cycle 2015-04-09 28 Sim-Outorder: Dispatch ● ● ● ● ● ● ruu_dispatch() Models instruction decoding and register renaming Takes instructions from fetch_data Decodes instructions Enters and links instructions into RUU and LSQ Splits memory operations into two separate instructions 2015-04-09 29 Sim-Outorder: Scheduler ● ● lsq_refresh() Models instruction selection, wakeup and issue ● ● Separate schedulers track register and memory dependences. Locates instructions with all register inputs ready and all memory inputs ready ● ● Issue of ready loads is stalled if there is a store with unresolved effective address in LSQ. If earlier store address matches load address, target value is forwarded to load. 2015-04-09 30 Sim-Outorder: Execute ● ● ● ● ● ● ruu_issue() Models functional units, D-cache issue and executes latencies Gets instructions that are ready Reserves free functional unit Schedules writeback events using latency of the functional unit Latencies are hardcoded in fu_config[] in simoutorder.c 2015-04-09 31 Sim-Outorder: Writeback ● ● ● ● ● ruu_writeback() Models writeback bandwidth, detects mis-predictions, initiated mis-prediction recovery sequence Gets execution finished instructions (specified in event queue) Wakes up instructions that are dependent on completed instruction on the dependence chains of instruction output Detects branch mis-prediction and roll state back to checkpoint 2015-04-09 32 Sim-Outorder: Commit ● ruu_commit() Models in-order retirement of instructions, store commits to the D-cache, and D-TLB miss handling ● While head of RUU/LSQ ready to commit ● ● ● ● ● D-TLB miss handling Retire store to D-cache Update register file and rename table Reclaim RUU/LSQ resources 2015-04-09 33 Sim-Outorder: Processor core and other specifications • • • • Instruction fetch, decode and issue bandwidth Capacity of RUU and LSQ Branch mis-prediction latency Number of functional units – integer ALU, integer multipliers/dividers – FP ALU, FP multipliers/dividers • Latency of I-cache/D-cache, memory and TLB • Record statistic by text address 2015-04-09 34 Global Options • These are supported on most simulators -h print help message -d enable debug message -i start up in Dlite! Debugger -q quit immediately (use with -dumpconfig) -config read config parameters from <file> -dumpconfig save config parameters into <file> 2015-04-09 35 How to get help from us • Drop by during TA’s office hour • E-Mail khkim@cse.tamu.edu 2015-04-09 36