Presentation Schedule - CS Course Webpages

advertisement
Introduction to SimpleScalar
(Based on SimpleScalar Tutorial)
CSCE614
Texas A&M University
2015-04-09
1
Overview
• What is an architectural simulator
– a tool that reproduces the behavior of a computing device
• Why use a simulator
– Leverage a faster, more flexible software development cycle
•
•
•
•
•
2015-04-09
Permit more design space exploration
Facilitates validation before H/W becomes available
Level of abstraction is tailored by design task
Possible to increase/improve system instrumentation
Usually less expensive than building a real system
2
Advantages of SimpleScalar
• Highly flexible
– functional simulator + performance simulator
• Portable
– Host: virtual target runs on most Unix-like systems
– Target: simulators can support multiple ISAs
• Extensible
– Source is included for compiler, libraries, simulators
– Easy to write simulators
• Performance
– Runs codes approaching ‘real’ sizes
2015-04-09
3
Simulation Tools
Architectural Simulators
Functional
Trace-Driven
Performance
Exec-Driven
Interpreters
Inst Schedulers
Cycle Timers
Direct Execution
Shaded tools are included in SimpleScalar Tool Set
2015-04-09
4
Functional vs. Performance
Simulators
• Functional simulators implement the architecture
– perform real execution
– Implement what programmers see
• Performance simulators implement the microarchitecture
– Model system resources/internals
– Concern about time
– Do not implement what programmers see
2015-04-09
5
Trace Driven vs. Execution Driven
Simulators
• Trace-Driven
– Simulator reads a ‘trace’ of the instructions captured during a
previous execution
– Easy to implement
– No functional components necessary
– No feedback to trace (eg. mis-prediction)
• Execution-Driven
– Simulator runs the program (trace-on-the-fly)
– Hard to implement
– Advantages
•
•
•
•
2015-04-09
Faster than tracing
No need to store traces
Register and memory values usually are not in trace
Support mis-speculation cost modeling
6
Instruction Schedulers vs. Cycle Timers
• Instruction Schedulers
– Simulator schedules instruction when resources are available
– Instructions proceeded one at a time
– Simpler, but less detailed
• Cycle Timers
– Simulator tracks microarch. state each cycle
– Simulator state == microarchitecture state
– Perfect for microarchitecture simulation
2015-04-09
7
SimpleScalar Release 3.0
• SimpleScalar now executes multiple instruction sets:
SimpleScalar PISA (the old "SimpleScalar ISA") and
Alpha AXP.
• All simulators now support external I/O traces (EIO traces).
Generated with a new simulator (sim-eio)
• Support more platforms
• explicit fault support
• And many more
2015-04-09
8
Simulator Suite
Sim-Fast
-300 lines
-functional
-4+ MIPS
Sim-Safe
-350 lines
-functional
w/checks
Sim-Profile
-900 lines
-functional
-Lot of stats
Performance
Detail
2015-04-09
Sim-Cache
Sim-Cheetah Sim-Outorder
Sim-BPred
-< 1000 lines
-functional
-Cache stats
-Branch stats
-3900 lines
-performance
-OoO issue
-Branch pred.
-Mis-spec.
-ALUs
-Cache
-TLB
-200+ KIPS
9
Sim-Fast
•
•
•
•
•
•
•
Functional simulation
Optimized for speed
Assumes no cache
Assumes no instruction checking
Does not support Dlite!
Does not allow command line arguments
<300 lines of code
2015-04-09
10
Sim-Safe
•
•
•
•
•
•
Functional simulation
Checks for instruction errors
Optimized for speed
Assumes no cache
Supports Dlite!
Does not allow command line arguments
2015-04-09
11
Sim-Cache
• Cache simulation
• Ideal for fast simulation of caches (if the effect of cache
performance on execution time is not necessary)
• Accepts command line arguments for:
–
–
–
–
level 1 & 2 instruction and data caches
TLB configuration (data and instruction)
Flush and compress
and more
• Ideal for performing high-level cache studies that don’t
take access time of the caches into account
2015-04-09
12
Sim-Cache (cont'd)
• generates one- and two-level cache hierarchy statistics and
profiles
• extra options (also supported on sim-outorder):
-cache:dl1 <config> - level 1 data cache configuration
-cache:dl2 <config> - level 2 data cache configuration
-cache:il1 <config> - level 1 instruction cache configuration
-cache:il2 <config> - level 2 instruction cache configuration
-tlb:dtlb <config> - data TLB configuration
-tlb:itlb <config> - instruction TLB configuration
-flush <config> - flush caches on system calls
-icompress - remaps 64-bit inst addresses to 32-bit equiv.
-pcstat <stat> - record statistic <stat> by text address
2015-04-09
13
Specifying Cache Configurations
• all caches and TLB configurations specified with same format:
<name>:<nsets>:<bsize>:<assoc>:<repl>
• where:
<name> - cache name (make this unique)
<nsets> - number of sets
<assoc> - associativity (number of “ways”)
<repl> - set replacement policy
l - for LRU
f - for FIFO
r - for RANDOM
• examples:
il1:1024:32:2:l
dtlb:1:4096:64:r
2015-04-09
2-way set-assoc 64k-byte cache, LRU
64-entry fully assoc TLB w/ 4k pages,random replacement
14
Sim-Bpred
• Simulate different branch prediction mechanisms
• Generate prediction hit and miss rate reports
• Does not simulate the effect of branch prediction on total
execution time
nottaken
taken
perfect
bimod
2lev
comb
2015-04-09
bimodal predictor
2-level adaptive predictor
combined predictor (bimodal and 2-level)
15
Sim-Profile
●
Program Profiler
●
Generates detailed profiles, by symbol and by address
●
Keeps track of and reports
●
Dynamic instruction counts
●
Instruction class counts
●
Branch class counts
●
Usage of address modes
●
Profiles of the text & data segment
2015-04-09
16
Sim-Outorder
• Most complicated and detailed simulator
• Supports out-of-order issue and execution
• Provides reports
–
–
–
–
branch prediction
cache
external memory
various configuration
2015-04-09
17
Sim-Outorder: Detailed Performance Simulator
• generates timing statistics for a detailed out-of-order issue
processor core with two-level cache memory hierarchy and
main memory
• extra options:
-fetch:ifqsize <size> - instruction fetch queue size (in insts)
-fetch:mplat <cycles> - extra branch mis-prediction latency (cycles)
-bpred <type> - specify the branch predictor
-decode:width <insts> - decoder bandwidth (insts/cycle)
-issue:width <insts> - RUU issue bandwidth (insts/cycle)
-issue:inorder - constrain instruction issue to program order
-issue:wrongpath - permit instruction issue after mis-speculation
-ruu:size <insts> - capacity of RUU (insts)
-lsq:size <insts> - capacity of load/store queue (insts)
-cache:dl1 <config> - level 1 data cache configuration
-cache:dl1lat <cycles> - level 1 data cache hit latency
2015-04-09
18
Sim-Outorder: Detailed Performance Simulator
-cache:dl2 <config> - level 2 data cache configuration
-cache:dl2lat <cycles> - level 2 data cache hit latency
-cache:il1 <config> - level 1 instruction cache configuration
-cache:il1lat <cycles> - level 1 instruction cache hit latency
-cache:il2 <config> - level 2 instruction cache configuration
-cache:il2lat <cycles> - level 2 instruction cache hit latency
-cache:flush - flush all caches on system calls
-cache:icompress - remap 64-bit inst addresses to 32-bit equiv.
-mem:lat <1st> <next> - specify memory access latency (first, rest)
-mem:width - specify width of memory bus (in bytes)
-tlb:itlb <config> - instruction TLB configuration
-tlb:dtlb <config> - data TLB configuration
-tlb:lat <cycles> - latency (in cycles) to service a TLB miss
2015-04-09
19
Sim-Outorder: Detailed Performance Simulator
-res:ialu - specify number of integer ALUs
-res:imult - specify number of integer multiplier/dividers
-res:memports - specify number of first-level cache ports
-res:fpalu - specify number of FP ALUs
-res:fpmult - specify number of FP multiplier/dividers
-pcstat <stat> - record statistic <stat> by text address
-ptrace <file> <range> - generate pipetrace
2015-04-09
20
Specifying the Branch Predictor
• specifying the branch predictor type:
-bpred <type>
• the supported predictor types are:
nottaken
taken
perfect
bimod
2lev
always predict not taken
always predict taken
perfect predictor
bimodal predictor (BTB w/ 2 bit counters)
2-level adaptive predictor
• configuring the bimodal predictor (only useful when “bpred bimod” is specified):
-bpred:bimod <size> size of direct-mapped BTB
2015-04-09
21
Specifying the Branch Predictor (cont'd)
• configuring the 2-level adaptive predictor (only useful when “bpred 2lev” is specified):
-bpred:2lev <l1size> <l2size> <hist_size> <xor>
Configurations: N, M, W, X
N:# entries in first level
(# of shift register(s))
M:# entries in 2nd level
(# of counters, or other FSM)
W:width of shift register(s) (# of bits in each shift register)
X:(yes-1/no-0) xor history
(We use 0 for this homework.)
and address for 2nd level index
Sample predictors:
GAg: 1,M,W,0 where
GAp: 1,M,W,0 where
PAg: N,M,W,0 where
PAp: N,M,W,0 where
2015-04-09
M
M
M
M
=
=
=
=
2^W
C*2^W, C is # of per-address prediction tables
2^W
N * 2^W
22
Performance Comparison of GAg,GAp, PAg and PAp
• GAp: 1 global history register and 8 per-address prediction tables
Branch address
4
2-bits per branch predictor
Prediction
2-bit global branch history
2015-04-09
23
(a) GAp
(b) (2,2) predictor
Hack the state machine of Branch Predictor!
Taken
Taken
Not taken
Not taken
T
T
Taken
Taken
Not taken
Not taken
NT
NT
Taken
Not taken
(a) A3 (Same as shown in the textbook)
2015-04-09
T
T
Taken
Taken
Not taken
Not taken
NT
NT
Taken
Not taken
(b) A2 (Original Simplescalar Implementation)
24
Sim-Outorder HW Architecture
Fetch
I-Cache
Dispatch
Register
Scheduler
Memory
Scheduler
I-TLB
Exe
Writeback
Commit
Mem
D-Cache
D-TLB
Virtual Memory
2015-04-09
25
Sim-Outorder (Main Loop)
• sim_main() in sim-outorder.c
ruu_init();
for(;;){
ruu_commit();
ruu_writeback();
lsq_refresh();
ruu_issue();
ruu_dispatch();
ruu_fetch();
}
• Executed once for each simulated machine cycle
• Walks pipeline from Commit to Fetch
– Reverse traversal handles inter-stage latch synchronization by only
one pass
2015-04-09
26
Sim-Outorder (RUU/LSQ)
• RUU (Register Update Unit)
– Handles register synchronization/communication
– Serves as reorder buffer and reservation stations
– Performs out-of-order issue when register and memory
dependences are satisfied
• LSQ (Load/Store Queue)
– Handles memory synchronization/communication
– Contains all loads and stores in program order
• Relationship between RUU and LSQ
– Memory dependencies are resolved by LSQ
– Load/Store effective address calculated in RUU
2015-04-09
27
Sim-Outorder: Fetch
●
●
●
ruu_fetch()
Models machine fetch bandwidth
Fetches instructions from one I-cache/memory
●
●
●
block until I-cache misses are resolved
Instructions are put into the instruction fetch queue
named fetch_data in sim-outorder.c (it is also called
dispatch queue in the paper)
Probes branch predictor to obtain the cache line for
next cycle
2015-04-09
28
Sim-Outorder: Dispatch
●
●
●
●
●
●
ruu_dispatch()
Models instruction decoding and register renaming
Takes instructions from fetch_data
Decodes instructions
Enters and links instructions into RUU and LSQ
Splits memory operations into two separate
instructions
2015-04-09
29
Sim-Outorder: Scheduler
●
●
lsq_refresh()
Models instruction selection, wakeup and issue
●
●
Separate schedulers track register and memory dependences.
Locates instructions with all register inputs ready and
all memory inputs ready
●
●
Issue of ready loads is stalled if there is a store with
unresolved effective address in LSQ.
If earlier store address matches load address, target value is
forwarded to load.
2015-04-09
30
Sim-Outorder: Execute
●
●
●
●
●
●
ruu_issue()
Models functional units, D-cache issue and executes
latencies
Gets instructions that are ready
Reserves free functional unit
Schedules writeback events using latency of the
functional unit
Latencies are hardcoded in fu_config[] in simoutorder.c
2015-04-09
31
Sim-Outorder: Writeback
●
●
●
●
●
ruu_writeback()
Models writeback bandwidth, detects mis-predictions,
initiated mis-prediction recovery sequence
Gets execution finished instructions (specified in
event queue)
Wakes up instructions that are dependent on
completed instruction on the dependence chains of
instruction output
Detects branch mis-prediction and roll state back to
checkpoint
2015-04-09
32
Sim-Outorder: Commit
●
ruu_commit()
Models in-order retirement of instructions, store
commits to the D-cache, and D-TLB miss handling
●
While head of RUU/LSQ ready to commit
●
●
●
●
●
D-TLB miss handling
Retire store to D-cache
Update register file and rename table
Reclaim RUU/LSQ resources
2015-04-09
33
Sim-Outorder:
Processor core and other specifications
•
•
•
•
Instruction fetch, decode and issue bandwidth
Capacity of RUU and LSQ
Branch mis-prediction latency
Number of functional units
– integer ALU, integer multipliers/dividers
– FP ALU, FP multipliers/dividers
• Latency of I-cache/D-cache, memory and TLB
• Record statistic by text address
2015-04-09
34
Global Options
• These are supported on most simulators
-h
print help message
-d
enable debug message
-i
start up in Dlite! Debugger
-q
quit immediately (use with -dumpconfig)
-config
read config parameters from <file>
-dumpconfig save config parameters into <file>
2015-04-09
35
How to get help from us
• Drop by during TA’s office hour
• E-Mail khkim@cse.tamu.edu
2015-04-09
36
Download