GEMS Tutorial ISCA05

advertisement
ISCA Tutorial
June 5th, 2005
Mike Marty, Brad Beckmann, Luke Yen, Alaa Alameldeen,
Min Xu, Kevin Moore
Please Ask Questions
(C) 2005 Multifacet Project
http://www.cs.wisc.edu/gems
What do you want to simulate?
CPU
Uniprocessor
P
P
P
P
$
$
$
$
Chip Multiprocessor (CMP)
Slide 2
Glueless
Multiprocessor
Symmetric
Multiprocessor
CMP
CMP
CMP
CMP
Multiple-CMP
http://www.cs.wisc.edu/gems
Open Source Release of GEMS
• GEMS v1.1 released as GPL software
http://www.cs.wisc.edu/gems
• Contributors
Alaa Alameldeen
Carl Mauer
Brad Beckmann
Kevin Moore
Ross Dickson
Manoj Plakal
Pacia Harper
Dan Sorin
Milo Martin
Min Xu
Mike Marty
Luke Yen
• Multifacet Project directed by Mark Hill & David Wood
Slide 3
http://www.cs.wisc.edu/gems
GEMS Requirements
• Virtutech Simics 2.0.x or 2.2.x
– Personal academic licenses available
– http://www.virtutech.com
• Host Machine
– x86 (32 or 64-bit) Linux or Sparc/Solaris host machine
– > 1 GB Memory
• Workload Checkpoints YOU Create
– License issues w/ releasing checkpoints
Slide 4
http://www.cs.wisc.edu/gems
Trace flie
Contended locks
Random
Tester
Deterministic
GEMS From 50,000 Feet
Microbenchmarks
Slide 5
Simics
Opal
Detailed
Processor
Model
http://www.cs.wisc.edu/gems
Trace flie
Contended locks
Random
Tester
Deterministic
GEMS From 50,000 Feet
Microbenchmarks
Simics
Opal
Detailed
Processor
Model
Full-System Functional Simulator
•
•
Boots unmodified Solaris 9
BUT, each instruction 1-cycle
•
www.virtutech.com
Slide 6
http://www.cs.wisc.edu/gems
GEMS From 50,000 Feet
Contended locks
Deterministic
Memory
System Model
Random
Opal
Trace flie
Simics
• Flexible multiprocessor memory hierarchy
• Includes domain-specific language
Detailed
Tester
Microbenchmarks
Slide 7
Processor
Model
http://www.cs.wisc.edu/gems
Trace flie
Contended locks
Random
Tester
Deterministic
GEMS From 50,000 Feet
Microbenchmarks
OoO Processor Model
Simics
Opal
Detailed
Processor
Model
• Implements partial SPARC v9 ISA
• Modeled after MIPS R10000
Slide 8
http://www.cs.wisc.edu/gems
Trace flie
Contended locks
Random
Tester
Deterministic
GEMS From 50,000 Feet
Simics
Microbenchmarks
Opal
Detailed
Processor
Model
Other Drivers
• Testing independent of Simics
• Microbenchmarks
Slide 9
http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation
• Demo: Simulating a Multiple-CMP System with GEMS
• Ruby: Memory system model
• BREAK
• Opal: Out-of-order processor model
• Demo: Two gems are better than one
• GEMS Source Code Tour and Extending Ruby
• Building Workloads
Slide 10
http://www.cs.wisc.edu/gems
Demo
Full-System Simulation with GEMS
• Steps:
–
–
–
–
–
–
–
Slide 11
Choosing a Ruby protocol
Building Ruby and Opal
Starting and configuring Simics
Loading and configuring Ruby
Loading and configuring Opal
Running simulation
Getting results
http://www.cs.wisc.edu/gems
Demo
Choosing the Ruby System/Protocol
• Included with GEMS release v1.1
– CMP protocols
• MOESI_CMP_token: M-CMP token coherence
• MSI_MOSI_CMP_directory: 2-level Directory
• MOESI_CMP_directory: higher performing 2-level Directory
– SMP protocols
•
•
•
•
Slide 12
MOSI_SMP_bcast: snooping on ordered interconnect
MOSI_SMP_directory
MOSI_SMP_hammer: based on AMD Hammer
And more
http://www.cs.wisc.edu/gems
Demo
Building Ruby and Opal
• Ruby module
cd $GEMS_ROOT/ruby
– set compile-time defaults
vi config/rubyconfig.defaults
– Build module, choosing protocol and destination dir
make PROTOCOL=MOESI_CMP_token DESTINATION=MOESI_CMP_token
– SLICC runs, generates HTML and additional C++ files
– Ruby module built and moved to
$GEMS_ROOT/simics/home/MOESI_CMP_token
• Build Opal
cd $GEMS_ROOT/opal
make module DESTINATION=MOESI_CMP_token
Slide 13
http://www.cs.wisc.edu/gems
Demo
Starting Simics
• Start non-GUI Simics
maya(9)% cd $GEMS_ROOT/simics/home/MOESI_CMP_token/
maya(10)% ./simics
Checking out a license... done: academic license.
Looking for additional Simics modules in ./modules
+----------------+
Reserved
|
Virtutech
|
|
Simics
|
+----------------+
www.simics.com
Virtutech AB
Type
Type
Type
Type
Copyright 1998-2004 by Virtutech, All Rights
Version: simics-2.0.23
Compiled: Thu Oct 14 20:27:36 CEST 2004
"Virtutech" and "Simics" are trademarks of
'copyright' for details on copyright.
'license' for details on warranty, copying, etc.
'readme' for further information about this version.
'help help' for info on the on-line documentation.
simics>
Slide 14
http://www.cs.wisc.edu/gems
Demo
Checkpoint and Configuration
• Checkpoints should be created first
– Simics-only process
simics> read-configuration ../../checkpoints-u3/jbb/jbb-16p.check
– SpecJBB checkpoint loaded
• Load python scripts
simics> @sys.path.append("../../../gen-scripts")
simics> @import mfacet
• Configure Simics
simics>
Turning
simics>
Turning
simics>
simics>
Slide 15
istc-disable
I-STC off and flushing old data
dstc-disable
D-STC off and flushing old data
instruction-fetch-mode instruction-fetch-trace
magic-break-enable
http://www.cs.wisc.edu/gems
Demo
Load and Configure Ruby
Load module
simics> load-module ruby
Setting # processors is required
simics> ruby0.setparam g_NUM_PROCESSORS 16
Create a M-CMP system (4 chips, 4 procs/chip)
simics> ruby0.setparam g_PROCS_PER_CHIP 4
Override compile-time defaults
simics>
simics>
simics>
simics>
ruby0.setparam
ruby0.setparam
ruby0.setparam
ruby0.setparam
g_NUM_L2_BANKS 32
L2_CACHE_ASSOC 4
L2_CACHE_NUM_SETS_BITS 16
NETWORK_LINK_LATENCY 50
Initialize
simics> ruby0.init
Slide 16
http://www.cs.wisc.edu/gems
Demo
Optionally Load and Configure Opal
Load module
simics> load-module opal
Initialize default processor
simics> opal0.init
simics> opal0.listparam
Start opal (but do not start simulating)
simics> opal0.sim-start “output.opal"
Slide 17
http://www.cs.wisc.edu/gems
Demo
Running simulation
• Setup transaction-based simulation
– “magic breakpoints”
– Five JBB transactions
simics> @mfacet.setup_run_for_n_transactions(5,1)
• Start simulating
– Ruby only (Simics drives Ruby):
simics> c
– Opal is loaded (Opal steps Simics):
simics> opal0.sim-step 9999999999
Slide 18
http://www.cs.wisc.edu/gems
Demo
Dumping Some Output
• Opal stats
simics> opal0.stats
• Ruby stats
simics> ruby0.dump-stats ruby.stats
• Ruby short stats
simics> ruby0.dump-short-stats
– Ruby_cycles is a good runtime metric
Slide 19
http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation
• Demo: Simulating a Multiple-CMP System with GEMS
• Ruby: Memory system model
–
–
–
–
–
–
Overview (Drivers & Memory System)
Event-driven simulation
Interconnection network
SLICC: Specifying the logic of the system
Simple example: SMP MI protocol
Limitations
• BREAK
•
•
•
•
Slide 20
Opal: Out-of-order processor model
Demo: Two gems are better than one
GEMS Source Code Tour and Extending Ruby
Building Workloads
http://www.cs.wisc.edu/gems
Trace flie
Contended locks
Microbenchmarks
Simics
Opal
Detailed
Processor
Model
Memory System
Drivers
Random
Tester
Deterministic
High-Level Infrastructure Map
Slide 21
http://www.cs.wisc.edu/gems
Ruby Driver: Random Tester
• “Verifying a Multiprocessor Cache Controller Using
Random Test Generation” [Wood et al. 90]
• Purpose: Excite cache coherency bugs
• Competing actions performed then checked
• Utilizes false sharing
– Multiple writers - action
– Single read - check
• Randomly inserted delay
Slide 22
Random
Tester
http://www.cs.wisc.edu/gems
Ruby Driver: Microbenchmarks
• Deterministic tester
– Compare and swap atomic op.
– RequestGenerator.C / SyntheticDriver.C
• Trace file
Trace file
• Contended locks
Deterministic
• GETX, SeriesGETS, Inv
Contended locks
– Simple sequence of requests
– Sanity checking and performance tuning
– DeterministicDriver.C
Microbenchmarks
– Issues requests one at a time
– Similar to cache warmup mechanism
– ‘-z <trace_file.gz>’
Slide 23
http://www.cs.wisc.edu/gems
Ruby Driver: In-order Processor Model
• Simics blocking interface (in-order processor)
– Single issue, non-pipelined processor
– Only one outstanding request per CPU
Simics
• SIMICS_RUBY_MULTIPLIER > 1
– Estimates a higher performance processor
– Multiple simics processor cycles == one ruby cycle
Slide 24
http://www.cs.wisc.edu/gems
Ruby Driver: In-order Processor Model
instructions
SIMICS
P0
Simics
in-order
processor
model
P2
P1
P3
Ruby
Memory System Model
Simics time queue
• Implements Simics’ mh_memorytracer_possible_cache_miss()
• “Callback” Simics with SIM_stall_cycle(proc_ptr, 0)
Slide 25
http://www.cs.wisc.edu/gems
Ruby Driver: Out-of-order Processor Model
• Opal (out-of-order processor)
– Super-scalar pipelined processor
– Multiple outstanding requests per CPU
Opal
• OPAL_RUBY_MULTIPLIER > 1
– Faster processor core frequency than memory
– Simulation execution optimization
Detailed
Processor
Model
What are they driving?
Slide 26
http://www.cs.wisc.edu/gems
Ruby Multiprocessor Memory System
• Physical Components
– Caches
– Memory
– System Interconnect
Ruby
Memory System Model
• Determines the timing of memory requests
– Driver issues memory request to Ruby
– Ruby simulates the requests
– Ruby eventually callbacks the driver with the latency
• Ruby’s purpose:
Return memory latency
Slide 27
http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation
• Demo: Simulating a Multiple-CMP System with GEMS
• Ruby: Memory system model
–
–
–
–
–
–
Overview (Drivers & Memory System)
Event-driven simulation
Interconnection network
SLICC: Specifying the logic of the system
Simple example: SMP MI protocol
Limitations
• BREAK
•
•
•
•
Slide 28
Opal: Out-of-order processor model
Demo: Two gems are better than one
GEMS Source Code Tour and Extending Ruby
Building Workloads
http://www.cs.wisc.edu/gems
Discrete Event-driven Simulation
• Discrete event-driven simulation
– Events change system state
– Series of scheduled events
*Event A 4
• Global EventQueue
– Heart of Ruby
– Priority heap of event/time pairs
• Not a true queue - not in FIFO order
• Self-sorting queue
– Given cycle events occur in arbitrary order
– All events must be at least one unit of time
Slide 29
Global
EventQueue
Event | Time
*Event G 7
*Event B
5
*Event J
3
*Event S
3
http://www.cs.wisc.edu/gems
Events and Consumers
• Event = Consumer Wakeup
– Consumer determines event type
– Consumer changes system state
• Typical event
– Consumer wakes up to observe its input ports
– Consumer acts upon the incoming message(s)
• Change system state
• Enqueue outgoing messages
– Consumer pops the incoming message(s)
– Consumer schedules outgoing message(s) consumers
Output Port
Consumer
Input Port
Consumer
Output Port
Consumer
Slide 30
http://www.cs.wisc.edu/gems
Events and Consumers
• Stalled event
– Consumer wakes up to observer its input ports
– Consumer encounters a stall
– Consumer schedules itself again
• Doesn’t pop incoming queue
Output Port
Consumer
Input Port
Consumer
Output Port
Consumer
Slide 31
http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation
• Demo: Simulating a Multiple-CMP System with GEMS
• Ruby: Memory system model
–
–
–
–
–
–
Overview (Drivers & Memory System)
Event-driven simulation
Interconnection network
SLICC: Specifying the logic of the system
Simple example: SMP MI protocol
Limitations
• BREAK
•
•
•
•
Slide 32
Opal: Out-of-order processor model
Demo: Two gems are better than one
GEMS Source Code Tour and Extending Ruby
Building Workloads
http://www.cs.wisc.edu/gems
Interconnection Network
•
A single flexible infrastructure
– Point-to-point links and switches: Consumers
– Both intra-chip and inter-chip networks
•
Dynamic network creation
– Routing tables created at runtime
– Utilizes input parameters
•
Throttle.C
Link
Two ways to generate topologies
1. Auto-generated
– Intra-chip network: Single on-chip switch
– Inter-chip network: 4 included (next slide)
2. Customized
PerfectSwitch.C
– TopologyType_FILE_SPECIFIED
– Adjust individual link latency and bandwidth
Switch
– Specify one link per line
Slide 33
http://www.cs.wisc.edu/gems
Auto-generated Inter-chip Network Topologies
TopologyType_TORUS_2D
TopologyType_PT_TO_PT
Slide 34
TopologyType_HIERARCHICAL_SWITCH
TopologyType_CROSSBAR
http://www.cs.wisc.edu/gems
Network Characteristics
•
Link latency
1. Auto-generated
–
–
ON_CHIP_LINK_LATENCY
NETWORK_LINK_LATENCY
2. Customized
–
•
‘link_latency:’
Link bandwidth
– Bandwidth specified in 1000th of byte
1. Auto-generated
–
–
On-chip = 10 x g_endpoint_bandwidth
Off-chip = g_endpoint_bandwidth
2. Customized
–
•
Buffer size
–
–
Infinite by default
Customized network supports finite buffering
•
•
•
Slide 35
Individual link bandwidth = ‘bw_multiplier:’ x g_endpoint_bandwidth
Prevent 2D-mesh network deadlock through e-cube restrictive routing
‘link_weight’
Perfect switch bandwidth
http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation
• Demo: Simulating a Multiple-CMP System with GEMS
• Ruby: Memory system model
–
–
–
–
–
–
Overview (Drivers & Memory System)
Event-driven simulation
Interconnection network
SLICC: Specifying the logic of the system
Simple example: SMP MI protocol
Limitations
• BREAK
•
•
•
•
Slide 36
Opal: Out-of-order processor model
Demo: Two gems are better than one
GEMS Source Code Tour and Extending Ruby
Building Workloads
http://www.cs.wisc.edu/gems
Specification Language for Implementing Cache
Coherence (SLICC)
• Domain-specific language
–
–
–
–
Designed to specify state machines for cache coherence
Syntactically similar to C/C++/Java
Constrains to hardware-like structures (i.e. no loops)
Generates C++ tightly coupled to Ruby
Network
Out-ports
Network
In-ports
• Two purposes
SLICC
State
Machine
1. Specify system coherence
– Per-memory-block State Machines
– I.e. cache and memory controller logic
2. Glue components together
– Caches with transaction buffers
– Network ports with controllers
Slide 37
http://www.cs.wisc.edu/gems
System Flexibility via SLICC
• Substantial portion of Ruby code generated
– In combination with dynamic network creation
– Permits a tremendously flexible simulation infrastructure
• protocols/<protocol_name>.slicc
– Indicates the SLICC files needed by the protocol
– Specifies the necessary generated objects
• Controller state machines
• Network messages
– Snooping protocol: requests and response messages
– Directory protocol: requests, forwarded requests, and responses
– Allocates only C++ objects needed by the particular protocol
• Ex. Shadow tags for an exclusive two-level cache
• Ex. Persistent Request Table for Token coherence
Slide 38
http://www.cs.wisc.edu/gems
Inside a SLICC State Machine
• Network buffers
– Outgoing and incoming ports
• States
– Base and transient states
• Events
– Internal events that cause state transitions
<controller_name>.sm
network ports
states
events
ruby structures
trigger events
• Ruby Structures
– Caches, transaction buffers… etc.
actions
• Trigger events
– Incoming messages trigger internal events
• Actions
– Operations performed on structures
transitions
• Transitions
Slide 39
– Cross-product of possible states and events
– Performs atomic sequence of actions
http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation
• Demo: Simulating a Multiple-CMP System with GEMS
• Ruby: Memory system model
–
–
–
–
–
–
Overview (Drivers & Memory System)
Event-driven simulation
Interconnection network
SLICC: Specifying the logic of the system
Simple example: SMP MI protocol
Limitations
• BREAK
•
•
•
•
Slide 40
Opal: Out-of-order processor model
Demo: Two gems are better than one
GEMS Source Code Tour and Extending Ruby
Building Workloads
http://www.cs.wisc.edu/gems
Demo
Creating a protocol with SLICC
• MI-example protocol
– Simple, SMP directory protocol
– Cache and directory/memory controller
– Assume ordered interconnect (for simplicity)
dir
dir
dir
M
Ruby
interconnect
$
Slide 41
$
$
I
http://www.cs.wisc.edu/gems
Demo
MI Cache Controller – States and Events
// STATES
enumeration(State, desc="Cache states") {
// stables states
I, desc="Not Present/Invalid";
M, desc="Modified";
// transient states
MI, desc="Modified, issued PUT";
II, desc="Not Present/Invalid, issued PUT";
IS, desc="Issued request for IFETCH/GETX";
IM, desc="Issued request for STORE/ATOMIC";
}
// EVENTS
enumeration(Event, desc="Cache events") {
// from processor
Load,
desc="Load request from processor";
Ifetch,
desc="Ifetch request from processor";
Store,
desc="Store request from processor";
Data,
Fwd_GETX,
desc="Data from network";
desc="Forward from network";
Replacement, desc="Replace a block";
Writeback_Ack,
desc="Ack from the directory for a writeback";
Writeback_Nack,
desc="Nack from the directory for a writeback";
}
Slide 42
http://www.cs.wisc.edu/gems
Demo
MI Cache Controller – Network Ports
// NETWORK BUFFERS
MessageBuffer requestFromCache, network="To", virtual_network="0",
ordered="true";
MessageBuffer responseFromCache, network="To", virtual_network="1",
ordered="true";
MessageBuffer forwardToCache, network="From", virtual_network="2",
ordered="true";
MessageBuffer responseToCache, network="From", virtual_network="1",
ordered="true";
// NETWORK PORTS
out_port(requestNetwork_out, RequestMsg, requestFromCache);
out_port(responseNetwork_out, ResponseMsg, responseFromCache);
in_port(forwardRequestNetwork_in, RequestMsg, forwardToCache) {
if (forwardRequestNetwork_in.isReady()) {
peek(forwardRequestNetwork_in, RequestMsg) {
if (in_msg.Type == CoherenceRequestType:GETX) {
trigger(Event:Fwd_GETX, in_msg.Address);
}
else if (in_msg.Type == CoherenceRequestType:WB_ACK) {
trigger(Event:Writeback_Ack, in_msg.Address);
}
else {
error("Unexpected message");
}
Slide 43
http://www.cs.wisc.edu/gems
}
Demo
MI Cache Controller – Structures
// CacheEntry
structure(Entry, desc="...", interface="AbstractCacheEntry") {
State CacheState,
desc="cache state";
bool Dirty,
desc="Is the data dirty (different than memory)?";
DataBlock DataBlk,
desc="data for the block";
}
external_type(CacheMemory) {
bool cacheAvail(Address);
Address cacheProbe(Address);
void allocate(Address);
void deallocate(Address);
Entry lookup(Address);
void changePermission(Address, AccessPermission);
bool isTagPresent(Address);
}
CacheMemory cacheMemory, template_hack="<L1Cache_Entry>",
constructor_hack='L1_CACHE_NUM_SETS_BITS, L1_CACHE_ASSOC, MachineType_L1Cache,
int_to_string(i)+"_L1"', abstract_chip_ptr="true";
Slide 44
http://www.cs.wisc.edu/gems
Demo
MI Cache Controller – “Mandatory Queue”
// Mandatory Queue
in_port(mandatoryQueue_in, CacheMsg, mandatoryQueue, desc="...") {
if (mandatoryQueue_in.isReady()) {
peek(mandatoryQueue_in, CacheMsg) {
if (cacheMemory.isTagPresent(in_msg.Address) == false &&
cacheMemory.cacheAvail(in_msg.Address) == false ) {
// make room for the block
trigger(Event:Replacement, cacheMemory.cacheProbe(in_msg.Address));
}
else {
trigger(mandatory_request_type_to_event(in_msg.Type), in_msg.Address);
}
}
}
}
Slide 45
http://www.cs.wisc.edu/gems
Demo
MI Cache Controller – Transitions
transition(I, Store, IM) {
v_allocateTBE;
i_allocateL1CacheBlock;
a_issueRequest;
m_popMandatoryQueue;
}
Atomic sequence of actions
transition(IM, Data, M) {
u_writeDataToCache;
s_store_hit;
w_deallocateTBE;
n_popResponseQueue;
}
transition(M, Fwd_GETX, I) {
e_sendData;
o_popForwardedRequestQueue;
}
transition(M, Replacement, MI) {
v_allocateTBE;
b_issuePUT;
x_copyDataFromCacheToTBE;
h_deallocateL1CacheBlock;
}
Slide 46
http://www.cs.wisc.edu/gems
Demo
MI Cache Controller – Actions
action(a_issueRequest, "a", desc="Issue a request") {
enqueue(requestNetwork_out, RequestMsg, latency="ISSUE_LATENCY") {
out_msg.Address := address;
out_msg.Type := CoherenceRequestType:GETX;
out_msg.Requestor := machineID;
out_msg.Destination.add(map_Address_to_Directory(address));
out_msg.MessageSize := MessageSizeType:Control;
}
}
action(e_sendData, "e", desc="Send data from cache to requestor") {
peek(forwardRequestNetwork_in, RequestMsg) {
enqueue(responseNetwork_out, ResponseMsg, latency="CACHE_RESPONSE_LATENCY")
{
out_msg.Address := address;
out_msg.Type := CoherenceResponseType:DATA;
out_msg.Sender := machineID;
out_msg.Destination.add(in_msg.Requestor);
out_msg.DataBlk := cacheMemory[address].DataBlk;
out_msg.MessageSize := MessageSizeType:Response_Data;
}
}
}
Slide 47
http://www.cs.wisc.edu/gems
Demo
SLICC-generated HTML tables
• http://www.cs.wisc.edu/gems/MI_example_html/
Slide 48
http://www.cs.wisc.edu/gems
Demo
Testing MI_example
Build Protocol
cd $GEMS_ROOT/ruby
make PROTOCOL=MI_example
Random test
– stresses protocol with simultaneous false-sharing requests
– 16 processors (-p), 10000 requests (-l)
./amd64_linux/generated/MI_example/bin/tester.exec –p 16 –l 10000
Deterministic test with transition trace
– use a trace, requests handled one at a time
– input trace (-z), compressed or non-compressed
– transition debug (-s) starting at cycle 1
./amd64_linux/generated/MI_example/bin/tester.exec –p 16 –z ruby.trace.gz –s 1
Slide 49
http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation
• Demo: Simulating a Multiple-CMP System with GEMS
• Ruby: Memory system model
• BREAK
• Opal: Out-of-order processor model
–
–
–
–
Overview
Pipeline
Example: Load instruction
Additional Tidbits
• Demo: Two gems are better than one
• GEMS Source Code Tour and Extending Ruby
• Building Workloads
Slide 50
http://www.cs.wisc.edu/gems
Overview
• What is OPAL?
– Out-of-Order SPARC processor simulator
• (modeled after MIPS R10K)
– Uses Timing-First design
– Realized as a Simics module – like RUBY
– Does NOT use Simics’ MAI interface
• Goal of this section
– Starting point for hacking Opal
• Learning approaches
– Code review / summarization (using Control Flow Graphs)
– Example: a load instruction
– Analogies to SimpleScalar…pay attention to the differences
Slide 51
http://www.cs.wisc.edu/gems
Ruby Driver: In-order Processor Model
instructions
SIMICS
P0
Simics
in-order
processor
model
P2
P1
P3
Ruby
Memory System Model
Simics time queue
• Implements Simics’ mh_memorytracer_possible_cache_miss()
• “Callback” Simics with SIM_stall_cycle(proc_ptr, 0)
Slide 52
http://www.cs.wisc.edu/gems
Preview: OPAL & Simics
OPAL
fetch
SIMICS
8
7
6 5
Phy_mem
2 4 13 1
P0
HIT
IFETCH
RUBY
decode
Schedule/execute
HIT
LOAD
retire
Instruction
• Use opal’s opal0.sim-step command
Slide 53
http://www.cs.wisc.edu/gems
Timing-First Simulation [Mauer Sigmetrics 02]
• Timing Simulator (Opal)
– functional execution of user/supervisor operations
– speculative, OoO multiprocessor timing simulation
– does NOT implement full ISA or any devices
• Functional Simulator (Simics)
– full-system multiprocessor simulation
– does NOT model detailed micro-architectural timing
KEY: Reload state if Opal state != Simics state
Slide 54
http://www.cs.wisc.edu/gems
Measured Deviations
• Less than 20 deviations per 100,000 instructions (0.02%)
Worst case performance error:
Slide 55
additional timing slides
2.4% (assuming deviation
latency is pipeline flush)
http://www.cs.wisc.edu/gems
Opal and UltraSparc
• Functionally simulates 103 of 183 of UltraSparc ISA instructions
(99.99% of all dynamic instr in workloads)
LIST
• Sample of unimplemented instrs:
– ARRAY -FEXPAND
– EDGE
-FMUL8x16
– SHUTDOWN
-SIAM
-FPADD
-RDSOFTINT
-FPMERGE
-RDSTICK
-SIR
-WRSOFTINT
-WRSTICK
• Does not functionally simulate devices or any I/O instructions
–
–
–
–
SCSI controllers and disks
PCI and SBUS interfaces
interrupt and DMA controllers
temperature sensors
Correctness type
Functional
Performance
Slide 56
% error
0
2.4 (worst case)
http://www.cs.wisc.edu/gems
Simulation Control (system.[C h])
system_t::simulate(int instrs)
Disable all simics procs
Simulated enough instrs?
For MP sims: P0’s instrs
counted here
Yes
return
No
Forall seq->advanceCycle()
Pipeline is modeled here
ruby->advanceTime()
global_cycle++
Slide 57
http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation
• Demo: Simulating a Multiple-CMP System with GEMS
• Ruby: Memory system model
• BREAK
• Opal: Out-of-order processor model
–
–
–
–
Overview
Pipeline
Example: Load instruction
Additional Tidbits
• Demo: Two gems are better than one
• GEMS Source Code Tour and Extending Ruby
• Building Workloads
Slide 58
http://www.cs.wisc.edu/gems
What’s done in a cycle?
pseq::advanceCycle()
FetchInstructions()
DecodeInstructions()
ScheduleInstructions()
Uses separate queues
(finitecycle.h) to record
how many instructions are
available for each stage.
The order is in fact not
important here.
RetireInstructions()
Scheduler->execute()
return
• SimpleScalar uses a reverse order, why?
Slide 59
http://www.cs.wisc.edu/gems
Pipeline Model (pseq.[C h])
MAX_FETCH
F
– Delay modeled with separate queues
(finitecycle.h)
F
– Types: CONFIG_ALU_MAPPING
– Number: CONFIG_NUM_ALUS
MAX_DISPATCH
D
Sched
FU1
FU0
FU1
FU0
MAX_DECODE
FU0
R
MAX_RETIRE
Slide 60
R
http://www.cs.wisc.edu/gems
RETIRE_STAGES
Determined by
CONFIG_ALU_LATENCY
MAX_EXECUTE
D
DECODE_STAGES
• Models fully-pipelined FUs
F
FETCH_STAGES
• Instructions stored/tracked in a
RUU-like structure (iwindow.[C h])
• Flexible multi-stage pipeline
Instructions ({dynamic,statici,memop,controlop}.[C h] )
decoded instr
(statici.[C h])
Dynamic
Event Times
Wait List ptr
Control
(controlop.[C h])
Predicted
Addr
Actual Addr
Taken/Not
Taken
Slide 61
Memory
(memop.[C h])
Registers
Traps
Seq #
ALU (dynamic.[C h])
Virtual/Phys
Addr
LSQ
index
http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation
• Demo: Simulating a Multiple-CMP System with GEMS
• Ruby: Memory system model
• BREAK
• Opal: Out-of-order processor model
–
–
–
–
Overview
Pipeline
Example: Load instruction
Additional Tidbits
• Demo: Two gems are better than one
• GEMS Source Code Tour and Extending Ruby
• Building Workloads
Slide 62
http://www.cs.wisc.edu/gems
Fetch
Stall
fetch
Is Fetch
Ready?
Emit NOP/Stall
Fetch
Yes
Address
Translation
Yes
I-TLB
Miss?
No
Read instruction:
pseq::getInstr()
Invoke Ruby to
simulate Ifetch
timing
Create
Dynamic Instr
(load_inst_t)
Slide 63
http://www.cs.wisc.edu/gems
Decode
Get load instr
from instr
window
Get current source
operand mappings :
arf::readDecodeMap()
(regmap.[C h], arf.[C h])
dynamic_inst_t::
decode()
Rename dest reg :
arf::allocateRegister()
Insert decoded
load inst in
decode queue
Slide 64
(regmap.[C h], arf.[C h])
http://www.cs.wisc.edu/gems
Schedule
Get load instr
from instr
window
Exceeded
scheduling
window?
TestSourceReadiness()
Stop
scheduling
Yes
Source not
ready
Wakeup
Source is ready
Scheduler->schedule()
Slide 65
WAIT_XX_STAGE
Yes
All sources ready?
http://www.cs.wisc.edu/gems
No
Execute
No, reschedule
Read port
avail?
D-TLB address translate
(memory_inst_t::addresstranslate())
Yes
Yes
Raise TLB miss exception
TLB Miss?
No
Invoke Ruby to simulate load timing
(rubycache_t::access())
Cache Miss?
Yes
No
Read value from Simics memory
(pseq->readPhysicalMemory()) pseq->complete()
Slide 66
CACHE_MISS_STAGE
http://www.cs.wisc.edu/gems
Retire
Step Simics
(pseq->advanceSimics())
Get completed LD inst
takeTrap()
(set trap state,squash pipeline)
Yes
Traps?
No
checkChangedState()
(verify load value)
Match
checkCriticalstate():
(PC, NPC,regs)
Match
Retire LD
Slide 67
FullSquash()
(reload state &
refetch from instr
following LD)
Memory
Consistency
http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation
• Demo: Simulating a Multiple-CMP System with GEMS
• Ruby: Memory system model
• BREAK
• Opal: Out-of-order processor model
–
–
–
–
Overview
Pipeline
Example: Load instruction
Additional Tidbits
• Demo: Two gems are better than one
• GEMS Source Code Tour and Extending Ruby
• Building Workloads
Slide 68
http://www.cs.wisc.edu/gems
Opal-Ruby Interface
Asynchronous
LD
load_inst_t::
Execute() 1
8 Complete()
OPAL
7
rubycache_t:
access()
2
complete()
5
system_t:
rubyCompletedRequest()
RUBY
OpalInterface:
isReady()
3
makeRequest()
4 hitCallback()
6
pseq_t:
completedRequest()
Slide 69
http://www.cs.wisc.edu/gems
Branch Prediction
pseq_t::createInstruction{
…
s_instr->nextPC()
…
}
Controlop_t::
Execute(){
(check prediction and
flush if mispredict)
}
Retire(){
…
Bpred->Update()
…
}
Slide 70
dynamic_inst_t::
nextPC_call(),
nextPC_predicated_branch(),
nextPC_predict_branch(),
nextPC_indirect()
Predict()
Update()
Branch predictor
(fetch/{yags.[C h], …} :
Predict()
Update()
http://www.cs.wisc.edu/gems
Common Config Parameters
Processor Width:
MAX_FETCH
_DECODE
_DISPATCH
_EXECUTE
_RETIRE
Pipeline Stages:
FETCH_STAGES
DECODE_STAGES
RETIRE_STAGES
Register File Sizes:
CONFIG_IREG_PHYSICAL (int)
CONFIG_FPREG_PHYSICAL (fp)
CONFIG_CCREG_PHYSICAL (cond code)
ROB Size:
IWINDOW_ROB_SIZE
Scheduling Window Size:
IWINDOW_WIN_SIZE
Slide 71
http://www.cs.wisc.edu/gems
Opal : Present and Future
• Implements Sparc instructions
– Simulating additional Sparc instructions easy task
– Porting to x86 substantial code rewrite
• Simulates timing of weaker memory consistency
models
– Add SC checks in Opal
– Add write buffers for weaker models (like TSO)
• No functional simulation of I/O
– Plug in disk simulator that interacts with Opal
• Not currently using MAI interface
– Possible to replace Opal w/ MAI module that interacts with
Ruby
• Aggressive micro-architectural techniques not
modeled
– Add support for trace caches, mem. dependence pred., etc
Slide 72
http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation
• Demo: Simulating a Multiple-CMP System with
GEMS
• Ruby: Memory system model
• BREAK
• Opal: Out-of-order processor model
• Demo: Two gems are better than one
– Breakdown network stats
– Example: Network contention with and without Opal
– Simulation runtimes
• GEMS Source Code Tour and Extending Ruby
• Building Workloads
Slide 73
http://www.cs.wisc.edu/gems
Demo
Breaking Down Ruby Stats Files
• Ruby system config print
– Values of all ruby config parameters
<system_config>.stats
Ruby config
• Overall runtime
– Target and host machine runtimes, IPC, etc.
• Cache profiling: L1I, L1D, L2…etc.
• Structure occupancy
– Demand for cache ports, transaction buffers
•
•
•
•
Latency breakdown
Request vs. system state (optional)
Message delay cycles (optional)
Network stats
– Link and switch utilization
• CC event / transition counts
Slide 74
Overall runtime
Cache profiling
Structure
occupancy
Latency
breakdown
Request vs.
system state
Message
delay cycles
Network stats
Event /
transition
counts
http://www.cs.wisc.edu/gems
Demo
•
•
•
•
•
Two GEMS are Better than One
Network behavior with and without Opal
8 processor CMP
SPLASH benchmark: ocean
8 byte-wide links between CPUs & L2 cache banks
Two runs using a customized network
1. Ruby only
•
•
•
•
Allows only one requests per processor
Maximum 8 outstanding requests
Low network utilization
Little network contention
2. Ruby & Opal
•
•
•
•
Slide 75
Allows multiple outstanding requests
Maximum 128 outstanding requests
Higher network utilization
Noticeable network contention
http://www.cs.wisc.edu/gems
Demo
Two GEMS are Better than One
Ruby Only
Ruby_cycles: 41361869
Message Delayed Cycles
----------------------
Total_delay_cycles: [binsize: 16 max: 553 count: 22892759
average: 0.534205 | standard deviation: 4.18656 | 22855760
20077
1945 325 309 175 105 3935 7681 518 338 254 397 273 166 130 142 33 41 25 26
29 15 10 10 2 0 2 0 4 4 10 7 6 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
Network Stats
-------------
utilized_percent_switch_0_link_3: 4.38966
bw: 8000
utilized_percent_switch_0_link_4: 4.36838
bw: 8000
links_
base_latency: 1
links_
base_latency: 1
Slide 76
http://www.cs.wisc.edu/gems
Demo
Two GEMS are Better than One
Ruby & Opal
Ruby_cycles: 72550169 (41361869)
Message Delayed Cycles
----------------------
Total_delay_cycles: [binsize: 16 max: 703 count: 22893122
average: 1.35992 (0.534205) | standard deviation: 6.55126
| 22608266 220366 29575 9084 4686 3248 2009 1687 6018 1798 1143 828 625 516
384 272 271 288 398 319 299 228 203 161 92 51 41 26 12 9 30 39 48 43 25 20 3
0 0 1 0 2 4 4 0 0 0 0 0 0 ]
Network Stats
-------------
utilized_percent_switch_0_link_3: 7.81863 (4.38966)
links_
bw: 8000 base_latency: 1
utilized_percent_switch_0_link_4: 7.64388 (4.36838)
links_
bw: 8000 base_latency: 1
Slide 77
http://www.cs.wisc.edu/gems
Simulation Time Comparison
• Comparisons of Runtimes
– Progressively add more simulation fidelity
• Simics only
• Simics + Ruby
• Simics + Ruby + Opal
– Accuracy vs. simulation time tradeoff
• Target Machine
– 8 UltraSPARC™ iii processor SMP (1 GHz)
– 4 GBs of memory
• Host Machine
– AMD Opteron™ uniprocessor (2.2 GHz)
– 4 GBs of memory
Slide 78
http://www.cs.wisc.edu/gems
Simulation Slowdown
2000 JBB Transactions
Time
Target
20 ms
Simics
1 minute
Simics + Ruby
Simics + Ruby + Opal
Slowdown
1
Slowdown / CPU
1
3000 x
380 x
15 minutes
45000 x
5600 x
45 minutes
140000 x
17000 x
CAVEAT: These performance numbers may not reflect the optimal
configuration of Virtutech Simics. For example, running Simics in “fast
mode” (or emulation-only mode) can reduce the slowdown (per CPU) of
Simics, compared to real hardware, to less than 10x
Slide 79
http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation
• Demo: Simulating a Multiple-CMP System with
GEMS
• Ruby: Memory system model
• BREAK
• Opal: Out-of-order processor model
• Demo: Two gems are better than one
• GEMS Source Code Tour and Extending Ruby
– GEMS software structure
– Directory Tour
– Demo: Extending Ruby and a CMP Protocol
• Building Workloads
Slide 80
http://www.cs.wisc.edu/gems
GEMS Software Structure
System
Chip
generated/<protocol>/Chip.h
Driver
common/Driver.h
Network
network/simple
Profiler
profiler/Profiler.h
Internal Drivers
Random Tester
tester/Tester.h
Topology
network/simple/Topology.h
Deterministic Tester
tester/DeterministicDriver.h
Multiple
Instantiations
Contended Locks
tester/SyntheticDriver.h
Simics
Interface
simics/SimicsInterface.h
Opal
Interface
One
Instantiation
interface/OpalInterface.h
http://www.cs.wisc.edu/gems
Ruby Software Structure
Chip
generated/<protocol>/Chip.h
SLICC
Ruby
Sequencer
system/Sequencer.h
Network
Ports
Caches
Directory
system/CacheMemory.h
system/DirectoryMemory.h
Ruby
buffer/MessageBuffer.h
SLICC
Cache
Line
generated/<protocol>/L1Cache_Entry.h
Directory
State
generated/<protocol>/Directory_Entry.h
Cache
Controllers
generated/<protocol>/L1Cache_Controller.h
generated/<protocol>/L2Cache_Controller.h
Directory
Controller
generated/<protocol>/Directory_Controller.h
http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation
• Demo: Simulating a Multiple-CMP System with
GEMS
• Ruby: Memory system model
• BREAK
• Opal: Out-of-order processor model
• Demo: Two gems are better than one
• GEMS Source Code Tour and Extending Ruby
– GEMS software structure
– Directory Tour
– Demo: Extending Ruby and a CMP Protocol
• Building Workloads
Slide 83
http://www.cs.wisc.edu/gems
Map of Directories: Top-Level
Top-Level Directory
ruby
Memory
System
Components
common
Common
GEMS
C++ code
results
Simulation
Output
Slide 84
opal
Processor
Components
gen-scripts
slicc
Generator
Code
scripts
Generated
Simics
Interface
Scripts
Common
GEMS
scripts
LICENSE
README
protocols
Protocol
Specification
Files
microbenchmarks
Separate
Microbenchmark
Executables
KNOWN_ISSUES
http://www.cs.wisc.edu/gems
Map of Directories: ruby
ruby
Makefile
buffers
MessageBuffer
between consumers
eventqueue
Global
eventqueue
profiler
Profiling code
interfaces
Ruby → Opal &
Simics
recorder
cache and trace
recorders
common
Common Ruby
C++ structs
module
The ruby simics
module
simics
Simics → Ruby
system
tester
platform
Physical memory
components
Random tester
& ubenchmarks
Object files &
executables
html
Protocol
tables
Slide 85
ruby.trace.gz
Example trace
file
init.h/.C
Ruby initializer &
destroyer
config
Ruby config files for
Module and tester
network
Simple network code
slicc_interface
Abstract classes interface
with different protocols
generated
SLICC generated C++
files
README.debugging
Ruby debug flag info
http://www.cs.wisc.edu/gems
Map of Directories: ruby/system
ruby/system
CacheMemory.h
cache template
data structure
DirectoryMemory.h/C
memory data
structure
MachineID.h
NodeID.h
object that uniquely
identifies all ruby
machines
object that identifies a
unique chip or machine
instatiation
PresistentArbiter.h/C
PersistentTable.h/C
NodePresistentTable.h/C
specific to token
protocol
Sequencer.h/C
manages memory
requests between the
driver and L1 cache
controller
TBETable.h/C
transaction buffer
entry table used by
cache controllers for
transient requests
Slide 86
PerfrectCacheMemory.h
a fully associative,
unbounded cache
memory template
StoreBuffer.h/C
used to simulate
TSO-like timing
TimerTable.h
specific to token
protocol
specific to token
protocol
StoreCache.h/C
used to simulate
TSO-like timing
System.h/C
top-level object of the
ruby memory system,
all ruby objects can be
accessed via the
g_system_ptr
specific to token
protocol
http://www.cs.wisc.edu/gems
Map of Directories: ruby/slicc_interface
ruby/slicc_interface
AbstractCacheEntry.h/C
ruby abstract class for
the protocol specific
cache entries
Message.h
parent class of all messages
messages communicated
between consumers via
MessageBuffers
RubySlicc_Profiler_interface.h/C
interface between the
generated protocol logic
and the ruby profiler code
Slide 87
AbstractChip.h/C
ruby abstract class for
the protocol specific
chip object
NetworkMessage.h
parent class of all
network messages,
each protocol
implements unique
network message
objects to communicate
between controllers
RubySlicc_Util.h
miscellaneous ruby
functions used by
the generated controllers
AbstractProtocol.h/C
contains booleans to define
protocol characteristics to ruby
RubySlicc_ComponentMapping.h
All address manipulation to
determine location and set
mapping is here
RubySlicc_includes.h
wrapper for the
RubySlicc interface files
http://www.cs.wisc.edu/gems
Map of Directories: slicc
slicc
Makefile
Makefile for the
SLICC code
generator
executable
parser
contains the lexer
and parser that
construct a
protocol’s AST
main.h/C
main function
of the SLICC
executable
Slide 88
ast
doc
Abstract Syntax
Tree code
contains some
old but useful
documentation
symbols
platform
contains SLICC
objects created
during the first
pass of the AST,
majority of code
generated by
these symbols
README
Summary of
how SLICC works
generator
file, html and
MIF generator
code
generated
Object files &
executables
generated lexer
and parser files
slicc_global.h
defines typedef, namespaces, etc.
http://www.cs.wisc.edu/gems
Map of Directories: opal
opal
Makefile
benchmark
Micro-architecture
benchmarks
config
Module and tester
config files
python
Misc test and
graphing scripts
design
Helpful informal
design docs
regression
Golden results
for tester
bypassing
Misc. proc structs
fetch
Predictors
(branch,Trap,RAS)
sparc
Implementationspecific defines
tester
trace
platform
Opal tester files
Files for branch,
memory traces
Object files &
executables
TODO
Todo wish list
Slide 89
README
Describes building
& running Opal
common
Global Opal structs
module
Code for Opal module
system
Pipeline model
generated
Files for parsing config
params
README.memory_consistency
Opal handling of mem. consistency
http://www.cs.wisc.edu/gems
Map of Directories: opal/system (1)
opal/system
actor.[C h]
General micro-arch.
structure class
checkresult.h
Structs used in
validation w/ Simics
dtlb.[C h]
TLB implementation
for stand-alone sims
flow.[C h]
CFG class
Slide 90
arf.[C h]
Register file
interface
config.include
Type defines for
config params
dx.[C h i]
Code for execution
of dynamic instrs
hfa.C
Opal-Simics interface
cache.[C h]
chain.[C h]
Opal’s built-in cache
structures
Used to analyze mem
dependencies
controlop.[C h]
decode.[C h]
Per opcode stats
collector class
Branch instr type
class
dynamic.[C h]
Top-level class
for all dynamic instrs
flatarf.[C h]
Non-renamed register
file interface
histogram.[C h]
hfa_init.h
Opal-Simics interface
externs
Histogram stats
class
http://www.cs.wisc.edu/gems
Map of Directories: opal/system (2)
opal/system
ipage.[C h]
Instruction page
class
lockstat.[C h]
Stats on locks in
system
mf_api.h
Simlink to Opal-Ruby
interface
pseq.[C h]
Top-level proc
sequencer
Slide 91
ipagemap.[C h]
Instruction page
cache class
lsq.[C h]
LSQ structure
mshr.[C h]
MSHR structure
(used in Opal cache
hierarchy only)
iwindow.[C h]
ix.[C h]
RUU-like struct for
storing/tracking instrs
Code to execute CFG
instructions
memop.[C h]
memstat.[C h]
Memory instr class
Memory addr stats
class
pipepool.[C h]
pipestate.[C h]
Wait-list object . Used
to model MSHR when
running w/ Ruby
Single waiter object
for pipepool
pstate.[C h]
ptrace.[C h]
regbox.[C h]
Functions used for
API calls to Simics
Used for analyzing
memory traces
Contains interface
ptrs to registers.
http://www.cs.wisc.edu/gems
Map of Directories: opal/system (3)
opal/system
regfile.[C h]
Models the register
file itself
simdist12.C
Dummy Simics
functions for tester
stopwatch.[C h]
Timer class, used to
collect time stats
regmap.[C h]
Rename map
structure
sparx.C
Several includes
sysstat.[C h]
Stats class for
dynamic insts
rubycache.[C h]
Handles all Opal Ruby memory
transactions
Global event queue
statici.[C h]
sstat.[C h]
Decoded instr class
Stats class for
static insts
system.[C h]
Top-level class for
manipulating sim
scheduler.[C h]
threadstat.[C h]
Stats class for
tracking per-thread
stats
wait.[C h]
Wait-list object for
dynamic insts
Slide 92
http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation
• Demo: Simulating a Multiple-CMP System with
GEMS
• Ruby: Memory system model
• BREAK
• Opal: Out-of-order processor model
• Demo: Two gems are better than one
• GEMS Source Code Tour and Extending Ruby
– GEMS software structure
– Directory Tour
– Demo: Extending Ruby and a CMP Protocol
• Building Workloads
Slide 93
http://www.cs.wisc.edu/gems
Demo
Extending Ruby
• Goal:
– Add new functionality to Ruby and interface to SLICC
• DemoPrefetcher
–
–
–
–
–
Simple, L2->memory next-line prefetcher
Module implemented as C++ object (DemoPrefetcher.C)
New type added to SLICC
Observes L1 GETS requests via function call
Triggers event for prefetch in next cycle
• Object is connected to an in_port
– Not the only way (or the right way) of implementing a
prefetcher
Slide 94
http://www.cs.wisc.edu/gems
Demo
Implementing DemoPrefetcher
• Creating an object that can “wakeup” a controller
DemoPrefetcher.h
class DemoPrefetcher {
public:
// An object in a SLICC controller will be passed a Chip*
DemoPrefetcher(Chip* chip_ptr);
// Allow an in_port to be attached
void setConsumer(Consumer* consumer_ptr) { m_consumer_ptr =
consumer_ptr; }
// When wakeup() is called, ensure it should do something
bool isReady() const;
// functions to implement simple next-line prefetching
const Address& popNextPrefetch();
const Address& peekNextPrefetch() const;
void cancelNextPrefetch();
void observeL1Request(const Address& address);
Slide 95
http://www.cs.wisc.edu/gems
Demo
Implementing DemoPrefetcher
DemoPrefetcher.C
void DemoPrefetcher::observeL1Request(const Address& address)
{
// next-line prefetch address
Address prefetch_addr = address;
prefetch_addr.makeNextStrideAddress(1);
// add to prefetch queue
m_prefetch_queue.push( prefetch_addr );
// when to wakeup-- choose 1 cycles later
Time ready_time = g_eventQueue_ptr->getTime() + 1;
// schedule a wakeup() so that the L2 controller can trigger
g_eventQueue_ptr->scheduleEventAbsolute(m_consumer_ptr,
ready_time);
}
Slide 96
http://www.cs.wisc.edu/gems
Demo
Interfacing DemoPrefetcher to SLICC
external_type(DemoPrefetcher, inport="yes") {
bool isReady();
Address popNextPrefetch();
void cancelNextPrefetch();
Address peekNextPrefetch();
void observeL1Request(Address);
}
DemoPrefetcher prefetcher;
// wakeup logic
in_port(prefetcher_in, Null, prefetcher) {
if (prefetcher_in.isReady() ) {
if (L2cacheMemory.cacheAvail(prefetcher.peekNextPrefetch()) ||
L2cacheMemory.isTagPresent(prefetcher.peekNextPrefetch())) {
if ( getState(prefetcher.peekNextPrefetch()) == State:I ||
getState(prefetcher.peekNextPrefetch()) == State:NP ) {
trigger(Event:Prefetch, prefetcher.popNextPrefetch());
}
else {
// tag is already present in a non-invalid state
prefetcher.cancelNextPrefetch();
}
}
else {
trigger(Event:L2_Replacement,
L2cacheMemory.cacheProbe(prefetcher.peekNextPrefetch()));
}
}
}
Slide 97
http://www.cs.wisc.edu/gems
Demo
Implementing DemoPrefetcher
• Nice property of TokenCMP: no tracking of prefetch
– A tag is allocated and a request issued to memory
– keeps received tokens/data if tag allocated
MOESI_CMP_tokenDEMO-L2cache.sm
transition(NP, Prefetch, I) {
vv_allocateL2CacheBlock;
a_issuePrefetch;
}
transition(I, Prefetch) {
a_issuePrefetch;
}
transition({S,O,M,I_L,S_L}, Prefetch) {
// do nothing
}
Slide 98
http://www.cs.wisc.edu/gems
Outline
• Introduction and Motivation
• Demo: Simulating a Multiple-CMP System with
GEMS
• Ruby: Memory system model
• BREAK
• Opal: Out-of-order processor model
• Demo: Two gems are better than one
• GEMS Source Code Tour and Extending Ruby
• Building Workloads
Slide 99
http://www.cs.wisc.edu/gems
Workloads for Simics/GEMS
• Unfortunately, we cannot release our workloads
(legal reasons)
• Steps for Workload Development
– Simple Example: Barnes-Hut
– What about more complex applications?
• Workload Simulation Methodology
– Simulating transactions/requests
– Coping with workload variability
Slide 100
http://www.cs.wisc.edu/gems
Workload Setup
• Simple Example: Barnes-Hut (Splash2 suite)
– Commands not to be taken literally! (might be different in
different versions)
• Main Steps:
–
–
–
–
Slide 101
Build OS checkpoint
Copy application source or binary to simulation
Create initial (cold) application checkpoint in Simics
Create warm application checkpoint with Simics/Ruby
http://www.cs.wisc.edu/gems
Build OS Checkpoint
• Use Simics to boot your OS and get a checkpoint
(assuming 16 processor serengeti target machine)
– cd simics/home/sarek
– ./simics –x sarek-16p.simics
• Script loads configuration and boots Solaris
• Scripts should be provided with your Simics distribution
assuming you have Solaris license (contact Virtutech Simics
Forum)
• Modify scripts to fit your target configuration (e.g., memory,
disk, network)
– At the end of your script, take a system snapshot
(checkpoint):
simics> write-configuration CHKPT_DIR/sarek-16p.check
simics> quit
– Use this checkpoint to build all your workloads’ 16 processor
checkpoints
Slide 102
http://www.cs.wisc.edu/gems
Copy Barnes Source or Binary
• Develop benchmark on real machine (if available)
– Use Simics “magic” instructions after initialization
• See Simics reference manual for magic instruction use
– Compile benchmark with such instructions before running in
Simics
• Load from your OS checkpoint
– ./simics
simics> read-configuration CHKPT_DIR/sarek-16p.check
simics> magic-break-enable
• Copy binary into simulated machine (or copy source
and compile)
– Console commands:
mount /host
cp –r /host/workloads/splash2/codes/apps/barnes/BARNES .
• See Simics reference manual on the use of the /host filesystem
Slide 103
http://www.cs.wisc.edu/gems
Obtain Initial Barnes Checkpoint
• Warm up application in Simics
– Console Commands:
./BARNES < input-warm
• input_warm specifies Barnes parameters
./BARNES < input-warm
• Use this second run to warm up cache (see next slide)
./BARNES < input-run > output; magic_call break
• After initial run, write checkpoint
simics> write-configuration CHKPT_DIR/barnes-cold16p.check
simics> quit
• Checkpoint is ready for GEMS run
Slide 104
http://www.cs.wisc.edu/gems
Obtain Warm Barnes Checkpoint
• Load initial checkpoint
–
–
–
–
–
setenv CHECKPOINT_AT_END yes
setenv TRANSACTIONS 1
setenv PROCESSORS 16
setenv CHECKPOINT CHKPT_DIR/barnes-cold-16p.check
./simics -no-win -x GEMS_ROOT/gen-scripts/go.simics
• Script (provided in release) should load ruby and run
till the end of the warmup run
– Also writes checkpoint at the end
• Edit checkpoint to remove ruby object
– Modify script to suit your needs
Slide 105
http://www.cs.wisc.edu/gems
What About More Complex Applications?
• Setup on real hardware
–
–
–
–
Tune workload, OS parameters
Scale-down for PC memory limits
Re-tune
For details, [Alameldeen et al., IEEE Computer, Feb’03]
• What if we don’t have access to real hardware?
– Install applications and setup in Simics
– Checkpoint often
– Not optimal for large scale applications!
Slide 106
http://www.cs.wisc.edu/gems
Simulating Transactions/Requests
• Throughput-based applications
– Work-based unit to compare configurations
– IPC not always meaningful
• Counting Transactions during Simulation
– Enable magic breaks in Simics
– Benchmark traps to Simics on every magic instruction
– Count magic breaks until we reach required number of
transactions
– Cope with benchmark variability
Slide 107
http://www.cs.wisc.edu/gems
Why Consider Variability?
OLTP
Slide 108
http://www.cs.wisc.edu/gems
Workload Variability
• How can slower memory lead to faster workload?
• Answer: Multithreaded workload takes different paths
– Different lock race outcomes
– Different scheduling decisions
→ Runs from same initial conditions can be different
 This can lead to wrong conclusions for deterministic
simulations
• Solution with deterministic simulation
– Add pseudo-random delay on memory accesses
(MEMORY_LATENCY)
– Simulate base (and enhanced) system multiple times
– Use simple or complex statistics [Alameldeen and Wood,
HPCA 2003]
Slide 109
http://www.cs.wisc.edu/gems
The End
• Download and Subscribe to Mailing Lists
http://www.cs.wisc.edu/gems
• We encourage your contributions
– Workloads
– Additional timing fidelity
Slide 110
http://www.cs.wisc.edu/gems
Additional Opal Slides
Slide 111
http://www.cs.wisc.edu/gems
Sensitivity Analysis
Slide 112
return
http://www.cs.wisc.edu/gems
Sensitivity Results
Slide 113
return
http://www.cs.wisc.edu/gems
Opal and Memory Consistency
• Designed to be aggressive OoO processor
• Our use of Simics is sequentially consistent
execution
• Models the performance of weaker models (such as
TSO) for only SC memory interleavings
• Violations of SC in Opal:
– Identical MSHR entry for memory requests with same addr
– Executes Ld/St out of program order
– No snooping of LSQ for external stores
Return
Slide 114
http://www.cs.wisc.edu/gems
Implemented UltraSparc Instructions (1)
add
bpe
addc
bpg
bpge
addcc
bpgu
addccc
bpl
alignaddr bple
alignaddrl bpleu
bpn
and
bpne
andcc
bpneg
andn
bpos
andncc
bppos
bpvc
ba
bpvs
bcc
brgez
bcs
brgz
be
brlez
brlz
bg
brnz
bge
brz
bgu
bshuffle
bl
bvc
bvs
ble
call
bleu
casa
bmask
casxa
bn
cmp
done
bne
retry
bneg
fabsd
bpa
fabsq
bpcc
fabss
bpcs
Slide 115
faddd
faddq
fadds
falignda
ta
fba
fbe
fbg
fbge
fbl
fble
fblg
fbn
fbne
fbo
fbpa
fbpe
fbpg
fbpge
fbpl
fbple
fbplg
fbpn
fbpne
fbpo
fbpu
fbpue
fbpug
fbpuge
fbpul
fbpule
fbu
fbue
fbug
fbuge
fbul
fbule
fcmpd
fcmped
fcmpeq
fcmpeq16
fcmpeq32
fcmpes
fcmpgt16
fcmpgt32
fcmple16
fcmple32
fcmpne16
fcmpne32
fcmpq
fcmps
fdivd
fdivq
fdivs
fdmulq
fdtoi
fdtoq
fdtos
fdtox
fitod
fitoq
fitos
flush
flushw
fmovd
fmovda
fmovdcc
fmovdcs
fmovde
fmovdg
fmovfqlg fmovql
fmovdge fmovfqn fmovqle
fmovdgu fmovfqne fmovqleu
fmovdl
fmovfqo fmovqn
fmovdle
fmovfqu fmovqne
fmovdleu fmovfque fmovqneg
fmovdn
fmovfqug fmovqpos
fmovdne fmovfquge fmovqvc
fmovdneg fmovfqul fmovqvs
fmovdpos fmovfqule fmovrdgez
fmovrdgz
fmovdvc fmovfsa
fmovrdlez
fmovdvs fmovfse
fmovfda fmovfsg fmovrdlz
fmovfde fmovfsge fmovrdnz
fmovrdz
fmovfdg fmovfsl
fmovfdge fmovfsle fmovrqgez
fmovfdl
fmovfslg fmovrqgz
fmovfdle fmovfsn fmovrqlez
fmovfdlg fmovfsne fmovrqlz
fmovfdn fmovfso fmovrqnz
fmovfdne fmovfsu fmovrqz
fmovfdo fmovfsue fmovrsgez
fmovfdu fmovfsug fmovrsgz
fmovfdue fmovfsuge fmovrslez
fmovfdug fmovfsul fmovrslz
fmovfduge fmovfsule fmovrsnz
fmovrsz
fmovfdul fmovq
fmovs
fmovfdule fmovqa
fmovfqa fmovqcc fmovsa
fmovfqe fmovqcs fmovscc
fmovscs
fmovfqg fmovqe
fmovse
fmovfqge fmovqg
fmovfql
fmovqge fmovsg
fmovfqle fmovqgu fmovsge
fmovsgu
fmovsl
fmovsle
fmovsleu
fmovsn
fmovsne
fmovsneg
fmovspos
fmovsvc
fmovsvs
fmuld
fmulq
fmuls
fnegd
fnegq
fnegs
fqtod
fqtoi
fqtos
fqtox
fsmuld
fsqrtd
fsqrtq
fsqrts
fsrc1
fstod
fstoi
fstoq
fstox
fsubd
fsubq
fsubs
fxtod
fxtoq
http://www.cs.wisc.edu/gems
Implemented UltraSparc Instructions (2)
fxtos
fzero
fzeros
ill
impdep1
impdep2
jmp
jmpl
ldblk
ldd
ldda
lddf
lddfa
ldf
ldfa
ldfsr
ldqa
ldqf
ldqfa
ldsb
ldsba
ldsh
ldsha
ldstub
ldstuba
ldsw
ldswa
ldub
lduba
lduh
lduha
lduw
lduwa
ldx
Slide 116
ldxa
ldxfsr
membar
mov
mova
movcc
movcs
move
movfa
movfe
movfg
movfge
movfl
movfle
movflg
movfn
movfne
movfo
movfu
movfue
movfug
movfuge
movful
movfule
movg
movge
movgu
movl
movle
movleu
movn
movne
movneg
movpos
movrgez
movrgz
movrlez
movrlz
movrnz
movrz
movvc
movvs
mulscc
mulx
nop
not
or
orcc
orn
orncc
popc
prefetch
prefetcha
rd
rdcc
rdpr
restore
restored
retrn
save
saved
sdiv
sdivcc
sdivx
sethi
sll
sllx
smul
smulcc
sra
srax
srl
srlx
stb
stba
stbar
stblk
std
stda
stdf
stdfa
stf
stfa
stfsr
sth
stha
stqf
stqfa
stw
stwa
stx
stxa
stxfsr
sub
subc
subcc
subccc
swap
swapa
ta
taddcc
taddcctv
tcc
tcs
te
tg
tge
tgu
tl
tle
tleu
tn
tne
tneg
tpos
trap
tsubcc
tsubcctv
tvc
tvs
udiv
udivcc
udivx
umul
umulcc
wr
wrcc
wrpr
xnor
xnorcc
xor
xorcc
return
http://www.cs.wisc.edu/gems
TLB Misses
• ITLB Misses
– emit special NOP instruction: STATIC_INSTR_MOP; stall
fetch
– does NOT update PC, NPC
– fetch resumes whenever any instr (including special NOP)
squashes
• DTLB Misses
– Set DTLB miss trap for instruction (setTrapType()) in
Execute()
– In retireInstruction(), retrieve trap and call takeTrap() to set
trap state for DTLB handler
– refetch from DTLB handler
Slide 117
http://www.cs.wisc.edu/gems
Example: Load instruction
• In dynamic_t::Schedule(), load waits until all operands ready
(WAIT_XX_STAGE cases)
• Scheduler gets invoked when all operands ready
• Load waits until read port to L1 is available
• Load_inst_t::Execute() gets called
–
–
–
–
Generates virtual address
Performs D-TLB address translation
Inserts entry in LSQ
Initiates cache access (via Ruby or Opal’s built-in simple cache
hierarchy)
– If cache miss -> put on wait list (CACHE_MISS_STAGE) and is
woken up by rubycache_t::complete()
• Invokes Simics to read actual memory value in
load_inst_t::Complete()
• Retirement check of load value & squash if value deviates from
Simics
Slide 118
http://www.cs.wisc.edu/gems
Modifying Opal-Ruby Interface
• Ruby->Opal interface defined in mf_opal_api object
(ruby/interfaces/mf_api.h)
• Opal->Ruby interface defined in mf_ruby_api object
• To create new Ruby->Opal callback (ex: hitCallback())
– Define function in ruby/interfaces/OpalInterface.C
– Add new function pointer to mf_opal_api object
– Create a new function handler in opal/system/system.C and
assign m_opal_api object’s new function pointer to this
function handler
• To create new Opal->Ruby callback (ex: makeRequest())
– Define function in ruby/interfaces/OpalInterface.C
– Add new function pointer to mf_ruby_api object
– Assign function pointer to new function in
OpalInterface::installInterface()
Slide 119
http://www.cs.wisc.edu/gems
Download