Vijeta - Duke ECE

advertisement
Analytic Evaluation of Shared-Memory
Systems with ILP Processors
Daniel J. Sorin , Vijay S. Pai , Sarita V. Adve ,
Mary K. Vernon, and David A. Wood
Presented by
Vijeta Johri
17 March 2004
Motivation
• Architectural simulator for shared-memory
systems with processors that aggressively
exploit ILP requires hours to simulate a few
seconds of real execution
• Develop an analytical model that produces
results similar to simulators in seconds
– Tractable system of equations
– Small set of simple i/p parameters
• Previous models assume fixed number of
outstanding memory requests
• Cache coherent, release consistent,
shared memory multiprocessor
– MSI directory-based protocol
• Mesh interconnection n/w
– Wormhole routing
– Separate reply & request n/w
• L1 cache
– writethrough, multiported &nonblocking
• L2 cache
– writeback, write-allocate & nonblocking
• MSHRs to track the status of all
outstanding misses
– Coalesced misses
• Instruction retire from instruction
window after completing execution
• Directory accesses overlapped with
memory accesses
Model Parameters
• System architecture
parameters
• Application parameters
– Performance determining
characteristics
– Insensitive to architectural
parameters
– ILP parameters
• Ability of processor to
overlap multiple memory
requests
– Homogenous applications
• Processor has same value for
each parameter
• Memory requests equally
distributed
Application Parameters
Fast Application Parameters
Estimation
• Parameters that don’t depend on ILP can be
measured using fast multiprocessor simulators
• FastILP for ILP parameters
– High performance by abstracting ILP processor &
memory system
– Speeds up processor simulation
• Completion timestamp calculation for each instruction
– Non-memory instructions : timestamp of source register & functional
unit availability
Fast Application Parameters
Estimation
• Speeds up memory system simulation
– Does not explicitly simulate any part of memory system
beyond cache hierarchy
– Divides simulated time into eras
• start when one or more memory replies unblock the processor
• end when the processor blocks again waiting for a memory
reply
• Allows time stamp computation for load & store
• Estimation of τ , CV τ & f M
• Trace driven simulation of 1 processor
– Homogenous applications
– Generated by trace generation utility
Analytic Model
• System throughput measured
– Function of dynamically changing no. of outstanding memory
requests before processor stalls
• Synchronous blocking submodel
– Fraction of time processor is stalled due to load or read-modifywrite inst. until data returns
– No. of Customer / processor is max no. of read requests that can
be issued by processor before blocking
• MSHR blocking submodel
– Computes additional fraction of time processor is stalled due to
MSHR being full
– No. of Customer / processor = no. of MSHR
– Blocking time when all MSHRs occupied by writes computed
Analytic Model
• Mean time each customer occupies the processor
– τ MB = τ / U SB (fraction of time processor isn’t stalled in
SB)
– τ SB = τ / U MB
• If M = M hw all stalls due to full MSHR
– Customer represent read & write
• Weighted sum of the throughputs computed,
weighted by the frequency of each throughput value
that would be observed for the number of MSHRs in
the system
• Separate queuing model used for locks
Model Equations
• SB and MB submodels use a set of customized MVA
equations to compute the mean delay for customers
at the processor, local and remote memory buses,
directories & network interfaces
– R SB = avg RTT in SB
– Rjsynch = total avg. residence time for read at memory
system resource j
– Z = total fixed delay for read request
• Read transactions retired per unit time = M/ R SB
Model Equations
– NIout denotes queues that are used to transmit message
into switching n/w and
is sum over all such queues
of the total mean delay for each type of synchronous
transaction y that can visit the the queues as request r or
data d message
– For computing
average number of visits that a
synchronous type x message from a type y transaction
makes to the local NI, is multiplied by the sum of the
average waiting and service times at the queue
Model Equations
– The utilization of the local outgoing NI queue by type x
messages of a type y transaction is equal to the average total
number of visits for these messages (per round trip in the
SB model) times their service time divided by the average
round trip time of the transactions in the SB model
– The average waiting time at the outgoing local NI queue
due to traffic from remote nodes is equal to the sum over all
transaction types of the synchronous and asynchronous
traffic generated remotely that is seen at the queue
Model Equations
– Waiting time at queue q of the local NI by remote traffic
of type y that is either synchronous or asynchronous is
equal to the sum over all message types x of the total
number of remote customers times the waiting time that
their type y transactions cause on local queue q This
waiting time is equal to the time that a customer would
wait for those customers in the queue plus the time that
the customer would wait for the customer in service
Results
• FastILP can generate parameters that lead to < 12%
error in throughput compared with RSIM generated
ones
• Model estimates application throughputs ranging
from 0.88 to 2.93 instructions retired per cycle under
10% relative error in only a few seconds compared
with hours required on simulators
Download