Analytic Evaluation of Shared-Memory Systems with ILP Processors Daniel J. Sorin , Vijay S. Pai , Sarita V. Adve , Mary K. Vernon, and David A. Wood Presented by Vijeta Johri 17 March 2004 Motivation • Architectural simulator for shared-memory systems with processors that aggressively exploit ILP requires hours to simulate a few seconds of real execution • Develop an analytical model that produces results similar to simulators in seconds – Tractable system of equations – Small set of simple i/p parameters • Previous models assume fixed number of outstanding memory requests • Cache coherent, release consistent, shared memory multiprocessor – MSI directory-based protocol • Mesh interconnection n/w – Wormhole routing – Separate reply & request n/w • L1 cache – writethrough, multiported &nonblocking • L2 cache – writeback, write-allocate & nonblocking • MSHRs to track the status of all outstanding misses – Coalesced misses • Instruction retire from instruction window after completing execution • Directory accesses overlapped with memory accesses Model Parameters • System architecture parameters • Application parameters – Performance determining characteristics – Insensitive to architectural parameters – ILP parameters • Ability of processor to overlap multiple memory requests – Homogenous applications • Processor has same value for each parameter • Memory requests equally distributed Application Parameters Fast Application Parameters Estimation • Parameters that don’t depend on ILP can be measured using fast multiprocessor simulators • FastILP for ILP parameters – High performance by abstracting ILP processor & memory system – Speeds up processor simulation • Completion timestamp calculation for each instruction – Non-memory instructions : timestamp of source register & functional unit availability Fast Application Parameters Estimation • Speeds up memory system simulation – Does not explicitly simulate any part of memory system beyond cache hierarchy – Divides simulated time into eras • start when one or more memory replies unblock the processor • end when the processor blocks again waiting for a memory reply • Allows time stamp computation for load & store • Estimation of τ , CV τ & f M • Trace driven simulation of 1 processor – Homogenous applications – Generated by trace generation utility Analytic Model • System throughput measured – Function of dynamically changing no. of outstanding memory requests before processor stalls • Synchronous blocking submodel – Fraction of time processor is stalled due to load or read-modifywrite inst. until data returns – No. of Customer / processor is max no. of read requests that can be issued by processor before blocking • MSHR blocking submodel – Computes additional fraction of time processor is stalled due to MSHR being full – No. of Customer / processor = no. of MSHR – Blocking time when all MSHRs occupied by writes computed Analytic Model • Mean time each customer occupies the processor – τ MB = τ / U SB (fraction of time processor isn’t stalled in SB) – τ SB = τ / U MB • If M = M hw all stalls due to full MSHR – Customer represent read & write • Weighted sum of the throughputs computed, weighted by the frequency of each throughput value that would be observed for the number of MSHRs in the system • Separate queuing model used for locks Model Equations • SB and MB submodels use a set of customized MVA equations to compute the mean delay for customers at the processor, local and remote memory buses, directories & network interfaces – R SB = avg RTT in SB – Rjsynch = total avg. residence time for read at memory system resource j – Z = total fixed delay for read request • Read transactions retired per unit time = M/ R SB Model Equations – NIout denotes queues that are used to transmit message into switching n/w and is sum over all such queues of the total mean delay for each type of synchronous transaction y that can visit the the queues as request r or data d message – For computing average number of visits that a synchronous type x message from a type y transaction makes to the local NI, is multiplied by the sum of the average waiting and service times at the queue Model Equations – The utilization of the local outgoing NI queue by type x messages of a type y transaction is equal to the average total number of visits for these messages (per round trip in the SB model) times their service time divided by the average round trip time of the transactions in the SB model – The average waiting time at the outgoing local NI queue due to traffic from remote nodes is equal to the sum over all transaction types of the synchronous and asynchronous traffic generated remotely that is seen at the queue Model Equations – Waiting time at queue q of the local NI by remote traffic of type y that is either synchronous or asynchronous is equal to the sum over all message types x of the total number of remote customers times the waiting time that their type y transactions cause on local queue q This waiting time is equal to the time that a customer would wait for those customers in the queue plus the time that the customer would wait for the customer in service Results • FastILP can generate parameters that lead to < 12% error in throughput compared with RSIM generated ones • Model estimates application throughputs ranging from 0.88 to 2.93 instructions retired per cycle under 10% relative error in only a few seconds compared with hours required on simulators