Integrated Memory Controllers with Parallel Coherence Streams Mainak Chaudhuri IIT Kanpur Mark Heinrich University of Central Florida Talk in One Slide Ever-increasing on-die integration – Faster memory controllers and coherence processors – Leads to new trade-offs in the domain of programmable coherence engines for scalable directory-based DSM multiprocessors – We show that multiple coherence engines are unnecessary in such environments – We develop a useful analytical model to quickly decide the coherence bandwidth requirement of parallel applications Parallel Coherence Streams Sketch Background Memory controller architecture Analytical model Evaluation framework Simulation results – Validation of model for directory protocols – Directory-less broadcast protocols – Multiprogramming Summary Parallel Coherence Streams Background: Integrated MC A direct solution to reduce round-trip cache miss latency – Other advantages related to maintenance and glueless multiprocessing Widely accepted in high-end industry – Alpha 21364, IBM Power5, AMD Opteron, Sun UltraSPARC III and IV, Sun Niagara Shared memory multiprocessors employing iMC are naturally DSMs – Bandwidth-thrifty directory coherence is the choice Parallel Coherence Streams Background: Directory Processing Home-based coherence protocols – Each cache block has a home node – Upper few bits of physical address – Each coherence request (miss or dirty eviction from the last level of cache) is first sent to the home node of the cache block – At home node, sharing information of the cache block is maintained in a data structure called directory (can be in SRAM or DRAM) – Coherence controller of the home looks up directory and takes appropriate actions Each node has at least one embedded directory coherence controller Parallel Coherence Streams Background: Directory Processing Two different trends in directory coherence controller architecture – Hardwired controllers Less flexible, tedious verification, often affects project’s critical path, but high-performance MIT Alewife, KSR1, SGI Origin, Stanford DASH – Custom programmable controllers Executes protocol software on a protocol processor embedded in memory controller Flexible in choice of protocol, easier to verify the protocol, loss of performance Compaq Piranha, Opteron-Horus, Stanford FLASH, Sequent STiNG, Sun S3.mp Parallel Coherence Streams Background: Flexible Processing Past research reports up to 12% performance loss [Stanford FLASH] – Main reason why industry is shy of pursuing this option Coherence controller occupancy has emerged as the most important parameter – Naturally, hardwired controllers get an upper hand – Past research has established the importance of multiple hardwired controllers in SMP nodes Parallel Coherence Streams Background: Flexible Processing New technology often changes the tradeoffs – Reconsider programmable directory controllers in the light of increased integration Bring the programmable controller on die Faster clock rates lead to lowered occupancy – New research questions: Can the integrated programmable controllers offer enough coherence bandwidth? Do we need multiple of those? Can the integrated controllers cope up with the extra pressure of emerging multi-threaded nodes? Parallel Coherence Streams Background: Flexible Processing Executes coherence protocol handlers in software with hardware support – Does not require interrupts Two major architectures – Integrated custom protocol processor(s) Can use one or more simple cores (this work considers one or two static dual-issue in-order cores with dedicated one level of caches) – Reserved protocol thread context(s) in a simultaneous multi-threaded (SMT) node SMTp: SMT with one or more protocol contexts Eliminates the protocol processor Parallel Coherence Streams Background: Flexible Processing OOO SMT Core (ATs) IL1 DL1 In-order PP IL1 DL1 OOO SMT Core (ATs+PTs) PP L2 cache MC IL1 DL1 SMTp L2 cache SDRAM Banks MC Router Router Parallel Coherence Streams SDRAM Banks Aside: A Protocol Handler Computes directory address from requested address (simple hash) Loads directory entry into a register Computes coherence actions based on directory state and header – Simple integer arithmetic Sends out coherence messages as needed – Custom instructions or uncached stores to write header and address to send unit – May carry cache line data read from DRAM Parallel Coherence Streams Scope of this Work Distributed shared memory (DSM) NUMA – Up to 16 nodes, directory-based or directoryless broadcast coherence – Each node has an SMT processor capable of running four application threads, an integrated memory controller, integrated one or two protocol processors (PPs) or protocol threads (PTs), integrated hypercube router – Six parallel applications and 4-way multiprogrammed workloads Applicability to multi-node chip-MPs Parallel Coherence Streams Contributions Two primary contributions – Evaluates two kinds of programmable coherence engines in the light of on-die integration and multi-threading – Develops a simple and generic analytical model relating the directory protocol occupancy, DRAM latency, and DRAM channel bandwidth Introduces the concept of occupancy margin Offers valuable insights on hot-spot situations Accurately predicts whether an additional coherence engine is helpful Parallel Coherence Streams Highlights Key results – Single integrated programmable controller offers sufficient coherence bandwidth for directory-based systems (contrast with offchip controller studies) Analytical model helps explain why typical hotspot situations (involving locks and flags) cannot be improved with parallel coherence stream processing – Directory-less broadcast systems (e.g., AMD Opteron) enjoy significant benefit from parallel coherence stream processing with multiple programmable “snoop” engines Parallel Coherence Streams Sketch Background Memory controller architecture Analytical model Evaluation framework Simulation results – Validation of model for directory protocols – Directory-less broadcast protocols – Multiprogramming Summary Parallel Coherence Streams Memory Controller Architecture: PP Processor Router VCs Software Queue PI Inbound NI Inbound SWQ Head Round Robin Dispatch PPWQ CAM Lookup OMB Protocol Processor NI Out Dcache SDRAM Banks Send Unit PI Out Icache Parallel Coherence Streams Memory Controller Architecture: SMTp Processor Router VCs Software Queue PI Inbound NI Inbound SWQ Head Round Robin Dispatch CAM Lookup PPWQ OMB Protocol Thread SDRAM Banks Send Unit PI Out NI Out Parallel Coherence Streams Memory Controller Architecture: SMTp Protocol thread participates in ICOUNT fetching when PC is valid Shares pipeline resources with application threads including the entire cache hierarchy Deadlock avoidance with reserved resources – Queue buffers, branch checkpoints, integer registers, integer and LS queue buffers, store buffers, MSHRs Parallel Coherence Streams Parallel Coherence Streams Replicated resources – Multiple protocol processors or protocol threads – Multiple OMBs – Multi-ported Icache and Dcache for protocol processor (does not apply to SMTp) Control flow – Mutual exclusion in directory access: requires six-ported CAM lookup in OMB and PPWQ – Schedule a message every cycle Protocol processors/threads arbitrate for PPWQ read port; smallest id wins (dynamic priority) Parallel Coherence Streams Parallel Coherence Streams Critical sections – Conventional LL/SC and test-and-set locks – For higher throughput test-and-set locks are maintained in on-chip registers – Software queue and its related states (e.g., occupancy) are the major shared read/write variables – Leads to increased average dynamic instruction count per handler Trades occupancy of individual handler with concurrency across handlers Parallel Coherence Streams Parallel Coherence Streams Boot sequence – Only one protocol processor/thread executes the entire boot sequence initializing the memory controller and peripheral states – The other processors/threads only initialize their architectural register states Out-of-order issue – Address conflict between six heads and PPWQ/OMB may lead to idle schedule cycles – Consider all requests in the five queues, not just the heads (SWQ is still FIFO) – Queues need to be collapsible with address CAMs Parallel Coherence Streams Sketch Background Memory controller architecture Analytical model Evaluation framework Simulation results – Validation of model for directory protocols – Directory-less broadcast protocols – Multiprogramming Summary Parallel Coherence Streams Analytical Model Goal of the model is to decide if a second protocol engine can improve performance The model is applicable to any system exercising directory-based protocols – Not just limited to systems with integrated controllers We analyze the time spent by a batch of requests in the memory system – Focus only on handling of read and readexclusive at home (most time consuming) – These require DRAM access (initiated speculatively) Parallel Coherence Streams Analytical Model Three parts in the life of a request after it is dispatched – DRAM occupancy or access latency (Om) – Protocol handler occupancy or protocol processing latency (Op) – DRAM channel occupancy or cache line transfer latency (Oc) We look at four scenarios involving two concurrently arriving bank-parallel requests (consider only Om > Op) – Single- and dual-channel DRAM controller with one and two protocol engines Parallel Coherence Streams Single-channel DRAM Controller R1 1PPU O1 M1 C1 R2 M2 R1 2PPU O1 M1 O2 M2 O2 C2 C1 What if 2Op > Om + 2Oc ? R2 C2 Parallel Coherence Streams Single-channel DRAM Controller R1 1PPU O1 M1 C1 R2 M2 R1 2PPU O1 M1 C2 O2 C1 Saved: 2Op – (Om + 2Oc) R2 O2 M2 Parallel Coherence Streams C2 Dual-channel DRAM Controller R1 1PPU O1 M1 C1 R2 M2 R1 2PPU O1 M1 O2 C2 C1 What if 2Op > Om + Oc ? R2 O2 M2 C2 Parallel Coherence Streams Dual-channel DRAM Controller R1 1PPU O1 M1 C1 M2 C2 O1 M1 C1 R2 R1 2PPU O2 Saved: 2Op – (Om + Oc) R2 O2 M2 C2 Parallel Coherence Streams General Formulation Burst arrival of requests: k at a time Single-channel DRAM controller: – Total protocol occupancy must get exposed if adding a second coherence engine has to be effective – Required condition: kOp > Om + kOc – Re-arranging: Op > Om/k + Oc Dual-channel DRAM controller: – Required condition: kOp > Om + kOc /2 – Re-arranging: Op > Om/k + Oc /2 Occupancy margin: left minus right Parallel Coherence Streams Take-away Points For highly bursty requests (high k) – Om has diminishing effect – Balance between Op and Oc becomes the most important determinant: tension between two competing bandwidths For small k – A large occupancy margin is unlikely, as the contribution from Om would be large – Adding a second coherence engine would not be useful: less concurrency Extra DRAM bandwidth may convert a negative occupancy margin to positive Parallel Coherence Streams Hot-spots and Bank Conflicts Bank conflicts can delay DRAM accesses of a burst of requests – Hot-spots often arise from access to the same cache block system-wide: an obvious case of bank conflict – Can multiple coherence engines help? Since Om > Op on average, for two conflicting requests the total protocol occupancy (2Op) is hidden under memory access latency (2Om) – Multiple coherence engines will not improve performance Parallel Coherence Streams Hot-spots and Bank Conflicts What about row buffer hits? – The first request in a batch will suffer from a row buffer miss – The subsequent ones will enjoy hits with high probability – Row buffer hits lower the average value of Om and may uncover portions of kOp even in the case of k conflicting requests – Required condition: kOp > Omiss + (k-1)Ohit + kOc/w for w-channel DRAM – Simplifying: Op > Om + Oc /w which contradicts Om > Op (typical of integrated controllers) Parallel Coherence Streams Sketch Background Memory controller architecture Analytical model Evaluation framework Simulation results – Validation of model for directory protocols – Directory-less broadcast protocols – Multiprogramming Summary Parallel Coherence Streams Evaluation Framework Evaluates both integrated protocol processors (PPs) and threads in SMTp Depending on the integration level the memory controller and PP can be clocked at different frequencies – Explores 400 MHz, 800 MHz, 1.6 GHz with 1.6 GHz main SMT processor Protocol threads in SMTp, by design, always run at full frequency (1.6 GHz) Each node has an SMT processor and runs up to four ATs (64-threaded apps) and two PTs Parallel Coherence Streams Flashback: Flexible Processing OOO SMT Core (ATs) IL1 DL1 In-order PP IL1 DL1 OOO SMT Core (ATs+PTs) PP L2 cache MC IL1 DL1 SMTp L2 cache SDRAM Banks MC Router Router Parallel Coherence Streams SDRAM Banks Sketch Background Memory controller architecture Analytical model Evaluation framework Simulation results – Validation of model for directory protocols – Directory-less broadcast protocols – Multiprogramming Summary Parallel Coherence Streams 400 MHz Protocol Processors: Execution Time 8% Parallel Coherence Streams 400 MHz Protocol Processors: Dispatcher’s Wait Cycles Parallel Coherence Streams 400 MHz Protocol Processors: Occupancy Parallel Coherence Streams Model Validation: 400 MHz PP Oc is fixed at 20 ns (128B @ 6.4 GB/s) Predict from 1PP measurements: Op (ns) Om (ns) kmax OM (ns) FFT 30.3 54.6 5 -0.6 FFTW 29.4 54.3 7 1.6 LU 25.3 40.1 6 -1.4 Ocean 36.6 54.4 7 8.8 Radix-Sort 33.8 45.8 4 2.3 Water 28.1 40.0 4 -1.9 Parallel Coherence Streams 1.6 GHz Protocol Processors: Execution Time Parallel Coherence Streams 1.6 GHz Protocol Processors: Dispatcher’s Wait Cycles Parallel Coherence Streams 1.6 GHz Protocol Processors: Occupancy Parallel Coherence Streams Model Validation: 1.6 GHz PP Oc is fixed at 20 ns (128B @ 6.4 GB/s) Predict from 1PP measurements: Op (ns) Om (ns) kmax OM (ns) FFT 8.4 54.4 12 -16.1 FFTW 7.5 53.9 16 -15.9 LU 6.3 40.2 16 -16.2 Ocean 9.4 54.6 16 -14.0 Radix-Sort 8.8 45.8 8 -16.9 Water 7.2 40.0 8 -17.8 Parallel Coherence Streams SMTp: Execution Time Parallel Coherence Streams SMTp: Dispatcher’s Wait Cycles Parallel Coherence Streams SMTp: Occupancy Parallel Coherence Streams Model Validation: SMTp Oc is fixed at 20 ns (128B @ 6.4 GB/s) Predict from 1PP measurements: Op (ns) Om (ns) kmax OM (ns) FFT 15.3 55.2 10 -10.2 FFTW 15.6 57.6 14 -8.5 LU 13.1 42.3 16 -9.5 Ocean 17.5 55.9 16 -6.0 Radix-Sort 16.6 51.8 11 -8.1 Water 14.4 40.1 8 -10.6 Parallel Coherence Streams Summary: Execution Time Parallel Coherence Streams Take-away Points Doubling controller frequency is always better than adding a second one – Reducing individual handler occupancy is more important than reducing the occupancy of a burst – Ocean shows the importance of burst mix Increasing frequency has diminishing return – Instead of building complex hardwired protocol engines, dedicate a thread or core in the emerging processors Parallel Coherence Streams Application to Many-core Multi-node many-core systems have a natural hierarchy of coherence – Private last levels (L1 or L2) kept coherent with the next shared level (L2 or L3) via a directory protocol (Niagara, Power5) – Multiple nodes employ a global dir. protocol – Our model applies to both the levels – The on-chip coherence controllers per bank in a shared NUCA are unique in one sense Om and Oc are much smaller than DRAM: our model may dictate positive OM depending on burstiness (k) and protocol’s complexity (Op) Parallel Coherence Streams Directory-less Broadcast Protocols Sends request to home, home broadcasts it to all and replies to requester, all snoop local caches and reply to requester, requester picks correct response – Still NUMA (a la AMD Opteron) – 26.1 messages per miss (compare with 2.5 in directory protocol) A lot of concurrency in the coherence processing layer – 16.1% average improvement when a second coherence engine is introduced (averaged across FFT, FFTW, Ocean, Radix-Sort) Parallel Coherence Streams Single-node Multiprogramming Multiprogrammed workloads have a large data footprint and no sharing across threads – A lot of outer-level cache misses, exercises the coherence engine a lot more than parallel applications – Our model correctly dictates that there is no gain in introducing a second controller Op is too small to satisfy the inequalities – Take-away point: coherence bandwidth requirement is not directly related to cache miss rate Parallel Coherence Streams Single-node Multiprogramming Parallel Coherence Streams Prior Research Programmable coherence controllers – Stanford FLASH, Wisconsin Typhoon, Sequent STiNG, Sun S3.mp, Compaq Piranha CMP, Newisys Opteron-Horus – All controllers are off-chip (and hence lags by at least two generations of process) Multiple coherence controllers – Explored with SMP nodes having off-chip directory controllers (IBM, UIUC, Purdue) – Local/remote address partitioning in Opteron-Horus, STiNG, and S3.mp Parallel Coherence Streams Summary A useful model for coherence layer designers – A simple and intuitive inequality rules – DRAM latency contributes little to this decision in the case of highly bursty applications – Bank-conflicting requests enjoy little or no benefit from parallel coherence stream processing (common case for hot locks/flags) Two controllers improve performance by up to 8% when freq. ratio is four Broadcast protocols enjoy larger benefit Parallel Coherence Streams Acknowledgments Anshuman Gupta (UCSD) – Preliminary simulations (part of independent study at IIT Kanpur) Varun Khaneja (AMD) – Development of directory-less broadcast protocol (part of MTech thesis at IIT Kanpur) IIT Kanpur Security Center – For hosting part of simulation infra-structure Parallel Coherence Streams Integrated Memory Controllers with Parallel Coherence Streams THANK YOU! Mainak Chaudhuri IIT Kanpur Mark Heinrich University of Central Florida To appear in IEEE TPDS