Fair Queuing Memory Systems Kyle Nesbit, Nidhi Aggarwal, Jim Laudon*, and Jim Smith University of Wisconsin – Madison Department of Electrical and Computer Engineering Sun Microsystems* 1 Motivation: Multicore Systems Significant memory bandwidth limitations Bandwidth constrained operating points will occur more often in the future Systems must perform well at bandwidth constrained operating points Must respond in a predictable manner 2 Bandwidth Interference IPC 1 0.8 0.6 0.4 0.2 0 Desktops Servers vpr vpr with crafty vpr with art Soft real-time constraints Fair sharing / billing Decreases overall throughput 3 Solution A memory scheduler based on First-Ready FCFS Memory Scheduling Network Fair Queuing (FQ) System software allocates memory system bandwidth to individual threads The proposed FQ memory scheduler 1. Offers threads their allocated bandwidth 2. Distributes excess bandwidth fairly 4 Background Memory Basics Memory Controllers First-Ready FCFS Memory Scheduling Network Fair Queuing 5 Background Memory Basics 6 Micron DDR2-800 timing constraints tRCD Activate to read 5 cycles tCL Read to data bus valid 5 cycles tWL Write to data bus valid 4 cycles tCCD CAS to CAS (CAS is a read or a write) 2 cycles tWTR Write to read 3 cycles tWR Internal write to precharge 6 cycles tRTP Internal read to precharge 3 cycles tRP Precharge to activate 5 cycles tRRD Activate to activate (different banks) 3 cycles tRAS Activate to precharge 18 cycles tRC Activate to activate (same bank) 22 cycles BL/2 Burst length (Cache Line Size / 64 bits) 4 cycles tRFC Refresth to activate 51 cycles tRFC Max refresh to refresh 28,000 cycles (measured in DRAM address bus cycles) 7 Background: Memory Controller CMP Processor L1 Caches Processor L1 Caches L2 Cache L2 Cache Memory Controller SDRAM Chip Boundary 8 Background: Memory Controller Translates memory requests into SDRAM commands Tracks SDRAM timing constraints Activate, Read, Write, and Precharge E.g., activate latency tRCD and CAS latency tCL Buffers and reorders requests in order to improve memory system throughput 9 Background: Memory Scheduler Processor Data Bus Memory Requests Processor Data Bus Arrival Time Assignment Cache Line Read Buffer Transaction Buffer Bank 1 Requests … Bank n Requests Cache Line Write Buffer FR-FCFS Scheduler SDRAM Address Bus SDRAM Data Bus SDRAM Data Bus Data Path Request / Command Path Control Path 10 Background: FR-FCFS Memory Scheduler A First-Ready FCFS priority queues 1. 2. 3. Ready commands CAS commands over RAS commands earliest arrival time Ready with respect to the SDRAM timing constraints FR-FCFS is a good general-purpose scheduling policy [Rixner 2004] Multithreaded issues 11 Example: Two Threads Bursty MLP, bandwidth constrained a5a6a7a8 a1a2a3a4 Thread 1 Memory Latency Computation Isolated misses, latency sensitive a1 a2 a3 a4 a5 Thread 2 Computation 12 First Come First Serve 1a2a3 4 a a Thread 1 a5a6a7a8 Shared Memory System Thread 2 a1 a2 13 Background: Network Fair Queuing Network Fair Queuing (FQ) provides QoS in communication networks Routers use FQ algorithms to offer flows their allocated bandwidth Network flows are allocated bandwidth on each network link along the flow’s path Minimum bandwidth bounds end-to-end communication delay through the network We leverage FQ theory to provide QoS in memory systems 14 Background: Virtual Finish-Time Algorithm The kth packet on flow i is denoted pik virtual start-time pik Sik = max { aik, Fik-1 } pik virtual finish-time Fik = Sik + Lik / i i flow i’s share of network link A virtual clock determines arrival time aik VC algorithm determines the fairness policy 15 Quality of Service Each thread is allocated a fraction i of the memory system bandwidth Desktop – soft real time applications Server – differentiated service – billing The proposed FQ memory scheduler 1. 2. Offers threads their allocated bandwidth, regardless of the load on the memory system Distributes excess bandwidth according to the FQ memory scheduler’s fairness policy 16 Quality of Service Minimum Bandwidth ⇒ QoS A thread allocated a fraction i of the memory system bandwidth will perform as well as the same thread on a private memory system operating at i of the frequency 17 Fair Queuing Memory Scheduler Thread 1 Requests VTMS is used to calculate memory request deadlines Request deadlines are virtual finish-times FQ scheduler selects 1. 2. the first-ready pending request with the earliest deadline first (EDF) Thread 1 VTMS … … Thread m Requests Thread m VTMS Deadline / Finish-Time Algorithm Transaction Buffer FQ Scheduler SDRAM 18 Fair Queuing Memory Scheduler Thread 1 a1a2a3a4 a5a6a7a8 Virtual Time Dilated latency by the reciprocal Memory i Shared Memory System Thread 2 Deadlines Deadlines a1 a2 a3 a4 Virtual Time 19 Virtual Time Memory System Each thread has its own VTMS to model its private memory system VTMS consists of multiple resources Banks and channels In hardware, a VTMS consists of one register for each memory bank and channel resource A VTMS register holds the virtual time the virtual resource will be ready to start the next request 20 Virtual Time Memory System A request’s deadline is its virtual finish-time The time the request would finish if the request’s thread were running on a private memory system operating at i of the frequency A VTMS model captures fundamental SDRAM timing characteristics Abstracts away some details in order to apply network FQ theory 21 Priority Inversion First-ready scheduling is required to improve bandwidth utilization Low priority ready commands can block higher priority (earlier virtual finish-time) commands Most priority inversion blocking occurs at active banks, e.g. a sequence of row hits 22 Bounding Priority Inversion Blocking Time 1. 2. When a bank is inactive and tRAS cycles after a bank has been activated, prioritize request FR-VFTF After a bank has been active for tRAS cycles, FQ scheduler select the command with the earliest virtual finish time and wait for it to become ready 23 Evaluation Simulator originally developed at IBM Research Structural model Adopts the ASIM modeling methodology Detailed model of finite memory system resources Simulate 20 statistically representative 100M instruction SPEC2000 traces 24 4GHz Processor – System Configuration Issue Buffer 64 entries Issue Width 8 units (2 FXU, 2 LSU, 2 FPU, 1 BRU, 1 CRU) Reorder Buffer 128 entries Load / Store Queues 32 entry load reorder queue, 32 entry store reorder queue I-Cache 32KB private, 4-ways, 64 byte lines, 2 cycle latency, 8 MSHRs D-Cache 32KB private, 4-ways, 64 byte lines, 2 cycle latency, 16 MSHRs L2 Cache 512KB private cache, 64 byte lines, Memory Controller 16 transaction buffer entries per thread, 8 write buffer entries per thread, closed page policy SDRAM Channels 1 channel SDRAM Ranks 1 rank SDRAM Banks 8 banks 8-ways, 12 cycle latency, 16 store merge buffer entries, 32 transaction buffer entries 25 Evaluation Single Thread Data Bus Utilization Utilization 100% 80% 60% 40% 20% crafty perlbmk sixtrack mesa vpr gzip bzip2 ammp gap twolf wupwise apsi mgrid swim gcc lucas facerec mcf equake art 0% We use data bus utilization to roughly approximate “aggressiveness” 26 Evaluation We present results for a two thread workload that stresses the memory system IPC normalized to QoS IPC Construct 19 workloads by combining each benchmark (subject thread) with art, the most aggressive benchmark (background thread) Static partitioning of memory bandwidth i = .5 Benchmark’s IPC on private memory system at i = .5 the frequency (.5 the bandwidth) More results in the paper 27 FQ FR-FCFS 1 0.5 Normalized IPC 1.5 perlbmk crafty hmean sixtrack perlbmk crafty hmean mcf facerec lucas gcc swim mgrid apsi wupwise twolf gap ammp bzip2 gzip vpr mesa sixtrack mesa vpr gzip bzip2 ammp gap twolf wupwise apsi mgrid swim gcc lucas facerec 28 equake 0 mcf 0 equake 2 1.5 1 0.5 Normalized IPC Normalized IPC of Subject Thread Normalized IPC of Background Thread (art) Throughput – Harmonic Mean of Normalized IPCs 1.4 Harmonic Mean of Normalized IPCs FR-FCFS FQ 1.2 1 0.8 0.6 0.4 0.2 hmean crafty perlbmk sixtrack mesa vpr gzip bzip2 ammp gap twolf wupwise apsi mgrid swim gcc lucas facerec mcf equake 0 29 Subject Thread of Two Thread Workload (Background Thread is art) 30 Summary and Conclusions Existing techniques can lead to unfair sharing of memory bandwidth resources ⇒ Destructive interference Fair queuing is a good technique to provide QoS in memory systems Providing threads QoS eliminates destructive interference which can significantly improve system throughput 31 Backup Slides 32 Generalized Processor Sharing Ideal generalized processor sharing (GPS) Each flow i is allocated a share i of the shared network link GPS server services all backlogged flows simultaneously in proportion to their allocated shares Flow 1 Flow 2 1 2 Flow 3 Flow 4 3 4 33 Background: Network Fair Queuing Network FQ algorithms model each flow as if it were on a private link Flow i’s private link has i the bandwidth of the real link Calculates packet deadlines A packet’s deadline is the virtual time the packet finishes its transmission on its private link 34 Virtual Time Memory System Finish Time Algorithm Thread i’s kth memory request is denoted mik bank j virtual start-time Bj.Fik = Bj.Sik + Bj.Lik / i mik channel virtual start-time Bj.Sik = max { aik , Bj.Fi(k-1)’ } mik bank j virtual finish-time mik C.Sik = max { Bj.Fik-1, C.Fik-1} mik channel virtual finish-time C.Fik = C.Sik + C.Lik / i 35 Fairness Policy FQMS Fairness policy: distribute excess bandwidth to the thread that has consumed the least excess bandwidth (relative to its service share) in the past Different than the fairness policy commonly used in networks Differs from the fairness policy commonly used in networks because a memory system is an integral part of a closed system 36 Background: SDRAM Memory Systems SDRAM 3D Structure Banks Rows Columns SDRAM Commands Activate row Read or write columns Precharge bank 37 Virtual Time Memory System Service Requirements SDRAM Command Bcmd .L Ccmd .L Activate tRCD n/a Read tCL BL/2 Write tWL BL/2 Precharge tRP + (tRAS - tRCD - tCL) n/a The tRAS timing constraint overlaps read and write bank timing constraints Precharge bank service requirement accounts for the overlap 38