Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research Multi-Core Systems unfairness CORE 0 CORE 1 CORE 2 CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM DRAM DRAM Bank 0 Bank 1 Bank 2 Multi-Core Chip Shared DRAM Memory System . . . DRAM Bank 7 2 DRAM Bank Operation Access Address (Row 1, 0, Column 0) 1) 9) Row decoder Rows Row address 0 1 Columns Row 01 Row Empty 1 9 Column address 0 Row Buffer CONFLICT HIT ! Column decoder Data 3 DRAM Controllers A row-conflict memory access takes significantly longer than a row-hit access Current controllers take advantage of the row buffer Commonly used scheduling policy (FR-FCFS) [Rixner, ISCA’00] (1) Row-hit (column) first: Service row-hit memory accesses first (2) Oldest-first: Then service older accesses first This scheduling policy aims to maximize DRAM throughput But, it is unfair when multiple threads share the DRAM system 4 Outline The Problem Stall-Time Fair Memory Scheduling Unfair DRAM Scheduling Fairness definition Algorithm Implementation System software support Experimental Evaluation Conclusions 5 The Problem Multiple threads share the DRAM controller DRAM controllers are designed to maximize DRAM throughput DRAM scheduling policies are thread-unaware and unfair Row-hit first: unfairly prioritizes threads with high row buffer locality Streaming threads Threads that keep on accessing the same row Oldest-first: unfairly prioritizes memory-intensive threads 6 T0: Row 0 T0: T1: Row 05 Row decoder The Problem T1: T0:Row Row111 0 T1: T0:Row Row16 0 Request Buffer Row Row 00 Row Buffer Column decoder Row size: 8KB, cache block size: 64B T0: streaming thread 128 requests T1: non-streaming thread of T0 serviced before T1 Data 7 Consequences of Unfairness in DRAM DRAM is the only shared resource 7.74 4.72 1.85 1.05 Vulnerability to denial of service [Moscibroda & Mutlu, Usenix Security’07] System throughput loss Priority inversion at the system/OS level Poor performance predictability 8 Outline The Problem Stall-Time Fair Memory Scheduling Unfair DRAM Scheduling Fairness definition Algorithm Implementation System software support Experimental Evaluation Conclusions 9 Fairness in Shared DRAM Systems A thread’s DRAM performance dependent on its inherent Interference between threads can destroy either or both A fair DRAM scheduler should take into account all factors affecting each thread’s DRAM performance Row-buffer locality Bank parallelism Not solely bandwidth or solely request latency Observation: A thread’s performance degradation due to interference in DRAM mainly characterized by the extra memory-related stall-time due to contention with other threads 10 Stall-Time Fairness in Shared DRAM Systems A DRAM system is fair if it slows down equal-priority threads equally Compared to when each thread is run alone on the same system Fairness notion similar to SMT [Cazorla, IEEE Micro’04][Luo, ISPASS’01], SoEMT [Gabor, Micro’06], and shared caches [Kim, PACT’04] Tshared: DRAM-related stall-time when the thread is running with other threads Talone: DRAM-related stall-time when the thread is running alone Memory-slowdown = Tshared/Talone The goal of the Stall-Time Fair Memory scheduler (STFM) is to equalize Memory-slowdown for all threads, without sacrificing performance Considers inherent DRAM performance of each thread 11 Outline The Problem Stall-Time Fair Memory Scheduling Unfair DRAM Scheduling Fairness definition Algorithm Implementation System software support Experimental Evaluation Conclusions 12 STFM Scheduling Algorithm (1) During each time interval, for each thread, DRAM controller At the beginning of a scheduling cycle, DRAM controller Tracks Tshared Estimates Talone Computes Slowdown = Tshared/Talone for each thread with an outstanding legal request Computes unfairness = MAX Slowdown / MIN Slowdown If unfairness < Use DRAM throughput oriented baseline scheduling policy (1) row-hit first (2) oldest-first 13 STFM Scheduling Algorithm (2) If unfairness ≥ Use fairness-oriented scheduling policy (1) requests from thread with MAX Slowdown first (2) row-hit first (3) oldest-first Maximizes DRAM throughput if it cannot improve fairness Does NOT waste useful bandwidth to improve fairness If a request does not interfere with any other, it is scheduled 14 How Does STFM Prevent Unfairness? T0: Row 0 T1: Row 5 T0: Row 0 T1: Row 111 T0: Row 0 T0: T1: Row 0 16 T0 Slowdown 1.10 1.00 1.04 1.07 1.03 Row 16 Row 00 Row 111 Row Buffer T1 Slowdown 1.14 1.03 1.06 1.08 1.11 1.00 Unfairness 1.06 1.04 1.03 1.00 Data 1.05 15 Outline The Problem Stall-Time Fair Memory Scheduling Unfair DRAM Scheduling Fairness definition Algorithm Implementation System software support Experimental Evaluation Conclusions 16 Implementation Tracking Tshared Relatively easy The processor increases a counter if the thread cannot commit instructions because the oldest instruction requires DRAM access Estimating Talone More involved because thread is not running alone Difficult to estimate directly Observation: Talone = Tshared - Tinterference Estimate Tinterference: Extra stall-time due to interference 17 Estimating Tinterference(1) When a DRAM request from thread C is scheduled Thread C can incur extra stall time: The request’s row buffer hit status might be affected by interference Estimate the row that would have been in the row buffer if the thread were running alone Estimate the extra bank access latency the request incurs Tinterference(C) += Extra Bank Access Latency # Banks Servicing C’s Requests Extra latency amortized across outstanding accesses of thread C (memory level parallelism) 18 Estimating Tinterference(2) When a DRAM request from thread C is scheduled Any other thread C’ with outstanding requests incurs extra stall time Interference in the DRAM data bus Tinterference(C’) += Bus Transfer Latency of Scheduled Request Interference in the DRAM bank (see paper) Tinterference(C’) += Bank Access Latency of Scheduled Request # Banks Needed by C’ Requests * K 19 Hardware Cost <2KB storage cost for Arithmetic operations approximated Fixed point arithmetic Divisions using lookup tables Not on the critical path 8-core system with 128-entry memory request buffer Scheduler makes a decision only every DRAM cycle More details in paper 20 Outline The Problem Stall-Time Fair Memory Scheduling Unfair DRAM Scheduling Fairness definition Algorithm Implementation System software support Experimental Evaluation Conclusions 21 Support for System Software Supporting system-level thread weights/priorities Thread weights communicated to the memory controller Larger-weight threads should be slowed down less Each thread’s slowdown is scaled by its weight Weighted slowdown used for scheduling Favors threads with larger weights OS can choose thread weights to satisfy QoS requirements : Maximum tolerable unfairness set by system software Don’t need fairness? Set large. Need strict fairness? Set close to 1. Other values of : trade-off fairness and throughput 22 Outline The Problem Stall-Time Fair Memory Scheduling Unfair DRAM Scheduling Fairness definition Algorithm Implementation System software support Experimental Evaluation Conclusions 23 Evaluation Methodology 2-, 4-, 8-, 16-core systems x86 processor model based on Intel Pentium M 4 GHz processor, 128-entry instruction window 512 Kbyte per core private L2 caches Detailed DRAM model based on Micron DDR2-800 128-entry memory request buffer 8 banks, 2Kbyte row buffer Row-hit round-trip latency: 35ns (140 cycles) Row-conflict latency: 70ns (280 cycles) Benchmarks SPEC CPU2006 and some Windows Desktop applications 256, 32, 3 benchmark combinations for 4-, 8-, 16-core experiments 24 Comparison with Related Work Baseline FR-FCFS Low DRAM throughput Unfairly penalizes non-intensive threads FR-FCFS+Cap Unfairly penalizes non-intensive threads with low-row-buffer locality FCFS [Rixner et al., ISCA’00] Static cap on how many younger row-hits can bypass older accesses Unfairly penalizes non-intensive threads Network Fair Queueing (NFQ) [Nesbit et al., Micro’06] Per-thread virtual-time based scheduling A thread’s private virtual-time increases when its request is scheduled Prioritizes requests from thread with the earliest virtual-time Equalizes bandwidth across equal-priority threads Does not consider inherent performance of each thread Unfairly prioritizes threads with non-bursty access patterns (idleness problem) Unfairly penalizes threads with unbalanced bank usage (in paper) 25 Idleness/Burstiness Problem in Fair Queueing Only Thread Thread 1’s 4 2virtual 3 serviced timeinincreases interval [t3,t4] [t1,t2] [t2,t3] even since though itsno virtual othertime thread is smaller needs than DRAM Thread 1’s Serviced Serviced Serviced Serviced Non-bursty thread suffers large performance loss even though it fairly utilized DRAM when no other thread needed it 26 Unfairness on 4-, 8-, 16-core Systems Unfairness = MAX Memory Slowdown / MIN Memory Slowdown 6 5.5 5 Unfairness 4.5 FR-FCFS FCFS FR-FCFS+Cap NFQ STFM 4 3.5 3 2.5 2 1.26X 1.27X 1.5 1.81X 1 4-core 8-core 16-core 27 System Performance 5.8% 1.1 4.1% 4.6% Normalized Weighted Speedup 1 0.9 0.8 0.7 0.6 FR-FCFS FCFS FR-FCFS+Cap NFQ STFM 0.5 0.4 0.3 0.2 0.1 0 4-core 8-core 16-core 28 Hmean-speedup (Throughput-Fairness Balance) 10.8% 1.4 9.5% 11.2% 1.3 Normalized Hmean Speedup 1.2 1.1 1 0.9 0.8 0.7 FR-FCFS FCFS FR-FCFS+Cap NFQ STFM 0.6 0.5 0.4 0.3 0.2 0.1 0 4-core 8-core 16-core 29 Outline The Problem Stall-Time Fair Memory Scheduling Unfair DRAM Scheduling Fairness definition Algorithm Implementation System software support Experimental Evaluation Conclusions 30 Conclusions A new definition of DRAM fairness: stall-time fairness New DRAM scheduling algorithm enforces this definition Equal-priority threads should experience equal memory-related slowdowns Takes into account inherent memory performance of threads Flexible and configurable fairness substrate Supports system-level thread priorities/weights QoS policies Results across a wide range of workloads and systems show: Improving DRAM fairness also improves system throughput STFM provides better fairness and system performance than previously-proposed DRAM schedulers 31 Thank you. Questions? Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research Backup Structure of the STFM Controller 35 Comparison using NFQ QoS Metrics Nesbit et al. [MICRO’06] proposed the following target for quality of service: A thread that is allocated 1/Nth of the memory system bandwidth will run no slower than the same thread on a private memory system running at 1/Nth of the frequency of the shared physical memory system Baseline with memory bandwidth scaled down by N We compared different DRAM schedulers’ effectiveness using this metric Number of violations of the above QoS target Harmonic mean of IPC normalized to the above baseline 36 % Workloads where QoS Objective NOT Satisfied Violations of the NFQ QoS Target 60% 55% FR-FCFS FCFS FR-FCFS+Cap NFQ STFM 50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% 4-core 8-core 16-core 37 Hmean of Normalized IPC (using Nesbit's baseline) Hmean Normalized IPC using NFQ Baseline 7.3% 10.3% 1.3 5.9% 9.1% 5.1% 7.8% 1.2 1.1 1 0.9 0.8 0.7 FR-FCFS FCFS FR-FCFS+Cap NFQ STFM 0.6 0.5 0.4 0.3 0.2 0.1 0 4-core 8-core 16-core 38 Shortcomings of the NFQ QoS Target Low baseline (easily achievable target) for equal-priority threads N equal-priority threads a thread should do better than on a system with 1/Nth of the memory bandwidth This target is usually very easy to achieve Unachievable target in some cases Especially when N is large Consider two threads always accessing the same bank in an interleaved fashion too much interference Baseline performance very difficult to determine in a real system Cannot scale memory frequency arbitrarily Not knowing baseline performance makes it difficult to set thread priorities (how much bandwidth to assign to each thread) 39 A Case Study Unfairness: 7.28 2.07 2.08 1.87 1.27 8 mcf libquantum GemsFDTD astar Normalized Memory Stall Time Slowdown Memory 7 6 5 4 3 2 1 0 FR-FCFS FCFS FR-FCFS+Cap NFQ STFM 40 Windows Desktop Workloads 41 Enforcing Thread Weights 42 Effect of 43 Effect of Banks and Row Buffer Size 44