CoScale: Coordinating CPU and Memory System DVFS in Server Systems Qingyuan Deng, David Meisner+, Abhishek Bhattacharjee, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University +Facebook Inc. *University of Michigan 1 Server power challenges CPU Memory Others Power Breakdown 100% 80% 60% 40% 20% 0% ILP MID MEM MIX • CPU and memory power represent the vast majority of server power 2 Need to conserve both CPU and memory energy • Related work • Lots of previous CPU DVFS works • MemScale: Active low-power modes for memory [ASPLOS11] • Uncoordinated DVFS causes poor behavior • Conflicts, oscillations, unstable behavior • May not generate the best energy savings • Difficult to bound the performance degradation • Need coordinated CPU and memory DVFS to achieve best results • Challenge: Constrain the search space to good frequency combinations 3 CoScale: Coordinating CPU and memory DVFS • Key goal • Conserve significant energy while meeting performance constraints • Hardware mechanisms • New performance counters • Frequency scaling (DFS) of the channels, DIMMs, DRAM devices • Voltage & frequency scaling (DVFS) of memory controller, CPU cores • Approach • Online profiling to estimate performance and power consumption • Epoch-based modeling and control to meet performance constraints • Main result • Energy savings of up to 24% (16% on average) within 10% perf. target; 4% on average within 1% perf. target 4 Outline • Motivation and overview • CoScale • Results • Conclusions 5 CoScale design • Goal: Minimize energy under user-specified performance bound • Approach: epoch-based OS-managed CPU / mem freq. tuning • Each epoch (e.g., an OS quantum): 1. Profile performance & CPU/memory boundness • Performance counters track mem-CPI & CPU-CPI, cache performance 2. Efficiently search for best frequency combination • Models estimate CPU/memory performance and power 3. Re-lock to best frequencies; continue tracking performance • Slack: delta between estimated & observed performance 4. Carry slack forward to performance target for next epoch 6 Frequency and slack management Actual Profiling Target Performance Pos. Slack Neg. Slack Pos. Slack EstimateCalculate performance/energy slack vs. target via models High Core Freq. Core MC, Bus + DRAM Low Core Freq. High Mem Freq. Low Mem Freq. Epoch 1 Epoch 2 Time Epoch 3 Epoch 4 7 Frequency search algorithm Memory Frequency Offline Core 1 Frequency Impractical! O (M × 𝐶 𝑁 ) : M: number of memory frequencies C: number of CPU frequencies N: number of CPU cores 8 Frequency search algorithm Memory Frequency CoScale Metric: △Power/△Performance Mem Core 0 Core 1 Action 0.73 0.65 0.81 0.73 0.65 0.52 Core 1 Mem 0.61 0.65 0.52 Core 0 Mem Core 0 Mem Mem Core 1 Core 1 Frequency Core grouping: Balance impact of memory and cores O (𝑀 + 𝐶 × 𝑁 2 ) 9 Outline • Motivation and overview • CoScale • Results • Conclusions 10 Methodology • Detailed simulation • 16 cores, 16MB LLC, 4 DDR3 channels, 8 DIMMs • Multi-programmed workloads from SPEC suites • Power modes • Memory: 10 frequencies between 200 and 800 MHz • CPU: 10 frequencies between 2.2GHz and 4GHz • Power model • Micron’s DRAM power model • McPAT CPU power model 11 Results – energy savings and performance Average energy savings Performance overhead 60% 50% 14% Full system energy Memory energy CPU energy 12% Multiprogram average Worst program in mix Performance loss bound 10% 40% 8% 30% 6% 20% 4% 10% 2% 0% 0% MEM MID ILP MIX AVG MEM MID ILP MIX AVG Higher CPU energy savings on MEM; higher memory savings on ILP System energy savings of 16% (up to 24%); always within perf. bound 12 Alternative approaches • Memory system DVFS only: MemScale • CPU DVFS only • Select the best combination of core frequencies • Uncoordinated • CPU & memory DVFS controllers make independent decisions • Semi-coordinated • CPU & memory DVFS controllers coordinate by sharing slack • Offline • Select the best combination of memory and core frequencies • Unrealistic: the search space is exponential on the number of cores 13 core frequency 4 3.5 3 (a) CoScale 2.5 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 4.5 4 3.5 3 2.5 (b) Uncoordinated 1 2 3 4 5 6 7 8 2 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 0.9 Mem. frequency (GHz) 4.5 Core frequency (GHz) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 memory frequency Core frequency (GHz) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 4.5 0.8 4 0.7 0.6 3.5 0.5 3 0.4 2.5 0.3 (c) Semi-Coordinated 0.2 1 2 3 4 5 6 7 8 Core frequency (GHz) Mem. frequency (GHz) Mem. frequency (GHz) Results – dynamic behavior 2 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Timeline of milc application in MIX2 14 Results – comparison to alternative approaches Full-system energy savings Performance overhead 20% 20% Worst in Mix 15% 10% 5% 0% CPI increase (%) Energy Savings (%) Multiprogram Average 15% Performance loss bound 10% 5% 0% CoScale achieves comparable energy savings to Offline Uncoordinated fails to bound the performance loss 15 Results – Sensitivity Analysis Impact of performance bound 1% Bound 5% Bound 10% Bound 15% Bound 20% Bound 30% 25% 20% 15% 10% 5% 0% System Energy Reduction Worst Perf. Degradation Results for MID workloads 16 Conclusions • CoScale contributions: • First coordinated DVFS strategy for CPU and memory • New perf. counters to capture energy and performance • Smart OS policy to choose best power modes dynamically • Avg 16% (up to 24%) full-system energy savings • Framework for coordination of techniques across components • In the paper • Details of search algorithm, performance counters, models • Sensitivity analyses (e.g., rest-of-system power, prefetching) • CoScale on in-order vs out-of-order CPUs 17 THANKS! SPONSORS: 18