unfairness
CORE 0 CORE 1 CORE 2 CORE 3
Multi-Core
Chip
L2
CACHE
L2
CACHE
L2
CACHE
L2
CACHE
DRAM MEMORY CONTROLLER
Shared DRAM
Memory System
DRAM
Bank 0
DRAM
Bank 1
DRAM
. . .
Bank 2
DRAM
Bank 7
1
Columns
Column decoder
Data
Row Buffer
2
A row-conflict memory access takes much longer than a row-hit access (50ns vs. 100ns)
Current controllers take advantage of the row buffer
Commonly used scheduling policy (FR-FCFS):
Row-hit first: Service row-hit memory accesses first
Oldest-first: Then service older accesses first
This scheduling policy aims to
Maximize DRAM throughput
But, yields itself to a new form of denial of service with multiple threads
3
Thomas Moscibroda and Onur Mutlu
Computer Architecture Group
Microsoft Research
The Problem
Memory Performance Hogs
Preventing Memory Performance Hogs
Our Solution
Fair Memory Scheduling
Experimental Evaluation
Conclusions
5
Multiple threads share the DRAM controller
DRAM controllers are designed to maximize DRAM throughput
DRAM scheduling policies are thread-unaware and unfair
Row-hit first: unfairly prioritizes threads with high row buffer locality
Streaming threads
Threads that keep on accessing the same row
Oldest-first: unfairly prioritizes memory-intensive threads
Vulnerable to denial of service attacks
6
A thread that exploits unfairness in the DRAM controller
to deny memory service (for long time periods) to threads coscheduled on the same chip
Can severely degrade other threads’ performance
While maintaining most of its own performance!
Easy to write an MPH
An MPH can be designed intentionally (malicious)
Or, unintentionally!
Implications in commodity multi-core systems:
widespread performance degradation and unpredictability
user discomfort and loss of productivity vulnerable hardware could lose market share
7
STREAM
- Sequential memory access
- Very high row buffer locality (96% hit rate)
- Memory intensive
8
RDARRAY
- Random memory access
- Very low row buffer locality (3% hit rate)
- Similarly memory intensive
9
Row size: 8KB, data size: 64B
Request Buffer
Row Buffer
Column decoder
Data
10
2.82X slowdown
1.18X slowdown
Results on Intel Pentium D running Windows XP
(Similar results for Intel Core Duo and AMD Turion, and on Fedora)
11
Simulated config:
- 512KB private L2s
- 3-wide processors
-128-entry instruction window
12
The Problem
Memory Performance Hogs
Preventing Memory Performance Hogs
Our Solution
Fair Memory Scheduling
Experimental Evaluation
Conclusions
13
MPHs cannot be prevented in software
Fundamentally hard to distinguish between malicious and unintentional MPHs
Software has no direct control over DRAM controller’s scheduling
OS/VMM can adjust applications’ virtual physical page mappings
No control over DRAM controller’s hardwired page bank mappings
MPHs should be prevented in hardware
Hardware also cannot distinguish malicious vs unintentional
Goal: Contain and limit MPHs by providing fair memory scheduling
14
A thread’s DRAM performance dependent on its inherent
Row-buffer locality
Performance
Well-studied notions of fairness are inadequate
Fair scheduling cannot simply
Equalize memory bandwidth among threads
Unfairly penalizes threads with good bandwidth utilization
DRAM scheduling is not a pure bandwidth allocation problem
Unlike a network wire, DRAM is a stateful system
Equalize the latency or number of requests among threads
Unfairly penalizes threads with good row-buffer locality
Row-hit requests are less costly than row-conflict requests
15
A DRAM system is fair if it slows down each thread equally
Compared to when the thread is run alone
Experienced Latency = EL(i) = Cumulative DRAM latency of thread i running with other threads
Ideal Latency= IL(i) = Cumulative DRAM latency of thread i if it were run alone on the same hardware resources
DRAM-Slowdown(i) = EL(i)/IL(i)
The goal of a fair scheduling policy is to equalize DRAM-
Slowdown(i) for all Threads i
Considers inherent DRAM performance of each thread
16
The Problem
Memory Performance Hogs
Preventing Memory Performance Hogs
Our Solution
Fair Memory Scheduling
Experimental Evaluation
Conclusions
17
During each time interval of cycles, for each thread,
DRAM controller
Tracks EL (Experienced Latency)
Estimates IL (Ideal Latency)
At the beginning of a scheduling cycle, DRAM controller
Computes Slowdown = EL/IL for each thread
Computes unfairness index = MAX Slowdown / MIN Slowdown
If unfairness <
Use baseline scheduling policy to maximize DRAM throughput
(1) row-hit first
(2) oldest-first
18
If unfairness ≥
Use fairness-oriented scheduling policy
(1) requests from thread with MAX Slowdown first (if any)
(2) row-hit first
(3) oldest-first
Maximizes DRAM throughput if cannot improve fairness
: Maximum tolerable unfairness
: Lever between short-term versus long-term fairness
System software determines and
Conveyed to the hardware via privileged instructions
19
T0: Row 0
T1: Row 5
T1: Row 111
Row Buffer
T0 Slowdown
Unfairness
1.05
Data
20
Tracking Experienced Latency
Relatively easy
Update a per-thread latency counter as long as there is an outstanding memory request for a thread
Estimating Ideal Latency
More involved because thread is not running alone
Need to estimate inherent row buffer locality
Keep track of the row that would have been in the row buffer if thread were running alone
For each memory request, update a per-thread latency counter based on
“would-have-been” row buffer state
Efficient implementation possible – details in paper
21
The Problem
Memory Performance Hogs
Preventing Memory Performance Hogs
Our Solution
Fair Memory Scheduling
Experimental Evaluation
Conclusions
22
Instruction-level performance simulation
Detailed x86 processor model based on Intel Pentium M
4 GHz processor, 128-entry instruction window
512 Kbyte per core private L2 caches
Detailed DRAM model
8 banks, 2Kbyte row buffer
Row-hit latency: 50ns (200 cycles)
Row-conflict latency: 100ns (400 cycles)
Benchmarks
Stream (MPH) and rdarray
SPEC CPU 2000 and Olden benchmark suites
23
MPH with non-intensive VPR
Fairness improvement 7.9X to 1.11
System throughput improvement 4.5X
MPH with intensive MCF
Fairness improvement 4.8X to 1.08
System throughput improvement 1.9X
24
• Non-memory intensive applications suffer the most (VPR)
• Low row-buffer locality hurts (MCF)
• Fairness improvement: 6.6X
• System throughput improvement: 2.2X
25
• Fairness improvement: 10.1X
• System throughput improvement: 1.97X
26
• Vulnerability increases with row buffer size
• Also with memory latency (in paper)
27
The Problem
Memory Performance Hogs
Preventing Memory Performance Hogs
Our Solution
Fair Memory Scheduling
Experimental Evaluation
Conclusions
28
Introduced the notion of memory performance attacks in shared DRAM memory systems
Unfair DRAM scheduling is the cause of the vulnerability
More severe problem in future many-core systems
We provide a novel definition of DRAM fairness
Threads should experience equal slowdowns
New DRAM scheduling algorithm enforces this definition
Effectively prevents memory performance attacks
Preventing attacks also improves system throughput!
29