Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda

advertisement
Stall-Time Fair
Memory Access Scheduling
Onur Mutlu and Thomas Moscibroda
Computer Architecture Group
Microsoft Research
Multi-Core Systems
unfairness
CORE 0
CORE 1
CORE 2
CORE 3
L2
CACHE
L2
CACHE
L2
CACHE
L2
CACHE
DRAM MEMORY CONTROLLER
DRAM DRAM DRAM
Bank 0 Bank 1 Bank 2
Multi-Core
Chip
Shared DRAM
Memory System
. . . DRAM
Bank 7
2
DRAM Bank Operation
Access Address
(Row 1,
0, Column 0)
1)
9)
Row decoder
Rows
Row address 0
1
Columns
Row 01
Row
Empty
1
9
Column address 0
Row Buffer CONFLICT
HIT
!
Column decoder
Data
3
DRAM Controllers

A row-conflict memory access takes significantly longer
than a row-hit access

Current controllers take advantage of the row buffer

Commonly used scheduling policy (FR-FCFS)
[Rixner, ISCA’00]
(1) Row-hit (column) first: Service row-hit memory accesses first
(2) Oldest-first: Then service older accesses first

This scheduling policy aims to maximize DRAM throughput

But, it is unfair when multiple threads share the DRAM system
4
Outline

The Problem


Stall-Time Fair Memory Scheduling






Unfair DRAM Scheduling
Fairness definition
Algorithm
Implementation
System software support
Experimental Evaluation
Conclusions
5
The Problem

Multiple threads share the DRAM controller
DRAM controllers are designed to maximize DRAM
throughput

DRAM scheduling policies are thread-unaware and unfair


Row-hit first: unfairly prioritizes threads with high row buffer
locality



Streaming threads
Threads that keep on accessing the same row
Oldest-first: unfairly prioritizes memory-intensive threads
6
T0: Row 0
T0:
T1: Row 05
Row decoder
The Problem
T1:
T0:Row
Row111
0
T1:
T0:Row
Row16
0
Request Buffer
Row
Row 00
Row Buffer
Column decoder
Row size: 8KB, cache block
size: 64B
T0: streaming thread
128 requests
T1: non-streaming
thread
of T0 serviced before
T1
Data
7
Consequences of Unfairness in DRAM
DRAM is the only shared resource
7.74
4.72
1.85




1.05
Vulnerability to denial of service [Moscibroda & Mutlu, Usenix Security’07]
System throughput loss
Priority inversion at the system/OS level
Poor performance predictability
8
Outline

The Problem


Stall-Time Fair Memory Scheduling






Unfair DRAM Scheduling
Fairness definition
Algorithm
Implementation
System software support
Experimental Evaluation
Conclusions
9
Fairness in Shared DRAM Systems

A thread’s DRAM performance dependent on its inherent




Interference between threads can destroy either or both
A fair DRAM scheduler should take into account all factors
affecting each thread’s DRAM performance


Row-buffer locality
Bank parallelism
Not solely bandwidth or solely request latency
Observation: A thread’s performance degradation due to
interference in DRAM mainly characterized by the extra
memory-related stall-time due to contention with other
threads
10
Stall-Time Fairness in Shared DRAM Systems

A DRAM system is fair if it slows down equal-priority threads equally
 Compared to when each thread is run alone on the same system





Fairness notion similar to SMT [Cazorla, IEEE Micro’04][Luo, ISPASS’01], SoEMT
[Gabor, Micro’06], and shared caches [Kim, PACT’04]
Tshared: DRAM-related stall-time when the thread is running with other
threads
Talone: DRAM-related stall-time when the thread is running alone
Memory-slowdown = Tshared/Talone
The goal of the Stall-Time Fair Memory scheduler (STFM) is to equalize
Memory-slowdown for all threads, without sacrificing performance
 Considers inherent DRAM performance of each thread
11
Outline

The Problem


Stall-Time Fair Memory Scheduling






Unfair DRAM Scheduling
Fairness definition
Algorithm
Implementation
System software support
Experimental Evaluation
Conclusions
12
STFM Scheduling Algorithm (1)

During each time interval, for each thread, DRAM controller



At the beginning of a scheduling cycle, DRAM controller



Tracks Tshared
Estimates Talone
Computes Slowdown = Tshared/Talone for each thread with an
outstanding legal request
Computes unfairness = MAX Slowdown / MIN Slowdown
If unfairness < 

Use DRAM throughput oriented baseline scheduling policy


(1) row-hit first
(2) oldest-first
13
STFM Scheduling Algorithm (2)

If unfairness ≥ 

Use fairness-oriented scheduling policy





(1) requests from thread with MAX Slowdown first
(2) row-hit first
(3) oldest-first
Maximizes DRAM throughput if it cannot improve fairness
Does NOT waste useful bandwidth to improve fairness

If a request does not interfere with any other, it is scheduled
14
How Does STFM Prevent Unfairness?
T0: Row 0
T1: Row 5
T0: Row 0
T1: Row 111
T0: Row 0
T0:
T1: Row 0
16
T0 Slowdown 1.10
1.00
1.04
1.07
1.03
Row
16
Row
00
Row 111
Row Buffer
T1 Slowdown 1.14
1.03
1.06
1.08
1.11
1.00
Unfairness

1.06
1.04
1.03
1.00
Data
1.05
15
Outline

The Problem


Stall-Time Fair Memory Scheduling






Unfair DRAM Scheduling
Fairness definition
Algorithm
Implementation
System software support
Experimental Evaluation
Conclusions
16
Implementation

Tracking Tshared



Relatively easy
The processor increases a counter if the thread cannot commit
instructions because the oldest instruction requires DRAM
access
Estimating Talone




More involved because thread is not running alone
Difficult to estimate directly
Observation:
 Talone = Tshared - Tinterference
Estimate Tinterference: Extra stall-time due to interference
17
Estimating Tinterference(1)

When a DRAM request from thread C is scheduled

Thread C can incur extra stall time:



The request’s row buffer hit status might be affected by interference
Estimate the row that would have been in the row buffer if the
thread were running alone
Estimate the extra bank access latency the request incurs
Tinterference(C) +=
Extra Bank Access Latency
# Banks Servicing C’s Requests
Extra latency amortized across outstanding accesses of thread C
(memory level parallelism)
18
Estimating Tinterference(2)

When a DRAM request from thread C is scheduled


Any other thread C’ with outstanding requests incurs extra
stall time
Interference in the DRAM data bus
Tinterference(C’) += Bus Transfer Latency of Scheduled Request

Interference in the DRAM bank (see paper)
Tinterference(C’) +=
Bank Access Latency of Scheduled Request
# Banks Needed by C’ Requests * K
19
Hardware Cost

<2KB storage cost for


Arithmetic operations approximated



Fixed point arithmetic
Divisions using lookup tables
Not on the critical path


8-core system with 128-entry memory request buffer
Scheduler makes a decision only every DRAM cycle
More details in paper
20
Outline

The Problem


Stall-Time Fair Memory Scheduling






Unfair DRAM Scheduling
Fairness definition
Algorithm
Implementation
System software support
Experimental Evaluation
Conclusions
21
Support for System Software

Supporting system-level thread weights/priorities




Thread weights communicated to the memory controller
Larger-weight threads should be slowed down less
Each thread’s slowdown is scaled by its weight
Weighted slowdown used for scheduling



Favors threads with larger weights
OS can choose thread weights to satisfy QoS requirements
: Maximum tolerable unfairness set by system software



Don’t need fairness? Set  large.
Need strict fairness? Set  close to 1.
Other values of : trade-off fairness and throughput
22
Outline

The Problem


Stall-Time Fair Memory Scheduling






Unfair DRAM Scheduling
Fairness definition
Algorithm
Implementation
System software support
Experimental Evaluation
Conclusions
23
Evaluation Methodology



2-, 4-, 8-, 16-core systems
 x86 processor model based on Intel Pentium M
 4 GHz processor, 128-entry instruction window
 512 Kbyte per core private L2 caches
Detailed DRAM model based on Micron DDR2-800
 128-entry memory request buffer
 8 banks, 2Kbyte row buffer
 Row-hit round-trip latency: 35ns (140 cycles)
 Row-conflict latency: 70ns (280 cycles)
Benchmarks
 SPEC CPU2006 and some Windows Desktop applications
 256, 32, 3 benchmark combinations for 4-, 8-, 16-core experiments
24
Comparison with Related Work

Baseline FR-FCFS



Low DRAM throughput
Unfairly penalizes non-intensive threads
FR-FCFS+Cap



Unfairly penalizes non-intensive threads with low-row-buffer locality
FCFS


[Rixner et al., ISCA’00]
Static cap on how many younger row-hits can bypass older accesses
Unfairly penalizes non-intensive threads
Network Fair Queueing (NFQ) [Nesbit et al., Micro’06]



Per-thread virtual-time based scheduling

A thread’s private virtual-time increases when its request is scheduled

Prioritizes requests from thread with the earliest virtual-time

Equalizes bandwidth across equal-priority threads

Does not consider inherent performance of each thread
Unfairly prioritizes threads with non-bursty access patterns (idleness problem)
Unfairly penalizes threads with unbalanced bank usage (in paper)
25
Idleness/Burstiness Problem in Fair Queueing
Only
Thread
Thread
1’s 4
2virtual
3
serviced
timeinincreases
interval [t3,t4]
[t1,t2]
[t2,t3]
even since
though
itsno
virtual
othertime
thread
is smaller
needs than
DRAM
Thread 1’s
Serviced
Serviced
Serviced
Serviced
Non-bursty thread suffers large performance loss
even though it fairly utilized DRAM when no other thread needed it
26
Unfairness on 4-, 8-, 16-core Systems
Unfairness = MAX Memory Slowdown / MIN Memory Slowdown
6
5.5
5
Unfairness
4.5
FR-FCFS
FCFS
FR-FCFS+Cap
NFQ
STFM
4
3.5
3
2.5
2
1.26X
1.27X
1.5
1.81X
1
4-core
8-core
16-core
27
System Performance
5.8%
1.1
4.1%
4.6%
Normalized Weighted Speedup
1
0.9
0.8
0.7
0.6
FR-FCFS
FCFS
FR-FCFS+Cap
NFQ
STFM
0.5
0.4
0.3
0.2
0.1
0
4-core
8-core
16-core
28
Hmean-speedup (Throughput-Fairness Balance)
10.8%
1.4
9.5%
11.2%
1.3
Normalized Hmean Speedup
1.2
1.1
1
0.9
0.8
0.7
FR-FCFS
FCFS
FR-FCFS+Cap
NFQ
STFM
0.6
0.5
0.4
0.3
0.2
0.1
0
4-core
8-core
16-core
29
Outline

The Problem


Stall-Time Fair Memory Scheduling






Unfair DRAM Scheduling
Fairness definition
Algorithm
Implementation
System software support
Experimental Evaluation
Conclusions
30
Conclusions

A new definition of DRAM fairness: stall-time fairness



New DRAM scheduling algorithm enforces this definition



Equal-priority threads should experience equal memory-related
slowdowns
Takes into account inherent memory performance of threads
Flexible and configurable fairness substrate
Supports system-level thread priorities/weights  QoS policies
Results across a wide range of workloads and systems show:


Improving DRAM fairness also improves system throughput
STFM provides better fairness and system performance than
previously-proposed DRAM schedulers
31
Thank you. Questions?
Stall-Time Fair
Memory Access Scheduling
Onur Mutlu and Thomas Moscibroda
Computer Architecture Group
Microsoft Research
Backup
Structure of the STFM Controller
35
Comparison using NFQ QoS Metrics

Nesbit et al. [MICRO’06] proposed the following target for
quality of service:



A thread that is allocated 1/Nth of the memory system
bandwidth will run no slower than the same thread on a
private memory system running at 1/Nth of the frequency of
the shared physical memory system
Baseline with memory bandwidth scaled down by N
We compared different DRAM schedulers’ effectiveness
using this metric


Number of violations of the above QoS target
Harmonic mean of IPC normalized to the above baseline
36
% Workloads where QoS Objective NOT Satisfied
Violations of the NFQ QoS Target
60%
55%
FR-FCFS
FCFS
FR-FCFS+Cap
NFQ
STFM
50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
4-core
8-core
16-core
37
Hmean of Normalized IPC (using Nesbit's baseline)
Hmean Normalized IPC using NFQ Baseline
7.3%
10.3%
1.3
5.9%
9.1%
5.1%
7.8%
1.2
1.1
1
0.9
0.8
0.7
FR-FCFS
FCFS
FR-FCFS+Cap
NFQ
STFM
0.6
0.5
0.4
0.3
0.2
0.1
0
4-core
8-core
16-core
38
Shortcomings of the NFQ QoS Target

Low baseline (easily achievable target) for equal-priority threads


N equal-priority threads  a thread should do better than on a
system with 1/Nth of the memory bandwidth
This target is usually very easy to achieve


Unachievable target in some cases


Especially when N is large
Consider two threads always accessing the same bank in an
interleaved fashion  too much interference
Baseline performance very difficult to determine in a real system


Cannot scale memory frequency arbitrarily
Not knowing baseline performance makes it difficult to set
thread priorities (how much bandwidth to assign to each thread)
39
A Case Study
Unfairness:
7.28
2.07
2.08
1.87
1.27
8
mcf
libquantum
GemsFDTD
astar
Normalized
Memory
Stall Time
Slowdown
Memory
7
6
5
4
3
2
1
0
FR-FCFS
FCFS
FR-FCFS+Cap
NFQ
STFM
40
Windows Desktop Workloads
41
Enforcing Thread Weights
42
Effect of 
43
Effect of Banks and Row Buffer Size
44
Download