Improving Memory Bank-Level Parallelism in the Presence of Prefetching
Chang Joo Lee
Veynu Narasiman
Onur Mutlu*
Yale N. Patt
Electrical and Computer Engineering
The University of Texas at Austin
* Electrical and Computer Engineering
Carnegie Mellon University
4/10/2020 1
• Crucial to high performance computing
• Made of DRAM chips
• Multiple banks
→ Each bank can be accessed independently
4/10/2020 2
DRAM bank 0
DRAM bank 1
Bank 0
DRAM system
Bank 1
Req B0
Req B1
DRAM controller
Data bus
Older
DRAM request buffer
Overlapped time
Req B0
Req B1
Data for Req B0
Data for Req B1
Time
DRAM throughput increased
4/10/2020 3
• Out-of-order execution, prefetching, runahead etc.
• Increase outstanding memory requests on the chip
– Memory-Level Parallelism (MLP) [Glew’98]
• Hope many requests will be serviced in parallel in the memory system
• Higher performance can be achieved when
BLP is exposed to the DRAM controller
4/10/2020 4
• On-chip buffers e.g., Miss Status Holding
Registers (MSHRs) are limited in size
– Limit the BLP exposed to the DRAM controller
– E.g., requests to the same bank fill up MSHRs
• In CMPs, memory requests from different cores are mixed together in DRAM request buffers
– Destroy the BLP of each application running on CMPs
Request Issue policies are critical to BLP exploited by DRAM controller
4/10/2020 5
1. Maximize the BLP exposed from each core to the DRAM controller
→ Increase DRAM throughput for useful requests
BLP-Aware Prefetch Issue (BAPI):
Decides the order in which prefetches are sent from prefetcher to MSHRs
2. Preserve the BLP of each application in CMPs
→ Increase system performance
BLP-Preserving Multi-core Request Issue (BPMRI):
Decides the order in which memory requests are sent from each core to DRAM request buffers
4/10/2020 6
• BLP-Aware Prefetch Issue (BAPI)
• BLP-Preserving Multi-core Request Issue
(BPMRI)
4/10/2020 7
• Miss Status Holding Registers (MSHRs) are NOT large enough to handle many memory requests [Tuck, MICRO’06]
– MSHRs keep track of all outstanding misses for a core
→ Total number of demand/prefetch requests
≤ total number of MSHR entries
– Complex, latency-critical, and power-hungry
→ Not scalable
Request issue policy to MSHRs affects the level of BLP exploited by DRAM controller
4/10/2020 8
FIFO (Intel Core) Overlapped time
To DRAM
α
β
β Bank 0 Dem B0 Pref B0
Bank 1
Pref B1 Pref B1
Bank 0 Bank 1
DRAM request buffers
1 request 1 request
Prefetch request buffer
MSHRs Full
α
: Dem B0
β
:
Older
BLP-aware
Overlapped time
DRAM service time
Increasing the number of requests
Bank 0 Dem B0 Pref B0
≠ high DRAM BLP
Bank 1 Pref B1 Pref B1
Core
4/10/2020
DRAM service time
Simple issue policy improves DRAM BLP
9
• Sends prefetches to MSHRs based on current
BLP exposed in the memory system
– Sends a prefetch mapped to the least busy DRAM bank
• Adaptively limits the issue of prefetches based on prefetch accuracy estimation
– Low prefetch accuracy
→ Fewer prefetches issued to MSHRs
– High prefetch accuracy
→ Maximize BLP
4/10/2020 10
• FIFO prefetch request buffer per DRAM bank
– Stores prefetches mapped to the corresponding
DRAM bank
• MSHR occupancy counter per DRAM bank
– Keeps track of the number of outstanding requests to the corresponding DRAM bank
• Prefetch accuracy register
– Stores the estimated prefetch accuracy periodically
4/10/2020 11
1. Make the oldest prefetch to each bank valid only if the bank’s MSHR occupancy counter
≤ prefetch send threshold
2. Among valid prefetches, select the request to the bank with minimum MSHR occupancy counter value
4/10/2020 12
• Prefetch Send Threshold
– Reserves MSHR entries for prefetches to different banks
– Adjusted based on prefetch accuracy
• Low prefetch accuracy → low prefetch send threshold
• High prefetch accuracy → high prefetch send threshold
4/10/2020 13
• BLP-Aware Prefetch Issue (BAPI)
• BLP-Preserving Multi-core Request Issue
(BPMRI)
4/10/2020 14
• DRAM request buffers are shared by multiple cores
– To exploit the BLP of a core, the BLP should be exposed to DRAM request buffers
– BLP potential of a core can be destroyed by the interference from other cores’ requests
Request issue policy from each core to DRAM request buffers affects BLP of each application
4/10/2020 15
To DRAM
Round-robin
Bank 0
Req A0 Req B0
Older Bank 1 Req B1
Bank 0 Bank 1
DRAM
DRAM request buffers controller
Request issuer
Core A
Core B
BLP-Preserving
Stall
Stall
Core A
4/10/2020
Req A1
Time
Req A0 Req B0
Core B
Older
Bank 1
Core A
Core B
Req A1
Stall
Stall
Req B1
Time
Increased cycles for Core B
Saved cycles for Core A
Issue policy should preserve DRAM BLP
16
• Consecutively sends requests from one core to
DRAM request buffers
• Limits the maximum number of consecutive requests sent from one core
– Prevent starvation of memory non-intensive applications
• Prioritizes memory non-intensive applications
– Impact of delaying requests from memory non-intensive application > Impact of delaying requests from memory intensive application
4/10/2020 17
• Last-level (L2) cache miss counter per core
– Stores the number of L2 cache misses from the core
• Rank register per core
– Fewer L2 cache misses → higher rank
– More L2 cache misses → lower rank
4/10/2020 18
Every request issue cycle
If consecutive requests from selected core ≥ request send threshold then selected core ← highest ranked core issue oldest request from selected core
4/10/2020 19
• x86 cycle accurate simulator
• Baseline processor configuration
– Per core
• 4-wide issue, out-of-order, 128-entry ROB
• Stream prefetcher (prefetch degree: 4, prefetch distance: 64)
• 32-entry MSHRs
• 512KB 8-way L2 cache
– Shared
• On-chip, demand-first FR-FCFS memory controller(s)
• 1, 2, 4 DRAM channels for 1, 4, 8-core systems
• 64, 128, 512-entry DRAM request buffers for 1, 4 and 8-core systems
• DDR3 1600 DRAM, 15-15-15ns, 8KB row buffer
4/10/2020 20
• Workloads
– 14 most memory-intensive SPEC CPU 2000/2006 benchmarks for single-core system
– 30 and 15 SPEC 2000/2006 workloads for 4 and 8-core CMPs
• Pseudo-randomly chosen multiprogrammed
• BAPI’s prefetch send threshold:
Prefetch accuracy (%) 0~40
Threshold 1
40~85
7
85~100
27
• BPMRI’s request send threshold: 10
• Prefetch accuracy estimation and rank decision are made every 100K cycles
4/10/2020 21
0.2
0.4
0.6
0.8
1
1.2
8.5% 13.8% 13.6%
1.2
1.2
1 1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
No pref Pref
1-core
BAPI
4/10/2020
0
No pref Pref
4-core
BLPaware
0
No pref Pref
8-core
BLPaware
22
BAPI
BPMRI
Total
Cost (bits)
94,368
72
94,440
• Total storage: 94,440 bits (11.5KB)
– 0.6% of L2 cache data storage
• Logic is not on the critical path
– Issue decision can be made slower than processor cycle
4/10/2020 23
• Uncontrolled memory request issue policies limit the level of BLP exploited by DRAM controller
• BLP-Aware Prefetch Issue
– Increases the BLP of useful requests from each core exposed to
DRAM controller
• BLP-Preserving Multi-core Request Issue
– Ensures requests from the same core can be serviced in parallel by DRAM controller
• Simple, low-storage cost
• Significantly improve DRAM throughput and performance for both single and multi-core systems
• Applicable to other memory technologies
4/10/2020 24
4/10/2020 25