Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee

Improving Memory Bank-Level Parallelism in the Presence of Prefetching

Chang Joo Lee

Veynu Narasiman

Onur Mutlu*

Yale N. Patt

Electrical and Computer Engineering

The University of Texas at Austin

* Electrical and Computer Engineering

Carnegie Mellon University

4/10/2020 1

Main Memory System

• Crucial to high performance computing

• Made of DRAM chips

• Multiple banks

→ Each bank can be accessed independently

4/10/2020 2

Memory Bank-Level Parallelism

(BLP)

DRAM bank 0

DRAM bank 1

Bank 0

DRAM system

Bank 1

Req B0

Req B1

DRAM controller

Data bus

Older

DRAM request buffer

Overlapped time

Req B0

Req B1

Data for Req B0

Data for Req B1

Time

DRAM throughput increased

4/10/2020 3

Memory Latency-Tolerance

Mechanisms

• Out-of-order execution, prefetching, runahead etc.

• Increase outstanding memory requests on the chip

– Memory-Level Parallelism (MLP) [Glew’98]

• Hope many requests will be serviced in parallel in the memory system

• Higher performance can be achieved when

BLP is exposed to the DRAM controller

4/10/2020 4

Problems

• On-chip buffers e.g., Miss Status Holding

Registers (MSHRs) are limited in size

– Limit the BLP exposed to the DRAM controller

– E.g., requests to the same bank fill up MSHRs

• In CMPs, memory requests from different cores are mixed together in DRAM request buffers

– Destroy the BLP of each application running on CMPs

Request Issue policies are critical to BLP exploited by DRAM controller

4/10/2020 5

1. Maximize the BLP exposed from each core to the DRAM controller

→ Increase DRAM throughput for useful requests

BLP-Aware Prefetch Issue (BAPI):

Decides the order in which prefetches are sent from prefetcher to MSHRs

2. Preserve the BLP of each application in CMPs

→ Increase system performance

BLP-Preserving Multi-core Request Issue (BPMRI):

Decides the order in which memory requests are sent from each core to DRAM request buffers

4/10/2020 6

DRAM BLP-Aware

Request Issue Policies

• BLP-Aware Prefetch Issue (BAPI)

• BLP-Preserving Multi-core Request Issue

(BPMRI)

4/10/2020 7

What Can Limit DRAM BLP?

• Miss Status Holding Registers (MSHRs) are NOT large enough to handle many memory requests [Tuck, MICRO’06]

– MSHRs keep track of all outstanding misses for a core

→ Total number of demand/prefetch requests

≤ total number of MSHR entries

– Complex, latency-critical, and power-hungry

→ Not scalable

Request issue policy to MSHRs affects the level of BLP exploited by DRAM controller

4/10/2020 8

What Can Limit DRAM BLP?

 FIFO (Intel Core) Overlapped time

To DRAM

α

β

β Bank 0 Dem B0 Pref B0

Bank 1

Pref B1 Pref B1

Bank 0 Bank 1

DRAM request buffers

1 request 1 request

Prefetch request buffer

MSHRs Full

α

: Dem B0

β

:

Older

 BLP-aware

Overlapped time

DRAM service time

Increasing the number of requests

Bank 0 Dem B0 Pref B0

≠ high DRAM BLP

Bank 1 Pref B1 Pref B1

Core

4/10/2020

DRAM service time

Simple issue policy improves DRAM BLP

9

BLP-Aware Prefetch Issue (BAPI)

• Sends prefetches to MSHRs based on current

BLP exposed in the memory system

– Sends a prefetch mapped to the least busy DRAM bank

• Adaptively limits the issue of prefetches based on prefetch accuracy estimation

– Low prefetch accuracy

→ Fewer prefetches issued to MSHRs

– High prefetch accuracy

→ Maximize BLP

4/10/2020 10

Implementation of BAPI

• FIFO prefetch request buffer per DRAM bank

– Stores prefetches mapped to the corresponding

DRAM bank

• MSHR occupancy counter per DRAM bank

– Keeps track of the number of outstanding requests to the corresponding DRAM bank

• Prefetch accuracy register

– Stores the estimated prefetch accuracy periodically

4/10/2020 11

BAPI Policy

Every prefetch issue cycle

1. Make the oldest prefetch to each bank valid only if the bank’s MSHR occupancy counter

≤ prefetch send threshold

2. Among valid prefetches, select the request to the bank with minimum MSHR occupancy counter value

4/10/2020 12

Adaptivity of BAPI

• Prefetch Send Threshold

– Reserves MSHR entries for prefetches to different banks

– Adjusted based on prefetch accuracy

• Low prefetch accuracy → low prefetch send threshold

• High prefetch accuracy → high prefetch send threshold

4/10/2020 13

DRAM BLP-Aware

Request Issue Policies

• BLP-Aware Prefetch Issue (BAPI)

• BLP-Preserving Multi-core Request Issue

(BPMRI)

4/10/2020 14

BLP Destruction in CMP Systems

• DRAM request buffers are shared by multiple cores

– To exploit the BLP of a core, the BLP should be exposed to DRAM request buffers

– BLP potential of a core can be destroyed by the interference from other cores’ requests

Request issue policy from each core to DRAM request buffers affects BLP of each application

4/10/2020 15

Why is DRAM BLP Destroyed?

To DRAM

 Round-robin

Bank 0

Req A0 Req B0

Older Bank 1 Req B1

Bank 0 Bank 1

DRAM

DRAM request buffers controller

Request issuer 

Core A

Core B

BLP-Preserving

Stall

Stall

Core A

4/10/2020

Req A1

Time

Req A0 Req B0

Core B

Older

Bank 1

Core A

Core B

Req A1

Stall

Stall

Req B1

Time

Increased cycles for Core B

Saved cycles for Core A

Issue policy should preserve DRAM BLP

16

BLP-Preserving Multi-Core

Request Issue (BPMRI)

• Consecutively sends requests from one core to

DRAM request buffers

• Limits the maximum number of consecutive requests sent from one core

– Prevent starvation of memory non-intensive applications

• Prioritizes memory non-intensive applications

– Impact of delaying requests from memory non-intensive application > Impact of delaying requests from memory intensive application

4/10/2020 17

Implementation of BPMRI

• Last-level (L2) cache miss counter per core

– Stores the number of L2 cache misses from the core

• Rank register per core

– Fewer L2 cache misses → higher rank

– More L2 cache misses → lower rank

4/10/2020 18

BPMRI Policy

Every request issue cycle

If consecutive requests from selected core ≥ request send threshold then selected core ← highest ranked core issue oldest request from selected core

4/10/2020 19

Simulation Methodology

• x86 cycle accurate simulator

• Baseline processor configuration

– Per core

• 4-wide issue, out-of-order, 128-entry ROB

• Stream prefetcher (prefetch degree: 4, prefetch distance: 64)

• 32-entry MSHRs

• 512KB 8-way L2 cache

– Shared

• On-chip, demand-first FR-FCFS memory controller(s)

• 1, 2, 4 DRAM channels for 1, 4, 8-core systems

• 64, 128, 512-entry DRAM request buffers for 1, 4 and 8-core systems

• DDR3 1600 DRAM, 15-15-15ns, 8KB row buffer

4/10/2020 20

Simulation Methodology

• Workloads

– 14 most memory-intensive SPEC CPU 2000/2006 benchmarks for single-core system

– 30 and 15 SPEC 2000/2006 workloads for 4 and 8-core CMPs

• Pseudo-randomly chosen multiprogrammed

• BAPI’s prefetch send threshold:

Prefetch accuracy (%) 0~40

Threshold 1

40~85

7

85~100

27

• BPMRI’s request send threshold: 10

• Prefetch accuracy estimation and rank decision are made every 100K cycles

4/10/2020 21

0.2

0.4

0.6

0.8

1

1.2

Performance of

BLP-Aware Issue Policies

8.5% 13.8% 13.6%

1.2

1.2

1 1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

No pref Pref

1-core

BAPI

4/10/2020

0

No pref Pref

4-core

BLPaware

0

No pref Pref

8-core

BLPaware

22

Hardware Storage Cost for 4-core CMP

BAPI

BPMRI

Total

Cost (bits)

94,368

72

94,440

• Total storage: 94,440 bits (11.5KB)

– 0.6% of L2 cache data storage

• Logic is not on the critical path

– Issue decision can be made slower than processor cycle

4/10/2020 23

Conclusion

• Uncontrolled memory request issue policies limit the level of BLP exploited by DRAM controller

• BLP-Aware Prefetch Issue

– Increases the BLP of useful requests from each core exposed to

DRAM controller

• BLP-Preserving Multi-core Request Issue

– Ensures requests from the same core can be serviced in parallel by DRAM controller

• Simple, low-storage cost

• Significantly improve DRAM throughput and performance for both single and multi-core systems

• Applicable to other memory technologies

4/10/2020 24

Questions?

4/10/2020 25

Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee

Main Memory System

Memory Bank-Level Parallelism

(BLP)

Memory Latency-Tolerance

Mechanisms

Problems

DRAM BLP-Aware

Request Issue Policies

What Can Limit DRAM BLP?

What Can Limit DRAM BLP?

BLP-Aware Prefetch Issue (BAPI)

Implementation of BAPI

BAPI Policy

Every prefetch issue cycle

Adaptivity of BAPI

DRAM BLP-Aware

Request Issue Policies

BLP Destruction in CMP Systems

Why is DRAM BLP Destroyed?

BLP-Preserving Multi-Core

Request Issue (BPMRI)

Implementation of BPMRI

BPMRI Policy

Simulation Methodology

Simulation Methodology

Performance of

BLP-Aware Issue Policies

Hardware Storage Cost for 4-core CMP

Conclusion

Questions?

Related documents

Products

Support

Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee

Main Memory System

Memory Bank-Level Parallelism

(BLP)

Memory Latency-Tolerance

Mechanisms

Problems

DRAM BLP-Aware

Request Issue Policies

What Can Limit DRAM BLP?

What Can Limit DRAM BLP?

BLP-Aware Prefetch Issue (BAPI)

Implementation of BAPI

BAPI Policy

Every prefetch issue cycle

Adaptivity of BAPI

DRAM BLP-Aware

Request Issue Policies

BLP Destruction in CMP Systems

Why is DRAM BLP Destroyed?

BLP-Preserving Multi-Core

Request Issue (BPMRI)

Implementation of BPMRI

BPMRI Policy

Simulation Methodology

Simulation Methodology

Performance of

BLP-Aware Issue Policies

Hardware Storage Cost for 4-core CMP

Conclusion

Questions?

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib