ProposalSlides4

advertisement

Designing On-chip Memory

Systems for Throughput

Architectures

Ph.D. Proposal

Jeff Diamond

Advisor: Stephen Keckler

Turning to Heterogeneous Chips

“We’ll be seeing a lot more than 2-4 cores per chip really quickly” – Bill Mark, 2005

AMD - TRINITY

NVIDIA Tegra 3

Intel – Ivy Bridge 1

Talk Outline

Introduction

The Problem

◦ Throughput Architectures

◦ Dissertation Goals

The Solution

◦ Modeling Throughput Performance

◦ Architectural Enhancements

 Thread Scheduling

 Cache Policies

◦ Methodology

Proposed Work

2

Throughput Architectures (TA)

Key Features:

◦ Break single application into threads

 Use explicit parallelism

◦ Optimize hardware for performance density

 Not single thread performance

Benefits:

◦ Drop voltage, peak frequency

 quadratic improvement in power efficiency

◦ Cores smaller, more energy efficient

 Further amortize through multithreading, SIMD

 Less need for OO, register renaming, branch prediction, fast synchronization, low latency ALUs

3

Scope – Highly Threaded TA

Architecture Continuum:

◦ Multithreading

 Large number of threads mask long latency

 Small amount of cache primarily for bandwidth

◦ Caching

 Large amounts of cache to reduce latency

 Small number of threads

Can we get benefits of both?

Power 7

4 threads/core

~1MB/thread

SPARC T4

8 threads/core

~80KB/thread

GTX 580

48 threads/core

~2KB/thread

4

Problem - Technology Mismatch

Computation is cheap, data movement is expensive

◦ Hit in L1 cache, 2.5x power of 64-bit FMADD

◦ Move across chip, 50x power

◦ Fetch from DRAM, 320x power

Limited off-chip bandwidth

◦ Exponential growth in cores saturates BW

◦ Performance capped

DRAM latency currently hundreds of cycles

◦ Need hundreds of threads/core in flight to cover

DRAM latency

5

The Downward Spiral

Little’s Law

◦ Threads needed is proportional to average latency

◦ Opportunity cost in on-chip resources

 Thread contexts

 In flight memory accesses

Too many threads – negative feedback

◦ Adding threads to cover latency increases latency

◦ Slower register access, thread scheduling

◦ Reduced Locality

 Reduces bandwidth and DRAM efficiency

 Reduces effectiveness of caching

◦ Parallel starvation

6

Talk Outline

Introduction

The Problem

◦ Throughput Architectures

◦ Dissertation Goals

The Solution

◦ Modeling Throughput Performance

◦ Architectural Enhancements

 Thread Scheduling

 Cache Policies

◦ Methodology

Proposed Work

7

Goal: Increase Parallel Efficiency

Problem: Too Many Threads!

◦ Increase Parallel Efficiency, i.e.

 Number of threads for given level of performance

 Improves throughput performance

Apply low latency caches

◦ Leverage upwards spiral

 Difficult to mix multithreading and caching

 Typically used just for bandwidth amplification

◦ Important ancillary factors

 Thread scheduling

 Instruction Scheduling (per thread parallelism)

8

Contributions

Quantifying the impact of single thread performance on throughput performance

Developing a mathematical analysis of throughput performance

Building a novel hybrid-trace based simulation infrastructure

Demonstrating unique architectural enhancements in thread scheduling and cache policies

9

Talk Outline

Introduction

The Problem

◦ Throughput Architectures

◦ Dissertation Goals

The Solution

◦ Modeling Throughput Performance

 Cache Performance

 The Valley

◦ Architectural Enhancements

 Thread Throttling

 Cache Policies

◦ Methodology

Proposed Work

10

Mathematical Analysis

Why take a mathematical approach?

◦ Be very precise about what we want to optimize

◦ Understand the relationships and sensitivities to throughput performance

 Single thread performance

 Cache improvements

 Application characteristics

◦ Rapid evaluation of design space

◦ Suggest most fruitful architectural improvements

11

Modeling Throughput Performance

P

CHIP

P

ST

N

T

L

AVG

= Total throughput performance

= Single thread performance

= Total active threads

= Average instruction latency

P

Chip

=

N

T

×

P

ST

P

ST

(( Ins / Sec ) / Thread )

=

ILP ( Ins / Thread )

L

AVG

( Sec )

L

AVG

= ( )

Power

CHIP

= E

AVG

(Joules/Ins)xP

CHIP

How can caches help throughput performance?

12

Cache As A Performance Unit

Area comparison:

FPU = 2-11KB SRAM, 8-40KB eDRAM

Active power: 20pJ / Op

Leakage power: 1 watt/mm 2

FMADD

SRAM

Active power: 50pJ/L1 access, 1.1nJ/L2 access

Leakage power: 70 milliwatts/mm 2

Make loads 150x faster, 300x more energy efficient

Use10-15x less power/mm^2 than FPUs

Leakage power comparison:

One FPU = ~64KB SRAM / 256KB eDRAM

Key: How much does a thread need?

13

Performance From Caching

Ignore changes to DRAM latency & off-chip BW

◦ We will simulate these

Assume ideal caches (frequency)

What is the maximum performance benefit?

A = Arithmetic intensity of application (fraction of non-memory instructions)

N

T

= Total active threads on chip

L = Latency

L improvement

( ) =

(1

-

A )

·

( L miss

-

L hit

)

· ( )

Memory Intensity, M=1-A

D

L

C

For power, replace L with E, the average energy per instruction

Qualitatively identical, but differences more dramatic

14

Ideal Cache = Frequency Cache

Hit rate depends on amount of cache, application working set

Store items used the most times

◦ This is the concept of “frequency”

Once we know an application’s memory access characteristics, we can model throughput performance

15

Modeling Cache Performance

F(c) H(c) P

ST

(c)

( ) =

ò

0 c

F c dc

P

ST

( ) =

L

NC

1

-

L

CI

=

( A

·

L

ALU

1

+

M

·

L

MISS

)

-

M

D

L

· ( )

16

Cache Performance Per Thread

Hit Rate(c)

100%

80%

60%

40%

20%

0%

0,0 0,2 0,4 0,6 0,8

Cache Per Thread (Fraction of

WS)

1,0

100%

Hit Rate(N

T

)

80%

60%

40%

20%

0%

0 500 1000

Total Active Threads

P

ST

(N

T

)

1,0

0,8

0,6

0,4

0,2

0,0

0 500 1000

Total Active Threads

( ) = const

=

1

WorkingSet

( ) = ò ( ) dc

= c

( ) =

1

N

T

¶ t

( ) = -

1

N

T

2

P

S

(t) is a steep reciprocal

17

Talk Outline

Introduction

The Problem

◦ Throughput Architectures

◦ Dissertation Goals

The Solution

◦ Modeling Throughput Performance

 Cache Performance

 “The Valley”

◦ Architectural Enhancements

 Thread Throttling

 Cache Policies

◦ Methodology

Proposed Work

18

“The Valley” in Cache Space

P

Chip

=

N

T

×

P

ST

High Performance

N

T

=

1 c

P

ST

(flat access)

=

X

19

“The Valley” In Thread Space

Cache

Valley MT

Regime

Regime

Width

Cache

No Cache

20

Prior Work

Hong et al, 2009, 2010

◦ Simple, cacheless GPU models

◦ Used to predict “MT peak”

Guz et al, 2008, 2010

◦ Graphed throughput performance with assumed cache profile

◦ Identified “valley” structure

◦ Validated against PARSEC benchmarks

◦ No mathematical analysis

◦ Minimal focus on bandwidth limited regime

◦ CMP benchmarks

Galal et al, 2011

◦ Excellent mathematical analysis

◦ Focused on FPU+Register design

21

“The Valley” In Thread Space

Cache

Valley MT

Regime

Regime

Width t

æ

è

1

t

1

ø

Cache

No Cache

H(c)

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

0.0

0.2

0.4

0.6

0.8

1.0

22

Energy vs Latency

! " #$ #%&'( ) * &+,- ,. /

! " #$%&'%(&) *) +', - -

! " #$%&'/0 12 2

0 56) '57) +, ( - 8''. '9 9

1: : ) 88'; <='>?10

0 56) '57) +, ( - ', : +588': A%7

B CC#: A%7'D %+)

?) , - 'C+59 '2 ?10

E'=%FF'2 , FFGH'IJ 2 J >'<) G( 5&) H'34. .

! "#$%&' ( ) %$( &' *+,' *' - . &( /0*+,12-

!" #$ %&

!) #$ %&

!, #$ %&

- . / 0 #122344

) '

, ( (

"

*

' (

+'

" ( ( (

" *( ( (

.

34

3!

@4

. 444

@44

. ! 444

23

Valley – Energy Efficiency

1,600

1,400

100%

H(c)

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

0.0

0.2

0.4

0.6

0.8

1.0

80%

1,200

1,000

800

600

400

200

0

60%

40%

20%

0%

Energy/Op

Hit Rate

Total Ac ve Threads

24

Talk Outline

Introduction

The Problem

◦ Throughput Architectures

◦ Dissertation Goals

The Solution

◦ Modeling Throughput Performance

 Cache Performance

 The Valley

◦ Architectural Enhancements

 Thread Throttling

 Cache Policies

◦ Methodology

Proposed Work

25

Contribution Thread Throttling

Have real time information:

◦ Arithmetic intensity

◦ Bandwidth utilization

◦ Current hit rate

 Conservatively approximate locality

Approximate optimum operating points

Shut down / Activate threads to increase performance

◦ Concentrate power and overclock

◦ Clock off unused cache if no benefit

26

Prior Work

Many studies in CMP and GPU area scale back threads

◦ CMP – miss rates get too high

◦ GPU – off-chip bandwidth is saturated

◦ Simple to hit, unidirectional

Valley is much more complex

◦ Two points to hit

◦ Three different operating regimes

Mathematical analysis lets us approximate both points with as little as two samples

 Both off-chip bandwidth and reciprocal of hit rate are nearly linear for a wide range of applications

27

Cache

Regime

Finding Optimal Points

Valley MT

Regime

Width t

æ

è

1

t

1

ø

Cache

No Cache

28

Talk Outline

Introduction

The Problem

◦ Throughput Architectures

◦ Dissertation Goals

The Solution

◦ Modeling Throughput Performance

 Cache Performance

 The Valley

◦ Architectural Enhancements

 Thread Throttling

 Cache Policies (Indexing, replacement)

◦ Methodology

Proposed Work

29

From Mathematical Analysis:

Need to work like LFU cache

◦ Hard to implement in practice

Still very little cache per thread

◦ Policies make big differences for small caches

◦ Associativity a big issue for small caches

Cannot cache every line referenced

◦ Beyond “dead line” prediction

◦ Stream lines with lower reuse

30

Contribution – Odd Set Indexing

Conflict misses pathological issue

◦ Most often seen with power of 2 strides

◦ Idea: map to 2 N -1 sets/banks instead

True “Silver Bullet”

◦ Virtually eliminates conflict misses in every setting we’ve tried

 Reduced scratchpad banks from 32 to 7 at same level of bank conflicts

Fastest, most efficient implementation

◦ Adds just a few gate delays

◦ Logic area < 4% 32-bit integer multiply

◦ Can still access last bank

31

100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%

More Preliminary Results

PARSEC L2 with 64 threads

DM w Prime Banking

32

Prior Work

Prime number of banks/sets thought ideal

◦ No efficient implementation

◦ Mersenne Primes not so convenient:

 3, 7, 15 , 31, 63 , 127, 255

 We demonstrated wasn’t an issue

Yang, ISCA ‘92 - prime strides for vector computers

 Showed 3x speedup

 We get correct offset for free

Kharbutli, HPCA 04 – showed prime sets as hash function for caches worked well

 Our implementation faster, more features

 Couldn’t appreciate benefits for SPEC

33

(Re)placement Policies

Not all data should be cached

◦ Recent papers for LLC caches

◦ Hard drive cache algorithms

Frequency over Recency

◦ Frequency hard to implement

◦ ARC good compromise

Direct Mapping Replacement dominates

◦ Look for explicit approaches

◦ Priority Classes

◦ Epochs

34

Prior Work

Belady – solved it all

◦ Three hierarchies of methods

◦ Best utilized information of prior line usage

◦ Light on implementation details

Approximations

◦ Hallnor & Reinhardt, ISCA 2000

 Generational Replacement

◦ Meggido, Usenet 2003, ARC cache

 ghost entries

 recency and frequency groups

◦ Qureshi, 2006, 2007 – Adaptive Insertion policies

◦ Multiqueue, LR-K, D-NUCA, etc.

35

Talk Outline

Introduction

The Problem

◦ Throughput Architectures

◦ Dissertation Goals

The Solution

◦ Modeling Throughput Performance

 Cache Performance

 The Valley

◦ Architectural Enhancements

 Thread Throttling

 Cache Policies (Indexing, replacement)

◦ Methodology (Applications, Simulation)

Proposed Work

36

Benchmarks

Initially studied regular HPC kernels/applications in CMP environment

◦ Dense Matrix Multiply

◦ Fast Fourier Transform

◦ Homme weather simulation

Added CUDA throughput benchmarks

◦ Parboil – old school MPI, coarse grained

◦ Rodinia – fine grained, varied

 Benchmarks typical of historical GPGPU applications

Will add irregular benchmarks

◦ SparseMM, Adaptive Finite Elements, Photon mapping

37

100%

80%

60%

40%

20%

0%

Subset of Benchmarks

Dynamic Instruction Mix

Operations

Scratchpad Access

Cache Access

38

Preliminary Results

Most of the benchmarks should benefit:

◦ Small working sets

◦ Concentrated working sets

◦ Hit rate curves easy to predict

39

Typical Concentration of Locality

4,096

2,048

1,024

512

256

128

64

32

16

8

4

2

1

0

HWT Working Set Per Task

100%

500,000

Working Set (Bytes)

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%

1,000,000

Reuse Count

Hit Rate

40

100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%

Scratchpad Task Locality

Frac on of Scratchpad Accesses

One Addr

2-128 addresses

41

Hybrid Simulator Design

C++/CUDA Simulate Different Architecture Than Traced

NVCC

PTX

Intermediate

Assembly Listing

Dynamic Trace Blocks

Attachment Points

Modify

Ocelot Functional Sim

Custom Trace Module

Compressed

Trace Data

Custom Simulator

Goals: Fast simulation, Overcome compiler issues for reasonable base case

42

Talk Outline

Introduction

The Problem

◦ Throughput Architectures

◦ Dissertation Goals

The Solution

◦ Modeling Throughput Performance

 Cache Performance

 The Valley

◦ Architectural Enhancements

 Thread Throttling

 Cache Policies (Indexing, replacement)

◦ Methodology (Applications, Simulation)

Proposed Work

43

Phase 1 – HPC Applications

Looked at GEMM, FFT & Homme in CMP setting

◦ Learned implementation algorithms, alternative algorithms

◦ Expertise allows for credible throughput analysis

◦ Valuable Lessons in multithreading and caching

Dense Matrix Multiply

◦ Blocking to maximize arithmetic intensity

◦ Enough contexts to cover latency

Fast Fourier Transform

◦ Pathologically hard on memory system

◦ Communication & synchronization

HOMME – weather modeling

◦ Intra-chip scaling incredibly difficult

◦ Memory system performance variation

◦ Replacing data movement with computation

First author publications:

◦ PPoPP 2008, ISPASS 2011 (Best Paper)

44

Phase 2 – Benchmark Characterization

Memory Access Characteristics of

Rodinia and Parboil benchmarks

Apply Mathematical Analysis

◦ Validate model

◦ Find optimum operating points for benchmarks

◦ Find optimum TA topology for benchmarks

NEARLY COMPLETE

45

Phase 3 – Evaluate Enhancements

Automatic Thread Throttling

Low latency hierarchical cache

Benefits of odd-sets/odd-banking

Benefits of explicit placement

(Priority/Epoch)

NEED FINAL EVALUATION and explicit placement study

46

Final Phase – Extend Domain

Study regular HPC applications in throughput setting

Add at least two irregular benchmarks

◦ Less likely to benefit from caching

◦ New opportunities for enhancement

Explore impact of future TA topologies

◦ Memory Cubes, TSV DRAM, etc.

47

Conclusion

Dissertation Goals:

◦ Quantify the degree single thread performance effects throughput performance for an important class of applications

◦ Improve parallel efficiency through thread scheduling, cache topology, and cache policies

Feasibility

◦ Regular Benchmarks show promising memory behavior

◦ Cycle accurate simulator nearly completed

48

Proposed Timeline

Phase 1 – HPC applications – completed

Phase 2 – Mathematical model &

Benchmark Characterization

◦ MAY-JUNE

Phase 3 – Architectural Enhancements

◦ JULY-AUGUST

Phase 4 – Domain enhancement / new features

◦ September-November

49

Related Publications To Date

50

Any Questions?

51

52

53

One Outlier

PNS Working Set Per Task

5

4

3

2

1

0

10

9

8

7

6

0 50,000 100,000 150,000

Working Set Size (Bytes)

50%

40%

30%

20%

10%

0%

200,000

100%

90%

80%

70%

60%

54

100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%

Priority Scheduling

Priority Scheduling, 8 Warps

Warp7-Ins

Warp6-Ins

Warp5-Ins

Warp4-Ins

Warp3-Ins

Warp2-Ins

Warp1-Ins

Warp0-Ins

55

Talk Outline

Introduction

◦ Throughput Architectures - The Problem

◦ Dissertation Overview

Modeling Throughput Performance

◦ Throughput

◦ Caches

◦ The Valley

Methodology

Architectural Enhancements

◦ Thread Scheduling

◦ Cache Policies

 Odd-set/Odd-bank caches

 Placement Policies

◦ Cache Topology

Dissertation Timeline

56

Modeling Throughput Performance

N

P

CHIP

P

L

T

ST

AVG

= Total Active Threads

= Total Throughput Performance

= Single Thread Performance

= Average Latency per instruction

P

Chip

=

N

T

×

P

ST

P

ST

(( Ins / Sec ) / Thread )

=

ILP ( Ins / Thread )

L

AVG

( Sec )

L

AVG

= ( )

Power

CHIP

= E

AVG

(Joules)xP

CHIP

57

Phase 1 – HPC Applications

Looked at GEMM, FFT & Homme in CMP setting

◦ Learned implementation algorithms, alternative algorithms

◦ Expertise allows for credible throughput analysis

◦ Valuable Lessons in multithreading and caching

Dense Matrix Multiply

◦ Blocking to maximize arithmetic intensity

◦ Need enough contexts to cover latency

Fast Fourier Transform

◦ Pathologically hard on memory system

◦ Communication & synchronization

HOMME – weather modeling

◦ Intra-chip scaling incredibly difficult

◦ Memory system performance variation

◦ Replacing data movement with computation

Most significant publications:

58

4,0

3,5

3,0

2,5

2,0

1,5

1,0

0,5

0,0

Odd Banking - Scratchpad

32-bank Scratchpad, Conflicts

59

Talk Outline

Introduction

◦ Throughput Architectures - The Problem

◦ Dissertation Overview

Modeling Throughput Performance

◦ Throughput

◦ Caches

◦ The Valley

Methodology

Architectural Enhancements

◦ Thread Scheduling

◦ Cache Policies

 Odd-set/Odd-bank caches

 Placement Policies

◦ Cache Topology

Dissertation Timeline

60

Problem - Technology Mismatch

Computation is cheap, data movement is expensive:

! " #$#%&'( ! ) &*+%+, -

! " #$%&'%(&) *) +', - -

! " #$%&'/0 12 2

0 56) '57) +, ( - 8''. '9 9

1: : ) 88'; <='>?10

0 56) '57) +, ( - ', : +588': A%7

B CC#: A%7'D%+)

?) , - 'C+59 '2 ?10

* Bill Dally, IPDPS Keynote, 2011

. 444

@44

. ! 444

.

34

3!

@4

Exponential growth in cores saturates off-chip bandwidth

- Performance capped

Latency to off-chip DRAM now hundreds of cycles

- Need hundreds of threads per core to mask

61

Talk Outline

Introduction

◦ Throughput Architectures - The Problem

◦ Dissertation Overview

Modeling Throughput Performance

◦ Throughput

◦ Caches

◦ The Valley

Methodology

Architectural Enhancements

◦ Thread Scheduling

◦ Cache Policies

 Odd-set/Odd-bank caches

 Placement Policies

◦ Cache Topology

Dissertation Timeline

62

The Power Wall

Socket power economically capped

DARPA’s UHCP Exascale Initiative:

◦ Supercomputers now power capped

◦ 10-20x power efficiency by 2017

◦ Supercomputing Moore’s Law:

 Double power efficiency every year

“Post-PC” client era requires >20x power efficiency of desktop

Even Throughput Architectures aren’t efficient enough!

63

3

2

1

5

4

0

8

7

6

Short Latencies Also Matter

BackProp-CPI@8 Warps

CMP-L1

RELAXED-L1

FERMI-L1

CMP-DRAM

RELAXED-DRAM

FERMI-DRAM

64

40%

35%

30%

25%

20%

15%

10%

5%

0%

Importance of Scatchpad

Accesses Per Instruction

Scratchpad

Cache

65

Talk Outline

Introduction

◦ Throughput Architectures - The Problem

◦ Dissertation Overview

Modeling Throughput Performance

◦ Throughput

◦ Caches

◦ The Valley

Methodology

Architectural Enhancements

◦ Thread Scheduling

◦ Cache Policies

 Odd-set/Odd-bank caches

 Placement Policies

◦ Cache Topology

Dissertation Timeline

66

Work Finished To Date

Mathematic Analysis

Architectural algorithms

Benchmark Characterization

Nearly finished full chip simulator

◦ Currently simulates one core at a time

Almost ready to publish 2 papers…

67

Benchmark Characterization

(May-June)

Latency Sensitivity with cache feedback, multiple blocks per core

Global caching, BW across cores

Validate mathematical model with benchmarks

Compiler Controls

68

Architectural Evaluation

(July-August)

Priority Thread Scheduling

Automatic Thread Throttling

Optimized Cache Topology

◦ Low latency / fast path

◦ Odd-set banking

◦ Explicit Epoch placement

69

Extending the Domain (Sep-Nov)

Extend benchmarks

◦ Port HPC applications/kernels to throughput environment

◦ Add at least two irregular applications

 E.g. Sparse MM, Photon Mapping, Adaptive Finite

Elements

Extend topologies, enhancements

◦ Explore design space of emerging architectures

◦ Examine optimizations beneficial to irregular applications

70

Questions?

71

Contributions

Mathematical Analysis of Throughput

Performance

◦ Caching, saturated bandwidth, sensitivities to application characteristics, latency

Quantify Importance of Single Thread

Latency

Demonstrate novel enhancements

◦ Valley based thread throttling

◦ Priority Scheduling

◦ Subcritical Caching Techniques

72

HOMME

73

Dense Matrix Multiply

74

100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%

PARSEC L2 64KB Hit Rates

FAL64

DMPB64

DM64

75

Odd Banking, L1 Cache Access

DRAM Cycle Utilization

100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%

Base

2 striped banks

3 striped banks

4 striped banks

256B Line

76

100%

80%

60%

40%

20%

0%

Local vs Global Working Sets

Single Task Fraction of

Scratchpad

77

Dynamic Working Sets

78

Fast Fourier Transform (blocked)

79

Performance From Caching

Assume ideal caches

Ignore changes to DRAM latency & offchip BW

80

Download