Regulated Buffer Sharing

advertisement

Concurrent

VLSI

Architecture

Group

Efficient Microarchitecture for

Network-on-Chip Routers

Daniel U. Becker

PhD Oral Examination

8/21/2012

Outline

• INTRODUCTION

• Allocator Implementations

• Buffer Management

• Infrastructure

• Conclusions

8/21/12 Efficient Microarchitecture for NoC Routers 2

Chip

Core

Networks-on-Chip

• Moore’s Law alive & well

• Many cores per chip

• Must work together

• Networks-on-Chip (NoCs) aim to provide scalable, efficient communication fabric

8/21/12 Efficient Microarchitecture for NoC Routers 3

Why Does the Network Matter?

Caches;

10%

Energy Breakdown for

Radix Sort (SPLASH-2)

DRAM;

14%

Core;

31%

NoC,

45%

• Performance

– Latency

– Throughput

– Fairness, QoS

• Cost

– Die area

– Wiring resources

– Design complexity

• Power & energy efficiency

[Harting et al., “Energy and Performance Benefits of Active Messages “

8/21/12 Efficient Microarchitecture for NoC Routers 4

Optimizing the Network

8/21/12 Efficient Microarchitecture for NoC Routers 5

Router Microarchitecture Overview

Part 1

Part 2

[Peh and Dally: “A Delay Model for Router Microarchitectures”]

8/21/12 Efficient Microarchitecture for NoC Routers 6

Outline

• Introduction

• ALLOCATOR IMPLEMENTATIONS

• Buffer Management

• Infrastructure

• Conclusions

[Becker and Dally: “Allocator Implementations for Network-on-Chip Routers,” SC’09]

8/21/12 Efficient Microarchitecture for NoC Routers 7

Allocators

• Fundamental part of router control logic

• Manage access to network resources

• Orchestrate flow of packets through router

• Affect network utilization

• Potentially affect cycle time

8/21/12 Efficient Microarchitecture for NoC Routers 8

Virtual Channel Allocation

• Virtual channels (VCs) allow multiple packets to be interleaved on physical channels

• Similar to lanes on a highway, allow traffic blocks to be bypassed

• Before packets can use network channel, need to claim ownership of a VC

• VC allocator assigns output VCs to waiting packets

8/21/12 Efficient Microarchitecture for NoC Routers 9

NM

MIN

REQ

Sparse VC Allocation

IVC OVC

P×2 Requests

P×8 Requests

P×4 Requests

NM

REP

MIN

2×2×2 VCs 2×4 VCs

8/21/12

P×2 Requests

P×4 Requests

8 VCs

Efficient Microarchitecture for NoC Routers

[single input port shown]

10

VC Allocator Delay

8/21/12 sep in

50

40

30

20

10

0

-30% sep out wf unr

Canonical design

-40%

-30% wf rot wf rep

-58%

Efficient Microarchitecture for NoC Routers 11

VC Allocator Area sep in

10000

8000

6000

4000

2000

0

-60% sep out

-78% wf unr

-50% wf rot wf rep

-78%

8/21/12 Efficient Microarchitecture for NoC Routers 12

Switch Allocation

• Once a VC is allocated, packet can be forwarded

• Broken down into flits

• For each flit, must request crossbar access inputs

• Switch allocator generates crossbar schedule

8/21/12

• outputs

Efficient Microarchitecture for NoC Routers

[Enright Jerger and Peh, “On-Chip Networks”]

13

Speculative Switch Allocation

• Reduce pipeline latency by attempting switch allocation in parallel with VC allocation

Speculate that VC will be assigned!

• But mis-speculation wastes crossbar bandwidth

• Must prioritize non-speculative requests

8/21/12 Efficient Microarchitecture for NoC Routers 14

non-spec.

requests

Pessimistic Speculation nonspec. allocator

• Speculation matters most when network is lightly loaded

• At low network load, most requests are granted

• Idea: Assume all non-spec. requests will be granted!

nonspec.

grants conflict detection spec.

requests spec. allocator mask spec.

grants

8/21/12 Efficient Microarchitecture for NoC Routers 15

Performance with Speculation

100

80 nonspec pessimistic canonical

60

40

20

-21% zero-load latency

<2%

0

0.2

0.22

0.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

[Mesh, 2 VCs; UR traffic] offered load (flits/cycle/node)

8/21/12 Efficient Microarchitecture for NoC Routers 16

8/21/12

2.4

2.3

2.2

2.1

2

1.9

0

2.9

x 10

4

2.8

2.7

2.6

2.5

Area and Delay Impact

[Full router; Mesh, 2 VCs; TSMC 45nm GP] nonspec pessimistic canonical

10 20

+16% max. clock freq.

-13% area @ 1.2 GHz

-5% area @ 1 GHz

30 cycle time (FO4)

40 50

Efficient Microarchitecture for NoC Routers

60

17

Additional Contributions

• Fast loop-free wavefront allocators

• Priority-based speculation

• Practical combined VC and switch allocation

• Details in thesis

8/21/12 Efficient Microarchitecture for NoC Routers 18

Summary

• Sparse VC allocation exploits traffic classes to reduce VC allocator complexity

– Reduces delay by 30-60%, area by 50-80%

– No change in functionality

• Pessimistic speculation reduces overhead for speculative switch allocation

– Reduces overall router area by up to 13%

– Reduces critical path delay by up to 14%

– Trade for some throughput loss near saturation

8/21/12 Efficient Microarchitecture for NoC Routers 19

Outline

• Introduction

• Allocator Implementations

• BUFFER MANAGEMENT

• Infrastructure

• Conclusions

[Becker et al.: “Adaptive Backpressure: Efficient Buffer Management for

On-Chip Networks,” to appear in ICCD’12]

8/21/12 Efficient Microarchitecture for NoC Routers 20

Buffer Cost

[TRIPS total network power]

8/21/12 channels

31% buffers

35% allocators

1% crossbar

33%

[Wang et al.: “Power-driven Design of Router Microarchitectures in On-chip Networks”]

Efficient Microarchitecture for NoC Routers 21

Buffer Management

• Many designs divide buffer statically among VCs

– Assign each VC its fair share

• But optimal buffer organization depends on load

– Low load favors deep VCs

– High load favors many VCs

• For fixed buffer size, static schemes must pick one or the other

⇒ Improve utilization by allowing buffer space to be shared among VCs

8/21/12 Efficient Microarchitecture for NoC Routers 22

Buffer Management Performance

1400

[linked-list based scheme; harmonic mean across traffic patterns] static dynamic

1300

1200

1100

1000

900

800

700

600

500

+8%

-28%

-18%

0.075

0.08

0.085

0.09

0.095

0.1

0.105

0.11

0.115

saturation rate (flits/cycle/node)

8/21/12 Efficient Microarchitecture for NoC Routers 23

Buffer Monopolization switch allocator credits switch allocator downstream router upstream router

• Congestion leads to buffer monopolization

• Uncongested traffic sees reduced buffer space

– Increases latency, reduces throughput

⇒ Congestion spreads across VCs!

8/21/12 Efficient Microarchitecture for NoC Routers 24

Adaptive Backpressure

• Avoid unproductive use of buffer space

• Impose quotas on outstanding credits

– Share freely under benign conditions

– Limit sharing to avoid performance pathologies

⇒ Vary backpressure based on demand credits flits

LAR:

IVC:

OVC:

SOT: lookahead routing logic input VC state output VC state shared occupancy tracker

IVC

IVC

IVC combined allocator busy occupancy quota

OVC

OVC

OVC

SOT credits flits credits flits

IVC

IVC

IVC

OVC

OVC

OVC

SOT credits flits

8/21/12 Efficient Microarchitecture for NoC Routers 25

Buffer Quota Heuristic

• Goal: Set quota values just high enough to support observed throughput for each VC

– Allow credit stalls that overlap with other stalls

– Drain unproductive buffer occupancy

• Difficult to measure throughput directly

• Instead, infer from credit round trip times

– In absence of congestion, set quota to RTT

– For each downstream stall cycle, reduce by one

8/21/12 Efficient Microarchitecture for NoC Routers 26

T crt,0

Buffer Quota Motivation (1)

Router 0 Router 1 Router 0 Router 1

T crt,0

+T stall

T stall

Excess flits

8/21/12

Full throughput is achieved in steady state

Efficient Microarchitecture for NoC Routers

Congestion causes downstream stall and unproductive buffer occupancy

27

Buffer Quota Motivation (2)

Router 0 Router 1 Router 0 Router 1

T stall

T idle

T stall

T stall

Excess flit drained

8/21/12

Insufficient credit supply causes idle cycle downstream

Credit stall resolves unproductive buffer occupancy

Efficient Microarchitecture for NoC Routers 28

Network Stability

0.25

0.2

0.15

0.1

0.05

0

0 baseline adaptive baseline adaptive

10

8

6

4

6.3x

2

0.1

0.2

0.3

0.4

offered load (flits/cycle/node)

0.5

0

0

[tornado traffic]

0.1

0.2

0.3

0.4

offered load (flits/cycle/node)

0.5

8/21/12 Efficient Microarchitecture for NoC Routers 29

Traffic Isolation

[Measure zero-load latency increase with background traffic] baseline adaptive baseline adaptive

50

45

40

40

35

-33% -38%

35 30

30

25

25

[uniform random background traffic]

20

0 0.2

0.4

0.6

0.8

background offered load (flits/cycle/node)

1

[hotspot background traffic]

20

0 0.2

0.4

0.6

0.8

background offered load (flits/cycle/node)

1

[uniform random foreground traffic]

8/21/12 Efficient Microarchitecture for NoC Routers 30

Zero-load Latency with Background

8/21/12

50

40

30

70

60

20

10

0 bitcomp baseline adaptive bitrev shuffle tornado transpose uniform traffic pattern

[50% uniform random background traffic]

Efficient Microarchitecture for NoC Routers

-31%

[mean] w/o background

31

Throughput with Background

-13% w/o background

8/21/12

[50% uniform random background traffic]

Efficient Microarchitecture for NoC Routers

3.3x

32

8/21/12

Application Performance Setup

• Model traffic in heterogeneous CMP

• Each node generates two types of traffic:

• PARSEC application traffic models latency-optimized core

• Streaming traffic to memory controllers model array of throughput-optimized cores

CPU

L1

I/O

L2

SP

SP

SP

SP

SP

SP

SP SP

SP SP

I/O network interface memory bank memory bank

Efficient Microarchitecture for NoC Routers memory bank memory bank

33

Application Performance

2

1.5

1

0.5

baseline adaptive

8/21/12

0 bscholes canneal dedup ferret fanimate vips workload x264

[12.5% injection rate for streaming traffic]

Efficient Microarchitecture for NoC Routers

-31%

[gmean] w/o background

34

Summary

• Sharing improves buffer utilization, but can lead to pathological performance

• Adaptive Backpressure minimizes unproductive use of shared buffer space

• Mitigates performance degradation in presence of adversarial traffic

• But maintains key benefits of buffer sharing under benign conditions

8/21/12 Efficient Microarchitecture for NoC Routers 35

Infrastructure

• Open source NoC router RTL

– State-of-the-art router implementation

– Highly parameterized

• Topology, routing, allocators, buffers, …

– Pervasive clock gating

– Fully synthesizable

– 100 files, >22k LOC of Verilog-2001

– Used in research efforts both inside and outside our research group

8/21/12 Efficient Microarchitecture for NoC Routers 36

Conclusions

• Future large-scale chip multiprocessors will require efficient on-chip networks

• Router microarchitecture is one of many aspects that need to be optimized

• Allocation has direct impact on router delay and throughput

• By exploiting higher-level properties, we can reduce cost and delay without degrading performance

• Input buffers are attractive candidates for optimization

• However, care must be taken to avoid performance pathologies

• By avoiding unproductive use of buffer space, Adaptive

Backpressure mitigates undesired interference effects

8/21/12 Efficient Microarchitecture for NoC Routers 37

Acknowledgements

• Bill

• Christos and Kunle

• Prof. Nishi

• George, Ted, Curt & the rest of the CVA gang

8/21/12 Efficient Microarchitecture for NoC Routers 38

Acknowledgements

8/21/12 Efficient Microarchitecture for NoC Routers 39

Acknowledgements

8/21/12 Efficient Microarchitecture for NoC Routers 40

That’s it for today.

THANK YOU!

8/21/12 Efficient Microarchitecture for NoC Routers 41

Download