NoC - MICREL

advertisement

Networks-on-Chip:

Motivations and Architecture

Federico Angiolini fangiolini@deis.unibo.it

DEIS Università di Bologna

Why NoCs

The System Interconnect

Chips tend to have more than one “ core ”

Control processors

Accelerators

Memories

I/O

How do we get them to talk to each other?

This is called “ System Interconnect ”

Traditional Answer: with Buses

Shared bus topology

Aimed at simple, cost-effective integration of components

Master 0 Master 1 Master 2

Shared Bus

Slave 0 Slave 1 Slave 2

Typical example: ARM Ltd. AMBA AHB

Arbitration among multiple masters

Single outstanding transaction allowed

If wait states are needed, everybody waits

So, Are We All Set?

Well... not really.

Let’s consider two trends

System/architectural: systems are becoming highly parallel

Physical: wires are becoming slower

(especially in relative terms)

System/Architectural Level

Parallel systems... OK, but how much?

CPUs: currently four cores (not so many...)

Playstation/Cell: currently nine engines (still OK)

GPUs: currently 100+ shaders (hey!)

Your next cellphone: 100+ cores (!!!)

And the trend is: double every 18 months

Multicore Testimonials 1

“We believe that Intel’s Chip

Level Multiprocessing (CMP) architectures represent the future of microprocessors because they deliver massive performance scaling while effectively managing power and heat”.

White paper “Platform 2015: Intel

Processor and Platform evolution for the next decade”

Intel IXP2800 with 16 micro-engines and one Intel XScale core

Multicore Testimonials 2

Intel: 80-core chip shown at ISSCC 2007

Rapport: Kilocore (1024 cores), for gaming & media

Expected mid 2007

"The next 25 years of digital signal processing technology will literally integrate hundreds of processors on a single chip to conceive applications beyond our imagination.”

Mike Hames, senior VP,

Texas Instruments

“Focus here is on parallelism and what's referred to as multi-core technology.”

Phil Hester, CTO, AMD

Multicore Testimonials 3

What Does This Mean for the

Interconnect?

A new set of requirements!

High performance

Many cores will want to communicate, fast

High parallelism (bandwidth)

Many cores will want to communicate, simultaneously

High heterogeneity/flexibility

Cores will operate at different frequencies, data widths, maybe with different protocols

Physical Level

Logic becomes faster and faster

Global wires don’t

And If We Consider a Floorplan...

2 cm

If you assume a shared bus, the wires have to go all around the chip (i.e. are very long)

Propagation delay

Spaghetti wiring

What Does This Mean for the

Interconnect?

A new set of requirements!

Short wiring

Point-to-point and local is best

Simple, structured wiring

Bundles of many wires are impractical to route

System Interconnect Evolution

Topology

Evolution Hierarchical

Buses

Traditional

Shared Bus

Master 0 Master 1

AHB crossbar

Advanced

Protocol AHB Layer 0 Bus Protocols

Slave 0 Slave 1

Slave 4

Evolution

Master 2 Master 3

Slave 5

AHB Layer 1 Slave 6

Help with the issues, but do not fully solve them

More scalable solutions needed

An Answer: Networks-on-Chip (NoCs)

CPU NI

NI DSP switch

NoC switch switch switch switch switch

NI DMA

NI MPEG

DRAM NI Accel NI

Packet-based communication

NIs convert transactions by cores into packets

Switches route transactions across the system

First Assessment of NoCs

High performance

High parallelism (bandwidth)

Yes: just add links and switches as you add cores

High heterogeneity/flexibility

Yes: just design appropriate NIs, then plug in

Short wiring

Yes: point-to-point, then just place switches as close as needed

Simple, structured wiring

Yes: links are point-to-point, width can be tuned

Problem Solved?

Maybe, but... buses excel in simplicity, low power and low area

When designing a NoC, remember that tradeoffs will be required to keep those under control

Not all designs will require a NoC, only the

“complex” ones

How to Design NoCs

How to Make NoCs Tick

A NoC is a small network

Many of the same architectural degrees of freedom

Some problems are less stringent

Static number of nodes

(Roughly) known traffic patterns and requirements

Some problems are much tougher

MANY less resources to solve problems

Latencies of nanoseconds, not milliseconds

But... what characterizes a network?

Key NoC Properties

Topology

Routing policy (where)

Switching policy (how)

Flow control policy (when)

Syn-, asyn- or meso-chronicity

...and many others!

Huge design space

NoC Topologies

Must comply with demands of…

 performance ( bandwidth & latency )

 area power routability

Can be split in…

 direct: node connected to every switch

 indirect: nodes connected to specific subset of switches

NoC Topology Examples:

Hypercubes

Compositional design

Example: hypercube topologies

Arrange N=2 n nodes in n-dimensional cube

At most n hops from source to destination

High bisection bandwidth

 good for traffic (but can you use it?) bad for cost [O(n 2 )]

Exploit locality

Node size grows

 as n [input/output cardinality] as n 2 [internal crossbar]

Adaptive routing may be possible

0-D

1-D

2-D

3-D

4-D

NoC Topology Examples:

Multistage Topologies

Need to fix hypercube resource requirements

Idea : unroll hypercube vertices

 switch sizes are now bounded, but

 loss of locality more hops can be blocking; non-blocking with even more stages

NoC Topology Examples: k-ary n-cubes (Mesh Topologies)

Alternate reduction from hypercube: restrict to < log dimensional structure

2

(N) e.g.

mesh (2-cube), 3D-mesh (3-cube)

Matches with physical world structure and allows for locality

Bounds degree at node

Even more bottleneck potentials

2D Mesh

NoC Topology Examples:

Torus

Need to improve mesh performance

Idea : wrap around n-cube ends

2-cube  cylinder

3-cube  donut

Halves worst-case hop count

Can be laid-out reasonably efficiently

 maybe 2x cost in channel width?

NoC Topology Examples:

Fat-Tree Topologies

Fatter links (actually: more of them) as you go to the root, so bisection bandwidth scales

NoC Routing Policies

Static

 e.g. source routing or coordinate-based

 simpler to implement and validate

Adaptive

 e.g. congestion-based potentially faster much more expensive allows for out-of-order packet delivery

 possibly a bad idea for NoCs

Huge issue: deadlocks

Deadlocks

A

B

A would like to talk to C

B to A

C to B

Everybody is stuck!!

C

Showstopper problem

 avoid by mapping: no route loops avoid by architecture: e.g. virtual channels provide deadlock recovery

Critical for adaptive routing

 livelocks also possible

NoC Switching Policies

Packet switching

 maximizes global network usage dynamically store-and-forward

 minimum logic, but higher latency, needs more buffers wormhole

 minimum buffering, but deadlock-prone, induces congestion

Circuit switching

 optimizes specific transactions

 no contention, no jitter requires handshaking

 may fail completely setup overhead

Virtual Channels

Performance improvement using virtual channels

Node 1 Node 2 Node 3 Node 4 Node 5

Destination of B

Node 1 Node 2 Node 3 Node 4 Node 5

Destination of B

A

B

Block

NoC Flow Control Policies

We need it because...

Sender may inject bursty traffic

Receiver buffers may fill up

Sender and receiver may operate at different frequencies

Arbitrations may be lost

How?

TDMA: pre-defined time slots

Speculative: send first, then wait for confirmation

(acknowledge - ACK)

Conservative: wait for token, then send ( creditbased )

Remember... links may be pipelined

Example: ACK/NACK Flow Control

Transmission

ACK and buffering

NACK

ACK/NACK propagation

Memory deallocation

Retransmission

Go-back-N

NoC Timing: Synchronous

Flip-flops everywhere, clock tree

Much more streamlined design

Clock tree burns 40% power budget, plus flip flops themselves

Not easy to integrate cores at different speeds

Increasingly difficult to constrain skew and process variance

Worst-case design

NoC Timing: Asynchronous

Potentially allows for data to arrive at any time, solves process variance etc.

Average-case behaviour

Lower power consumption

Maximum flexibility in IP integration

More secure for encryption logic

Less EMI

Much larger area

Can be much slower (if really robust)

Two-way handshake removes the “bet” of synchronous logic

Intermediate implementations exist

Much tougher to design

NoC Timing: Mesochronous

Attempts to optimize latency of long paths

Everybody uses the same clock

Senders embed their clock within packets

Data is sent over long links and arrives out of sync with receiver clock

Embedded clock is used to sample incoming packets

Dual-clocked FIFO restores synchronization

Tough to design

Somewhat defeats the NoC principles

Sender

Data

Strobe

Receiver

Dual-clocked FIFO

CK

Link

CK

The xpipes NoC

The xpipes NoC

 xpipes is a library of NoC components

Network Interface (NI), Switch, Link

Configurability of parameters such as flit width, amount of buffering, flow control and arbitration policies… xpipes is designed to be scalable to future technology nodes, architecturally and physically

Leverages a cell synthesis flow, no hard macros

Pipelined links to tackle wire propagation delay

A complete CAD flow is provided to move from the application task graph level to the chip floorplan

The xpipes NoC: the Network Interface initiator NI target NI

LUT packeting request channel unpacketing

OCP

OCP clk

 unpacketing packets NoC topolog y packets

LUT packeting response channel xpipes clk

Performs packeting/unpacketing

OCP 2.0 protocol to connect to IP cores

Source routing via routing Look-Up Tables

Dual Clock operation

OCP clk

OCP

Basic OCP Concepts

Point-to-point, unidirectional, synchronous

Easy physical implementation

Master/slave, request/response

Well-defined, simple roles

Extensions

Added functionality to support cores with more complex interface requirements

Configurability

Match a core’s requirements exactly

Tailor design to required features only

Reference: [SonicsInc]

Basic OCP Protocol

MCmd [3]

MAddr [32]

MData [32]

MRespAccept

SCmdAccept

SResp [2]

SData [32]

Request phase

Response phase

Read Transaction Write Transaction

OCP Extensions

Simple Extensions

Byte Enables

Bursts

Flow Control/Data Handshake

Complex Extensions

Threads and Connections

Sideband Signals

Interrupts, etc.

Testing Signals

The xpipes NoC: the Switch

Allocator

Arbiter

Crossbar

Routing &

Flow Control

Input and/or output buffering

Wormhole switching

Supports multiple flow control policies

The xpipes NoC: the ACK/NACK Link

S

FLIT

REQ

ACK

FLIT

REQ

ACK

FLIT

REQ

ACK

R

Repeaters are pure registers

Buffering and retransmission logic in the sender

The xpipes NoC: the STALL/GO Link

S

FLIT

REQ

STALL

FLIT

REQ

STALL

FLIT

REQ

STALL

R

Repeaters are two-entry FIFOs

No retransmission allowed

Quality of Service and the Æthereal NoC

Speaking of Quality of Service...

Signal processing hard real time very regular load

Media processing hard real time irregular load

Multi-media soft real time irregular load high quality high quality worst case typically on DSPs average case

SoC/media processors limited quality average case

PC/desktop

Very challenging!

Multimedia Application Demands

Increasing functionality and heterogeneity

Higher semantic content/entropy

More dynamism

29000

27000

25000

23000

21000

19000

17000

15000

VBR MPEG

DVD stream worst-case load structural load running average instantaneous load

[Gossens03]

Negotiating NoC Resources

[Gossens03]

29000

27000

25000

23000

21000

19000

17000

15000

VBR MPEG

DVD stream worst-case load structural load running average instantaneous load steady states

(re)negotiate

A QoS Approach

Essential to recover global predictability and improve performance

Applications require it!

It fits well with protocol stack concept

What is QoS?

Requester poses the service request ( negotiation )

Provider either commits to or rejects the request

Renegotiate when requirements change

After negotiation, steady states that are predictable

Guaranteed versus best-effort service

Types of commitment correctness completion bounds e.g. uncorrupted data e.g. no packet loss e.g. maximum latency

QoS + Best Effort

Best-effort services have better average resource utilisation at the cost of unpredictable/unbounded worst-case behaviour

The combination of best-effort & guaranteed services is useful!

QoS in the Æthereal NoC

Conceptually, two disjoint networks

 a network with throughput+latency guarantees (GT)

 a network without those guarantees (best-effort, BE)

Several types of commitment in the network

 combine guaranteed worst-case behaviour with good average resource usage best-effort router programming guaranteed router priority/arbitration

Æthereal Router Architecture

Best-effort router

Wormhole routing

Input queueing

Source routing

Guaranteed throughput router

Contention-free routing

 synchronous, using slot tables time-division multiplexed circuits

Store-and-forward routing

Headerless packets

 information is present in slot table

A lot of hardware overhead!!!

Æthereal: Contention-Free

Routing

Latency guarantees are easy in circuit switching

With packet switching, need to “emulate”

Schedule packet injection in network such that they never contend for same link at same time

 in space: disjoint paths

 in time: time-division multiplexing

Use best-effort packets to set up connections

Distributed, concurrent, pipelined, consistent

Compute slot assignment at build time, run time, or combination

Connection opening may be rejected

Download