NoC - MICREL

Networks-on-Chip:

Motivations and Architecture

Federico Angiolini fangiolini@deis.unibo.it

DEIS Università di Bologna

Why NoCs

The System Interconnect





Chips tend to have more than one “ core ”









Control processors

Accelerators

Memories

I/O

How do we get them to talk to each other?



This is called “ System Interconnect ”





Traditional Answer: with Buses

Shared bus topology

Aimed at simple, cost-effective integration of components

Master 0 Master 1 Master 2

Shared Bus



Slave 0 Slave 1 Slave 2

Typical example: ARM Ltd. AMBA AHB



Arbitration among multiple masters



Single outstanding transaction allowed



If wait states are needed, everybody waits

So, Are We All Set?





Well... not really.

Let’s consider two trends





System/architectural: systems are becoming highly parallel

Physical: wires are becoming slower

(especially in relative terms)

System/Architectural Level





Parallel systems... OK, but how much?









CPUs: currently four cores (not so many...)

Playstation/Cell: currently nine engines (still OK)

GPUs: currently 100+ shaders (hey!)

Your next cellphone: 100+ cores (!!!)

And the trend is: double every 18 months

Multicore Testimonials 1

“We believe that Intel’s Chip

Level Multiprocessing (CMP) architectures represent the future of microprocessors because they deliver massive performance scaling while effectively managing power and heat”.

White paper “Platform 2015: Intel

Processor and Platform evolution for the next decade”

Intel IXP2800 with 16 micro-engines and one Intel XScale core


Intel: 80-core chip shown at ISSCC 2007

Rapport: Kilocore (1024 cores), for gaming & media

Expected mid 2007

"The next 25 years of digital signal processing technology will literally integrate hundreds of processors on a single chip to conceive applications beyond our imagination.”

Mike Hames, senior VP,

Texas Instruments

“Focus here is on parallelism and what's referred to as multi-core technology.”

Phil Hester, CTO, AMD


What Does This Mean for the

Interconnect?









A new set of requirements!

High performance



Many cores will want to communicate, fast

High parallelism (bandwidth)



Many cores will want to communicate, simultaneously

High heterogeneity/flexibility



Cores will operate at different frequencies, data widths, maybe with different protocols

Physical Level





Logic becomes faster and faster

Global wires don’t

And If We Consider a Floorplan...

2 cm



If you assume a shared bus, the wires have to go all around the chip (i.e. are very long)





Propagation delay

Spaghetti wiring

What Does This Mean for the

Interconnect?







A new set of requirements!

Short wiring



Point-to-point and local is best

Simple, structured wiring



Bundles of many wires are impractical to route

System Interconnect Evolution





Topology

Evolution Hierarchical

Buses

Traditional

Shared Bus

Master 0 Master 1

AHB crossbar

Advanced

Protocol AHB Layer 0 Bus Protocols

Slave 0 Slave 1

Slave 4

Evolution

Master 2 Master 3

Slave 5

AHB Layer 1 Slave 6

Help with the issues, but do not fully solve them

More scalable solutions needed

An Answer: Networks-on-Chip (NoCs)

CPU NI

NI DSP switch

NoC switch switch switch switch switch

NI DMA

NI MPEG

DRAM NI Accel NI







Packet-based communication

NIs convert transactions by cores into packets

Switches route transactions across the system

First Assessment of NoCs











High performance

High parallelism (bandwidth)



Yes: just add links and switches as you add cores

High heterogeneity/flexibility



Yes: just design appropriate NIs, then plug in

Short wiring



Yes: point-to-point, then just place switches as close as needed

Simple, structured wiring



Yes: links are point-to-point, width can be tuned

Problem Solved?







Maybe, but... buses excel in simplicity, low power and low area

When designing a NoC, remember that tradeoffs will be required to keep those under control

Not all designs will require a NoC, only the

“complex” ones

How to Design NoCs

How to Make NoCs Tick





A NoC is a small network







Many of the same architectural degrees of freedom

Some problems are less stringent





Static number of nodes

(Roughly) known traffic patterns and requirements

Some problems are much tougher





MANY less resources to solve problems

Latencies of nanoseconds, not milliseconds

But... what characterizes a network?

Key NoC Properties













Topology

Routing policy (where)

Switching policy (how)

Flow control policy (when)

Syn-, asyn- or meso-chronicity

...and many others!



Huge design space

NoC Topologies





Must comply with demands of…

 performance ( bandwidth & latency )





 area power routability

Can be split in…

 direct: node connected to every switch

 indirect: nodes connected to specific subset of switches

NoC Topology Examples:

Hypercubes







Compositional design

Example: hypercube topologies











Arrange N=2 n nodes in n-dimensional cube

At most n hops from source to destination

High bisection bandwidth



 good for traffic (but can you use it?) bad for cost [O(n 2 )]

Exploit locality

Node size grows



 as n [input/output cardinality] as n 2 [internal crossbar]

Adaptive routing may be possible

0-D

1-D

2-D

3-D

4-D


Multistage Topologies





Need to fix hypercube resource requirements

Idea : unroll hypercube vertices

 switch sizes are now bounded, but





 loss of locality more hops can be blocking; non-blocking with even more stages

NoC Topology Examples: k-ary n-cubes (Mesh Topologies)











Alternate reduction from hypercube: restrict to < log dimensional structure

2

(N) e.g.

mesh (2-cube), 3D-mesh (3-cube)

Matches with physical world structure and allows for locality

Bounds degree at node

Even more bottleneck potentials

2D Mesh


Torus









Need to improve mesh performance

Idea : wrap around n-cube ends



2-cube  cylinder



3-cube  donut

Halves worst-case hop count

Can be laid-out reasonably efficiently

 maybe 2x cost in channel width?


Fat-Tree Topologies



Fatter links (actually: more of them) as you go to the root, so bisection bandwidth scales

NoC Routing Policies







Static

 e.g. source routing or coordinate-based

 simpler to implement and validate

Adaptive







 e.g. congestion-based potentially faster much more expensive allows for out-of-order packet delivery

 possibly a bad idea for NoCs

Huge issue: deadlocks

Deadlocks

A

B









A would like to talk to C

B to A

C to B

Everybody is stuck!!

C





Showstopper problem





 avoid by mapping: no route loops avoid by architecture: e.g. virtual channels provide deadlock recovery

Critical for adaptive routing

 livelocks also possible

NoC Switching Policies





Packet switching





 maximizes global network usage dynamically store-and-forward

 minimum logic, but higher latency, needs more buffers wormhole

 minimum buffering, but deadlock-prone, induces congestion

Circuit switching



 optimizes specific transactions

 no contention, no jitter requires handshaking



 may fail completely setup overhead

Virtual Channels



Performance improvement using virtual channels

Node 1 Node 2 Node 3 Node 4 Node 5

Destination of B

Node 1 Node 2 Node 3 Node 4 Node 5

Destination of B

A

B

Block

NoC Flow Control Policies







We need it because...









Sender may inject bursty traffic

Receiver buffers may fill up

Sender and receiver may operate at different frequencies

Arbitrations may be lost

How?







TDMA: pre-defined time slots

Speculative: send first, then wait for confirmation

(acknowledge - ACK)

Conservative: wait for token, then send ( creditbased )

Remember... links may be pipelined

Example: ACK/NACK Flow Control



Transmission



ACK and buffering



NACK



ACK/NACK propagation



Memory deallocation



Retransmission



Go-back-N

NoC Timing: Synchronous













Flip-flops everywhere, clock tree

Much more streamlined design

Clock tree burns 40% power budget, plus flip flops themselves

Not easy to integrate cores at different speeds

Increasingly difficult to constrain skew and process variance

Worst-case design

NoC Timing: Asynchronous



















Potentially allows for data to arrive at any time, solves process variance etc.

Average-case behaviour

Lower power consumption

Maximum flexibility in IP integration

More secure for encryption logic

Less EMI

Much larger area

Can be much slower (if really robust)





Two-way handshake removes the “bet” of synchronous logic

Intermediate implementations exist

Much tougher to design

NoC Timing: Mesochronous

















Attempts to optimize latency of long paths

Everybody uses the same clock

Senders embed their clock within packets

Data is sent over long links and arrives out of sync with receiver clock

Embedded clock is used to sample incoming packets

Dual-clocked FIFO restores synchronization

Tough to design

Somewhat defeats the NoC principles

Sender

Data

Strobe

Receiver

Dual-clocked FIFO

CK

Link

CK

The xpipes NoC

The xpipes NoC





 xpipes is a library of NoC components



Network Interface (NI), Switch, Link



Configurability of parameters such as flit width, amount of buffering, flow control and arbitration policies… xpipes is designed to be scalable to future technology nodes, architecturally and physically





Leverages a cell synthesis flow, no hard macros

Pipelined links to tackle wire propagation delay

A complete CAD flow is provided to move from the application task graph level to the chip floorplan

The xpipes NoC: the Network Interface initiator NI target NI

LUT packeting request channel unpacketing

OCP

OCP clk







 unpacketing packets NoC topolog y packets

LUT packeting response channel xpipes clk

Performs packeting/unpacketing

OCP 2.0 protocol to connect to IP cores

Source routing via routing Look-Up Tables

Dual Clock operation

OCP clk

OCP

Basic OCP Concepts









Point-to-point, unidirectional, synchronous



Easy physical implementation

Master/slave, request/response



Well-defined, simple roles

Extensions



Added functionality to support cores with more complex interface requirements

Configurability





Match a core’s requirements exactly

Tailor design to required features only

Reference: [SonicsInc]

Basic OCP Protocol

MCmd [3]

MAddr [32]

MData [32]

MRespAccept

SCmdAccept

SResp [2]

SData [32]

Request phase

Response phase

Read Transaction Write Transaction

OCP Extensions









Simple Extensions



Byte Enables





Bursts

Flow Control/Data Handshake

Complex Extensions



Threads and Connections

Sideband Signals



Interrupts, etc.

Testing Signals

The xpipes NoC: the Switch

Allocator

Arbiter

Crossbar







Routing &

Flow Control

Input and/or output buffering

Wormhole switching

Supports multiple flow control policies

The xpipes NoC: the ACK/NACK Link

S

FLIT

REQ

ACK

FLIT

REQ

ACK

FLIT

REQ

ACK

R





Repeaters are pure registers

Buffering and retransmission logic in the sender

The xpipes NoC: the STALL/GO Link

S

FLIT

REQ

STALL

FLIT

REQ

STALL

FLIT

REQ

STALL

R





Repeaters are two-entry FIFOs

No retransmission allowed

Quality of Service and the Æthereal NoC

Speaking of Quality of Service...

Signal processing hard real time very regular load

Media processing hard real time irregular load

Multi-media soft real time irregular load high quality high quality worst case typically on DSPs average case

SoC/media processors limited quality average case

PC/desktop

Very challenging!

Multimedia Application Demands







Increasing functionality and heterogeneity

Higher semantic content/entropy

More dynamism

29000

27000

25000

23000

21000

19000

17000

15000

VBR MPEG

DVD stream worst-case load structural load running average instantaneous load

[Gossens03]

Negotiating NoC Resources

[Gossens03]

29000

27000

25000

23000

21000

19000

17000

15000

VBR MPEG

DVD stream worst-case load structural load running average instantaneous load steady states

(re)negotiate











A QoS Approach









Essential to recover global predictability and improve performance

Applications require it!



It fits well with protocol stack concept

What is QoS?

Requester poses the service request ( negotiation )

Provider either commits to or rejects the request

Renegotiate when requirements change

After negotiation, steady states that are predictable

Guaranteed versus best-effort service







Types of commitment correctness completion bounds e.g. uncorrupted data e.g. no packet loss e.g. maximum latency

QoS + Best Effort





Best-effort services have better average resource utilisation at the cost of unpredictable/unbounded worst-case behaviour

The combination of best-effort & guaranteed services is useful!

QoS in the Æthereal NoC





Conceptually, two disjoint networks

 a network with throughput+latency guarantees (GT)

 a network without those guarantees (best-effort, BE)

Several types of commitment in the network

 combine guaranteed worst-case behaviour with good average resource usage best-effort router programming guaranteed router priority/arbitration

Æthereal Router Architecture







Best-effort router







Wormhole routing

Input queueing

Source routing

Guaranteed throughput router





Contention-free routing



 synchronous, using slot tables time-division multiplexed circuits

Store-and-forward routing



Headerless packets

 information is present in slot table

A lot of hardware overhead!!!

Æthereal: Contention-Free

Routing















Latency guarantees are easy in circuit switching

With packet switching, need to “emulate”

Schedule packet injection in network such that they never contend for same link at same time

 in space: disjoint paths

 in time: time-division multiplexing

Use best-effort packets to set up connections

Distributed, concurrent, pipelined, consistent

Compute slot assignment at build time, run time, or combination

Connection opening may be rejected

NoC - MICREL

Networks-on-Chip:

Motivations and Architecture

Why NoCs

The System Interconnect

Traditional Answer: with Buses

So, Are We All Set?

System/Architectural Level

Multicore Testimonials 1

Multicore Testimonials 2

Multicore Testimonials 3

Physical Level

System Interconnect Evolution

First Assessment of NoCs

Problem Solved?

How to Design NoCs

How to Make NoCs Tick

Key NoC Properties

NoC Topologies

NoC Routing Policies

Deadlocks

NoC Switching Policies

Virtual Channels

NoC Flow Control Policies

NoC Timing: Synchronous

NoC Timing: Asynchronous

NoC Timing: Mesochronous

The xpipes NoC

The xpipes NoC

Basic OCP Concepts

Basic OCP Protocol

OCP Extensions

Quality of Service and the Æthereal NoC

Negotiating NoC Resources

A QoS Approach

QoS + Best Effort

QoS in the Æthereal NoC

Æthereal Router Architecture

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib