Federico Angiolini fangiolini@deis.unibo.it
DEIS Università di Bologna
Chips tend to have more than one “ core ”
Control processors
Accelerators
Memories
I/O
How do we get them to talk to each other?
This is called “ System Interconnect ”
Shared bus topology
Aimed at simple, cost-effective integration of components
Master 0 Master 1 Master 2
Shared Bus
Slave 0 Slave 1 Slave 2
Typical example: ARM Ltd. AMBA AHB
Arbitration among multiple masters
Single outstanding transaction allowed
If wait states are needed, everybody waits
Well... not really.
Let’s consider two trends
System/architectural: systems are becoming highly parallel
Physical: wires are becoming slower
(especially in relative terms)
Parallel systems... OK, but how much?
CPUs: currently four cores (not so many...)
Playstation/Cell: currently nine engines (still OK)
GPUs: currently 100+ shaders (hey!)
Your next cellphone: 100+ cores (!!!)
And the trend is: double every 18 months
“We believe that Intel’s Chip
Level Multiprocessing (CMP) architectures represent the future of microprocessors because they deliver massive performance scaling while effectively managing power and heat”.
White paper “Platform 2015: Intel
Processor and Platform evolution for the next decade”
Intel IXP2800 with 16 micro-engines and one Intel XScale core
Intel: 80-core chip shown at ISSCC 2007
Rapport: Kilocore (1024 cores), for gaming & media
Expected mid 2007
"The next 25 years of digital signal processing technology will literally integrate hundreds of processors on a single chip to conceive applications beyond our imagination.”
Mike Hames, senior VP,
Texas Instruments
“Focus here is on parallelism and what's referred to as multi-core technology.”
Phil Hester, CTO, AMD
What Does This Mean for the
Interconnect?
A new set of requirements!
High performance
Many cores will want to communicate, fast
High parallelism (bandwidth)
Many cores will want to communicate, simultaneously
High heterogeneity/flexibility
Cores will operate at different frequencies, data widths, maybe with different protocols
Logic becomes faster and faster
Global wires don’t
And If We Consider a Floorplan...
2 cm
If you assume a shared bus, the wires have to go all around the chip (i.e. are very long)
Propagation delay
Spaghetti wiring
What Does This Mean for the
Interconnect?
A new set of requirements!
Short wiring
Point-to-point and local is best
Simple, structured wiring
Bundles of many wires are impractical to route
Topology
Evolution Hierarchical
Buses
Traditional
Shared Bus
Master 0 Master 1
AHB crossbar
Advanced
Protocol AHB Layer 0 Bus Protocols
Slave 0 Slave 1
Slave 4
Evolution
Master 2 Master 3
Slave 5
AHB Layer 1 Slave 6
Help with the issues, but do not fully solve them
More scalable solutions needed
An Answer: Networks-on-Chip (NoCs)
CPU NI
NI DSP switch
NoC switch switch switch switch switch
NI DMA
NI MPEG
DRAM NI Accel NI
Packet-based communication
NIs convert transactions by cores into packets
Switches route transactions across the system
High performance
High parallelism (bandwidth)
Yes: just add links and switches as you add cores
High heterogeneity/flexibility
Yes: just design appropriate NIs, then plug in
Short wiring
Yes: point-to-point, then just place switches as close as needed
Simple, structured wiring
Yes: links are point-to-point, width can be tuned
Maybe, but... buses excel in simplicity, low power and low area
When designing a NoC, remember that tradeoffs will be required to keep those under control
Not all designs will require a NoC, only the
“complex” ones
A NoC is a small network
Many of the same architectural degrees of freedom
Some problems are less stringent
Static number of nodes
(Roughly) known traffic patterns and requirements
Some problems are much tougher
MANY less resources to solve problems
Latencies of nanoseconds, not milliseconds
But... what characterizes a network?
Topology
Routing policy (where)
Switching policy (how)
Flow control policy (when)
Syn-, asyn- or meso-chronicity
...and many others!
Huge design space
Must comply with demands of…
performance ( bandwidth & latency )
area power routability
Can be split in…
direct: node connected to every switch
indirect: nodes connected to specific subset of switches
NoC Topology Examples:
Hypercubes
Compositional design
Example: hypercube topologies
Arrange N=2 n nodes in n-dimensional cube
At most n hops from source to destination
High bisection bandwidth
good for traffic (but can you use it?) bad for cost [O(n 2 )]
Exploit locality
Node size grows
as n [input/output cardinality] as n 2 [internal crossbar]
Adaptive routing may be possible
0-D
1-D
2-D
3-D
4-D
NoC Topology Examples:
Multistage Topologies
Need to fix hypercube resource requirements
Idea : unroll hypercube vertices
switch sizes are now bounded, but
loss of locality more hops can be blocking; non-blocking with even more stages
NoC Topology Examples: k-ary n-cubes (Mesh Topologies)
Alternate reduction from hypercube: restrict to < log dimensional structure
2
(N) e.g.
mesh (2-cube), 3D-mesh (3-cube)
Matches with physical world structure and allows for locality
Bounds degree at node
Even more bottleneck potentials
2D Mesh
NoC Topology Examples:
Torus
Need to improve mesh performance
Idea : wrap around n-cube ends
2-cube cylinder
3-cube donut
Halves worst-case hop count
Can be laid-out reasonably efficiently
maybe 2x cost in channel width?
NoC Topology Examples:
Fat-Tree Topologies
Fatter links (actually: more of them) as you go to the root, so bisection bandwidth scales
Static
e.g. source routing or coordinate-based
simpler to implement and validate
Adaptive
e.g. congestion-based potentially faster much more expensive allows for out-of-order packet delivery
possibly a bad idea for NoCs
Huge issue: deadlocks
A
B
A would like to talk to C
B to A
C to B
Everybody is stuck!!
C
Showstopper problem
avoid by mapping: no route loops avoid by architecture: e.g. virtual channels provide deadlock recovery
Critical for adaptive routing
livelocks also possible
Packet switching
maximizes global network usage dynamically store-and-forward
minimum logic, but higher latency, needs more buffers wormhole
minimum buffering, but deadlock-prone, induces congestion
Circuit switching
optimizes specific transactions
no contention, no jitter requires handshaking
may fail completely setup overhead
Performance improvement using virtual channels
Node 1 Node 2 Node 3 Node 4 Node 5
Destination of B
Node 1 Node 2 Node 3 Node 4 Node 5
Destination of B
A
B
Block
We need it because...
Sender may inject bursty traffic
Receiver buffers may fill up
Sender and receiver may operate at different frequencies
Arbitrations may be lost
How?
TDMA: pre-defined time slots
Speculative: send first, then wait for confirmation
(acknowledge - ACK)
Conservative: wait for token, then send ( creditbased )
Remember... links may be pipelined
Example: ACK/NACK Flow Control
Transmission
ACK and buffering
NACK
ACK/NACK propagation
Memory deallocation
Retransmission
Go-back-N
Flip-flops everywhere, clock tree
Much more streamlined design
Clock tree burns 40% power budget, plus flip flops themselves
Not easy to integrate cores at different speeds
Increasingly difficult to constrain skew and process variance
Worst-case design
Potentially allows for data to arrive at any time, solves process variance etc.
Average-case behaviour
Lower power consumption
Maximum flexibility in IP integration
More secure for encryption logic
Less EMI
Much larger area
Can be much slower (if really robust)
Two-way handshake removes the “bet” of synchronous logic
Intermediate implementations exist
Much tougher to design
Attempts to optimize latency of long paths
Everybody uses the same clock
Senders embed their clock within packets
Data is sent over long links and arrives out of sync with receiver clock
Embedded clock is used to sample incoming packets
Dual-clocked FIFO restores synchronization
Tough to design
Somewhat defeats the NoC principles
Sender
Data
Strobe
Receiver
Dual-clocked FIFO
CK
Link
CK
xpipes is a library of NoC components
Network Interface (NI), Switch, Link
Configurability of parameters such as flit width, amount of buffering, flow control and arbitration policies… xpipes is designed to be scalable to future technology nodes, architecturally and physically
Leverages a cell synthesis flow, no hard macros
Pipelined links to tackle wire propagation delay
A complete CAD flow is provided to move from the application task graph level to the chip floorplan
The xpipes NoC: the Network Interface initiator NI target NI
LUT packeting request channel unpacketing
OCP
OCP clk
unpacketing packets NoC topolog y packets
LUT packeting response channel xpipes clk
Performs packeting/unpacketing
OCP 2.0 protocol to connect to IP cores
Source routing via routing Look-Up Tables
Dual Clock operation
OCP clk
OCP
Point-to-point, unidirectional, synchronous
Easy physical implementation
Master/slave, request/response
Well-defined, simple roles
Extensions
Added functionality to support cores with more complex interface requirements
Configurability
Match a core’s requirements exactly
Tailor design to required features only
Reference: [SonicsInc]
MCmd [3]
MAddr [32]
MData [32]
MRespAccept
SCmdAccept
SResp [2]
SData [32]
Request phase
Response phase
Read Transaction Write Transaction
Simple Extensions
Byte Enables
Bursts
Flow Control/Data Handshake
Complex Extensions
Threads and Connections
Sideband Signals
Interrupts, etc.
Testing Signals
The xpipes NoC: the Switch
Allocator
Arbiter
Crossbar
Routing &
Flow Control
Input and/or output buffering
Wormhole switching
Supports multiple flow control policies
The xpipes NoC: the ACK/NACK Link
S
FLIT
REQ
ACK
FLIT
REQ
ACK
FLIT
REQ
ACK
R
Repeaters are pure registers
Buffering and retransmission logic in the sender
The xpipes NoC: the STALL/GO Link
S
FLIT
REQ
STALL
FLIT
REQ
STALL
FLIT
REQ
STALL
R
Repeaters are two-entry FIFOs
No retransmission allowed
Speaking of Quality of Service...
Signal processing hard real time very regular load
Media processing hard real time irregular load
Multi-media soft real time irregular load high quality high quality worst case typically on DSPs average case
SoC/media processors limited quality average case
PC/desktop
Very challenging!
Multimedia Application Demands
Increasing functionality and heterogeneity
Higher semantic content/entropy
More dynamism
29000
27000
25000
23000
21000
19000
17000
15000
VBR MPEG
DVD stream worst-case load structural load running average instantaneous load
[Gossens03]
[Gossens03]
29000
27000
25000
23000
21000
19000
17000
15000
VBR MPEG
DVD stream worst-case load structural load running average instantaneous load steady states
(re)negotiate
Essential to recover global predictability and improve performance
Applications require it!
It fits well with protocol stack concept
What is QoS?
Requester poses the service request ( negotiation )
Provider either commits to or rejects the request
Renegotiate when requirements change
After negotiation, steady states that are predictable
Guaranteed versus best-effort service
Types of commitment correctness completion bounds e.g. uncorrupted data e.g. no packet loss e.g. maximum latency
Best-effort services have better average resource utilisation at the cost of unpredictable/unbounded worst-case behaviour
The combination of best-effort & guaranteed services is useful!
Conceptually, two disjoint networks
a network with throughput+latency guarantees (GT)
a network without those guarantees (best-effort, BE)
Several types of commitment in the network
combine guaranteed worst-case behaviour with good average resource usage best-effort router programming guaranteed router priority/arbitration
Best-effort router
Wormhole routing
Input queueing
Source routing
Guaranteed throughput router
Contention-free routing
synchronous, using slot tables time-division multiplexed circuits
Store-and-forward routing
Headerless packets
information is present in slot table
A lot of hardware overhead!!!
Æthereal: Contention-Free
Routing
Latency guarantees are easy in circuit switching
With packet switching, need to “emulate”
Schedule packet injection in network such that they never contend for same link at same time
in space: disjoint paths
in time: time-division multiplexing
Use best-effort packets to set up connections
Distributed, concurrent, pipelined, consistent
Compute slot assignment at build time, run time, or combination
Connection opening may be rejected