Introduction to Network-on-Chip (NOC)

advertisement
Networks-on-Chip
Ben Abdallah Abderazek
The University of Aizu,
Graduate School of Computer Science and Eng.
Adaptive Systems Laboratory,
E-mail: benab@u-aizu.ac.jp
03/01/2010
Hong Kong University of Science and Technology, March 2010
1
Part I
Application Requirements
Network on Chip: A paradigm Shift in VLSI
Critical problems addressed by NoC
Traffic abstractions
Data Abstraction
Network delay modeling
Hong Kong University of Science and Technology, March 2010
2
Application Requirements
Signal processing
o Hard real time
o Very regular load
o High quality
Media processing
o Hard real time
Typically on DSPs
SoC/media processors
o Irregular load
o High quality
Multimedia
o Soft real time
o Irregular load
o Limited quality
PC/desktop
Very challenging!
Hong Kong University of Science and Technology, March 2010
3
What the Internet Needs?
Increasing Huge
Amount of Packets
&
Routing,
Packet Classification,
Encryption, QoS,
New Applications
and Protocols, etc…..
ASIC
(large,
expensive to develop,
not flexible)
General Purpose RISC
(not capable enough)
SoC, MCSoC?
• High processing power
• Support wire speed
• Programmable
• Scalable
• Specially for network
applications
Hong Kong University of Science and Technology, March 2010
4
Example - Network Processor (NP)


16 pico-procesors and 1 powerPC
Each pico-processor



Dyadic Processing Unit





Two pico-processors
2KB Shared memory
Tree search engine
Focus is layers 2-4
PowerPC 405 for control plane
operations


Support 2 hardware threads
3 stage pipeline :
fetch/decode/execute
16K I and D caches
Target is OC-48
IBM PowerNP
Adaptive Systems Laboratory, Univ. of Aizu
5
Example - Network Processor (NP)

NP can be applied in various network
layers and applications



Traditional apps – forwarding, classification
Advanced apps – transcoding, URL-based
switching, security etc.
New apps
Adaptive Systems Laboratory, Univ. of Aizu
6
Telecommunication Systems and
NoC Paradigm
The
trend nowadays is to integrate
telecommunication system on
complex multicore SoC (MCSoC):
 Network processors,
 Multimedia hubs ,and
 base-band telecom circuits
These
applications have tight time-tomarket and performance constraints
Adaptive Systems Laboratory, Univ. of Aizu
7
Telecommunication Systems and
NoC Paradigm
Telecommunication
multicore SoC is
composed of 4 kinds of components:
1.
2.
3.
4.
Software tasks,
Processors executing software,
Specific hardware cores , and
Global on-chip communication
network
Adaptive Systems Laboratory, Univ. of Aizu
8
Telecommunication Systems and
NoC Paradigm
Telecommunication
multicore SoC is
composed of 4 kinds of components:
1.
2.
3.
4.
Software tasks,
Processors executing software,
Specific hardware cores , and
Global on-chip communication
network
This is the most challenging part.
Adaptive Systems Laboratory, Univ. of Aizu
9
Technology & Architecture
Trends
Technology
trends:
 Vast transistor budgets
 Relatively poor interconnect scaling
 Need to manage complexity and power
 Build flexible designs (multi-/general-purpose)
Architectural
trends:
 Go parallel !
 Keep core complexity constant or simplify
Result is lots of modules (cores, memories, offchip
interfaces, specialized IP cores, etc.)
Hong Kong University of Science and Technology, March 2010
10
Wire Delay vs. Logic Delay
Operation
Delay
Delay
(.13mico) (.05micro
)
32-bit ALU Operation
650ps
250ps
32-bit Register read
325ps
125ps
Read 32-bit from 8KB RAM
780ps
300ps
Transfer 32-bit across chip
(10mm)
1400ps
2300ps
Transfer 32-bit across chip (200mm)
2800ps
4600ps
2:1 global on-chip communication to operation delay
9:1 in 2010
Ref: W.J. Dally HPCA Panel presentation 2002
Hong Kong University of Science and Technology, March 2010
11
Communication Reliability
Information
transfer is inherently
unreliable at the electrical level, due to:
 Timing errors
 Cross-talk
 Electro-magnetic interference (EMI)
 Soft errors
The
problem will get increasingly worse
as technology scales down
Adaptive Systems Laboratory, UoA
12
Evolution of on-chip
communication
Hong Kong University of Science and Technology, March 2010
13
Traditional SoC nightmare
Variety
of dedicated interfaces
 Design and verification complexity
 Unpredictable performance
 Many underutilized wires
DMA
CPU
DSP
Control
signals
CPU Bus
A
Bridge
B
C
Peripheral Bus
IO
IO
IO
Hong Kong University of Science and Technology, March 2010
14
Network on Chip: A paradigm
Shift in VLSI
From: Dedicated signal wires
To: Shared network
s
s
s
Module
s
s
Module
Module
s
PointTo-point
Link
s
s
Computing
Module
s
Network
switch
Adaptive Systems Laboratory, UoA
15
NoC essential
s
s
s
Module
s
s
Module
Module
s
s
s
s
Communication
by packets of bits
 Routing of packets through several hops, via
switches
Efficient sharing of wires
 Parallelism
Hong Kong University of Science and Technology, March 2010
16
Characteristics of a paradigm
shift
Solves
a critical problem
 Step-up in abstraction
 Design is affected:
 Design becomes more restricted
 New tools
 The changes enable higher complexity and
capacity
 Jump in design productivity
Hong Kong University of Science and Technology, March 2010
17
Characteristics of a paradigm
shift
Solves
a critical problem
 Step-up in abstraction
 Design is affected:
 Design becomes more restricted
 New tools
 The changes enable higher complexity and
capacity
 Jump in design productivity
Hong Kong University of Science and Technology, March 2010
18
Origins of the NoC concept
 The
idea was talked about in the 90’s, but actual research
came in the new illenium.
Some well-known early publications:







Guerrier and Greiner (2000) “A generic architecture for on-chip
packet-switched interconnections”
Hemani et al. (2000) “Network on chip: An architecture for billion
transistor era”
Dally and Towles (2001) “Route packets, not wires: on-chip
interconnection networks”
Wingard (2001) “MicroNetwork-based integration of SoCs”
Rijpkema, Goossens and Wielage (2001) “A router architecture for
networks on silicon”
Kumar et al. (2002) “A Network on chip architecture and design
methodology”
De Micheli and Benini (2002) “Networks on chip: A new paradigm
for systems on chip design”
Hong Kong University of Science and Technology, March 2010
19
Don't we already know how to
design interconnection networks?
Many
existing network topologies, router
designs and theory has already been
developed for high end supercomputers
and telecom switches
Yes, and we'll cover some of this material,
but the trade-offs on-chip lead to very
different designs!!
Hong Kong University of Science and Technology, March 2010
20
Critical problems addressed by
NoC
1) Global interconnect design problem:
delay, power, noise, scalability, reliability
2)
System integration
productivity problem
3) Chip Multi Processors
(key to power-efficient computing
Hong Kong University of Science and Technology, March 2010
21
1(a): NoC and Global wire delay
Long wire delay is dominated by Resistance
Add repeaters
Repeaters become latches (with clock frequency scaling)
Latches evolve to NoC routers
NoC
Router
NoC
Router
NoC
Router
Hong Kong University of Science and Technology, March 2010
22
1(b): Wire design for NoC
 NoC links:
 Regular
 Point-to-point (no fanout tree)
 Can use transmission-line layout
 Well-defined current return path
 Can be optimized for noise / speed / power
 Low swing, current mode, ….
Hong Kong University of Science and Technology, March 2010
23
1(c): NoC scalability
 For Same Performance, compare the wire area and power
NoC:
O(n)
O(n)
Simple Bus
O(n^3 √n)
O(n√n)
Point –to-Point
Segmented Bus:
O(n^2 √n)
O(n^2 √n)
O(n√n)
O(n √n)
Hong Kong University of Science and Technology, March 2010
24
1(d): NoC and communication
reliability
 Fault tolerance & error correction
Router
n
…
Input buffer
UMODEM
U
M
O
D
E
M
Router
U
M
O
D
E
M
Error correction
Synchronization
UMODEM
ISI reduction
m
Parallel to Serial Convertor
UMODEM
U
M
O
D
E
M
Router
U
M
O
D
E
M
Modulation
Link Interface
UMODEM
Interconnect
A. Morgenshtein, E. Bolotin, I. Cidon, A. Kolodny, R. Ginosar, “Micro-modem – reliability solution for
NOC communications”, ICECS 2004
Hong Kong University of Science and Technology, March 2010
25
1(e): NoC and GALS
Modules
in NoC System use different
clocks
 May use different voltages
NoC can take care of synchronization
 NoC design may be asynchronous

 No waste of power when the links and
routers are idle
Hong Kong University of Science and Technology, March 2010
26
2: NoC and engineering
productivity
NoC
eliminates ad-hoc global wire
engineering
 NoC separates computation from
communication
 NoC supports modularity and reuse of cores
NoC is a platform for system integration,
debugging and testing

Hong Kong University of Science and Technology, March 2010
27
3: NoC and CMP
cannot provide Power-efficient
performance growth
Interconnect
Uniprocessors
 Interconnect dominates dynamic power
Gate
 Global wire delay doesn’t scale
 Instruction-level parallelism is limited
Diff.
 Power-efficiency requires many parallel local
Uniprocessor
dynamic power
computations
(Magen et al., SLIP 200
Uniprocessir
 Chip Multi Processors (CMP)
Performance
 Thread-Level Parallelism (TLP)
Die Area (or Power)
Hong Kong University of Science and Technology, March 2010
28
3: NoC and CMP
 Uniprocessors
cannot provide Power-efficient
performance growth
 Interconnect dominates dynamic power
 Global wire delay doesn’t scale
 Instruction-level parallelism is limited
 Power-efficiency
requires many parallel local
computations
 Chip Multi Processors (CMP)
 Thread-Level Parallelism (TLP)
 Network
is a natural choice for CMP!
Hong Kong University of Science and Technology, March 2010
29
3: NoC and CMP
Network
is a
natural
choice for
CMP
 Uniprocessors
cannot provide Power-efficient
performance growth
 Interconnect dominates dynamic power
 Global wire delay doesn’t scale
 Instruction-level parallelism is limited
 Power-efficiency
requires many parallel local
computations
 Chip Multi Processors (CMP)
 Thread-Level Parallelism (TLP)
 Network
is a natural choice for CMP!
Hong Kong University of Science and Technology, March 2010
30
Why Now is the time for NoC?
Difficulty of DSM wire design
Productivity pressure
CMPs
Hong Kong University of Science and Technology, March 2010
31
Traffic abstractions
 Traffic
model are generally captured from actual traces
of functional simulation
 A statically distribution is often assumed for message
Flow
1 ->10
2->10
1->4
4->10
4->5
3->10
5->10
6->10
8->10
9->8
9->10
7->10
11->10
12->10
Bandwidth
400kb/s
1.8Mb/s
230kb/s
50kb/s
300kb/s
34kb/s
400kb/s
699kb/s
300kb/s
1.8mb/s
200kb/s
200kb/s
300kb/s
500kb/s
Packet size
1kb
3kb
2kb
1kb
3kb
0.5kb
1kb
2kb
3kb
5kb
5kb
3kb
4kb
5kb
Latency
5ns
12ns
6ns
3ns
4ns
15ns
4ns
1ns
12ns
7ns
10ns
12ns
10ns
12ns
PE1
PE2
PE3
PE4
PE12
PE10
PE11
PE5
PE9
PE7
PE8
PE6
Hong Kong University of Science and Technology, March 2010
32
Data abstractions
Hong Kong University of Science and Technology, March 2010
33
Layers of abstraction in network
modeling
 Software layers
 Application, OS
 Network & transport layers
 Network topology e.g. crossbar, ring, mesh, torus, fat tree,…
 Switching Circuit / packet switching(SAF,VCT), wormhole
 Addressing Logical/physical, source/destination, flow, transaction
 Routing Static/dynamic, distributed/source, deadlock avoidance
 Quality of Service e.g. guaranteed-throughput, best-effort
 Congestion control, end-to-end flow control
 Data link layer
 Flow control (handshake)
 Handling of contention
 Correction of transmission errors
 Physical layer
 Wires, drivers, receivers, repeaters, signaling, circuits,..
Hong Kong University of Science and Technology, March 2010
34
How to select architecture ?
Architecture
choices depends on system
needs.
Reconfiguration
Rate
During run time
CMP/
Multicore
ASSP
At boot time
FPGA
At design time
ASIC
Flexibility
Single application
General purpose or Embedded systems
Hong Kong University of Science and Technology, March 2010
35
How to select architecture ?
Architecture
choices depends on system
needs.
Reconfiguration
Rate
A large range of solutions!
During run time
CMP/
Multicore
ASSP
At boot time
FPGA
At design time
ASIC
Flexibility
Single application
General purpose or Embedded systems
Hong Kong University of Science and Technology, March 2010
36
Example: OASIS
 ASIC
assumed
 Traffic requirement are known a-priori
 Features
 Packet switching – wormhole
 Quality of service e
 Mesh topology
K. Mori, A. Ben Abdallah, and K. Kuruda, “Design and Evaluation of a Complexity Effective Network-on-Chip Architecture on FPGA", The 19th Intelligent System Symposium (FAN 2009), pp.318321, Sep. 2009.
S. Miura, A. Ben Abdallah, and K. Kuroda, "PNoC - Design and Preliminary Evaluation of a Parameterizable NoC for MCSoCGeneration and Design Space Exploration", The 19th Intelligent
System Symposium (FAN 2009), pp.314-317, Sep. 2009.
Hong Kong University of Science and Technology, March 2010
37
Perspective 1: NoC vs. Bus
NoC
Aggregate bandwidth grows
 Link speed unaffected by N
 Concurrent spatial reuse
 Pipelining is built-in
 Distributed arbitration
 Separate abstraction layers
However:
 No performance guarantee
 Extra delay in routers
 Area and power overhead?
 Modules need NI
 Unfamiliar methodology

Bus






Bandwidth is limited, shared
Speed goes down as N grows
No concurrency
Pipelining is tough
Central arbitration
No layers of abstraction
(communication and computation are
coupled)
However:
 Fairly simple and familiar
Hong Kong University of Science and Technology, March 2010
38
Perspective 2: NoC vs. Off-chip
Networks
NoC
 Sensitive to cost:
 area
 power
 Wires are relatively cheap
 Latency is critical
Off-Chip Networks






Cost is in the links
Latency is tolerable
Traffic/applications unknown
Changes at runtime
Adherence to networking
standards
 Traffic may be known a-priori
 Design time specialization
 Custom NoCs are possible
Hong Kong University of Science and Technology, March 2010
39
VLSI CAD problems
Application
mapping
 Floorplanning / placement
 Routing
 Buffer sizing
 Timing closure
 Simulation
 Testing
Hong Kong University of Science and Technology, March 2010
40
VLSI CAD problems in NoC
 Application
mapping (map tasks to cores)
 Floorplanning / placement (within the network)
 Routing (of messages)
 Buffer sizing (size of FIFO queues in the routers)
 Timing closure (Link bandwidth capacity allocation)
 Simulation (Network simulation, traffic/delay/power modeling)
 Other NoC design problems (topology synthesis,
switching, virtual channels, arbitration, flow
control,……)
Hong Kong University of Science and Technology, March 2010
41
Typical NoC design flow
Place
Modules
Determine routing
and adjust link
capacities
Hong Kong University of Science and Technology, March 2010
42
Timing closure in NoC
Define intermodule traffic
Place modules
Increase link
capacities
No
QoS
satisfied ?
Yes
Finish
 Too long capacity results in poor QoS
 Too high capacity wastes area
 Uniform link capacities are a waste in ASIP system
Hong Kong University of Science and Technology, March 2010
43
Network delay modeling
 Analysis of mean packet delay us
 Multiple Virtual-Channels
 Different link capacities
 Different communication demands
wormhole network
Hong Kong University of Science and Technology, March 2010
44
NoC design requirements
High-performance
interconnect
 High-throughput, latency, power, area
Complex
functionality (performance again)
 Support for virtual-channels
 QoS
Synchronization
 Reliability, high-throughput, low-laten
45
ISO/OSI network protocol stack model
Hong Kong University of Science and Technology, March 2010
46
Part II
NoC topologies
Switching strategies
Routing algorithms
Flow control schemes
Clocking schemes
QoS
Basic Building Blocks
Status and Open Problems
Hong Kong University of Science and Technology, March 2010
47
NoC Topology
The connection map between PEs
Adopted
from large-scale networks and
parallel computing
 Topology classifications:
 Direct topologies
 Indirect topologies
Adaptive Systems Laboratory, Univ. of Aizu
48
Direct topologies
Each
switch (SW) connected to a single PE
As the # of nodes in the system increases,
the total bandwidth also increases
PE
1 PE is
connected
to only a single
SW
PE
PE
SW
SW
SW
SW
PE
Hong Kong University of Science and Technology, March 2010
49
Direct topologies
Mesh
2D
mesh is most popular
 All links have the same length
 Eases physical design
 Area grows linearly with the the # of nodes
4x4 Mesh
Hong Kong University of Science and Technology, March 2010
50
Direct topologies
Torus and Folded Torus
Torus
PE
R
PE
R
PE
R
PE
PE
R
PE
PE
R
PE
PE
R
PE
PE
R
PE
R
R
PE
PE
R
R
R
R
PE
PE
PE
R
R
R
PE
PE
PE
R
R
R
R
PE
PE
PE
PE
R
R
R
R
PE
PE
PE
PE
R
R
R
R
PE
PE
PE
Folded Torus
R
R
 Similar to a regular Mesh
 Excessive delay problem due to
long-end-around connection
PE
R
R

Overcomes the long link limitation
of a 2-D torus

Links have the same size
Hong Kong University of Science and Technology, March 2010
51
Direct topologies
Octagon topology
Messages
being sent between any 2 nodes
require at most two hops
More octagons can be tiled together to
accommodate larger designs
PE
PE
PE
SW
PE
PE
PE
PE
PE
Hong Kong University of Science and Technology, March 2010
52
Indirect topologies
A set of PEs are connected to a switch (router).
Fat
tree topology
 Nodes are connected only to the leaves of the tree
 More links near root, where bandwidth requirements
are higher
SW
SW
SW
SW
PE
SW
PE
PE
SW
SW
PE
PE
PE
PE
PE
Hong Kong University of Science and Technology, March 2010
53
Indirect topologies
k-ary n-fly butterfly network
Blocking
multi-stage network – packets
may be temporarily blocked or dropped in
the network if contention occurs
Example: 2-ary 3-fly butterfly network
Hong Kong University of Science and Technology, March 2010
54
Indirect topologies
(m, n, r) symmetric Clos network
3-stage
network in which each stage is
made up of a number of crossbar switches
m : number of middle-stage switches
n : number of input/output
nodes on each input/output switch
r : number of I and O switches
Example: (3, 3, 4) Clos network
Non-blocking
network
Expensive (several full crossbars)
Hong Kong University of Science and Technology, March 2010
55
Indirect topologies
Benes network
 Rearrangeable
network in which paths may have to be
rearranged to provide a connection, requiring an
appropriate controller
 Clos topology composed of 2 x 2 switches
Example: (2, 2, 4) re-arrangeable Clos network constructed using
two (2, 2, 2) Clos networks with 4 x 4 middle switches.
Hong Kong University of Science and Technology, March 2010
56
Irregular Topologies
Customized
 Customized
for an application
 Usually a mix of shared bus, direct, and
indirect network topologies
sw
sw
PE
PE
sw
sw
PE
PE
sw
sw
sw
PE
PE
sw
PE
sw
sw
sw
PE
PE
PE
sw
sw
sw
PE
Example1: Reduced mesh
sw
PE
sw
sw
sw
PE
PE
PE
PE
sw
PE
PE
PE
PE
sw
PE
PE
sw
sw
sw
PE
PE
sw
sw
sw
PE
Example 2: Cluster-based hybrid topology
Hong Kong University of Science and Technology, March 2010
57
Example 1: Partially irregular 2D-Mesh
topology
PE
PE
R
PE
∆y
2∆y
PE
∆x
PE
R
PE
R
R
∆x
PE
PE
R
PE
R
PE
R
PE
R
PE
R
R
PE
PE
2∆y
R
2∆x
 Contains
oversized rectangularly shaped PEs.
Adaptive Systems Laboratory, Univ. of Aizu
58
Example 2: Irregular Mesh

R
R
R
R
R
R
R
R
R
This kind of
chip does not
limit the shape
of the PEs or
the placement
of the routers.
It may be
considered a
"custom" NoC
R
Adaptive Systems Laboratory, Univ. of Aizu
59
How to Select a Topology ?
Application
decides the topology type
If PEs = few tens
Star, Mesh topologies are recommended
If PEs = 100 or more
Hierarchical Star, Mesh are recommended
Some topologies are better for certain
designs than others
Most of the times, when one topology is
better in performance, it is worse in power
consumption!!
Adaptive Systems Laboratory, Univ. of Aizu
60
Part II
NoC topologies
NoC Switching strategies
Routing algorithms
Flow control schemes
Clocking schemes
QoS
Basic Building Blocks
Status and Open Problems
Hong Kong University of Science and Technology, March 2010
61
NoC Switching Strategies
Switching determines how flits and
packets flows through routers in the
network
There
are two basic modes:
 Circuit switching
 Packet switching
Adaptive Systems Laboratory, Univ. of Aizu
62
Circuit Switching
Network
resources (channels) are
reserved before a packet is sent
Entire
path must be reserved first
The packets do not contain routing
information, but rather data and
information about the data.
Circuit-switched networks require no
overhead for packetisation, packet header
processing or packet buffering
Hong Kong University of Science and Technology, March 2010
63
Circuit Switching
Header
ACK
Data
R1
R2
R3
Routing + switching delay
Router
Delay
Setup time
Transfer time
Adaptive Systems Laboratory, Univ. of Aizu
64
Circuit Switching
Once
circuit is setup, router latency and
control overheads are very low
 Very poor use of channel bandwidth if
lots of short packets must be sent to
many different destinations
 More commonly seen in embedded SoC
applications where traffic patterns may be
static and involve streaming large amounts of
data between different IP blocks
Hong Kong University of Science and Technology, March 2010
65
Packet Switching
 We
can aim to make better use of channel
resources by buffering packets. We then
arbitrate for access to network resources
dynamically.
 We distinguish between different approaches
by the granularity at which we reserve
resources (e.g. channels and buffers) and
conditions that must be met for a packet to
advance to the next node
Hong Kong University of Science and Technology, March 2010
66
Packet Switching
Advance when entire packet is buffered + L free flit buffers at next node
Store-and-forward (SaF)
Advance when L free flit buffers at the next node
Packet-Buffer
Flow Control
Cut-through
Can advance when at least one flit buffer is available
Flit-Buffer
Flow Control
Wormhole
L : Packet Length
Hong Kong University of Science and Technology, March 2010
67
Packet Switching
Store and Forward (SAF)
 Packet
is sent from one router to the next only
if the receiving router has buffer space for
entire packet
 Buffer size in the router is at least equal to the
size of a packet
Forward packet by packet
Buffer
packet
Switch
Buffer
Buffer
Switch
Switch
Store and Forward switching
data flit
header flit
Hong Kong University of Science and Technology, March 2010
68
Packet switching
Wormhole (WH)
 Flit
is forwarded to a router if space exists for that flit
 Parts of the packet can be distributed among two or more
routers
 Buffer requirements are reduced to one flit, instead of an
entire packet
Forward flit by flit
Buffer
packet
Switch
Buffer
Buffer
Switch
Switch
WH switching technique
data flit
header flit
Hong Kong University of Science and Technology, March 2010
69
Packet switching
Virtual Channel (VC)
Improve
performance of WH routing,
prevent a single packet blocking a free
channel
 e.g. if the green packet is blocked, the red packet
may still make progress through the network
 We can interleave flits from different packets
over the same channel
Hong Kong University of Science and Technology, March 2010
70
Part II
NoC topologies
NoC Switching strategies
Routing algorithms
Flow control schemes
Clocking schemes
QoS
Basic Building Blocks
Status and Open Problems
Hong Kong University of Science and Technology, March 2010
71
Download