A HIGH BANDWIDTH AREA EFFICIENT SPATIAL DIVISON

advertisement
D EPARTMENT OF E LECTRICAL & C OMPUTER E NGINEERING
F ACULTY OF E NGINEERING
A HIGH BANDWIDTH AREA EFFICIENT SPATIAL DIVISON
MULTIPLEXING BASED NETWORK ON CHIP
NOVEMBER 2010
A HIGH BANDWIDTH AREA EFFICIENT SPATIAL DIVISON
MULTIPLEXING BASED NETWORK ON CHIP
S UBMITTED BY
MANMOHAN MANOHARAN
D EPARTMENT OF E LECTRICAL & C OMPUTER E NGINEERING
F ACULTY OF E NGINEERING
I N PARTIAL FULFILMENT OF THE
REQUIREMENTS FOR THE
D EGREE OF
MASTER OF SCIENCE
N ATIONAL U NIVERSITY OF S INGAPORE
ABSTRACT
The shifting of trends from single processor chips to MPSoCs have resulted in the search for
alternate methods of interconnecting technologies for the various components of the MPSoCs.
Network On Chips provide a promising solution in this direction. Leveraging upon the concepts
of networks which are already well established and stable, the idea to implement MPSoCs as a
network of various components holds great scope for improvisation. Network-On-Chips are
more scalable with respect to bus based architecture as the number of cores increase. An area
efficient high bandwidth SDM based NoC is presented in this thesis. Furthermore, an SDF model
of this NoC has also been developed that would enable application performance prediction on the
underlying hardware.
ACKNOWLEDGEMENT
I would like to express my sincere thanks to everyone who has supported me to complete this
thesis to the very best of my abilities. First and foremost, I would like to thank my supervisor, Dr
Akash Kumar, for accepting me under his wings and giving me an idea to start with. No amount
of words would be sufficient to describe the support and encouragement he has given me during
the course of this project. Without his help and guidance, I would not have come so far. His
infinite patience while listening to the various issues I had faced during the course of this project
and pointing out zillions of my silly mistakes and his understanding of my capabilities as a
student gave me the necessary confidence to carry on till the end. I am grateful to Mr Shakith
Fernando, for providing me all the necessary assistance and knowledge with respect to FPGAs
and developing my love towards these wonderful logic devices. I would also like to thank Dr
Bharadwaj Veeravalli, Associate Professor, Dept of Electrical and Computer Engineering, NUS
for giving me permission to use his lab facilities for the purpose of my project. I also thank Mr.
Eric Poon, Lab Officer of the CNDS lab for providing me with all the necessary setup to work on
my project. I also thank my dearest friends Ganesh, Deepu, Jerrin, Pramod, Sheng, and Rajesh
for helping me make the time I spend in the University both productive as well as fun filled.
Last but not the least, I am thankful for my most wonderful parents and sister, who have
supported me all throughout my life in whatever situations I have been and for all the decisions I
have taken until now.
TABLE OF CONTENTS
LIST OF FIGURES____________________________________________________________i
LIST OF TABLES_____________________________________________________________ii
LIST OF SYMBOLS AND ABBREVIATIONS_______ _______________________ _iii
CHAPTER 1
INTRODUCTION
1
1.1 OVERVIEW OF MULTIPROCESSOR SYSTEM ON CHIPS
1
1.2 INTRODUCTION TO NETWORK ON CHIPS
3
1.3 KEY CONTRIBUTIONS
7
1.4 THESIS ORGANIZATION
8
CHAPTER 2
HIGH BANDWIDTH AREA EFFICIENT SDM BASED NETWORK ON CHIP
9
2.1 BASIC ARCHITECTURE
9
2.2 ARCHITECTURAL IMPROVEMENTS
13
2.2.1 MULTIPLE CLOCK SUPPORT
13
2.2.2
16
FLOW CONTROL
2.2.3 OPTIMIZED NETWORK INTERFACE ARCHITECTURE
2.3 RESULTS AND ANALYSIS
18
19
CHAPTER 3
SDF MODELLING
21
3.1 SDF GRAPH BASICS
21
3.2 SDF MODEL OF SDM NETWORK ON CHIP
25
3.3 APPLICATION MAPPING AND THROUGHPUT ANALYSIS
28
3.4 CONCLUSIONS
32
CHAPTER 4
CASE STUDY
33
CHAPTER 5
CONCLUSIONS AND FUTURE WORK
38
5.1 CONCLUSIONS
38
5.2 FUTURE WORK
39
BIBLIOGRAPHY
41
LIST OF FIGURES
CHAPTER 1
Figure 1. Intel CPU Introductions from 1975-2010.................................................2
Figure 1.1 A typical Network-on-Chip based System .............................................5
Figure 1.2a Time Division Multiplexing .................................................................6
Figure 1.2b Spatial Division Multiplexing ..............................................................6
Figure 1.3 NI described in [11] ................................................................................7
CHAPTER 2
Figure 2.1 NoC structure[12] .................................................................................10
Figure 2.2 Network Interface for the existing structure [12] ................................11
Figure 2.3.Data Transmission through wires[12] .................................................11
Figure 2.4.k-way router[12] ...................................................................................12
Figure 2.5 Overview of Multiple Clocking Implementation .................................14
Figure 2.6 Multiple clocking and flow control at transmission .............................15
Figure 2.7.Multiple clocking and flow control at receiver ....................................17
Figure 2.8.Flow Control Mechanism in NoC .......................................................17
Figure 2.9.Area Breakdown in existing NoC[12] .................................................18
CHAPTER 3
Figure 3.1 Example of a 3 actor SDF graph ........................................................22
Figure 3.2 SDF model with self edge ..................................................................23
Figure 3.3 SDF model with back edge to model a buffer ....................................24
Figure 3.4 SDF model of the SDM based NoC ....................................................26
Figure 3.5 Optimized Hardware generation flow using SDF graph ......................28
Figure 3.6 Modelling of tasks running on two processors .....................................29
Figure 3.7. Snippet of XML specification of SDF graph .....................................30
i
CHAPTER 4
Figure 4.1 SDF Model of Producer Consumer .....................................................33
Figure 4.2. SDF Model of Producer Consumer with NoC ..................................34
Figure 4.3. Number of Lanes vs Iterations per Second on SDF3 ..........................35
Figure 4.4. Number of Lanes vs Iterations per Second on FPGA .........................35
Figure 4.5 Measured data vs SDF3 for TXCLK=SYSCLK ..................................36
Figure 4.6 Measured data vs SDF3 for TXCLK= 5*SYSCLK .............................36
CHAPTER 5
Figure 5.1 Proposed area efficient architecture for Send Data Distributor ...........39
LIST OF TABLES
Table 1: Resource comparison between existing NoC [12] and [11] ……………9
Table 2: Bandwidth per connection for different TXCLK and Wires…………..20
Table 3: Predicted period vs. measured period for producer-consumer model… 37
ii
LIST OF SYMBOLS AND ABBREVIATIONS
FIFO
First-In First-Out
FPGA
Field Programmable Gate Array
FSL
Fast Simplex Link
IP
Intellectual Property
MPSoC
Multi-Processor System-on-Chip
NI
Network Interface
NoC
Network-on-Chip
SDF
Synchronous Data Flow
SDM
Spatial Division Multiplexing
SoC
System-on-Chip
TDM
Time Division Multiplexing
VHDL
Very High Speed Integrated Circuit Hardware Description Language
iii
CHAPTER 1
INTRODUCTION
The invention of the transistor by William Shockley, John Bardeen, and Walter Brattain at
AT&T„s Bell Labs in 1947 has been described as one of the most important inventions in the
history of mankind.
This tiny device now forms the basic building block of any modern
electronic device manufactured today. The development of integrated circuits (ICs) that contains
transistors fabricated on a semiconductor substrate such as silicon to form a complete electronic
circuit containing both active and passive components has revolutionized the world of
electronics. The ability to mass produce the ICs using a highly automated process has enabled us
to achieve very low unit costs, further driving the costs of electronic devices down.
With the growth of silicon processing technology, more and more transistors could be packed on
the surface of a silicon substrate. The first ICs contained only a few tens of transistors. This
level of integration was known as Small Scale Integration (SSI). The next level of integration
was called the Medium Scale Integration (MSI) and was developed in the late 1960s. From then
on the level of integration has increased to Large Scale Integration (LSI) in 1970s with tens of
thousands of transistors on a single chip and then to Very Large Scale Integration (VLSI) starting
in the 1980s and continuing even now. The number of transistors per chip has increased to
several billion in many of the computer processor chips released in the last decade. The feature
size has also reduced to 32 nm, with many commercial microprocessor chips being manufactured
at this feature size.
1
The level of integration of transistors has closely followed the trend as noted by Gordon E.
Moore in 1965 [1] and also popularly known as Moore‟s Law which states that the number of
transistors will double every 18 to 24 months. As the number of transistors that could be
integrated on silicon continued to increase, the performance of the processor chips also showed a
similar trend. This can be seen from the figure shown below, which shows the trend of Intel
CPUs from the period 1970 to 2010.
Figure 1: Intel CPU Introductions from 1975 – 2010
Source : http://www.gotw.ca/publications/concurrency-ddj.htm
2
1.1 Overview of Multiprocessor System-on-Chips
Even though the number of transistors that could be integrated on silicon continued to increase,
the clock speed has not seen a linear increase since 2003. In order to improve the performance of
processors, the most promising direction is in the direction of having multiple cores running on
the same chip. Such processors are called Chip Multi Processors (CMP). Also, due to extremely
small feature size, multiple processor cores and custom IPs can be integrated into a single chip.
Such a device is called Multiprocessor System-On-Chip (MPSoC) that can be considered to be a
complete system implemented on a single silicon chip. The processing cores can be either of the
same type or it can have heterogeneous cores, such as a DSP. The advantages of an MPSoC are
multifold in terms of power, programmability and performance. Since MPSoCs are developed as
a platform rather than a specialized product [2], they may allow for different implementations of
the same product. A few examples of MPSoC platforms are Philips Nexperia [3], and TI OMAP
[4].
In [5], some of the key architectural decisions to be taken pertaining to MPSoC architecture
design are given. The decisions on hardware design mainly revolves around the number of
processors, their homogeneity and/or heterogeneity, interprocessor communications, memory
hierarchy, special modes for power reduction etc.
1.2 Introduction to Network-On- Chips
As the number of cores increase, the interconnecting architecture will also have to support the
exponentially increasing traffic. The connection has to provide very high bandwidths, so as to
satisfy the communication requirements for the applications. Also, it has to be scalable so as to
support a large number of cores. Currently, the bus based architecture for interprocessor
communications does not scale well with increasing number of cores. There bottle neck in the
3
performance of MPSoC therefore, moves from the computational elements to the communication
network .
A new way to solve the above mentioned problem of communication bottleneck in MPSoCs is to
take the concepts of networking and implementing the MPSoC as a network of different
components [6]. Such communication interconnects are termed as Network-on-Chips (NoCs).
Compared to bus based architecture, a NoC is more modular, and the scalability is higher while
designing systems with large number of cores.
There can be three types of services that should be considered while designing a NoC Best effort
(BE) services, Guaranteed throughput (GT) and hybrid , with support for both BE and GT. In BE
based NoC, there is no guarantee on the throughput that is achieved from the NoC. A number of
packet based NoCs have been developed which provides BE services. Some examples are
QNoC[7] and MANGO[8].GT services allow you to ensure that an application meets the
required throughput. The NoC has to be predictable so that the resource allocation for the
application can be done during design time. There are two approaches to provide GT services:
Time Division Multiplexing (TDM) and Spatial Division Multiplexing (SDM).
In a typical NoC based MPSoC system, as shown in Fig1.1, the IP connects to a communication
network through a component called network interface (NI). The communication network is
made up of routers that act as the switching point of data.
4
IP
IP
Network
Interface
(NI)
Network
Interface
(NI)
ROUTER
ROUTER
ROUTER
ROUTER
Network
Interface
(NI)
Network
Interface
(NI)
IP
IP
Figure 1.1 A typical Network-on-Chip based system
In TDM, the available links between two routers are multiplexed on a time sharing basis,
between various connections. The amount of bandwidth obtained for each connection will be
proportional to the number of timeslots that is allocated to the connection. However, the routers
have to store all the switching tables within itself. This can bring about a large area and power
over head into the system. A few TDM based NoCs have been developed which are Aethereal[9]
and Nostrum[10].
Another approach for providing GT services is to have Spatial Division Multiplexing. The links
between each router consists of a number of wires. In SDM approach, a subset of the number of
wires is allocated to a particular connection, depending upon the bandwidth requirements. The
available bandwidth increases with the number of wires allocated. A brief comparison between
TDM and SDM approaches to bandwidth sharing is described as shown in Fig 1.2a and 1.2b
respectively.
5
Number of Lanes
Figure 1.2a Time Division Multiplexing
0
1
2
3
4
5
6
7
A
B
C
D
0
1
2
Connection
3
4
Time slot
5
6
7
Figure 1.2b Spatial Division Multiplexing
Consider four separate connections A, B, C and D with different bandwidth requirements,
multiplexing a link with 8 wires. Assume that the connections A,B,C and D requires 2,3,2 and 1
time slots respectively in TDM to satisfy their bandwidth requirements. Also, all the connections
will use all the available wires in its respective time slot. In SDM, the multiplexing is done in
terms of the wires. In this way, A, B, C and D will receive 2, 3, 2 and 1 wires respectively. The
advantage of SDM is that the router does not have to switch at every time slot and hence, there is
6
no need to store the switching tables in the router. The Network Interface now becomes more
complex as compared to the router, where we achieve considerable savings in power and area.
An SDM based NoC is described in [11], with a flexible architecture but higher area cost. This
NoC makes use of an N bit to M bit serializer, which serializes an N bit data to M wires, as
shown in Fig.1.3. Each connection has its own message queue and serializer.
Figure 1.3 NI described in [11]
The router of the NoC has full flexibility and hence each incoming wire can connect to any of the
each outgoing wire. However, the area overhead associated with this router is very large. As an
improvement to this design, a new NoC has been proposed and developed in [12]. The new
design is described in detail in Chapter 2.
7
1.3 Key Contributions
The contributions of this thesis are as follows:

Adding multiple clock support to the Network On Chip as presented in [12]. Having
separate clocks for transmission network and IP interface allows us to vary the
transmission clock to achieve higher bandwidth and throughput.

The existing architecture of the network interface has been modified to make the design
more area efficient

An SDF based model for the Network on Chip so that the throughput of the applications
can be calculated during design time.
1.4 Thesis Organization.
Chapter 2 of this thesis describes the architecture of the existing NoC in detail. The architectural
improvements brought about in this thesis are also described in this chapter. Chapter 3 explains
the basics of SDF modeling of applications. In this chapter, the SDF model of the Network on
Chip is also discussed. In Chapter 4, a case study of a producer consumer SDF model using the
Network on Chip is discussed. The variation in throughput on increasing the number of lanes and
the transmission clock frequency is also described.
8
CHAPTER 2
HIGH BANDWIDTH AREA EFFICIENT SDM BASED NETWORK ON
CHIP
An SDM based NoC has been developed which has considerable improvements as compared to
the NoC as mentioned in [11]. This chapter describes the basic architecture of this Network On
Chip and the modifications made to improve the performance of the NoC. Multiple clocking
support has been added to the NoC that allows the data transmission network to be clocked at a
very high frequency as compared to the IP. This allows for higher bandwidth. Further, area
improvements have been made in the architecture of the existing NoC. As compared to the
design in [11], the area improvements in the new design are indicated in Table 1.
Component
32-bit to M-bit serializer
sendDataDistributor
32-bit to 1-bit serializer
Number of Slices
13319
183
48
Power (mW)
134.38
2.08
2.27
Table 1: Resource comparison between existing NoC [12] and [11]
2.1 Basic Architecture
A salient feature of the NoC described is dynamic reconfigurability, which enables run time link
configuration of the NoC. This feature helps when multiple use cases are supported in the
system. The NoC can be reconfigured whenever the use case switches.
The NoC consists of two components, a) Control network and b) Data network. The structure of
the NoC is as shown in Fig 2.1.
9
SDM-based
Data
Network
Packet-based
Control
Network
Figure 2.1: NoC structure [12]
Each part is described in detail below.
a) Control Network: The control network is used in programming the NoC for changing
the configuration. It is a light weight packet based network that has a very low area
overhead as compared to the whole NoC. The links between nodes consists of 8 wires
each. The node at (0, 0) is connected to the east and north while all other nodes are
connected to the east. The programming data is broadcast from this node to all the other
nodes in the network. Each node in the mesh has a unique id, through which it can be
identified. A protocol for programming the NoC has been designed, through which the
data is forwarded through the network. The protocol for transmitting the data is shown in
b) Data Network: The data network consists of the NI which connects to the IP through
FSL links and the routers, which guides the data through the corresponding path.
Network Interface: The structure of the existing network interface is given in Fig 2.2 that
supports 3 sending channels of 32 bit width and 8 sending and receiving wires, forming
the data transmission network.
10
Network Interface
32-bit to 1-bit
serializers
output[7]
32
send1
Data
Distributor
32
32
output[..]
32
send2
Data
Distributor
32
32
output[0]
32
send3
Data
Distributor
32
32
Router
IP Core
Output
message
queues
Figure 2.2 Network Interface for the existing structure[12].
There exists a data distributor for each of the three channels and a 32 bit to 1 bit serializer
for each of the sending wires. The role of the distributor is to send the 32 bit data to the
wires allocated to that channel in a sequential manner. In this NI, the entire 32 bit data is
send on the same wire rather than distributing it over all the wires allocated to the
particular channel, as shown in Fig2.3. For e.g., all the bits of three separate data packets
A, B and C are send along three separate lines .The data from the data distributors are
ORed and fed to all the serializers. This ensures that the data distributor can send the data
to all the serializers. A two bit handshaking mechanism exists between the serializer and
the data distributor.
Figure 2.3: Data transmission through the wires[12].
11
The serializer receives that data packet from the distributor and serializes it along the
wire. The start of data transmission is indicated by a start bit which is active high.
For receiving the data, the whole process is followed in the reverse order.
Router: The routers used in this NoC are described as a 1-way router. A k-way router
can be defined as a router where each incoming line can be switched to k lines in each
direction. Fig 2.4 describes the k-way switching for different values of k.
(a) 4-way
(a) 2-way
(a) 1-way
Figure 2.4 k-way router[12]
In this NoC, we use a 1-way router, which restricts the number of connections that any
incoming wire can switch to one. On one hand, this reduces the area overhead while on
the other hand it may not be possible to satisfy all connection requirements using this
router. In order to mitigate this, the option is to map the application to processors such
that there are more than one possible path between a source and destination. In this way,
the probability that a 1-way router can satisfy all the connection requirements increases.
12
2.2 Architectural Improvements
In this section, a few architectural improvements that have been incorporated into the existing
NoC design are discussed.
2.2.1 Multiple Clock support.
In the existing NoC design, there exists a single clock for the IP, network interface and data
transmission network. As a result, the bandwidth of the network was restricted by the maximum
frequency at which the network interface could operate. The motivation to implement multiple
clocking for the NoC was the flexibility it could bring to the network. A few advantages on
having a separate clock domain for transmission network are as follows:

IPs running at different frequencies can connect to a communication network having a
common speed.

The communication frequency is isolated from the computational frequency and hence it
can be separately tweaked. This can help in saving power by allocating more number of
lanes running at a lower frequency as compared to fewer lanes running at higher
frequency.
Figure 2.5 indicates the separation of clock domains for the transmission network and the NI in
the multiple clock supporting scheme. The NIs are clocked at a separate clock frequency
(SYSCLK) than that from the routers that form the transmission network which runs on the
transmission frequency (TXCLK). The NI communicates with the IP using the SYSCLK, while
it transmits data through the wires using the TXCLK as clock reference.
13
IP
CORE
NI
(SYSCLK)
NI
(SYSCLK)
IP
CORE
NI
(SYSCLK)
IP
CORE
TXCLK
TRANSMISSION
NETWORK
IP
CORE
NI
(SYSCLK)
Figure 2.5 Overview of Multiple clocking implementation
Implementation: The separation of clock domains at the NI is brought about at the serializer
level. In order to implement multiple clocking, the structure of the serializer was modified in
order to take into consideration the different clock frequencies involved. In the existing design,
the serializer and the data distributor was clocked by the same source. The modified serializer is
now clocked by two sources: the system clock which is common for the data distributor, and the
transmission clock, which clocks the entire data transmission network.
The structure of the 32 to 1 bit serializer was also modified to support the two clocks as shown in
Figure 2.6.
14
FLOW
CONTROL IN
NEW
FINISH
SERIAL OUT
ACK
INTERFACE
SYSCLK
SERIALIZER
START
DATA [31:0]
TXCLK
Figure 2.6 Multiple clocking and flow control at transmission
There are two components in the structure: The interface unit that performs a two way handshake
process with the data distributor, clocked by the system clock and the serializer, which receives
the 32 bit directly from the data distributor. The interface checks whether the serializer is busy
serializing the data or not, and then accept new data for transmission. When a new data is
accepted, the interface gives a trigger to the serializer to start transmission. After the
transmission is complete, the serializer signals the interface that the transmission is complete.
Similarly at the receiver end, there exist two clocking domains at the deserializer. The structure
of the deserializer is given in Fig 2.7.
By separating the network interface and transmission network into two clocking domains, there
is much flexibility in the clocking of the NoC. The IP is connected to the network interface
through FSL links. This enables IPs operating at different frequencies to be connected to the
same NoC, since the FSL links support asynchronous mode of operation.
15
2.2.2 Flow Control.
When a transmitter is allowed to send data to the receiver at a rate more than the receiver can
process, the receiver will be overwhelmed by the data and hence, data packets are lost. In order
to prevent this from happening, buffers can be included between the transmitter and receiver, so
that data packets are stored temporarily until the receiver processes them. When the traffic load
is too high, the amount of buffer that is required can become very large. This could translate into
larger area overhead. In order to prevent this, a flow control mechanism has been implemented in
the design. The flow control has been implemented at the level of the network interface, where a
full signal is send from the receiver network interface to the transmitter side. This is shown in
Fig. 2.8. The figure shows three data wires allocated between a transmitter and receiver. The
flow control wire is allocated for each of the three wires.
At the receiver end, the serializer checks whether the receive FSL buffer is full or not. If it
detects that the receiver buffer is full, then a full signal is asserted by the serializer. When
transmitting data, the interface unit of the serializer checks whether the full signal is asserted or
not. The data is transmitted only if the full signal is deasserted (Fig 2.6). This flow control is
implemented for each of the wires of the network. Fig 2.7 shows the architecture of the
deserializer at the receiving node of the NoC.
16
FLOW
CONTROL
OUT
SERIAL IN
NEW
NEW
ACK
ACK
DE
SERIALIZER
REGISTER
DATA[31:0]
DATA[31:0]
SYSCLK
TXCLK
Fig 2.7 Multiple clocking and flow control at receiver
NETWORK
INTERFACE
(TX)
ROUTER
ROUTER
NETWORK
INTERFACE
(RX)
Flow control Wire
Data Wire
Figure 2.8 Flow control mechanism in the NoC
As a result of implementing flow control, the router now switches an additional wire for each of
the incoming wires. However, the area over head is approximately 10% in this case. The k-way
17
routing is followed in this case as well, with the router switching the data wires and the flow
control wire in pairs.
2.2.3 Optimized Network Interface Architecture:
The current NoC architecture is not fully optimized in terms of the area and the FPGA resources
it consumes. The various components of the NoC can be further optimized to reduce the area
overhead so that a more efficient design can be generated. It is found that the network interface
takes up a large share of the resources being consumed by the entire NoC architecture [12].
(a) Top
(b) NI
Figure 2.9 Area breakdown of existing NoC [12]
A breakdown of the NI shows that the Data distributor and the serializer / deserializer are the
major resource consuming components. An effort to further optimize the architecture of these
components was done.
Firstly, the HDL code of the serializer and deserializer was more
optimized to obtain a lower resource usage. In the earlier design, the serializer used a state
machine with 33 states. In the new design, the number of states was reduced from 33 to 3 and
simple counter was used to increment the number of bits received or send. Using the new VHDL
18
code, the resources used by the serializer reduced from 48 slices [12] to 42 which are about 12.5
%.
2.3 Results and Analysis.
In this section, the overall results of the various architectural improvements made to the design
are discussed. As mentioned earlier, the aim was to improve the available bandwidth of the NoC
as a whole while reducing the area overhead as much as possible. The testing for the
architectural implementation was done on a Xilinx Virtex 4 XC4LX25 FPGA board.
Bandwidth Improvement using Multiple Clocking
We used a 1 by 2 matrix with one sending channel and receiving channel to test the bandwidth
scaling with multiple clock support. The network had 8 transmitting and receiving wires with a
32 bit data packet being received from the IP core. The IP used was Microblaze running at 50
MHz. The NoC was tested for different transmission clock frequencies, keeping the system clock
at 50 MHz. We operated the NoC at a maximum frequency of 250 MHz before timing errors
were detected. As the transmission clock frequency increases, the available bandwidth per wire
also increased linearly with the transmission clock frequency. The maximum bandwidth
available to a connection can be calculated as
B= N*TXCLK
where N is the number of wires allocated to the connection, while TXCLK is the transmission
clock of the NoC. Hence, the number of wires required for any connection decreases linearly
with increase in clock frequency. The area overhead associated with multiple clocking supports
was almost nil in terms of number of the number of slices consumed. Table 2 shows the
bandwidth available (MHz) as the TXCLK frequency increases.
19
Wires
1
2
3
4
5
6
7
8
TXCLK(MHz)
50
100
150
200
250
50
100
150
200
250
300
350
400
100
200
300
400
500
600
700
800
150
300
450
600
750
900
1050
1200
200
400
600
800
1000
1200
1400
1600
250
500
750
1000
1250
1500
1750
2000
Table 2: Bandwidth per connection for different TXCLK and Wires
Area optimization of NI.
The area optimized version of the serializer and deserializer gave about on an average 12.5 %
decrease in the number of slices consumed. The number of slices consumed by the entire NoC
also reduced by a factor of about 15 %, which could be attributed to better packing of the design
into slices.
20
CHAPTER 3
SDF MODELLING
As different multimedia applications become more and more computationally intensive, it is
required to have a very good model of the application for two main reasons: a) To verify that the
underlying hardware is able to support the throughput requirements of the application and b) To
have an optimized mapping and scheduling of various tasks of the application on a multicore
environment. Hence the presence of a very good model of the application is paramount in
determining the performance of an application on a particular platform. This chapter describes
with the basics of Synchronous Data Flow (SDF) graphs and modeling of applications using SDF
graphs. An SDF model of the NoC that was described in the earlier chapter is also developed and
described here.
3.1 SDF GRAPH BASICS
Synchronous Data Flow (SDF) graphs are used to model DSP and multimedia applications
especially
streaming
data
applications
such
as
video
encoding/decoding,
mpeg
encoding/decoding etc. that can be implemented using task level parallelism on multiple
processors. The various tasks of each application are modeled as actors, which form the vertices
of the SDF graph. The communication between the various tasks or actors is represented by
edges of the graph, which models the channels of communication of the various tasks. The worst
case execution time (WCET) of the various actors of the SDF graph can be represented in the
SDF graph. These are known as timed SDF graphs. Hence the throughput obtained from an SDF
model of the application is conservative in nature.
21
Fig 3.1 shows an SDF graph of a simple three task application. The tasks a0, a1 and a2 each take
100 clock cycles to execute which is indicated inside the vertices. The connecting edges between
the tasks are also shown.
2
a0
1
100
100
1
a1
1
1
100
1
a2
Figure 3.1 Example of a three actor SDF graph.
In a typical streaming application, the execution of a particular task begins after data is received.
The execution of a particular task is called firing of the corresponding actor in SDF graph. At the
end of execution of a task, data is produced. The data produced and consumed by the actors are
called tokens. The rate at which each actor produces and consumes tokens is called token rate. In
the SDF graph description, each edge forms a channel for each actor to produce and consume
tokens. The token rate of an actor is indicated at the edges of the actors in the SDF graph as
shown.
22
For the firing of an actor, sufficient tokens should be present at all its input channels. In Figure
3.1, actor a0 can start the execution because of a token present initially at its incoming edge. This
token is called initial token and is represented by the bullet in Fig3.1. After its execution, a0
produces two tokens on the edge between a0 and a1. This allows actor a1 to fire, since it now has
sufficient tokens at its incoming edge for its firing.
Since a0 produces two tokens after its firing, and the input token rate of a1 is only one, two
simultaneous executions of actor a1 can begin simultaneously. For this to happen, a1 has to be
mapped to two separate processors. However, if only one processor is allocated for this task, then
only one instance of a1 can run. To model this, we make use of the concept of self-edges, which
is an edge leading from one actor to itself, with one initial token to indicate that only one
instance of that actor can run simultaneously. This is shown in Fig 3.2.
2
a0
1
100
1
100
1
1
1
1
100
a1
1
a2
Figure 3.2 SDF model with self-edge.
By varying the number of initial tokens on this edge, we can vary the number of simultaneous
executions of an actor of the SDF graph. This is called auto concurrency of the actor.
23
Buffers also can be modeled using edges and tokens in SDF graph. A buffer can be considered to
be an edge with initial tokens to be equal to the buffer size. The direction of the edge would be
the opposite of the channel to which the particular actor is writing.
2
a0
1
1
100 2
1
100
1
1
1
1
100
a1
1
a2
Figure 3.3 SDF model with back edge to model a buffer.
For e.g., if we consider the actor a0 writing two tokens to the channel towards a1, which has a
limited buffer capacity of only two tokens, this can be represented as a back edge from a1 to a0.
The buffer size of the channel is indicated by the number of tokens on this back edge. This is
represented in Fig 3.3. Hence, when a0 executes, the two tokens on the buffer are consumed and
two tokens are produced on the channel from a0 to a1. Only after a1 executes twice, thereby
replenishing the buffers pace, a0 can execute again.
Terminologies of SDF graphs:
In this section, the various terminologies of the SDF graph are described. The definition of all the
terms mentioned is described in [13].
24
1. Iteration: An SDF graph is said to have executed a single iteration if, all the actors of the
graph have fired in sequence and the state of the graph has returned to the initial state.
2. Repetition Vector: The repetition vector of an SDF graph is the vector that defines the
number of firings that each actor has for a single iteration of the SDF graph. The
repetition vector of the graph shown in Fig3.3 is [a2 a1 a0] = [1 2 1].
3. Period: The period of the SDF graph is the execution time of an iteration of the SDF
graph.
4. Throughput: The throughput of an SDF graph is the number of iterations per second of
the SDF graph. This is equal to inverse of the period of the SDF graph.
For an application, it might take a few iterations before the application starts its periodic
behavior. Hence the execution of the application can have two phases, a) the transient phase
which refers to the initial cycles before the periodic behavior kicks in and b) the steady state
phase which refers to the stable iterations that the application graph performs [13]. The average
throughput that an application achieves refers to the steady state throughput of the graph. A
comprehensive description of the SDF graphs and application modeling and throughput analysis
using SDF graphs is described in [15].
3.2 SDF MODEL OF THE SDM NETWORK ON CHIP
In a normal SDF graph, the communication channels between the two actors indicated by edges
are assumed to be a connection with infinite bandwidth and infinite buffer size. However, in the
real world scenario, this is not the case. The communication channel between two processors
adds communication latency due to limited bandwidth and limited buffer size. Hence it is
necessary to include a model of the interconnecting channel between two processors, so that the
application performance can be predicted as accurately as possible. For this purpose, we need to
25
introduce actors and edges into the SDF graph of the application so that the performance of the
interconnection is included in the performance analysis. This would help us in determining the
mapping of the applications and identify how much of application performance is linked to the
performance of the interconnection network. It would give us opportunities to determine the type
of interconnect that would help us achieve better bandwidth and also in designing an efficient
interconnection for the processors. This section describes an SDF model of the NoC that was
described in Chapter 2 that can be incorporated into an SDF graph of any application.
Figure 3.4 describes the SDF model of the SDM based NoC. The various actors of the SDF
graph are used to model the various components of the NoC as described.
NO.OF
LANES
1
1
TOKEN
SIZE
SEND
FSL
BUFFER
1
ACTOR
WA
1
1
SDD
1
1
1
1
TOKEN
SIZE
1
1
SREG
1
ROUTER
RCV FSL
BUFFER
1
1
1
1
RCV FSL
BUFFER
1
1
SRLR
ROUTER
1 1
1 1
1
NO.OF
LANES
1
DESRLR
1
1
1
NO.OF
LANES
1
RDC
1
1
1
1
1
RREG
RA
1 1
1
1
1
1
ACTOR
1
1
1
1
NO.OF
LANES
Figure 3.4 SDF model of the SDM based NoC
In Fig3.4, two tasks modeled by actors named ACTOR are running on two different processors
connected to each other by SDM based NoC. The WA and RA actors model the functions that write
and read the data to the FSL links that connects the processor to the NoC. The NoC generated has a data
width of 32 bits. Hence it is assumed that the token size and buffer size described in Fig 3.4 is in terms of
32 bit data packet. Accordingly the WA and RA actors are assumed to be tasks that pump 32 bit tokens
26
into the FSL links. However, this model can be replaced by more complex models depending on the
implementation of the function in software.
The various actors of the SDF graph that forms the NoC model are described below.
a) SREG: This actor models the reading of data from the FSL links at the NI block. The FSL can
transfer data at one packet every clock cycle. The FSL buffer can be modeled by the back edge on
SREG to WA and the FSL buffer size can be indicated by the initial tokens on this edge.
b) SDD: This actor models the data distributor of the NI. It distributes the data to the allocated lanes
of the transmission port. A packet arriving at the SDD has to wait until the SDD has switched its
output to the correct outgoing wire. In the worst case, the SDD has to pass through all the
outgoing wires before a packet is forwarded for transmission. The data distributor takes two clock
cycles to forward the incoming data to an output serializer. Hence the WCET of SDD is (2 + w l) clock cycles where w and l is the number of wires in the NI output port and the number of wires
allocated to the connection respectively.
c) SRLR: This actor models serial transmission of the data packet. It receives the data from the data
distributer and converts it to serial format. A 32 bit data packet takes 33 clock cycles to transmit
since it includes a start bit at the start of transmission. The number of lanes allocated to the
connection is modeled as the auto concurrency of this actor. This will ensure that at a time l
number of lanes can be used for transmission in the connection.
d) ROUTER: This actor models the clock latency associated with the routers that occurs in the
connection path. For each router, the execution time of the actor increases by one clock cycle.
The auto concurrency of this actor is also set to be equal to the number of lanes l.
e) DESRLR: This actor models the deserializer that receives the data and converts it back to the 32
bit packet format. The auto concurrency of this actor is also set to be equal to the number of lanes
l, since there is a deserializer for each wire of the NoC.
27
f) RDC: This actor models the receive data collector that receives the data from each lane
sequentially. This behaves similarly to the SDD. The data collector scans all the incoming wires
sequentially. The RDC takes two clock cycles to read the data from the deserializer. In the worst
case, a packet arrives at a wire just after the data collector has scanned that wire for new data.
Hence the WCET of this actor is equal to (2 + w-l) clock cycles.
g) RREG: This actor models the writing of data received to the FSL buffer and is similar to
SREG. It writes the received data packet to the FSL buffer at the receiver end. This can
happen at a rate of one packet per clock cycle.
h) The flow control signal is modeled by a token that is send from the actor RA to the SRLR
through a series of routers. The functioning of the flow control is as described in Chapter
2.
3.3 APPLICATION MAPPING AND THROUGHPUT ANALYSIS
When analyzing the performance of an application on a multi-processor system, the mapping of the
various tasks of the application needs to be done first. This gives an idea of the channels where the SDF
model of the NoC needs to be inserted into the application SDF graph. A typical design flow to generate
the optimal hardware for a NoC based multi-processor system is shown in Fig 3.5.
MAPPED SDF
GRAPH
BANDWIDTH/
BUFFER SIZE
CALCULATION
HARDWARE
SYNTHESIS
Fig 3.5 Optimized Hardware generation flow using SDF graph
28
In this section, we describe the method to incorporate the SDF model of the NoC into the SDF
graph of the application and calculate the throughput of the application.
αdst
αsrc-n
p
q
1
Ac
ACTOR1
n
As
1
1
ACTOR2
1
1
Figure 3.6 Modeling of tasks running on two separate processors
Fig 3.6 shows the SDF graph model of two tasks ACTOR1 and ACTOR2 that run on two
separate processors that are connected to each other using an interconnect mechanism. The actor
Ac models the write function of data into the interconnection network. The self-edge indicates
that the tokens are written sequentially into the interconnection network. The actor As models the
latency associated with sending the data across the interconnection network. The model of Ac can
be replaced by the model of the SDM based NoC described earlier.
For calculating the throughput analysis of SDF graphs, a tool SDF3 [14] has been developed and
is available online for free download. This tool reads in the SDF graph of the application
specified in XML format and has various features incorporated to calculate the various
parameters of the application graph. This powerful tool can be utilized for all analysis and
29
simulation operations on SDF graphs. A snippet of the XML format that specifies the description
of the SDF graph of an application is given in Fig 3.7.
Figure 3.7 Snippet of XML specification of SDF graph
The XML of the SDF graph declares each actor of the application and defines the various parameters of
the actors such as execution time, port names and port directions, token rate at each edge, etc. The various
edges that represent the communication channels are also described in the XML file.
In order to analyze the application throughput when multiple clocking of the NoC is enabled, the method
to be followed is slightly different as compared to the normal flow. In the normal flow, it is assumed that
the NoC and all the processors are executing at the same clock frequency and the execution time of all
actors are specified in terms of clock cycles with respect to this common clock frequency. The throughput
of the graph can be calculated for a single common clock frequency.
In order to model the multiple clocking capability of the NoC, we need to modify the methodology
in order to calculate the throughput of the application graph. In the case of a multiple clocking
NoC, there can be two cases:
30
a) The transmission clock TXCLK is an integer multiple of the system clock SYSCLK, i.e.
TXCLK = SYSCLK * N, where N is an integer. In this case, multiply the execution times
of all actors of the application graph by N and then calculate the period of the graph
through SDF3. If Ps is the period calculated by SDF3 and Pc is the actual period of the
graph, then Pc = Ps/N.
For example, consider TXCLK to be 200 MHz and SYSCLK to be 50 MHz in a
multicore system with the cores connected by NoC. Then, accordingly the execution
times of all the actors running in the SYSCLK domain is multiplied by 4 since N =
200/50 = 4. The period of the application obtained hence would be in terms of TXCLK.
To get the period of the application in terms of SYSCLK, divide the period predicted
from SDF3 by 4.
b) The transmission clock TXCLK is a fractional multiple of SYSCLK. In this case, the
easiest method is to find a clock frequency Fc, which is an LCM to both SYSCLK and
TXCLK. If Fc= l * TXCLK and Fc.= m * SYSCLK, then the execution time of the
actors in the SYSCLK domain is multiplied by m and the actors in TXCLK domain is
multiplied by l. For example, consider the SYSCLK to be 66 MHz and TXCLK to be 99
Mhz. Here, TXCLK = 1.5 * SYSCLK. So, we can choose a frequency Fc of 198 MHz
such that Fc = 3 * SYSCLK and Fc.= 2* TXCLK. Now we can multiply the execution
time of all actors in SYSCLK domain by 3 and TXCLK domain by 2. The period of the
application obtained would be in terms of Fc. In order to obtain the clock cycles in terms
of SYSCLK, simply divide the period of the application predicted from SDF3 by 3.
The same method can be used in analysis of application graphs where, different actors are
executed by processors running at different but synchronous clock speeds. It is assumed in the
31
above cases that the different clocks of NoCs are not asynchronous in nature. The analysis of
throughput using multiple clocking in such scenarios is not considered.
3.3 CONCLUSIONS
In this chapter, a brief introduction to SDF graphs is provided. The various features of SDF
graphs are described and the methods to describe the data flow in a streaming application using
SDF graphs are also provided. Further, SDF3, a tool for analysis of SDF graphs is discussed
here. An SDF model for the SDM based NoC and methods to calculate the throughput of
applications running on multicore systems using the SDM based NoC as the interconnect was
also described in this chapter.
32
CHAPTER 4
CASE STUDY
In this chapter we study two SDF models, one that of a simple producer consumer SDF model
and the other, a JPEG decoder SDF model, so as to evaluate the results of the architectural
improvements on the NoC The SDF analysis was carried out using the SDF3 tool available from
[14], and the implementation was carried out on a Memec Virtex 4 FPGA board with Virtex-4
XC4LX25 FPGA on it. The NoC had one sending channel and one receiving channel, with 8
sending and receiving wires for transmission and reception. The data width was chosen to be 32
bit.
The SDF model of an example producer-consumer graph is given in Fig 4.1. The execution time
of the actor is very small and the number of tokens produced is 10000 32 bit words which is very
large. This model was chosen so as to stress the communication network between two
processors. The actors PRODUCER and CONSUMER executes in 110 clock cycles on a
Microblaze processor. The auto concurrency of each actor is one.
110
110
10000
10000
PRODUCER
CONSUMER
Figure 4.1: SDF model of producer-consumer
33
From SDF3 analysis, the period of the SDF graph can be calculated as 110. After incorporating
the SDF graph of the NoC, the application graph becomes as shown in Figure 4.2. The execution
time of each actor is indicated within the braces.
NO.OF
LANES
1
1
ROUTER
256
1
1
1
1
256
256
10000
PRODUCER
(110)
1
1
1
WA
(7)
1
1
256
SREG
(1)
1
1
1
1
1
1
1
SDD
( 2+8 - l )
SRLR
(36)
1 1
1
NO.OF
LANES
1 1
1
ROUTER
(2)
1
1
NO.OF
LANES
1
1
DESRLR
(3)
1
1
1
RDC
RREG
( 2+8 - l )
(1)
1 1
1 1
1
1
1
1
RA
(7)
1
1
1
CONSUMER
(110)
1
NO.OF
LANES
Figure 4.2 SDF model of Producer-Consumer with NoC
The producer and consumer actors were run on a Microblaze processor running at 50 Mhz. The
clock frequency of the NoC was varied from 50 MHz to 250 MHz after which the design showed
timing errors during synthesis. The results obtained after varying the different parameters of the
NoC are described in Table 3. Figure 4.3 shows the variation of the iterations per second of the
SDF graph on varying the number of lanes allocated to the communication channel, calculated
on the SDF3 tool. The linear increase in throughput as the number of lanes allocated to the
channel is clearly seen in this graph. The maximum throughput obtained for the SDF model was
approximately 712 iterations per second from the SDF3 analysis. Figure 4.4 shows the results of
actual execution of this model on the Microblaze system as described in the earlier section. The
measured data follows the SDF3 model very closely according to the results. The maximum
throughput obtained from the actual system was 715 iterations per second.
34
800
Iterations per second
700
600
500
txclk=50Mhz
400
txclk=100Mhz
300
txclk=150Mhz
200
txclk=250Mhz
100
txclk=200Mhz
0
1
2
3
4
5
6
7
8
Lanes Allocated
Figure 4.3 Number of Lanes vs. Iterations per second on SDF3
800
Iterations per second
700
600
500
clk=50Mhz
400
clk=100Mhz
300
clk=150Mhz
200
clk=200Mhz
clk=250Mhz
100
0
1
2
3
4
5
6
7
8
Number of Lanes
Figure 4.4 Number of Lanes vs. Iterations per second on FPGA
Comparing the two graphs, the close correlation between the simulation and the measured data is
evident. Analyzing the graph also indicates that the bandwidth available to the connection
increases linearly with the clock frequency as well. Fig 4.5 and 4.6 shows the comparison of
35
period calculated via the SDF3 analysis and the actual measured data for two different values of
TXCLK, while keeping the SYSCLK constant.
SYSCLK = TXCLK = 50 Mhz
400000
Application Period
350000
300000
250000
200000
SDF3
150000
Measured Data
100000
50000
0
1
2
3
4
5
6
7
8
No. of Lanes
Figure 4.5 Measured data vs. SDF3 for TXCLK=SYSCLK
SYSCLK = 50 Mhz, TXCLK = 250 Mhz
82000
80000
Application Period
78000
76000
74000
SDF3
72000
Measured Data
70000
68000
66000
64000
1
2
3
4
5
6
7
8
No.of Lanes
Fig 4.6 Measured data vs. SDF3 for TXCLK = 5*SYSCLK
36
SDF graph
CLK
Wires
1
2
3
4
5
6
7
8
PRODUCER-CONSUMER MODEL
50(MHz)
SDF3 Measured
360000
349994
180000
174997
120000
116665
90000
87499
70110
70014
70110
70014
70110
70014
70110
70014
100(MHz)
150(MHz)
SDF3 Measured SDF3 Measured
180000
179997
120000
120000
90000
90000
70110
70014
70110
70014
70110
70014
70110
70014
70110
70014
70110
70014
70110
70014
70110
70014
70110
70014
70110
70014
70110
70014
70110
70014
70110
70014
200(MHz)
SDF3
90000
70110
70110
70110
70110
70110
70110
70110
Measured
89999
70014
70014
70014
70014
70014
70014
70014
250(MHz)
SDF3
80000
70110
70110
70110
70110
70110
70110
70110
Measured
79997
70014
70014
70014
70014
70014
70014
70014
Table 3: Predicted period vs. measured period for producer-consumer model.
These results indicate the flexibility that we get when choosing the appropriate number of lanes
for a connection. Consider a scenario where the connection requirements between two nodes are
large but the number of wires available between nodes is limited. In such a case, we can use
multiple clocking to increase the bandwidth available per wire to a suitable value so that more
connection requirements can be satisfied using the same number of available wires. In another
scenario, the higher clock frequency of the NoC can result in hotspot development and
unbalanced power dissipation across the chip. One solution for this is to find alternate routes for
the same connection across less congested routes over the NoC with more number of wires and
reduced transmission frequency.
37
CHAPTER 5
CONCLUSIONS AND FUTURE WORK
This chapter describes the major conclusions from this thesis and major issues that still needs to
be resolved.
5.1 CONCLUSIONS
The SDM based NoC described in Chapter 2 offers a very promising solution to provide a
guaranteed throughput interconnection mechanism for MPSoCs. The area overhead associated
with the storing of switching tables in the routers in the case of TDMA based NoCs has been
eliminated in this NoC. The existing NoC has been modified to provide multiple clocking
supports for the NoC as described in Chapter 2. The flow control incorporated into the NoC has
made the NoC more robust and reduced the buffer requirements at the receive side so that data
packets are not lost. The serializer and deserializer have been optimized further in terms of
resources, resulting in consumption of 12.5% less number of slices as earlier. The multiple
clocking capabilities have enabled in improving the transmission bandwidth and flexibility of the
SDM based NoC. A slower IP can be connected to the NoC while the data transmission can
occur at very high speeds. This feature has also made it possible to add another dimension so as
to how the NoC can be configured for particular scenarios.
The SDF model developed for the NoC has been found to be very accurate in predicting
application performance on a multicore system based on the SDM based NoC. A method to
modify the throughput analysis so as to incorporate the multiple clock feature of the NoC is also
described in Chapter 3. A case study using a producer – consumer model for analysis of the
38
performance of the SDF model is described in Chapter 4. The linear increase in the bandwidth
of the connection with the number of lanes and transmission clock frequency is seen from the
results of the experiments.
5.2 FUTURE WORK
A few improvements that can be brought about on the current SDM based NoC are listed here.
a) Area optimization of Network Interface: The Network Interface of the SDM NoC is
very complex and takes up a large amount of resources. As described in Chapter 2, the
data distributor takes up nearly 10% of the NI area. The current structure of the data
distributor is very complex and involves routing a large number of wires. There exists a
data distributor for each sending channel and each data distributor has a separate data bus
for each of the serializers. This structure can be optimized so that significant savings on
resources can be achieved. A new architecture for the NI is proposed in Figure 5.1.
Instead of a separate data distributor for each of the channels as existing in the present
design, an N: 1 switch for N incoming channels can be used. The output of the switch
connects to the input of all the serializers of the individual outgoing wires. The controller
communicates with each serializer using the existing two way handshaking mechanism.
The area savings on this architecture looks promising and resource savings upto 50 % in
terms of slices used is expected.
39
Wires_Allocated_channel_1 Wires_Allocated_channel_N
NETWORK INTERFACE
CONTROLLER
SERIALIZER_0
SELECT
CHANNEL 1
FROM
IP
SELECT
SERIALIZER_1
CHANNEL 2
ROUTER
N:1 switch
SERIALIZER_2
CHANNEL N
SERIALIZER_W
Figure 5.1 Proposed area efficient NI architecture.
The work on this architecture is still in progress.
b) Fault tolerance for the NoC: As more and more nodes are incorporated into the chip,
the size of the NoC that connects them also increases. The chance of a component of the
NoC failing hence increases. Adding fault tolerance capacity to the NoC can help in
making the communication system more robust.
40
BIBLIOGRAPHY
[1]
Moore,
Gordon
E.
(1965). "Cramming
more
components
onto
integrated
Circuits" , Electronics Magazine, pp. 114-117, 1965.
[2]
Wolf, Wayne. “Embedded computer architectures in the MPSoC age”, WCAE '05:
Proceedings of the 2005 workshop on Computer architecture education, 2005
[3]
J.A. De Oliveira and H. Van Antwerpen, The Philips Nexperia digital video platforms,
from “Winning the SoC Revolution,” Kluwer Academic Publishers, 2003.
Peter Cumming. THE TI OMAP PLATFORM APPROACH TO SOC. from “Winning the
[4]
SoC Revolution”, Kluwer Academic Publishers, 2003.
[5]
Martin, G., Overview of the MPSoC design challenge, Design Automation Conference,
2006 43rd ACM/IEEE , pp. 274 – 279, September 2006.
[6] Benini L, De Micheli G, Network-on-chips: A new paradigm for systems on chip design,
Design, Automation and Test in Europe Conference and Exhibition, 2002. Proceedings, pp.
418-419.
[7] E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny, "QNoC: QoS architecture and design
process for network on chip," Journal of Systems Architecture, vol. 50, pp. 105-128, 2004.
[8] T. Bjerregaard and J. Sparso, "A router architecture for connection oriented service
guarantees in the MANGO clockless network-on-chip," in Proceedings -Design, Automation and
Test in Europe, DATE '05, 2005, pp.1226-1231
[9] K. Goossens, J. Dielissen, and A. Radulescu, "Æthereal network on chip: concepts,
architectures, and implementations," IEEE Design & Test of Computers, vol. 22, pp. 414-21,
2005.
41
[10] M. Millberg, E. Nilsson, R. Thid, and A. Jantsch, "Guaranteed bandwidth using looped
containers in temporally disjoint networks within the Nostrum network on chip," in Proceedings
- Design, Automation and Test in Europe Conference and Exhibition, 2004, pp. 890-895
[11] A. Leroy, D. Milojevic, D. Verkest, F. Robert, and F. Catthoor, "Concepts and
implementation of spatial division multiplexing for guaranteed throughput in networks-on-chip,"
IEEE Transactions on Computers, vol.57, pp. 1182-1195, 2008.
[12] Joseph, Yang Zhiyao, “An area efficient dynamically reconfigurable spatial division
multiplexing Network on Chip with static throughput guarantee “, FYP Thesis, May 2010
[13] Kumar, Akash, “Analysis, Design and Management of Multimedia Multiprocessor
Systems”, PhD Thesis, April 2010
[14]
SDF3,
a
tool
for
SDF
graph
analysis
available
at
for
download
http://www.es.ele.tue.nl/sdf3/
[15] Stuijk, Sander, “Predictable Mapping of Streaming Applications on Multiprocessors”, PhD
Thesis, October 2007.
42
Download