XTRANC_modified_arash - Explore Bristol Research

advertisement
XTRANC: A Routing Algorithm for
Dynamically Reconfigurable
Networks-on-chip
Arash Farhadi Beldachi *1, Simon Hollis**2, Jose L. Nunez-Yanez *3
*
Department of Electronic Engineering
University of Bristol, UK
1 arash.beldachi@bristol.ac.uk
3 j.l.nunez-yanez@bristol.ac.uk
**
Department of Computer Science
University of Bristol, UK
2
simon@compsci.bristol.ac.uk
Abstract—This paper presents a novel routing algorithm called XTRANC which supports topologies based on a
variable number and size of inner-torus building blocks. The inner-tori partition a traditional mesh network into an
arbitrary number of sub-networks to increase the mesh performance. The sub-networks can generate non regular
global topologies which are also supported by the XTRANC algorithm. XTRANC is especially suitable for
dynamically reconfigurable networks mapped to commercial FPGAs in which additional links are added to the mesh
topology at run-time to reduce congestion depending on application behaviour and resource availability. XTRANC
allows the insertion of links as requested by different parts of the application without centralized control and this
research shows that despite this dynamic behaviour the routing algorithm remains deadlock free.
1. Introduction
NoC (Networks-on-Chip) topologies can be constructed with regular (i.e. uniform) or irregular (i.e. custom)
topologies. Regular topologies such as the traditional mesh are more popular due to, for example, their
modularity and predictability which helps their design, verification and analysis. On the other hand, the mapping
of irregular task graphs to these topologies can result in local congestion and lower link utilization. Custom
topologies avoid this by being designed for specific applications and usually impact the numbers of routers and
links as well as offering better system performance. On the other hand, they lack general purpose computation
capabilities and are complex to design [1].
Among regular networks, the mesh (e.g. 2D/3D) is the most popular way of achieving a scalable communication
model in NoC systems [2]. The torus topology is a good alternative to increase the throughput and reduce the
average latency but, when the network is large, the achievable clock frequency could degrade due to the need to
create long links between the border nodes. The partitioning of the mesh network to smaller networks to form
inner-tori could be a good solution to increase the mesh network performance and bridge the performance gap
between regular and customized networks. The creation of the inner-tori means that the internal routers must be
able to modify its architecture to support additional communication ports. Also, the additional links affect the
regularity of the original topology and demand a special routing algorithm. In this scenario, a topology is not
completely irregular, but has a regular underlying topology with extra links.
In this work, we have proposed a new routing algorithm called eXtended Torus Routing Algorithm for NoC
(XTRANC) which is based on the Torus Routing Algorithm for NoC (TRANC) [3]. We have selected TRANC
because it uses only one virtual channel per physical channel and it is a deadlock free routing algorithm for the
Torus NoC. TRANC and XTRANC take advantage of the Interconnection Routing Notation (IRN) [3], which is
a map-based systematic approach of designing routing algorithms for mesh and torus NoCs and can be extended
to other topologies as well.
XTRANC is suitable for both homogeneous and irregular networks. It helps mesh networks to achieve better
performance by partitioning them into sub-networks and it is helpful when there is a need to add extra link(s) to
the mesh networks to reduce the latency in specific paths according to the current traffic pattern. This routing
algorithm is designed to work in NoCs implemented in FPGAs (Field Programmable Gate Arrays) that support
dynamic reconfiguration. Dynamic reconfiguration allows changing the network logic at run-time and is
available in state-of-art FPGAs from major manufacturers. The contributions of this paper can be summarised as
follows:
1) We propose and develop the dead-lock free XTRANC algorithm for dynamically reconfigurable networks
using Interconnection Routing Notation.
2) We use a physical prototype based on the SoCWire NoC to evaluate the complexity, latency and throughput
of the XTRANC algorithm.
3) We evaluate the performance of XTRANC in homogeneous and irregular networks and compare it with the
popular 2D mesh topology.
The rest of the paper is organized as follows: Section 2 reviews related works. Section 3 briefly introduces the
Interconnection Routing Notation used to develop XTRANC. XTRANC is introduced in section 4 focusing on
its dead-lock free behaviour. Section 5 introduces our performance evaluation platform based on FPGA devices.
Section 6 investigates XTRANC deployed in a homogeneous network resulting from partitioning a mesh
network into an arbitrary number of inner-tori. Section 7 evaluates XTRANC after adding extra-long range links
to the mesh network creating irregular topologies. Finally, Section 8 concludes the paper.
2. Related Works
Several routing algorithms based on the traditional ‘X-Y’ algorithm have been proposed and here we highlight
key work. A comprehensive review is available in [4] that includes routing algorithms for NoC architectures at
all abstraction levels, from the algorithmic level to actual implementation.
Odd-even turn [5] is a routing algorithm to avoid the deadlock by putting restrictions on some locations where
turns can take place. In DyAd [6], is combination of the oe-fix, which is a deterministic routing algorithm, and
the odd-even. In this scenario, routers can switch between two routing algorithms considering the congestion
situations of the network.
Intermittent X, Y (IX/Y) [7] is introduced to achieve fewer collisions. This algorithm takes the advantage of the
“X-Y” and “Y-X” algorithms intermittently to choose the next hope and improves the average delay. A variable
switch between ‘0’ and ‘1’ to create the choosing mechanism. In this scenario, the variable sets to ‘0’ for the
first packet to select ‘X-Y’ routing algorithm and sets to ‘1’ for the second packet to choose “Y-X” routing
algorithm. The switch between ‘0’ and ‘1’ continues for the next packets. The authors used a simulator to
evaluate a 4x4 mesh network.
Dynamic XY (DyXY) [8] is an adaptive routing algorithm which is based on congestion condition in the
proximity. In this algorithm, the packets only route shortest path between source and destination. The shortest
path will be selected depends on the network congestion condition when there are multiple shortest paths
between source and destination. A simulator was employed to measure the performance.
XYX [9] is a fault-tolerant routing algorithm. This routing algorithm employs the ‘X-Y’ and the “Y-X”
algorithms and designed for the higher traffic load. In this routing algorithm, the source node adds an error
detection code to the header of the packet and then, it copies a packet before sending. After that it keeps the
copied packet as the redundant and sends the packet to the destination. The destination node checks the packet,
calculates the error detection code and compares it with the one in the packet header. The packet is accepted
when there is not any error. If there is an error, the destination node sends a NACK message to the source node
to request for resending the packet using the redundant packet. The source node sends the original packet to the
destination using the ‘X-Y’ routing algorithm and sends the redundant packets employing the “Y-X” routing
algorithm. A simulation method has been used for experimental analysis in this work.
[10] proposed an adaptive routing algorithm called Adaptive XY which is operating as a deterministic ‘X-Y’ or
adaptive routing depends on the congestion conditions of the current router neighbours. In this scheme, a switch
which is context aware agent is used to route the packets according to the network congestion status and a
packet is route through a less congested path in the high congestion scenario. This work used simulation to
evaluate the algorithm.
In contract to these works, XTRANC is specially designed to work in NoCs implemented in FPGAs and
supports both homogeneous and irregular networks. It helps mesh networks to achieve better performance by
partitioning them into inner-tori and it is helpful when there is a need to add extra-long range link(s) to the mesh
networks to reduce the latency in specific paths according to the current traffic pattern using dynamic
reconfiguration.”
There are other efforts investigating alternative regular topologies for the mesh topology and adding extra-long
range links to a mesh network to increase the network performance. We have highlighted relevant examples in
this section. A new NoC topology, called a partially interconnected mesh (full-mesh) network, and a routing
algorithm to support the architecture is presented in [11, 12]. This architecture extends the mesh networks by
adding four bidirectional channels to remove the congestion and hot spots compare to the mesh networks but it
introduces a significant overhead. Diametrical 2D mesh [13] adds eight extra links to mesh topology to reduce
the mesh diameter when it is expanded by a large number of IP cores and tries to minimize the area and power
consumption. The number of extra links is always fixed and equal to eight and does not grow as the number of
IP cores increases. Adding eight extra links to a mesh network partitions it to four sub-networks which have an
overlap, and these extra links connect the edge of the sub-network to each other. The main issue is that, the subnetworks size grows by the size of IP cores and makes the extra links longer, which causes difficulty in the later
stages of the implementation, such as place and route. In addition, traffic load is not well balanced and
distributed in this method. [14] presented the mesh networks with random additional links. This work proved
that adding random extra links can increase the critical load of a network which causes performance
degradation. This work is a theoretical work and demands a practical analysis and implementation. [15]
introduces a technique to synthesize an optimal irregular mesh topology for a specific application which
demonstrated that an irregular mesh can increase the performance compared to a traditional mesh topology. In
this method, a computationally expensive analysis of the frequency and magnitude communication between IP
cores in the application is performed to identify the optimum locations to add Long Range Links (LRLs) to the
traditional mesh topology. Therefore, the frequency of communication between each node during the execution
of the application on a traditional mesh must be identified and this technique is suitable for applications with
predictable communication behaviour. This method does not need any global information for routing but needs
local information (at least 1 neighbour) of the network to compute the distance. The shorter route calculation
requires time which could increase the router latency. The resulting network with this technique is application
specific and is not suitable for general purpose computing. [16] introduced the Skip-link architecture that
dynamically reconfigures NoC topologies to reduce the overall switching activity in many-core systems. This
architecture allows the creation of long range skip links at runtime to reduce the logical distance between
regularly communicating nodes. However, the logical distance between some other nodes will be increased and
cause hop penalties and the Skip-links only allowed to be placed after summarise all of the traffic flows, the
hop-count savings outweigh the penalties [16] and it is not targeted for reconfigurable chips.
To the best of our knowledge, this is first work to propose a routing algorithm for both modified regular and
irregular mesh with extra-long range links specially design for dynamically reconfigurable FPGAs. Our
proposed routing algorithm does not need any lookup tables and global or local information when the extra links
are added and it is designed to work on reconfigurable chips that can alter the network infrastructure at run-time.
3. Background
Interconnection Routing Notation (IRN) [3] is a map-based systematic approach to design routing algorithms for
mesh and torus based NoCs. This notation is useful to obtain a better understanding and formulation of routing
algorithms for NoCs and validate new algorithms as deadlock free. This approach can be extended to create
routing algorithms for other interconnection topologies as well. The IRN explores the rule of movement via one
dimension and is suitable for an ‘X-Y’ based routing algorithm in which a packet first traverses the ‘X’ and then
‘Y’ dimension to reach its destination. The IRN has 2 sections, the IRN Graph and Map. The IRN Graph shows
the direction of the movement from a source node to other nodes, which are destination nodes by arrows. The
IRN Map is based on the IRN Graph and displays the direction that packet should traverse from the current node
to get closer to the destination.
Fig.1 shows the IRN for 5×5 mesh topology with the ‘X-Y’ routing algorithm. As it can be seen in Fig.1.a, the
first row of the IRN Graph, which has been marked with the adjacent moves, displays a row of the network to
reveal the topology. The next rows display the direction of the movement from a node to other nodes, which are
destination nodes, with arrows. In the original IRN Graph, directions of movement for destinations with more
than one hop are considered. In Fig.1.a, we have considered all the directions for clarity purposes. Fig.1.b
displays the IRN Map which is based on the IRN Graph of Fig.1.a. Each row number presents a current node
and each column number shows a destination node. Each cell contains the direction that a packet should
traverse from the current node to get closer to the destination in this figure. The ‘+’ and ‘-’ signs show the
movement in the positive and negative directions of the dimension respectively. The ‘+’ and ‘-’ signs in circles
present the movements which are unchangeable because it is not reasonable to select the opposite direction to
reach the destination. The ‘+’ and ‘-’ signs without circles are selectable and can create routing algorithms
although some of them are deadlock-free and some have the possibility of generating deadlock. To find the best
routing algorithm using the IRN notation which is deadlock-free and optimal, the following rules should be
considered [3]:



There should not be more than one sign change in each column to avoid livelock. Also, there must not be
more than one sign change in a row as well.
It is better to have equal number of ‘+’s and ‘-’s for the selectable area and in each row to achieve a more
optimal routing.
There should be one row with all selectable ‘-’ and another one with ‘+’ movements for the algorithm to be
deadlock-free [3].
Adjacent
moves
0
1
2
3
4
To
0
From
From 0
1
2
3
4
From 1
0
1
2
3
4
From 2
0
1
2
3
4
2
From 3
0
1
2
3
4
3
From 4
0
1
2
3
4
1
4
2
3
4
++++
0
0
1
_
+++
_ _
++
_ _ _
+
_ _ _ _
a) IRN Graph
b) IRN Map
Fig.1. The IRN for a 5×5 mesh using ‘X-Y’ routing algorithm
To
From
0
1
2
5
7
_
_
_
_
_
_
+
+
+
+
++
_ _ _ _
+
_
+++++ _ _ +
+++++
_
_
_
_
_
++ ++
+++
_
++
_ _
+
_ _ _
6
_
_
_
_
0
0
1
2
1
2
3
4
5
0
1
2
3
4
0
1
2
3
0
1
2
0
1
0
3
2
1
0
4
3
2
1
5
4
3
2
6
5
4
3
6
5
4
3
Torus 6x6
7
4
Torus 4x4
6
3
Torus 5x5
5
2
Torus 7x7
4
1
Torus 8x8
3
0
Fig.2. The IRN Map for the TRANC in 1dimension of an n×n torus (n=4 to 8) [3]
IRN has been employed to propose a routing algorithm called Torus Routing Algorithm for NoC (TRANC) [3]
which is a deadlock-free deterministic routing algorithm for the torus NoCs and it uses only one virtual channel
per physical channel. Fig.2 presents the TRANC IRN Map for one dimension of the 4×4 to 8×8 torus networks.
The IRN Map for a ring with n nodes is generated by adding a row and column to the IRN Map of the ring with
n-1 nodes. The right most two cells should be filled with ‘-’ and others with ‘+’ signs for the newly added rows
and the two lower cells must be filled with ‘+’ and others with ‘-’ signs for the newly added columns. A loop
between movements in positive or negative directions causes deadlock. TRANC is a deadlock-free routing
algorithm. For example, if we consider an 8×8 torus as it can be seen in Fig.2 , there is not a positive movement
of more than one hop from node 5 to other nodes which breaks the positive loop in the network. In addition,
there is not a negative movement of more than one hop from node 6 to others which breaks the negative loops in
the network as well.
4. Proposed Routing Algorithm
4.1. Algorithm description
Fig.3 shows a ring topology with 16 nodes in 1 dimension which is ‘X’ in this case. This topology is the basic
topology component of the torus topology. In this topology, the border nodes, which are the first and the last
nodes (i.e. nodes R0 and R15) are connected by a wrap-around link. When a node sends a packet to another
node, according to the destination address, the routing algorithm makes a decision to send the packet to the left
(west) or right (east) port of the router.
X
R0
X
R1
X
R2
X
R3
X
R4
X
R5
X
R6
X
R7
X
R8
X
R9
X
Fig.3. Ring topology with 16 nodes
R10
X
R11
X
R12
X
R13
X
R14
X
R15
X
i
X
R0
X
R1
X
R2
X
R3
X
R4
X
X
R5
X
R6
R7
Left border
Outside the inner-ring, left
X
R8
X
R9
X
X
R10
R11
X
R12
X
R13
X
R14
X
R15
X
Right border Outside the inner-ring, right
Inside the inner-ring
Fig.4. An inner-ring with 9 nodes
Fig.4 shows an inner-ring with 9 nodes. In this inner-ring, an extra link, which is ‘i’, connects left (Rs) and right
(Rs+8) border nodes. There are two different types of nodes in an inner-ring: The nodes inside the ring and
border nodes. Consequently, a routing algorithm for the inner-ring topology should consider the two different
types of nodes.
i
Adjacent
moves
X
Xs
X Xs+1 X Xs+2 X Xs+3 X
X
From Xs
Xs
From Xs+1
Xs
From Xs+2
Xs
i
X
Xs+1
X
Xs+2
Xs+3
Xs
Xs+2
Xs+3
Xs+1
Xs+3
Xs+2
Xs+3
Xs+3
X
Xs+1
i
X
Xs+1
X
Xs+2
i
X
From Xs+3
To
From
Xs
Xs+1
X
Xs+2
i
Xs
Xs+1 Xs+2
Xs+3
i
+
+
_
_
+
_
+ _ _ +
i
Fig.5. The IRN for a 4×4 inner-ring (1D XTRANC)
The IRN has been employed to find the routing algorithm called 1D XTRANC for the inner-ring topology. Fig.5
shows the IRN Graph and Map for an inner-ring with four nodes which starts at Xs. As it can be seen in this
figure, there are two main differences between the IRN Graph and Map for inner-torus and torus. The nodes
numbers are shifted with an offset which is the left border of the inner-torus and direction ‘i’ is added to the IRN
Graph and Map. The ‘i’ is a bidirectional link between the left and right border of the inner-torus. Considering
the ‘i’ direction is the negative direction when the left border node is the source and the right border node is the
destination, and the ‘i’ direction is the positive direction when the right border node is the source and the left
border node is the destination. In this scheme, there is not more than one sign change in each column and this
avoids livelock. Also, there is not more than one sign change in a row as well. In addition, there is not a loop
between movements in positive or negative directions which causes deadlock. As it can be seen in Fig.5, this
algorithm is a deadlock-free algorithm because there is not a positive movement of more than one hop from
node Xs+1 to other nodes which breaks the positive loop in the network. In addition, there is not a negative
movement of more than one hop from node Xs+2 to others which breaks the negative loops in the network as
well.
Fig.6 presents the IRN Map for the inner-ring topology with offset Xs which starts at radix n=4 to n=8. The IRN
Map for a radix n ring with offset Xs is generated by adding a row and column to the IRN Map of radix n-1 with
offset Xs ring. The right most two cells should be filled with ‘i/-’ and others with ‘+’ signs for the newly added
rows. The ‘i/-’ means ‘i’ should be selected when this row is the first row and ‘-’ must be selected when the row
is not the first row. The two lower cells must be filled with first ‘i’ and second ‘+’ and others with ‘-’ signs for
the newly added columns. Consider an 8×8 inner-torus as it can be seen in Fig.6 , there is not a positive
movement of more than one hop from node Xs+5 to other nodes which breaks the positive loop in the network.
In addition, there is not a negative movement of more than one hop from node Xs+6 to others which breaks the
negative loops in the network as well. This shows that XTRANC is a deadlock-free routing algorithm according
to the proof proposed in [3] and summarized in section 3.
Xs Xs+1 Xs+2 Xs+3 Xs+4 Xs+5 Xs+6 Xs+7
Xs
Xs+1
Xs
Xs+2
Xs+6
Xs
Xs+1
Xs+2
Xs+3
Xs+4
Xs+5
Xs
Xs+1
Xs+2
Xs+3
Xs+4
Xs
Xs+1
Xs+2
Xs+3
Xs+1
Xs
Xs+2
Xs+1
Xs
Xs+3
Xs+2
Xs+1
Xs
Xs+4
Xs+3
Xs+2
Xs+1
Xs+5
Xs+4
Xs+3
Xs+2
i
Xs+7
Xs+5
i
Xs+6
i
Xs+5
Xs+4
i
Inner-ring (n=8)
Xs+6
Xs+4
Inner-ring (n=4)
Xs+5
Xs+3
Inner-ring (n=6)
Xs+4
Xs+2
Inner-ring (n=7)
Xs+3
Xs+1
Inner-ring (n=5)
_ _
i/ i/
+
+
+
+
+
_
_ _
i/ i/
+
+
+
+
_ _
_ _
i/ i/
+
+
+
_ _ _
_ _
i/ i/
+
+
_ _ _ _
_
i/
+
+
_ _ _ _ _
_
+
+ + + + + __ _ +
Xs
i
Xs+3
Fig.6. The IRN Map for the inner-ring topology (1D XTRANC, n=4 to 8)
4.2. Operational Examples
We have considered that there are not any nested inner-rings and/or overlap between inner-rings in an inner-ring
topology and each node can only use one extra link in this dimension. Therefore, it is possible to add one or
more inner-ring to the topologies. In this scenario, we have segmented the routing algorithm into more than one
segment. Each segment contains XTRANC or ‘X’ routing algorithm and both of them are deadlock free. The
routing algorithm used to connect the different segments is ‘X’ which is deadlock free routing algorithm. The
following examples describe this concept. Fig.7 and Fig.8 show two examples of the inner-ring topology. As it
can be seen in these figures, it is possible to have only inner-rings or regular nodes in the topology.
i
X
R0
X
R1
X
R2
X
R3
X
R4
X
R5
X
R6
X
Left border
Outside the inner-ring, left
R7
X
R8
X
R9
X
R10
X
R11
X
R12
X
R13
X
X
R15
X
Right border Outside the inner-ring, right
Inside the inner-ring
Segment 2
Segment 1
R14
Segment 3
Fig.7. An inner-ring topology with 3 segments
i1
X
R0
X
R1
X
i3
i2
R2
X
R3
X
R4
X
R5
X
R6
X
R7
X
R8
X
R9
X
R10
X
R11
X
R12
X
R13
X
Inner-ring 1
Inner-ring 2
Regular nodes
Inner-ring 3
Segment 1
Segment 2
Segment 3
Segment 4
R14
X
R15
X
Fig.8. An inner-ring topology with 4 segments
Fig.7 displays an inner-ring topology which has been segmented into 3 segments: an inner-ring with 9 nodes, 4
regular nodes outside the left border and 3 regular nodes outside the right border. In this scenario, there are three
routing domains: segment 1 and segment 3 uses ‘X’ routing algorithm and segment 2 employs XTRANC. In this
scenario, we study all different communications among the segments:
1.
The source node is in the segment 1 and the destination is in the segment 3: As an example we have
considered node R1 as the source and node R15 as the destination address. First, segment 1 uses ‘X’ routing
algorithm to send the packet from node R1 to node R2 and then node R3 which is the right border of this
segment. Then, this node passes the packet to its right neighbour which is node R4, the left border of the
segment 2, via ‘X’ routing algorithm. After that, node R4 sends the packet from left border to node R12,
which is the right border of the segment 2, through the extra link (i). Then, the right border of the segment 2
sends the packet to its right neighbour which is the left border of segment 3 using ‘X’ routing algorithm.
Finally, node R13 uses ‘X’ routing algorithm to send the packet to its destination in this segment which is
node R15 via node R14 using ‘X’ routing algorithm. A similar algorithm will be used when the source node
is in segment 3 and the destination is in segment 1.
2.
The source node is in the segment 1 and the destination is in the segment 2: As an example we have
considered node R1 as the source and node R7 or R10 as the destination addresses. First, segment 1 uses
‘X’ routing algorithm to send the packet from node R1 to node R3 which is the right border of this segment.
Then this node passes the packet to its right neighbour which is node R4, the left border of the segment 2,
using ‘X’ routing algorithm. In this step, the packet is inside the segment 2 and node R4 sends the packet to
the destination according to the IRN Map for the inner-ring topology. Node R4 sends the packet through
the nodes R5 and R6 to the destination node when the destination node is node R7 or using extra link ‘i’
and sending the packet through the nodes R12 and R11 to the destination node when the destination node is
node R10. A similar algorithm will be used when the source node is in the segment 3 and the destination is
in the segment 2.
3.
The source node is in segment 2 and the destination is in segment 1 or segment 3: As an example we have
considered node R7 as the source and node R1 as the destination address. First, segment 2 uses the IRN
Map for the inner-ring topology to send the packet from node R7 to node R4 which is the left border of this
segment via nodes R6 and R5. Then, this node passes the packet to its left neighbour which is node R3, the
right border of the segment 1, using ‘X’ routing algorithm. Finally, node R3 uses ‘X’ routing algorithm to
send the packet towards the destination node which is node 0 through the nodes R2 and R1.
Fig.8 has 4 segments which are three inner-rings and one segment with 2 regular nodes. The segment with
regular nodes employs ‘X’ routing algorithm and the inner-ring segments use XTRANC. The connections
between the different segments take the advantage of ‘X’ routing algorithm as well. For example, we consider
node R1as the source and node R13 as the destination nodes. First, node R1 uses the IRN Map for the inner-ring
topology and passes the packet to the R3 which is the right border of segment 1 via node R0. Then, node R3
sends the packet to its right neighbour, node R4 in segment 2, using ‘X’ routing algorithm. After that, node R4,
which is the left border of the inner-ring 2 in segment 2, sends the packet to the right border in this segment,
which is node R8, via extra link i2. Then node R8 sends the packet to the right neighbour node R9 which is the
left border of segment 3 using ‘X’ routing algorithm. Node R9 sends the packet to the node R10 right border of
segment 3 using ‘X’ routing algorithm. After that, node R10 employs ‘X’ routing algorithm to send the packet
to the right neighbour which is node R11, the left border of the segment 3. Finally, node R11 uses the IRN Map
for the inner-ring topology to send the packet towards the destination node in this segment which is R13 via
node R12.
4.3. Implementing the 1D XTRANC
There are two parameters, which are Ring and Node that have been used for the router initialisation in the
network to implement the routing algorithm. If the node is a regular node, both Ring and Node parameters are
set to zero. Parameter Ring is set to 1 when the node is inside the inner-ring and parameter Node is set to 1 when
the node is on the border of the inner-ring. The regular nodes which are not in an inner-ring employ the ‘X’
routing algorithm and to implement the IRN map for an inner-ring, we consider the current node location in the
inner-ring which is a) inside node or b) border node of the inner-ring. Fig.9 and Fig.10 show the routing
algorithm pseudo codes for the nodes inside and the border nodes of the inner-rings respectively. When the
source node is inside and the destination nodes are outside the inner-ring, the packet will be routed to the innerring border which is the same side as the destination node. When the source and destination nodes are in the
inner-ring, the packet will be routed according to the IRN Map for the inner-ring topology.
Inputs: X Coordinates of current node (Xcurren, Xstart, Xend),
X destination node (Xdest), and radix n;
Output: Selected output Channel;
n:= Xend-Xstart+1;
if (Xdest=Xcurren) then return Local port;
else
{if ( Xdest<=Xend and Xdest>=Xstart )then
{ if (Xdest-1=Xcurren-n) or
((Xcurren = Xstart+n-4) and (Xdest = Xstart+n-2)) or
((Xcurren>= Xstart+n-2) and (Xdest<= Xstart+n-4)) or
((Xdest-Xcurren> 0) and (Xdest<= Xstart+n-3)) or
(Xdest-1=Xcurren)
then return X+;
if (Xdest-Xcurren =-1) or (Xdest-Xcurren = n-1) or
((Xcurren = Xstart+n-1) and (Xdest = Xstart+n-3)) or
((Xcurren<= Xstart+n-5) and (Xdest>= Xstart+n-2)) or
((Xdest-Xcurren< 0) and (Xcurren<= Xstart+n-3)) or
((Xdest-Xcurren> 1) and (Xdest= Xstart+n-1))
then return X-;}
}else
{if (Xdest < Xstart )then
{ if (Xstart-1=Xcurren-n) or
((Xcurren = Xstart+n-4) and (Xstart= Xstart+n-2)) or
((Xcurren>= Xstart+n-2) and (Xstart<= Xstart+n-4)) or
((Xstart-Xcurren> 0) and (Xstart<= Xstart+n-3)) or
(Xstart-1=Xcurren)
then return X+;
if (Xstart-Xcurren =-1) or (Xstart-Xcurren = n-1) or
((Xcurren = Xstart+n-1) and (Xstart = Xstart+n-3)) or
((Xcurren<= Xstart+n-5) and (Xstart>= Xstart+n-2)) or
((Xstart-Xcurren< 0) and (Xcurren<= Xstart+n-3)) or
((Xstart-Xcurren> 1) and (Xstart= Xstart+n-1))
then return X-;}
}else
{if (Xdest>Xend )then
{
if (Xend-1=Xcurren-n) or
((Xcurren = Xstart+n-4) and (Xend= Xstart+n-2)) or
((Xcurren>= Xstart+n-2) and (Xend<= Xstart+n-4)) or
((Xend-Xcurren> 0) and (Xend<= Xstart+n-3)) or
(Xend-1=Xcurren)
then return X+;
if (Xend-Xcurren =-1) or (Xend-Xcurren = n-1) or
((Xcurren = Xstart+n-1) and (Xend = Xstart+n-3)) or
((Xcurren<= Xstart+n-5) and (Xend>= Xstart+n-2)) or
((Xend-Xcurren< 0) and (Xcurren<= Xstart+n-3)) or
((Xend-Xcurren> 1) and (Xend= Xstart+n-1))
then return X-;}
}
End;
Fig.9. Pseudo code for the nodes inside the inner-rings
Inputs: X Coordinates of current node (Xcurren ,Xstart, Xend),
X destination node (Xdest);
Output: Selected output Channel;
n:= Xend-Xstart+1;
if (Xdest=Xcurrent) then return Local port;
Else
{if Xcurrent=Xend and Xdest<=Xend and Xdest>=Xstart then
{if
(Xdest-1=Xcurrent) or
((Xcurrent = Xstart+n-4) and (Xdest = Xstart+n-2)) or
((Xcurrent>= Xstart+n-2) and (Xdest<= Xstart+n-4)) or
((Xdest-Xcurrent> 0) and (Xdest<= Xstart+n-3))
then return i port;
elsif (Xdest-Xcurrent =-1) or (Xdest-Xcurrent = n-1) or
((Xcurrent = Xstart+n-1) and (Xdest = Xstart+n-3)) or
((Xcurrent<= Xstart+n-5) and (Xdest>= Xstart+n-2)) or
((Xdest-Xcurrent< 0) and (Xcurrent<= Xstart+n-3)) or
((Xdest-Xcurrent> 1) and (Xdest= Xstart+n-1))
then return X-; }
elsif (Xcurrent=Xend and Xdest <= Xstart)
then return i port;
elsif (Xcurrent=Xend and Xdest>Xend)
then return X+;}
else
{if Xcurrent=Xstart and Xdest<=Xend and Xdest>=Xstart
{if (Xdest-Xcurrent = -1) or (Xdest-Xcurrent = Xstart+n-1) or
((Xcurrent = Xstart+n-1) and (Xdest = Xstart+n-3)) or
((Xcurrent<= Xstart+n-5) and (Xdest>= Xstart+n-2)) or
((Xdest-Xcurrent< 0) and (Xcurrent<= Xstart+n-3)) or
((Xdest-Xcurrent> 1) and (Xdest= Xstart+n-1))
then return i port;
elsif (Xdest-1=Xcurrent ) or (Xdest-1=Xcurrent-Xstart-n) or
((Xcurrent = Xstart+n-4) and (Xdest = Xstart+n-2)) or
((Xcurrent>= Xstart+n-2) and (Xdest<= Xstart+n-4)) or
((Xdest-Xcurrent> 0) and (Xdest<= Xstart+n-3))
then return X+ port;}
elsif Xcurrent=Xstart and Xdest >=Xend
then return i port;
elsif Xcurrent=Xstart and Xdest<=Xstart
then return X- port;}
}
End;
Fig.10. Pseudo code for the border nodes of the inner-rings
The inner-ring topology is the basic topology component also employed in 2D (i.e. ‘X-Y’).The same as 1D,
XTRANC for a 2D networks is deadlock and livelock free. In the 2D scenario, a packet first traverse its route
towards its destination across the X axis for the X dimension and then across the Y axis for the Y dimension.
When traversing a dimension, it is allowed to forward a packet to overshoot its destination using XTRANC and
back track the same dimension. By enabling the ability to make turns in this way, 2D XTRANC prevents any
cyclic dependency in reserving and using network channels by messages. The proof is the same as for the
restricted turn model in [17]. XTRANC pseudo code for a 2D networks is the same as 1D and the routing
algorithm should consider first ‘X’ and then ‘Y’ directions. XTRANC for the Y direction can be achieved by
changing ‘X-’ to ‘Y-’, ‘X+’ to ‘Y+’ and ‘i’ to ‘j’.
5. Performance Evaluation Platform
The System-On-Chip Wire (SoCWire) [18, 19], which supports dynamic partial reconfiguration, has been
employed in this work to build the networks for evaluation with different topologies. We have modified the
SoCWire Switch to create a low overhead router called SoCWire Router [20] for regular large networks with
many nodes adding logical addressing. To evaluate and verify the capabilities of the networks, we have
designed an on-chip performance evaluation platform [20] and examined the partial reconfiguration in a
physical prototype. We have designed and implemented a NoC performance evaluation platform on the FPGA
to verify and evaluate the created networks around the Leon3 SoC available in the GRLIB IP Library [21]. The
Leon3 is a SPARC-compatible softcore processor which is developed by AeroFlex-Gaisler and interfaces to the
AMBA bus architecture. The IP blocks for the AMBA bus, Leon3 processor and other SoC peripherals are
available in the GRLIB IP library. An additional IP block added to the GRLIB library is the reconfiguration
controller [22] which is used to load new bitstreams stored in external DDR memory into the tiles without host
intervention via the ICAP hardware unit (Internal Configuration Access Port) which allows direct access to
device fabric. This external memory provides a good trade-off between memory size, transfer overheads and onchip resource utilisation.
ICAP
LEON 3
Processor
Reconfiguration
Controller
JTAG
JTAG
Dbg link
AMBA AHB
AHB
Controler
SDRAM
Memory
Controler
Dual Port RAM
Tx
Rx
PGs
PRs
Static
part
SRAM
I/O
PROM
CODECs
SoCWire Network
Dynamic
part
Fig.11. The performance evaluation platform for SoCWire networks
Fig.11 presents the architecture of this performance evaluation platform. We have considered the SoCWire
Network as the dynamic part while the rest of the system is statically configured. This allows the
communication interconnect to map as a single block to a centralized area of the device which is connected to
the evaluation platform. This approach is effective at using the current design flows of partial reconfiguration
since it is possible to use the available FPGA resources optimally. If the number of local/communication ports
change in a router the wiring infrastructure will change as well and this can be achieved by letting the P&R
FPGA tools manage the resources in the assigned communication area without imposing excessive constraints.
Dynamic Partial Reconfiguration (DPR) is useful for systems with multiple functions that can time-share the
same FPGA resources. We have used the DPR technique to create the dynamic part of the system while the rest
of the evaluation platform is statically configured. The resulting device layout is shown in Fig.12 with the area
occupy by the communication network clearly shown.
SoCWire Network
(Dynamic part)
Performance Evaluation Platform
(Static part)
Fig.12. The resulting device layout
In this work we consider one single rectangular area that implements the network with the processing elements
located outside of this area. This method is efficient for small to medium networks. It is possible to use
partitioning methods with DPR to include multiple area group and non-rectangular PR regions [23] suitable for
larger networks. This is part of our future work.
Our performance evaluation system is based on different traffic types representing synthetic and realistic
application traffic patterns. Performance is analyzed in terms of throughput and latency after implementing the
system in the target board running at a normalized frequency of 100 MHz. The throughput shows the efficiency
of delivering packets in the network and depends on topology and routing policy which is ‘X-Y’ in all the cases
to be able to perform a fair comparison. The time required for traversing the network is referred to as the latency
of a network. The average latency is the mean of the latencies of all received packets in the topology [17].
6. XTRANC for homogeneous networks
The partitioning of a mesh network into an arbitrary numbers of smaller networks based on the inner-torus
topology and the XTRANC routing algorithm could be a good solution to increase a mesh network performance.
For example, Fig.13 shows a typical 10×10 mesh network which has been partitioned to four inner-tori, 25
nodes each and each inner-torus uses XTRANC as a routing algorithm for 2D topology.
R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13
R14
R15
R16
R17
R18
R19
R20
R21
R22
R23
R24
R25
R26
R27
R28
R29
R30
R31
R32
R33
R34
R35
R36
R37
R38
R39
R40
R41
R42
R43
R44
R45
R46
R47
R48
R49
R50
R51
R52
R53
R54
R55
R56
R57
R58
R59
R60
R61
R62
R63
R64
R65
R66
R67
R68
R69
R70
R71
R72
R73
R74
R75
R76
R77
R78
R79
R80
R81
R82
R83
R84
R85
R86
R87
R88
R89
R90
R91
R92
R93
R94
R95
R96
R97
R98
R99
Fig.13. A typical 10×10 network which is partitioned to inner-torus networks
6.1. Evaluation of an inner-torus network
k
l
...
...
m
n
o
p
...
...
...
...
h
i
j
k
l
m
n
o
p
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Fig.14. A 4×4 inner-torus partition
...
...
...
...
...
...
...
...
...
j
g
...
...
...
i
f
...
...
e
...
...
d
...
...
h
c
...
...
g
...
...
f
b
...
...
e
a
...
...
...
...
...
...
...
d
...
...
c
...
...
b
...
a
...
...
...
Fig.14 shows one partition of an inner-torus with 16 nodes which has been connected to our performance
evaluation platform introduced the previous section and prototyped on a Virtex-5 LX110T device. We have
evaluated a 16 nodes mesh and full-mesh (interconnected mesh) [11, 12] networks to compare with the innertorus. Fig.15 displays the full-mesh network [11, 12].
Fig.15 A 4×4 full-mesh network
6.1.1 Power and Maximum Achieved Frequency
Table 1 shows the maximum achieved frequency and the consumed power for the networks. The XPower tool,
which is the Xilinx power analysis tool, has been used to estimate the consumed power of the networks. We
have considered the hierarchy power, which includes DCM power, BRAM power, signal power and logic
power, to compare the networks. As it can be seen in this table, the maximum achieved frequency for the mesh
network is 10.01% higher than the inner-torus network and for the inner-torus is 6.6% higher than the full-mesh
network. The maximum achieved frequencies of the networks are highly dependent on the mapped area. In
addition, the inner-torus network consumes more power than the mesh network but it reduces power by 38.1%
compared to the full-mesh network.
TABLE 1
THE MAXIMUM ACHIEVED FREQUENCY AND THE CONSUMED POWER
Mesh
Inner-torus
Full-mesh
Max. Frequency(MHz)
119.9
107.77
100.65
Power(mW)
40.93
70.17
113.36
6.1.2 Performance
We have evaluated the networks with two different traffic patterns:
A. Uniform Traffic Pattern 1
In this scenario, each node sends 100 packets, with 15 flits each, to random targets which are the other nodes.
There is a random timing interval between the sending packets which indicate the average time between data
transfer requests. We have evaluated the networks with different ranges of random firing intervals. A lower
value indicates a more heavily loaded NoC.
Fig.16 displays the throughput and Fig.17 shows the average latency for the 4×4 inner-torus, full-mesh and
mesh networks for this uniform traffic pattern. These figures reveal that the inner-torus network increases the
throughput up to 25.94% and reduces the average latency up to 79.2% compared to the mesh network. As it can
be seen in these figures, the inner-torus network increases the throughput up to 1.7 % and reduces the average
latency up to 13.7% compared to the full-mesh network. This indicates that the inner-torus network achieves a
performance comparable to the full-mesh despite of using less resources and being more power efficient.
Fig.16. The networks throughput for the uniform traffic pattern 1
Fig.17. The networks average latency for the uniform traffic pattern 1
B. Uniform Traffic Pattern 2
In this case, each node sends 1-100 packets, 15 flits each to a random target which are the other nodes. The
traffic is bursty traffic and there is not a random timing interval between the sending packets. Fig.18 displays the
throughput and Fig.19 shows the average latency for the 4×4 inner-torus, full-mesh and mesh networks in this
scenario. These figures reveal that the inner-torus network increases the throughput between 23.4 to 59% and
reduces the average latency between 28.2 to 40.7% compared to the mesh network for the applied test. In
addition, the inner-torus network increases the throughput between 6.1 to 63.9 % and reduces the average
latency between 14.5 to 22.2% compared to the full-mesh in this scenario.
Fig.18. The networks throughput for the uniform traffic pattern 2
Fig.19. The networks average latency for the uniform traffic pattern 2
6.1.2 Complexity
Table 2 shows the networks complexity summary. As it can be seen in the table, the configuration with the
inner-torus network consumes 44% LUTs compared to the traditional mesh network. The additional logic is
required due to the extra ports in the inner routers. The full-mesh is, as expected, even more complex and it
requires 53.8% more LUTs compared to the inner-torus network.
TABLE 2
COMPLEXITY FOR THE 4×4 INNER-TORUS, FULL-MESH AND THE MESH NETWORKS
Networks
Slice Registers
Slice LUTs
Block RAM/FIFO
Mesh
10864
16828
48
Inner-torus
13029
24237
56
Full-mesh
19294
37285
80
7. XTRANC for irregular networks
It is possible to add extra link(s) only to areas where the network has congestion. The extra links are added to
the mesh network in ‘X’ and ‘Y’ direction between the nodes which have long latency problems. The key point
here is to detect the area which requires extra link(s) and replace the network or a part of network with the
modified ones which has extra link(s). Note that adding one extra link in the current setup will require a
reconfiguration of the network area as seen in Fig.12.
7.1. Possible mechanism to detect and add extra link(s) to the network at the run-time
This section introduces a possible mechanism to detect and add extra link(s) at run-time in commercial FPGAs
using the state of art Dynamic Partial Reconfiguration technique.
In previous work [24] a distributed stochastic run-time strategy is presented that incrementally maps an
application onto a large run-time reconfigurable multicore platform. This method uses a mesh network as a
communication infrastructure. The proposed task mapping employs specific tasks named task managers (TMs)
for mapping applications. Each application has its own TM which monitors its task during its execution to
perform fault detection, task migration and energy management. The TM could have the capability to map tasks
on demand and also map more than one application to reduce the overhead caused by the TM, but for the sake
of simplicity and clarity, this paper considers one task manager for each application.
In this scenario, when a network has more than one task manager (TM), each TM maps its tasks and is not
aware of the other TMs and some applications have an unpredictable traffic patterns. Thus, the communication
latency is unpredictable due to the congestion. In addition, some nodes need to receive the data from the other
nodes less than a specific latency but due to the created congestion; the node receives the data with a longer
latency. Therefore, a novel method is needed to control the communication latency between the nodes and solve
this problem. In this method when a node sends a packet to another node, the sending time stamp is added to
the header of the packet. When a node receives a packet, it extracts the sending time and compares it to the
current time and makes a decision. There are two decisions which can be made at this point: If the latency is less
than the expected and acceptable latency, the node accepting the packet and if the latency is more than expected
latency, the related node sends a message to its TM and asks for help. This message contains the sender and
receiver information.
When a TM receives the help message from a node, it has five tasks to do:
1.
TM orders its nodes and other TMs to inform their nodes a reconfiguration is in process. The nodes
communications pause immediately when they receive the command.
2.
TM is aware of the different configurations bitstreams which have been stored in external DDR memory. It
calculates which configuration with extra link(s) should be selected to improve the latency.
3.
TM waits for a specific amount of clock cycles for the previous packets inside of the network to be
delivered. This interrupt can vary depending on the network size and traffic pattern.
4.
TM orders to replace the network or part of the network with a mesh network with extra link(s) using DPR
technique.
5.
TM sends a message to its related nodes and other TMs to inform them the network reconfiguration has
done to start the communications.
All of the above process happens in the Network Interface (NI) level as well as TM and IPs do not have any
overhead for this process.
7.2. Irregular network performance analysis
In this section, we have considered the whole and or a part of the network as the dynamic part that needs to be
reconfigured while the rest of the system is statically configured.The experiments in 7.2.1. to 7.2.3. demonstrate
that the XTRANC is a good routing algorithm capable of reducing the communication latency between specific
nodes. We have considered three cases as proof of the concept:
7.2.1
Case 1:2 TMs with 1 task each and 1 extra link
We have considered a network area where two different tasks have been mapped at run-time. Fig.20 shows these
two simple task graphs and Fig.21 the mapped tasks on the mesh network. We have connected the sub-network
to the performance evaluation platform, considered the evaluation platform as static and the network as the
dynamic part and implemented the system in the FPGA. We have assumed all the communications between
nodes meet their latency limitations except the communication between node ‘f ‘and ‘g’. We have used PR and
replaced the mesh network with the mesh network with an extra link which can be seen in Fig.22. Table 3 shows
the complexity and maximum achieved frequency of the mesh and the mesh network with an extra link in this
case. This table shows that adding an extra link requires 7.5% more LUTs. In addition, as we mentioned before
the maximum achieved frequency is highly dependent on the area which have been selected to map the
networks. In this case, adding an extra link does not have an effect on the maximum achieved frequency.
g
c
...
...
Task 0
a'
f'
TM
0
...
b'
g
c
...
d
...
Task 0
Task 1
Fig.21. Run-time mapped applications on mesh network
d
...
b'
a
...
...
e
...
TM
0
b
...
f'
TM
1
...
a'
...
...
...
f
...
a
c'
...
e
e'
...
b
d'
...
TM
1
...
...
...
...
f
...
...
c'
...
...
e'
...
...
d'
...
...
...
...
Fig.20. Two simple task graphs.
Task 1
Fig.22. Run-time mapped applications on mesh network with one extra link
TABLE 3. OCCUPIED AREA AND MAXIMUM FREQUENCY OF THE NETWORKS
Networks
Mesh
Mesh-4x4-1-extra-link
Slice Registers
10864
11168
Slice LUTs
16828
18094
Block RAM/FIFOs
Max frequency (MHz)
Fig.23. The latency between nodes ‘f’ and ‘g’ with mesh
and mesh with 1 extra link networks
48
49
108.719
108.684
Fig.24.The average network latency between nodes for the mesh
and mesh with 1 extra link networks
As it can be seen in Fig.23, the average latency between node ‘f’ and ‘g’ decreases by 30.42% by adding the
extra link when each transaction contains a packet with 16 flits. Fig.24 shows the average latency for the
networks. As it can be seen in Fig.24, adding the extra link reduces the average network latency by 8.58%.
c
...
Task 0
a'
f'
TM
0
...
g
b'
c
...
d
...
Task 0
Task 1
Fig.25. Run-time mapped applications on mesh network
d
...
b'
a
...
g
e
...
...
b
...
TM
0
...
f'
TM
1
...
a'
...
...
...
f
...
a
c'
...
e
e'
...
b
d'
...
TM
1
...
...
...
...
f
...
...
c'
...
...
e'
...
...
d'
...
...
...
...
7.2.2. Case 2: 2 TMs with 1 task each and 2 extra links
We have considered a network area with the same two tasks as the previous case but mapped differently as it
can be seen in Fig.25. We have connected the sub-network to the performance evaluation platform, considered
the evaluation platform as the static and the network as the dynamic part and implemented the system on the
FPGA. We have assumed all the communications between nodes meet their latency limitations except the
communication between node ‘f ‘and ‘g’. We have used the PR and replaced the mesh network with the mesh
network with two extra links which can be seen in Fig.26. Table 4 shows the complexity and maximum
achieved frequency for the networks. This table reveals that adding two extra links in this case, increases the
consumed LUTs by 16.42% in comparison to the mesh network. As it can be seen in Fig.27, the average latency
between the node ‘f’ and ‘g’ decreases by 38.67% by adding the extra links when each transaction contains a
packet with 16 flits. Fig.28 shows the average latency for the networks. As it can be seen in this fig, adding the
extra links reduce the average network latency.
Task 1
Fig.26. Run-time mapped applications on mesh network with two extra link
TABLE 4. COMPLEXITY AND MAXIMUM FREQUENCY OF THE NETWORKS.
Networks
Mesh
Mesh_2extra_links
Slice Registers
10864
11438
Slice LUTs
16828
19591
48
50
103.842
100.827
Block RAM/FIFO
Max frequency (MHz)
Fig.27. The latency between nodes ‘f’ and ‘g’ with mesh.
and mesh with 2 extra links network
Fig.28.The average network latency between nodes for the mesh
and mesh with 2 extra links network
7.2.3
Case 3: 1 static partition, 1 dynamic partition , 1 TM with 1 task for each partitions, 1 extra link
We have considered 2 network areas with 8 and 6 nodes which have one TM each and each of them is
responsible for one task and have connected the networks to the performance evaluation platform, considered
the evaluation platform and sub-network with 6 nodes which has been considered as a static part and the subnetwork with 8 nodes which has been selected as the dynamic part. Fig.29 shows the TGs and Fig.30 displays
how the tasks have been mapped to the 2 sub-networks. We have assumed all the communications between
nodes meet their latency limitations except the communication between nodes ‘g ‘and ‘a’. We have used the PR
and replaced the mesh network with the mesh network with one and two extra links which can be seen in Fig.31
and Fig.32 respectively. Note that a direct link between ‘g’ and ‘a’ is not compatible with XTRANC.
Fig.29. Two task graphs.
...
f
g
i
TM
1
l
...
...
a
TM
0
b
d
j
k
m
...
...
...
...
...
...
...
e
...
...
...
...
c
Task 0 , dynamic part
...
...
...
...
Traffic direction ‘g’ to ‘a’
Task 1 , static part
Traffic direction ‘a’ to ‘g’
Fig.30. Run-time mapped applications on mesh network.
Fig.31. Run-time mapped applications on mesh network with one extra link.
Fig.32. Run-time mapped applications on mesh network with two extra links.
Table 5 shows the complexity and maximum achieved frequency of the sub-networks with 8 nodes. This table
reveals adding one and two extra links needs 6.2% and 15.36% more LUTs compared to the mesh network.
Fig.33 shows the average latency between node ‘g’ and ‘a’. This figure shows that adding an extra link reduces
the average latency from node ‘g’ to node ‘a’ by 15.4%. In addition, adding two extra links reduces the average
latency from node ‘g’ to node ‘a’ by 17.2% and from node ‘a’ to node ‘g’ by 31.3%. Fig.34 displays the average
networks latency. Adding one and two extra links reduce the average latency.
TABLE 5
COMPLEXITY AND MAXIMUM FREQUENCY OF THE MESH AND THE NETWORKS.
2X4_mesh
Slice Registers
5875
2X4_mesh_
1-extralink
5878
Slice LUTs
8947
9497
10321
Block RAM/FIFO
26
27
28
Max Frequency( MHz)
119.36
124.347
124.611
Fig.33. The latency between nodes ‘g’ and ‘a’ with mesh
and mesh with 1 and 2 extra links networks
8
2X4_mesh_
2-extralinks
6129
Fig.34.The average network latency between nodes for the mesh
and mesh with 1 and 2 extra links networks
Conclusions
We have created the XTRANC algorithm as a suitable routing for dynamically reconfigurable NoCs formed by
mesh networks extended with additional links. The experimental analysis reveals that XTRANC achieves better
performance and remains deadlock-free despite the topology changes introduced by the additional links.
XTRANC can be used to improve the use of limited hardware resources in FPGAs by adding links only to areas
of traffic congestion and dynamically reconfiguring the links as requirements changes.
The results indicate that the partial reconfiguration features available in modern FPGAs could be used to deploy
different interconnects at run-time depending on active application and design goal. The current work is based
on a Xilinx V5 LX110T that with 69K logic cells does not offer enough resources to build larger systems but
with new FPGAs such as the latest Xilinx Virtex-7 with millions of logic cells, it will be possible to create very
large communication networks with hundreds of processing elements and study the scalability of this method in
these cases. This is part of our future work. In addition, the IRN produces both deterministic and adaptive
routing algorithms. In this work we have employed the deterministic XTRANC because of its inherent
simplicity being based on the ‘X-Y’ algorithm. Future work will consider adaptive XTRANC and compare it
with deterministic XTRANC and other adaptive algorithms.
References
[1] Ababei, C.: ‘Efficient Congestion-Oriented Custom Network-on-Chip Topology Synthesis’ Reconfigurable
Computing and FPGAs (ReConFig), 2010 International Conference on, Cancun, Quintana Roo, Mexico,
Dec. 2010, pp.352- 357.
[2] Duato, J., Yalamachili, S., and Ni, L.:’ Interconnection Networks: An Engineering Approach’. Morgan
Kaufmann, 2003.
[3] Rahmati, D., Sarbazi-Azad, H., Hessabi, S., and Eslami Kiasari, A.: ‘Power-efficient deterministic and
adaptive routing in torus networks-on-chip,’ Microprocessors and Microsystems - Embedded Hardware
Design, 2012, 36, (7), pp. 571-585.
[4] Palesi, M. , and Daneshtalab, M. (Eds.):’Routing Algorithms in Networks-on-Chip, Springer’, 2013.
[5] Chiu, C. M.: ’The odd-even turn model for adaptive routing’. IEEE Trans. on Parallel and Distributed
Systems, 2000, 11, (7), .pp.729 - 738.
[6] Hu, J., C., and Marculescu, R.: ‘DyAD - smart routing for networks-on-chip’, In Proc. Design Automation
Conference, July 2004, pp. 260 - 263.
[7] Shafiee, A. M., Montazeri, M., and Nikdast, M.: ‘An Innovational Intermittent Routing Algorithm in
Network-on-Chip’, International Conference on Computer Science and Engineering, France, September
2008.
[8] Li, M., Zeng, Q., Jone, W.: ‘DyXY - a proximity congestion-aware deadlock-free dynamic routing method
for network on chip’, Design Automation Conference, 43rd ACM/IEEE, San Francisco, CA, USA, July
2006, pp. 849-852.
[9] Patooghy, A.; Miremadi, S.G.: ‘XYX: A Power & Performance Efficient Fault-Tolerant Routing Algorithm
for Network on Chip,’ Parallel, Distributed and Network-based Processing, 2009 17th Euromicro
International Conference on, Weimar, Germany, Feb 2009, pp. 245-251.
[10] Nickray, M., Dehyadgari, M., Afzali-Kusha, A.: ‘Adaptive routing using context-aware agents for networks
on chips’, Design and Test Workshop (IDT), 2009 4th International , Riyadh, Saudi Arabia, Nov. 2009, pp.
1-6.
[11] Choudhary, S. and Qureshi, S.: ‘A new NoC architecture based on partial interconnection of mesh
networks’ in IEEE Symposium on Computers & Informatics (ISCI), Kuala Lumpur, 2011.
[12] Choudhary, S. and Qureshi, S.: ‘Performance Evaluation of Mesh-based NoCs: Implementation of a New
Architecture and Routing Algorithm’, International Journal of Automation and Computing , 2012, 9, (4),
pp. 403-413.
[13] Reshadi, M., Khademzadeh, A., Reza, A., and Bahmani, M.: ‘A Novel Mesh Architecture for On-Chip
Networks’, ‘D & R Industry Articles,’ http://www.design-reuse.com/articles/23347/on-chipnetwork,
accessed Sep. 2013.
[14] Fuks, H., and Lawniczak, A.T.: ’Performance of data networks with random links’, In Mathematics and
Computers in Simulation, 1999, 51, (1-2), pp. 101-117.
[15] Ogras, U. Y., and Marculescu, R.: ‘It's a small world after all: Noc performance optimization via long-range
link insertion’, IEEE Trans. on Very Large Scale Integration Systems, Special Section on
Hardware/Software Codesign and System Synthesis, 2006, 14, (7), pp. 693-706.
[16] Jackson, C.; Hollis, S.J.: ‘Skip-links: A dynamically reconfiguring topology for energy-efficient NoCs’,
System on Chip (SoC), 2010 International Symposium on , Tampere, Finland, Sep.2010 , pp. 49-54.
[17] Dally, W.J., and Towles, B.P.: ‘Principles and practices of interconnection networks’, 2004, The
MorganKaufmann series in computer architecture and design. Morgan Kaufmann, Burlington.
[18] ‘System-on_Chip Wire, IDA, 5 May 2009’, www.socwire.org, accessed Sep. 2013.
[19] Osterloh, B., Michalik, H., and Fiethe, B.: ‘SoCWire: A Robust and Fault Tolerant Network-on-Chip
Approach for a Dynamic Reconfigurable System-on-Chip in FPGAs,’ in Architecture of Computing
Systems - ARCS 2009. 2009, vol 5455, Delft, Netherlands: Springer Berlin / Heidelberg, pp. 50-59.
[20] Beldachi, A.F., Hosseinabady, M., and Nunez-Yanez, J.: ‘Configurable Router Design for Dynamically
Reconfigurable Systems based on the SoCWire NoC’ International Journal of Reconfigurable and
Embedded Systems (IJRES),2013, 2, (1).
[21] ‘Leon3 Processor/GRLIB’, http://www.gaisler.com, accessed Sep. 2013.
[22] Nabina, A., Nunez-Yanez, J.: ‘Dynamic Reconfiguration Optimisation with Streaming Data
Decompression,’ FPL, 2010 International Conference on Field Programmable Logic and Applications,
Milan, Italy, Sep. 2010, pp. 602-607.
[23] ‘Xilinx website’, http://www.xilinx.com/support/answers/25018.htm, accessed Sep. 2013.
[24] Hosseinabady, M., and. Nunez-Yanez, J. L. ‘Run-time stochastic task mapping on a large scale network-onchip with dynamically reconfigurable tiles’ Computers & Digital Techniques, IET,2012, 6, (1), pp. 1 – 11.
Download