norc_nunezyanez_revised_2 - Explore Bristol Research

advertisement
Adaptive Routing Strategies for Fault-tolerant
On-chip Networks in Dynamically Reconfigurable Systems.
Jose Luis Nunez-Yanez, Department of Electrical&Electronic Engineering, Bristol University,
Woodland Road, BS8 1UB, UK, e_mail: j.l.nunez-yanez@bristol.ac.uk
Doug Edwards, School of Computer Science, University of Manchester, Oxford Road, M13 9PL,UK,
e_mail: doug.edwards@manchester.ac.uk
Antonio Marcello Coppola, STMicroelectronics, 12, rue Jules Horowith
Grenoble, FR 38019, e_mail: marcello.coppola@st.com
Abstract
This paper presents an investigation into an effective and low-complexity adaptive routing strategy
based on stochastic principles for an asynchronous network-on-chip platform that includes
dynamically reconfigurable computing nodes. The approach is compared with classic deterministic
routing and it is shown to have good properties in terms of throughput and excellent fault-tolerance
capabilities. The challenge of how to deliver reliability is one of the problems that multiprocessor
systems architects and manufactures will face as feature sizes and voltage supplies shrink and deepsubmicron effects reduce the ability to carry out deterministic computing. It is likely that a new type of
deep-submicron complex multicore systems will emerge which will be required to deliver high
performance within strict energy and area budgets and operate over unreliable silicon. Within this
context, the paper studies an on-chip communication infrastructure suitable for these systems.
1. Introduction
Network-on-Chips (NoCs) have been proposed as a technology able to overcome the performance
bottleneck faced by traditional shared bus designs. Conventional bus structures (AMBA, Coreconnect
etc) have being coping with the increasing number of connected IP blocks with the introduction of
complex hierarchies. This has helped to alleviate problems such as shared-medium access contention
and wasteful data broadcast but the scalability of these solutions is limited. Additionally, the increases
in noise and crosstalk expected in sub-65 nm dimensions will also hinder the performance of
conventional bus structures [1]. The multicore technology of the future is poised to use on-chip packet
or circuit switched networks for effective communication. This will enable increases in bandwidth as
more processing nodes are added, as well as providing support for error resilience and energy saving
techniques [2].
Synchronous and asynchronous NoCs have been proposed by different academic and commercial
research teams. Asynchronous NoCs offer some clear advantages in solving power and delay problems
associated with clock distribution in large chips. They also have some serious challenges regarding
verification of the self-timed logic together with the possible area overheads introduced by having to
add wires and logic to eliminate problematic glitches and support the handshaking protocols that
replace the global clock. The lack of commercial EDA support for asynchronous logic is also a
limiting factor in increasing its wide acceptance. Intensive work in the computational domain
involving the design of complex self-time processor cores has been conducted in the Amulet group in
Manchester University [3]. The challenges of self-timed logic are evident in this work and it is hard to
imagine that this type of logic will replace well-understood synchronous design in the processing
elements of a multicore platform in the short to medium future. On the other hand, the higher
regularity and fewer gates of the communication domain mean that the advantages of asynchronous
design are more apparent [4].
The dynamic reconfiguration feature means that a single silicon platform could support multiple
applications via rapid reconfiguration of the logic gates. Existing and emerging consumer audio/video
standards, and the need for product differentiation by introducing new feature updates in already
deployed devices, mean that the cost of the design, verification and fabrication of billion-transistor
dedicated ASICs [5] will be increasingly difficult to justify compared with a reconfigurable platform.
Current platform FPGAs already incorporate a highly structured network at a very fine grain level
used to connect the basic logic cells that form the FPGA computing fabric [6]. FPGAs however tend to
consume a lot of energy most of it on the routing wires which can follow convoluted routes in the
fabric as determined by the routing algorithm. Routing complexity increases with chip size and this
tendency also decreases performance since the high capacitance of long wires negatively affects the
achievable clock rates. Introducing the NoC offers a natural way to structure this complexity so that
FPGA routing domains are confined within the areas defined by the NoC wires and global
communication is limited to using NoC resources.
The combination of an asynchronous communication infrastructure with dynamically reconfigurable
computing nodes presents some interesting opportunities and challenges in the design of an effective
communication protocol and their exploration constitutes the objective of this work. The paper
investigates both the throughput and the fault-tolerance performance of this adaptive multicore
platform which has been named NoRC (Network on a Reconfigurable Chip). The main focus is how
the presence of dynamically reconfigurable computing resources affects the design of a flexible
routing strategy based on stochastic principles. Stochastic routing means that routes are discovered
dynamically with little area overhead. Throughput, however, can degrade since typically more hops
are needed for a packet to reach its destination. The paper is organized as follows. Section 2 describes
related work in the area of fault-tolerance strategies and adaptive routing in on-chip communication
architectures. Section 3 presents the features of the NoRC platform and section 4 describes how
adaptive routing is implemented by NoRC. Section 5 studies the trade-off between throughput and
platform complexity, varying the number of functions supported in the processing nodes in NoRC,
compared with a deterministic routing strategy. Then in section 6, an investigation is carried into the
fault-tolerance capabilities and throughput offered by the stochastic approach compared with the same
deterministic routing approach. Finally section 7 presents the conclusions and indicates future research
directions. It should be noted that the simulator developed in this work does not pretend to be a cycle
accurate model of a particular process but rather its role is to provide an initial insight into the features
and limitations of the proposed NoRC platform.
2. Related work in NoC fault-tolerance and adaptive routing
Reliability in a MPSoC (Multi Processor System on Chip) platform such as NoRC could be affected
by soft/transient errors (cosmic rays, alpha particles, crosstalk, coupling noise) and hard/permanent
errors (defective open or short circuits, electro-migration) introduced by the low yield and process
variations that are likely in deep-submicron processes. The GALS (Globally Asynchronous Locally
Synchronous) platform combined with DVS (Dynamic Voltage Scaling) could also generate
synchronization errors between the asynchronous and synchronous domains that become difficult to
trace. Both transient and permanent errors should be taken into account when designing a faulttolerance strategy.
Traditionally analysis of transient faults assumes that some bits in the data packet are flipped to the
wrong value and the literature focuses on the use of linear codes such as CRC (Cyclic Redundancy
Check) and Hamming to detect and/or correct these errors and the area/energy/performance trade-offs
involved.
In [7] a detailed study is conducted on the energy-reliability trade-off comparing correction at the
receiver stage versus retransmission of the corrupted data. Error correcting-detection is based on cyclic
codes and Hamming codes. The paper points out that advance interconnect schemes such as NoCs
tend to have shorter point-to-point links which should favour retransmission rather than error
correction from an energy point of view. Communication reliability can be obtained at different levels
of granularity: it should be possible, for example, to add Hamming codes to each data flit while using
a single CRC code to protect the whole packet. Retransmission at the switch level translates to a need
for power hungry buffering resources so that end-to-end retransmission at the NI level is generally
preferred. Similar conclusions are reached in [8] which also investigates the advantages of either
retransmitting the erroneous packets after error detection or fixing the problem using error correcting
techniques without the need for retransmission. The paper finds that the overheads of error correction
in terms of extra area are high and this also translates into high energy consumption. A hybrid scheme
is proposed that uses error detection and retransmission for short distances and error correction for
long distances with a crossover point of 6 hops. This approach offers the best energy trade off but it
requires the presence of hardware modules for error detection and correction in the network interfaces
and its area overhead is large.
The work presented in [9] studies the energy consumption incurred by different error
correcting/recovery schemes depending on whether they are deployed at the switch or network
interface level. This work assumes static routing with paths set at the sender network interface. Critical
information such as packet headers is protected with hardware redundancy such as TMR (Triple
Modular Redundancy). The scheme relies heavily on data buffering and these buffers are found to be
the main cause of the power overhead. The conclusions are that switch-based error control works
better for local communication while end-to-end is better suited for global communication.
In [10] a technique is proposed that uses different levels of error protection depending on the
reliability of the communication link. As reliability of the link changes depending on level of noise
present, the error detection scheme changes from single error bit detection to double and triple error
detection. The work focuses on error detection coupled with retransmission, and obtains energy
savings because the complexity of the error detection scheme is adjusted to meet the channel needs.
The obtained energy savings are modest at 10% since the hardware required to support the most
complex scheme needs to be physically present event if it is not used.
In [11] Hamming codes are also used to detect errors and retransmission used as the recovery method
as it is has been reported as being more energy efficient. The paper recognizes that even with the
addition of error correcting capabilities to the system, the approach will not work in the presence of
permanent errors since the error could affect the data in such a way that the error correcting technique
proves ineffective and multiple retransmissions would only deliver multiple copies of incorrect data.
To tackle this situation, the approach adds triple modular redundancy to the hardware with spare
components. A reconfiguration strategy is used to use to select components that are functionally
correct after error detection using Hamming codes. This works well so long as the extra links
provided are able to eliminate the problem, but there is an obvious area overhead.
Additionally, it is reasonable to assume that errors could affect not only the data lines but also the
control lines of the communication links and the logic of the router nodes [12]. This means that
transient errors could result in routers which temporarily stop responding to requests or stop
forwarding data correctly. A fault-tolerance strategy should also take into account these types of
problems and the described techniques that rely on cycle-redundancy-check codes (CRC)/ Hamming
codes and retransmission and/or correction cannot do that on their own.
In [13] a comprehensive study of the most common failure types expected in NoCs is conducted and
architectural solutions proposed for a generic synchronous NoC router. The approach is based on a
combination of error detecting/correcting codes and retransmission. Buffers are heavily used in the
router nodes to support hop-to-hop retransmission when the error correction codes are insufficient for
solving the problem. Deadlock recovery is also included as part of the fault tolerance infrastructure.
Both intra router errors in the combinatorial logic and classic link faults are considered. It is apparent
in the paper that the extra complexity added to the router to support all these fault-tolerant features is
considerable.
The routing protocol employed has also a major impact on fault-tolerance. Dynamic and adaptive
approaches offer fault-tolerance to temporal and permanent errors in the network compared with static
schemes. The disadvantage is the complexity introduced in the routing nodes with extra processing,
larger buffers and large routing tables that can quickly make the communication network complexity
unmanageable in an embedded system. Dynamic routing algorithms typically use tables to determine
the position of the receiver of a data packet [14]. The route is dynamically discovered and path
changes can be accommodated easily.
A simpler approach to the dynamic discovery of the route that removes the need for routing tables and
complex processing is the randomized gossip protocols proposed in [15]. In this case, if a router has a
packet pending to send, it will forward it to a randomly chosen subset of neighbour routers. These
routers will upon reception of the packet behave in the same way. This type of probabilistic broadcast
algorithm has very good fault-tolerance properties but generates a lot of redundant traffic with
multiple replicated copies of the same packet using scarce resources in the network which could result
in energy waste and network congestion. The work conducted in [16] investigates the flooding
mechanism used in [15] and concludes that excessive unnecessary redundancy is generated. A
variation is proposed in which a limited number of packets move towards the destination in a directed
way. This means that routing decisions are not totally random but the routers know the destination of
the packet and some weighting is performed so that the ports that lead towards the destination have a
higher probability of being chosen than the other ports.
The routing approach used in this work shares some of the stochastic features of [15] and avoids
packet replication by using a single request flit to establish the communication followed by a variable
number of deterministically routed data packets. It combines packet switching for the initial single-flit
request packets with circuit switching for the data flits. It is designed to exploit the features of the
dynamically reconfigurable platform by replacing inflexible destination addresses with service
requests. This adaptive routing offers fault-tolerance to permanent failures in the processing and
routing nodes. For transient errors affecting the packet header (that identifies the service requested)
and/or payload, a combination of error detecting Hamming codes at the flit level and CRC codes at
that packet level can be used to detect erroneous bits packet headers and payloads. Upon detection of
errors, the packets are dropped and a retransmission requested since this technique is simpler and it has
been found in the reviewed literature to be more efficient that full blown error correction. One of the
most attractive features of the routing used in NoRC compared with the reviewed methods is its
simplicity since it does not require buffering of the packets at the router level, routing tables or
hardware redundancy to operate efficiently.
3. The NoRC platform
Fig. 1 illustrates the NoRC platform which combines network-on-chip, a reconfigurable fabric,
hardwire processing functions such as a RISC CPU and embedded DSP blocks, Dynamic Voltage
Scaling and GALS technology in a single silicon chip.
Our research into this system is based on the On Chip Communication Network [17]. OCCN is an
open-source object oriented SystemC library which provides an efficient framework for the modelling,
simulation and design space exploration of on-chip communication architectures. OCCN enables the
separation between communication and computation with the use of the communication interface
which is the only way to interact with the behaviour of a processing node. This enables both the
communication and behaviour components to be described at different levels of detail and these
modules to be freely intermixed as along as the interface protocols remain the same. Packets in OCCN
are called PDUs (Protocol Data Units) and typically consist of a header field, payload field and a
trailer field. Packets are defined as the smallest part of a message that can be routed independently
through the network. The packets are transmitted divided into flits.
Segmentation and reassembly are implemented among different layers of communication (transport,
network and link layer). Communication refinement enables for example replacing the abstract
communication channel that transfer 32-bit flits between the Master and Slave ports into a detailed
description of a dual-rail return-to-zero asynchronous protocol similar to CHAIN links.
In this work we have ported the SystemC OCCN library and classes to the ModelSim simulation
environment. Modelsim is considered the standard in RTL (VHDL, Verilog) simulation and for some
time now it also supports SystemC simulation. The three languages can be intermixed in the same
simulation and this opens some interesting avenues for communication and computation refinement.
The initial NoRC platform is based on a 2D mesh Manhattan topology that have been shown to be
VLSI-friendly [18] and well suited for allocation of the regular blocks of reconfigurable fabric.
Typical parameters in the SystemC code describing NoRC include: network dimensions, packet
injection rate, transient error rate, permanent error location and timing, router and link latency, retry
latency, buffer, packet and flit size. Fig. 2 gives an overview of the different components used in the
SystemC NoRC model. Each computing tile can behave both as a master or slave. The figure only
shows a control thread running in the router but in reality five control threads (one per port) run in
parallel monitoring activity in each of the ports.
We have aimed at minimising the complexity of routers since they represent the main area and power
overhead in the on-chip network [19], Buffering in the router nodes is typically introduced to provide
GT so routers can buffer packets if high priority packets arrived that must be forwarded immediately.
Additionally, buffers temporarily store clean copies of data units (packets, flits) and can be used to
retransmit the data if errors are detected obtaining fault-tolerance. On the negative side, buffering also
means that delay times go up and the buffers can quickly become the highest contributor to the area
overhead. In the following sections we investigate a routing/switching strategy that can work without
buffers while providing good levels of fault-tolerance and throughput in the context of the dynamically
reconfigurable NoRC platform.
4. Flexible routing in the NoRC platform
NoRC aims at supporting the high-demand processing power expected in multimedia MPSoC. The
initial NoRC configuration includes processing nodes (transmitter and receiver), routers and channel
classes which have been extended to collect statistical data for throughput and packet loss. The
network interfaces move data from the synchronous processing domain to the network asynchronous
domain. Computation is synchronised with a clock signal modelling a GALS platform. Buffering is
limited to the network interface with the use of asynchronous FIFOs. No extra buffering is provided in
the router ports to keep complexity low. To further eliminate the need for buffering in the network
interface, NoRC provides a connection-oriented communication protocol so the delivery of the
payload flits is done in order avoiding the extra complexity of a reordering scheme.
NoRC delivers flexibility by not having the IP of the processing nodes determined at design time.
Consequently, more resources can be dedicated to the communication infrastructure at design and
fabrication time since these costs can be shared over the multiple applications in the multimedia
domain. The predictable wire-lengths in the proposed homogenous NoRC mean that inductance,
resistance and capacitive parasitic effects can be controlled allowing aggressive circuit design
techniques such as low swing drivers and receivers that will be beneficial for the energy efficient
design paradigm.
Communication in NoRC starts with a single request flit used to set-up the connection between
transmitter and receiver followed by a variable number of flits containing the data payload. The final
flit contains the CRC code used to verify error-free communications. Successful CRC verification
results in a single acknowledge flit send to the transmitter which is also used to release the resources
reserved by the initial request flit.
The initial request flit follows a store-and-forward switching approach with the single flit packet being
stored in the router before a decision on where to forward it is made. Once the circuit is established
payload transfer is done using circuit switching with no further arbitration being required. This enables
in order delivery and good levels of GT [20]. The presence of circuits removes the need for buffers
since only enough capacity to store a single flit is required.
In NoRC we envisage that processing nodes perform request for services sending data packets that
need certain processing to be done on them (for example motion estimation in a macroblock in video
coding). The request flits routing could be based in a randomised algorithm because we assume that
the service provider location is not known in this distributed computing platform based on dynamic
reconfiguration and FPGA-like silicon structures. This function-based random routing strategy means
that the request flits do not contain a destination address but a function identifier. A single copy of a
request flit exists and when it is received by a router an initial check is performed to verify if the
processing node attached to that router can service the function request. If that is the case the request
flit is delivered to the processing node which will acknowledge it once it is ready to start accepting the
data payload. In this work the flit size has been configured to 32 bits although other flit sizes are
certainly possible.
This strategy means that the circuits established for the data payload transfer are not fixed but they can
change between packets so adaptation to dynamic traffic is provided. When the first choice of output
port for the request flit is busy a hot potato/deflective approach is used to choose another router port
randomly. Request packets are constantly transferred among routers until they reach a processing node
that can service the request contained in the flit avoiding the need for extra buffering. If all the output
ports are busy the request flit bounces back with a circuit unsuccessful flag set. The sender will then
wait some small random amount of time before requesting the same service. This new request will
follow a different route from the one used by the unsuccessful request.
Fig. 3 shows a flow chart of the routing decisions taken place in one of the five ports available in each
router. There are three main flit types: request flits, acknowledge flits and data flits. Each flit arriving
at a router port is checked to identify it as one of these three types. A flit identified as a request flit that
cannot be serviced in the current processing node, results on a forward operation using a randomly
chosen port and setting the corresponding input port, output ports and direction as busy. For example,
if the north-south direction is reserved future flits entering trough the north port are automatically
forwarded to the south port without further arbitration. Similarly, flits entering through the south port
are forwarded to the north port without further arbitration. This direction reservation is maintained
until an acknowledge flit is used to free the reserved resources. Acknowledge flits are also generated
when a request flit cannot progress any further and bounces back (connection failure). This situation
could arise with busy or faulty ports. This specially flagged acknowledge flit releases resources as
normal and informs the source node that a connection failure has taken place. Possible errors affecting
the bits in these controls flits could be detected using error detection codes. Additionally, if an
acknowledge flit gets lost in transit the source is designed to retry the transmission after a certain time
out period. Fig. 3 indicates the possibility of collisions among request flits. This circumstance can take
place when two request flits originating from different nodes arrive at two ports which share the same
channel. For example, ports west and east in routers horizontally adjacent. Under this condition both
request flits bounce back resulting in two connection failures.
Figs. 4 and 5 show examples of successful connections established between source and receiver nodes.
Source nodes cannot request a function that can be provided by themselves. The dark squares
represent unavailable busy ports so the router is forced to randomly choose another port. Request flits
keep moving sometimes crossing more than once the same router as shown in Fig. 5 until a node that
can provide the requested function is found. Fig. 6 shows a situation in which a flit arrives at a router
that has its two possible exit ports busy resulting in a connection failure. The bounced flit is used to
release all the reserved resources up to that point and inform the source about the problem. A new
retransmission will result in a new route and node chosen to service the request.
To compare this dynamic random-walk/stochastic routing approach we have implemented the
deterministic XY algorithm routing leaving the rest of the NoRC infrastructure intact. In XY routing
the positions of the processing nodes are described by coordinates, the X-coordinate for the horizontal
and the Y-coordinate for the vertical position. Packets are routed to the correct horizontal position first
and then in the vertical direction. XY routing produces minimal paths and simple routers but its
tolerance to permanent and transient errors is low.
In order to be able to make direct comparisons with the stochastic approach, the initial single-flit
request is routed along the horizontal direction until the router location Y index matches the Y part of
the address. Then, the request flit moves along the vertical direction until the flit finds its destination.
If the receiver accepts the request and acknowledge flit is sent back to the transmitter and a connection
established between sender and transmitter which will persist until the receiver sends an acknowledge
flit indicating successful or erroneous payload transfer. Independently of the data payload transmission
being successful or erroneous, the transmitter will need to re-arbitrate for the connection resources
using another request flit before more data can be sent. If the request packet cannot proceed according
to the XY route, it bounces back releasing the resources reserved up to that point and informs the
transmitter about the failure to establish the connection. Similarly to the random case, the transmitter
waits a small random amount of time before trying to establish a connection again.
The XY strategy assumes that the transmitter node knows the location of the receiver node. For
example in a multimedia application involving video coding, the fact that a computing node providing
motion estimation services is located at position [3,2] should be known by a control thread processing
a particular macroblock in position [1,1]. The stochastic approach is more flexible and it does not
require this knowledge to operate correctly. It is, therefore, more suitable to a platform such as NoRC
which enables dynamic reconfiguration of the nodes based on power or processing demands so the
number and location of processing nodes being able to perform a particular task could vary as
requirements change.
In summary, transmitter and receivers behave differently for the XY routing and for the random case.
In XY routing the transmitter chooses a receiver based on X and Y coordinates while in the random
case it chooses some processing needed on a data packet rather than which is the processing node that
is going to perform the task.
In principle the random strategy should generate more link transversals and consume more energy
since XY follows the shortest path, but this depends on which is the ratio between processing nodes
and request functions. If the number of processing nodes is much larger than the number of functions
available in the system the chances are that a request flit will soon find a node able to service the
requested function. Once the link is established the flits containing the data payload are transmitted
using the reserved resources in the path without further arbitration.
5. Throughput Analysis
For the throughput investigation we have configured the NoRC SystemC model with a link delay of
1.3 ns corresponding to a link length of 2 mm in a CHAIN-style network using 0.18 μm technology
[21]. This link delay includes all the transitions required in a return-to-zero dual-rail encoding protocol
to transmit one bit of data. We have also configured the router latency to 5 ns and the clock used by
the computing nodes to 10 ns. These values are used by both configurations: XY and routing and
random so a fair comparison can be carried out.
We have defined four operation points in terms of network load. These load points enable the study of
network behaviour under different levels of congestion. The load is determined by a random variable
with different bounds depending on the level of congestion required. Table 1 gives details of the four
loading points used.
Fig.7 shows the behaviour of a reliable (no retransmissions or failures included) 4x4 NoRC for XY
routing and random. The Y axis shows throughput as a cumulative value obtained by averaging the
throughput obtained in multiple small time windows of 50 ns. The X axis shows the simulation time in
nanoseconds. The NoRC using stochastic routing is configured with four functions distributed
randomly as shown in Fig.8. From the figure it is apparent that for this configuration, if a node
implements a particular service there is a high chance that the neighbouring nodes will implement the
other three services. This is not always the case since, for example, if the top-right node needs service
number 3 the request flit will need to transverse the whole network before a node servicing the request
can be found.
In Fig.7 we observe that the throughput for the XY and stochastic cases under light load conditions is
very similar. If we increase the load we observe that the XY case saturates the network much earlier
than the stochastic case. This means that the random routing can delivered almost 30% extra
bandwidth compared with XY routing for heavy load conditions.
Fig.9 investigates the behaviour of random routing varying the number of services available in the
system for the high load case. We compare the case with 4 services shown in Fig.7 with 6,8 and 11
services. Increasing the number of services means that the probability of quickly finding a node that
can service a particular request decreases and network capacity is reduced. In a real implementation it
is expected that the number of services will change dynamically and the capacity of the network as
well. Fig.9 shows that in all the cases the network reaches the steady state quickly and continues to
operate without service interruptions (no deadlocks) with data payloads successfully transferred
between request originators and service providers. Fig.10 shows the function distribution in NoRC for
each of the new three cases.
Fig. 11 shows the behaviour of the 4x4 mesh for the stochastic 6-service and XY routing with the
network saturating at around 350 Mbytes per second. Both algorithms offer similar levels of
performance in this particular configuration assuming a reliable NoRC. Since in the random case the
transmitter does not need to know where a particular service provider is located, packet data could
contain not only payloads needed to be processed but also bitstreams defining the functionality for the
node. For example, a network node that is failing to meet processing deadlines because some
particular function is not being performed fast enough could request the generation of a bitstream
packet implementing that function. A configuration node could then generate a request for an
available node which is not performing any useful function. A node replying to this request could
accept the bitstream packet and start the new service. The dynamic discovery of the route provided by
the random algorithm means that no explicit communication to identify service provider locations is
required.
Fig.12 compares the capacity of XY and random routing using different NoRC sizes: 4x4, 5x5, 6x6
and 7x7. The performance of the 6-service 4x4 and 5x5 NoRCs is very similar, but for the 6x6 and 7x7
cases it is apparent that the stochastic algorithm works better. We then investigate how many extra
services the stochastic algorithm can accommodate before its performance is similar to XY. This is
shown in Fig. 12 for the 9-service case which works very similarly to XY. This is then the service
count threshold between both algorithms. Increasing the service count further means that the capacity
of stochastic is lower than XY but, in any case, it is important to note that capacity is stable and grid
lock states are not detected. The next section investigates the advantages of stochastic compared to XY
in terms of fault tolerance.
6. Fault-tolerant Analysis
To explore the fault-tolerance properties of the NoRC approach we have used a simple 4x4 NoRC
configured with a total of 6 different services and high traffic load. The service distribution and the
port errors progressively introduced are illustrated in Fig. 13 while Fig. 14 shows the relative timing of
these errors.
Fig.15 shows the effects of a single permanent fault in the south port of router [2,2] in a 4x4 NoRC.
The graphs use an instantaneous window throughput so each point in the graph represents the
throughput of the network in a small window of 50 ns. The fault takes place at 400 μs and the XY
algorithm shown in darker colour immediately suffers a degradation in throughput. It is expected that
as more request flits try to make use of this router the bandwidth will decrease further until zero
packets can get through. On the other hand the random algorithm continues to operate without
showing any significant reduction in throughput.
Figs. 16 and 17 show the effects of transient and permanent router malfunctions respectively. The
same router [2,2] in the 4x4 NoRC is chosen for these experiments.
From the figures it is apparent that the XY routing strategy has no fault-tolerance capabilities and
immediately breaks causing the whole platform to stop processing. The reason is that all the
processing nodes, within a short time interval, try to route some data through router [2,2]. Despite the
constant failures the algorithm has no flexibility to choose alternative paths and the network quickly
locks up. The random strategy has a reduction in throughput when router [2,2] stops working but it
continues to operate in a stable state and it shows the good fault-tolerant properties expected in an
adaptive routing strategy.
Fig. 18 shows the effects of a failure in router [0,3] instead of router [2,2] to investigate how the
relative location of the router affects the drop in performance. The XY router again encounters a dead
lock situation as nodes progressively try to use a router that keeps bouncing back packets. The
negative effect in the stochastic case is less noticeable since router [0,3] has one port less and less
chances of being used to route traffic. In Figures 19, 20 and 21 the number of routers failing is
increased to two ([0,3] and [1,2]) , three (add [2,3] to the previous two) and four (add [2,0] to the
previous three) respectively. It can be observed in the figures that a failure in a single router for the
XY case always hangs the system while for the stochastic case the degradation in performance is
evident with the increase in the number of faulty routers but it is not critical. The limit can be seen in
Fig. 21 during the time interval in which four routers fail in parallel and there are some time windows
in which data does not get through the system. This suggests that for this particular distribution of
load, services and network size a failure rate superior to 4 could hang the network. It is also important
to notice that in these experiments we have made sure that at least on service provider for each service
in the network remains fault free. Otherwise, requests would be left in the network and the system
would eventually hang as progressively each node makes a request for the unavailable service.
7. Conclusions and Future work
This paper has presented the NoRC platform and study how the features expected in a dynamically
reconfigurable platforms can be exploited by a simple but efficient stochastic routing algorithm. The
initial results are compared with a simple deterministic algorithm and show competitive levels of
performance in terms of throughput. The main advantage of the stochastic strategy is its excellent
fault-tolerant capabilities and the flexibility introduced in the distributed multiprocessor platform with
a service-based strategy. This means that there is no need to maintain up-to-date location tables in the
processing nodes as the platform dynamically changes. Error resiliency is expected to be a
fundamental feature that systems fabricated with billions of transistors in deep-submicron processes
should possess to be cost effective. The regularity of the proposed NoRC together with stochastic
routing could be the steps in the right direction. The dynamic reconfiguration obtained with the
FPGA-like processing nodes should enable one device service different requirements in the
multimedia domain. Traditionally multimedia applications can tolerate small service drops and could
be a good match to the stochastic approach. For example, a few erroneous pixels in one video frame of
a high definition display working at 50 frames per second do not invalidate the system. Economics
dictate that the fabrication costs of one of these complex devices in ultra-deep submicron technology
will be very high, of the order of a few million dollars per mask set, and as such this cost should be
amortized over the device volume required for multiple applications.
GT traffic could be one of the challenges faced by a stochastic approach such as the one proposed in
this paper. A simple option could be to reserve the routing resources needed by GT traffic and let the
stochastic algorithm find an alternative path for non critical traffic. Additionally, the dynamic
reconfiguration capabilities could be used to move physically closer transmitter and receivers involved
in GT traffic. We expect to investigate these alternatives as work progresses.
Further work should also investigate the energy consumption introduce by the stochastic alternative. In
principle more flits need to be retransmitted to found a service provider but the request flits are very
small and need a fraction of the energy required by a full packet transmission. The connection oriented
transmission of data packets without arbitration once a circuit has been successful established should
limit energy waste. To introduce a suitable energy model in the SystemC NoRC model constitutes
part of our future work. We will also investigate the suitability for asynchronous realization of the
stochastic router using the Balsa synthesis toolset [22] and extract data in terms of latency and energy
consumption to further refine the initial model presented in this paper.
[1] S. Furber, J. Bainbridge, “Future Trends in SoC Interconnect”, Proceedings 2005 International
Symposium on System-on-Chip, pp. 183-186, November 2005, Tampere, Finland.
[2] R. Marculescu, D. Marculescu, L. Pileggi, "Toward an Integrated Design Methodology for FaultTolerant, Multiple Clock/Voltage Integrated Systems", Proc. IEEE Intl. Conference on Computer
Design (ICCD), San Jose, CA, Oct. 2004.
[3] J. V. Woods, P. Day, S. B. Furber, J. D. Garside, N. C. Paver, and S. Temple, “AMULET1: An
asynchronous ARM microprocessor,” IEEE Trans. Comput., vol. 48, pp. 375–398, Apr. 1997.
[4] J. Bainbridge, L. A. Plana, S. B. Furber, "The Design and Test of a Smartcard Chip Using a
CHAIN Self-Timed Network-on-Chip," date, p. 30274, Design, Automation and Test in Europe
Conference and Exhibition Designers’ Forum (DATE'04), 2004.
[5] D. Bursky, “We Must Hold the Line on Soaring ASIC Design Costs”, Electronic Design,
www.elecdesign.com, October, 2002.
[6] L. Benini and G. De Micheli. “Networks on chips: A new SoC paradigm”, IEEE Computer,
35(1):70–80, 2002.
[7] D. Bertozzi, L. Benini, G. De Micheli, “Error control schemes for on-chip communication links:
the energy-reliability tradeoff”, Computer-Aided Design of Integrated Circuits and Systems, IEEE
Transactions on, Vol.24, Iss. 6, pp. 818- 831, June 2005.
[8] P. Bhojwani, R. Singhal, G. Choi, and R. Mahapatra. “Forward error correction for on-chip
networks”. Proc. of Workshop for Unique Chips and Systems (UCAS-2), March 2006.
[9] Murali, S.; Theocharides, T.; Vijaykrishnan, N.; Irwin, M. J.; Benini, L.; De Micheli, G.; "Analysis
of error recovery schemes for networks on chips", IEEE Design & Test of Computers, Volume 22,
Issue 5, Sept.-Oct. 2005, pp. 434--442.
[10] L. Li, N. Vijaykrishnan, M. Kandemir, M.J. Irwin, “Adaptive Error Protection for Energy
Efficiency”, 2003 International Conference on Computer Aided Design (ICCAD’03) , pages 2-7 ,
November 9-13, 2003, San Jose, CA, USA
[11] Teijo Lehtonen, Pasi Liljeberg, and Juha Plosila, “Online Reconfigurable Self-Timed Links for
Fault Tolerant NoC,” VLSI Design, vol. 2007, 13 pages, 2007
[12] P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, L. Alvisi, “Modeling the Effect of
Technology Trends on the Soft Error Rate of Combinational Logic”, Proc. of Dependable Systems
and Networks (DSN), pp. 389-398, 2002.
[13] D. Park, C. Nicopoulos, J. Kim, N. Vijaykrishnan, C. R. Das. “Exploring Fault-Tolerant
Network-on-Chip Architectures”, Proc. of the 2006 International Conference on Dependable Systems
and Networks, June, pp. 93- 104, 2006.
[14] M. Ali, M. Welzl, M. Zwicknagl, and S. Hellebrand. “Considerations for fault-tolerant network
on chips”. Proc. of the 17th Intl. Conf. on Microelectronics (ICM), 2005.
[15] T. Dumitras, R, Marculescu. On-Chip Stochastic Communication. Proc. Design Automation &
Test in Europe (DATE), March, 2003
[16] M. Pirretti, G. M. Link, R. R. Brooks, N. Vijaykrishnan, M. Kandemir, and I. M. J. “Fault
tolerant algorithms for network-on-chip interconnect”. In Proc. of the ISVLSI, 2004.
[17] M. Coppola, S. Curaba, M. D. Grammatikakis, G. Maruccia, F. Papariello, "OCCN: A NoC
Modeling Framework for Design Exploration", Journal on System Architecture, vol. 50, pp. 129--163,
February 2004.
[18] D. Wiklund, D. Liu, “SoCBUS: Switched Network on Chip for Hard Real Time Embedded
Systems”, International Parallel and Distributed Processing Symposium (IPDPS'03) , pp. 78a, France,
2003.
[19] Seongmoo Heo and Krste Asanovic, "Replacing Global Wires with an On-Chip Network: A
Power Analysis" International Symposium on Low Power Electronics and Design (ISLPED'05), San
Diego, CA, August 2005.
[20] Umit Y. Ogras, Jingcao Hu, Radu Marculescu. “Key Research Problems in NoC Design: A
Holistic Perspective” International Conference on Hardware - Software Codesign and System
Synthesis, September, 2005.
[21] J. Bainbridge, S. B. Furber. “Chain: A delay-insensitive chip area interconnect”. IEEE Micro,
22(5):16--23, 2002.
[22] D.A. Edwards, A. Bardsley, “Balsa: An Asynchronous Hardware Synthesis Language”, The
Computer Journal, Volume 45 (1), Pages 12-18, January 2002.
Router
Router
Network
Interface
Local
SRAM
Memory
Reconfigurable
Fabric
DSP
functions
Network
Interface
Local
SRAM
Memory
RISC
Processor
Router
Reconfigurable
Fabric
DSP
functions
RISC
Processor
Router
Network
Interface
Local
SRAM
Memory
Reconfigurable
Fabric
Network
Interface
Embedded
DRAM
DSP
functions
RISC
Processor
DRAM
Controller
Switch Voltage/Clock Domain
Asynchronous Router
Asyncnronous Buffer
Tile Voltage/Clock Domain
Network Interface
RISC CPU
RECONFIGURABLE
FABRIC
Embedded
DSP
Functions
AMBA AHB BUS
Fig.1 NoRC components
EMBEDDED
SRAM MEMORY
SRAM
controller
OCCN StdChanel
ROUTER
(Crossbar no buffering)
Thread Route
North
OCCN StdChannel
Link Layer
OCCN StdChanel
TRANSMITTER
RECEIVER
MODULE
(Computing Tile)
O
CC
N
St
dC
ha
n
el
OCCN StdChanel
MasterPhy Port
Physical Layer
Thread
LinkLayer
Receive
Thread
TansportLayer
Receive
Assembly
Acknowledge
Event
Thread
Link Layer
Transmit
Segmentation
Transport Layer
Thread Transport Layer
Transmit
Fig.2 SystemC NoRC components
Description
Light load
Medium load
High Load
Maximum Load
Time between
request packets
min
max
0 ns
5000 ns
0 ns
500 ns
0 ns
50 ns
0 ns
5 ns
Table 1. Network load points
Flit received in
port
YES
NO
Request flit?
YES
YES
Function request
= node function?
NO
NO
Pick an output
port randomly
NO
Port PE busy?
YES
Acknowledge
flit?
NO
Bounce flit?
YES
Forward data flit
Generate
bounce flit
(connection
failure)
Release
resources
Set port to busy
NO
YES
Port busy?
Pick next port
NO
Set port to busy
YES
Colision?
NO
All ports
checked?
Forward
acknowledge flit
YES
YES
Colision?
NO
Send request flit
to PE
Generate
Bounce flit
(connection
failure)
Fig.3 Router activity flow chart
Forward
request flit
R
R
7
R
R
9
R
10
R
R
R
R
13
12
3
R
R
R
18
4
9
12
R
18
R
0
R
2
5
R
7
3
Fig.4 Process 10 requests service 7
R
R
7
R
R
9
R
10
R
R
R
R
13
12
3
R
R
R
18
4
9
12
R
18
R
0
R
2
5
R
7
3
Fig.5 Process 12 requests service 3
R
R
7
R
R
9
R
10
R
R
R
R
4
9
18
R
0
R
2
18
R
R
R
12
13
12
3
R
5
R
7
3
Fig.6 Process 12 connection failure
Fig.7 Throughput analysis (4-service stochastic and XY)
R
R
3
R
R
1
R
2
R
R
R
R
0
1
2
R
0
R
2
2
R
R
R
0
1
0
3
R
1
R
3
Fig.8 Distribution for 4 services.
3
Fig.9 Throughput analysis (service count impact on stochastic routing)
R
R
1
R
R
1
R
4
R
R
R
R
R
R
R
R
4
9
18
R
0
R
2
18
R
R
R
12
R
R
R
3
13
12
3
R
R
R
5
7
9
10
R
R
R
8
0
2
7
R
R
R
8
4
9
2
R
3
2
3
5
R
R
R
R
3
9
0
5
R
R
R
R
0
2
7
2
R
R
R
R
0
1
0
2
R
R
R
5
2
5
R
5
R
7
3
Fig.10 Distribution for 6,8 and 11 services.
Fig.11 Throughput analysis (6-service stochastic and XY).
Fig.12 Throughput analysis (network size).
R
R
1
R
R
1
R
4
2,0
1,2
R
R
2
R
0
2,2
1
R
0
5
2
5
0,3
2
2,3
0
R
2
Fig.13 Router error distribution.
5
R
3
5
0 ms
0.2 ms 0.3 ms 0.4 ms
0.5 ms 0.6 ms 0.7 ms
0.35 ms
Router [2,2]/
Router [0,3]
Router [1,2]
Router [2,3]
Router [2,0]
Fig.14 Router error relative timing.
1 ms
Fig.15 Fault-tolerant analysis (faulty south port in router [2,2]).
Fig.16 Fault-tolerant analysis (temporal fault in router [2,2]).
Fig.17 Fault-tolerant analysis (permanent fault in router [2,2).
Fig.18 Fault-tolerant analysis (temporal fault in router [0,3]).
Fig.19 Fault-tolerant analysis (temporal fault in router [0,3] and [1,2]).
Fig.20 Fault-tolerant analysis (temporal fault in router [0,3], [1,2] and [2,3]).
Fig.21 Fault-tolerant analysis (temporal fault in router [0,3], [1,2], [2,3] and [2,0]).
Download