D EPARTMENT OF E LECTRICAL & C OMPUTER E NGINEERING F ACULTY OF E NGINEERING A HIGH BANDWIDTH AREA EFFICIENT SPATIAL DIVISON MULTIPLEXING BASED NETWORK ON CHIP NOVEMBER 2010 A HIGH BANDWIDTH AREA EFFICIENT SPATIAL DIVISON MULTIPLEXING BASED NETWORK ON CHIP S UBMITTED BY MANMOHAN MANOHARAN D EPARTMENT OF E LECTRICAL & C OMPUTER E NGINEERING F ACULTY OF E NGINEERING I N PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE D EGREE OF MASTER OF SCIENCE N ATIONAL U NIVERSITY OF S INGAPORE ABSTRACT The shifting of trends from single processor chips to MPSoCs have resulted in the search for alternate methods of interconnecting technologies for the various components of the MPSoCs. Network On Chips provide a promising solution in this direction. Leveraging upon the concepts of networks which are already well established and stable, the idea to implement MPSoCs as a network of various components holds great scope for improvisation. Network-On-Chips are more scalable with respect to bus based architecture as the number of cores increase. An area efficient high bandwidth SDM based NoC is presented in this thesis. Furthermore, an SDF model of this NoC has also been developed that would enable application performance prediction on the underlying hardware. ACKNOWLEDGEMENT I would like to express my sincere thanks to everyone who has supported me to complete this thesis to the very best of my abilities. First and foremost, I would like to thank my supervisor, Dr Akash Kumar, for accepting me under his wings and giving me an idea to start with. No amount of words would be sufficient to describe the support and encouragement he has given me during the course of this project. Without his help and guidance, I would not have come so far. His infinite patience while listening to the various issues I had faced during the course of this project and pointing out zillions of my silly mistakes and his understanding of my capabilities as a student gave me the necessary confidence to carry on till the end. I am grateful to Mr Shakith Fernando, for providing me all the necessary assistance and knowledge with respect to FPGAs and developing my love towards these wonderful logic devices. I would also like to thank Dr Bharadwaj Veeravalli, Associate Professor, Dept of Electrical and Computer Engineering, NUS for giving me permission to use his lab facilities for the purpose of my project. I also thank Mr. Eric Poon, Lab Officer of the CNDS lab for providing me with all the necessary setup to work on my project. I also thank my dearest friends Ganesh, Deepu, Jerrin, Pramod, Sheng, and Rajesh for helping me make the time I spend in the University both productive as well as fun filled. Last but not the least, I am thankful for my most wonderful parents and sister, who have supported me all throughout my life in whatever situations I have been and for all the decisions I have taken until now. TABLE OF CONTENTS LIST OF FIGURES____________________________________________________________i LIST OF TABLES_____________________________________________________________ii LIST OF SYMBOLS AND ABBREVIATIONS_______ _______________________ _iii CHAPTER 1 INTRODUCTION 1 1.1 OVERVIEW OF MULTIPROCESSOR SYSTEM ON CHIPS 1 1.2 INTRODUCTION TO NETWORK ON CHIPS 3 1.3 KEY CONTRIBUTIONS 7 1.4 THESIS ORGANIZATION 8 CHAPTER 2 HIGH BANDWIDTH AREA EFFICIENT SDM BASED NETWORK ON CHIP 9 2.1 BASIC ARCHITECTURE 9 2.2 ARCHITECTURAL IMPROVEMENTS 13 2.2.1 MULTIPLE CLOCK SUPPORT 13 2.2.2 16 FLOW CONTROL 2.2.3 OPTIMIZED NETWORK INTERFACE ARCHITECTURE 2.3 RESULTS AND ANALYSIS 18 19 CHAPTER 3 SDF MODELLING 21 3.1 SDF GRAPH BASICS 21 3.2 SDF MODEL OF SDM NETWORK ON CHIP 25 3.3 APPLICATION MAPPING AND THROUGHPUT ANALYSIS 28 3.4 CONCLUSIONS 32 CHAPTER 4 CASE STUDY 33 CHAPTER 5 CONCLUSIONS AND FUTURE WORK 38 5.1 CONCLUSIONS 38 5.2 FUTURE WORK 39 BIBLIOGRAPHY 41 LIST OF FIGURES CHAPTER 1 Figure 1. Intel CPU Introductions from 1975-2010.................................................2 Figure 1.1 A typical Network-on-Chip based System .............................................5 Figure 1.2a Time Division Multiplexing .................................................................6 Figure 1.2b Spatial Division Multiplexing ..............................................................6 Figure 1.3 NI described in [11] ................................................................................7 CHAPTER 2 Figure 2.1 NoC structure[12] .................................................................................10 Figure 2.2 Network Interface for the existing structure [12] ................................11 Figure 2.3.Data Transmission through wires[12] .................................................11 Figure 2.4.k-way router[12] ...................................................................................12 Figure 2.5 Overview of Multiple Clocking Implementation .................................14 Figure 2.6 Multiple clocking and flow control at transmission .............................15 Figure 2.7.Multiple clocking and flow control at receiver ....................................17 Figure 2.8.Flow Control Mechanism in NoC .......................................................17 Figure 2.9.Area Breakdown in existing NoC[12] .................................................18 CHAPTER 3 Figure 3.1 Example of a 3 actor SDF graph ........................................................22 Figure 3.2 SDF model with self edge ..................................................................23 Figure 3.3 SDF model with back edge to model a buffer ....................................24 Figure 3.4 SDF model of the SDM based NoC ....................................................26 Figure 3.5 Optimized Hardware generation flow using SDF graph ......................28 Figure 3.6 Modelling of tasks running on two processors .....................................29 Figure 3.7. Snippet of XML specification of SDF graph .....................................30 i CHAPTER 4 Figure 4.1 SDF Model of Producer Consumer .....................................................33 Figure 4.2. SDF Model of Producer Consumer with NoC ..................................34 Figure 4.3. Number of Lanes vs Iterations per Second on SDF3 ..........................35 Figure 4.4. Number of Lanes vs Iterations per Second on FPGA .........................35 Figure 4.5 Measured data vs SDF3 for TXCLK=SYSCLK ..................................36 Figure 4.6 Measured data vs SDF3 for TXCLK= 5*SYSCLK .............................36 CHAPTER 5 Figure 5.1 Proposed area efficient architecture for Send Data Distributor ...........39 LIST OF TABLES Table 1: Resource comparison between existing NoC [12] and [11] ……………9 Table 2: Bandwidth per connection for different TXCLK and Wires…………..20 Table 3: Predicted period vs. measured period for producer-consumer model… 37 ii LIST OF SYMBOLS AND ABBREVIATIONS FIFO First-In First-Out FPGA Field Programmable Gate Array FSL Fast Simplex Link IP Intellectual Property MPSoC Multi-Processor System-on-Chip NI Network Interface NoC Network-on-Chip SDF Synchronous Data Flow SDM Spatial Division Multiplexing SoC System-on-Chip TDM Time Division Multiplexing VHDL Very High Speed Integrated Circuit Hardware Description Language iii CHAPTER 1 INTRODUCTION The invention of the transistor by William Shockley, John Bardeen, and Walter Brattain at AT&T„s Bell Labs in 1947 has been described as one of the most important inventions in the history of mankind. This tiny device now forms the basic building block of any modern electronic device manufactured today. The development of integrated circuits (ICs) that contains transistors fabricated on a semiconductor substrate such as silicon to form a complete electronic circuit containing both active and passive components has revolutionized the world of electronics. The ability to mass produce the ICs using a highly automated process has enabled us to achieve very low unit costs, further driving the costs of electronic devices down. With the growth of silicon processing technology, more and more transistors could be packed on the surface of a silicon substrate. The first ICs contained only a few tens of transistors. This level of integration was known as Small Scale Integration (SSI). The next level of integration was called the Medium Scale Integration (MSI) and was developed in the late 1960s. From then on the level of integration has increased to Large Scale Integration (LSI) in 1970s with tens of thousands of transistors on a single chip and then to Very Large Scale Integration (VLSI) starting in the 1980s and continuing even now. The number of transistors per chip has increased to several billion in many of the computer processor chips released in the last decade. The feature size has also reduced to 32 nm, with many commercial microprocessor chips being manufactured at this feature size. 1 The level of integration of transistors has closely followed the trend as noted by Gordon E. Moore in 1965 [1] and also popularly known as Moore‟s Law which states that the number of transistors will double every 18 to 24 months. As the number of transistors that could be integrated on silicon continued to increase, the performance of the processor chips also showed a similar trend. This can be seen from the figure shown below, which shows the trend of Intel CPUs from the period 1970 to 2010. Figure 1: Intel CPU Introductions from 1975 – 2010 Source : http://www.gotw.ca/publications/concurrency-ddj.htm 2 1.1 Overview of Multiprocessor System-on-Chips Even though the number of transistors that could be integrated on silicon continued to increase, the clock speed has not seen a linear increase since 2003. In order to improve the performance of processors, the most promising direction is in the direction of having multiple cores running on the same chip. Such processors are called Chip Multi Processors (CMP). Also, due to extremely small feature size, multiple processor cores and custom IPs can be integrated into a single chip. Such a device is called Multiprocessor System-On-Chip (MPSoC) that can be considered to be a complete system implemented on a single silicon chip. The processing cores can be either of the same type or it can have heterogeneous cores, such as a DSP. The advantages of an MPSoC are multifold in terms of power, programmability and performance. Since MPSoCs are developed as a platform rather than a specialized product [2], they may allow for different implementations of the same product. A few examples of MPSoC platforms are Philips Nexperia [3], and TI OMAP [4]. In [5], some of the key architectural decisions to be taken pertaining to MPSoC architecture design are given. The decisions on hardware design mainly revolves around the number of processors, their homogeneity and/or heterogeneity, interprocessor communications, memory hierarchy, special modes for power reduction etc. 1.2 Introduction to Network-On- Chips As the number of cores increase, the interconnecting architecture will also have to support the exponentially increasing traffic. The connection has to provide very high bandwidths, so as to satisfy the communication requirements for the applications. Also, it has to be scalable so as to support a large number of cores. Currently, the bus based architecture for interprocessor communications does not scale well with increasing number of cores. There bottle neck in the 3 performance of MPSoC therefore, moves from the computational elements to the communication network . A new way to solve the above mentioned problem of communication bottleneck in MPSoCs is to take the concepts of networking and implementing the MPSoC as a network of different components [6]. Such communication interconnects are termed as Network-on-Chips (NoCs). Compared to bus based architecture, a NoC is more modular, and the scalability is higher while designing systems with large number of cores. There can be three types of services that should be considered while designing a NoC Best effort (BE) services, Guaranteed throughput (GT) and hybrid , with support for both BE and GT. In BE based NoC, there is no guarantee on the throughput that is achieved from the NoC. A number of packet based NoCs have been developed which provides BE services. Some examples are QNoC[7] and MANGO[8].GT services allow you to ensure that an application meets the required throughput. The NoC has to be predictable so that the resource allocation for the application can be done during design time. There are two approaches to provide GT services: Time Division Multiplexing (TDM) and Spatial Division Multiplexing (SDM). In a typical NoC based MPSoC system, as shown in Fig1.1, the IP connects to a communication network through a component called network interface (NI). The communication network is made up of routers that act as the switching point of data. 4 IP IP Network Interface (NI) Network Interface (NI) ROUTER ROUTER ROUTER ROUTER Network Interface (NI) Network Interface (NI) IP IP Figure 1.1 A typical Network-on-Chip based system In TDM, the available links between two routers are multiplexed on a time sharing basis, between various connections. The amount of bandwidth obtained for each connection will be proportional to the number of timeslots that is allocated to the connection. However, the routers have to store all the switching tables within itself. This can bring about a large area and power over head into the system. A few TDM based NoCs have been developed which are Aethereal[9] and Nostrum[10]. Another approach for providing GT services is to have Spatial Division Multiplexing. The links between each router consists of a number of wires. In SDM approach, a subset of the number of wires is allocated to a particular connection, depending upon the bandwidth requirements. The available bandwidth increases with the number of wires allocated. A brief comparison between TDM and SDM approaches to bandwidth sharing is described as shown in Fig 1.2a and 1.2b respectively. 5 Number of Lanes Figure 1.2a Time Division Multiplexing 0 1 2 3 4 5 6 7 A B C D 0 1 2 Connection 3 4 Time slot 5 6 7 Figure 1.2b Spatial Division Multiplexing Consider four separate connections A, B, C and D with different bandwidth requirements, multiplexing a link with 8 wires. Assume that the connections A,B,C and D requires 2,3,2 and 1 time slots respectively in TDM to satisfy their bandwidth requirements. Also, all the connections will use all the available wires in its respective time slot. In SDM, the multiplexing is done in terms of the wires. In this way, A, B, C and D will receive 2, 3, 2 and 1 wires respectively. The advantage of SDM is that the router does not have to switch at every time slot and hence, there is 6 no need to store the switching tables in the router. The Network Interface now becomes more complex as compared to the router, where we achieve considerable savings in power and area. An SDM based NoC is described in [11], with a flexible architecture but higher area cost. This NoC makes use of an N bit to M bit serializer, which serializes an N bit data to M wires, as shown in Fig.1.3. Each connection has its own message queue and serializer. Figure 1.3 NI described in [11] The router of the NoC has full flexibility and hence each incoming wire can connect to any of the each outgoing wire. However, the area overhead associated with this router is very large. As an improvement to this design, a new NoC has been proposed and developed in [12]. The new design is described in detail in Chapter 2. 7 1.3 Key Contributions The contributions of this thesis are as follows: Adding multiple clock support to the Network On Chip as presented in [12]. Having separate clocks for transmission network and IP interface allows us to vary the transmission clock to achieve higher bandwidth and throughput. The existing architecture of the network interface has been modified to make the design more area efficient An SDF based model for the Network on Chip so that the throughput of the applications can be calculated during design time. 1.4 Thesis Organization. Chapter 2 of this thesis describes the architecture of the existing NoC in detail. The architectural improvements brought about in this thesis are also described in this chapter. Chapter 3 explains the basics of SDF modeling of applications. In this chapter, the SDF model of the Network on Chip is also discussed. In Chapter 4, a case study of a producer consumer SDF model using the Network on Chip is discussed. The variation in throughput on increasing the number of lanes and the transmission clock frequency is also described. 8 CHAPTER 2 HIGH BANDWIDTH AREA EFFICIENT SDM BASED NETWORK ON CHIP An SDM based NoC has been developed which has considerable improvements as compared to the NoC as mentioned in [11]. This chapter describes the basic architecture of this Network On Chip and the modifications made to improve the performance of the NoC. Multiple clocking support has been added to the NoC that allows the data transmission network to be clocked at a very high frequency as compared to the IP. This allows for higher bandwidth. Further, area improvements have been made in the architecture of the existing NoC. As compared to the design in [11], the area improvements in the new design are indicated in Table 1. Component 32-bit to M-bit serializer sendDataDistributor 32-bit to 1-bit serializer Number of Slices 13319 183 48 Power (mW) 134.38 2.08 2.27 Table 1: Resource comparison between existing NoC [12] and [11] 2.1 Basic Architecture A salient feature of the NoC described is dynamic reconfigurability, which enables run time link configuration of the NoC. This feature helps when multiple use cases are supported in the system. The NoC can be reconfigured whenever the use case switches. The NoC consists of two components, a) Control network and b) Data network. The structure of the NoC is as shown in Fig 2.1. 9 SDM-based Data Network Packet-based Control Network Figure 2.1: NoC structure [12] Each part is described in detail below. a) Control Network: The control network is used in programming the NoC for changing the configuration. It is a light weight packet based network that has a very low area overhead as compared to the whole NoC. The links between nodes consists of 8 wires each. The node at (0, 0) is connected to the east and north while all other nodes are connected to the east. The programming data is broadcast from this node to all the other nodes in the network. Each node in the mesh has a unique id, through which it can be identified. A protocol for programming the NoC has been designed, through which the data is forwarded through the network. The protocol for transmitting the data is shown in b) Data Network: The data network consists of the NI which connects to the IP through FSL links and the routers, which guides the data through the corresponding path. Network Interface: The structure of the existing network interface is given in Fig 2.2 that supports 3 sending channels of 32 bit width and 8 sending and receiving wires, forming the data transmission network. 10 Network Interface 32-bit to 1-bit serializers output[7] 32 send1 Data Distributor 32 32 output[..] 32 send2 Data Distributor 32 32 output[0] 32 send3 Data Distributor 32 32 Router IP Core Output message queues Figure 2.2 Network Interface for the existing structure[12]. There exists a data distributor for each of the three channels and a 32 bit to 1 bit serializer for each of the sending wires. The role of the distributor is to send the 32 bit data to the wires allocated to that channel in a sequential manner. In this NI, the entire 32 bit data is send on the same wire rather than distributing it over all the wires allocated to the particular channel, as shown in Fig2.3. For e.g., all the bits of three separate data packets A, B and C are send along three separate lines .The data from the data distributors are ORed and fed to all the serializers. This ensures that the data distributor can send the data to all the serializers. A two bit handshaking mechanism exists between the serializer and the data distributor. Figure 2.3: Data transmission through the wires[12]. 11 The serializer receives that data packet from the distributor and serializes it along the wire. The start of data transmission is indicated by a start bit which is active high. For receiving the data, the whole process is followed in the reverse order. Router: The routers used in this NoC are described as a 1-way router. A k-way router can be defined as a router where each incoming line can be switched to k lines in each direction. Fig 2.4 describes the k-way switching for different values of k. (a) 4-way (a) 2-way (a) 1-way Figure 2.4 k-way router[12] In this NoC, we use a 1-way router, which restricts the number of connections that any incoming wire can switch to one. On one hand, this reduces the area overhead while on the other hand it may not be possible to satisfy all connection requirements using this router. In order to mitigate this, the option is to map the application to processors such that there are more than one possible path between a source and destination. In this way, the probability that a 1-way router can satisfy all the connection requirements increases. 12 2.2 Architectural Improvements In this section, a few architectural improvements that have been incorporated into the existing NoC design are discussed. 2.2.1 Multiple Clock support. In the existing NoC design, there exists a single clock for the IP, network interface and data transmission network. As a result, the bandwidth of the network was restricted by the maximum frequency at which the network interface could operate. The motivation to implement multiple clocking for the NoC was the flexibility it could bring to the network. A few advantages on having a separate clock domain for transmission network are as follows: IPs running at different frequencies can connect to a communication network having a common speed. The communication frequency is isolated from the computational frequency and hence it can be separately tweaked. This can help in saving power by allocating more number of lanes running at a lower frequency as compared to fewer lanes running at higher frequency. Figure 2.5 indicates the separation of clock domains for the transmission network and the NI in the multiple clock supporting scheme. The NIs are clocked at a separate clock frequency (SYSCLK) than that from the routers that form the transmission network which runs on the transmission frequency (TXCLK). The NI communicates with the IP using the SYSCLK, while it transmits data through the wires using the TXCLK as clock reference. 13 IP CORE NI (SYSCLK) NI (SYSCLK) IP CORE NI (SYSCLK) IP CORE TXCLK TRANSMISSION NETWORK IP CORE NI (SYSCLK) Figure 2.5 Overview of Multiple clocking implementation Implementation: The separation of clock domains at the NI is brought about at the serializer level. In order to implement multiple clocking, the structure of the serializer was modified in order to take into consideration the different clock frequencies involved. In the existing design, the serializer and the data distributor was clocked by the same source. The modified serializer is now clocked by two sources: the system clock which is common for the data distributor, and the transmission clock, which clocks the entire data transmission network. The structure of the 32 to 1 bit serializer was also modified to support the two clocks as shown in Figure 2.6. 14 FLOW CONTROL IN NEW FINISH SERIAL OUT ACK INTERFACE SYSCLK SERIALIZER START DATA [31:0] TXCLK Figure 2.6 Multiple clocking and flow control at transmission There are two components in the structure: The interface unit that performs a two way handshake process with the data distributor, clocked by the system clock and the serializer, which receives the 32 bit directly from the data distributor. The interface checks whether the serializer is busy serializing the data or not, and then accept new data for transmission. When a new data is accepted, the interface gives a trigger to the serializer to start transmission. After the transmission is complete, the serializer signals the interface that the transmission is complete. Similarly at the receiver end, there exist two clocking domains at the deserializer. The structure of the deserializer is given in Fig 2.7. By separating the network interface and transmission network into two clocking domains, there is much flexibility in the clocking of the NoC. The IP is connected to the network interface through FSL links. This enables IPs operating at different frequencies to be connected to the same NoC, since the FSL links support asynchronous mode of operation. 15 2.2.2 Flow Control. When a transmitter is allowed to send data to the receiver at a rate more than the receiver can process, the receiver will be overwhelmed by the data and hence, data packets are lost. In order to prevent this from happening, buffers can be included between the transmitter and receiver, so that data packets are stored temporarily until the receiver processes them. When the traffic load is too high, the amount of buffer that is required can become very large. This could translate into larger area overhead. In order to prevent this, a flow control mechanism has been implemented in the design. The flow control has been implemented at the level of the network interface, where a full signal is send from the receiver network interface to the transmitter side. This is shown in Fig. 2.8. The figure shows three data wires allocated between a transmitter and receiver. The flow control wire is allocated for each of the three wires. At the receiver end, the serializer checks whether the receive FSL buffer is full or not. If it detects that the receiver buffer is full, then a full signal is asserted by the serializer. When transmitting data, the interface unit of the serializer checks whether the full signal is asserted or not. The data is transmitted only if the full signal is deasserted (Fig 2.6). This flow control is implemented for each of the wires of the network. Fig 2.7 shows the architecture of the deserializer at the receiving node of the NoC. 16 FLOW CONTROL OUT SERIAL IN NEW NEW ACK ACK DE SERIALIZER REGISTER DATA[31:0] DATA[31:0] SYSCLK TXCLK Fig 2.7 Multiple clocking and flow control at receiver NETWORK INTERFACE (TX) ROUTER ROUTER NETWORK INTERFACE (RX) Flow control Wire Data Wire Figure 2.8 Flow control mechanism in the NoC As a result of implementing flow control, the router now switches an additional wire for each of the incoming wires. However, the area over head is approximately 10% in this case. The k-way 17 routing is followed in this case as well, with the router switching the data wires and the flow control wire in pairs. 2.2.3 Optimized Network Interface Architecture: The current NoC architecture is not fully optimized in terms of the area and the FPGA resources it consumes. The various components of the NoC can be further optimized to reduce the area overhead so that a more efficient design can be generated. It is found that the network interface takes up a large share of the resources being consumed by the entire NoC architecture [12]. (a) Top (b) NI Figure 2.9 Area breakdown of existing NoC [12] A breakdown of the NI shows that the Data distributor and the serializer / deserializer are the major resource consuming components. An effort to further optimize the architecture of these components was done. Firstly, the HDL code of the serializer and deserializer was more optimized to obtain a lower resource usage. In the earlier design, the serializer used a state machine with 33 states. In the new design, the number of states was reduced from 33 to 3 and simple counter was used to increment the number of bits received or send. Using the new VHDL 18 code, the resources used by the serializer reduced from 48 slices [12] to 42 which are about 12.5 %. 2.3 Results and Analysis. In this section, the overall results of the various architectural improvements made to the design are discussed. As mentioned earlier, the aim was to improve the available bandwidth of the NoC as a whole while reducing the area overhead as much as possible. The testing for the architectural implementation was done on a Xilinx Virtex 4 XC4LX25 FPGA board. Bandwidth Improvement using Multiple Clocking We used a 1 by 2 matrix with one sending channel and receiving channel to test the bandwidth scaling with multiple clock support. The network had 8 transmitting and receiving wires with a 32 bit data packet being received from the IP core. The IP used was Microblaze running at 50 MHz. The NoC was tested for different transmission clock frequencies, keeping the system clock at 50 MHz. We operated the NoC at a maximum frequency of 250 MHz before timing errors were detected. As the transmission clock frequency increases, the available bandwidth per wire also increased linearly with the transmission clock frequency. The maximum bandwidth available to a connection can be calculated as B= N*TXCLK where N is the number of wires allocated to the connection, while TXCLK is the transmission clock of the NoC. Hence, the number of wires required for any connection decreases linearly with increase in clock frequency. The area overhead associated with multiple clocking supports was almost nil in terms of number of the number of slices consumed. Table 2 shows the bandwidth available (MHz) as the TXCLK frequency increases. 19 Wires 1 2 3 4 5 6 7 8 TXCLK(MHz) 50 100 150 200 250 50 100 150 200 250 300 350 400 100 200 300 400 500 600 700 800 150 300 450 600 750 900 1050 1200 200 400 600 800 1000 1200 1400 1600 250 500 750 1000 1250 1500 1750 2000 Table 2: Bandwidth per connection for different TXCLK and Wires Area optimization of NI. The area optimized version of the serializer and deserializer gave about on an average 12.5 % decrease in the number of slices consumed. The number of slices consumed by the entire NoC also reduced by a factor of about 15 %, which could be attributed to better packing of the design into slices. 20 CHAPTER 3 SDF MODELLING As different multimedia applications become more and more computationally intensive, it is required to have a very good model of the application for two main reasons: a) To verify that the underlying hardware is able to support the throughput requirements of the application and b) To have an optimized mapping and scheduling of various tasks of the application on a multicore environment. Hence the presence of a very good model of the application is paramount in determining the performance of an application on a particular platform. This chapter describes with the basics of Synchronous Data Flow (SDF) graphs and modeling of applications using SDF graphs. An SDF model of the NoC that was described in the earlier chapter is also developed and described here. 3.1 SDF GRAPH BASICS Synchronous Data Flow (SDF) graphs are used to model DSP and multimedia applications especially streaming data applications such as video encoding/decoding, mpeg encoding/decoding etc. that can be implemented using task level parallelism on multiple processors. The various tasks of each application are modeled as actors, which form the vertices of the SDF graph. The communication between the various tasks or actors is represented by edges of the graph, which models the channels of communication of the various tasks. The worst case execution time (WCET) of the various actors of the SDF graph can be represented in the SDF graph. These are known as timed SDF graphs. Hence the throughput obtained from an SDF model of the application is conservative in nature. 21 Fig 3.1 shows an SDF graph of a simple three task application. The tasks a0, a1 and a2 each take 100 clock cycles to execute which is indicated inside the vertices. The connecting edges between the tasks are also shown. 2 a0 1 100 100 1 a1 1 1 100 1 a2 Figure 3.1 Example of a three actor SDF graph. In a typical streaming application, the execution of a particular task begins after data is received. The execution of a particular task is called firing of the corresponding actor in SDF graph. At the end of execution of a task, data is produced. The data produced and consumed by the actors are called tokens. The rate at which each actor produces and consumes tokens is called token rate. In the SDF graph description, each edge forms a channel for each actor to produce and consume tokens. The token rate of an actor is indicated at the edges of the actors in the SDF graph as shown. 22 For the firing of an actor, sufficient tokens should be present at all its input channels. In Figure 3.1, actor a0 can start the execution because of a token present initially at its incoming edge. This token is called initial token and is represented by the bullet in Fig3.1. After its execution, a0 produces two tokens on the edge between a0 and a1. This allows actor a1 to fire, since it now has sufficient tokens at its incoming edge for its firing. Since a0 produces two tokens after its firing, and the input token rate of a1 is only one, two simultaneous executions of actor a1 can begin simultaneously. For this to happen, a1 has to be mapped to two separate processors. However, if only one processor is allocated for this task, then only one instance of a1 can run. To model this, we make use of the concept of self-edges, which is an edge leading from one actor to itself, with one initial token to indicate that only one instance of that actor can run simultaneously. This is shown in Fig 3.2. 2 a0 1 100 1 100 1 1 1 1 100 a1 1 a2 Figure 3.2 SDF model with self-edge. By varying the number of initial tokens on this edge, we can vary the number of simultaneous executions of an actor of the SDF graph. This is called auto concurrency of the actor. 23 Buffers also can be modeled using edges and tokens in SDF graph. A buffer can be considered to be an edge with initial tokens to be equal to the buffer size. The direction of the edge would be the opposite of the channel to which the particular actor is writing. 2 a0 1 1 100 2 1 100 1 1 1 1 100 a1 1 a2 Figure 3.3 SDF model with back edge to model a buffer. For e.g., if we consider the actor a0 writing two tokens to the channel towards a1, which has a limited buffer capacity of only two tokens, this can be represented as a back edge from a1 to a0. The buffer size of the channel is indicated by the number of tokens on this back edge. This is represented in Fig 3.3. Hence, when a0 executes, the two tokens on the buffer are consumed and two tokens are produced on the channel from a0 to a1. Only after a1 executes twice, thereby replenishing the buffers pace, a0 can execute again. Terminologies of SDF graphs: In this section, the various terminologies of the SDF graph are described. The definition of all the terms mentioned is described in [13]. 24 1. Iteration: An SDF graph is said to have executed a single iteration if, all the actors of the graph have fired in sequence and the state of the graph has returned to the initial state. 2. Repetition Vector: The repetition vector of an SDF graph is the vector that defines the number of firings that each actor has for a single iteration of the SDF graph. The repetition vector of the graph shown in Fig3.3 is [a2 a1 a0] = [1 2 1]. 3. Period: The period of the SDF graph is the execution time of an iteration of the SDF graph. 4. Throughput: The throughput of an SDF graph is the number of iterations per second of the SDF graph. This is equal to inverse of the period of the SDF graph. For an application, it might take a few iterations before the application starts its periodic behavior. Hence the execution of the application can have two phases, a) the transient phase which refers to the initial cycles before the periodic behavior kicks in and b) the steady state phase which refers to the stable iterations that the application graph performs [13]. The average throughput that an application achieves refers to the steady state throughput of the graph. A comprehensive description of the SDF graphs and application modeling and throughput analysis using SDF graphs is described in [15]. 3.2 SDF MODEL OF THE SDM NETWORK ON CHIP In a normal SDF graph, the communication channels between the two actors indicated by edges are assumed to be a connection with infinite bandwidth and infinite buffer size. However, in the real world scenario, this is not the case. The communication channel between two processors adds communication latency due to limited bandwidth and limited buffer size. Hence it is necessary to include a model of the interconnecting channel between two processors, so that the application performance can be predicted as accurately as possible. For this purpose, we need to 25 introduce actors and edges into the SDF graph of the application so that the performance of the interconnection is included in the performance analysis. This would help us in determining the mapping of the applications and identify how much of application performance is linked to the performance of the interconnection network. It would give us opportunities to determine the type of interconnect that would help us achieve better bandwidth and also in designing an efficient interconnection for the processors. This section describes an SDF model of the NoC that was described in Chapter 2 that can be incorporated into an SDF graph of any application. Figure 3.4 describes the SDF model of the SDM based NoC. The various actors of the SDF graph are used to model the various components of the NoC as described. NO.OF LANES 1 1 TOKEN SIZE SEND FSL BUFFER 1 ACTOR WA 1 1 SDD 1 1 1 1 TOKEN SIZE 1 1 SREG 1 ROUTER RCV FSL BUFFER 1 1 1 1 RCV FSL BUFFER 1 1 SRLR ROUTER 1 1 1 1 1 NO.OF LANES 1 DESRLR 1 1 1 NO.OF LANES 1 RDC 1 1 1 1 1 RREG RA 1 1 1 1 1 1 ACTOR 1 1 1 1 NO.OF LANES Figure 3.4 SDF model of the SDM based NoC In Fig3.4, two tasks modeled by actors named ACTOR are running on two different processors connected to each other by SDM based NoC. The WA and RA actors model the functions that write and read the data to the FSL links that connects the processor to the NoC. The NoC generated has a data width of 32 bits. Hence it is assumed that the token size and buffer size described in Fig 3.4 is in terms of 32 bit data packet. Accordingly the WA and RA actors are assumed to be tasks that pump 32 bit tokens 26 into the FSL links. However, this model can be replaced by more complex models depending on the implementation of the function in software. The various actors of the SDF graph that forms the NoC model are described below. a) SREG: This actor models the reading of data from the FSL links at the NI block. The FSL can transfer data at one packet every clock cycle. The FSL buffer can be modeled by the back edge on SREG to WA and the FSL buffer size can be indicated by the initial tokens on this edge. b) SDD: This actor models the data distributor of the NI. It distributes the data to the allocated lanes of the transmission port. A packet arriving at the SDD has to wait until the SDD has switched its output to the correct outgoing wire. In the worst case, the SDD has to pass through all the outgoing wires before a packet is forwarded for transmission. The data distributor takes two clock cycles to forward the incoming data to an output serializer. Hence the WCET of SDD is (2 + w l) clock cycles where w and l is the number of wires in the NI output port and the number of wires allocated to the connection respectively. c) SRLR: This actor models serial transmission of the data packet. It receives the data from the data distributer and converts it to serial format. A 32 bit data packet takes 33 clock cycles to transmit since it includes a start bit at the start of transmission. The number of lanes allocated to the connection is modeled as the auto concurrency of this actor. This will ensure that at a time l number of lanes can be used for transmission in the connection. d) ROUTER: This actor models the clock latency associated with the routers that occurs in the connection path. For each router, the execution time of the actor increases by one clock cycle. The auto concurrency of this actor is also set to be equal to the number of lanes l. e) DESRLR: This actor models the deserializer that receives the data and converts it back to the 32 bit packet format. The auto concurrency of this actor is also set to be equal to the number of lanes l, since there is a deserializer for each wire of the NoC. 27 f) RDC: This actor models the receive data collector that receives the data from each lane sequentially. This behaves similarly to the SDD. The data collector scans all the incoming wires sequentially. The RDC takes two clock cycles to read the data from the deserializer. In the worst case, a packet arrives at a wire just after the data collector has scanned that wire for new data. Hence the WCET of this actor is equal to (2 + w-l) clock cycles. g) RREG: This actor models the writing of data received to the FSL buffer and is similar to SREG. It writes the received data packet to the FSL buffer at the receiver end. This can happen at a rate of one packet per clock cycle. h) The flow control signal is modeled by a token that is send from the actor RA to the SRLR through a series of routers. The functioning of the flow control is as described in Chapter 2. 3.3 APPLICATION MAPPING AND THROUGHPUT ANALYSIS When analyzing the performance of an application on a multi-processor system, the mapping of the various tasks of the application needs to be done first. This gives an idea of the channels where the SDF model of the NoC needs to be inserted into the application SDF graph. A typical design flow to generate the optimal hardware for a NoC based multi-processor system is shown in Fig 3.5. MAPPED SDF GRAPH BANDWIDTH/ BUFFER SIZE CALCULATION HARDWARE SYNTHESIS Fig 3.5 Optimized Hardware generation flow using SDF graph 28 In this section, we describe the method to incorporate the SDF model of the NoC into the SDF graph of the application and calculate the throughput of the application. αdst αsrc-n p q 1 Ac ACTOR1 n As 1 1 ACTOR2 1 1 Figure 3.6 Modeling of tasks running on two separate processors Fig 3.6 shows the SDF graph model of two tasks ACTOR1 and ACTOR2 that run on two separate processors that are connected to each other using an interconnect mechanism. The actor Ac models the write function of data into the interconnection network. The self-edge indicates that the tokens are written sequentially into the interconnection network. The actor As models the latency associated with sending the data across the interconnection network. The model of Ac can be replaced by the model of the SDM based NoC described earlier. For calculating the throughput analysis of SDF graphs, a tool SDF3 [14] has been developed and is available online for free download. This tool reads in the SDF graph of the application specified in XML format and has various features incorporated to calculate the various parameters of the application graph. This powerful tool can be utilized for all analysis and 29 simulation operations on SDF graphs. A snippet of the XML format that specifies the description of the SDF graph of an application is given in Fig 3.7. Figure 3.7 Snippet of XML specification of SDF graph The XML of the SDF graph declares each actor of the application and defines the various parameters of the actors such as execution time, port names and port directions, token rate at each edge, etc. The various edges that represent the communication channels are also described in the XML file. In order to analyze the application throughput when multiple clocking of the NoC is enabled, the method to be followed is slightly different as compared to the normal flow. In the normal flow, it is assumed that the NoC and all the processors are executing at the same clock frequency and the execution time of all actors are specified in terms of clock cycles with respect to this common clock frequency. The throughput of the graph can be calculated for a single common clock frequency. In order to model the multiple clocking capability of the NoC, we need to modify the methodology in order to calculate the throughput of the application graph. In the case of a multiple clocking NoC, there can be two cases: 30 a) The transmission clock TXCLK is an integer multiple of the system clock SYSCLK, i.e. TXCLK = SYSCLK * N, where N is an integer. In this case, multiply the execution times of all actors of the application graph by N and then calculate the period of the graph through SDF3. If Ps is the period calculated by SDF3 and Pc is the actual period of the graph, then Pc = Ps/N. For example, consider TXCLK to be 200 MHz and SYSCLK to be 50 MHz in a multicore system with the cores connected by NoC. Then, accordingly the execution times of all the actors running in the SYSCLK domain is multiplied by 4 since N = 200/50 = 4. The period of the application obtained hence would be in terms of TXCLK. To get the period of the application in terms of SYSCLK, divide the period predicted from SDF3 by 4. b) The transmission clock TXCLK is a fractional multiple of SYSCLK. In this case, the easiest method is to find a clock frequency Fc, which is an LCM to both SYSCLK and TXCLK. If Fc= l * TXCLK and Fc.= m * SYSCLK, then the execution time of the actors in the SYSCLK domain is multiplied by m and the actors in TXCLK domain is multiplied by l. For example, consider the SYSCLK to be 66 MHz and TXCLK to be 99 Mhz. Here, TXCLK = 1.5 * SYSCLK. So, we can choose a frequency Fc of 198 MHz such that Fc = 3 * SYSCLK and Fc.= 2* TXCLK. Now we can multiply the execution time of all actors in SYSCLK domain by 3 and TXCLK domain by 2. The period of the application obtained would be in terms of Fc. In order to obtain the clock cycles in terms of SYSCLK, simply divide the period of the application predicted from SDF3 by 3. The same method can be used in analysis of application graphs where, different actors are executed by processors running at different but synchronous clock speeds. It is assumed in the 31 above cases that the different clocks of NoCs are not asynchronous in nature. The analysis of throughput using multiple clocking in such scenarios is not considered. 3.3 CONCLUSIONS In this chapter, a brief introduction to SDF graphs is provided. The various features of SDF graphs are described and the methods to describe the data flow in a streaming application using SDF graphs are also provided. Further, SDF3, a tool for analysis of SDF graphs is discussed here. An SDF model for the SDM based NoC and methods to calculate the throughput of applications running on multicore systems using the SDM based NoC as the interconnect was also described in this chapter. 32 CHAPTER 4 CASE STUDY In this chapter we study two SDF models, one that of a simple producer consumer SDF model and the other, a JPEG decoder SDF model, so as to evaluate the results of the architectural improvements on the NoC The SDF analysis was carried out using the SDF3 tool available from [14], and the implementation was carried out on a Memec Virtex 4 FPGA board with Virtex-4 XC4LX25 FPGA on it. The NoC had one sending channel and one receiving channel, with 8 sending and receiving wires for transmission and reception. The data width was chosen to be 32 bit. The SDF model of an example producer-consumer graph is given in Fig 4.1. The execution time of the actor is very small and the number of tokens produced is 10000 32 bit words which is very large. This model was chosen so as to stress the communication network between two processors. The actors PRODUCER and CONSUMER executes in 110 clock cycles on a Microblaze processor. The auto concurrency of each actor is one. 110 110 10000 10000 PRODUCER CONSUMER Figure 4.1: SDF model of producer-consumer 33 From SDF3 analysis, the period of the SDF graph can be calculated as 110. After incorporating the SDF graph of the NoC, the application graph becomes as shown in Figure 4.2. The execution time of each actor is indicated within the braces. NO.OF LANES 1 1 ROUTER 256 1 1 1 1 256 256 10000 PRODUCER (110) 1 1 1 WA (7) 1 1 256 SREG (1) 1 1 1 1 1 1 1 SDD ( 2+8 - l ) SRLR (36) 1 1 1 NO.OF LANES 1 1 1 ROUTER (2) 1 1 NO.OF LANES 1 1 DESRLR (3) 1 1 1 RDC RREG ( 2+8 - l ) (1) 1 1 1 1 1 1 1 1 RA (7) 1 1 1 CONSUMER (110) 1 NO.OF LANES Figure 4.2 SDF model of Producer-Consumer with NoC The producer and consumer actors were run on a Microblaze processor running at 50 Mhz. The clock frequency of the NoC was varied from 50 MHz to 250 MHz after which the design showed timing errors during synthesis. The results obtained after varying the different parameters of the NoC are described in Table 3. Figure 4.3 shows the variation of the iterations per second of the SDF graph on varying the number of lanes allocated to the communication channel, calculated on the SDF3 tool. The linear increase in throughput as the number of lanes allocated to the channel is clearly seen in this graph. The maximum throughput obtained for the SDF model was approximately 712 iterations per second from the SDF3 analysis. Figure 4.4 shows the results of actual execution of this model on the Microblaze system as described in the earlier section. The measured data follows the SDF3 model very closely according to the results. The maximum throughput obtained from the actual system was 715 iterations per second. 34 800 Iterations per second 700 600 500 txclk=50Mhz 400 txclk=100Mhz 300 txclk=150Mhz 200 txclk=250Mhz 100 txclk=200Mhz 0 1 2 3 4 5 6 7 8 Lanes Allocated Figure 4.3 Number of Lanes vs. Iterations per second on SDF3 800 Iterations per second 700 600 500 clk=50Mhz 400 clk=100Mhz 300 clk=150Mhz 200 clk=200Mhz clk=250Mhz 100 0 1 2 3 4 5 6 7 8 Number of Lanes Figure 4.4 Number of Lanes vs. Iterations per second on FPGA Comparing the two graphs, the close correlation between the simulation and the measured data is evident. Analyzing the graph also indicates that the bandwidth available to the connection increases linearly with the clock frequency as well. Fig 4.5 and 4.6 shows the comparison of 35 period calculated via the SDF3 analysis and the actual measured data for two different values of TXCLK, while keeping the SYSCLK constant. SYSCLK = TXCLK = 50 Mhz 400000 Application Period 350000 300000 250000 200000 SDF3 150000 Measured Data 100000 50000 0 1 2 3 4 5 6 7 8 No. of Lanes Figure 4.5 Measured data vs. SDF3 for TXCLK=SYSCLK SYSCLK = 50 Mhz, TXCLK = 250 Mhz 82000 80000 Application Period 78000 76000 74000 SDF3 72000 Measured Data 70000 68000 66000 64000 1 2 3 4 5 6 7 8 No.of Lanes Fig 4.6 Measured data vs. SDF3 for TXCLK = 5*SYSCLK 36 SDF graph CLK Wires 1 2 3 4 5 6 7 8 PRODUCER-CONSUMER MODEL 50(MHz) SDF3 Measured 360000 349994 180000 174997 120000 116665 90000 87499 70110 70014 70110 70014 70110 70014 70110 70014 100(MHz) 150(MHz) SDF3 Measured SDF3 Measured 180000 179997 120000 120000 90000 90000 70110 70014 70110 70014 70110 70014 70110 70014 70110 70014 70110 70014 70110 70014 70110 70014 70110 70014 70110 70014 70110 70014 70110 70014 70110 70014 200(MHz) SDF3 90000 70110 70110 70110 70110 70110 70110 70110 Measured 89999 70014 70014 70014 70014 70014 70014 70014 250(MHz) SDF3 80000 70110 70110 70110 70110 70110 70110 70110 Measured 79997 70014 70014 70014 70014 70014 70014 70014 Table 3: Predicted period vs. measured period for producer-consumer model. These results indicate the flexibility that we get when choosing the appropriate number of lanes for a connection. Consider a scenario where the connection requirements between two nodes are large but the number of wires available between nodes is limited. In such a case, we can use multiple clocking to increase the bandwidth available per wire to a suitable value so that more connection requirements can be satisfied using the same number of available wires. In another scenario, the higher clock frequency of the NoC can result in hotspot development and unbalanced power dissipation across the chip. One solution for this is to find alternate routes for the same connection across less congested routes over the NoC with more number of wires and reduced transmission frequency. 37 CHAPTER 5 CONCLUSIONS AND FUTURE WORK This chapter describes the major conclusions from this thesis and major issues that still needs to be resolved. 5.1 CONCLUSIONS The SDM based NoC described in Chapter 2 offers a very promising solution to provide a guaranteed throughput interconnection mechanism for MPSoCs. The area overhead associated with the storing of switching tables in the routers in the case of TDMA based NoCs has been eliminated in this NoC. The existing NoC has been modified to provide multiple clocking supports for the NoC as described in Chapter 2. The flow control incorporated into the NoC has made the NoC more robust and reduced the buffer requirements at the receive side so that data packets are not lost. The serializer and deserializer have been optimized further in terms of resources, resulting in consumption of 12.5% less number of slices as earlier. The multiple clocking capabilities have enabled in improving the transmission bandwidth and flexibility of the SDM based NoC. A slower IP can be connected to the NoC while the data transmission can occur at very high speeds. This feature has also made it possible to add another dimension so as to how the NoC can be configured for particular scenarios. The SDF model developed for the NoC has been found to be very accurate in predicting application performance on a multicore system based on the SDM based NoC. A method to modify the throughput analysis so as to incorporate the multiple clock feature of the NoC is also described in Chapter 3. A case study using a producer – consumer model for analysis of the 38 performance of the SDF model is described in Chapter 4. The linear increase in the bandwidth of the connection with the number of lanes and transmission clock frequency is seen from the results of the experiments. 5.2 FUTURE WORK A few improvements that can be brought about on the current SDM based NoC are listed here. a) Area optimization of Network Interface: The Network Interface of the SDM NoC is very complex and takes up a large amount of resources. As described in Chapter 2, the data distributor takes up nearly 10% of the NI area. The current structure of the data distributor is very complex and involves routing a large number of wires. There exists a data distributor for each sending channel and each data distributor has a separate data bus for each of the serializers. This structure can be optimized so that significant savings on resources can be achieved. A new architecture for the NI is proposed in Figure 5.1. Instead of a separate data distributor for each of the channels as existing in the present design, an N: 1 switch for N incoming channels can be used. The output of the switch connects to the input of all the serializers of the individual outgoing wires. The controller communicates with each serializer using the existing two way handshaking mechanism. The area savings on this architecture looks promising and resource savings upto 50 % in terms of slices used is expected. 39 Wires_Allocated_channel_1 Wires_Allocated_channel_N NETWORK INTERFACE CONTROLLER SERIALIZER_0 SELECT CHANNEL 1 FROM IP SELECT SERIALIZER_1 CHANNEL 2 ROUTER N:1 switch SERIALIZER_2 CHANNEL N SERIALIZER_W Figure 5.1 Proposed area efficient NI architecture. The work on this architecture is still in progress. b) Fault tolerance for the NoC: As more and more nodes are incorporated into the chip, the size of the NoC that connects them also increases. The chance of a component of the NoC failing hence increases. Adding fault tolerance capacity to the NoC can help in making the communication system more robust. 40 BIBLIOGRAPHY [1] Moore, Gordon E. (1965). "Cramming more components onto integrated Circuits" , Electronics Magazine, pp. 114-117, 1965. [2] Wolf, Wayne. “Embedded computer architectures in the MPSoC age”, WCAE '05: Proceedings of the 2005 workshop on Computer architecture education, 2005 [3] J.A. De Oliveira and H. Van Antwerpen, The Philips Nexperia digital video platforms, from “Winning the SoC Revolution,” Kluwer Academic Publishers, 2003. Peter Cumming. THE TI OMAP PLATFORM APPROACH TO SOC. from “Winning the [4] SoC Revolution”, Kluwer Academic Publishers, 2003. [5] Martin, G., Overview of the MPSoC design challenge, Design Automation Conference, 2006 43rd ACM/IEEE , pp. 274 – 279, September 2006. [6] Benini L, De Micheli G, Network-on-chips: A new paradigm for systems on chip design, Design, Automation and Test in Europe Conference and Exhibition, 2002. Proceedings, pp. 418-419. [7] E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny, "QNoC: QoS architecture and design process for network on chip," Journal of Systems Architecture, vol. 50, pp. 105-128, 2004. [8] T. Bjerregaard and J. Sparso, "A router architecture for connection oriented service guarantees in the MANGO clockless network-on-chip," in Proceedings -Design, Automation and Test in Europe, DATE '05, 2005, pp.1226-1231 [9] K. Goossens, J. Dielissen, and A. Radulescu, "Æthereal network on chip: concepts, architectures, and implementations," IEEE Design & Test of Computers, vol. 22, pp. 414-21, 2005. 41 [10] M. Millberg, E. Nilsson, R. Thid, and A. Jantsch, "Guaranteed bandwidth using looped containers in temporally disjoint networks within the Nostrum network on chip," in Proceedings - Design, Automation and Test in Europe Conference and Exhibition, 2004, pp. 890-895 [11] A. Leroy, D. Milojevic, D. Verkest, F. Robert, and F. Catthoor, "Concepts and implementation of spatial division multiplexing for guaranteed throughput in networks-on-chip," IEEE Transactions on Computers, vol.57, pp. 1182-1195, 2008. [12] Joseph, Yang Zhiyao, “An area efficient dynamically reconfigurable spatial division multiplexing Network on Chip with static throughput guarantee “, FYP Thesis, May 2010 [13] Kumar, Akash, “Analysis, Design and Management of Multimedia Multiprocessor Systems”, PhD Thesis, April 2010 [14] SDF3, a tool for SDF graph analysis available at for download http://www.es.ele.tue.nl/sdf3/ [15] Stuijk, Sander, “Predictable Mapping of Streaming Applications on Multiprocessors”, PhD Thesis, October 2007. 42