104 JOURNAL OF ELECTRONIC SCIENCE AND TECHNOLOGY OF CHINA, VOL. 7, NO. 2, JUNE 2009 Delay Optimized Architecture for On-Chip Communication Sheraz Anjum, Jie Chen, Pei-Pei Yue, and Jian Liu Abstract⎯Networks-on-chip (NoC), a new system on chip (SoC) paradigm, has become a great focus of research by many groups during the last few years. Among all the NoC architectures that have been proposed until now, 2D-Mesh has proved to be the best architecture for implementation due to its regular and simple interconnection structure. In this paper, we propose a new interconnect architecture called 2D-diagonal mesh (2DDgl-Mesh) for on-chip communication. The 2DDglMesh is almost similar to traditional 2D-Mesh in aspects of cost, area, and implementation, but it can outperform the later in delay. The both architectures are compared by using NS-2 (a network simulator) and CINSIM (a component based interconnection simulator) under the same traffic models and parametric conditions. The results of comparison show that under the proposed architecture, the packets can almost always be routed to their destinations in less time. In addition, our architecture can sometimes perform better than 2D-Mesh in drop ratio for special fixed traffic models. Index Terms⎯2D-Mesh, networks-on-chip, network simulator 2, traffic models, system on chip. issues. Moreover, long wire delays, reusability, less modularity, and scalability issues have been added to the problems of current bus-based SoCs. Consequently, more modular and scalable design methodologies[2]-[4] have been proposed, known as network on chip (NoC), a new SoC paradigm. The use of globally asynchronous locally synchronous concept in NoCs has disintegrated the design of resources from the rest of the network. Its use could enhance the scalability, modularity, and reusability of IP. Design and selection of appropriate architectures for on-chip communication take a key role in the design and implementation of the complete platform for NoC. Different on-chip interconnect architectures were proposed, evaluated, or analyzed in [5]-[8]. Among all the NoC architectures presented until now, 2D-Mesh has proved to be the best architecture in terms of implementation due to its regular and simple interconnection structure. 2D-Mesh architecture is also more compatible with ultra deep submicron fabrication technologies. In this paper, we propose a new interconnect architecture, named as 2D-diagonal mesh (2DDgl-Mesh), which has similarities of traditional 2D-Mesh but can perform better than the later in delay and sometimes in drop ratio too. 2. Related Work 1. Introduction According to International Technology Roadmap for [1] Semiconductors (ITRS) , billions of transistors can be fabricated on a single chip by using 45 nm process technology by the end of this decade. Current system on chip (SoC) design methodologies are not scaling well with the advancement of process technologies. The use of buses in today’s SoCs for interconnecting heterogeneous resources is becoming a bottleneck due to contention and congestion Manuscript received February 27, 2008; revised April 20, 2008. This work was supported by the National Natural Science Foundation of China under Grant No. 60425413 and COMSATS Institute of Information Technology, Pakistan. S. Anjum is with COMSATS Institute of Information Technology, Pakistan and working as Assistant Professor at the Deptment of Electrical Engineering CIIT Quaid Avenue, Wah Cantt Campus, Pakistan (e-mail: sheraz1976@hotmail.com). J. Chen, P.-P. Yue, and J. Liu are with Institute of Microelectronics, Chinese Academy of Sciences, Beijing, 100029, China (e-mail: jchen@ime.ac.cn, yuepeipei@ime.ac.cn, and liujian04@mails.gucas.ac.cn). Many research teams have focused the architectural aspects of NoCs. Kumar et al. introduced a new methodology for designing mesh architecture for NoC[3]. Karim et al. proposed a novel communication network architecture for 8-CPU distributed-memoly systems that has the potential to deliver the throughput required in next generation routers[5]. Vahdatpour et al. purposed a new network on chip architecture called hierarchical graph[6], where NS-2 (a network simulator-2) was applied for the purpose of simulation and analysis of their proposed architecture. In [7] Pande et al. developed a consistent and meaningful evaluation methodology to compare the performance and characteristics of a variety of NoC architectures. Hossain et al. introduced extended butterfly FAT tree interconnection (EFTI) and provided a routing algorithm for EFTI and its comparative analysis through the simulation results[8]. In [9] Sun et al. constructed a proto-model using a public domain network simulator NS-2[10] and evaluated design options for a specific NOC architecture which has a two dimensional mesh of switches. ANJUM et al: Delay Optimized Architecture for On-Chip Communication 105 Fig. 3. Worst case delay comparison. Fig. 1. 2DDgl-Mesh architecture. Fig. 4. Discrete components of 2DDgl-Mesh: (a) and (b) are quarter cross components and (c) to (f) are half cross components. Fig. 2. 2D-Mesh architecture. 3. Architecture of 2DDgl-Mesh Our proposed architecture named as 2DDgl-Mesh and traditional 2D-Mesh are shown in Fig. 1 and Fig. 2, respectively. It is evident from the two figures that our architecture has been derived by introducing two diagonal links in the traditional 2D-Mesh architecture. In Fig. 1 and Fig. 2, ‘Rt’ represents a router and ‘r’ represents a heterogeneous resource. Let N = X 2 be the total number of resources required to be interconnected on-chip. The architectures in Fig. 1 and Fig. 2 have shown for specific value of X=5. Both the architectures are scalable and can accommodate more and more resources N with the increase in X. Let L be the total number of links between the routers and LDgl the total number of diagonal links that have been introduced in the proposed architecture, then “LDgl = 2(X−1)”. Therefore, we can conclude that the proposed architecture, 2DDgl-Mesh, has been derived by adding LDgl links to traditional 2D-Mesh and approx. 2LDgl ports to the routers fall on the diagonal links. The addition of diagonal links can help in the reduction of average delay of packets that have been routed between any source destination pair. The worst case delay is related to the resource that is on the opposite ends of the chip. For a sender/receiver pair situated on the opposite ends of a chip, the number of hops using 2D-Mesh is 2(X−1) but that using 2DDgl-Mesh is (X−1), as indicated in Fig. 3. Therefore, 2DDgl-Mesh is two times faster than the traditional 2D-Mesh architecture in terms of the worst case delay. 3.1 Scalability The proposed architecture of 2DDgl-Mesh is a scalable architect and therefore it can accommodate as many resources as required for implementation of the larger NoCs without performance degradation. The architecture shown in Fig. 1 is only for 25 nodes and its design connecting discrete components are shown in Fig. 4. Fig. 4 (a) and (b) are the basic discrete components known as quarter cross components and Fig. 4 (c) to (f) are known as half cross components. (c) and (d) are derived by overlapping (b) to the right side of (a) and (a) to the top side of (b), respectively. Similarly, (e) and (f) are also derived from (a) and (b). In the similar way, the architecture of 2DDgl-Mesh is derived by overlapping (d) to the left side of 106 JOURNAL OF ELECTRONIC SCIENCE AND TECHNOLOGY OF CHINA, VOL. 7, NO. 2, JUNE 2009 (e) or overlapping (c) to the top side of (f). The scaled architecture of 2DDgl-Mesh for N = 81 is shown in Fig. 5. The architecture of Fig. 5 is derived by overlapping the mirror sides of four 2DDgl-Meshes. In this way the architecture of 2DDgl-Mesh can scale to any number of nodes/routers. Also any number of quarters cross and half cross components can also be used in the same way to add any number of desired nodes. 3.2 Routing The routing algorithm determines the next hop for a packet in an intermediate router to reach the destination node or IP. The selection of an appropriate routing algorithm for the NoC architecture plays an important role in the overall performance of the NoC system. As our proposed architecture contains many diagonal links in addition to horizontal and vertical ones, the routing algorithms should be able to utilize these diagonal links in an effective manner. After careful analysis we selected Dijkstra’s shortest path first (SPF) routing algorithms for our NoC architecture. SPF tries to find a shortest path between a source/destination pair and in this way the added diagonal links could effectively be used for routing packets. The decoding of routing information contained in the packet’s headers enables the routers to route them to their desired destination. On one hand the addition of diagonal links could help in the reduction of average packet delay from their source to the destinations, but on the other hands, more ports to the routers falling on the diagonal or quarter cross or half cross components are required. The addition of new ports to the routers can be considered as router expense that has to be paid to achieve less average packet delay. The addition of one 2DDgl-Mesh requires the addition of 16 more ports to the traditional 2D-Mesh architecture. Similarly addition of one half cross or one quarter cross components requires the addition of 8 or 4 more ports respectively to the corresponding routers of 2DDgl-Mesh as compared with the traditional 2D-Mesh architecture. Fig. 5. 2DDgl-Mesh for N = 81 resources. 4. Simulation Environment Due to lack of tools available for NoC simulation, many researchers as in [6] and [9] have used NS-2 for simulation of NoC architectures and algorithms. In view of the facilities and concrete documentation[10],[11] available for NS-2, we also apply NS-2 for analytical simulation and comparison of the proposed 2DDgl-Mesh with the traditional 2D-Mesh architecture under the same parametric conditions. All simulation parameters are designed same as mentioned in Section 4 of [6], including resource nodes N=25, exponential traffic sources for senders, a sender/receiver pair associated to each resource node, normalized values of bandwidth, and delay. In order to cover the traffic behavior for a large set of applications, three different source/destination selection models are used and detailed in Section 5. 4.1 Performance Metrics Two performance metrics, average delay and drop ratio of packets, are used to compare the efficiency of both the architectures. Let Da represent the average delay, DR the drop ratio, P the total number of packets generated in one simulation, DLi the end to end delay of packet i and PD the number of packets dropped, then we have P Da = ∑ DLi / P (1) DR = PD / P (2) i =0 5. Results of Comparison The comparison of both the architectures is performed by using three different traffic models. In the following subsections we will briefly discuss these three traffic models and the results on the architectures in consideration. 5.1 2D and 2DDgl Traffic Model Fig. 6 reveals the details of the model. The major difference between 2D and 2DDgl is the selection of neighbors and non-neighbors according to 2D-Mesh and 2DDgl-Mesh architectures respectively, i.e., the resources at the corner in 2D traffic model have only two neighbors while that in 2DDgl have three neighbors. In both of the models, we set Range 1 between 0 and 1. If Range 1 is set to 0, the algorithm will always select a random non-neighbor of the Source i; if Range 1 is set to 1, a random neighbor of Source i will always be selected. The middle values of Range 1 will change the probability of selection between neighbors and non-neighbors of Source i. Fig. 7 and Fig. 9 show the average packet delay vs. traffic rate comparison for both the architectures using 2D and 2DDgl traffic models. It is evident from the graphs that 2DDgl-Mesh almost always has less average delay than 2D-Mesh. Similarly, Fig. 8 and Fig. 10 show that the two architectures have almost the same drop ratio. ANJUM et al: Delay Optimized Architecture for On-Chip Communication 107 0.09 Start 0.08 2D-Mesh 2DDgl-Mesh 0.07 Source=Si Drop ratio 0.06 Generate a random No. Rd1 b/w 0 to 1 0.05 0.04 0.03 0.02 0.01 Rd1< Range 1 Generate List 1 containing n neighbors of Resource i that are at Hop 1 from i 0 0 Generate List 2 containing n non-neighbors of Resource i that are at more than 1 hop from i 40 80 120 Traffic (Mb/s) 160 200 Fig. 8. Drop vs. traffic rate using 2D random traffic model. 90 2D-Mesh 2DDgl-Mesh Select a random neighbor Di from List 1 Select a random non-neighbor Di from List 2 Destination=Di Average delay (μs) 80 70 60 50 40 Next 30 0 Note: Si and Di are integers ranging b/w 0 and N−1 40 Fig. 6. 2D and 2DDgl traffic selection model. 80 120 Traffic (Mb/s) 160 200 Fig. 9. Delay vs. traffic using 2DDgl 75% local traffic model. 2D-Mesh 2DDgl-Mesh 0.09 90 2D-Mesh 0.08 2DDgl-Mesh 80 0.07 0.06 70 Drop ratio Average delay (μs) 100 60 0.05 0.04 0.03 50 0.02 40 0 40 80 120 Traffic (Mb/s) 160 200 0.01 0 0 Fig. 7. Delay vs. traffic rate using 2D random traffic model. 5.2 Special Fixed Traffic Model In this model, the destination is always selected according to the following equation: Di = ( N − 1) − Si (3) where Di and Si are the ith destination and the ith sources respectively and N is the total number of resources used. This model is specifically designed to check the worst case delay behavior of both the architectures. The equation will always try to select the source/destination pair from the opposite 40 80 120 Traffic (Mb/s) 160 200 Fig. 10. Drop vs. traffic rate using 2DDgl 75% local traffic model. sides of the chip as mirror image. For instance, if N=25 and Si = 0, then Di = 24 (opposite side) and if Si = 4, then Di = 20 (again on the opposite side) etc. Fig. 11 shows that 2DDgl-Mesh has even less average delay than 2D-Mesh under the special fixed traffic selection model. Similarly, Fig. 12 shows that our proposed architecture has lower drop ratio under this type of fixed traffic model. JOURNAL OF ELECTRONIC SCIENCE AND TECHNOLOGY OF CHINA, VOL. 7, NO. 2, JUNE 2009 108 5.3 Random Traffic Model 100 2D-Mesh 2DDgl-Mesh Average delay (μs) 90 80 70 60 50 40 0 40 80 120 Traffic (Mb/s) 160 200 Fig. 11. Delay vs. traffic rate using special fixed traffic model. 0.09 2D-Mesh 2DDgl-Mesh 0.08 Drop ratio 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0 40 80 120 Traffic (Mb/s) 160 200 Fig. 12. Drop vs. traffic rate using special fixed traffic model. 9 2D-Mesh 2DDgl-Mesh Average delay (μs) 8 7 6 5 4 0 2 3 4 5 6 7 8 Simulation runs 9 10 Fig. 13. Average Delay vs. simulation runs using random traffic. Average delay (μs) 5 To simulate IPs or resources of different sizes, we implement diverse kinds of traffic distributions such as geometric, periodic, and pareto distributions on different nodes along with random traffic generation. Geometric distribution is implemented on resources r0 to r4 and r20 to r24 (see Fig. 1). Periodic distribution is implemented on resources r5 to r9 and r15 to r19 and Pareto distribution is implemented on resources r10 to r14. In this way, 2D-mesh and 2DDgl-Mesh architectures behave like having nodes of different sizes. Component based interconnection network simulator (CINSIM) [12] is used to simulate the scenario of different node sizes as mentioned above under same parametric conditions and random traffic model. CINSIM can simulate both the steady state as well as transient behavior of any interconnection network. Therefore, the behaviors of both architectures are analyzed for steady state and for first 50 clock cycles. The steady state simulation is computed ten times and each time approximating 10% traffic load is added to the previous one. Fig. 13 shows the estimated average delay of packets vs. simulation runs. It is clear from the Fig. 13 that 2DDgl-Mesh has less average delay than 2D-Mesh for every simulation run. Also, the transient behavior of 2DDgl-Mesh is superior over 2D-Mesh in terms of average delay for the first 50 clock cycles, as shown in Fig. 14. 6. Conclusions In this paper, we have proposed a new interconnect architecture for on-chip communication named as 2DDglMesh. The architecture is almost similar to the traditional 2D-Mesh in terms of cost, area, and implementation issues, but it can perform better than 2D-Mesh does. We used NS-2 and CINSIM to simulate both the architectures under the same traffic models and parametric conditions. The results of comparison using different traffic selection models show that the delay of 2DDgl-Mesh is always less than that of 2D-Mesh architecture. In addition, 2DDgl-Mesh can also have lower drop ratio under fixed traffic model. The optimization in delay or drop is achieved by adding few diagonal links and ports to the routers falling on the diagonal of the meshes. Therefore, it is suggested that the 2DDgl-Mesh architecture instead of 2D-Mesh can safely be used for on-chip communication without much overhead. 4 References 3 2 1 2D-Mesh 2DDgl-Mesh 0 0 10 20 30 Clock cycle 40 50 Fig. 14. Average delay vs. clock cycle using random traffic. [1] International Technology Roadmap for Semiconductors, 2004 ed., Semiconductor Industry Association, World Semiconductor Council, 2004. [2] L. Benini and G. D. Micheli, “Networks on chip: a new SoC paradigm,” IEEE Computers, vol. 35, no. 1, pp. 70-78, 2002. [3] S. Kumar, A. Jantsch, J. P. Soininen, M. Forsell, M. Millberg, J. Öberg, K. Tiensyrjä, and A. Hemani, “A network on chip architecture and design methodology,” in Proc. of IEEE Computer Society Annual Symposium on VLSI, Pittsburgh, ANJUM et al: Delay Optimized Architecture for On-Chip Communication Pennsylvania, USA, 2002, pp. 117-124. [4] A. Jantsch and H. Tenhunen, Networks on Chip, Stockholm, NY: Kluwer Academic Pub., 2003, pp. 85-106. [5] F. Karim, A. Nguyen, and S. Dey, “On-chip communication architecture for OC-768 network processors,” in Proc. 2001 DAC Conf., Las Vegas, 2001, pp. 678-683. [6] A. Vahdatpour, A. Tavakoli, and M. H. Falaki, “Hierarchical graph: a new cost effective architecture for network on chip,” in Proc. 2005 EUC (IFIP) Conf., Nagasaki, Japan, 2005, pp. 311-320. [7] P. P. Pande, C. Grecu, M. Jones, A. Ivanov, and R. Saleh, “Performance evaluation and design trade-offs for networkon-chip interconnect architectures,” IEEE Trans. Computers, vol. 54, no. 8, pp. 1025-1040, 2005. [8] H. Hossain, M. M. Akbar, and M. M. Islam, “Extendedbutterfly FAT tree interconnection (EFTI) architecture for NoC,” in Proc. IEEE Pacific Rim Conf. on Communication, Computers and Signal Processing, Victoria, B.C., Canada, 2005, pp. 613-616. [9] Y. R. Sun, S. Kumar, and A. Jantsch, “Simulation and evaluation for a network on chip architecture using NS-2,” presented at The 20th NORCHIP conference, Copenhagen, Denmark, November 11-12, 2002. [10] The ns manual, The VINT Project, UC Berkeley, LBL, USC/ISI and Xerox PARC, [Online]. http://www.isi.edu/ nsnam/ns/ns-documentation.html. [11] NS Simulator for Beginners, Lecture Notes, Univ. de Los Andes, Merida, Venezuela and ESSI, Sophia-Antipolis, France, [Online]. http://www-sop.inria.fr/mistral/personnel/Eitan.Altman /COURS-NS/n3.pdf. [12] A. Walter, M. Kühm, D. Tutsch, D. Lüdtke, and C. Zimmermann, CINSim Handbook: Installation and User's Guide, [Online]. http://dontcry.cs.tu-berlin.de/cinsim/docbook/ html/handbook.html. Sheraz Anjum was born in Okara, Pakistan in 1976. He graduated in electronics from Quaid-i-Azam University, Islamabad, Pakistan, in 1999, received M.S. degree in computer engineering from University of Engineering and Technology, Taxila, Pakistan, in 2005 and Ph.D. degree in microelectronics and solid- 109 state electronics from Institute of Micro-Electronics, Chinese Academy of Sciences, Beijing, China in 2008, respectively. Currently he is working as assistant professor with the Department of Electrical Engineering, COMSATS Wah Cantt Casmpus, Pakistan. His research interests include design of advance DSP architectures, multi-processor SoC and network-on-chip. Jie Chen received his M.S. and Ph.D. degrees in electrical engineering from the University of Electro-Communications, Tokyo, Japan in 1991 and 1994, respectively. He is presently a director professor with Institute of Microelectronics, and a professor with the Graduate School of Chinese Academy of Sciences. Before joining Chinese Academy of Sciences in 2001, he was an associate professor with the Graduate School of Information Systems, UEC, a research project leader of Advanced IC Development Center, YOZAN Inc., Tokyo, Japan, from 1995 to 1997, and a research associate with UEC from 1994 to 1995. He received the fund for 100 Talent-Scientists of Chinese Academy of Science in 2001, and the Chinese Prime Minister Fund for Distinguished Chinese Young Scholars in 2004. His current research interests include SOC design for wireless communications and multimedia signal processing. Pei-Pei Yue received the B.E. degree in electronic information science & engineering from Shandong University, China, in 2003. Now she is a Ph.D. candidate with Institute of Microelectronics, Chinese Academy of Sciences. Her research interest includes architecture design of networks-on-chip. Jian Liu received the B.E. degree from Department of Information Science and Electronic Engineering, Zhejiang University, in 2004. He is currently pursuing Ph.D. degree with Institute of Microelectronics, Chinese Academy of Sciences. His research interest includes modelling and simulation of NoCs.