www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13696-13703 A 16-Core Processor with Shared-Memory and Message-Passing Communications Shaik Mahmed basha, G.Nageswararao,(phD) 2 nd Year M.TechVLSI, Department of ECE Audisankara Institute Of Technology, Gudur. Associate Professor Department of ECE. Audisankara Institute Of Technology, Gudur. Abstract—A 16-core processor with both message-passing and shared-memory inter-core communication mechanisms is implemented in 65 nm CMOS. Message-passing communication is enabled in a 36 Mesh packet-switched network-on-chip, and shared-memory communication is supported using the shared memory within each cluster. The processor occupies 9.1 and operates fully functional at a clock rate of 750 MHz at 1.2 V and maximum 800 MHz at 1.3 V. Each core dissipates 34 mW under typical conditions at 750 MHz and 1.2 V while executing embedded applications such as an LDPC decoder, a 3780-point FFT module, an H.264 decoder and an LTE channel estimator. Index Terms—Chip multiprocessor, cluster-based, FFT, H.264 decoder, inter-core communication, inter-core synchronization, LDPC decoder, LTE channel estimator, message-passing, multi-core, network-on-chip, NoC, shared-memory, SIMD. Index Terms—Chip multiprocessor, cluster-based, FFT, H.264 decoder, inter-core communication, inter-core synchronization, LDPC decoder, LTE channel estimator, message-passing, multi-core, network-on-chip, NoC, shared-memory, SIMD. I. INTRODUCTION POWER budgets of embedded processors are bearing higher pressure than before, driven by the massive employment of mobile computing devices. The advancement of applications in communication and multimedia systems even exacerbates the situation. Fortunately, chip multiprocessors emerge as a promising solution and many efforts are taken to increase the parallelism and optimize memory hierarchy concurrently, in order to meet the stringent power budget while still enhancing performance [7]–[9]. However, even managing to rebalance performance and power, multi core architecture still introduces new challenges on inter-core communications, which soon becomes the key for further performance improvement. Especially, the efficiency of inter-core communications has direct impact on the performance and power metrics of embedded processors. When a certain application is mapped on a multicore system, it is usually pipelined and the throughput depends on both computing capability and communication efficiency between cores. Although there are various performance-enhanced technologies, such as super-scalar and Very Long Instruction Word (VLIW) etc., mature solutions for inter-core communications are still in absence temporarily. Hence, the next stage research should focus on improving the efficiency of inter-core communications. In the history of microprocessor, shared-memory communication is the most often used mechanism due to its simple programming model [3]–[5], [15]. However, it fails to provide sufficient scalability to cater to the increasing core numbers. Therefore, multicore processor designers turned to the messagepassing communication mechanism, which shows more scalability and potential to be employed in the nextgeneration embedded multicore processors [6]–[9]. In this paper, we attempt to summarize the key features of the shared-memory and message-passing communications. We show that different inter-core communication methods match with different scenarios, which implies that we could obtain a higher performance and power efficiency by integrating both inter-core communication mechanisms. We propose a 16-core processor adopting hybrid intercore communication schemes with both shared-memory and message-passing inter-core communications. A 2D Mesh Net- work-on-Chip (NoC) is adopted to support messagepassing communications. Meanwhile, a cluster-based memory hierarchy including shared memory enables shared- Shaik Mahmed basha, IJECS Volume 4 Issue 8 Aug, 2015 Page No.13696-13703 Page 13696 DOI: 10.18535/ijecs/v4i8.14 memory communications. We also propose a hardwareaided mailbox inter-core synchronization method to support inter-core communications, and a new memory hierarchy to achieve higher energy efficiency. A prototype 16-core processor chip has been fabricated in TSMC 65 nm Low Power (LP) CMOS process and shows full functions. This paper is organized as below. Section II describes the key features of the 16-core processor and related work. Section III details its design and implementation. Section IV presents the measured result with the fabricated chip. Section V concludes the paper. Fig.1: Motivation: Reduce the efficiency gap while maintaining the flexibility. II. MOTIVATION AND KEY FEATURES The primary motivation of our work is to improve the performance and power efficiency of embedded multicore processors while still maintaining flexibility, in other words, to reduce the efficiency gap between multicore processors and ASICs as shown in Fig. 1. Several key features are implemented which are detailed in the following subsections. A. Chip Multiprocessor and Inter-Core Communications The inter-core communications are becoming increasingly important in the era of chip multiprocessors. As multicore became mainstream architecture, processor designers began to place more significance on inter-core communications since the overall performance of multiprocessors relies highly on the inter-core communications [1]. Especially, embedded multicore processors face much more pressure to improve their inter-core communication efficiency. First, power and cost budgets limit the possibility to integrate processor cores with strong computability, which forces us to exploit extra performance from the intercore cooperation. Second, embedded applications usually have the characteristic of stream processing. The data stream flows through several processor cores until getting the results, which increase the challenges of massive data passthrough across different cores, thus the throughput is highly relevant to the efficiency of inter-core communications. Finally, we see the trend that core number of embedded multicore processor is getting larger, and more core counts introduce more challenges to achieve efficient inter-core communications. Accordingly, the inter core communication efficiency plays a more critical role for embedded multiprocessors. B. Hybrid Inter-Core Communication Mechanism Several inter-core communication schemes are proposed in precedent work. For traditional Symmetric Multi-Processing (SMP) processors, shared cache or memory units can support shared-memory inter-core communications. Typical examples are MRTP [3], Hydra [4], UltraSPARC T1 [5] and Cortex-A9 [15]. Although with simple programming, shared-memory communications face several challenges limiting its massive use in future manycore processors. First, its low scalability doesn’t allow more cores. Even with only 8 cores, the interconnects consume power equivalent of one core and take area equivalent of three cores [2]. Second, cache coherence issues are extremely complex, resulting in large hardware overhead and power consumption. Conversely, message-passing communication mechanism at- tracts lots of attention recently because of its better scalability. Typical examples are RAW [6], TILE64 [7], Intel 80-Tile [8] and ASAP [9]. They adopt NoC as the channels to link massive cores, and it’s convenient to add or reduce cores underlying certain topology. Table I The Comparison Of Shared-Memory And Message-Passing Inter-Core Communications Method Usage Pro Con Medium Scenario Shared-memory Large, unsplit data block Simple programming Lower scalability Shared cache/memory Computation data flow Message-passing Frequent, scattered data Better scalability Uncertain channel Network-on-chip Control data flow Graphics However, for message-passing, the benefit of strong scalability is undermined by its complex programming model and difficult QoS (Quality of Service) guarantee. In fact, shared-memory and message-passing intercore communications are suitable for different scenarios, as Table I shows. For typical multicore embedded applications, the data flows can be classified as two categories: computation data flow and control data flow. Most of computation data flows are continuous and in block which are suitable for shared memory communication, while Shaik Mahmed basha, IJECS Volume 4 Issue 8 Aug, 2015 Page No.13696-13703 Page 13697 DOI: 10.18535/ijecs/v4i8.14 control data flows are usually occasional and scattered data packets which are suitable for message-passing communication. Different inter-core communication schemes are suitable for different scenarios and it’s possible to achieve a better efficiency by integrating the sharedmemory and message-passing communications, therefore a hybrid on-chip inter-core communication scheme is proposed in this paper. The recently proposed 48-core IA-32 processor proposed a 2D Mesh NoC that supports messagepassing communication, and a Message Passing Buffer (MPB) is used to increase the performance of message passing programming model [30]. The pioneering work ‘Alewife’ from MIT [38] proposed integrating sharedmemory and message-passing communication for multiboard super computing in late 90’s, and we believe now it is the time to enable both mechanisms for on-chip multi-core communication with state-of-the-art NoC techniques. C. Cluster-Based Memory Hierarchy The memory hierarchy in a processor has an enormous impact on its overall performance and power consumption, especially in multicore systems. With increasing core numbers, competition for memory resources among cores becomes greater, resulting in more access latency. Second, cache coherence issues become too complex to solve with limited hardware and power budgets. Finally, “Memory Wall” issues become more significant in multicore systems, limiting the increasing number of cores [10]. Hence it’s necessary to optimize the memory hierarchy for multicore processors. Although traditional SMP hierarchy is widely used in many chip multiprocessors, such as Power 4 [31] and Core i7-940 [16], it is still experiencing low efficiency for most embedded applications. Some designers tried to solve this problem by using cache-free architectures, such as the 167-processor computational platform [32] with less memory hierarchy, and the Imagine [33] with limited application domains. While some designers suggested to partition the cache unit into different layer that includes private and shared parts to improve efficiency, such as Merrimac [34] and TRIPS [35]. Memory accesses operations consume large proportion of power breakdown and the associated latency also reduces the overall performance [11]. Thus our primary work in memory hierarchy optimization is to avoid frequent memory access and to increase the data locality. III. The 16-core processor we proposed has a 36 2D Mesh NoC that links sixteen processor cores (PCore) based on MIPS 4KE and two memory cores (MCore). A hybrid inter-core communication scheme is employed supporting both sharedmemory and message-passing communications. Shared memory in MCore enables shared-memory communications within the cluster, and the NoC enables message-passing among all PCores. Cluster-based architecture is employed with two clusters implemented. Each cluster comprises eight PCores and one MCore, shared memory in MCore can be accessed by PCores in the same cluster. Data enters the processor through the input First In First Out (FIFO), and exits through the output FIFO. An on-chip Voltage Controlled Oscillator (VCO) generates the system clock, together with static and dynamic clock-gating schemes. External clock can be selected by a mux. A test mode allows monitoring the inner operation flows. Fig. 2 depicts the architecture overview and key features of the proposed processor [12]. The PCore includes a typical Reduced Instruction Set Computer (RISC) style processor core with six-stage pipeline, a 2k-word instruction memory, a 1k-word private data memory, a router and interfaces for inter-core communications. The MCore includes an 8k-word shared memory with four banks. A. Design of Key Modules 1) Processor Core: Fig. 3 shows the architecture of PCore. Two input FIFOs are implemented to support both core-core and core-memory inter-core communications. The mailbox is used for inter-core synchronization which will be detailed later. An arbitrator is employed to manage private and shared memory access. The processor core has a six-stage pipeline illustrated in Fig. 4. In IFetch stage, instructions are fetched according to the PC (Program Counter). The Decode stage generates control signals and fetches operands from the register file or the input FIFO. Operations and address calculations are done in the Execution stage with function blocks including ALU (Arithmetic Logic Unit), Shifter and MDU (Multiplication Division Unit). The Memory stage is associated with data memory access. Typically, it consumes one clock cycle to finish a private data memory access, while shared memory access requires 2 cycles when no contention occurs. In the Align stage, data is aligned which will be written to the register file or the output FIFO in Write Back stage. PROCESSOR WITH HYBRID COMMUNICATIONS Shaik Mahmed basha, IJECS Volume 4 Issue 8 Aug, 2015 Page No.13696-13703 Page 13698 DOI: 10.18535/ijecs/v4i8.14 Fig.2: Architecture overview of the proposed 16-core processor. Fig.4: The six-stage pipeline of the processor core. Fig.5: Read and write datapath of the extended register file Fig.3: Architecture overview of the PCore. Fig.6: Architecture overview of the MCore. Shaik Mahmed basha, IJECS Volume 4 Issue 8 Aug, 2015 Page No.13696-13703 Page 13699 DOI: 10.18535/ijecs/v4i8.14 Fig.7: Architecture of the proposed voltage controlled oscillator. Data-Level Parallelism (DLP), configurability and simplicity are three principles underlying the processor design. DLP is enhanced using Simple Instruction Multiple Data (SIMD) Instruction Set Architecture (ISA) supporting three kinds of data width: 8 b, 16 b and 32 b since commonused data widths in embedded era are 8 b and 16 b. We reconstruct the datapath (ALU, shifter and MDU) with configurable data width. Three computing modes are proposed to support the SIMD ISA, including scalar-scalar, vector-vector and scalar-vector SIMD operations which are detailed in our previous work in [13]. It’s necessary to increase the data locality to reduce power consumption. In the original MIPS 4KE processor which is also our baseline processor, the register file has 32 words, limiting the data locality. Hence, we extended the size of register file to 64 words. The processor benefits from the extended register file in three aspects. First, more available registers mean more capacity to allocate data used by SIMD instructions, thus the performance of processor is improved. Second, the data locality of processor is enhanced with more registers, resulting in less memory access and power consumption. Third, the extended register file serves as FIFO mapping ports. As Fig. 5 shows, FIFO read and write ports are mapped to $24 and $25 registers, respectively. A special instruction (regconfig) is employed to activate certain parts of register file and FIFO ports, which is illustrated in Fig. 5 and we can access FIFO ports directly, with the register-related instruction. Thus load/store instructions are not necessary here, reducing the number of instructions. We have no cache with PCore to reduce chip area and elude complex cache coherence issues. The cachefree design scheme aims to low-power required by most embedded applications. 2) Memory Core: A MCore includes an 8k-word shared data memory partitioned into four banks with increased bandwidth of 102.4 Gbps @ 800 MHz, together with an input FIFO to receive data from NoC, illustrated in Fig. 6. All PCores in the same cluster are able to access MCore via direct hardwires with fixed priority order, to simplify the arbitration logic and optimize the critical path, and to obtain high performance and low cost. On the other side, the software programmers shall take this issue into consideration to avoid the live-lock. In theory, the live-lock is a possible risk for the lower priority core however in reality the live-lock is rarely observed. From the perspective of soft- ware, we use the shared memory to transfer data between the different modules of a specific application which are mapped to different cores, the processor will be stalled when data in the shared memory is not ready. Thus even the core is with the highest priority, it won’t keep occupying the shared memory, which will grant the low priority cores to have access to shared memory. The latency accessing MCore without contention is 2 cycles. However, since we adopt a fixed priority shared memory access scheme, when several cores access MCore at a time, cores with low priority will be stalled causing a latency larger than 2 cycles whose accurate value depending on the MCore accessing pattern. The latency increases dramatically as hardwire length increases, thus we only implemented hardwires inside the cluster to avoid long distance interconnects. 3) VCO & Clock-Gating: An on-chip VCO is integrated to generate clock whose architecture is shown in Fig. 7. The VCO includes a typical saturated-type ring oscillator, a level shift module, a duty cycle correction module and a frequency divider as shown in Fig. 7 [14]. Test results show that the VCO can generate clock ranging from 200 MHz–1.6 GHz. Two clock-gating schemes are implemented, both static and dynamic. Fig. 8 shows the two clock-gating domains. Static clock gating scheme is used to turn off clock of PCore excluding its router. So we can selectively shut down clock of certain PCores which are not used. The configuration of static clock-gating can be done manually via the clock-gating signal shown in Fig. 8. The dynamic one is used to turn off clock of certain components, including the extended register file, MDU, private data memory and shared memory banks when idle. The configuration process is autonomous and self-activated. A dynamic clock-gating management unit will shut down the clock of certain modules if they are idle. Test results show that dynamic clock gating can reduce the overall power consumption by 28.6% averagely when running various applications under the same conditions. Worth mentioning, no performance over- head occurs since the static one is configured in initialization while the dynamic one is controlled by common clock-gating cells. Fig.8: The Static and dynamic clock-gating domains in PCore. (a) PCore - Static Clock-gating Shaik Mahmed basha, IJECS Volume 4 Issue 8 Aug, 2015 Page No.13696-13703 Page 13700 DOI: 10.18535/ijecs/v4i8.14 with a fixed access priority order for hardware simplicity, i.e., PCore in the top-left corner has the highest priority, and PCore in the bottom-right corner has the lowest priority. The reason for adopting this scheme and software mechanism to avoid live-lock is detailed in the previous section. Second, a hardware-aided mailbox mechanism is used to achieve high inter-core synchronization efficiency, which will be mentioned later. b) PCore - Dynamic Clock-gating. B. Design of Hybrid Inter-Core Communications: A hybrid inter-core communication mechanism is employed to cater to different communication scenarios, integrating both shared-memory and message-passing communication schemes. Fig. 9 illustrates the implementation of the hybrid inter-core communications. The shared data memory in MCore supports the shared-memory communication inside the cluster, which is an ideal way for large block data transferring. Meanwhile, the 2D Mesh NoC supports the message-passing communication, which is suitable to transfer frequent and scattered data packets. Moreover, message-passing shows more extensibility than sharedmemory so we adopt multiplying the number of clusters to realize the scalability of our processor while the number of cores within one cluster is fixed. Fig.9: Implementation of the hybrid inter-core communications: (a) Shared- memory via MCore (b) Message-passing via NoC. To add core numbers within one cluster will increase access latency of shared memory and make the shared memory access arbitration more complicated thus adding cost of hardware and accessing latency, so it will not be applicable. 1)Shared-Memory Communication: Two features distinguish the proposed shared-memory communication scheme from other previous work [15]–[17]. First, only eight PCores in the same cluster can access shared memory Typically, three steps are used to complete sharedmemory communication. First, Src PCore stores data into the shared memory. Next, it sends a synchronization signal to the Dest PCore, informing that data is ready. Third, the Dest PCore loads data after the synchronization signal is confirmed. Fig. 10 illustrates the three steps in sharedmemory communication. 2)Message-Passing Communication: Although the sharedmemory communication is only allowed within the same cluster, the message-passing enables all of the PCores to communicate with each other. We implement a 36 2D Mesh NoC to support message-passing communications, where an XY dimension-ordered deadlock-free wormhole routing algorithm [18] is adopted. Even with more scalability and adaptability for frequent and scattered data transferring, the efficiency of message-passing communication is limited by two bottlenecks. The first is the uncertainty in the communication channel, as the NoC with heavy traffic load will block the data packets and increase the latency. Fortunately, with the aid of shared-memory communications within cluster, the traffic load of NoC can be reduced significantly. The second bottleneck lies in the data transferring between processor core and router. Usually, FIFO ports are mapped in the data memory address space [19], accessing with load/store. Extra operations are needed to calculate the address for memory access resulting in extra power consumption. We proposed two solutions to conquer the second bottleneck. First, the destination can either be the PCore or the private data memory. We have two input FIFOs in the PCore, one for processor core, and the other for private data memory. The first bit of packet head determines the communication destination, as shown in Fig. 11. Second, two FIFO ports mapping schemes are proposed. One is the traditional method that maps the FIFO ports in data memory address space, using load/store to access the FIFO. The other is mapping the FIFO ports in the extended register file address space, and register operation instructions are used to access the FIFO. The number of communication instructions is reduced by 50% by eliminating redundant load/store operations (e.g., two instructions “lw $24, 0($3)” and “add $1, $2, $24” are needed when mapping to memory space while only one instruction “add $1, $2, $24” is needed when mapping to the register file space). Shaik Mahmed basha, IJECS Volume 4 Issue 8 Aug, 2015 Page No.13696-13703 Page 13701 DOI: 10.18535/ijecs/v4i8.14 Fig.10: Three steps in a typical shared-memory communication: (1) Src PCore stores data to shared memory in MCore; (2) Src PCore sends synchronization signal to Dest PCore; (3) Dest PCore loads data from shared memory when synchronization signal is received. Fig.11: Datapath of the message-passing communication in PCore IV. SIMULATION RESULTS The simulation of the proposed design is carried out by using Verilog HDL language in Xilinx ISE simulator. The simulation results of the proposed design is as shown in below figure. Message-passing communications are supported by the 36 2D Mesh NoC, and shared-memory communications are supported by shared memory units in the memory cores. The proposed cluster-based memory hierarchy makes the processor well-suited for most embedded applications. The processor chip has a total 256 KB on-chip memory, while each processor core has an 8 KB instruction memory and a 4 KB private data memory, and each memory core has a 32 KB shared memory. The processor is fabricated in TSMC 65 nm LP CMOS with the chip area of 9.1, while each core occupies 0.43. Typically, the frequency of each processor core is 750 MHz at 1.2 V while dissipating 34 mW, with an energy efficiency of 45 pJ/Op for 32-bit operation and 22 pJ/Op for 16-bit operation. REFERENCES Fig.12: Simulation results of the proposed design V. CONCLUSIONS A 16-core processor for embedded applications with hybrid inter-core communications is proposed in this paper. The processor has 16 processor cores and 2 memory cores. [1] G. Blake, R. G. Dreslinski, and T. Mudge, “A survey of multicore pro- cessors: A review of their common attributes,” IEEE Signal Process. Mag., pp. 26–37, Nov. 2009. [2] R. Kumar, V. Zyuban, and D. Tullsen, “Interconnections in multi-core architecture: Understanding mechanisms, overheads and scaling,” in Proc. 32nd Int. Symp. Computer Architecture (ISCA’05), 2005, pp. 408–419. [3] H.-Y. Kim, Y.-J. Kim, J.-H. Oh, and L.-S. Kim, “A reconfigurable SIMT processor for mobile ray tracing with contention reduction in shared memory,” IEEE Trans. Shaik Mahmed basha, IJECS Volume 4 Issue 8 Aug, 2015 Page No.13696-13703 Page 13702 DOI: 10.18535/ijecs/v4i8.14 Circuits Syst. I, Reg. Papers, no. 60, pt. 4, pp. 938–950, Apr. 2013. [4] L. Hammond, B.-A. Hubbert, M. Siu, M.-K. Prabhu, M. Chen, and K. Olukolun, “The stanford Hydra CMP,” IEEE Micro, vol. 20, no. 2, pp. 71–84, 2000. [5] A. S. Leon, B. Langley, and L. S. Jinuk, “The UltraSPARC T1 pro- cessor: CMT reliability,” in Proc. Custom Integrated Circuits Conf. (CICC’06) Dig. Tech. Papers, 2006, pp. 555–562. [6] M.-B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Green- wald, H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Stumpen, M. Frank, S. Amarasinghe, and A.Agarwal, “The Raw microprocessor: A computational fabric for software circuits and general-purpose programs,” IEEE Micro, vol. 22, no. 2, pp. 25–35, Mar/Apr. 2002. [7] Tilera Corp., Tilepro64 Processor Tilera Product Brief, 2008 [Online]. Available: http://www.tilera.com/pdf/ProductBrief_TILEPro64_Web_v 2.pdf [8] S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar, “An 80-tile sub-100-W teraflops processor in 65-nm CMOS,” IEEE J. Solid-State Circuit, vol. 43, no. [9] Z. Yu, M. J. Meeuwsen, R. W. Apperson, O. Sattari, M. Lai, J. W. Webb, E. W. Work, D. Truong, T. Mohsenin, and B. M. Baas, “AsAP: An asynchronous array of simple processors,” IEEE J. Solid-State Cir- cuits, vol. 43, no. 3, pp. 695–705, Mar. 2008. [10] B. Rogers, A. Krishna, G. Bell, and K. Vu, “Scaling the bandwidth wall: Challengesn and avenues for CMP scaling,” in Proc. ACM Int. Symp. Computer Architecture (ISCA’09), 2009, pp. 371–382.1, pp. 29–41, Jan 2008. 2 G.Nageswararao pursuing phD in wireless communications at Nagarjuna University,Guntur.He received his M.Tech in VLSI from Samuel Institute Of Engineering & Technology,Markapur,Prakasam(Dist). He has 16 years teaching experience. He is presently working as Associate Professor in the department of ECE Audisankara Institute Of Technology,Gudur, Affiliated to JNTU, Anantpur. AUTHORS 1 Shaik.mahmedbasha received his B.TECH degree in Electronics and Communication Engineering from Siddhartha Institute Of Science & Technology, Puttur, Chithoor (Dist), affiliated to JNTU Anantapur. He is currently pursuing M.Tech VLSI in Audisankara Institute Of Technology, Gudur(Autonomous), Nellore (Dist), affiliated to JNTU Anantapur. Shaik Mahmed basha, IJECS Volume 4 Issue 8 Aug, 2015 Page No.13696-13703 Page 13703