Introduction to the Tiled HW Architecture of SHAPES Pier Stanislao Paolucci 1,2,**, Francesca Lo Cicero 1, Alessandro Lonardo 1, Mersia Perra 1, Davide Rossetti 1, Carlo Sidore 1, Piero Vicini 1, Marcello Coppola 3, Luigi Raffo 4, Gianni Mereu4, Francesca Palumbo4, Luca Fanucci 5, Sergio Saponara5, Francesco Vitullo5 1 INFN Roma, 2Atmel Roma, 3 ST Microelectronics - Grenoble, 4DIEE, Università di Cagliari, 5Università di Pisa Abstract 1. Introduction Nanoscale systems on chip dedicated to embedded systems and numerical computations will integrate a few hundreds of million gates. The challenge is to find a scalable HW/SW design style for future CMOS technologies. The main HW problem is wiring, which threatens Moore’s law. Tiled architectures suggest a possible HW path: “small” processing tiles connected by “short wires”. A second HW problem is the management of the design complexity. A tiled design style reuses stable Intellectual Properties requiring a few million gates: a manageable complexity. A typical SHAPES tile always contains one Distributed Network Processor (DNP) for inter-tile communications, plus one VLIW DSP processor for computation and/or one RISC processor for control. The DNP is supported by a NoC (for inter-tile, inter-chip communications) and by Ndimensional toroidal network for off-chip communications. The SW challenge is to provide a simple and efficient programming environment for a (massive) tiled parallel architecture. This paper introduces the HW architecture. There is no processing power ceiling for low consumption, low cost, dense Numerical Embedded Scalable Systems dedicated to future human-centric applications which will manage multi-channel audio, video and multi-sensorial input/outputs. Nanoscale systems on chip dedicated to embedded systems and numerical computations will integrate a few hundreds of million gates. A serious challenge is to Figure 1. The Tiled HW Architecture of SHAPES * corresponding author e-mail: pier.paolucci@roma1.infn.it. SHAPES (scalable Software Hardware Computing Architecture for Embedded Systems) is a European Project (FET-FP6-2004-IST-4.2.3.4(viii) - Advanced Comp. Arch. started Jan 2006. See www.shapes-p.org for a complete documentation. identify a scalable HW/SW design style for future CMOS technologies enabling high gate counts [1-3]. The main HW problem is wiring [4,5], which threatens Moore’s law. A second HW problem is the management of the design complexity of high gate count designs. Tiled architectures [6-9] suggest a possible HW path: “small” processing tiles connected by “short wires”. A tiled design style extensively reuses processing tiles, each tile composed of stable Intellectual Properties requiring only a few million gates: a manageable complexity. The SHAPES project targets three main objectives: investigate the tiled HW paradigm, experiment a realtime, communication aware system SW, and validate the HW and System SW Platform through a set of benchmarking applications. This paper introduces the HW architecture. For an introduction to the SHAPES System SW, see [10]. processor for control intensive codes. Intra-tile communications are sustained by a Multilayer Bus Matrix, while inter-tile communications are supported by the NoC (on-chip) and by the 3DT (offchip 3 Dim. Toroidal next neighbours interconnection network). The DNP acts as a generalized DMA controller, offloading the RISC and DSP processors from the task of managing the packets flowing through the inter-tile network. SHAPES is a Distributed Memory Architecture. Each Tile is equipped with distributed on-chip memories and can be associated with an external distributed memory (DXM). Each tile may also contain a POT (a set of Peripherals On Tile). In its first implementation, the SHAPES tile will be developed as the combination of a new generation of the DIOPSIS (RISC + DSP) MPSOC [8,15], designed by ATMEL Roma, with the DNP, designed by INFN. 2. Tiled HW paradigm 3. Distributed Network Processor (DNP) The SHAPES project (scalable Software hardware DNP stands for Distributed Network Processor. The Computing Architecture for Embedded Systems) main task of the DNP (designed by INFN) is to provide investigates a specific Tiled HW paradigm (see Figure inter-tile communication services. A secondary service 1). is to act as a DMA controller for intra-tile Each Tile includes a few million gates, for optimal communications. balance among parallelism, local memory, and IP reuse on future technologies. The SHAPES inter-tile routing fabric connects on-chip and off-chip tiles, weaving a distributed packet switching network. 3D next neighbours engineering methodologies will be studied for off-chip networking and maximum system density, leveraging on the know-how accumulated by INFN during the design and development of several generations of massive parallel processors [11-14] dedicated to numerical computations. Figure 2 describes a typical Tile of SHAPES. Each tile of SHAPES always contains one Distributed Network Processor (DNP) for inter-tile communications, plus one VLIW floating-point DSP (Digital Signal Processor) for numerical computations, and/or a RISC Figure 2. A typical Tile of SHAPES: RISC + VLIW DSP + DNP allowing for easy upgrade and bug fixing. 4. mAgicV VLIW Floating-Point DSP Core Figure 3. The interfaces of the Distributed Network Processor mAgicV VLIW DSP (designed by Atmel Roma) is a fully C programmable, high performance Digital Signal processor delivering 10 floating-point operations per cycle and 16 ops per cycle. It is new member of the mAgic [16] processor family used in the Atmel Diopsis product line (multiprocessor systems on chip combining a RISC and a DSP). In a conventional numerical processor, which detects parallelism at execution time and pushes the clock speed to the limit, the die area required for control logic overhead frequently dwarfs that which is required by the functional units. For a discussion of the efficiency of DSPs relying on parallelism detected before execution and exploited by VLIW and appropriate software scheduling see [18]. In our opinion and experience, moderate clock speeds are an ideal complement to VLIW architectures because they reduce pipeline depth, bypass logic, and speculation correction logic. This choice simplifies the task of high-level language compilers. A moderate clock speed also allows simpler clock tree management and lower supply voltages. The classical drawback of VLIW architectures is that the longer instruction words require more memory. The DNP offers its services to other masters in the tile (typically a RISC or DSP processor) (Figure 3 describes the interfaces offered by the DNP). The DNP receives command issued by the initiating processors on a slave port. The DNP is also equipped with two master ports to sustain the data traffic. In a typical situation the master ports act simultaneously to sustain the flow between the source and destination targets (memory buffers exposed to the DNP by the tile). For intra-tile communications the set of master and slave interface ports would suffice. For inter-tile communication, which require the cooperation of at least two DNPs located on different tiles, the DNP is equipped with a set of inter-tile interfaces. A first set of those interfaces are used for offchip communications on the 3DT (3D toroidal nextneighbors topology) and on a collective communication tree (CTV). A second set of interfaces will be used for on-chip communication through a dedicated NOC architecture. The DNP may host a small processor, called in the following DNP-P, to implement some advanced features in an easy way, without resorting to complex VHDL coding while Figure 4. mAgicV Floating-point VLIW DSP While conventional code compression/decompression schemes mitigate this problem [19], they also increase control logic overhead. We provided efficient usage of the internal program memory designing a small footprint (18 Kgate) VLIW Dynamic Program memory Decompression system (DyProDe) [20]. The architecture of mAgicV is co-designed with its C-Compiler, using the Retargetable Compiler generation techniques of Target Compiler Technologies [17]. The C-oriented Architecture completely frees the user from the burden of dealing with the parallelism of the DSP processor resources (when the program is written for a single processor). A rich library of C-written DSP routines is available. The DSP library reaches impressive level of performance, comparable or higher to the performances obtained by optimized assembler in previous generations of the mAgic Processor family. For what regards multiprocessor architectures, the problem is addressed by the System SW developed by the SHAPES project). The mAgicV floating-point VLIW DSP is equipped with one AHB master port and one AHB slave port for system-on-chip integration. It includes 256 data registers, 64 address registers, 10 independent arithmetic operating units, 2 independent address generation units and a DMA engine driving the AHB Master Port. To sustain the internal parallelism, the data bandwidth among the Register File, the Operators and the Data Memory System amounts to 80 bytes/cycle. The Data Memory System is designed to transfer 28 bytes/cycle. For instance, activating all the computing units, mAgicV can produce one complete FFT butterfly per cycle. mAgicV operates on IEEE 754 40-bit extended precision floating-point and 32-bit integer numeric format for numerical computations, while internal memory accesses are supported by a powerful 16-bit MAGU (Multiple Address Generation Unit). It includes on-chip 16K x 40-bit 6-access/cycle data memory system and 8K x 128-bit dual port program memory locations. 5. The Spidergon NoC Spidergon-STNoC (S-STNoC) is the Network on Chip (NoC) technology currently developed in STMicroelectronics, with the support of University of Cagliari and Pisa, and it is made of a network of microrouters interconnected in a Spidergon [21] topology. The main task of S-STNoC in SHAPES is to provide the inter-tile on-chip communication services. Different tiles can be connected on the same silicon chip by the DNP to S-STNoC interconnection. Figure 5. Spidergon NoC Topology In fact the DNP has a port connected to S-STNoC Network Interface (S-STNoC NI), a block who is responsible to map ingoing DNP to S-STNoC packets in a S-STNoC compatible packet format and vice-versa outgoing S-STNoC to DNP packets in DNP packet format. Traditional SoC interconnection architectures have generally been based upon buses and point-to-point links. Bus architectures are not ultimately scalable, since as more units are added to it, the power dissipation for bus access grows as a consequence of the increased capacitive load. Dedicated point-to-point links are optimal in terms of performance, but the number of links required increases exponentially with the number of cores, leading to a potential area and routing problem. For maximum flexibility and scalability it is generally accepted that a move towards a shared segmented global communication structure is needed [22]. This definition leads in turn to a data-routing network consisting of communication links and routing nodes that are implemented on the chip. Networks-on-Chip are mainly based around three components: 1) Routers, 2) Network Interfaces and 3) Physical Link. Each Router has 4 connections to other three Routers and one Network Interface. NoCs hide the interconnect specific implementation details to the IP resources interfaced. All that is needed for an external IP to transmit data through the network is a specific designed Network Interface. The architecture of S-STNoC Network Interface has been designed with modularity in mind. It has two main components, shell and kernel. The shell handles the processing activity related to the signals coming from the IP resource side, while the kernel takes care of the activity related to the NoC side of the Network Interface. The shell architecture depends on the specific IP protocol while the kernel architecture does not. S-STNoC Network Interfaces support data size conversion and, thanks to FIFO synchronizers, kernel and shell subsystems running at totally unrelated frequencies. S-STNoC Physical Link handles Router-to-Router and Router-to-NI connections [23]. It is made up of two interfaces: 1) Upstream (US) interface, associated to the transmitter module and 2) Downstream (DS) interface, associated to the receiver module. Links are half-duplex and support both synchronous and mesochronous clocking strategies. The latter enables full bandwidth communication (1 GHz in our experiments in 65 nm CMOS) between IP macrocells clocked by signals with the same frequency but an arbitrary amount of skew. The verification of SSTNOC building blocks is based on Specman [24]. 6. Tile aware System SW and benchmarks Even if this paper is an introduction of the SHAPES HW architecture, we find useful to resume a few key ideas about System SW, introduced by [10]. The SHAPES project will adopt innovative layered system software, which does not destroy the information about algorithmic parallelism, data and workload distribution and real-time requirements provided by the programmer. The System SW will be fully aware of the Tiled HW paradigm. For efficiency and predictability, the system SW manages intra-tile and inter-tile latencies, bandwidths, computing resources, using static and dynamic profiling. The application is described using a model based approach in terms of a network of actors with explicit real-time constraint annotation, and it is mapped (i.e. bound to the processing and communication resources and instrumented with scheduling strategies) by an iterative automated (or semi-automated) multiobjective optimization. The mapping procedure uses performance estimation obtained at different level of abstractions: from analytic prediction down to queries answered by the simulation environment. The layered structure of the software separates the application code from the Hardware dependent Software. Therefore it is possible to debug the application and System SW layers independently and obtain higher simulation speeds. The application can be debugged using a Virtual Architecture, which is agnostic of the HdS detail, and of the particular Hardware abstraction Layer offered by the real HW. The HDS generator refines the System software, first through a Transaction Accurate step, and then down to the Virtual Prototype level, where the generated System Software contains all the details needed to drive the actual HW platform. The SW accesses the on-chip and off-chip networks through a homogeneous interface. The same HW and SW interface can be adopted for integration with signal acquisition and reconfigurable logic tiles. Generation after generation, the number of tiles on a single-chip will grow, but the application will be portable. The SHAPES HW and SW platform will be benchmarked through a set of applications characterized by a large inherent parallelism and, typically, by real-time constraints: wave field synthesis for array of sound sources reproduced by large arrays of loud-speakers, treatment of signals acquired by arrays of microphones, Ultrasound Scanners, and Theoretical Physics (Lattice Quantum-Chromo Dynamics). 7. Acknowledgements SHAPES (scalable Software Hardware Computing Architecture for Embedded Systems) is a project partially funded by European Commission (FET-FP62004-IST-4.2.3.4(viii) - Advanced Comp. Arch.). The SHAPES HW Team acknowledges the fruitful contribution of the project partners working on System SW and Benchmarking Applications. In particular we would like to thank Lothar Thiele – ETH Zurich, Rainer Leupers – RWTH Aachen, Ahmed A. Jerraya – TIMA Grenoble, Gert Goossens – TARGET Compiler Technologies, Thomas Sporer – Fraunhofer IDMT for discussions and feedbacks. 8. References [1] D. Sylvester and K. Keutzer, “Impact of Small Process Geometries on Microarchitectures in Systems on a Chip”, Proc. IEEE, 89-4(2001)467-489. [2] W.J. Dally and S. Lacy, “VLSI Architectures: Past, Present and Future”, Proc. Advanced Research in VLSI Conf., IEEE Press (1999)232-241. [3] A. Allan et al., “2001 Technology Roadmap for Semiconductors”, IEEE Computer 35-1(2002)42-53. [4] R. Ho, K. Mai and M. Horowitz, “The Future of Wires”, Proc. IEEE, 89-4 (2001)490-504. [5] J. Rabaey, A. Chandrakasan, B. Nikolic, Digital Integrated Circuits, 2-nd Edition, Prentice-Hall (2003) Chapter 4 and 9. [6] M.B. Taylor et al., “The Raw Microprocessor: A Computational Fabric for Software Circuits and GeneralPurpose Programs”, IEEE Micro 22-2(2002)25-35. [7] L.P. Carloni, A.L. Sangiovanni-Vincentelli, "Coping with latency in SOC Design", IEEE Micro 22-5 (2002) 24-35. [8] P.S. Paolucci et al. “Janus: A gigaflop VLIW+RISC Soc Tile”, Hot Chips 15 IEEE Stanford Conference (2003). http://www.hotchips.org (note: Janus was the development name of the first generation of DIOPSIS). [9] Paolucci, P. S., “The Diopsis Multiprocessor Tile of SHAPES” 6th International Forum on Application-Specific Multi-Processor SoC MPSOC’06 (Colorado, August 2006) [10] Paolucci, P. S., Jerraya, A. A., Leupers, R., Thiele, L., and Vicini, P. 2006. “SHAPES: a tiled scalable software hardware architecture platform for embedded systems.” In Proceedings of the 4th international Conference on Hardware/Software Codesign and System Synthesis (Seoul, Korea, 2006). CODES+ISSS '06. ACM Press, 167-172. DOI= http://doi.acm.org/10.1145/1176254.1176297 [11] A. Bartoloni, P.S. Paolucci et al., “A Hardware Implementation of the APE100 Architecture”, Int. Journ. Mod. Phys. C 4(1993)969. [12] N. Cabibbo and P.S. Paolucci, “SIMD algorithm for Matrix Transposition”, Int. Journ. Mod. Phys. C 6(1995)183. [13] F. Aglietti, P. S. Paolucci, et al. , “The teraflop supercomputer APEmille: architecture, software and project status report” Computer Physics Communications, 110,1-3 (May 1998) 216-219 [14]Belletti, F et al. “Computing for LQCD: apeNEXT”, Computing in Science and Engineering, 8-1, pp. 18-29, Jan/Feb, 2006 [15] ATMEL Roma, “DIOPSIS: Dual Inter Operable Processor in A Single Silicon”, www.atmelroma.it [16] P.S. Paolucci, P. Kajfasz et al., “mAgic-FPU and MADE: A customizable VLIW core and the modular VLIW processor architecture description environment”, Computer Physics Communication 139(2001)132-143. [17]“Chess/Checkers, a retargetable tool-suite for embedded processors”, Target Compiler Technologies, http://www.retarget.com/doc/target-whitepaper.pdf. [18] P. Faraboschi, G. Desoli, J.A. Fisher, “The Latest Word in Digital and Media Processing”, IEEE Signal Processing Mag. 15-2(1998)59-85. [19] R.P. Clowell, J. O’Donnell, D.P. Papworth, P.K. Rodman, “Instruction Storage Method with a Compressed Format Using a Mask Word”, U.S. Patent 5057837, (Oct 1991). [20] P. S. Paolucci, “Apparatus and Method for Dynamic Program Decompression”, U.S. Patent 6,766,439, (Jul 2004). [21] R. Locatelli, G. Maruccia, L. Pieralisi, A. Scandurra, M. Coppola, “Spidergon: a novel on-chip communication network”, International Symposium on System-on-Chip, 2004, pp. 15-26 [22] L. Benini, G. De Micheli, “Powering network-onchips”, The 14th International Symposium on System Synthesis (ISSS), pp. 33-38, 2001. [23] D. Mangano, R. Locatelli, A. Scandurra, C. Pistritto, M. Coppola, L. Fanucci, F. Vitullo, D. Zandri, “Skew Insensitive Physical Links for Network on Chip”, Nano-Net 2006, September 14-16, 2006. [24] The e functional verification language, http://www.ieee1647.org/