A Coarse-Grained Reconfigurable Array for High-Performance Computing Applications Philipp S. Käsgen Markus Weinhardt Christian Hochberger Osnabrück University of Applied Sciences Osnabrück University of Applied Sciences Technische Universität Darmstadt Email: p.kaesgen@hs-osnabrueck.de Email: m.weinhardt@hs-osnabrueck.de Email: hochberger@rs.tu-darmstadt.de Abstract—Coarse-Grained Reconfigurable Arrays (CGRAs) are mostly hardware accelerators for multimedia applications which require only integer operations in most cases. We design a CGRA with multiple Floating Point Units suitable for High-Performance Computing (HPC). This PhD project is about halfway completed. Index Terms—Coarse-Grained Reconfigurable Array, HighPerformance Computing, Hardware Accelerator. I. I NTRODUCTION The computational parallelism and energy efficiency inherent in CGRAs has been a subject of research mostly for multimedia applications for many years. But the said strengths of CGRAs are also beneficial for other application domains like HPC since single-core and multi-core systems may soon hit scaling limits [1]. Hence, we propose a novel CGRA designed for HPC applications: the High-Performance Reconfigurable Processor (HiPReP)1 . Yet, in this paper, we mostly focus on the architecture, and refrain from presenting the mappings of HPC applications as the software stack is still being developed. Till now, we use mappings of MatrixMatrix-Multiplication, Fast-Fourier-Transform, Finite Impulse Response Filter (FIR Filter), and Stencil Codes done by hand which guided the early design process of the CGRA’s topology and computational requirements of the Processing Elements (PEs). We will proceed as follows: Section II summarizes CGRAs which can process Floating Point values and which, because of this ability, might process HPC applications. Then, we introduce our architecture in Section III, before we present an in-depth view of our PEs in Section IV. In Section V we explain how we test our CGRA and how it is classified for better comparison with other CGRAs. Section VI concludes this paper. II. R ELATED W ORK FloRA [2] explicitely examines the possibility to process Floating Point values by using two PEs capable of integer operations only: One PE calculates the mantissa and the other one calculates the exponent. The joint process is steered by a PE internal Finite State Machine (FSM). Besides that, the 8×8 CGRA can be configured such that a pipeline is established where a pipeline stage corresponds to one row of PEs. It 1 funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 283321772 is even possible to map applications which are speculatively executed. In contrast to FloRA, our CGRA provides full Floating Point Units in each PE. Dynamic Synthesized Execution Resources (DySER) [3] is a 8 × 8 CGRA which is integrated in the OpenSPARC processor pipeline as a functional unit. It is a heterogeneous array which supports both integer and Floating Point operations. Unlike DySER, our CGRA is loosely coupled to the host system, and hence, the host-CPU can concurrently process other tasks. In [4], a Scheduler for a CGRA integrated in an AMIDAR processor [5] is proposed, capable of mapping nested loops, control flow, and double-precision Floating Point operations. This CGRA provides a global context counter to control the current contexts in each PE, whereas our PEs have individual context counters. The commercial CGRA by Wave Computing [6] is specifically designed to accelerate the training and inference of Deep Neural Networks (DNN). Because of the simplistic PE design (only Nearest-Neighbor connections, accumulator machine), it scales very well and compensates its lack of PE computing power with the sheer mass of PEs (16 Clusters with 32 × 32 PEs each). Also its clock has a frequency of 6.7 GHz. III. T HE CGRA Our CGRA is a co-processor template designed as a SystemC simulation model which is connected to a host system via Direct Memory Access (DMA). A block diagram of a 4 × 4 instance is given in Figure 1. The CGRA itself consists of the memory controller and an array of PEs (4 × 4 array in this case). The host system comprises - highly simplified the Host-CPU and a memory. The directed edges represent directed connections, undirected edges represent bidirectional connections. Thin edges are 32 bits wide to transmit singleprecision Floating Point numbers, and bold ones are multiples of 32 bits wide. The thin edges may be extended to 64 bits width to support double-precision Floating Point numbers in the future. The co-processor is either idle, in the configuration, or in the processing phase. It can be invoked by sending the start address of the configuration to the according address in the contiguous address space (Memory Mapped I/O). When it receives this start address, it loads the PE contexts and data address components from the main memory. The PE contexts are forwarded to the according PEs, and the data address 978-1-7281-1968-7/18/$31.00 ©2018 IEEE uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on February 09,2024 at 18:16:12 UTC from IEEE Xplore. Restrictions appl does not generate more addresses (hence, all results have supposedly been written back to memory), the PEs encounter the END instruction and signalize this to the memory controller, or the computation gets interrupted by new configuration data. IV. T HE P ROCESSING E LEMENT Fig. 1. Integration of our co-processor in a Host-System components remain in the memory controller. As soon as the co-processor has finished the configuration, the processing phase begins. In this phase, it consumes and produces data (Streaming). After data processing, the co-processor remains idle again, unless it is reconfigured. Beside the long-lines from the memory controller, the PEs are connected with the eight nearest neighbors each, i.e. they can exchange data using a handshake protocol. The incoming long-lines from the memory controller to the rows and columns of the PE array can only be written by the memory controller. In other words, the PEs cannot use these for data exchange. For the outgoing long-lines from the PEs to the memory controller, it is the other way around, i.e. the PEs can only write to but not read from them. The memory controller can send data to specific PEs as well as broadcast to a whole line or column. For handling the shared output longlines from the PEs, the PEs send valid signals to the memory controller and the memory controller grants write access to one PE in a row. If a PE has to wait for an input value before continuing to process contexts, this PE is undersupplied. If a PE wants to send a value to another one, but the receiving PE signalizes the sending PE to wait, the sending PE experiences backpressure. It has to wait, until the receiving PE has consumed the preceeding input value and is ready to accept a new one. Both, undersupply and back-pressure might be signs of unbalanced context loads between the PEs. A process terminates when either the memory controller First of all, we distinguish between a PE Setup and a PE Configuration. The individual PE Setups define which Functional Units (FUs) are implemented in hardware in the specific PE, i.e. which Floating Point or integer operations are supported. This enables us to generate homogeneous and heterogeneous CGRAs. For our simulator, the PEs are designed as templates, i.e. the setups can easily be changed for each PE. Of course, the setups do not change during a simulation run. In addition, multiple FUs can be defined for one PE. The PE Configurations define which contexts are executed during the simulation. The general structure of the PE is depicted in Figure 2. Basically, it contains a RISC-like processor pipeline with a context fetch stage (similar to instruction fetch stage), one or more execution stages, and a write-back stage. The bold arrows represent multiple bit wide wires (mostly data), and the thin arrows represent one bit wires used for control signal transmission. The block shapes indicate how the according modules behave: Rounded rectangles represent combinatorial elements. Normal rectangles with fine lines are combinatorial memories. Bold rectangles are standard registers. Rounded Rectangles with bold lines are complex modules with internal states. The trapezoidal blocks labeled MUX are multiplexers. Dashed blocks are optional registers which might be implemented for timing closure. The power of this PE template is motivated by the design goals: Firstly, we want to support several FUs with differing latencies and possibly differing initiation intervals. Secondly, we provide an Internal Register File. Note that, due to the processing pipeline within the PE, we have to treat data and structural hazards. This is done in hardware to ease the compiler implementation. Thirdly, we also support branches as contexts. By doing so, each PE can process small loops whose loop bodies consist of differing operations. A. Context Fetch The Program Counter (PC) addresses one of the context memory entries in the Context Memory (CM). The PC will be incremented if the current opcode is not END and the PE is not pressured back by other PEs or by local contexts waiting for data (undersupply). The multiplexer selecting the actual next PC value is controlled by two signals: J and branch invalid. It handles the static branch prediction and roll-back to the old PC value. Precisely, the PE statically predicts that a branch will be taken for efficiently handling loops: If J is false and there was no invalid branch before, the incremented PC will be forwarded. If J is true and there was no invalid branch before, the target address from the context is forwarded as the next PC. But if there was an invalid branch before, the Back Up PC is forwarded. The Back Up PC will store the incremented PC, uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on February 09,2024 at 18:16:12 UTC from IEEE Xplore. Restrictions appl Fig. 2. PE Overview if a branch context is detected and no preceeding branch was invalid and, in the case of a conditional branch if the according operands are available. These context fetching strategies are mostly handled by the decoder. C. Decoder The decoder interprets the context and deploys the respective FU, and it checks, whether the required operands are available, i.e. • B. Input Registers The Input Registers are the interface for all incoming operands from the neighboring PEs and the Long-lines. They handle values such that every input register only accepts values from one neighboring PE only and every input value is read exactly once. The PEs process input signals as streams, i.e. they communicate with a handshake protocol. If an input signal is valid (valid = 1), and the receiving Input Register’s value does not signalize to wait (wait = 0), then the value will be accepted. In any other case, the input value will not be accepted. • if a required Input Register value has not been read at all by looking at the Input Wait signals (see Section IV-B), or if a required Internal Register File value is valid by looking at the Data Hazard Detector. A context may proceed to an FU if all of the following holds: • • • • The operands (both sources and target) required by the context are available (i.e. the respective wait is set and no data hazard is detected). No back-pressure is present. If applicable, the condition of the branch taken before has been evaluated valid. The FU is not busy. uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on February 09,2024 at 18:16:12 UTC from IEEE Xplore. Restrictions appl Only one context at a time may be promoted to execution, i.e. at most one of the Enable Signals for the FUs is set by the decoder at the same time step. Data and structural hazards are treated in a similar way as in [7]. TABLE I A RCHITECTURE C LASSIFICATION ACCORDING TO [10] Property Manifestation Structure Network Type Granularity Multi-granular Register files Memory hierarchy Direct connection 32-bit No Yes Scratchpad Control Oper. Scheduling Part. Reconfig. Network Config. Custom operations Dynamic Dynamic Dynamic No Integration Integration Coupling Resource sharing Accelerator Loosely coupled No Tooling Compiler Place and Route DSE Imperative language Not yet Not yet D. Write-Back In the Write-Back stage, a PE has to handle three things: The Output Registers (ORs), the Internal Register File, and the Data Hazard Detector. When an operation has finished computation, the highest bit of the target address provided by the Structural Hazard Detector enables writing the result either to the ORs or to the Internal Register File. In the mean time, the respective accompanying control signals are set (set valid or reset the Data Hazard Detector). In the next clock cycle, the result is either visible to another PE or to the next context. An OR also senses back-pressure when it is about to be written while being notified to wait by the receiving PE. This happens when the receiving PE has not processed the preceeding value. In this case, the back-pressure detected signal is set until the receiving PE resets the wait signal. V. E VALUATION In order to verify the functionality of our CGRA, we simulate our SystemC models with the Open SystemC Initiative (OSCI) simulator. In addition, for debugging and demonstration purposes, we implemented a visualization of our CGRA simulation in Java. For a Floating Point FIR Filter, each PE can hold one coefficient as well as one tap, and forwards the most recent input value and the multiply-accumulated value to the next PE, enabling it to process any FIR Filter whose order is smaller or equal to the number of PEs plus one. The simulation shows that the hand-mapped filter generates a result every ten clock cycles. Also, for a comparison of CGRAs, several surveys such as [8], [9], and [10] propose according points of comparison. In [10], the most refined categories can be found which we adapted to our CGRA in Table I. VI. C ONCLUSION In summary, we presented the architecture of our novel CGRA designed for HPC applications. We briefly explained the interface to the Host-System, and covered the PEs and the PE array in detail. In the future, we will elaborate on the interaction of the CGRA and the Host-System in-depth, as well as show how HPC applications can be mapped on the CGRA using a Software Stack which is the topic of the second PhD project funded by this project. In addition, we will also benchmark the system to optimize it with respect to chip area, power consumption, and performance of an Application-Specific Integrated Circuit. Due to the write-once-read-once policy for the data exchange between the PEs, and resolving possible conflicts with other operands within a PE (see Section IV-C), a consecutive value to be written to the same input has to wait at least for one cycle in practice. In our experience so far, this does not seem to pose a performance issue, as the contexts of the PEs typically communicate with each other rarely in every clock cycle, and, due to memory latencies, we do not expect to process new input values in every clock cycle, too. Future evaluations will show if this holds, otherwise the communication protocol will be revised. R EFERENCES [1] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger, “Dark silicon and the end of multicore scaling,” in 2011 38th Annual Int. Symp. on Computer Architecture (ISCA), June 2011, pp. 365–376. [2] D. Lee, M. Jo, K. Han, and K. Choi, “FloRA: Coarse-grained reconfigurable architecture with floating-point operation capability,” in Proc. of the 2009 Int. Conf. on Field-Programmable Technology, FPT’09, 2009, pp. 376–379. [3] V. Govindaraju, C. H. Ho, T. Nowatzki, J. Chhugani, N. Satish, K. Sankaralingam, and C. Kim, “DySER: Unifying functionality and parallelism specialization for energy-efficient computing,” IEEE Micro, vol. 32, no. 5, pp. 38–51, 2012. [4] T. Ruschke, L. J. Jung, D. Wolf, and C. Hochberger, “Scheduler for Inhomogeneous and Irregular CGRAs with Support for Complex Control Flow,” 2016 IEEE Int. Parallel and Distributed Processing Symp. Workshops (IPDPSW), pp. 198–207, 2016. [5] S. Gatzka and C. Hochberger, “The AMIDAR class of reconfigurable processors,” J. of Supercomputing, vol. 32, no. 2, pp. 163–181, 2005. [6] C. Nicol, “A Coarse Grain Reconfigurable Array (CGRA) for Statically Scheduled Data Flow Computing,” Wave Computing, Inc., Tech. Rep., 2016. [7] J. Thornton, “Parallel operation in the control data 6600,” AFIPS ’64 (Fall, part II): Proc. of the October 27-29, 1964, fall joint computer conference, part II: very high speed computer systems, 1964. [8] G. Theodoridis, D. Soudris, and S. Vassiliadis, “A Survey of CoarseGrain Reconfigurable Architectures and CAD Tools,” in Fine-and Coarse-Grain Reconfigurable Computing. Springer, 2008, pp. 89–149. [9] B. De Sutter, P. Raghavan, and A. Lambrechts, “Coarse-grained reconfigurable array architectures,” in Handbook of Signal Processing Systems: Second Edition, 2013, pp. 553–592. [10] M. Wijtvliet, L. Waeijen, and H. Corporaal, “Coarse grained reconfigurable architectures in the past 25 years: Overview and classification,” in Proc. 16th Int. Conf. on Embedded Computer Systems: Architectures, Modeling and Simulation, SAMOS 2016, 2017, pp. 235–244. uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on February 09,2024 at 18:16:12 UTC from IEEE Xplore. Restrictions appl