Uploaded by sumanthnukula27

A Coarse-Grained Reconfigurable Array for High-Performance Computing Applications

advertisement
A Coarse-Grained Reconfigurable Array for
High-Performance Computing Applications
Philipp S. Käsgen
Markus Weinhardt
Christian Hochberger
Osnabrück University of Applied Sciences Osnabrück University of Applied Sciences Technische Universität Darmstadt
Email: p.kaesgen@hs-osnabrueck.de
Email: m.weinhardt@hs-osnabrueck.de Email: hochberger@rs.tu-darmstadt.de
Abstract—Coarse-Grained Reconfigurable Arrays (CGRAs) are
mostly hardware accelerators for multimedia applications which
require only integer operations in most cases. We design a CGRA
with multiple Floating Point Units suitable for High-Performance
Computing (HPC). This PhD project is about halfway completed.
Index Terms—Coarse-Grained Reconfigurable Array, HighPerformance Computing, Hardware Accelerator.
I. I NTRODUCTION
The computational parallelism and energy efficiency inherent in CGRAs has been a subject of research mostly
for multimedia applications for many years. But the said
strengths of CGRAs are also beneficial for other application
domains like HPC since single-core and multi-core systems
may soon hit scaling limits [1]. Hence, we propose a novel
CGRA designed for HPC applications: the High-Performance
Reconfigurable Processor (HiPReP)1 . Yet, in this paper, we
mostly focus on the architecture, and refrain from presenting
the mappings of HPC applications as the software stack is
still being developed. Till now, we use mappings of MatrixMatrix-Multiplication, Fast-Fourier-Transform, Finite Impulse
Response Filter (FIR Filter), and Stencil Codes done by hand
which guided the early design process of the CGRA’s topology
and computational requirements of the Processing Elements
(PEs).
We will proceed as follows: Section II summarizes CGRAs
which can process Floating Point values and which, because
of this ability, might process HPC applications. Then, we
introduce our architecture in Section III, before we present
an in-depth view of our PEs in Section IV. In Section V we
explain how we test our CGRA and how it is classified for
better comparison with other CGRAs. Section VI concludes
this paper.
II. R ELATED W ORK
FloRA [2] explicitely examines the possibility to process
Floating Point values by using two PEs capable of integer
operations only: One PE calculates the mantissa and the other
one calculates the exponent. The joint process is steered by a
PE internal Finite State Machine (FSM). Besides that, the 8×8
CGRA can be configured such that a pipeline is established
where a pipeline stage corresponds to one row of PEs. It
1 funded by the Deutsche Forschungsgemeinschaft (DFG, German Research
Foundation) - 283321772
is even possible to map applications which are speculatively
executed. In contrast to FloRA, our CGRA provides full
Floating Point Units in each PE.
Dynamic Synthesized Execution Resources (DySER) [3] is a
8 × 8 CGRA which is integrated in the OpenSPARC processor
pipeline as a functional unit. It is a heterogeneous array which
supports both integer and Floating Point operations. Unlike
DySER, our CGRA is loosely coupled to the host system, and
hence, the host-CPU can concurrently process other tasks.
In [4], a Scheduler for a CGRA integrated in an AMIDAR
processor [5] is proposed, capable of mapping nested loops,
control flow, and double-precision Floating Point operations.
This CGRA provides a global context counter to control the
current contexts in each PE, whereas our PEs have individual
context counters.
The commercial CGRA by Wave Computing [6] is specifically designed to accelerate the training and inference of Deep
Neural Networks (DNN). Because of the simplistic PE design
(only Nearest-Neighbor connections, accumulator machine),
it scales very well and compensates its lack of PE computing
power with the sheer mass of PEs (16 Clusters with 32 × 32
PEs each). Also its clock has a frequency of 6.7 GHz.
III. T HE CGRA
Our CGRA is a co-processor template designed as a SystemC simulation model which is connected to a host system
via Direct Memory Access (DMA). A block diagram of a 4 × 4
instance is given in Figure 1. The CGRA itself consists of
the memory controller and an array of PEs (4 × 4 array in
this case). The host system comprises - highly simplified the Host-CPU and a memory. The directed edges represent
directed connections, undirected edges represent bidirectional
connections. Thin edges are 32 bits wide to transmit singleprecision Floating Point numbers, and bold ones are multiples
of 32 bits wide. The thin edges may be extended to 64 bits
width to support double-precision Floating Point numbers in
the future.
The co-processor is either idle, in the configuration, or in
the processing phase. It can be invoked by sending the start
address of the configuration to the according address in the
contiguous address space (Memory Mapped I/O). When it
receives this start address, it loads the PE contexts and data
address components from the main memory. The PE contexts
are forwarded to the according PEs, and the data address
978-1-7281-1968-7/18/$31.00 ©2018 IEEE
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on February 09,2024 at 18:16:12 UTC from IEEE Xplore. Restrictions appl
does not generate more addresses (hence, all results have supposedly been written back to memory), the PEs encounter the
END instruction and signalize this to the memory controller,
or the computation gets interrupted by new configuration data.
IV. T HE P ROCESSING E LEMENT
Fig. 1. Integration of our co-processor in a Host-System
components remain in the memory controller. As soon as
the co-processor has finished the configuration, the processing
phase begins. In this phase, it consumes and produces data
(Streaming). After data processing, the co-processor remains
idle again, unless it is reconfigured.
Beside the long-lines from the memory controller, the PEs
are connected with the eight nearest neighbors each, i.e.
they can exchange data using a handshake protocol. The
incoming long-lines from the memory controller to the rows
and columns of the PE array can only be written by the
memory controller. In other words, the PEs cannot use these
for data exchange. For the outgoing long-lines from the PEs
to the memory controller, it is the other way around, i.e. the
PEs can only write to but not read from them. The memory
controller can send data to specific PEs as well as broadcast to
a whole line or column. For handling the shared output longlines from the PEs, the PEs send valid signals to the memory
controller and the memory controller grants write access to
one PE in a row.
If a PE has to wait for an input value before continuing to
process contexts, this PE is undersupplied. If a PE wants to
send a value to another one, but the receiving PE signalizes
the sending PE to wait, the sending PE experiences backpressure. It has to wait, until the receiving PE has consumed
the preceeding input value and is ready to accept a new
one. Both, undersupply and back-pressure might be signs of
unbalanced context loads between the PEs.
A process terminates when either the memory controller
First of all, we distinguish between a PE Setup and a
PE Configuration. The individual PE Setups define which
Functional Units (FUs) are implemented in hardware in the
specific PE, i.e. which Floating Point or integer operations
are supported. This enables us to generate homogeneous
and heterogeneous CGRAs. For our simulator, the PEs are
designed as templates, i.e. the setups can easily be changed
for each PE. Of course, the setups do not change during a
simulation run. In addition, multiple FUs can be defined for
one PE. The PE Configurations define which contexts are
executed during the simulation.
The general structure of the PE is depicted in Figure 2.
Basically, it contains a RISC-like processor pipeline with a
context fetch stage (similar to instruction fetch stage), one
or more execution stages, and a write-back stage. The bold
arrows represent multiple bit wide wires (mostly data), and
the thin arrows represent one bit wires used for control signal
transmission. The block shapes indicate how the according
modules behave: Rounded rectangles represent combinatorial
elements. Normal rectangles with fine lines are combinatorial
memories. Bold rectangles are standard registers. Rounded
Rectangles with bold lines are complex modules with internal
states. The trapezoidal blocks labeled MUX are multiplexers.
Dashed blocks are optional registers which might be implemented for timing closure.
The power of this PE template is motivated by the design
goals: Firstly, we want to support several FUs with differing
latencies and possibly differing initiation intervals. Secondly,
we provide an Internal Register File. Note that, due to the
processing pipeline within the PE, we have to treat data
and structural hazards. This is done in hardware to ease the
compiler implementation. Thirdly, we also support branches
as contexts. By doing so, each PE can process small loops
whose loop bodies consist of differing operations.
A. Context Fetch
The Program Counter (PC) addresses one of the context
memory entries in the Context Memory (CM). The PC will be
incremented if the current opcode is not END and the PE is not
pressured back by other PEs or by local contexts waiting for
data (undersupply). The multiplexer selecting the actual next
PC value is controlled by two signals: J and branch invalid.
It handles the static branch prediction and roll-back to the old
PC value. Precisely, the PE statically predicts that a branch
will be taken for efficiently handling loops: If J is false and
there was no invalid branch before, the incremented PC will be
forwarded. If J is true and there was no invalid branch before,
the target address from the context is forwarded as the next
PC. But if there was an invalid branch before, the Back Up PC
is forwarded. The Back Up PC will store the incremented PC,
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on February 09,2024 at 18:16:12 UTC from IEEE Xplore. Restrictions appl
Fig. 2. PE Overview
if a branch context is detected and no preceeding branch was
invalid and, in the case of a conditional branch if the according
operands are available. These context fetching strategies are
mostly handled by the decoder.
C. Decoder
The decoder interprets the context and deploys the respective FU, and it checks, whether the required operands are
available, i.e.
•
B. Input Registers
The Input Registers are the interface for all incoming
operands from the neighboring PEs and the Long-lines. They
handle values such that every input register only accepts values
from one neighboring PE only and every input value is read
exactly once.
The PEs process input signals as streams, i.e. they communicate with a handshake protocol. If an input signal is
valid (valid = 1), and the receiving Input Register’s value
does not signalize to wait (wait = 0), then the value will
be accepted. In any other case, the input value will not be
accepted.
•
if a required Input Register value has not been read at all
by looking at the Input Wait signals (see Section IV-B),
or
if a required Internal Register File value is valid by
looking at the Data Hazard Detector.
A context may proceed to an FU if all of the following
holds:
•
•
•
•
The operands (both sources and target) required by the
context are available (i.e. the respective wait is set and
no data hazard is detected).
No back-pressure is present.
If applicable, the condition of the branch taken before
has been evaluated valid.
The FU is not busy.
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on February 09,2024 at 18:16:12 UTC from IEEE Xplore. Restrictions appl
Only one context at a time may be promoted to execution, i.e.
at most one of the Enable Signals for the FUs is set by the
decoder at the same time step.
Data and structural hazards are treated in a similar way as
in [7].
TABLE I
A RCHITECTURE C LASSIFICATION ACCORDING TO [10]
Property
Manifestation
Structure
Network Type
Granularity
Multi-granular
Register files
Memory hierarchy
Direct connection
32-bit
No
Yes
Scratchpad
Control
Oper. Scheduling
Part. Reconfig.
Network Config.
Custom operations
Dynamic
Dynamic
Dynamic
No
Integration
Integration
Coupling
Resource sharing
Accelerator
Loosely coupled
No
Tooling
Compiler
Place and Route
DSE
Imperative language
Not yet
Not yet
D. Write-Back
In the Write-Back stage, a PE has to handle three things:
The Output Registers (ORs), the Internal Register File, and
the Data Hazard Detector. When an operation has finished
computation, the highest bit of the target address provided
by the Structural Hazard Detector enables writing the result
either to the ORs or to the Internal Register File. In the mean
time, the respective accompanying control signals are set (set
valid or reset the Data Hazard Detector). In the next clock
cycle, the result is either visible to another PE or to the next
context.
An OR also senses back-pressure when it is about to be
written while being notified to wait by the receiving PE.
This happens when the receiving PE has not processed the
preceeding value. In this case, the back-pressure detected
signal is set until the receiving PE resets the wait signal.
V. E VALUATION
In order to verify the functionality of our CGRA, we simulate our SystemC models with the Open SystemC Initiative
(OSCI) simulator. In addition, for debugging and demonstration purposes, we implemented a visualization of our CGRA
simulation in Java.
For a Floating Point FIR Filter, each PE can hold one
coefficient as well as one tap, and forwards the most recent
input value and the multiply-accumulated value to the next PE,
enabling it to process any FIR Filter whose order is smaller or
equal to the number of PEs plus one. The simulation shows
that the hand-mapped filter generates a result every ten clock
cycles.
Also, for a comparison of CGRAs, several surveys such
as [8], [9], and [10] propose according points of comparison.
In [10], the most refined categories can be found which we
adapted to our CGRA in Table I.
VI. C ONCLUSION
In summary, we presented the architecture of our novel
CGRA designed for HPC applications. We briefly explained
the interface to the Host-System, and covered the PEs and the
PE array in detail.
In the future, we will elaborate on the interaction of the
CGRA and the Host-System in-depth, as well as show how
HPC applications can be mapped on the CGRA using a Software Stack which is the topic of the second PhD project funded
by this project. In addition, we will also benchmark the system
to optimize it with respect to chip area, power consumption,
and performance of an Application-Specific Integrated Circuit.
Due to the write-once-read-once policy for the data exchange between the PEs, and resolving possible conflicts with
other operands within a PE (see Section IV-C), a consecutive
value to be written to the same input has to wait at least for one
cycle in practice. In our experience so far, this does not seem to
pose a performance issue, as the contexts of the PEs typically
communicate with each other rarely in every clock cycle, and,
due to memory latencies, we do not expect to process new
input values in every clock cycle, too. Future evaluations will
show if this holds, otherwise the communication protocol will
be revised.
R EFERENCES
[1] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and
D. Burger, “Dark silicon and the end of multicore scaling,” in 2011
38th Annual Int. Symp. on Computer Architecture (ISCA), June 2011,
pp. 365–376.
[2] D. Lee, M. Jo, K. Han, and K. Choi, “FloRA: Coarse-grained reconfigurable architecture with floating-point operation capability,” in Proc. of
the 2009 Int. Conf. on Field-Programmable Technology, FPT’09, 2009,
pp. 376–379.
[3] V. Govindaraju, C. H. Ho, T. Nowatzki, J. Chhugani, N. Satish,
K. Sankaralingam, and C. Kim, “DySER: Unifying functionality and
parallelism specialization for energy-efficient computing,” IEEE Micro,
vol. 32, no. 5, pp. 38–51, 2012.
[4] T. Ruschke, L. J. Jung, D. Wolf, and C. Hochberger, “Scheduler
for Inhomogeneous and Irregular CGRAs with Support for Complex
Control Flow,” 2016 IEEE Int. Parallel and Distributed Processing
Symp. Workshops (IPDPSW), pp. 198–207, 2016.
[5] S. Gatzka and C. Hochberger, “The AMIDAR class of reconfigurable
processors,” J. of Supercomputing, vol. 32, no. 2, pp. 163–181, 2005.
[6] C. Nicol, “A Coarse Grain Reconfigurable Array (CGRA) for Statically
Scheduled Data Flow Computing,” Wave Computing, Inc., Tech. Rep.,
2016.
[7] J. Thornton, “Parallel operation in the control data 6600,” AFIPS ’64
(Fall, part II): Proc. of the October 27-29, 1964, fall joint computer
conference, part II: very high speed computer systems, 1964.
[8] G. Theodoridis, D. Soudris, and S. Vassiliadis, “A Survey of CoarseGrain Reconfigurable Architectures and CAD Tools,” in Fine-and
Coarse-Grain Reconfigurable Computing. Springer, 2008, pp. 89–149.
[9] B. De Sutter, P. Raghavan, and A. Lambrechts, “Coarse-grained reconfigurable array architectures,” in Handbook of Signal Processing Systems:
Second Edition, 2013, pp. 553–592.
[10] M. Wijtvliet, L. Waeijen, and H. Corporaal, “Coarse grained reconfigurable architectures in the past 25 years: Overview and classification,”
in Proc. 16th Int. Conf. on Embedded Computer Systems: Architectures,
Modeling and Simulation, SAMOS 2016, 2017, pp. 235–244.
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on February 09,2024 at 18:16:12 UTC from IEEE Xplore. Restrictions appl
Download