Dynamic Communication in a Coarse Grained Reconfigurable Array

advertisement
Dynamic Communication in a Coarse Grained Reconfigurable Array
Robin Panda, Scott Hauck
Dept. of Electrical Engineering
University of Washington
Seattle, WA 98195
{robin, hauck}@ee.washington.edu
Abstract - Coarse Grained Reconfigurable Arrays (CGRAs)
are typically very efficient for a single task. However all
functional units are required to perform in lock step, wasting
resources and making complex programming flows difficult.
Massively Parallel Processor Arrays (MPPAs) excel at
executing unrelated tasks simultaneously, but limit the amount
of resources dedicated to a single task. We propose an
architecture with an MPPA’s design flexibility and a CGRA’s
throughput, capable of processing and transferring data in a
pre-compiled schedule, with dynamic transfers between
components. Alternative interconnect strategies are compared
for silicon area cost and power utilization.
Keywords - CGRA; MPPA; flow control; interconnect
I.
INTRODUCTION
Field programmable gate arrays (FPGAs) have long
been used for accelerating compute intensive applications
without the cost and difficulty of custom ASICs. However,
their ability to be programmed for any logic function at the
granularity of individual bits is unnecessary for many
applications. If we limit primarily to standard logic
operations performed on entire words of data, logic units
can be used instead of LUTs. If the routing also operates on
a word granularity, area can be saved by not storing or
processing redundant configuration information.
Coarse Grained Reconfigurable Arrays (CGRAs)
attempt these word optimizations with a sea of ALUs
connected with a FPGA-like interconnect that operates on a
word’s worth of data at a time. The reduction in
configuration information allows for time-multiplexing
several configurations onto the same hardware, increasing
hardware utilization. An alternate method of wiring
computational elements together is a Massively Parallel
Processor Array (MPPA). This is a network of more
independent processors and their memory which
communicate by passing messages.
CGRAs are good at using many processors for a single
task, but the need for the whole device to operate in lock
step is inefficient for control and handling multiple tasks in
an application. MPPAs are great for control and a large
number of tasks or applications, but are less efficient for
pipelined tasks and cannot automatically spread operations
across hardware. Our goal is to design an architecture that
combines the benefits of both.
II.
WHAT IS A CGRA
Our base CGRA [1] is composed of word-width
functional units, including ALUs, shifters, or other specialpurpose processing elements, connected with a wordoriented interconnect. It also includes LUTs and single bit
communication channels to form a basic FPGA within the
architecture for control and bitwise logic. Memory is local
like in an FPGA with no native coherency mechanisms for
shared memory. Block memories are managed by the
application code, while registers required for timing and
synchronization are managed by the CAD tools.
Each cycle, the configuration of the functional units’
opcodes and addresses to be requested from register banks is
sent to the functional units. These opcodes are scheduled by
the compiler using modulo scheduling [2] for automatically
pipelining user code. The ALUs in a CGRA cannot change
contexts independently. In addition to sharing the context
selection mechanism with other hardware, they must all
follow the same schedule to ensure the proper
synchronization of operations. This limits independence of
separate threads and therefore the thread level parallelism
that can be used. It is still a very good architecture for the
parallelism and pipelining extracted by the compiler.
The interconnect, is handled similarly. Switchboxes
(Fig. 1) connect 32-bit buses to other switchboxes and
processing elements. An incoming bus (A) will fan out to
multiplexers (B, C) in all other directions. Configuration
memories (D) cycle through the configurations to select the
appropriate inputs to the multiplexer B. After passing
through the multiplexer, the bus is registered before being
driven across the long wires to the next switchbox.
Figure 1.
Switchbox schematic
A CGRA’s cyclic schedules waste hardware because
they must use predication instead of branching with data
dependent instructions. When executing multiple
applications, all tasks are locked to the same execution rate,
slowing down some tasks and wasting more hardware on
others. Some tasks may process data at significantly
different rates [3], requiring careful optimization by the
programmer.
III.
MPPAS
In the Ambric MPPA [4], the ALUs from the CGRA are
replaced with small processors with full, independent
branching. This makes it well suited for small control tasks.
However, since the processors and network are no longer
executing in a lock-step manner, the rest of the architecture
is more complicated. The processors must be individually
programmed. The network routing can still be configured at
compile time, but special sets of registers (Fig. 2) are
needed to coordinate operations between different
processors. The interconnect is flow-controlled, which
allows processors to know when to operate on the data in
the channel and can retain data in flight if the processor or
network downstream is not ready to receive data.
Fig. 2 represents one stage of the communication
network. In the absence of congestion, data is sent from the
upstream stage via Data In/Valid In, held in register B, and
sent out via Data Out/Valid Out. If the downstream stage is
unable to receive data, it will deassert Ready In, and this
stage will maintain the data in B. It will alert the upstream
stage via Ready Out. However, since the upstream stage
may have already forwarded a new value, this stage includes
a second register C to hold this value.
Many applications do not easily break down into the
hundreds of sub-tasks required for an MPPA in a humancomprehensible manner. Increasing throughput without
these sub-tasks requires either heavyweight processors or
dedicated hard blocks. Even when broken down,
communication overhead can be a problem [6] when
executing the tiny subtasks required for high throughput.
Since hard blocks go unutilized in many applications and
large, out-of-order processors are not efficient uses of die
area and power, another solution is required.
IV.
ARCHITECTURES
Ideally, a CGRA could be divided into a few regions that
operate on different schedules called control domains. If at
least some of the control domains have the ability to
conditionally branch like MPPAs, control tasks could be
implemented in a compact manner and eliminate costly
predication within the main computation blocks. Multiple
tasks, from the same application or not, could operate
independently of one another, and each task would be
spread over its maximum utilizable area. This spreading can
be done automatically via CGRA-style tools, allowing the
computation to be written in a logical, integrated manner
that is easier for programmers to understand and compose.
Figure 2. Ambric register set for dynamic flow control (adapted from [5])
This hybrid will operate on a fixed schedule in large
processing blocks, but also have smaller control flow
oriented tasks that communicate dynamically with each
other and the computation blocks. Because the processing
blocks take up more area, most of the architecture will be
operating in the scheduled mode. To be most efficient, the
system will need to be able to allocate resources to large,
CGRA-style blocks or small, MPPA-style tasks on an
application-by-application basis.
For such a system, independent scheduled and flowcontrolled networks would be inefficient. Instead, there
should be flow controlled communications on underlying
scheduled resources, or scheduled communications on
underlying flow controlled resources. Since much of the
array is likely to be dedicated to CGRA-like computations,
having the underlying hardware be scheduled is likely to be
the most efficient. Because of this, this paper will explore
adapting a scheduled network for dynamic communication,
using the existing single bit resources for implementing the
data valid and backpressure signals.
A. Dedicated and borrowed storage
The most straightforward way to shoehorn the
handshake circuit of Fig. 2 into the switchbox of Fig. 1 is by
adding dedicated shadow registers in parallel with each
existing register like gray register at A in Fig. 3. However,
there are already more registers conveniently located in the
same switchbox that might be borrowed. Each wire entering
the switchbox fans out to multiplexers headed in every
direction. For example, Fig. 3 shows a track from the West
fanning out to the North and East. Each multiplexer is
followed by a register, which can store the extra data during
a stall if its mux is programmed properly. By adding the
gray wire from B’s register to A, the data captured a B
during a stall can be recovered during normal operation.
Figure 3. Fanout inside a switchbox
This design should be more efficient than putting in
special purpose shadow registers. When dedicating registers
to the task, the register overhead of supporting the
handshaking will be paid by every channel with shadow
registers of this capability, even if not used. This is very
inefficient when the handshaking is used little. When
borrowing registers from another channel, the primary
overhead when the handshaking capability is unused comes
from adding the additional input to each multiplexer and the
logic gates for the handshake. However, each channel
actually using handshaking will be wasting a set of
perpendicular driver buffers that are quite large. Therefore,
it is inefficient when handshaking is heavily used.
B. Half-bandwidth
The extra registers are required because it takes a cycle
to notify the upstream register after the downstream register
stops; since the upstream register may have already sent a
value, storage is needed to save the incoming value. An
alternative is to require a register to hold its value until the
receiving register is guaranteed empty. Thus, if a stall
occurs any value will have an empty register ahead of it
already.
While this means we do not need to add storage to the
interconnect, it only allows a data word to be sent every
other clock cycle. Fortunately, it is rare to have an iteration
interval of 1 in a CGRA [2]. Similarly, meaningful loops on
an MPPA processor generally contain more than one
instruction, like in [3] and in [6]. Therefore, in many cases it
is impossible to send or receive data on every clock cycle
anyway, and a half-bandwidth channel is sufficient.
The handshake must be modified as follows: If a register
set holds valid data, then it must deassert ready to prevent
upstream data from being lost in case the set downstream
from it becomes no longer ready. If a set does not hold valid
data, then it is ready to receive data. Thus, the ready signal
output is simply the inverse of its own valid bit. When used
for dynamic flow control, a half-bandwidth channel only
uses up one 32-bit bus and two single bit routes, the
minimum for implementing a handshake.
C. Interleaved channels
If a sending task can produce data every cycle and its
receiving task can receive every cycle, they may benefit
from a full bandwidth channel. If either is slower, eventually
the storage available in the channel will be exhausted and
the faster task will have to periodically stall until its data
rate matches its partner’s. Thus, since full-rate channels are
rare, we would like to avoid adding hardware to the
architecture specifically for these cases.
This need for a full-rate channel can be satisfied by
alternating between two half-rate channels within the
processing control domain. The sequence of words A, B, C,
D would be sent with A and C on one channel and B and D
on the other. Note that interleaving two half-bandwidth
channels is roughly the same cost as borrowing registers,
only requiring two extra single-bit channels. This cost is
only borne by channels that actually require this full
bandwidth, since half-bandwidth and interleaved fullbandwidth channels can coexist in the same mapping
D. Minimal changes
The interconnect itself does not need to perform the
handshake; it can be performed by the proper programming
of the regular compute elements. To avoid using flow
control and buffering within the channel, we instead
pipeline the channel without stalls, and have a large enough
FIFO at the end of the channel to catch all sent values when
the receiving task stalls. If the FIFO begins to fill, a
ready/stall signal is sent via a separate one-bit channel to
the sender to throttle the data in the stream.
Ready must be predicted many clock cycles in the future
to account for the longer propagation delay between where
ready is generated and where the signal takes effect. For a
channel of length N, the FIFO is no longer ready when there
are less than 2N words: It will take N clock cycles for ready
to propagate to the sending device and, at that time, there
may be N more words in the channel.
In practice, an entire memory block would be dedicated
to the receiver FIFO. The memory cannot be shared with
any other functions because the full memory bandwidth is
potentially required. Note that special provision must be
made to ensure the FIFO logic itself does not stall.
Otherwise, data already in flight in the channel will be lost.
In fact, for all of the flow-controlled networks in this paper
the network itself cannot stall with the surrounding logic;
since these channels may traverse completely independent
tasks, allowing those tasks to stall the channels can easily
produce deadlock.
E. Tokens
For custom handshake processors, the memory required
can be reduced even further by calculating when to stall on
the sending side. A local register stores the count of available
words in the FIFO. Every time a new word is sent by the
sender, the count of storage remaining is decremented.
Whenever the FIFO is read by the receiver, it will send a
single bit token back to the counter to update it; when the
counter receives the token it will increment the counter.
When this counter reaches zero, the sending task must stall
or its data may arrive at a full FIFO. Because of the reduced
latency between where the stall signal is generated and
where it stops the sender, this design can operate with fewer
than 2N entries in the FIFO by throttling the send rate so no
more data is in flight than can be handled. Therefore, it is
more versatile when the FIFO memory is limited.
V.
RESULTS
To evaluate these designs, they are compared for gate
area and power consumption. In our base architecture, delay
in the interconnect is dominated by the long wires between
switchboxes and is not significantly different between the
designs. The area of the logic gates added to a single
scheduled channel from one switchbox to another to add the
dynamic flow control capability is calculated. To account
for the existing logic used by the design we add the area of
the scheduled resources for each dynamic channel actually
used. For power, we calculate the energy used when an
individual bit on the bus toggles. This is added to the energy
used by the handshake logic for those channels using the
handshake.
The areas and powers are calculated using the models
developed in a 65nm process for the Mosaic project [1]. The
Mosaic CGRA is used for wire length and driver
requirements. Because both the built FIFO and the token
counter are only required once per channel, their expenses
are amortized over the average channel length from the
Mosaic benchmarks suite. To put the calculated costs of
each design in context, the average area and energy
breakdowns of the various components in Mosaic are used
to estimate the final effect on full chip area and power. The
full chip power and area is graphed versus the percentage of
dynamic channels in dynamic mode. This is repeated
varying the percentage of scheduled channels capable of
handshaking.
In Fig. 4, the lines labeled built represent a FIFO
composed of existing processing elements as described in
section D. ½ BW is the design that operates at half data rate.
Borrow uses registers from perpendicular channels. Dual
interleaves two half-rate channels for full rate. Extra uses
dedicated shadow registers. Token builds dedicated FIFO
hardware with token counters.
As not all channels need be able to operate dynamically,
we only add this capability to 25% of the channels; we then
plot the full chip area for each design as the percentage of
the dynamic channels used increases. For area, the ½bandwidth design is the best in almost all situations. Full
bandwidth communication will still require another method.
If less than 20% of the available dynamic channels are in
use, borrowing registers from an adjacent channel is the
most efficient, with the dual interleaved channels a close
runner-up that will be superior if there are channels that
only need half bandwidth. However, as the usage increases,
such as when everything is operating like an MPPA, the
token method becomes the most area-efficient. If we
increase the percentage of handshake-capable channels
beyond 25, the absolute areas increase, but the comparison
between the designs remains roughly the same.
The energy graph tells a different story. The ½bandwidth design is one of the worst power-wise for large
channel utilization. The token counting FIFO has the best
energy performance for all utilizations. We can combine the
two graphs into the area-energy product. For this balance,
we see that the ½-bandwidth and token versions provide the
best results.
VI.
CONCLUSION
The best implementation of a hybrid interconnect is
dependent on usage. For full bandwidth signaling with most
channels used, dedicated FIFO hardware with tokens is
ideal. However, in real applications this high utilization and
bandwidth is unlikely. Therefore, the best design is likely to
be the half-bandwidth channels with interleaving when full
bandwidth is required. To verify, future work is required to
develop benchmarks for this sort of hybrid architecture to
fully characterize the channel requirements for length,
bandwidth, and density. With appropriate CAD tool support,
the effects of stealing resources from a control domain can
be determined along with more precise cost estimates.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
Brian Van Essen, Improving the Energy Efficiency of Coarse-Grained
Reconfigurable Arrays, Ph.D. Thesis, University of Washington,
Dept. of CSE, 2010.
S. Friedman, A. Carroll, B. Van Essen, B. Ylvisaker, C. Ebeling, S.
Hauck, "SPR: An Architecture-Adaptive CGRA Mapping Tool",
ACM/SIGDA Symposium on Field-Programmable Gate Arrays, pp.
191-200, 2009.
M. Haselman, N. Johnson-Williams, C. Jerde, M. Kim, S. Hauck, T.
K. Lewellen, R. Miyaoka, "FPGA vs. MPPA for Positron Emission
Tomography Pulse Processing", International Conference on FieldProgrammable Technology, 2009.
M. Butts, A.M. Jones, P. Wasson, "A Structural Object Programming
Model, Architecture, Chip and Tools for Reconfigurable
Computing," IEEE Symposium on Field-Programmable Custom
Computing Machines, pp.55-64, 23-25 April 2007
A. M. Jones, “Asynchronous communication among hardware object
nodes in IC with receive and send ports protocol registers using
temporary register bypass select for validity information,” U.S. Patent
7409533, Aug. 5, 2008.
R. Panda, J. Xu, S. Hauck, "Software Managed Distributed Memories
in MPPAs", International Conference on Field Programmable Logic
and Applications, 2010.
Figure 4. Normalized full chip area, energy, and area-energy product
Download