Issues and Approaches to Coarse-Grain Reconfigurable Architecture Development

advertisement
Issues and Approaches to Coarse-Grain Reconfigurable Architecture Development
Ken Eguro and Scott Hauck
Department of Electrical Engineering
University of Washington
Seattle, WA 98195 USA
{eguro, hauck}@ee.washington.edu
coarse-grain functional units while routing and memory
resources can be tuned to better reflect the requirements.
An example of such a specialized reconfigurable device is
the RaPiD architecture [10], which was originally
designed to implement applications in the DSP domain.
This architecture consists of dedicated 16-bit multipliers,
ALUs and RAM modules that are connected through a
programmable and pipelined word-wise data bus. While
this architecture clearly lacks much of the flexibility of a
more conventional FPGA, [8] has shown that it
successfully improves performance while minimally
affecting usability since it provides significant speed, area
and power advantages across a wide range of DSP
applications.
Although domain-specialized FPGAs such as RaPiD
can offer great advantages over general-purpose
reconfigurable devices, they also present some new and
unique challenges. While design choices that affect the
performance and flexibility of classical FPGAs are
relatively clearly defined and well understood, the effects
that fundamental architecture decisions have on
specialized reconfigurable devices are largely unknown
and difficult to quantify.
Abstract
Although domain-specialized FPGAs can offer
significant area, speed and power improvements over
conventional reconfigurable devices, there are several
unique and unexplored design problems that complicate
their development. One source of these problems is that
designers often opt to replace more universal, fine-grain
logic elements with a specialized set of coarse-grain
functional units to improve computation speed and reduce
routing complexity. One issue this introduces is that it is
not obvious how to simultaneously consider all
applications in a domain and determine the most
appropriate overall number and ratio of the different
functional units. In this paper, we illustrate how this
problem manifests itself during the development of an
encryption-specialized FPGA architecture. We present
three algorithms that solve this problem by balancing the
hardware needs of the domain while considering
performance and area requirements. We believe these
concerns need to be addressed by future CAD tools in
order to develop more sophisticated applicationspecialized reconfigurable devices.
2 Implications
Devices
1 Introduction
While flexibility is an important feature of
reconfigurable devices, conventional FPGAs are simply
too generic to provide high performance in many
situations. General-purpose reconfigurable devices, while
well suited to small or irregular functions, typically suffer
a stiff penalty when implementing wide and complex
arithmetic operations. These types of functions need to be
built from too many small logical resources and end up
being spread across too general a routing structure to be
efficient.
However, if the range of applications that a device is
intended for is known beforehand, a designer can
specialize the logic, memory and routing resources to
enhance the performance of the device while still
providing adequate flexibility to accommodate all
anticipated uses. Common and complex operations can
be implemented much more efficiently on specialized
of
Domain-Specialized
The development of a conventional reconfigurable
array that can accommodate a group of applications is
relatively straightforward. Based upon the results of past
research, consistently good results can be obtained using
an array consisting of 4-LUT based CLBs [2] surrounded
by a predetermined mix of short and long routing tracks
[4]. Using existing CAD tools, the only real parameters
that would need to be determined are the overall size of
the array and the number of routing tracks in the system.
However, when developing domain-specialized
reconfigurable devices, the sea of fine-grained logical
resources found in conventional FPGAs is typically
replaced with a clearly defined set of coarse-grained
function units. While this can greatly improve the
performance of the system, developers need to closely
consider the specific ways in which each algorithm uses
1
for each algorithm in the domain, these curves do not
offer enough information to assist designers of coarsegrained reconfigurable devices.
When developing
domain-specialized FPGAs, the individual functional unit
needs of every algorithm in the domain contends for
control over what the overall architecture will look like.
To illustrate the complicated relationships between
functional unit demands across a domain, we analyzed the
15 candidate algorithms of the AES competition over a
range of performance levels. We identified the resource
requirements to implement each of the algorithms at
natural unrolling points, from relatively small, timemultiplexed
elements
to
completely
unrolled
implementations. From this analysis, we demonstrate
four main factors that obscure the relationship between
hardware resources and performance.
First, although the algorithms in our domain share
many common operations, the ratio of the different
functional units varies considerably between algorithms.
Without any prioritization, it is unclear how to distribute
resources. For example, if we consider the fully rolled
implementations for six encryption algorithms, as in
Figure 1, we can see the wide variation in RAM, crossbar,
and runtime requirements among the different algorithms.
the provided resources since the logical elements are no
longer universally flexible. Merely given a domain of
applications it is not obvious what the most appropriate
set of functional units would be, much less what routing
architecture would be appropriate, what implications this
might have on necessary CAD tools, or how any of these
factors might affect each other.
The first challenge we mentioned, the selection of
functional units, can be subdivided into three steps. First,
the applications in a domain must be analyzed to
determine what operations they require. Crucial parts
such as wide multipliers or fast adders should be
identified. Next, this preliminary set of functional units
can be distilled to a smaller set by capitalizing on
potential overlap or partial reuse of other types of units.
Different sizes of memories, for example, may be
combined through the use of multi-mode addressing
schemes. Lastly, based upon design constraints, the exact
number of each type of unit in the array should be
determined. For example, if the applications are memoryintensive rather than computation-intensive, the relative
number of memory units versus ALUs should reflect this.
This paper will primarily focus on the problem of
functional unit allocation – determining the most
appropriate quantity and ratio of functional units across
the domain. We will use the 15 candidate algorithms of
the Advanced Encryption Standard (AES) competition
[12] to illustrate the issues that make component
allocation difficult.
While operator identification and optimization are both
complex problems unique to coarse-grain architectures,
we will not address these issues here since the algorithms
in the target domain often provide an obvious starting
point. Normally, algorithms are described using strongly
typed functions such as multiplications or Boolean
operations. The logical optimization and technology
mapping for such strongly typed operators is relatively
simple to perform by hand. Although this may overlook
subtle optimizations, such as the exact tradeoffs between
having a separate multiplier and adder versus creating a
dedicated multiply-accumulate unit, manual analysis does
provide an acceptable working set and offers a starting
point for our discussion concerning functional unit
allocation.
Algorithm
(Baseline)
CAST-256 (1x)
DEAL (1x)
HPC (1x)
Loki97 (1x)
Serpent (1x)
Twofish (1x)
Average
Std. Dev.
RAM
Blocks
16
1
24
40
8
8
16.2
14.1
XBars
Runtime
0
7
52
7
32
0
16.3
21.1
48
96
8
128
32
16
54.7
47.6
Figure 1 – Ratio Complications
This table compares the RAM, crossbar and runtime
requirements for the baseline implementations of six
encryption algorithms. Notice that in all three categories
the deviation in requirements is comparable to the average
value.
To complicate matters, if we attempt to equalize any
one requirement over the entire set, the variation among
the other requirements becomes more extreme. This can
be seen in Figure 2. In this case, if we consider the RAM
resources that an architecture should provide, we notice
that Loki97 requires at least 40 RAM modules. If we
attempt to develop an architecture that caters to this
constraint and unroll the other algorithms to take
advantage of the available memory, we see that the
deviation in the number of crossbars and runtime
increases sharply.
3 Difficulties of Functional Unit Allocation
Although it is relatively straightforward to establish
the absolute minimum area required to support a domain
of applications, determining the best way to allocate
additional resources is more difficult. While developers
of conventional reconfigurable devices can determine the
best overall size of the array for any set of design
constraints by building performance versus area curves
2
Algorithm
(Unrolling Factor)
CAST-256 (2x)
DEAL (32x)
HPC (1x)
Loki97 (1x)
Serpent (8x)
Twofish (4x)
Average
Std. Dev.
RAM
Blocks
32
32
24
40
32
32
32
5.6
XBars
Runtime
0
104
52
7
32
0
32.5
40.6
24
3
8
128
4
4
28.5
49.4
Algorithm
(Unrolling Factor)
FROG (1x)
FROG (4x)
FROG (16x)
FROG (64x)
FROG (256x)
Runtime
0
16
7
0
0
4.6
7.1
24
24
16
18
16
22.8
6.3
23
72
256
120
30
512
128
32
8
2
The last problem of estimating performance from
available resources is that if a particular implementation
requires more functional units of a certain type than is
available, the needed functionality can often be emulated
with combinations of the other, under-utilized units. For
example, a regular bit permutation could be accomplished
with a mixture of shifting and masking. Although this
flexibility may improve resource utilization, it also
dramatically increases the number of designs to be
evaluated.
The second factor that complicates the correlation
between hardware availability and performance is that the
algorithms have vastly different complexities. This
means that the hardware requirement for each algorithm
to support a given throughput differs considerably. In
Figure 3 we see an example of five different encryption
algorithms that are implemented to have similar
throughput, but then have a wide variation in hardware
requirements. It is difficult to fairly quantify the
performance-versus-hardware tradeoff of any domain that
has a wide complexity gap.
XBars
Runtime
An example of the unpredictable nature of hardware
demands when unrolling algorithms.
This table displays the compounded problems that occur
when attempting to normalize the RAM requirements
across algorithms. The other algorithms are unrolled to
make use of the memory ceiling set by Loki97. Notice that
the total deviation in crossbars roughly doubles as
compared to the baseline comparison and that the
deviation in runtime becomes almost twice the new
average value.
RAM
Blocks
32
4
320
64
8
85.6
133.2
Muxes
Figure 4 – Scaling Behavior
Figure 2 – Equalization Difficulties
Algorithm
(Unrolling Factor)
CAST-256 (2x)
DEAL (4x)
Loki97 (8x)
Magenta (4x)
Twofish (1x)
Average
Std. Dev.
RAM
Blocks
8
8
8
16
64
4 Function Unit Allocation
To produce an efficient platform for a diverse group of
applications, an effective solution to the functional unit
allocation problem must have the flexibility needed to
simultaneously address the multi-dimensional hardware
requirements of the entire domain while maximizing
usability and maintaining hard or soft area and
performance constraints. In the following sections we
propose three solutions to this problem. The first
algorithm addresses hard performance constraints. The
second and third algorithms attempt to maximize the
overall performance given softer constraints.
4.1
Performance-Constrained Algorithm
The first algorithm we developed uses a hard minimum
throughput constraint to guide the functional unit
allocation. As described earlier, we began the exploration
of this domain by establishing the hardware requirements
of all of the algorithms for a variety of performance
levels. We use this table of results in all of our functional
unit allocation techniques to determine the minimum
hardware that each algorithm needs in order to support a
given throughput.
Our first algorithm begins by determining the
hardware requirements to run each algorithm at a
specified minimum throughput. We then examine these
requirements to establish the maximum necessary number
of each type of functional unit. To calculate the overall
performance for this superset of resources, we reexamine
Figure 3 – Complexity Disparity
An illustration of the imbalance that occurs when
attempting to equalize throughput across algorithms. In
this case we choose Twofish as a baseline and unrolled
the rest of the algorithms to best match its throughput.
Notice that the deviation in RAM and crossbar
requirements is well above the average value.
The third problem of allocating hardware resources is
that the requirements of the algorithms do not necessarily
scale linearly or monotonically when loops are unrolled.
See Figure 4 for an example of this non-uniform behavior.
This phenomenon makes it difficult to foresee the effect
of decreasing the population of one type of functional unit
and increasing another.
3
# of
Functional
Units
Required
# of
Functional
Units
Required
16
8
4
Algorithm X
8
4
1
Algorithm Y
9
3
1
Algorithm Z
# of Clock
Cycles
16
8
4
Algorithm X
We are given the functional unit requirements of three
encryption algorithms in a range of performance levels.
We are also given a hard throughput constraint –
4 clock cycles / block in this example.
8
4
1
Algorithm Y
9
3
1
Algorithm Z
# of Clock
Cycles
Determine the slowest implementation for each
algorithm that still satisfies the minimum throughput
requirement, then eliminate implementations below the
performance threshold
unroll
# of
Functional
Units
Required
# of
Functional
Units
Required
4
Algorithm X
4
1
Algorithm Y
3
1
Algorithm Z
# of Clock
Cycles
4
Algorithm X
Based on this subset of implementations, determine the
minimum number of each resource type.
4
1
Algorithm Y
3
1
Algorithm Z
# of Clock
Cycles
If possible, further unroll the algorithms to better
utilize available resources. Final Cost = 4 + 1 + 3 = 8
Figure 5 – Performance-Constrained Functional Unit Selection
Illustration of performance-constrained selection algorithm.
each algorithm to determine if there are sufficient
resources to allow for greater throughput, then apply the
cost function described by this equation:
Cost =
4.2
Area-Constrained Algorithm
The next two algorithms we developed use simulated
annealing to provide more sophisticated solutions that are
able to capitalize on softer constraints to improve average
throughput. The second algorithm begins by randomly
adding functional units to the architecture until limited by
a given area constraint. The quality of this configuration
is evaluated by determining the highest performance
implementation for each algorithm, given the existing
resources, then applying the cost function described by
this equation:
N −1
CCi
i =0
In this equation, N is the total number of algorithms in
the domain and CCi is the number of clock cycles
required to encrypt a single block of text in the highest
throughput configuration of algorithm i that will fit on the
architecture. See Figure 5 for an illustrated example of
the performance-constrained functional unit selection
process.
Note that this is a greedy algorithm and, due to the
non-linear and non-monotonic behavior of hardware
requirements, does not necessarily find the minimum area
or maximum performance for the system. Because the
starting point is chosen solely on the basis of throughput
without considering hardware requirements, it is possible
that higher throughput implementations of a given
algorithm may have lower resource demands for
particular functional unit types. If that algorithm becomes
the limiting factor when determining the number of any
resource type, it will likely affect the overall area and
performance results.
Cost =
N −1
CCi , if algorithm i fits on the architecture
i =0
PC * A, otherwise
In this equation, N is the total number of algorithms in
the domain and CCi is the number of clock cycles
required to encrypt a single block of text in the highest
throughput configuration of algorithm i that will fit on the
array. However, if an algorithm cannot be implemented
on the available hardware, we impose an exclusion
penalty proportional to A, the additional area necessary to
map the slowest implementation of the algorithm to the
4
1) Starting Config.
2) Remove Unit 4
3) Add Unit 5
4) Evaluate & Accept
5) Remove Unit 5
6) Add Unit 2
7) Evaluate & Reject
Maximum
Area
Unit Type 1
Unit Type 2
Unit Type 3
Unit Type 4
Unit Type 5
Figure 6 – Area-Constrained Function Unit Selection
Illustration of our area-constrained selection algorithm.
from being addressed with a larger hardware/throughput
matrix.
array. In our evaluations, we used a large constant
penalty scaling factor (PC) since we wanted our system
include all of the candidate algorithms. However, this
factor is completely application-dependent and must be
tuned depending on the size of the functional units, how
many algorithms are in the domain, what the average
runtime is, and how critical it is that the system is able to
implement the entire domain. While this penalty system
does not necessarily guide the simulated annealing to the
best solution, since a higher throughput implementation
may be closer to the existing configuration, it does
provide some direction to the tool to help prevent the
potentially unwanted exclusion of some of the algorithms
in the domain.
After calculating the quality of the configuration we
perturb the system by randomly picking two types of
components, removing enough of the first type to replace
it with at least one of the second, then adding enough of
the second type to fill up the available area. Finally, the
quality of the new configuration is evaluated in the same
manner as before. If the new configuration provides the
same or better throughput, it is accepted. If it does not
provide better performance, based on the current
temperature and relative performance degradation, it may
or may not be accepted. This process is based on the
simple acceptance function and adaptive cooling schedule
described in [5]. See Figure 6 for an illustration of this
procedure.
Note that, as described earlier, some operations may be
emulated by combinations of other functional units. For
simplicity, in this example we do not directly deal with
this possibility, but there is no inherent limitation in either
of the area-constrained solutions that would prevent this
4.3
Improved Area-Constrained Algorithm
Our last functional unit selection algorithm attempts to
balance performance and area constraints. First, we
eliminate implementations from the hardware/throughput
matrix that do not provide enough throughput to meet a
specified minimum performance requirement. Then, we
randomly select one of the remaining implementations of
each algorithm for our current arrangement.
We
determine the minimum hardware and area requirements
necessary to fit all of the algorithms at their current
settings, and then establish if any algorithms can be
expanded to a higher performance level given the
calculated hardware resources. The quality of this
arrangement is determined by the number of clock cycles
required to run all of the algorithms at their current
settings and a penalty based on any excess area needed by
the system. The cost function is described by this
equation:
Cost =
N −1
i =0
CCi + Area Penalty
In this equation, N is the total number of algorithms in
the domain and CCi is the number of clock cycles
required to encrypt a single block of text in the highest
throughput configuration of algorithm i that will fit on the
architecture.
If the area required for the current
configuration is larger than the specified maximum
5
unroll
Cost = 6 + 100 = 106
# of
Functional
Units
Required
Maximum
Area
16
8
4
Algorithm X
8
4
1
Algorithm Y
9 3
1
Algorithm Z
# of Clock
Cycles
Eliminate any implementations below the given
performance threshold, then randomly choose a
throughput level for each algorithm and
determine the minimum hardware requirements.
Unroll the algorithms further, if possible.
Evaluate the throughput and penalize for any
excessive area required by the resulting
architecture.
Cost = 6 + 100 = 106
# of
Functional
Units
Required
unroll
Cost = 14 + 20 = 34
Maximum
Area
16
8
4
Algorithm X
8
4
1
Algorithm Y
9
3
1
Algorithm Z
# of Clock
Cycles
Randomly choose a new implementation for one
algorithm (Z in this case), and determine the
hardware requirements for the new configuration.
Despite the lower performance, the new state
will be accepted due to a much lower area
penalty.
Figure 7 – Improved Area-Constrained Functional Unit Selection
Illustration of our improved area-constrained selection algorithm. In this example we assume the throughput threshold is set at 10
cycles/block.
allowable area, we also add an area penalty that is
described by this equation:
all of the distinct throughput levels between all of the
algorithms in the domain. Then, each of these distinct
throughput constraints was fed into the performanceconstrained functional unit selection algorithm. The area
requirements for each were recorded and then used as
inputs to the two area-constrained techniques.
The three techniques we developed produce very
different results when applied to our example domain. As
expected, the hard throughput constraint of the
performance-driven approach has limitations. In Figure 8
and Figure 9 we plot the results of all three functional unit
selection algorithms over ten area scenarios. Figure 8
shows the maximum number of clock cycles per block
required by any algorithm in the domain as a function of
the area of the system. Since the number of clock cycles
needed to encrypt each block of data is inversely
proportional to the throughput, we can see from this graph
that, for the majority of the architectures we examined,
the performance-constrained algorithm indeed produces
the best minimum performance among the three
allocation methods. Also, as expected, the limitations of
the performance-driven algorithm regarding non-linear
and non-monotonic hardware requirements allow the
improved area-constrained technique to occasionally
obtain somewhat better minimum performance.
Area Penalty = PC * (CA / MA)
In this case, PC is a constant penalty scaling factor, CA
is the calculated area requirement of the current
configuration and MA is the specified maximum
allowable area. Again, since we wanted a hard area
constraint for our evaluation, we set PC to a large value.
However, similar to the previous functional unit selection
algorithm, this term is application-specific and must be
tuned depending on how hard or soft an area constraint is
desired. After calculating the quality of the configuration,
we then perturb the system by arbitrarily choosing one
algorithm and randomly changing the setting to a different
performance level. Finally, the quality is re-evaluated and
compared to the original arrangement in the same
simulated-annealing manner as described in Section 4.2.
See Figure 7 for an illustration of this process.
5 Function Unit Allocation Results
The evaluation of the functional unit allocation
techniques began by using the performance-constrained
method as a baseline for comparison. We first identified
6
140
Worst-Case Clock Cycles / Block
120
100
80
P erf. C onst.
Area C onst.
Im p. Area C onst.
60
40
20
0
0.00
2.00
4.00
6.00
8.00
10.00
12.00
A rea (10 M illion U nits)
Figure 8 – Minimum Throughput Results of Functional Unit Selection
Graph of maximum number of clock cycles required by any application in the domain as a function of area. Notice that while the
improved area-constrained allocation technique occasionally produces better results, the performance-constrained method obtains
the best worst-case performance over the majority of the design space.
400
350
Total Clocks Cycles / Block
300
250
P erf. C onst.
Area C onst.
200
Im p. Area C onst.
150
100
50
0
0.00
2.00
4.00
6.00
8.00
10.00
12.00
A rea (10 M illion U nits)
Figure 9 – Performance Results of Functional Unit Selection Across the Domain
Graph of the total number of clock cycles required to run all of the applications in the domain as a function of area. Notice that the
overall performance of the higher throughput systems produced by the performance-constrained algorithm lag considerably behind
that of the architectures generated by either of the area-constrained techniques.
7
competition also produced a smaller set of 5 finalists. We
have run this set though each of our algorithms, and
compare them to the optimal results found by a bruteforce search. Figure 10 and Figure 11 show the results of
the testing repeated for this smaller domain. In this case,
the allocation problem is much simpler and both the
performance-constrained and improved area-constrained
techniques find the optimal architecture for all area
scenarios. In addition to showing that our allocation
methods can achieve optimal results, this test also
confirms the suspicion that the original area-constrained
method does not perform as well as the improved
technique. While this domain does not present all of the
same challenges as the full set of AES candidates, it is
likely that our earlier results at least approach the optimal
architecture.
In contrast, though, when we plot the total number of
clock cycles required by all of the algorithms in the
domain as a function of area, as in Figure 9, we see a
completely different picture for the performanceconstrained selection method. The results in this graph
directly reflect the average performance of the system for
a given configuration. Figure 9 shows that the average
performance of the system across the domain is reduced
by as much as almost 50% when using the performanceconstrained selection method as compared to using either
of the area-driven techniques.
The poor average
throughput is particularly apparent in the larger
architectures. This means that if the design constraints
allow for some flexibility in terms of the minimum
acceptable performance, better average throughput may
be obtained by using either of the area driven approaches.
When comparing the two area-constrained techniques,
Figure 9 shows that the average performance results of
the improved area-constrained technique are marginally
better than those from the original area-driven method
Furthermore, when we consider the area requirements for
the generated architectures, the improved area-constrained
method generally produces architectures with equal or
smaller area requirements. In addition, Figure 8 shows
that the improved area-constrained method consistently
produces architectures with an equal or lower maximum
number of clock cycles for the worst-case encryption
algorithm compared to the basic area constrained
technique.
All of these observations can likely be attributed to the
same source. Because the original area-constrained
functional unit selection algorithm is based upon
randomly adding and subtracting different types of
components to the system, it is likely that none of the
applications in the domain fully utilize any of the
functional unit types in the resultant architecture.
Conversely, since the improved area-constrained
technique is based upon choosing sets of particular
implementations, it is guaranteed that at least one
application will fully utilize each of the functional unit
types. It is likely that this fundamental difference creates
more noise in the original area-constrained selection
technique and thus makes it more difficult for the
algorithm to converge. In addition, even if the original
area-constrained technique were to converge on a similar
mixture of components as the improved method, it is very
possible that there may still be some functional unit types
that are not fully utilized by any implementation. Of
course, this will result in a larger architecture than is
necessary.
To better illustrate the quality of the proposed
algorithms we could compare them to the optimal
architectures found by brute-force search. Unfortunately,
the original set of 15 algorithms from the AES
competition, with up to 10 implementations each, is too
complex to solve via such methods. However, the AES
6 Future Work
In addition to the operator identification and
optimization problems described earlier, domainspecialized reconfigurable devices also raise several
completely new concerns. One interesting question
surrounds the issue of adaptability. While a carefully
designed architecture will likely perform well on the
limited set applications that directly affected the design,
what methodology is needed to sufficiently encapsulate
the needs of the applications as a domain? Extending this
to the example of encryption, would algorithm updates or
completely different ciphers also perform well on this
system?
Another issue, related to adaptability, is specialization.
It is expected that the larger the group of applications
used to develop a domain-specialized reconfigurable
device, the better it will handle unanticipated netlists.
However, the size of the target domain is also connected
to the specificity of the array and its overall performance.
It is vital that developers understand how these factors
affect each other.
7 Conclusions
In this paper we have shown that the development of a
coarse-grained reconfigurable architecture raises several
unique and un-addressed design problems. We presented
three techniques to allocate functional units that attempt
to balance performance and area constraints on domains
that have vastly different hardware requirements. While
taking special consideration for stable, high-performance
implementations and the possibility for future flexibility,
designers can use these functional unit allocation
techniques
to
develop
future
coarse-grained
reconfigurable devices.
The first algorithm produces architectures under a
guaranteed hard performance requirement. The second
8
35
Worst-Case Clock Cycles / Block
30
25
20
Perf. C onst.
Area C onst.
Im p. Area Con st.
15
10
5
0
0.00
5.00
10.00
15.00
20.00
25.00
30.00
A rea (M illion U nits)
Figure 10 – Minimum Throughput Results of Functional Unit Selection on a Limited Domain
Graph of maximum number of clock cycles required by any of the five finalist algorithms as a function of area.
performance-constrained and improved area-constrained techniques achieve the optimal results.
Both the
100
90
80
Total Clock Cycles / Block
70
60
Perf. Const
50
Area Const.
Imp. Area C onst.
40
30
20
10
0
0.00
5.00
10.00
15.00
20.00
25.00
30.00
A rea (M illion U nits)
Figure 11 – Performance Results of Functional Unit Selection Across a Limited Domain
Graph of the total number of clock cycles required to run all five of the AES finalist ciphers as a function of area. Again, both the
performance-constrained and improved area-constrained techniques achieve the optimal results.
9
[5] Betz, Vaughn and Jonathon Rose. “VPR: A New
Packing, Placement and Routing Tool for FPGA
Research.”
International Workshop on Field
Programmable Logic and Applications, 1997: 21322.
algorithm allows designers to trade versatility for better
average throughput.
The third algorithm produces
efficient architectures that can take advantage of softer
area constraints. While the performance-constrained
algorithm can be used when designers are only concerned
with the minimum performance of a system, the areaconstrained algorithms were shown to produce better
average performance given similar area. Although the
original area-constrained technique allows designers to
potentially improve overall performance by excluding
very demanding applications, the improved areaconstrained technique consistently produced better results
when considering the entire domain. It is likely that the
improved area-constrained method would be the most
appropriate choice unless the minimum performance of
the system needs to be absolutely guaranteed.
Although we encountered the difficulties of functional
unit selection while exploring an encryption-specific
domain, we believe that the causes of the problem are not
exclusive to encryption and can be expected to be
common in many complex groups of applications. The
functional unit selection problem will become more
difficult as reconfigurable devices are expected to offer
better and better performance over large domain spaces.
Increased specialization of function units and growing
domain size combined with the need for resource
utilization optimization techniques such as functional unit
emulation will soon complicate architecture exploration
beyond that which can be analyzed by hand. In the
future, designers will need CAD tools that are aware of
these issues in order to create devices that retain the
flexibility required for customization over a domain of
applications while maintaining good throughput and area
characteristics.
[6] Brown, Lawrie and Josef Pieprzyk. “Introducing the
New LOKI97 Block Cipher.” First AES Candidate
Conference, Aug. 20-22, 1998.
[7] Burwick, Carolynn, Don Coppersmith, Edward
D'Avignon, Rosario Gennaro, Shait Halevi, Charanjit
Jutla, Stephen M. Matyas Jr., Luke O'Connor,
Mohammad Peyravian, David Safford and Nevenko
Zunic. “Mars – A Candidate Cipher for AES.” First
AES Candidate Conference, Aug. 20-22, 1998.
[8] Cronquist, Darren C., Paul Franklin, Chris Fisher,
Miguel Figueroa and Carl Ebeling. "Architecture
Design of Reconfigurable Pipelined Datapaths."
Twentieth Anniversary Conference on Advanced
Research in VLSI, 1999:23-40.
[9] Daemen, Joan and Vincent Rijmen. “AES Proposal:
Rijndael.” First AES Candidate Conference, Aug. 2022, 1998.
[10] Ebeling, Carl, Darren C. Cronquist, and Paul
Franklin. "RaPiD - Reconfigurable Pipelined
Datapath." The 6th International Workshop on FieldProgrammable Logic and Applications, 1996: 126 35.
[11] Eguro, Ken. “RaPiD-AES: Developing an Encryption
Specific FPGA Architecture.” Master’s Thesis. Dept
of Electrical Engineering, University of Washington,
Dec. 2002.
8 References
[1] Adams, C. and J. Gilchrist. “The CAST-256
Encryption Algorithm.” First AES Candidate
Conference, Aug. 20-22, 1998.
[12] National Institute of Standards and Technology.
Advanced Encryption Standard (AES) Development
Effort. Nov. 11, 2002.
<http://csrc.nist.gov/encryption/aes/index2.html>.
[2] Ahmed, E. and J. Rose. “The Effect of LUT and
Cluster Size on Deep-Submicron FPGA Performance
and Density.” International Symposium on Field
Programmable Gate Arrays, 2000: 3-12.
[13] Rivest, Ron, M. J. B. Robshaw, R. Sidney and Y. L.
Yin. “The RC6 Block Cipher.” First AES Candidate
Conference, Aug. 20-22, 1998.
[3] Anderson, Ross, Eli Biham and Lars Knudsen.
“Serpent: A Proposal for the Advanced Encryption
Standard.” First AES Candidate Conference, Aug.
20-22, 1998.
[14] Schneier, B., J. Kelsey, D. Whiting, D. Wagner, C.
Hall and N. Ferguson. “Twofish: A 128-bit Block
Cipher.” First AES Candidate Conference, Aug. 2022, 1998.
[4] Betz, Vaughn and Jonathan Rose. “FPGA Routing
Architecture: Segmentation and Buffering to
Optimize Speed and Density.” International
Symposium on Field Programmable Gate Arrays,
1998: 59-68.
10
Download