Area, Delay, and Power Characteristics of Standard

advertisement
Area, Delay, and Power Characteristics of Standard-Cell
Implementations of the AES S-Box
Stefan Tillich, Martin Feldhofer, Thomas Popp
Institute for Applied Information Processing and Communications
Graz University of Technology, Inffeldgasse 16a, A–8010 Graz, Austria
{stillich,mfeldhof,tpopp}@iaik.tugraz.at
Johann Großschädl
University of Bristol, Department of Computer Science
Merchant Venturers Building, Woodland Road, Bristol, BS8 1UB, U.K.
johann.groszschaedl@cs.bris.ac.uk
Abstract
Cryptographic substitution boxes (S-boxes) are an integral part of modern block ciphers
like the Advanced Encryption Standard (AES). There exists a rich literature devoted to the
efficient implementation of cryptographic S-boxes, wherein hardware designs for FPGAs and
standard cells received particular attention. In this paper we present a comprehensive study of
different standard-cell implementations of the AES S-box with respect to timing (i.e. critical
path), silicon area, power consumption, and combinations of these cost metrics. We examine
implementations which exploit the mathematical properties of the AES S-box, constructions
based on hardware look-up tables, and dedicated low-power solutions. Our results show that
the timing, area, and power properties of the different S-box realizations can vary by up
to almost an order of magnitude. In terms of area and area-delay product, the best choice
are implementations which calculate the S-box output. On the other hand, the hardware
look-up solutions are characterized by the shortest critical path. The dedicated low-power
implementations do not only reduce power consumption by a large degree, but they also show
good timing properties and offer the best power-delay and power-area product, respectively.
1
Introduction
The Internet of the 21st century will consist of billions of non-traditional computing systems like
cell phones, PDAs, sensor nodes, and other mobile devices (“gadgets”) with wireless networking
capability. Wireless networking, along with the fact that many of these devices (e.g. sensor nodes)
are easily accessible, have raised a number of security concerns. Sophisticated security protocols, in
combination with well-established cryptographic primitives, can ensure privacy and integrity of
communication over insecure networks. Consequently, there is an increasing demand to implement
cryptographic algorithms on resource-limited embedded devices like cell phones, PDAs, or sensor
nodes. Even some extremely constrained systems like Radio Frequency Identification (RFID) tags
are required to perform cryptographic operations.
The Advanced Encryption Standard (AES), which was announced by the NIST in 2001, defines
one of the most important symmetric ciphers for the long-term future [15]. The AES algorithm
Journal of Signal Processing Systems, vol. 50, no. 2, pp. 251–261, Feb. 2008. © Springer-Verlag, 2008.
A preliminary version of this paper was published in the proceedings of SAMOS 2006, LNCS 4017, pp. 457–466.
is a variant of the Rijndael cipher [4] and can be implemented efficiently in both hardware
and software. Common AES hardware implementations take the form of stand-alone ASICs and
cryptographic coprocessors for system-on-chip (SoC) integration. In addition, hardware/software
co-design techniques like extending the instruction set of a general-purpose processor have been
investigated in the recent past [19]. Due to the high performance of modern microprocessors, AES
software can reach throughput rates that are sufficient for most applications. Therefore, hardware
implementations of the AES algorithm are mainly important for high-end server systems with
extreme performance requirements and for embedded devices with a demand for low power
consumption and small silicon area.
Most of the published AES hardware designs focus on high speed and high throughput for
implementation in FPGAs [3, 11, 16]. In addition, some ASIC implementations have been reported
in the literature. For example, Hodjat et al. developed a 3.84 Gbits/s AES crypto coprocessor with
modes-of-operation support based on a 0.18 µm CMOS technology [7]. Their design features a
128-bit datapath and encrypts a block of data in 11 clock cycles. A completely different design
approach is necessary when optimizing AES hardware for low power consumption or small silicon
area. Feldhofer et al. introduced an AES implementation suited for passively-powered devices like
RFID tags [6]. It comprises an 8-bit datapath which occupies an area of 3,595 gates (including
registers and control logic) when synthesized using a 0.35 µm standard cell library. These results
show that the AES algorithm allows for a wide range of trade-offs between performance, power
consumption, and hardware cost [5].
Symmetric ciphers like the AES require non-linear functions in order to resist linear cryptanalysis. Substitution is a common function for introducing non-linearity. A substitution function,
generally referred to as S-box, can be realized in form of an arbitrary mapping from input bits
to output bits (e.g. DES [14]) or via algebraic operations (e.g. AES). Different cipher algorithms
use different numbers of S-boxes. For example, DES uses eight S-boxes which map six to four
bits, while AES employs a single S-box which is a bijective mapping from eight to eight bits. The
AES algorithm makes use of its S-box in the SubBytes round transformation and the key expansion
[4]. From a mathematical point of view, the AES S-box is defined as an inversion in the finite field
F28 with a specific irreducible polynomial [9], followed by an affine transformation. The inverse
S-box, which is required for the InvSubBytes round transformation for decryption, is simply the
inverse of the affine transformation, followed by an inversion in F28 .
The S-box is a costly and performance-critical building block of the AES algorithm. Results
from previous work [7, 21] show that the S-box lies on the critical path of many AES architectures
and, hence, limits the maximum clock frequency. In addition, the S-box also impacts area and power
consumption of AES hardware [17, 13]. Therefore, the AES S-box has been a subject of intensive
research in recent years, which has led to a rich literature on efficient S-box design and implementation. The proposed designs can be roughly categorized into S-boxes that contain optimized
circuits for arithmetic in F28 [2, 17, 20], constructions using hardware look-up tables [8, 11], and
dedicated low-power solutions [1], all of which have their specific advantages and disadvantages
with respect to area, delay, and power consumption. Although most papers introducing new S-box
designs provide implementation results and discuss related work, it is generally difficult to compare
the different design approaches since, for example, the implementations may have been produced
using different design flows and tools, different standard cell libraries, or different optimizations
(speed, area) for the synthesis process.
In this paper we analyze and compare silicon area, critical path delay, and power consumption
characteristics of the most common standard-cell designs of the AES S-box in a uniform and
coherent way. We consider in our study designs which exploit the mathematical properties of the
S-box, constructions based on hardware look-up tables, and dedicated low-power solutions. In
contrast to our previous work [18] where we used a 0.35 µm standard-cell library to evaluate
different S-box designs, we conducted the present study on basis of a more modern 0.25 µm process
technology in order to provide practical insights and results that are closer to the state-of-the-art in
VLSI manufacturing. We put similar effort into optimizing each of the evaluated S-box designs to
ensure a fair comparison. Our results show that the area, delay, and power figures of the different
S-box designs vary significantly (up to almost an order of magnitude), which underpins the importance of selecting the best-suited S-box with respect to the requirements of the application.
The remainder of this paper is organized as follows. Section 2 briefly explains the AES
algorithm and discusses hardware implementation aspects. In Section 3 we overview different
implementation strategies for the AES S-box. The particular S-box implementations that we used
for our evaluation of area, delay, and power consumption are described in Section 4. Section 5
provides background information on the design flow and evaluation methodology. In Section 6 we
discuss our experimental results and we finally conclude in Section 7.
2
The Advanced Encryption Standard
In November 2001, after several years of public evaluation, the National Institute of Standards and
Technology (NIST) officially announced the algorithm for the new Federal Information Processing
Standard FIPS-197 [15], also called Advanced Encryption Standard (AES). The block cipher
Rijndael [4] was chosen from 15 submitted candidates and has thenceforward become the AES
algorithm.
The AES is a very flexible algorithm suitable for implementation on many platforms in software
as well as in hardware. Its simplicity and symmetry properties facilitate optimization towards
different objectives such as high performance or low cost. The AES algorithm has a fixed block
size of 128 bits. Each block is organized as a 4 × 4 matrix of bytes, referred to as State. The
FIPS-197 standard defines three different key lengths: 128, 192, and 256 bits. Similar to most
symmetric ciphers, the AES algorithm encrypts an input block by applying a round transformation
several times. Depending on the key length, the number of rounds is either 10, 12, or 14. The
round transformation modifies the 128-bit State from its initial value (i.e. the plaintext) to obtain
the ciphertext after the last round. Each round consists of non-linear, linear, and key-dependent
transformations, which can all be described by means of algebraic operations over the finite field
F28 . These operations, called SubBytes, ShiftRows, MixColumns, and AddRoundKey, scramble
the bytes of the State either individually, row-wise, or column-wise. Before the first round an initial
AddRoundKey is performed, while in the last round the MixColumns operation is omitted.
The SubBytes transformation substitutes each byte of the State independently. This byte substitution is defined by the so-called S-box, which can be expressed through arithmetic operations
in the finite fields F2 and F28 . More specifically, it is composed of an inversion in F28 followed
by an affine transformation. The affine transformation consists of a multiplication with a constant
polynomial over F2 and addition of another constant polynomial. The SubBytes transformation is
the only non-linear function of the AES algorithm. Its implementation has a major impact on the
area, performance, and power consumption of an AES hardware module.
ShiftRows rotates each row of the State to the left using a specific offset. The offset equals the
row index (starting at 0), which means that the first row is not rotated at all and the last row is
rotated by three bytes to the left.
MixColumns operates on columns of the State. Each column is interpreted as a polynomial
of degree 3 with coefficients from the field F28 . This polynomial is multiplied by a polynomial
with fixed coefficients, and the result is reduced modulo g(t) = {1}t 4 + {1} (where {1} ∈ F28 ). The
MixColumns operation is often expressed as a multiplication by a constant 4 × 4 matrix of F28
elements with the input column (interpreted as four elements of F28 ), yielding the respective output
column.
The three aforementioned transformations form the substitution permutation network of the
AES algorithm, wherein SubBytes represents the substitution part (to increase confusion) and
ShiftRows and MixColumns constitute the permutation part (increasing diffusion). AddRoundKey
simply combines the State with a round key by applying an XOR-operation over all 128 bits.
The KeySchedule transformation produces the 128-bit round keys, whereby the first round key
is equal to the cipher key. All other round keys are computed from the previous round key by using
the S-box functionality and some constants referred to as Rcon. The decryption function recovers
the plaintext from a given ciphertext by executing the inverse round transformations (InvSubBytes,
InvShiftRows, InvMixColumns, and AddRoundKey) in reverse order. All round keys are also used
in reverse order.
2.1
Hardware Implementation Aspects
The AES is a flexible algorithm well suited for implementation in hardware. A multitude of hardware architectures are possible, which allows for optimization toward different requirements,
ranging from high performance to low power consumption and small silicon area. A considerable
literature exists that is devoted to efficient hardware implementation of the AES [3, 6, 7, 11, 16, 17,
21]. Depending on the target application, AES architectures can have a datapath width of between
8 and 128 bits. Additionally, it is possible to unroll several rounds and insert pipeline stages into
the design. However, to support different modes of operation like the CBC mode [4], often only
one round is realized in hardware and used repeatedly.
The width of the datapath determines the main characteristics (i.e. performance, area, power
consumption) of an AES implementation. Since the AES is byte-oriented, an 8-bit architecture with
a single S-box is the natural choice for applications where small area and low power dissipation are
crucial, e.g. smart cards or RFID tags. At the other end of the spectrum are 128-bit architectures
containing 16 S-boxes to compute the SubBytes function of a 128-bit data block in one pass. Due
to this massive parallelism, 128-bit architectures can reach high throughput rates at the expense
of large silicon area. 32-bit architectures with four S-boxes constitute a good compromise between
the two aforementioned extremes; they allow for much higher performance than 8-bit architectures
but demand only a fraction of the area of 128-bit implementations.
3
Implementation Strategies for the AES S-Box
All AES architectures sketched in the previous section have a common feature in that the SubBytes
transformation occupies a significant portion of the overall silicon area. The size of SubBytes
is, in turn, determined by the number of S-boxes and their concrete implementation. Various
implementation options for the AES S-box have been investigated in the recent past, which has led
to an abundant literature [1, 2, 8, 10, 12, 13, 17, 20].
The SubBytes transformation substitutes all 16 bytes of the State independently using the
S-box. Furthermore, the S-box is also used in the AES key expansion. In software, the S-box
is typically realized in the form of a look-up table since inversion in the finite field F28 can
not be calculated efficiently on general-purpose processors. In hardware, on the other hand, the
implementation of the S-box is directed by the desired trade-off between area, delay, and power
consumption. The most obvious implementation approach for the S-box takes the form of hardware
look-up tables [11]. However, since encryption and decryption require different tables, and each
table contains 2048 bits, the overall hardware cost of this approach is relatively high.
An implementation option related to standard cells is the use of ROM compilers to produce
hardware macros. For the technology that we used, a sufficiently large ROM would require a
considerable amount of silicon area. The critical path delay would be similar to a hardware look-up
approach, but the power consumption of generated ROMs is about two to three orders of magnitude
higher1 . Therefore, we do not consider the implementation of the S-box as ROM in this paper.
More sophisticated approaches calculate the S-box function in hardware using its algebraic
properties [4]. The focus of such implementations is the efficient realization of the inversion in
F28 , which can be achieved by decomposing the finite field into the sub-fields F24 and F22 . An
inversion in a finite field of characteristic 2 can be carried out in different ways, depending on
the basis which is used to represent the field elements [9]. The two most common types of bases
for F2m are the polynomial basis and the normal basis. A polynomial basis is a basis of the
form {1, α, α2 , . . . , αm−1 } where α is a root of an irreducible polynomial p(t) of degree m with
coefficients from F2 . On the other hand, a normal basis can be found by selecting a field element
m
β ∈ F2m such that the elements of the set {β, β2 , β4 , . . . , β2 −1 } are linearly independent.
A third approach for implementing the AES S-box was proposed by Bertoni et al. in [1].
By using an intermediate one-hot encoding of the input, arbitrary logic functions (including
cryptographic S-boxes) can be realized with minimal power consumption. The main drawback of
this approach is that it results in relatively large silicon area.
4
Implementation Details
All AES S-box implementations analyzed in this paper can perform forward and inverse byte
substitution for encryption and decryption, respectively. We implemented the S-boxes either from
scratch or obtained the HDL descriptions from the authors of the respective publications. The
implementations examined consist solely of combinatorial logic, i.e. no pipeline stages have been
inserted since pipelining does not make sense when a feedback mode of operation like OFB or
CBC is used [7]. In the following we describe a total of eight different implementations of the
AES S-box which can be grouped into three basic categories: look-up implementations, calculating
implementations, and low-power implementations. Four of the eight S-box implementations are
illustrated in Figure 1.
The simplest design in our comparison is a straight-forward implementation of a hardware
look-up table [11]. The synthesizer transforms the behavioral description of the look-up table into
a mass of unstructured standard cells. This approach will be denoted as hw-lut. A modification of
1 Unfortunately,
the exact performance figures for ROMs were not accessible for the technology we used.
Sin
Sin
enc / dec
Sin[3..0]
16x8-bit
LUT
Sin[3..0]
16x8-bit
LUT
Sin[3..0]
...
16x8-bit
LUT
...
Combinational logic
Sin[7..4]
32-to-1
enc / dec
Sout
sub16-lut
hw-lut
Sout
Sin
Sin
...
Decoder
inverse affine
transformation
1
0
Permutation
GF(28)
inversion
enc / dec
affine
transformation
Multiplexer
1
hybrid-lut
0
Sout
...
Sout
bertoni
Figure 1: Comparison of four S-box implementations
hw-lut is to use sub-tables in order to minimize switching activity in the look-up tables to reduce
power consumption. We examined such solutions with sub-tables of size 16, 32, 64, 128, and 256
bytes, but in this paper we only specify results for size 16 (sub16-lut).
Implementations which calculate the S-box transformation in hardware were first proposed by
Wolkerstorfer et al. [20] and Satoh et al. [17]. The former approach decomposes the elements
of F28 into polynomials over the sub-field F24 and performs inversion there. Our implementation of
this solution is denoted as wolkerstorfer. Satoh’s solution decomposes the field elements further
into polynomials over the sub-field F22 , where inversion is a trivial swap of the lower and higher
bit of the representation. This implementation is referred to as satoh in the following. Both of
these approaches represent the field elements by using a polynomial basis. Canright improved the
calculation of the S-box by switching the representation to a normal basis [2]. Like in Satoh’s
solution, the elements of F28 are mapped to a polynomial over the sub-field F22 . This approach will
be denoted as canright.
A compromise between hardware look-up and calculation has also been examined. In this implementation (denoted as hybrid-lut) only the inversion in F28 is realized as look-up table. Since
the inversion is used for both encryption and decryption, the size of the look-up table is halved in
relation to the hw-lut approach. The affine and inverse affine transformations are performed via
logic circuits just as in the calculating implementations of wolkerstorfer, satoh, and canright.
The low-power approach of Bertoni et al. [1] uses a decode stage to convert the eight bits of the
input byte and the control bit which selects encryption or decryption into a one-hot representation
consisting of 29 = 512 bits. The substitution itself is just a rearrangement of these bits and can be
done efficiently in hardware by a rewiring of lines as illustrated in Figure 1. Since two of the lines
always map to the same 8-bit result (one for encryption and one for decryption), these line pairs can
be combined with a logical OR to yield a one-hot decoded representation of the result consisting
of 256 bits. A subsequent encoder stage transforms this result back to an 8-bit binary value. Due
to this decoder-permute-encoder structure, there is only very little signal activity within the circuit
when the input changes, resulting in low power consumption. Note that the structure of Bertoni’s
approach makes it easily possible to introduce pipeline stages. However, it may be necessary to add
a large number of additional flip-flops when the pipeline stage is placed between the decoder and
encoder, i.e. on the one-hot encoded signal lines. These flip-flops will increase power consumption
considerably and can easily mitigate the low-power advantages of this solution. For design scenarios
where both power consumption and silicon area are of minor importance, Bertoni’s approach can
offer the best opportunity for reaching very high clock frequencies.
We tested two implementations of Bertoni’s approach: One implementation uses a decoder with
four stages as proposed in the original publication for minimal power consumption (bertoni). The
second implementation, denoted as bertoni-2stg, uses a different decoder structure with only two
stages in order to reduce the critical path of the circuit.
In the remainder of this paper we will refer to wolkerstorfer, satoh, and canright as calculating
implementations. We will denote hw-lut and hybrid-lut as look-up implementations, and sub16lut, bertoni, and bertoni-2stg as low-power implementations.
5
Design Flow and Evaluation Methodology
In contrast to our previous work [18] where we used a 0.35 µm standard cell library from Austriamicrosystems, all results in this paper were obtained with the VST250 standard cells from Virtual
Silicon. These standard cells are built upon the 0.25 µm process technology L250 of UMC, which
provides one poly-silicon layer and five metal layers. The nominal supply voltage of the VST250
cell library is 2.5 V.
We implemented the eight S-box designs described in Section 4 in VHDL according to the
specifications in the respective papers. In order to ensure a fair comparison and a common interface
for all implementations, we provided the input and output of each S-box with 8-bit registers. The
integration of the registers made it possible to optimize for area and delay during synthesis. The
logic synthesis was done using the Physically Knowledgeable Synthesis (PKS) tool from Cadence.
We varied the constraints for the delay time (i.e. maximum clock frequency) from the minimum
value to a value where the constraints could just be met. The delays given in Table 1 are the actual
delays of the synthesized circuit. Empty cells in the table indicate that the respective target delay
could not be achieved by the synthesizer.
After synthesis, the placement and routing of the standard cells was performed with the Cadence
tool First Encounter. We did not include I/O cells into the designs, i.e. we analyzed only the core of
the S-boxes consisting of standard cells and the power supply rings. During placement we used an
area utilization of 70%. All the figures in Table 1 are results from synthesis excluding the clock tree
for the input and output registers. After the routing step we integrated the layouts of the standard
cells into the design, which gave us the full layout in GDS2 format.
Design
canright
satoh
wolkerstorfer
hw-lut
sub16-lut
hybridlut
bertoni
bertoni2stg
Result
Act. delay (ns)
Area (GE)
Power (µA)
Act. delay (ns)
Area (GE)
Power (µA)
Act. delay (ns)
Area (GE)
Power (µA)
Act. delay (ns)
Area (GE)
Power (µA)
Act. delay (ns)
Area (GE)
Power (µA)
Act. delay (ns)
Area (GE)
Power (µA)
Act. delay (ns)
Area (GE)
Power (µA)
Act. delay (ns)
Area (GE)
Power (µA)
2.00
–
–
–
–
–
–
–
–
–
1.95
1545
1.18
–
–
–
–
–
–
1.86
2016
0.42
1.98
1941
0.42
3.00
–
–
–
–
–
–
–
–
–
2.91
1415
0.97
2.94
2040
0.56
2.93
1222
1.34
2.90
1433
0.30
2.79
1446
0.32
4.00
–
–
–
–
–
–
–
–
–
3.90
1351
1.00
3.92
1979
0.53
3.92
840
1.02
3.31
1399
0.27
3.53
1436
0.31
Target delay (ns)
5.00
6.00
4.98
5.00
496
400
1.78
1.78
–
5.93
–
438
–
2.00
4.93
5.94
625
412
1.87
1.97
4.98
5.88
1352 1302
0.97
0.93
4.46
4.46
1957 1957
0.55
0.58
4.86
5.83
810
799
0.98
0.95
3.31
3.31
1399 1399
0.27
0.27
3.26
3.26
1421 1421
0.33
0.33
7.00
6.55
303
1.81
6.55
409
1.73
6.48
415
1.75
6.61
1301
1.00
4.46
1957
0.58
6.49
798
0.98
3.31
1399
0.27
3.26
1421
0.33
8.00
6.55
303
1.81
6.99
385
1.51
7.51
392
1.53
6.61
1301
1.00
4.46
1957
0.58
6.49
798
0.98
3.31
1399
0.27
3.26
1421
0.33
9.00
6.55
303
1.81
6.99
385
1.51
7.51
392
1.53
6.61
1301
1.00
4.46
1957
0.58
6.49
798
0.98
3.31
1399
0.27
3.26
1421
0.33
Table 1: Synthesis results of the eight S-box designs depending on the target delay
We extracted a Spectre netlist from the layout using Assura RCX, where we only considered
resistors larger than 1 Ω and capacitors larger than 1 pF. In contrast to our previous work [18], we
obtained the power consumption of the different S-box designs through simulation with Synopsys
NanoSim. All simulations were performed with BSIM3v3 transistor models characterized for the
UMC L250 technology and the built-in NanoSim models for resistors and capacitors. The results of
the NanoSim simulations shown in Table 1 represent the mean current consumption of the S-boxes
at a supply voltage of 2.5 V. We used a clock frequency of 50 MHz (i.e. new input values are
applied to the circuit with a period of 20 ns) and simulated all 256 possible input patterns.
6
Experimental Results
We synthesized all eight S-box implementations mentioned in Section 4 using the design flow
described previously. For each implementation several synthesis runs were carried out, whereby
we specified different target values for the maximum critical path delay, ranging from 2 ns to
9 ns. Table 1 summarizes the actual delay, the area of the synthesized design, and the mean power
consumption. We omitted the results of all synthesis runs where the timing constraints were not
met, i.e. when the actual delay was higher than the target delay.
2500
sub16-lut
hybrid-lut
bertoni
satoh
bertoni-2stg
wolkerstorfer
hw-lut
canright
Area (GE)
2000
1500
1000
500
0
1
2
3
4
5
6
7
8
9
Target value for critical path delay (ns)
Figure 2: Area vs. critical path delay
Figure 2 shows the area of the eight S-box designs when synthesized for a specific critical path
delay. The area is given in gate equivalents (GE), calculated as total area divided by the size of a
2-input NAND with the lowest drive strength, which is the NAND20 cell of the library we used.
Amongst the three calculating implementations (at the bottom of the figure), canright is clearly
the best. It has the smallest size of all eight S-boxes, but suffers from a longer critical path than the
hardware look-up implementations and the low-power solutions. The calculating implementations
are smaller than the other two approaches because they make use of the algebraic structure of the
S-box to implement the substitution. On the other hand, this structure has a relatively long critical
path. The shortest critical path can be achieved with bertoni, but its size is about three times that
of canright. Look-up implementations ignore the algebraic structure of the S-box and just aim at a
straightforward realization of the boolean equations given by the input-output relation. Hence, the
synthesizer has a much higher degree of freedom for optimizing the circuit, which allows for a
shorter critical path at the expense of silicon area.
The low-power implementations also ignore the algebraic properties of the substitution and
simply implement the boolean equations of the input-output relation. However, they use a specific
structure (decode-permute-encode) to reduce signal activity. Although the critical path is similarly
short as for look-up implementations, the one-hot encoding requires more silicon area than the
look-up implementations. The sub16-lut approach also has a significant area overhead introduced
by the address decoding of the sub-tables, which makes it the most costly solution in terms of silicon
area. Moreover, the address decode logic causes a longer critical path. As expected, the compromise
between hardware look-up and calculation (hybrid-lut) lies somewhere between hw-lut and the
calculating implementations with regard to both critical path delay and area.
Figure 3 shows the total power consumption plotted against the critical path delay. All power
values are normalized with respect to the power consumption of hw-lut for a delay of 5.0 ns. The
low-power S-boxes based on the approach of Bertoni (bertoni, bertoni-2stg) are the clear winners
of this comparison. The original implementation bertoni shows the best overall results among all
eight examined designs, closely followed by the modified version bertoni-2stg. Bertoni’s approach
is solely directed towards low power consumption with a minimal level of signal activity in the
circuit. The sub16-lut approach, on the other hand, tries to improve a straightforward look-up table
Total power (normalized)
1,5
satoh
hw-lut
1,25
wolkerstorfer
sub16-lut
canright
bertoni-2stg
hybrid-lut
bertoni
1
0,75
0,5
0,25
0
1
2
3
4
5
6
7
8
9
Target value for critical path delay (ns)
Figure 3: Total power consumption vs. critical path delay
implementation (hw-lut) with low-power measures. However, sub16-lut requires almost twice as
much power as bertoni, while hw-lut consumes about three times more power. The hybrid-lut
approach requires roughly the same amount of power as hw-lut.
The power consumption of the calculating implementations is much higher than that of the
low-power and look-up versions. The algebraic evaluation of the S-box function in calculating
implementations causes a large number of internal nodes to transition even if only a few input bits
toggle. This behavior entails high signal activity and, in turn, high power consumption. In look-up
implementations a change of a few input bits affects the evaluation of all output bits separately. As
normally some output bits will remain unchanged, the signal activity within this particular path
is low, which limits the overall power consumption. The implementation of canright consumes
almost twice as much power as hw-lut, and roughly an order of magnitude more power than
bertoni. The other two calculating implementations, wolkerstorfer and satoh, have similar power
characteristics as canright.
1250
(Power x Area) normalized
satoh
sub16-lut
wolkerstorfer
hw-lut
canright
bertoni
hybrid-lut
bertoni-2stg
1000
750
500
250
0
1
2
3
4
5
6
7
8
Target value for critical path delay (ns)
Figure 4: Power-area product vs. critical path delay
9
Figure 4 shows the results of the eight S-box implementations in terms of the power-area
product. This metric is particularly relevant for applications with a need for both small silicon area
and low power consumption, e.g. cryptographically enhanced RFID tags or sensor nodes.
Due to their large area requirements, hw-lut and sub16-lut have the worst power-area product among all eight examined implementations. Also the calculating S-boxes show a relatively
bad power-area product, which is mainly caused by the high power consumption of the S-box
evaluation. All three calculating implementations have similar characteristics for relaxed critical
path conditions. Both satoh and wolkerstorfer also have similar properties for more stringent
constraints on the critical path, whereas canright becomes more and more advantageous for faster
designs. The hybrid-lut implementation is even slightly better than canright when synthesized for
a delay of 5 ns. However, hybrid-lut becomes very unattractive if the critical path delay needs to be
smaller. The low-power approach of bertoni achieves the best overall power-area product, closely
followed by bertoni-2stg.
The power-area products shown in Figure 4 differ from those in [18] because we used a different
standard cell library and a different approach for evaluating the power consumption. According to
our results, the calculating implementations are more attractive than the look-up implementations
and sub16-lut is the best look-up implementation for short critical paths. The low-power designs
achieve the best results for the power-area product in our study as well as in [18]. However, while
our study found slight advantages for bertoni, the results in [18] show bertoni-2stg as winner.
1,2
Total power (normalized)
1
decreasing
critical path
delay
0,8
0,6
0,4
satoh
canright
hw-lut
bertoni
0,2
wolkerstorfer
hybrid-lut
bertoni-2stg
sub16-lut
0
0
300
600
900
1200
1500
1800
2100
Area (GE)
Figure 5: Total power consumption vs. area
Figure 5 illustrates the power consumption in relation to the required silicon area. In general,
the points further away from the point of origin represent synthesis results for shorter critical
path delays. The figure shows that calculating implementations tend to sacrifice power efficiency
to achieve higher speed. On the other hand, the low-power implementations trade silicon area
for a shorter critical path. The sub16-lut implementation shows similar behavior. The look-up
implementations hw-lut and hybrid-lut sacrifice area as well as power efficiency to roughly the
same degree.
In order to minimize the critical path delay, the synthesizer applies a number of optimization
techniques like using standard cells with higher drive strengths or the duplication of logic paths,
which causes considerable power consumption in circuits with high switching activity. Calculating
S-box implementations have an inherently high number of signal switches and, therefore, incur an
over-proportional increase in power consumption when reducing the critical path delay. Low-power
implementations, on the other hand, are characterized by little signal activity and, therefore, a
moderate increase in power consumption for shorter critical paths.
When compared to the results reported in [18] (which are based on a 0.35 µm standard-cell
library), the silicon area and critical path delay figures correspond quite well to the current ones
obtained with the UMC 0.25 µm technology. Regarding power consumption, we notice that the
current figures indicate a less dramatic difference among the examined S-box implementations as
those given in [18]. We attribute this discrepany to the different standard cell libraries and the
different power evaluation methods. While the results in [18] were obtained via estimations from
the synthesis tool, our current figures result from a much more accurate simulation of the placed
and routed designs using NanoSim. This, of course, has also led to slight differences in all other
metrics which include the power consumption results.
7
Conclusions
In this paper we examined eight AES S-box implementations which follow three different design
strategies. We analyzed and compared various cost metrics like critical path delay, silicon area, and
power consumption of these implementations based on synthesis runs with a 0.25 µm CMOS
standard cell library. Our simulation results clearly show that the characteristics of the eight S-box
implementations differ significantly. For example, the power consumption of the different S-boxes
varies by almost an order of magnitude, which underpins the importance of selecting the proper
S-box with respect to the requirements of the target application. We found that Canright’s S-box
design is the best choice for applications where small silicon area is the main criterion (e.g. RFID
tags). Bertoni’s S-box is very well suited for applications with a demand for low power or energy
consumption, e.g. wireless sensor nodes. In addition, the Bertoni S-box also has the shortest critical
path, followed by the look-up implementations. While the results for the calculating implementations only apply to the AES S-box, the insights from the other two implementation strategies
(look-up except hybrid-lut and low-power) are also useful for other cryptographic S-boxes.
Acknowledgements
The authors would like to thank Johannes Wolkerstorfer and David Canright for providing the
HDL source code of several AES S-box implementations. The research described in this paper has
been supported by the Austrian Science Fund (FWF) under grant P16952–N04, the FIT-IT initiative
of the Austrian Federal Ministry of Transport, Innovation, and Technology (project SNAP), and the
EPSRC under grant EP/E001556/1. The research described in this paper has also been supported, in
part, by the European Commission through the IST Programme under contract IST-2002-507932
ECRYPT. The information in this document reflects only the authors’ views, is provided as is and
no guarantee or warranty is given that the information is fit for any particular purpose. The user
thereof uses the information at its sole risk and liability.
References
[1] G. Bertoni, M. Macchetti, L. Negri, and P. Fragneto. Power-efficient ASIC synthesis of cryptographic
Sboxes. In Proceedings of the 14th ACM Great Lakes Symposium on VLSI (GLSVLSI 2004), pp. 277–
281. ACM Press, 2004.
[2] D. Canright. A very compact S-Box for AES. In Cryptographic Hardware and Embedded Systems —
CHES 2005, vol. 3659 of Lecture Notes in Computer Science, pp. 441–455. Springer Verlag, 2005.
[3] P. Chodowiec and K. Gaj. Very compact FPGA implementation of the AES algorithm. In Cryptographic Hardware and Embedded Systems — CHES 2003, vol. 2779 of Lecture Notes in Computer
Science, pp. 319–333. Springer Verlag, 2003.
[4] J. Daemen and V. Rijmen. The Design of Rijndael: AES – The Advanced Encryption Standard. Springer
Verlag, 2002.
[5] M. Feldhofer, K. Lemke, E. Oswald, F.-X. Standaert, T. Wollinger, and J. Wolkerstorfer. State of the Art
in Hardware Architectures. ECRYPT deliverable D.VAM.2, available for download at http://www.
ecrypt.eu.org/documents/D.VAM.2-1.0.pdf, Sept. 2005.
[6] M. Feldhofer, J. Wolkerstorfer, and V. Rijmen. AES implementation on a grain of sand. IEE Proceedings Information Security, 152(1):13–20, Oct. 2005.
[7] A. Hodjat, D. D. Hwang, B.-C. Lai, K. Tiri, and I. M. Verbauwhede. A 3.84 Gbits/s AES crypto coprocessor with modes of operation in a 0.18-µm CMOS technology. In Proceedings of the 15th ACM
Great Lakes Symposium on VLSI (GLSVLSI 2005), pp. 351–356. ACM Press, 2005.
[8] H. Li. A parallel S-box architecture for AES byte substitution. In Proceedings of the 2nd International
Conference on Communications, Circuits and Systems (ICCCAS 2004), vol. 1, pp. 1–3. IEEE, 2004.
[9] R. Lidl and H. Niederreiter. Finite Fields, vol. 20 of Encyclopedia of Mathematics and Its Applications.
Cambridge University Press, 1996.
[10] M. Macchetti and G. Bertoni. Hardware implementation of the Rijndael SBOX: A case study. ST
Journal of System Research, 0(0):84–91, July 2003.
[11] M. McLoone and J. V. McCanny. High performance single-chip FPGA Rijndael algorithm implementations. In Cryptographic Hardware and Embedded Systems — CHES 2001, vol. 2162 of Lecture Notes
in Computer Science, pp. 65–76. Springer Verlag, 2001.
[12] N. Mentens, L. Batina, B. Preneel, and I. M. Verbauwhede. Systematic evaluation of compact hardware
implementations for the Rijndael S-box. In Topics in Cryptology — CT-RSA 2005, vol. 3376 of Lecture
Notes in Computer Science, pp. 323–333. Springer Verlag, 2005.
[13] S. Morioka and A. Satoh. An optimized S-Box circuit architecture for low power AES design. In Cryptographic Hardware and Embedded Systems — CHES 2002, vol. 2523 of Lecture Notes in Computer
Science, pp. 172–186. Springer Verlag, 2002.
[14] National Institute of Standards and Technology (NIST). Data Encryption Standard (DES). Federal
Information Processing Standards (FIPS) Publication 46-3, Oct. 1999.
[15] National Institute of Standards and Technology (NIST). Advanced Encryption Standard (AES). Federal
Information Processing Standards (FIPS) Publication 197, Nov. 2001.
[16] N. Pramstaller and J. Wolkerstorfer. A universal and efficient AES co-processor for field programmable
logic arrays. In Field Programmable Logic and Application — FPL 2004, vol. 3203 of Lecture Notes
in Computer Science, pp. 565–574. Springer Verlag, 2004.
[17] A. Satoh, S. Morioka, K. Takano, and S. Munetoh. A compact Rijndael hardware architecture with
S-Box optimization. In Advances in Cryptology — ASIACRYPT 2001, vol. 2248 of Lecture Notes in
Computer Science, pp. 239–254. Springer Verlag, 2001.
[18] S. Tillich, M. Feldhofer, and J. Großschädl. Area, delay, and power characteristics of standard-cell
implementations of the AES S-box. In Embedded Computer Systems: Architectures, Modeling, and
Simulation — SAMOS 2006, vol. 4017 of Lecture Notes in Computer Science, pp. 457–466. Springer
Verlag, 2006.
[19] S. Tillich and J. Großschädl. Instruction set extensions for efficient AES implementation on 32-bit
processors. In Cryptographic Hardware and Embedded Systems — CHES 2006, vol. 4249 of Lecture
Notes in Computer Science, pp. 270–284. Springer Verlag, 2006.
[20] J. Wolkerstorfer, E. Oswald, and M. Lamberger. An ASIC implementation of the AES SBoxes. In
Topics in Cryptology — CT-RSA 2002, vol. 2271 of Lecture Notes in Computer Science, pp. 67–78.
Springer Verlag, 2002.
[21] X. Zhang and K. K. Parhi. High-speed VLSI architectures for the AES algorithm. IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, 12(9):957–967, Sept. 2004.
Download